We evaluated each agent on the complete WebMall task set. The evaluation compares the effectiveness of different agent architectures:
RAG Agent, MCP Agent, and NLWeb Agent. For comparison, we also include results from the strongest HTML Agent configuration from the
original WebMall benchmark, AX+MEM.
Every agent architecture is evaluated in combination with both GPT-4.1 (gpt-4.1-2025-04-14), GPT-5 (gpt-5-2025-08-07), GPT-5-mini
(gpt-5-mini-2025-08-07) and Claude 4 Sonnet (claude-sonnet-4-20250514) to assess variations in effectiveness across models. Note that
our experiments do not utilize prompt caching. Detailed execution logs for all runs are available in our
GitHub repository, organized
by interface type and model. For quick understanding of agent behavior,
shortened execution logs
containing one successful task execution per agent and model are also available.
4.1 Evaluation Metrics
-
Task completion rate (CR): Computed as a binary success measure. For retrieval tasks, the set of URLs returned by the
agent must be identical to the test set. For transactional tasks (e.g., add-to-cart, checkout), completion requires reaching the
specified final state. This metric therefore captures strict task correctness.
-
Precision, Recall, F1: Computed by comparing the set of answers returned by the agent to the test set. Precision
measures the fraction of agent-returned items that are correct, recall measures the fraction of test set items recovered. These
metrics capture graded performance and are informative in cases of partial matches where completion rate alone would report failure.
-
Runtime (s): End-to-end latency per task, measured from task submission to final output, including model reasoning
and all tool calls.
-
Token usage: Total number of input and output tokens consumed by the agent per task category. We explicitly exclude
embedding tokens used for indexing, as their cost is several orders of magnitude lower than LLM inference tokens and they provide
little information in this context.
-
Cost ($): Estimated inference cost based on token usage, calculated using the published per-token input and output
prices from the respective model providers. The current agent implementations do not utilize prompt caching.
4.2 Overall Results
The following table shows the overall performance of each interface averaged across all tasks and models:
Key insight: API-based (MCP: F1 0.75 and NLWeb: F1 0.76) and RAG (F1 0.76) outperform the HTML browsing agent (AX+MEM:
F1 0.69) by roughly 6-7 F1 points on average.
4.3 Results by Interface and Model
The following table shows detailed results for each interface broken down by the model used. This allows for direct comparison of how
different models perform with each interface architecture.
Key insights
-
Across all interfaces, GPT-5 yields the highest combination of task completion and F1 (e.g., AX+MEM: CR 0.69/F1 0.76 vs. 0.55–0.59 CR and 0.64–0.68 F1 for other models; RAG: CR 0.79/F1 0.87 vs. 0.61–0.65 CR and 0.68–0.75 F1).
-
The GPT-5 RAG agent attains the best overall effectiveness in the table (CR 0.79, F1 0.87), outperforming GPT-5 MCP (CR 0.73, F1 0.82) and GPT-5 NLWeb (CR 0.73, F1 0.84).
4.4 Results by Task Group
Specific Product Search
The Specific Product Search category includes tasks where users need to find particular products by name, model number, or specific
technical requirements. This represents 23 tasks from the WebMall benchmark and tests the agents' ability to accurately retrieve and
filter products based on precise criteria.
Key insights
-
For every interface, GPT-5 delivers the strongest effectiveness: it yields the highest combination of completion rate and F1 compared to GPT-4.1, GPT-5-mini, and Claude Sonnet 4 within the same interface.
-
The best-performing configurations for specific product search are GPT-5 with RAG, MCP, and NLWeb (CR 0.83–0.87, F1 0.96), which all achieve very high precision and recall (P ≥ 0.95, R ≥ 0.95).
-
While different models show some variation within a given interface (e.g., RAG F1 ranges from 0.86 to 0.96, MCP from 0.84 to 0.96), the choice of interface has a larger impact on effectiveness in this task category: for the same model, RAG, MCP, and NLWeb consistently achieve higher CR and F1 than the HTML agent.
Vague Product Search
The Vague Product Search category includes 19 tasks that require interpreting imprecise or subjective requirements, finding product
substitutes, and identifying compatible products. These tasks test the agents' ability to understand fuzzy requirements and apply
semantic reasoning to match products that may not have exact specification matches.
Key insights
-
Across interfaces and models, vague product search shows a clear drop in effectiveness compared to specific product search: the
average F1 decreases from about 0.87 to 0.76 (≈0.11 F1), and the average completion rate falls from about 0.73 to 0.60 (≈0.13 CR).
-
RAG with GPT-5 remains the strongest configuration (CR 0.79, F1 0.90), with NLWeb and MCP using GPT-5 or GPT-5-mini (F1 0.82–0.86), while the HTML agent with GPT-5 reaches F1 0.83.
-
For GPT-5, the performance gap between interfaces narrows compared to specific product search: AX+MEM, MCP, and NLWeb cluster
relatively closely on F1 (0.82–0.86), but RAG still leads with F1 0.90 on vague queries.
-
Within each interface, models vary in effectiveness (e.g., RAG F1 ranges from 0.68 to 0.90, NLWeb from 0.73 to 0.86), yet the
interface design continues to matter: for GPT-4.1 and Claude Sonnet 4, RAG, MCP, and NLWeb consistently reach higher F1 than the
HTML agent on vague product search.
Cheapest Product Search
The Cheapest Product Search category includes 26 tasks focused on finding the most affordable options. These tasks test the agents'
ability to compare prices across products and shops, while also filtering by specific or vague requirements. This category combines
price optimization with product search capabilities.
Key insights
-
Cheapest product search is the most challenging category: averaged over all interfaces and models, F1 drops from about 0.88 on
specific product search and 0.77 on vague search to about 0.64 here (≈0.24 and ≈0.13 F1 lower, respectively), while average
completion rate decreases from roughly 0.74 to 0.60 (≈0.14 CR).
-
RAG with GPT-5 remains the top-performing configuration (CR 0.72, F1 0.78), with RAG+GPT-5-mini (CR 0.69, F1 0.76) and NLWeb+GPT-5
(CR 0.65, F1 0.75) close behind, indicating that several interfaces can reach similar effectiveness on price-focused tasks.
-
For GPT-5, the performance gap between interfaces narrows further compared to the other task groups: AX+MEM, RAG, MCP, and NLWeb
all achieve F1 scores between 0.72 and 0.78, so the best and worst GPT-5 setups differ by only 0.06 F1.
-
Interface design still matters, especially for weaker models: with GPT-4.1, RAG, MCP, and the HTML agent attain F1 scores between
0.58 and 0.68, whereas NLWeb+GPT-4.1 lags noticeably at 0.42 F1.
Actions and Transactions
The Actions and Transactions category includes 15 tasks focused on e-commerce operations: adding products to cart (7 tasks) and
completing checkout processes (8 tasks). These tasks test the agents' ability to interact with transactional APIs and navigate
multi-step workflows requiring sequential actions.
Key insights
-
Transactional tasks are generally solved very reliably: averaged across all interfaces and models, completion rate is about 0.81
and F1 about 0.86, clearly higher than for any of the search-oriented task groups.
-
The HTML agent shows an unusual pattern: AX+MEM with GPT-4.1 achieves a perfect score (CR 1.00, F1 1.00), while GPT-5 and GPT-5-mini
drop to substantially lower effectiveness (CR 0.67/F1 0.64 and CR 0.53/F1 0.56, respectively).
-
For the other interfaces, the GPT-5 series performs strongly: RAG, MCP, and NLWeb with GPT-5 all reach CR 0.93 and F1 0.98, closely
matched by their GPT-4.1 and Claude Sonnet 4 counterparts (F1 0.96–0.98), indicating that executing structured action sequences is
a comparatively easy setting for these architectures.
-
GPT-5-mini consistently trails its larger counterparts on actions and transactions (e.g., RAG: F1 0.54, MCP: 0.88, NLWeb: 0.87),
suggesting that reduced model capacity has a more pronounced impact on multi-step transactional workflows than on many retrieval
tasks.
4.5 Cost & Runtime Analysis
Cost and execution time analysis based on model pricing and runtime results from our experiments.
The total cost of running all experiments across all interfaces and models was approximately
$250. This cost excludes embedding generation (which is negligible at $0.02/MTok) and infrastructure costs. Execution
times shown are averages per task.
4.6 Cost vs Effectiveness Comparison
The scatter plot below visualizes the relationship between cost and effectiveness across different agent interfaces and models. Each
point represents a combination of interface type (RAG, MCP, NLWeb, HTML) and language model (GPT-4.1, GPT-5, GPT-5-mini, Claude Sonnet 4).
Cost–effectiveness insights
-
RAG offers a particularly attractive price–performance trade-off. RAG with GPT-5-mini is by far the cheapest configuration in our
study (cost $0.01 with CR 0.65 and F1 0.75), while RAG with GPT-5 achieves the highest overall effectiveness (CR 0.79, F1 0.87) at a
still moderate cost of $0.15.
-
Among the API-based interfaces, NLWeb with GPT-5 combines strong effectiveness (CR 0.73, F1 0.84) with relatively low cost ($0.09),
whereas MCP with GPT-5 is slightly more expensive ($0.18, F1 0.82) but remains competitive in terms of performance.
-
The HTML agent (AX+MEM) is consistently less cost-efficient: even with GPT-5 it reaches lower effectiveness (F1 0.76) at a higher
cost ($0.50) than the best RAG and NLWeb configurations, while Claude Sonnet 4 with HTML is both the most expensive ($1.05) and not
the most effective option (F1 0.66).
Runtime patterns
-
GPT-5 configurations tend to be noticeably slower than other models for the same interface: for example, AX+MEM with GPT-5 averages
522 s per task versus 92 s with GPT-4.1, and RAG with GPT-5 takes 114 s compared to 8 s with GPT-4.1.
-
Claude Sonnet 4 and GPT-4.1 generally offer the fastest execution times across interfaces (e.g., RAG+GPT-4.1: 8 s, MCP+GPT-4.1: 11 s,
NLWeb+Claude: 20 s), but GPT-4.1 in particular trades some effectiveness for speed (e.g., RAG F1 0.75 vs. 0.87 with GPT-5).
-
For many interfaces, GPT-5-mini sits between GPT-4.1 and GPT-5 in terms of runtime (e.g., RAG: 51 s vs. 8 s and 114 s; MCP: 80 s vs.
11 s and 94 s), offering moderate speed but also reduced effectiveness compared to full GPT-5.