MCP vs RAG vs NLWeb vs HTML:
A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web

1. Introduction

LLM-based agents are increasingly deployed to automate web tasks such as product search, offer comparison, and order placement. We see four dominant ways these agents interact with websites: (1) HTML browsing by clicking links and filling forms, (2) retrieval-augmented generation (RAG) over pre-crawled content, (3) Model Context Protocol (MCP) access to site-specific Web APIs, and (4) NLWeb, which lets agents issue natural-language queries that are answered with Schema.org-style JSON data. Despite rapid experimentation with these interfaces, the community still lacks a systematic comparison of their effectiveness and efficiency on the same challenging task sets.

To close this gap we introduce a reproducible testbed with four simulated e-shops. Each shop offers its catalog via HTML, MCP, and NLWeb, while a dedicated crawler provides the corpus needed for the RAG interface. For every interface (HTML, RAG, MCP, NLWeb) we build a specialized agent so that all four agents face exactly the same tasks covering specific and vague product search, cheapest-offer comparisons, and transactional workflows such as cart management and checkout. The architecture diagram below provides an overview of the interfaces and their interaction patterns.

Our study evaluates the agents with GPT-4.1, GPT-5, GPT-5-mini and Claude Sonnet 4, measuring task completion, precision/recall/F1, execution time, and token consumption to derive per-task cost estimates. Across search-oriented tasks, RAG, MCP, and NLWeb agents outperform the HTML baseline by roughly 11 percentage points in task completion while using 2–5× fewer tokens. The GPT-5 RAG agent delivers the best overall completion rate (0.79) with moderate token usage, while GPT-5-mini offers cost-effectiveness when absolute accuracy requirements are lower.

This work makes two concrete contributions:

  1. Unified testbed: We introduce a testbed for comparing different agent architectures. The testbed consists of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each of the four architectures (HTML, RAG, MCP, and NLWeb) the testbed contains specialized agents that interact with the e-shops using the respective interfaces.
  2. Systematic evaluation: Using different sets of challenging e-commerce tasks, we systematically evaluate agent performance across interfaces and models (GPT-4.1, GPT-5, GPT-5-mini, Claude Sonnet 4) and analyze the effectiveness, efficiency, and cost of the different agents.

The remainder of this page summarizes the architectures, describes the task design, and presents the experimental findings. All code and data needed to reproduce the results are available in our GitHub repository.

News

28-11-2025: Technical Report about the experiments uploaded to Arxiv: MCP vs RAG vs NLWeb vs HTML - A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web

04-08-2025: Initial release of this webpage.

2. Architectures and Interfaces

The section below describes the four different architectures that we compare in our experimental study as well as the interfaces that agents and Webshops use for communication.

2.1 HTML Architecture

Within this architecture, e-shops expose traditional HTML interfaces intended for human consumption. The agents interact with these pages by clicking hyperlinks and performing form-filling actions. For our experiments we use the AX+MEM agent from the WebMall benchmark, which is implemented with the AgentLab library and executed inside BrowserGym. The agent observes the accessibility tree (AXTree) of each page and can store relevant information in short-term memory to maintain context across multiple steps. We disable visual perception in our setup because adding screenshots on top of the AXTree reduced performance in prior WebMall experiments.

HTML Agent Workflow

The HTML agent executes the following interaction loop:

  1. Navigate to target web page using browser automation
  2. Parse accessibility tree (AXTree) to understand page content
  3. Store relevant information in short-term memory
  4. Execute action (click, type, scroll) based on task requirements
  5. Repeat interaction cycle until task completion

More details about the AX+MEM HTML Agent are on the WebMall benchmark page.

2.2 RAG Architecture

The RAG architecture includes a search engine that crawls the HTML interfaces of all four shops, strips navigation and markup, and indexes the remaining text. Our implementation builds a unified Elasticsearch index fed by the Unstructured processing pipeline. The RAG agent retrieves content by issuing queries to this search engine instead of visiting the live sites. It iteratively formulates queries, inspects the returned documents, and refines follow-up queries as needed. For transactional steps (add-to-cart or checkout) we expose dedicated Python functions that the agent can invoke directly.

The agent leverages the LangGraph framework to orchestrate retrieval workflows and incorporates specialized tools for e-commerce actions. Concretely, the RAG agent is a LangGraph ReAct agent with the following capabilities:

  1. search_products: Execute semantic search queries against Elasticsearch (returns title + URL for efficiency)
  2. get_product_details: Fetch detailed product information for specific URLs
  3. add_to_cart_webmall_1-4: Add products to specific shop carts
  4. checkout_webmall_1-4: Complete purchases with customer details

RAG Agent Workflow

The agent follows a multi-query process: it first issues lightweight searches to gather candidates, inspects the responses, and reformulates queries as many times as needed to cover all shops. Only after narrowing down promising items does it call get_product_details for detailed attributes, minimizing token usage.

For more details about the implementation of the RAG agent please refer to the RAG source code in our repository. The prompt used for defining the agent can be found here.

2.3 MCP Architecture

In the Model Context Protocol (MCP) architecture, the e-shops expose their product search, cart manipulation, and checkout functionality via proprietary APIs. Each shop hosts its own MCP server that defines the available functions and parameters, and the agent invokes these endpoints over the protocol instead of scraping HTML content.

Like the RAG setup, the MCP agent can iteratively refine its queries: it issues a search, inspects the JSON response, and adjusts the next request accordingly. However, every shop uses its own function names, parameter conventions, and response schema. This heterogeneity places the burden of normalization on the agent, which must interpret different JSON formats when comparing or merging results across shops.

Our MCP agent reuses the LangGraph framework from the RAG setup but swaps in MCP tool calls for product search, cart management, and checkout operations.

MCP Agent Workflow

The MCP server for each shop exposes its capabilities as tools, which the agent can discover and execute. The workflow is as follows:

  1. Connection and Discovery: The agent (acting as an MCP Host) connects to the shop-specific MCP server and enumerates the available tools and parameters.
  2. Iterative Tool Execution: The agent invokes search or cart functions with an initial query, inspects the JSON result, and can reissue refined queries to explore additional items or shops.
  3. Response Harmonization: Because each shop returns different attribute names and structures, the agent normalizes the data before making comparisons or taking follow-up actions.

The example below illustrates the heterogeneity of WebAPIs by comparing the search methods of two shops. search_products and find_items_techtalk use different parameter names, while the second method offers the possiblity to sort by price, which is not possible in the first method.

// E-Store Athletes (Shop 1) search signature — no sorting
async def search_products(
    ctx: Context,
    query: str,
    per_page: int = 10,
    page: int = 1,
    include_descriptions: bool = False
) -> str

// TechTalk (Shop 2) search signature — supports sorting
async def find_items_techtalk(
    ctx: Context,
    query: str,
    limit: int = 5,
    page_num: int = 1,
    sort_by_price: str = "none",
    include_descriptions: bool = False
) -> str

The heterogeneity extends beyond function signatures to the data structures that are returned by the different search endpoints, e.g. each shop used a different set of attribute names and its own product categorization hierarchy.

For complete implementation details, refer to the MCP server code in our repository.

2.4 NLWeb Architecture

The NLWeb interface extends MCP by requiring each shop to expose a standardized natural-language query endpoint. Every provider hosts an ask tool that accepts queries like “laptops under $1000 with 16GB RAM,” performs an internal search, and returns results in schema.org-style JSON. Because all shops respond with the same vocabulary, agents spend less effort harmonizing the data. Transactional workflows (cart actions, checkout) remain available via the shop’s MCP tools.

In our deployment we create one Elasticsearch index per shop to support semantic search within the NLWeb server. Each query is embedded and matched against the shop’s own index before the results are serialized through the schema.org product vocabulary. For reference, a sample ask call and response is available here.

The NLWeb agent uses the same LangGraph framework as our other agents, combining the standardized search tool with MCP-based cart and checkout functions. Table 2 later in this section summarizes the main characteristics of all four architectures.

NLWeb Agent Workflow

The agent’s interaction with the NLWeb + MCP interface follows the workflow below:

  1. Connection and Discovery: The agent connects to the NLWeb-enabled MCP server for a specific shop and discovers the available tools.
  2. Natural Language Query: The agent sends a natural language query (e.g., "laptops under $1000 with 16GB RAM") to the server by invoking the ask tool.
  3. Semantic Search Execution: The server generates an embedding from the query and performs a cosine similarity search against the pre-computed vectors in its dedicated Elasticsearch index.
  4. Standardized Response: The agent receives a list of products in the standardized schema.org JSON format.

The complete implementation is available in the NLWeb + MCP directory. The prompt used for the agent can be found here.

2.5 Comparison of the Different Architectures

The table below compares the four different architectures along the features offered by the e-shops and the functionality implemented by the agents.

Aspect HTML RAG MCP NLWeb
E-Shops
Interface HTML pages Retrieval API Proprietary APIs Standardized API
Search functionality Free-text search per shop Search engine over crawled content Per-shop index; structured data Per-shop index; structured data
Search response HTML result list with links Pre-processed HTML pages Heterogeneous JSON Schema.org JSON
Agent
Communication protocol HTML over HTTP Direct calls to search engine JSON-RPC via MCP JSON-RPC via MCP
Query strategy Site search and browsing Multi-query Multi-query per shop Multi-query per shop
Query refinement Interactive page exploration Self-evaluation & iteration Self-evaluation & iteration Self-evaluation & iteration
Add to cart / checkout Clicking & form filling Direct function calls MCP tool invocation MCP tool invocation

3. Use Case: Online Shopping

To evaluate the effectiveness and efficiency of the different agents, we use the WebMall benchmark. WebMall simulates an online shopping environment with four distinct webshops, each offering around 1000 products described with heterogeneous product descriptions. The WebMall benchmark includes a diverse set of e-commerce tasks that test different agent capabilities ranging from focused retrieval to advanced reasoning about compatible and substitutional products. These tasks are organized into two main categories based on their complexity:

Specific Product Search (23 tasks)

Example: Find all offers for Fractal Design PC Gaming Cases which support 240mm radiators and 330mm GPUs.

  • Find Specific Product (12 tasks): Locate a particular product by name or model number
  • Products Fulfilling Specific Requirements (11 tasks): Find products matching precise technical specifications

Vague Product Search (19 tasks)

Example: Find all offers for compact keyboards that are best suited for working with a laptop remotely.

  • Products Satisfying Vague Requirements (8 tasks): Interpret and fulfill imprecise or subjective requirements
  • Find Substitutes (6 tasks): Identify alternative products when the requested item is unavailable
  • Find Compatible Products (5 tasks): Locate accessories or components compatible with a given product

Cheapest Product Search (26 tasks)

Example: Find the cheapest offer for a new Xbox gaming console with at least 512 GB disk space in white.

  • Find Cheapest Offer (10 tasks): Identify the lowest-priced option for a specific product across shops
  • Cheapest Offer with Specific Requirements (10 tasks): Find the most affordable product meeting detailed criteria
  • Cheapest Offer with Vague Requirements (6 tasks): Combine price optimization with fuzzy requirement matching

Transactional Tasks (15 tasks)

Example: Add the product on page {url} to the shopping cart and complete the checkout process.

  • Add to Cart (7 tasks): Add selected products to the shopping cart
  • Checkout (8 tasks): Complete the purchase process with payment and shipping information

Further examples of each task type are found on the WebMall benchmark page. The complete task set including the solution for each task is found in the WebMall repository.

4. Experimental Results

We evaluated each agent on the complete WebMall task set. The evaluation compares the effectiveness of different agent architectures: RAG Agent, MCP Agent, and NLWeb Agent. For comparison, we also include results from the strongest HTML Agent configuration from the original WebMall benchmark, AX+MEM.

Every agent architecture is evaluated in combination with both GPT-4.1 (gpt-4.1-2025-04-14), GPT-5 (gpt-5-2025-08-07), GPT-5-mini (gpt-5-mini-2025-08-07) and Claude 4 Sonnet (claude-sonnet-4-20250514) to assess variations in effectiveness across models. Note that our experiments do not utilize prompt caching. Detailed execution logs for all runs are available in our GitHub repository, organized by interface type and model. For quick understanding of agent behavior, shortened execution logs containing one successful task execution per agent and model are also available.

4.1 Evaluation Metrics

  • Task completion rate (CR): Computed as a binary success measure. For retrieval tasks, the set of URLs returned by the agent must be identical to the test set. For transactional tasks (e.g., add-to-cart, checkout), completion requires reaching the specified final state. This metric therefore captures strict task correctness.
  • Precision, Recall, F1: Computed by comparing the set of answers returned by the agent to the test set. Precision measures the fraction of agent-returned items that are correct, recall measures the fraction of test set items recovered. These metrics capture graded performance and are informative in cases of partial matches where completion rate alone would report failure.
  • Runtime (s): End-to-end latency per task, measured from task submission to final output, including model reasoning and all tool calls.
  • Token usage: Total number of input and output tokens consumed by the agent per task category. We explicitly exclude embedding tokens used for indexing, as their cost is several orders of magnitude lower than LLM inference tokens and they provide little information in this context.
  • Cost ($): Estimated inference cost based on token usage, calculated using the published per-token input and output prices from the respective model providers. The current agent implementations do not utilize prompt caching.

4.2 Overall Results

The following table shows the overall performance of each interface averaged across all tasks and models:

Key insight: API-based (MCP: F1 0.75 and NLWeb: F1 0.76) and RAG (F1 0.76) outperform the HTML browsing agent (AX+MEM: F1 0.69) by roughly 6-7 F1 points on average.

4.3 Results by Interface and Model

The following table shows detailed results for each interface broken down by the model used. This allows for direct comparison of how different models perform with each interface architecture.

Key insights

  • Across all interfaces, GPT-5 yields the highest combination of task completion and F1 (e.g., AX+MEM: CR 0.69/F1 0.76 vs. 0.55–0.59 CR and 0.64–0.68 F1 for other models; RAG: CR 0.79/F1 0.87 vs. 0.61–0.65 CR and 0.68–0.75 F1).
  • The GPT-5 RAG agent attains the best overall effectiveness in the table (CR 0.79, F1 0.87), outperforming GPT-5 MCP (CR 0.73, F1 0.82) and GPT-5 NLWeb (CR 0.73, F1 0.84).

4.4 Results by Task Group

Specific Product Search

The Specific Product Search category includes tasks where users need to find particular products by name, model number, or specific technical requirements. This represents 23 tasks from the WebMall benchmark and tests the agents' ability to accurately retrieve and filter products based on precise criteria.

Key insights

  • For every interface, GPT-5 delivers the strongest effectiveness: it yields the highest combination of completion rate and F1 compared to GPT-4.1, GPT-5-mini, and Claude Sonnet 4 within the same interface.
  • The best-performing configurations for specific product search are GPT-5 with RAG, MCP, and NLWeb (CR 0.83–0.87, F1 0.96), which all achieve very high precision and recall (P ≥ 0.95, R ≥ 0.95).
  • While different models show some variation within a given interface (e.g., RAG F1 ranges from 0.86 to 0.96, MCP from 0.84 to 0.96), the choice of interface has a larger impact on effectiveness in this task category: for the same model, RAG, MCP, and NLWeb consistently achieve higher CR and F1 than the HTML agent.

Vague Product Search

The Vague Product Search category includes 19 tasks that require interpreting imprecise or subjective requirements, finding product substitutes, and identifying compatible products. These tasks test the agents' ability to understand fuzzy requirements and apply semantic reasoning to match products that may not have exact specification matches.

Key insights

  • Across interfaces and models, vague product search shows a clear drop in effectiveness compared to specific product search: the average F1 decreases from about 0.87 to 0.76 (≈0.11 F1), and the average completion rate falls from about 0.73 to 0.60 (≈0.13 CR).
  • RAG with GPT-5 remains the strongest configuration (CR 0.79, F1 0.90), with NLWeb and MCP using GPT-5 or GPT-5-mini (F1 0.82–0.86), while the HTML agent with GPT-5 reaches F1 0.83.
  • For GPT-5, the performance gap between interfaces narrows compared to specific product search: AX+MEM, MCP, and NLWeb cluster relatively closely on F1 (0.82–0.86), but RAG still leads with F1 0.90 on vague queries.
  • Within each interface, models vary in effectiveness (e.g., RAG F1 ranges from 0.68 to 0.90, NLWeb from 0.73 to 0.86), yet the interface design continues to matter: for GPT-4.1 and Claude Sonnet 4, RAG, MCP, and NLWeb consistently reach higher F1 than the HTML agent on vague product search.

Cheapest Product Search

The Cheapest Product Search category includes 26 tasks focused on finding the most affordable options. These tasks test the agents' ability to compare prices across products and shops, while also filtering by specific or vague requirements. This category combines price optimization with product search capabilities.

Key insights

  • Cheapest product search is the most challenging category: averaged over all interfaces and models, F1 drops from about 0.88 on specific product search and 0.77 on vague search to about 0.64 here (≈0.24 and ≈0.13 F1 lower, respectively), while average completion rate decreases from roughly 0.74 to 0.60 (≈0.14 CR).
  • RAG with GPT-5 remains the top-performing configuration (CR 0.72, F1 0.78), with RAG+GPT-5-mini (CR 0.69, F1 0.76) and NLWeb+GPT-5 (CR 0.65, F1 0.75) close behind, indicating that several interfaces can reach similar effectiveness on price-focused tasks.
  • For GPT-5, the performance gap between interfaces narrows further compared to the other task groups: AX+MEM, RAG, MCP, and NLWeb all achieve F1 scores between 0.72 and 0.78, so the best and worst GPT-5 setups differ by only 0.06 F1.
  • Interface design still matters, especially for weaker models: with GPT-4.1, RAG, MCP, and the HTML agent attain F1 scores between 0.58 and 0.68, whereas NLWeb+GPT-4.1 lags noticeably at 0.42 F1.

Actions and Transactions

The Actions and Transactions category includes 15 tasks focused on e-commerce operations: adding products to cart (7 tasks) and completing checkout processes (8 tasks). These tasks test the agents' ability to interact with transactional APIs and navigate multi-step workflows requiring sequential actions.

Key insights

  • Transactional tasks are generally solved very reliably: averaged across all interfaces and models, completion rate is about 0.81 and F1 about 0.86, clearly higher than for any of the search-oriented task groups.
  • The HTML agent shows an unusual pattern: AX+MEM with GPT-4.1 achieves a perfect score (CR 1.00, F1 1.00), while GPT-5 and GPT-5-mini drop to substantially lower effectiveness (CR 0.67/F1 0.64 and CR 0.53/F1 0.56, respectively).
  • For the other interfaces, the GPT-5 series performs strongly: RAG, MCP, and NLWeb with GPT-5 all reach CR 0.93 and F1 0.98, closely matched by their GPT-4.1 and Claude Sonnet 4 counterparts (F1 0.96–0.98), indicating that executing structured action sequences is a comparatively easy setting for these architectures.
  • GPT-5-mini consistently trails its larger counterparts on actions and transactions (e.g., RAG: F1 0.54, MCP: 0.88, NLWeb: 0.87), suggesting that reduced model capacity has a more pronounced impact on multi-step transactional workflows than on many retrieval tasks.

4.5 Cost & Runtime Analysis

Cost and execution time analysis based on model pricing and runtime results from our experiments.

The total cost of running all experiments across all interfaces and models was approximately $250. This cost excludes embedding generation (which is negligible at $0.02/MTok) and infrastructure costs. Execution times shown are averages per task.

4.6 Cost vs Effectiveness Comparison

The scatter plot below visualizes the relationship between cost and effectiveness across different agent interfaces and models. Each point represents a combination of interface type (RAG, MCP, NLWeb, HTML) and language model (GPT-4.1, GPT-5, GPT-5-mini, Claude Sonnet 4).

Cost–effectiveness insights

  • RAG offers a particularly attractive price–performance trade-off. RAG with GPT-5-mini is by far the cheapest configuration in our study (cost $0.01 with CR 0.65 and F1 0.75), while RAG with GPT-5 achieves the highest overall effectiveness (CR 0.79, F1 0.87) at a still moderate cost of $0.15.
  • Among the API-based interfaces, NLWeb with GPT-5 combines strong effectiveness (CR 0.73, F1 0.84) with relatively low cost ($0.09), whereas MCP with GPT-5 is slightly more expensive ($0.18, F1 0.82) but remains competitive in terms of performance.
  • The HTML agent (AX+MEM) is consistently less cost-efficient: even with GPT-5 it reaches lower effectiveness (F1 0.76) at a higher cost ($0.50) than the best RAG and NLWeb configurations, while Claude Sonnet 4 with HTML is both the most expensive ($1.05) and not the most effective option (F1 0.66).

Runtime patterns

  • GPT-5 configurations tend to be noticeably slower than other models for the same interface: for example, AX+MEM with GPT-5 averages 522 s per task versus 92 s with GPT-4.1, and RAG with GPT-5 takes 114 s compared to 8 s with GPT-4.1.
  • Claude Sonnet 4 and GPT-4.1 generally offer the fastest execution times across interfaces (e.g., RAG+GPT-4.1: 8 s, MCP+GPT-4.1: 11 s, NLWeb+Claude: 20 s), but GPT-4.1 in particular trades some effectiveness for speed (e.g., RAG F1 0.75 vs. 0.87 with GPT-5).
  • For many interfaces, GPT-5-mini sits between GPT-4.1 and GPT-5 in terms of runtime (e.g., RAG: 51 s vs. 8 s and 114 s; MCP: 80 s vs. 11 s and 94 s), offering moderate speed but also reduced effectiveness compared to full GPT-5.

5. Running the Benchmark

To reproduce our experiments or run the benchmarks with your own agent implementations, follow these setup instructions. The complete code and documentation are available in our GitHub repository.

Prerequisites

  • Python 3.8+: Required for all agent implementations
  • Elasticsearch 8.x: Running on http://localhost:9200
  • OpenAI API Key: For embeddings and LLM calls
  • Optional: Anthropic API key for Claude model support

Quick Start

# Clone repository
git clone https://github.com/wbsg-uni-mannheim/WebMall-Interfaces.git
cd WebMall-Interfaces

# Install dependencies  
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your API keys

# Index data (required for NLWeb and API MCP)
cd src/nlweb_mcp
python ingest_data.py --shop all --force-recreate

# Run benchmarks
cd ..
python benchmark_nlweb_mcp.py    # NLWeb interface
python benchmark_rag.py          # RAG interface  
python benchmark_api_mcp.py      # API MCP interface

For detailed setup instructions and interface-specific configuration, see the main README and individual interface documentation in the src/ directory.

7. Feedback

We welcome feedback and contributions via GitHub issues and discussions. Alternatively, you can also contact us directly via email.

References

[Song 2025] Song, Yueqi, et al.: Beyond Browsing: API-Based Web Agents. arXiv:2410.16464, 2025.

[Zhou 2023] Zhou, Shuyan, et al.: WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854, 2023.

[Zhang 2025] Zhang, Chaoyun, et al.: API Agents vs. GUI Agents: Divergence and Convergence. arXiv:2503.11069, 2025.

[Sager 2025] Sager, Pascal, et al.: A Comprehensive Survey of Agents for Computer Use. arXiv:2501.16150, 2025.

[Yehudai 2025] Yehudai, Asaf , et al.: Survey on Evaluation of LLM-based Agents, arXiv:2503.16416, 2025.

[Petrova 2025] Petrova, Tatiana, et al.: From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents. arXiv:2507.10644, 2025.

[Lyu 2025] Lyu, Yougang, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, Xiuying Chen: DeepShop: A Benchmark for Deep Research Shopping Agents. arXiv:2506.02839, 2025.