MCP vs RAG vs NLWeb vs HTML:
A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web

1. Introduction

LLM agents use different architectures and interfaces to interact with the World Wide Web. Some agents rely on traditional web browsers to navigate HTML pages originally designed for human users. Others do not directly access websites but instead retrieve web content by querying search engines that have indexed the Web. A third architectural approach assumes that websites expose site-specific Web APIs, which agents interact with via the Model Context Protocol (MCP). A fourth architecture, proposed by Microsoft under the NLWeb initiative, defines a standardized interface through which agents query individual websites and receive responses formatted as structured schema.org data.

This page presents the results of the first experimental comparison of these four architectures, using the same set of tasks within an e-commerce scenario. The experiments were conducted across four simulated e-shops, each offering products via different interfaces. Four corresponding LLM agents — the MCP agent, RAG agent, NLWeb agent, and HTML agent — are evaluated performing the same set of 91 tasks, each using a different method for interacting with the shops.

We compare the effectiveness (success rate, F1) of the different agents in solving the tasks, which are grouped into categories such as searching for specific products, searching for the cheapest product given concrete or vague requirements, adding products to shopping carts, and finally checking out the products and paying for them by credit card. We also assess the efficiency of each architecture by measuring task runtime and token usage. The analysis of input and output tokens provides a basis for estimating both the operational cost of each agent as well as its energy consumption and environmental impact.

ChatGPT: The experiments show that the MCP, RAG, and NLWeb agents achieve comparable — and in many cases even higher — task completion rates than the HTML agent, while consuming 5 to 10 times fewer tokens. For basic tasks, the NLWeb agent achieves the highest completion rate (88% with Claude Sonnet), while the RAG agent shows strong performance across both basic and advanced tasks. All alternative interfaces demonstrate significantly lower token usage compared to the browser-based agent, with the RAG agent being particularly efficient.

2. Architectures and Interfaces

The section below describes the four different architectures that we compare in our experimental study as well as the interfaces that agents and e-shops use for communication.

2.1 Browser-based Agent (HTML Agent)

The HTML agent accesses the e-shops via their traditional HTML interfaces intended for human usage. We employ the AX+MEM HTML agent from the WebMall benchmark for our experiments. The agent is implemented using the AgentLab library which accompanies BrowserGym. The agent uses the accessibility tree (AXTree) of HTML pages as observation space and has access to short-term memory which it can use to store relevant information at each step in order to maintain context across longer task sequences.

HTML Agent Workflow

The HTML agent executes the following interaction loop:

  1. Navigate to target web page using browser automation
  2. Parse accessibility tree (AXTree) to understand page content
  3. Store relevant information in short-term memory
  4. Execute action (click, type, scroll) based on task requirements
  5. Repeat interaction cycle until task completion

More details about the AX+MEM HTML agent are found on the WebMall benchmark page.

2.2 Agent querying a Search Engine (RAG Agent)

The RAG agent does not directly access the e-shops but interacts with a search engine that has crawled and indexed all pages of all e-shops. Our RAG implementation uses Elasticsearch to create a unified search index containing scraped content from all four WebMall shops. Before indexing, we remove naviagation elements and HTML tags from the pages using the Unstructured library. An example of a resulting JSON file can be found here. The system generates composite embeddings that combine product titles and descriptions, enabling semantic similarity search. The search engine is presented to the agent as a tool that can be called one or multiple times with differing queries to iteratively refine the results.

The agent leverages the LangGraph framework to orchestrate retrieval workflows and incorporates specialized Python functions for e-commerce actions like adding items to carts and completing checkouts. More specifically, the RAG agent is implemented as a LangGraph ReAct agent with the following available tools:

  1. search_products: Execute semantic search queries against Elasticsearch (returns title + URL for efficiency)
  2. get_product_details: Fetch detailed product information for specific URLs
  3. add_to_cart_webmall_1-4: Add products to specific shop carts
  4. checkout_webmall_1-4: Complete purchases with customer details

RAG Agent Workflow

The agent follows a two-phase search approach: first using lightweight searches to identify promising products, then fetching detailed information only for relevant items to minimize token usage.

For more details about the implementation of the RAG agent please refer to the RAG source code in our repository. The prompt used for defining the agent can be found here.

2.3 Agent querying Web APIs via MCP (MCP Agent)

The MCP agent interacts with websites through structured APIs provided by website vendors. These APIs are exposed via they Model Context Protocol (MCP), originally proposed by Anthropic, is an open protocol designed to standardize communication between LLM applications and external tools or data sources. Instead of parsing unstructured web content, an agent (the MCP Host) connects to a dedicated MCP Server that exposes a well-defined set of tools. These tools wrap the WebAPI and can be implemented by the vendor of the website directly or by a third party.

In our setup, we run four independent MCP servers, one for each WebMall shop. These servers expose tools for actions like search, cart management, and checkout. However, to simulate a realistic, multi-provider environment, the WebAPIs are heterogeneous - the data format and tools returned by each server are intentionally different. This heterogeneity forces the agent to adapt to different API responses from each shop, testing its ability to handle diverse, non-standardized data structures, which reflects the reality of integrating with multiple independent web services.

The agent leverages the same LangGraph framework as the RAG agent. It uses MCP tools exposed by each shop's server for product search, cart management, and checkout operations.

MCP Agent Workflow

The MCP server for each shop exposes its capabilities as tools, which the agent can discover and execute. The workflow is as follows:

  1. Connection and Discovery: The agent, acting as an MCP Host, establishes a connection with the MCP Server for a specific shop and discovers the available tools through the protocol's capability negotiation.
  2. Tool Execution: The agent invokes tools like search_products or add_to_cart by sending JSON-RPC messages to the server. The server executes the corresponding actions.
  3. Response Handling: The agent receives a structured but potentially heterogeneous JSON response.

The heterogeneity of WebAPIs is evident in how different shops implement the same functionality. For example, checkout operations have completely different signatures and parameter names:

// E-Store Athletes (Shop 1) checkout signature
async def checkout(
    ctx: Context,
    first_name: str,
    last_name: str,
    email: str,
    phone: str,
    address_1: str,
    city: str,
    state: str,
    postcode: str,
    country: str,
    credit_card_number: str,
    credit_card_expiry: str,
    credit_card_cvc: str
) -> str

// TechTalk (Shop 2) checkout signature  
async def checkout_cart_techtalk(
    ctx: Context,
    customer_first_name: str,
    customer_last_name: str,
    customer_email: str,
    customer_phone: str,
    shipping_street: str,
    shipping_city: str,
    shipping_state: str,
    shipping_zip: str,
    shipping_country_code: str,
    payment_card_number: str,
    card_expiration_date: str,
    card_security_code: str
) -> str

The heterogeneity extends beyond function signatures to the data structures that are returned by the different search endpoints, e.g. each shop used a different set of attribute names and its own product categorization hierarchy.

For complete implementation details, refer to the MCP server code in our repository.

2.4 Agent querying NLWeb Sites (NLWeb Agent)

The NLWeb agent interacts with websites through a standardized natural language interface provided by website vendors. The vendor must implement and host an "ask" endpoint that accepts natural language queries and returns structured responses according to the Schema.org format. NLWeb (Natural Language for Web), proposed and supported by Microsoft, provides a standardized mechanism for this interaction. It operates by leveraging existing semi-structured data, particularly Schema.org markup, to create a semantic layer over a website's content.

In our implementation, we create one dedicated Elasticsearch index per webshop that enables semantic search of that website's content. Each NLWeb server processes natural language queries by generating embeddings and performing cosine similarity search against its shop-specific index. Additionally, we create an MCP server per shop to enable other functionality like cart management and checkout operations, complementing the ask tool for product search.

The agent leverages the same LangGraph framework as the other agents. An example for a ask tool call and response can be found here.

NLWeb Agent Workflow

The agent's interaction with the NLWeb + MCP interface the following workflow:

  1. Connection and Discovery: The agent connects to the NLWeb-enabled MCP server for a specific shop and discovers the available tools.
  2. Natural Language Query: The agent sends a natural language query (e.g., "laptops under $1000 with 16GB RAM") to the server by invoking the ask tool.
  3. Semantic Search Execution: The server generates an embedding from the query and performs a cosine similarity search against the pre-computed vectors in its dedicated Elasticsearch index.
  4. Standardized Response: The agent receives a list of products in the standardized Schema.org JSON format.

The complete implementation is available in the NLWeb + MCP directory. The prompt used for the agent can be found here.

2.5 Comparison of the Different Architectures

Aspect RAG MCP NLWeb HTML
Website Infrastructure
Data Source Web scraping (HTML) API access API access Real-time HTML + AXTree + Screenshots
Data Storage Unified Elasticsearch Index Per-Shop Elasticsearch Indices Per-Shop Elasticsearch Indices Browser state only
Index Content Unstructured Text Structured Product Data Structured Product Data N/A
Preprocessing HTML cleaning None required Schema.org translation AXTree generation
Response Format Document fields (title, content, url) Heterogeneous per-shop JSON Standardized Schema.org Multi-modal observations (HTML/AXTree/Visual)
Agent Architecture
Search Type Semantic (KNN on embeddings) Semantic (KNN on embeddings) Semantic (KNN on embeddings) DOM/AXTree traversal + visual navigation
Communication Protocol Direct Python Functions JSON-RPC via MCP JSON-RPC via MCP Playwright + Chrome DevTools Protocol
Query Strategy Multi-query generation Multi-query possible per shop Multi-query possible per shop Multi-action sequences per step
Shop Selection Searches all shops at once Agent selects shops Agent selects shops Sequential shop visits
Processing Details
Embedding Fields Title, Content, Composite Title, Content, Composite Title, Content, Composite N/A
Embedding Model OpenAI text-embedding-3-small OpenAI text-embedding-3-small OpenAI text-embedding-3-small N/A
Result Ranking Cosine similarity score Cosine similarity score Cosine similarity score Page order/relevance
Agent Capabilities
Search Refinement Self-evaluation & iteration Self-evaluation & iteration Self-evaluation & iteration Interactive page exploration
Cross-shop Comparison Native (unified index) Sequential MCP calls Sequential MCP calls Sequential browsing
Cart Management Python tool functions MCP tool invocation MCP tool invocation Browser interactions
Checkout Process Direct function calls MCP tool invocation MCP tool invocation Form filling & submission

3. Use Case: Online Shopping

To evaluate the effectiveness and efficiency of the different agents, we use the WebMall benchmark. WebMall simulates a online shopping environment with four distinct webshops, each offering around 1000 products described by by heterogeneous product descriptions. The WebMall benchmark includes a diverse set of e-commerce tasks that test different agent capabilities ranging from focused retrieval to advanced reasoning about compatible and substitutional products. These tasks are organized into two main categories based on their complexity:

Basic Tasks

  • Find Specific Product (12 tasks): Locate a particular product by name or model number
  • Find Cheapest Offer (10 tasks): Identify the lowest-priced option for a specific product across shops
  • Products Fulfilling Specific Requirements (11 tasks): Find products matching precise technical specifications
  • Add to Cart (7 tasks): Add selected products to the shopping cart
  • Checkout (8 tasks): Complete the purchase process with payment and shipping information

Advanced Tasks

  • Cheapest Offer with Specific Requirements (10 tasks): Find the most affordable product meeting detailed criteria
  • Products Satisfying Vague Requirements (8 tasks): Interpret and fulfill imprecise or subjective requirements
  • Cheapest Offer with Vague Requirements (6 tasks): Combine price optimization with fuzzy requirement matching
  • Find Substitutes (6 tasks): Identify alternative products when the requested item is unavailable
  • Find Compatible Products (5 tasks): Locate accessories or components compatible with a given product
  • End To End (8 tasks): Complete full shopping workflows from search to checkout

Further examples of each task type are found on the WebMall benchmark page. The complete task set including the solution for each task is found in the WebMall repository.

4. Experimental Results

We evaluated each agent interface on the complete WebMall benchmark task set. The evaluation compares the performance of different agent architectures: RAG Agent, MCP Agent, and NLWeb Agent. For comparison, we also include results from the strongest-performing browser-based agent on the WebMall benchmark, AX+MEM, subsequently referred to as Browser Agent.

Every agent interface is evaluated with both GPT-4.1 (gpt-4.1-2025-04-14) and Claude 4 Sonnet (claude-sonnet-4-20250514) to assess model-dependent performance variations. Note that our experiments do not utilize prompt caching. Detailed execution logs for all runs are available in our GitHub repository, organized by interface type and model. For quick understanding of agent behavior, shortened execution logs containing one successful task execution per agent interface and model are also available.

4.1 Evaluation Metrics

We assess performance using four metrics derived from comparing the agent's response against the ground truth solutions:

  • Task Completion Rate: Binary metric (0 or 1) measuring exact task completion. An agent achieves 1.0 only if no elements are missing and no additional elements are returned.
  • Precision: Fraction of agent-returned URLs that are correct (intersection ÷ total URLs returned). Higher precision means fewer incorrect products.
  • Recall: Fraction of correct URLs that the agent found (intersection ÷ total correct URLs). Higher recall means fewer missing products.
  • F1 Score: Harmonic mean of precision and recall.

4.2 Performance by Task Type

The results are shown in the following tables, sorted by completion rate and categorized by task type. Best results per metric are highlighted in bold.

4.3 Performance by Category

Results grouped by task complexity: Basic tasks include straightforward operations like finding specific products and simple checkout processes, while Advanced tasks require complex reasoning such as interpreting vague requirements, finding substitutes, and multi-step workflows.

4.4 Cost & Runtime Analysis

Cost and execution time analysis based on model pricing and actual performance from our experiments. Token prices as of July 2025:

The total cost of running all experiments across all interfaces and models was approximately $250. This cost excludes embedding generation (which is negligible at $0.02/MTok) and infrastructure costs. Execution times shown are averages per task.

Key Findings

  • Performance: Specialized interfaces match or exceed browser-based performance, with NLWeb achieving 88% completion rate on basic tasks.
  • Efficiency: Alternative interfaces use 5-10x fewer tokens than browser-based agents, translating to significant cost savings.
  • Speed: RAG and API-based agents complete tasks faster due to direct data access without page navigation overhead.

7. Feedback

We welcome feedback and contributions via GitHub issues and discussions. Alternatively, you can also contact us directly via email.

References

[Song 2025] Song, Yueqi, et al.: Beyond Browsing: API-Based Web Agents. arXiv:2410.16464, 2025.

[Zhang 2025] Zhang, Chaoyun, et al.: API Agents vs. GUI Agents: Divergence and Convergence. arXiv:22503.11069, 2025.

[Sager 2025] Sager, Pascal, et al.: A Comprehensive Survey of Agents for Computer Use. arXiv:2501.16150, 2025.

[Yehudai 2025] Yehudai, Asaf , et al.: Survey on Evaluation of LLM-based Agents, arXiv:2503.16416, 2025.

[Petrova 2025] Petrova, Tatiana, et al.: From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents. arXiv:2507.10644, 2025.