The section below describes the four different architectures that we compare in our experimental study as well as the interfaces that agents and e-shops use for communication.
LLM agents use different architectures and interfaces to interact with the World Wide Web. Some agents rely on traditional web browsers to navigate HTML pages originally designed for human users. Others do not directly access websites but instead retrieve web content by querying search engines that have indexed the Web. A third architectural approach assumes that websites expose site-specific Web APIs, which agents interact with via the Model Context Protocol (MCP). A fourth architecture, proposed by Microsoft under the NLWeb initiative, defines a standardized interface through which agents query individual websites and receive responses formatted as structured schema.org data.
This page presents the results of the first experimental comparison of these four architectures, using the same set of tasks within an e-commerce scenario. The experiments were conducted across four simulated e-shops, each offering products via different interfaces. Four corresponding LLM agents — the MCP agent, RAG agent, NLWeb agent, and HTML agent — are evaluated performing the same set of 91 tasks, each using a different method for interacting with the shops.
We compare the effectiveness (success rate, F1) of the different agents in solving the tasks, which are grouped into categories such as searching for specific products, searching for the cheapest product given concrete or vague requirements, adding products to shopping carts, and finally checking out the products and paying for them by credit card. We also assess the efficiency of each architecture by measuring task runtime and token usage. The analysis of input and output tokens provides a basis for estimating both the operational cost of each agent as well as its energy consumption and environmental impact.
ChatGPT: The experiments show that the MCP, RAG, and NLWeb agents achieve comparable — and in many cases even higher — task completion rates than the HTML agent, while consuming 5 to 10 times fewer tokens. For basic tasks, the NLWeb agent achieves the highest completion rate (88% with Claude Sonnet), while the RAG agent shows strong performance across both basic and advanced tasks. All alternative interfaces demonstrate significantly lower token usage compared to the browser-based agent, with the RAG agent being particularly efficient.
The section below describes the four different architectures that we compare in our experimental study as well as the interfaces that agents and e-shops use for communication.
The HTML agent accesses the e-shops via their traditional HTML interfaces intended for human usage. We employ the AX+MEM HTML agent from the WebMall benchmark for our experiments. The agent is implemented using the AgentLab library which accompanies BrowserGym. The agent uses the accessibility tree (AXTree) of HTML pages as observation space and has access to short-term memory which it can use to store relevant information at each step in order to maintain context across longer task sequences.
The HTML agent executes the following interaction loop:
More details about the AX+MEM HTML agent are found on the WebMall benchmark page.
The RAG agent does not directly access the e-shops but interacts with a search engine that has crawled and indexed all pages of all e-shops. Our RAG implementation uses Elasticsearch to create a unified search index containing scraped content from all four WebMall shops. Before indexing, we remove naviagation elements and HTML tags from the pages using the Unstructured library. An example of a resulting JSON file can be found here. The system generates composite embeddings that combine product titles and descriptions, enabling semantic similarity search. The search engine is presented to the agent as a tool that can be called one or multiple times with differing queries to iteratively refine the results.
The agent leverages the LangGraph framework to orchestrate retrieval workflows and incorporates specialized Python functions for e-commerce actions like adding items to carts and completing checkouts. More specifically, the RAG agent is implemented as a LangGraph ReAct agent with the following available tools:
The agent follows a two-phase search approach: first using lightweight searches to identify promising products, then fetching detailed information only for relevant items to minimize token usage.
For more details about the implementation of the RAG agent please refer to the RAG source code in our repository. The prompt used for defining the agent can be found here.
The MCP agent interacts with websites through structured APIs provided by website vendors. These APIs are exposed via they Model Context Protocol (MCP), originally proposed by Anthropic, is an open protocol designed to standardize communication between LLM applications and external tools or data sources. Instead of parsing unstructured web content, an agent (the MCP Host) connects to a dedicated MCP Server that exposes a well-defined set of tools. These tools wrap the WebAPI and can be implemented by the vendor of the website directly or by a third party.
In our setup, we run four independent MCP servers, one for each WebMall shop. These servers expose tools for actions like search, cart management, and checkout. However, to simulate a realistic, multi-provider environment, the WebAPIs are heterogeneous - the data format and tools returned by each server are intentionally different. This heterogeneity forces the agent to adapt to different API responses from each shop, testing its ability to handle diverse, non-standardized data structures, which reflects the reality of integrating with multiple independent web services.
The agent leverages the same LangGraph framework as the RAG agent. It uses MCP tools exposed by each shop's server for product search, cart management, and checkout operations.
The MCP server for each shop exposes its capabilities as tools, which the agent can discover and execute. The workflow is as follows:
search_products
or add_to_cart
by sending
JSON-RPC messages to the server. The server executes the corresponding actions.
The heterogeneity of WebAPIs is evident in how different shops implement the same functionality. For example, checkout operations have completely different signatures and parameter names:
// E-Store Athletes (Shop 1) checkout signature async def checkout( ctx: Context, first_name: str, last_name: str, email: str, phone: str, address_1: str, city: str, state: str, postcode: str, country: str, credit_card_number: str, credit_card_expiry: str, credit_card_cvc: str ) -> str // TechTalk (Shop 2) checkout signature async def checkout_cart_techtalk( ctx: Context, customer_first_name: str, customer_last_name: str, customer_email: str, customer_phone: str, shipping_street: str, shipping_city: str, shipping_state: str, shipping_zip: str, shipping_country_code: str, payment_card_number: str, card_expiration_date: str, card_security_code: str ) -> str
The heterogeneity extends beyond function signatures to the data structures that are returned by the different search endpoints, e.g. each shop used a different set of attribute names and its own product categorization hierarchy.
For complete implementation details, refer to the MCP server code in our repository.
The NLWeb agent interacts with websites through a standardized natural language interface provided by website vendors. The vendor must implement and host an "ask" endpoint that accepts natural language queries and returns structured responses according to the Schema.org format. NLWeb (Natural Language for Web), proposed and supported by Microsoft, provides a standardized mechanism for this interaction. It operates by leveraging existing semi-structured data, particularly Schema.org markup, to create a semantic layer over a website's content.
In our implementation, we create one dedicated Elasticsearch index per webshop that enables semantic search of that website's content.
Each NLWeb server processes natural language queries by generating embeddings and performing cosine similarity search against its
shop-specific index. Additionally, we create an MCP server per shop to enable other functionality like cart management and checkout
operations, complementing the ask
tool for product search.
The agent leverages the same LangGraph framework as the other
agents. An example for a ask
tool call and response can be found
here.
The agent's interaction with the NLWeb + MCP interface the following workflow:
ask
tool.
The complete implementation is available in the NLWeb + MCP directory. The prompt used for the agent can be found here.
Aspect | RAG | MCP | NLWeb | HTML |
---|---|---|---|---|
Website Infrastructure | ||||
Data Source | Web scraping (HTML) | API access | API access | Real-time HTML + AXTree + Screenshots |
Data Storage | Unified Elasticsearch Index | Per-Shop Elasticsearch Indices | Per-Shop Elasticsearch Indices | Browser state only |
Index Content | Unstructured Text | Structured Product Data | Structured Product Data | N/A |
Preprocessing | HTML cleaning | None required | Schema.org translation | AXTree generation |
Response Format | Document fields (title, content, url) | Heterogeneous per-shop JSON | Standardized Schema.org | Multi-modal observations (HTML/AXTree/Visual) |
Agent Architecture | ||||
Search Type | Semantic (KNN on embeddings) | Semantic (KNN on embeddings) | Semantic (KNN on embeddings) | DOM/AXTree traversal + visual navigation |
Communication Protocol | Direct Python Functions | JSON-RPC via MCP | JSON-RPC via MCP | Playwright + Chrome DevTools Protocol |
Query Strategy | Multi-query generation | Multi-query possible per shop | Multi-query possible per shop | Multi-action sequences per step |
Shop Selection | Searches all shops at once | Agent selects shops | Agent selects shops | Sequential shop visits |
Processing Details | ||||
Embedding Fields | Title, Content, Composite | Title, Content, Composite | Title, Content, Composite | N/A |
Embedding Model | OpenAI text-embedding-3-small | OpenAI text-embedding-3-small | OpenAI text-embedding-3-small | N/A |
Result Ranking | Cosine similarity score | Cosine similarity score | Cosine similarity score | Page order/relevance |
Agent Capabilities | ||||
Search Refinement | Self-evaluation & iteration | Self-evaluation & iteration | Self-evaluation & iteration | Interactive page exploration |
Cross-shop Comparison | Native (unified index) | Sequential MCP calls | Sequential MCP calls | Sequential browsing |
Cart Management | Python tool functions | MCP tool invocation | MCP tool invocation | Browser interactions |
Checkout Process | Direct function calls | MCP tool invocation | MCP tool invocation | Form filling & submission |
To evaluate the effectiveness and efficiency of the different agents, we use the WebMall benchmark. WebMall simulates a online shopping environment with four distinct webshops, each offering around 1000 products described by by heterogeneous product descriptions. The WebMall benchmark includes a diverse set of e-commerce tasks that test different agent capabilities ranging from focused retrieval to advanced reasoning about compatible and substitutional products. These tasks are organized into two main categories based on their complexity:
Further examples of each task type are found on the WebMall benchmark page. The complete task set including the solution for each task is found in the WebMall repository.
We evaluated each agent interface on the complete WebMall benchmark task set. The evaluation compares the performance of different agent architectures: RAG Agent, MCP Agent, and NLWeb Agent. For comparison, we also include results from the strongest-performing browser-based agent on the WebMall benchmark, AX+MEM, subsequently referred to as Browser Agent.
Every agent interface is evaluated with both GPT-4.1 (gpt-4.1-2025-04-14) and Claude 4 Sonnet (claude-sonnet-4-20250514) to assess model-dependent performance variations. Note that our experiments do not utilize prompt caching. Detailed execution logs for all runs are available in our GitHub repository, organized by interface type and model. For quick understanding of agent behavior, shortened execution logs containing one successful task execution per agent interface and model are also available.
We assess performance using four metrics derived from comparing the agent's response against the ground truth solutions:
The results are shown in the following tables, sorted by completion rate and categorized by task type. Best results per metric are highlighted in bold.
Results grouped by task complexity: Basic tasks include straightforward operations like finding specific products and simple checkout processes, while Advanced tasks require complex reasoning such as interpreting vague requirements, finding substitutes, and multi-step workflows.
Cost and execution time analysis based on model pricing and actual performance from our experiments. Token prices as of July 2025:
The total cost of running all experiments across all interfaces and models was approximately $250. This cost excludes embedding generation (which is negligible at $0.02/MTok) and infrastructure costs. Execution times shown are averages per task.
We welcome feedback and contributions via GitHub issues and discussions. Alternatively, you can also contact us directly via email.
[Song 2025] Song, Yueqi, et al.: Beyond Browsing: API-Based Web Agents. arXiv:2410.16464, 2025.
[Zhang 2025] Zhang, Chaoyun, et al.: API Agents vs. GUI Agents: Divergence and Convergence. arXiv:22503.11069, 2025.
[Sager 2025] Sager, Pascal, et al.: A Comprehensive Survey of Agents for Computer Use. arXiv:2501.16150, 2025.
[Yehudai 2025] Yehudai, Asaf , et al.: Survey on Evaluation of LLM-based Agents, arXiv:2503.16416, 2025.
[Petrova 2025] Petrova, Tatiana, et al.: From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents. arXiv:2507.10644, 2025.