The section below describes the four different architectures that we compare in our experimental study as well as the interfaces that agents and Webshops use for communication.
LLM agents use different architectures and interfaces to interact with the World Wide Web. Some agents rely on traditional web browsers to navigate HTML pages originally designed for human users. Others do not directly access websites but instead retrieve web content by querying search engines that have indexed the Web. A third architectural approach assumes that websites expose site-specific Web APIs, which agents interact with via the Model Context Protocol (MCP). A fourth architecture, proposed by Microsoft under the NLWeb initiative, defines a standardized interface through which agents query individual websites and receive responses formatted as structured Schema.org data.
This page presents the results of the first experimental comparison of these four architectures, using the same set of tasks within an e-commerce scenario. The experiments were conducted across four simulated webshops, each offering products via different interfaces. Four corresponding LLM agents — the MCP Agent, RAG Agent, NLWeb Agent, and HTML Agent — are evaluated performing the same set of 91 tasks, each using a different method for interacting with the shops.
We compare the effectiveness (success rate, F1) of the different agents in solving the tasks, which are grouped into categories such as searching for specific products, searching for the cheapest product given concrete or vague requirements, adding products to shopping carts, and finally checking out the products and paying for them by credit card. We also assess the efficiency of each architecture by measuring task runtime and token usage. The analysis of input and output tokens provides a basis for estimating both the operational cost of each agent as well as its energy consumption and environmental impact.
In our WebMall experiments, the MCP, RAG, and NLWeb Agents achieve comparable — and in many cases higher — task completion rates than the HTML Agent (AX+MEM), while using 5–10× fewer tokens. On basic tasks, the NLWeb Agent achieves the highest completion rate (up to 88% with Claude 4 Sonnet), and the RAG Agent shows competitive effectiveness across both basic and advanced tasks. All three alternatives use significantly fewer tokens than the HTML Agent.
The section below describes the four different architectures that we compare in our experimental study as well as the interfaces that agents and Webshops use for communication.
The HTML Agent accesses the Webshops via their traditional HTML interfaces intended for human use. We employ the AX+MEM HTML Agent from the WebMall benchmark for our experiments. The agent is implemented using the AgentLab library which accompanies BrowserGym. The agent uses the accessibility tree (AXTree) of HTML pages as observation space and has access to short-term memory, which it can use to store relevant information at each step in order to maintain context across longer task sequences.
The HTML agent executes the following interaction loop:
More details about the AX+MEM HTML Agent are on the WebMall benchmark page.
The RAG Agent does not directly access the Webshops but interacts with a search engine that has crawled and indexed all pages of all Webshops. Our RAG implementation uses Elasticsearch to create a unified search index containing scraped content from all four WebMall shops. Before indexing, we remove navigation elements and HTML tags from the pages using the Unstructured library. An example of a resulting JSON file can be found here. The system generates composite embeddings that combine product titles and descriptions, enabling semantic similarity search. The search engine is presented to the agent as a tool that can be called one or multiple times with differing queries to iteratively refine the results.
The agent leverages the LangGraph framework to orchestrate retrieval workflows and incorporates specialized Python functions for e-commerce actions like adding items to carts and completing checkouts. More specifically, the RAG agent is implemented as a LangGraph ReAct agent with the following available tools:
The agent follows a two-phase search approach: first using lightweight searches to identify promising products, then fetching detailed information only for relevant items to minimize token usage.
For more details about the implementation of the RAG agent please refer to the RAG source code in our repository. The prompt used for defining the agent can be found here.
This architecture assumes that the e-shops provide proprietary API as well as MCP descriptions of the APIs. The agent interacts with the e-shops by calling shop-specific functions. These APIs are exposed via the Model Context Protocol (MCP), originally proposed by Anthropic. MCP is an open protocol designed to standardize communication between LLM applications and external tools or data sources. Instead of parsing unstructured web content, an agent (the MCP Host) connects to a dedicated MCP Server that exposes a well-defined set of tools (functions).
In our setup, we run four independent MCP servers, one for each WebMall shop. These servers expose tools for actions like search, cart management, and checkout. However, to simulate a realistic, multi-shop environment, the WebAPIs are heterogeneous - the data format and tools returned by each server are intentionally different. This heterogeneity forces the agent to adapt to different API responses from each shop, testing its ability to handle diverse, non-standardized data structures, which reflects the reality of integrating with multiple independent web APIs.
The MCP agent is implemented using the same LangGraph framework as the RAG agent. It uses MCP tools exposed by each shop's server for product search, cart management, and checkout operations.
The MCP server for each shop exposes its capabilities as tools, which the agent can discover and execute. The workflow is as follows:
search_products
or add_to_cart
by sending
JSON-RPC messages to the server. The server executes the corresponding actions.
The example below illustrates the heterogeneity of WebAPIs by comparing the search methods of two shops. search_products
and find_items_techtalk
use different parameter names, while the second method offers the possiblity to sort by price, which is not possible in the first method.
// E-Store Athletes (Shop 1) search signature — no sorting async def search_products( ctx: Context, query: str, per_page: int = 10, page: int = 1, include_descriptions: bool = False ) -> str // TechTalk (Shop 2) search signature — supports sorting async def find_items_techtalk( ctx: Context, query: str, limit: int = 5, page_num: int = 1, sort_by_price: str = "none", include_descriptions: bool = False ) -> str
The heterogeneity extends beyond function signatures to the data structures that are returned by the different search endpoints, e.g. each shop used a different set of attribute names and its own product categorization hierarchy.
For complete implementation details, refer to the MCP server code in our repository.
The NLWeb agent interacts with the e-shops through a standardized natural language interface that needs to be offered by all NLWeb sites. Each e-shop must implement "ask" endpoint that accepts natural language queries and returns structured responses using the schema.org format. NLWeb (Natural Language for Web), proposed and supported by Microsoft, provides a standardized mechanism for this interaction. It leverages existing semi-structured formats, particularly schema.org and RSS, to create a semantic layer over a site's content.
In our implementation, we create one dedicated Elasticsearch index per e-shop (NLWeb Site) that enables semantic search of that
site's content. Each NLWeb server processes natural language queries by generating embeddings and performing cosine similarity search
against its shop-specific index. Additionally, we create an MCP server per shop to enable other functionality like cart management and
checkout operations, complementing the ask
tool for product search.
The NLWeb Agent uses the same LangGraph framework as the other
agents. An example of an ask
tool call and response can be found
here.
The agent’s interaction with the NLWeb + MCP interface follows the workflow below:
ask
tool.
The complete implementation is available in the NLWeb + MCP directory. The prompt used for the agent can be found here.
The table below compares the four different architectures along the features offered by the e-shops and the functionality implemented by the agents.
Aspect | RAG | MCP | NLWeb | HTML |
---|---|---|---|---|
E-Shops | ||||
Interface Offered | HTML pages | proprietary APIs | standardized API | HTML pages + search box |
Search Functionality | Not used as shops are crawled | Search structured data | Search structured data | Search free text |
Search Response | N/A | heterogenous JSON | schema.org JSON | HTML result list |
Agent | ||||
Communication Protocol | direct calls to search engine | JSON-RPC via MCP | JSON-RPC via MCP | HTML over HTTP |
Query Strategy | Multi-query search engine | Multi-query per shop | Multi-query per shop | Site search and browsing |
Search Refinement | Self-evaluation & iteration | Self-evaluation & iteration | Self-evaluation & iteration | Interactive page exploration |
Cart Management | direct functions calls | MCP tool invocation | MCP tool invocation | Browser interactions |
Checkout Process | direct functions calls | MCP tool invocation | MCP tool invocation | Form filling & submission |
To evaluate the effectiveness and efficiency of the different agents, we use the WebMall benchmark. WebMall simulates an online shopping environment with four distinct webshops, each offering around 1000 products described with heterogeneous product descriptions. The WebMall benchmark includes a diverse set of e-commerce tasks that test different agent capabilities ranging from focused retrieval to advanced reasoning about compatible and substitutional products. These tasks are organized into two main categories based on their complexity:
Further examples of each task type are found on the WebMall benchmark page. The complete task set including the solution for each task is found in the WebMall repository.
We evaluated each agent on the complete WebMall task set. The evaluation compares the effectiveness of different agent architectures: RAG Agent, MCP Agent, and NLWeb Agent. For comparison, we also include results from the strongest HTML Agent configuration from the original WebMall benchmark, AX+MEM.
Every agent architecture is evaluated in combination with both GPT-4.1 (gpt-4.1-2025-04-14) and Claude 4 Sonnet (claude-sonnet-4-20250514) to assess variations in effectiveness across models. Note that our experiments do not utilize prompt caching. Detailed execution logs for all runs are available in our GitHub repository, organized by interface type and model. For quick understanding of agent behavior, shortened execution logs containing one successful task execution per agent and model are also available.
We assess effectiveness using four metrics derived from comparing the agent's response against the ground truth solutions:
The results are shown in the following tables, sorted by completion rate and categorized by task type. Best results per metric are highlighted in bold.
Results grouped by task complexity: Basic tasks include straightforward operations like finding specific products and simple checkout processes, while Advanced tasks require complex reasoning such as interpreting vague requirements, finding substitutes, and compatible products.
Cost and execution time analysis based on model pricing and runtime results from our experiments. Token prices as of July 2025:
The total cost of running all experiments across all interfaces and models was approximately $250. This cost excludes embedding generation (which is negligible at $0.02/MTok) and infrastructure costs. Execution times shown are averages per task.
The scatter plot below visualizes the relationship between cost and effectiveness across different agent interfaces and models. Each point represents a combination of interface type (RAG, MCP, NLWeb, HTML) and language model (GPT-4.1, Claude 4 Sonnet).
To reproduce our experiments or run the benchmarks with your own agent implementations, follow these setup instructions. The complete code and documentation are available in our GitHub repository.
http://localhost:9200
# Clone repository git clone https://github.com/wbsg-uni-mannheim/WebMall-Interfaces.git cd WebMall-Interfaces # Install dependencies pip install -r requirements.txt # Configure environment cp .env.example .env # Edit .env with your API keys # Index data (required for NLWeb and API MCP) cd src/nlweb_mcp python ingest_data.py --shop all --force-recreate # Run benchmarks cd .. python benchmark_nlweb_mcp.py # NLWeb interface python benchmark_rag.py # RAG interface python benchmark_api_mcp.py # API MCP interface
For detailed setup instructions and interface-specific configuration, see the main README and individual interface documentation in the src/ directory.
We welcome feedback and contributions via GitHub issues and discussions. Alternatively, you can also contact us directly via email.
[Song 2025] Song, Yueqi, et al.: Beyond Browsing: API-Based Web Agents. arXiv:2410.16464, 2025.
[Zhang 2025] Zhang, Chaoyun, et al.: API Agents vs. GUI Agents: Divergence and Convergence. arXiv:2503.11069, 2025.
[Sager 2025] Sager, Pascal, et al.: A Comprehensive Survey of Agents for Computer Use. arXiv:2501.16150, 2025.
[Yehudai 2025] Yehudai, Asaf , et al.: Survey on Evaluation of LLM-based Agents, arXiv:2503.16416, 2025.
[Petrova 2025] Petrova, Tatiana, et al.: From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents. arXiv:2507.10644, 2025.
[Lyu 2025] Lyu, Yougang, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, Xiuying Chen: DeepShop: A Benchmark for Deep Research Shopping Agents. arXiv:2506.02839, 2025.