Introduction

This page presents the WebMall benchmark for evaluating the capability of Web agents to find and compare product offers from multiple e-shops. The benchmark features two sets of tasks: A set containing basic tasks like searching and comparing offers, adding offers to the shopping cart, and finally checking out the selected offers. The benchmark also features an advanced task set containing searches with vague requirements as well as searches for compatible products or cheaper substitute products.

Given a user task, the agent is asked to visit four e-shops exposing heterogeneous product offers via heterogeneous user interfaces. WebMall differs from existing e-commerce benchmarks, such as WebShop, WebArena, or Mind2Web, by 1) requiring the agent to visit multiple e-shops, 2) featuring product offers from different real-world sources, and 3) containing advanced tasks such as finding compatible or substitute products.

News

2024-05-06: Initial release: Version 0.7 of the WebMall benchmark released.

Screencasts

The following screencasts show an agent performing different tasks across the four shops.

Agent finding the cheapest product that meets specific technical requirements.

Advanced task showing agent interpreting vague user requirements to find suitable products.

Advanced task demonstrating agent finding substitute products for a given item.

Demo showing an agent completing the checkout process for selected products.

The WebMall Task Set

The WebMall benchmark covers 84 tasks distributed over 10 task categories. 9 out of 10 tasks require the agents to visit four different webshops to find relevant product offers. 3 out of the 9 further require the comparison of product prices across shops in order to find the cheapest offer(s). The task categories are grouped into a basic task set and an advanced task set.

The basic task set covers tasks that usually need to be handled in the process of buying a product online. It contains 5 of the 10 task categories and covers the tasks of searching for offers for a specific product, finding the cheapest offers, locating product offers that meet specific requirements, as well as adding products to the cart and completing the checkout process. The basic set contains 48 tasks.

The advanced task set covers difficult product search and comparison tasks. It contains tasks like looking for product offers given only vague requirements from the user, finding the cheapest offer fitting specific or vague requirements, finding cheaper substitutes for a specific product, and finding compatible products like a cable for connecting a monitor to a computer. The advanced set contains 36 tasks.

Each task is defined by a specific instruction to the agent in string format as well as the expected answers (URLs of the correct product offers) if the task requires answers. The instruction to the agent consists of a general part which is the same for all tasks and a task specific part. The general part contains links to the four webshops as well as instructions on how to submit the final solution after completing a task. An example of a complete instruction string is found here.

The table below gives an overview of the 10 task categories and includes an example task from each category. A list containing all 84 tasks of the WebMall benchmark is provided here.

Task Categories Overview

Task Category Count Example
Basic Task Set
Single Product Search 12 Find all offers for the Asus ROG Ryujin II ARGB 360mm Liquid CPU Cooler.
Cheapest Product Search 10 Find the cheapest offer for the Netac Z Slim 1TB M.2 External SSD. If there are multiple offers having the same cheapest price, return all of them.
Best Fit Specific Requirements 11 Find all offers for a DDR4 2x16 GB RAM Kit with 3000MHz.
Add to Cart 7 Find all offers for the GameMax Iceburg 360mm ARGB Liquid CPU Cooler and add each of them to the respective shopping cart of the shop where you found the offer.
Checkout 8 Add the product on page {{Item_URL}} to the shopping cart and complete the checkout process. Pay via credit card using the following information: {{credit_card_info}}
Advanced Task Set
Cheapest Best Fit Specific Requirements 10 Find the cheapest offer for a liquid CPU cooler with a 420mm radiator. If there are multiple offers having the same cheapest price, return all of them.
Best Fit Vague Requirements 9 Find an offer for a GPU which is best suited for training and running large language models on a local workstation computer
Cheapest Best Fit Vague Requirements 6 Find the cheapest offer for a gaming monitor which is suitable for high-level reaction-based e-sports gaming. If there are multiple offers having the same cheapest price, return all of them.
Find Substitutes 6 Find cheaper alternatives with at least the same capacity and speed for this product: {{Item_URL}}. If there are multiple offers having the same cheapest price, return all of them.
Find Compatible Products 5 Find compatible CPUs for this item: {{Item_URL}}.

The WebMall Shops

The WebMall benchmark asks agents to search for products in four distinct webshops which provide heterogeneous user interfaces. The webshops are implemented using the WordPress plugin WooCommerce and can be hosted via docker either locally or on a remote machine.

Screenshot Shop 1: E-Store Athletics

Screenshot Shop 1: E-Store Athletics

Screenshot Shop 2: TechTalk

Screenshot Shop 2: TechTalk

Each shop contains heterogeneous product offers originating from a wide set of e-shops which annotate product offers within their pages using the schema.org vocabulary. The product offers were extracted from the October 2024 version of the CommonCrawl by the WebDataCommons project [Brinkmann2023].

The four WebMall shops contain a total of 1,088 product offers distributed across four main categories: Internal PC Components, PC Peripherals, PC Cases, and Devices. The distribution varies across shops to create diverse shopping environments for agent evaluation.

Product Distribution Across All Shops

Product Category Overall Total Shop 1 Shop 2 Shop 3 Shop 4
Offers % Offers % Offers % Offers % Offers %
Internal PC Components 477 43.8% 120 39.6% 72 27.9% 210 61.4% 75 40.5%
PC Peripherals 225 20.7% 91 30.0% 49 19.0% 45 13.2% 40 21.6%
PC Cases 270 24.8% 57 18.8% 114 44.2% 56 16.4% 43 23.2%
Devices 116 10.7% 35 11.6% 23 8.9% 31 9.1% 27 14.6%
Total 1,088 100.0% 303 100.0% 258 100.0% 342 100.0% 185 100.0%

Baseline Experiments

We conduct a series of baseline experiments using web agents implemented as part of the AgentLab library that accompanies BrowserGym [Chezelles2025]. We test 6 agent setups along the three dimensions (1) observation space (AXTree or AXTree+Screenshots), (2) enabled use of short-term memory, and (3) the used LLM (GPT4.1 or Claude Sonnet 4). The observation space of the agent is either just the AX-tree of the visited webpages or the AXTree and a screenshot of the currently visible page. In the screenshot, each element of the visible page is annotated with a number that corresponds to the AX-tree id of the element. If short-term memory is activated, the agent can note down information it deems relevant to remember at each step. An example of the full final message passed to the agent, which also contains an action history, for two experimental settings with the GPT4.1 model can be found here (AXTree only) and here (AXTree+Memory).

Each of the agent setups is run on the Basic and Advanced task sets of WebMall. Completion Rate refers to the task completion rate, i.e. how many tasks of that respective set were fully and correctly completed. Partial Completion Rate is the average partial completion rate which shows the average over the partial completion rates of all tasks in a set. A partial completion of a task is achieved if e.g. the model returned 3 of 4 required product offers, which would yield a partial completion rate of 75% for that task.

Task Completion Rates

Model Task set AX-Tree AX-Tree + Memory AX-Tree + Vision
Completion Rate (%) Partial Completion Rate (%) Completion Rate (%) Partial Completion Rate (%) Completion Rate (%) Partial Completion Rate (%)
GPT4.1 Basic 62.50 72.03 75.00 81.77 50.00 59.26
Claude Sonnet 4 Basic 70.83 79.85 75.00 79.40 70.84 76.09
GPT4.1 Advanced 19.45 30.42 27.78 42.45 19.45 23.47
Claude Sonnet 4 Advanced 38.89 53.84 41.67 54.12 33.34 42.13

Token Usage and Cost per Task

Model Task Set Observation Space Avg. Steps Avg. Input Tokens Avg. Output Tokens Avg. Cost ($)
GPT4.1 Basic AX-Tree 20 131,451 2,187 0.28
AX-Tree + Memory 19 131,118 3,229 0.29
AX-Tree + Vision 20 147,237 1,850 0.31
Claude Sonnet 4 Basic AX-Tree 20 173,443 4,459 0.59
AX-Tree + Memory 21 229,343 10,613 0.85
AX-Tree + Vision 25 268,075 5,907 0.89
GPT4.1 Advanced AX-Tree 21 154,609 2,660 0.33
AX-Tree + Memory 19 147,548 3,983 0.33
AX-Tree + Vision 19 156,366 2,161 0.33
Claude Sonnet 4 Advanced AX-Tree 25 257,357 6,906 0.88
AX-Tree + Memory 23 331,334 17,901 1.26
AX-Tree + Vision 33 461,398 10,602 1.54

Completion Rates per Task Category

Model Task set AX-Tree AX-Tree + Memory AX-Tree + Vision
Completion Rate (%) Partial Completion Rate (%) Completion Rate (%) Partial Completion Rate (%) Completion Rate (%) Partial Completion Rate (%)
Basic Tasks
GPT4.1 Single Product Search 58.34 69.45 58.34 69.45 41.67 59.47
Cheapest Product Search 60.00 60.00 80.00 80.00 10.00 15.00
Best Fit Specific Requirements 36.37 61.26 54.55 71.94 36.37 52.79
Add to Cart 85.71 85.71 100.00 100.00 85.71 85.71
Checkout 87.50 93.75 100.00 100.00 100.00 100.00
Claude Sonnet 4 Single Product Search 75.00 85.42 58.34 66.67 66.67 77.08
Cheapest Product Search 70.00 70.00 70.00 70.00 70.00 70.00
Best Fit Specific Requirements 54.55 73.45 63.64 73.76 54.55 61.57
Add to Cart 71.43 78.57 100.00 100.00 71.43 78.57
Checkout 87.50 93.75 100.00 100.00 100.00 100.00
Advanced Tasks
GPT4.1 Cheapest Best Fit Specific Requirements 40.00 40.00 50.00 55.00 20.00 20.00
Best Fit Vague Requirements 0.00 28.15 11.11 34.63 22.23 32.78
Cheapest Best Fit Vague Requirements 16.67 31.94 33.34 44.45 33.33 33.33
Find Substitutes 33.34 41.67 33.34 50.00 16.67 25.00
Find Compatible Products 0.00 0.00 0.00 20.00 0.00 0.00
Claude Sonnet 4 Cheapest Best Fit Specific Requirements 50.00 55.00 60.00 65.00 40.00 45.00
Best Fit Vague Requirements 33.34 63.52 33.34 55.37 33.34 43.52
Cheapest Best Fit Vague Requirements 33.34 37.50 33.34 48.61 16.67 31.95
Find Substitutes 33.34 61.12 33.34 47.22 33.34 47.23
Find Compatible Products 40.00 45.00 40.00 45.00 40.00 40.00

Token Usage and Cost per Task Category

Model Task Category Observation Space Avg. Steps Avg. Input Tokens Avg. Output Tokens Avg. Cost ($)
Basic Tasks
GPT4.1 Single Product Search AX-Tree 18 112,625 2,071 0.24
AX-Tree + Memory 17 113,030 3,155 0.25
AX-Tree + Vision 18 127,133 1,845 0.27
Claude Sonnet 4 Single Product Search AX-Tree 19 149,838 4,102 0.51
AX-Tree + Memory 21 235,463 12,541 0.89
AX-Tree + Vision 21 203,623 5,024 0.69
GPT4.1 Cheapest Product Search AX-Tree 17 100,277 1,736 0.21
AX-Tree + Memory 16 102,181 2,774 0.23
AX-Tree + Vision 18 125,801 1,605 0.26
Claude Sonnet 4 Cheapest Product Search AX-Tree 17 126,316 3,215 0.43
AX-Tree + Memory 17 163,060 7,779 0.61
AX-Tree + Vision 22 214,319 4,662 0.71
GPT4.1 Best Fit Specific Requirements AX-Tree 24 182,700 3,073 0.39
AX-Tree + Memory 21 165,627 4,101 0.36
AX-Tree + Vision 22 187,407 2,384 0.39
Claude Sonnet 4 Best Fit Specific Requirements AX-Tree 25 262,004 6,738 0.89
AX-Tree + Memory 23 328,001 14,560 1.20
AX-Tree + Vision 32 427,724 8,801 1.42
GPT4.1 Add to Cart AX-Tree 19 119,019 1,969 0.25
AX-Tree + Memory 18 122,835 2,970 0.27
AX-Tree + Vision 18 126,263 1,482 0.26
Claude Sonnet 4 Add to Cart AX-Tree 18 141,195 3,441 0.48
AX-Tree + Memory 18 175,459 8,457 0.65
AX-Tree + Vision 23 224,813 4,311 0.74
GPT4.1 Checkout AX-Tree 24 139,070 1,899 0.29
AX-Tree + Memory 24 154,220 2,934 0.33
AX-Tree + Vision 24 167,308 1,749 0.35
Claude Sonnet 4 Checkout AX-Tree 23 174,202 4,309 0.59
AX-Tree + Memory 23 214,509 7,724 0.76
AX-Tree + Vision 26 250,284 6,202 0.84
Advanced Tasks
GPT4.1 Cheapest Best Fit Specific Requirements AX-Tree 20 142,851 2,198 0.30
AX-Tree + Memory 20 146,210 3,377 0.32
AX-Tree + Vision 18 139,887 1,718 0.29
Claude Sonnet 4 Cheapest Best Fit Specific Requirements AX-Tree 21 184,652 4,768 0.63
AX-Tree + Memory 21 306,201 19,909 1.22
AX-Tree + Vision 29 363,086 7,278 1.20
GPT4.1 Best Fit Vague Requirements AX-Tree 24 182,507 3,151 0.39
AX-Tree + Memory 18 146,078 4,581 0.33
AX-Tree + Vision 23 205,822 3,036 0.44
Claude Sonnet 4 Best Fit Vague Requirements AX-Tree 30 360,564 9,720 1.23
AX-Tree + Memory 28 466,259 23,329 1.75
AX-Tree + Vision 37 542,216 13,202 1.82
GPT4.1 Cheapest Best Fit Vague Requirements AX-Tree 22 169,166 2,916 0.36
AX-Tree + Memory 19 146,537 3,988 0.32
AX-Tree + Vision 20 169,546 2,299 0.36
Claude Sonnet 4 Cheapest Best Fit Vague Requirements AX-Tree 25 254,591 6,206 0.86
AX-Tree + Memory 24 343,691 16,423 1.28
AX-Tree + Vision 37 532,580 9,576 1.74
GPT4.1 Find Substitutes AX-Tree 19 125,073 2,472 0.27
AX-Tree + Memory 19 138,926 3,562 0.30
AX-Tree + Vision 17 136,587 1,975 0.29
Claude Sonnet 4 Find Substitutes AX-Tree 23 234,912 6,423 0.80
AX-Tree + Memory 17 207,122 13,662 0.83
AX-Tree + Vision 31 489,526 12,386 1.65
GPT4.1 Find Compatible Products AX-Tree 22 145,882 2,617 0.31
AX-Tree + Memory 21 164,433 4,619 0.37
AX-Tree + Vision 15 108,222 1,528 0.23
Claude Sonnet 4 Find Compatible Products AX-Tree 25 247,251 7,537 0.85
AX-Tree + Memory 22 272,964 10,975 0.98
AX-Tree + Vision 32 393,378 11,663 1.36

Running the WebMall Benchmark

For running the benchmark, we assume a unix operating system in order to run Docker. If you are using Windows, please refer to WSL setup of docker.

How to Setup the Shops

The setup consists of different Docker containers, two for each shop (shop + database). This allows for a simple setup using Docker Compose. Please refer to the installation guide on GitHub for setting up the shops.

Performing Experiments

Instructions on how to perform experiments on single tasks as well as running a full study on the benchmark sets is found in the GitHub installation guide. The system writes comprehensive logs during execution, including agent actions, observations, and performance metrics. All logs including summary results are stored in the output directory set in the .env file for later analysis.

Related Work

[Yehudai2025] survey benchmarks for the evaluation of LLM agents and categorize them according to the agent's application domain as well as the agent capabilities that are evaluated. Other benchmarks that also evaluate the capability of Web agents to perform online shopping are the WebShop benchmark as well as the WebArena and Mind2Web benchmarks, which feature e-commerce tasks as part of a wider task set. Compared to these benchmarks, the WebMall benchmark requires agents to perform longer running tasks (due to visiting multiple shops), to deal with heterogeneous product data originating from different real-world sources, and perform advanced searches such as finding compatible or substitute products.

Feedback

We welcome feedback and contributions via GitHub issues and discussions. Alternatively, you can also contact the authors of the benchmark directly via email.

References

[Brinkmann2023] Brinkmann, Alexander, et al.: The Web Data Commons Schema.org Data Set Series. Companion Proceedings of the ACM Web Conference, 2023.

[Chezelles2025] Le Sellier De Chezelles, Thibault, et al.: The BrowserGym Ecosystem for Web Agent Research. arXiv:2412.05467, 2025.

[Yehudai2025] Yehudai, Asaf , et al.: Survey on Evaluation of LLM-based Agents, arXiv:2503.16416, 2025.

[Zhou2023] Zhou, Shuyan, et al.: Webarena: A realistic web environment for building autonomous agents. arXiv:2307.13854, 2023.