WebMall - A Multi-Shop Benchmark for Evaluating Web Agents

Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Caspary, Christian Bizer,

Web-based Systems Group, University of Mannheim

ralph.peeters@uni-mannheim.de

Introduction

This page presents the WebMall benchmark for evaluating the capability of Web agents to find and compare product offers from multiple e-shops. The benchmark features two sets of tasks: A set containing basic tasks like searching and comparing offers, adding offers to the shopping cart, and finally checking out the selected offers. The benchmark also features an advanced task set containing searches with vague requirements as well as searches for compatible products or cheaper substitute products.

Given a user task, the agent is asked to visit four e-shops exposing heterogeneous product offers via heterogeneous user interfaces. WebMall differs from existing e-commerce benchmarks, such as WebShop, WebArena, or Mind2Web, by 1) requiring the agent to visit multiple e-shops, 2) featuring product offers from different real-world sources, and 3) containing advanced tasks such as finding compatible or substitute products.

News

01-08-2025: Release of WebMall-Interfaces: MCP vs RAG vs NLWeb vs HTML - A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web
29-07-2025: Version 1.0 released: Featuring additional product offers, new task category "End to End", and cleanup of tasks und solutions.
05-06-2025: Initial release: Version 0.7 released.

Screencasts

The following screencasts show an agent performing different tasks across the four shops.

Agent finding the cheapest product that meets specific technical requirements.

Advanced task showing agent interpreting vague user requirements to find suitable products.

Advanced task demonstrating agent finding substitute products for a given item.

Demo showing an agent completing the checkout process for selected products.

The WebMall Task Set

The WebMall benchmark covers 91 tasks distributed over 11 task categories. 10 out of 11 tasks require the agents to visit four different webshops to find relevant product offers. 4 out of the 10 further require the comparison of product prices across shops in order to find the cheapest offer(s). The task categories are grouped into a basic task set and an advanced task set.

The basic task set covers tasks that usually need to be handled in the process of buying a product online. It contains 5 of the 11 task categories and covers the tasks of searching for offers for a specific product, finding the cheapest offers, locating product offers that meet specific requirements, as well as adding products to the cart and completing the checkout process. The basic set contains 48 tasks.

The advanced task set covers difficult product search and comparison tasks. It contains tasks like looking for product offers given only vague requirements from the user, finding the cheapest offer fitting specific or vague requirements, finding cheaper substitutes for a specific product, and finding compatible products like a cable for connecting a monitor to a computer. The task set also contains end-to-end tasks which are a combination of tasks from the basic task set. End-to-end tasks require the agent to find the cheapest offer for a product, put it in the shopping cart and perform the checkout process. The advanced set contains 43 tasks.

Each task is defined by a specific instruction to the agent in string format as well as the expected answers (URLs of the correct product offers) if the task requires answers. The instruction to the agent consists of a general part which is the same for all tasks and a task specific part. The general part contains links to the four webshops as well as instructions on how to submit the final solution after completing a task. An example of a complete instruction string is found here.

The table below gives an overview of the 11 task categories and includes an example task from each category. A list containing all 91 tasks of the WebMall benchmark is provided here.

Task Categories Overview

Task Category	Count	Examples
Basic Task Set
Find Specific Product	12	Find all offers for the AMD Ryzen 9 5900X. Find all offers for the Canon EOS R5 Mark II.
Find Cheapest Offer	10	Find the cheapest offer for the Samsung Galaxy S24 Plus. Find the cheapest offer for the Netac Z Slim 1TB M.2 External SSD.
Products Fulfilling Specific Requirements	11	Find all offers for orange straps that fit with the Apple Watch Series 6. Find all offers for Samsung Tablets which support 5G and come with an S-Pen stylus.
Add to Cart	7	Find all offers for the GameMax Iceburg 360mm ARGB Liquid CPU Cooler and add each of them to the respective shopping cart of the shop where you found the offer. Find all offers for the Asus DUAL RTX4070 SUPER OC White and add each of them to the respective shopping cart of the shop where you found the offer.
Checkout	8	Add the product on page https://webmall-3.[local_path].de/product/trust-tk-350-wireless-membrane-keyboard-spill-proof-silent-keys-media-keys-black to the shopping cart and complete the checkout process. Pay via credit card using the following information: Address: Jessica Morgan, jessica.morgan@yahoo.com, Maple Avenue, 742, 60614, IL, USA, Credit card number: 4242424242424242, CVV: 123, expiry date: 12/28. Add the product on page https://webmall-1.[local_path].de/product/palit-rtx3050-dual-v2-pcie4-8gb-ddr6-dvi-hdmi-dp-1777mhz-clock-rgb-lighting to the shopping cart and complete the checkout process. Pay via credit card using the following information: Address: Jessica Morgan, jessica.morgan@yahoo.com, Maple Avenue, 742, 60614, IL, USA, Credit card number: 4242424242424242, CVV: 123, expiry date: 12/28.
Advanced Task Set
Cheapest Offer Specific Requirements	10	Find the cheapest offer for a new Xbox gaming console with at least 512gb disk space in white. Find the cheapest offer for a Samsung Galaxy smartphone from the S24 series which has a camera with 200 Megapixel resolution.
Products Satisfying Vague Requirements	8	Find all offers for the largest available MX500 model by Crucial. Find all offers for an adapter so I can connect my monitor, which does not support HDMI, to an HDMI cable. The monitor's connector looks quite similar to HDMI.
Cheapest Offer Vague Requirements	6	Find the cheapest offer for each Smartphone model of Samsungs budget-friendly smartphone series. Find the cheapest offers for each model of mid-tier nVidia gaming GPUs in the 4000 series.
Find Substitutes	6	Find the cheapest alternative for this item: https://webmall-3.[local_path].de/product/arctic-liquid-freezer-iii-360mm-liquid-cpu-cooler-p12-pwm-pst-fans-pwm-controlled-pump . Find the cheapest alternative with at least the same capacity and speed for this product: https://webmall-3.[local_path].de/product/corsair-1tb-mp600-core-xt-m-2-nvme-ssd-m-2-2280-pcie4-3d-qlc-nand-r-w-5000-3500-mb-s-700k-900k-iops .
Find Compatible Products	5	Find all offers for compatible CPUs for this motherboard: https://webmall-3.[local_path].de/product/asus-pro-ws-wrx80e-sage-se-wifi-ii-workstation-amd-wrx80-swrx8-eatx-8-ddr4-sli-wi-fi-6e-dual-10g-lan-hyper-m-2-card-3x-m-2 . Find kits with single or multiple 32 GB RAM sticks compatible with this motherboard: https://webmall-4.[local_path].de/product/asus-rog-strix-z790-e-gaming-wifi-intel-z790-1700-atx-4-ddr5-hdmi-dp-wi-fi-6e-2-5g-lan-pcie5-rgb-5x-m-2 .
End To End	8	Find the cheapest offer for the Asrock B550 PHANTOM GAMING 4, add it to the shopping cart and complete the checkout process. Pay via credit card using the following information: Address: Jessica Morgan, jessica.morgan@yahoo.com, Maple Avenue, 742, 60614, IL, USA, Credit card number: 4242424242424242, CVV: 123, expiry date: 12/28. Find the cheapest offer for the Asus ROG Ryuo III 360 ARGB 360mm Liquid CPU Cooler and the cheapest offer for the Corsair Vengeance LPX 16GB Kit (2 x 8GB), add the respective cheapest offers to the shopping cart and complete the checkout process. If they are found in the same shop, put both in the shopping cart and checkout only once. Pay via credit card using the following information: Address: Jessica Morgan, jessica.morgan@yahoo.com, Maple Avenue, 742, 60614, IL, USA, Credit card number: 4242424242424242, CVV: 123, expiry date: 12/28.

The WebMall Shops

The WebMall benchmark asks agents to search for products in four distinct webshops which provide heterogeneous user interfaces. The webshops are implemented using the WordPress plugin WooCommerce and can be hosted via docker either locally or on a remote machine.

Screenshot Shop 1: E-Store Athletics

Screenshot Shop 2: TechTalk

Each shop contains heterogeneous product offers originating from a wide set of e-shops which annotate product offers within their pages using the schema.org vocabulary. The product offers were extracted from the October 2024 version of the CommonCrawl by the WebDataCommons project [Brinkmann2023].

The four WebMall shops contain a total of 4,423 product offers distributed across three main categories: PC Components, PC Peripherals, and Other Electronics. The distribution varies across shops to create diverse shopping environments for agent evaluation. The PC Components category includes internal computer parts such as CPUs, RAM, and motherboards. PC Peripherals covers external devices like monitors, keyboards, and external hard drives, while Other Electronics features consumer tech products such as gaming consoles, headphones, and smartwatches.

Product Distribution Across All Shops

Product Category	Overall Total		Shop 1		Shop 2		Shop 3		Shop 4
	Offers	%	Offers	%	Offers	%	Offers	%	Offers	%
PC Components	1,477	33.4	348	30.2	369	33.7	430	37.2	330	32.4
PC Peripherals	1,388	31.4	432	37.5	255	23.3	336	29.1	365	35.8
Other Electronics	1,558	35.2	372	32.3	471	43.0	390	33.7	325	31.9
Total	4,423	100.0	1,152	100.0	1,095	100.0	1,156	100.0	1,020	100.0

Baseline Experiments

We conduct a series of baseline experiments using web agents implemented as part of the AgentLab library that accompanies BrowserGym [Chezelles2025]. We test 8 agent setups along the three dimensions (1) observation space (AXTree or Screenshots or AXTree+Screenshots), (2) enabled use of short-term memory, and (3) the used LLM (GPT4.1 or Claude Sonnet 4). The observation space of the agent is either just the AX-tree or screenshot of the visited webpages or both the AXTree and screenshot of the currently visible page. In the screenshot, each element of the visible page is annotated with a number that corresponds to the AX-tree id of the element. If short-term memory is activated, the agent can note down information it deems relevant to remember at each step. An example of the full final message passed to the agent, which also contains an action history, for two experimental settings with the GPT4.1 model can be found here (AXTree only) and here (AXTree+Memory).

Each of the agent setups is run on the Basic and Advanced task sets of WebMall. The Completion Rate measures the number of tasks in a given set that are completed fully and correctly, meaning that all correct answers are identified, and no additional incorrect answers are returned. Precision, Recall and F1 allow a more fine-grained evaluation of the results as they take into account the number of correct and incorrect answers given by the agent. Precision is the ratio of correct answers to all answers given by the agent, Recall is the ratio of correct answers to all correct answers in the task set, and F1 is the harmonic mean of Precision and Recall. The second table reports the average number of steps taken by the agent to complete a task, the average number of input and output tokens used by the agent to complete a single task, as well as the average runtime and cost per task.

Completion Rates and F1 Score

Model	Task set	AX-Tree				AX-Tree + Memory				AX-Tree + Vision				Vision
Model	Task set	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score
GPT4.1	Basic	56.25	74.48	67.59	70.87	75.00	91.60	83.95	87.61	56.25	72.66	65.77	69.04	41.67	59.64	50.43	54.65
GPT4.1	Advanced	32.56	52.03	45.57	48.59	34.88	52.11	46.25	49.01	39.53	48.46	48.35	48.41	13.95	20.70	18.00	19.26
Claude Sonnet 4	Basic	66.67	76.04	72.44	74.20	70.83	81.25	75.12	78.06	72.92	79.17	76.67	77.90	10.42	35.42	21.99	27.14
Claude Sonnet 4	Advanced	53.49	63.37	63.41	63.39	48.84	61.51	58.40	59.91	37.21	41.11	41.80	41.45	4.65	10.47	6.69	8.16

Token Usage, Cost and Runtime per Task

Model	Task Set	Observation Space	Avg. Steps	Avg. Input Tokens	Avg. Output Tokens	Avg. Runtime	Avg. Cost
GPT4.1	Basic	AX-Tree	22.69	131,301	2,334	130.5s	0.28$
		AX-Tree + Memory	20.88	130,270	3,511	142.4s	0.29$
		AX-Tree + Vision	20.92	135,362	1,901	155.4s	0.29$
		Vision	28.56	104,617	2,453	176.2s	0.23$
GPT4.1	Advanced	AX-Tree	24.98	160,922	2,950	159.2s	0.35$
		AX-Tree + Memory	24.19	178,949	4,658	177.0s	0.40$
		AX-Tree + Vision	23.74	169,956	2,468	187.8s	0.36$
		Vision	33.33	133,972	3,119	216.4s	0.29$
Claude Sonnet 4	Basic	AX-Tree	23.69	188,079	6,791	222.7s	0.67$
		AX-Tree + Memory	22.04	236,631	15,106	334.6s	0.94$
		AX-Tree + Vision	25.62	242,597	6,255	279.5s	0.82$
		Vision	43.40	364,694	13,937	446.9s	1.30$
Claude Sonnet 4	Advanced	AX-Tree	29.65	291,048	10,063	331.7s	1.02$
		AX-Tree + Memory	27.33	364,858	18,149	420.9s	1.37$
		AX-Tree + Vision	37.26	480,199	12,630	471.9s	1.63$
		Vision	47.74	421,704	17,456	536.3s	1.53$

Completion Rates and F1 Score per Task Category

Model	Task set	AX-Tree				AX-Tree + Memory				AX-Tree + Vision				Vision
Model	Task set	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score
Basic Tasks
GPT4.1	Single Product Search	33.33	85.42	66.48	74.77	66.67	88.64	81.69	85.02	33.33	67.71	54.61	60.46	41.67	69.10	56.44	62.13
	Cheapest Product Search	60.00	60.00	60.00	60.00	90.00	90.00	90.00	90.00	40.00	42.50	42.50	42.50	50.00	63.33	57.50	60.28
	Best Fit Specific Requirements	27.27	50.00	40.61	44.82	36.36	84.85	59.01	69.61	45.45	68.18	56.97	62.07	27.27	54.55	38.03	44.81
	Add to Cart	85.71	85.71	85.71	85.71	100.00	100.00	100.00	100.00	85.71	100.00	92.86	96.30	85.71	100.00	92.86	96.30
	Checkout	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	12.50	12.50	12.50	12.50
Claude Sonnet 4	Single Product Search	66.67	83.33	78.41	80.80	75.00	83.33	79.17	81.20	75.00	83.33	79.17	81.20	0.00	58.33	22.98	32.97
	Cheapest Product Search	70.00	75.00	75.00	75.00	70.00	70.00	70.00	70.00	80.00	80.00	80.00	80.00	40.00	60.00	50.00	54.55
	Best Fit Specific Requirements	45.45	63.64	53.31	58.01	45.45	81.82	59.61	68.97	45.45	63.64	57.27	60.29	9.09	36.36	25.45	29.95
	Add to Cart	71.43	71.43	71.43	71.43	85.71	85.71	85.71	85.71	85.71	85.71	85.71	85.71	0.00	0.00	0.00	0.00
	Checkout	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	0.00	0.00	0.00	0.00
Advanced Tasks
GPT4.1	Cheapest Best Fit Specific Requirements	40.00	40.00	40.00	40.00	30.00	30.00	30.00	30.00	30.00	30.00	30.00	30.00	20.00	20.00	20.00	20.00
	Best Fit Vague Requirements	12.50	64.03	48.09	54.93	25.00	80.09	65.28	71.93	25.00	39.87	44.27	41.95	12.50	43.75	31.77	36.81
	Cheapest Best Fit Vague Requirements	16.67	54.17	48.61	51.24	16.67	66.67	44.44	53.33	16.67	52.50	48.61	50.48	0.00	6.67	3.33	4.44
	Find Substitutes	50.00	50.00	50.00	50.00	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33
	Find Compatible Products	40.00	60.00	46.67	52.50	40.00	40.00	40.00	40.00	60.00	70.00	66.67	68.29	20.00	20.00	20.00	20.00
	End-to-End	37.50	50.00	43.75	46.67	62.50	62.50	62.50	62.50	75.00	75.00	75.00	75.00	0.00	0.00	0.00	0.00
Claude Sonnet 4	Cheapest Best Fit Specific Requirements	60.00	60.00	60.00	60.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	10.00	10.00	10.00	10.00
	Best Fit Vague Requirements	37.50	68.39	68.75	68.57	37.50	71.88	57.64	63.97	37.50	58.48	62.15	60.26	0.00	31.25	10.94	16.20
	Cheapest Best Fit Vague Requirements	33.33	52.78	40.56	45.87	33.33	33.33	33.33	33.33	16.67	16.67	16.67	16.67	0.00	0.00	0.00	0.00
	Find Substitutes	83.33	83.33	83.33	83.33	66.67	66.67	66.67	66.67	16.67	16.67	16.67	16.67	0.00	0.00	0.00	0.00
	Find Compatible Products	40.00	52.22	66.67	58.57	20.00	54.00	60.00	56.84	60.00	60.00	60.00	60.00	20.00	20.00	20.00	20.00
	End-to-End	62.50	62.50	62.50	62.50	75.00	87.50	81.25	84.26	37.50	37.50	37.50	37.50	0.00	0.00	0.00	0.00

Token Usage, Cost and Runtime per Task Category

Model	Task Category	Observation Space	Avg. Steps	Avg. Input Tokens	Avg. Output Tokens	Avg. Runtime	Avg. Cost
Basic Tasks
GPT4.1	Single Product Search	AX-Tree	19.83	104,112	2,107	112.8s	0.23$
		AX-Tree + Memory	19.25	116,163	3,519	128.6s	0.26$
		AX-Tree + Vision	21.83	142,731	2,123	164.6s	0.30$
		Vision	25.50	88,910	2,348	178.1s	0.20$
Claude Sonnet 4	Single Product Search	AX-Tree	21.17	152,088	4,969	204.7s	0.53$
		AX-Tree + Memory	20.25	222,168	22,060	371.9s	1.00$
		AX-Tree + Vision	22.83	199,290	5,643	256.6s	0.68$
		Vision	43.50	353,486	13,278	459.9s	1.26$
GPT4.1	Cheapest Product Search	AX-Tree	22.60	125,517	2,262	123.6s	0.27$
		AX-Tree + Memory	18.60	109,401	3,262	130.5s	0.24$
		AX-Tree + Vision	17.70	106,590	1,564	131.9s	0.23$
		Vision	20.90	65,711	1,758	135.1s	0.15$
Claude Sonnet 4	Cheapest Product Search	AX-Tree	18.80	118,040	3,786	148.8s	0.41$
		AX-Tree + Memory	18.10	150,517	10,139	255.6s	0.60$
		AX-Tree + Vision	22.50	181,149	4,451	236.8s	0.61$
		Vision	37.30	265,802	11,051	377.5s	0.96$
GPT4.1	Add to Cart	AX-Tree	17.43	84,593	1,765	99.9s	0.18$
		AX-Tree + Memory	17.71	97,523	3,145	126.4s	0.22$
		AX-Tree + Vision	18.00	106,785	1,602	133.4s	0.23$
		Vision	22.71	74,339	1,967	132.4s	0.16$
Claude Sonnet 4	Add to Cart	AX-Tree	23.29	163,199	13,892	253.1s	0.70$
		AX-Tree + Memory	19.43	164,683	12,305	253.0s	0.68$
		AX-Tree + Vision	21.71	168,716	4,359	206.3s	0.57$
		Vision	45.14	388,697	15,682	485.2s	1.40$
GPT4.1	Checkout	AX-Tree	23.88	130,314	2,025	129.9s	0.28$
		AX-Tree + Memory	23.75	142,864	3,016	138.6s	0.31$
		AX-Tree + Vision	23.75	151,931	1,739	149.1s	0.32$
		Vision	45.88	189,649	3,437	246.7s	0.41$
Claude Sonnet 4	Checkout	AX-Tree	23.38	161,683	4,504	178.5s	0.55$
		AX-Tree + Memory	23.50	204,719	8,058	243.0s	0.74$
		AX-Tree + Vision	26.75	242,492	6,433	250.5s	0.82$
		Vision	47.38	451,284	16,625	452.6s	1.60$
GPT4.1	Best Fit Specific Requirements	AX-Tree	28.36	196,660	3,232	175.9s	0.42$
		AX-Tree + Memory	24.64	176,312	4,320	181.1s	0.39$
		AX-Tree + Vision	22.64	159,616	2,271	185.3s	0.34$
		Vision	30.00	114,546	2,793	188.0s	0.25$
Claude Sonnet 4	Best Fit Specific Requirements	AX-Tree	31.36	326,041	8,655	322.5s	1.11$
		AX-Tree + Memory	28.18	399,688	18,946	484.3s	1.48$
		AX-Tree + Vision	33.18	392,796	9,640	410.8s	1.32$
		Vision	44.82	388,575	14,214	467.5s	1.38$
Advanced Tasks
GPT4.1	Best Fit Vague Requirements	AX-Tree	22.75	140,989	2,897	139.1s	0.31$
		AX-Tree + Memory	25.12	181,787	4,877	188.4s	0.40$
		AX-Tree + Vision	23.25	168,377	2,658	187.8s	0.36$
		Vision	32.62	126,021	3,236	225.6s	0.28$
Claude Sonnet 4	Best Fit Vague Requirements	AX-Tree	31.62	336,629	10,184	354.8s	1.16$
		AX-Tree + Memory	27.38	346,824	19,371	456.8s	1.33$
		AX-Tree + Vision	38.38	534,397	15,310	484.4s	1.83$
		Vision	48.75	475,440	20,131	563.5s	1.73$
GPT4.1	Cheapest Best Fit Specific Requirements	AX-Tree	19.30	107,727	2,213	118.6s	0.23$
		AX-Tree + Memory	18.90	120,608	3,744	136.9s	0.27$
		AX-Tree + Vision	18.80	117,998	1,876	153.6s	0.25$
		Vision	28.80	106,068	2,603	185.9s	0.23$
Claude Sonnet 4	Cheapest Best Fit Specific Requirements	AX-Tree	24.00	217,118	13,293	297.6s	0.85$
		AX-Tree + Memory	25.10	325,667	14,472	367.1s	1.19$
		AX-Tree + Vision	35.60	437,355	11,882	480.8s	1.49$
		Vision	47.30	412,536	15,742	538.1s	1.47$
GPT4.1	Cheapest Best Fit Vague Requirements	AX-Tree	26.67	184,821	3,598	193.9s	0.40$
		AX-Tree + Memory	28.00	246,932	6,551	232.0s	0.55$
		AX-Tree + Vision	26.17	209,730	3,200	226.5s	0.45$
		Vision	36.83	151,453	3,643	231.7s	0.33$
Claude Sonnet 4	Cheapest Best Fit Vague Requirements	AX-Tree	30.00	313,754	9,956	334.9s	1.09$
		AX-Tree + Memory	22.67	259,970	17,054	381.2s	1.04$
		AX-Tree + Vision	39.83	526,701	14,133	473.4s	1.79$
		Vision	42.17	357,716	18,598	524.5s	1.35$
GPT4.1	End-to-End	AX-Tree	32.62	203,866	3,236	212.0s	0.43$
		AX-Tree + Memory	32.75	261,536	5,404	226.7s	0.57$
		AX-Tree + Vision	37.12	282,332	3,141	260.1s	0.59$
		Vision	43.38	192,228	3,598	243.2s	0.41$
Claude Sonnet 4	End-to-End	AX-Tree	39.12	371,083	8,417	406.4s	1.24$
		AX-Tree + Memory	38.62	602,894	20,159	487.4s	2.11$
		AX-Tree + Vision	44.75	554,604	10,537	483.9s	1.82$
		Vision	50.50	434,913	18,041	568.8s	1.58$
GPT4.1	Find Compatible Products	AX-Tree	25.60	200,441	2,976	156.3s	0.42$
		AX-Tree + Memory	23.40	163,766	4,185	164.1s	0.36$
		AX-Tree + Vision	17.60	111,540	2,065	140.4s	0.24$
		Vision	28.80	113,408	3,014	239.9s	0.25$
Claude Sonnet 4	Find Compatible Products	AX-Tree	29.20	281,322	8,635	320.8s	0.97$
		AX-Tree + Memory	26.40	336,625	21,679	433.0s	1.34$
		AX-Tree + Vision	26.00	286,474	10,047	435.4s	1.01$
		Vision	46.80	380,184	15,129	443.6s	1.37$
GPT4.1	Find Substitutes	AX-Tree	25.00	162,071	3,197	151.2s	0.35$
		AX-Tree + Memory	17.17	106,951	3,393	118.1s	0.24$
		AX-Tree + Vision	17.50	117,727	1,911	149.3s	0.25$
		Vision	28.67	113,057	2,749	184.5s	0.25$
Claude Sonnet 4	Find Substitutes	AX-Tree	23.83	232,177	8,014	264.0s	0.82$
		AX-Tree + Memory	21.33	265,259	18,121	403.8s	1.07$
		AX-Tree + Vision	35.33	495,069	13,742	453.3s	1.69$
		Vision	49.83	446,313	16,761	542.7s	1.59$

Running the WebMall Benchmark

For running the benchmark, we assume a unix operating system in order to run Docker. If you are using Windows, please refer to WSL setup of docker.

How to Setup the Shops

The setup consists of different Docker containers, two for each shop (shop + database). This allows for a simple setup using Docker Compose. Please refer to the installation guide on GitHub for setting up the shops.

Performing Experiments

Instructions on how to perform experiments on single tasks as well as running a full study on the benchmark sets is found in the GitHub installation guide. The system writes comprehensive logs during execution, including agent actions, observations, and performance metrics. All logs including summary results are stored in the output directory set in the .env file for later analysis.

Related Work

[Yehudai2025] survey benchmarks for the evaluation of LLM agents and categorize them according to the agent's application domain as well as the agent capabilities that are evaluated. Other benchmarks that also evaluate the capability of Web agents to perform online shopping are the WebShop benchmark as well as the WebArena and Mind2Web benchmarks, which feature e-commerce tasks as part of a wider task set. Compared to these benchmarks, the WebMall benchmark requires agents to perform longer running tasks (due to visiting multiple shops), to deal with heterogeneous product data originating from different real-world sources, and perform advanced searches such as finding compatible or substitute products.

Feedback

We welcome feedback and contributions via GitHub issues and discussions. Alternatively, you can also contact the authors of the benchmark directly via email.

References

[Brinkmann2023] Brinkmann, Alexander, et al.: The Web Data Commons Schema.org Data Set Series. Companion Proceedings of the ACM Web Conference, 2023.

[Chezelles2025] Le Sellier De Chezelles, Thibault, et al.: The BrowserGym Ecosystem for Web Agent Research. arXiv:2412.05467, 2025.

[Yehudai2025] Yehudai, Asaf , et al.: Survey on Evaluation of LLM-based Agents, arXiv:2503.16416, 2025.

[Zhou2023] Zhou, Shuyan, et al.: Webarena: A realistic web environment for building autonomous agents. arXiv:2307.13854, 2023.