MaDI-Bench: Mannheim Data Integration Benchmark

Abstract

Data integration combines heterogeneous data sets into a single, coherent representation. It involves a sequence of interdependent tasks: schema matching, value normalization, entity blocking, entity matching, and data fusion. Existing benchmarks either evaluate these steps in isolation or cover only incomplete versions of the data integration pipeline, omitting specific steps. The lack of public end-to-end data integration benchmarks hinders research on data integration methods that address the integration process as a whole.

The Mannheim Data Integration Benchmark (MaDI-Bench) fills this gap. MaDI-Bench is the first benchmark for the end-to-end integration of relational tables covering all steps of the integration process. MaDI-Bench contributes (i) a set of base end-to-end data integration tasks spanning several application domains, each requiring the full schema matching, value normalization, entity matching, and conflict resolution pipeline; and (ii) a generic method for deriving task variants that mitigates rapid benchmark saturation as data integration systems advance. We validate the benchmark using human-engineered pipelines, a best-of-breed pipeline, and an LLM-based pipeline. The validation demonstrates the utility of the benchmark for measuring the step-wise as well as the end-to-end performance of data integration pipelines. All benchmark artefacts are available for public download.

The Base Tasks

MaDI-Bench consists of five end-to-end data integration tasks: Games, Companies, Music, Products, and Scientific Papers. Each task takes several heterogeneous source tables in a domain and asks a system to return one fused target table, exercising schema matching, value normalization, entity matching, and data fusion together. The benchmark provides ground truth to score every step on its own, not only the final table: a gold schema mapping for schema matching, the target schema's constraints and taxonomies for value normalization (measured as consistency with the schema), labeled record pairs for blocking and entity matching, and hand-verified records for data fusion, together with metrics for the end-to-end result. Each base task also ships in easy, medium, and hard variants, for 20 integration tasks in total.

Schema matching → Value normalization → Entity blocking & matching → Data fusion

5

Domains

20

Tasks

93,000

Matching records

11,000

Human-verified values

Games. This task requires the integration of three video-game datasets: Metacritic (review scores and ESRB ratings), a sales dataset (commercial performance), and DBpedia (release dates, developers, platforms, genres, and series). The target schema has ten attributes.

Challenges

The same game on a different platform is a separate entity, so a matcher must not over-rely on the title.
Identical titles recur across platforms and sequels, and special editions and downloadable content differ only slightly in name.
Platform, genre, date, and rating vocabularies must be normalized (taxonomies are provided).

Sample records from each source

DBpediadbpedia.csv

wiki_ref	title	launch_yr	studio	system	genre	franchise
dbpedia_11	Mario Kart Arcade GP	2005-01-01	Namco	Arcade game	Racing video game	Mario Kart
dbpedia_101	Dr. Mario	1990-01-01	Nintendo Research & Development 1	Game Boy Advance	Puzzle video game	(empty)
dbpedia_177	Sonic Jump	2012-01-01	Hardlight	Android (operating system)	Platform game	Sonic the Hedgehog (series)

Metacriticmetacritic.csv

mc_id	game_title	year_published	made_by	console	genres	press_rating	player_rating	age_rating
metacritic_5	Super Mario Galaxy	2007-01-01	Nintendo	Wii	Action,Platformer,Platformer,3D,3D	97.0	9.1	E
metacritic_9	Super Mario Galaxy 2	2010-01-01	Nintendo EAD Tokyo	Wii	Action,Platformer,Platformer,3D,3D	97.0	9.1	E
metacritic_10	The Legend of Zelda: Ocarina of Time	1998-01-01	Nintendo	Nintendo 64	Action Adventure,Fantasy	99.0	9.0	E

Salessales.csv

rec_id	prod_title	launch_dt	studio	dist	hw	genre	press_score	comm_rating	age_classification	units_sold_mm
sales_2	Mario Kart Wii	2008-01-01	Nintendo	Nintendo	Wii	Racing	82	8.3	E	35
sales_4	New Super Mario Bros.	2006-01-01	Nintendo	Nintendo	DS	Platform	89	8.5	E	29
sales_7	Mario Kart DS	2005-01-01	Nintendo	Nintendo	DS	Racing	91	8.6	E	23

Companies. This task requires the integration of the Forbes Global 2000 list, a DBpedia company extract, and a FullContact company-profile sample. The target schema has eight attributes.

Challenges

Company-name variants, spelling differences, and corporate hierarchies.
Companies span the globe, so legal suffixes and locations are hard to normalize and resolve.
Financial figures, countries, and industry categorization taxonomies must be normalized.

Sample records from each source

Forbesforbes.csv

forbes_url	company	url	region	business_segment	asset_value	sales_figure
http://www.forbes.com/companies/icbc/	ICBC	http://www.forbes.com/companies/icbc/	China	Major Banks	3124900000000	148700000000
http://www.forbes.com/companies/china-…	China Construction Bank	http://www.forbes.com/companies/china-…	China	Regional Banks	2449500000000	121300000000
http://www.forbes.com/companies/agricu…	Agricultural Bank of China	http://www.forbes.com/companies/agricu…	China	Regional Banks	2405400000000	136400000000

DBpediadbpedia.csv

entity_uri	org_name	established	nation	headquarters	sector	keypeople_name	total_assets_val	annual_income
http://dbpedia.org/resource/%C3%80_la_…	À la Table de Spanghero	1970-01-01	France	Castelnaudary	Meat	(empty)	(empty)	(empty)
http://dbpedia.org/resource/%C3%87al%C…	Çalık Enerji	1998-01-01	Turkey	Istanbul	(empty)	Çalık Holding	(empty)	(empty)
http://dbpedia.org/resource/%C3%87al%C…	Çalık Holding	1997-01-01	Turkey	Istanbul	(empty)	Ahmet Çalık	8	2.8

FullContactfullcontact.csv

Attribute_1	Attribute_2	Attribute_3	Attribute_4	Attribute_5	Attribute_6
fullcontact_1	BBMG	United States	Brooklyn	Raphael Bemporad	(empty)
fullcontact_2	CIT Group Inc (DEL)	Canada	Toronto	(empty)	1908-01-01
fullcontact_3	City & National Employment	United States	Waterloo	(empty)	1957-01-01

Music. Integrates release-level records from Discogs, Last.fm, and MusicBrainz that describe albums, EPs, and singles with partial overlap. The target schema has eight attributes.

Challenges

Heterogeneous value formats, for example album durations recorded in different ways across sources.
Title and artist variants, and sparse source records.
Dates, countries, and track lists must be normalized.

Sample records from each source

Discogsdiscogs.csv

rec_uid	title_str	performer	pub_dt	origin_loc	imprint	category	tracks_track-name
discogs_3	Fermats Theorem / Sight Beyond	John B	1996-01-01	UK	New Identity Recordings	Electronic	['Fermats Theorem', 'Sight Beyond']
discogs_4	Tempest / Inner Sense	Psychosis	1998-01-01	UK	Renegade Hardware	Electronic	['Tempest', 'Inner Sense']
discogs_5	The Sign's Alive	Lypid	2000-09-05	United States of America	Statra Recordings	Electronic	["The Sign's Alive (Original Mix)", "T…

Last.fmlastfm.csv

item_code	album_title	band	album_length	tracks_track-name
lastFM_1	John B - Fermats Theorem / Sight Beyo…	John B	903	['Fermats Theorem', 'Sight Beyond']
lastFM_2	Tempest / Inner Sense	Psychosis	734	['Tempest', 'Inner Sense']
lastFM_4	Petalpusher - Surrender	Petalpusher	1626	['Surrender (Petalpusher Original)', "…

MusicBrainzmusicbrainz.csv

Attribute_1	Attribute_2	Attribute_3	Attribute_4	Attribute_5	Attribute_6	Attribute_9
mbrainz_1	Fermats Theorem / Sight Beyond	John B	1996-01-01	United Kingdom of Great Britain and No…	1055	['Fermats Theorem', 'Sight Beyond']
mbrainz_2	Tempest / Inner Sense	Psychosis	1998-12-14	United Kingdom of Great Britain and No…	724	['Tempest', 'Inner Sense']
mbrainz_3	The Sign's Alive	Lypid	2000-09-05	United States of America	2384	["The Sign's Alive (original mix)", "T…

Products. Based on a sample of the WDC Products benchmark, covering GPUs, SSDs, HDDs, and USB sticks. The target schema has 25 attributes, the largest in the benchmark.

Challenges

The large schema makes matching delicate: a small difference in a single technical attribute can separate a match from a non-match.
The same attribute appears under different naming conventions across sources.
Units, capacities, dimensions, and speeds must be normalized against taxonomies.

Sample records from each source

Dataset 1dataset_1.json

id	manufacturer	product_name	product_description	list_price	currency_code	cluster_id	product_url	name_and_description	model_name	manufacturer_part_number	category	gpu_chipset	video_memory_gb	capacity_gb	sequential_read_mb_s	sequential_write_mb_s	bus_standard	interface	width_millimeters	length_millimeters	height_millimeters	weight_grams	connector	memory_technology	colour	form_factor
12198483	Gigabyte	Gigabyte NVIDIA GeForce RTX 3080 Gamin…	CUDA Cores: 8704, Boost Clock: 1800MHz…	799.99	GBP	1002037	https://www.novatech.co.uk/products/gi…	Gigabyte NVIDIA GeForce RTX 3080 Gamin…	NVIDIA GeForce RTX 3080 Gaming OC	(empty)	GPU	GeForce RTX 3080	10	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	GDDR6X	(empty)	(empty)
78378158	Western Digital	WD Blue 6TB 3.5\" SATA 3 HDD/Hard Driv…	6TB WD Blue WD60EZAZ, 3.5\" HDD, SATA…	129.98	GBP	1004942	https://www.scan.co.uk/products/6tb-wd…	WD Blue 6TB 3.5\" SATA 3 HDD/Hard Driv…	WD Blue	WD60EZAZ	HDD	(empty)	(empty)	6000	(empty)	(empty)	SATA	SATA III	(empty)	(empty)	(empty)	(empty)	3.5-inch SATA	(empty)	(empty)	3.5-inch
80641070	Corsair	Corsair Force MP510 M.2 SSD - 960GB	Solid State Drive, 960 GB, intern, M.2…	2124.0	NOK	1007272	https://www.proshop.no/SSD/Corsair-For…	Corsair Force MP510 M.2 SSD - 960GB. D…	Force MP510	(empty)	SSD	(empty)	(empty)	960	(empty)	(empty)	PCI Express x4	NVMe	(empty)	(empty)	(empty)	(empty)	M.2	(empty)	(empty)	M.2 2280

Dataset 2dataset_2.json

id	brandName	name	descriptionText	priceAmount	currency	cluster_id	productUrl	titleAndDescription	modelName	mpn	productCategory	chipset	vramGb	capacityGb	readSpeedMbps	writeSpeedMbps	busType	interfaceType	widthMm	depthMm	heightMm	weightG	connectionType	memoryType	color	formFactor
19126355	Gigabyte	Gigabyte NVIDIA GeForce RTX 3080 10GB…	Gigabyte NVIDIA GeForce RTX 3080 GAMIN…	99999.99	GBP	1002037	https://www.scan.co.uk/products/gigaby…	Gigabyte NVIDIA GeForce RTX 3080 10GB…	GeForce RTX 3080 10GB GAMING OC	(empty)	GPU	GeForce RTX 3080	10	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	GDDR6X	(empty)	(empty)
42841911	Western Digital	WD Blue 6TB Desktop Hard Disk Drive -…	WD Blue 6TB Desktop Hard Disk Drive -…	128.99	USD	1004942	https://www.newegg.com/blue-wd60ezaz-6…	WD Blue 6TB Desktop Hard Disk Drive -…	WD Blue	WD60EZAZ	HDD	(empty)	(empty)	6000	(empty)	(empty)	SATA	SATA III	(empty)	(empty)	(empty)	(empty)	3.5 Inch	(empty)	(empty)	3.5-inch
46775597	Corsair	CORSAIR - Force Series MP510 960GB M.2…	CORSAIR Force Series MP510 960GB M.2 S…	(empty)	(empty)	1007272	https://www.dindator.se/corsair-force-…	CORSAIR - Force Series MP510 960GB M.2…	Force Series MP510	CSSD-F960GBMP510	SSD	(empty)	(empty)	960	3480	3000	PCI Express x4	NVMe	(empty)	(empty)	(empty)	(empty)	M.2	(empty)	(empty)	M.2 2280

Dataset 3dataset_3.json

id	Brand	ProductTitle	Details	Price	Currency	cluster_id	Link	TitleDetails	Model	PartNo	Type	Chipset	MemorySizeGB	CapacityGB	ReadMBs	WriteMBs	Bus	Interface	WidthMM	LengthMM	HeightMM	WeightG	Connector	MemoryType	Colour	FormFactor
46320085	Gigabyte	GIGABYTE GeForce RTX 3080 GAMING OC 10…	To Avail the offer Click Here	73160.0	INR	1002037	https://www.pcstudio.in/product/gigaby…	GIGABYTE GeForce RTX 3080 GAMING OC 10…	GeForce RTX 3080 GAMING OC 10G	GV-N3080GAMING OC-10GD	GPU	GeForce RTX 3080	10	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)
91583813	Western Digital	WD Blue 6TB SATA3 256MB 3.5' 5400RPM 6…	Western Digital WD Blue 3.5-inch PC ha…	278.0	AUD	1004942	https://netplus.com.au/products/wd-blu…	WD Blue 6TB SATA3 256MB 3.5' 5400RPM 6…	WD Blue	(empty)	HDD	(empty)	(empty)	6000	(empty)	(empty)	SATA	SATA III	(empty)	(empty)	(empty)	640.0	3.5-inch SATA	(empty)	(empty)	3.5-inch
86850217	Corsair	SSD 960GB Corsair Force MP510 NVMe	SSD 960GB Corsair Force MP510 NVMe	2316.0	SEK	1007272	https://www.nordwaystore.se/ssd-960gb-…	SSD 960GB Corsair Force MP510 NVMe. De…	Force MP510	(empty)	SSD	(empty)	(empty)	960	(empty)	(empty)	(empty)	NVMe	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)

Dataset 4dataset_4.json

id	mfr	name	desc	amt	cur	cluster_id	link	name_desc	mdl	pn	cat	chip	vram	cap_gb	rd_mbs	wr_mbs	bus	iface	w_mm	l_mm	h_mm	wt_g	conn	mem	clr	ff
66956099	Gigabyte	Gigabyte Video Card GV-N3080GAMING OC-…	Gigabyte Video Card GV-N3080GAMING OC-…	791.99	USD	1002037	https://www.thekeykey.com/detail/GV-N3…	Gigabyte Video Card GV-N3080GAMING OC-…	GeForce RTX 3080 GAMING OC	GV-N3080GAMING OC-10GD	GPU	GeForce RTX 3080	10	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	GDDR6X	(empty)	(empty)
5078198	Western Digital	Western Digital HDD Blue 6TB 3,5\" 256…	Dysk BLUE 6TB 3,5 256MB SATAIII/5400rp…	(empty)	(empty)	1004942	https://www.alsen.pl/podzespoly-komput…	Western Digital HDD Blue 6TB 3,5\" 256…	Blue	(empty)	HDD	(empty)	(empty)	6000	(empty)	(empty)	SATA	SATA III	(empty)	(empty)	(empty)	(empty)	3.5-inch SATA	(empty)	(empty)	3.5-inch
77571226	Corsair	Corsair MP510 M.2-2280 960GB	Huge 960GB storage capacity and super…	(empty)	(empty)	1007272	https://www.cclonline.com/product/2874…	Corsair MP510 M.2-2280 960GB. Descript…	MP510	(empty)	SSD	(empty)	(empty)	960	3480	3000	(empty)	(empty)	(empty)	(empty)	(empty)	(empty)	M.2	(empty)	(empty)	M.2 2280

Scientific Papers. Integrates computer-science paper records from DBLP, Crossref, and OpenAlex. The target schema has 11 attributes.

Challenges

The largest task by row count, so blocking efficiency matters: a poor reduction ratio leads to hundreds of thousands of comparisons, which makes the task suited for evaluating efficiency as well as effectiveness.
DOIs are not part of the released source data; fusion rows use a source_ids helper list to point back to the contributing source records.
Title and author-list variants, publication-type and venue normalization, and sparse metadata for volume, issue, pages, and citation counts.

Sample records from each source

Crossrefcrossref.jsonl

id	work_type	title_text	contributor_names	issued_year	container_title	publisher_name	abstract_text	volume_id	issue_id	page_first	page_last	reference_total	cited_total
crossref-00000	inproceedings	A Simulation-driven Approach in Risk-…	['Ilaria Angela Amantea', ' Antonio D…	2018	(empty)	SCITEPRESS - Science and Technology P…	(empty)	(empty)	(empty)	98	105	0	18
crossref-00001	article	Group Recommendation Systems Based on…	['Guang Fang', ' Lei Su', ' Di Jiang'…	2018	Wireless Communications and Mobile Co…	Wiley	With the development of social networ…	2018	1	None	None	35	14
crossref-00002	article	Relation of Country‐of‐Origin Effect,…	['Juan Manuel Berbel-Pineda', ' Beatr…	2018	Complexity	Wiley	To obtain information about foreign m…	2018	1	None	None	60	10

DBLPdblp.jsonl

id	entry_type	publication_title	author_list	pub_year	venue_name	volume_no	issue_no	page_start	page_finish
dblp-00000	inproceedings	Thread Weaving: Static Resource Sched…	['Hsuan Hsiao', 'Jason Helge Anderson…	2019	(empty)	(empty)	(empty)	(empty)	(empty)
dblp-00001	inproceedings	LEAD: learning-enabled energy-aware d…	['Mark Clark', 'Avinash Kodi', 'Razva…	2018	(empty)	(empty)	(empty)	1	6
dblp-00002	inproceedings	PlanarONoC: concurrent placement and …	['Yu-Kai Chuang', 'Kuan-Jung Chen', '…	2018	(empty)	(empty)	(empty)	1	6

OpenAlexopen_alex.jsonl

id	work_kind	display_title	authors_list	year_published	source_name	topic_terms	volume_tag	issue_tag	start_page	end_page	refs_count	citations_count
open_alex-00000	article	STRING v11: protein–protein associati…	['Damian Szklarczyk', 'Annika L Gable…	2018	Nucleic Acids Research	KEGG, Interaction network	47	D1	D607	D613	64	16038
open_alex-00001	article	Minimap2: pairwise alignment for nucl…	['Heng Li']	2018	Bioinformatics	Multiple sequence alignment	34	18	3094	3100	42	12292
open_alex-00002	article	SWISS-MODEL: homology modelling of pr…	['Andrew Waterhouse', 'Martino Berton…	2018	Nucleic Acids Research	Homology	46	W1	W296	W303	73	11711

Dataset statistics

Task	Dataset	#	Records	Attributes
Games	DBpedia, Metacritic, Sales	3	74,951	10
Companies	Forbes, DBpedia, FullContact	3	14,016	8
Music	Discogs, Last.fm, MusicBrainz	3	37,255	8
Products	Four WDC Products feeds	4	3,012	25
Scientific Papers	DBLP, Crossref, OpenAlex	3	182,059	11

Record counts are for the base task. The per-task pages in the repository list the full sizes for every source and difficulty level.

Artifacts per Task

MaDI provides the following artefacts for each of the base tasks:

Source tables and table metadata. One table per source, each with a metadata file recording its origin, the columns it contains, and the date it was published. Example: forbes.csv, forbes_metadata.json
Target schema. The columns of the fused table, with a value constraint per attribute (for example a valid founding-year range) and categorical attributes tied to standard taxonomies such as GICS industries and CLDR country names. Example: target_schema.json
Gold schema mapping. A correspondence from every source column to the target schema. Example: sm_mapping_gold.json
Entity matching sets. Labeled train, validation, and test pairs for each pair of sources. Example: forbes_2_dbpedia_train.csv
Fusion validation and test sets. 100 validation and 100 test entities per task, hand-annotated with the verified value for every attribute. Example: validation_set.xml
Reference outputs. The output of the human-engineered pipeline, used as a silver standard. Example: companies/base/output/

The sources describe the same kind of entity but rarely agree on column names, value formats, or even the facts themselves, which is what makes the merge hard.

Example · one company in three sources (Companies task)

Forbesforbes.csv

company	Volkswagen Group
region	(empty)
business_segment	Auto & Truck Manufacturers
asset_value	446900000000
sales_figure	261500000000

DBpediadbpedia.csv

org_name	Volkswagen
established	1937-01-01
nation	Germany
headquarters	Wolfsburg
sector	Automotive industry
total_assets_val	323400000000

FullContactfullcontact.csv

Attribute_2	Volkswagen Group
Attribute_3	Germany
Attribute_4	Wolfsburg
Attribute_5	(empty)
Attribute_6	1937-01-01

The same company in three sources. Forbes leaves the country blank, FullContact hides fields behind names like Attribute_2, the asset figures disagree, and DBpedia gives a shorter name. We follow this company through each step below.

The Integration Pipeline

A system under test (SUT) needs to handle the following four subtasks of the overall integration task. MaDI-Bench provides ground truth for each subtask as well as for the output of the overall integration (fused dataset). Thus, a SUT can be evaluated one task at a time or in an end-to-end fashion.

1Schema Matching

Sources rarely share column names, so every source column has to be matched to a column in the target schema before values can be combined. Names may be descriptive, abbreviated, or meaningless, and where a name gives nothing away, a matcher has to read the values instead.

Ground truth: gold schema mappingMetric: F1 over column correspondences

Example · company, org_name and Attribute_2 all map to name

forbes · company	→	name
dbpedia · established	→	founded
dbpedia · headquarters	→	city
fullcontact · Attribute_2 from values only	→	name
fullcontact · Attribute_6 from values only	→	founded

2Value Normalization

Matched columns still store values in different formats and vocabularies, so they are rewritten into the forms the target schema requires. The schema sets a constraint for every attribute: a data type, and where it applies a numeric range, a string pattern, or a date format. A founding date, for instance, must be an ISO YYYY-MM-DD inside an allowed range, and assets must be a non-negative integer in US dollars below an upper bound. Categorical attributes are pinned to taxonomies that ship with the task and fix both the allowed values and the level to normalize to: countries follow CLDR, industries follow GICS, and each task adds its own, such as platform, genre, and rating for Games. Normalization is scored as consistency, that is, whether the output values satisfy these schema constraints.

Defined by: target-schema constraints & taxonomiesReference: human-pipeline output

Example · the same fact, written many ways

Field	As found in sources	Normalized to target
industry	"Auto & Truck Manufacturers", "Automotive industry"	Automobiles (GICS)
founded	"January 01, 1937", "01.01.1937"	1937-01-01
assets	"411 818 350 000,00", "323.40" (bn)	411818350000, 323400000000
country	"中华人民共和国", "U.S.A."	China, United States (CLDR)

These messy forms are real values from the easy and hard variants; the base task is cleaner.

3Entity Blocking & Matching

Records that refer to the same entity have to be found across sources. Scoring the full cartesian product of record pairs is quadratic in the input size and computationally infeasible, so blocking first reduces it to a smaller set of candidate pairs, and matching then classifies each candidate as a match or a non-match. The hardest cases are look-alikes that fall into the same block but refer to different entities.

Ground truth: labeled pairsMetric: pair completeness & reduction ratio (blocking), F1 (matching)

Example · candidate pairs for one company

forbes · Volkswagen Group↔dbpedia · Volkswagen✓ match

forbes · Volkswagen Group↔fullcontact_802 · Volkswagen Group✓ match

forbes · Volkswagen Group↔fullcontact_243 · CST Brands, Inc.✗ no match

forbes · Volkswagen Group↔fullcontact_1462 · Intermountain Healthcare✗ no match

4Data Fusion

After matching groups the records of an entity into a cluster, fusion resolves the cluster into a single value for each attribute. Where the sources report conflicting values, fusion has to decide which value to keep, and this conflict resolution is the step where the benchmark leaves the most room for improvement. Each gold value was set by annotators who verified it against independent sources.

Ground truth: hand-verified fused recordsMetric: fusion accuracy

Example · one fused record, with where each value came from

name	Volkswagen GroupForbesFullContact
country	GermanyDBpediaFullContact
founded	1937-01-01DBpediaFullContact
assets	447910000000verified
revenue	261910000000verified

Forbes reports $446.9B in assets, DBpedia reports $323.4B. Neither is taken at face value: the gold value, $447.91B, is the figure annotators confirmed by hand.

End-to-End Evaluation

Beyond the per-step scores, MaDI-Bench rates the final fused table along three quality dimensions: coverage (did the right entities and values make it in?), consistency (does the table fit the target schema?), and correctness (are the merged values right?). Each is reported at three reference levels: reference-free (structure only, no labels needed), silver (against the human pipeline output), and ground truth (against the annotated test sets).

Validation

We validate MaDI-Bench with three pipelines that span the design space from human-engineered to fully automated. All three build on the PyDI integration framework, and their per-step and end-to-end outputs are released in the repository under results/ and the per-task output/ folders.

P1Human-engineered pipelinesilver reference

Domain-specific workflows that master's and PhD students built and refined on PyDI. Every workflow runs the same order: PyDI's LLM schema matcher proposes correspondences, an engineer inspects and hand-corrects them, and the sources are mapped to the target schema. A normalization step then parses dates and numbers, converts units, and applies the taxonomy mappings. Blocking uses embedding or task-specific keys tuned to at least 97% pair completeness. Entity matching combines rule-based and learned tree-based matchers with a global one-to-one assignment, and fusion applies per-attribute PyDI heuristics such as source trust, longest or shortest string, numeric aggregation, and set union.

It is a practical human reference rather than an optimum, and its fused output serves as the benchmark's silver standard.

Code and outputs: PyDI framework · workflow notebook · silver outputs

P2Best-of-breed pipelinecommittee per step

Chains state-of-the-art methods from the literature, one per integration step. It runs the four stages once: at each stage a committee of competing methods is scored on the validation set, and the winner's output is passed to the next stage.

Schema matching: COMA, Magneto, a Sentence-BERT embedding matcher, and PyDI's label-, instance-, duplicate- and LLM-based matchers. Blocking: standard, token and sorted-neighborhood blocking, BM25, Sentence-BERT, and SC-Block, keeping the blocker that clears 97% pair completeness at the highest reduction ratio. Entity matching: Ditto and Magellan. Fusion: PyDI heuristics plus the TruthFinder, LTM, AccuSim, CASEFusion, and FusionQuery truth-discovery models, with a per-attribute selector.

Outputs: results/best of breeds · per-stage method selections for one run

P3LLM-based pipelineautomatic, self-configuring

Showcases how well an LLM can configure an end-to-end data integration pipeline. It uses the same PyDI methods as the human pipeline, but an LLM, not a person, sets them up. GPT-5.5 performs schema matching, and the values are normalized by mapping them to the predefined taxonomies, with a column profiler selecting a normalization function for the remaining attributes. An embedding blocker then reduces the sources to candidate pairs.

From a sample of those pairs the pipeline builds its own machine-labeled training and validation sets with GPT-5.2 as the labeler, independent of the labels the benchmark provides, and uses them to configure an entity matcher chosen among rule-based and feature-based models such as random forest and XGBoost. For fusion it writes a configuration that assigns each attribute a conflict-resolution heuristic. The LLM only configures the pipeline and generates the training artifacts; it plays no part in the integration itself, which is what lets the pipeline scale to the largest tasks.

Code: automatic-data-integration · Outputs: results/llm pipeline · a sample run

The three pipelines behave very differently across the steps. Schema matching is close to perfect for all three, entity matching covers a wide range across tasks, and data fusion is the hardest step for every pipeline. Every number below is per task, and the strongest pipeline in each row is shown in green. The columns are the human-engineered pipeline (Human, P1), the best-of-breed pipeline (BoB, P2), and the LLM-based pipeline (LLM, P3).

Schema matching

The three pipelines reach close to optimal F1 by using the LLM Schema matcher provided in PyDI, which reads column names and values together. The label-based and instance-based matchers are shown for reference: their much lower scores show that schema matching is genuinely difficult in the benchmark, while state-of-the-art LLMs reach close to perfect performance once they are given both the headers and the values. Values are F1 (%).

Task	Human	BoB	LLM	Label-based	Instance-based
Games	100	100	100	30.00	60.50
Companies	100	100	97.56	26.10	53.80
Music	100	100	100	28.60	64.50
Products	100	100	100	61.60	7.70
Scientific Papers	100	97.14	100	73.80	33.30

Blocking and entity matching

Pair completeness stays between 94.4 and 100% for every pipeline, so the candidate sets keep almost all of the labeled test matches. Reduction ratio is above 99% on every task except Products, where the LLM pipeline's embedding blocker prunes 97.5% of the pairs against 71.9% for the other two. Entity matching itself spans a wide range, from near the ceiling on Papers and Music to clearly more open on Games and Products. Values are percentages. The best entity-matching score per task is in green.

Task	Pair completeness			Reduction ratio			Entity matching F1
Task	Human	BoB	LLM	Human	BoB	LLM	Human	BoB	LLM
Games	99.55	100.00	96.85	99.93	99.90	99.96	89.45	67.30	63.15
Companies	97.73	94.37	100.00	99.34	99.24	99.69	87.65	89.29	90.69
Music	96.55	96.34	99.85	99.67	99.67	99.88	95.35	94.84	98.11
Products	100.00	100.00	98.00	71.91	71.91	97.54	70.37	84.09	63.09
Scientific Papers	97.90	97.90	99.69	99.98	99.98	99.97	99.78	99.70	96.05

Pair completeness and reduction ratio measure blocking quality on the labeled entity-matching test set. Entity matching F1 is on the same test set.

Data fusion

Fusion is the hardest step for every pipeline. The LLM pipeline produces the most accurate fused values on four of the five tasks. Best-of-breed leads on Products. Values are accuracy (%). the best per task is in green.

Task	Human	BoB	LLM
Games	71.70	64.91	84.87
Companies	44.70	45.90	63.24
Music	70.20	76.92	83.12
Products	40.20	56.68	43.65
Scientific Papers	78.90	61.04	79.35

Efficiency

All three pipelines are preconfigured, so the times below cover only the final integration run, not any configuration search or design effort. The main difference comes from entity matching: best-of-breed mostly selects Ditto, a pre-trained language model whose per-task fine-tuning and inference dominate its runtime, while the human and LLM pipelines use lighter matchers. On Companies, where best-of-breed instead picks the cheaper Magellan, it runs about as fast as the others. Values are end-to-end wall-clock seconds.

Task	Human	BoB	LLM
Games	180	5,736	196
Companies	70	213	203
Music	184	5,980	211
Products	98	1,388	718
Scientific Papers	637	16,184	1,374
Mean	234	5,900	540

End to end

MaDI-Bench also scores the final fused table as a whole, along three dimensions: coverage (are the right entities and values present), consistency (does the output fit the target schema), and correctness (are the merged values right). Each is reported at three reference levels: reference-free, against the human pipeline (silver), and against the annotated test sets (ground truth). The table gives the mean over the five tasks under the ground-truth reference, with the better pipeline per metric in green.

Metric (ground-truth reference, mean over tasks)	Best-of-breed	LLM-based
Entity recovery coverage, higher is better	0.90	0.81
Value drift coverage, lower is better	0.48	0.56
BCubed F1 clustering correctness, higher is better	0.97	0.94
Fusion accuracy value correctness, higher is better	0.61	0.71
Fully-correct rate every attribute right at once, higher is better	0.05	0.14

Best-of-breed wins on coverage and clustering. The LLM pipeline wins on value correctness. The fully-correct rate, which needs every attribute of an entity right at once, stays low for both, so the integrated task is far from solved. Schema validity (consistency) is close, near 0.95 for both pipelines. Full per-task numbers at all three reference levels are in the results directory.

Variant Generation

To keep the benchmark hard as systems improve, each base task also ships in easy, medium, and hard versions. The records and the correct answers stay the same. Eight difficulty knobs perturb the data, each set to its own target for easy, medium, and hard. A harder level turns up the kinds of heterogeneity that real integration runs into.

A record changed by the knobs

Here is the DBpedia record for the Taisei Corporation, base against hard. Schema naming divergence (Knob 8) first renames every column to a cryptic code: org_name→nm, established→ey, nation→cn, headquarters→hq, sector→sg, keypeople_name→kpn, total_assets_val→ta, annual_income→ai. On top of that, the values change:

Field	Base value	Hard value	Changed by
name	Taisei Corporation	Taisei Corporation	unchanged
founded	1873-01-01	01.01.1873	Knob 5 · format
country	Japan	Nippon	Knob 1 · surface
city	TokyoShinjuku, Tokyo	TokyoShinjuku, Tokyo	unchanged
industry	Construction	Constructi on	Knob 6 · value noise
key people	Okura Kihachiro	(dropped)	Knob 3 · attribute drop
assets	3,650,187,000	3650.19	Knob 5 · unit scale
revenue	213,195,000	167.18	Knob 5 · unit + currency

Eight changes from one base record: a renamed schema, a reformatted date, a reworded country, a corrupted industry string, a dropped person, and two rescaled financial figures. The easy level moves the same dials the other way, with descriptive column names, gentle date and number formats, and fewer dropped cells.

Difficulty Knobs

Each knob targets one or two pipeline stages and is set to an easy, medium, or hard target. MaDI-Bench uses eight:

Knob 1 Matching · Fusion

Surface augmentation

Rewrites attribute values into other surface forms using abbreviations, token reorderings, and dropped tokens. This erodes the surface overlap that blocking and matching depend on, and turns clean values into plausible-looking disagreements for fusion.

exampleSan Francisco → Francisco San

Knob 2 Blocking · Matching

Entity niche density

Adds similar-but-distinct entities within the same domain niche, such as many products from one brand. Blocks fill up with non-matching pairs, and more candidate pairs sit close to the match / non-match boundary.

examplegenerates a near-duplicate company and inserts it into a crowded niche

Knob 3 Blocking · Matching · Fusion

Attribute drop

Blanks a share of cells per source. Missing blocking keys make records unreachable, missing attributes weaken the evidence for matching, and sparse coverage leaves fusion less to cross-check.

examplerevenue 358,500,000 → (dropped)

Knob 4 Fusion

Coverage skew

Varies how many sources describe each entity. This changes cluster sizes and thins the evidence available to resolve conflicts for entities that only one or two sources cover.

exampleInter RAO: its DBpedia row is removed, leaving fewer sources

Knob 5 Normalization

Format & unit diversity

Rewrites dates, numbers, and quantities into more formats, units, and scales, for example ISO against dotted dates, or raw figures against millions. More admissible forms per field make value standardization harder.

example65,170,000,000 → 65170 (millions)

Knob 6 Normalization · Matching · Fusion

Value noise

Corrupts a share of values with typos, encoding artefacts, and truncation. The damaged values lower string-similarity scores during matching and show up as spurious disagreements during fusion.

example2005-01-01 → 2o0t-01-o1

Knob 8 Schema matching

Schema naming divergence

Renames source columns along a scale from fully descriptive, through abbreviated, to opaque codes. At the hard end the column name carries no signal, so a matcher has to rely on the values alone.

exampletotal_assets_val → ta

Knob 10 Fusion

Source reliability

Reshuffles which source holds the correct value per attribute, so the most reliable source is right for only a share of entities. No fixed source-trust rule stays optimal.

exampleAccor: the verified name is reassigned to a different source

Two further knobs (value ambiguity and schema distractors) are specified for synthetic tasks. Full specifications for every knob are in the repository.

Difficulty across the variants

The variants produce the intended difficulty gradient, clearest for data fusion: at the hard level fusion accuracy falls in every domain, by up to about 23 points for best-of-breed (Papers) and 28 for the LLM pipeline (Papers). Schema matching stays robust across the levels, and entity matching stays resilient once the matcher is re-tuned per variant. The hard variants raise the difficulty in every domain, while the easy variants keep simpler methods competitive. Values are percentages: schema matching and entity matching F1, and fusion accuracy.

Domain	Level	Best-of-breed (P2)			LLM-based (P3)
Domain	Level	SM	EM	Fusion	SM	EM	Fusion
Companies	Base	100.0	89.3	45.9	97.6	90.7	63.2
	Easy	100.0	93.6	40.3	100.0	88.2	42.3
	Medium	100.0	88.9	39.6	100.0	77.0	56.8
	Hard	91.3	94.5	33.3	100.0	87.4	44.4
Games	Base	100.0	67.3	64.9	100.0	63.2	84.9
	Easy	98.1	66.9	62.1	97.9	76.0	81.4
	Medium	98.1	69.7	63.6	97.9	72.7	78.5
	Hard	98.1	79.2	52.5	97.9	82.0	72.6
Music	Base	100.0	94.8	76.9	100.0	98.1	83.1
	Easy	100.0	94.9	85.4	100.0	97.7	77.4
	Medium	100.0	93.2	65.1	100.0	90.9	71.0
	Hard	92.0	94.3	54.5	90.9	88.0	58.6
Products	Base	100.0	84.1	56.7	100.0	63.1	43.6
	Easy	99.5	91.5	51.6	100.0	27.5	49.3
	Medium	98.1	88.1	59.3	100.0	36.9	45.9
	Hard	98.1	85.3	49.5	98.0	51.0	34.1
Scientific Papers	Base	97.1	99.7	61.0	100.0	96.0	79.3
	Easy	92.9	99.5	58.1	91.7	90.9	64.6
	Medium	90.5	99.2	47.7	91.7	87.6	62.3
	Hard	92.9	96.7	37.8	91.7	78.1	51.1

SM is schema matching F1, EM is entity matching F1, Fusion is data fusion accuracy. The human pipeline is not run on the variants. These are the per-stage scores from the paper (Table XII).

Conclusion

This paper introduced the MaDI-Bench integration benchmark. The benchmark models the full complexity of the data integration process, including schema matching, value normalization, entity matching, and conflict resolution, while at the same time taking into account the dependencies between the subtasks. To prevent a quick saturation of the benchmark as agentic systems progress, we introduce a generic variant-generation method for deriving harder variants from the base tasks. We validated MaDI-Bench and the generated variants using human-designed pipelines, an LLM-based pipeline, and a best-of-breed pipeline. The validation showed the benchmark's utility for measuring the step-wise as well as end-to-end performance of data integration systems. We hope that the benchmark will prove useful for the community and will support the development of fully automatic as well as human-in-the-loop data integration systems.

Downloads

Tasks are released as plain files (CSV, JSON, and XML), so any system can read them. PyDI adds the integration steps and an evaluator for every stage, so a run can be scored without labeling any data.

Benchmark repository: all 20 tasks, ground truth, and the validation runs of the three pipelines.
PyDI: the integration framework and the per-step evaluator classes.
The tasks: inputs, target schemas, labeled splits, and fused gold records.

BibTeX

@misc{steiner2026madibench,
  title         = {MaDI-Bench: An End-to-End Data Integration Benchmark},
  author        = {Steiner, Aaron and Peeters, Ralph and Bizer, Christian},
  year          = {2026},
  eprint        = {2606.30371},
  archivePrefix = {arXiv},
  primaryClass  = {cs.DB},
  url           = {https://arxiv.org/abs/2606.30371}
}