Abstract

Data integration combines heterogeneous data sets into a single, coherent representation. It involves a sequence of interdependent tasks: schema matching, value normalization, entity blocking, entity matching, and data fusion. Existing benchmarks either evaluate these steps in isolation or cover only incomplete versions of the data integration pipeline, omitting specific steps. The lack of public end-to-end data integration benchmarks hinders research on data integration methods that address the integration process as a whole.

The Mannheim Data Integration Benchmark (MaDI-Bench) fills this gap. MaDI-Bench is the first benchmark for the end-to-end integration of relational tables covering all steps of the integration process. MaDI-Bench contributes (i) a set of base end-to-end data integration tasks spanning several application domains, each requiring the full schema matching, value normalization, entity matching, and conflict resolution pipeline; and (ii) a generic method for deriving task variants that mitigates rapid benchmark saturation as data integration systems advance. We validate the benchmark using human-engineered pipelines, a best-of-breed pipeline, and an LLM-based pipeline. The validation demonstrates the utility of the benchmark for measuring the step-wise as well as the end-to-end performance of data integration pipelines. All benchmark artefacts are available for public download.

The Base Tasks

MaDI-Bench consists of five end-to-end data integration tasks: Games, Companies, Music, Products, and Scientific Papers. Each task takes several heterogeneous source tables in a domain and asks a system to return one fused target table, exercising schema matching, value normalization, entity matching, and data fusion together. The benchmark provides ground truth to score every step on its own, not only the final table: a gold schema mapping for schema matching, the target schema's constraints and taxonomies for value normalization (measured as consistency with the schema), labeled record pairs for blocking and entity matching, and hand-verified records for data fusion, together with metrics for the end-to-end result. Each base task also ships in easy, medium, and hard variants, for 20 integration tasks in total.

Schema matching  →  Value normalization  →  Entity blocking & matching  →  Data fusion

Games. This task requires the integration of three video-game datasets: Metacritic (review scores and ESRB ratings), a sales dataset (commercial performance), and DBpedia (release dates, developers, platforms, genres, and series). The target schema has ten attributes.

Challenges
  • The same game on a different platform is a separate entity, so a matcher must not over-rely on the title.
  • Identical titles recur across platforms and sequels, and special editions and downloadable content differ only slightly in name.
  • Platform, genre, date, and rating vocabularies must be normalized (taxonomies are provided).
Sample records from each source
DBpediadbpedia.csv
wiki_ref title launch_yr studio system genre franchise
dbpedia_11 Mario Kart Arcade GP 2005-01-01 Namco Arcade game Racing video game Mario Kart
dbpedia_101 Dr. Mario 1990-01-01 Nintendo Research & Development 1 Game Boy Advance Puzzle video game (empty)
dbpedia_177 Sonic Jump 2012-01-01 Hardlight Android (operating system) Platform game Sonic the Hedgehog (series)
Metacriticmetacritic.csv
mc_id game_title year_published made_by console genres press_rating player_rating age_rating
metacritic_5 Super Mario Galaxy 2007-01-01 Nintendo Wii Action,Platformer,Platformer,3D,3D 97.0 9.1 E
metacritic_9 Super Mario Galaxy 2 2010-01-01 Nintendo EAD Tokyo Wii Action,Platformer,Platformer,3D,3D 97.0 9.1 E
metacritic_10 The Legend of Zelda: Ocarina of Time 1998-01-01 Nintendo Nintendo 64 Action Adventure,Fantasy 99.0 9.0 E
Salessales.csv
rec_id prod_title launch_dt studio dist hw genre press_score comm_rating age_classification units_sold_mm
sales_2 Mario Kart Wii 2008-01-01 Nintendo Nintendo Wii Racing 82 8.3 E 35
sales_4 New Super Mario Bros. 2006-01-01 Nintendo Nintendo DS Platform 89 8.5 E 29
sales_7 Mario Kart DS 2005-01-01 Nintendo Nintendo DS Racing 91 8.6 E 23

Companies. This task requires the integration of the Forbes Global 2000 list, a DBpedia company extract, and a FullContact company-profile sample. The target schema has eight attributes.

Challenges
  • Company-name variants, spelling differences, and corporate hierarchies.
  • Companies span the globe, so legal suffixes and locations are hard to normalize and resolve.
  • Financial figures, countries, and industry categorization taxonomies must be normalized.
Sample records from each source
Forbesforbes.csv
forbes_url company url region business_segment asset_value sales_figure
http://www.forbes.com/companies/icbc/ ICBC http://www.forbes.com/companies/icbc/ China Major Banks 3124900000000 148700000000
http://www.forbes.com/companies/china-… China Construction Bank http://www.forbes.com/companies/china-… China Regional Banks 2449500000000 121300000000
http://www.forbes.com/companies/agricu… Agricultural Bank of China http://www.forbes.com/companies/agricu… China Regional Banks 2405400000000 136400000000
DBpediadbpedia.csv
entity_uri org_name established nation headquarters sector keypeople_name total_assets_val annual_income
http://dbpedia.org/resource/%C3%80_la_… À la Table de Spanghero 1970-01-01 France Castelnaudary Meat (empty) (empty) (empty)
http://dbpedia.org/resource/%C3%87al%C… Çalık Enerji 1998-01-01 Turkey Istanbul (empty) Çalık Holding (empty) (empty)
http://dbpedia.org/resource/%C3%87al%C… Çalık Holding 1997-01-01 Turkey Istanbul (empty) Ahmet Çalık 8 2.8
FullContactfullcontact.csv
Attribute_1 Attribute_2 Attribute_3 Attribute_4 Attribute_5 Attribute_6
fullcontact_1 BBMG United States Brooklyn Raphael Bemporad (empty)
fullcontact_2 CIT Group Inc (DEL) Canada Toronto (empty) 1908-01-01
fullcontact_3 City & National Employment United States Waterloo (empty) 1957-01-01

Music. Integrates release-level records from Discogs, Last.fm, and MusicBrainz that describe albums, EPs, and singles with partial overlap. The target schema has eight attributes.

Challenges
  • Heterogeneous value formats, for example album durations recorded in different ways across sources.
  • Title and artist variants, and sparse source records.
  • Dates, countries, and track lists must be normalized.
Sample records from each source
Discogsdiscogs.csv
rec_uid title_str performer pub_dt origin_loc duration imprint category tracks_track-name
discogs_3 Fermats Theorem / Sight Beyond John B 1996-01-01 UK 0 New Identity Recordings Electronic ['Fermats Theorem', 'Sight Beyond']
discogs_4 Tempest / Inner Sense Psychosis 1998-01-01 UK 0 Renegade Hardware Electronic ['Tempest', 'Inner Sense']
discogs_5 The Sign's Alive Lypid 2000-09-05 United States of America 0 Statra Recordings Electronic ["The Sign's Alive (Original Mix)", "T…
Last.fmlastfm.csv
item_code album_title band album_length tracks_track-name
lastFM_1 John B - Fermats Theorem / Sight Beyo… John B 903 ['Fermats Theorem', 'Sight Beyond']
lastFM_2 Tempest / Inner Sense Psychosis 734 ['Tempest', 'Inner Sense']
lastFM_4 Petalpusher - Surrender Petalpusher 1626 ['Surrender (Petalpusher Original)', "…
MusicBrainzmusicbrainz.csv
Attribute_1 Attribute_2 Attribute_3 Attribute_4 Attribute_5 Attribute_6 Attribute_9
mbrainz_1 Fermats Theorem / Sight Beyond John B 1996-01-01 United Kingdom of Great Britain and No… 1055 ['Fermats Theorem', 'Sight Beyond']
mbrainz_2 Tempest / Inner Sense Psychosis 1998-12-14 United Kingdom of Great Britain and No… 724 ['Tempest', 'Inner Sense']
mbrainz_3 The Sign's Alive Lypid 2000-09-05 United States of America 2384 ["The Sign's Alive (original mix)", "T…

Products. Based on a sample of the WDC Products benchmark, covering GPUs, SSDs, HDDs, and USB sticks. The target schema has 25 attributes, the largest in the benchmark.

Challenges
  • The large schema makes matching delicate: a small difference in a single technical attribute can separate a match from a non-match.
  • The same attribute appears under different naming conventions across sources.
  • Units, capacities, dimensions, and speeds must be normalized against taxonomies.
Sample records from each source
Dataset 1dataset_1.json
id manufacturer product_name product_description list_price currency_code cluster_id product_url name_and_description model_name manufacturer_part_number category gpu_chipset video_memory_gb capacity_gb sequential_read_mb_s sequential_write_mb_s bus_standard interface width_millimeters length_millimeters height_millimeters weight_grams connector memory_technology colour form_factor
12198483 Gigabyte Gigabyte NVIDIA GeForce RTX 3080 Gamin… CUDA Cores: 8704, Boost Clock: 1800MHz… 799.99 GBP 1002037 https://www.novatech.co.uk/products/gi… Gigabyte NVIDIA GeForce RTX 3080 Gamin… NVIDIA GeForce RTX 3080 Gaming OC (empty) GPU GeForce RTX 3080 10 (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) GDDR6X (empty) (empty)
78378158 Western Digital WD Blue 6TB 3.5\" SATA 3 HDD/Hard Driv… 6TB WD Blue WD60EZAZ, 3.5\" HDD, SATA… 129.98 GBP 1004942 https://www.scan.co.uk/products/6tb-wd… WD Blue 6TB 3.5\" SATA 3 HDD/Hard Driv… WD Blue WD60EZAZ HDD (empty) (empty) 6000 (empty) (empty) SATA SATA III (empty) (empty) (empty) (empty) 3.5-inch SATA (empty) (empty) 3.5-inch
80641070 Corsair Corsair Force MP510 M.2 SSD - 960GB Solid State Drive, 960 GB, intern, M.2… 2124.0 NOK 1007272 https://www.proshop.no/SSD/Corsair-For… Corsair Force MP510 M.2 SSD - 960GB. D… Force MP510 (empty) SSD (empty) (empty) 960 (empty) (empty) PCI Express x4 NVMe (empty) (empty) (empty) (empty) M.2 (empty) (empty) M.2 2280
Dataset 2dataset_2.json
id brandName name descriptionText priceAmount currency cluster_id productUrl titleAndDescription modelName mpn productCategory chipset vramGb capacityGb readSpeedMbps writeSpeedMbps busType interfaceType widthMm depthMm heightMm weightG connectionType memoryType color formFactor
19126355 Gigabyte Gigabyte NVIDIA GeForce RTX 3080 10GB… Gigabyte NVIDIA GeForce RTX 3080 GAMIN… 99999.99 GBP 1002037 https://www.scan.co.uk/products/gigaby… Gigabyte NVIDIA GeForce RTX 3080 10GB… GeForce RTX 3080 10GB GAMING OC (empty) GPU GeForce RTX 3080 10 (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) GDDR6X (empty) (empty)
42841911 Western Digital WD Blue 6TB Desktop Hard Disk Drive -… WD Blue 6TB Desktop Hard Disk Drive -… 128.99 USD 1004942 https://www.newegg.com/blue-wd60ezaz-6… WD Blue 6TB Desktop Hard Disk Drive -… WD Blue WD60EZAZ HDD (empty) (empty) 6000 (empty) (empty) SATA SATA III (empty) (empty) (empty) (empty) 3.5 Inch (empty) (empty) 3.5-inch
46775597 Corsair CORSAIR - Force Series MP510 960GB M.2… CORSAIR Force Series MP510 960GB M.2 S… (empty) (empty) 1007272 https://www.dindator.se/corsair-force-… CORSAIR - Force Series MP510 960GB M.2… Force Series MP510 CSSD-F960GBMP510 SSD (empty) (empty) 960 3480 3000 PCI Express x4 NVMe (empty) (empty) (empty) (empty) M.2 (empty) (empty) M.2 2280
Dataset 3dataset_3.json
id Brand ProductTitle Details Price Currency cluster_id Link TitleDetails Model PartNo Type Chipset MemorySizeGB CapacityGB ReadMBs WriteMBs Bus Interface WidthMM LengthMM HeightMM WeightG Connector MemoryType Colour FormFactor
46320085 Gigabyte GIGABYTE GeForce RTX 3080 GAMING OC 10… To Avail the offer Click Here 73160.0 INR 1002037 https://www.pcstudio.in/product/gigaby… GIGABYTE GeForce RTX 3080 GAMING OC 10… GeForce RTX 3080 GAMING OC 10G GV-N3080GAMING OC-10GD GPU GeForce RTX 3080 10 (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty)
91583813 Western Digital WD Blue 6TB SATA3 256MB 3.5' 5400RPM 6… Western Digital WD Blue 3.5-inch PC ha… 278.0 AUD 1004942 https://netplus.com.au/products/wd-blu… WD Blue 6TB SATA3 256MB 3.5' 5400RPM 6… WD Blue (empty) HDD (empty) (empty) 6000 (empty) (empty) SATA SATA III (empty) (empty) (empty) 640.0 3.5-inch SATA (empty) (empty) 3.5-inch
86850217 Corsair SSD 960GB Corsair Force MP510 NVMe SSD 960GB Corsair Force MP510 NVMe 2316.0 SEK 1007272 https://www.nordwaystore.se/ssd-960gb-… SSD 960GB Corsair Force MP510 NVMe. De… Force MP510 (empty) SSD (empty) (empty) 960 (empty) (empty) (empty) NVMe (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty)
Dataset 4dataset_4.json
id mfr name desc amt cur cluster_id link name_desc mdl pn cat chip vram cap_gb rd_mbs wr_mbs bus iface w_mm l_mm h_mm wt_g conn mem clr ff
66956099 Gigabyte Gigabyte Video Card GV-N3080GAMING OC-… Gigabyte Video Card GV-N3080GAMING OC-… 791.99 USD 1002037 https://www.thekeykey.com/detail/GV-N3… Gigabyte Video Card GV-N3080GAMING OC-… GeForce RTX 3080 GAMING OC GV-N3080GAMING OC-10GD GPU GeForce RTX 3080 10 (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) (empty) GDDR6X (empty) (empty)
5078198 Western Digital Western Digital HDD Blue 6TB 3,5\" 256… Dysk BLUE 6TB 3,5 256MB SATAIII/5400rp… (empty) (empty) 1004942 https://www.alsen.pl/podzespoly-komput… Western Digital HDD Blue 6TB 3,5\" 256… Blue (empty) HDD (empty) (empty) 6000 (empty) (empty) SATA SATA III (empty) (empty) (empty) (empty) 3.5-inch SATA (empty) (empty) 3.5-inch
77571226 Corsair Corsair MP510 M.2-2280 960GB Huge 960GB storage capacity and super… (empty) (empty) 1007272 https://www.cclonline.com/product/2874… Corsair MP510 M.2-2280 960GB. Descript… MP510 (empty) SSD (empty) (empty) 960 3480 3000 (empty) (empty) (empty) (empty) (empty) (empty) M.2 (empty) (empty) M.2 2280

Scientific Papers. Integrates computer-science paper records from DBLP, Crossref, and OpenAlex. The target schema has 11 attributes.

Challenges
  • The largest task by row count, so blocking efficiency matters: a poor reduction ratio leads to hundreds of thousands of comparisons, which makes the task suited for evaluating efficiency as well as effectiveness.
  • DOIs are not part of the released source data; fusion rows use a source_ids helper list to point back to the contributing source records.
  • Title and author-list variants, publication-type and venue normalization, and sparse metadata for volume, issue, pages, and citation counts.
Sample records from each source
Crossrefcrossref.jsonl
id work_type title_text contributor_names issued_year container_title publisher_name abstract_text volume_id issue_id page_first page_last reference_total cited_total
crossref-00000 inproceedings A Simulation-driven Approach in Risk-… ['Ilaria Angela Amantea', ' Antonio D… 2018 (empty) SCITEPRESS - Science and Technology P… (empty) (empty) (empty) 98 105 0 18
crossref-00001 article Group Recommendation Systems Based on… ['Guang Fang', ' Lei Su', ' Di Jiang'… 2018 Wireless Communications and Mobile Co… Wiley With the development of social networ… 2018 1 None None 35 14
crossref-00002 article Relation of Country‐of‐Origin Effect,… ['Juan Manuel Berbel-Pineda', ' Beatr… 2018 Complexity Wiley To obtain information about foreign m… 2018 1 None None 60 10
DBLPdblp.jsonl
id entry_type publication_title author_list pub_year venue_name volume_no issue_no page_start page_finish
dblp-00000 inproceedings Thread Weaving: Static Resource Sched… ['Hsuan Hsiao', 'Jason Helge Anderson… 2019 (empty) (empty) (empty) (empty) (empty)
dblp-00001 inproceedings LEAD: learning-enabled energy-aware d… ['Mark Clark', 'Avinash Kodi', 'Razva… 2018 (empty) (empty) (empty) 1 6
dblp-00002 inproceedings PlanarONoC: concurrent placement and … ['Yu-Kai Chuang', 'Kuan-Jung Chen', '… 2018 (empty) (empty) (empty) 1 6
OpenAlexopen_alex.jsonl
id work_kind display_title authors_list year_published source_name topic_terms volume_tag issue_tag start_page end_page refs_count citations_count
open_alex-00000 article STRING v11: protein–protein associati… ['Damian Szklarczyk', 'Annika L Gable… 2018 Nucleic Acids Research KEGG, Interaction network 47 D1 D607 D613 64 16038
open_alex-00001 article Minimap2: pairwise alignment for nucl… ['Heng Li'] 2018 Bioinformatics Multiple sequence alignment 34 18 3094 3100 42 12292
open_alex-00002 article SWISS-MODEL: homology modelling of pr… ['Andrew Waterhouse', 'Martino Berton… 2018 Nucleic Acids Research Homology 46 W1 W296 W303 73 11711

Dataset statistics

Task Dataset # Records Attributes
Games DBpedia, Metacritic, Sales 3 74,951 10
Companies Forbes, DBpedia, FullContact 3 14,016 8
Music Discogs, Last.fm, MusicBrainz 3 37,255 8
Products Four WDC Products feeds 4 3,012 25
Scientific Papers DBLP, Crossref, OpenAlex 3 182,059 11

Record counts are for the base task. The per-task pages in the repository list the full sizes for every source and difficulty level.

Artifacts per Task

MaDI provides the following artefacts for each of the base tasks:

  • Source tables and table metadata. One table per source, each with a metadata file recording its origin, the columns it contains, and the date it was published. Example: forbes.csv, forbes_metadata.json
  • Target schema. The columns of the fused table, with a value constraint per attribute (for example a valid founding-year range) and categorical attributes tied to standard taxonomies such as GICS industries and CLDR country names. Example: target_schema.json
  • Gold schema mapping. A correspondence from every source column to the target schema. Example: sm_mapping_gold.json
  • Entity matching sets. Labeled train, validation, and test pairs for each pair of sources. Example: forbes_2_dbpedia_train.csv
  • Fusion validation and test sets. 100 validation and 100 test entities per task, hand-annotated with the verified value for every attribute. Example: validation_set.xml
  • Reference outputs. The output of the human-engineered pipeline, used as a silver standard. Example: companies/base/output/

The sources describe the same kind of entity but rarely agree on column names, value formats, or even the facts themselves, which is what makes the merge hard.

Example  ·  one company in three sources (Companies task)
Forbesforbes.csv
company Volkswagen Group
region (empty)
business_segment Auto & Truck Manufacturers
asset_value 446900000000
sales_figure 261500000000
DBpediadbpedia.csv
org_name Volkswagen
established 1937-01-01
nation Germany
headquarters Wolfsburg
sector Automotive industry
total_assets_val 323400000000
FullContactfullcontact.csv
Attribute_2 Volkswagen Group
Attribute_3 Germany
Attribute_4 Wolfsburg
Attribute_5 (empty)
Attribute_6 1937-01-01

The same company in three sources. Forbes leaves the country blank, FullContact hides fields behind names like Attribute_2, the asset figures disagree, and DBpedia gives a shorter name. We follow this company through each step below.

The Integration Pipeline

A system under test (SUT) needs to handle the following four subtasks of the overall integration task. MaDI-Bench provides ground truth for each subtask as well as for the output of the overall integration (fused dataset). Thus, a SUT can be evaluated one task at a time or in an end-to-end fashion.

1Schema Matching

Sources rarely share column names, so every source column has to be matched to a column in the target schema before values can be combined. Names may be descriptive, abbreviated, or meaningless, and where a name gives nothing away, a matcher has to read the values instead.

Ground truth: gold schema mappingMetric: F1 over column correspondences

Example  ·  company, org_name and Attribute_2 all map to name
forbes · company name
dbpedia · established founded
dbpedia · headquarters city
fullcontact · Attribute_2 from values only name
fullcontact · Attribute_6 from values only founded

2Value Normalization

Matched columns still store values in different formats and vocabularies, so they are rewritten into the forms the target schema requires. The schema sets a constraint for every attribute: a data type, and where it applies a numeric range, a string pattern, or a date format. A founding date, for instance, must be an ISO YYYY-MM-DD inside an allowed range, and assets must be a non-negative integer in US dollars below an upper bound. Categorical attributes are pinned to taxonomies that ship with the task and fix both the allowed values and the level to normalize to: countries follow CLDR, industries follow GICS, and each task adds its own, such as platform, genre, and rating for Games. Normalization is scored as consistency, that is, whether the output values satisfy these schema constraints.

Defined by: target-schema constraints & taxonomiesReference: human-pipeline output

Example  ·  the same fact, written many ways
Field As found in sources Normalized to target
industry "Auto & Truck Manufacturers", "Automotive industry" Automobiles (GICS)
founded "January 01, 1937", "01.01.1937" 1937-01-01
assets "411 818 350 000,00", "323.40" (bn) 411818350000, 323400000000
country "中华人民共和国", "U.S.A." China, United States (CLDR)

These messy forms are real values from the easy and hard variants; the base task is cleaner.

3Entity Blocking & Matching

Records that refer to the same entity have to be found across sources. Scoring the full cartesian product of record pairs is quadratic in the input size and computationally infeasible, so blocking first reduces it to a smaller set of candidate pairs, and matching then classifies each candidate as a match or a non-match. The hardest cases are look-alikes that fall into the same block but refer to different entities.

Ground truth: labeled pairsMetric: pair completeness & reduction ratio (blocking), F1 (matching)

Example  ·  candidate pairs for one company
forbes · Volkswagen Groupdbpedia · Volkswagen✓ match
forbes · Volkswagen Groupfullcontact_802 · Volkswagen Group✓ match
forbes · Volkswagen Groupfullcontact_243 · CST Brands, Inc.✗ no match
forbes · Volkswagen Groupfullcontact_1462 · Intermountain Healthcare✗ no match

4Data Fusion

After matching groups the records of an entity into a cluster, fusion resolves the cluster into a single value for each attribute. Where the sources report conflicting values, fusion has to decide which value to keep, and this conflict resolution is the step where the benchmark leaves the most room for improvement. Each gold value was set by annotators who verified it against independent sources.

Ground truth: hand-verified fused recordsMetric: fusion accuracy

Example  ·  one fused record, with where each value came from
name Volkswagen GroupForbesFullContact
country GermanyDBpediaFullContact
founded 1937-01-01DBpediaFullContact
assets 447910000000verified
revenue 261910000000verified

Forbes reports $446.9B in assets, DBpedia reports $323.4B. Neither is taken at face value: the gold value, $447.91B, is the figure annotators confirmed by hand.

End-to-End Evaluation

Beyond the per-step scores, MaDI-Bench rates the final fused table along three quality dimensions: coverage (did the right entities and values make it in?), consistency (does the table fit the target schema?), and correctness (are the merged values right?). Each is reported at three reference levels: reference-free (structure only, no labels needed), silver (against the human pipeline output), and ground truth (against the annotated test sets).

Validation

We validate MaDI-Bench with three pipelines that span the design space from human-engineered to fully automated. All three build on the PyDI integration framework, and their per-step and end-to-end outputs are released in the repository under results/ and the per-task output/ folders.

P1Human-engineered pipelinesilver reference

Domain-specific workflows that master's and PhD students built and refined on PyDI. Every workflow runs the same order: PyDI's LLM schema matcher proposes correspondences, an engineer inspects and hand-corrects them, and the sources are mapped to the target schema. A normalization step then parses dates and numbers, converts units, and applies the taxonomy mappings. Blocking uses embedding or task-specific keys tuned to at least 97% pair completeness. Entity matching combines rule-based and learned tree-based matchers with a global one-to-one assignment, and fusion applies per-attribute PyDI heuristics such as source trust, longest or shortest string, numeric aggregation, and set union.

It is a practical human reference rather than an optimum, and its fused output serves as the benchmark's silver standard.

P2Best-of-breed pipelinecommittee per step

Chains state-of-the-art methods from the literature, one per integration step. It runs the four stages once: at each stage a committee of competing methods is scored on the validation set, and the winner's output is passed to the next stage.

Schema matching: COMA, Magneto, a Sentence-BERT embedding matcher, and PyDI's label-, instance-, duplicate- and LLM-based matchers. Blocking: standard, token and sorted-neighborhood blocking, BM25, Sentence-BERT, and SC-Block, keeping the blocker that clears 97% pair completeness at the highest reduction ratio. Entity matching: Ditto and Magellan. Fusion: PyDI heuristics plus the TruthFinder, LTM, AccuSim, CASEFusion, and FusionQuery truth-discovery models, with a per-attribute selector.

P3LLM-based pipelineautomatic, self-configuring

Showcases how well an LLM can configure an end-to-end data integration pipeline. It uses the same PyDI methods as the human pipeline, but an LLM, not a person, sets them up. GPT-5.5 performs schema matching, and the values are normalized by mapping them to the predefined taxonomies, with a column profiler selecting a normalization function for the remaining attributes. An embedding blocker then reduces the sources to candidate pairs.

From a sample of those pairs the pipeline builds its own machine-labeled training and validation sets with GPT-5.2 as the labeler, independent of the labels the benchmark provides, and uses them to configure an entity matcher chosen among rule-based and feature-based models such as random forest and XGBoost. For fusion it writes a configuration that assigns each attribute a conflict-resolution heuristic. The LLM only configures the pipeline and generates the training artifacts; it plays no part in the integration itself, which is what lets the pipeline scale to the largest tasks.

The three pipelines behave very differently across the steps. Schema matching is close to perfect for all three, entity matching covers a wide range across tasks, and data fusion is the hardest step for every pipeline. Every number below is per task, and the strongest pipeline in each row is shown in green. The columns are the human-engineered pipeline (Human, P1), the best-of-breed pipeline (BoB, P2), and the LLM-based pipeline (LLM, P3).

Schema matching

The three pipelines reach close to optimal F1 by using the LLM Schema matcher provided in PyDI, which reads column names and values together. The label-based and instance-based matchers are shown for reference: their much lower scores show that schema matching is genuinely difficult in the benchmark, while state-of-the-art LLMs reach close to perfect performance once they are given both the headers and the values. Values are F1 (%).

Task Human BoB LLM Label-based Instance-based
Games 100 100 100 30.00 60.50
Companies 100 100 97.56 26.10 53.80
Music 100 100 100 28.60 64.50
Products 100 100 100 61.60 7.70
Scientific Papers 100 97.14 100 73.80 33.30

Blocking and entity matching

Pair completeness stays between 94.4 and 100% for every pipeline, so the candidate sets keep almost all of the labeled test matches. Reduction ratio is above 99% on every task except Products, where the LLM pipeline's embedding blocker prunes 97.5% of the pairs against 71.9% for the other two. Entity matching itself spans a wide range, from near the ceiling on Papers and Music to clearly more open on Games and Products. Values are percentages. The best entity-matching score per task is in green.

Task Pair completeness Reduction ratio Entity matching F1
Human BoB LLM Human BoB LLM Human BoB LLM
Games 99.55 100.00 96.85 99.93 99.90 99.96 89.45 67.30 63.15
Companies 97.73 94.37 100.00 99.34 99.24 99.69 87.65 89.29 90.69
Music 96.55 96.34 99.85 99.67 99.67 99.88 95.35 94.84 98.11
Products 100.00 100.00 98.00 71.91 71.91 97.54 70.37 84.09 63.09
Scientific Papers 97.90 97.90 99.69 99.98 99.98 99.97 99.78 99.70 96.05

Pair completeness and reduction ratio measure blocking quality on the labeled entity-matching test set. Entity matching F1 is on the same test set.

Data fusion

Fusion is the hardest step for every pipeline. The LLM pipeline produces the most accurate fused values on four of the five tasks. Best-of-breed leads on Products. Values are accuracy (%). the best per task is in green.

Task Human BoB LLM
Games 71.70 64.91 84.87
Companies 44.70 45.90 63.24
Music 70.20 76.92 83.12
Products 40.20 56.68 43.65
Scientific Papers 78.90 61.04 79.35

Efficiency

All three pipelines are preconfigured, so the times below cover only the final integration run, not any configuration search or design effort. The main difference comes from entity matching: best-of-breed mostly selects Ditto, a pre-trained language model whose per-task fine-tuning and inference dominate its runtime, while the human and LLM pipelines use lighter matchers. On Companies, where best-of-breed instead picks the cheaper Magellan, it runs about as fast as the others. Values are end-to-end wall-clock seconds.

Task Human BoB LLM
Games 180 5,736 196
Companies 70 213 203
Music 184 5,980 211
Products 98 1,388 718
Scientific Papers 637 16,184 1,374
Mean 234 5,900 540

End to end

MaDI-Bench also scores the final fused table as a whole, along three dimensions: coverage (are the right entities and values present), consistency (does the output fit the target schema), and correctness (are the merged values right). Each is reported at three reference levels: reference-free, against the human pipeline (silver), and against the annotated test sets (ground truth). The table gives the mean over the five tasks under the ground-truth reference, with the better pipeline per metric in green.

Metric (ground-truth reference, mean over tasks) Best-of-breed LLM-based
Entity recovery  coverage, higher is better 0.90 0.81
Value drift  coverage, lower is better 0.48 0.56
BCubed F1  clustering correctness, higher is better 0.97 0.94
Fusion accuracy  value correctness, higher is better 0.61 0.71
Fully-correct rate  every attribute right at once, higher is better 0.05 0.14

Best-of-breed wins on coverage and clustering. The LLM pipeline wins on value correctness. The fully-correct rate, which needs every attribute of an entity right at once, stays low for both, so the integrated task is far from solved. Schema validity (consistency) is close, near 0.95 for both pipelines. Full per-task numbers at all three reference levels are in the results directory.

Variant Generation

To keep the benchmark hard as systems improve, each base task also ships in easy, medium, and hard versions. The records and the correct answers stay the same. Eight difficulty knobs perturb the data, each set to its own target for easy, medium, and hard. A harder level turns up the kinds of heterogeneity that real integration runs into.

A record changed by the knobs

Here is the DBpedia record for the Taisei Corporation, base against hard. Schema naming divergence (Knob 8) first renames every column to a cryptic code: org_name→nm, established→ey, nation→cn, headquarters→hq, sector→sg, keypeople_name→kpn, total_assets_val→ta, annual_income→ai. On top of that, the values change:

Field Base value Hard value Changed by
name Taisei Corporation Taisei Corporation unchanged
founded 1873-01-01 01.01.1873 Knob 5 · format
country Japan Nippon Knob 1 · surface
city TokyoShinjuku, Tokyo TokyoShinjuku, Tokyo unchanged
industry Construction Constructi on Knob 6 · value noise
key people Okura Kihachiro (dropped) Knob 3 · attribute drop
assets 3,650,187,000 3650.19 Knob 5 · unit scale
revenue 213,195,000 167.18 Knob 5 · unit + currency

Eight changes from one base record: a renamed schema, a reformatted date, a reworded country, a corrupted industry string, a dropped person, and two rescaled financial figures. The easy level moves the same dials the other way, with descriptive column names, gentle date and number formats, and fewer dropped cells.

Difficulty Knobs

Each knob targets one or two pipeline stages and is set to an easy, medium, or hard target. MaDI-Bench uses eight:

Knob 1 Matching · Fusion

Surface augmentation

Rewrites attribute values into other surface forms using abbreviations, token reorderings, and dropped tokens. This erodes the surface overlap that blocking and matching depend on, and turns clean values into plausible-looking disagreements for fusion.

exampleSan FranciscoFrancisco San

Knob 2 Blocking · Matching

Entity niche density

Adds similar-but-distinct entities within the same domain niche, such as many products from one brand. Blocks fill up with non-matching pairs, and more candidate pairs sit close to the match / non-match boundary.

examplegenerates a near-duplicate company and inserts it into a crowded niche

Knob 3 Blocking · Matching · Fusion

Attribute drop

Blanks a share of cells per source. Missing blocking keys make records unreachable, missing attributes weaken the evidence for matching, and sparse coverage leaves fusion less to cross-check.

examplerevenue 358,500,000(dropped)

Knob 4 Fusion

Coverage skew

Varies how many sources describe each entity. This changes cluster sizes and thins the evidence available to resolve conflicts for entities that only one or two sources cover.

exampleInter RAO: its DBpedia row is removed, leaving fewer sources

Knob 5 Normalization

Format & unit diversity

Rewrites dates, numbers, and quantities into more formats, units, and scales, for example ISO against dotted dates, or raw figures against millions. More admissible forms per field make value standardization harder.

example65,170,000,00065170 (millions)

Knob 6 Normalization · Matching · Fusion

Value noise

Corrupts a share of values with typos, encoding artefacts, and truncation. The damaged values lower string-similarity scores during matching and show up as spurious disagreements during fusion.

example2005-01-012o0t-01-o1

Knob 8 Schema matching

Schema naming divergence

Renames source columns along a scale from fully descriptive, through abbreviated, to opaque codes. At the hard end the column name carries no signal, so a matcher has to rely on the values alone.

exampletotal_assets_valta

Knob 10 Fusion

Source reliability

Reshuffles which source holds the correct value per attribute, so the most reliable source is right for only a share of entities. No fixed source-trust rule stays optimal.

exampleAccor: the verified name is reassigned to a different source

Two further knobs (value ambiguity and schema distractors) are specified for synthetic tasks. Full specifications for every knob are in the repository.

Difficulty across the variants

The variants produce the intended difficulty gradient, clearest for data fusion: at the hard level fusion accuracy falls in every domain, by up to about 23 points for best-of-breed (Papers) and 28 for the LLM pipeline (Papers). Schema matching stays robust across the levels, and entity matching stays resilient once the matcher is re-tuned per variant. The hard variants raise the difficulty in every domain, while the easy variants keep simpler methods competitive. Values are percentages: schema matching and entity matching F1, and fusion accuracy.

Domain Level Best-of-breed (P2) LLM-based (P3)
SM EM Fusion SM EM Fusion
Companies Base 100.0 89.3 45.9 97.6 90.7 63.2
Easy 100.0 93.6 40.3 100.0 88.2 42.3
Medium 100.0 88.9 39.6 100.0 77.0 56.8
Hard 91.3 94.5 33.3 100.0 87.4 44.4
Games Base 100.0 67.3 64.9 100.0 63.2 84.9
Easy 98.1 66.9 62.1 97.9 76.0 81.4
Medium 98.1 69.7 63.6 97.9 72.7 78.5
Hard 98.1 79.2 52.5 97.9 82.0 72.6
Music Base 100.0 94.8 76.9 100.0 98.1 83.1
Easy 100.0 94.9 85.4 100.0 97.7 77.4
Medium 100.0 93.2 65.1 100.0 90.9 71.0
Hard 92.0 94.3 54.5 90.9 88.0 58.6
Products Base 100.0 84.1 56.7 100.0 63.1 43.6
Easy 99.5 91.5 51.6 100.0 27.5 49.3
Medium 98.1 88.1 59.3 100.0 36.9 45.9
Hard 98.1 85.3 49.5 98.0 51.0 34.1
Scientific Papers Base 97.1 99.7 61.0 100.0 96.0 79.3
Easy 92.9 99.5 58.1 91.7 90.9 64.6
Medium 90.5 99.2 47.7 91.7 87.6 62.3
Hard 92.9 96.7 37.8 91.7 78.1 51.1

SM is schema matching F1, EM is entity matching F1, Fusion is data fusion accuracy. The human pipeline is not run on the variants. These are the per-stage scores from the paper (Table XII).

Conclusion

This paper introduced the MaDI-Bench integration benchmark. The benchmark models the full complexity of the data integration process, including schema matching, value normalization, entity matching, and conflict resolution, while at the same time taking into account the dependencies between the subtasks. To prevent a quick saturation of the benchmark as agentic systems progress, we introduce a generic variant-generation method for deriving harder variants from the base tasks. We validated MaDI-Bench and the generated variants using human-designed pipelines, an LLM-based pipeline, and a best-of-breed pipeline. The validation showed the benchmark's utility for measuring the step-wise as well as end-to-end performance of data integration systems. We hope that the benchmark will prove useful for the community and will support the development of fully automatic as well as human-in-the-loop data integration systems.

Downloads

Tasks are released as plain files (CSV, JSON, and XML), so any system can read them. PyDI adds the integration steps and an evaluator for every stage, so a run can be scored without labeling any data.

  • Benchmark repository: all 20 tasks, ground truth, and the validation runs of the three pipelines.
  • PyDI: the integration framework and the per-step evaluator classes.
  • The tasks: inputs, target schemas, labeled splits, and fused gold records.

BibTeX

@misc{steiner2026madibench,
  title         = {MaDI-Bench: An End-to-End Data Integration Benchmark},
  author        = {Steiner, Aaron and Peeters, Ralph and Bizer, Christian},
  year          = {2026},
  eprint        = {2606.30371},
  archivePrefix = {arXiv},
  primaryClass  = {cs.DB},
  url           = {https://arxiv.org/abs/2606.30371}
}