Data integration combines heterogeneous data sets into a single, coherent representation. It involves a sequence of interdependent tasks: schema matching, value normalization, entity blocking, entity matching, and data fusion. Existing benchmarks either evaluate these steps in isolation or cover only incomplete versions of the data integration pipeline, omitting specific steps. The lack of public end-to-end data integration benchmarks hinders research on data integration methods that address the integration process as a whole.
The Mannheim Data Integration Benchmark (MaDI-Bench) fills this gap. MaDI-Bench is the first benchmark for the end-to-end integration of relational tables covering all steps of the integration process. MaDI-Bench contributes (i) a set of base end-to-end data integration tasks spanning several application domains, each requiring the full schema matching, value normalization, entity matching, and conflict resolution pipeline; and (ii) a generic method for deriving task variants that mitigates rapid benchmark saturation as data integration systems advance. We validate the benchmark using human-engineered pipelines, a best-of-breed pipeline, and an LLM-based pipeline. The validation demonstrates the utility of the benchmark for measuring the step-wise as well as the end-to-end performance of data integration pipelines. All benchmark artefacts are available for public download.
MaDI-Bench consists of five end-to-end data integration tasks: Games, Companies, Music, Products, and Scientific Papers. Each task takes several heterogeneous source tables in a domain and asks a system to return one fused target table, exercising schema matching, value normalization, entity matching, and data fusion together. The benchmark provides ground truth to score every step on its own, not only the final table: a gold schema mapping for schema matching, the target schema's constraints and taxonomies for value normalization (measured as consistency with the schema), labeled record pairs for blocking and entity matching, and hand-verified records for data fusion, together with metrics for the end-to-end result. Each base task also ships in easy, medium, and hard variants, for 20 integration tasks in total.
Schema matching → Value normalization → Entity blocking & matching → Data fusion
Games. This task requires the integration of three video-game datasets: Metacritic (review scores and ESRB ratings), a sales dataset (commercial performance), and DBpedia (release dates, developers, platforms, genres, and series). The target schema has ten attributes.
| wiki_ref | title | launch_yr | studio | system | genre | franchise |
|---|---|---|---|---|---|---|
| dbpedia_11 | Mario Kart Arcade GP | 2005-01-01 | Namco | Arcade game | Racing video game | Mario Kart |
| dbpedia_101 | Dr. Mario | 1990-01-01 | Nintendo Research & Development 1 | Game Boy Advance | Puzzle video game | (empty) |
| dbpedia_177 | Sonic Jump | 2012-01-01 | Hardlight | Android (operating system) | Platform game | Sonic the Hedgehog (series) |
| mc_id | game_title | year_published | made_by | console | genres | press_rating | player_rating | age_rating |
|---|---|---|---|---|---|---|---|---|
| metacritic_5 | Super Mario Galaxy | 2007-01-01 | Nintendo | Wii | Action,Platformer,Platformer,3D,3D | 97.0 | 9.1 | E |
| metacritic_9 | Super Mario Galaxy 2 | 2010-01-01 | Nintendo EAD Tokyo | Wii | Action,Platformer,Platformer,3D,3D | 97.0 | 9.1 | E |
| metacritic_10 | The Legend of Zelda: Ocarina of Time | 1998-01-01 | Nintendo | Nintendo 64 | Action Adventure,Fantasy | 99.0 | 9.0 | E |
| rec_id | prod_title | launch_dt | studio | dist | hw | genre | press_score | comm_rating | age_classification | units_sold_mm |
|---|---|---|---|---|---|---|---|---|---|---|
| sales_2 | Mario Kart Wii | 2008-01-01 | Nintendo | Nintendo | Wii | Racing | 82 | 8.3 | E | 35 |
| sales_4 | New Super Mario Bros. | 2006-01-01 | Nintendo | Nintendo | DS | Platform | 89 | 8.5 | E | 29 |
| sales_7 | Mario Kart DS | 2005-01-01 | Nintendo | Nintendo | DS | Racing | 91 | 8.6 | E | 23 |
Companies. This task requires the integration of the Forbes Global 2000 list, a DBpedia company extract, and a FullContact company-profile sample. The target schema has eight attributes.
| forbes_url | company | url | region | business_segment | asset_value | sales_figure |
|---|---|---|---|---|---|---|
| http://www.forbes.com/companies/icbc/ | ICBC | http://www.forbes.com/companies/icbc/ | China | Major Banks | 3124900000000 | 148700000000 |
| http://www.forbes.com/companies/china-… | China Construction Bank | http://www.forbes.com/companies/china-… | China | Regional Banks | 2449500000000 | 121300000000 |
| http://www.forbes.com/companies/agricu… | Agricultural Bank of China | http://www.forbes.com/companies/agricu… | China | Regional Banks | 2405400000000 | 136400000000 |
| entity_uri | org_name | established | nation | headquarters | sector | keypeople_name | total_assets_val | annual_income |
|---|---|---|---|---|---|---|---|---|
| http://dbpedia.org/resource/%C3%80_la_… | À la Table de Spanghero | 1970-01-01 | France | Castelnaudary | Meat | (empty) | (empty) | (empty) |
| http://dbpedia.org/resource/%C3%87al%C… | Çalık Enerji | 1998-01-01 | Turkey | Istanbul | (empty) | Çalık Holding | (empty) | (empty) |
| http://dbpedia.org/resource/%C3%87al%C… | Çalık Holding | 1997-01-01 | Turkey | Istanbul | (empty) | Ahmet Çalık | 8 | 2.8 |
| Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Attribute_5 | Attribute_6 |
|---|---|---|---|---|---|
| fullcontact_1 | BBMG | United States | Brooklyn | Raphael Bemporad | (empty) |
| fullcontact_2 | CIT Group Inc (DEL) | Canada | Toronto | (empty) | 1908-01-01 |
| fullcontact_3 | City & National Employment | United States | Waterloo | (empty) | 1957-01-01 |
Music. Integrates release-level records from Discogs, Last.fm, and MusicBrainz that describe albums, EPs, and singles with partial overlap. The target schema has eight attributes.
| rec_uid | title_str | performer | pub_dt | origin_loc | duration | imprint | category | tracks_track-name |
|---|---|---|---|---|---|---|---|---|
| discogs_3 | Fermats Theorem / Sight Beyond | John B | 1996-01-01 | UK | 0 | New Identity Recordings | Electronic | ['Fermats Theorem', 'Sight Beyond'] |
| discogs_4 | Tempest / Inner Sense | Psychosis | 1998-01-01 | UK | 0 | Renegade Hardware | Electronic | ['Tempest', 'Inner Sense'] |
| discogs_5 | The Sign's Alive | Lypid | 2000-09-05 | United States of America | 0 | Statra Recordings | Electronic | ["The Sign's Alive (Original Mix)", "T… |
| item_code | album_title | band | album_length | tracks_track-name |
|---|---|---|---|---|
| lastFM_1 | John B - Fermats Theorem / Sight Beyo… | John B | 903 | ['Fermats Theorem', 'Sight Beyond'] |
| lastFM_2 | Tempest / Inner Sense | Psychosis | 734 | ['Tempest', 'Inner Sense'] |
| lastFM_4 | Petalpusher - Surrender | Petalpusher | 1626 | ['Surrender (Petalpusher Original)', "… |
| Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Attribute_5 | Attribute_6 | Attribute_9 |
|---|---|---|---|---|---|---|
| mbrainz_1 | Fermats Theorem / Sight Beyond | John B | 1996-01-01 | United Kingdom of Great Britain and No… | 1055 | ['Fermats Theorem', 'Sight Beyond'] |
| mbrainz_2 | Tempest / Inner Sense | Psychosis | 1998-12-14 | United Kingdom of Great Britain and No… | 724 | ['Tempest', 'Inner Sense'] |
| mbrainz_3 | The Sign's Alive | Lypid | 2000-09-05 | United States of America | 2384 | ["The Sign's Alive (original mix)", "T… |
Products. Based on a sample of the WDC Products benchmark, covering GPUs, SSDs, HDDs, and USB sticks. The target schema has 25 attributes, the largest in the benchmark.
| id | manufacturer | product_name | product_description | list_price | currency_code | cluster_id | product_url | name_and_description | model_name | manufacturer_part_number | category | gpu_chipset | video_memory_gb | capacity_gb | sequential_read_mb_s | sequential_write_mb_s | bus_standard | interface | width_millimeters | length_millimeters | height_millimeters | weight_grams | connector | memory_technology | colour | form_factor |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12198483 | Gigabyte | Gigabyte NVIDIA GeForce RTX 3080 Gamin… | CUDA Cores: 8704, Boost Clock: 1800MHz… | 799.99 | GBP | 1002037 | https://www.novatech.co.uk/products/gi… | Gigabyte NVIDIA GeForce RTX 3080 Gamin… | NVIDIA GeForce RTX 3080 Gaming OC | (empty) | GPU | GeForce RTX 3080 | 10 | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | GDDR6X | (empty) | (empty) |
| 78378158 | Western Digital | WD Blue 6TB 3.5\" SATA 3 HDD/Hard Driv… | 6TB WD Blue WD60EZAZ, 3.5\" HDD, SATA… | 129.98 | GBP | 1004942 | https://www.scan.co.uk/products/6tb-wd… | WD Blue 6TB 3.5\" SATA 3 HDD/Hard Driv… | WD Blue | WD60EZAZ | HDD | (empty) | (empty) | 6000 | (empty) | (empty) | SATA | SATA III | (empty) | (empty) | (empty) | (empty) | 3.5-inch SATA | (empty) | (empty) | 3.5-inch |
| 80641070 | Corsair | Corsair Force MP510 M.2 SSD - 960GB | Solid State Drive, 960 GB, intern, M.2… | 2124.0 | NOK | 1007272 | https://www.proshop.no/SSD/Corsair-For… | Corsair Force MP510 M.2 SSD - 960GB. D… | Force MP510 | (empty) | SSD | (empty) | (empty) | 960 | (empty) | (empty) | PCI Express x4 | NVMe | (empty) | (empty) | (empty) | (empty) | M.2 | (empty) | (empty) | M.2 2280 |
| id | brandName | name | descriptionText | priceAmount | currency | cluster_id | productUrl | titleAndDescription | modelName | mpn | productCategory | chipset | vramGb | capacityGb | readSpeedMbps | writeSpeedMbps | busType | interfaceType | widthMm | depthMm | heightMm | weightG | connectionType | memoryType | color | formFactor |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19126355 | Gigabyte | Gigabyte NVIDIA GeForce RTX 3080 10GB… | Gigabyte NVIDIA GeForce RTX 3080 GAMIN… | 99999.99 | GBP | 1002037 | https://www.scan.co.uk/products/gigaby… | Gigabyte NVIDIA GeForce RTX 3080 10GB… | GeForce RTX 3080 10GB GAMING OC | (empty) | GPU | GeForce RTX 3080 | 10 | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | GDDR6X | (empty) | (empty) |
| 42841911 | Western Digital | WD Blue 6TB Desktop Hard Disk Drive -… | WD Blue 6TB Desktop Hard Disk Drive -… | 128.99 | USD | 1004942 | https://www.newegg.com/blue-wd60ezaz-6… | WD Blue 6TB Desktop Hard Disk Drive -… | WD Blue | WD60EZAZ | HDD | (empty) | (empty) | 6000 | (empty) | (empty) | SATA | SATA III | (empty) | (empty) | (empty) | (empty) | 3.5 Inch | (empty) | (empty) | 3.5-inch |
| 46775597 | Corsair | CORSAIR - Force Series MP510 960GB M.2… | CORSAIR Force Series MP510 960GB M.2 S… | (empty) | (empty) | 1007272 | https://www.dindator.se/corsair-force-… | CORSAIR - Force Series MP510 960GB M.2… | Force Series MP510 | CSSD-F960GBMP510 | SSD | (empty) | (empty) | 960 | 3480 | 3000 | PCI Express x4 | NVMe | (empty) | (empty) | (empty) | (empty) | M.2 | (empty) | (empty) | M.2 2280 |
| id | Brand | ProductTitle | Details | Price | Currency | cluster_id | Link | TitleDetails | Model | PartNo | Type | Chipset | MemorySizeGB | CapacityGB | ReadMBs | WriteMBs | Bus | Interface | WidthMM | LengthMM | HeightMM | WeightG | Connector | MemoryType | Colour | FormFactor |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 46320085 | Gigabyte | GIGABYTE GeForce RTX 3080 GAMING OC 10… | To Avail the offer Click Here | 73160.0 | INR | 1002037 | https://www.pcstudio.in/product/gigaby… | GIGABYTE GeForce RTX 3080 GAMING OC 10… | GeForce RTX 3080 GAMING OC 10G | GV-N3080GAMING OC-10GD | GPU | GeForce RTX 3080 | 10 | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) |
| 91583813 | Western Digital | WD Blue 6TB SATA3 256MB 3.5' 5400RPM 6… | Western Digital WD Blue 3.5-inch PC ha… | 278.0 | AUD | 1004942 | https://netplus.com.au/products/wd-blu… | WD Blue 6TB SATA3 256MB 3.5' 5400RPM 6… | WD Blue | (empty) | HDD | (empty) | (empty) | 6000 | (empty) | (empty) | SATA | SATA III | (empty) | (empty) | (empty) | 640.0 | 3.5-inch SATA | (empty) | (empty) | 3.5-inch |
| 86850217 | Corsair | SSD 960GB Corsair Force MP510 NVMe | SSD 960GB Corsair Force MP510 NVMe | 2316.0 | SEK | 1007272 | https://www.nordwaystore.se/ssd-960gb-… | SSD 960GB Corsair Force MP510 NVMe. De… | Force MP510 | (empty) | SSD | (empty) | (empty) | 960 | (empty) | (empty) | (empty) | NVMe | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) |
| id | mfr | name | desc | amt | cur | cluster_id | link | name_desc | mdl | pn | cat | chip | vram | cap_gb | rd_mbs | wr_mbs | bus | iface | w_mm | l_mm | h_mm | wt_g | conn | mem | clr | ff |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 66956099 | Gigabyte | Gigabyte Video Card GV-N3080GAMING OC-… | Gigabyte Video Card GV-N3080GAMING OC-… | 791.99 | USD | 1002037 | https://www.thekeykey.com/detail/GV-N3… | Gigabyte Video Card GV-N3080GAMING OC-… | GeForce RTX 3080 GAMING OC | GV-N3080GAMING OC-10GD | GPU | GeForce RTX 3080 | 10 | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | GDDR6X | (empty) | (empty) |
| 5078198 | Western Digital | Western Digital HDD Blue 6TB 3,5\" 256… | Dysk BLUE 6TB 3,5 256MB SATAIII/5400rp… | (empty) | (empty) | 1004942 | https://www.alsen.pl/podzespoly-komput… | Western Digital HDD Blue 6TB 3,5\" 256… | Blue | (empty) | HDD | (empty) | (empty) | 6000 | (empty) | (empty) | SATA | SATA III | (empty) | (empty) | (empty) | (empty) | 3.5-inch SATA | (empty) | (empty) | 3.5-inch |
| 77571226 | Corsair | Corsair MP510 M.2-2280 960GB | Huge 960GB storage capacity and super… | (empty) | (empty) | 1007272 | https://www.cclonline.com/product/2874… | Corsair MP510 M.2-2280 960GB. Descript… | MP510 | (empty) | SSD | (empty) | (empty) | 960 | 3480 | 3000 | (empty) | (empty) | (empty) | (empty) | (empty) | (empty) | M.2 | (empty) | (empty) | M.2 2280 |
Scientific Papers. Integrates computer-science paper records from DBLP, Crossref, and OpenAlex. The target schema has 11 attributes.
source_ids helper
list to point back to the contributing source
records.
| id | work_type | title_text | contributor_names | issued_year | container_title | publisher_name | abstract_text | volume_id | issue_id | page_first | page_last | reference_total | cited_total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| crossref-00000 | inproceedings | A Simulation-driven Approach in Risk-… | ['Ilaria Angela Amantea', ' Antonio D… | 2018 | (empty) | SCITEPRESS - Science and Technology P… | (empty) | (empty) | (empty) | 98 | 105 | 0 | 18 |
| crossref-00001 | article | Group Recommendation Systems Based on… | ['Guang Fang', ' Lei Su', ' Di Jiang'… | 2018 | Wireless Communications and Mobile Co… | Wiley | With the development of social networ… | 2018 | 1 | None | None | 35 | 14 |
| crossref-00002 | article | Relation of Country‐of‐Origin Effect,… | ['Juan Manuel Berbel-Pineda', ' Beatr… | 2018 | Complexity | Wiley | To obtain information about foreign m… | 2018 | 1 | None | None | 60 | 10 |
| id | entry_type | publication_title | author_list | pub_year | venue_name | volume_no | issue_no | page_start | page_finish |
|---|---|---|---|---|---|---|---|---|---|
| dblp-00000 | inproceedings | Thread Weaving: Static Resource Sched… | ['Hsuan Hsiao', 'Jason Helge Anderson… | 2019 | (empty) | (empty) | (empty) | (empty) | (empty) |
| dblp-00001 | inproceedings | LEAD: learning-enabled energy-aware d… | ['Mark Clark', 'Avinash Kodi', 'Razva… | 2018 | (empty) | (empty) | (empty) | 1 | 6 |
| dblp-00002 | inproceedings | PlanarONoC: concurrent placement and … | ['Yu-Kai Chuang', 'Kuan-Jung Chen', '… | 2018 | (empty) | (empty) | (empty) | 1 | 6 |
| id | work_kind | display_title | authors_list | year_published | source_name | topic_terms | volume_tag | issue_tag | start_page | end_page | refs_count | citations_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| open_alex-00000 | article | STRING v11: protein–protein associati… | ['Damian Szklarczyk', 'Annika L Gable… | 2018 | Nucleic Acids Research | KEGG, Interaction network | 47 | D1 | D607 | D613 | 64 | 16038 |
| open_alex-00001 | article | Minimap2: pairwise alignment for nucl… | ['Heng Li'] | 2018 | Bioinformatics | Multiple sequence alignment | 34 | 18 | 3094 | 3100 | 42 | 12292 |
| open_alex-00002 | article | SWISS-MODEL: homology modelling of pr… | ['Andrew Waterhouse', 'Martino Berton… | 2018 | Nucleic Acids Research | Homology | 46 | W1 | W296 | W303 | 73 | 11711 |
Dataset statistics
| Task | Dataset | # | Records | Attributes |
|---|---|---|---|---|
| Games | DBpedia, Metacritic, Sales | 3 | 74,951 | 10 |
| Companies | Forbes, DBpedia, FullContact | 3 | 14,016 | 8 |
| Music | Discogs, Last.fm, MusicBrainz | 3 | 37,255 | 8 |
| Products | Four WDC Products feeds | 4 | 3,012 | 25 |
| Scientific Papers | DBLP, Crossref, OpenAlex | 3 | 182,059 | 11 |
Record counts are for the base task. The per-task pages in the repository list the full sizes for every source and difficulty level.
MaDI provides the following artefacts for each of the base tasks:
The sources describe the same kind of entity but rarely agree on column names, value formats, or even the facts themselves, which is what makes the merge hard.
| company | Volkswagen Group |
| region | (empty) |
| business_segment | Auto & Truck Manufacturers |
| asset_value | 446900000000 |
| sales_figure | 261500000000 |
| org_name | Volkswagen |
| established | 1937-01-01 |
| nation | Germany |
| headquarters | Wolfsburg |
| sector | Automotive industry |
| total_assets_val | 323400000000 |
| Attribute_2 | Volkswagen Group |
| Attribute_3 | Germany |
| Attribute_4 | Wolfsburg |
| Attribute_5 | (empty) |
| Attribute_6 | 1937-01-01 |
The same company in three sources. Forbes leaves the country blank, FullContact hides fields behind names like Attribute_2, the asset figures disagree, and DBpedia gives a shorter name. We follow this company through each step below.
A system under test (SUT) needs to handle the following four subtasks of the overall integration task. MaDI-Bench provides ground truth for each subtask as well as for the output of the overall integration (fused dataset). Thus, a SUT can be evaluated one task at a time or in an end-to-end fashion.
Sources rarely share column names, so every source column has to be matched to a column in the target schema before values can be combined. Names may be descriptive, abbreviated, or meaningless, and where a name gives nothing away, a matcher has to read the values instead.
| forbes · company | → | name |
| dbpedia · established | → | founded |
| dbpedia · headquarters | → | city |
| fullcontact · Attribute_2 from values only | → | name |
| fullcontact · Attribute_6 from values only | → | founded |
Matched columns still store values in different formats and vocabularies, so they are rewritten into the forms the target schema requires. The schema sets a constraint for every attribute: a data type, and where it applies a numeric range, a string pattern, or a date format. A founding date, for instance, must be an ISO YYYY-MM-DD inside an allowed range, and assets must be a non-negative integer in US dollars below an upper bound. Categorical attributes are pinned to taxonomies that ship with the task and fix both the allowed values and the level to normalize to: countries follow CLDR, industries follow GICS, and each task adds its own, such as platform, genre, and rating for Games. Normalization is scored as consistency, that is, whether the output values satisfy these schema constraints.
| Field | As found in sources | Normalized to target |
|---|---|---|
| industry | "Auto & Truck Manufacturers", "Automotive industry" | Automobiles (GICS) |
| founded | "January 01, 1937", "01.01.1937" | 1937-01-01 |
| assets | "411 818 350 000,00", "323.40" (bn) | 411818350000, 323400000000 |
| country | "中华人民共和国", "U.S.A." | China, United States (CLDR) |
These messy forms are real values from the easy and hard variants; the base task is cleaner.
Records that refer to the same entity have to be found across sources. Scoring the full cartesian product of record pairs is quadratic in the input size and computationally infeasible, so blocking first reduces it to a smaller set of candidate pairs, and matching then classifies each candidate as a match or a non-match. The hardest cases are look-alikes that fall into the same block but refer to different entities.
After matching groups the records of an entity into a cluster, fusion resolves the cluster into a single value for each attribute. Where the sources report conflicting values, fusion has to decide which value to keep, and this conflict resolution is the step where the benchmark leaves the most room for improvement. Each gold value was set by annotators who verified it against independent sources.
| name | Volkswagen GroupForbesFullContact |
| country | GermanyDBpediaFullContact |
| founded | 1937-01-01DBpediaFullContact |
| assets | 447910000000verified |
| revenue | 261910000000verified |
Forbes reports $446.9B in assets, DBpedia reports $323.4B. Neither is taken at face value: the gold value, $447.91B, is the figure annotators confirmed by hand.
Beyond the per-step scores, MaDI-Bench rates the final fused table along three quality dimensions: coverage (did the right entities and values make it in?), consistency (does the table fit the target schema?), and correctness (are the merged values right?). Each is reported at three reference levels: reference-free (structure only, no labels needed), silver (against the human pipeline output), and ground truth (against the annotated test sets).
We validate MaDI-Bench with three pipelines that span the design space from human-engineered to fully automated. All three build on the PyDI integration framework, and their per-step and end-to-end outputs are released in the repository under results/ and the per-task output/ folders.
Domain-specific workflows that master's and PhD students built and refined on PyDI. Every workflow runs the same order: PyDI's LLM schema matcher proposes correspondences, an engineer inspects and hand-corrects them, and the sources are mapped to the target schema. A normalization step then parses dates and numbers, converts units, and applies the taxonomy mappings. Blocking uses embedding or task-specific keys tuned to at least 97% pair completeness. Entity matching combines rule-based and learned tree-based matchers with a global one-to-one assignment, and fusion applies per-attribute PyDI heuristics such as source trust, longest or shortest string, numeric aggregation, and set union.
It is a practical human reference rather than an optimum, and its fused output serves as the benchmark's silver standard.
Code and outputs: PyDI framework · workflow notebook · silver outputs
Chains state-of-the-art methods from the literature, one per integration step. It runs the four stages once: at each stage a committee of competing methods is scored on the validation set, and the winner's output is passed to the next stage.
Schema matching: COMA, Magneto, a Sentence-BERT embedding matcher, and PyDI's label-, instance-, duplicate- and LLM-based matchers. Blocking: standard, token and sorted-neighborhood blocking, BM25, Sentence-BERT, and SC-Block, keeping the blocker that clears 97% pair completeness at the highest reduction ratio. Entity matching: Ditto and Magellan. Fusion: PyDI heuristics plus the TruthFinder, LTM, AccuSim, CASEFusion, and FusionQuery truth-discovery models, with a per-attribute selector.
Outputs: results/best of breeds · per-stage method selections for one run
Showcases how well an LLM can configure an end-to-end data integration pipeline. It uses the same PyDI methods as the human pipeline, but an LLM, not a person, sets them up. GPT-5.5 performs schema matching, and the values are normalized by mapping them to the predefined taxonomies, with a column profiler selecting a normalization function for the remaining attributes. An embedding blocker then reduces the sources to candidate pairs.
From a sample of those pairs the pipeline builds its own machine-labeled training and validation sets with GPT-5.2 as the labeler, independent of the labels the benchmark provides, and uses them to configure an entity matcher chosen among rule-based and feature-based models such as random forest and XGBoost. For fusion it writes a configuration that assigns each attribute a conflict-resolution heuristic. The LLM only configures the pipeline and generates the training artifacts; it plays no part in the integration itself, which is what lets the pipeline scale to the largest tasks.
Code: automatic-data-integration · Outputs: results/llm pipeline · a sample run
The three pipelines behave very differently across the steps. Schema matching is close to perfect for all three, entity matching covers a wide range across tasks, and data fusion is the hardest step for every pipeline. Every number below is per task, and the strongest pipeline in each row is shown in green. The columns are the human-engineered pipeline (Human, P1), the best-of-breed pipeline (BoB, P2), and the LLM-based pipeline (LLM, P3).
The three pipelines reach close to optimal F1 by using the LLM Schema matcher provided in PyDI, which reads column names and values together. The label-based and instance-based matchers are shown for reference: their much lower scores show that schema matching is genuinely difficult in the benchmark, while state-of-the-art LLMs reach close to perfect performance once they are given both the headers and the values. Values are F1 (%).
| Task | Human | BoB | LLM | Label-based | Instance-based |
|---|---|---|---|---|---|
| Games | 100 | 100 | 100 | 30.00 | 60.50 |
| Companies | 100 | 100 | 97.56 | 26.10 | 53.80 |
| Music | 100 | 100 | 100 | 28.60 | 64.50 |
| Products | 100 | 100 | 100 | 61.60 | 7.70 |
| Scientific Papers | 100 | 97.14 | 100 | 73.80 | 33.30 |
Pair completeness stays between 94.4 and 100% for every pipeline, so the candidate sets keep almost all of the labeled test matches. Reduction ratio is above 99% on every task except Products, where the LLM pipeline's embedding blocker prunes 97.5% of the pairs against 71.9% for the other two. Entity matching itself spans a wide range, from near the ceiling on Papers and Music to clearly more open on Games and Products. Values are percentages. The best entity-matching score per task is in green.
| Task | Pair completeness | Reduction ratio | Entity matching F1 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Human | BoB | LLM | Human | BoB | LLM | Human | BoB | LLM | |
| Games | 99.55 | 100.00 | 96.85 | 99.93 | 99.90 | 99.96 | 89.45 | 67.30 | 63.15 |
| Companies | 97.73 | 94.37 | 100.00 | 99.34 | 99.24 | 99.69 | 87.65 | 89.29 | 90.69 |
| Music | 96.55 | 96.34 | 99.85 | 99.67 | 99.67 | 99.88 | 95.35 | 94.84 | 98.11 |
| Products | 100.00 | 100.00 | 98.00 | 71.91 | 71.91 | 97.54 | 70.37 | 84.09 | 63.09 |
| Scientific Papers | 97.90 | 97.90 | 99.69 | 99.98 | 99.98 | 99.97 | 99.78 | 99.70 | 96.05 |
Pair completeness and reduction ratio measure blocking quality on the labeled entity-matching test set. Entity matching F1 is on the same test set.
Fusion is the hardest step for every pipeline. The LLM pipeline produces the most accurate fused values on four of the five tasks. Best-of-breed leads on Products. Values are accuracy (%). the best per task is in green.
| Task | Human | BoB | LLM |
|---|---|---|---|
| Games | 71.70 | 64.91 | 84.87 |
| Companies | 44.70 | 45.90 | 63.24 |
| Music | 70.20 | 76.92 | 83.12 |
| Products | 40.20 | 56.68 | 43.65 |
| Scientific Papers | 78.90 | 61.04 | 79.35 |
All three pipelines are preconfigured, so the times below cover only the final integration run, not any configuration search or design effort. The main difference comes from entity matching: best-of-breed mostly selects Ditto, a pre-trained language model whose per-task fine-tuning and inference dominate its runtime, while the human and LLM pipelines use lighter matchers. On Companies, where best-of-breed instead picks the cheaper Magellan, it runs about as fast as the others. Values are end-to-end wall-clock seconds.
| Task | Human | BoB | LLM |
|---|---|---|---|
| Games | 180 | 5,736 | 196 |
| Companies | 70 | 213 | 203 |
| Music | 184 | 5,980 | 211 |
| Products | 98 | 1,388 | 718 |
| Scientific Papers | 637 | 16,184 | 1,374 |
| Mean | 234 | 5,900 | 540 |
MaDI-Bench also scores the final fused table as a whole, along three dimensions: coverage (are the right entities and values present), consistency (does the output fit the target schema), and correctness (are the merged values right). Each is reported at three reference levels: reference-free, against the human pipeline (silver), and against the annotated test sets (ground truth). The table gives the mean over the five tasks under the ground-truth reference, with the better pipeline per metric in green.
| Metric (ground-truth reference, mean over tasks) | Best-of-breed | LLM-based |
|---|---|---|
| Entity recovery coverage, higher is better | 0.90 | 0.81 |
| Value drift coverage, lower is better | 0.48 | 0.56 |
| BCubed F1 clustering correctness, higher is better | 0.97 | 0.94 |
| Fusion accuracy value correctness, higher is better | 0.61 | 0.71 |
| Fully-correct rate every attribute right at once, higher is better | 0.05 | 0.14 |
Best-of-breed wins on coverage and clustering. The LLM pipeline wins on value correctness. The fully-correct rate, which needs every attribute of an entity right at once, stays low for both, so the integrated task is far from solved. Schema validity (consistency) is close, near 0.95 for both pipelines. Full per-task numbers at all three reference levels are in the results directory.
To keep the benchmark hard as systems improve, each base task also ships in easy, medium, and hard versions. The records and the correct answers stay the same. Eight difficulty knobs perturb the data, each set to its own target for easy, medium, and hard. A harder level turns up the kinds of heterogeneity that real integration runs into.
Here is the DBpedia record for the Taisei Corporation, base against hard. Schema naming divergence (Knob 8) first renames every column to a cryptic code: org_name→nm, established→ey, nation→cn, headquarters→hq, sector→sg, keypeople_name→kpn, total_assets_val→ta, annual_income→ai. On top of that, the values change:
| Field | Base value | Hard value | Changed by |
|---|---|---|---|
| name | Taisei Corporation | Taisei Corporation | unchanged |
| founded | 1873-01-01 | 01.01.1873 | Knob 5 · format |
| country | Japan | Nippon | Knob 1 · surface |
| city | TokyoShinjuku, Tokyo | TokyoShinjuku, Tokyo | unchanged |
| industry | Construction | Constructi on | Knob 6 · value noise |
| key people | Okura Kihachiro | (dropped) | Knob 3 · attribute drop |
| assets | 3,650,187,000 | 3650.19 | Knob 5 · unit scale |
| revenue | 213,195,000 | 167.18 | Knob 5 · unit + currency |
Eight changes from one base record: a renamed schema, a reformatted date, a reworded country, a corrupted industry string, a dropped person, and two rescaled financial figures. The easy level moves the same dials the other way, with descriptive column names, gentle date and number formats, and fewer dropped cells.
Each knob targets one or two pipeline stages and is set to an easy, medium, or hard target. MaDI-Bench uses eight:
Surface augmentation
Rewrites attribute values into other surface forms using abbreviations, token reorderings, and dropped tokens. This erodes the surface overlap that blocking and matching depend on, and turns clean values into plausible-looking disagreements for fusion.
exampleSan Francisco → Francisco San
Entity niche density
Adds similar-but-distinct entities within the same domain niche, such as many products from one brand. Blocks fill up with non-matching pairs, and more candidate pairs sit close to the match / non-match boundary.
examplegenerates a near-duplicate company and inserts it into a crowded niche
Attribute drop
Blanks a share of cells per source. Missing blocking keys make records unreachable, missing attributes weaken the evidence for matching, and sparse coverage leaves fusion less to cross-check.
examplerevenue 358,500,000 → (dropped)
Coverage skew
Varies how many sources describe each entity. This changes cluster sizes and thins the evidence available to resolve conflicts for entities that only one or two sources cover.
exampleInter RAO: its DBpedia row is removed, leaving fewer sources
Format & unit diversity
Rewrites dates, numbers, and quantities into more formats, units, and scales, for example ISO against dotted dates, or raw figures against millions. More admissible forms per field make value standardization harder.
example65,170,000,000 → 65170 (millions)
Value noise
Corrupts a share of values with typos, encoding artefacts, and truncation. The damaged values lower string-similarity scores during matching and show up as spurious disagreements during fusion.
example2005-01-01 → 2o0t-01-o1
Schema naming divergence
Renames source columns along a scale from fully descriptive, through abbreviated, to opaque codes. At the hard end the column name carries no signal, so a matcher has to rely on the values alone.
exampletotal_assets_val → ta
Source reliability
Reshuffles which source holds the correct value per attribute, so the most reliable source is right for only a share of entities. No fixed source-trust rule stays optimal.
exampleAccor: the verified name is reassigned to a different source
Two further knobs (value ambiguity and schema distractors) are specified for synthetic tasks. Full specifications for every knob are in the repository.
The variants produce the intended difficulty gradient, clearest for data fusion: at the hard level fusion accuracy falls in every domain, by up to about 23 points for best-of-breed (Papers) and 28 for the LLM pipeline (Papers). Schema matching stays robust across the levels, and entity matching stays resilient once the matcher is re-tuned per variant. The hard variants raise the difficulty in every domain, while the easy variants keep simpler methods competitive. Values are percentages: schema matching and entity matching F1, and fusion accuracy.
| Domain | Level | Best-of-breed (P2) | LLM-based (P3) | ||||
|---|---|---|---|---|---|---|---|
| SM | EM | Fusion | SM | EM | Fusion | ||
| Companies | Base | 100.0 | 89.3 | 45.9 | 97.6 | 90.7 | 63.2 |
| Easy | 100.0 | 93.6 | 40.3 | 100.0 | 88.2 | 42.3 | |
| Medium | 100.0 | 88.9 | 39.6 | 100.0 | 77.0 | 56.8 | |
| Hard | 91.3 | 94.5 | 33.3 | 100.0 | 87.4 | 44.4 | |
| Games | Base | 100.0 | 67.3 | 64.9 | 100.0 | 63.2 | 84.9 |
| Easy | 98.1 | 66.9 | 62.1 | 97.9 | 76.0 | 81.4 | |
| Medium | 98.1 | 69.7 | 63.6 | 97.9 | 72.7 | 78.5 | |
| Hard | 98.1 | 79.2 | 52.5 | 97.9 | 82.0 | 72.6 | |
| Music | Base | 100.0 | 94.8 | 76.9 | 100.0 | 98.1 | 83.1 |
| Easy | 100.0 | 94.9 | 85.4 | 100.0 | 97.7 | 77.4 | |
| Medium | 100.0 | 93.2 | 65.1 | 100.0 | 90.9 | 71.0 | |
| Hard | 92.0 | 94.3 | 54.5 | 90.9 | 88.0 | 58.6 | |
| Products | Base | 100.0 | 84.1 | 56.7 | 100.0 | 63.1 | 43.6 |
| Easy | 99.5 | 91.5 | 51.6 | 100.0 | 27.5 | 49.3 | |
| Medium | 98.1 | 88.1 | 59.3 | 100.0 | 36.9 | 45.9 | |
| Hard | 98.1 | 85.3 | 49.5 | 98.0 | 51.0 | 34.1 | |
| Scientific Papers | Base | 97.1 | 99.7 | 61.0 | 100.0 | 96.0 | 79.3 |
| Easy | 92.9 | 99.5 | 58.1 | 91.7 | 90.9 | 64.6 | |
| Medium | 90.5 | 99.2 | 47.7 | 91.7 | 87.6 | 62.3 | |
| Hard | 92.9 | 96.7 | 37.8 | 91.7 | 78.1 | 51.1 | |
SM is schema matching F1, EM is entity matching F1, Fusion is data fusion accuracy. The human pipeline is not run on the variants. These are the per-stage scores from the paper (Table XII).
This paper introduced the MaDI-Bench integration benchmark. The benchmark models the full complexity of the data integration process, including schema matching, value normalization, entity matching, and conflict resolution, while at the same time taking into account the dependencies between the subtasks. To prevent a quick saturation of the benchmark as agentic systems progress, we introduce a generic variant-generation method for deriving harder variants from the base tasks. We validated MaDI-Bench and the generated variants using human-designed pipelines, an LLM-based pipeline, and a best-of-breed pipeline. The validation showed the benchmark's utility for measuring the step-wise as well as end-to-end performance of data integration systems. We hope that the benchmark will prove useful for the community and will support the development of fully automatic as well as human-in-the-loop data integration systems.
Tasks are released as plain files (CSV, JSON, and XML), so any system can read them. PyDI adds the integration steps and an evaluator for every stage, so a run can be scored without labeling any data.
@misc{steiner2026madibench,
title = {MaDI-Bench: An End-to-End Data Integration Benchmark},
author = {Steiner, Aaron and Peeters, Ralph and Bizer, Christian},
year = {2026},
eprint = {2606.30371},
archivePrefix = {arXiv},
primaryClass = {cs.DB},
url = {https://arxiv.org/abs/2606.30371}
}