The paper presents MammoTab 25, a new dataset comprising approximately 838930 Wikipedia tables extracted from over 63 million English Wikipedia pages and semantically annotated through Wikidata. Each table in MammoTab 25 is accompanied by fine-grained metadata, including column typing, NIL flags, and statistics, and by four prompt templates, making the resource simultaneously suitable for training, fine-tuning, and stress-testing Large Language Models (LLMs). MammoTab 25 covers, in a single benchmark, all key challenges for the semantic interpretation of tables, such as disambiguation issues, homonymy and acronym presence, NIL-mentions, and large web-table sizes; the tags attached to every table let researchers isolate and diagnose specific failure cases with precision. The corpus is delivered with an open-source pipeline that can be rerun on future Wikipedia dumps, ensuring long-term sustainability and up-to-date annotations. MammoTab 25 already supports, and will continue to support, a public leaderboard that evaluates the Semantic Table Interpretation (STI) capabilities of state-of-the-art and upcoming LLMs, providing the community with a live yardstick of progress.

Cremaschi, M., Belotti, F., D'Souza, J., Palmonari, M. (2025). MammoTab 25: A Large-Scale Dataset for Semantic Table Interpretation - Training, Testing, and Detecting Weaknesses. In The Semantic Web – ISWC 2025 24th International Semantic Web Conference, Nara, Japan, November 2–6, 2025, Proceedings, Part II (pp.131-148) [10.1007/978-3-032-09530-5_8].

MammoTab 25: A Large-Scale Dataset for Semantic Table Interpretation - Training, Testing, and Detecting Weaknesses

Cremaschi, Marco;Belotti, Federico;Palmonari, Matteo
2025

Abstract

The paper presents MammoTab 25, a new dataset comprising approximately 838930 Wikipedia tables extracted from over 63 million English Wikipedia pages and semantically annotated through Wikidata. Each table in MammoTab 25 is accompanied by fine-grained metadata, including column typing, NIL flags, and statistics, and by four prompt templates, making the resource simultaneously suitable for training, fine-tuning, and stress-testing Large Language Models (LLMs). MammoTab 25 covers, in a single benchmark, all key challenges for the semantic interpretation of tables, such as disambiguation issues, homonymy and acronym presence, NIL-mentions, and large web-table sizes; the tags attached to every table let researchers isolate and diagnose specific failure cases with precision. The corpus is delivered with an open-source pipeline that can be rerun on future Wikipedia dumps, ensuring long-term sustainability and up-to-date annotations. MammoTab 25 already supports, and will continue to support, a public leaderboard that evaluates the Semantic Table Interpretation (STI) capabilities of state-of-the-art and upcoming LLMs, providing the community with a live yardstick of progress.
paper
knowledge graph, artificial intelligence, tabular data
English
24th International Semantic Web Conference - November 2–6, 2025
2025
The Semantic Web – ISWC 2025 24th International Semantic Web Conference, Nara, Japan, November 2–6, 2025, Proceedings, Part II
9783032095299
29-ott-2025
2025
131
148
open
Cremaschi, M., Belotti, F., D'Souza, J., Palmonari, M. (2025). MammoTab 25: A Large-Scale Dataset for Semantic Table Interpretation - Training, Testing, and Detecting Weaknesses. In The Semantic Web – ISWC 2025 24th International Semantic Web Conference, Nara, Japan, November 2–6, 2025, Proceedings, Part II (pp.131-148) [10.1007/978-3-032-09530-5_8].
File in questo prodotto:
File Dimensione Formato  
Cremaschi-2025-24 Int Semantic Web Conf-preprint.pdf

accesso aperto

Tipologia di allegato: Submitted Version (Pre-print)
Licenza: Non specificato
Dimensione 2.47 MB
Formato Adobe PDF
2.47 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/576467
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact