The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, offering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.

Agostini, A., Maurino, A., Spahiu, B. (2025). PuckTrick: A Library for Making Synthetic Data More Realistic. In Proceedings of the 33nd Symposium on Advanced Database Systems (pp.382-398). CEUR-WS.

PuckTrick: A Library for Making Synthetic Data More Realistic

Agostini A.;Maurino A.;Spahiu B.
2025

Abstract

The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, offering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.
paper
data quality
English
33nd Symposium on Advanced Database Systems, SEBD 2025 - June 16th to 19th, 2025
2025
Bartolini, I; Cima, G; Firmani, D; Lembo, D; Martinenghi, D; Mecella, M; Poggi, A
Proceedings of the 33nd Symposium on Advanced Database Systems
2025
4182
382
398
https://ceur-ws.org/Vol-4182/
open
Agostini, A., Maurino, A., Spahiu, B. (2025). PuckTrick: A Library for Making Synthetic Data More Realistic. In Proceedings of the 33nd Symposium on Advanced Database Systems (pp.382-398). CEUR-WS.
File in questo prodotto:
File Dimensione Formato  
Agostini et al-2025-SEBD-CEUR-VoR.pdf

accesso aperto

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 1.84 MB
Formato Adobe PDF
1.84 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/607061
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
Social impact