Bicocca Open Archive

In this paper, we present DIETA, a small, decoder-only Transformer model with 0.5 billion parameters, specifically designed and trained for Italian–English machine translation. We collect and curate a large parallel corpus consisting of approximately 207 million Italian–English sentence pairs across diverse domains, including parliamentary proceedings, legal texts, web-crawled content, subtitles, news, literature and 352 million back-translated data using pretrained models. Additionally, we create and release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles, enabling assessment of translation quality on contemporary text. Comprehensive evaluations show that DIETA achieves competitive performance on multiple Italian–English benchmarks, consistently ranking in the second quartile of a 32-system leaderboard and outperforming most other sub-3B models on four out of five test suites. The training script, trained models, curated corpus, and newly introduced evaluation set are made publicly available, facilitating further research and development in specialized Italian–English machine translation: https://github.com/pkasela/DIETA-Machine-Translation.

Kasela, P., Braga, M., Ghiotto, A., Pilzer, A., Viviani, M., Raganato, A. (2025). DIETA: A Decoder-only transformer-based model for Italian–English machine TrAnslation. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) (pp.1-11). CEUR-WS.

DIETA: A Decoder-only transformer-based model for Italian–English machine TrAnslation

Kasela P.;Braga M.;Ghiotto A.;Pilzer A.;Viviani M.;Raganato A.

2025

Abstract

In this paper, we present DIETA, a small, decoder-only Transformer model with 0.5 billion parameters, specifically designed and trained for Italian–English machine translation. We collect and curate a large parallel corpus consisting of approximately 207 million Italian–English sentence pairs across diverse domains, including parliamentary proceedings, legal texts, web-crawled content, subtitles, news, literature and 352 million back-translated data using pretrained models. Additionally, we create and release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles, enabling assessment of translation quality on contemporary text. Comprehensive evaluations show that DIETA achieves competitive performance on multiple Italian–English benchmarks, consistently ranking in the second quartile of a 32-system leaderboard and outperforming most other sub-3B models on four out of five test suites. The training script, trained models, curated corpus, and newly introduced evaluation set are made publicly available, facilitating further research and development in specialized Italian–English machine translation: https://github.com/pkasela/DIETA-Machine-Translation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				Italian–English Translations; Large Language Models; Machine Translation; Parallel Corpus;
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				11th Italian Conference on Computational Linguistics, CLiC-it 2025 - September 24-26, 2025
			
	Anno del convegno
	
				2025
			
	Curatori della monografia
	
				Bosco, C; Jezek, E; Polignano, M; Sanguinetti, M
			
	Titolo degli atti
	
				Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
			
	Collana o serie
	
				CEUR WORKSHOP PROCEEDINGS
			
	Data di pubblicazione
	
				2025
			
	Numero del volume
	
				4112
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				11
			
	URL alternativo
	
				https://ceur-ws.org/Vol-4112/
			
	Fulltext
	
				open
			
	Citazione
	
				Kasela, P., Braga, M., Ghiotto, A., Pilzer, A., Viviani, M., Raganato, A. (2025). DIETA: A Decoder-only transformer-based model for Italian–English machine TrAnslation. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) (pp.1-11). CEUR-WS.
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
Kasela et al-2025-CLiC-it-CEUR-VoR.pdf accesso aperto Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 1.34 MB Formato Adobe PDF Visualizza/Apri	1.34 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/604162

Citazioni

0

ND

Social impact