The main goal of the ParaCrawl project is to create the largest publicly available corpora by crawling hundreds of thousands of websites, using open source tools. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor, a highly modular pipeline that allows harvesting parallel corpora from multilingual websites or from preexisting or historical web crawls such as Common Crawl or the one available as part of the Internet Archive. The processing pipeline consists of the steps: crawling, text extraction, document alignment, sentence alignment, and sentence pair filtering. The ACL paper describes these steps in detail and evaluates alternative methods empirically in terms of their impact on machine translation quality. Hunalign, Bleualign and Vecalign tools are evaluated for the sentence alignment step. Similarly, Zipporah, Bicleaner and LASER are evaluated for the sentence pair filtering step. Benchmarking data sets for these evaluations are also published. The released parallel corpora is also described in the paper and useful statistics are tabulated about the size of the corpora before and after cleaning for different languages. The quality and usefulness of the data is measured by training Transformer-Based machine translation models with Marian for five different languages. Improvements in BLEU scores are reported against models trained on WMT data sets. Furthermore, the energy cost consumption of running and maintaining such a computationally expensive pipeline is discussed and positive environmental impacts are highlighted. The paper aims to contribute to the further development of novel methods of better processing of raw parallel data and to neural machine translation training with noisy data especially for low resource languages.
Watch our pre-recorded talk on ACL2020 Virtual Conference website.
and join the live Q&A sessions on Tuesday, July 7, 2020:
Session 8A: Resources and Evaluation-7 14:00–15:00 CEST
Session 9A: Resources and Evaluation-9 19:00–20:00 CEST