machine translation

Monday, 06 July 2020 00:00

Web-Scale Acquisition of Parallel Corpora, ParaCrawl in ACL

The main goal of the ParaCrawl project is to create the largest publicly available corpora by crawling hundreds of thousands of websites, using open source tools. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor, a highly modular pipeline that allows harvesting parallel corpora from multilingual websites or from preexisting or historical web crawls such as Common Crawl or the one available as part of the Internet Archive. The processing pipeline consists of the steps: crawling, text extraction, document alignment, sentence alignment, and sentence pair filtering. The ACL paper describes these steps in detail and evaluates alternative methods empirically in terms of their impact on machine translation quality. Hunalign, Bleualign and Vecalign tools are evaluated for the sentence alignment step. Similarly, Zipporah, Bicleaner and LASER are evaluated for the sentence pair filtering step. Benchmarking data sets for these evaluations are also published. The released parallel corpora is also described in the paper and useful statistics are tabulated about the size of the corpora before and after cleaning for different languages. The quality and usefulness of the data is measured by training Transformer-Based machine translation models with Marian for five different languages. Improvements in BLEU scores are reported against models trained on WMT data sets. Furthermore, the energy cost consumption of running and maintaining such a computationally expensive pipeline is discussed and positive environmental impacts are highlighted. The paper aims to contribute to the further development of novel methods of better processing of raw parallel data and to neural machine translation training with noisy data especially for low resource languages.

Read the paper

Watch our pre-recorded talk on ACL2020 Virtual Conference website.
and join the live Q&A sessions on Tuesday, July 7, 2020:
Session 8A: Resources and Evaluation-7 14:00–15:00 CEST
Session 9A: Resources and Evaluation-9 19:00–20:00 CEST

Published in News

Tagged under

Tuesday, 07 April 2020 08:38

ParaCrawl Corpus Release 6

Release 6 includes a new language pair English-Icelandic with a lot more data for many other languages. Restorative cleaning with Bifixer gets more data by improving sentence splitting, better data by applying fixes to wrong encoding, html issues, alphabet issues and typos and unique data not only identifying duplicates but also near duplicates. Improved Bicleaner models have also been applied to filter out noisy parallel sentences for this release.

Corpora sizes and download links are available from ParaCrawl's website (https://paracrawl.eu/v6).

The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English).

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

Published in News

Tagged under

Tuesday, 25 February 2020 10:30

ParaCrawl Corpus Release 5.1

Version 5.1 builds upon the same raw corpus as version 5. Thanks to improvements in filtering procedure, the official subset extracted as version 5.1 is now higher in quantity for almost all language pairs (but ga, de, sl and et). Quality measured extrinsically through MT for several language pairs shows also improvement in quality.

Corpora sizes and download links are available from ParaCrawl's website (https://paracrawl.eu/v5-1).

This is the official release to be used in WMT20. Stay tuned for more news and follow us on twitter @ParaCrawl.

The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English).

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

Published in News

Tagged under

Wednesday, 01 January 2020 09:05

ParaCrawl - A CEF Digital Success Story

EU funding supports ParaCrawl, the largest collection of language resources for many European languages – significantly improving machine translation quality. Read the Success Story published by CEF Digital, titled "ParaCrawl taps the World Wide Web for language resources".

Continue Reading the Article

Published in News

Tagged under

Tuesday, 01 October 2019 09:07

Kick-off meeting of ParaCrawl 3: Continued Web-Scale Provision of Parallel Corpora for European Languages

Last week took place the kick off meeting of the third CEF funded Action aiming at improving and expanding the parallel corpora developed in two previous actions (ParaCrawl-1-Action no 2016-EU-IA-0114 and ParaCrawl-2-Action no 2017-EU-IA-0178). These previous Actions have already resulted in the release of the largest ever publicly available parallel corpora, for all EU/EEA official languages paired with English, as well as a complete end-to-end crawling and extraction open-source software toolkit.

ParaCrawl 3 will offer improved extraction software capable of efficiently processing an even larger portion of the Web (more than 1 compressed petabyte). At the same time, it will apply state-of-the-art neural methods to the detection of parallel sentences, and the processing of the extracted corpora. Special emphasis will be placed on collecting larger corpora for language pairs that are currently under-resourced. The corpora will be made more useful for training machine translation (MT) systems by post-processing the data to split long sentences, repair broken sentences and synthesise new sentences.

The new corpus releases will be made available via a data portal which will allow the users building the machine translation systems to select the types of text which best fit their purpose.

Keep posted!

Published in News

Tagged under

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

Displaying items by tag: machine translation

Web-Scale Acquisition of Parallel Corpora, ParaCrawl in ACL

ParaCrawl Corpus Release 6

ParaCrawl Corpus Release 5.1

ParaCrawl - A CEF Digital Success Story

Kick-off meeting of ParaCrawl 3: Continued Web-Scale Provision of Parallel Corpora for European Languages