Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

Scope & Objectives of ParaCrawl

Action: Broader Web-Scale Provision of Parallel Corpora for European Languages

This action builts upon the ongoing efforts to collect 24 offical EU languages by adding Icelandic, Norwegian (Bokmål and Nynorsk), Basque, Catalan/Valencian, and Galician. Going beyond HTML by ingesting PDFs and word processing formats. Expanding the current crawling efforts to 1 petabyte of compressed web pages from the latest Internet Archive crawl. Domain filtering and weighting with a freely provided open-source tool. Better document and segment alignment, better cleaning, and corpus postprocessing. We expect to create the largest parallel corpus for many of the languages, focusing on the needs of CEF-AT.

Action: Provision of Web-Scale Parallel Corpora for Official European Languages

ParaCrawl will create and release large parallel corpora to/from English for all official EU languages by a broad web crawling effort. State-of-the-art methods will be applied for the entire processing chain from identifying web sites with translated text all the way to collecting, cleaning and delivering parallel corpora that are ready as training data for CEF.AT and translation memories for DG Translation. It will also make available consortium partners’ open-source tools to CEF Automated Translation and all other interested parties. Throughout the project there will be four large parallel corpora releases and two software releases.

24 Languages

The target is to collect parallel corpora for the official 24 languages of the European Union

Multilingual Web Crawling

ParaCrawl will discover multilingual content from candidate websites and crawl it.

Quality Testing

Improved document and sentence-level alignment. Cleaned, anonymised and annotated translation units.

Open Source Data Collection Pipeline

The ParaCrawl project will make use of state-of-the-art open source tools to crawl, align and clean data, and bundled them together in a open source pipeline.

Project Milestones

Action: Broader Web-Scale Provision of Parallel Corpora for European Languages

Novermber 2018, Website updated with Broader WebCrawl.

February 2019, Integration of Initial Domain Identification Technology.

June 2019, Integration of Processing of Broader Document Formats.

September 2019, Inclusion of Data from Internet Archive in Data Release.

September 2019, Data Release 1

March 2020, Data Release 2

August 2020, Final Code Release

August 2020, Domain Identification for Data Release 1

September 2020, Data Release 3

Action: Continued Web-Scale Provision of Parallel Corpora for European Languages

September 2020, Augmented Data Release including Manufactured Data

September 2020, Implementation of Deferred Crawling

March 2021, Data Release 1 (This will include extra data from the internet archive and an updated version of the manufactured data)

May 2021, Completion of data portal, with data from release 1

September 2021, Data Release 2 (This will include more data from the internet archive and an updated version of the manufactured data)

September 2021, Code Release

September 2021, Quality testing for release 1

September 2021, Data promotion plan

September 2021, Validation: Metadata related to "Augmented Data, Data release 1 and 2" to be compliant with the ELRC SHARE specifications and available on the ELRC-SHARE repository.

Project Partners:

Other Contributors:

Follow ParaCrawl on Github and Twitter

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

© Copyright 2019. All rights reserved.

Any communication or publication related to the action, made by the beneficiaries jointly or individually in any form and using any means, shall indicate that it reflects only the author's view and that the Agency is not responsible for any use that may be made ofthe information it contains.