Provision of Web-Scale Parallel Corpora for Official European Languages

Scope & Objectives of ParaCrawl

ParaCrawl will create and release large parallel corpora to/from English for all official EU languages by a broad web crawling effort. State-of-the-art methods will be applied for the entire processing chain from identifying web sites with translated text all the way to collecting, cleaning and delivering parallel corpora that are ready as training data for CEF.AT and translation memories for DG Translation. It will also make available consortium partners’ open-source tools to CEF Automated Translation and all other interested parties. Throughout the project there will be four large parallel corpora releases and two software releases.

24 Languages

The target is to collect parallel corpora for the official 24 languages of the European Union

Multilingual Web Crawling

ParaCrawl will discover multilingual content from candidate websites and crawl it.

Quality Testing

Improved document and sentence-level alignment. Cleaned, anonymised and annotated translation units.

Open Source Data Collection Pipeline

The ParaCrawl project will make use of state-of-the-art open source tools to crawl, align and clean data, and bundled them together in a open source pipeline.

Project Milestones

September 2017, Kickoff meeting in Alicante, Spain

November 2017, Website up and running

January 2018, Release of Corpus v1

April 2018, Release of Corpus v1 in ELRC-SHARE

June 2018, Release of Corpus v2

June 2018, Release of Software v1

October 2018, Release of Corpus v3

March 2019, Release of Corpus v4, also in ELRC-SHARE

March 2019, Release of Software v2

Follow ParaCrawl on Github

Project Partners:

Other Contributors:

Provision of Web-Scale Parallel Corpora for Official European Languages

© Copyright 2018. All rights reserved.