Broader Web-Scale Provision of Parallel Corpora for European Languages

Scope & Objectives of ParaCrawl

Action: Broader Web-Scale Provision of Parallel Corpora for European Languages

This action builts upon the ongoing efforts to collect 24 offical EU languages by adding Icelandic, Norwegian (Bokmål and Nynorsk), Basque, Catalan/Valencian, and Galician. Going beyond HTML by ingesting PDFs and word processing formats. Expanding the current crawling efforts to 1 petabyte of compressed web pages from the latest Internet Archive crawl. Domain filtering and weighting with a freely provided open-source tool. Better document and segment alignment, better cleaning, and corpus postprocessing. We expect to create the largest parallel corpus for many of the languages, focusing on the needs of CEF-AT.

Action: Provision of Web-Scale Parallel Corpora for Official European Languages

ParaCrawl will create and release large parallel corpora to/from English for all official EU languages by a broad web crawling effort. State-of-the-art methods will be applied for the entire processing chain from identifying web sites with translated text all the way to collecting, cleaning and delivering parallel corpora that are ready as training data for CEF.AT and translation memories for DG Translation. It will also make available consortium partners’ open-source tools to CEF Automated Translation and all other interested parties. Throughout the project there will be four large parallel corpora releases and two software releases.

24 Languages

The target is to collect parallel corpora for the official 24 languages of the European Union

Multilingual Web Crawling

ParaCrawl will discover multilingual content from candidate websites and crawl it.

Quality Testing

Improved document and sentence-level alignment. Cleaned, anonymised and annotated translation units.

Open Source Data Collection Pipeline

The ParaCrawl project will make use of state-of-the-art open source tools to crawl, align and clean data, and bundled them together in a open source pipeline.

Project Milestones

September 2017, Kickoff meeting in Alicante, Spain

November 2017, Website up and running

January 2018, Release of Corpus v1

April 2018, Release of Corpus v1 in ELRC-SHARE

June 2018, Release of Corpus v2

June 2018, Release of Software v1

October 2018, Release of Corpus v3

March 2019, Release of Corpus v4, also in ELRC-SHARE

March 2019, Release of Software v2

Follow ParaCrawl on Github and Twitter

Project Partners:

Other Contributors:

Broader Web-Scale Provision of Parallel Corpora for European Languages

© Copyright 2019. All rights reserved.

Any communication or publication related to the action, made by the beneficiaries jointly or individually in any form and using any means, shall indicate that it reflects only the author's view and that the Agency is not responsible for any use that may be made ofthe information it contains.