1st corpus release for ParaCrawl

Sunday, 14 January 2018 00:00

1st corpus release for ParaCrawl

font size decrease font size increase font size
Print
Email

The first version of the ParaCrawl corpus has been released. It contains parallel corpora for 11 languages paired with English, namely German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Finnish and Latvian, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl.

Corpus size, BLEU score evaluations and download links are available from ParaCrawl's website (http://paracrawl.eu/releases.html). The corpus will soon be uploaded to other public data repositories as well.

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Updated releases are scheduled for June 2018, October 2018, and March 2019.

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

Last modified on Tuesday, 26 November 2019 10:34

More in this category: « ParaCrawl corpus release 5 Meet ParaCrawl at AMTA Technology Forum! »

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

1st corpus release for ParaCrawl