Monday, 19 November 2018 00:00

ParaCrawl corpus release v3.0

The thrid version of the ParaCrawl corpus has been released. It contains parallel corpora for 23 languages paired with English. 6 new languages are added to the v3 release namely Bulgarian, Danish, Greek, Slovak, Slovenian and Swedish. For the previously released languages more data is added to the corpus. For each language two different versions of corpus are released based on two cleaning tools, i.e. BiCleaner and Zipporah. ParaCrawl corpus is crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl.

Corpus size and download links are available from ParaCrawl's website (http://paracrawl.eu/releases.html). The corpus will soon be uploaded to other public data repositories as well.

The source code of the ParaCrawl OpenSource Pipeline (Bitextor) is also available on Github.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Next release is scheduled for March 2019.

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

Last modified on Tuesday, 26 November 2019 10:43