The fourth version of the ParaCrawl corpus has been released. It is the final release for the first ParaCrawl project, Provision of Web-Scale Parallel Corpora for Official European Languages, contains parallel corpora for 23 European languages paired with English. The latest release of the corpora brings cutting-edge improvements to the processing pipeline, mainly focusing on getting high-quality bilingual sentences. To that end, extensive cleaning techniques have been applied such as character-based language model filtering or safe restorative cleaning.
The source code of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.
The ParaCrawl efforts will continue with the second iteration, Broader Web-Scale Provision of Parallel Corpora for European Languages; focusing on more language pairs, ingesting more file formats beyond HTML, expanding the crawl coverage and applying domain filtering. Stay tuned for more news and follow us on twitter @ParaCrawl.
The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English).
The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)