>>Check More Data and News sections for updates!!! >>
ParaCrawl Corpus release v9
This corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.
Release 9 is the final release for ParaCrawl Action 3: "Continued Web-Scale Provision of Parallel Corpora for European Languages".
ParaCrawl 9 brings new content and higher quality as the result of an improved pipeline with:
- better PDF processing
- language identification based on CLD2 full instead of lite
- improved machine translation models (almost all neural) used to parallelize sentences
- neural cleaning applied for the first time
As a bonus, we release an English-Chinese corpus and monolingual data.