ParaCrawl Corpus release v8
The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 8 is the first release for ParaCrawl Action 3: "Continued Web-Scale Provision of Parallel Corpora for European Languages".
ParaCrawl 8 adds a huge amount of data to previous releases and additional cleaning routines such as the removal of machine translated content detected through the use of MT plugins (more details) in websites. The corpus is the result of a full reprocessing of all the content from already crawled sources besides the addition of new sources from the Internet Archive or new crawlings.
This version relies on an updated and enhanced version of Bitextor (see changes) including minor fixes for Bifixer (fixes), Bicleaner (filters) and Biroamer (anonymizes). Bitextor provides for the first time deferred crawled corpora as part of this version.
As a bonus, a corpus made of all the monolingual English data in V8 (96 billion sentences!) has been produced along with a new version of the English-Russian corpus. Also, new synthesized data for 4 domains (Financial,Law, IT and Medical) is available as part of this version.
New version 8.1 for Spanish-Galician and Spanish-Catalan: due to a processing error, we discovered a lot of Spanish content in Catalan and Galician sentences. We've produced new filtered versions for these 2 pairs, in order to fix this issue.