ParaCrawl Corpus release v8
The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 8 is the first release for ParaCrawl Action 3: "Continued Web-Scale Provision of Parallel Corpora for European Languages".
ParaCrawl 8 adds a huge amount of data to previous releases and additional cleaning routines such as the removal of machine translated content detected through the use of MT plugins (more details) in websites. The corpus is the result of a full reprocessing of all the content from already crawled sources besides the addition of new sources from the Internet Archive or new crawlings.
This version relies on an updated and enhanced version of Bitextor (see changes) including minor fixes for Bifixer (fixes), Bicleaner (filters) and Biroamer (anonymizes). Bitextor provides for the first time deferred crawled corpora as part of this version.