ParaCrawl 7 is the final release of ParaCrawl Action 2: "Broader Web-Scale Provision of Parallel Corpora for European Languages" and it uses a brand new version of Bicleaner, namely version 0.14 (see full log of changes). Some highlights are as follows:
- new rules have been implemented to filter out noise for, e.g. sentences containing a lot of glued words or inappropriate language
- the classifier uses now a different technology: extremely randomised trees instead of random forest is the default classifier
- classifier features have been improved to better cope with OOVs and make the most of the probabilistic dictionaries
- training procedure has been simplified and logging info messages are now more informative
- access to pre-trained language packs has also been eased
- the 29 available language packs have been updated
Corpora sizes and download links are available from ParaCrawl's website (https://paracrawl.eu/v7).
The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.