ParaCrawl Corpus release v6
The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 6 includes a new language pair English-Icelandic with a lot more data for many other languages. Restorative cleaning with Bifixer gets more data by improving sentence splitting, better data by applying fixes to wrong encoding, html issues, alphabet issues and typos and unique data not only identifying duplicates but also near duplicates. Improved Bicleaner models have also been applied to filter out noisy parallel sentences for this release.