ParaCrawl Corpus Bonus Release

Two language pairs Dutch-French and Polish-German are part of this bonus release. These language pairs are crawled in collaboration with an industry partner. The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.

Language
 
Sentences
Source Words
Russian  
New
0
0
491,941,804
492,260,972
5,377,911
101,312,142
5,377,911
101,312,142
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.