ParaCrawl Corpus Bonus Release

Two language pairs Dutch-French and Polish-German are part of this bonus release. These language pairs are crawled in collaboration with an industry partner. The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.

Language
 
Sentences
Source Words
English-Azerbaijani  
New
3,158,025
47,117,416
3,158,025
47,117,416
3,158,025
47,117,416
336,067,622
5,513,087,127
English-Tajik  
New
343,401
5,513,041
343,401
5,513,041
343,401
5,513,041
11,394,854
244,332,910
English-Armenian  
New
1,988,287
35,997,571
1,988,287
35,997,571
1,988,287
35,997,571
31,671,924
233,771,297
Polish-Czech
24,001,403
288,826,678
24,001,403
288,826,678
6,055,618,075
28,559,061,699
Ukrainian
13,354,365
505,831,880
13,354,365
505,831,880
235,700,383
5,832,658,894
Chinese
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.