ParaCrawl Corpus Bonus Release
Two language pairs Dutch-French and Polish-German are part of this bonus release. These language pairs are crawled in collaboration with an industry partner. The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.
Language
Sentences
Source Words
English-Hindi
New
4,712,564
74,000,000
4,712,564
74,000,000
4,712,564
74,000,000
English-Indonesian
New
7,133,323
109,000,000
7,133,323
109,000,000
7,133,323
109,000,000
English-Khmer v2
New
1,501,304
23,000,000
1,501,304
23,000,000
1,501,304
23,000,000
English-Korean v2
New
7,709,312
114,000,000
7,709,312
114,000,000
7,709,312
114,000,000
English-Lao
New
1,994,053
27,000,000
1,994,053
27,000,000
1,994,053
27,000,000
English-Burmese v2
New
1,666,530
28,000,000
1,666,530
28,000,000
1,666,530
28,000,000
English-Nepali v2
New
2,243,954
32,000,000
2,243,954
32,000,000
2,243,954
32,000,000
English-Thai
New
2,175,890
22,000,000
2,175,890
22,000,000
2,175,890
22,000,000
English-Vietnamese
New
6,291,407
93,000,000
6,291,407
93,000,000
6,291,407
93,000,000
Polish-Czech
24,001,403
288,826,678
24,001,403
288,826,678
6,055,618,075
28,559,061,699
English-Ukrainian
13,354,365
505,831,880
13,354,365
505,831,880
235,700,383
5,832,658,894
English-Chinese
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
English-Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
0
0
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.