ParaCrawl Corpus release v5.1

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Version 5.1 builds upon the same raw corpus as V5. Thanks to improvements in filtering procedure, the official subset extracted as version 5.1 is now higher in quantity for almost all language pairs (but ga, de, sl and et). Quality measured extrinsically through MT for several language pairs shows also improvement in quality.

A newer version is available
See Latest Releases
Language
 
Sentences
Source Words
Bulgarian
2,775,256
60,246,308
2,775,256
60,246,308
248,555,951
1,564,051,100
Czech
5,345,693
105,973,351
5,345,693
105,973,351
665,535,115
4,025,512,842
Danish
4,851,772
111,476,139
4,851,772
111,476,139
447,743,455
3,347,135,236
German
34,371,306
708,068,143
34,371,306
708,068,143
5,038,103,659
27,994,213,177
Greek
4,038,777
93,473,163
4,038,777
93,473,163
640,502,801
3,768,712,672
Spanish
44,587,162
1,072,236,916
44,587,162
1,072,236,916
2,674,900,280
16,598,620,402
Estonian
1,452,963
31,597,344
1,452,963
31,597,344
168,091,382
915,074,587
Finnish
3,421,382
66,385,933
3,421,382
66,385,933
460,181,215
2,731,068,033
French
63,634,915
1,518,457,124
63,634,915
1,518,457,124
4,273,819,421
24,983,683,983
Irish
521,768
12,089,677
521,768
12,089,677
64,628,733
667,211,260
Croatian
1,993,180
44,945,371
1,993,180
44,945,371
273,330,006
1,738,164,401
Hungarian
4,782,328
115,330,046
4,782,328
115,330,046
461,181,772
3,208,285,083
Italian
24,089,063
587,087,473
24,089,063
587,087,473
2,251,771,798
13,150,606,108
Lithuanian
1,368,514
27,894,906
1,368,514
27,894,906
198,101,611
963,384,230
Latvian
1,056,252
22,810,714
1,056,252
22,810,714
176,113,669
1,069,218,155
Maltese
186,630
4,280,211
186,630
4,280,211
3,693,930
38,492,028
Dutch
11,272,396
247,536,605
11,272,396
247,536,605
1,101,087,006
6,792,400,704
Polish
6,577,804
143,702,545
6,577,804
143,702,545
723,052,912
4,123,972,411
Portuguese
15,259,967
337,394,318
15,259,967
337,394,318
1,068,161,866
6,537,298,891
Romanian
3,176,488
69,998,913
3,176,488
69,998,913
510,209,923
3,034,045,929
Slovak
2,496,533
48,160,348
2,496,533
48,160,348
269,067,288
1,416,750,646
Slovenian
1,220,652
29,042,458
1,220,652
29,042,458
175,682,959
1,003,867,134
Swedish
6,633,761
149,048,559
6,633,761
149,048,559
620,338,561
3,496,650,816
Bonus Release (Low resource languages) - Last Updates on Sep 2021
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on Sep 2021
Chinese  
New
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.