ParaCrawl Corpus release v5.0

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v5 is the first release for the Action: "Broader Web-Scale Provision of Parallel Corpora for European Languages". New crawled data is added, including data from Internet Archive. Enhancements in document and sentence aligners with updated BiCleaner strategy resulted in corpora twice the size compare to release v4 for all official EU languages (23 languages paired with English).

A newer version is available
See Latest Releases
Language
 
Sentences
Source Words
Bulgarian
2,586,277
55,725,444
2,586,277
55,725,444
248,555,951
1,564,051,100
Czech
5,280,149
117,385,158
5,280,149
117,385,158
665,535,115
4,025,512,842
Danish
4,606,183
106,565,546
4,606,183
106,565,546
447,743,455
3,347,135,236
German
36,936,714
929,818,868
36,936,714
929,818,868
5,038,103,659
27,994,213,177
Greek
3,830,643
88,669,279
3,830,643
88,669,279
640,502,801
3,768,712,672
Spanish
38,971,348
897,891,704
38,971,348
897,891,704
2,674,900,280
16,598,620,402
Estonian
1,387,869
30,858,140
1,387,869
30,858,140
168,091,382
915,074,587
Finnish
3,097,223
66,385,933
3,097,223
66,385,933
460,181,215
2,731,068,033
French
51,316,168
1,178,317,233
51,316,168
1,178,317,233
4,273,819,421
24,983,683,983
Irish
782,769
21,909,039
782,769
21,909,039
64,628,733
667,211,260
Croatian
1,861,590
43,464,197
1,861,590
43,464,197
273,330,006
1,738,164,401
Hungarian
4,187,051
104,292,635
4,187,051
104,292,635
461,181,772
3,208,285,083
Italian
22,100,078
533,512,632
22,100,078
533,512,632
2,251,771,798
13,150,606,108
Lithuanian
1,270,933
27,214,054
1,270,933
27,214,054
198,101,611
963,384,230
Latvian
1,019,003
23,656,140
1,019,003
23,656,140
176,113,669
1,069,218,155
Maltese
177,244
4,252,814
177,244
4,252,814
3,693,930
38,492,028
Dutch
10,596,717
233,087,345
10,596,717
233,087,345
1,101,087,006
6,792,400,704
Polish
6,382,371
145,802,939
6,382,371
145,802,939
723,052,912
4,123,972,411
Portuguese
13,860,663
299,634,135
13,860,663
299,634,135
1,068,161,866
6,537,298,891
Romanian
2,870,687
62,189,306
2,870,687
62,189,306
510,209,923
3,034,045,929
Slovak
2,365,339
45,636,383
2,365,339
45,636,383
269,067,288
1,416,750,646
Slovenian
1,406,645
31,855,427
1,406,645
31,855,427
175,682,959
1,003,867,134
Swedish
6,079,175
138,264,978
6,079,175
138,264,978
620,338,561
3,496,650,816
Bonus Release (Low resource languages) - Last Updates on Oct. 2024
English-Azerbaijani
3,158,025
47,117,416
3,158,025
47,117,416
3,158,025
47,117,416
336,067,622
5,513,087,127
English-Tajik
343,401
5,513,041
343,401
5,513,041
343,401
5,513,041
11,394,854
244,332,910
English-Armenian
1,988,287
35,997,571
1,988,287
35,997,571
1,988,287
35,997,571
31,671,924
233,771,297
English-Khmer v1
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
English-Burmese v1
31,374
661,577
40,590,354
40,595,755
31,374
661,577
English-Nepali v1
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
English-Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
English-Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
English-Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
English-Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
English-Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on October 2024
English-Hindi  
New
4,712,564
74,000,000
4,712,564
74,000,000
4,712,564
74,000,000
English-Indonesian  
New
7,133,323
109,000,000
7,133,323
109,000,000
7,133,323
109,000,000
English-Khmer v2  
New
1,501,304
23,000,000
1,501,304
23,000,000
1,501,304
23,000,000
English-Korean v2  
New
7,709,312
114,000,000
7,709,312
114,000,000
7,709,312
114,000,000
English-Lao  
New
1,994,053
27,000,000
1,994,053
27,000,000
1,994,053
27,000,000
English-Burmese v2  
New
1,666,530
28,000,000
1,666,530
28,000,000
1,666,530
28,000,000
English-Nepali v2  
New
2,243,954
32,000,000
2,243,954
32,000,000
2,243,954
32,000,000
English-Thai  
New
2,175,890
22,000,000
2,175,890
22,000,000
2,175,890
22,000,000
English-Vietnamese  
New
6,291,407
93,000,000
6,291,407
93,000,000
6,291,407
93,000,000
Polish-Czech
24,001,403
288,826,678
24,001,403
288,826,678
6,055,618,075
28,559,061,699
English-Ukrainian
13,354,365
505,831,880
13,354,365
505,831,880
235,700,383
5,832,658,894
English-Chinese
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
English-Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.