ParaCrawl Corpus release v4.0

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v4 is the final release for the Action: "Provision of Web-Scale Parallel Corpora for Official European Languages" and it covers all official EU languages (23 languages paired with English)

A newer version is available
See Latest Releases
Language
 
Sentences
Source Words
Bulgarian
1,039,885
21,109,546
1,039,885
21,109,546
288,395,110
1,552,588,179
Czech
2,981,949
48,918,151
2,981,949
48,918,151
1,189,317,247
5,621,562,488
Danish
2,414,895
48,240,290
2,414,895
48,240,290
586,535,848
3,484,768,564
German
16,264,450
307,786,150
16,264,450
307,786,150
7,387,809,953
32,358,035,774
Greek
1,985,233
38,322,532
1,985,233
38,322,532
740,094,469
3,340,324,438
Spanish
21,987,267
476,409,854
21,987,267
476,409,854
3,959,845,706
18,128,847,778
Estonian
853,422
16,537,397
853,422
16,537,397
342,677,535
1,522,504,098
Finnish
2,156,069
41,564,859
2,156,069
41,564,859
736,050,617
3,494,554,815
French
31,374,161
664,924,148
31,374,161
664,924,148
6,429,921,903
28,529,875,306
Irish
357,399
8,241,515
357,399
8,241,515
156,189,807
1,194,451,883
Croatian
1,002,053
19,904,218
1,002,053
19,904,218
411,950,164
1,996,212,922
Hungarian
1,901,342
30,835,267
1,901,342
30,835,267
622,224,794
2,590,060,050
Italian
12,162,239
260,361,435
12,162,239
260,361,435
3,333,886,336
14,519,224,940
Lithuanian
844,643
15,087,805
844,643
15,087,805
294,568,032
1,198,118,449
Latvian
553,060
10,996,032
553,060
10,996,032
262,685,954
1,371,257,575
Maltese
195,510
4,100,912
195,510
4,100,912
17,602,902
164,119,571
Dutch
5,659,268
108,197,376
5,659,268
108,197,376
1,760,140,259
8,239,317,278
Polish
3,503,276
65,618,419
3,503,276
65,618,419
1,259,312,618
5,555,536,170
Portuguese
8,141,940
156,125,200
8,141,940
156,125,200
1,763,439,122
8,465,738,356
Romanian
1,952,043
39,882,223
1,952,043
39,882,223
793,759,210
4,059,255,214
Slovak
1,591,831
26,711,854
1,591,831
26,711,854
334,903,774
1,418,785,612
Slovenian
660,161
14,489,659
660,161
14,489,659
208,466,320
967,461,921
Swedish
3,476,729
70,088,534
3,476,729
70,088,534
739,146,200
3,217,514,612
Bonus Release (Low resource languages) - Last Updates on Apr 2021
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Russian  
New
5,377,911
101,312,142
491,941,804
492,260,972
5,377,911
101,312,142
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on Dec 2020
Russian  
New
0
0
491,941,804
492,260,972
5,377,911
101,312,142
5,377,911
101,312,142
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.