ParaCrawl Corpus release v9

This corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.

Release 9 is the final release for ParaCrawl Action 3: "Continued Web-Scale Provision of Parallel Corpora for European Languages".

ParaCrawl 9 brings new content and higher quality as the result of an improved pipeline with:

  • better PDF processing
  • language identification based on CLD2 full instead of lite
  • improved machine translation models (almost all neural) used to parallelize sentences
  • neural cleaning applied for the first time
With this version, we reach the best MT results ever obtained with ParaCrawl.

As a bonus, we release an English-Chinese corpus and monolingual data (coming soon!).

Data formats: 5 variations of each corpus are provided: 1. Bicleaner TXT format, 2. Bicleaner TMX format, 3. RAW corpus, 4. ROAM (anonymised) format and 5. Deferred crawling format containing pointers to URLs to recrawl the corpora on your end. Click on the icon for each language to show the download links. To effectively transform a TMX to a tab-separated text file Download TMXT tool. Also, if you use deferred crawled corpora, our reconstruction tool can prove useful.
WMT warning: ParaCrawl contribution to WMT has been permanent since 2018. Data sets from various ParaCrawl releases have been used in different shared tasks along the years. To make sure you get the right version of the exact data sets needed for these shared tasks, please download them directly from the links provided at WMT website .
Language
 
Sentences
Source Words
Bulgarian
13,264,297
226,485,024
13,264,297
226,485,024
379,713,465
9,061,247,935
11,660,661
189,446,058
13,264,297
226,485,024
Czech
50,632,492
692,110,883
50,632,492
692,110,883
2,996,891,961
56,087,810,356
41,000,528
544,920,121
50,632,492
692,110,883
Danish
34,207,155
555,049,001
34,207,155
555,049,001
2,381,010,643
47,495,319,396
28,615,957
445,634,283
34,207,155
555,049,001
German
278,310,907
4,269,352,871
278,310,907
4,269,352,871
9,662,113,792
210,295,493,345
233,362,910
3,450,715,114
278,310,907
4,269,352,871
Greek
21,402,042
340,479,785
21,402,042
340,479,785
1,405,078,151
27,686,172,080
18,582,458
279,761,255
21,402,042
340,479,785
Spanish
269,394,967
4,374,060,920
269,394,967
4,374,060,920
9,419,403,869
206,498,167,547
223,941,877
3,496,386,983
269,394,967
4,374,060,920
Estonian
8,539,879
136,598,517
8,539,879
136,598,517
390,991,428
7,743,920,994
7,290,786
114,333,882
8,539,879
136,598,517
Finnish
31,315,287
454,355,512
31,315,287
454,355,512
1,792,067,249
34,145,205,926
25,393,902
355,039,178
31,315,287
454,355,512
French
216,646,826
3,761,995,609
216,646,826
3,761,995,609
5,495,920,681
134,218,695,949
177,473,147
2,988,027,816
216,646,826
3,761,995,609
Irish
3,245,618
56,703,820
3,245,618
56,703,820
615,267,926
13,691,594,812
2,584,201
43,513,633
3,245,618
56,703,820
Croatian
3,240,420
79,062,603
3,240,420
79,062,603
563,570,419
12,185,582,349
2,730,607
64,058,854
3,240,420
79,062,603
Hungarian
36,432,544
509,256,374
36,432,544
509,256,374
2,245,577,938
41,486,191,342
29,128,012
398,339,155
36,432,544
509,256,374
Icelandic
2,967,519
45,093,876
2,967,519
45,093,876
65,379,727
1,500,672,709
2,446,111
35,527,952
2,967,519
45,093,876
Italian
96,975,991
1,682,841,134
96,975,991
1,682,841,134
2,881,161,252
66,782,184,952
78,373,609
1,303,115,197
96,975,991
1,682,841,134
Lithuanian
13,191,973
185,525,058
13,191,973
185,525,058
706,153,214
14,023,789,635
11,024,987
150,966,844
13,191,973
185,525,058
Latvian
13,063,804
197,361,114
13,063,804
197,361,114
621,121,254
12,392,015,629
10,958,884
161,335,336
13,063,804
197,361,114
Maltese
1,207,760
24,989,869
1,207,760
24,989,869
52,687,808
1,132,276,882
1,049,566
21,163,122
1,207,760
24,989,869
Norwegian (Bokmål)
19,281,147
325,690,737
19,281,147
325,690,737
1,186,857,903
25,059,150,285
16,296,119
265,306,747
19,281,147
325,690,737
Dutch
89,135,870
1,306,977,298
89,135,870
1,306,977,298
3,792,110,568
80,824,583,076
75,744,287
1,080,767,873
89,135,870
1,306,977,298
Norwegian (Nynorsk)
294,470
5,965,872
294,470
5,965,872
26,975,121
552,467,707
213,538
3,956,194
294,470
5,965,872
Polish
40,082,037
599,286,435
40,082,037
599,286,435
1,494,638,802
31,382,315,844
32,354,336
467,434,142
40,082,037
599,286,435
Portuguese
84,925,921
1,337,335,551
84,925,921
1,337,335,551
2,489,237,215
56,204,193,540
69,999,741
1,065,769,612
84,925,921
1,337,335,551
Romanian
25,048,461
403,390,783
25,048,461
403,390,783
1,496,164,926
29,847,980,518
20,989,631
325,653,445
25,048,461
403,390,783
Slovak
22,901,690
302,051,134
22,901,690
302,051,134
1,572,443,942
29,606,225,083
18,144,092
236,437,555
22,901,690
302,051,134
Slovenian
9,516,068
151,672,416
9,516,068
151,672,416
458,836,589
9,800,697,623
8,196,238
127,519,352
9,516,068
151,672,416
Swedish
49,109,339
722,407,192
49,109,339
722,407,192
2,026,885,221
41,303,256,881
41,573,601
594,612,823
49,109,339
722,407,192
Spanish-Catalan
17,238,953
389,803,201
17,238,953
389,803,201
298,874,794
7,071,221,480
17,102,682
385,834,468
17,238,953
389,803,201
Spanish-Basque
3,344,372
64,667,201
3,344,372
64,667,201
36,775,535
835,871,547
3,295,251
63,677,141
3,344,372
64,667,201
Spanish-Galician
1,879,651
44,626,184
1,879,651
44,626,184
96,193,555
2,222,137,977
1,865,314
44,204,985
1,879,651
44,626,184
Bonus Release (Low resource languages) - Last Updates on Sep 2021
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on Sep 2021
Chinese  
New
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359