A newer version is available
See Latest Releases
Language
 
Sentences
Source Words
Russian
12,061,155
157,061,045
1,078,819,759
Bonus Release (Low resource languages) - Last Updates on Sep 2021
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on Sep 2021
Polish-Czech  
New
24,001,403
288,826,678
24,001,403
288,826,678
6,055,618,075
28,559,061,699
Ukrainian  
New
13,354,365
505,831,880
13,354,365
505,831,880
235,700,383
5,832,658,894
Chinese  
New
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359
Language Crawled Websites Download Details
In the proceedings of WMT 2019 release 3 of the corpus is used. For WMT 2018, the FILTERED v1.0 of the released corpus was used.
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Bulgarian 4,762
  File Size Sentence Pairs English Words
RAW v5.0 7.4GB 248,555,951 1,564,051,100
BiCleaner v5.0 692MB 2,586,277 55,725,444
Croatian 8,889
  File Size Sentence Pairs English Words
RAW v5.0 8.33GB 273,330,006 1,738,164,401
BiCleaner v5.0 477MB 1,861,590 43,464,197
Czech 14,335
  File Size Sentence Pairs English Words
RAW v5.0 20GB 665,535,115 4,025,512,842
BiCleaner v5.0 1.2GB 5,280,149 117,385,158
Danish 19,776
  File Size Sentence Pairs English Words
RAW v5.0 17.8GB 447,743,455 3,347,135,236
BiCleaner v5.0 1066MB 4,606,183 106,565,546
Dutch 17,887
  File Size Sentence Pairs English Words
RAW v5.0 32GB 1,101,087,006 6,792,400,704
BiCleaner v5.0 2.8GB 10,596,717 233,087,345
Estonian 9,522
  File Size Sentence Pairs English Words
RAW v5.0 4.5GB 168,091,382 915,074,587
BiCleaner v5.0 338MB 1,387,869 30,858,140
Finnish 11,028
  File Size Sentence Pairs English Words
RAW v5.0 13.5GB 460,181,215 2,731,068,033
BiCleaner v5.0 704MB 3,097,223 66,385,933
French 48,498
  File Size Sentence Pairs English Words
RAW v5.0 128GB 4,273,819,421 24,983,683,983
BiCleaner v5.0 13.9GB 51,316,168 1,178,317,233
German 67,977
  File Size Sentence Pairs English Words
RAW v5.0 142.8GB 5,038,103,659 27,994,213,177
BiCleaner v5.0 11.07GB 36,936,714 929,818,868
Greek 11,343
  File Size Sentence Pairs English Words
RAW v5.0 16.2GB 640,502,801 3,768,712,672
BiCleaner v5.0 1122MB 3,830,643 88669279
Hungarian 9,522
  File Size Sentence Pairs English Words
RAW v5.0 14.47GB 461,181,772 3,208,285,083
BiCleaner v5.0 1069MB 4,187,051 104,292,635
Irish 1,283
  File Size Sentence Pairs English Words
RAW v5.0 4.66GB 64,628,733 667,211,260
BiCleaner v5.0 190MB 782,769 21,909,039
Italian 31,518
  File Size Sentence Pairs English Words
RAW v5.0 66GB 2,251,771,798 13,150,606,108
BiCleaner v5.0 6.02GB 22,100,078 533,512,632
Latvian 3,557
  File Size Sentence Pairs English Words
RAW v5.0 5GB 176,113,669 1,069,218,155
BiCleaner v5.0 286MB 1,019,003 23,656,140
Lithuanian 4,678
  File Size Sentence Pairs English Words
RAW v5.0 4.92GB 198,101,611 963,384,230
BiCleaner v5.0 375MB 1,270,933 27,214,054
Maltese 672
  File Size Sentence Pairs English Words
RAW v5.0 173MB 3,693,930 38,492,028
BiCleaner v5.0 38MB 177,244 4,252,814
Polish 13,357
  File Size Sentence Pairs English Words
RAW v5.0 22.6GB 723,052,912 4,123,972,411
BiCleaner v5.0 1.6GB 6,382,371 145,802,939
Portuguese 18,887
  File Size Sentence Pairs English Words
RAW v5.0 34.8GB 1,068,161,866 6,537,298,891
BiCleaner v5.0 3.3GB 13,860,663 299,634,135
Romanian 9,335
  File Size Sentence Pairs English Words
RAW v5.0 15.2GB 510,209,923 3,034,045,929
BiCleaner v5.0 728MB 2,870,687 62,189,306
Slovak 7,980
  File Size Sentence Pairs English Words
RAW v5.0 6.05GB 269,067,288 1,416,750,646
BiCleaner v5.0 568MB 2,365,339 45,636,383
Slovenian 5,016
  File Size Sentence Pairs English Words
RAW v5.0 4.07GB 175,682,959 1,003,867,134
BiCleaner v5.0 406MB 1,406,645 31,855,427
Spanish 36,211
  File Size Sentence Pairs English Words
RAW v5.0 80.4GB 2,674,900,280 16,598,620,402
BiCleaner v5.0 9.6GB 38,971,348 897,891,704
Swedish 13,616
  File Size Sentence Pairs English Words
RAW v5.0 16.54GB 620,338,561 3,496,650,816
BiCleaner v5.0 1542MB 6,079,175 138,264,978
Bonus Release
Dutch-French 7,700
  File Size Sentence Pairs Dutch Words French Words
RAW 1.8GB 38,164,560 770,141,393 817,973,481
BiCleaner 752MB 2,687,331 60,504,313 64,650,034
Polish-German 5,549
  File Size Sentence Pairs Polish Words German Words
RAW 479MB 11,060,105 202,765,359 198,442,547
BiCleaner 216MB 916,522 18,883,576 20,271,637
Extra Languages in release v1.0
Russian 14,035 RAW v1.0 FILTERED v1.0
  File Size Sentence Pairs English Words
RAW v1.0 38GB 1,078,819,759 -
Filtered v1.0 637MB 12,061,155 157,061,045
  • Releases 4 and earlier included unaligned sentences in the raw file with one side empty. Release 5 removes these sentences from the raw file, explaining why the raw sizes dropped.