A Japanese version is available through JParaCrawl (3rd Party Releases)
Language
Sentences
Source Words
Bonus Release (Low resource languages) - Last Updates on Oct. 2024
English-Azerbaijani
3,158,025
47,117,416
3,158,025
47,117,416
3,158,025
47,117,416
336,067,622
5,513,087,127
English-Tajik
343,401
5,513,041
343,401
5,513,041
343,401
5,513,041
11,394,854
244,332,910
English-Armenian
1,988,287
35,997,571
1,988,287
35,997,571
1,988,287
35,997,571
31,671,924
233,771,297
English-Khmer v1
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
English-Burmese v1
31,374
661,577
40,590,354
40,595,755
31,374
661,577
English-Nepali v1
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
English-Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
English-Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
English-Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
English-Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
English-Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on October 2024
English-Hindi
New
4,712,564
74,000,000
4,712,564
74,000,000
4,712,564
74,000,000
English-Indonesian
New
7,133,323
109,000,000
7,133,323
109,000,000
7,133,323
109,000,000
English-Khmer v2
New
1,501,304
23,000,000
1,501,304
23,000,000
1,501,304
23,000,000
English-Korean v2
New
7,709,312
114,000,000
7,709,312
114,000,000
7,709,312
114,000,000
English-Lao
New
1,994,053
27,000,000
1,994,053
27,000,000
1,994,053
27,000,000
English-Burmese v2
New
1,666,530
28,000,000
1,666,530
28,000,000
1,666,530
28,000,000
English-Nepali v2
New
2,243,954
32,000,000
2,243,954
32,000,000
2,243,954
32,000,000
English-Thai
New
2,175,890
22,000,000
2,175,890
22,000,000
2,175,890
22,000,000
English-Vietnamese
New
6,291,407
93,000,000
6,291,407
93,000,000
6,291,407
93,000,000
Polish-Czech
24,001,403
288,826,678
24,001,403
288,826,678
6,055,618,075
28,559,061,699
English-Ukrainian
13,354,365
505,831,880
13,354,365
505,831,880
235,700,383
5,832,658,894
English-Chinese
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
English-Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
0
0
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359
Language
Crawled Websites
Download
Details
In the proceedings of WMT 2019 release 3 of the corpus is used. For WMT 2018, the FILTERED v1.0 of the released corpus was used.
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Releases 4 and earlier included unaligned sentences in the raw file with one side empty. Release 5 removes these sentences from the raw file, explaining why the raw sizes dropped.
FILTERED v1.0 of the corpus is very rough and it is significantly refined in new releases.
Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages
Any communication or publication related to the action, made by the beneficiaries jointly or individually in any form and using any means, shall indicate that it reflects only the author's view and that the Agency is not responsible for any use that may be made of the information it contains.
Publish the Menu module to "offcanvas" position. Here you can publish other modules as well. Learn More.