Broader Web-Scale Provision of Parallel Corpora for European Languages

ParaCrawl OpenSource Pipeline (Bitextor)

Bitextor is a tool for automatically harvesting bitexts from multilingual websites.

Bitextor 7.2.1
Even young Luke Skywalker had to face some pythons on his way to Yoda! This release brings with it the Force, thanks to all the Jedis who helped.

v7.2.1 Changelog

  • Updated submodules
  • Fixed and updated requirements from bicleaner and bifixer
  • Fixed bifixer silent error output
  • Fixed bifixer output when using flag --ignore_duplicates
  • Fixed possible tabs in URLs from bad-formed WARCs
  • Improved documentation
  • Note: the bitextor-v7.2.1.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.2.1.zip tarball or cloning the repo v7.2.1 tag.

 

v7.2 Changelog

  • Now you can set a list of WARC files in addition to the URLs as input for Bitextor (thanks @zuny26!)
    • For example, in config file for Snakemake:
      WARCFiles: ["/home/user/warc1.warc.gz", "/home/user/warc2.warc.gz"]
  • Switched to the WARC standard .gz compressed format (records individually compressed).
  • warc3-wet completely replaced with warcio (thanks @zuny26!)
  • xzlang Snakefile parameters allows grouping preprocessing output by languages (create a separate file for each language found, not just LANG1 and LANG2) (thanks @zuny26!)
  • This has the benefit of avoiding repeating preprocessing step when processing the same domain for different pair of languages.
  • Added support for giawarc WARC preprocessor (thanks @wwaites!)
    • Activate it using giawarc: true in Snakemake config parameters
    • Installation instructions in README.md
  • Added support in bitextor-warc2preprocess for the HTML/XML Python parser selectolax
    • Select which parser with parser in Snakemake options. Options are 'alcazar', 'bs4' (default) and 'modest'.
      NOTE: it does not do anything giawarc: true or xzlang: true
  • pdf-extract is now installed and used using Pypi package (thanks @dionwiggins!)
  • Fixed sentence splitter in MT-based document alignment for the target language (thanks @kirefu!)
  • bleualign-cpp implementation is now an external dependency
  • Replaced MD5 with MurmurHash3 in Creepy crawler, WARC preprocessors and deferred module
  • Some PEP8 code compliance changes and code cleaning using latest Snakecharm
  • Updated submodules
    • restorative-cleaning is deprecated. Now bifixer is replacing it! (thanks @mbanon!)
  • Updated documentation.
    • Added Docker installation instructions (thanks @amirkamran!).
    • Added installation instructions for giawarc in Dependencies.
    • Updated kenlm instructions to install it from upstream (thanks @kpu!).
    • Updated diagram with changes.
    • Created Wiki with small tutorials, like running Bitextor with new languages or running Bitextor in Windows 10
  • General system stability improvements to enhance the user's experience.
  • Note: the bitextor-v7.2.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.2.zip tarball or cloning the repo v7.2 tag.

ParaCrawl Corpus release v5.0

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v5 is the first release for the Action: "Broader Web-Scale Provision of Parallel Corpora for European Languages". New crawled data is added, including data from Internet Archive. Enhancements in document and sentence aligners with updated BiCleaner strategy resulted in corpora twice the size compare to release v4 for all official EU languages (23 languages paired with English).

Language Crawled Websites Download Details
In the proceedings of WMT 2019 release 3 of the corpus is used. For WMT 2018, the FILTERED v1.0 of the released corpus was used.
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format.
To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Bulgarian 4,762
File SizeSentence PairsEnglish Words
RAW v5.07.4GB248,555,9511,564,051,100
BiCleaner v5.0692MB2,586,27755,725,444
RAW v4.010.2GB288,395,1101,552,588,179
BiCleaner v4.0284MB1,039,88521,109,546
RAW v3.010.2GB288,395,1101,552,588,179
BiCleaner v3.0371MB1,704,76228,243,306
Zipporah v3.0217MB821,46417,578,839
Croatian 8,889
File SizeSentence PairsEnglish Words
RAW v5.08.33GB273,330,0061,738,164,401
BiCleaner v5.0477MB1,861,59043,464,197
RAW v4.012.4GB411,950,1641,996,212,922
BiCleaner v4.0246MB1,002,05319,904,218
RAW v3.012.4GB411,950,1641,996,212,922
BiCleaner v3.0265MB1,568,94723,531,438
Zipporah v3.0260MB1,004,13118,004,931
BiCleaner v2.0345MB1,455,84121,387,649
Zipporah v2.0295MB
933,19016,655,718
Czech 14,335
File SizeSentence PairsEnglish Words
RAW v5.020GB665,535,1154,025,512,842
BiCleaner v5.01.2GB5,280,149117,385,158
RAW v4.033.9GB1,189,317,2475,621,562,488
BiCleaner v4.0697MB2,981,94948,918,151
RAW v3.033.9GB1,189,317,2475,621,562,488
BiCleaner v3.0913MB5,862,52175,316,848
Zipporah v3.04.1GB17,058,282139,211,417
BiCleaner v2.01.2GB5,488,58969,182,264
Zipporah v2.04.5GB15,846,424123,222,290
RAW v1.021.5GB818,784,053-
Filtered v1.0285MB10,020,25078,743,955
BiCleaner v1.2237MB2,367,60938,913,821
Zipporah v1.2529MB9,982,50878,943,174
Danish 19,776
File SizeSentence PairsEnglish Words
RAW v5.017.8GB447,743,4553,347,135,236
BiCleaner v5.01066MB4,606,183106,565,546
RAW v4.024.8GB586,535,8483,484,768,564
BiCleaner v4.0553MB2,414,89548,240,290
RAW v3.024.8GB586,535,8483,484,768,564
BiCleaner v3.0737MB4,891,46267,200,201
Zipporah v3.0565MB1,194,58922,476,008
Dutch 17,887
File SizeSentence PairsEnglish Words
RAW v5.032GB1,101,087,0066,792,400,704
BiCleaner v5.02.8GB10,596,717233,087,345
RAW v4.047.8GB1,760,140,2598,239,317,278
BiCleaner v4.01.7GB5,659,268108,197,376
RAW v3.047.8GB1,760,140,2598,239,317,278
BiCleaner v3.02.2GB10,408,489143,294,712
Zipporah v3.01.1GB3,291,80456,744,571
BiCleaner v2.02.7GB9,342,505127,895,866
Zipporah v2.01.2GB2,922,15250,402,591
RAW v1.037.5GB1,506,033,538-
Filtered v1.0168MB2,560,472 45,149,412
BiCleaner v1.2581MB6,185,906100,284,153
Zipporah v1.2276MB2,556,52345,169,303
Estonian 9,522
File SizeSentence PairsEnglish Words
RAW v5.04.5GB168,091,382915,074,587
BiCleaner v5.0338MB1,387,86930,858,140
RAW v4.08.4GB342,677,5351,522,504,098
BiCleaner v4.0202MB853,42216,537,397
RAW v3.08.4GB342,677,5351,522,504,098
BiCleaner v3.0198MB1,064,07817,725,513
Zipporah v3.0214MB1,163,99417,105,752
BiCleaner v2.0245MB960,27615,633,491
Zipporah v2.0271MB1,122,28912,820,311
RAW v1.04.4GB191,183,197-
Filtered v1.074.7MB1,298,10313,134,231
Finnish 11,028
File SizeSentence PairsEnglish Words
RAW v5.013.5GB460,181,2152,731,068,033
BiCleaner v5.0704MB3,097,22366,385,933
RAW v4.020.6GB736,050,6173,494,554,815
BiCleaner v4.0469MB2,156,06941,564,859
RAW v3.020.6GB736,050,6173,494,554,815
BiCleaner v3.0693MB3,944,92954,984,783
Zipporah v3.0432MB966,14514,175,421
BiCleaner v2.0985MB3,632,44749,751,376
Zipporah v2.0459MB831,17012,692,508
RAW v1.012.7GB504,805,915-
Filtered v1.034.5MB544,3358,420,501
BiCleaner v1.2177MB1,982,77429,979,317
Zipporah v1.259.8MB621,7289,481,646
French 48,498
File SizeSentence PairsEnglish Words
RAW v5.0128GB4,273,819,42124,983,683,983
BiCleaner v5.013.9GB51,316,1681,178,317,233
RAW v4.0183GB6,429,921,90328,529,875,306
BiCleaner v4.09.9GB31,374,161664,924,148
RAW v3.0183GB6,429,921,90328,529,875,306
Zipporah v3.015GB39,615,885791,250,385
BiCleaner v2.011.4GB37,823,646600,029,874
Zipporah v2.016.5GB37,743,429754,045,036
RAW v1.0111GB4,235,725,445-
Filtered v1.02.1GB27,622,881546,401,428
BiCleaner v1.22.6GB25,380,067428,397,408
Zipporah v1.23.9GB33,108,141648,244,663
German 67,977
File SizeSentence PairsEnglish Words
RAW v5.0142.8GB5,038,103,65927,994,213,177
BiCleaner v5.011.07GB36,936,714929,818,868
RAW v4.0211GB7,387,809,95332,358,035,774
BiCleaner v4.05.4GB16,264,450307,786,150
RAW v3.0211GB7,387,809,95332,358,035,774
BiCleaner v3.08.5GB31,358,551502,903,379
Zipporah v3.028.7GB61,349,218809,954,481
BiCleaner v2.09.8GB27,702,949456,442,715
Zipporah v2.026.3GB55,849,341740,849,699
RAW v1.0121GB4,591,582,415-
Filtered v1.01.8GB36,351,593476,398,001
BiCleaner v1.21.8GB17,378,982302,274,816
Zipporah v1.23.3GB40,546,537522,204,110
Greek 11,343
File SizeSentence PairsEnglish Words
RAW v5.016.2GB640,502,8013,768,712,672
BiCleaner v5.01122MB3,830,64388669279
RAW v4.024.7GB740,094,4693,340,324,438
BiCleaner v4.0572MB1,985,23338,322,532
RAW v3.024.7GB740,094,4693,340,324,438
BiCleaner v3.0878MB4,533,17557,752,932
Zipporah v3.0454MB992,22018,958,591
Hungarian 9,522
File SizeSentence PairsEnglish Words
RAW v5.014.47GB461,181,7723,208,285,083
BiCleaner v5.01069MB4,187,051104,292,635
RAW v4.016.5GB622,224,7942,590,060,050
BiCleaner v4.0482MB1,901,34230,835,267
RAW v3.016.5GB622,224,7942,590,060,050
BiCleaner v3.0456MB3,160,49632,151,740
Zipporah v3.0360MB1,023,87517,235,595
BiCleaner v2.0669MB
308,248631,764,228
Zipporah v2.0338MB902,41215,054,278
Irish 1,283
File SizeSentence PairsEnglish Words
RAW v5.04.66GB64,628,733667,211,260
BiCleaner v5.0190MB782,76921,909,039
RAW v4.06.8GB156,189,8071,194,451,883
BiCleaner v4.075.9MB357,3998,241,515
RAW v3.06.8GB156,189,8071,028,019,178
BiCleaner v3.0117MB607,73415,473,067
Zipporah v3.0154MB744,37514,525,892
BiCleaner v2.0150MB573,45114,813,115
Zipporah v2.0191MB732,78314,269,940
Italian 31,518
File SizeSentence PairsEnglish Words
RAW v5.066GB2,251,771,79813,150,606,108
BiCleaner v5.06.02GB22,100,078533,512,632
RAW v4.091.3GB3,333,886,33614,519,224,940
BiCleaner v4.03.6GB12,162,239260,361,435
RAW v3.091.3GB3,333,886,33614,519,224,940
BiCleaner v3.04.3GB14,439,190308,244,744
Zipporah v3.06.9GB1,368,691269,587,549
BiCleaner v2.05.0GB17,224,855264,324,830
Zipporah v2.06.7GB12,252,492231,025,420
RAW v1.045.1GB1,727,688,019-
Filtered v1.0593MB8,318,493155,973,063
BiCleaner v1.2963MB11,790,134147,402,459
Zipporah v1.21.2GB12,065,631212,026,083
Latvian 3,557
File SizeSentence PairsEnglish Words
RAW v5.05GB176,113,6691,069,218,155
BiCleaner v5.0286MB1,019,00323,656,140
RAW v4.07.7GB262,685,9541,371,257,575
BiCleaner v4.0158MB553,06010,996,032
RAW v3.07.7GB262,685,9541,371,257,575
BiCleaner v3.0218MB1,009,86015,058,052
Zipporah v3.0125MB434,4797,742,539
BiCleaner v2.0270MB1,133,36216,744,306
Zipporah v2.0137MB386,4476,066,997
RAW v1.04.9GB173,585,643-
Filtered v1.016MB242,2274,250,040
BiCleaner v1.243.3MB406,7426,995,228
Zipporah v1.226.9MB241,5464,247,908
Lithuanian 4,678
File SizeSentence PairsEnglish Words
RAW v5.04.92GB198,101,611963,384,230
BiCleaner v5.0375MB1,270,93327,214,054
RAW v4.07.8GB294,568,0321,198,118,449
BiCleaner v4.0231MB844,64315,087,805
RAW v3.07.8GB294,568,0321,198,118,449
BiCleaner v3.0273MB1,368,69119,471,370
Zipporah v3.0128MB432,7246,727,629
BiCleaner v2.0330MB
1,133,36216,744,306
Zipporah v2.0144MB
386,447
6,066,997
Maltese 672
File SizeSentence PairsEnglish Words
RAW v5.0173MB3,693,93038,492,028
BiCleaner v5.038MB177,2444,252,814
RAW v4.0723MB17,602,902164,119,571
BiCleaner v4.039.9MB
195,5104,100,912
RAW v3.0723MB17,602,902164,119,571
BiCleaner v3.038.1MB
227,4994,429,648
Zipporah v3.033.9MB
154,0382,143,321
BiCleaner v2.041.6MB
198,5373,884,509
Zipporah v2.038.6MB
137,3181,919,196
Polish 13,357
File SizeSentence PairsEnglish Words
RAW v5.022.6GB723,052,9124,123,972,411
BiCleaner v5.01.6GB6,382,371145,802,939
RAW v4.033.8GB1,259,312,6185,555,536,170
BiCleaner v4.0970MB3,503,27665,618,419
RAW v3.033.8GB1,259,312,6185,555,536,170
BiCleaner v3.01.3GB6,806,79694,612,131
Zipporah v3.0709MB1,748,86529,205,973
BiCleaner v2.01.6GB5,787,43681,662,507
Zipporah v2.0692MB1,488,05626,329,826
RAW v1.024.4GB984,884,968-
Filtered v1.085.7MB1,275,16222,092,316
BiCleaner v1.2330MB3,270,26255,467,253
Zipporah v1.2143MB1,269,30022,079,132
Portuguese 18,887
File SizeSentence PairsEnglish Words
RAW v5.034.8GB1,068,161,8666,537,298,891
BiCleaner v5.03.3GB13,860,663299,634,135
RAW v4.050.6GB1,763,439,1228,465,738,356
BiCleaner v4.02GB8,141,940156,125,200
RAW v3.050.6GB1,763,439,1228,465,738,356
BiCleaner v3.02GB11,698,633171,495,357
Zipporah v3.01.1GB3,834,61379,794,493
BiCleaner v2.02.3GB9,740,600148,240,776
Zipporah v2.01.1GB3,454,34972,592,387
RAW v1.034.1GB1,357,911,799-
Filtered v1.0222MB2,809,38157,392,721
BiCleaner v1.2611MB6,436,49193,021,518
Zipporah v1.2366MB3,056,92060,180,429
Romanian 9,335
File SizeSentence PairsEnglish Words
RAW v5.015.2GB510,209,9233,034,045,929
BiCleaner v5.0728MB2,870,68762,189,306
RAW v4.022.7GB793,759,2104,059,255,214
BiCleaner v4.0498MB1,952,04339,882,223
RAW v3.022.7GB793,759,2104,059,255,214
BiCleaner v3.0592MB3,284,81049,494,227
Zipporah v3.0621MB2,766,70638,673,891
BiCleaner v2.0713MB2,684,18939,958,916
Zipporah v2.0607MB2,537,85134,596,458
RAW v1.017.2GB635,709,587-
Filtered v1.0105MB2,459,75232,800,110
BiCleaner v1.2159MB1,592,69227,531,812
Zipporah v1.2151MB2,459,40832,806,629
Slovak 7,980
File SizeSentence PairsEnglish Words
RAW v5.06.05GB269,067,2881,416,750,646
BiCleaner v5.0568MB2,365,33945,636,383
RAW v4.08.9GB334,903,7741,418,785,612
BiCleaner v4.0359MB1,591,83126,711,854
RAW v3.08.9GB334,903,7741,418,785,612
BiCleaner v3.0540MB2,759,45135,247,648
Zipporah v3.0163MB599,67111,134,061
Slovenian 5,016
File SizeSentence PairsEnglish Words
RAW v5.04.07GB175,682,9591,003,867,134
BiCleaner v5.0406MB1,406,64531,855,427
RAW v4.06GB208,466,320967,461,921
BiCleaner v4.0197MB660,16114,489,659
RAW v3.06GB208,466,320967,461,921
BiCleaner v3.0309MB1,386,81919,915,661
Zipporah v3.0178MB478,4449,060,846
Spanish 36,211
File SizeSentence PairsEnglish Words
RAW v5.080.4GB2,674,900,28016,598,620,402
BiCleaner v5.09.6GB38,971,348897,891,704
RAW v4.0111GB3,959,845,70618,128,847,778
BiCleaner v4.06.1GB21,987,267476,409,854
RAW v3.0111GB3,959,845,70618,128,847,778
BiCleaner v3.06.4GB30,535,457491,951,545
Zipporah v3.08.6GB24,634,419505,890,391
BiCleaner v2.07.2GB25,473,946412,852,386
Zipporah v2.08.7GB21,286,014437,009,844
RAW v1.060.9GB2,368,243,619-
Filtered v1.01.3GB16,001,341325,745,201
BiCleaner v1.21.8GB17,511,545303,161,256
Zipporah v1.22.2GB18,197,039366,172,313
Swedish 13,616
File SizeSentence PairsEnglish Words
RAW v5.016.54GB620,338,5613,496,650,816
BiCleaner v5.01542MB6,079,175138,264,978
RAW v4.022.5GB739,146,2003,217,514,612
BiCleaner v4.0857MB3,476,72970,088,534
RAW v3.022.5GB739,146,2003,217,514,612
BiCleaner v3.0967MB4,960,28279,278,861
Zipporah v3.0675MB1,913,48734,574,753
Bonus Release
Dutch-French 7,700
File SizeSentence PairsDutch WordsFrench Words
RAW1.8GB38,164,560770,141,393817,973,481
BiCleaner752MB2,687,33160,504,31364,650,034
Polish-German 5,549
File SizeSentence PairsPolish WordsGerman Words
RAW479MB11,060,105202,765,359198,442,547
BiCleaner216MB916,52218,883,57620,271,637
Extra Languages in release v1.0
Russian 14,035 RAW v1.0 FILTERED v1.0
File SizeSentence PairsEnglish Words
RAW v1.038GB1,078,819,759-
Filtered v1.0637MB12,061,155157,061,045
  • Releases 4 and earlier included unaligned sentences in the raw file with one side empty. Release 5 removes these sentences from the raw file, explaining why the raw sizes dropped.
  • FILTERED v1.0 of the corpus is very rough and it is significantly refined in new releases.

Quality Assessment

Neural machine translation (NMT) systems using Marian 5-layer transformer models were trained for 3 different scenarios:

  • ParaCrawl5 individual systems: full-size or up to 4 million sentences were randomly selected from the ParaCrawl5 corpus and used to train 23 NMT systems.
  • Europarl individual systems: full-size available corpora from Europarl v7 were used to train 19 NMT systems.
  • Europarl + ParaCrawl: a combination of full-size Europarl v7 and full-size or up to 2 million sentences from ParaCrawl were selected to train 19 NMT systems.

The number of NMT systems for which we report results, a total of 58, was constrained in the different scenarios by the available languages in Europarl and TED test sets. From the original set of 23 language combinations in ParaCrawl5, Europarl covered 19 of them (not available for Irish, Croatian, Latvian and Maltese) and TED talks covered 20 of them (not available for Irish, Latvian and Maltese). This is why we do not report results for some language pairs in some scenarios.

 

Langauge Pair
English-*
Sentence Pairs BLEU
ParaCrawl5 Europarl v7 Europarl v7 + ParaCrawl5 ParaCrawl5 Europarl v7 Europarl v7 + ParaCrawl5
Bulgarian2.5M408K2.4M29.9920.4930.42
Czech4.0M647K2.6M19.6814.7720.36
Danish4.0M1.9M3.9M37.9931.4438.13
German4.0M1.9M3.9M26.8220.7127.34
Greek3.8M1.2M3.2M29.2823.0529.51
Spanish4.0M2.0M4.0M37.4429.1838.03
Estonian1.3M651K1.9M15.7414.1517.35
Finnish3.1M1.9M3.9M13.2614.1514.18
French4.0M2.0M4.0M31.4624.8430.71
Croatian1.8M--22.5--
Hungarian4.0M625K2.6M15.2911.916.21
Italian4.0M1.9M3.9M29.8722.8130.29
Lithuanian1.3M634K1.9M14.311.716.31
Dutch4.0M2.0M4.0M29.6823.0129.99
Polish4.0M631K2.6M15.1110.9915.43
Portuguese4.0M2.0M4.0M36.0225.3834.87
Romanian2.8M400K2.4M26.1319.0226.56
Slovak2.3M639K2.6M20.1214.7920.98
Slovenian1.4M624K2.0M19.4415.3120.67
Swedish4.0M1.8M3.8M35.1927.4534.66

Almost all ParaCrawl5 individual systems have better BLEU results than Europarl individual systems:

  • All pairs but English-Finnish are improved by a mean of +5.18 BLEU points. This is probably due to the fact that their sizes are bigger in all cases.
  • English-Finnish NMT system, despite having 1.1 million sentences more than Europarl, underperforms it. This needs to be further investigated.

The combination of Europarl and a subset of ParaCrawl5 helps improve the results of the individual systems in almost all cases:

  • Compared to individual ParaCrawl5 systems: Only 3 combined systems produce slightly lower results than the individual ParaCrawl5 (French, Portuguese and Swedish) despite receiving +2M sentences each.
  • Compared to individual Europarl: all combined pairs are better than individual Europarl ones by a mean of +5.6 BLEU points. The pairs to which the contribution of ParaCrawl5 seems to significantly better are Portuguese (+9.49 BLEU points) and Spanish (+8.85 BLEU points). Romanian and Finnish stay almost at the same results as Europarl individual systems despite receiving +2M sentences each.

License

These data are released under this licensing scheme:

 

 

Notice and take down policy

 

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

 

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And contact Kenneth Heafield at the following email address: kheafiel+takedown at inf.ed.ac.uk.

 

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Project Milestones

Novermber 2018, Website updated with Broader WebCrawl

February 2019, Integration of Initial Domain Identification Technology

June 2019, Integration of Processing of Broader Document Formats

September 2019, Inclusion of Data from Internet Archive in Data Release.

September 2019, Data Release 1

March 2020, Data Release 2

August 2020, Final Code Release

August 2020, Domain Identification for Data Release 1

September 2020, Data Release 3

Follow ParaCrawl on Github and Twitter

Project Partners:

Other Contributors:

Broader Web-Scale Provision of Parallel Corpora for European Languages

© Copyright 2019. All rights reserved.

Any communication or publication related to the action, made by the beneficiaries jointly or individually in any form and using any means, shall indicate that it reflects only the author's view and that the Agency is not responsible for any use that may be made ofthe information it contains.