Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

ParaCrawl OpenSource Pipeline (Bitextor)

Bitextor is a tool for automatically harvesting bitexts from multilingual websites.

Bitextor 7.2.1
Even young Luke Skywalker had to face some pythons on his way to Yoda! This release brings with it the Force, thanks to all the Jedis who helped.

v7.2.1 Changelog

  • Updated submodules
  • Fixed and updated requirements from bicleaner and bifixer
  • Fixed bifixer silent error output
  • Fixed bifixer output when using flag --ignore_duplicates
  • Fixed possible tabs in URLs from bad-formed WARCs
  • Improved documentation
  • Note: the tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the tarball or cloning the repo v7.2.1 tag.


v7.2 Changelog

  • Now you can set a list of WARC files in addition to the URLs as input for Bitextor (thanks @zuny26!)
    • For example, in config file for Snakemake:
      WARCFiles: ["/home/user/warc1.warc.gz", "/home/user/warc2.warc.gz"]
  • Switched to the WARC standard .gz compressed format (records individually compressed).
  • warc3-wet completely replaced with warcio (thanks @zuny26!)
  • xzlang Snakefile parameters allows grouping preprocessing output by languages (create a separate file for each language found, not just LANG1 and LANG2) (thanks @zuny26!)
  • This has the benefit of avoiding repeating preprocessing step when processing the same domain for different pair of languages.
  • Added support for giawarc WARC preprocessor (thanks @wwaites!)
    • Activate it using giawarc: true in Snakemake config parameters
    • Installation instructions in
  • Added support in bitextor-warc2preprocess for the HTML/XML Python parser selectolax
    • Select which parser with parser in Snakemake options. Options are 'alcazar', 'bs4' (default) and 'modest'.
      NOTE: it does not do anything giawarc: true or xzlang: true
  • pdf-extract is now installed and used using Pypi package (thanks @dionwiggins!)
  • Fixed sentence splitter in MT-based document alignment for the target language (thanks @kirefu!)
  • bleualign-cpp implementation is now an external dependency
  • Replaced MD5 with MurmurHash3 in Creepy crawler, WARC preprocessors and deferred module
  • Some PEP8 code compliance changes and code cleaning using latest Snakecharm
  • Updated submodules
    • restorative-cleaning is deprecated. Now bifixer is replacing it! (thanks @mbanon!)
  • Updated documentation.
    • Added Docker installation instructions (thanks @amirkamran!).
    • Added installation instructions for giawarc in Dependencies.
    • Updated kenlm instructions to install it from upstream (thanks @kpu!).
    • Updated diagram with changes.
    • Created Wiki with small tutorials, like running Bitextor with new languages or running Bitextor in Windows 10
  • General system stability improvements to enhance the user's experience.
  • Note: the tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the tarball or cloning the repo v7.2 tag.

ParaCrawl Corpus release v5.0

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v5 is the first release for the Action: "Broader Web-Scale Provision of Parallel Corpora for European Languages". New crawled data is added, including data from Internet Archive. Enhancements in document and sentence aligners with updated BiCleaner strategy resulted in corpora twice the size compare to release v4 for all official EU languages (23 languages paired with English).

Language Crawled Websites Download Details
In the proceedings of WMT 2019 release 3 of the corpus is used. For WMT 2018, the FILTERED v1.0 of the released corpus was used.
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format.
To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Bulgarian 4,762
File SizeSentence PairsEnglish Words
RAW v5.07.4GB248,555,9511,564,051,100
BiCleaner v5.0692MB2,586,27755,725,444
RAW v4.010.2GB288,395,1101,552,588,179
BiCleaner v4.0284MB1,039,88521,109,546
RAW v3.010.2GB288,395,1101,552,588,179
BiCleaner v3.0371MB1,704,76228,243,306
Zipporah v3.0217MB821,46417,578,839
Croatian 8,889
File SizeSentence PairsEnglish Words
RAW v5.08.33GB273,330,0061,738,164,401
BiCleaner v5.0477MB1,861,59043,464,197
RAW v4.012.4GB411,950,1641,996,212,922
BiCleaner v4.0246MB1,002,05319,904,218
RAW v3.012.4GB411,950,1641,996,212,922
BiCleaner v3.0265MB1,568,94723,531,438
Zipporah v3.0260MB1,004,13118,004,931
BiCleaner v2.0345MB1,455,84121,387,649
Zipporah v2.0295MB
Czech 14,335
File SizeSentence PairsEnglish Words
RAW v5.020GB665,535,1154,025,512,842
BiCleaner v5.01.2GB5,280,149117,385,158
RAW v4.033.9GB1,189,317,2475,621,562,488
BiCleaner v4.0697MB2,981,94948,918,151
RAW v3.033.9GB1,189,317,2475,621,562,488
BiCleaner v3.0913MB5,862,52175,316,848
Zipporah v3.04.1GB17,058,282139,211,417
BiCleaner v2.01.2GB5,488,58969,182,264
Zipporah v2.04.5GB15,846,424123,222,290
RAW v1.021.5GB818,784,053-
Filtered v1.0285MB10,020,25078,743,955
BiCleaner v1.2237MB2,367,60938,913,821
Zipporah v1.2529MB9,982,50878,943,174
Danish 19,776
File SizeSentence PairsEnglish Words
RAW v5.017.8GB447,743,4553,347,135,236
BiCleaner v5.01066MB4,606,183106,565,546
RAW v4.024.8GB586,535,8483,484,768,564
BiCleaner v4.0553MB2,414,89548,240,290
RAW v3.024.8GB586,535,8483,484,768,564
BiCleaner v3.0737MB4,891,46267,200,201
Zipporah v3.0565MB1,194,58922,476,008
Dutch 17,887
File SizeSentence PairsEnglish Words
RAW v5.032GB1,101,087,0066,792,400,704
BiCleaner v5.02.8GB10,596,717233,087,345
RAW v4.047.8GB1,760,140,2598,239,317,278
BiCleaner v4.01.7GB5,659,268108,197,376
RAW v3.047.8GB1,760,140,2598,239,317,278
BiCleaner v3.02.2GB10,408,489143,294,712
Zipporah v3.01.1GB3,291,80456,744,571
BiCleaner v2.02.7GB9,342,505127,895,866
Zipporah v2.01.2GB2,922,15250,402,591
RAW v1.037.5GB1,506,033,538-
Filtered v1.0168MB2,560,472 45,149,412
BiCleaner v1.2581MB6,185,906100,284,153
Zipporah v1.2276MB2,556,52345,169,303
Estonian 9,522
File SizeSentence PairsEnglish Words
RAW v5.04.5GB168,091,382915,074,587
BiCleaner v5.0338MB1,387,86930,858,140
RAW v4.08.4GB342,677,5351,522,504,098
BiCleaner v4.0202MB853,42216,537,397
RAW v3.08.4GB342,677,5351,522,504,098
BiCleaner v3.0198MB1,064,07817,725,513
Zipporah v3.0214MB1,163,99417,105,752
BiCleaner v2.0245MB960,27615,633,491
Zipporah v2.0271MB1,122,28912,820,311
RAW v1.04.4GB191,183,197-
Filtered v1.074.7MB1,298,10313,134,231
Finnish 11,028
File SizeSentence PairsEnglish Words
RAW v5.013.5GB460,181,2152,731,068,033
BiCleaner v5.0704MB3,097,22366,385,933
RAW v4.020.6GB736,050,6173,494,554,815
BiCleaner v4.0469MB2,156,06941,564,859
RAW v3.020.6GB736,050,6173,494,554,815
BiCleaner v3.0693MB3,944,92954,984,783
Zipporah v3.0432MB966,14514,175,421
BiCleaner v2.0985MB3,632,44749,751,376
Zipporah v2.0459MB831,17012,692,508
RAW v1.012.7GB504,805,915-
Filtered v1.034.5MB544,3358,420,501
BiCleaner v1.2177MB1,982,77429,979,317
Zipporah v1.259.8MB621,7289,481,646
French 48,498
File SizeSentence PairsEnglish Words
RAW v5.0128GB4,273,819,42124,983,683,983
BiCleaner v5.013.9GB51,316,1681,178,317,233
RAW v4.0183GB6,429,921,90328,529,875,306
BiCleaner v4.09.9GB31,374,161664,924,148
RAW v3.0183GB6,429,921,90328,529,875,306
Zipporah v3.015GB39,615,885791,250,385
BiCleaner v2.011.4GB37,823,646600,029,874
Zipporah v2.016.5GB37,743,429754,045,036
RAW v1.0111GB4,235,725,445-
Filtered v1.02.1GB27,622,881546,401,428
BiCleaner v1.22.6GB25,380,067428,397,408
Zipporah v1.23.9GB33,108,141648,244,663
German 67,977
File SizeSentence PairsEnglish Words
RAW v5.0142.8GB5,038,103,65927,994,213,177
BiCleaner v5.011.07GB36,936,714929,818,868
RAW v4.0211GB7,387,809,95332,358,035,774
BiCleaner v4.05.4GB16,264,450307,786,150
RAW v3.0211GB7,387,809,95332,358,035,774
BiCleaner v3.08.5GB31,358,551502,903,379
Zipporah v3.028.7GB61,349,218809,954,481
BiCleaner v2.09.8GB27,702,949456,442,715
Zipporah v2.026.3GB55,849,341740,849,699
RAW v1.0121GB4,591,582,415-
Filtered v1.01.8GB36,351,593476,398,001
BiCleaner v1.21.8GB17,378,982302,274,816
Zipporah v1.23.3GB40,546,537522,204,110
Greek 11,343
File SizeSentence PairsEnglish Words
RAW v5.016.2GB640,502,8013,768,712,672
BiCleaner v5.01122MB3,830,64388669279
RAW v4.024.7GB740,094,4693,340,324,438
BiCleaner v4.0572MB1,985,23338,322,532
RAW v3.024.7GB740,094,4693,340,324,438
BiCleaner v3.0878MB4,533,17557,752,932
Zipporah v3.0454MB992,22018,958,591
Hungarian 9,522
File SizeSentence PairsEnglish Words
RAW v5.014.47GB461,181,7723,208,285,083
BiCleaner v5.01069MB4,187,051104,292,635
RAW v4.016.5GB622,224,7942,590,060,050
BiCleaner v4.0482MB1,901,34230,835,267
RAW v3.016.5GB622,224,7942,590,060,050
BiCleaner v3.0456MB3,160,49632,151,740
Zipporah v3.0360MB1,023,87517,235,595
BiCleaner v2.0669MB
Zipporah v2.0338MB902,41215,054,278
Irish 1,283
File SizeSentence PairsEnglish Words
RAW v5.04.66GB64,628,733667,211,260
BiCleaner v5.0190MB782,76921,909,039
RAW v4.06.8GB156,189,8071,194,451,883
BiCleaner v4.075.9MB357,3998,241,515
RAW v3.06.8GB156,189,8071,028,019,178
BiCleaner v3.0117MB607,73415,473,067
Zipporah v3.0154MB744,37514,525,892
BiCleaner v2.0150MB573,45114,813,115
Zipporah v2.0191MB732,78314,269,940
Italian 31,518
File SizeSentence PairsEnglish Words
RAW v5.066GB2,251,771,79813,150,606,108
BiCleaner v5.06.02GB22,100,078533,512,632
RAW v4.091.3GB3,333,886,33614,519,224,940
BiCleaner v4.03.6GB12,162,239260,361,435
RAW v3.091.3GB3,333,886,33614,519,224,940
BiCleaner v3.04.3GB14,439,190308,244,744
Zipporah v3.06.9GB1,368,691269,587,549
BiCleaner v2.05.0GB17,224,855264,324,830
Zipporah v2.06.7GB12,252,492231,025,420
RAW v1.045.1GB1,727,688,019-
Filtered v1.0593MB8,318,493155,973,063
BiCleaner v1.2963MB11,790,134147,402,459
Zipporah v1.21.2GB12,065,631212,026,083
Latvian 3,557
File SizeSentence PairsEnglish Words
RAW v5.05GB176,113,6691,069,218,155
BiCleaner v5.0286MB1,019,00323,656,140
RAW v4.07.7GB262,685,9541,371,257,575
BiCleaner v4.0158MB553,06010,996,032
RAW v3.07.7GB262,685,9541,371,257,575
BiCleaner v3.0218MB1,009,86015,058,052
Zipporah v3.0125MB434,4797,742,539
BiCleaner v2.0270MB1,133,36216,744,306
Zipporah v2.0137MB386,4476,066,997
RAW v1.04.9GB173,585,643-
Filtered v1.016MB242,2274,250,040
BiCleaner v1.243.3MB406,7426,995,228
Zipporah v1.226.9MB241,5464,247,908
Lithuanian 4,678
File SizeSentence PairsEnglish Words
RAW v5.04.92GB198,101,611963,384,230
BiCleaner v5.0375MB1,270,93327,214,054
RAW v4.07.8GB294,568,0321,198,118,449
BiCleaner v4.0231MB844,64315,087,805
RAW v3.07.8GB294,568,0321,198,118,449
BiCleaner v3.0273MB1,368,69119,471,370
Zipporah v3.0128MB432,7246,727,629
BiCleaner v2.0330MB
Zipporah v2.0144MB
Maltese 672
File SizeSentence PairsEnglish Words
RAW v5.0173MB3,693,93038,492,028
BiCleaner v5.038MB177,2444,252,814
RAW v4.0723MB17,602,902164,119,571
BiCleaner v4.039.9MB
RAW v3.0723MB17,602,902164,119,571
BiCleaner v3.038.1MB
Zipporah v3.033.9MB
BiCleaner v2.041.6MB
Zipporah v2.038.6MB
Polish 13,357
File SizeSentence PairsEnglish Words
RAW v5.022.6GB723,052,9124,123,972,411
BiCleaner v5.01.6GB6,382,371145,802,939
RAW v4.033.8GB1,259,312,6185,555,536,170
BiCleaner v4.0970MB3,503,27665,618,419
RAW v3.033.8GB1,259,312,6185,555,536,170
BiCleaner v3.01.3GB6,806,79694,612,131
Zipporah v3.0709MB1,748,86529,205,973
BiCleaner v2.01.6GB5,787,43681,662,507
Zipporah v2.0692MB1,488,05626,329,826
RAW v1.024.4GB984,884,968-
Filtered v1.085.7MB1,275,16222,092,316
BiCleaner v1.2330MB3,270,26255,467,253
Zipporah v1.2143MB1,269,30022,079,132
Portuguese 18,887
File SizeSentence PairsEnglish Words
RAW v5.034.8GB1,068,161,8666,537,298,891
BiCleaner v5.03.3GB13,860,663299,634,135
RAW v4.050.6GB1,763,439,1228,465,738,356
BiCleaner v4.02GB8,141,940156,125,200
RAW v3.050.6GB1,763,439,1228,465,738,356
BiCleaner v3.02GB11,698,633171,495,357
Zipporah v3.01.1GB3,834,61379,794,493
BiCleaner v2.02.3GB9,740,600148,240,776
Zipporah v2.01.1GB3,454,34972,592,387
RAW v1.034.1GB1,357,911,799-
Filtered v1.0222MB2,809,38157,392,721
BiCleaner v1.2611MB6,436,49193,021,518
Zipporah v1.2366MB3,056,92060,180,429
Romanian 9,335
File SizeSentence PairsEnglish Words
RAW v5.015.2GB510,209,9233,034,045,929
BiCleaner v5.0728MB2,870,68762,189,306
RAW v4.022.7GB793,759,2104,059,255,214
BiCleaner v4.0498MB1,952,04339,882,223
RAW v3.022.7GB793,759,2104,059,255,214
BiCleaner v3.0592MB3,284,81049,494,227
Zipporah v3.0621MB2,766,70638,673,891
BiCleaner v2.0713MB2,684,18939,958,916
Zipporah v2.0607MB2,537,85134,596,458
RAW v1.017.2GB635,709,587-
Filtered v1.0105MB2,459,75232,800,110
BiCleaner v1.2159MB1,592,69227,531,812
Zipporah v1.2151MB2,459,40832,806,629
Slovak 7,980
File SizeSentence PairsEnglish Words
RAW v5.06.05GB269,067,2881,416,750,646
BiCleaner v5.0568MB2,365,33945,636,383
RAW v4.08.9GB334,903,7741,418,785,612
BiCleaner v4.0359MB1,591,83126,711,854
RAW v3.08.9GB334,903,7741,418,785,612
BiCleaner v3.0540MB2,759,45135,247,648
Zipporah v3.0163MB599,67111,134,061
Slovenian 5,016
File SizeSentence PairsEnglish Words
RAW v5.04.07GB175,682,9591,003,867,134
BiCleaner v5.0406MB1,406,64531,855,427
RAW v4.06GB208,466,320967,461,921
BiCleaner v4.0197MB660,16114,489,659
RAW v3.06GB208,466,320967,461,921
BiCleaner v3.0309MB1,386,81919,915,661
Zipporah v3.0178MB478,4449,060,846
Spanish 36,211
File SizeSentence PairsEnglish Words
RAW v5.080.4GB2,674,900,28016,598,620,402
BiCleaner v5.09.6GB38,971,348897,891,704
RAW v4.0111GB3,959,845,70618,128,847,778
BiCleaner v4.06.1GB21,987,267476,409,854
RAW v3.0111GB3,959,845,70618,128,847,778
BiCleaner v3.06.4GB30,535,457491,951,545
Zipporah v3.08.6GB24,634,419505,890,391
BiCleaner v2.07.2GB25,473,946412,852,386
Zipporah v2.08.7GB21,286,014437,009,844
RAW v1.060.9GB2,368,243,619-
Filtered v1.01.3GB16,001,341325,745,201
BiCleaner v1.21.8GB17,511,545303,161,256
Zipporah v1.22.2GB18,197,039366,172,313
Swedish 13,616
File SizeSentence PairsEnglish Words
RAW v5.016.54GB620,338,5613,496,650,816
BiCleaner v5.01542MB6,079,175138,264,978
RAW v4.022.5GB739,146,2003,217,514,612
BiCleaner v4.0857MB3,476,72970,088,534
RAW v3.022.5GB739,146,2003,217,514,612
BiCleaner v3.0967MB4,960,28279,278,861
Zipporah v3.0675MB1,913,48734,574,753
Bonus Release
Dutch-French 7,700
File SizeSentence PairsDutch WordsFrench Words
Polish-German 5,549
File SizeSentence PairsPolish WordsGerman Words
Extra Languages in release v1.0
Russian 14,035 RAW v1.0 FILTERED v1.0
File SizeSentence PairsEnglish Words
RAW v1.038GB1,078,819,759-
Filtered v1.0637MB12,061,155157,061,045
  • Releases 4 and earlier included unaligned sentences in the raw file with one side empty. Release 5 removes these sentences from the raw file, explaining why the raw sizes dropped.
  • FILTERED v1.0 of the corpus is very rough and it is significantly refined in new releases.


These data are released under this licensing scheme:



Notice and take down policy


Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:


  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And contact Kenneth Heafield at the following email address: kheafiel+takedown at


Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Project Milestones

Action: Broader Web-Scale Provision of Parallel Corpora for European Languages

Novermber 2018, Website updated with Broader WebCrawl.

February 2019, Integration of Initial Domain Identification Technology.

June 2019, Integration of Processing of Broader Document Formats.

September 2019, Inclusion of Data from Internet Archive in Data Release.

September 2019, Data Release 1

March 2020, Data Release 2

August 2020, Final Code Release

August 2020, Domain Identification for Data Release 1

September 2020, Data Release 3

Action: Continued Web-Scale Provision of Parallel Corpora for European Languages

September 2020, Augmented Data Release including Manufactured Data

September 2020, Implementation of Deferred Crawling

March 2021, Data Release 1 (This will include extra data from the internet archive and an updated version of the manufactured data)

May 2021, Completion of data portal, with data from release 1

September 2021, Data Release 2 (This will include more data from the internet archive and an updated version of the manufactured data)

September 2021, Code Release

September 2021, Quality testing for release 1

September 2021, Data promotion plan

September 2021, Validation: Metadata related to "Augmented Data, Data release 1 and 2" to be compliant with the ELRC SHARE specifications and available on the ELRC-SHARE repository.

Project Partners:

Other Contributors:

Follow ParaCrawl on Github and Twitter

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

© Copyright 2019. All rights reserved.

Any communication or publication related to the action, made by the beneficiaries jointly or individually in any form and using any means, shall indicate that it reflects only the author's view and that the Agency is not responsible for any use that may be made ofthe information it contains.