Broader Web-Scale Provision of Parallel Corpora for European Languages

ParaCrawl OpenSource Pipeline (Bitextor)

Bitextor is a tool for automatically harvesting bitexts from multilingual websites.

Bitextor 7
We redesigned Bitextor to be more flexible and scalable, running it like a make but in Python. Say thanks to snakemake! Furthermore, we made all Python scripts compatible with Python 3.5-3.8, so good news for long term support.

  • bitextor.sh reworked.
    • Now it calls rules from snakemake/Snakefile.
    • Also, it uses NMT rules if using translation-based document alignment from snakemake/nmt/Snakefile.
      • Example YAML config files at snakemake/example/tests.
  • Crawling output format changed to WARC.
  • Reworked file formats.
    • The ETT, LETT and LETTR formats do not exist anymore. Now, there is one file per column to avoid redundancy.
  • Sentence and word tokeniser paths can be specified by the user.
  • Deleted Zipporah and Ulysses.
  • Replaced Tika with Python3 ftfy.
  • Added optional restorative cleaning.
  • Updated installation instructions.
  • Added bleualign (C++ implementation) as an alternative sentence aligner.
  • Reworked Makefile.
    • No need to install Bitextor to run it (or any of the included scripts).
  • Updated submodules, APT dependencies and pip packages.
  • PEP 8 style guidelines in Python scripts.
  • Fully compatible with Python 3.
  • Added computational requirements documentation and reworked README.md.
  • General system stability improvements to enhance the user's experience.

ParaCrawl Corpus release v4.0

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v4 is the final release for the Action: "Provision of Web-Scale Parallel Corpora for Official European Languages" and it covers all official EU languages (23 languages paired with English)

Language Crawled Websites Download Details
In the proceedings of WMT 2019 release 3 of the corpus is used. For WMT 2018, the FILTERED v1.0 of the released corpus was used.
In v4 two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format.
To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Bulgarian 4,762
File SizeSentence PairsEnglish Words
RAW v4.010.2GB288,395,1101,552,588,179
BiCleaner v4.0 TMX284MB1,039,88521,109,546
RAW v3.010.2GB288,395,1101,552,588,179
BiCleaner v3.0371MB1,704,76228,243,306
Zipporah v3.0217MB821,46417,578,839
Danish 19,776
File SizeSentence PairsEnglish Words
RAW v4.024.8GB586,535,8483,484,768,564
BiCleaner v4.0553MB2,414,89548,240,290
RAW v3.024.8GB586,535,8483,484,768,564
BiCleaner v3.0737MB4,891,46267,200,201
Zipporah v3.0565MB1,194,58922,476,008
Greek 11,343
File SizeSentence PairsEnglish Words
RAW v4.024.7GB740,094,4693,384,919,588
BiCleaner v4.0572MB1,985,23338,322,532
RAW v3.024.7GB740,094,4693,384,919,588
BiCleaner v3.0878MB4,533,17557,752,932
Zipporah v3.0454MB992,22018,958,591
Slovak 7,980
File SizeSentence PairsEnglish Words
RAW v4.08.9GB
BiCleaner v4.0359MB1,591,83126,711,854
RAW v3.08.9GB
BiCleaner v3.0540MB2,759,45135,247,648
Zipporah v3.0163MB599,67111,134,061
Slovenian 5,016
File SizeSentence PairsEnglish Words
RAW v4.06GB208,466,320972,646,305
BiCleaner v4.0197MB660,16114,489,659
RAW v3.06GB208,466,320972,646,305
BiCleaner v3.0309MB1,386,81919,915,661
Zipporah v3.0178MB478,4449,060,846
Swedish 13,616
File SizeSentence PairsEnglish Words
RAW v4.022.5GB739,146,2003,224,270,010
BiCleaner v4.0857MB3,476,72970,088,534
RAW v3.022.5GB739,146,2003,224,270,010
BiCleaner v3.0967MB4,960,28279,278,861
Zipporah v3.0675MB1,913,48734,574,753
Irish 1,283
File SizeSentence PairsEnglish Words
RAW v4.06.8GB156,189,8071,194,451,883
BiCleaner v4.075.9MB357,3998,241,515
RAW v3.06.8GB156,189,8071,194,451,883
BiCleaner v3.0117MB607,73415,473,067
Zipporah v3.0154MB744,37514,525,892
BiCleaner v2.0150MB573,45114,813,115
Zipporah v2.0191MB732,78314,269,940
Croatian 8,889
File SizeSentence PairsEnglish Words
RAW v4.012.4GB411,950,1642,031,138,976
BiCleaner v4.0246MB1,002,05319,904,218
RAW v3.012.4GB411,950,1642,031,138,976
BiCleaner v3.0265MB1,568,94723,531,438
Zipporah v3.0260MB1,004,13118,004,931
BiCleaner v2.0345MB1,455,84121,387,649
Zipporah v2.0295MB
933,19016,655,718
Maltese 672
File SizeSentence PairsEnglish Words
RAW v4.0723MB17,602,902183,558,003
BiCleaner v4.039.9MB
195,5104,100,912
RAW v3.0723MB17,602,902183,558,003
BiCleaner v3.038.1MB
227,4994,429,648
Zipporah v3.033.9MB
154,0382,143,321
BiCleaner v2.041.6MB
198,5373,884,509
Zipporah v2.038.6MB
137,3181,919,196
Lithuanian 4,678
File SizeSentence PairsEnglish Words
RAW v4.07.8GB294,568,0321,226,507,592
BiCleaner v4.0231MB844,64315,087,805
RAW v3.07.8GB294,568,0321,226,507,592
BiCleaner v3.0273MB1,368,69119,471,370
Zipporah v3.0128MB432,7246,727,629
BiCleaner v2.0330MB
1,133,36216,744,306
Zipporah v2.0144MB
386,447
6,066,997
Hungarian 9,522
File SizeSentence PairsEnglish Words
RAW v4.016.5GB
BiCleaner v4.0482MB1,901,34230,835,267
RAW v3.016.5GB
BiCleaner v3.0456MB3,160,49632,151,740
Zipporah v3.0360MB1,023,87517,235,595
BiCleaner v2.0669MB
308,248631,764,228
Zipporah v2.0338MB902,41215,054,278
Estonian 9,522
File SizeSentence PairsEnglish Words
RAW v4.08.4GB
BiCleaner v4.0202MB853,42216,537,397
RAW v3.08.4GB
BiCleaner v3.0198MB1,064,07817,725,513
Zipporah v3.0214MB1,163,99417,105,752
BiCleaner v2.0245MB960,27615,633,491
Zipporah v2.0271MB1,122,28912,820,311
RAW v1.04.4GB191,183,197-
Filtered v1.074.7MB1,298,10313,134,231
German 67,977
File SizeSentence PairsEnglish Words
RAW v4.0211GB
BiCleaner v4.05.4GB16,264,450307,786,150
RAW v3.0211GB
BiCleaner v3.08.5GB31,358,551502,903,379
Zipporah v3.028.7GB61,349,218809,954,481
BiCleaner v2.09.8GB27,702,949456,442,715
Zipporah v2.026.3GB55,849,341740,849,699
RAW v1.0121GB4,591,582,415-
Filtered v1.01.8GB36,351,593476,398,001
BiCleaner v1.21.8GB17,378,982302,274,816
Zipporah v1.23.3GB40,546,537522,204,110
French 48,498
File SizeSentence PairsEnglish Words
RAW v4.0183GB
BiCleaner v4.09.9GB31,374,161664,924,148
RAW v3.0183GB
Zipporah v3.015GB39,615,885791,250,385
BiCleaner v2.011.4GB37,823,646600,029,874
Zipporah v2.016.5GB37,743,429754,045,036
RAW v1.0111GB4,235,725,445-
Filtered v1.02.1GB27,622,881546,401,428
BiCleaner v1.22.6GB25,380,067428,397,408
Zipporah v1.23.9GB33,108,141648,244,663
Spanish 36,211
File SizeSentence PairsEnglish Words
RAW v4.0111GB
BiCleaner v4.06.1GB21,987,267476,409,854
RAW v3.0111GB
BiCleaner v3.06.4GB30,535,457491,951,545
Zipporah v3.08.6GB24,634,419505,890,391
BiCleaner v2.07.2GB25,473,946412,852,386
Zipporah v2.08.7GB21,286,014437,009,844
RAW v1.060.9GB2,368,243,619-
Filtered v1.01.3GB16,001,341325,745,201
BiCleaner v1.21.8GB17,511,545303,161,256
Zipporah v1.22.2GB18,197,039366,172,313
Italian 31,518
File SizeSentence PairsEnglish Words
RAW v4.091.3GB
BiCleaner v4.03.6GB12,162,239260,361,435
RAW v3.091.3GB
BiCleaner v3.04.3GB14,439,190308,244,744
Zipporah v3.06.9GB1,368,691269,587,549
BiCleaner v2.05.0GB17,224,855264,324,830
Zipporah v2.06.7GB12,252,492231,025,420
RAW v1.045.1GB1,727,688,019-
Filtered v1.0593MB8,318,493155,973,063
BiCleaner v1.2963MB11,790,134147,402,459
Zipporah v1.21.2GB12,065,631212,026,083
Portuguese 18,887
File SizeSentence PairsEnglish Words
RAW v4.050.6GB
BiCleaner v4.02GB8,141,940156,125,200
RAW v3.050.6GB
BiCleaner v3.02GB11,698,633171,495,357
Zipporah v3.01.1GB3,834,61379,794,493
BiCleaner v2.02.3GB9,740,600148,240,776
Zipporah v2.01.1GB3,454,34972,592,387
RAW v1.034.1GB1,357,911,799-
Filtered v1.0222MB2,809,38157,392,721
BiCleaner v1.2611MB6,436,49193,021,518
Zipporah v1.2366MB3,056,92060,180,429
Dutch 17,887
File SizeSentence PairsEnglish Words
RAW v4.047.8GB
BiCleaner v4.01.7GB5,659,268108,197,376
RAW v3.047.8GB
BiCleaner v3.02.2GB10,408,489143,294,712
Zipporah v3.01.1GB3,291,80456,744,571
BiCleaner v2.02.7GB9,342,505127,895,866
Zipporah v2.01.2GB2,922,15250,402,591
RAW v1.037.5GB1,506,033,538-
Filtered v1.0168MB2,560,472 45,149,412
BiCleaner v1.2581MB6,185,906100,284,153
Zipporah v1.2276MB2,556,52345,169,303
Polish 13,357
File SizeSentence PairsEnglish Words
RAW v4.033.8GB
BiCleaner v4.0970MB3,503,27665,618,419
RAW v3.033.8GB
BiCleaner v3.01.3GB6,806,79694,612,131
Zipporah v3.0709MB1,748,86529,205,973
BiCleaner v2.01.6GB5,787,43681,662,507
Zipporah v2.0692MB1,488,05626,329,826
RAW v1.024.4GB984,884,968-
Filtered v1.085.7MB1,275,16222,092,316
BiCleaner v1.2330MB3,270,26255,467,253
Zipporah v1.2143MB1,269,30022,079,132
Czech 14,335
File SizeSentence PairsEnglish Words
RAW v4.033.9GB
BiCleaner v4.0697MB2,981,94948,918,151
RAW v3.033.9GB
BiCleaner v3.0913MB5,862,52175,316,848
Zipporah v3.04.1GB17,058,282139,211,417
BiCleaner v2.01.2GB5,488,58969,182,264
Zipporah v2.04.5GB15,846,424123,222,290
RAW v1.021.5GB818,784,053-
Filtered v1.0285MB10,020,25078,743,955
BiCleaner v1.2237MB2,367,60938,913,821
Zipporah v1.2529MB9,982,50878,943,174
Romanian 9,335
File SizeSentence PairsEnglish Words
RAW v4.022.7GB
BiCleaner v4.0498MB1,952,04339,882,223
RAW v3.022.7GB
BiCleaner v3.0592MB3,284,81049,494,227
Zipporah v3.0621MB2,766,70638,673,891
BiCleaner v2.0713MB2,684,18939,958,916
Zipporah v2.0607MB2,537,85134,596,458
RAW v1.017.2GB635,709,587-
Filtered v1.0105MB2,459,75232,800,110
BiCleaner v1.2159MB1,592,69227,531,812
Zipporah v1.2151MB2,459,40832,806,629
Finnish 11,028
File SizeSentence PairsEnglish Words
RAW v4.020.6GB
BiCleaner v4.0469MB2,156,06941,564,859
RAW v3.020.6GB
BiCleaner v3.0693MB3,944,92954,984,783
Zipporah v3.0432MB966,14514,175,421
BiCleaner v2.0985MB3,632,44749,751,376
Zipporah v2.0459MB831,17012,692,508
RAW v1.012.7GB504,805,915-
Filtered v1.034.5MB544,3358,420,501
BiCleaner v1.2177MB1,982,77429,979,317
Zipporah v1.259.8MB621,7289,481,646
Latvian 3,557
File SizeSentence PairsEnglish Words
RAW v4.07.7GB
BiCleaner v4.0158MB553,06010,996,032
RAW v3.07.7GB
BiCleaner v3.0218MB1,009,86015,058,052
Zipporah v3.0125MB434,4797,742,539
BiCleaner v2.0270MB1,133,36216,744,306
Zipporah v2.0137MB386,4476,066,997
RAW v1.04.9GB173,585,643-
Filtered v1.016MB242,2274,250,040
BiCleaner v1.243.3MB406,7426,995,228
Zipporah v1.226.9MB241,5464,247,908
Extra Languages in release v1.0
Russian 14,035 RAW v1.0 FILTERED v1.0
File SizeSentence PairsEnglish Words
RAW v1.038GB1,078,819,759-
Filtered v1.0637MB12,061,155157,061,045
  • The large drop-off between the sentence paris of RAW and different FILTERED versions is due to deduplication and removal of data due to failures of earlier processing steps.
  • FILTERED v1.0 of the corpus is very rough and it is significantly refined in new releases.

License

These data are released under this licensing scheme:

 

 

Notice and take down policy

 

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

 

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And contact Kenneth Heafield at the following email address: kheafiel+takedown at inf.ed.ac.uk.

 

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Project Milestones

September 2017, Kickoff meeting in Alicante, Spain

November 2017, Website up and running

January 2018, Release of Corpus v1

April 2018, Release of Corpus v1 in ELRC-SHARE

June 2018, Release of Corpus v2

June 2018, Release of Software v1

October 2018, Release of Corpus v3

March 2019, Release of Corpus v4, also in ELRC-SHARE

March 2019, Release of Software v2

Follow ParaCrawl on Github

Project Partners:

Other Contributors:

Provision of Web-Scale Parallel Corpora for Official European Languages

© Copyright 2018. All rights reserved.