Broader Web-Scale Provision of Parallel Corpora for European Languages

ParaCrawl OpenSource Pipeline (Bitextor)

Bitextor is a tool for automatically harvesting bitexts from multilingual websites. The release v6.0.0-rc.1 is the code release for ParaCrawl project.

  • Updated documentation and README.md with new dependencies, commands and troubleshooting
  • Added original repositories for most of compiled dependencies (mgiza, clustercat, bicleaner...)
  • Fixed encoding errors in tika input/output management
  • Added option to use nltk as sentence splitter
  • Added lots of parameters and options for bitextor to control most parts of the pipeline and long named versions of them (see --help)
  • Replaced mkcls with clustercat and giza-pp with mgiza
  • Added option for a config file in bitextor. See README.md.
  • Added ELRC metrics and filters
  • Added bicleaner and zipporah classifiers and thresholds for filtering
  • Added httrack as alternative crawler
  • Added a JHU processing script for processing crawler content (option --jhu-lett)
  • Added an alternative document aligner translate based (Paracrawl) (option --jhu-aligner-command TRANSLATIONCOMMAND)
  • Minor changes and bugfixes

Note: the bitextor-v6.0.0-rc.1.zip tarball does not include submodules code. If you start compiling the project from this tarball, first you need to git submodule update --init --recursive. Also, you can't perform this command on the source code .tar.gz and .zip packages, so we recommend the bitextor-v6.0.0-rc.1.zip tarball or cloning the repo.


ParaCrawl Corpus v3.0

The v3.0 release contains six new languages Bulgarian, Danish, Greek, Slovak, Slovenian and Swedish. More data is also added to the already released languages.

 

ParaCrawl Corpus v2.0

The v2.0 release contains six new languages Irish, Croatian, Maltese, Lithuanian, Hungarian and Estonian. More data is also added to the already released languages.

 

ParaCrawl Corpus v1.2

The v1.2 release contains two new filtered versions of the corpus for each language "ZIPPORAH v1.2" and "BICLEANER v1.2". For more details see the following corpus size table.

 

ParaCrawl Corpus v1.0

The first version of the ParaCrawl corpus released in, January 2018, contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. More details of the crawling and extraction process are given below.

 

Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics.

 

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Updated releases are scheduled for June 2018, October 2018, and March 2019.

Corpus Size

Language Crawled Websites Download Details
Please note that in the proceedings of WMT 2018 the FILTERED v1.0 of the released corpus is used.
New Languages in release v3.0 (Bulgarian, Danish, Greek, Slovak, Slovenian, Swedish)
More data is added to the other languages, please download v3.0 files.
Bulgarian 4,762
File SizeSentence PairsEnglish Words
RAW v3.010.2GB288,395,1101,552,588,179
BiCleaner v3.0371MB1,704,76228,243,306
Zipporah v3.0217MB821,46417,578,839
Danish 19,776
File SizeSentence PairsEnglish Words
RAW v3.024.8GB586,535,8483,484,768,564
BiCleaner v3.0737MB4,891,46267,200,201
Zipporah v3.0565MB1,194,58922,476,008
Greek 11,343
File SizeSentence PairsEnglish Words
RAW v3.024.7GB740,094,4693,384,919,588
BiCleaner v3.0878MB4,533,17557,752,932
Zipporah v3.0454MB992,22018,958,591
Slovak 7,980
File SizeSentence PairsEnglish Words
RAW v3.08.9GB
BiCleaner v3.0540MB2,759,45135,247,648
Zipporah v3.0163MB599,67111,134,061
Slovenian 5,016
File SizeSentence PairsEnglish Words
RAW v3.06GB208,466,320972,646,305
BiCleaner v3.0309MB1,386,81919,915,661
Zipporah v3.0178MB478,4449,060,846
Swedish 13,616
File SizeSentence PairsEnglish Words
RAW v3.022.5GB739,146,2003,224,270,010
BiCleaner v3.0967MB4,960,28279,278,861
Zipporah v3.0675MB1,913,48734,574,753
Irish 1,283
File SizeSentence PairsEnglish Words
RAW v3.06.8GB156,189,8071,194,451,883
BiCleaner v3.0117MB607,73415,473,067
Zipporah v3.0154MB744,37514,525,892
BiCleaner v2.0150MB573,45114,813,115
Zipporah v2.0191MB732,78314,269,940
Croatian 8,889
File SizeSentence PairsEnglish Words
RAW v3.012.4GB411,950,1642,031,138,976
BiCleaner v3.0265MB1,568,94723,531,438
Zipporah v3.0260MB1,004,13118,004,931
BiCleaner v2.0345MB1,455,84121,387,649
Zipporah v2.0295MB
933,19016,655,718
Maltese 672
File SizeSentence PairsEnglish Words
RAW v3.0723MB17,602,902183,558,003
BiCleaner v3.038.1MB
227,4994,429,648
Zipporah v3.033.9MB
154,0382,143,321
BiCleaner v2.041.6MB
198,5373,884,509
Zipporah v2.038.6MB
137,3181,919,196
Lithuanian 4,678
File SizeSentence PairsEnglish Words
RAW v3.07.8GB294,568,0321,226,507,592
BiCleaner v3.0273MB1,368,69119,471,370
Zipporah v3.0128MB432,7246,727,629
BiCleaner v2.0330MB
1,133,36216,744,306
Zipporah v2.0144MB
386,447
6,066,997
Hungarian 9,522
File SizeSentence PairsEnglish Words
RAW v3.016.5GB
BiCleaner v3.0456MB3,160,49632,151,740
Zipporah v3.0360MB1,023,87517,235,595
BiCleaner v2.0669MB
308,248631,764,228
Zipporah v2.0338MB902,41215,054,278
Estonian 9,522
File SizeSentence PairsEnglish Words
RAW v3.08.4GB
BiCleaner v3.0198MB1,064,07817,725,513
Zipporah v3.0214MB1,163,99417,105,752
BiCleaner v2.0245MB960,27615,633,491
Zipporah v2.0271MB1,122,28912,820,311
RAW v1.04.4GB191,183,197-
Filtered v1.074.7MB1,298,10313,134,231
German 67,977
File SizeSentence PairsEnglish Words
RAW v3.0211GB
BiCleaner v3.08.5GB31,358,551502,903,379
Zipporah v3.028.7GB61,349,218809,954,481
BiCleaner v2.09.8GB27,702,949456,442,715
Zipporah v2.026.3GB55,849,341740,849,699
RAW v1.0121GB4,591,582,415-
Filtered v1.01.8GB36,351,593476,398,001
BiCleaner v1.21.8GB17,378,982302,274,816
Zipporah v1.23.3GB40,546,537522,204,110
French 48,498
File SizeSentence PairsEnglish Words
RAW v3.0183GB
Zipporah v3.015GB39,615,885791,250,385
BiCleaner v2.011.4GB37,823,646600,029,874
Zipporah v2.016.5GB37,743,429754,045,036
RAW v1.0111GB4,235,725,445-
Filtered v1.02.1GB27,622,881546,401,428
BiCleaner v1.22.6GB25,380,067428,397,408
Zipporah v1.23.9GB33,108,141648,244,663
Spanish 36,211
File SizeSentence PairsEnglish Words
RAW v3.0111GB
BiCleaner v3.06.4GB30,535,457491,951,545
Zipporah v3.08.6GB24,634,419505,890,391
BiCleaner v2.07.2GB25,473,946412,852,386
Zipporah v2.08.7GB21,286,014437,009,844
RAW v1.060.9GB2,368,243,619-
Filtered v1.01.3GB16,001,341325,745,201
BiCleaner v1.21.8GB17,511,545303,161,256
Zipporah v1.22.2GB18,197,039366,172,313
Italian 31,518
File SizeSentence PairsEnglish Words
RAW v3.091.3GB
BiCleaner v3.04.3GB14,439,190308,244,744
Zipporah v3.06.9GB1,368,691269,587,549
BiCleaner v2.05.0GB17,224,855264,324,830
Zipporah v2.06.7GB12,252,492231,025,420
RAW v1.045.1GB1,727,688,019-
Filtered v1.0593MB8,318,493155,973,063
BiCleaner v1.2963MB11,790,134147,402,459
Zipporah v1.21.2GB12,065,631212,026,083
Portuguese 18,887
File SizeSentence PairsEnglish Words
RAW v3.050.6GB
BiCleaner v3.02GB11,698,633171,495,357
Zipporah v3.01.1GB3,834,61379,794,493
BiCleaner v2.02.3GB9,740,600148,240,776
Zipporah v2.01.1GB3,454,34972,592,387
RAW v1.034.1GB1,357,911,799-
Filtered v1.0222MB2,809,38157,392,721
BiCleaner v1.2611MB6,436,49193,021,518
Zipporah v1.2366MB3,056,92060,180,429
Dutch 17,887
File SizeSentence PairsEnglish Words
RAW v3.047.8GB
BiCleaner v3.02.2GB10,408,489143,294,712
Zipporah v3.01.1GB3,291,80456,744,571
BiCleaner v2.02.7GB9,342,505127,895,866
Zipporah v2.01.2GB2,922,15250,402,591
RAW v1.037.5GB1,506,033,538-
Filtered v1.0168MB2,560,472 45,149,412
BiCleaner v1.2581MB6,185,906100,284,153
Zipporah v1.2276MB2,556,52345,169,303
Polish 13,357
File SizeSentence PairsEnglish Words
RAW v3.033.8GB
BiCleaner v3.01.3GB6,806,79694,612,131
Zipporah v3.0709MB1,748,86529,205,973
BiCleaner v2.01.6GB5,787,43681,662,507
Zipporah v2.0692MB1,488,05626,329,826
RAW v1.024.4GB984,884,968-
Filtered v1.085.7MB1,275,16222,092,316
BiCleaner v1.2330MB3,270,26255,467,253
Zipporah v1.2143MB1,269,30022,079,132
Czech 14,335
File SizeSentence PairsEnglish Words
RAW v3.033.9GB
BiCleaner v3.0913MB5,862,52175,316,848
Zipporah v3.04.1GB17,058,282139,211,417
BiCleaner v2.01.2GB5,488,58969,182,264
Zipporah v2.04.5GB15,846,424123,222,290
RAW v1.021.5GB818,784,053-
Filtered v1.0285MB10,020,25078,743,955
BiCleaner v1.2237MB2,367,60938,913,821
Zipporah v1.2529MB9,982,50878,943,174
Romanian 9,335
File SizeSentence PairsEnglish Words
RAW v3.022.7GB
BiCleaner v3.0592MB3,284,81049,494,227
Zipporah v3.0621MB2,766,70638,673,891
BiCleaner v2.0713MB2,684,18939,958,916
Zipporah v2.0607MB2,537,85134,596,458
RAW v1.017.2GB635,709,587-
Filtered v1.0105MB2,459,75232,800,110
BiCleaner v1.2159MB1,592,69227,531,812
Zipporah v1.2151MB2,459,40832,806,629
Finnish 11,028
File SizeSentence PairsEnglish Words
RAW v3.020.6GB
BiCleaner v3.0693MB3,944,92954,984,783
Zipporah v3.0432MB966,14514,175,421
BiCleaner v2.0985MB3,632,44749,751,376
Zipporah v2.0459MB831,17012,692,508
RAW v1.012.7GB504,805,915-
Filtered v1.034.5MB544,3358,420,501
BiCleaner v1.2177MB1,982,77429,979,317
Zipporah v1.259.8MB621,7289,481,646
Latvian 3,557
File SizeSentence PairsEnglish Words
RAW v3.07.7GB
BiCleaner v3.0218MB1,009,86015,058,052
Zipporah v3.0125MB434,4797,742,539
BiCleaner v2.0270MB1,133,36216,744,306
Zipporah v2.0137MB386,4476,066,997
RAW v1.04.9GB173,585,643-
Filtered v1.016MB242,2274,250,040
BiCleaner v1.243.3MB406,7426,995,228
Zipporah v1.226.9MB241,5464,247,908
Extra Languages in release v1.0
Russian 14,035 RAW v1.0 FILTERED v1.0
File SizeSentence PairsEnglish Words
RAW v1.038GB1,078,819,759-
Filtered v1.0637MB12,061,155157,061,045
  • The large drop-off between the sentence paris of RAW and different FILTERED versions is due to deduplication and removal of data due to failures of earlier processing steps.
  • FILTERED v1.0 of the corpus is very rough and it is significantly refined in new releases.

NMT Comparison on WMT Data

These experiments compare WMT systems with and without Paracrawl.
The systems are shallow NMT models, trained with Marian, covering the language pairs {cs,de,fi,lv,ro} -> en.

Pair BLEU WMT BLEU ParaCrawl Sentences WMT Sentences ParaCrawl Over-samping (WMT)
fi-en 21.7 24.2 2634433 624058 1x
lv-en 15.6 16.5 4461720 242227 1x
ro-en 29.2 32.4 608320 2459752 4x
cs-en 25.7 26.3 52024546 10020250 1x
de-en 29.8 28.8 5852458 36351593 7x

Notes:

  • Scores are for 4-way checkpoint ensemble, at intervals of 30k
  • bpe and truecase models trained on WMT16/17 data only
  • Sentence counts are with Moses clean (to 80 tokens)

Processing Pipeline

The corpus is created through a pipeline of processing steps. The software for these steps is either already currently open-source or will be released as open-source software in June 2018. Some pre-release code is available on our Github.

Web Site Identification

We ran the CLD2 language classifier on all of the CommonCrawl. Domains were selected for crawling if they had significant amounts of text in at least two languages. The size ratio between the languages had to be somewhat comparable.


This release limits to 11 languages, though we plan to cover all official EU languages.


For more on this effort, please see: "N-gram Counts and Language Models from the Common Crawl", by Christian Buck, Kenneth Heafield, Bas van Ooyen (LREC 2014).

Web Crawling

For web crawling we use the standard tool HTTrack. The raw crawl of the currently processed data set is about 100 TB compressed.

Text Extraction

For text extraction we use Bitextor, which was developed and is being refined by partners of the project. It converts each web page into a standard format, containing the original HTML markup and the stripped-out text. See the Bitextor site for more.

Document Alignment

Web pages in English and the foreign language are aligned using a method that matches machine translation of the foreign document with the English document and also uses the URL. It is based on the method described in "Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance" by Christian Buck and Philipp Koehn (WMT 2016) but has been refined since. Partners of the ParaCrawl project organized a shared task on document alignment in 2016, which describes these methods in "Findings of the WMT 2016 Bilingual Document Alignment Shared Task" by Christian Buck and Philipp Koehn (WMT 2016).

Sentence Alignment

For sentence alignment, we use the tool Hunalign.

Hunalign reports a sentence pair score which we include in the raw release of the corpus.

Sentence Filtering

Finally, we filter the raw corpus to remove data. This is done in two different variations; using Zipporah and Bicleaner tools.


The Zipporah tool is described in "Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora" by Hainan Xu and Philipp Koehn (EMNLP 2017). The tool considers language fluency (measured by a language model) and translation adequacy (measured by a translation dictionary). It computes a score for each sentence pair which can be used for filtering. This score is included in the raw release of the corpus. We use the threshold of 0 for the official filtered release of the corpus.


Bicleaner based on a blend of generic rules (called rule-based precleaning or hardrules) and a classifier that uses probabilistic dictionaries. The classifier gives a score and suggests whether to keep or discard the sentence. We saw that >=0.7 is a safe threshold for most combinations based on sampling and manual inspection. The tool is still under development and we plan to introduce some improvements for known problems in the following few weeks.


Sentence filtering also includes basic steps such as removing empty lines, too long sentences (more than 200 words), and sentence pairs with highly mismatched number of words.


Note that a popular filtering step is to subsample the corpus for domain relevance, e.g., by a method like "Intelligent Selection of Language Model Training Data" by Robert C Moore and William Lewis (ACL 2010). We do not filter ParaCrawl for any domain, since it is a general-purpose corpus.

License

These data are released under this licensing scheme:

 

 

Notice and take down policy

 

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

 

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And contact Kenneth Heafield at the following email address: kheafiel+takedown at inf.ed.ac.uk.

 

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Project Milestones

September 2017, Kickoff meeting in Alicante, Spain

November 2017, Website up and running

January 2018, Release of Corpus v1

April 2018, Release of Corpus v1 in ELRC-SHARE

June 2018, Release of Corpus v2

June 2018, Release of Software v1

October 2018, Release of Corpus v3

March 2019, Release of Corpus v4, also in ELRC-SHARE

March 2019, Release of Software v2

Follow ParaCrawl on Github

Project Partners:

Other Contributors:

Provision of Web-Scale Parallel Corpora for Official European Languages

© Copyright 2018. All rights reserved.