Provision of Web-Scale Parallel Corpora for Official European Languages

ParaCrawl Corpus v1.2

The v1.2 release contains two new filtered versions of the corpus for each language "ZIPPORAH v1.2" and "BICLEANER v1.2". For more details see the following corpus size table.

 

The January 2018 release of the ParaCrawl corpus is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. More details of the crawling and extraction process are given below.

 

Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics.

 

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Updated releases are scheduled for June 2018, October 2018, and March 2019.

Corpus Size

Language Crawled Websites Raw Sentence Pairs Download Details
Please note that in the proceedings of WMT 2018 the FILTERED v1.0 of the released corpus is used.
German 49,656 4,591,582,415 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw121GB4,591,582,415-
Filtered v1.01.8GB36,351,593476,398,001
BiCleaner v1.21.8GB17,378,982302,274,816
Zipporah v1.23.3GB40,546,537522,204,110
French 35,512 4,235,725,445 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw111GB4,235,725,445-
Filtered v1.02.1GB27,622,881546,401,428
BiCleaner v1.22.6GB25,380,067428,397,408
Zipporah v1.23.9GB33,108,141648,244,663
Spanish 27,194 2,368,243,619 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw60.9GB2,368,243,619-
Filtered v1.01.3GB16,001,341325,745,201
BiCleaner v1.21.8GB17,511,545303,161,256
Zipporah v1.22.2GB18,197,039366,172,313
Italian 21,940 1,727,688,019 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw45.1GB1,727,688,019-
Filtered v1.0593MB8,318,493155,973,063
BiCleaner v1.2963MB11,790,134147,402,459
Zipporah v1.21.2GB12,065,631212,026,083
Portuguese 14,786 1,357,911,799 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw34.1GB1,357,911,799-
Filtered v1.0222MB2,809,38157,392,721
BiCleaner v1.2611MB6,436,49193,021,518
Zipporah v1.2366MB3,056,92060,180,429
Dutch 10,212 1,506,033,538 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw37.5GB1,506,033,538-
Filtered v1.0168MB2,560,472 45,149,412
BiCleaner v1.2581MB6,185,906100,284,153
Zipporah v1.2276MB2,556,52345,169,303
Polish 10,212 984,884,968 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw24.4GB984,884,968-
Filtered v1.085.7MB1,275,16222,092,316
BiCleaner v1.2330MB3,270,26255,467,253
Zipporah v1.2143MB1,269,30022,079,132
Czech 8,429 818,784,053 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw21.5GB818,784,053-
Filtered v1.0285MB10,020,25078,743,955
BiCleaner v1.2237MB2,367,60938,913,821
Zipporah v1.2529MB9,982,50878,943,174
Romanian 7,104 635,709,587 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw17.2GB635,709,587-
Filtered v1.0105MB2,459,75232,800,110
BiCleaner v1.2159MB1,592,69227,531,812
Zipporah v1.2151MB2,459,40832,806,629
Finnish 5,990 504,805,915 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw12.7GB504,805,915-
Filtered v1.034.5MB544,3358,420,501
BiCleaner v1.2177MB1,982,77429,979,317
Zipporah v1.259.8MB621,7289,481,646
Latvian 1,725 173,585,643 RAW FILTERED v1.0
BICLEANER v1.2 ZIPPORAH v1.2
File SizeSentence PairsEnglish Words
Raw4.9GB173,585,643-
Filtered v1.016MB242,2274,250,040
BiCleaner v1.243.3MB406,7426,995,228
Zipporah v1.226.9MB241,5464,247,908
Extra Languages in release v1.0
Russian 14,035 1,078,819,759 RAW FILTERED v1.0
File SizeSentence PairsEnglish Words
Raw38GB1,078,819,759-
Filtered v1.0637MB12,061,155157,061,045
Estonian 1,784 191,183,197 RAW FILTERED v1.0
File SizeSentence PairsEnglish Words
Raw4.4GB191,183,197-
Filtered v1.074.7MB1,298,10313,134,231
  • The large drop-off between the sentence paris of RAW and different FILTERED versions is due to deduplication and removal of data due to failures of earlier processing steps.
  • FILTERED v1.0 of the corpus is very rough and it is significantly refined in new releases.

NMT Comparison on WMT Data

These experiments compare WMT systems with and without Paracrawl.
The systems are shallow NMT models, trained with Marian, covering the language pairs {cs,de,fi,lv,ro} -> en.

Pair BLEU WMT BLEU ParaCrawl Sentences WMT Sentences ParaCrawl Over-samping (WMT)
fi-en 21.7 24.2 2634433 624058 1x
lv-en 15.6 16.5 4461720 242227 1x
ro-en 29.2 32.4 608320 2459752 4x
cs-en 25.7 26.3 52024546 10020250 1x
de-en 29.8 28.8 5852458 36351593 7x

Notes:

  • Scores are for 4-way checkpoint ensemble, at intervals of 30k
  • bpe and truecase models trained on WMT16/17 data only
  • Sentence counts are with Moses clean (to 80 tokens)

Processing Pipeline

The corpus is created through a pipeline of processing steps. The software for these steps is either already currently open-source or will be released as open-source software in June 2018. Some pre-release code is available on our Github.

Web Site Identification

We ran the CLD2 language classifier on all of the CommonCrawl. Domains were selected for crawling if they had significant amounts of text in at least two languages. The size ratio between the languages had to be somewhat comparable.


This release limits to 11 languages, though we plan to cover all official EU languages.


For more on this effort, please see: "N-gram Counts and Language Models from the Common Crawl", by Christian Buck, Kenneth Heafield, Bas van Ooyen (LREC 2014).

Web Crawling

For web crawling we use the standard tool HTTrack. The raw crawl of the currently processed data set is about 100 TB compressed.

Text Extraction

For text extraction we use Bitextor, which was developed and is being refined by partners of the project. It converts each web page into a standard format, containing the original HTML markup and the stripped-out text. See the Bitextor site for more.

Document Alignment

Web pages in English and the foreign language are aligned using a method that matches machine translation of the foreign document with the English document and also uses the URL. It is based on the method described in "Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance" by Christian Buck and Philipp Koehn (WMT 2016) but has been refined since. Partners of the ParaCrawl project organized a shared task on document alignment in 2016, which describes these methods in "Findings of the WMT 2016 Bilingual Document Alignment Shared Task" by Christian Buck and Philipp Koehn (WMT 2016).

Sentence Alignment

For sentence alignment, we use the tool Hunalign.

Hunalign reports a sentence pair score which we include in the raw release of the corpus.

Sentence Filtering

Finally, we filter the raw corpus to remove data. This is done in two different variations; using Zipporah and Bicleaner tools.


The Zipporah tool is described in "Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora" by Hainan Xu and Philipp Koehn (EMNLP 2017). The tool considers language fluency (measured by a language model) and translation adequacy (measured by a translation dictionary). It computes a score for each sentence pair which can be used for filtering. This score is included in the raw release of the corpus. We use the threshold of 0 for the official filtered release of the corpus.


Bicleaner based on a blend of generic rules (called rule-based precleaning or hardrules) and a classifier that uses probabilistic dictionaries. The classifier gives a score and suggests whether to keep or discard the sentence. We saw that >=0.7 is a safe threshold for most combinations based on sampling and manual inspection. The tool is still under development and we plan to introduce some improvements for known problems in the following few weeks.


Sentence filtering also includes basic steps such as removing empty lines, too long sentences (more than 200 words), and sentence pairs with highly mismatched number of words.


Note that a popular filtering step is to subsample the corpus for domain relevance, e.g., by a method like "Intelligent Selection of Language Model Training Data" by Robert C Moore and William Lewis (ACL 2010). We do not filter ParaCrawl for any domain, since it is a general-purpose corpus.

License

These data are released under this licensing scheme:

 

 

Notice and take down policy

 

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

 

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And contact Kenneth Heafield at the following email address: kheafiel+takedown at inf.ed.ac.uk.

 

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Project Milestones

September 2017, Kickoff meeting in Alicante, Spain

November 2017, Website up and running

January 2018, Release of Corpus v1

April 2018, Release of Corpus v1 in ELRC-SHARE

June 2018, Release of Corpus v2

June 2018, Release of Software v1

October 2018, Release of Corpus v3

March 2019, Release of Corpus v4, also in ELRC-SHARE

March 2019, Release of Software v2

Follow ParaCrawl on Github

Project Partners:

Other Contributors:

Provision of Web-Scale Parallel Corpora for Official European Languages

© Copyright 2018. All rights reserved.