ParaCrawl OpenSource Pipeline (Bitextor)
Bitextor is a tool for automatically harvesting bitexts from multilingual websites.
bleualign(C++ implementation) as an alternative sentence aligner.
ParaCrawl Corpus release v4.0
The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v4 is the final release for the Action: "Provision of Web-Scale Parallel Corpora for Official European Languages" and it covers all official EU languages (23 languages paired with English)
ParaCrawl Corpus v3.0
The v3.0 release contains six new languages Bulgarian, Danish, Greek, Slovak, Slovenian and Swedish. More data is also added to the already released languages.
ParaCrawl Corpus v2.0
The v2.0 release contains six new languages Irish, Croatian, Maltese, Lithuanian, Hungarian and Estonian. More data is also added to the already released languages.
ParaCrawl Corpus v1.2
The v1.2 release contains two new filtered versions of the corpus for each language "ZIPPORAH v1.2" and "BICLEANER v1.2". For more details see the following corpus size table.
ParaCrawl Corpus v1.0
The first version of the ParaCrawl corpus released in, January 2018, contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. More details of the crawling and extraction process are given below.
Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics.
The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Updated releases are scheduled for June 2018, October 2018, and March 2019.
ParaCrawl Corpus release v5.0
The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v5 is the first release for the Action: "Broader Web-Scale Provision of Parallel Corpora for European Languages". New crawled data is added, including data from Internet Archive. Enhancements in document and sentence aligners with updated BiCleaner strategy resulted in corpora twice the size compare to release v4 for all official EU languages (23 languages paired with English).
|In the proceedings of WMT 2019 release 3 of the corpus is used. For WMT 2018, the FILTERED v1.0 of the released corpus was used.|
To effectively transform a TMX to a tab-separated text file Download TMXT tool.
|Extra Languages in release v1.0|
|Russian||14,035||RAW v1.0 FILTERED v1.0||
These experiments compare WMT systems with and without Paracrawl. The systems are shallow NMT models, trained with Marian.
Neural machine translation (NMT) systems using Marian 5-layer transformer models were trained for 3 different scenarios:
The number of NMT systems for which we report results, a total of 58, was constrained in the different scenarios by the available languages in Europarl and TED test sets. From the original set of 23 language combinations in ParaCrawl5, Europarl covered 19 of them (not available for Irish, Croatian, Latvian and Maltese) and TED talks covered 20 of them (not available for Irish, Latvian and Maltese). This is why we do not report results for some language pairs in some scenarios.
|ParaCrawl5||Europarl v7||Europarl v7 + ParaCrawl5||ParaCrawl5||Europarl v7||Europarl v7 + ParaCrawl5|
Almost all ParaCrawl5 individual systems have better BLEU results than Europarl individual systems:
The combination of Europarl and a subset of ParaCrawl5 helps improve the results of the individual systems in almost all cases:
These data are released under this licensing scheme:
Notice and take down policy
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Novermber 2018, Website updated with Broader WebCrawl
February 2019, Integration of Initial Domain Identification Technology
June 2019, Integration of Processing of Broader Document Formats
September 2019, Inclusion of Data from Internet Archive in Data Release.
September 2019, Data Release 1
March 2020, Data Release 2
August 2020, Final Code Release
August 2020, Domain Identification for Data Release 1
September 2020, Data Release 3
Broader Web-Scale Provision of Parallel Corpora for European Languages
© Copyright 2019. All rights reserved.
Any communication or publication related to the action, made by the beneficiaries jointly or individually in any form and using any means, shall indicate that it reflects only the author's view and that the Agency is not responsible for any use that may be made ofthe information it contains.