Provision of Web-Scale Parallel Corpora for Official European Languages

News

 

Sep 27, 2018

 

ParaCrawl corpus release v2.0

 

The second version of the ParaCrawl corpus has been released. It contains parallel corpora for 17 languages paired with English. 6 new languages are added to the v2 release namely Irish, Croatian, Maltese, Lithuanian, Hungarian and Estonian. For the previously released languages (German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian and Finnish) more data is added to the corpus. For each language two different versions of corpus are released based on two cleaning tools, i.e. BiCleaner and Zipporah. ParaCrawl corpus is crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl.

 

Corpus size and download links are available from ParaCrawl's website (http://paracrawl.eu/releases.html). The corpus will soon be uploaded to other public data repositories as well.

 

The source code of the ParaCrawl OpenSource Pipeline (Bitextor) is also available on Github.

 

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Updated releases are scheduled for June 2018, October 2018, and March 2019.

 

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

 

 

Mar 7, 2018

 

Meet ParaCrawl at AMTA Technology Forum!

 

Prompsit, one of our partners, will attend and exhibit on behalf of ParaCrawl at the next AMTA Conference in Boston (17-21 March 2018). The exhibition is part of the Technology Forum organised inside the AMTA Conference which will take place on 18th March 2018 from 12:30 to 17:30.

 

By visiting us at AMTA’s Technology Forum:

  • you will learn more about the 11 parallel corpora that we already released
  • you will see a live demo of some the tools that we will soon release: Bicleaner, a web-based TMX cleaner and KEOPS, an evaluation toolkit for parallel sentences.

 

Come a visit us at AMTA’s Technology Forum for free!
If you are coming only for the Technology Forum, you just need to select Complimentary Registration on the AMTA registration site.

 

 

Jan 14, 2018

 

1st corpus release for ParaCrawl

 

The first version of the ParaCrawl corpus has been released. It contains parallel corpora for 11 languages paired with English, namely German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Finnish and Latvian, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl.

 

Corpus size, BLEU score evaluations and download links are available from ParaCrawl's website (http://paracrawl.eu/releases.html). The corpus will soon be uploaded to other public data repositories as well.

 

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Updated releases are scheduled for June 2018, October 2018, and March 2019.

 

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

 

 

Nov 17, 2017

 

1st Stakeholder Board Call

 

The 1st call between the ParaCrawl consortium and the Stakeholder board was held in November, 2017.

Corpus Releases/Languages: The stakeholders demonstrated interest in the forthcoming ParaCrawl parallel corpus releases in the 24 official EU languages, and look forward to the first corpus release v1 in January 2018, comprising 11 language pairs namely English to/from German, Spanish, French, Polish, Italian, Portuguese, Czech, Romanian, Latvian, Finnish, Dutch. For the second corpus release v2 comprising 18 languages and due in June 2018, the board suggested that the consortium adds 6 small, low-resourced and/or complex languages such as Irish, Hungarian, Lithuanian, Estonian, Croatian and Maltese.

Domains: Stakeholders further identified a need for data in the domains of Culture, Hospitality and Travel as well as in informal, non-domain specific language data (e.g. for services such as online dispute resolution, etc.). Considering that a domain is defined by a corpus, the latter can be filtered with a language model for extracting domain-specific data.


Use cases: A number of use cases is envisaged by the board for the released data namely training MT engines, integrating translation into search engines, widening the scope of translation services to more verticals, data mining, enhancing existing language/translation databases.

IP/Copyright issues: In order to avoid IP/copyright issues, all data will be packaged and released across websites without referring to the sources, anonymised, randomised and sorted by quality score so that the most users can get is a sentence (without context). The board pointed out that it would however be useful to have translations in context; as the consortium keeps the information of the data sources (websites) internally for research purposes, different corpora for researchers, companies and governmental institutions could potentially be released in the future.

 

 

Sep 29, 2017

 

The Kickoff meeting of ParaCrawl EU Project took place in Alicante, Spain.

 

Project Milestones

September 2017, Kickoff meeting in Alicante, Spain

November 2017, Website up and running

January 2018, Release of Corpus v1

April 2018, Release of Corpus v1 in ELRC-SHARE

June 2018, Release of Corpus v2

June 2018, Release of Software v1

October 2018, Release of Corpus v3

March 2019, Release of Corpus v4, also in ELRC-SHARE

March 2019, Release of Software v2

Follow ParaCrawl on Github

Project Partners:

Other Contributors:

Provision of Web-Scale Parallel Corpora for Official European Languages

© Copyright 2018. All rights reserved.