Provision of Web-Scale Parallel Corpora for Official European Languages

Stakeholder Board

The ParaCrawl stakeholder board is a small yet very powerful group of experts in the field, comprising representatives from large international organizations namely the European Commission (eTranslation), World Trade Organization (WTO) and World Intellectual Property Organization (WIPO) and global industrial representatives namely Mozilla (Berlin), Facebook (Paris), TransPerfect (Barcelona) and UTH International (Shanghai).

 

The board has an advisory role on the usefulness of the web sites that the consortium identifies to crawl and types of data we collect, on how to improve our data collection for the purpose of mainly training MT engines, on the quality of the data, on other useful multilingual web sites to crawl (targeted crawls), and related issues.

 

Throughout the project, calls between the stakeholder board and the consortium are scheduled, following ParaCrawl's milestones.

Updates

 

Nov 17, 2017

 

1st Stakeholder Board Call

 

The 1st call between the ParaCrawl consortium and the Stakeholder board was held in November, 2017.

Corpus Releases/Languages: The stakeholders demonstrated interest in the forthcoming ParaCrawl parallel corpus releases in the 24 official EU languages, and look forward to the first corpus release v1 in January 2018, comprising 11 language pairs namely English to/from German, Spanish, French, Polish, Italian, Portuguese, Czech, Romanian, Latvian, Finnish, Dutch. For the second corpus release v2 comprising 18 languages and due in June 2018, the board suggested that the consortium adds 6 small, low-resourced and/or complex languages such as Irish, Hungarian, Lithuanian, Estonian, Croatian and Maltese.

Domains: Stakeholders further identified a need for data in the domains of Culture, Hospitality and Travel as well as in informal, non-domain specific language data (e.g. for services such as online dispute resolution, etc.). Considering that a domain is defined by a corpus, the latter can be filtered with a language model for extracting domain-specific data.


Use cases: A number of use cases is envisaged by the board for the released data namely training MT engines, integrating translation into search engines, widening the scope of translation services to more verticals, data mining, enhancing existing language/translation databases.

IP/Copyright issues: In order to avoid IP/copyright issues, all data will be packaged and released across websites without referring to the sources, anonymised, randomised and sorted by quality score so that the most users can get is a sentence (without context). The board pointed out that it would however be useful to have translations in context; as the consortium keeps the information of the data sources (websites) internally for research purposes, different corpora for researchers, companies and governmental institutions could potentially be released in the future.

 

Project Milestones

September 2017, Kickoff meeting in Alicante, Spain

November 2017, Website up and running

January 2018, Release of Corpus v1

April 2018, Release of Corpus v1 in ELRC-SHARE

June 2018, Release of Corpus v2

June 2018, Release of Software v1

October 2018, Release of Corpus v3

March 2019, Release of Corpus v4, also in ELRC-SHARE

March 2019, Release of Software v2

Follow ParaCrawl on Github

Project Partners:

Other Contributors:

Provision of Web-Scale Parallel Corpora for Official European Languages

© Copyright 2018. All rights reserved.