About

100%

Next Milestone

September 2021

Augmented Data Release including Manufactured Data and Implementation of Deferred Crawling

Continued Web-Scale Provision of Parallel Corpora for European Languages

This Action will offer improved extraction software capable of efficiently processing an even larger portion of the Web (more than 1 compressed petabyte). At the same time, it will apply state-of-the-art neural methods to the detection of parallel sentences, and the processing of the extracted corpora. Special emphasis will be placed on collecting larger corpora for language pairs that are currently under-resourced.

Implementation schedule: October 2019 to September 2021

More Info

Completed

Broader Web-Scale Provision of Parallel Corpora for European Languages

This Action aims to collect translated sentences from the web for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. These translations will be mined from a large collection of web pages, approximately 1 petabyte in size. The system will extract web pages in hypertext markup language (HTML) as well as files in portable document format (PDF) format, using text where available and optical character recognition otherwise.

Implementation schedule: September 2018 to September 2020

More Info

Completed

Provision of Web-Scale Parallel Corpora for Official European Languages

The Action aims for the development of parallel corpora (collections of translated text) for all official EU languages. The Action will provide to the CEF building block Automated Translation (CEF-AT), through web crawling, the same type of large scale data that is available to large commercial Machine Translation (MT) engines. By the end of the action, parallel corpora for all 24 official languages will be made available to CEF-AT. For 8 languages (i.e. English, German, Spanish, French, Polish, Italian, Portuguese, and Czech) the parallel corpora will have more than 1 billion tokens. For the 16 other languages, the action aims to collect more than 100 million tokens.

Implementation schedule: September 2017 to March 2019

More Info

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

Continued Web-Scale Provision of Parallel Corpora for European Languages

Broader Web-Scale Provision of Parallel Corpora for European Languages

Provision of Web-Scale Parallel Corpora for Official European Languages

All official EU languages and more

Multi-lingual crawling pipeline

Broader document format

Domain identification