Provision of Web-Scale Parallel Corpora for Official European Languages

Related Projects

Large-Scale Parallel Web Crawl

This is an on-going project to crawl a large number of sites across the web. The current corpus is fairly noisy and covers only few language pairs. The first official release of the corpus was back in December 2016.

N-gram counts and language models from the CommonCrawl

We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate.

ModernMT - Next Generation Machine Translation

MMT is a context-aware, incremental and distributed general purpose Neural Machine Translation technology. MMT is:
     - Simple to use, fast to train, and easy to scale with respect to domains, data, and users.
     - Trained by pooling all available projects/customers data and translation memories in one folder.
     - Queried by providing the sentence to be translated and optionally some context text.

Tools

Bitextor: The automatic bitext generator

Bitextor is an automatic bitext generator which obtains its base corpora from the Internet. It works by downloading an entire website (applying a filter to download only those files written in HTML) and comparing every pair of files. It detects the language and, through a group of heuristics (file size, HTML skeleton edit distance, format, etc.), it tries to guess which files have the same content in different languages. Once it has identified the pairs of files, it generates a bitext file in TMX format.

Bicrawler: Create bitexts from multilingual websites

Bicrawler is a free web-based tool that allows you, within a pair of clicks, to retrieve the content of a multilingual website and create a TMX with all the parallel sentences available there. Bicrawler runs different steps to guarantee a reliable output: segments sentences, aligns and parallelizes, cleans and put the sentences together in TMX format.

HTTrack

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.

hunalign – Sentence Aligner

hunalign aligns bilingual text on the sentence level. Its input is tokenized and sentence-segmented text in two languages. In the simplest case, its output is a sequence of bilingual sentence pairs (bisentences).

zipporah

A Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora by Hainan Xu and Philipp Koehn (EMNLP 2017). The tool considers language fluency (measured by a language model) and translation adequacy (measured by a translation dictionary). It computes a score for each sentence pair which can be used for filtering. This score is included in the raw release of the corpus. We use the threshold of 0 for the official filtered release of the corpus.

Project Milestones

September 2017, Kickoff meeting in Alicante, Spain

November 2017, Website up and running

January 2018, Release of Corpus v1

April 2018, Release of Corpus v1 in ELRC-SHARE

June 2018, Release of Corpus v2

June 2018, Release of Software v1

October 2018, Release of Corpus v3

March 2019, Release of Corpus v4, also in ELRC-SHARE

March 2019, Release of Software v2

Follow ParaCrawl on Github

Project Partners:

Other Contributors:

Provision of Web-Scale Parallel Corpora for Official European Languages

© Copyright 2018. All rights reserved.