More parallel data

This section is about additional sources of parallel data done using parts or the whole ParaCrawl pipeline or using the data to create derived corpora.

74%
Next Data Releases
March 2021
September 2021
Patent parallel corpora made of English and Croatian, Norwegian (Bokmål), German, Polish, Spanish, and French. Also Icelandic might be included
EuroPat: Unleashing European Patent Translations

Patents provide a rich source of technical vocabulary, product names, and person names that complement other data sources used for machine translation.

This Action will will mine parallel corpora from patents by aggregating, aligning, and converting patent data. Alignment and cleaning modules in the ParaCrawl pipeline will be enhanced and used to carry out this action.

The first release includes English-German (12.6M parallel sentences) and English-French corpora (9.2M parallel sentences) made up by using information from the European Patent Organisation database to identify patents.

Implementation schedule: September 2010 to September 2021

  More info   Project website  Download the data
Completed
Last release
October 2020
Multi-parallel corpus by pivoting via English made from ParaCrawl data.
MultiParaCrawl v 7.1

Parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English. They only provide the additional language pairs that came out of pivoting. The bitexts for English are available from the ParaCrawl release.
Stats about the data in MultiParaCrawl v7.1:

  • 40 languages, 669 bitexts
  • total number of files: 40
  • total number of tokens: 10.14G
  • total number of sentence fragments: 505.48M

  OPUS website  Download the data