More data

More monolingual and parallel data

This section is about additional sources of parallel or monolingual data done using parts or the whole ParaCrawl pipeline or using the data to create derived corpora.

100%

Last release

November 2024

English-Asian languages

South and East Asian Languages bonus corpus out!

The bonus corpus includes 9 language pairs: Hindi, Nepali, Burmese, Thai, Lao, Khmer, Vietnamese, Indonesian, and Korean paired with English. Building on the work of the Paracrawl project, it follows the same general sequence of steps although different technologies are applied: targeted web crawling, document alignment (using efficient Marian neural models distilled from NLLB ), sentence alignment (using Vecalign), and parallel corpus filtering (using LASER 3), including deduplication and a novel toxicity filter.

Downloading and formats

The new parallel corpora can be downloaded from the bonus section in the Releases tab and are provided in three different formats: txt (only parallel sentences, one per line), tmx (filtered parallel sentences and metadata in xml format) and dedup (filtered parallel sentences and metadata in a tab-separated file format).

More info at: https://www2.statmt.org/wmt24/pdf/2024.wmt-1.132.pdf

Last release

October 2022

Polish-Czech

Polish-Czech bonus corpus released!

The corpus has 24M bilingual segments. Text was extracted from HTML, classified by language, split, deduplicated and double-side filtered with Bicleaner-ai.

Downloading and formats

The Polish-Czech parallel corpus can be downloaded from the bonus section in the release tab and is provided in three different formats: txt (only parallel sentences, one per line), tmx (filtered parallel sentences and metadata in xml format) and raw (unfiltered parallel sentences and metadata).

100%

Bonus release

June 2022

Language Classified Web Text from ParaCrawl

This is a release of text from Internet Archive and targeted crawls performed in the ParaCrawl project. Text was extracted from web pages with document-level language classification provided by a fasttext model, for 189 languages.

There is a separate directory for each data source: wide00016, wide00015, wide00006, GWB-20191109192916, hieu, marta, and philipp. Large crawls from the Internet Archive are wide00016, wide00015, wide00006. GWB-20191109192916 was extracted from a larger collection at the Internet Archive to match language codes ga, no, nb, nn, hr, and is in URLs with a corresponding URL containing en, but incidentally collected some other languages too. Hieu and Philipp aimed at sites with a mix of languages and Marta aimed at languages in Spain. The project also used CommonCrawl which is already public.

The special language code und contains documents that were not at least 0.5 in any one language code according to the classifier. This is mainly provided so that the set of documents is complete and one could run their own language identifier over the entire corpus. Currently wide00015 does not have an und directory pending supercomputer time; the rest are complete.

A complete list of files is in files. To download all files, run:

wget https://web-language-models.s3.us-east-1.amazonaws.com/paracrawl/monolingualv9/files
wget -nH -x --cut-dirs=2 -i files

Within each collection, there are separate directories by ISO 639-3 language code and then a text.gz and a url.gz. Each line in text.gz is a base64-encoded document while the corresponding line in url.gz is the URL. Once decoded, the document is in UTF-8 with just text extracted from a web page with newlines added at block element boundaries; we recommend running your own sentence splitter configured to insert but not remove newlines.

Example to just print text: zcat text.gz |base64 -d. A longer example of python:

#!/usr/bin/env python3
import gzip
import base64

with gzip.open("text.gz") as texts, gzip.open("url.gz") as urls:
  for encoded, url in zip(texts, urls):
    text = base64.standard_b64decode(encoded).decode('utf-8')
    url = url.decode('utf-8')
    print("Contents of " + url)
    print(text)

However, for performance we recommend using pigz for decompression and GNU parallel to parallelize over documents

Next processing step advice

Run a sentence splitter then you'll probably want to classify language at sentence level. English UI elements tend to creep into every language.

100%

Last release

April 2021

English monolingual data from ParaCrawl V8

Monolingual data from ParaCrawl V8: English

The corpus has 96,470,655,818 lines, 1,337,127,886,176 tokens, and 9,153,226,323,307 characters of English. Text was extracted from HTML, classified, split, and deduplicated.

The corpus is available as 128 files, split by the hash of the line. The first and last URLs are:
https://neural.mt/data/paracrawl8-mono/en-000.gz
https://neural.mt/data/paracrawl8-mono/en-127.gz

#!/bin/bash
for i in {0..127}; do
  wget https://neural.mt/data/paracrawl8-mono/en-$(printf "%03i" $i).gz
done

Files are hosted on the Internet Archive. Due to their 1 TB limit per directory, there are redirects to the appropriate directory.

Source data

This is all the English data used for ParaCrawl release 8, which is based on the following crawls.

Internet Archive: wide00006, wide00015, and pages with en, is, hr, no, and ga in their URL.
CommonCrawl: 2016-30, 2017-30, 2018-30, 2019-18, and 2019-35.
Targeted: Philipp Koehn crawled domains that have a mix of multilingual content based on language classification in CommonCrawl. Marta Bañón aimed for sites in Basque, Catalan, Galician, and Spanish but picked up some English on the way. Hieu Hoang crawled sites that produced parallel sentences in earlier generations of ParaCrawl.

More languages

Coming, though ParaCrawl release 9 processing takes priority. That will have even more data!

100%

Next Data Releases

September 2021

Patent parallel corpora made of English and Croatian, Norwegian (Bokmål), German, Polish, Spanish, and French. Also Icelandic might be included

EuroPat: Unleashing European Patent Translations

Patents provide a rich source of technical vocabulary, product names, and person names that complement other data sources used for machine translation.

This Action will will mine parallel corpora from patents by aggregating, aligning, and converting patent data. Alignment and cleaning modules in the ParaCrawl pipeline will be enhanced and used to carry out this action.

The first release included English-German (12.6M parallel sentences) and English-French corpora (9.2M parallel sentences) made up by using information from the European Patent Organisation database to identify patents.

The second release includes 6 language combinations: English-German (15.5M parallel sentences), English-Spanish (44.4M parallel sentences), English-French corpora (12M parallel sentences), English-Croatian (75k parallel sentences), English-Norwegian (4M parallel sentences) and English-Polish (89k parallel sentences) from various sources.

Implementation schedule: September 2010 to September 2021

More info Project website Download the data

Completed

Last release

October 2020

Multi-parallel corpus by pivoting via English made from ParaCrawl data.

MultiParaCrawl v 7.1

Parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English. They only provide the additional language pairs that came out of pivoting. The bitexts for English are available from the ParaCrawl release.
Stats about the data in MultiParaCrawl v7.1:

40 languages, 669 bitexts
total number of files: 40
total number of tokens: 10.14G
total number of sentence fragments: 505.48M

OPUS website Download the data

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

More monolingual and parallel data

South and East Asian Languages bonus corpus out!

Downloading and formats

Polish-Czech bonus corpus released!

Downloading and formats

Language Classified Web Text from ParaCrawl

Next processing step advice

Monolingual data from ParaCrawl V8: English

Source data

More languages

EuroPat: Unleashing European Patent Translations

MultiParaCrawl v 7.1