ParaCrawl Synthesized Data Release

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.

The synthesized data is based around the COVID-19 domain. It is synthesized using parallel corpus from ParaCrawl release 7.0 corpus in combination with a COVID-19 glossary sourced from Tico-19. The Synthesis tool takes existing sentence pairs and replaces words from the COVID-19 glossary and their translation to create a new sentence pair. The tool makes use of word embeddings to assess word similarity and word alignments computed by fast-align. This identifies the translations for words in the existing parallel corpus. More details about the tool and how it works can be found at https://github.com/paracrawl/synthesis.

Process Description

Assume that Covid-19 glossary has the following translation:

  • Guangdong
  • Гуандун

The similarity model based on word embeddings may find the following similar translation pair in the corpus:

  • Hubei
  • Hubei

Note that both '' Guangdong'' and ''Гуандун'', as well as ''Hubei'' and ''Hubei'' must be similar.
The existing parallel corpus may contain this word translation pair in the following sentence pair:

  • Project monitoring and management are handled by our office in Hubei city .
  • Мониторинг и управление проектом осуществляется в нашем офисе в городе Hubei .

Note that the tool assumes that all data is tokenized - it does not perform any additional pre-processing.
Based on all this information, the following synthetic sentence pair is generated:

  • Project monitoring and management are handled by our office in Guangdong city .
  • Мониторинг и управление проектом осуществляется в нашем офисе в городе Гуандун.

For the first set of synthesized data, the follow data has been made available:
Language
 
Sentences
Bulgarian
1,872,208
Czech
28,872,079
Danish
3,967,918
Estonian
675,770
Finnish
1,367,263
Greek
541,089
Icelandic
920,263
Latvian
194,573
Lithuanian
2,924,916
Portuguese
43,204,196
Romanian
1,282,807
Russian
516,767
Slovak
164,490
Slovenian
74,015
Swedish
16,526,006