ParaCrawl Synthesized Data Release
The synthesized data is based around the COVID-19 domain. It is synthesized using parallel corpus from ParaCrawl release 7.0 corpus in combination with a COVID-19 glossary sourced from Tico-19. The Synthesis tool takes existing sentence pairs and replaces words from the COVID-19 glossary and their translation to create a new sentence pair. The tool makes use of word embeddings to assess word similarity and word alignments computed by fast-align. This identifies the translations for words in the existing parallel corpus. More details about the tool and how it works can be found at https://github.com/paracrawl/synthesis.
Assume that Covid-19 glossary has the following translation:
The similarity model based on word embeddings may find the following similar translation pair in the corpus:
Note that both '' Guangdong'' and ''Гуандун'', as well as ''Hubei'' and ''Hubei'' must be similar.
The existing parallel corpus may contain this word translation pair in the following sentence pair:
- Project monitoring and management are handled by our office in Hubei city .
- Мониторинг и управление проектом осуществляется в нашем офисе в городе Hubei .
Note that the tool assumes that all data is tokenized - it does not perform any additional pre-processing.
Based on all this information, the following synthetic sentence pair is generated:
- Project monitoring and management are handled by our office in Guangdong city .
- Мониторинг и управление проектом осуществляется в нашем офисе в городе Гуандун.