Bonus corpus: English - Ukrainian parallel corpus v1 release
A new parallel corpus covering English and Ukrainian has been released in March 2022.
It has been released as a bonus corpus on behalf of the ParaCrawl effort.
Three versions (Raw, clean TMX and clean TXT) have been produced.
The clean version accounts for 13M sentence pairs and 505M source tokens for both the TXT and TMX formats.
Please download them from the following links:
Corpus type
|
Sentence pairs | Source Tokens | Link | Size |
English-Ukranian, TXT version (TAB separated files) | 13,354,365 | 505,831,880 | paracrawl-clean-en-uk.txt |
2.8G |
English-Ukrainian, TMX version (XML file) | 13,354,365 | 505,831,880 | paracrawl-clean-en-uk.tmx |
8.3G |
English-Ukrainian, RAW version | 235,700,383 | 5,832,658,894 | paracrawl-raw-en-uk |
15.3G |