Friday, 18 March 2022 08:49

Bonus release: English-Ukranian parallel corpus added

 

Bonus corpus: English - Ukrainian parallel corpus v1 release

A new parallel corpus covering English and Ukrainian has been released in March 2022. 

It has been released as a bonus corpus on behalf of the ParaCrawl effort. 

Three versions (Raw, clean TMX and clean TXT) have been produced. 

The clean version accounts for 13M sentence pairs and 505M source tokens for both the TXT and TMX formats. 

 Please download them from the following links: 

Corpus type
Sentence pairs Source Tokens Link Size
English-Ukranian, TXT version (TAB separated files) 13,354,365 505,831,880 paracrawl-clean-en-uk.txt

2.8G

English-Ukrainian, TMX version (XML file) 13,354,365 505,831,880 paracrawl-clean-en-uk.tmx

8.3G

English-Ukrainian, RAW version  235,700,383 5,832,658,894 paracrawl-raw-en-uk

15.3G

 
Last modified on Friday, 08 April 2022 10:19