ParaCrawl works!

Monday, 01 April 2019 00:00

ParaCrawl works!

font size decrease font size increase font size
Print
Email

NMT experiments are performed for various language pairs, comparing models trained on WMT data with and without the addition of ParaCrawl released corpora. Shallow NMT models, trained with Marian, are used for these experiments. The following table shows that almost in all cases, except for en-cs, addition of ParaCrawl data significantly improves the BLEU scores. The ParaCrawl pipeline has significantly improved since the release 1 and that reflects in the following results as the v4 of the ParaCrawl data is much cleaner, the improvement in BLEU scores is much more evident.

Pair	Direction	BLEU (WMT)	BLEU (ParaCrawl v1)	BLEU (ParaCrawl v4)
Finnish-English	en-fi	17.5	17.5	18.7
Finnish-English	fi-en	21.7	24.2	26.3
Latvian-English	en-lv	13.2	13.9	15.1
Latvian-English	lv-en	15.6	16.5	18.1
Romanian-English	en-ro	25.9	26.5	27.2
Romanian-English	ro-en	31.1	33.5	35.1
Czech-English	en-cs	20.5	19.1	20.4
Czech-English	cs-en	25.7	26.3	26.8
German-English	en-de	24.0	20.8	25.2
German-English	de-en	29.8	28.8	32.9

Last modified on Tuesday, 14 January 2020 09:08

More in this category: « ParaCrawl corpus release v4.0 ParaCrawl - A CEF Digital Success Story »

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

ParaCrawl works!