Releases

`>>Check More Data and News sections for updates!!! >>`

ParaCrawl Corpus release v9

This corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.

Release 9 is the final release for ParaCrawl Action 3: "Continued Web-Scale Provision of Parallel Corpora for European Languages".

ParaCrawl 9 brings new content and higher quality as the result of an improved pipeline with:

better PDF processing
language identification based on CLD2 full instead of lite
improved machine translation models (almost all neural) used to parallelize sentences
neural cleaning applied for the first time

With this version, we reach the best MT results ever obtained with ParaCrawl.

As a bonus, we release an English-Chinese corpus and monolingual data.

Data formats: 5 variations of each corpus are provided: 1. Bicleaner TXT format, 2. Bicleaner TMX format, 3. RAW corpus, 4. ROAM (anonymised) format and 5. Deferred crawling format containing pointers to URLs to recrawl the corpora on your end. Click on the icon for each language to show the download links. To effectively transform a TMX to a tab-separated text file Download TMXT tool. Also, if you use deferred crawled corpora, our reconstruction tool can prove useful.

WMT warning: ParaCrawl contribution to WMT has been permanent since 2018. Data sets from various ParaCrawl releases have been used in different shared tasks along the years. To make sure you get the right version of the exact data sets needed for these shared tasks, please download them directly from the links provided at WMT website .

Language

Sentences

Source Words

Bulgarian

13,264,297

226,485,024

13,264,297

226,485,024

379,713,465

9,061,247,935

11,660,661

189,446,058

13,264,297

226,485,024

Czech

50,632,492

692,110,883

50,632,492

692,110,883

2,996,891,961

56,087,810,356

41,000,528

544,920,121

50,632,492

692,110,883

Danish

34,207,155

555,049,001

34,207,155

555,049,001

2,381,010,643

47,495,319,396

28,615,957

445,634,283

34,207,155

555,049,001

German

278,310,907

4,269,352,871

278,310,907

4,269,352,871

9,662,113,792

210,295,493,345

233,362,910

3,450,715,114

278,310,907

4,269,352,871

Greek

21,402,042

340,479,785

21,402,042

340,479,785

1,405,078,151

27,686,172,080

18,582,458

279,761,255

21,402,042

340,479,785

Spanish

269,394,967

4,374,060,920

269,394,967

4,374,060,920

9,419,403,869

206,498,167,547

223,941,877

3,496,386,983

269,394,967

4,374,060,920

Estonian

8,539,879

136,598,517

8,539,879

136,598,517

390,991,428

7,743,920,994

7,290,786

114,333,882

8,539,879

136,598,517

Finnish

31,315,287

454,355,512

31,315,287

454,355,512

1,792,067,249

34,145,205,926

25,393,902

355,039,178

31,315,287

454,355,512

French

216,646,826

3,761,995,609

216,646,826

3,761,995,609

5,495,920,681

134,218,695,949

177,473,147

2,988,027,816

216,646,826

3,761,995,609

Irish

3,245,618

56,703,820

3,245,618

56,703,820

615,267,926

13,691,594,812

2,584,201

43,513,633

3,245,618

56,703,820

Croatian

3,240,420

79,062,603

3,240,420

79,062,603

563,570,419

12,185,582,349

2,730,607

64,058,854

3,240,420

79,062,603

Hungarian

36,432,544

509,256,374

36,432,544

509,256,374

2,245,577,938

41,486,191,342

29,128,012

398,339,155

36,432,544

509,256,374

Icelandic

2,967,519

45,093,876

2,967,519

45,093,876

65,379,727

1,500,672,709

2,446,111

35,527,952

2,967,519

45,093,876

Italian

96,975,991

1,682,841,134

96,975,991

1,682,841,134

2,881,161,252

66,782,184,952

78,373,609

1,303,115,197

96,975,991

1,682,841,134

Lithuanian

13,191,973

185,525,058

13,191,973

185,525,058

706,153,214

14,023,789,635

11,024,987

150,966,844

13,191,973

185,525,058

Latvian

13,063,804

197,361,114

13,063,804

197,361,114

621,121,254

12,392,015,629

10,958,884

161,335,336

13,063,804

197,361,114

Maltese

1,207,760

24,989,869

1,207,760

24,989,869

52,687,808

1,132,276,882

1,049,566

21,163,122

1,207,760

24,989,869

Norwegian (Bokmål)

19,281,147

325,690,737

19,281,147

325,690,737

1,186,857,903

25,059,150,285

16,296,119

265,306,747

19,281,147

325,690,737

Dutch

89,135,870

1,306,977,298

89,135,870

1,306,977,298

3,792,110,568

80,824,583,076

75,744,287

1,080,767,873

89,135,870

1,306,977,298

Norwegian (Nynorsk)

294,470

5,965,872

294,470

5,965,872

26,975,121

552,467,707

213,538

3,956,194

294,470

5,965,872

Polish

40,082,037

599,286,435

40,082,037

599,286,435

1,494,638,802

31,382,315,844

32,354,336

467,434,142

40,082,037

599,286,435

Portuguese

84,925,921

1,337,335,551

84,925,921

1,337,335,551

2,489,237,215

56,204,193,540

69,999,741

1,065,769,612

84,925,921

1,337,335,551

Romanian

25,048,461

403,390,783

25,048,461

403,390,783

1,496,164,926

29,847,980,518

20,989,631

325,653,445

25,048,461

403,390,783

Slovak

22,901,690

302,051,134

22,901,690

302,051,134

1,572,443,942

29,606,225,083

18,144,092

236,437,555

22,901,690

302,051,134

Slovenian

9,516,068

151,672,416

9,516,068

151,672,416

458,836,589

9,800,697,623

8,196,238

127,519,352

9,516,068

151,672,416

Swedish

49,109,339

722,407,192

49,109,339

722,407,192

2,026,885,221

41,303,256,881

41,573,601

594,612,823

49,109,339

722,407,192

Spanish-Catalan

17,238,953

389,803,201

17,238,953

389,803,201

298,874,794

7,071,221,480

17,102,682

385,834,468

17,238,953

389,803,201

Spanish-Basque

3,344,372

64,667,201

3,344,372

64,667,201

36,775,535

835,871,547

3,295,251

63,677,141

3,344,372

64,667,201

Spanish-Galician

1,879,651

44,626,184

1,879,651

44,626,184

96,193,555

2,222,137,977

1,865,314

44,204,985

1,879,651

44,626,184

Bonus Release (Low resource languages) - Last Updates on Oct. 2024

English-Azerbaijani

3,158,025

47,117,416

3,158,025

47,117,416

3,158,025

47,117,416

336,067,622

5,513,087,127

English-Tajik

343,401

5,513,041

343,401

5,513,041

343,401

5,513,041

11,394,854

244,332,910

English-Armenian

1,988,287

35,997,571

1,988,287

35,997,571

1,988,287

35,997,571

31,671,924

233,771,297

English-Khmer v1

65,113

1,511,950

21,560,446

21,565,078

65,113

1,511,950

English-Burmese v1

31,374

661,577

40,590,354

40,595,755

31,374

661,577

English-Nepali v1

92,084

2,941,031

36,454,553

36,466,101

92,084

2,941,031

English-Pashto

26,321

692,651

2,587,950

2,593,163

26,321

692,651

English-Singhalese

217,407

5,791,982

38,720,907

38,724,422

217,407

5,791,982

English-Somali

14,879

506,201

28,387,922

28,396,227

14,879

506,201

English-Swahili

132,517

3,696,543

84,605,506

132,517

3,696,543

English-Tagalog

248,684

6,327,801

108,260,601

248,684

6,327,801

Bonus Release - Last Updates on October 2024

English-Hindi

New

4,712,564

74,000,000

4,712,564

74,000,000

4,712,564

74,000,000

English-Indonesian

New

7,133,323

109,000,000

7,133,323

109,000,000

7,133,323

109,000,000

English-Khmer v2

New

1,501,304

23,000,000

1,501,304

23,000,000

1,501,304

23,000,000

English-Korean v2

New

7,709,312

114,000,000

7,709,312

114,000,000

7,709,312

114,000,000

English-Lao

New

1,994,053

27,000,000

1,994,053

27,000,000

1,994,053

27,000,000

English-Burmese v2

New

1,666,530

28,000,000

1,666,530

28,000,000

1,666,530

28,000,000

English-Nepali v2

New

2,243,954

32,000,000

2,243,954

32,000,000

2,243,954

32,000,000

English-Thai

New

2,175,890

22,000,000

2,175,890

22,000,000

2,175,890

22,000,000

English-Vietnamese

New

6,291,407

93,000,000

6,291,407

93,000,000

6,291,407

93,000,000

Polish-Czech

24,001,403

288,826,678

24,001,403

288,826,678

6,055,618,075

28,559,061,699

English-Ukrainian

13,354,365

505,831,880

13,354,365

505,831,880

235,700,383

5,832,658,894

English-Chinese

14,170,585

217,604,664

14,170,585

217,604,664

1,207,487,761

8,953,713,029

English-Russian

5,377,911

101,312,142

5,377,911

101,312,142

491,941,804

492,260,972

English-Korean

4,002,441

61,963,744

4,002,441

61,963,744

Dutch-French

2,687,331

60,504,313

2,687,331

60,504,313

38,164,560

770,141,393

Polish-German

916,522

18,883,576

916,522

18,883,576

11,060,105

202,765,359

Bitextor

ParaCrawl Open Source pipeline

Bitextor is a tool for automatically harvesting bitexts from multilingual websites.

View release on Github

Quickstart

Docker

If you want to easily install Bitextor, just use the Docker version:

docker pull bitextor/bitextor docker run --name bitextor bitextor/bitextor

Docker Image

Corset

Filtering

Searching

Perform searches on ParaCrawl corpora or get filtered subsets from it using Corset.

View release on Github

Synthesized Data: focus on translation of rare words

manufactured data

ParaCrawl v9

This time, the synthesized data release is made of sentences from ParaCrawl v8 corpus in which original/translation rare words have been replaced with similar original/translation words based on the average of cosine distance.Parallel data for the 8 lowest-resources official European languages is available for download.

More details

KEOPS

Quality Evaluation

KEOPS provides a complete tool for manual evaluation of parallel sentences and other linguistic tasks.

View release on Github

Bicleaner

Classifier

Bicleaner (bicleaner-classify) is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

View release on Github

Third Party Releases

JParaCrawl

JParaCrawl is the largest publicly available English-Japanese parallel corpus created by NTT Communication Science Laboratories. It was created by largely crawling the web and automatically aligning parallel sentences.

Read More

Citing ParaCrawl

Research

If you want to cite ParaCrawl, please refer to: ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

`>>Check More Data and News sections for updates!!! >>`

ParaCrawl Corpus release v9

Bitextor

Quickstart

Corset

Synthesized Data: focus on translation of rare words

KEOPS

Bicleaner

Third Party Releases

Citing ParaCrawl

License

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

>>Check More Data and News sections for updates!!! >>

ParaCrawl Corpus release v9

VersionRelease Date

Checkout the bonus releases

A Japanese version is availablethrough JParaCrawl (3rd Party Releases)

License

`>>Check More Data and News sections for updates!!! >>`

Version
Release Date

A Japanese version is available
through JParaCrawl (3rd Party Releases)