ParaCrawl Release v5.1

ParaCrawl Corpus release v5.1

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Version 5.1 builds upon the same raw corpus as V5. Thanks to improvements in filtering procedure, the official subset extracted as version 5.1 is now higher in quantity for almost all language pairs (but ga, de, sl and et). Quality measured extrinsically through MT for several language pairs shows also improvement in quality.

A newer version is available

See Latest Releases

Language

Sentences

Source Words

Bulgarian

2,775,256

60,246,308

2,775,256

60,246,308

248,555,951

1,564,051,100

Czech

5,345,693

105,973,351

5,345,693

105,973,351

665,535,115

4,025,512,842

Danish

4,851,772

111,476,139

4,851,772

111,476,139

447,743,455

3,347,135,236

German

34,371,306

708,068,143

34,371,306

708,068,143

5,038,103,659

27,994,213,177

Greek

4,038,777

93,473,163

4,038,777

93,473,163

640,502,801

3,768,712,672

Spanish

44,587,162

1,072,236,916

44,587,162

1,072,236,916

2,674,900,280

16,598,620,402

Estonian

1,452,963

31,597,344

1,452,963

31,597,344

168,091,382

915,074,587

Finnish

3,421,382

66,385,933

3,421,382

66,385,933

460,181,215

2,731,068,033

French

63,634,915

1,518,457,124

63,634,915

1,518,457,124

4,273,819,421

24,983,683,983

Irish

521,768

12,089,677

521,768

12,089,677

64,628,733

667,211,260

Croatian

1,993,180

44,945,371

1,993,180

44,945,371

273,330,006

1,738,164,401

Hungarian

4,782,328

115,330,046

4,782,328

115,330,046

461,181,772

3,208,285,083

Italian

24,089,063

587,087,473

24,089,063

587,087,473

2,251,771,798

13,150,606,108

Lithuanian

1,368,514

27,894,906

1,368,514

27,894,906

198,101,611

963,384,230

Latvian

1,056,252

22,810,714

1,056,252

22,810,714

176,113,669

1,069,218,155

Maltese

186,630

4,280,211

186,630

4,280,211

3,693,930

38,492,028

Dutch

11,272,396

247,536,605

11,272,396

247,536,605

1,101,087,006

6,792,400,704

Polish

6,577,804

143,702,545

6,577,804

143,702,545

723,052,912

4,123,972,411

Portuguese

15,259,967

337,394,318

15,259,967

337,394,318

1,068,161,866

6,537,298,891

Romanian

3,176,488

69,998,913

3,176,488

69,998,913

510,209,923

3,034,045,929

Slovak

2,496,533

48,160,348

2,496,533

48,160,348

269,067,288

1,416,750,646

Slovenian

1,220,652

29,042,458

1,220,652

29,042,458

175,682,959

1,003,867,134

Swedish

6,633,761

149,048,559

6,633,761

149,048,559

620,338,561

3,496,650,816

Bonus Release (Low resource languages) - Last Updates on Oct. 2024

English-Azerbaijani

3,158,025

47,117,416

3,158,025

47,117,416

3,158,025

47,117,416

336,067,622

5,513,087,127

English-Tajik

343,401

5,513,041

343,401

5,513,041

343,401

5,513,041

11,394,854

244,332,910

English-Armenian

1,988,287

35,997,571

1,988,287

35,997,571

1,988,287

35,997,571

31,671,924

233,771,297

English-Khmer v1

65,113

1,511,950

21,560,446

21,565,078

65,113

1,511,950

English-Burmese v1

31,374

661,577

40,590,354

40,595,755

31,374

661,577

English-Nepali v1

92,084

2,941,031

36,454,553

36,466,101

92,084

2,941,031

English-Pashto

26,321

692,651

2,587,950

2,593,163

26,321

692,651

English-Singhalese

217,407

5,791,982

38,720,907

38,724,422

217,407

5,791,982

English-Somali

14,879

506,201

28,387,922

28,396,227

14,879

506,201

English-Swahili

132,517

3,696,543

84,605,506

132,517

3,696,543

English-Tagalog

248,684

6,327,801

108,260,601

248,684

6,327,801

Bonus Release - Last Updates on October 2024

English-Hindi

New

4,712,564

74,000,000

4,712,564

74,000,000

4,712,564

74,000,000

English-Indonesian

New

7,133,323

109,000,000

7,133,323

109,000,000

7,133,323

109,000,000

English-Khmer v2

New

1,501,304

23,000,000

1,501,304

23,000,000

1,501,304

23,000,000

English-Korean v2

New

7,709,312

114,000,000

7,709,312

114,000,000

7,709,312

114,000,000

English-Lao

New

1,994,053

27,000,000

1,994,053

27,000,000

1,994,053

27,000,000

English-Burmese v2

New

1,666,530

28,000,000

1,666,530

28,000,000

1,666,530

28,000,000

English-Nepali v2

New

2,243,954

32,000,000

2,243,954

32,000,000

2,243,954

32,000,000

English-Thai

New

2,175,890

22,000,000

2,175,890

22,000,000

2,175,890

22,000,000

English-Vietnamese

New

6,291,407

93,000,000

6,291,407

93,000,000

6,291,407

93,000,000

Polish-Czech

24,001,403

288,826,678

24,001,403

288,826,678

6,055,618,075

28,559,061,699

English-Ukrainian

13,354,365

505,831,880

13,354,365

505,831,880

235,700,383

5,832,658,894

English-Chinese

14,170,585

217,604,664

14,170,585

217,604,664

1,207,487,761

8,953,713,029

English-Russian

5,377,911

101,312,142

5,377,911

101,312,142

491,941,804

492,260,972

English-Korean

4,002,441

61,963,744

4,002,441

61,963,744

Dutch-French

2,687,331

60,504,313

2,687,331

60,504,313

38,164,560

770,141,393

Polish-German

916,522

18,883,576

916,522

18,883,576

11,060,105

202,765,359

Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.

Bitextor

ParaCrawl Open Source pipeline

Bitextor is a tool for automatically harvesting bitexts from multilingual websites.

View release on Github

Quickstart

Docker

If you want to easily install Bitextor, just use the Docker version:

docker pull bitextor/bitextor docker run --name bitextor bitextor/bitextor

Docker Image

Corset

Filtering

Searching

Perform searches on ParaCrawl corpora or get filtered subsets from it using Corset.

View release on Github

KEOPS

Quality Evaluation

KEOPS provides a complete tool for manual evaluation of parallel sentences and other linguistic tasks.

View release on Github

Bicleaner

Classifier

Bicleaner (bicleaner-classify) is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

View release on Github

Third Party Releases

JParaCrawl

JParaCrawl is the largest publicly available English-Japanese parallel corpus created by NTT Communication Science Laboratories. It was created by largely crawling the web and automatically aligning parallel sentences.

Read More

Citing ParaCrawl

Research

If you want to cite ParaCrawl, please refer to: ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

ParaCrawl Corpus release v5.1

VersionRelease Date

Checkout the bonus releases

A Japanese version is availablethrough JParaCrawl (3rd Party Releases)

Version
Release Date

A Japanese version is available
through JParaCrawl (3rd Party Releases)