ParaCrawl Release v5

ParaCrawl Corpus release v5.0

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v5 is the first release for the Action: "Broader Web-Scale Provision of Parallel Corpora for European Languages". New crawled data is added, including data from Internet Archive. Enhancements in document and sentence aligners with updated BiCleaner strategy resulted in corpora twice the size compare to release v4 for all official EU languages (23 languages paired with English).

A newer version is available

See Latest Releases

Language

Sentences

Source Words

Bulgarian

2,586,277

55,725,444

2,586,277

55,725,444

248,555,951

1,564,051,100

Czech

5,280,149

117,385,158

5,280,149

117,385,158

665,535,115

4,025,512,842

Danish

4,606,183

106,565,546

4,606,183

106,565,546

447,743,455

3,347,135,236

German

36,936,714

929,818,868

36,936,714

929,818,868

5,038,103,659

27,994,213,177

Greek

3,830,643

88,669,279

3,830,643

88,669,279

640,502,801

3,768,712,672

Spanish

38,971,348

897,891,704

38,971,348

897,891,704

2,674,900,280

16,598,620,402

Estonian

1,387,869

30,858,140

1,387,869

30,858,140

168,091,382

915,074,587

Finnish

3,097,223

66,385,933

3,097,223

66,385,933

460,181,215

2,731,068,033

French

51,316,168

1,178,317,233

51,316,168

1,178,317,233

4,273,819,421

24,983,683,983

Irish

782,769

21,909,039

782,769

21,909,039

64,628,733

667,211,260

Croatian

1,861,590

43,464,197

1,861,590

43,464,197

273,330,006

1,738,164,401

Hungarian

4,187,051

104,292,635

4,187,051

104,292,635

461,181,772

3,208,285,083

Italian

22,100,078

533,512,632

22,100,078

533,512,632

2,251,771,798

13,150,606,108

Lithuanian

1,270,933

27,214,054

1,270,933

27,214,054

198,101,611

963,384,230

Latvian

1,019,003

23,656,140

1,019,003

23,656,140

176,113,669

1,069,218,155

Maltese

177,244

4,252,814

177,244

4,252,814

3,693,930

38,492,028

Dutch

10,596,717

233,087,345

10,596,717

233,087,345

1,101,087,006

6,792,400,704

Polish

6,382,371

145,802,939

6,382,371

145,802,939

723,052,912

4,123,972,411

Portuguese

13,860,663

299,634,135

13,860,663

299,634,135

1,068,161,866

6,537,298,891

Romanian

2,870,687

62,189,306

2,870,687

62,189,306

510,209,923

3,034,045,929

Slovak

2,365,339

45,636,383

2,365,339

45,636,383

269,067,288

1,416,750,646

Slovenian

1,406,645

31,855,427

1,406,645

31,855,427

175,682,959

1,003,867,134

Swedish

6,079,175

138,264,978

6,079,175

138,264,978

620,338,561

3,496,650,816

Bonus Release (Low resource languages) - Last Updates on Oct. 2024

English-Azerbaijani

3,158,025

47,117,416

3,158,025

47,117,416

3,158,025

47,117,416

336,067,622

5,513,087,127

English-Tajik

343,401

5,513,041

343,401

5,513,041

343,401

5,513,041

11,394,854

244,332,910

English-Armenian

1,988,287

35,997,571

1,988,287

35,997,571

1,988,287

35,997,571

31,671,924

233,771,297

English-Khmer v1

65,113

1,511,950

21,560,446

21,565,078

65,113

1,511,950

English-Burmese v1

31,374

661,577

40,590,354

40,595,755

31,374

661,577

English-Nepali v1

92,084

2,941,031

36,454,553

36,466,101

92,084

2,941,031

English-Pashto

26,321

692,651

2,587,950

2,593,163

26,321

692,651

English-Singhalese

217,407

5,791,982

38,720,907

38,724,422

217,407

5,791,982

English-Somali

14,879

506,201

28,387,922

28,396,227

14,879

506,201

English-Swahili

132,517

3,696,543

84,605,506

132,517

3,696,543

English-Tagalog

248,684

6,327,801

108,260,601

248,684

6,327,801

Bonus Release - Last Updates on October 2024

English-Hindi

New

4,712,564

74,000,000

4,712,564

74,000,000

4,712,564

74,000,000

English-Indonesian

New

7,133,323

109,000,000

7,133,323

109,000,000

7,133,323

109,000,000

English-Khmer v2

New

1,501,304

23,000,000

1,501,304

23,000,000

1,501,304

23,000,000

English-Korean v2

New

7,709,312

114,000,000

7,709,312

114,000,000

7,709,312

114,000,000

English-Lao

New

1,994,053

27,000,000

1,994,053

27,000,000

1,994,053

27,000,000

English-Burmese v2

New

1,666,530

28,000,000

1,666,530

28,000,000

1,666,530

28,000,000

English-Nepali v2

New

2,243,954

32,000,000

2,243,954

32,000,000

2,243,954

32,000,000

English-Thai

New

2,175,890

22,000,000

2,175,890

22,000,000

2,175,890

22,000,000

English-Vietnamese

New

6,291,407

93,000,000

6,291,407

93,000,000

6,291,407

93,000,000

Polish-Czech

24,001,403

288,826,678

24,001,403

288,826,678

6,055,618,075

28,559,061,699

English-Ukrainian

13,354,365

505,831,880

13,354,365

505,831,880

235,700,383

5,832,658,894

English-Chinese

14,170,585

217,604,664

14,170,585

217,604,664

1,207,487,761

8,953,713,029

English-Russian

5,377,911

101,312,142

5,377,911

101,312,142

491,941,804

492,260,972

English-Korean

4,002,441

61,963,744

4,002,441

61,963,744

Dutch-French

2,687,331

60,504,313

2,687,331

60,504,313

38,164,560

770,141,393

Polish-German

916,522

18,883,576

916,522

18,883,576

11,060,105

202,765,359

Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.

Bitextor

ParaCrawl Open Source pipeline

Bitextor is a tool for automatically harvesting bitexts from multilingual websites.

View release on Github

Quickstart

Docker

If you want to easily install Bitextor, just use the Docker version:

docker pull bitextor/bitextor docker run --name bitextor bitextor/bitextor

Docker Image

Corset

Filtering

Searching

Perform searches on ParaCrawl corpora or get filtered subsets from it using Corset.

View release on Github

KEOPS

Quality Evaluation

KEOPS provides a complete tool for manual evaluation of parallel sentences and other linguistic tasks.

View release on Github

Bicleaner

Classifier

Bicleaner (bicleaner-classify) is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

View release on Github

Third Party Releases

JParaCrawl

JParaCrawl is the largest publicly available English-Japanese parallel corpus created by NTT Communication Science Laboratories. It was created by largely crawling the web and automatically aligning parallel sentences.

Read More

Citing ParaCrawl

Research

If you want to cite ParaCrawl, please refer to: ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

ParaCrawl Corpus release v5.0

VersionRelease Date

Checkout the bonus releases

A Japanese version is availablethrough JParaCrawl (3rd Party Releases)

Version
Release Date

A Japanese version is available
through JParaCrawl (3rd Party Releases)