ParaCrawl Release v8

ParaCrawl Corpus release v8

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 8 is the first release for ParaCrawl Action 3: "Continued Web-Scale Provision of Parallel Corpora for European Languages".

ParaCrawl 8 adds a huge amount of data to previous releases and additional cleaning routines such as the removal of machine translated content detected through the use of MT plugins (more details) in websites. The corpus is the result of a full reprocessing of all the content from already crawled sources besides the addition of new sources from the Internet Archive or new crawlings.

This version relies on an updated and enhanced version of Bitextor (see changes) including minor fixes for Bifixer (fixes), Bicleaner (filters) and Biroamer (anonymizes). Bitextor provides for the first time deferred crawled corpora as part of this version.

As a bonus, a corpus made of all the monolingual English data in V8 (96 billion sentences!) has been produced along with a new version of the English-Russian corpus. Also, new synthesized data for 4 domains (Financial,Law, IT and Medical) is available as part of this version.

New version 8.1 for Spanish-Galician and Spanish-Catalan: due to a processing error, we discovered a lot of Spanish content in Catalan and Galician sentences. We've produced new filtered versions for these 2 pairs, in order to fix this issue.

Language

Sentences

Source Words

Bulgarian

11,927,063

203,375,210

11,927,063

203,375,210

854,733,138

18,604,567,822

3,009,916

50,070,366

11,927,063

203,375,210

Czech

50,152,749

686,422,416

50,152,749

686,422,416

3,305,572,201

68,597,385,464

41,100,059

541,110,549

50,152,749

686,422,416

Danish

41,939,140

615,918,838

41,939,140

615,918,838

3,347,601,977

70,654,954,456

34,772,889

485,480,287

41,939,140

615,918,838

German

261,109,308

3,814,804,964

261,109,308

3,814,804,964

26,655,605,254

573,283,847,994

213,698,600

3,016,392,423

261,109,308

3,814,804,964

Greek

34,586,041

499,466,027

34,586,041

499,466,027

3,441,314,226

73,822,945,203

28,665,368

397,368,723

34,586,041

499,466,027

Spanish

396,501,181

5,603,516,317

396,501,181

5,603,516,317

41,499,613,769

867,946,490,916

327,845,105

4,414,531,025

396,501,181

5,603,516,317

Estonian

8,585,187

144,690,364

8,585,187

144,690,364

642,176,845

12,878,933,565

7,375,531

121,286,878

8,585,187

144,690,364

Finnish

15,301,979

250,528,619

15,301,979

250,528,619

1,570,205,909

35,656,029,925

12,762,014

200,937,547

15,301,979

250,528,619

French

266,848,268

4,441,921,942

266,848,268

4,441,921,942

24,235,367,738

532,388,485,568

219,494,980

3,519,726,369

266,848,268

4,441,921,942

Irish

1,995,659

39,491,428

1,995,659

39,491,428

1,072,936,263

24,357,497,495

1,702,825

33,425,533

1,995,659

39,491,428

Croatian

11,063,212

164,345,030

11,063,212

164,345,030

1,715,050,486

36,278,818,443

9,062,845

130,636,396

11,063,212

164,345,030

Hungarian

12,681,746

196,278,321

12,681,746

196,278,321

1,307,661,091

27,915,080,653

10,529,565

160,217,586

12,681,746

196,278,321

Icelandic

5,724,258

79,645,858

5,724,258

79,645,858

209,761,975

4,400,395,928

4,557,278

60,363,762

5,724,258

79,645,858

Italian

120,119,878

1,970,999,568

120,119,878

1,970,999,568

11,726,003,767

260,835,888,372

97,967,919

1,529,943,957

120,119,878

1,970,999,568

Lithuanian

8,043,262

130,375,034

8,043,262

130,375,034

550,598,514

11,458,376,058

6,807,010

107,919,671

8,043,262

130,375,034

Latvian

8,177,660

138,752,970

8,177,660

138,752,970

490,189,945

10,062,934,467

6,955,523

115,765,677

8,177,660

138,752,970

Maltese

1,604,135

30,567,571

1,604,135

30,567,571

88,711,521

1,824,228,962

1,376,335

25,861,339

1,604,135

30,567,571

Dutch

98,474,880

1,384,748,238

98,474,880

1,384,748,238

6,703,800,208

140,665,734,012

80,099,994

1,083,914,273

98,474,880

1,384,748,238

Norwegian

59,090,389

785,275,124

59,090,389

785,275,124

4,990,065,958

101,779,188,277

47,373,466

607,754,127

59,090,389

785,275,124

Polish

45,359,213

666,844,113

45,359,213

666,844,113

5,767,226,735

121,958,563,962

37,802,842

533,691,061

45,359,213

666,844,113

Portuguese

102,631,451

1,562,012,122

102,631,451

1,562,012,122

10,122,731,111

225,822,517,114

85,102,715

1,237,102,517

102,631,451

1,562,012,122

Romanian

13,376,424

220,420,304

13,376,424

220,420,304

1,570,816,827

34,846,884,817

11,270,962

179,121,809

13,376,424

220,420,304

Slovak

13,010,434

202,080,757

13,010,434

202,080,757

1,002,007,836

20,601,640,521

11,208,541

170,049,628

13,010,434

202,080,757

Slovenian

7,536,844

136,309,008

7,536,844

136,309,008

615,555,357

13,079,867,865

6,442,830

113,165,698

7,536,844

136,309,008

Swedish

44,066,693

657,463,968

44,066,693

657,463,968

4,476,351,734

91,348,264,048

36,993,856

531,048,243

44,066,693

657,463,968

Spanish-Catalan

New

39,688,735

732,461,360

39,688,735

732,461,360

2,953,756,667

90,271,047,717

39,312,412

725,174,709

39,688,735

732,461,360

Spanish-Basque

2,864,354

44,790,206

2,864,354

44,790,206

509,767,791

15,786,006,832

2,815,365

44,003,473

2,864,354

44,790,206

Spanish-Galician

New

5,261,521

78,298,840

5,261,521

78,298,840

1,912,182,856

65,242,661,658

5,232,239

77,721,927

5,261,521

78,298,840

Bonus Release (Low resource languages) - Last Updates on Oct. 2024

English-Azerbaijani

3,158,025

47,117,416

3,158,025

47,117,416

3,158,025

47,117,416

336,067,622

5,513,087,127

English-Tajik

343,401

5,513,041

343,401

5,513,041

343,401

5,513,041

11,394,854

244,332,910

English-Armenian

1,988,287

35,997,571

1,988,287

35,997,571

1,988,287

35,997,571

31,671,924

233,771,297

English-Khmer v1

65,113

1,511,950

21,560,446

21,565,078

65,113

1,511,950

English-Burmese v1

31,374

661,577

40,590,354

40,595,755

31,374

661,577

English-Nepali v1

92,084

2,941,031

36,454,553

36,466,101

92,084

2,941,031

English-Pashto

26,321

692,651

2,587,950

2,593,163

26,321

692,651

English-Singhalese

217,407

5,791,982

38,720,907

38,724,422

217,407

5,791,982

English-Somali

14,879

506,201

28,387,922

28,396,227

14,879

506,201

English-Swahili

132,517

3,696,543

84,605,506

132,517

3,696,543

English-Tagalog

248,684

6,327,801

108,260,601

248,684

6,327,801

Bonus Release - Last Updates on October 2024

English-Hindi

New

4,712,564

74,000,000

4,712,564

74,000,000

4,712,564

74,000,000

English-Indonesian

New

7,133,323

109,000,000

7,133,323

109,000,000

7,133,323

109,000,000

English-Khmer v2

New

1,501,304

23,000,000

1,501,304

23,000,000

1,501,304

23,000,000

English-Korean v2

New

7,709,312

114,000,000

7,709,312

114,000,000

7,709,312

114,000,000

English-Lao

New

1,994,053

27,000,000

1,994,053

27,000,000

1,994,053

27,000,000

English-Burmese v2

New

1,666,530

28,000,000

1,666,530

28,000,000

1,666,530

28,000,000

English-Nepali v2

New

2,243,954

32,000,000

2,243,954

32,000,000

2,243,954

32,000,000

English-Thai

New

2,175,890

22,000,000

2,175,890

22,000,000

2,175,890

22,000,000

English-Vietnamese

New

6,291,407

93,000,000

6,291,407

93,000,000

6,291,407

93,000,000

Polish-Czech

24,001,403

288,826,678

24,001,403

288,826,678

6,055,618,075

28,559,061,699

English-Ukrainian

13,354,365

505,831,880

13,354,365

505,831,880

235,700,383

5,832,658,894

English-Chinese

14,170,585

217,604,664

14,170,585

217,604,664

1,207,487,761

8,953,713,029

English-Russian

5,377,911

101,312,142

5,377,911

101,312,142

491,941,804

492,260,972

English-Korean

4,002,441

61,963,744

4,002,441

61,963,744

Dutch-French

2,687,331

60,504,313

2,687,331

60,504,313

38,164,560

770,141,393

Polish-German

916,522

18,883,576

916,522

18,883,576

11,060,105

202,765,359

Bitextor

ParaCrawl Open Source pipeline

Bitextor is a tool for automatically harvesting bitexts from multilingual websites.

View release on Github

Quickstart

Docker

If you want to easily install Bitextor, just use the Docker version:

docker pull bitextor/bitextor docker run --name bitextor bitextor/bitextor

Docker Image

Corset

Filtering

Searching

Perform searches on ParaCrawl corpora or get filtered subsets from it using Corset.

View release on Github

Synthesized Data: focus on translation of rare words

manufactured data

ParaCrawl v9

This time, the synthesized data release is made of sentences from ParaCrawl v8 corpus in which original/translation rare words have been replaced with similar original/translation words based on the average of cosine distance.Parallel data for the 8 lowest-resources official European languages is available for download.

More details

KEOPS

Quality Evaluation

KEOPS provides a complete tool for manual evaluation of parallel sentences and other linguistic tasks.

View release on Github

Bicleaner

Classifier

Bicleaner (bicleaner-classify) is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

View release on Github

Third Party Releases

JParaCrawl

JParaCrawl is the largest publicly available English-Japanese parallel corpus created by NTT Communication Science Laboratories. It was created by largely crawling the web and automatically aligning parallel sentences.

Read More

Citing ParaCrawl

Research

If you want to cite ParaCrawl, please refer to: ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

ParaCrawl Corpus release v8

VersionRelease Date

Checkout the bonus releases

A Japanese version is availablethrough JParaCrawl (3rd Party Releases)

Version
Release Date

A Japanese version is available
through JParaCrawl (3rd Party Releases)