Polish-Kashubian parallel translation corpus - Open Research Data - Bridge of Knowledge

Search

Polish-Kashubian parallel translation corpus

wersja 2.0

Description

The dataset contains Polish words and sentences and their translations into Kashubian. The dataset consists of train and test subsets. The train subset contains about 100,000 parallel translations. It was created using two types of sources. The first one is the online dictionaries:

  1. kaszebe.org
  2. sloworz.org
  3. odmiana.net

The second type of source was an existing dataset that was incorporated into this one:

  1. OPUS
  2. Tatoeba Challenge

The dataset was carefully cleaned and duplicates were removed.

The test dataset is distributed together with the training dataset. It contains 70 parallel Poish-Kashubian sentences. The sentences were obtained from sources other than the train data to make them reliable.

Dataset file

Polish-Kashubian parallel translation corpus.zip
814.0 kB, S3 ETag 44810ca14f445862b0bbd85c3fa03ec7-1, downloads: 10
The file hash is calculated from the formula
hexmd5(md5(part1)+md5(part2)+...)-{parts_count} where a single part of the file is 512 MB in size.

Example script for calculation:
https://github.com/antespi/s3md5
download file Polish-Kashubian parallel translation corpus.zip

File details

License:
Creative Commons: 0 1.0 open in new tab
CC 0
Public Domain Dedication

Details

Year of publication:
2024
Verification date:
2025-02-01
Dataset language:
Polish
Fields of science:
  • information and communication technology (Engineering and Technology)
DOI:
DOI ID 10.34808/5whb-dk74 open in new tab
Series:
Verified by:
Gdańsk University of Technology

Keywords

References

Cite as

Authors

Version this document has several versions

DOI 10.34808/4sbd-2v21 represents the latest version of the data.

seen 98 times