Polish-Kashubian parallel translation corpus - Open Research Data - Bridge of Knowledge

Search

Polish-Kashubian parallel translation corpus

Description

The data set contains about 120,000 Polish words and sentences and their translations into Kashubian. It was created using two types of sources. The first one is the online dictionaries:

  1. kaszebe.org
  2. sloworz.org
  3. odmiana.net

The second type of source was an existing dataset that was incorporated into this one:

  1. OPUS
  2. Tatoeba Challenge

The dataset was pre-cleaned and duplicates were removed.

Dataset file

dataset.zip
933.8 kB, S3 ETag 300ef6fafbc762e50e991f8d2ae60466-1, downloads: 27
The file hash is calculated from the formula
hexmd5(md5(part1)+md5(part2)+...)-{parts_count} where a single part of the file is 512 MB in size.

Example script for calculation:
https://github.com/antespi/s3md5
download file dataset.zip

File details

License:
Creative Commons: 0 1.0 open in new tab
CC 0
Public Domain Dedication

Details

Year of publication:
2024
Verification date:
2024-09-30
Dataset language:
Polish
Fields of science:
  • information and communication technology (Engineering and Technology)
DOI:
DOI ID 10.34808/t930-fs97 open in new tab
Verified by:
Gdańsk University of Technology

Keywords

Cite as

seen 79 times