Elgold intermediate: raw texts

Description

The dataset contains raw texts scrapped from various internet sources which were used for creating the Elgold dataset.

The texts were collected from 7 main categories: "News", "Job offers", "Movie reviews", "Automotive blogs", "Amazon product reviews", "Scientific papers abstracts", and "Historic blogs". The Scientific Papers category was additionally divided into five subcategories: "Biomedicine", "Life Sciences", "Mathematics", "Medicine & Public Health", and "Science, Humanities and Social Sciences, multidisciplinary".

The raw texts were collected from publicly available Internet sources by the group of 14 participants. Every category has 2-3 participants assigned.

The dataset consists of approximately 100 texts for each category (and subcategory in the case of "Scientific papers abstracts").

Dataset file

raw.zip

1.0 MB, S3 ETag 22f7c4a62f2f469187172d70e1df7b98-1, downloads: 47

The file hash is calculated from the formula
hexmd5(md5(part1)+md5(part2)+...)-{parts_count} where a single part of the file is 512 MB in size.

Example script for calculation:
https://github.com/antespi/s3md5

download

File details

License:: open in new tab

CC 0

Public Domain Dedication

Details

Year of publication:

2024

Verification date:

2024-06-28

Creation date:

2024

Dataset language:

English

Fields of science:

information and communication technology (Engineering and Technology)

DOI:

10.34808/py0a-xj82

Series:

Elgold intermediate

Verified by:

Gdańsk University of Technology

Keywords

References

Cite as

Authors

seen 107 times

Search