Elgold intermediate: raw texts - Open Research Data - Bridge of Knowledge

Search

Elgold intermediate: raw texts

Description

The dataset contains raw texts scrapped from various internet sources which were used for creating the Elgold dataset.

The texts were collected from 7 main categories: "News", "Job offers", "Movie reviews", "Automotive blogs", "Amazon product reviews", "Scientific papers abstracts", and "Historic blogs". The Scientific Papers category was additionally divided into five subcategories: "Biomedicine", "Life Sciences", "Mathematics", "Medicine & Public Health", and "Science, Humanities and Social Sciences, multidisciplinary". 

The raw texts were collected from publicly available Internet sources by the group of 14 participants. Every category has 2-3 participants assigned.

The dataset consists of approximately 100 texts for each category (and subcategory in the case of "Scientific papers abstracts").

Dataset file

raw.zip
1.0 MB, S3 ETag 22f7c4a62f2f469187172d70e1df7b98-1, downloads: 45
The file hash is calculated from the formula
hexmd5(md5(part1)+md5(part2)+...)-{parts_count} where a single part of the file is 512 MB in size.

Example script for calculation:
https://github.com/antespi/s3md5
download file raw.zip

File details

License:
Creative Commons: 0 1.0 open in new tab
CC 0
Public Domain Dedication

Details

Year of publication:
2024
Verification date:
2024-06-28
Creation date:
2024
Dataset language:
English
Fields of science:
  • information and communication technology (Engineering and Technology)
DOI:
DOI ID 10.34808/py0a-xj82 open in new tab
Series:
Verified by:
Gdańsk University of Technology

Keywords

References

Cite as

seen 88 times