Description
The dataset contains raw texts scrapped from various internet sources which were used for creating the Elgold dataset.
The texts were collected from 7 main categories: "News", "Job offers", "Movie reviews", "Automotive blogs", "Amazon product reviews", "Scientific papers abstracts", and "Historic blogs". The Scientific Papers category was additionally divided into five subcategories: "Biomedicine", "Life Sciences", "Mathematics", "Medicine & Public Health", and "Science, Humanities and Social Sciences, multidisciplinary".
The raw texts were collected from publicly available Internet sources by the group of 14 participants. Every category has 2-3 participants assigned.
The dataset consists of approximately 100 texts for each category (and subcategory in the case of "Scientific papers abstracts").
Dataset file
hexmd5(md5(part1)+md5(part2)+...)-{parts_count}
where a single part of the file is 512 MB in size.Example script for calculation:
https://github.com/antespi/s3md5
File details
- License:
-
open in new tabCC 0Public Domain Dedication
Details
- Year of publication:
- 2024
- Verification date:
- 2024-06-28
- Creation date:
- 2024
- Dataset language:
- English
- Fields of science:
-
- information and communication technology (Engineering and Technology)
- DOI:
- DOI ID 10.34808/py0a-xj82 open in new tab
- Series:
- Verified by:
- Gdańsk University of Technology
Keywords
References
- dataset Elgold: gold standard, multi-genre dataset for named entity recognition and linking
- dataset Elgold intermediate: annotated raw
Cite as
Authors
seen 88 times