Rust QA: question answering dataset for "The Rust Programming Language" in SQuAD 2.0 format

Description

Rust QA is a dataset for training and evaluating QA systems. The dataset consists of 1068 questions to "The Rust Programming Language" book (https://doc.rust-lang.org/stable/book/) with the answers provided as text spans from the book. The dataset is released in SQuAD 2.0 format.

The dataset is splited to 854 train, 107 validation and 107 test samples. Each split is saved in separate JSON file. Each data sample consists of following notable fields:

"context" - larger fragment of text. In our dataset it corresponds to a particular chapter from the language book.
"qas" - table of questions with answers for the specified context. Each question is an object with "question", "id", "answers" and "is_impossible" fields. All of the questions in the dataset have one answer and are possible to answer.
"question" - question in textual format.
"text" - answer in textual format.
"answer_start" - position of the first symbol of the answer in the context text.

The dataset was created using Haystack annotation tool (https://docs.haystack.deepset.ai/docs/annotation). All 105 chapters of the language book have been evenly split between five annotators, who then devised questions based on each chapter’s content.

Together with the dataset we realase the Rust book that was used for creating the annotations.

Dataset file

Rust QA.zip

9.5 MB, S3 ETag 9aabe795aa37db0ffd0dd3f75b1cf245-1, downloads: 85

The file hash is calculated from the formula
hexmd5(md5(part1)+md5(part2)+...)-{parts_count} where a single part of the file is 512 MB in size.

Example script for calculation:
https://github.com/antespi/s3md5

download

File details

License:: open in new tab

CC BY

Attribution

Details

Year of publication:

2024

Verification date:

2024-02-28

Dataset language:

English

Fields of science:

information and communication technology (Engineering and Technology)

DOI:

10.34808/c05c-9542

Verified by:

Gdańsk University of Technology

Keywords

Cite as

Authors

Szymon Olewniczak mgr inż.
Department of Computer Architecture
orcid number 0000-0002-9387-8546open in new tab
Project Manager
Michał Maciszka

Faculty of Electronics, Telecommunications and Informatics
orcid number 0009-0007-3424-4491open in new tab
Creator
Kamil Paluszewski

Faculty of Electronics, Telecommunications and Informatics
orcid number 0009-0001-5582-8847open in new tab
Creator
Grzegorz Pozorski

Faculty of Electronics, Telecommunications and Informatics
orcid number 0009-0002-9260-9878open in new tab
Creator
Wojciech Rosenthal

Faculty of Electronics, Telecommunications and Informatics
orcid number 0009-0007-2524-7295open in new tab
Creator
Łukasz Zaleski

Faculty of Electronics, Telecommunications and Informatics
orcid number 0009-0000-5861-0176open in new tab
Creator

seen 298 times

Search