Description
WikiPrefs
The WikiPrefs dataset is a human preferences dataset for Large Language Models alignment. It was built using the EditPrefs method from historical edits of Wikipedia featured articles
The code used for creating the dataset is available on GitHub: https://github.com/jmajkutewicz/EditPrefs
Dataset Description
- Language: English
- License: CC BY-SA 4.0
- Note that:
- the text comes from Wikipedia and is subjected to CC BY-SA 4.0 license
- the prompts were created using the GPT-3.5-turbo and are subjected to OpenAI license restrictions
Dataset Structure
The dataset is split into 63345 train and 2000 test samples. Each sample consists of:
- page_id - Wikipedia article id
- page_title - Wikipedia article title
- section - section of the Wikipedia article
- rev_id - the revision of the Wikipedia article
- prev_rev_id - parent revision
- timestamp - date of the edit
- contributor - author of the edit
- comment - comment associated with the edit
- prompt - synthetic instruction that matches the responses
- chosen - chosen response, created from the edited revision of the Wikipedia article; formatted as a list of messages
- rejected - rejected response, created from the original revision of the Wikipedia article; formatted as a list of messages
Source Data
The dataset was created from the English Wikipedia dump from 01.04.2024
Applications
The dataset can be used for aligning Large Language Models with standard techniques such as RLHF or DPO
Dataset file
wiki_prefs.zip
20.4 MB,
S3 ETag
87408b4237e1bec4ff3ffc740ba810b8-1,
downloads: 35
The file hash is calculated from the formula
Example script for calculation:
https://github.com/antespi/s3md5
hexmd5(md5(part1)+md5(part2)+...)-{parts_count}
where a single part of the file is 512 MB in size.Example script for calculation:
https://github.com/antespi/s3md5
File details
- License:
-
open in new tabCC BY-SAShare-alike
Details
- Year of publication:
- 2024
- Verification date:
- 2024-10-21
- Dataset language:
- English
- Fields of science:
-
- information and communication technology (Engineering and Technology)
- DOI:
- DOI ID 10.34808/vnjf-8275 open in new tab
- Verified by:
- Gdańsk University of Technology
Keywords
Cite as
Authors
seen 84 times