WikiPrefs: human preferences dataset build from text edits - Open Research Data - Bridge of Knowledge

Search

WikiPrefs: human preferences dataset build from text edits

Description

WikiPrefs

The WikiPrefs dataset is a human preferences dataset for Large Language Models alignment. It was built using the EditPrefs method from historical edits of Wikipedia featured articles

The code used for creating the dataset is available on GitHub: https://github.com/jmajkutewicz/EditPrefs

Dataset Description

  • Language: English
  • License: CC BY-SA 4.0
  • Note that:
    • the text comes from Wikipedia and is subjected to CC BY-SA 4.0 license
    • the prompts were created using the GPT-3.5-turbo and are subjected to OpenAI license restrictions

Dataset Structure

The dataset is split into 63345 train and 2000 test samples. Each sample consists of:

  • page_id - Wikipedia article id
  • page_title - Wikipedia article title
  • section - section of the Wikipedia article
  • rev_id - the revision of the Wikipedia article
  • prev_rev_id - parent revision
  • timestamp - date of the edit
  • contributor - author of the edit
  • comment - comment associated with the edit
  • prompt - synthetic instruction that matches the responses
  • chosen - chosen response, created from the edited revision of the Wikipedia article; formatted as a list of messages
  • rejected - rejected response, created from the original revision of the Wikipedia article; formatted as a list of messages

Source Data

The dataset was created from the English Wikipedia dump from 01.04.2024

Applications

The dataset can be used for aligning Large Language Models with standard techniques such as RLHF or DPO

Dataset file

wiki_prefs.zip
20.4 MB, S3 ETag 87408b4237e1bec4ff3ffc740ba810b8-1, downloads: 5
The file hash is calculated from the formula
hexmd5(md5(part1)+md5(part2)+...)-{parts_count} where a single part of the file is 512 MB in size.

Example script for calculation:
https://github.com/antespi/s3md5
download file wiki_prefs.zip

File details

License:
Creative Commons: by-sa 4.0 open in new tab
CC BY-SA
Share-alike

Details

Year of publication:
2024
Verification date:
2024-10-21
Dataset language:
English
Fields of science:
  • information and communication technology (Engineering and Technology)
DOI:
DOI ID 10.34808/vnjf-8275 open in new tab
Verified by:
Gdańsk University of Technology

Keywords

Cite as

seen 40 times