TF-IDF weighted bag-of-words preprocessed text documents from Simple English Wikipedia

Description

The SimpleWiki2K-scores dataset contains TF-IDF weighted bag-of-words preprocessed text documents (raw strings are not available) [feature matrix] and their multi-label assignments [label-matrix]. Label scores for each document are also provided for an enhanced multi-label KNN [1] and LEML [2] classifiers. The aim of the dataset is to establish a benchmark for scores thresholding methods that are necessary to obtain multi-label predictions.

Original source of data and preprocessing: Simple English Wikipedia (dump from 2012-05-07) is the source of text documents and category assignments. All articles from main categories were taken up to level 5 of category hierarchy. Crucially, all categories with less than 10 articles were removed from the dataset. Then, articles without any assignments were also removed and this category/article removal process was repeated until there were no categories with less than 10 documents and no documents without at least one category assignments. Bag-of-words representation of documents was used with TF-IDF weighting scheme.

The dataset is split into train and test parts. Additionally, train part is subdivided into 10 validation folds. All these partitions were obtained using iterative multi-label stratification algorithm [3]. The scores from KNN and LEML classifiers are provided in each validation fold for train data part and validation data part after they were trained in a given data fold. Scores for test parts are also provided after classifiers were trained on the whole training split.

Both feature and label matrices, as well as all the scores provided, are python scipy.sparse.csr_matrix matrices saved in a npz format. All these objects can be loaded in code using scipy.sparse.load_npz(fp) method.

[1] Han X, Li S, Shen Z. A k-NN method for large scale hierarchical text classification at LSHTC3. In: Proceedings of the 2012 ECML/PKDD Discovery Challenge Workshop on Large-Scale Hierarchical Text Classification, Bristol 2012.

[2] Yu HF, Jain P, Kar P, Dhillon I. Large-scale multi-label learning with missing labels. In: International conference on machine learning 2014 Jan 27 (pp. 593-601). PMLR.

[3] Sechidis K, Tsoumakas G, Vlahavas I. On the stratification of multi-label data. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III 22 2011 (pp. 145-158). Springer Berlin Heidelberg.

Whole dataset (train and test parts jointly) summary:

nDocs = 67505, nLabels = 1849, nFeatures = 97179

Label_mtx:

type=<class 'scipy.sparse._csr.csr_matrix'>, shape=(67505, 1849), nnz=108076, dtype=int32

Documents per label:
in [0.0-1.0): 0 items
in [1.0-3.0): 0 items
in [3.0-10.0): 0 items
in [10.0-30.0): 1363 items
in [30.0-100.0): 403 items
in [100.0-300.0): 48 items
in [300.0-1000.0): 21 items
in [1000.0-3000.0): 8 items
in [3000.0-inf): 6 items
min=10, mean=58.45105462412115, max=10307

Labels per document:
in [0.0-1.0): 0 items
in [1.0-3.0): 59639 items
in [3.0-10.0): 7853 items
in [10.0-30.0): 13 items
in [30.0-100.0): 0 items
in [100.0-300.0): 0 items
in [300.0-1000.0): 0 items
in [1000.0-3000.0): 0 items
in [3000.0-inf): 0 items
min=1, mean=1.6010073327901637, max=14

Features_mtx:

type=<class 'scipy.sparse._csr.csr_matrix'>, shape=(67505, 97179), nnz=4158000, dtype=float32

Documents per feature:
in [0.0-1.0): 0 items
in [1.0-3.0): 32381 items
in [3.0-10.0): 39198 items
in [10.0-30.0): 13271 items
in [30.0-100.0): 7070 items
in [100.0-300.0): 2945 items
in [300.0-1000.0): 1560 items
in [1000.0-3000.0): 568 items
in [3000.0-inf): 186 items
min=2, mean=42.787021887444816, max=23616

Features per document:
in [0.0-1.0): 0 items
in [1.0-3.0): 47 items
in [3.0-10.0): 10085 items
in [10.0-30.0): 18785 items
in [30.0-100.0): 28828 items
in [100.0-300.0): 7976 items
in [300.0-1000.0): 1731 items
in [1000.0-3000.0): 53 items
in [3000.0-inf): 0 items
min=1, mean=61.59543737500926, max=2735

Dataset file

SimpleWiki2K.zip

545.4 MB, S3 ETag 9ee7710ee98322af5edcf12ef7438adf-2, downloads: 114

The file hash is calculated from the formula
hexmd5(md5(part1)+md5(part2)+...)-{parts_count} where a single part of the file is 512 MB in size.

Example script for calculation:
https://github.com/antespi/s3md5

download

File details

License:: open in new tab

CC BY

Attribution
Software:: Python + SciPy

Details

Year of publication:

2023

Verification date:

2023-04-25

Creation date:

2020

Dataset language:

English

Fields of science:

information and communication technology (Engineering and Technology)

DOI:

10.34808/fmnf-4767

Verified by:

Gdańsk University of Technology

Keywords

Cite as

Authors

seen 352 times

Search