Description
The SimpleWiki2K-scores dataset contains TF-IDF weighted bag-of-words preprocessed text documents (raw strings are not available) [feature matrix] and their multi-label assignments [label-matrix]. Label scores for each document are also provided for an enhanced multi-label KNN [1] and LEML [2] classifiers. The aim of the dataset is to establish a benchmark for scores thresholding methods that are necessary to obtain multi-label predictions.
Original source of data and preprocessing: Simple English Wikipedia (dump from 2012-05-07) is the source of text documents and category assignments. All articles from main categories were taken up to level 5 of category hierarchy. Crucially, all categories with less than 10 articles were removed from the dataset. Then, articles without any assignments were also removed and this category/article removal process was repeated until there were no categories with less than 10 documents and no documents without at least one category assignments. Bag-of-words representation of documents was used with TF-IDF weighting scheme.
The dataset is split into train and test parts. Additionally, train part is subdivided into 10 validation folds. All these partitions were obtained using iterative multi-label stratification algorithm [3]. The scores from KNN and LEML classifiers are provided in each validation fold for train data part and validation data part after they were trained in a given data fold. Scores for test parts are also provided after classifiers were trained on the whole training split.
Both feature and label matrices, as well as all the scores provided, are python scipy.sparse.csr_matrix matrices saved in a npz format. All these objects can be loaded in code using scipy.sparse.load_npz(fp) method.
[1] Han X, Li S, Shen Z. A k-NN method for large scale hierarchical text classification at LSHTC3. In: Proceedings of the 2012 ECML/PKDD Discovery Challenge Workshop on Large-Scale Hierarchical Text Classification, Bristol 2012.
[2] Yu HF, Jain P, Kar P, Dhillon I. Large-scale multi-label learning with missing labels. In: International conference on machine learning 2014 Jan 27 (pp. 593-601). PMLR.
[3] Sechidis K, Tsoumakas G, Vlahavas I. On the stratification of multi-label data. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III 22 2011 (pp. 145-158). Springer Berlin Heidelberg.
Whole dataset (train and test parts jointly) summary:
nDocs = 67505, nLabels = 1849, nFeatures = 97179
Label_mtx:
type=<class 'scipy.sparse._csr.csr_matrix'>, shape=(67505, 1849), nnz=108076, dtype=int32
Documents per label:
in [0.0-1.0): 0 items
in [1.0-3.0): 0 items
in [3.0-10.0): 0 items
in [10.0-30.0): 1363 items
in [30.0-100.0): 403 items
in [100.0-300.0): 48 items
in [300.0-1000.0): 21 items
in [1000.0-3000.0): 8 items
in [3000.0-inf): 6 items
min=10, mean=58.45105462412115, max=10307
Labels per document:
in [0.0-1.0): 0 items
in [1.0-3.0): 59639 items
in [3.0-10.0): 7853 items
in [10.0-30.0): 13 items
in [30.0-100.0): 0 items
in [100.0-300.0): 0 items
in [300.0-1000.0): 0 items
in [1000.0-3000.0): 0 items
in [3000.0-inf): 0 items
min=1, mean=1.6010073327901637, max=14
Features_mtx:
type=<class 'scipy.sparse._csr.csr_matrix'>, shape=(67505, 97179), nnz=4158000, dtype=float32
Documents per feature:
in [0.0-1.0): 0 items
in [1.0-3.0): 32381 items
in [3.0-10.0): 39198 items
in [10.0-30.0): 13271 items
in [30.0-100.0): 7070 items
in [100.0-300.0): 2945 items
in [300.0-1000.0): 1560 items
in [1000.0-3000.0): 568 items
in [3000.0-inf): 186 items
min=2, mean=42.787021887444816, max=23616
Features per document:
in [0.0-1.0): 0 items
in [1.0-3.0): 47 items
in [3.0-10.0): 10085 items
in [10.0-30.0): 18785 items
in [30.0-100.0): 28828 items
in [100.0-300.0): 7976 items
in [300.0-1000.0): 1731 items
in [1000.0-3000.0): 53 items
in [3000.0-inf): 0 items
min=1, mean=61.59543737500926, max=2735
Dataset file
hexmd5(md5(part1)+md5(part2)+...)-{parts_count}
where a single part of the file is 512 MB in size.Example script for calculation:
https://github.com/antespi/s3md5
File details
- License:
-
open in new tabCC BYAttribution
- Software:
- Python + SciPy
Details
- Year of publication:
- 2023
- Verification date:
- 2023-04-25
- Creation date:
- 2020
- Dataset language:
- English
- Fields of science:
-
- information and communication technology (Engineering and Technology)
- DOI:
- DOI ID 10.34808/fmnf-4767 open in new tab
- Verified by:
- Gdańsk University of Technology
Keywords
Cite as
Authors
seen 301 times