TF-IDF weighted bag-of-words preprocessed text documents from Simple English Wikipedia

Opis

The SimpleWiki2K-scores dataset contains TF-IDF weighted bag-of-words preprocessed text documents (raw strings are not available) [feature matrix] and their multi-label assignments [label-matrix]. Label scores for each document are also provided for an enhanced multi-label KNN [1] and LEML [2] classifiers. The aim of the dataset is to establish a benchmark for scores thresholding methods that are necessary to obtain multi-label predictions.

Original source of data and preprocessing: Simple English Wikipedia (dump from 2012-05-07) is the source of text documents and category assignments. All articles from main categories were taken up to level 5 of category hierarchy. Crucially, all categories with less than 10 articles were removed from the dataset. Then, articles without any assignments were also removed and this category/article removal process was repeated until there were no categories with less than 10 documents and no documents without at least one category assignments. Bag-of-words representation of documents was used with TF-IDF weighting scheme.

The dataset is split into train and test parts. Additionally, train part is subdivided into 10 validation folds. All these partitions were obtained using iterative multi-label stratification algorithm [3]. The scores from KNN and LEML classifiers are provided in each validation fold for train data part and validation data part after they were trained in a given data fold. Scores for test parts are also provided after classifiers were trained on the whole training split.

Both feature and label matrices, as well as all the scores provided, are python scipy.sparse.csr_matrix matrices saved in a npz format. All these objects can be loaded in code using scipy.sparse.load_npz(fp) method.

[1] Han X, Li S, Shen Z. A k-NN method for large scale hierarchical text classification at LSHTC3. In: Proceedings of the 2012 ECML/PKDD Discovery Challenge Workshop on Large-Scale Hierarchical Text Classification, Bristol 2012.

[2] Yu HF, Jain P, Kar P, Dhillon I. Large-scale multi-label learning with missing labels. In: International conference on machine learning 2014 Jan 27 (pp. 593-601). PMLR.

[3] Sechidis K, Tsoumakas G, Vlahavas I. On the stratification of multi-label data. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III 22 2011 (pp. 145-158). Springer Berlin Heidelberg.

Whole dataset (train and test parts jointly) summary:

nDocs = 67505, nLabels = 1849, nFeatures = 97179

Label_mtx:

type=<class 'scipy.sparse._csr.csr_matrix'>, shape=(67505, 1849), nnz=108076, dtype=int32

Documents per label:
in [0.0-1.0): 0 items
in [1.0-3.0): 0 items
in [3.0-10.0): 0 items
in [10.0-30.0): 1363 items
in [30.0-100.0): 403 items
in [100.0-300.0): 48 items
in [300.0-1000.0): 21 items
in [1000.0-3000.0): 8 items
in [3000.0-inf): 6 items
min=10, mean=58.45105462412115, max=10307

Labels per document:
in [0.0-1.0): 0 items
in [1.0-3.0): 59639 items
in [3.0-10.0): 7853 items
in [10.0-30.0): 13 items
in [30.0-100.0): 0 items
in [100.0-300.0): 0 items
in [300.0-1000.0): 0 items
in [1000.0-3000.0): 0 items
in [3000.0-inf): 0 items
min=1, mean=1.6010073327901637, max=14

Features_mtx:

type=<class 'scipy.sparse._csr.csr_matrix'>, shape=(67505, 97179), nnz=4158000, dtype=float32

Documents per feature:
in [0.0-1.0): 0 items
in [1.0-3.0): 32381 items
in [3.0-10.0): 39198 items
in [10.0-30.0): 13271 items
in [30.0-100.0): 7070 items
in [100.0-300.0): 2945 items
in [300.0-1000.0): 1560 items
in [1000.0-3000.0): 568 items
in [3000.0-inf): 186 items
min=2, mean=42.787021887444816, max=23616

Features per document:
in [0.0-1.0): 0 items
in [1.0-3.0): 47 items
in [3.0-10.0): 10085 items
in [10.0-30.0): 18785 items
in [30.0-100.0): 28828 items
in [100.0-300.0): 7976 items
in [300.0-1000.0): 1731 items
in [1000.0-3000.0): 53 items
in [3000.0-inf): 0 items
min=1, mean=61.59543737500926, max=2735

Plik z danymi badawczymi

SimpleWiki2K.zip

545.4 MB, S3 ETag 9ee7710ee98322af5edcf12ef7438adf-2, pobrań: 114

Hash pliku liczony jest ze wzoru
hexmd5(md5(part1)+md5(part2)+...)-{parts_count} gdzie pojedyncza część pliku jest wielkości 512 MB

Przykładowy skrypt do wyliczenia:
https://github.com/antespi/s3md5

pobierz

Informacje szczegółowe o pliku

Licencja:: otwiera się w nowej karcie

CC BY

Uznanie autorstwa
Oprogramowanie:: Python + SciPy

Informacje szczegółowe

Rok publikacji:

2023

Data zatwierdzenia:

2023-04-25

Data wytworzenia:

2020

Język danych badawczych:

angielski

Dyscypliny:

informatyka techniczna i telekomunikacja (Dziedzina nauk inżynieryjno-technicznych)

DOI:

10.34808/fmnf-4767

Weryfikacja:

Politechnika Gdańska

Słowa kluczowe

Cytuj jako

Autorzy

wyświetlono 352 razy

Wyszukiwarka