An Analysis of Neural Word Representations for Wikipedia Articles Classification

Julian Szymański; Nathan Kawalec

doi:10.1080/01969722.2019.1565124

An Analysis of Neural Word Representations for Wikipedia Articles Classification

Abstrakt

One of the current popular methods of generating word representations is an approach based on the analysis of large document collections with neural networks. It creates so-called word-embeddings that attempt to learn relationships between words and encode this information in the form of a low-dimensional vector. The goal of this paper is to examine the differences between the most popular embedding models and the typical bag-of-words (BoW) approach used for document representation. The hypothesis behind the experiments is that the more informative the representation is, the better classification results it produces. The evaluation of the representations has been performed with regards to the accuracy of three text classifiers. The experiments have been performed on subsets of articles selected from Wikipedia. To test the independence of the results from the language of their dataset, we created datasets for the Polish and English versions of this repository. The datasets have been provided publicly to create a baseline to study the different representation methods. The classification tasks, which aim to reconstruct the human-made Wikipedia categories, confirm that the word embeddings can be successfully used for text classification. Word embeddings for document representation with typical vector averaging methods does not outperform usage of BoW. We use a modification of the document representation based on kernel transformations that shows an improvement of the text classification results. Also, we find that in most cases the method of dimensionality reduction with neural embeddings outperforms that of LSA.

Cytowania

6

CrossRef
0

Web of Science
5

Scopus

Autorzy (2)

Julian Szymański dr hab. inż.
Nathan Kawalec

Cytuj jako

Pełna treść

pełna treść publikacji nie jest dostępna w portalu

pełna treść artykułu zobacz w serwisie zewnętrznym otwiera się w nowej karcie

Słowa kluczowe

BAG-OF-WORDS, DOCUMENT CATEGORIZATION, NEURAL NETWORKS, TEXT CLASSIFICATION, TEXT REPRESENTATION, WIKIPEDIA, WORD EMBEDDINGS

Informacje szczegółowe

Kategoria:: Publikacja w czasopiśmie
Typ:: artykuły w czasopismach
Opublikowano w:: CYBERNETICS AND SYSTEMS nr 50, strony 176 - 196,
ISSN: 0196-9722
Język:: angielski
Rok wydania:: 2019
Opis bibliograficzny:: Szymański J., Kawalec N.: An Analysis of Neural Word Representations for Wikipedia Articles Classification// CYBERNETICS AND SYSTEMS -Vol. 50,iss. 2 (2019), s.176-196
DOI:: Cyfrowy identyfikator dokumentu elektronicznego (otwiera się w nowej karcie) 10.1080/01969722.2019.1565124
Weryfikacja:: Politechnika Gdańska