Comparative Analysis of Text Representation Methods Using Classification

Julian Szymański

doi:10.1080/01969722.2014.874828

Comparative Analysis of Text Representation Methods Using Classification

Abstrakt

In our work, we review and empirically evaluate five different raw methods of text representation that allow automatic processing of Wikipedia articles. The main contribution of the article—evaluation of approaches to text representation for machine learning tasks—indicates that the text representation is fundamental for achieving good categorization results. The analysis of the representation methods creates a baseline that cannot be compensated for even by sophisticated machine learning algorithms. It confirms the thesis that proper data representation is a prerequisite for achieving high-quality results of data analysis. Evaluation of the text representations was performed within the Wikipedia repository by examination of classification parameters observed during automatic reconstruction of human-made categories. For that purpose, we use a classifier based on a support vector machines method, extended with multilabel and multiclass functionalities. During classifier construction we observed parameters such as learning time, representation size, and classification quality that allow us to draw conclusions about text representations. For the experiments presented in the article, we use data sets created from Wikipedia dumps. We describe our software, called Matrix’u, which allows a user to build computational representations of Wikipedia articles. The software is the second contribution of our research, because it is a universal tool for converting Wikipedia from a human-readable form to a form that can be processed by a machine. Results generated using Matrix’u can be used in a wide range of applications that involve usage of Wikipedia data.

Cytowania

3 0

CrossRef
0

Web of Science
2 9

Scopus

Autor (1)

Julian Szymański dr hab. inż.

Cytuj jako

Pełna treść

pełna treść publikacji nie jest dostępna w portalu

pełna treść artykułu zobacz w serwisie zewnętrznym otwiera się w nowej karcie

Słowa kluczowe

Informacje szczegółowe

Kategoria:: Publikacja w czasopiśmie
Typ:: artykuł w czasopiśmie wyróżnionym w JCR
Opublikowano w:: CYBERNETICS AND SYSTEMS nr 45, strony 180 - 199,
ISSN: 0196-9722
Język:: angielski
Rok wydania:: 2014
Opis bibliograficzny:: Szymański J.: Comparative Analysis of Text Representation Methods Using Classification// CYBERNETICS AND SYSTEMS. -Vol. 45, nr. 2 (2014), s.180-199
DOI:: Cyfrowy identyfikator dokumentu elektronicznego (otwiera się w nowej karcie) 10.1080/01969722.2014.874828
Weryfikacja:: Politechnika Gdańska

wyświetlono 174 razy

J. Szymański,
N. Kawalec

2019

Path-based methods on categorical structures for conceptual representation of wikipedia articles

2017

Meta Tagi

Comparative Analysis of Text Representation Methods Using Classification

Abstrakt

Cytowania

Autor (1)

Julian Szymański dr hab. inż.

Cytuj jako

Pełna treść

Słowa kluczowe

Informacje szczegółowe

Publikacje, które mogą cię zainteresować

Text classifiers for automatic articles categorization

Spectral Clustering Wikipedia Keyword-Based search Results

An Analysis of Neural Word Representations for Wikipedia Articles Classification

Path-based methods on categorical structures for conceptual representation of wikipedia articles

Wyszukiwarka

Comparative Analysis of Text Representation Methods Using Classification

Abstrakt

Cytowania

Autor (1)

Julian Szymański dr hab. inż.

Cytuj jako

Pełna treść

Słowa kluczowe

Informacje szczegółowe

Publikacje, które mogą cię zainteresować

Text classifiers for automatic articles categorization

Spectral Clustering Wikipedia Keyword-Based search Results

An Analysis of Neural Word Representations for Wikipedia Articles Classification

Path-based methods on categorical structures for conceptual representation of wikipedia articles