Study of Statistical Text Representation Methods for Performance Improvement of a Hierarchical Attention Network

Adam Wawrzyński; Julian Szymański

doi:10.3390/app11136113

Study of Statistical Text Representation Methods for Performance Improvement of a Hierarchical Attention Network

Abstrakt

To effectively process textual data, many approaches have been proposed to create text representations. The transformation of a text into a form of numbers that can be computed using computers is crucial for further applications in downstream tasks such as document classification, document summarization, and so forth. In our work, we study the quality of text representations using statistical methods and compare them to approaches based on neural networks. We describe in detail nine different algorithms used for text representation and then we evaluate five diverse datasets: BBCSport, BBC, Ohsumed, 20Newsgroups, and Reuters. The selected statistical models include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TFIDF) weighting, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). For the second group of deep neural networks, Partition-Smooth Inverse Frequency (P-SIF), Doc2Vec-Distributed Bag of Words Paragraph Vector (Doc2Vec-DBoW), Doc2Vec-Memory Model of Paragraph Vectors (Doc2Vec-DM), Hierarchical Attention Network (HAN) and Longformer were selected. The text representation methods were benchmarked in the document classification task and BoW and TFIDF models were used were used as a baseline. Based on the identified weaknesses of the HAN method, an improvement in the form of a Hierarchical Weighted Attention Network (HWAN) was proposed. The incorporation of statistical features into HAN latent representations improves or provides comparable results on four out of five datasets. The article presents how the length of the processed text affects the results of HAN and variants of HWAN models

Cytowania

2

CrossRef
0

Web of Science
2

Scopus

Autorzy (2)

Cytuj jako

Pełna treść

pobierz publikację

pobrano 120 razy

Wersja publikacji: Accepted albo Published Version
DOI:: Cyfrowy identyfikator dokumentu elektronicznego (otwiera się w nowej karcie) 10.3390/app11136113
Licencja: otwiera się w nowej karcie

pełna treść artykułu zobacz w serwisie zewnętrznym otwiera się w nowej karcie

Słowa kluczowe

Informacje szczegółowe

Kategoria:: Publikacja w czasopiśmie
Typ:: artykuły w czasopismach
Opublikowano w:: Applied Sciences-Basel nr 11,
ISSN: 2076-3417
Język:: angielski
Rok wydania:: 2021
Opis bibliograficzny:: Wawrzyński A., Szymański J.: Study of Statistical Text Representation Methods for Performance Improvement of a Hierarchical Attention Network// Applied Sciences-Basel -Vol. 11,iss. 13 (2021), s.6113-
DOI:: Cyfrowy identyfikator dokumentu elektronicznego (otwiera się w nowej karcie) 10.3390/app11136113
Weryfikacja:: Politechnika Gdańska