Search results for: text representation, documents categorization

Search results for: text representation, documents categorization

results on page:
embed this view on your website

Filters

total: 60

clear all filters disabled

Representation of hypertext documents based on terms, Links and text compressibility
Publication
- J. Szymański
- W. Duch
- LECTURE NOTES IN COMPUTER SCIENCE - Year 2010
Opisano metody reprezentacji dokumentów tekstowych oparte na słowach, wzajemnych powiązaniach i metodach kompresji. Dokonano ich oceny w oparciu o klasyfikator SVM.
Text Categorization Improvement via User Interaction
Publication
- J. Atroszko
- J. Szymański
- D. Gil
- H. Mora
- Year 2018
In this paper, we propose an approach to improvement of text categorization using interaction with the user. The quality of categorization has been defined in terms of a distribution of objects related to the classes and projected on the self-organizing maps. For the experiments, we use the articles and categories from the subset of Simple Wikipedia. We test three different approaches for text representation. As a baseline we use...

Full text to download in external service
Evaluation of Path Based Methods for Conceptual Representation of the Text
Publication
- Ł. Kucharczyk
- J. Szymański
- Year 2014
Typical text clustering methods use the bag of words (BoW) representation to describe content of documents. However, this method is known to have several limitations. Employing Wikipedia as the lexical knowledge base has shown an improvement of the text representation for data-mining purposes. Promising extensions of that trend employ hierarchical organization of Wikipedia category system. In this paper we propose three path-based...

Full text to download in external service
Path-based methods on categorical structures for conceptual representation of wikipedia articles
Publication
- Ł. Kucharczyk
- J. Szymański
- JOURNAL OF INTELLIGENT INFORMATION SYSTEMS - Year 2017
Machine learning algorithms applied to text categorization mostly employ the Bag of Words (BoW) representation to describe the content of the documents. This method has been successfully used in many applications, but it is known to have several limitations. One way of improving text representation is usage of Wikipedia as the lexical knowledge base – an approach that has already shown promising results in many research studies....

Full text available to download
Text classifiers for automatic articles categorization
Publication
- Year 2012
The article concerns the problem of automatic classification of textual content. We present selected methods for generation of documents representation and we evaluate them in classification tasks. The experiments have been performed on Wikipedia articles classified automatically to their categories made by Wikipedia editors.
Comparative Analysis of Text Representation Methods Using Classification
Publication
- J. Szymański
- CYBERNETICS AND SYSTEMS - Year 2014
In our work, we review and empirically evaluate five different raw methods of text representation that allow automatic processing of Wikipedia articles. The main contribution of the article—evaluation of approaches to text representation for machine learning tasks—indicates that the text representation is fundamental for achieving good categorization results. The analysis of the representation methods creates a baseline that cannot...

Full text to download in external service
Wikipedia Articles Representation with Matrix'u
Publication
- J. Szymański
- Year 2013
In the article we evaluate different text representation methods used for a task of Wikipedia articles categorization. We present the Matrix’u application used for creating computational datasets ofWikipedia articles. The representations have been evaluated with SVM classifiers used for reconstruction human made categories.

Full text to download in external service
Spectral Clustering Wikipedia Keyword-Based search Results
Publication
- J. Szymański
- T. Dziubich
- FRONTIERS IN ROBOTICS AND AI - Year 2017
The paper summarizes our research in the area of unsupervised categorization of Wikipedia articles. As a practical result of our research, we present an application of spectral clustering algorithm used for grouping Wikipedia search results. The main contribution of the paper is a representation method for Wikipedia articles that has been based on combination of words and links and used for categoriation of search results in this...

Full text available to download
TF-IDF weighted bag-of-words preprocessed text documents from Simple English Wikipedia
Open Research Data
open access
The SimpleWiki2K-scores dataset contains TF-IDF weighted bag-of-words preprocessed text documents (raw strings are not available) [feature matrix] and their multi-label assignments [label-matrix]. Label scores for each document are also provided for an enhanced multi-label KNN [1] and LEML [2] classifiers. The aim of the dataset is to establish a benchmark...
Review on Wikification methods
Publication
- J. Szymański
- M. Naruszewicz
- AI COMMUNICATIONS - Year 2019
The paper reviews methods on automatic annotation of texts with Wikipedia entries. The process, called Wikification aims at building references between concepts identified in the text and Wikipedia articles. Wikification finds many applications, especially in text representation, where it enables one to capture the semantic similarity of the documents. Also, it can be considered as automatic tagging of the text. We describe typical...

Full text to download in external service
Improving css-KNN Classification Performance by Shifts in Training Data
Publication
- K. Draszawka
- J. Szymański
- F. Guerra
- Year 2015
This paper presents a new approach to improve the performance of a css-k-NN classifier for categorization of text documents. The css-k-NN classifier (i.e., a threshold-based variation of a standard k-NN classifier we proposed in [1]) is a lazy-learning instance-based classifier. It does not have parameters associated with features and/or classes of objects, that would be optimized during off-line learning. In this paper we propose...
Two Stage SVM and kNN Text Documents Classifier
Publication
- M. Kępa
- J. Szymański
- Year 2015
The paper presents an approach to the large scale text documents classification problem in parallel environments. A two stage classifier is proposed, based on a combination of k-nearest neighbors and support vector machines classification methods. The details of the classifier and the parallelisation of classification, learning and prediction phases are described. The classifier makes use of our method named one-vs-near. It is...
Parallel Computations of Text Similarities for Categorization Task
Publication
- J. Szymański
- Year 2013
In this chapter we describe the approach to parallel implementation of similarities in high dimensional spaces. The similarities computation have been used for textual data categorization. A test datasets we create from Wikipedia articles that with their hyper references formed a graph used in our experiments. The similarities based on Euclidean distance and Cosine measure have been used to process the data using k-means algorithm....
Study of Statistical Text Representation Methods for Performance Improvement of a Hierarchical Attention Network
Publication
- A. Wawrzyński
- J. Szymański
- Applied Sciences-Basel - Year 2021
To effectively process textual data, many approaches have been proposed to create text representations. The transformation of a text into a form of numbers that can be computed using computers is crucial for further applications in downstream tasks such as document classification, document summarization, and so forth. In our work, we study the quality of text representations using statistical methods and compare them to approaches...

Full text available to download
Text categorization with semantic commonsense knowledge: First results
Publication
- P. Majewski
- J. Szymański
- Year 2008
Do przetwarzania tekstów typowo wykorzystuje się reprezentacjeBOW. Podejście takie nie daje jednak dobrych rezultatów w sytuacjigdy podobne dokumenty nie współdzielą ze sobą słów.W artykule zaprezentowano podejście do konstrukcji funkcjijądra dla klasyfikatorów SVM opartego na zewnętrznej bazie wiedzyo pojęciach językowych.
Text Documents Classification with Support Vector Machines
Publication
- P. Majewski
- Year 2008
External Validation Measures for Nested Clustering of Text Documents
Publication
- K. Draszawka
- J. Szymański
- Year 2011
Abstract. This article handles the problem of validating the results of nested (as opposed to "flat") clusterings. It shows that standard external validation indices used for partitioning clustering validation, like Rand statistics, Hubert Γ statistic or F-measure are not applicable in nested clustering cases. Additionally to the work, where F-measure was adopted to hierarchical classification as hF-measure, here some methods to...
Development and Research of the Text Messages Semantic Clustering Methodology
Publication
- N. Rizun
- P. Kapłański
- Y. Taranenko
- Year 2016
The methodology of semantic clustering analysis of customer’s text-opinions collection is developed. The author's version of the mathematical models of formalization and practical realization of short textual messages semantic clustering procedure is proposed, based on the customer’s text-opinions collection Latent Semantic Analysis knowledge extracting method. An algorithm for semantic clustering of the text-opinions is developed,...

Full text available to download
Intelligent information services 23/24
e-Learning Courses
- J. Szymański
Information retrieval Text categorization Natural language processing
DEVELOPMENT OF THE ALGORITHM OF POLISH LANGUAGE FILM REVIEWS PREPROCESSING
Publication
- N. Rizun
- J. Taranenko
- Rocznik Naukowy Wydzialu Zarzadzania w Ciechanowie - Year 2017
The algorithm and the software for conducting the procedure of Preprocessing of the reviews of films in the Polish language were developed. This algorithm contains the following steps: Text Adaptation Procedure; Procedure of Tokenization; Procedure of Transforming Words into the Byte Format; Part-of-Speech Tagging; Stemming / Lemmatization Procedure; Presentation of Documents in the Vector Form (Vector Space Model) Procedure; Forming...

Full text available to download
An Analysis of Neural Word Representations for Wikipedia Articles Classification
Publication
- J. Szymański
- N. Kawalec
- CYBERNETICS AND SYSTEMS - Year 2019
One of the current popular methods of generating word representations is an approach based on the analysis of large document collections with neural networks. It creates so-called word-embeddings that attempt to learn relationships between words and encode this information in the form of a low-dimensional vector. The goal of this paper is to examine the differences between the most popular embedding models and the typical bag-of-words...

Full text to download in external service
Internal legal acts of technical and medical universities in Poland regulating classes conducted in-person during the Covid-19 pandemic
Open Research Data
open access
- K. Górak-Sosnowska
- L. Tomaszewska
A database of legal acts and other internal documents of medical and technical universities in Poland regulating the way of organizing in-person or hybrid classes during the COVID-19 pandemic from the summer semester 2019/2020 to the winter semester 2020/2021.Documents were encoded in two separate coding systems using the MAXQDA program for qualitative...
Contextual ontology for tonality assessment
Publication
- W. Waloszek
- N. Rizun
- Procedia Computer Science - Year 2020
classification tasks. The discussion focuses on two important research hypotheses: (1) whether it is possible to construct such an ontology from a corpus of textual document, and (2) whether it is possible and beneficial to use inferencing from this ontology to support the process of sentiment classification. To support the first hypothesis we present a method of extraction of hierarchy of contexts from a set of textual documents...

Full text available to download
Just look at to open it up: A biometric verification facility for password autofill to protect electronic documents
Publication
- M. Smiatacz
- B. Wiszniewski
- MULTIMEDIA TOOLS AND APPLICATIONS - Year 2021
Electronic documents constitute specific units of information, and protecting them against unauthorized access is a challenging task. This is because a password protected document may be stolen from its host computer or intercepted while on transfer and exposed to unlimited offline attacks. The key issue is, therefore, making document passwords hard to crack. We propose to augment a common text password authentication interface...

Full text available to download
System of specific grants for local government units in Poland
Publication
- A. Sekuła
- Year 2009
The article analyses the system of specific grants in local governments in Poland. First, main revenue sources of local self-governments are presented. Their presentation is based upon the consideration of one of the basic important principles in democratic states today, i.e. decentralization. The text then, in more details, describes specific grants with respect to the European Charter of Local Self-Government. Subsequently, the...
Agile Commerce in the light of Text Mining
Publication
- A. Baj-Rogowska
- Przedsiębiorczość i Zarządzanie - Year 2017
The survey conducted for this study reveals that more than 84% of respondents have never encountered the term “agile commerce” and do not understand its meaning. At the same time, they are active participants of this strategy. Using digital channels as customers more often than ever before, they have already been included in the agile philosophy. Based on the above, the purpose of the study is to analyse major text sets containing...

Full text available to download
Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System
Publication
- A. Sobecki
- M. Kępa
- Year 2018
The plagiarism detection problem involves finding patterns in unstructured text documents. Similarity of documents in this approach means that the documents contain some identical phrases with defined minimal length. The typical methods used to find similar documents in dig- ital libraries are not suitable for this task (plagiarism detection) because found documents may contain similar content and we have not any war- ranty that...

Full text to download in external service
Information Retrieval with the Use of Music Clustering by Directions Algorithm
Publication
- A. Kaczmarek
- Year 2013
This paper introduces the Music Clustering by Directions (MCBD) algorithm. The algorithm is designed to support users of query by humming systems in formulating queries. This kind of systems makes it possible to retrieve songs and tunes on the basis of a melody recorded by the user. The Music Clustering by Directions algorithm is a kind of an interactive query expansion method. On the basis of query, the algorithm provides suggestions...

Full text to download in external service
Extraction of information from born-digital PDF documents for reproducible research
Publication
- B. Wiszniewski
- J. Siciarek
- Journal of Advanced Management - Year 2016
Born-digital PDF electronic documents might reasonably be expected to preserve useful data units of their source originals that suffice to produce executable papers for reproducible research. Unfortunately, developers of authoring tools may adopt arbitrary PDF generation strategies, producing a plethora of internal data representations. Such common information units as text paragraphs, tables, function graphs and flow diagrams,...

Full text available to download
Self-Organizing Map representation for clustering Wikipedia search results
Publication
- J. Szymański
- LECTURE NOTES IN COMPUTER SCIENCE - Year 2011
The article presents an approach to automated organization of textual data. The experiments have been performed on selected sub-set of Wikipedia. The Vector Space Model representation based on terms has been used to build groups of similar articles extracted from Kohonen Self-Organizing Maps with DBSCAN clustering. To warrant efficiency of the data processing, we performed linear dimensionality reduction of raw data using Principal...
Self–Organizing Map representation for clustering Wikipedia search results
Publication
- J. Szymański
- Year 2011
The article presents an approach to automated organization of textual data. The experiments have been performed on selected sub-set of Wikipedia. The Vector Space Model representation based on terms has been used to build groups of similar articles extracted from Kohonen Self-Organizing Maps with DBSCAN clustering. To warrant efficiency of the data processing, we performed linear dimensionality reduction of raw data using Principal...

Full text to download in external service
Ontologies vs. Rules — Comparison of Methods of Knowledge Representation Based on the Example of IT Services Management
Publication
- A. Czarnecki
- T. Sitek
- Year 2013
This text provides a brief overview of selected structures aimed at knowledge representation in the form of ontologies based on description logic and aims at comparing them with their counterparts based on the rule-based approach. Due to the limitations on the length of the article, only elements associated with the representation of concepts could be shown, without including roles. The formalisms of the OWL language were used...

Full text to download in external service
Management of Textual Data at Conceptual Level
Publication
- J. Szymański
- Year 2011
The article presents the approach to the management of a large repository of documents at conceptual level. We describe our approach to representing Wikipedia articles using their categories. The representation has been used to construct groups of similar articles. Proposed approach has been implemented in prototype system that allows to organize articles that are search results for a given query. Constructed clusters allow to...
Retrieval with Semantic Sieve
Publication
- Year 2013
The article presents an algorithm we called Semantic Sieve applied for refining search results in text documents repository. The algorithm calculates socalled conceptual directions that enables interaction with the user and allows to narrow the set of results to the most relevant ones. We present the system where the algorithm has been implemented. The system also offers in the presentation layer clustering of the results into...

Full text to download in external service
Semantic Analysis and Text Summarization in Socio-Technical Systems
Publication
- N. Rizun
- Year 2018
In this chapter the authors present the results of the development the methodology for increasing the reliability of the functioning of the Socio-Technical System. The existed methods and algorithms for processing unstructured (textual) information were studied. Taking into account noted above strengths and weaknesses of Discriminant and Probabilistic approaches of Latent Semantic Relations analysis in of the summarization projection...

Full text to download in external service
Ontologie vs. reguły — porównanie metod reprezentacji wiedzy na przykładzie dziedziny zarządzania usługami informatycznymi
Publication
- A. Czarnecki
- T. Sitek
- Ekonomiczne Problemy Usług - Year 2013
Tekst stanowi krótki przegląd wybranych konstrukcji służących reprezentacji wiedzy w postaci ontologii opartych na logice opisowej i porównanie ich z odpowiednikami opartymi na zapisie regułowym. Z powodu ograniczonej liczby stron pokazano tylko elementy związane z reprezentacją konceptów, bez uwzględniania ról. Do zapisu ontologii wykorzystano formalizmy języka OWL, zaś reguły wyrażono w Prologu. Dla lepszego zilustrowania tych...

Full text available to download
SEMANTIC ANALYSIS ALGORITHMS FOR KNOWLEDGE WORKERS SUPPORT
Publication
- N. Rizun
- M. Rizun
- J. Taranenko
- Year 2017
The paper examines various aspects of text analysis application for knowledge worker’s activity realization. Conclusions are drawn about the relevance and importance of processing the non-structured textual information in order to increase knowledge worker’s efficiency, as well as their awareness in different branches of science. The paper considers the existing algorithms of texts semantic analysis as the sphere of documents topical...

Full text available to download
Selection of Relevant Features for Text Classification with K-NN
Publication
- Year 2013
In this paper, we describe five features selection techniques used for a text classification. An information gain, independent significance feature test, chi-squared test, odds ratio test, and frequency filtering have been compared according to the text benchmarks based on Wikipedia. For each method we present the results of classification quality obtained on the test datasets using K-NN based approach. A main advantage of evaluated...

Full text to download in external service
Concept description vectors and the 20 question game
Publication
- J. Szymański
- T. Sarnatowicz
- W. Duch
- Year 2005
Knowledge of properties that are applicable to a given object is a necessary prerequisite to formulate intelligent question. Concept description vectors provide simplest representation of this knowledge, storing for each object information about the values of its properties. Experiments with automatic creation of concept description vectors from various sources, including ontologies, dictionaries, encyclopedias and unstructured...

Full text to download in external service
Passing from requirements specification to class model using application domain ontology
Publication
- J. Kuchta
- Zeszyty Naukowe Wydziału ETI Politechniki Gdańskiej. Technologie Informacyjne - Year 2010
The quality of a classic software engineering process depends on the completeness of project documents and on the inter-phase consistency. In this paper, a method for passing from the requirement specification to the class model is proposed. First, a developer browses the text of the requirements, extracts the word sequences, and places them as terms into the glossary. Next, the internal ontology logic for the glossary needs to...
Methodology for Text Classification using Manually Created Corpora-based Sentiment Dictionary
Publication
- N. Rizun
- W. Waloszek
- Year 2018
This paper presents the methodology of Textual Content Classification, which is based on a combination of algorithms: preliminary formation of a contextual framework for the texts in particular problem area; manual creation of the Hierarchical Sentiment Dictionary (HSD) on the basis of a topically-oriented Corpus; tonality texts recognition via using HSD for analysing the documents as a collection of topically completed fragments...

Full text available to download
Improving the Accuracy in Sentiment Classification in the Light of Modelling the Latent Semantic Relations
Publication
- N. Rizun
- W. Waloszek
- Y. Taranenko
- Information - Year 2018
The research presents the methodology of improving the accuracy in sentiment classification in the light of modelling the latent semantic relations (LSR). The objective of this methodology is to find ways of eliminating the limitations of the discriminant and probabilistic methods for LSR revealing and customizing the sentiment classification process (SCP) to the more accurate recognition of text tonality. This objective was achieved...

Full text available to download
Towards Increasing Density of Relations in Category Graphs
Publication
- Year 2014
In the chapter we propose methods for identifying new associations between Wikipedia categories. The first method is based on Bag-of-Words (BOW) representation of Wikipedia articles. Using similarity of the articles belonging to different categories allows to calculate the information about categories similarity. The second method is based on average scores given to categories while categorizing documents by our dedicated score-based...

Full text to download in external service
Wykluczenie finansowe starszych konsumentów na rynku usług finansowych
Publication
- B. Czerwiński
- MARKETING I RYNEK - Year 2014
Celem opracowania jest identyfikacja uwarunkowań wykluczenia finansowego starszych konsumentów na rynku usług finansowych. Dla realizacji tego celu zidentyfikowano m.in. podstawowe pojęcia dotyczące wykluczenia (w tym wykluczenia finansowego) osób starszych. Zagrożenia wynikające z wykluczenia finansowego starszych konsumentów zostały zilustrowane poprzez analizę danych statystycznych dotyczących starzenia się społeczeństwa. Podstawowy...

Full text to download in external service
Machine Learning and Text Analysis in an Artificial Intelligent System for the Training of Air Traffic Controllers
Publication
- T. Shmelova
- Y. Sikirda
- N. Rizun
- V. Lazorenko
- V. Kharchenko
- Year 2020
This chapter presents the application of new information technology in education for the training of air traffic controllers (ATCs). Machine learning, multi-criteria decision analysis, and text analysis as the methods of artificial intelligence for ATCs training have been described. The authors have made an analysis of the International Civil Aviation Organization documents for modern principles of ATCs education. The prototype...

Full text available to download
China and the Chinese in the modern world. An interdisciplinary study
Publication
- I. Szpotakowski
- Z. Kopania
- Year 2020
This monograph is a collection of chapters devoted to modern China on various approaches. There is no future without a past and a modern China is a country that skillfully combines the new with the old and the authors have attempted to present this phenomenon in this book. It brings to light issues such as a honorificativity in Chinese administrative and legal documents, a comparison of Chinese and...

Full text available to download
Inter Applied Chemistry Programme 3 Miodrag Guzvic
e-Learning Courses
- A. Głowacz-Różyńska
- M. Tobiszewski
- C. Jungnickel
- P. Filipkowski
Inter Applied Chemistry Programme 3 Dr. Miodrag Gužvić Title: Genetic toxicology
Semantic Memory for Avatars in Cyberspace
Publication
- J. Szymański
- T. Sarnatowicz
- W. Duch
- Year 2005
Avatars that show intelligent behavior should have an access to general knowledge about the world, knowledge that humans store in their semantic memories. The simplest knowledge representation for semantic memory is based on the Concept Description Vectors (CDVs) that store, for each concept, an information whether a given property can be applied to this concept or not. Unfortunately large-scale semantic memories are not available....
Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention
Publication
- D. Korzekwa
- R. Barra-Chicote
- S. Zaporowski
- G. Beringer
- J. Lorenzo-trueba
- A. Serafinowicz
- J. Droppo
- T. Drugman
- B. Kostek
- Year 2021
This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learning model that automatically de...

Full text available to download
CAD. Integrated Architectural Design, MSc Arch (2022/2023)
e-Learning Courses
- D. Cyparski
The programme will provide students with a solid grounding in BIM (Building Information Modelling) using Autodesks Revit Architecture. Students will review the advanced features of Revit for Architecture, a tool to support BIM (Building Information Modelling) and delivery of 3D digital models and related documentation. The lesson plans will specifically introduce students to common workflows and problem-solving skills while creating...

Search

Filters

Catalog

Search results for: text representation, documents categorization