Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System

Andrzej Sobecki; Marcin Kępa

doi:10.1007/978-3-319-74497-1

Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System

Abstract

The plagiarism detection problem involves finding patterns in unstructured text documents. Similarity of documents in this approach means that the documents contain some identical phrases with defined minimal length. The typical methods used to find similar documents in dig- ital libraries are not suitable for this task (plagiarism detection) because found documents may contain similar content and we have not any war- ranty that they contain any of identical phrases. The article describes an example method of searching for similar documents contains iden- tical phrases in big documents repositories, and presents a problem of selecting storage and computing platform suitable for presented method using in plagiarism detection systems. In the article we present compari- son of the mentioned above method implementations using two comput- ing platforms: KASKADA and Hadoop with different configurations in order to test and compare their performance and scalability. The method using the default tools available on the Hadoop platform i.e. HDFS and Apache Spark offers worse performance than the method implemented on the KASKADA platform using the NFS (Network File System) and the processing model Master/Slave. The advantage of the Hadoop platform increases with the use of additional data structures (hash-map) and tools offered on this platform, i.e. HBase (NoSQL). The tools integrated with the Hadoop platform provide a possibility of creating efficient and a scalable method for finding similar documents in big repositories. The KASKADA platform offers efficient tools for analysing data in real-time processes i.e. when there is no need to compare the input data to a large collection of information (patterns) and to use the advanced data structures. The Con- tribution of this article is the comparison of the two computing and storage platforms in order to achieve better performance of the method used in the plagiarism detection system to find similar documents containing identi- cal phrases.

Citations

3

CrossRef
0

Web of Science
0

Scopus

Authors (2)

Cite as

Full text

full text is not available in portal

full content of the article see on external site open in new tab

Keywords

Details

Category:: Monographic publication
Type:: rozdział, artykuł w książce - dziele zbiorowym /podręczniku w języku o zasięgu międzynarodowym
Title of issue:: Semantic Keyword-Based Search on Structured Data Sources strony 56 - 69
ISSN:: 0302-9743
Language:: English
Publication year:: 2018
Bibliographic description:: Sobecki A., Kępa M.: Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System// Semantic Keyword-Based Search on Structured Data Sources/ ed. Julian Szymański, Yannis Velegrakis : Springer, 2017, s.56-69
DOI:: Digital Object Identifier (open in new tab) 10.1007/978-3-319-74497-1
Verified by:: Gdańsk University of Technology

seen 163 times

Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System

Abstract

Citations

Authors (2)

Andrzej Sobecki dr inż.

Marcin Kępa mgr inż.

Cite as

Full text

Keywords

Details

Recommended for you

Integration, Processing and Dissemination of LiDAR Data in a 3D Web-GIS

A framework for automatic detection of abandoned luggage in airport terminal

Efficient algorithm for blinking LED detection dedicated to embedded systems equipped with high performance cameras

Performance evaulation of video object tracking algorithm in autonomous surveillance system

Search

Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System

Abstract

Citations

Authors (2)

Andrzej Sobecki dr inż.

Marcin Kępa mgr inż.

Cite as

Full text

Keywords

Details

Recommended for you

Integration, Processing and Dissemination of LiDAR Data in a 3D Web-GIS

A framework for automatic detection of abandoned luggage in airport terminal

Efficient algorithm for blinking LED detection dedicated to embedded systems equipped with high performance cameras

Performance evaulation of video object tracking algorithm in autonomous surveillance system