Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System - Publication - Bridge of Knowledge

Search

Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System

Abstract

The plagiarism detection problem involves finding patterns in unstructured text documents. Similarity of documents in this approach means that the documents contain some identical phrases with defined minimal length. The typical methods used to find similar documents in dig- ital libraries are not suitable for this task (plagiarism detection) because found documents may contain similar content and we have not any war- ranty that they contain any of identical phrases. The article describes an example method of searching for similar documents contains iden- tical phrases in big documents repositories, and presents a problem of selecting storage and computing platform suitable for presented method using in plagiarism detection systems. In the article we present compari- son of the mentioned above method implementations using two comput- ing platforms: KASKADA and Hadoop with different configurations in order to test and compare their performance and scalability. The method using the default tools available on the Hadoop platform i.e. HDFS and Apache Spark offers worse performance than the method implemented on the KASKADA platform using the NFS (Network File System) and the processing model Master/Slave. The advantage of the Hadoop platform increases with the use of additional data structures (hash-map) and tools offered on this platform, i.e. HBase (NoSQL). The tools integrated with the Hadoop platform provide a possibility of creating efficient and a scalable method for finding similar documents in big repositories. The KASKADA platform offers efficient tools for analysing data in real-time processes i.e. when there is no need to compare the input data to a large collection of information (patterns) and to use the advanced data structures. The Con- tribution of this article is the comparison of the two computing and storage platforms in order to achieve better performance of the method used in the plagiarism detection system to find similar documents containing identi- cal phrases.

Citations

  • 0

    CrossRef

  • 0

    Web of Science

  • 0

    Scopus

Cite as

Full text

full text is not available in portal

Keywords

Details

Category:
Monographic publication
Type:
rozdział, artykuł w książce - dziele zbiorowym /podręczniku w języku o zasięgu międzynarodowym
Title of issue:
Semantic Keyword-Based Search on Structured Data Sources strony 56 - 69
ISSN:
0302-9743
Language:
English
Publication year:
2018
Bibliographic description:
Sobecki A., Kępa M.: Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System// Semantic Keyword-Based Search on Structured Data Sources/ ed. Julian Szymański, Yannis Velegrakis : Springer, 2017, s.56-69
DOI:
Digital Object Identifier (open in new tab) 10.1007/978-3-319-74497-1
Verified by:
Gdańsk University of Technology

seen 113 times

Recommended for you

Meta Tags