Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System
Abstract
The plagiarism detection problem involves finding patterns in unstructured text documents. Similarity of documents in this approach means that the documents contain some identical phrases with defined minimal length. The typical methods used to find similar documents in dig- ital libraries are not suitable for this task (plagiarism detection) because found documents may contain similar content and we have not any war- ranty that they contain any of identical phrases. The article describes an example method of searching for similar documents contains iden- tical phrases in big documents repositories, and presents a problem of selecting storage and computing platform suitable for presented method using in plagiarism detection systems. In the article we present compari- son of the mentioned above method implementations using two comput- ing platforms: KASKADA and Hadoop with different configurations in order to test and compare their performance and scalability. The method using the default tools available on the Hadoop platform i.e. HDFS and Apache Spark offers worse performance than the method implemented on the KASKADA platform using the NFS (Network File System) and the processing model Master/Slave. The advantage of the Hadoop platform increases with the use of additional data structures (hash-map) and tools offered on this platform, i.e. HBase (NoSQL). The tools integrated with the Hadoop platform provide a possibility of creating efficient and a scalable method for finding similar documents in big repositories. The KASKADA platform offers efficient tools for analysing data in real-time processes i.e. when there is no need to compare the input data to a large collection of information (patterns) and to use the advanced data structures. The Con- tribution of this article is the comparison of the two computing and storage platforms in order to achieve better performance of the method used in the plagiarism detection system to find similar documents containing identi- cal phrases.
Citations
-
2
CrossRef
-
0
Web of Science
-
0
Scopus
Authors (2)
Cite as
Full text
full text is not available in portal
Keywords
Details
- Category:
- Monographic publication
- Type:
- rozdział, artykuł w książce - dziele zbiorowym /podręczniku w języku o zasięgu międzynarodowym
- Title of issue:
- Semantic Keyword-Based Search on Structured Data Sources strony 56 - 69
- ISSN:
- 0302-9743
- Language:
- English
- Publication year:
- 2018
- Bibliographic description:
- Sobecki A., Kępa M.: Methodology of Selecting the Hadoop Ecosystem Configuration in Order to Improve the Performance of a Plagiarism Detection System// Semantic Keyword-Based Search on Structured Data Sources/ ed. Julian Szymański, Yannis Velegrakis : Springer, 2017, s.56-69
- DOI:
- Digital Object Identifier (open in new tab) 10.1007/978-3-319-74497-1
- Verified by:
- Gdańsk University of Technology
seen 154 times