AN ONTOLOGY BASED WEB CRAWLER WITH A NEAR-DUPLICATE DETECTION SYSTEM TO IMPROVE THE PERFORMANCE OF A WEB CRAWLER

Daines Walowe NGULAMU

Abstract


A web crawler is a program that searches the World Wide Web in an orderly manner in order to collect data based on a search query. Web crawling is therefore the process of finding web pages and downloading them automatically. Crawlers have a difficult time getting relevant and quality information according to the search query of the user from the web. This is due to the large volume of the World Wide Web. This characteristic of the web also challenges the web crawlers as they may download duplicate and near-duplicate web pages according to the search query. These web pages reduce the quality of search indexes as well as affect storage cost and page ranking. In order improve the performance of the web crawler, an ontology-based web crawler with a near duplicate detection system was designed. The experiment was carried out using secondary data from a sample web site which was used since crawling is an endless process. Using these two approaches, the ontology web crawler would search for relevant searches according to the search query of the user while the near-duplicate detection system would eliminate redundant data.

Key Words: Crawlers, Ontology-Based Web Crawler, Near-Duplicate Detection System


Full Text:

PDF

References


Anisha Gupta. (2013). A review on efficient web crawling.

Arun Pr, Sumesh Ms. (2015). Near-duplicate web page detection by enhanced TDW and simHash technique. International Conference on Computing and Network Communications.

Ayar Pranav, Sandip Chauhan. (2015). Efficient Focused Web Crawling Approach for Search Engine. , International Journal of Computer Science and Mobile Computing, Vol.4 Issue.5, May- 2015, pg. 545-551.

Bo Wen. (2018). Application of distributed web crawler in information management systems. School Of Computer Science And Technology.

Brian Pinkerton. (2000) Web crawler: finding what people want.

Deepak Kumar, Aditya Kumar. (2013).Design Issues for Search Engines and Web Crawlers: A Review . IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 6 (Nov. - Dec. 2013), PP 34-37

Dr. Naresh Kumar, Shirank Awasthi, Devvrat Tyagi. (2016). Web crawler challenges and their solutions. International Journal Of Scientific And Engineering Research.

Dvijesh Bhatt, Daiwat Amit Ayas, Sharnil Pandya. (2015). Focused web crawler. Advances In Computer Science And Information Technology.

Eldhose P sim. (2015). Classification of detection of near duplicate web pages using five stage algorithm.

Gujan .H. Agre, Nikita .V. Mahajan. (2015). Keywords focused crawler. International Conference On Electronics And Communication Systems.

J. Prasanna Kumar & P. Govindarajulu .(2013). Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting, International Journal of Computational Intelligence Systems, 6:1, 1-13, DOI: 10.1080/18756891.2013.752657

Janet williams. (2015). Evolution of web crawling: how crawling the web emerged as a maintenance discipline.

Jyoti Mar, Naresh Kumar, Dinesh Rai. (2019). An improved crawler based on efficient ranking page algorithm. Internation Journal Of Advanced Trends In Computer Science And Engineering.

K. Subramanyam Sharma, Dr. K. SrujanRaju, P.Yadagiri. (2016). An Efficient Approach for Near-duplicate page detection in web crawling

K. Subramanyamm Sharma, Dr.K Srujan Raju. (2016). An effiecient approach for near-duplicate page detetction in web crawling. Imperial Journal of Interdisciplinary Research.

Md Abu Kausar, V.S.Dhaka, Sanjeer Kumar Singh. (2013). A web crawler: A review. International Research Journal Of Computer Applications, Vol 63 No 2

Mridul B sahu, Samiksha Bharne. (2016). A survey on various kinds of web crawlers and intelligent crawler. International Journal Of Scientific Engineering And Applied Science.

Ms Girija. M. (2016). Handling duplicate data detection of query result from multiple web databases using unsupervised duplicate detection with blocking algorithm. International Research Journal Of Engineering And Technology.

Parigha Suryawanshi , D.V.Patil. (2015). An Overview of Approaches Used In Focused Crawlers. International Research Journal of Engineering and Technology (IRJET) Volume: 02 Issue: 09 | Dec-2015

S. Amudha. (2017). Web crawler for mining web data. International Research Journal Of Engineering And Technology.

Satinder Bal Gupta . (2012). The Issues and Challenges with the Web Crawlers . International Journal of Information Technology & Systems, Vol. 1, No. 1

Shailesh Singh, Syed Imtiyaz Hassan. (2017). Detecting duplicates and near duplicate records in large databases. International Journal On Computer Science and Engineering.

Vardana Shrivastava. (2018). A methodolical study of a web crawler. Journal Of Engineering Research And Application.

Vishaka. (2018). Issues and challenges wth web crawlers. International Journal Of Science And Research.


Refbacks

  • There are currently no refbacks.