Machine learning-based topical web crawler: An ensemble approach incorporating meta-features

Tae Jun Kim, Han Joon Kim

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

A topical web crawler is to collect web pages that describe some pre-specified topics. The web pages collected by the topical crawler share the same or similar words and however among them not a few pages can be irrelevant to the given topics. In paJticular, the performance of topical crawler degrades for a more specific topic. To achieve successful topical crawling, an additional job is required to actively filter out the pages irrelevant to the given topics. For th~sw e propose an ememble-style machne leaming archtecture that can effectively handle not only literal term features but also numeric meta-features to improve topical web crawler; in our work we intend to more precisely crawl the web pages about 'fire accidents' as a specific topic. In case of the fire we have found that significant meta-features for topical crawling include the information of tags, the number of words in the title, the number of person names, the number of location names of web pages and so foh. For the numeric meta-features we use the logistic regression and random forest leaming algorithms and for the literal word features, Naive Bayes and support vector leaming algorithms. Through extensive experiments using the fire accident-related news aJticles we prove that the proposed method outperforms the conventional ones.

Original languageEnglish
Pages (from-to)4651-4656
Number of pages6
JournalJournal of Engineering and Applied Sciences
Volume12
Issue number18
DOIs
StatePublished - 2017

Keywords

  • Ensemble
  • Extensive
  • Filtering
  • Machine Learning
  • Meta-features
  • Web crawler

Fingerprint

Dive into the research topics of 'Machine learning-based topical web crawler: An ensemble approach incorporating meta-features'. Together they form a unique fingerprint.

Cite this