Abstract
A topical web crawler is to collect web pages that describe some pre-specified topics. The web pages collected by the topical crawler share the same or similar words and however among them not a few pages can be irrelevant to the given topics. In paJticular, the performance of topical crawler degrades for a more specific topic. To achieve successful topical crawling, an additional job is required to actively filter out the pages irrelevant to the given topics. For th~sw e propose an ememble-style machne leaming archtecture that can effectively handle not only literal term features but also numeric meta-features to improve topical web crawler; in our work we intend to more precisely crawl the web pages about 'fire accidents' as a specific topic. In case of the fire we have found that significant meta-features for topical crawling include the information of tags, the number of words in the title, the number of person names, the number of location names of web pages and so foh. For the numeric meta-features we use the logistic regression and random forest leaming algorithms and for the literal word features, Naive Bayes and support vector leaming algorithms. Through extensive experiments using the fire accident-related news aJticles we prove that the proposed method outperforms the conventional ones.
Original language | English |
---|---|
Pages (from-to) | 4651-4656 |
Number of pages | 6 |
Journal | Journal of Engineering and Applied Sciences |
Volume | 12 |
Issue number | 18 |
DOIs | |
State | Published - 2017 |
Keywords
- Ensemble
- Extensive
- Filtering
- Machine Learning
- Meta-features
- Web crawler