Open Access Open Access  Restricted Access Subscription or Fee Access

News Filter By Using Scraping Web with Crawling RegEx and TF-IDF Filter

Warna Agung C., Nur Lailatul Aqromi


The most important information in Education is in the form of announcement, as in the results of a poll site in Blitar Education Office which reached 69.8% on 10 November 2016. In fact, the information can be derived from other Education Offices, provinces, or CSR. Based on this problem, the researchers propose two ways to get relevant news from various sites. First, create a virtual robot for crawling. Second, do filtering and arranging a rank in HTML documents crawling results. Crawling is operated by scrapping method, which uses Regular Expression. Regular Expression has a function to cut HTML screen based on a predetermined pattern. URL that has been obtained is used for crawling the content details through the location of the original details one. Documents then are stored and ranked using TF-IDF. Next, the rank results sent to using a post method agent request. On the first experiment, it is resulted that 100% RegEx parsing through the entire of URL address running successfully. Next on the second experiment which is using Recall and Precision, it produce a similar level until 71.4% if the news filtered visualization as maximize as a half of the total document.

Full Text:



  • There are currently no refbacks.

Disclaimer/Regarding indexing issue:

We have provided the online access of all issues and papers to the indexing agencies (as given on journal web site). It’s depend on indexing agencies when, how and what manner they can index or not. Hence, we like to inform that on the basis of earlier indexing, we can’t predict the today or future indexing policy of third party (i.e. indexing agencies) as they have right to discontinue any journal at any time without prior information to the journal. So, please neither sends any question nor expects any answer from us on the behalf of third party i.e. indexing agencies.Hence, we will not issue any certificate or letter for indexing issue. Our role is just to provide the online access to them. So we do properly this and one can visit indexing agencies website to get the authentic information.