Knowledge Network Node

Online Web news extraction via tag path feature weighted by text block densityChinese Full TextEnglish Full Text (MT)

Gongqing WU;Pengcheng LIU;Jun HU;Xuegang HU;School of Computer Science and Information Engineering, Hefei University of Technology;

Abstract: Web news extraction is the basis and an open research problem of many "big data" and "big knowledge" applications. Presently, tag paths and text block density are two excellent features that can help to solve this problem. The tag path feature can distinguish well the content from the noise for the whole webpage, but it has difficulty in recognizing noise in the content block or the content in the noise block. The text block density feature can recognize well the high-density content block, but it is not robust enough. Aiming at the abovementioned problems, we propose a Web information extraction model, referred to as CEDP, which can effectively combine the tag path feature and the text block density feature. We design a tag path feature weighted by the text block density in order to utilize the merits of the two features above. In addition, we design a Web news extraction method via the weighted tag path feature, CEDP-NLTD. CEDP-NLTD is a fast, universal, nontraining, online Web news extraction algorithm that is suitable for extracting heterogeneous Web news from the big data environment of the Web across various resources, styles, and languages. Experiments on public datasets such as Clean Eval show that the CEDP-NLTD method achieves better performance than the state-of-the-art CETR, CETD, CEPR, and CEPF methods, and it achieves better performance than CEDP-TD, CEDP-CTD,and CEDP-DSum, which are respectively generated from CEDP by using one of the three block density features of CETD.
  • Series:

    (I) Electronic Technology & Information Science

  • Subject:

    Computer Software and Application of Computer; Internet Technology

  • Classification Code:

    TP391.1;TP393.09

Download the mobile appuse the app to scan this coderead the article.

Tips: Please download CAJViewer to view CAJ format full text.

Download: 203 Page: 1078-1094 Pagecount: 17 Size: 472K

Related Literature
  • Similar Article
  • Reader Recommendation
  • Associated Author