Knowledge Network Node

Text extraction method based on text and symbol densityChinese Full Text

HONG Hong-hui;DING Shi-tao;HUANG Ao;GUO Zhi-yuan;Wuhan Research Institute of Posts and Telecommunications;

Abstract: Most web pages contain not only the main content,but also navigation bar,advertising,copyright and other irrelevant information. These extra contents are also referred to as noise,usually irrelevant to the topic. Since these noises will hamper the performance of search engine for Web data mining,noise removal is needed. In this paper,we propose a fast,accurate and general web content extraction algorithm based on text density and symbol density,which can preserve the original structure.Compared with some existing algorithms,the algorithm can reflect the accuracy of the algorithm,and the algorithm can better support the large amount of data Web page text extraction operation.
  • DOI:

    10.14022/j.cnki.dzsjgc.2019.08.029

  • Series:

    (I) Electronic Technology & Information Science

  • Subject:

    Computer Software and Application of Computer; Internet Technology

  • Classification Code:

    TP391.1;TP393.092

  • Mobile Reading
    Read on your phone instantly
    Step 1

    Scan QR Codes

    "Mobile CNKI-CNKI Express" App

    Step 2

    Open“CNKI Express”

    and click the scan icon in the upper left corner of the homepage.

    Step 3

    Scan QR Codes

    Read this article on your phone.

  • HTML
  • Read online
  • CAJ Download
  • PDF Download

Download the mobile appuse the app to scan this coderead the article.

Tips: Please download CAJViewer to view CAJ format full text.

Download: 1397 Page: 133-137 Pagecount: 5 Size: 1664K

Related Literature
  • Similar Article
  • Reader Recommendation
  • Associated Author