PYthon 转换HTML到Text纯文本_Wally_Yu的博客

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

奔放的打火机 · 当flatMap返回空单声道时，如何调用sw ...· 1 年前 ·

逆袭的剪刀 · 隐秘的 MySQL ...· 1 年前 ·

灰常酷的显示器 · jquery 绑定事件 - blur() ...· 1 年前 ·

高大的板栗 · Python网络爬虫的同步和异步-pytho ...· 1 年前 ·

很酷的生菜 · python--文件操作删除某行 - ...· 1 年前 ·

Project: DeHTML
Description:
This small script is intended to allow conversion from HTML markup to plain text. '\n\n \n Project: DeHTML
\n Description:
\n This small script is intended to allow conversion from HTML markup to \n plain text.\n \n \n ' >>> print nltk.clean_html(aa) Project: DeHTML Description : This small script is intended to allow conversion from HTML markup to plain text.

如果觉得nltk太笨重，大材小用的话，可以自己写代码，以下代码转自：http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc
class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []
    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')
    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')
    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')
    def text(self):
        return ''.join(self.__text).strip()
def dehtml(text):
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text
def main():
    text = r'''
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    print(dehtml(text))
if __name__ == '__main__':
    main()

 运行结果：
>>> ================================ RESTART ================================
Project: DeHTML 
Description : 
This small script is intended to allow conversion from HTML markup to plain text.

 真是前人栽树，后人乘凉，不禁想起了一幅画：
                    今天项目需要将HTML转换为纯文本，去网上搜了一下，发现Python果然是神通广大，无所不能，方法是五花八门。。。拿今天亲自试的两个方法举例，以方便后人：方法一：1. 安装nltk，可以去pipy装（注：需要依赖以下包：numpy, PyYAML）2.测试代码：>>> import nltk>>> aa = r'''
				本文实例讲述了Python转换HTML到Text纯文本的方法。分享给大家供大家参考。具体分析如下：
今天项目需要将HTML转换为纯文本，去网上搜了一下，发现Python果然是神通广大，无所不能，方法是五花八门。
拿今天亲自试的两个方法举例，以方便后人：
1. 安装nltk，可以去pipy装
（注：需要依赖以下包：numpy, PyYAML）
2.测试代码：
复制代码 代码如下:>>> import nltk  
>>> aa = r””’ 
 <b>Project:</b> DeHTML<br> 
 <b>Description</b>:<br
				夹以及子目录、子目录里面的 ，获取到该目录下所有的【.html】文件后，返回一个list对象
2、遍历完成后得到一个html文件列表对象，将该列表交给html_to_txt方法，html_to_txt方法
里面循环逐个读取html文件中指定标签中标签中标签中的文字，和中指定标签
里面标签的文字提取出来
3、读取到的文本内容输出到txt文件中，这里可以加上一个替换replac
最近在ubuntu linux下混,可惜CppBlog下的FreeTextBox用firefox一打开就假死,而TextBox又不支持文本转html(主要是没有加换行),于是就写了一个脚本.  
在/usr/bin中新建一个快捷方式,名为txt2htm,然后在属性中设为可执行,就可以用了  用法如
txt2htm xxx.txt
为了方便起见,大家可以新建一个后缀为txt的文件,点右键,选打开...
def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = ...
				翻了一些博客，看到有博主是自己写了将html转为text的函数，但是由于项目时间比较紧，所以自己懒得动脑筋去写了，
这里推荐大家用一下nltk模块中clean_html()函数，用法如下：
import nltk 
html="""
<!DOCTYPE html>
        <title>这是个标题</title>
    </head>
            Celery 链接RabbitMQ报错CRITICAL/MainProcess] Frequent restarts detected: RestartFreqExceeded('5 in 1s',)
            mongodb启动失败[转]
            Ubuntu 12.04下LAMP安装配置（转）