python 中文字数统计/分词,统计中文字数Python

python 中文字数统计/分词

摘要：

因为想把一段文字分词，所以，需要明确一定的词语关系。随便的txt小说，就1mb多。要数数这1mb多的中文到底有多少字，多少分词，这些分词的词性是什么样的。2）再把小说根据正则表达法开始分词，获得小说中汉字总数3）将内存中的小说每段POST到提供分词服务的API里面去，获取分词结果4）按照API说明，取词素材：1、linux/GNU=˃debian/ubuntu12.04/Linuxmint13Preferred2、python3、中文分词API，这里我们使用的是http://www.vapsec.com/fenci/4、分词属性的说明文件下载http://vdisk.weibo.com/s/qR7KSFDa9ON或者http://ishare.iask.sina.com.cn/f/68191875.html这里已经写好了一个测试脚本。

因为想把一段文字分词，所以，需要明确一定的词语关系。

在网上随便下载了一篇中文小说。随便的txt小说，就1mb多。要数数这1mb多的中文到底有多少字，多少分词，这些分词的词性是什么样的。

这里是思路

1）先把小说读到内存里面去。

2）再把小说根据正则表达法开始分词，获得小说中汉字总数

3）将内存中的小说每段POST到提供分词服务的API里面去，获取分词结果

4）按照API说明，取词

素材：

1、linux/GNU => debian/ubuntu 12.04/Linuxmint 13Preferred
2、python
3、中文分词API， 这里我们使用的是 http://www.vapsec.com/fenci/
4、分词属性的说明文件下载 http://vdisk.weibo.com/s/qR7KSFDa9ON 或者 http://ishare.iask.sina.com.cn/f/68191875.html

这里已经写好了一个测试脚本。只是单个进程访问。还没有加入并发的测试。

在以后的测试中，我会加入并发的概念的。

下面是测试脚本 test.py

#!/usr/bin/env python#coding: utf-8
importsys
importurllib
importurllib2
importos
importre
from datetime importdatetime, timedelta


def url_post(word='My name is Jake Anderson', geshi="json"):
    url = "http://open.vapsec.com/segment/get_word"postDict ={
            "word":word,
            "format":geshi
    }
    
    
    postData =urllib.urlencode(postDict)
    request =urllib2.Request(url, postData)
    request.get_method = lambda : 'POST'
    #request.add_header('Authorization', basic)
    response =urllib2.urlopen(request)
    r =response.readlines()
    printr

if __name__ == "__main__":
    f = open('novel2.txt', 'r')
    #get Chinese characters quantity
    regex=re.compile(r"(?x) (?: [w-]+ | [x80-xff]{3} )")
    count =0
    for line inf:
        line = line.decode('gbk')
        line = line.encode('utf8')
        word = [w for w inregex.split(line)]
        count +=len(word)
    #print count
    f = open('novel2.txt', 'r')
    start_time =datetime.now()
    for line inf:
        line = line.decode('gbk')
        line = line.encode('utf8')
        word2 = [w for w inregex.split(line)]
        printline
        url_post(line)
    end_time =datetime.now()
    tdelta = start_time -end_time
    print "It takes " + str(tdelta.total_seconds()) + "seconds to segment " + str(count) + "Chinese words!"
    print "This means it can segment " + str(count/tdelta.total_seconds()) + "Chinese characters per second!"

novel2.txt 是下载的小说。这个小说1.2MB大小。大约有580000字吧。

小说是GBK的格式，所以下载后，要转码成 utf-8的格式。

可以看到的终端效果大致是这样的。

把小说中所有的词，进行远程分词服务。

python 中文字数统计/分词第1张

免责声明：文章转载自《python 中文字数统计/分词》仅用于学习参考。如对内容有疑问，请及时联系本站处理。

python 中文字数统计/分词

相关文章

团队介绍&amp;学长采访

python 获取cookie，发送请求

iPhone上将短信内容发送到指定邮箱的方法

python新建txt文件，并逐行写入数据

python使用imap-tools模块下载邮件中的附件

python调用oracle存储过程

最新文章

随机推荐

思享工具箱导航

JSON工具

格式化转换

加解密编码

文本数字

网络

站长

计算

其他

对照列表

python 中文字数统计/分词

相关文章

团队介绍&amp;amp;学长采访

python 获取cookie，发送请求

iPhone上将短信内容发送到指定邮箱的方法

python新建txt文件，并逐行写入数据

python使用imap-tools模块下载邮件中的附件

python调用oracle存储过程

最新文章

随机推荐

思享工具箱导航

JSON工具

格式化转换

加解密编码

文本数字

网络

站长

计算

其他

对照列表

团队介绍&学长采访