scws即Simple Chinese Word Segmentation。是C语言开发的基于词频词典的机械式中文分词引擎。scws的作者为hightman,采用BSD许可协议发布。nodescws的作者在libscws上添加功能(包括停用词、忽略符号、json格式配置等)并添加了node.js binding,除自己代码,不持有libscws著作权。
scws的主页: http://www.xunsearch.com/scws, GitHub: https://github.com/hightman/scws
Current release: v0.5.1
- 项目主页: https://github.com/dotSlashLu/nodescws
- 使用问题,bug report: https://github.com/dotSlashLu/nodescws/issues
npm install scws
var Scws = require("scws");
var scws = new Scws(settings); // NOTE: before v0.5.0, do new Scws.init(settings)
var results = scws.segment(text);
scws.destroy(); // DO NOT forget this or your memory may be corrupted
注意,在v0.5.0之前,使用new Scws.init(settings)
初始化。
- settings:
Object
, 分词设置, 支持charset, dicts, rule, ignorePunct, multi, debug:-
charset:
String
, Optional采用的encoding,支持"utf8","gbk", 默认值"utf8"
-
dicts:
String
, Required要采用的词典文件的filename,多个文件之间用':'分隔。 支持xdb格式以及txt格式,自制词典请以".txt"作文件后缀。 例如"./dicts/dict.utf8.xdb:./dicts/dict_cht.utf8.xdb:./dicts/dict.test.txt" scws自带的xdb格式词典附在该extension目录下(一般是node_modules/scws/)的./dicts/ , 有简体和繁体两种选择,如果该项缺失则默认使用自带utf8简体中文词典
-
rule:
String
, Optional要采用的规则文件,设置对应编码下的地名,人名,停用词等。 详见该extension目录下(一般是node_modules/scws/)的rules/rules.utf8.ini。 若该配置缺失则默认使用自带utf8的规则文件。 v0.2.3添加了JSON支持,避免繁复的ini语法。 若以.json结尾,则会解析对应的JSON rule文件,也可以直接传JSON string来进行配置。语法参考 ./rules/rules.utf8.json
-
ignorePunct:
Bool
, Optional是否忽略标点
-
multi:
String
, Optional是否进行长词复合切分,例如中国人这个词产生“中国人”,“中国”,“人”多个结果,可选值"short", "duality", "zmain", "zall": short: 短词 duality: 组合相邻的两个单字 zmain: 重要单字 zall: 全部单字
-
debug:
Bool
, Optional是否以debug模式运行,若为true则输出scws的log, warning, error到stdout, defult为false
-
applyStopWord:
Bool
, Optional是否应用rule文件中[nostats]区块所规定的停用词,默认为true
-
- text:
String
, 要切分的字符串
Return Array
[
{
word: '可读性',
offset: 183, // 该词在文档中的位置
length: 9, // byte
attr: 'n', // 词性,采用《现代汉语语料库加工规范——词语切分与词性标注》标准,涵义请参考 http://blog.csdn.net/dbigbear/article/details/1488087
idf: 7.800000190734863
},
...
]
var fs = require("fs")
Scws = require("scws");
fs.readFile("./test_doc.txt", {
encoding: "utf8"
}, function(err, data){
if (err)
return console.error(err);
// initialize scws with config entries
var scws = new Scws({
charset: "utf8",
//dicts: "./dicts/dict.utf8.xdb:./dicts/dict_cht.utf8.xdb:./dicts/dict.test.txt",
dicts: "./dicts/dict.utf8.xdb",
rule: "./rules/rules.utf8.ini",
ignorePunct: true,
multi: "duality",
debug: true
});
// segment text
res = scws.segment(data);
res1 = scws.segment("大家好我来自德国,我是德国人");
console.log(res, res1);
// destroy scws, recollect memory
scws.destroy();
})
更多请参考test/
中的测试
- Update NAN, supports all major node.js versions
- New js API design
- Fix #11
- Thanks to @mike820324 now scws supports io.js
- Changed project structure
- Refactored node bindings
- Added rule setting by JSON file and JSON string thus making adding stop words more easier with node
- Some small bug fixes, including issue #5(Thanks to @Frully)
- Add stop words support
- Remove line endings when
ignorePunct
is set true
You can add your own stop words in the entry [nostats]
in the rule file. Turn off stop words feature by setting applyStopWord
false.
New syntax to initialize scws: scws = new Scws(config); result = scws.segment(text); scws.destroy()
so that we are able to reuse scws instance, thus gaining great improvement in perfermence when recurrently used(approximately 1/4 faster).
Added new setting entry debug
. Setting config.debug = true
will make scws output it's log, error, warning to stdout
Published to npm registry. usage: scws(text, settings);
available setting entries: charset, dicts, rule, ignorePunct, multi.