实现初始化时的性能调优 #72

yukon12345 · 2020-12-23T08:18:17Z

用了下这版的jieba，感觉加载词典时候太慢了，
性能分析之后发现是因为拆分字典每一行外加存入original_freq、total数组各占了一半消耗时间。
之前有人发过issue说的是开api服务的方式常驻内存的方式来减少加载消耗，但方案还是繁琐，
于是改造了Jieba::genTrie()方法，做了一个缓存功能，使得不用重复读取字典，运行过第一次之后会直接生成缓存，之后就能直接使用生成好的original_freq数组即可。其实作者在方法里注释掉了其他cache载入，应该也有想到，不知道为什么没加这个功能？
经测试：
加载big字典，处理速度从原来的9秒以上缩减到2-3秒
加载普通字典，从原来的5秒以上缩减到2秒
加载small词典，从原来的3秒以上缩减到1秒以下。
最后希望作者能够继续维护好这版jieba。现在在爆肝赶工中，而且也优化了了其他的地方，暂时没时间pull request
如果修改了字典，把.cache文件删除即可。
只需要把下面代码覆盖原来的Jieba::genTrie()方法即可实现缓存字典:

/**
     * Static method genTrie
     *
     * @param string $f_name # input f_name
     * @param array $options # other options
     *
     * @return array self::$trie
     */
    public static function genTrie($f_name, $options = array())
    {
        $defaults = array(
            'mode' => 'default'
        );
        $options = array_merge($defaults, $options);
        self::$trie = new MultiArray(file_get_contents($f_name . '.json'));
        //self::$trie->cache = new MultiArray(file_get_contents($f_name.'.cache.json'));
        if(file_exists($f_name.'.cache')){
            #有缓存就取
            $datas=json_decode(file_get_contents($f_name.'.cache'),true);
            self::$original_freq=$datas['original_freq'];
            self::$total = $datas['total'];
        }else {
            $content = fopen($f_name, "r");
            while (($line = fgets($content)) !== false) {
                $explode_line = explode(" ", trim($line));
                $word = $explode_line[0];
                $freq = $explode_line[1];
                $tag = $explode_line[2];
                $freq = (float)$freq;
                if (isset(self::$original_freq[$word])) {
                    self::$total -= self::$original_freq[$word];
                }
                self::$original_freq[$word] = $freq;
                self::$total += $freq;
                #//$l = mb_strlen($word, 'UTF-8');
                #//$word_c = array();
                #//for ($i=0; $i<$l; $i++) {
                #//    $c = mb_substr($word, $i, 1, 'UTF-8');
                #//    array_push($word_c, $c);
                #//}
                #//$word_c_key = implode('.', $word_c);
                #//self::$trie->set($word_c_key, array("end"=>""));
            }
            fclose($content);
            #添加缓存文件.cache
            $datas=[];
            $datas['original_freq']=self::$original_freq;
            $datas['total']=self::$total ;
            file_put_contents($f_name.'.cache',json_encode($datas));
        }
        return self::$trie;
    }// end function genTrie

The text was updated successfully, but these errors were encountered:

fukuball · 2020-12-23T09:16:12Z

@yukon12345 感謝建議，還是希望能幫忙發個 pull request，或者是等我晚些工作較不忙碌了再回來加入 cache 功能

fukuball added the enhancement label Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

实现初始化时的性能调优 #72

实现初始化时的性能调优 #72

yukon12345 commented Dec 23, 2020 •

edited

Loading

fukuball commented Dec 23, 2020

实现初始化时的性能调优 #72

实现初始化时的性能调优 #72

Comments

yukon12345 commented Dec 23, 2020 • edited Loading

fukuball commented Dec 23, 2020

yukon12345 commented Dec 23, 2020 •

edited

Loading