Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

实现初始化时的性能调优 #72

Open
yukon12345 opened this issue Dec 23, 2020 · 1 comment
Open

实现初始化时的性能调优 #72

yukon12345 opened this issue Dec 23, 2020 · 1 comment

Comments

@yukon12345
Copy link

yukon12345 commented Dec 23, 2020

用了下这版的jieba,感觉加载词典时候太慢了,
性能分析之后发现是因为拆分字典每一行外加存入original_freq、total数组各占了一半消耗时间。
之前有人发过issue说的是开api服务的方式常驻内存的方式来减少加载消耗,但方案还是繁琐,
于是改造了Jieba::genTrie()方法,做了一个缓存功能,使得不用重复读取字典,运行过第一次之后会直接生成缓存,之后就能直接使用生成好的original_freq数组即可。其实作者在方法里注释掉了其他cache载入,应该也有想到,不知道为什么没加这个功能?
经测试:
加载big字典,处理速度从原来的9秒以上缩减到2-3秒
加载普通字典,从原来的5秒以上缩减到2秒
加载small词典,从原来的3秒以上缩减到1秒以下。
最后希望作者能够继续维护好这版jieba。现在在爆肝赶工中,而且也优化了了其他的地方,暂时没时间pull request
如果修改了字典,把.cache文件删除即可。
只需要把下面代码覆盖原来的Jieba::genTrie()方法即可实现缓存字典:

/**
     * Static method genTrie
     *
     * @param string $f_name # input f_name
     * @param array $options # other options
     *
     * @return array self::$trie
     */
    public static function genTrie($f_name, $options = array())
    {
        $defaults = array(
            'mode' => 'default'
        );
        $options = array_merge($defaults, $options);
        self::$trie = new MultiArray(file_get_contents($f_name . '.json'));
        //self::$trie->cache = new MultiArray(file_get_contents($f_name.'.cache.json'));
        if(file_exists($f_name.'.cache')){
            #有缓存就取
            $datas=json_decode(file_get_contents($f_name.'.cache'),true);
            self::$original_freq=$datas['original_freq'];
            self::$total = $datas['total'];
        }else {
            $content = fopen($f_name, "r");
            while (($line = fgets($content)) !== false) {
                $explode_line = explode(" ", trim($line));
                $word = $explode_line[0];
                $freq = $explode_line[1];
                $tag = $explode_line[2];
                $freq = (float)$freq;
                if (isset(self::$original_freq[$word])) {
                    self::$total -= self::$original_freq[$word];
                }
                self::$original_freq[$word] = $freq;
                self::$total += $freq;
                #//$l = mb_strlen($word, 'UTF-8');
                #//$word_c = array();
                #//for ($i=0; $i<$l; $i++) {
                #//    $c = mb_substr($word, $i, 1, 'UTF-8');
                #//    array_push($word_c, $c);
                #//}
                #//$word_c_key = implode('.', $word_c);
                #//self::$trie->set($word_c_key, array("end"=>""));
            }
            fclose($content);
            #添加缓存文件.cache
            $datas=[];
            $datas['original_freq']=self::$original_freq;
            $datas['total']=self::$total ;
            file_put_contents($f_name.'.cache',json_encode($datas));
        }
        return self::$trie;
    }// end function genTrie
@fukuball
Copy link
Owner

@yukon12345 感謝建議,還是希望能幫忙發個 pull request,或者是等我晚些工作較不忙碌了再回來加入 cache 功能

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants