Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate tokenizer from hasher #162

Merged
merged 43 commits into from
Mar 5, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
96d2f1a
Separate whitespace tokenizer from hasher
piroor Jun 30, 2017
25b112f
Separate stopword filter from hasher
piroor Jun 30, 2017
010605c
Run tests in deep directories
piroor Jun 30, 2017
b66ac69
Separate stemmer from hasher
piroor Jun 30, 2017
3188c52
Separate tests for stopword and tokenizer from hasher's one
piroor Jun 30, 2017
4f60a6b
Reintroduce method to get hash from clean words
piroor Jun 30, 2017
6b433ee
Fix usage of Stopword filter
piroor Jun 30, 2017
10d3e3a
Add tests for Tokenizer::Token
piroor Jun 30, 2017
19d83d9
Add test for TokenFilter::Stemmer
piroor Jun 30, 2017
91523e8
Remove needless conversion
piroor Jun 30, 2017
d84caa9
Unite stemmer and stopword filter to whitespace tokenizer
piroor Jun 30, 2017
9df9bfa
Fix indent
piroor Jun 30, 2017
164620c
Insert seaparator blank lines between meaningful blocks
piroor Jun 30, 2017
b30c9c5
Revert "Insert seaparator blank lines between meaningful blocks"
piroor Jun 30, 2017
c6c88a5
Revert "Fix indent"
piroor Jun 30, 2017
56fe374
Revert "Unite stemmer and stopword filter to whitespace tokenizer"
piroor Jun 30, 2017
a27c4f3
Fix indent
piroor Jun 30, 2017
d7b2519
Use meaningful variable name
piroor Jun 30, 2017
67caa82
Describe new modules and classes
piroor Dec 18, 2017
ce7bca0
Give tokenizer and token filters from outside of hasher
piroor Jan 16, 2018
d0bdd5b
Uniform coding style
piroor Jan 16, 2018
0647ba7
Apply enable_stemmer option correctly
piroor Jan 16, 2018
d83fc80
Fix invalid URI
piroor Jan 16, 2018
65dc97d
Don't give needless parameters
piroor Jan 16, 2018
a5d8f4d
Load required modules
piroor Jan 16, 2018
1a105ef
Define default token filters for hasher
piroor Jan 16, 2018
52e61f2
Fix path to modules
piroor Jan 16, 2018
148a150
Add description how to use custom tokenizer
piroor Jan 16, 2018
8fe32d3
Define token filter to remove symbol only tokens
piroor Jan 16, 2018
2477ef1
Fix path to required module
piroor Jan 16, 2018
6880ba5
Remove needless parameter
piroor Jan 16, 2018
a9b9639
Use langauge option only for stopwords filter
piroor Jan 16, 2018
35c304e
Add test for TokenFilter::Symbol
piroor Jan 16, 2018
829a176
Remove needless "s"
piroor Jan 16, 2018
751b15b
Add how to use custom token filters
piroor Jan 16, 2018
932a0a1
Reject cat token based on regexp
piroor Jan 16, 2018
14af4d0
Add tests to custom tokenizer and token filters
piroor Jan 16, 2018
3c59f44
Fix usage of custom tokenizer
piroor Jan 16, 2018
d856224
Add note for custom tokenizer
piroor Jan 16, 2018
b82e68d
Describe spec of custom tokenizer at first
piroor Jan 16, 2018
958d3a0
Accept lambda as custom token filter and tokenizer
piroor Mar 5, 2018
7bceef7
Fix mismatched descriptions about method
piroor Mar 5, 2018
81824f5
Add more tests for custom tokenizer and filters
piroor Mar 5, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Rakefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,15 @@ task default: [:test]
desc 'Run all unit tests'
Rake::TestTask.new(:test) do |t|
t.libs << 'lib'
t.pattern = 'test/*/*_test.rb'
t.pattern = 'test/**/*_test.rb'
t.verbose = true
end

# Run benchmarks
desc 'Run all benchmarks'
Rake::TestTask.new(:bench) do |t|
t.libs << 'lib'
t.pattern = 'test/*/*_benchmark.rb'
t.pattern = 'test/**/*_benchmark.rb'
t.verbose = true
end

Expand Down
2 changes: 1 addition & 1 deletion classifier-reborn.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Gem::Specification.new do |s|
s.summary = 'A general classifier module to allow Bayesian and other types of classifications.'
s.authors = ['Lucas Carlson', 'Parker Moore', 'Chase Gilliam']
s.email = ['[email protected]', '[email protected]', '[email protected]']
s.homepage = 'www.classifier-reborn.com'
s.homepage = 'http://www.classifier-reborn.com'

all_files = `git ls-files -z`.split("\x0")
s.files = all_files.grep(%r{^(bin|lib|data)/})
Expand Down
71 changes: 71 additions & 0 deletions docs/bayes.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,77 @@ classifier.train("Cat", "I can has cat")
classifier.train("Dog", "I don't always bark at night")
```

## Custom Tokenizer

By default the classifier tokenizes given inputs as a white-space separeted terms.
If you want to use different tokenizer, give it via the `:tokenizer` option.
Tokenizer must be an object having a method named `call`, or a lambda.
The function must return tokens as instances of `ClassifierReborn::Tokenizer::Token`.

```ruby
require 'classifier-reborn'

module BigramTokenizer
module_function
def call(str)
str.each_char
.each_cons(2)
.map do |chars| ClassifierReborn::Tokenizer::Token.new(chars.join) end
end
end

classifier = ClassifierReborn::Bayes.new tokenizer: BigramTokenizer
```

or

```ruby
require 'classifier-reborn'

bigram_tokenizer = lambda do |str|
str.each_char
.each_cons(2)
.map do |chars| ClassifierReborn::Tokenizer::Token.new(chars.join) end
end

classifier = ClassifierReborn::Bayes.new tokenizer: bigram_tokenizer
```

## Custom Token Filters

By default classifier rejects stopwords from tokens.
This behavior is implemented based on filters for tokens.
If you want to use more token filters, give them via the `:token_filter` option.
A filter must be an object having a method named `call`, or a lambda.

```ruby
require 'classifier-reborn'

module CatFilter
module_function
def call(tokens)
tokens.reject do |token|
/cat/i === token
end
end
end

white_filter = lambda do |tokens|
tokens.reject do |token|
/white/i === token
end
done

filters = [
CatFilter,
white_filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piroor, are we missing a comma at the end here? Also, the last element of the array has an unnecessary comma.

# If you want to reject stopwords too, you need to include stopword filter
# to the list of token filters manually.
ClassifierReborn::TokenFilters::Stopword,
]
classifier = ClassifierReborn::Bayes.new token_filters: filters
```

## Custom Stopwords

The library ships with stopword files in various languages.
Expand Down
22 changes: 18 additions & 4 deletions lib/classifier-reborn/bayes.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@

require 'set'

require_relative 'extensions/tokenizer/whitespace'
require_relative 'extensions/token_filter/stopword'
require_relative 'extensions/token_filter/stemmer'
require_relative 'category_namer'
require_relative 'backends/bayes_memory_backend'
require_relative 'backends/bayes_redis_backend'
Expand Down Expand Up @@ -50,6 +53,14 @@ def initialize(*args)
@threshold = options[:threshold]
@enable_stemmer = options[:enable_stemmer]
@backend = options[:backend]
@tokenizer = options[:tokenizer] || Tokenizer::Whitespace
@token_filters = options[:token_filters] || [TokenFilter::Stopword]
if @enable_stemmer && !@token_filters.include?(TokenFilter::Stemmer)
@token_filters << TokenFilter::Stemmer
end
if @token_filters.include?(TokenFilter::Stopword)
TokenFilter::Stopword.language = @language
end

populate_initial_categories

Expand All @@ -65,7 +76,8 @@ def initialize(*args)
# b.train "that", "That text"
# b.train "The other", "The other text"
def train(category, text)
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
word_hash = Hasher.word_hash(text, @enable_stemmer,
tokenizer: @tokenizer, token_filters: @token_filters)
return if word_hash.empty?
category = CategoryNamer.prepare_name(category)

Expand Down Expand Up @@ -95,7 +107,8 @@ def train(category, text)
# b.train :this, "This text"
# b.untrain :this, "This text"
def untrain(category, text)
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
word_hash = Hasher.word_hash(text, @enable_stemmer,
tokenizer: @tokenizer, token_filters: @token_filters)
return if word_hash.empty?
category = CategoryNamer.prepare_name(category)
word_hash.each do |word, count|
Expand All @@ -120,7 +133,8 @@ def untrain(category, text)
# The largest of these scores (the one closest to 0) is the one picked out by #classify
def classifications(text)
score = {}
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
word_hash = Hasher.word_hash(text, @enable_stemmer,
tokenizer: @tokenizer, token_filters: @token_filters)
if word_hash.empty?
category_keys.each do |category|
score[category.to_s] = Float::INFINITY
Expand Down Expand Up @@ -266,7 +280,7 @@ def custom_stopwords(stopwords)
return # Do not overwrite the default
end
end
Hasher::STOPWORDS[@language] = Set.new stopwords
TokenFilter::Stopword::STOPWORDS[@language] = Set.new stopwords
end
end
end
62 changes: 18 additions & 44 deletions lib/classifier-reborn/extensions/hasher.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,63 +5,37 @@

require 'set'

require_relative 'tokenizer/whitespace'
require_relative 'token_filter/stopword'
require_relative 'token_filter/stemmer'

module ClassifierReborn
module Hasher
STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')]

module_function

# Return a Hash of strings => ints. Each word in the string is stemmed,
# interned, and indexes to its frequency in the document.
def word_hash(str, language = 'en', enable_stemmer = true)
cleaned_word_hash = clean_word_hash(str, language, enable_stemmer)
symbol_hash = word_hash_for_symbols(str.scan(/[^\s\p{WORD}]/))
cleaned_word_hash.merge(symbol_hash)
end

# Return a word hash without extra punctuation or short symbols, just stemmed words
def clean_word_hash(str, language = 'en', enable_stemmer = true)
word_hash_for_words(str.gsub(/[^\p{WORD}\s]/, '').downcase.split, language, enable_stemmer)
end

def word_hash_for_words(words, language = 'en', enable_stemmer = true)
d = Hash.new(0)
words.each do |word|
next unless word.length > 2 && !STOPWORDS[language].include?(word)
if enable_stemmer
d[word.stem.intern] += 1
else
d[word.intern] += 1
def word_hash(str, enable_stemmer = true,
tokenizer: Tokenizer::Whitespace,
token_filters: [TokenFilter::Stopword])
if token_filters.include?(TokenFilter::Stemmer)
unless enable_stemmer
token_filters.reject! do |token_filter|
token_filter == TokenFilter::Stemmer
end
end
else
token_filters << TokenFilter::Stemmer if enable_stemmer
end
words = tokenizer.call(str)
token_filters.each do |token_filter|
words = token_filter.call(words)
end
d
end

# Add custom path to a new stopword file created by user
def add_custom_stopword_path(path)
STOPWORDS_PATH.unshift(path)
end

def word_hash_for_symbols(words)
d = Hash.new(0)
words.each do |word|
d[word.intern] += 1
end
d
end

# Create a lazily-loaded hash of stopword data
STOPWORDS = Hash.new do |hash, language|
hash[language] = []

STOPWORDS_PATH.each do |path|
if File.exist?(File.join(path, language))
hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
break
end
end

hash[language]
end
end
end
23 changes: 23 additions & 0 deletions lib/classifier-reborn/extensions/token_filter/stemmer.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# encoding: utf-8
# Author:: Lucas Carlson (mailto:[email protected])
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

module ClassifierReborn
module TokenFilter
# This filter converts given tokens to their stemmed versions.
module Stemmer
module_function

def call(tokens)
tokens.collect do |token|
if token.stemmable?
token.stem
else
token
end
end
end
end
end
end
47 changes: 47 additions & 0 deletions lib/classifier-reborn/extensions/token_filter/stopword.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# encoding: utf-8
# Author:: Lucas Carlson (mailto:[email protected])
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

module ClassifierReborn
module TokenFilter
# This filter removes stopwords in the language, from given tokens.
module Stopword
STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../../data/stopwords')]
@language = 'en'

module_function

def call(tokens)
tokens.reject do |token|
token.maybe_stopword? &&
(token.length <= 2 || STOPWORDS[@language].include?(token))
end
end

# Add custom path to a new stopword file created by user
def add_custom_stopword_path(path)
STOPWORDS_PATH.unshift(path)
end

# Create a lazily-loaded hash of stopword data
STOPWORDS = Hash.new do |hash, language|
hash[language] = []

STOPWORDS_PATH.each do |path|
if File.exist?(File.join(path, language))
hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
break
end
end

hash[language]
end

# Changes the language of stopwords
def language=(language)
@language = language
end
end
end
end
19 changes: 19 additions & 0 deletions lib/classifier-reborn/extensions/token_filter/symbol.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# encoding: utf-8
# Author:: Lucas Carlson (mailto:[email protected])
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

module ClassifierReborn
module TokenFilter
# This filter removes symbol-only terms, from given tokens.
module Symbol
module_function

def call(tokens)
tokens.reject do |token|
/[^\s\p{WORD}]/ === token
end
end
end
end
end
35 changes: 35 additions & 0 deletions lib/classifier-reborn/extensions/tokenizer/token.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# encoding: utf-8
# Author:: Lucas Carlson (mailto:[email protected])
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

module ClassifierReborn
module Tokenizer
class Token < String
# The class can be created with one token string and extra attributes. E.g.,
# t = ClassifierReborn::Tokenizer::Token.new 'Tokenize', stemmable: true, maybe_stopword: false
#
# Attributes available are:
# stemmable: true Possibility that the token can be stemmed. This must be false for un-stemmable terms, otherwise this should be true.
# maybe_stopword: true Possibility that the token is a stopword. This must be false for terms which never been stopword, otherwise this should be true.
def initialize(string, stemmable: true, maybe_stopword: true)
super(string)
@stemmable = stemmable
@maybe_stopword = maybe_stopword
end

def stemmable?
@stemmable
end

def maybe_stopword?
@maybe_stopword
end

def stem
stemmed = super
self.class.new(stemmed, stemmable: @stemmable, maybe_stopword: @maybe_stopword)
end
end
end
end
Loading