-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate tokenizer from hasher #162
Merged
Merged
Changes from all commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
96d2f1a
Separate whitespace tokenizer from hasher
piroor 25b112f
Separate stopword filter from hasher
piroor 010605c
Run tests in deep directories
piroor b66ac69
Separate stemmer from hasher
piroor 3188c52
Separate tests for stopword and tokenizer from hasher's one
piroor 4f60a6b
Reintroduce method to get hash from clean words
piroor 6b433ee
Fix usage of Stopword filter
piroor 10d3e3a
Add tests for Tokenizer::Token
piroor 19d83d9
Add test for TokenFilter::Stemmer
piroor 91523e8
Remove needless conversion
piroor d84caa9
Unite stemmer and stopword filter to whitespace tokenizer
piroor 9df9bfa
Fix indent
piroor 164620c
Insert seaparator blank lines between meaningful blocks
piroor b30c9c5
Revert "Insert seaparator blank lines between meaningful blocks"
piroor c6c88a5
Revert "Fix indent"
piroor 56fe374
Revert "Unite stemmer and stopword filter to whitespace tokenizer"
piroor a27c4f3
Fix indent
piroor d7b2519
Use meaningful variable name
piroor 67caa82
Describe new modules and classes
piroor ce7bca0
Give tokenizer and token filters from outside of hasher
piroor d0bdd5b
Uniform coding style
piroor 0647ba7
Apply enable_stemmer option correctly
piroor d83fc80
Fix invalid URI
piroor 65dc97d
Don't give needless parameters
piroor a5d8f4d
Load required modules
piroor 1a105ef
Define default token filters for hasher
piroor 52e61f2
Fix path to modules
piroor 148a150
Add description how to use custom tokenizer
piroor 8fe32d3
Define token filter to remove symbol only tokens
piroor 2477ef1
Fix path to required module
piroor 6880ba5
Remove needless parameter
piroor a9b9639
Use langauge option only for stopwords filter
piroor 35c304e
Add test for TokenFilter::Symbol
piroor 829a176
Remove needless "s"
piroor 751b15b
Add how to use custom token filters
piroor 932a0a1
Reject cat token based on regexp
piroor 14af4d0
Add tests to custom tokenizer and token filters
piroor 3c59f44
Fix usage of custom tokenizer
piroor d856224
Add note for custom tokenizer
piroor b82e68d
Describe spec of custom tokenizer at first
piroor 958d3a0
Accept lambda as custom token filter and tokenizer
piroor 7bceef7
Fix mismatched descriptions about method
piroor 81824f5
Add more tests for custom tokenizer and filters
piroor File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,7 +15,7 @@ Gem::Specification.new do |s| | |
s.summary = 'A general classifier module to allow Bayesian and other types of classifications.' | ||
s.authors = ['Lucas Carlson', 'Parker Moore', 'Chase Gilliam'] | ||
s.email = ['[email protected]', '[email protected]', '[email protected]'] | ||
s.homepage = 'www.classifier-reborn.com' | ||
s.homepage = 'http://www.classifier-reborn.com' | ||
|
||
all_files = `git ls-files -z`.split("\x0") | ||
s.files = all_files.grep(%r{^(bin|lib|data)/}) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# encoding: utf-8 | ||
# Author:: Lucas Carlson (mailto:[email protected]) | ||
# Copyright:: Copyright (c) 2005 Lucas Carlson | ||
# License:: LGPL | ||
|
||
module ClassifierReborn | ||
module TokenFilter | ||
# This filter converts given tokens to their stemmed versions. | ||
module Stemmer | ||
module_function | ||
|
||
def call(tokens) | ||
tokens.collect do |token| | ||
if token.stemmable? | ||
token.stem | ||
else | ||
token | ||
end | ||
end | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# encoding: utf-8 | ||
# Author:: Lucas Carlson (mailto:[email protected]) | ||
# Copyright:: Copyright (c) 2005 Lucas Carlson | ||
# License:: LGPL | ||
|
||
module ClassifierReborn | ||
module TokenFilter | ||
# This filter removes stopwords in the language, from given tokens. | ||
module Stopword | ||
STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../../data/stopwords')] | ||
@language = 'en' | ||
|
||
module_function | ||
|
||
def call(tokens) | ||
tokens.reject do |token| | ||
token.maybe_stopword? && | ||
(token.length <= 2 || STOPWORDS[@language].include?(token)) | ||
end | ||
end | ||
|
||
# Add custom path to a new stopword file created by user | ||
def add_custom_stopword_path(path) | ||
STOPWORDS_PATH.unshift(path) | ||
end | ||
|
||
# Create a lazily-loaded hash of stopword data | ||
STOPWORDS = Hash.new do |hash, language| | ||
hash[language] = [] | ||
|
||
STOPWORDS_PATH.each do |path| | ||
if File.exist?(File.join(path, language)) | ||
hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split | ||
break | ||
end | ||
end | ||
|
||
hash[language] | ||
end | ||
|
||
# Changes the language of stopwords | ||
def language=(language) | ||
@language = language | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# encoding: utf-8 | ||
# Author:: Lucas Carlson (mailto:[email protected]) | ||
# Copyright:: Copyright (c) 2005 Lucas Carlson | ||
# License:: LGPL | ||
|
||
module ClassifierReborn | ||
module TokenFilter | ||
# This filter removes symbol-only terms, from given tokens. | ||
module Symbol | ||
module_function | ||
|
||
def call(tokens) | ||
tokens.reject do |token| | ||
/[^\s\p{WORD}]/ === token | ||
end | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# encoding: utf-8 | ||
# Author:: Lucas Carlson (mailto:[email protected]) | ||
# Copyright:: Copyright (c) 2005 Lucas Carlson | ||
# License:: LGPL | ||
|
||
module ClassifierReborn | ||
module Tokenizer | ||
class Token < String | ||
# The class can be created with one token string and extra attributes. E.g., | ||
# t = ClassifierReborn::Tokenizer::Token.new 'Tokenize', stemmable: true, maybe_stopword: false | ||
# | ||
# Attributes available are: | ||
# stemmable: true Possibility that the token can be stemmed. This must be false for un-stemmable terms, otherwise this should be true. | ||
# maybe_stopword: true Possibility that the token is a stopword. This must be false for terms which never been stopword, otherwise this should be true. | ||
def initialize(string, stemmable: true, maybe_stopword: true) | ||
super(string) | ||
@stemmable = stemmable | ||
@maybe_stopword = maybe_stopword | ||
end | ||
|
||
def stemmable? | ||
@stemmable | ||
end | ||
|
||
def maybe_stopword? | ||
@maybe_stopword | ||
end | ||
|
||
def stem | ||
stemmed = super | ||
self.class.new(stemmed, stemmable: @stemmable, maybe_stopword: @maybe_stopword) | ||
end | ||
end | ||
end | ||
end |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piroor, are we missing a comma at the end here? Also, the last element of the array has an unnecessary comma.