Skip to content

Commit

Permalink
update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
MaksimEkin committed Apr 22, 2024
1 parent 7f3643c commit 8610807
Show file tree
Hide file tree
Showing 7 changed files with 13 additions and 6 deletions.
2 changes: 1 addition & 1 deletion docs/TELF.pre_processing.Vulture.html
Original file line number Diff line number Diff line change
Expand Up @@ -410,7 +410,7 @@ <h2>Submodules<a class="headerlink" href="#submodules" title="Link to this headi

<dl class="py attribute">
<dt class="sig sig-object py" id="TELF.pre_processing.Vulture.vulture.Vulture.DEFAULT_PIPELINE">
<span class="sig-name descname"><span class="pre">DEFAULT_PIPELINE</span></span><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">[SimpleCleaner(module_type='CLEANER',</span> <span class="pre">effective_stop_words=['characteristics',</span> <span class="pre">'acknowledgment',</span> <span class="pre">'characteristic',</span> <span class="pre">'substantially',</span> <span class="pre">'significantly',</span> <span class="pre">'unfortunately',</span> <span class="pre">'predominantly',</span> <span class="pre">'automatically',</span> <span class="pre">'approximately',</span> <span class="pre">'corresponding',</span> <span class="pre">'investigation',</span> <span class="pre">'successfully',</span> <span class="pre">'representing',</span> <span class="pre">'demonstrated',</span> <span class="pre">'respectively',</span> <span class="pre">'sufficiently',</span> <span class="pre">'applications',</span> <span class="pre">'specifically',</span> <span class="pre">'introduction',</span> <span class="pre">'particularly',</span> <span class="pre">'consequently',</span> <span class="pre">'demonstrates',</span> <span class="pre">'nevertheless',</span> <span class="pre">'application',</span> <span class="pre">'investigate',</span> <span class="pre">...</span> <span class="pre">(+1359</span> <span class="pre">more)],</span> <span class="pre">patterns={'standardize_hyphens':</span> <span class="pre">(re.compile('[\\u002D\\u2010\\u2011\\u2012\\u2013\\u2014\\u2015\\u2212\\u2E3A\\u2E3B]'),</span> <span class="pre">'-'),</span> <span class="pre">'remove_copyright_statement':</span> <span class="pre">None,</span> <span class="pre">'remove_stop_phrases':</span> <span class="pre">None,</span> <span class="pre">'make_lower_case':</span> <span class="pre">None,</span> <span class="pre">'normalize':</span> <span class="pre">None,</span> <span class="pre">'remove_trailing_dash':</span> <span class="pre">('(?&lt;!\\w)-|-(?!\\w)',</span> <span class="pre">''),</span> <span class="pre">'make_hyphens_words':</span> <span class="pre">('([a-z])\\-([a-z])',</span> <span class="pre">''),</span> <span class="pre">'remove_next_line':</span> <span class="pre">('\\n+',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_email':</span> <span class="pre">('\\S*&#64;\\S*\\s?',</span> <span class="pre">''),</span> <span class="pre">'remove_formulas':</span> <span class="pre">('\\b\\w*[\\=\\≈\\/\\\\\\±]\\w*\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_dash':</span> <span class="pre">('-',</span> <span class="pre">''),</span> <span class="pre">'remove_between_[]':</span> <span class="pre">('\\[.*?\\]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_between_()':</span> <span class="pre">('\\(.*?\\)',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_[]':</span> <span class="pre">('[\\[\\]]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_()':</span> <span class="pre">('[()]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_\\':</span> <span class="pre">('\\\\',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_numbers':</span> <span class="pre">('\\d+',</span> <span class="pre">''),</span> <span class="pre">'remove_standalone_numbers':</span> <span class="pre">('\\b\\d+\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_nonASCII_boundary':</span> <span class="pre">('\\b[^\\x00-\\x7F]+\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_nonASCII':</span> <span class="pre">('[^\\x00-\\x7F]+',</span> <span class="pre">''),</span> <span class="pre">'remove_tags':</span> <span class="pre">('&amp;lt;/?.*?&amp;gt;',</span> <span class="pre">''),</span> <span class="pre">'remove_special_characters':</span> <span class="pre">('[!|&quot;|#|$|%|&amp;|\\|\\\'|(|)|*|+|,|.|/|:|;|&lt;|=|&gt;|?|&#64;|[|\\|]|^|_|`|{|\\||}|~]',</span> <span class="pre">''),</span> <span class="pre">'isolate_frozen':</span> <span class="pre">None,</span> <span class="pre">'remove_extra_whitespace':</span> <span class="pre">('\\s+',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_stop_words':</span> <span class="pre">None,</span> <span class="pre">'min_characters':</span> <span class="pre">None},</span> <span class="pre">exclude_hyphenated_stopwords=False,</span> <span class="pre">sw_pattern=re.compile('\\b[\\w-]+\\b'))]</span></em><a class="headerlink" href="#TELF.pre_processing.Vulture.vulture.Vulture.DEFAULT_PIPELINE" title="Link to this definition">#</a></dt>
<span class="sig-name descname"><span class="pre">DEFAULT_PIPELINE</span></span><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">[SimpleCleaner(module_type='CLEANER',</span> <span class="pre">effective_stop_words=['characteristics',</span> <span class="pre">'characteristic',</span> <span class="pre">'acknowledgment',</span> <span class="pre">'significantly',</span> <span class="pre">'automatically',</span> <span class="pre">'predominantly',</span> <span class="pre">'investigation',</span> <span class="pre">'approximately',</span> <span class="pre">'unfortunately',</span> <span class="pre">'corresponding',</span> <span class="pre">'substantially',</span> <span class="pre">'nevertheless',</span> <span class="pre">'demonstrates',</span> <span class="pre">'specifically',</span> <span class="pre">'applications',</span> <span class="pre">'introduction',</span> <span class="pre">'sufficiently',</span> <span class="pre">'demonstrated',</span> <span class="pre">'particularly',</span> <span class="pre">'consequently',</span> <span class="pre">'representing',</span> <span class="pre">'respectively',</span> <span class="pre">'successfully',</span> <span class="pre">'background:',</span> <span class="pre">'application',</span> <span class="pre">...</span> <span class="pre">(+1359</span> <span class="pre">more)],</span> <span class="pre">patterns={'standardize_hyphens':</span> <span class="pre">(re.compile('[\\u002D\\u2010\\u2011\\u2012\\u2013\\u2014\\u2015\\u2212\\u2E3A\\u2E3B]'),</span> <span class="pre">'-'),</span> <span class="pre">'remove_copyright_statement':</span> <span class="pre">None,</span> <span class="pre">'remove_stop_phrases':</span> <span class="pre">None,</span> <span class="pre">'make_lower_case':</span> <span class="pre">None,</span> <span class="pre">'normalize':</span> <span class="pre">None,</span> <span class="pre">'remove_trailing_dash':</span> <span class="pre">('(?&lt;!\\w)-|-(?!\\w)',</span> <span class="pre">''),</span> <span class="pre">'make_hyphens_words':</span> <span class="pre">('([a-z])\\-([a-z])',</span> <span class="pre">''),</span> <span class="pre">'remove_next_line':</span> <span class="pre">('\\n+',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_email':</span> <span class="pre">('\\S*&#64;\\S*\\s?',</span> <span class="pre">''),</span> <span class="pre">'remove_formulas':</span> <span class="pre">('\\b\\w*[\\=\\≈\\/\\\\\\±]\\w*\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_dash':</span> <span class="pre">('-',</span> <span class="pre">''),</span> <span class="pre">'remove_between_[]':</span> <span class="pre">('\\[.*?\\]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_between_()':</span> <span class="pre">('\\(.*?\\)',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_[]':</span> <span class="pre">('[\\[\\]]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_()':</span> <span class="pre">('[()]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_\\':</span> <span class="pre">('\\\\',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_numbers':</span> <span class="pre">('\\d+',</span> <span class="pre">''),</span> <span class="pre">'remove_standalone_numbers':</span> <span class="pre">('\\b\\d+\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_nonASCII_boundary':</span> <span class="pre">('\\b[^\\x00-\\x7F]+\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_nonASCII':</span> <span class="pre">('[^\\x00-\\x7F]+',</span> <span class="pre">''),</span> <span class="pre">'remove_tags':</span> <span class="pre">('&amp;lt;/?.*?&amp;gt;',</span> <span class="pre">''),</span> <span class="pre">'remove_special_characters':</span> <span class="pre">('[!|&quot;|#|$|%|&amp;|\\|\\\'|(|)|*|+|,|.|/|:|;|&lt;|=|&gt;|?|&#64;|[|\\|]|^|_|`|{|\\||}|~]',</span> <span class="pre">''),</span> <span class="pre">'isolate_frozen':</span> <span class="pre">None,</span> <span class="pre">'remove_extra_whitespace':</span> <span class="pre">('\\s+',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_stop_words':</span> <span class="pre">None,</span> <span class="pre">'min_characters':</span> <span class="pre">None},</span> <span class="pre">exclude_hyphenated_stopwords=False,</span> <span class="pre">sw_pattern=re.compile('\\b[\\w-]+\\b'))]</span></em><a class="headerlink" href="#TELF.pre_processing.Vulture.vulture.Vulture.DEFAULT_PIPELINE" title="Link to this definition">#</a></dt>
<dd></dd></dl>

<dl class="py attribute">
Expand Down
2 changes: 1 addition & 1 deletion docs/Vulture.html
Original file line number Diff line number Diff line change
Expand Up @@ -482,7 +482,7 @@ <h1>Available Functions<a class="headerlink" href="#available-functions" title="

<dl class="py attribute">
<dt class="sig sig-object py" id="TELF.pre_processing.Vulture.vulture.Vulture.DEFAULT_PIPELINE">
<span class="sig-name descname"><span class="pre">DEFAULT_PIPELINE</span></span><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">[SimpleCleaner(module_type='CLEANER',</span> <span class="pre">effective_stop_words=['characteristics',</span> <span class="pre">'acknowledgment',</span> <span class="pre">'characteristic',</span> <span class="pre">'substantially',</span> <span class="pre">'significantly',</span> <span class="pre">'unfortunately',</span> <span class="pre">'predominantly',</span> <span class="pre">'automatically',</span> <span class="pre">'approximately',</span> <span class="pre">'corresponding',</span> <span class="pre">'investigation',</span> <span class="pre">'successfully',</span> <span class="pre">'representing',</span> <span class="pre">'demonstrated',</span> <span class="pre">'respectively',</span> <span class="pre">'sufficiently',</span> <span class="pre">'applications',</span> <span class="pre">'specifically',</span> <span class="pre">'introduction',</span> <span class="pre">'particularly',</span> <span class="pre">'consequently',</span> <span class="pre">'demonstrates',</span> <span class="pre">'nevertheless',</span> <span class="pre">'application',</span> <span class="pre">'investigate',</span> <span class="pre">...</span> <span class="pre">(+1359</span> <span class="pre">more)],</span> <span class="pre">patterns={'standardize_hyphens':</span> <span class="pre">(re.compile('[\\u002D\\u2010\\u2011\\u2012\\u2013\\u2014\\u2015\\u2212\\u2E3A\\u2E3B]'),</span> <span class="pre">'-'),</span> <span class="pre">'remove_copyright_statement':</span> <span class="pre">None,</span> <span class="pre">'remove_stop_phrases':</span> <span class="pre">None,</span> <span class="pre">'make_lower_case':</span> <span class="pre">None,</span> <span class="pre">'normalize':</span> <span class="pre">None,</span> <span class="pre">'remove_trailing_dash':</span> <span class="pre">('(?&lt;!\\w)-|-(?!\\w)',</span> <span class="pre">''),</span> <span class="pre">'make_hyphens_words':</span> <span class="pre">('([a-z])\\-([a-z])',</span> <span class="pre">''),</span> <span class="pre">'remove_next_line':</span> <span class="pre">('\\n+',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_email':</span> <span class="pre">('\\S*&#64;\\S*\\s?',</span> <span class="pre">''),</span> <span class="pre">'remove_formulas':</span> <span class="pre">('\\b\\w*[\\=\\≈\\/\\\\\\±]\\w*\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_dash':</span> <span class="pre">('-',</span> <span class="pre">''),</span> <span class="pre">'remove_between_[]':</span> <span class="pre">('\\[.*?\\]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_between_()':</span> <span class="pre">('\\(.*?\\)',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_[]':</span> <span class="pre">('[\\[\\]]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_()':</span> <span class="pre">('[()]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_\\':</span> <span class="pre">('\\\\',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_numbers':</span> <span class="pre">('\\d+',</span> <span class="pre">''),</span> <span class="pre">'remove_standalone_numbers':</span> <span class="pre">('\\b\\d+\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_nonASCII_boundary':</span> <span class="pre">('\\b[^\\x00-\\x7F]+\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_nonASCII':</span> <span class="pre">('[^\\x00-\\x7F]+',</span> <span class="pre">''),</span> <span class="pre">'remove_tags':</span> <span class="pre">('&amp;lt;/?.*?&amp;gt;',</span> <span class="pre">''),</span> <span class="pre">'remove_special_characters':</span> <span class="pre">('[!|&quot;|#|$|%|&amp;|\\|\\\'|(|)|*|+|,|.|/|:|;|&lt;|=|&gt;|?|&#64;|[|\\|]|^|_|`|{|\\||}|~]',</span> <span class="pre">''),</span> <span class="pre">'isolate_frozen':</span> <span class="pre">None,</span> <span class="pre">'remove_extra_whitespace':</span> <span class="pre">('\\s+',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_stop_words':</span> <span class="pre">None,</span> <span class="pre">'min_characters':</span> <span class="pre">None},</span> <span class="pre">exclude_hyphenated_stopwords=False,</span> <span class="pre">sw_pattern=re.compile('\\b[\\w-]+\\b'))]</span></em><a class="headerlink" href="#TELF.pre_processing.Vulture.vulture.Vulture.DEFAULT_PIPELINE" title="Link to this definition">#</a></dt>
<span class="sig-name descname"><span class="pre">DEFAULT_PIPELINE</span></span><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">[SimpleCleaner(module_type='CLEANER',</span> <span class="pre">effective_stop_words=['characteristics',</span> <span class="pre">'characteristic',</span> <span class="pre">'acknowledgment',</span> <span class="pre">'significantly',</span> <span class="pre">'automatically',</span> <span class="pre">'predominantly',</span> <span class="pre">'investigation',</span> <span class="pre">'approximately',</span> <span class="pre">'unfortunately',</span> <span class="pre">'corresponding',</span> <span class="pre">'substantially',</span> <span class="pre">'nevertheless',</span> <span class="pre">'demonstrates',</span> <span class="pre">'specifically',</span> <span class="pre">'applications',</span> <span class="pre">'introduction',</span> <span class="pre">'sufficiently',</span> <span class="pre">'demonstrated',</span> <span class="pre">'particularly',</span> <span class="pre">'consequently',</span> <span class="pre">'representing',</span> <span class="pre">'respectively',</span> <span class="pre">'successfully',</span> <span class="pre">'background:',</span> <span class="pre">'application',</span> <span class="pre">...</span> <span class="pre">(+1359</span> <span class="pre">more)],</span> <span class="pre">patterns={'standardize_hyphens':</span> <span class="pre">(re.compile('[\\u002D\\u2010\\u2011\\u2012\\u2013\\u2014\\u2015\\u2212\\u2E3A\\u2E3B]'),</span> <span class="pre">'-'),</span> <span class="pre">'remove_copyright_statement':</span> <span class="pre">None,</span> <span class="pre">'remove_stop_phrases':</span> <span class="pre">None,</span> <span class="pre">'make_lower_case':</span> <span class="pre">None,</span> <span class="pre">'normalize':</span> <span class="pre">None,</span> <span class="pre">'remove_trailing_dash':</span> <span class="pre">('(?&lt;!\\w)-|-(?!\\w)',</span> <span class="pre">''),</span> <span class="pre">'make_hyphens_words':</span> <span class="pre">('([a-z])\\-([a-z])',</span> <span class="pre">''),</span> <span class="pre">'remove_next_line':</span> <span class="pre">('\\n+',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_email':</span> <span class="pre">('\\S*&#64;\\S*\\s?',</span> <span class="pre">''),</span> <span class="pre">'remove_formulas':</span> <span class="pre">('\\b\\w*[\\=\\≈\\/\\\\\\±]\\w*\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_dash':</span> <span class="pre">('-',</span> <span class="pre">''),</span> <span class="pre">'remove_between_[]':</span> <span class="pre">('\\[.*?\\]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_between_()':</span> <span class="pre">('\\(.*?\\)',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_[]':</span> <span class="pre">('[\\[\\]]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_()':</span> <span class="pre">('[()]',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_\\':</span> <span class="pre">('\\\\',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_numbers':</span> <span class="pre">('\\d+',</span> <span class="pre">''),</span> <span class="pre">'remove_standalone_numbers':</span> <span class="pre">('\\b\\d+\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_nonASCII_boundary':</span> <span class="pre">('\\b[^\\x00-\\x7F]+\\b',</span> <span class="pre">''),</span> <span class="pre">'remove_nonASCII':</span> <span class="pre">('[^\\x00-\\x7F]+',</span> <span class="pre">''),</span> <span class="pre">'remove_tags':</span> <span class="pre">('&amp;lt;/?.*?&amp;gt;',</span> <span class="pre">''),</span> <span class="pre">'remove_special_characters':</span> <span class="pre">('[!|&quot;|#|$|%|&amp;|\\|\\\'|(|)|*|+|,|.|/|:|;|&lt;|=|&gt;|?|&#64;|[|\\|]|^|_|`|{|\\||}|~]',</span> <span class="pre">''),</span> <span class="pre">'isolate_frozen':</span> <span class="pre">None,</span> <span class="pre">'remove_extra_whitespace':</span> <span class="pre">('\\s+',</span> <span class="pre">'</span> <span class="pre">'),</span> <span class="pre">'remove_stop_words':</span> <span class="pre">None,</span> <span class="pre">'min_characters':</span> <span class="pre">None},</span> <span class="pre">exclude_hyphenated_stopwords=False,</span> <span class="pre">sw_pattern=re.compile('\\b[\\w-]+\\b'))]</span></em><a class="headerlink" href="#TELF.pre_processing.Vulture.vulture.Vulture.DEFAULT_PIPELINE" title="Link to this definition">#</a></dt>
<dd></dd></dl>

<dl class="py attribute">
Expand Down
Loading

0 comments on commit 8610807

Please sign in to comment.