Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More amy/anysplit modifications #1337

Merged
merged 23 commits into from
Aug 13, 2022
Merged

More amy/anysplit modifications #1337

merged 23 commits into from
Aug 13, 2022

Conversation

ampli
Copy link
Member

@ampli ampli commented Aug 12, 2022

This patch makes anysplit.c potentially grapheme-aware. It works only with the PCRE2 library.
I added a new definition to the affix file: atomic-unit that defines a sequence that should not get split. Initially, I defined it as \X. But noted that for some reasons that I don't know (I even have no idea if this is due to a bug or feature in PCRE2 or Unicode) sometimes a word starts with mark characters (that may be rendered badly as they don't have a base character to modify). So
I changed it to \X\pM*. (A better name than atomic-unit may be needed to prevent confusion with the atomase...).

I also simplified the regexes in amy/4.0.regex. I have left there the non-PCRE2 regexes, commented out.
In the affix file (through any/affix-punc) I added [[:punct:]] regexes for RPUNC/LPUNC/MPUNC, to strip off all types of punctuations (of course this splits numbers and times too). I don't know how much this is a good idea, but of course, this is optional and can be modified. I have left the multi-character punctuations (e.g. ...).

Due to the multi-character punctuation, I added the following, which is actually a bug fix to my previous affix-related modifications:

  • afdict_init(): Validate affixes w/dictionary_word_is_known().

Note that my new ANY_PUNCT accepts subscripted punctuation, and if they are used, this allows to know from which
side they got strip. However, I now see that due to my syntax decision for affix regexes, this way cannot be used with them.
So maybe it will be an improvement to change their syntax to /regex/\1/ after all (and then /regex/\1.y/ could be used to add a subscript, and, for example /regex/\1/a can specify to split as an alternative (instead of a replacement like now).

Refs: Issues #1334, #1333, #1315; PRs #1334, #1329, #1321.

ampli added 23 commits August 12, 2022 17:20
This doesn't work yet for splitting on grapheme boundaries, because ^X
matches at leas one codepoint so it matches a split initial morpheme in a part.

This change is needed for the upcoming new code to split at grapheme
boundaries.
No need for them after the grapheme-aware separation modification.
This way morpheme candidates (split parts) are not starting with marks.
This looks nicer and gives less splits. I don't know it is more useful.
...instead of dict_has_word(), to allow punctuation that match a regex.
@linas linas merged commit 1744562 into opencog:master Aug 13, 2022
@linas
Copy link
Member

linas commented Aug 13, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants