-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More amy/anysplit modifications #1337
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This doesn't work yet for splitting on grapheme boundaries, because ^X matches at leas one codepoint so it matches a split initial morpheme in a part. This change is needed for the upcoming new code to split at grapheme boundaries.
No need for them after the grapheme-aware separation modification.
This way morpheme candidates (split parts) are not starting with marks. This looks nicer and gives less splits. I don't know it is more useful.
...instead of dict_has_word(), to allow punctuation that match a regex.
Thank you! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch makes
anysplit.c
potentially grapheme-aware. It works only with the PCRE2 library.I added a new definition to the affix file:
atomic-unit
that defines a sequence that should not get split. Initially, I defined it as\X
. But noted that for some reasons that I don't know (I even have no idea if this is due to a bug or feature in PCRE2 or Unicode) sometimes a word starts with mark characters (that may be rendered badly as they don't have a base character to modify). SoI changed it to
\X\pM*
. (A better name thanatomic-unit
may be needed to prevent confusion with the atomase...).I also simplified the regexes in
amy/4.0.regex
. I have left there the non-PCRE2 regexes, commented out.In the affix file (through
any/affix-punc
) I added[[:punct:]]
regexes for RPUNC/LPUNC/MPUNC, to strip off all types of punctuations (of course this splits numbers and times too). I don't know how much this is a good idea, but of course, this is optional and can be modified. I have left the multi-character punctuations (e.g....
).Due to the multi-character punctuation, I added the following, which is actually a bug fix to my previous affix-related modifications:
afdict_init()
: Validate affixes w/dictionary_word_is_known()
.Note that my new
ANY_PUNCT
accepts subscripted punctuation, and if they are used, this allows to know from whichside they got strip. However, I now see that due to my syntax decision for affix regexes, this way cannot be used with them.
So maybe it will be an improvement to change their syntax to
/regex/\1/
after all (and then/regex/\1.y/
could be used to add a subscript, and, for example/regex/\1/a
can specify to split as an alternative (instead of a replacement like now).Refs: Issues #1334, #1333, #1315; PRs #1334, #1329, #1321.