Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode 15.1 new GB9c break rule #1718

Closed
DonKult opened this issue Oct 29, 2023 · 4 comments
Closed

Support Unicode 15.1 new GB9c break rule #1718

DonKult opened this issue Oct 29, 2023 · 4 comments

Comments

@DonKult
Copy link
Contributor

DonKult commented Oct 29, 2023

ycmd embeds its unicode support files and tests (currently for version 13), but a script (update_unicode.py) is provided to update to the latest unicode version. This used to work to upgrade to version 14, but doesn't anymore with 15. The tests fail for example with:

[ RUN      ] UnicodeTest/WordTest.BreakIntoCharacters/1186
./cpp/ycm/tests/Word_test.cpp:60: Failure
Value of: Word( word_.text_ ).Characters()
Expected: { *{ "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", "\xE0\xA4\x95\xE0\xA4\xA4"
    As Text: "कत", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", false, true, false, false } }
  Actual: { *{ "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D"
    As Text: "क", "\xE0\xA4\x95"
    As Text: "क", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D"
    As Text: "क", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D"
    As Text: "क्, false, true, false, false }, *{ "\xE0\xA4\xA4"
    As Text: "त", "\xE0\xA4\xA4"
    As Text: "त", "\xE0\xA4\xA4"
    As Text: "त", "\xE0\xA4\xA4"
    As Text: "त", true, true, false, false } }

[  FAILED  ] UnicodeTest/WordTest.BreakIntoCharacters/1186, where GetParam() = { "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", { "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत" } (0 ms)

The reason is that 15.1 introduces a new rule for (not) breaking: GB9c and of course the new tests exercising this rule fail now.

Prior art implementing this rule elsewhere: JuliaStrings/utf8proc#253

Would be nice if support for newer Unicode standards could be added to ycmd.

@puremourning
Copy link
Member

PR welcome.

@bstaletic
Copy link
Collaborator

I have recently tried doing a naive upgrade to the latest unicode standard, but have seen that the tests are failing.

The reason is that 15.1 introduces a new rule for (not) breaking: GB9c and of course the new tests exercising this rule fail now.

Thanks for tracking this down. I have stopped at the previous step, because I was busy. In case you do want to contribute:

https://github.com/ycm-core/ycmd/blob/master/cpp/ycm/Word.cpp#L31

@bstaletic
Copy link
Collaborator

bstaletic commented Oct 30, 2023

@DonKult I am afraid I am having troubles understanding the new rule. Up to now the boundary rules table contained only break properties that were explained in the values table. Now that GB9c is added, it talks about "Indic_Conjunct_Break" (InCB), but I don't see it defined anywhere.

On top of that, InCB is not mentioned in the break property data either.

 

EDIT: Found it!

Now, this is a new property. That means we will need to extend our UnicodeData.inc with one more data member. :/
It's definitely doable, but it does make me wish we had a sparse vector to save on space.

EDIT2: I understand the point of having links to the versioned documents in update_unicode.py, but half of thtem already aren't and it is now just confusing. We should pick a side.

I know, I know... I am to blame for that mess.

@bstaletic
Copy link
Collaborator

We have merged the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants