Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-23004: experiment with UTF-8/16 C++ iterators #3096

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Aug 12, 2024

Checklist
  • Required: Issue filed: ICU-23004
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/unicode/utf16cppiter.h is different
  • icu4c/source/common/unicode/uversion.h is no longer changed in the branch
  • icu4c/source/test/intltest/intltest.vcxproj is different
  • icu4c/source/test/intltest/intltest.vcxproj.filters is different
  • icu4c/source/test/intltest/itutil.cpp is different
  • icu4c/source/test/intltest/Makefile.in is different
  • icu4c/source/test/intltest/utfcppitertest.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu
Copy link
Member Author

Hi @eggrobin I think this is worth taking another look. I rebased on recent main, made changes from our discussions, and I think this looks roughly like a reasonable validating, forward-only (so far) Unicode 16-bit-string code point iterator. It no longer tries to be clever: It no longer reads & validates the code point while iterating, and no longer stores the result in the iterator.

Plenty of TODOs and questions left, but I would appreciate feedback on the shape of what I've got so far.

@markusicu
Copy link
Member Author

I experimented with godbolt, and found that the compiler does its best fusing operator*() and operator++() when they both call the same implementation function. This makes operator++() look horribly inefficient, but the machine code for a regular range-based for loop from the optimizing clang 19 looks very concise.

I then also made the iterator bidirectional and added a special version for efficient rbegin() & rend() using the same principles.

The bidirectional iterator also exposes explicit but non-colloquial functions.

@eggrobin
Copy link
Member

eggrobin commented Jan 3, 2025

(As noted over email, I’ll take a look on Monday when I’m back from the holidays.)

@markusicu markusicu changed the title experiment with UTF-8/16 C++ iterators ICU-23004: experiment with UTF-8/16 C++ iterators Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants