-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misuse of wchar_t for UTF-16/32 #180
Comments
Thank you for your comments. As you note, lcms use wchar_t to store utf16 codes and this is the only assumption it does. This is clearly stated in the manual that the char width is 16 bits. See MLU documentation. Utf32 is not supported, nor even mentioned in the documentation. wchar_t was chosen as transport type because at least can hold 16 bits and in some platforms it simplifies string handling. On other platforms, it makes some characters accessible with few effort (those in range 0..65536). In any case it doesnt intend to be a complete solution. I found many compatibility problems with string conversion functions, so I remove all of them for simplicity sake.Lcms2 development is on freeze and no new features will be incorporated right now. Otherwise your suggestions makes a lot of sense and for sure I will have them very present on future developments.Thanks again
|
Yes, any use of these functions requires a boilerplate to cast those 32-bit characters back to uint16. Otherwise no other library or API will read the strings correctly. Not a big deal but it is pretty unusual. Not sure how to handle string literals. There is no equivalent in C99 for “the ASCII functions [are] only valid if the original string has no non-ASCII code points in it” → correct. Don't use them if you are not 100% sure there are only ASCII characters. |
Would it be possible/easier to solve this using UTF-8 on all platforms? |
2.13 has some code to fix this issue. I left it open because it misses the write part, but the engine should now read and convert ICC's utf16 into utf32. |
Even though
wchar_t
does not have a well-defined size in C or C++, LittleCMS uses it to store UTF-16 strings.The problem is that on platforms with a 32-bit
wchar_t
(e.g. Linux, OS X), each UTF-16 code unit (e.g. from a file) is simply cast towchar_t
by_cmsReadWCharArray
which doesn't produce a valid UTF-32 sequence. To produce a valid UTF-32 sequence one would have to decode each Unicode code point from one or more UTF-16 code units.To make things worse
_cmsWriteWCharArray
just down-castswchar_t
tocmsUInt16Number
discarding the upper 16 bits of the UTF-32 code unit. This of course does not convert UTF-32 to UTF-16, as the correct conversion would be to decompose each Unicode code point into one or more UTF-16 code units.Apart from these problems there is also a minor matter of wasted memory - each 32-bit
wchar_t
has 16 bits unused.At the very minimum, I would propose documenting the fact that strings created by LittleCMS functions returning
wchar_t
arrays cannot be used as valid UTF-32 strings on platforms with a 32-bitwchar_t
. Instead the user must do their own conversion, using the lower 16 bits of each value to form a valid UTF-16 sequence. And functions that takewchar_t
strings should also be documented to take UTF-16 sequences where each code unit is stored in awchar_t
.Of course, this behaviour is not intuitive or expected by the user and makes handling textual data with LittleCMS more complex (not to mention resource intensive, as each string conversion may require allocating extra memory). For instance, this confusion can cause issues such as #163.
That is why I also propose deprecating the functions that use
wchar_t
in favour of new functions that operate on some newcmsU16Character
type (always 16-bit, name definitely up for discussion), and explicitly state that strings are encoded as UTF-16 in the native byte order. (I would say, just usechar16_t
but that would require C11/C++11, so a custom type is probably a better choice for compatibility). The new functions could use aUTF16
postfix.With these new functions the internals would need to be reworked. Instead of storing Unicode text as
wchar_t
arrays, it should be stored ascmsU16Character
arrays. As an added bonus, this would allow for more efficient I/O functions that could process multiple characters at once instead of a character at a time (with the byte-swappingif
-statement needing to be called only once for character block).The deprecated functions would just do the up/down-casting themselves instead of deferring that to the I/O layer.
Another issue that I see is with the
ASCII
functions as the conversion from UTF-16 to ASCII (cmsMLUgetASCII
) is only valid if the original string has no non-ASCII code points in it. A better way to handle this may be to provide the ability to replace non-ASCII code points with a character of the user's choosing (e.g.?
). Of course this would require handling strings as actual sequences of code points instead of opaque code units. (I'm not 100% sure, but this probably doesn't even require fully decoding the UTF-32 code point from UTF-16, just skipping over some of the UTF-16 code units.)P.S. I would be glad to create a pull-request with some of the proposed changes if a decision is made to proceed with them.
The text was updated successfully, but these errors were encountered: