-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing data for Display Names #3260
Comments
I think 2 is a big issue, and I think it also happens for other data. We could, instead of loading a single data struct in the formatter constructor, load all structs for the whole fallback chain. This could use naive fallback (i.e. chopping off tags), so no additional data would be needed. We can then remove redundant entries from |
Discuss with: |
Discussed on 2023-07-04. We will use the auxiliary key model, similar to currency formatter (#1441), which resolves the issues in the OP. |
Options:
Discussion:
|
Discussion in ECMA-402 meeting: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-11-25.md#display-names-data-slicing |
Trying to organize my thoughts here. There are two different ways we are discussing about reducing data size:
These two paths can be explored mostly independently, though some solutions impact both. Deduplicating identical stringsTwo general paths for this which are compatible with the ICU4X architecture:
Is there another approach I'm missing? Removing cold stringsTwo ways to go about this:
Thoughts on any of the above? |
While I am in favor of allowing clients to fully customize which strings they get, I think removing cold strings is prone to being confusing for people using our default data? I'd rather attempt to deduplicate. Multiple payloads sounds like the okay-but-suboptimal solution, and we can plan for that whilst I figure out how fundamental the rust compiler limits are.
I actually think this can be a pro! I think this depends on the usage patterns, and I could see us even having different types of display names formatters geared towards different use cases. The use case of "show the user the country they are in" is different from "show the users a dropdown of countries". We're going to have a bit of recurring tension between "optimize data for multiple use cases at once" vs "make different data models for different use cases, but then people with both use cases have to avoid accidentally duplicating data". We've had this question crop up before for BidiAuxiliaryProperties. |
I'm not too worried about confusion with our default data. I think our default data should always be at least the "core" data set, which I'm considering to be the cartesian product of all
Yes, deduplication is something we should do regardless of how/if we decide to handle cold strings.
Ideally we work out the compiler fundamentals before implementing a solution. It seems unlikely we'll have this ready in 2024, but it would be nice to have a path forward and implement this in 2025.
True. What I had in mind when I wrote my comment was more like, "if you need one name at a time, a map lookup is about as fast as a data provider lookup, but if you need an iterator over all names, a map lookup is probably faster." |
Yep, 100%. |
I don't like the cold-data thing. A user with a Korean UI and address in Germany is just as important as a user with a German UI and an address in Germany, even if there are more of the latter. We also don't have any data on which combinations are "hot". |
It's about tradeoffs. @zbraniecki likes to use the following thought experiment: in an environment with a fixed amount of space, what produces the better outcome for global users: including these cold display name strings, or including a whole new locale? I do think we should keep the cold strings in the default configuration. But, I also think we are in a position to give this lever to clients who need it.
CLDR has some data on usage patterns between languages and regions. The data could probably use some improvement. |
I think it's bad to slice up the data like this. When a South Korean app asks a user for their country-or-region of residence as part of user registration (let's assume for the sake of simplicity here that it's legitimate to ask for such a thing), why should the menu exclude Uganda? AFAICT, splicing up the DisplayNames data like this doesn't make much sense, since DisplayNames makes the most sense to use when you need to deal with a large set of possible names and especially if translating them is outside the core competence of the app's UI localizers. If the app only needs to deal with a small set of translatable strings that are familiar to the translators, the app doesn't need DisplayNames. Continuing with the Korean example: Anecdotally, if a South Korean app has non-Korean UI languages, the UI languages are Korean, English, Japanese, and Chinese (without clearly saying Simplified or Traditional). That's a small set, but it doesn't follow that it makes sense to provide DisplayNames pruned to a small set that's somehow geographically proximate plus English. The app does not need DisplayNames—hot or cold—to get labels for an in-app UI for switching between these. It can use whatever mechanism if uses for its UI strings in general. (And making English have a broader set of supported names than Korean on the assumption that people whose country-or-region of residence is in the cold set for Korean would use the English UI anyway would fail cases like Koreans who have moved abroad and are visiting registering for an account with the UI in Korean but country-or-region of residence in the cold set. So that would be bad.) Having an API that works for presumed-hot values but not for presumed-cold values makes for an unreliable API. If developers perceive an API as too unreliable to use, then even the utility per size of even the hot set becomes worse. As far as the JavaScript ECMA-402 side goes, as I said in the TG2 meeting, I think it would be better to not ship DisplayNames than to ship DisplayNames in an unreliable state. |
The DisplayNames component comes with a large amount of data. It is the largest locale-specific data in ICU and will also likely be the largest in ICU4X.
There are a few things that make DisplayNames interesting:
CC @snktd @robertbastian @markusicu
The text was updated successfully, but these errors were encountered: