Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing data for Display Names #3260

Open
Tracked by #3913
sffc opened this issue Apr 4, 2023 · 12 comments
Open
Tracked by #3913

Optimizing data for Display Names #3260

sffc opened this issue Apr 4, 2023 · 12 comments
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design C-dnames Component: Language/Region/... Display Names S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality U-ecma402 User: ECMA-402 compatibility

Comments

@sffc
Copy link
Member

sffc commented Apr 4, 2023

The DisplayNames component comes with a large amount of data. It is the largest locale-specific data in ICU and will also likely be the largest in ICU4X.

There are a few things that make DisplayNames interesting:

  1. The majority of the display names are probably not useful to carry for most clients. For example, users speaking Japanese are more likely to need the translation for the Katakana script than the translation for the Cherokee script. We should explore something like japanext and likelysubtagsext where we have a core set and an extended set.
  2. Regional variants often override only a small number of strings. For example, en-GB and en-US might be equivalent for all region names except for one or two. This doesn't play nicely with the deduplication mechanism we've thusfar relied on.

CC @snktd @robertbastian @markusicu

@sffc sffc added A-design Area: Architecture or design discuss Discuss at a future ICU4X-SC meeting A-data Area: Data coverage or quality C-dnames Component: Language/Region/... Display Names labels Apr 4, 2023
@robertbastian
Copy link
Member

I think 2 is a big issue, and I think it also happens for other data. We could, instead of loading a single data struct in the formatter constructor, load all structs for the whole fallback chain. This could use naive fallback (i.e. chopping off tags), so no additional data would be needed. We can then remove redundant entries from en-GB and en-001 if they are in en (if we're using naive we'd still have duplication across GB and 001 though).

@sffc
Copy link
Member Author

sffc commented May 11, 2023

Discuss with:

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label May 25, 2023
@sffc
Copy link
Member Author

sffc commented Jul 5, 2023

Discussed on 2023-07-04. We will use the auxiliary key model, similar to currency formatter (#1441), which resolves the issues in the OP.

@sffc sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 5, 2023
@sffc sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Jul 5, 2023
@sffc sffc added T-core Type: Required functionality S-medium Size: Less than a week (larger bug fix or enhancement) labels Jul 5, 2023
@Manishearth Manishearth moved this to Not a blocker in icu4x 2.0 Feb 23, 2024
@sffc sffc removed this from icu4x 2.0 Feb 29, 2024
@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Nov 20, 2024
@sffc
Copy link
Member Author

sffc commented Nov 23, 2024

Options:

  1. Fine grained data marker attributes
  2. Core/Extended

Discussion:

  • @zbraniecki Seems like it would be a footgun if clients aren't loading display names that they need.
  • @sffc How would you draw the line?
  • @zbraniecki Not really sure.
  • @zbraniecki Most clients are not in a position to decide whether "Tovalu" should be included. It seems option 2 is easier for clients, if we can make a reasonable set. We are in a better position to draw this line than our customers.
  • @sffc Option 1 is better for data deduplication. It's possible though that we could do something similar to how we reduced data size for time zones (where we allow fallback to root).
  • @sffc My main worry with Option 1 is its impact on compile times with so many finely sliced data marker attributes.

@sffc sffc added U-ecma402 User: ECMA-402 compatibility and removed discuss-priority Discuss at the next ICU4X meeting labels Nov 23, 2024
@sffc
Copy link
Member Author

sffc commented Nov 25, 2024

@sffc
Copy link
Member Author

sffc commented Dec 14, 2024

Trying to organize my thoughts here.

There are two different ways we are discussing about reducing data size:

  1. Deduplicating identical strings
    • Example: en-GB and en-US and fr have the same translation for Panama. We'd like to ship that translation only once.
  2. Removing cold strings
    • Example: the display name of Uganda in Korean maybe doesn't need to be in a minimal data set.

These two paths can be explored mostly independently, though some solutions impact both.

Deduplicating identical strings

Two general paths for this which are compatible with the ICU4X architecture:

  1. Make every string its own data marker attribute. This is what we decided to do last summer, Optimizing data for Display Names #3260 (comment), and it is what we currently do for the experimental unit and currency formats.
    • Pro: Smallest postcard data size
    • Pro: Allows for cold strings to be removed via datagen (solves problem 2)
    • Pro: Easy to implement
    • Con: Finely sliced data markers increase build times and baked data size (Baked data is big, and compiles slowly, for finely sliced data markers #5230)
    • Con: The data provider needs to be queried for every individual display name (as opposed to one load that returns a map)
  2. Load multiple payloads, such as "locale plus root" or "locale plus script", which have been deduplicated. This is what @robertbastian suggested in Optimizing data for Display Names #3260 (comment) and similar to what he implemented to deduplicate time zone names in Remove generic metazone values that match location values #5751
    • Pro: Doesn't test the boundaries of the Rust compiler
    • Pro: Should be fairly efficient for display name lookup
    • Con: Leaves some data size on the table (unclear how much)
    • Con: A bit harder to implement
    • Con: Increases stack size of DisplayNames type
    • Con: May favor Latin-script languages, but see Consider using script fallback for time zone names and maybe others #5901

Is there another approach I'm missing?

Removing cold strings

Two ways to go about this:

  1. Utilize data marker attributes (see above)
    • All pros and cons as above
    • Pro: The set of strings included for each locale is fully customizable by the client, though we could still ship default configurations
    • Con: A non-default set of strings requires running datagen
  2. Introduce Minimal/Core/Extended keys, similar to LocaleExpander.
    • Pro: Clients do not need to fiddle with datagen in order to meet their requirements
    • Pro: Puts the i18n team in charge of deciding what constitutes a "cold" string
    • Con: The choices offered to clients are coarse (such as minimal, core, or extended) and not customizable
    • Con: Increases stack size of DisplayNames type
    • Con: May reduce performance relative to not having the separate keys

Thoughts on any of the above?

@Manishearth @hsivonen

@Manishearth
Copy link
Member

Manishearth commented Dec 17, 2024

While I am in favor of allowing clients to fully customize which strings they get, I think removing cold strings is prone to being confusing for people using our default data?

I'd rather attempt to deduplicate. Multiple payloads sounds like the okay-but-suboptimal solution, and we can plan for that whilst I figure out how fundamental the rust compiler limits are.

Con: The data provider needs to be queried for every individual display name (as opposed to one load that returns a map)

I actually think this can be a pro! I think this depends on the usage patterns, and I could see us even having different types of display names formatters geared towards different use cases. The use case of "show the user the country they are in" is different from "show the users a dropdown of countries".

We're going to have a bit of recurring tension between "optimize data for multiple use cases at once" vs "make different data models for different use cases, but then people with both use cases have to avoid accidentally duplicating data". We've had this question crop up before for BidiAuxiliaryProperties.

@sffc
Copy link
Member Author

sffc commented Dec 17, 2024

While I am in favor of allowing clients to fully customize which strings they get, I think removing cold strings is prone to being confusing for people using our default data?

I'm not too worried about confusion with our default data. I think our default data should always be at least the "core" data set, which I'm considering to be the cartesian product of all modern locales. If we take the multiple-keys approach, the constructors can have names such as new_minimal, new, and new_extended.

I'd rather attempt to deduplicate.

Yes, deduplication is something we should do regardless of how/if we decide to handle cold strings.

Multiple payloads sounds like the okay-but-suboptimal solution, and we can plan for that whilst I figure out how fundamental the rust compiler limits are.

Ideally we work out the compiler fundamentals before implementing a solution. It seems unlikely we'll have this ready in 2024, but it would be nice to have a path forward and implement this in 2025.

Con: The data provider needs to be queried for every individual display name (as opposed to one load that returns a map)

I actually think this can be a pro! I think this depends on the usage patterns, and I could see us even having different types of display names formatters geared towards different use cases. The use case of "show the user the country they are in" is different from "show the users a dropdown of countries".

We're going to have a bit of recurring tension between "optimize data for multiple use cases at once" vs "make different data models for different use cases, but then people with both use cases have to avoid accidentally duplicating data". We've had this question crop up before for BidiAuxiliaryProperties.

True. What I had in mind when I wrote my comment was more like, "if you need one name at a time, a map lookup is about as fast as a data provider lookup, but if you need an iterator over all names, a map lookup is probably faster."

@Manishearth
Copy link
Member

Yep, 100%.

@robertbastian
Copy link
Member

I don't like the cold-data thing. A user with a Korean UI and address in Germany is just as important as a user with a German UI and an address in Germany, even if there are more of the latter. We also don't have any data on which combinations are "hot".

@sffc
Copy link
Member Author

sffc commented Jan 9, 2025

I don't like the cold-data thing. A user with a Korean UI and address in Germany is just as important as a user with a German UI and an address in Germany, even if there are more of the latter.

It's about tradeoffs. @zbraniecki likes to use the following thought experiment: in an environment with a fixed amount of space, what produces the better outcome for global users: including these cold display name strings, or including a whole new locale?

I do think we should keep the cold strings in the default configuration. But, I also think we are in a position to give this lever to clients who need it.

We also don't have any data on which combinations are "hot".

CLDR has some data on usage patterns between languages and regions. The data could probably use some improvement.

@hsivonen
Copy link
Member

hsivonen commented Jan 9, 2025

the display name of Uganda in Korean maybe doesn't need to be in a minimal data set.

I think it's bad to slice up the data like this. When a South Korean app asks a user for their country-or-region of residence as part of user registration (let's assume for the sake of simplicity here that it's legitimate to ask for such a thing), why should the menu exclude Uganda?

AFAICT, splicing up the DisplayNames data like this doesn't make much sense, since DisplayNames makes the most sense to use when you need to deal with a large set of possible names and especially if translating them is outside the core competence of the app's UI localizers. If the app only needs to deal with a small set of translatable strings that are familiar to the translators, the app doesn't need DisplayNames.

Continuing with the Korean example: Anecdotally, if a South Korean app has non-Korean UI languages, the UI languages are Korean, English, Japanese, and Chinese (without clearly saying Simplified or Traditional). That's a small set, but it doesn't follow that it makes sense to provide DisplayNames pruned to a small set that's somehow geographically proximate plus English. The app does not need DisplayNames—hot or cold—to get labels for an in-app UI for switching between these. It can use whatever mechanism if uses for its UI strings in general.

(And making English have a broader set of supported names than Korean on the assumption that people whose country-or-region of residence is in the cold set for Korean would use the English UI anyway would fail cases like Koreans who have moved abroad and are visiting registering for an account with the UI in Korean but country-or-region of residence in the cold set. So that would be bad.)

Having an API that works for presumed-hot values but not for presumed-cold values makes for an unreliable API. If developers perceive an API as too unreliable to use, then even the utility per size of even the hot set becomes worse.

As far as the JavaScript ECMA-402 side goes, as I said in the TG2 meeting, I think it would be better to not ship DisplayNames than to ship DisplayNames in an unreliable state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design C-dnames Component: Language/Region/... Display Names S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality U-ecma402 User: ECMA-402 compatibility
Projects
None yet
Development

No branches or pull requests

4 participants