Optimizing data for Display Names #3260

sffc · 2023-04-04T23:19:51Z

The DisplayNames component comes with a large amount of data. It is the largest locale-specific data in ICU and will also likely be the largest in ICU4X.

There are a few things that make DisplayNames interesting:

The majority of the display names are probably not useful to carry for most clients. For example, users speaking Japanese are more likely to need the translation for the Katakana script than the translation for the Cherokee script. We should explore something like japanext and likelysubtagsext where we have a core set and an extended set.
Regional variants often override only a small number of strings. For example, en-GB and en-US might be equivalent for all region names except for one or two. This doesn't play nicely with the deduplication mechanism we've thusfar relied on.

CC @snktd @robertbastian @markusicu

robertbastian · 2023-04-05T08:55:03Z

I think 2 is a big issue, and I think it also happens for other data. We could, instead of loading a single data struct in the formatter constructor, load all structs for the whole fallback chain. This could use naive fallback (i.e. chopping off tags), so no additional data would be needed. We can then remove redundant entries from en-GB and en-001 if they are in en (if we're using naive we'd still have duplication across GB and 001 though).

sffc · 2023-05-11T18:33:46Z

Discuss with:

sffc · 2023-07-05T08:37:27Z

Discussed on 2023-07-04. We will use the auxiliary key model, similar to currency formatter (#1441), which resolves the issues in the OP.

sffc · 2024-11-23T01:36:01Z

Options:

Fine grained data marker attributes
Core/Extended

Discussion:

@zbraniecki Seems like it would be a footgun if clients aren't loading display names that they need.
@sffc How would you draw the line?
@zbraniecki Not really sure.
@zbraniecki Most clients are not in a position to decide whether "Tovalu" should be included. It seems option 2 is easier for clients, if we can make a reasonable set. We are in a better position to draw this line than our customers.
@sffc Option 1 is better for data deduplication. It's possible though that we could do something similar to how we reduced data size for time zones (where we allow fallback to root).
@sffc My main worry with Option 1 is its impact on compile times with so many finely sliced data marker attributes.

sffc · 2024-11-25T18:06:42Z

Discussion in ECMA-402 meeting: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-11-25.md#display-names-data-slicing

sffc · 2024-12-14T01:13:57Z

Trying to organize my thoughts here.

There are two different ways we are discussing about reducing data size:

Deduplicating identical strings
- Example: en-GB and en-US and fr have the same translation for Panama. We'd like to ship that translation only once.
Removing cold strings
- Example: the display name of Uganda in Korean maybe doesn't need to be in a minimal data set.

These two paths can be explored mostly independently, though some solutions impact both.

Deduplicating identical strings

Two general paths for this which are compatible with the ICU4X architecture:

Make every string its own data marker attribute. This is what we decided to do last summer, Optimizing data for Display Names #3260 (comment), and it is what we currently do for the experimental unit and currency formats.
- Pro: Smallest postcard data size
- Pro: Allows for cold strings to be removed via datagen (solves problem 2)
- Pro: Easy to implement
- Con: Finely sliced data markers increase build times and baked data size (Baked data is big, and compiles slowly, for finely sliced data markers #5230)
- Con: The data provider needs to be queried for every individual display name (as opposed to one load that returns a map)
Load multiple payloads, such as "locale plus root" or "locale plus script", which have been deduplicated. This is what @robertbastian suggested in Optimizing data for Display Names #3260 (comment) and similar to what he implemented to deduplicate time zone names in Remove generic metazone values that match location values #5751
- Pro: Doesn't test the boundaries of the Rust compiler
- Pro: Should be fairly efficient for display name lookup
- Con: Leaves some data size on the table (unclear how much)
- Con: A bit harder to implement
- Con: Increases stack size of DisplayNames type
- Con: May favor Latin-script languages, but see Consider using script fallback for time zone names and maybe others #5901

Is there another approach I'm missing?

Removing cold strings

Two ways to go about this:

Utilize data marker attributes (see above)
- All pros and cons as above
- Pro: The set of strings included for each locale is fully customizable by the client, though we could still ship default configurations
- Con: A non-default set of strings requires running datagen
Introduce Minimal/Core/Extended keys, similar to LocaleExpander.
- Pro: Clients do not need to fiddle with datagen in order to meet their requirements
- Pro: Puts the i18n team in charge of deciding what constitutes a "cold" string
- Con: The choices offered to clients are coarse (such as minimal, core, or extended) and not customizable
- Con: Increases stack size of DisplayNames type
- Con: May reduce performance relative to not having the separate keys

Thoughts on any of the above?

@Manishearth @hsivonen

Manishearth · 2024-12-17T01:02:44Z

While I am in favor of allowing clients to fully customize which strings they get, I think removing cold strings is prone to being confusing for people using our default data?

I'd rather attempt to deduplicate. Multiple payloads sounds like the okay-but-suboptimal solution, and we can plan for that whilst I figure out how fundamental the rust compiler limits are.

Con: The data provider needs to be queried for every individual display name (as opposed to one load that returns a map)

I actually think this can be a pro! I think this depends on the usage patterns, and I could see us even having different types of display names formatters geared towards different use cases. The use case of "show the user the country they are in" is different from "show the users a dropdown of countries".

We're going to have a bit of recurring tension between "optimize data for multiple use cases at once" vs "make different data models for different use cases, but then people with both use cases have to avoid accidentally duplicating data". We've had this question crop up before for BidiAuxiliaryProperties.

sffc · 2024-12-17T02:06:15Z

While I am in favor of allowing clients to fully customize which strings they get, I think removing cold strings is prone to being confusing for people using our default data?

I'm not too worried about confusion with our default data. I think our default data should always be at least the "core" data set, which I'm considering to be the cartesian product of all modern locales. If we take the multiple-keys approach, the constructors can have names such as new_minimal, new, and new_extended.

I'd rather attempt to deduplicate.

Yes, deduplication is something we should do regardless of how/if we decide to handle cold strings.

Multiple payloads sounds like the okay-but-suboptimal solution, and we can plan for that whilst I figure out how fundamental the rust compiler limits are.

Ideally we work out the compiler fundamentals before implementing a solution. It seems unlikely we'll have this ready in 2024, but it would be nice to have a path forward and implement this in 2025.

Con: The data provider needs to be queried for every individual display name (as opposed to one load that returns a map)

I actually think this can be a pro! I think this depends on the usage patterns, and I could see us even having different types of display names formatters geared towards different use cases. The use case of "show the user the country they are in" is different from "show the users a dropdown of countries".

We're going to have a bit of recurring tension between "optimize data for multiple use cases at once" vs "make different data models for different use cases, but then people with both use cases have to avoid accidentally duplicating data". We've had this question crop up before for BidiAuxiliaryProperties.

True. What I had in mind when I wrote my comment was more like, "if you need one name at a time, a map lookup is about as fast as a data provider lookup, but if you need an iterator over all names, a map lookup is probably faster."

Manishearth · 2024-12-17T02:46:24Z

Yep, 100%.

robertbastian · 2025-01-07T14:57:42Z

I don't like the cold-data thing. A user with a Korean UI and address in Germany is just as important as a user with a German UI and an address in Germany, even if there are more of the latter. We also don't have any data on which combinations are "hot".

sffc · 2025-01-09T01:10:19Z

I don't like the cold-data thing. A user with a Korean UI and address in Germany is just as important as a user with a German UI and an address in Germany, even if there are more of the latter.

It's about tradeoffs. @zbraniecki likes to use the following thought experiment: in an environment with a fixed amount of space, what produces the better outcome for global users: including these cold display name strings, or including a whole new locale?

I do think we should keep the cold strings in the default configuration. But, I also think we are in a position to give this lever to clients who need it.

We also don't have any data on which combinations are "hot".

CLDR has some data on usage patterns between languages and regions. The data could probably use some improvement.

hsivonen · 2025-01-09T12:22:41Z

the display name of Uganda in Korean maybe doesn't need to be in a minimal data set.

I think it's bad to slice up the data like this. When a South Korean app asks a user for their country-or-region of residence as part of user registration (let's assume for the sake of simplicity here that it's legitimate to ask for such a thing), why should the menu exclude Uganda?

AFAICT, splicing up the DisplayNames data like this doesn't make much sense, since DisplayNames makes the most sense to use when you need to deal with a large set of possible names and especially if translating them is outside the core competence of the app's UI localizers. If the app only needs to deal with a small set of translatable strings that are familiar to the translators, the app doesn't need DisplayNames.

Continuing with the Korean example: Anecdotally, if a South Korean app has non-Korean UI languages, the UI languages are Korean, English, Japanese, and Chinese (without clearly saying Simplified or Traditional). That's a small set, but it doesn't follow that it makes sense to provide DisplayNames pruned to a small set that's somehow geographically proximate plus English. The app does not need DisplayNames—hot or cold—to get labels for an in-app UI for switching between these. It can use whatever mechanism if uses for its UI strings in general.

(And making English have a broader set of supported names than Korean on the assumption that people whose country-or-region of residence is in the cold set for Korean would use the English UI anyway would fail cases like Koreans who have moved abroad and are visiting registering for an account with the UI in Korean but country-or-region of residence in the cold set. So that would be bad.)

Having an API that works for presumed-hot values but not for presumed-cold values makes for an unreliable API. If developers perceive an API as too unreliable to use, then even the utility per size of even the hot set becomes worse.

As far as the JavaScript ECMA-402 side goes, as I said in the TG2 meeting, I think it would be better to not ship DisplayNames than to ship DisplayNames in an unreliable state.

sffc added A-design Area: Architecture or design discuss Discuss at a future ICU4X-SC meeting A-data Area: Data coverage or quality C-dnames Component: Language/Region/... Display Names labels Apr 4, 2023

sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label May 25, 2023

sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 5, 2023

sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Jul 5, 2023

sffc added T-core Type: Required functionality S-medium Size: Less than a week (larger bug fix or enhancement) labels Jul 5, 2023

sffc mentioned this issue Jul 5, 2023

Implement and re-implement auxiliary keys #3632

Closed

sffc mentioned this issue Aug 22, 2023

Finalize the DisplayNames component #3913

Open

5 tasks

sffc modified the milestones: 1.4 Blocking ⟨P1⟩, 1.5 Blocking ⟨P1⟩ Nov 14, 2023

Manishearth added this to icu4x 2.0 Feb 23, 2024

Manishearth moved this to Not a blocker in icu4x 2.0 Feb 23, 2024

sffc modified the milestones: 1.5 Blocking ⟨P1⟩, 1.x Priority ⟨P2⟩ Feb 29, 2024

sffc removed this from icu4x 2.0 Feb 29, 2024

sffc mentioned this issue Oct 8, 2024

Fix generic location format for single-tz countries #5657

Merged

sffc added the discuss-priority Discuss at the next ICU4X meeting label Nov 20, 2024

sffc added U-ecma402 User: ECMA-402 compatibility and removed discuss-priority Discuss at the next ICU4X meeting labels Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing data for Display Names #3260

Optimizing data for Display Names #3260

sffc commented Apr 4, 2023 •

edited

Loading

robertbastian commented Apr 5, 2023

sffc commented May 11, 2023 •

edited by robertbastian

Loading

sffc commented Jul 5, 2023

sffc commented Nov 23, 2024

sffc commented Nov 25, 2024

sffc commented Dec 14, 2024

Manishearth commented Dec 17, 2024 •

edited

Loading

sffc commented Dec 17, 2024

Manishearth commented Dec 17, 2024

robertbastian commented Jan 7, 2025

sffc commented Jan 9, 2025

hsivonen commented Jan 9, 2025

Optimizing data for Display Names #3260

Optimizing data for Display Names #3260

Comments

sffc commented Apr 4, 2023 • edited Loading

robertbastian commented Apr 5, 2023

sffc commented May 11, 2023 • edited by robertbastian Loading

sffc commented Jul 5, 2023

sffc commented Nov 23, 2024

sffc commented Nov 25, 2024

sffc commented Dec 14, 2024

Deduplicating identical strings

Removing cold strings

Manishearth commented Dec 17, 2024 • edited Loading

sffc commented Dec 17, 2024

Manishearth commented Dec 17, 2024

robertbastian commented Jan 7, 2025

sffc commented Jan 9, 2025

hsivonen commented Jan 9, 2025

sffc commented Apr 4, 2023 •

edited

Loading

sffc commented May 11, 2023 •

edited by robertbastian

Loading

Manishearth commented Dec 17, 2024 •

edited

Loading