Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmenter constructor options/preferences normalization #5856

Open
sffc opened this issue Nov 22, 2024 · 12 comments
Open

Segmenter constructor options/preferences normalization #5856

sffc opened this issue Nov 22, 2024 · 12 comments
Labels
C-segmentation Component: Segmentation discuss-priority Discuss at the next ICU4X meeting S-small Size: One afternoon (small bug fix or enhancement)

Comments

@sffc
Copy link
Member

sffc commented Nov 22, 2024

In #3284 we decided to add a content_locale option to segmenter.

In #5839, @zbraniecki asked some questions about the signature of the constructors.

Currently we have constructors of the following shape:

  1. new: singleton static data (infallible), no content locale tailorings
  2. try_new_unstable: dynamic data, no content locale tailorings
  3. try_new_with_options: static data with content locale tailorings
  4. try_new_with_options_unstable: dynamic data with content locale tailorings

Which of the following should we do?

  1. Keep all 4 sets of constructors listed above
  2. Remove constructor sets 1 and 2. Rename 3 and 4 to not have with_options.
  3. Remove constructor set 2. Rename 3 and 4 to not have with_options. Keep constructor 1 as a compiled data infallible optimization.

And, should we take a LanguageIdentifier or a preferences bag for the content locale? (Please understand the discussion in #3284 before stating an opinion on this)

@zbraniecki @makotokato @aethanyc @Manishearth

@sffc sffc added C-segmentation Component: Segmentation needs-approval One or more stakeholders need to approve proposal labels Nov 22, 2024
@sffc sffc added this to the ICU4X 2.0 ⟨P1⟩ milestone Nov 22, 2024
@aethanyc
Copy link
Contributor

If we were to reduce the number of constructors, I'm leaning toward option 3 to remove constructor set 2. I feel it is not too terrible to ask the user to pass Default::default() as the option to the constructor.

Note: there is probably a naming inconsistency in LineSegmenter. We should prepend try_ to the method name in LineSegmenter::new_auto_with_options, LineSegmenter::new_dictionary_with_options, and LineSegmenter::new_lstm_with_options, because they all take LineBreakOptions which contains content_locale.

@sffc
Copy link
Member Author

sffc commented Nov 28, 2024

@aethanyc I lean toward option 3 a well, but I think we should take the opportunity to give the function ::new() a better name. This is also aligned with the discussion in #5554.

Some ideas:

  • new_root
  • new_invariant
  • new_without_options

@sffc
Copy link
Member Author

sffc commented Dec 10, 2024

@Manishearth @makotokato @zbraniecki, anything to add to the list above?

What should we name the segmenter constructor that takes no arguments, uses compiled data, is infallible, and uses Unicode data with no CLDR locale tailorings? (It is currently named ::new() but there are reasons we want to change it.)

@Manishearth
Copy link
Member

Not a huge fan of using the somewhat-CLDR-internal concept "root" in the name. new_untailored or new_no_tailoring? (tailoring is also a bit of jargon, but it's jargon in the segmentation space, not specific to CLDR. Unclear to me if that distinction actually matters)

@makotokato
Copy link
Member

I don't have strong opinions for this. Actually we have no plan to add new options. If we support auto-phase in css text, it can add it in css property.

@sffc sffc added S-small Size: One afternoon (small bug fix or enhancement) discuss-priority Discuss at the next ICU4X meeting and removed needs-approval One or more stakeholders need to approve proposal labels Jan 7, 2025
@Manishearth
Copy link
Member

#5958 adds new_root or new_root_foo to all segmenters except Grapheme.

We can rename root to something else if we want, I am happy with root for now. We should perhaps get WG approval on the name.

We should also figure out what we want to do with Grapheme. Grapheme does not support any tailorings right now, but it could in the future? I'm inclined to let Grapheme have a new() and use the deprecation route if we need to split it.

@aethanyc
Copy link
Contributor

aethanyc commented Jan 8, 2025

We can rename root to something else if we want, I am happy with root for now. We should perhaps get WG approval on the name.

I like new_without_options proposed in #5856 (comment). Although it is a bit mouthful, but it clearly indicate that it accepts no options like WordBreakOptions or LineBreakOptions.

@Manishearth
Copy link
Member

without_options somewhat assumes that it's always with an Options type, whereas some of the APIs over FFI take a content_locale instead, so we'd need to use new_without_options on one type and new_without_content_locale on another. I don't quite like that.

But I don't really like root much either: it's CLDR jargon.

🤷‍♂️

@sffc
Copy link
Member Author

sffc commented Jan 9, 2025

new_with_default_options

@robertbastian robertbastian added discuss Discuss at a future ICU4X-SC meeting and removed discuss Discuss at a future ICU4X-SC meeting labels Jan 9, 2025
@Manishearth
Copy link
Member

Start 10:28

  • @Manishearth Some segmenter constructors accept a content locale, which is fallible. Line and Word have the Dictionary/LSTM version. We previously decided that we generally want to use specific names for infallible constructors, not just ::new().

Currently: new() vs new_with_options() (Sentence) or new_auto()/new_lstm()/new_dictionary() vs new_auto_with_options()/etc (line, word).

Situation:

  • LineSegmenter: has auto/lstm/dictionary, Options has a content locale (and other flags) but it does not make it fallible
  • WordSegmenter: has auto/lstm/dictionary. Options has ONLY a content locale which makes it fallible
  • SentenceSegmenter: has just new. Options has ONLY a content locale with makes it fallible
  • GraphemeSegmenter: has just new. No Options/content locale.

Current suggestions:

  • new_root
  • new_invariant
  • new_without_content_locale
  • new_without_options
  • new_with_default_options
  • new_with_undefined_locale
  • new_with_und_locale

Issues with root: CLDR jargon, people don't think of languages as a tree

Issues with without_content_locale and without_options: Some APIs have more than just content locale in options, so we'd either need to pick different names or pick one and have it be inaccurate for one type

  • @robertbastian: I like without_options
  • @robertbastian invariant is a bit weird, und is a bit unclear what's inside
  • @Manishearth same complaint about und
  • @sffc There's no und option? There is with_und_locale
  • @hsivonen: I like root, not that bad to have CLDR jargon in ICU4X, und also feels like jargom
  • @sffc Line break options are infallible, right?
  • @Manishearth Yes. All the constructors are infallible
  • @sffc What we really care about is the locale; that is the thing that makes these constructors fallible. We could potentially have an options bag that doesn't contain locale and is not fallible.
  • @nekevss I like new_without_options
  • @hsivonen Is new_root semantically correct? It replaces new_auto, which is root plus special knowledge for 5 languages, right? I think "root" means "what's in the spec" and "auto" does that but with models for CJ/SEA locales.
  • @Manishearth I find "root" to be a bit better than "und" but nothing is really semantically valid here. We could potentially have content_locale passed separately and call the fns with_content_locale. I think I prefer LineSegmenter to have a single triplet of constructor...
  • @Manishearth current highest preference: Line has single triplet of constructors that always take options, infallible. Word/sentence have a pair of constructors, one which takes nothing, one which takes a content locale (NOT an options bag). Future options can be added with future apis.

Timebox +10min

  • @sffc The reason we ended up with this design deliberately is because we wanted to emphasize that the locale is a content locale and not a formatting locale. If we want to make it a free floating argument, then it makes things easier for us, except that it doesn't make it clear that it is a content locale. The other issue is that, in the options bag, it should still be discoverable. With an options bag, it's more centralized. new_with_content_locale should be the default thing, and new_without_content_locale would be the other one. Maybe those should be the 2 names.
  • @sffc I think we should have the options bag, even if it is empty. @zbraniecki has made the point before that we should be consistent across our constructors, and we should do the same here.

Manish's proposal:

  • LineSegmenter: single (triplet) of constructors, always takes options, separate Option<&LanguageIdentifier> parameter. new_with_content_locale
  • Word, Sentence: new_with_content_locale(options, content_locale), new_without_content_locale(options)

Shane's proposal:

  • Line, Word, Sentence: all have two constructors with_content_locale and without_content_locale
  • In Line, all constructors are infallible; in Word and Sentence, with_content_locale returns Result

Discussion:

  • @Manishearth The client may or may not know the content locale. They shouldn't have to conditionally call one constructor or the other.
  • @sffc I don't think we should have a Rust Option as a positional argument.

Conclusion: Revisit Later

@Manishearth
Copy link
Member

Discussion between @Manishearth and @sffc

  • @Manishearth IMO the intrinsic type of content_locale is Option<&LanguageIdentifier>, being unsure if your content has a particular locale is a very normal state to be in. All of these APIs should be taking Option. This is different from formatting locale where it should be known.
  • @sffc Since we don't have currnetly in all three segmenters we don't have 2 distinct sets of options, some fallible, some not, maybe we just stick with options making word/sentence be fallible. new_default (new_default, new_default_dictionary, new_default_lstm) and try_new. Line has a single API that's just new that takes Options.
  • @Manishearth: yeah, seems good.
  • @Manishearth in the future if we need to add more options we can add more APIs, like new_with_sentence_suppressions().
  • We add WordBreakInvariantOptions, WordBreakOptions, SentenceBreak(same thing), LineBreakOptions, Options types use InvariantOptions types, Invariant Options types are for fallible things. We have new_default and try_new. The API distinction becomes whether things are fallible.

Conclusion:

  • Line gets new_{auto,lstm,dictionary}(options: LineBreakOptions), infallible
  • Word gets new_{auto,lstm,dictionary}(options: WordBreakInvariantOptions) and try_new_{auto,lstm,dictionary}(options: WordBreakOptions)
  • Sentence gets new(SentenceBreakInvariantOptions) or try_new(SentenceBreakOptions)

FFI has LineBreakOptions but uses with_content_locale for the other two, as per our "expand single-element options bag over FFI" convention

Agreed: @sffc and @Manishearth

@sffc
Copy link
Member Author

sffc commented Jan 10, 2025

One more thought: maybe drop Sentence::new() to nudge people toward providing a content locale for that one, where it is probably more important than for the others (for sentence suppressions), and we can add it later if someone needs it.

Then the only type with two constructor flavors is Word.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation discuss-priority Discuss at the next ICU4X meeting S-small Size: One afternoon (small bug fix or enhancement)
Projects
None yet
Development

No branches or pull requests

5 participants