Is there any benchmark testing for segmenter wrt ICU4C WordBreakIterator? #2948
Replies: 2 comments 15 replies
-
ICU4X will likely be faster during data loading. However, note that our segmenter modules are experimental and we have not yet made the data fully zero copy, so there will be some situations where it's not as fast as it could be. When the segmenter modules are finished we will likely always be faster than ICU4C on data loading as we are in every other component. For segmentation performance, I haven't measured it myself, but I think @zbraniecki has? I'll let the others give more specific answers as they have more context. |
Beta Was this translation helpful? Give feedback.
-
@aethanyc @makotokato - do we understand the reason for ICU4X data to be slightly larger than ICU4C? |
Beta Was this translation helpful? Give feedback.
-
We currently use ICU4C WordBreakIterator for our use case. Recently, we came across ICU4X so thought of trying its wordBreakSegmenter. I analyzed the data size difference between ICU4C and ICU4X for WordBreakSegmenter for SEA languages and found that ICU4X's data size is slightly larger than ICU4C's data size and on the other hand your LSTM data size seems to be very less.
So I have below 2 queries:
Beta Was this translation helpful? Give feedback.
All reactions