On Fri, 16 Dec 2022 at 03:21, Tim Starling <tstarl...@wikimedia.org> wrote:
> > I'm concerned about the time order of using grapheme offsets. For > example, is subString() O(N) in $offset? If the idea is to be easy to > use and performant, you don't want to have subtle algorithmic > complexity traps. > This is a good point; it's certainly true of existing functions, like grapheme_strlen(), and indeed mb_strlen(), which has to iterate variable width code points. Perhaps we could take advantage of having a stateful object and internally optimise this in some way, such as caching a partial lookup table of graphemes to byte offsets. For instance, the table might look like this: 10: 22 20: 50 30: 70 35: 82; LAST Then $string->subString(23, 20) would: * take a pointer to byte 50 * pass it to the ICU grapheme iterator to skip over 3 graphemes; let's say that takes us to byte 58 * since 23 + 20 > 35, the rest of the string is included * the new object could construct an offset table without examining the string: 7: 12 (grapheme 30 - 23; byte 70 - 58) 12: 24; LAST (grapheme 35 - 23; byte 82 - 58) Whether this complexity would pay off in real-world scenarios, I don't know, but if people started using this for all the text on an application, I can see longer strings becoming a more common use case. Regards, -- Rowan Tommins [IMSoP]