Re: [PHP-DEV] [RFC] Unicode Text Processing

Rowan Tommins Fri, 16 Dec 2022 03:22:11 -0800

On Fri, 16 Dec 2022 at 03:21, Tim Starling <tstarl...@wikimedia.org> wrote:


>
> I'm concerned about the time order of using grapheme offsets. For
> example, is subString() O(N) in $offset? If the idea is to be easy to
> use and performant, you don't want to have subtle algorithmic
> complexity traps.
>


This is a good point; it's certainly true of existing functions, like
grapheme_strlen(), and indeed mb_strlen(), which has to iterate variable
width code points.

Perhaps we could take advantage of having a stateful object and internally
optimise this in some way, such as caching a partial lookup table of
graphemes to byte offsets.

For instance, the table might look like this:

10: 22
20: 50
30: 70
35: 82; LAST

Then $string->subString(23, 20) would:

* take a pointer to byte 50
* pass it to the ICU grapheme iterator to skip over 3 graphemes; let's say
that takes us to byte 58
* since 23 + 20 > 35, the rest of the string is included
* the new object could construct an offset table without examining the
string:

7: 12 (grapheme 30 - 23; byte 70 - 58)
12: 24; LAST (grapheme 35 - 23; byte 82 - 58)


Whether this complexity would pay off in real-world scenarios, I don't
know, but if people started using this for all the text on an application,
I can see longer strings becoming a more common use case.

Regards,
-- 
Rowan Tommins
[IMSoP]

Re: [PHP-DEV] [RFC] Unicode Text Processing

Reply via email to