[PHP-DEV] Re: Suggestion: Make all PCRE functions return character offsets,rather than byte offsets if the modifier `u` (PCRE_UTF8) is given

Christoph M. Becker Fri, 02 Oct 2020 05:26:28 -0700

On 10/2/2020 at 12:01 PM, Thomas Landauer wrote:

> this is a follow-up of a bug I opened, and cmb suggested to continue
> here: https://bugs.php.net/bug.php?id=80166


Indeed, thanks!

> Advantages:
>
> 1: Easier string manipulation:
> If somebody does (as in my case) `preg_match_all()` with
> PREG_OFFSET_CAPTURE, what will they probably use those returned
> numbers/offsets for?
> My answer: For *splitting the string* - in some way or the other. Now,
> with byte offsets, I can't do such basic things as just `+1` to get to
> the next character. Or extract exactly 3 characters.

The term "character" is ambiguous wrt. Unicode.  The mbstring functions
work on Unicode code points, so it's probably better to use that term
instead.

While it is trivial to get the next code point using index+1, this is
not necessarily the next character, as perceived by a human.  Using
mb_substr(), you may even break "characters", e.g. <https://3v4l.org/5geOr>.

> 2: Better performance:
> This may sound odd, since cmb said the exact opposite ;-) (sequential
> access vs. random access). However, if I need character offsets (see 1),
> what can I do? I'm forced to use some workaround on top - as e.g.
> https://www.php.net/manual/en/function.preg-match-all.php#71572 - which
> is certainly way slower than any native implementation.

If mbstring functions are used to find some offset, they always have to
traverse the string from the beginning, even if you are just interested
in the last code point of a long string.  If you have byte offsets, that
code point can be accessed directly.  Of course, that may not suit any
possible scenario, but I still don't think that the PCRE functions
should deal with code point offset instead of byte offsets.

Regards,
Christoph

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

[PHP-DEV] Re: Suggestion: Make all PCRE functions return *character* offsets,rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given

Reply via email to

[PHP-DEV] Re: Suggestion: Make all PCRE functions return character offsets,rather than byte offsets if the modifier `u` (PCRE_UTF8) is given