[PHP-DEV] Suggestion: Make all PCRE functions return character offsets, rather than byte offsets if the modifier `u` (PCRE_UTF8) is given

Thomas Landauer Fri, 02 Oct 2020 03:02:28 -0700

Hi,

this is a follow-up of a bug I opened, and cmb suggested to continue
here: https://bugs.php.net/bug.php?id=80166


Advantages:

1: Easier string manipulation:
If somebody does (as in my case) `preg_match_all()` with
PREG_OFFSET_CAPTURE, what will they probably use those returned
numbers/offsets for?
My answer: For *splitting the string* - in some way or the other. Now,
with byte offsets, I can't do such basic things as just `+1` to get to
the next character. Or extract exactly 3 characters.

2: Better performance:
This may sound odd, since cmb said the exact opposite ;-) (sequential
access vs. random access). However, if I need character offsets (see 1),
what can I do? I'm forced to use some workaround on top - as e.g.
https://www.php.net/manual/en/function.preg-match-all.php#71572 - which
is certainly way slower than any native implementation.

3: Consistency with users' expectations:
The current behavior is causing confusion and is perceived as
counter-intuitive, see
https://www.php.net/manual/en/function.preg-match-all.php#61426 and
https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php

So I'm suggesting:

* Either do the BC break, and just return byte offsets if the modifier
`u` is given.
* Or create *new* functions for it: `mb_preg_match_all()` etc.

--

Cheers,
Thomas

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

[PHP-DEV] Suggestion: Make all PCRE functions return *character* offsets, rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given

Reply via email to

[PHP-DEV] Suggestion: Make all PCRE functions return character offsets, rather than byte offsets if the modifier `u` (PCRE_UTF8) is given