On 10/2/2020 at 12:01 PM, Thomas Landauer wrote: > this is a follow-up of a bug I opened, and cmb suggested to continue > here: https://bugs.php.net/bug.php?id=80166
Indeed, thanks! > Advantages: > > 1: Easier string manipulation: > If somebody does (as in my case) `preg_match_all()` with > PREG_OFFSET_CAPTURE, what will they probably use those returned > numbers/offsets for? > My answer: For *splitting the string* - in some way or the other. Now, > with byte offsets, I can't do such basic things as just `+1` to get to > the next character. Or extract exactly 3 characters. The term "character" is ambiguous wrt. Unicode. The mbstring functions work on Unicode code points, so it's probably better to use that term instead. While it is trivial to get the next code point using index+1, this is not necessarily the next character, as perceived by a human. Using mb_substr(), you may even break "characters", e.g. <https://3v4l.org/5geOr>. > 2: Better performance: > This may sound odd, since cmb said the exact opposite ;-) (sequential > access vs. random access). However, if I need character offsets (see 1), > what can I do? I'm forced to use some workaround on top - as e.g. > https://www.php.net/manual/en/function.preg-match-all.php#71572 - which > is certainly way slower than any native implementation. If mbstring functions are used to find some offset, they always have to traverse the string from the beginning, even if you are just interested in the last code point of a long string. If you have byte offsets, that code point can be accessed directly. Of course, that may not suit any possible scenario, but I still don't think that the PCRE functions should deal with code point offset instead of byte offsets. Regards, Christoph -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php