Re: [PHP-DEV] Re: Suggestion: Make all PCRE functions return character offsets,rather than byte offsets if the modifier `u` (PCRE_UTF8) is given

Colin O'Dell Fri, 02 Oct 2020 06:02:48 -0700

The ability to receive the "character" offset would be extremely useful to
the league/commonmark project.  This project is a Markdown parser which
conforms to the CommonMark spec which defines all behavior with regards to
Unicode code points: <https://spec.commonmark.org/0.29/#character>

On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker <cmbecke...@gmx.de>
wrote:

> While it is trivial to get the next code point using index+1, this is
> not necessarily the next character, as perceived by a human.  Using
> mb_substr(), you may even break "characters", e.g. <https://3v4l.org/5geOr
> >.
>

In my particular use case, this is entirely acceptable per the spec linked
to above.

Because the CommonMark spec is "character"-centric, we do have a need to
keep track of character positions within strings when parsing forwards, and
while also allowing for regular expressions to be matched against UTF-8
strings.  As Thomas noted, using PREG_OFFSET_CAPTURE provides us with the
byte offset, not the "character" offset.  We therefore must do additional
work to calculate the latter from the former:

            $offset = \mb_strlen(\substr($subject, 0, $matches[0][1]),
'UTF-8');

This code is frequently executed and therefore leads to worse performance
than if preg_match() could simply return the offsets we need.

Would I be correct in assuming that preg_match() already has some knowledge
or awareness about codepoints / "characters" when matching against UTF-8
strings and capturing offsets?  If so, I think it would be very beneficial
to provide that information to userland to avoid unnecessary
re-calculations.

I'd therefore like to propose a third alternative option: a new flag like
PREG_OFFSET_CODEPOINT.  When used in combination with PREG_OFFSET_CAPTURE,
it would return the offset position in terms of "characters", not bytes.
This could also be used to interpret any $offset argument as "characters"
instead of bytes.

The reason I prefer this option is that it doesn't break BC and is entirely
opt-in.  If a developer wants this behavior and understands the
implications they can use it.  Nobody else is affected otherwise.

On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker <cmbecke...@gmx.de>
wrote:

> If mbstring functions are used to find some offset, they always have to
> traverse the string from the beginning, even if you are just interested
> in the last code point of a long string.  If you have byte offsets, that
> code point can be accessed directly.  Of course, that may not suit any
> possible scenario, but I still don't think that the PCRE functions
> should deal with code point offset instead of byte offsets.
>

I'll admit that I don't have the best understanding of how PCRE works
under-the-hood, but I do believe that because it offers some functionality
for working with codepoints, having it also work with codepoint-based
offsets seems like a natural extension.  And while it may not be the most
optimal or common way of working with strings, I do believe there are some
valid use cases for it.  If placing this within PCRE violates some
principles of the library then I'd be okay placing similar functionality
elsewhere.

-- 
Colin O'Dell
colinod...@gmail.com

Re: [PHP-DEV] Re: Suggestion: Make all PCRE functions return *character* offsets,rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given

Reply via email to

Re: [PHP-DEV] Re: Suggestion: Make all PCRE functions return character offsets,rather than byte offsets if the modifier `u` (PCRE_UTF8) is given