The ability to receive the "character" offset would be extremely useful to the league/commonmark project. This project is a Markdown parser which conforms to the CommonMark spec which defines all behavior with regards to Unicode code points: <https://spec.commonmark.org/0.29/#character>
On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker <cmbecke...@gmx.de> wrote: > While it is trivial to get the next code point using index+1, this is > not necessarily the next character, as perceived by a human. Using > mb_substr(), you may even break "characters", e.g. <https://3v4l.org/5geOr > >. > In my particular use case, this is entirely acceptable per the spec linked to above. Because the CommonMark spec is "character"-centric, we do have a need to keep track of character positions within strings when parsing forwards, and while also allowing for regular expressions to be matched against UTF-8 strings. As Thomas noted, using PREG_OFFSET_CAPTURE provides us with the byte offset, not the "character" offset. We therefore must do additional work to calculate the latter from the former: $offset = \mb_strlen(\substr($subject, 0, $matches[0][1]), 'UTF-8'); This code is frequently executed and therefore leads to worse performance than if preg_match() could simply return the offsets we need. Would I be correct in assuming that preg_match() already has some knowledge or awareness about codepoints / "characters" when matching against UTF-8 strings and capturing offsets? If so, I think it would be very beneficial to provide that information to userland to avoid unnecessary re-calculations. I'd therefore like to propose a third alternative option: a new flag like PREG_OFFSET_CODEPOINT. When used in combination with PREG_OFFSET_CAPTURE, it would return the offset position in terms of "characters", not bytes. This could also be used to interpret any $offset argument as "characters" instead of bytes. The reason I prefer this option is that it doesn't break BC and is entirely opt-in. If a developer wants this behavior and understands the implications they can use it. Nobody else is affected otherwise. On Fri, Oct 2, 2020 at 8:25 AM Christoph M. Becker <cmbecke...@gmx.de> wrote: > If mbstring functions are used to find some offset, they always have to > traverse the string from the beginning, even if you are just interested > in the last code point of a long string. If you have byte offsets, that > code point can be accessed directly. Of course, that may not suit any > possible scenario, but I still don't think that the PCRE functions > should deal with code point offset instead of byte offsets. > I'll admit that I don't have the best understanding of how PCRE works under-the-hood, but I do believe that because it offers some functionality for working with codepoints, having it also work with codepoint-based offsets seems like a natural extension. And while it may not be the most optimal or common way of working with strings, I do believe there are some valid use cases for it. If placing this within PCRE violates some principles of the library then I'd be okay placing similar functionality elsewhere. -- Colin O'Dell colinod...@gmail.com