On Sun, Jun 23, 2019 at 5:46 PM Nikita Popov <nikita....@gmail.com> wrote:
> On Sun, Jun 23, 2019 at 5:30 PM Ben Ramsey <b...@benramsey.com> wrote: > >> > On Jun 23, 2019, at 05:35, Rowan Collins <rowan.coll...@gmail.com> >> wrote: >> > >> > On 22 June 2019 20:56:24 BST, Ben Ramsey <b...@benramsey.com> wrote: >> >> Perhaps it would only be an issue with the case-insensitive versions, >> >> as Nikita points out? If so, can someone provide some example strings >> >> where an mb_starts_with_ci() would return true, while >> >> str_starts_with_ci() would return false? >> > >> > >> > That's easy: any character that has a lower- and uppercase form, and is >> not represented as one byte in the target encoding. For that matter, any >> such character in the non-ASCII section of a single-byte encoding, since a >> non-mbstring case insensitive flag would presumably leave everything other >> than ASCII letters untouched. >> > >> > So, any non-Latin script, like Greek or Cyrillic; any accented >> characters, unless you're lucky and they're represented by ASCII-letter >> plus combining modifier; the Turkish "i", which if I remember rightly has >> three forms not two; and so on. >> >> >> According to Google, "İyi akşamlar” is the Turkish phrase for “Good >> evening” (Turkish speakers, please correct me, if this wrong). However, >> using the existing mb_* functions, I can’t get mb_stripos() to return 0 >> when trying to see if the string “İYI AKŞAMLAR” begins with “i̇yi.” >> >> I’m just using UTF-8, so maybe there’s an encoding issue here? >> >> $string = 'İyi akşamlar'; >> $upper = mb_strtoupper($string); >> $lowerChars = mb_strtolower(mb_substr($string, 0, 3)); >> >> var_dump($string, $upper, $lowerChars); >> var_dump(mb_stripos($upper, $lowerChars)); >> > > The reason why this doesn't work is that mb_stripos internally performs a > simple case fold, while a full case fold would be needed in this case > (Turkish i is hard). It's a bit tricky due to the need to remap character > offsets. > I've implemented use of full case folding in https://github.com/php/php-src/pull/4303. While doing that I kind of convinced myself that we probably shouldn't actually do this, because it breaks simple mb_stripos loops in a subtle way. It probably makes more sense for people to explicitly call mb_convert_case($string, MB_CASE_FOLD) and then operate on the resulting strings. Both much more efficient, and avoids offset remapping issues. Nikita