On Sun, Jun 23, 2019 at 5:46 PM Nikita Popov <nikita....@gmail.com> wrote:

> On Sun, Jun 23, 2019 at 5:30 PM Ben Ramsey <b...@benramsey.com> wrote:
>
>> > On Jun 23, 2019, at 05:35, Rowan Collins <rowan.coll...@gmail.com>
>> wrote:
>> >
>> > On 22 June 2019 20:56:24 BST, Ben Ramsey <b...@benramsey.com> wrote:
>> >> Perhaps it would only be an issue with the case-insensitive versions,
>> >> as Nikita points out? If so, can someone provide some example strings
>> >> where an mb_starts_with_ci() would return true, while
>> >> str_starts_with_ci() would return false?
>> >
>> >
>> > That's easy: any character that has a lower- and uppercase form, and is
>> not represented as one byte in the target encoding. For that matter, any
>> such character in the non-ASCII section of a single-byte encoding, since a
>> non-mbstring case insensitive flag would presumably leave everything other
>> than ASCII letters untouched.
>> >
>> > So, any non-Latin script, like Greek or Cyrillic; any accented
>> characters, unless you're lucky and they're represented by ASCII-letter
>> plus combining modifier; the Turkish "i", which if I remember rightly has
>> three forms not two; and so on.
>>
>>
>> According to Google, "İyi akşamlar” is the Turkish phrase for “Good
>> evening” (Turkish speakers, please correct me, if this wrong). However,
>> using the existing mb_* functions, I can’t get mb_stripos() to return 0
>> when trying to see if the string “İYI AKŞAMLAR” begins with “i̇yi.”
>>
>> I’m just using UTF-8, so maybe there’s an encoding issue here?
>>
>> $string = 'İyi akşamlar';
>> $upper = mb_strtoupper($string);
>> $lowerChars = mb_strtolower(mb_substr($string, 0, 3));
>>
>> var_dump($string, $upper, $lowerChars);
>> var_dump(mb_stripos($upper, $lowerChars));
>>
>
> The reason why this doesn't work is that mb_stripos internally performs a
> simple case fold, while a full case fold would be needed in this case
> (Turkish i is hard). It's a bit tricky due to the need to remap character
> offsets.
>

I've implemented use of full case folding in
https://github.com/php/php-src/pull/4303. While doing that I kind of
convinced myself that we probably shouldn't actually do this, because it
breaks simple mb_stripos loops in a subtle way. It probably makes more
sense for people to explicitly call mb_convert_case($string, MB_CASE_FOLD)
and then operate on the resulting strings. Both much more efficient, and
avoids offset remapping issues.

Nikita

Reply via email to