On Fri, Sep 17, 2021 at 4:59 AM Tim Starling <tstarl...@wikimedia.org>
wrote:

> I would like to know if a patch to make strtolower and strtoupper do
> plain ASCII case conversion would be accepted, or if an RFC should be
> created.
>
> The situation with case conversion is inconsistent.
>
> The following functions do ASCII case conversion: strcasecmp,
> strncasecmp, substr_compare.
>
> The following functions do locale-dependent case conversion:
> strtolower, strtoupper, str_ireplace, stristr, stripos, strripos,
> strnatcasecmp, ucfirst, ucwords, lcfirst.
>
> I would make them all do ASCII case conversion.
>
> Developers need ASCII case conversion, because it is used internally
> by PHP for things like class name comparison, and because it is a
> specified algorithm in HTML 5 and related standards.
>
> The existing options for ASCII case conversion are:
>
> * Never call setlocale(). But this breaks non-ASCII characters in
> escapeshellarg() and can't be guaranteed in a library.
>
> * Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also
> can't be guaranteed in a library.
>
> * Use strtr(). But this is ugly and slow.
>
> If mbstring has a way to do it, I can't find it. I tested
> mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii').
>
> Note that locale-dependent case conversion is almost never a useful
> feature. Strings are passed through tolower() one byte at a time, to
> be interpreted with some legacy 8-bit character set. So the result
> will typically be mojibake even if the correct locale is selected.
>
> strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I
> made a full list at <https://phabricator.wikimedia.org/T291234>. The
> UTF-8 locales mostly work, except for the Turkish ones, which mangle
> ASCII strings.
>
> At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My
> general recommendation is to avoid locales and locale-dependent
> functions, as locales are a fundamentally broken concept." I agree
> with that. I think PHP should migrate away from locale dependence.
> When PHP was young, it was convenient to use the C library, but we've
> progressed well past that point now.
>
> -- Tim Starling
>

We've been slowly moving away from locale-dependent functionality. Since
PHP 8 we no longer inherit any locales from the environment and have made
float to string conversion locale-independent.

I would very much support making strtolower() and friends a simple ASCII
case conversion operation. mb_strtolower() etc already offer full
Unicode-compliant case conversions that work correctly with multi-byte
encodings. The locale-sensitivity of strtolower() only works with legacy
single-byte encodings and as such is of questionable usefulness even in
cases where it is not actively harmful.

That said, I do think this change requires an RFC.

Regards,
Nikita

Reply via email to