On Fri, Sep 17, 2021 at 4:59 AM Tim Starling <tstarl...@wikimedia.org> wrote:
> I would like to know if a patch to make strtolower and strtoupper do > plain ASCII case conversion would be accepted, or if an RFC should be > created. > > The situation with case conversion is inconsistent. > > The following functions do ASCII case conversion: strcasecmp, > strncasecmp, substr_compare. > > The following functions do locale-dependent case conversion: > strtolower, strtoupper, str_ireplace, stristr, stripos, strripos, > strnatcasecmp, ucfirst, ucwords, lcfirst. > > I would make them all do ASCII case conversion. > > Developers need ASCII case conversion, because it is used internally > by PHP for things like class name comparison, and because it is a > specified algorithm in HTML 5 and related standards. > > The existing options for ASCII case conversion are: > > * Never call setlocale(). But this breaks non-ASCII characters in > escapeshellarg() and can't be guaranteed in a library. > > * Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also > can't be guaranteed in a library. > > * Use strtr(). But this is ugly and slow. > > If mbstring has a way to do it, I can't find it. I tested > mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii'). > > Note that locale-dependent case conversion is almost never a useful > feature. Strings are passed through tolower() one byte at a time, to > be interpreted with some legacy 8-bit character set. So the result > will typically be mojibake even if the correct locale is selected. > > strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I > made a full list at <https://phabricator.wikimedia.org/T291234>. The > UTF-8 locales mostly work, except for the Turkish ones, which mangle > ASCII strings. > > At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My > general recommendation is to avoid locales and locale-dependent > functions, as locales are a fundamentally broken concept." I agree > with that. I think PHP should migrate away from locale dependence. > When PHP was young, it was convenient to use the C library, but we've > progressed well past that point now. > > -- Tim Starling > We've been slowly moving away from locale-dependent functionality. Since PHP 8 we no longer inherit any locales from the environment and have made float to string conversion locale-independent. I would very much support making strtolower() and friends a simple ASCII case conversion operation. mb_strtolower() etc already offer full Unicode-compliant case conversions that work correctly with multi-byte encodings. The locale-sensitivity of strtolower() only works with legacy single-byte encodings and as such is of questionable usefulness even in cases where it is not actively harmful. That said, I do think this change requires an RFC. Regards, Nikita