Hi Tim Starling, > I would like to know if a patch to make strtolower and strtoupper do > plain ASCII case conversion would be accepted, or if an RFC should be > created. > > The situation with case conversion is inconsistent. > > The following functions do ASCII case conversion: strcasecmp, > strncasecmp, substr_compare. > > The following functions do locale-dependent case conversion: > strtolower, strtoupper, str_ireplace, stristr, stripos, strripos, > strnatcasecmp, ucfirst, ucwords, lcfirst. > > I would make them all do ASCII case conversion. > > Developers need ASCII case conversion, because it is used internally > by PHP for things like class name comparison, and because it is a > specified algorithm in HTML 5 and related standards. > > The existing options for ASCII case conversion are: > > * Never call setlocale(). But this breaks non-ASCII characters in escapeshellarg() and can't be guaranteed in a library. > > * Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also can't be guaranteed in a library. > > * Use strtr(). But this is ugly and slow. > > If mbstring has a way to do it, I can't find it. I tested > mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii'). > > Note that locale-dependent case conversion is almost never a useful > feature. Strings are passed through tolower() one byte at a time, to > be interpreted with some legacy 8-bit character set. So the result > will typically be mojibake even if the correct locale is selected. > > strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I > made a full list at <https://phabricator.wikimedia.org/T291234>. The > UTF-8 locales mostly work, except for the Turkish ones, which mangle > ASCII strings. > > At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My > general recommendation is to avoid locales and locale-dependent > functions, as locales are a fundamentally broken concept." I agree > with that. I think PHP should migrate away from locale dependence. > When PHP was young, it was convenient to use the C library, but we've > progressed well past that point now.
I think it's a good idea (But would still require an RFC) As you said, the way it acts on bytes rather than codepoints seems like it's almost always incorrect outside a narrow range (except for rare charsets such as https://en.wikipedia.org/wiki/ISO/IEC_8859-1) The behavior of strtolower is inconvenient for common uses in - filesystem paths, where strolower('I') isn't 'i' in tr_TR - username validation, if it's possible to create a new account that is considered the same case-insensitive strings in some locales but not others - etc. When implementing this, Zend/Optimizer/sccp.c has optimizations for functions such as str_contains, etc to optimize. After removing locale dependence, those optimizations could be safely added for functions that would be locale independent as a result of your change. - This would allow eliminating more dead code, and make code calling those functions (on constant arguments) faster by caching the resulting strings in opcache. The function `zend_string_tolower` can safely be used to efficiently convert strings to lowercase in a case-insensitive way. (zend_string_toupper hasn't been needed yet due to not yet having any use cases in php-src's internals, but could be added in such a PR) ``` 841: || zend_string_equals_literal(name, "str_contains") 842: || zend_string_equals_literal(name, "str_ends_with") 843: || zend_string_equals_literal(name, "str_replace") 844: || zend_string_equals_literal(name, "str_split") 845: || zend_string_equals_literal(name, "str_starts_with") ``` Thanks, Tyson -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php