Hi Tim Starling,
 
> I would like to know if a patch to make strtolower and strtoupper do
> plain ASCII case conversion would be accepted, or if an RFC should be
> created.
> 
> The situation with case conversion is inconsistent.
> 
> The following functions do ASCII case conversion: strcasecmp,
> strncasecmp, substr_compare.
> 
> The following functions do locale-dependent case conversion:
> strtolower, strtoupper, str_ireplace, stristr, stripos, strripos,
> strnatcasecmp, ucfirst, ucwords, lcfirst.
> 
> I would make them all do ASCII case conversion.
> 
> Developers need ASCII case conversion, because it is used internally
> by PHP for things like class name comparison, and because it is a
> specified algorithm in HTML 5 and related standards.
> 
> The existing options for ASCII case conversion are:
> 
> * Never call setlocale(). But this breaks non-ASCII characters in
escapeshellarg() and can't be guaranteed in a library.
> 
> * Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also
can't be guaranteed in a library.
> 
> * Use strtr(). But this is ugly and slow.
> 
> If mbstring has a way to do it, I can't find it. I tested
> mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii').
> 
> Note that locale-dependent case conversion is almost never a useful
> feature. Strings are passed through tolower() one byte at a time, to
> be interpreted with some legacy 8-bit character set. So the result
> will typically be mojibake even if the correct locale is selected.
> 
> strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I
> made a full list at <https://phabricator.wikimedia.org/T291234>. The
> UTF-8 locales mostly work, except for the Turkish ones, which mangle
> ASCII strings.
> 
> At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My
> general recommendation is to avoid locales and locale-dependent
> functions, as locales are a fundamentally broken concept." I agree
> with that. I think PHP should migrate away from locale dependence.
> When PHP was young, it was convenient to use the C library, but we've
> progressed well past that point now.

I think it's a good idea (But would still require an RFC)
As you said, the way it acts on bytes rather than codepoints seems like it's 
almost always incorrect outside a narrow range
(except for rare charsets such as https://en.wikipedia.org/wiki/ISO/IEC_8859-1)

The behavior of strtolower is inconvenient for common uses in
- filesystem paths, where strolower('I') isn't 'i' in tr_TR
- username validation, if it's possible to create a new account that is 
considered the same case-insensitive strings in some locales but not others
- etc.

When implementing this, Zend/Optimizer/sccp.c has optimizations for functions 
such as str_contains, etc to optimize.
After removing locale dependence, those optimizations could be safely added for 
functions that would be locale independent as a result of your change.
- This would allow eliminating more dead code, and make code calling those 
functions (on constant arguments) faster by caching the resulting strings in 
opcache.

The function `zend_string_tolower` can safely be used to efficiently convert 
strings to lowercase in a case-insensitive way.
(zend_string_toupper hasn't been needed yet due to not yet having any use cases 
in php-src's internals, but could be added in such a PR)

```
841:            || zend_string_equals_literal(name, "str_contains")
842:            || zend_string_equals_literal(name, "str_ends_with")
843:            || zend_string_equals_literal(name, "str_replace")
844:            || zend_string_equals_literal(name, "str_split")
845:            || zend_string_equals_literal(name, "str_starts_with")
```

Thanks,
Tyson
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to