On 21/03/2025 11:17, Tim Düsterhus wrote:

I am not sure if that signature makes sense and if the proposed functionality fits into mbstring for that reason. IRIs are defined as UTF-8, any other encoding results in invalid output / results that are not interoperable.


This confirms a nagging feeling I had when I first saw the thread: the name "mb_rawurlencode" implies "do the same things as rawurlencode, but for multi-byte strings", but that's not what is being proposed.


Notably, a similar feature is actually slated for removal; to quote https://www.php.net/manual/en/migration82.deprecated.php#migration82.deprecated.mbstring

> Usage of the QPrint, Base64, Uuencode, and HTML-ENTITIES 'text encodings' is deprecated for all MBString functions. Unlike all the other text encodings supported by MBString, these do not encode a sequence of Unicode codepoints, but rather a sequence of raw bytes. It is not clear what the correct return values for most MBString functions should be when one of these non-encodings is specified.

The same applies here: if you write mb_rawurlencode($my_string, 'SHIFT-JIS'), does that mean convert what you can to ASCII, and percent encode the rest for a URI; or does it mean convert to UTF-8, and percent encode as necessary for an IRI? If the input contains sequences which are not valid SHIFT-JIS, are those bytes treated as unencodable (producing errors or substitution characters), or are they directly percent encoded?


The correct solution to me is to build a proper thought-through API as part of the proposed new Uri namespace and not adding new standalone functions without a clear vision.


I completely agree.

For instance, the IRI standard does include an algorithm for converting a non-Unicode IRI representation to a URI - but it requires a Unicode Normalization step, which is a complex algorithm not included in ext/standard or ext/mbstring, only ext/intl. However, a function in the URI namespace that only handled the UTF-8 input case might still be useful.


Along those lines, I think there might need to be two additional 
changes/additions to help with encoding for RFC 3987 and WHATWG-URL component 
values:

- `http_build_query()` would need PHP_QUERY_3987 and PHP_QUERY_WHATWG flags and 
corresponding logic (or entirely new functions); and
- `parse_str()` would need a corresponding `mb_parse_str()`.


I haven't followed the other URI thread at all, but isn't replacing the scattered standard library functions with a consistent API the whole point of that effort?

parse_str() in particular has a non-descriptive name, and a weird function signature because it used to directly overwrite variables by name.

As a comparison, we didn't extend the shuffle() function with an algorithm parameter, we added a shuffleArray() method to the new Randomizer class.


--
Rowan Tommins
[IMSoP]

Reply via email to