On 21/03/2025 11:17, Tim Düsterhus wrote:
I am not sure if that signature makes sense and if the proposed
functionality fits into mbstring for that reason. IRIs are defined as
UTF-8, any other encoding results in invalid output / results that are
not interoperable.
This confirms a nagging feeling I had when I first saw the thread: the
name "mb_rawurlencode" implies "do the same things as rawurlencode, but
for multi-byte strings", but that's not what is being proposed.
Notably, a similar feature is actually slated for removal; to quote
https://www.php.net/manual/en/migration82.deprecated.php#migration82.deprecated.mbstring
> Usage of the QPrint, Base64, Uuencode, and HTML-ENTITIES 'text
encodings' is deprecated for all MBString functions. Unlike all the
other text encodings supported by MBString, these do not encode a
sequence of Unicode codepoints, but rather a sequence of raw bytes. It
is not clear what the correct return values for most MBString functions
should be when one of these non-encodings is specified.
The same applies here: if you write mb_rawurlencode($my_string,
'SHIFT-JIS'), does that mean convert what you can to ASCII, and percent
encode the rest for a URI; or does it mean convert to UTF-8, and percent
encode as necessary for an IRI? If the input contains sequences which
are not valid SHIFT-JIS, are those bytes treated as unencodable
(producing errors or substitution characters), or are they directly
percent encoded?
The correct solution to me is to build a proper thought-through API as
part of the proposed new Uri namespace and not adding new standalone
functions without a clear vision.
I completely agree.
For instance, the IRI standard does include an algorithm for converting
a non-Unicode IRI representation to a URI - but it requires a Unicode
Normalization step, which is a complex algorithm not included in
ext/standard or ext/mbstring, only ext/intl. However, a function in the
URI namespace that only handled the UTF-8 input case might still be useful.
Along those lines, I think there might need to be two additional
changes/additions to help with encoding for RFC 3987 and WHATWG-URL component
values:
- `http_build_query()` would need PHP_QUERY_3987 and PHP_QUERY_WHATWG flags and
corresponding logic (or entirely new functions); and
- `parse_str()` would need a corresponding `mb_parse_str()`.
I haven't followed the other URI thread at all, but isn't replacing the
scattered standard library functions with a consistent API the whole
point of that effort?
parse_str() in particular has a non-descriptive name, and a weird
function signature because it used to directly overwrite variables by name.
As a comparison, we didn't extend the shuffle() function with an
algorithm parameter, we added a shuffleArray() method to the new
Randomizer class.
--
Rowan Tommins
[IMSoP]