On Fri, 7 Jul 2023 20:23:17 GMT, Michael Strauß <mstra...@openjdk.org> wrote:
>> From https://datatracker.ietf.org/doc/html/rfc3986#page-11 >> >> >> Therefore, the >> >> >> >> >> >> Berners-Lee, et al. Standards Track [Page 11] >> >> [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986) >> URI Generic Syntax January 2005 >> >> >> integer values used by the ABNF must be mapped back to their >> corresponding characters via US-ASCII in order to complete the syntax >> rules. > >> I wonder if this is all necessary. The data is supposed to be url-encoded, >> so it's essentially ASCII, no? >> >> passing default charset to getBytes() is not right, it probably should be >> >> URLDecoder.decode(data.replace("+", "%2B"), >> charset).getBytes(StandardCharsets.US_ASCII)); >> >> or am I missing something? > > The payload of a data URI is just a sequence of bytes, not characters. Only > when the numeric value of a byte, assuming ASCII encoding, is a *safe URL > character*, it is left as-is; otherwise percent-encoding is used to encode > the byte value. The > [specification](https://datatracker.ietf.org/doc/html/rfc2397) points out: > > Without ";base64", the data (as a sequence of octets) is represented using > ASCII encoding for octets inside the range of safe URL characters and using > the standard %xx hex encoding of URLs for octets outside that range. > > > Decoding the payload back to a byte array is done by simply converting each > assumed ASCII character to its numeric value, and decoding percent-encoded > bytes to their hex value. Note that the assumed ASCII encoding only refers to > the URL, but not to the payload. The payload is not a string, and it doesn't > contain characters; it's a sequence of bytes. > > `URLDecoder` is not a general-purpose class to decode a percent-encoded > sequence of bytes. It's specifically meant to take a HTML forms string and > decode it into a string with some defined charset, using additional rules > that don't generally apply to percent-encoded byte sequences. For example, a > space character is encoded as `+` (that's where the kind-of-hacky > `data.replace("+", "%2B")` comes from). > > Using `URLDecoder` kind of works (if we use a sufficiently rich charset for > both `URLDecoder.decode` and `String.getBytes`), but only by accident. While > it accepts almost any percent-encoded data, the Javadoc for `URLDecoder` says: > > There are two possible ways in which this decoder could deal with illegal > strings. > It could either leave illegal characters alone or it could throw an > IllegalArgumentException. > Which approach the decoder takes is left to the implementation. thank you for clarifications! your approach does make sense. ------------- PR Review Comment: https://git.openjdk.org/jfx/pull/1165#discussion_r1256454118