Re: RFR: 8311216: DataURI can lose information in some charset environments

Andy Goryachev Fri, 07 Jul 2023 13:51:56 -0700

On Fri, 7 Jul 2023 20:23:17 GMT, Michael Strauß <mstra...@openjdk.org> wrote:


>> From https://datatracker.ietf.org/doc/html/rfc3986#page-11
>> 
>> 
>> Therefore, the
>> 
>> 
>> 
>> 
>> 
>> Berners-Lee, et al.         Standards Track                    [Page 11]
>> 
>> [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986)                   
>> URI Generic Syntax               January 2005
>> 
>> 
>>    integer values used by the ABNF must be mapped back to their
>>    corresponding characters via US-ASCII in order to complete the syntax
>>    rules.
>
>> I wonder if this is all necessary. The data is supposed to be url-encoded, 
>> so it's essentially ASCII, no?
>> 
>> passing default charset to getBytes() is not right, it probably should be
>> 
>> URLDecoder.decode(data.replace("+", "%2B"), 
>> charset).getBytes(StandardCharsets.US_ASCII));
>> 
>> or am I missing something?
> 
> The payload of a data URI is just a sequence of bytes, not characters. Only 
> when the numeric value of a byte, assuming ASCII encoding, is a *safe URL 
> character*, it is left as-is; otherwise percent-encoding is used to encode 
> the byte value. The 
> [specification](https://datatracker.ietf.org/doc/html/rfc2397) points out:
> 
> Without ";base64", the data (as a sequence of octets) is represented using
> ASCII encoding for octets inside the range of safe URL characters and using
> the standard %xx hex encoding of URLs for octets outside that range.
> 
> 
> Decoding the payload back to a byte array is done by simply converting each 
> assumed ASCII character to its numeric value, and decoding percent-encoded 
> bytes to their hex value. Note that the assumed ASCII encoding only refers to 
> the URL, but not to the payload. The payload is not a string, and it doesn't 
> contain characters; it's a sequence of bytes.
> 
> `URLDecoder` is not a general-purpose class to decode a percent-encoded 
> sequence of bytes. It's specifically meant to take a HTML forms string and 
> decode it into a string with some defined charset, using additional rules 
> that don't generally apply to percent-encoded byte sequences. For example, a 
> space character is encoded as `+` (that's where the kind-of-hacky 
> `data.replace("+", "%2B")` comes from).
> 
> Using `URLDecoder` kind of works (if we use a sufficiently rich charset for 
> both `URLDecoder.decode` and `String.getBytes`), but only by accident. While 
> it accepts almost any percent-encoded data, the Javadoc for `URLDecoder` says:
> 
> There are two possible ways in which this decoder could deal with illegal 
> strings.
> It could either leave illegal characters alone or it could throw an 
> IllegalArgumentException.
> Which approach the decoder takes is left to the implementation.

thank you for clarifications!  your approach does make sense.

-------------

PR Review Comment: https://git.openjdk.org/jfx/pull/1165#discussion_r1256454118

Re: RFR: 8311216: DataURI can lose information in some charset environments

Reply via email to