Re: RFR: 8311216: DataURI can lose information in some charset environments

Michael Strauß Fri, 07 Jul 2023 13:26:51 -0700

On Fri, 7 Jul 2023 19:17:59 GMT, Andy Goryachev <ango...@openjdk.org> wrote:


>> modules/javafx.graphics/src/main/java/com/sun/javafx/util/DataURI.java line 
>> 115:
>> 
>>> 113:             nameValuePairs,
>>> 114:             base64,
>>> 115:             base64 ? Base64.getDecoder().decode(data) : 
>>> decodePercentEncoding(data));
>> 
>> I wonder if this is all necessary.  The data is supposed to be url-encoded, 
>> so it's essentially ASCII, no?
>> 
>> passing default charset to getBytes() is not right, it probably should be
>> 
>> URLDecoder.decode(data.replace("+", "%2B"), 
>> charset).getBytes(StandardCharsets.US_ASCII));
>> 
>> or am I missing something?
>
> From https://datatracker.ietf.org/doc/html/rfc3986#page-11
> 
> 
> Therefore, the
> 
> 
> 
> 
> 
> Berners-Lee, et al.         Standards Track                    [Page 11]
> 
> [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986)                   
> URI Generic Syntax               January 2005
> 
> 
>    integer values used by the ABNF must be mapped back to their
>    corresponding characters via US-ASCII in order to complete the syntax
>    rules.

> I wonder if this is all necessary. The data is supposed to be url-encoded, so 
> it's essentially ASCII, no?
> 
> passing default charset to getBytes() is not right, it probably should be
> 
> URLDecoder.decode(data.replace("+", "%2B"), 
> charset).getBytes(StandardCharsets.US_ASCII));
> 
> or am I missing something?

The payload of a data URI is just a sequence of bytes, not characters. Only 
when the numeric value of a byte, assuming ASCII encoding, is a *safe URL 
character*, it is left as-is; otherwise percent-encoding is used to encode the 
byte value. The [specification](https://datatracker.ietf.org/doc/html/rfc2397) 
points out:

Without ";base64", the data (as a sequence of octets) is represented using
ASCII encoding for octets inside the range of safe URL characters and using
the standard %xx hex encoding of URLs for octets outside that range.


Decoding the payload back to a byte array is done by simply converting each 
assumed ASCII character to its numeric value, and decoding percent-encoded 
bytes to their hex value. Note that the assumed ASCII encoding only refers to 
the URL, but not to the payload. The payload is not a string, and it doesn't 
contain characters; it's a sequence of bytes.

`URLDecoder` is not a general-purpose class to decode a percent-encoded 
sequence of bytes. It's specifically meant to take a HTML forms string and 
decode it into a string with some defined charset, using additional rules that 
don't generally apply to percent-encoded byte sequences. For example, a space 
character is encoded as `+` (that's where the kind-of-hacky `data.replace("+", 
"%2B")` comes from).

Using `URLDecoder` kind of works (if we use a sufficiently rich charset for 
both `URLDecoder.decode` and `String.getBytes`), but only by accident. While it 
accepts almost any percent-encoded data, the Javadoc for `URLDecoder` says:

There are two possible ways in which this decoder could deal with illegal 
strings.
It could either leave illegal characters alone or it could throw an 
IllegalArgumentException.
Which approach the decoder takes is left to the implementation.

-------------

PR Review Comment: https://git.openjdk.org/jfx/pull/1165#discussion_r1256426290

Re: RFR: 8311216: DataURI can lose information in some charset environments

Reply via email to