On Wed, 14 Feb 2024 11:35:10 GMT, Raffaello Giulietti <rgiulie...@openjdk.org> 
wrote:

>> While properly encoded modified UTF-8 strings won't have embedded zeros 
>> (`\u0000` will be encoded as `0xC0, 0x80`) the decoding routines in 
>> `DataInputStream` and `ObjectInputStream` allows them and does not throw an 
>> exception if an embedded zero is encountered. This PR does not change 
>> semantics here AFAICT. If you think we need to be stricter in these decoders 
>> that could be done as a separate RFE and I'll put this on hold.
>
> Ah OK.
> 
> I didn't check the current code, only the proposed one.
> Although the specification clearly states that the method should throw, if 
> the current code does not throw on zeros, then it makes sense that the 
> proposed one shouldn't either.

The specification is somewhat ambiguous:
https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/io/DataInput.html#readUTF()

There's a sweeping `Throws UTFDataFormatException - if the bytes do not 
represent a valid modified UTF-8 encoding of a string` but also: `If the first 
byte of a group matches the bit pattern 0xxxxxxx (where x means "may be 0 or 
1"), then the group consists of just that byte. The byte is zero-extended to 
form a character.` I think the latter gives some leeway on being lenient on 
embedded zeros, even if it's made clear elsewhere that valid encoders need to 
replace zeros with the `0xC0, 0x80` sequence.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17734#discussion_r1489564324

Reply via email to