On Wed, 14 Feb 2024 11:35:10 GMT, Raffaello Giulietti <rgiulie...@openjdk.org> wrote:
>> While properly encoded modified UTF-8 strings won't have embedded zeros >> (`\u0000` will be encoded as `0xC0, 0x80`) the decoding routines in >> `DataInputStream` and `ObjectInputStream` allows them and does not throw an >> exception if an embedded zero is encountered. This PR does not change >> semantics here AFAICT. If you think we need to be stricter in these decoders >> that could be done as a separate RFE and I'll put this on hold. > > Ah OK. > > I didn't check the current code, only the proposed one. > Although the specification clearly states that the method should throw, if > the current code does not throw on zeros, then it makes sense that the > proposed one shouldn't either. The specification is somewhat ambiguous: https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/io/DataInput.html#readUTF() There's a sweeping `Throws UTFDataFormatException - if the bytes do not represent a valid modified UTF-8 encoding of a string` but also: `If the first byte of a group matches the bit pattern 0xxxxxxx (where x means "may be 0 or 1"), then the group consists of just that byte. The byte is zero-extended to form a character.` I think the latter gives some leeway on being lenient on embedded zeros, even if it's made clear elsewhere that valid encoders need to replace zeros with the `0xC0, 0x80` sequence. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17734#discussion_r1489564324