Hi,

On 9/28/2011 3:44 PM, Ulf Zibis wrote:
Hi Sherman,

1. bug 7096080 is not visible at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080

It might take couple days for it to show up on bugs.sun.com. But it has exactly the same content as
my previous email. In fact I simply copy/pasted them into email.

3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>


(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2,
what UTF-8 is doing now, for this scenario

(2) new byte[]{(byte)0xE1, (byte)0x40 ---> CoderResult.malformedForLength(1)
The change proposed actually fixed this one already (malformed length 1

(3) new byte[]{(byte)0xC0} ---> CoderResult.malformedForLength(1)
Technically this is not a bug, the decoder will return malformedlength 1 if you go with decode(bf,cf, true). But yes, it would be desirable to return malformed length 1 without waiting for second byte. The code/webrev has been updated to just do this as "expected".

Now the 2-byte sequence entry check has been updated to
} else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {
...
}

and I no longer check the first byte for malformed2(),

in which I think has the smallest performance impact for 2 bytes sequence. I ran several rounds of benchmark testing, I did not see significant difference. I will try more later.

I'm not sure I understand the suggested b1 < -0x3e patch, I don't see we can simply replace
((b1 >> 5) == -2) with (b1 < -0x3e).

Anyway, I hope now you are motivated to take a deep look at the code:-) and maybe want to
run all your tests to confirm the change is fine.

This change does expose an existing bug/issue in StreamDecoder, in which the StreamDecoder fails to replace a "malformed" input, in which a "leading byte" is at the end of the stream. This is why
I commended the line in Errors. I will file a bug for this one later.

5. IMHO charset CESU-8 should be hosted in extended-charsets, otherwise it should be added to java.nio.StandardCharsets


We have lots of charsets provided via the "standard charset provider" (in rt.jar) but not listed in StandardCharsets.

-Sherman

Reply via email to