Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

Xueming Shen Wed, 28 Sep 2011 20:28:38 -0700

Hi,

On 9/28/2011 3:44 PM, Ulf Zibis wrote:

Hi Sherman,
1. bug 7096080 is not visible athttp://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080

It might take couple days for it to show up on bugs.sun.com. But it hasexactly the same content as

my previous email. In fact I simply copy/pasted them into email.

3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>

(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --->CoderResult.malformedForLength(1)It appears the Unicode Standard now explicitly recommends to return themalformed length 2,

what UTF-8 is doing now, for this scenario

(2) new byte[]{(byte)0xE1, (byte)0x40 --->CoderResult.malformedForLength(1)

The change proposed actually fixed this one already (malformed length 1

(3) new byte[]{(byte)0xC0} ---> CoderResult.malformedForLength(1)

Technically this is not a bug, the decoder will return malformedlength1 if you go withdecode(bf,cf, true). But yes, it would be desirable to return malformedlength 1 withoutwaiting for second byte. The code/webrev has been updated to just dothis as "expected".


Now the 2-byte sequence entry check has been updated to
} else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {
...
}

and I no longer check the first byte for malformed2(),

in which I think has the smallest performance impact for 2 bytessequence. I ran severalrounds of benchmark testing, I did not see significant difference. Iwill try more later.

I'm not sure I understand the suggested b1 < -0x3e patch, I don't seewe can simply replace

((b1 >> 5) == -2) with (b1 < -0x3e).

Anyway, I hope now you are motivated to take a deep look at the code:-)and maybe want to

run all your tests to confirm the change is fine.

This change does expose an existing bug/issue in StreamDecoder, in whichthe StreamDecoder failsto replace a "malformed" input, in which a "leading byte" is at the endof the stream. This is why

I commended the line in Errors. I will file a bug for this one later.

5. IMHO charset CESU-8 should be hosted in extended-charsets,otherwise it should be added to java.nio.StandardCharsets

We have lots of charsets provided via the "standard charset provider"(in rt.jar) but not listed in StandardCharsets.


-Sherman

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

Reply via email to