Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-14 Thread Ulf Zibis
Am 14.10.2011 10:47, schrieb Ulf Zibis: My new guess for the reason: The unfolding of the bytes to int to serve the isNotContinuation / isMalformedxx methods. So those methods should be coded in byte logic too. + use the "bx <= (byte)abc" logic instead "shift" or "(bx & abc) != def". -Ulf

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-14 Thread Ulf Zibis
Am 30.09.2011 22:46, schrieb Xueming Shen: I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because the benchmark shows the "shift" version is slightly faster. Do you have any number shows any difference now. My non-scientific benchmark still suggests the "shift" type is f

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-14 Thread Ulf Zibis
Am 13.10.2011 21:13, schrieb Xueming Shen: On 10/13/2011 09:55 AM, Ulf Zibis wrote: Am 11.10.2011 19:49, schrieb Xueming Shen: I don't know which one is better, I did a run on private static boolean op1(int b) { return (b >> 6) != -2; } private static boolean op2(int b) {

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-13 Thread Xueming Shen
On 10/13/2011 09:55 AM, Ulf Zibis wrote: Am 11.10.2011 19:49, schrieb Xueming Shen: I don't know which one is better, I did a run on private static boolean op1(int b) { return (b >> 6) != -2; } private static boolean op2(int b) { return (b & 0xc0) != 0x80; }

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-13 Thread Ulf Zibis
Am 11.10.2011 19:49, schrieb Xueming Shen: I don't know which one is better, I did a run on private static boolean op1(int b) { return (b >> 6) != -2; } private static boolean op2(int b) { return (b & 0xc0) != 0x80; } private static boolean op3(byte b) {

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-11 Thread Xueming Shen
On 10/11/2011 04:36 AM, Ulf Zibis wrote: Am 30.09.2011 22:46, schrieb Xueming Shen: I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because the benchmark shows the "shift" version is slightly faster. Do you have any number shows any difference now. My non-scientific be

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-11 Thread Ulf Zibis
Am 11.10.2011 13:36, schrieb Ulf Zibis: I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because the benchmark shows the "shift" version is slightly faster. Do you have any number shows any difference now. My non-scientific benchmark still suggests the "shift" type is fa

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-11 Thread Ulf Zibis
Hi Sherman, I didn't read anything from you since longer time. You are in holidays? Am 30.09.2011 22:46, schrieb Xueming Shen: I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because the benchmark shows the "shift" version is slightly faster. Do you have any number sho

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-02 Thread Ulf Zibis
Hi again, Am 30.09.2011 00:27, schrieb Xueming Shen: On 09/29/2011 02:16 PM, Ulf Zibis wrote: 280 if (Character.isSurrogate(c)) 281 return malformedForLength(src, sp, dst, dp, 3); Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-02 Thread Ulf Zibis
Am 02.10.2011 08:29, schrieb Xueming Shen: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Go to 3.9 Unicode Encoding Forms. Or simply search D93 On 10/1/2011 2:21 PM, Ulf Zibis wrote: Am 30.09.2011 22:46, schrieb Xueming Shen: On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-01 Thread Xueming Shen
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Go to 3.9 Unicode Encoding Forms. Or simply search D93 On 10/1/2011 2:21 PM, Ulf Zibis wrote: Am 30.09.2011 22:46, schrieb Xueming Shen: On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> Cod

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-01 Thread Ulf Zibis
Am 30.09.2011 22:46, schrieb Xueming Shen: On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing now, fo

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-30 Thread Xueming Shen
On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing now, for this scenario My idea behind is, that i

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-30 Thread Ulf Zibis
Hi, Am 29.09.2011 05:27, schrieb Xueming Shen: Hi, On 9/28/2011 3:44 PM, Ulf Zibis wrote 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-29 Thread Xueming Shen
On 09/29/2011 02:16 PM, Ulf Zibis wrote: Please use spaces with ternary operators: Lines 155, 216 For short you could use sr instead srcRemaining, consistent to sa, sp, sl. 420 // returns -1 if there is malformed byte(s) and the better: 420 // returns -1 if there is/are malfor

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-28 Thread Xueming Shen
Hi, On 9/28/2011 3:44 PM, Ulf Zibis wrote: Hi Sherman, 1. bug 7096080 is not visible at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080 It might take couple days for it to show up on bugs.sun.com. But it has exactly the same content as my previous email. In fact I simply copy/pa

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-28 Thread Ulf Zibis
Hi Sherman, 1. bug 7096080 is not visible at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080 2. bug 7096080 seems to be a duplicate of 6798514 - Charset UTF-8 accepts CESU-8 codings which was closed. It should be reopened

Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-28 Thread Xueming Shen
Hi, [I combined the proposed charge for #7082884, in which no one appears to be interested:-) into this one] Unicode Standard added "Addition Constraints on conversion of ill-formed UTF-8" in version 5.1 [1] and updated in 6.0 again with further "clarification" [2] regarding how a "conformanc