Re: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

2012-02-13 Thread Ulf Zibis
On 2/13/2012 11:07 AM, Bill Shannon wrote: Can you detect the case of creating an InputStreamReader using the default encoding, wrapped around the InputStream from System.in that refers to the console? If so, it might be good to handle that case as well, although at this point I would conside

Re: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

2012-02-13 Thread Ulf Zibis
Sherman, thanks for your additional explanation. One nit more... Why you use the "sun." prefix? I think, "stdout.encoding" "stderr.encoding" would be enough + nicer. In some years, nobody will have any association with 'sun'. On the other hand, it would be more true to use: "windows

Re: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

2012-02-13 Thread Ulf Zibis
Am 13.02.2012 19:35, schrieb Xueming Shen: On 2/13/2012 10:15 AM, Ulf Zibis wrote: Interesting issue, especially for us germans! What is about System.in, if one types some umlaute at Windows console? System.in is a "InputStream", no charset involved there, you build your own &

Re: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

2012-02-13 Thread Ulf Zibis
Interesting issue, especially for us germans! What is about System.in, if one types some umlaute at Windows console? Why are there theoretically different code pages for stdout and stderr? -Ulf Am 13.02.2012 18:36, schrieb Xueming Shen: Hi This is a long standing Windows codepage support is

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-14 Thread Ulf Zibis
Am 14.10.2011 10:47, schrieb Ulf Zibis: My new guess for the reason: The unfolding of the bytes to int to serve the isNotContinuation / isMalformedxx methods. So those methods should be coded in byte logic too. + use the "bx <= (byte)abc" logic instead "shift" or "(bx & abc) != def". -Ulf

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-14 Thread Ulf Zibis
Am 30.09.2011 22:46, schrieb Xueming Shen: I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because the benchmark shows the "shift" version is slightly faster. Do you have any number shows any difference now. My non-scientific benchmark still suggests the "shift" type is f

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-14 Thread Ulf Zibis
Am 13.10.2011 21:13, schrieb Xueming Shen: On 10/13/2011 09:55 AM, Ulf Zibis wrote: Am 11.10.2011 19:49, schrieb Xueming Shen: I don't know which one is better, I did a run on private static boolean op1(int b) { return (b >> 6) != -2; } private static boolea

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-13 Thread Ulf Zibis
Am 11.10.2011 19:49, schrieb Xueming Shen: I don't know which one is better, I did a run on private static boolean op1(int b) { return (b >> 6) != -2; } private static boolean op2(int b) { return (b & 0xc0) != 0x80; } private static boolean op3(byte b) {

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-11 Thread Ulf Zibis
Am 11.10.2011 13:36, schrieb Ulf Zibis: I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because the benchmark shows the "shift" version is slightly faster. Do you have any number shows any difference now. My non-scientific benchmark still suggests

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-11 Thread Ulf Zibis
Hi Sherman, I didn't read anything from you since longer time. You are in holidays? Am 30.09.2011 22:46, schrieb Xueming Shen: I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because the benchmark shows the "shift" version is slightly faster. Do you have any number sho

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-02 Thread Ulf Zibis
Hi again, Am 30.09.2011 00:27, schrieb Xueming Shen: On 09/29/2011 02:16 PM, Ulf Zibis wrote: 280 if (Character.isSurrogate(c)) 281 return malformedForLength(src, sp, dst, dp, 3); Shouldn't we return cr.length() = 1to allow remaining 2 bytes

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-02 Thread Ulf Zibis
Am 02.10.2011 08:29, schrieb Xueming Shen: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Go to 3.9 Unicode Encoding Forms. Or simply search D93 On 10/1/2011 2:21 PM, Ulf Zibis wrote: Am 30.09.2011 22:46, schrieb Xueming Shen: On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-01 Thread Ulf Zibis
Am 30.09.2011 22:46, schrieb Xueming Shen: On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-30 Thread Ulf Zibis
Hi, Am 29.09.2011 05:27, schrieb Xueming Shen: Hi, On 9/28/2011 3:44 PM, Ulf Zibis wrote 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537> (1) new byte[]{(byte)0xE1, (byte)0x80, (byt

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-28 Thread Ulf Zibis
Hi Sherman, 1. bug 7096080 is not visible at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080 2. bug 7096080 seems to be a duplicate of 6798514 - Charset UTF-8 accepts CESU-8 codings which was closed. It should be reopened

Re: Codereview request for 7082884: Incorrect UTF8 conversion for sequence ED 31

2011-09-28 Thread Ulf Zibis
Am 19.09.2011 22:21, schrieb Xueming Shen: The current implementation decode new String(new byte[]{(byte)0xed, 31}, "UTF8") Bug 7082884 refers to ED 31, so it should be: new String(new byte[]{(byte)0xed, 0x31}, "UTF8") -Ulf