Re: Fwd: Some differences on Window UDC area

Charles Lee Wed, 23 May 2012 02:09:08 -0700

Hi guys,

We have a simple test case:


for (String cname : new String[] { "GBK", "MS936", "GB18030" }) {
        Charset charset = Charset.forName(cname);
        System.out.println("charset: " + charset.name());
        CharsetEncoder ce = charset.newEncoder();
        char[] chars = new char[] { 0xE585, 0xE586, 0xE592 };
        CharBuffer cb = CharBuffer.wrap(chars);
        ByteBuffer bb = ce.encode(cb);

        for (char c : chars) {
        System.out.printf("\\u%04x", (int) c);
        }
        System.out.print(" -> ");

        for (byte b : bb.array())
        if (b != 0x0) {
            System.out.printf("\\x%02x", (int) b & 0xFF);
        }
        System.out.println("");
    }

The output is
charset: GBK
\ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
charset: x-mswin-936
\ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
charset: GB18030
\ue585\ue586\ue592 -> \xa2\xa0\xa3\x40\xa3\x4c

From the msdn[1], U+E000 -> U+F8FF is in the EUDC scope. So U+E586 isin the EUDC scope. But the mapped code in MS936/GBK is 0xA2AB, it is notin the EUDC scope.With another simple test case, you can find there are more codes that isnot mapped right:


for (int i = 0xE000; i < 0xE000 + 1894; i++) {
        String s = new String(new char[] { (char) i });
        byte[] bs = s.getBytes("MS936");
        int b0 = (int) bs[0] & 0xFF;
        int b1 = (int) bs[1] & 0xFF;
        if ((b0 >= 0xAA && b0 <= 0xAF) && (b1 >= 0xA1 && b1 <= 0xFE))
        continue;
        if ((b0 >= 0xF8 && b0 <= 0xFE) && (b1 >= 0xA1 && b1 <= 0xFE))
        continue;
        if ((b0 >= 0xA1 && b0 <= 0xA7) && (b1 >= 0x40 && b1 <= 0xA0))
        continue;
        System.out.printf("\\u%04X -> \\x%02X\\x%02X%n", i, b0, b1);
    }

I have written a generator in C#[2] which outputs the mapping code inGB2312[3] and GB18030[4] in scope U+E000 and U+F8FF to find that most ofcode are the same. Hereby I suggest we may follow the code from GB2312and the changed map file in openjdk can be found [5][6].


Would anyone help to take a look on this issue?

[1]http://msdn.microsoft.com/en-us/library/windows/desktop/dd317837%28v=vs.85%29.aspx

[2] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/Program.cs
[3] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb2312Map.txt
[4] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb18030Map.txt
[5] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/GBK.map.new
[6] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/MS936.map.new

P.S: Sorry for the late notice.


On 03/29/2011 03:00 PM, Charles Lee wrote:

On 03/28/2011 11:06 PM, Alan Bateman wrote:
Charles Lee wrote:
:
It looks similar. How can I find the patch quickly? I notice it says"the list is attached to this CR". Is it CR-6183404? Since cr hasthe pattern cr.openjdk.java.net/~username/id, how can I know who isthe committer to this CR?
cr.openjdk.java.net is the place where we push webrevs when a patchis out for review. I don't think this one is one anyone's list forjdk7 and the list attached to the bug is likely the list of incorrectmappings. If this is fixed then I assume the fix will update themappings in jdk/make/tools/CharsetMapping/MS936.map.
-Alan
I have output more bytes[1] to see whether other bytes are encodedcorrectly. But unfortunately it is not. It is kind of like, onwindows, using ms936, PUA of ms936 use the PUA of gb18030. Inwikipedia, it says gb18030 is compatible with gbk which ms936implemented. Can we conclude that ms936 should follow the gb18030'sbehavior?
[1] 0xE585, 0xE586, 0xE587, 0xE588, 0xE589, 0xE58a, 0xE58b, 0xE58c,0xE58d, 0xE58e, 0xE58f, 0xE590, 0xE591, 0xE592, 0xE593, 0xE594,0xE595, 0xE596, 0xE597, 0xE598, 0xE599, 0xE59a, 0xE59b, 0xE59c,0xE59d, 0xE59e, 0xe79f.
Using MS936 charset, we expect:
\xa2\xa0\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa3\x4d\xa3\x4e\xa3\x4f\xa3\x50\xa3\x51\xa3\x52\xa3\x53\xa3\x54\xa3\x55\xa3\x56\xa3\x57\xa3\x58\xa6\xfe
but we got:
\xa2\xa0\xa2\xab\xa2\xac\xa2\xad\xa2\xae\xa2\xaf\xa2\xb0\xa2\xe3\xa2\xe4\xa2\xef\xa2\xf0\xa2\xfd\xa2\xfe\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa7\xa0



--
Yours Charles

Re: Fwd: Some differences on Window UDC area

Reply via email to