Re: [11] RFR: 8181157: CLDR Timezone name fallback implementation

2018-04-26 Thread Xueming Shen
+1 On 4/23/18, 6:25 PM, Naoto Sato wrote: Hi Sherman, thanks for the review. On 4/23/18 1:06 PM, Xueming Shen wrote: Naoto, Here some comments (1) CLDRTimeZoneNameProviderImpl.java: Ln#58: to use Stream.toArray(String[]::new) ? no sure which one is faster Ln#155-160

Re: [11] RFR: 8181157: CLDR Timezone name fallback implementation

2018-04-23 Thread Xueming Shen
Naoto, Here some comments (1) CLDRTimeZoneNameProviderImpl.java: Ln#58: to use Stream.toArray(String[]::new) ? no sure which one is faster Ln#155-160: is it worth considering to check all possible empty slots in "names" here (from index_std_long to index_std_short?

Re: [11] RFR: 8189784: Parsing with Java 9 AKST timezone returns the SystemV variant of the timezone

2018-04-11 Thread Xueming Shen
+1 On 4/9/18, 6:00 PM, naoto.s...@oracle.com wrote: Thanks, Erik. Modified GensrcCLDR.gmk as suggested: http://cr.openjdk.java.net/~naoto/8189784/webrev.03/ Naoto On 4/9/18 4:15 PM, Erik Joelsson wrote: Hello Naoto, Looks good, just a style issue. When breaking a recipe line, please add 4

Re: [9] RFR: 8173423: Wrong display name for supplemental Japanese era

2017-01-31 Thread Xueming Shen
+1 On Jan 31, 2017, at 5:27 AM, Naoto Sato wrote: Hello, Please review the fix to the following bug: https://bugs.openjdk.java.net/browse/JDK-8173423 The proposed fix is located at: http://cr.openjdk.java.net/~naoto/8173423/webrev.00/ There is a functionality in Japanese Imperial Calendar t

Re: RFR: regex changes

2016-05-03 Thread Xueming Shen
Hi, This one has be out for review for a while. If there is no further comments and feedback the changes will be pushed in shortly. Thanks, Sherman On 3/18/16, 1:05 PM, Xueming Shen wrote: Hi, There are couple regex related changes waiting for review. I have pull them together here (with

Re: CFV: New Internationalization Group Lead: Masayoshi Okutsu

2016-04-14 Thread Xueming Shen
Vote: yes On 4/12/16 10:02 AM, Naoto Sato wrote: I hereby nominate Masayoshi Okutsu to Internationalization Group Lead [1]. Masayoshi is an active member in the Internationalization group since the inception, and ideal for the role. Votes are due by 10:00am (Pacific Time), April 26, 2016.

Re: RFR: regex changes

2016-03-22 Thread Xueming Shen
ce matter enough to warrant adding this extra code. The measurement I did suggests it's still worth adding such a small piece code, given this one probably will be used for most of the greedy {}, with lots of raw "int" in and out, without boxing, and much smaller footprint. Thanks

RFR: regex changes

2016-03-19 Thread Xueming Shen
Hi, There are couple regex related changes waiting for review. I have pull them together here (with the notes) to make it easy to review. http://cr.openjdk.java.net/~sherman/regexBackTrack.Lamnda.CanonEQ/webrev/ (1) Exponential backtracking Note: http://cr.openjdk.java.net/~sherman/regexBackT

RFR: Regex canonical equivalents

2016-03-19 Thread Xueming Shen
Hi, While still waiting patiently for the review for RFR: Regex exponential backtracking issue [1] RFR: Regex exponential backtracking issue --- more cleanup/tuning [2] Here is the third round of change to address the "broken canonical equivalent support" issue listed in JEP-111 [3] . Canoni

Re: RFR JDK-7071819: To support Extended Grapheme Clusters in Regex

2016-02-03 Thread Xueming Shen
ntroducing a local var. Regards, Peter On 02/02/2016 10:25 PM, Xueming Shen wrote: Hi, Have not heard any feedback on this one so far. I'm adding a little more to make it attractive for reviewers :-) On top of the \N now the webrev includes the proposal to add two more matchers, \X for u

RFR JDK-7071819: To support Extended Grapheme Clusters in Regex

2016-02-02 Thread Xueming Shen
https://bugs.openjdk.java.net/browse/JDK-7071819 Issue: https://bugs.openjdk.java.net/browse/JDK-8147531 webrev: http://cr.openjdk.java.net/~sherman/8147531_7071819/webrev/ Thanks! Sherman On 01/18/2016 11:52 PM, Xueming Shen wrote: Hi, Please help review the change to add \N support in regex.

RFR JDK-8042125: Japanese character converters incompatible between Java 7 and Java 8

2015-05-23 Thread Xueming Shen
Hi Please help the change for 8042125 issue: https://bugs.openjdk.java.net/browse/JDK-8042125 webrev: http://cr.openjdk.java.net/~sherman/8042125 It's a regression caused by the changes of JDK-6653797. The direct triggers are (1) the .c2b mapping table for ms932/0208is missing (regardless the

Re: RFR : 8071447: IBM1166 Locale Request for Kazakh characters

2015-02-24 Thread Xueming Shen
Sean, Based on https://bugs.openjdk.java.net/browse/JDK-4159519, "historically" our ebcdic charsets alway do IBM.map: 0x15U+000a 0x25U+000a IBMxxx.c2b 0x15U+0085 IBMxxx.nr 0x25U+000a Someone had complained that "this is not correct". But it has been this way for a long

Re: [8u-dev] 8051641: Request for Approval: Africa/Casablanca transitions is incorrectly calculated starting from 2027

2014-12-29 Thread Xueming Shen
On 12/28/2014 03:02 AM, Aleksej Efimov wrote: Hi, Can I have an approval for the backport of 8051641. The backport slightly differs from JDK9 for ZoneRulesBuilder.java. Adding the original reviewer and alias for review. Testing: All TZ related tests shows PASS result, the JPRT testing shows no

Re: RFR: 8051641: Africa/Casablanca transitions is incorrectly calculated starting from 2027

2014-12-15 Thread Xueming Shen
On 12/15/14 8:06 AM, Aleksej Efimov wrote: Hello Sherman, Masayoshi and other experts, Can I ask for a JDK9 review for the sun/util/calendar/zi/TestZoneInfo310.java test failure [1] that was caused by incorrect handling of last rules for Africa/Casablanca time zone. This timzone has two last

Re: RFR: JDK-8042369 Remove duplicated java.time classes in build.tools.tzdb

2014-05-19 Thread Xueming Shen
ps run with the new data in next 24 hrs...) in the future. http://cr.openjdk.java.net/~sherman/tzdbProvider/webrev On 05/04/2014 11:16 PM, Xueming Shen wrote: Hi Please help review the change for #8042369 Issue: https://bugs.openjdk.java.net/browse/JDK-8042369 Webrev: http://cr.openjdk.java.ne

RFR: JDK-8039751: UTF-8 decoder fails to handle some edge cases correctly

2014-04-09 Thread Xueming Shen
Hi, Please help review the fix for JDK-8039751. Issue: https://bugs.openjdk.java.net/browse/JDK-8039751 webrev: http://cr.openjdk.java.net/~sherman/8039751/webrev/ This is the corner case (in 4 bytes sequence) we missed when fixing 7096080 [1]. The UTF_8 decoder correctly returns the malf

Re: RFR: 8037012: (tz) Support tzdata2014a

2014-03-13 Thread Xueming Shen
looks good. On 3/12/14 8:05 AM, Aleksej Efimov wrote: Hello, Can I have a review for a tzdata2014a [1] integration to JDK9: http://cr.openjdk.java.net/~aefimov/8037012/9/webrev.00 The following test sets were executed on the build with latest tzdata: test/sun/util/calendar test/java/util/Cale

Re: RFR: 8027370: (tz) Support tzdata2013h

2013-11-08 Thread Xueming Shen
looks fine. I would assume you've also run the corresponding tests at test/closed repo. -Sherman On 11/5/2013 8:38 AM, Aleksej Efimov wrote: Hi, Can I have a review for tzdata2013h integration [1]. The webrev link can be located here [2]. The following test sets were executed on build with

Re: 8027848: The ZoneInfoFile doesn't honor future GMT offset changes

2013-11-06 Thread Xueming Shen
here: http://cr.openjdk.java.net/~aefimov/8027848/webrev.01/ <http://cr.openjdk.java.net/%7Eaefimov/8027848/webrev.01/> -Aleksej On 11/05/2013 10:58 PM, Xueming Shen wrote: On 11/05/2013 10:50 AM, Xueming Shen wrote: Aleksej, For better performance (1) the currT should be "static f

Re: 8027848: The ZoneInfoFile doesn't honor future GMT offset changes

2013-11-05 Thread Xueming Shen
On 11/05/2013 10:50 AM, Xueming Shen wrote: Aleksej, For better performance (1) the currT should be "static final" so we dont have to access the System.curentTimeMillis() for each TimeZone/ZoneInfo instance. (2) instead of iterating through the standardTransitions(), shouldn't we

Re: 8027848: The ZoneInfoFile doesn't honor future GMT offset changes

2013-11-05 Thread Xueming Shen
Aleksej, For better performance (1) the currT should be "static final" so we dont have to access the System.curentTimeMillis() for each TimeZone/ZoneInfo instance. (2) instead of iterating through the standardTransitions(), shouldn't we just check the last one? given it's a sorted list. btw, in

RFR JDK-8020054: (tz) Support tzdata2013d

2013-08-08 Thread Xueming Shen
Hi, Please help review the proposed change to update the tz data in JDK8 from 2013c to 2013d. http://cr.openjdk.java.net/~sherman/8020054/webrev http://cr.openjdk.java.net/~sherman/8020054/closed Tests list below have been run and passed (except java/time/tck/java/time/chrono/TCKChronology.java

Re: [threeten-dev] RFR JDK-8013386: (tz) Support tzdata2013c

2013-05-13 Thread Xueming Shen
-) >> I'm concerned about the 24:00 fix. Is there any way to produce the correct rules without hard coding time zone IDs? > I don't know how to do it, yet. I definitely can have a RFE for it and spend some time on it later. Thanks, -- Yuka (2013/05/14 2:22), Xueming Shen wr

Re: [threeten-dev] RFR JDK-8013386: (tz) Support tzdata2013c

2013-05-13 Thread Xueming Shen
It would be appreciated if you guys can help review before 5/15 us time here, so tz update can get into M7, if it matters;-) -Sherman On 5/8/2013 3:20 PM, Xueming Shen wrote: Hi, Please help review the proposed change to update the tz data in JDK8 from 2012i to 2013c. Other than the tzdb

Re: RFR JDK-8013386: (tz) Support tzdata2013c

2013-05-09 Thread Xueming Shen
RFE for it and spend some time on it later. -Sherman Masayoshi On 5/10/2013 8:24 AM, Xueming Shen wrote: Hi Sean, Thanks for the review. It appears I missed jdk/test/sun/util/calendar/zi/tzdata, webrev has been updated to include the test data update. http://cr.openjdk.java.net/~sherman/8013386/web

Re: RFR JDK-8013386: (tz) Support tzdata2013c

2013-05-09 Thread Xueming Shen
N file stored with tzdata. Above points are not necessarily related to 2013c update and should be cleaned up separately perhaps. regards, Sean. On 08/05/2013 23:20, Xueming Shen wrote: Hi, Please help review the proposed change to update the tz data in JDK8 from 2012i to 2013c. Other than

RFR JDK-8013386: (tz) Support tzdata2013c

2013-05-08 Thread Xueming Shen
Hi, Please help review the proposed change to update the tz data in JDK8 from 2012i to 2013c. Other than the tzdb data file update under make/sun/javazic/tzdata, corresponding updates have also been made in TimeZoneNames***.java for the newly added zones, Asia/Khandyga and Ust-Nera, and updated

Re: RFR JDK-8013254: Constructor \w need update to add the support of \p{Join_Control}

2013-04-30 Thread Xueming Shen
My apology, the webrev is at http://cr.openjdk.java.net/~sherman/8013254/webrev/ -Sherman On 04/30/2013 10:01 AM, Xueming Shen wrote: Hi, It appears we dropped the ball on u+200c and u+200d when we updated the "simple word boundaries" back to jdk7 [1]. You can find most of t

Fwd: RFR JDK-8013254: Constructor \w need update to add the support of \p{Join_Control}

2013-04-30 Thread Xueming Shen
Original Message Message-ID: <517ff8f8.3080...@oracle.com> Date: Tue, 30 Apr 2013 10:01:44 -0700 From: Xueming Shen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 MIME-Version: 1.0 To: core-libs-de

Re: [8] Review request for JEP 127: Improve Locale Data Packaging and Adopt Unicode CLDR Data

2012-08-21 Thread Xueming Shen
looks fine for me. -sherman On 8/20/2012 10:14 AM, Naoto Sato wrote: I have updated the changeset by removing the copyright headers from all of the CLDR files, and added a LICENSE file at the top of CLDR source directory (src/share/classes/sun/util/cldr/resources/21_0_1). No other changes hav

Re: Fwd: Codereview request for 7130915: File.equals does not give expected results when path contains Non-English characters on Mac OS X

2012-06-27 Thread Xueming Shen
Thanks Alan! The webrev has been updated to throw OOME as your other nio native dispatcher does. http://cr.openjdk.java.net/~sherman/7130915/webrev. I can wait for your back from the vacation:-) -Sherman On 6/26/12 11:41 PM, Alan Bateman wrote: On 27/06/2012 04:33, Xueming Shen wrote

Re: Fwd: Codereview request for 7130915: File.equals does not give expected results when path contains Non-English characters on Mac OS X

2012-06-26 Thread Xueming Shen
nfc/nfd layer will be there. Copyright has been re-copy/ pasted and we now only use only bugid. -Sherman On 6/26/12 8:02 AM, Alan Bateman wrote: On 26/06/2012 07:00, Xueming Shen wrote: On 6/25/12 10:58 PM, Xueming Shen wrote: Hi, While I still believe that case-insensitive is the ri

Fwd: Codereview request for 7130915: File.equals does not give expected results when path contains Non-English characters on Mac OS X

2012-06-25 Thread Xueming Shen
On 6/25/12 10:58 PM, Xueming Shen wrote: Hi, While I still believe that case-insensitive is the right choice for File/Path on MacOSX, it is suggested that we might want to be a little conservative in this patch, with the assumption that this patch will be backport to 7u release after being

Re: Codereview request for 7130915: File.equals does not give expected results when path contains Non-English characters on Mac OS X

2012-06-24 Thread Xueming Shen
6.2012, at 19:01, Xueming Shen wrote: Hi Here is the proposed change to support Unicode nfd/nfc and case insensitive file path on MacOSX file system. 7130915: File.equals does not give expected results when path contains Non-English characters on Mac OS X 7168427: FileInputStream cannot open f

Re: Codereview request for 7130915: File.equals does not give expected results when path contains Non-English characters on Mac OS X

2012-06-22 Thread Xueming Shen
On 6/22/12 11:02 AM, Mike Duigou wrote: On Jun 22 2012, at 10:01 , Xueming Shen wrote: Hi Here is the proposed change to support Unicode nfd/nfc and case insensitive file path on MacOSX file system. 7130915: File.equals does not give expected results when path contains Non-English

Codereview request for 7130915: File.equals does not give expected results when path contains Non-English characters on Mac OS X

2012-06-22 Thread Xueming Shen
Hi Here is the proposed change to support Unicode nfd/nfc and case insensitive file path on MacOSX file system. 7130915: File.equals does not give expected results when path contains Non-English characters on Mac OS X 7168427: FileInputStream cannot open file where the file path contains asian

Re: Fwd: Some differences on Window UDC area

2012-06-03 Thread Xueming Shen
ate for both MS936 and GBK. -Sherman On 5/31/2012 9:47 PM, Xueming Shen wrote: On 5/31/2012 8:04 PM, Charles Lee wrote: Hi Sherman, Thank you for bring these out. The change is great because MS936.map is the same as mine :-) What about GBK.map? Given how those code points are mapped in GB

Re: Fwd: Some differences on Window UDC area

2012-05-31 Thread Xueming Shen
x27;m confirming with our Solaris people to get the mapping table used in their iconv. -Sherman On 05/31/2012 03:25 PM, Xueming Shen wrote: Hi, Here is the webrev for the updated MS936.map change, which updated the mapping entries for 500+ EUDC code points with in range of A140- A7A0. I'

Re: Fwd: Some differences on Window UDC area

2012-05-31 Thread Xueming Shen
Hi, Here is the webrev for the updated MS936.map change, which updated the mapping entries for 500+ EUDC code points with in range of A140- A7A0. I'm using CR#6183404 http://cr.openjdk.java.net/~sherman/6183404/webrev I re-generated the MS936.b2c and c2b mapping tables via MultiByteToWideChar

Re: Fwd: Some differences on Window UDC area

2012-05-29 Thread Xueming Shen
Hi Charles, The MS936 charset is long overdue for a update. See CR#6183404. The mapping need to be re-generated from MS's latest 936 table (not, MS936 should just follow MS's mapping table, not GB18030) As noted in MS936.map, the existing mapping table uses 1894 entries from GBK UDC block for

Re: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

2012-02-16 Thread Xueming Shen
Thanks Alan, webrev has been updated accordingly. http://cr.openjdk.java.net/~sherman/4153167/webrev <http://cr.openjdk.java.net/%7Esherman/4153167/webrev/> -Sherman On 02/15/2012 07:00 AM, Alan Bateman wrote: On 13/02/2012 17:36, Xueming Shen wrote: : The webrev is at

Re: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

2012-02-13 Thread Xueming Shen
On 2/13/2012 11:07 AM, Bill Shannon wrote: Thanks for fixing this! The webrev is at http://cr.openjdk.java.net/~sherman/4153167/webrev You probably don't need to malloc 64 bytes for a string that's going to be less than 16 bytes. And shouldn't you use snprintf in any event? Unlike Unix,

Re: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

2012-02-13 Thread Xueming Shen
To have separate sun.stdout.encoding and sun.stderr.encoding is mainly because of implementation convenience. I need three things from the native (1) is std.out tty (2) is std.err tty (3) the console encoding if (1) or (2) are true, and I tried to avoid to go down to native multiple times it a

Re: Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

2012-02-13 Thread Xueming Shen
underlying console can understand. -Sherman -Sherman -Ulf Am 13.02.2012 18:36, schrieb Xueming Shen: Hi This is a long standing Windows codepage support issue on Java platform (we probably have 20 bug/rfes filed for this particular issue and closed as the dup of 4153167). Windows

Codereview request for 4153167: separate between ANSI and OEM code pages on Windows

2012-02-13 Thread Xueming Shen
Hi This is a long standing Windows codepage support issue on Java platform (we probably have 20 bug/rfes filed for this particular issue and closed as the dup of 4153167). Windows supports two sets of codepages, ANSI (Windows) codepage and OEM (IBM) codepage. Windows uses ANSI/Windows codepa

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-10 Thread Xueming Shen
reate a filter stream to deal with stateful encodings with the java.io API. If it's OK to support only 1.4 and later, the java.nio.charset API should be used. Thanks, Masayoshi On 2/10/2012 4:12 AM, Xueming Shen wrote: CCed Bill Shannon. On 02/09/2012 11:10 AM, Xueming Shen wrote: Ch

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-09 Thread Xueming Shen
Jason, I might be misunderstanding your suggestion, but the current implementation of OutputStreamWriter.flushBuffer()/StreamWriter.implFlushBuffer() does not flush the encoder, so even the caller can choose when to invoke flushBuffer(), it does not solve the problem (flush() invokes flushBuff

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-09 Thread Xueming Shen
CCed Bill Shannon. On 02/09/2012 11:10 AM, Xueming Shen wrote: CharsetEncoder has the "flush()" method as the last step (of a series of "encoding" steps) to flush out any internal state to the output buffer. The issue here is the the upper level wrapper class, OutputStre

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-09 Thread Xueming Shen
really a Java SE bug? The usage of OutputSteamWriter in JavaMail seems to be wrong to me. The writeTo method in the bug report doesn't seem to be able to deal with any stateful encodings. Masayoshi On 2/9/2012 3:26 PM, Xueming Shen wrote: Hi This is a long standing "regression&qu

Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

2012-02-08 Thread Xueming Shen
Hi This is a long standing "regression" from 1.3.1 on how OutputStreamWriter.flush()/flushBuffer() handles escape or shift sequence in some of the charset/encoding, for example the ISO-2022-JP. ISO-2022-JP is encoding that starts with ASCII mode and then switches between ASCII andJapanese ch

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-13 Thread Xueming Shen
On 10/13/2011 09:55 AM, Ulf Zibis wrote: Am 11.10.2011 19:49, schrieb Xueming Shen: I don't know which one is better, I did a run on private static boolean op1(int b) { return (b >> 6) != -2; } private static boolean op2(int b) { return (b &

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-11 Thread Xueming Shen
On 10/11/2011 04:36 AM, Ulf Zibis wrote: Am 30.09.2011 22:46, schrieb Xueming Shen: I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because the benchmark shows the "shift" version is slightly faster. Do you have any number shows any difference now

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-01 Thread Xueming Shen
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Go to 3.9 Unicode Encoding Forms. Or simply search D93 On 10/1/2011 2:21 PM, Ulf Zibis wrote: Am 30.09.2011 22:46, schrieb Xueming Shen: On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-30 Thread Xueming Shen
On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing now, for this scenario My idea behind is, that i

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-29 Thread Xueming Shen
quot;desired"/recommended behavior in this case, from Standard point view? Am 29.09.2011 05:27, schrieb Xueming Shen: Hi, On 9/28/2011 3:44 PM, Ulf Zibis wrote: 5. IMHO charset CESU-8 should be hosted in extended-charsets, otherwise it should be added to java.nio.StandardCharsets W

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-28 Thread Xueming Shen
Hi, On 9/28/2011 3:44 PM, Ulf Zibis wrote: Hi Sherman, 1. bug 7096080 is not visible at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080 It might take couple days for it to show up on bugs.sun.com. But it has exactly the same content as my previous email. In fact I simply copy/pa

Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-28 Thread Xueming Shen
Hi, [I combined the proposed charge for #7082884, in which no one appears to be interested:-) into this one] Unicode Standard added "Addition Constraints on conversion of ill-formed UTF-8" in version 5.1 [1] and updated in 6.0 again with further "clarification" [2] regarding how a "conformanc

Re: Java encoder errors

2011-09-20 Thread Xueming Shen
On 09/19/2011 03:26 PM, Tom Christiansen wrote: Mark Davis ☕ wrote on Mon, 19 Sep 2011 14:41:49 PDT: I agree with the first part, disallowing the irregular code sequences. Finding that Java allowed surrogates to sneak through in their UTF-8 streams like that was quite odd. It's said "b

Re: Java encoder errors

2011-09-19 Thread Xueming Shen
Tom, Very good timing:-) I'm back to my encoding related bugs just fixing some corner cases in the new UTF-8 implementation we putback in for JDK7. The surrogates part is a known issue. Unicode Standard can simply change its "terms" [1] and announce "the irregular code unit sequence is no lo

Codereview request for 7082884: Incorrect UTF8 conversion for sequence ED 31

2011-09-19 Thread Xueming Shen
Hi, Unicode Standard added "Addition Constraints on conversion of ill-formed UTF-8" in version 5.1 [1] and updated in 6.0 again with further "clarification" [2] regarding how a "conformance" implementation should handle ill-formed UTF-8 byte sequence. Basically it says (1) the conversion pro

Re: Request for review: 7084245: Update usages of InternalError to use exception chaining

2011-08-30 Thread Xueming Shen
Hi Sebastian, On 08/30/2011 01:23 AM, Sebastian Sickelmann wrote: Sorry i have forgotten the webrev url. http://oss-patches.24.eu/openjdk8/InternalError/part2/7084245_main_1/ with couple changes from your original patch. (1) Undo the changes in DecimalFormat.java and Format.java. whil

Re: Request for review: 7084245: Update usages of InternalError to use exception chaining

2011-08-29 Thread Xueming Shen
Hi Sebastian, I will help to push the patch, if people all agreed the changes proposed. I pulled your patch and generated the webrev at http://cr.openjdk.java.net/~sherman/7084245/webrev with couple changes from your original patch. (1) Undo the changes in DecimalFormat.java and Format.java.

Re: Is(n't) this a Java Unicode compiler bug? [4=OSCON]

2011-07-19 Thread Xueming Shen
Tom, JLS 3.8 [1] Identifiers states "Two identifiers are the same only if they are identical, that is, have the same Unicode character for each letter or digit/./ Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single l

Re: j.u.regex: Negated Character Classes

2011-06-08 Thread Xueming Shen
#x27;ll admit to reading through to the end of your note and finding it interesting ;-) Some comments in-lined. On 03/Jun/2011 22:55, Xueming Shen wrote: I'm sure everybody understands what "negated character classes" [^...] in j.u.regex means. You would never have doubt about [^c

Re: Fwd: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-27 Thread Xueming Shen
Thanks Alan! webrev has been updated accordingly. -Sherman On 4/27/2011 8:51 AM, Alan Bateman wrote: Xueming Shen wrote: : UNICODE_CHARACTER_CLASS is clear and straightforward. I am OK with it. The webrev, ccc and api docs have been updated accordingly. Yes, I still need a reviewer for

Re: Fwd: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-26 Thread Xueming Shen
eaning OO. So for that reason I don't think CLASS(ES) would be optimal. bc; Bidi_Class ccc ; Canonical_Combining_Class http://unicode.org/Public/UNIDATA/PropertyAliases.txt Mark /— Il meglio è l’inimico del bene —/ On Sun, Apr 24, 2011 at 11:22, Xueming Shen <mailto:xuemi

Re: Fwd: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-26 Thread Xueming Shen
On 04-26-2011 2:20 AM, Alan Bateman wrote: Xueming Shen wrote: Thanks Mark! Let's go with UNICODE_PROPERTY, if there is no objection. I went through the updates to the javadoc and the approach looks good and nicely done. A minor comment is that the compile(String,int) method repeat

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-24 Thread Xueming Shen
Thanks Tom! The j.u.regex does not have its own direct access to PropList for now, have to use the properties from j..l.Character class. I will have to move those CharacterDateNN classes from the java.lang package (package private) to sun.lang or somewhere that both j.u.Character and j.u.regex

Re: Fwd: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-24 Thread Xueming Shen
Two more names, UNICODE_PROPERTIES and UNICODE_CLASSES, are suggested. any opinion? -Sherman On 4/23/2011 6:50 PM, Xueming Shen wrote: Forwarding...forgot to include the list. Original Message Subject: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4

Fwd: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-23 Thread Xueming Shen
Forwarding...forgot to include the list. Original Message Subject: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties Date: Sat, 23 Apr 2011 17:53:42 -0700 From: Xueming Shen To: Tom Christiansen Mark

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-23 Thread Xueming Shen
11 1:00 AM, Xueming Shen wrote: Hi This proposal tries to address (1) j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] requirement as Tom pointed out in his email on i18n-dev list [2]. Basically we have 3 problems here. a. ju.regex word boundary construct \b and \B u

Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

2011-04-23 Thread Xueming Shen
Hi This proposal tries to address (1) j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] requirement as Tom pointed out in his email on i18n-dev list [2]. Basically we have 3 problems here. a. ju.regex word boundary construct \b and \B uses Unicode \p{letter} + \p{digit

Review request: 7037261: j.l.Character.isLowerCase/isUpperCase need to match the Unicode Standard definition

2011-04-19 Thread Xueming Shen
Hi Tom Christiansen recently contributed a API doc update [1] for j.l.Character, as the followup for the Unicode support discussion in j.l.Character/j.u.regex we had back to January. In his doc patch, Tom recommended to "downgrade" the doc for j.l.Character.isLowCase/UpperCase(char/int) metho

Re: java.lang.Character lacuna #1 of 2

2011-04-15 Thread Xueming Shen
Tom I have filed CR/RFE 7036910: j.l.Character.toLowerCaseCharArray/toTitleCaseCharArray for this request. The j.l.Character.toLowerCase/toUpperCase() suggests to use String.toLower/UpperCase() for case mapping, if you want 1:M mapping taken care. And if you trust the API:-), which you shou

Re: java.lang.Character lacuna #2 of 2

2011-04-14 Thread Xueming Shen
Tom, Welcome back:-) Have you seen that cool \x{h...h}? oh, you saw it:-) Yes, It might be desirable to have a corresponding getCodePointFromName(String name), at least I will need that when I do \N{unicode_name} in regex, but I'm not sure if it is worth to make it a method into j.l.Character

Re: Codereview request for 7033561: Missing Unicode Script aliases

2011-04-06 Thread Xueming Shen
Thanks! webrev has been updated accordingly. -Sherman On 04/06/2011 01:29 PM, Alan Bateman wrote: Xueming Shen wrote: It appears the aliases mapping for Character.UnicodeScript is not updated accordingly when we upgraded the Unicode support to 6.0 for JDK7. The difference between the

Codereview request for 7033561: Missing Unicode Script aliases

2011-04-06 Thread Xueming Shen
It appears the aliases mapping for Character.UnicodeScript is not updated accordingly when we upgraded the Unicode support to 6.0 for JDK7. The difference between the previous version (5.2) and 6.0 of the aliases are these 3 missing names reported in #7033561. The webrev with the change is at

Re: RL1.1 Hex Notation

2011-01-27 Thread Xueming Shen
I run public static void main(String[] args) { test("\uD800\uDF3C", "^\\x{1033c}$"); test("\uD800\uDF3C", "^\\xF0\\x90\\x8C\\xBC$"); test("\uD800\uDF3C", "^\\x{D800}\\x{DF3c}+$"); test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3c}]+$"); test("\uD800\uDF3C", "^

Re: RL1.1 Hex Notation

2011-01-27 Thread Xueming Shen
On 01/27/2011 12:48 PM, Tom Christiansen wrote: Sherman wrote: Oh, I see the problem. Obviously I have been working on jdk7 too long and forgot the latest release is still 6:-( There is indeed a bug in the previous implementation which I fixed in 7 long time ago (I mentioned this in one of the

Re: RL1.1 Hex Notation

2011-01-27 Thread Xueming Shen
Mark, The high/lowSurrogate(codepoint) pair has been added in jdk1.7 already. http://download.java.net/jdk7/docs/api/java/lang/Character.html#highSurrogate(int) http://download.java.net/jdk7/docs/api/java/l

Re: RL1.1 Hex Notation

2011-01-26 Thread Xueming Shen
On 01/26/2011 11:50 AM, Mark Davis ☕ wrote: > I guess you are asking for something like? I'm not asking for that. What I'm saying is that as far as I can tell, there is no way in Java to meet the terms of RL1.1, because there is not a way to use hex numbers in any syntax for values above

Re: Now what?

2011-01-26 Thread Xueming Shen
On 1.27.2011 3:09, Tom Christiansen wrote: 7006289: java.util.regex yields nonsense by breaking the connection between \b and \w Categoryjava:classes_util State 1-Dispatched, bug Priority: 4-Low Submit Date 12-DEC-2010 7006291: Java claims to support Unicode p

Re: RL1.1 Hex Notation

2011-01-25 Thread Xueming Shen
Hi Mark, I guess you are asking for something like? char[] cc = Character.toChars(0x12345); Matcher m = Pattern.compile("[" + "\\u" + HEX(cc[0]) + "\\u" + HEX(cc[1]) + "

Re: Now what?

2011-01-25 Thread Xueming Shen
Tom, Yes, I would need some time to digest all the technical details, though I believe I've had a good understanding of most issues you raised. Sure, I will keep you updated for the related RFEs I will submit based on your research. The CR# so far I have are 7014645: Support Perl style Uni

Re: regex rewriting code (part 1 of 3)

2011-01-25 Thread Xueming Shen
Tom, The fact that these POSIX/ASCII only version properties/constructs have been there for years ("compatibility") and it appears that "most" developers are happy (habit, performance...) with them, I don't think we can and want to switch to the Unicode version, simply for conformance. Java ta

Re: RL1.1 Hex Notation

2011-01-24 Thread Xueming Shen
Tom, I would not overread this too much:-) There is no reason for the tr#18 to use any specific encoding in the specification, it's a perfect choice to simply pick the syntax notation that uses the code point value directly. However I don't think this "sample" syntax (or might be even further

Re: j.u.r.Pattern documentation errors

2011-01-23 Thread Xueming Shen
Thanks Tom. That part of doc definitely need re-visit, it was written before 2002 (probably is against Perl 5.6) and have not been touched since, lots are no longer true given the latest 5.12. -Sherman On 1-23-2011 14:14 02:14 PM, Tom Christiansen wrote: In this message I cover only those e

Re: RL1.4 Simple Word Boundaries

2011-01-23 Thread Xueming Shen
Tom, Thanks for the detailed and excellent "reality check". While I'm still going through all the details it appears that the fact the current Java Unicode property data does not include the properties defined in PropList.txt (current implementation reads the property data only from UnicodeDat

Re: RL1.2 Properties (part 1 of 2)

2011-01-23 Thread Xueming Shen
Tom, The Unicode/java version of lowercase, uppercase, withespace and letter character classes are provided via \p{javaXYZ}, and the \p{Lower/Upper/Alpha/Space} are specified/implemented for POSIX version, which is clearly documented in the API document. I would not use "worst" for this. I do

Re: more Oracle MX troubles

2011-01-21 Thread Xueming Shen
800 Received: from [10.159.5.156] (/10.159.5.156) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 21 Jan 2011 12:25:26 -0800 Message-ID:<4d39ebaa.8090...@oracle.com> Date: Fri, 21 Jan 2011 12:25:14 -0800 From: Xueming Shen User-Agent: Mozilla/5.0 (Windows; U; Windows NT

Re: RL1.1 Hex Notation

2011-01-21 Thread Xueming Shen
Tom, Introducing in the new perl style \x{...} as the hexadecimal notation appears to be a nice-to-have enhancement (I will file a RFE to put this request in record). But I don't think you can simply deny that the Java Unicode escape sequences for UTF16 is NOT A "mechanism"/notation for specif

Re: Java Regexes vs Unicode Regexes

2011-01-20 Thread Xueming Shen
On 01/20/2011 12:55 PM, Tom Christiansen wrote: Sherman wrote: At the end, Java RegEx is NOT a Unicode RegEx, while it supports Unicode RegEx at certain level, sometime via different syntax, I don't feel this is a big problem for most Java developers and should not be a stopper for most program

Re: Java and Unicode

2010-12-11 Thread Xueming Shen
Hi Tom, Thanks for looking into the Unicode support issues in Java RegEx. Since you haven been working on Unicode in the past decade, I'm sure you understand that most of the issues you are pointing out here belongs to the "Extended Unicode Support: Level 2" as documented in UTS#18 Unicode Re

Re: Reading Linux filenames in a way that will map back the same on open?

2008-09-13 Thread Xueming Shen
.util.Arrays.sort(Arrays.java:1079) at equivs.main(equivs.java:40) make: *** [wrapped] Error 1 ...and the foo.java program gives: $ LC_ALL=en_US.ISO8859-1; export LC_ALL; java foo sun.jnu.encoding=ANSI_X3.4-1968 file.encoding=ANSI_X3.4-1968 default locale=en_US Thanks folks. Xueming Shen

Re: Reading Linux filenames in a way that will map back the same on open?

2008-09-13 Thread Xueming Shen
Martin, don't trap people into using -Dfile.encoding, always treat it as a read only property:-) I believe initializeEncoding(env) gets invoked before -Dxyz=abc overwrites the default one, beside the "jnu encoding" is introduced in 6.0, so we no longer look file.encoding since, I believe y