-------- Original Message --------
Message-ID:     <517ff8f8.3080...@oracle.com>
Date:   Tue, 30 Apr 2013 10:01:44 -0700
From:   Xueming Shen <xueming.s...@oracle.com>
User-Agent:     Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) 
Gecko/20110414 Thunderbird/3.1.10
MIME-Version:   1.0
To:     core-libs-dev core-libs-dev <core-libs-...@openjdk.java.net>
Subject:        RFR JDK-8013254: Constructor \w need update to add the support 
of \p{Join_Control}
Content-Type:   text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding:      7bit



Hi,

It appears we dropped the ball on u+200c and u+200d when we updated
the "simple word boundaries" back to jdk7 [1]. You can find most of the
related discussion here [2]. These 2 code points are listed as one of the
issues we were trying to fix but obviously the final doc and implementation
don't address them. Mainly because the \p{Join_Control} was not explicitly
listed in TR#18 "compatibility" section back then (the earlier version) [3],
though these 2 code points are explicitly mentioned at section RL1.4 Simple
Word Boundaries [4]. The \p{Join_Control} (u+200c and u+200d) has been
added/listed in the "compatibility" section in the latest version of TR#18 [5].

The proposed change here is to
(1) add these two code points back to the collection of \w
(2) list them explicitly into the \w definition as \p{Join_Control}
(3) list Join_Control as one of the supported binary properties.

http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000381.html

The webrev for RegExTest.java above includes the change for 8013252
which is being reviewed as well, I'm not separating them out just for
convenience. The regression/unit tests may not that "direct", here is
a direct version to verify the fix.

        Matcher wordU = Pattern.compile("\\w", 
Pattern.UNICODE_CHARACTER_CLASS).matcher("");
        System.out.println(wordU.reset("\u200c").find());
        System.out.println(wordU.reset("\u200d").find());

thanks
-Sherman

[1] http://ccc.us.oracle.com/7039066
[2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000381.html
[3] http://www.unicode.org/reports/tr18/tr18-13.html#Compatibility_Properties
[4] http://www.unicode.org/reports/tr18/tr18-13.html#Simple_Word_Boundaries
[5] http://www.unicode.org/reports/tr18/#Compatibility_Properties

Reply via email to