Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

naoto . sato Sun, 19 Jul 2020 11:06:01 -0700

Hi Mark,

Thank you for your comments.


On 7/17/20 8:03 PM, Mark Davis ☕ wrote:

One option is to have a fast path that uses char functions, up to thepoint where you hit a high surrogate, then drop into the slowercodepoint path. That saves time for the high-runner cases.
On the other hand, if the times are good enough, you might not need thecomplication.

The implementation is dealing with bare byte arrays. Only methods thatit uses from Character class are toLowerCase(int) and toUpperCase(int)(sans surrogate check, which is needed at each iteration anyways), andtheir "char" equivalents are merely casting (char) to the int result. Soit might not be so beneficial to differentiate char and int paths.

Having said that, I found that there was an unnecessary surrogate check(always checks high *and* low surrogate on each iteration), so I revisedthe fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto


Mark
//////

On Fri, Jul 17, 2020 at 4:39 PM <naoto.s...@oracle.com<mailto:naoto.s...@oracle.com>> wrote:


    Hi,

    Based on the suggestions, I modified the fix as follows:

    https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

    Changes from the initial revision are:

    - Shared the implementation between compareToCI() and regionMatchesCI()
    - Enabled immediate short cut if two code points match.
    - Created a simple JMH benchmark. Here is the scores before and after
    the change:

    before:

Benchmark Mode Cnt Score ErrorUnitsStringCompareToIgnoreCase.lower avgt 25 53.764 ? 2.811ns/opStringCompareToIgnoreCase.supLower avgt 25 24.211 ? 1.135ns/opStringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ? 1.344ns/opStringCompareToIgnoreCase.upperLower avgt 25 18.859 ? 1.499ns/op


    after:

Benchmark Mode Cnt Score ErrorUnitsStringCompareToIgnoreCase.lower avgt 25 58.354 ? 4.603ns/opStringCompareToIgnoreCase.supLower avgt 25 57.975 ? 5.672ns/opStringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ? 0.965ns/opStringCompareToIgnoreCase.upperLower avgt 25 17.744 ? 0.272ns/op


    Here, "sup" means all supplementary characters, BMP otherwise. "lower"
    means each character requires upper->lower casemap. "upperLower" means
    all characters are the same, except the last character which requires
    casemap.

    I think the result is reasonable, considering surrogates check are now
    mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but
    it did not seem to benefit here. In fact, the performance degraded
    partly because I implemented the short cut, and possibly for the
    overhead of extra checks.

    Naoto

    On 7/15/20 9:00 AM, naoto.s...@oracle.com
    <mailto:naoto.s...@oracle.com> wrote:
     > Hello,
     >
     > Please review the fix to the following issues:
     >
     > https://bugs.openjdk.java.net/browse/JDK-8248655
     > https://bugs.openjdk.java.net/browse/JDK-8248434
     >
     > The proposed changeset and its CSR are located at:
     >
     > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
     > https://bugs.openjdk.java.net/browse/JDK-8248664
     >
     > A bug was filed against SimpleDateFormat (8248434) where
     > case-insensitive date format/parse failed in some of the new
    locales in
     > JDK15. The root cause was that case-insensitive
    String.regionMatches()
     > method did not work with supplementary characters. The problem is
    that
     > the method's spec does not expect case mappings of supplementary
     > characters, possibly because it was overlooked in the first
    place, JSR
     > 204 - "Unicode Supplementary Character support". Similar behavior is
     > observed in other two case-insensitive methods, i.e.,
     > compareToIgnoreCase() and equalsIgnoreCase().
     >
     > The fix is straightforward to compare strings by code point basis,
     > instead of code unit (16bit "char") basis. Technically this
    change will
     > introduce a backward incompatibility, but I believe it is an
     > incompatibility to wrong behavior, not true to the meaning of those
     > methods' expectations.
     >
     > Naoto

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

Reply via email to