Re: RFR: 8363972: Loose matching of dash/minusSign in number parsing [v6]

Justin Lu Fri, 01 Aug 2025 15:45:28 -0700

On Fri, 1 Aug 2025 19:05:27 GMT, Naoto Sato <na...@openjdk.org> wrote:


>> Enabling lenient minus sign matching when parsing numbers. In some locales, 
>> e.g. Finnish, the default minus sign is the Unicode "Minus Sign" (U+2212), 
>> which is not the "Hyphen Minus" (U+002D) that users type in from keyboard. 
>> Thus the parsing of user input numbers may fail. This change utilizes CLDR's 
>> `parseLenient` element for minus signs and loosely matches them with the 
>> hyphen-minus so that user input numbers can parse. As this is a behavioral 
>> change, a corresponding CSR has been drafted.
>
> Naoto Sato has updated the pull request with a new target base due to a merge 
> or a rebase. The pull request now contains 16 commits:
> 
>  - Merge branch 'master' into JDK-8363972-Loose-matching-dash
>  - Spec update
>  - Supplementary/CanonEq tests
>  - flipped again, which was correct
>  - flipped the size check
>  - Address review comments
>  - Merge branch 'master' into JDK-8363972-Loose-matching-dash
>  - tidying up
>  - test location
>  - spec update
>  - ... and 6 more: https://git.openjdk.org/jdk/compare/8e921aee...3682484d

src/java.base/share/classes/java/text/DecimalFormat.java line 3512:

> 3510: 
> 3511:     /**
> 3512:      * {@return if the text matches the affix} In lenient mode, lenient

It's more clear to state that lenient minus signs in the affix match to lenient 
minus signs in the text. (Not just match to hyphen-minus.)

src/java.base/share/classes/java/text/DecimalFormat.java line 3543:

> 3541:             var lmsp = Pattern.compile("[" + lms + "]", 
> Pattern.CANON_EQ);
> 3542:             var a = lmsp.matcher(affix).replaceAll("-");
> 3543:             var t = lmsp.matcher(text.substring(position, 
> Math.min(tlen, position + alen)))

I don't think this works. If you see my other comment regarding the test case 
and swap it such that `lenientMinusSign` is "-\u00E4" and the pattern (affix) 
includes "a\u0308" and we parse some text such as "\u00E41.5".

Then we have the following,


var lmsp = Pattern.compile("[" + lms + "]", Pattern.CANON_EQ); // Returns 
pattern [-ä]
var a = lmsp.matcher(affix).replaceAll("-"); // Gets correctly normalized to "-"
// Incorrectly matches against "ä1" and normalized to "-1" since the substring 
returned is indexed from 0 to 2.
var t = lmsp.matcher(text.substring(position, Math.min(tlen, position + alen)))


That is, I think we need to substring after we have performed the 
normalization. Something such as,


var lmsp = Pattern.compile("[" + lms + "]", Pattern.CANON_EQ);
var a = lmsp.matcher(affix).replaceAll("-");
var t = lmsp.matcher(text).replaceAll("-");
return t.startsWith(a, position);


However, we will still run into issues later in DecimalFormat, as the position 
is incremented by the original prefix length.


        } else if (gotNegative) {
            position += negativePrefix.length();


So for "ä1.5" we would start parsing at position = 2, erroneously returning 
"0.5". So further thought may be needed.

test/jdk/java/text/Format/NumberFormat/LenientMinusSignTest.java line 128:

> 126:             .set(dfs, "-\u00E5");
> 127:         var df = new DecimalFormat("#.#;a\u0308#.#", dfs);
> 128:         assertEquals(df.parse("a\u03081.5"), -1.5);

This one confuses me, should it not be parsing "\u00E51.5" such that the test 
can check if canonical equivalence occurs when matching the text "\u00E5" to 
the affix "a\u0308"? Otherwise, we are just parsing the exact pattern we set? 
Also I think it should be "\u00E4", not "\u00E5".

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26580#discussion_r2248872126
PR Review Comment: https://git.openjdk.org/jdk/pull/26580#discussion_r2248949783
PR Review Comment: https://git.openjdk.org/jdk/pull/26580#discussion_r2248892644

Re: RFR: 8363972: Loose matching of dash/minusSign in number parsing [v6]

Reply via email to