On Tue, 25 Nov 2025 20:21:26 GMT, Xueming Shen <[email protected]> wrote:
>> ### Summary
>>
>> Case folding is a key operation for case-insensitive matching (e.g., string
>> equality, regex matching), where the goal is to eliminate case distinctions
>> without applying locale or language specific conversions.
>>
>> Currently, the JDK does not expose a direct API for Unicode-compliant case
>> folding. Developers now rely on methods such as:
>>
>> **String.equalsIgnoreCase(String)**
>>
>> - Unicode-aware, locale-independent.
>> - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per
>> code point.
>> - Limited: does not support 1:M mapping defined in Unicode case folding.
>>
>> **Character.toLowerCase(int) / Character.toUpperCase(int)**
>>
>> - Locale-independent, single code point only.
>> - No support for 1:M mappings.
>>
>> **String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)**
>>
>> - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
>> - Intended primarily for presentation/display, not structural
>> case-insensitive matching.
>> - Requires full string conversion before comparison, which is less efficient
>> and not intended for structural matching.
>>
>> **1:M mapping example, U+00DF (ß)**
>>
>> - String.toUpperCase(Locale.ROOT, "ß") → "SS"
>> - Case folding produces "ss", matching Unicode caseless comparison rules.
>>
>>
>> jshell> "\u00df".equalsIgnoreCase("ss")
>> $22 ==> false
>>
>> jshell>
>> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
>> $24 ==> true
>>
>>
>> ### Motivation & Direction
>>
>> Add Unicode standard-compliant case-less comparison methods to the String
>> class, enabling & improving reliable and efficient Unicode-aware/compliant
>> case-insensitive matching.
>>
>> - Unicode-compliant **full** case folding.
>> - Simpler, stable and more efficient case-less matching without workarounds.
>> - Brings Java's string comparison handling in line with other programming
>> languages/libraries.
>>
>> This PR proposes to introduce the following comparison methods in `String`
>> class
>>
>> - boolean equalsFoldCase(String anotherString)
>> - int compareToFoldCase(String anotherString)
>> - Comparator<String> UNICODE_CASEFOLD_ORDER
>>
>> These methods are intended to be the preferred choice when Unicode-compliant
>> case-less matching is required.
>>
>> *Note: An early draft also proposed a String.toCaseFold() method returning a
>> new case-folded string.
>> However, during review this was considered error-prone, as the resulting
>> string could easily be mistaken for a general transformation like
>> toLowerCase() and then pass...
>
> Xueming Shen has updated the pull request incrementally with one additional
> commit since the last revision:
>
> minor doc formatting update
make/jdk/src/classes/build/tools/generatecharacter/GenerateCaseFolding.java
line 79:
> 77: // hack, hack, hack! the logic does not pick 0131. just add
> manually to support 'I's.
> 78: // 0049; T; 0131; # LATIN CAPITAL LETTER I
> 79: final String T_0x0131_0x49 = String.format(" entry(0x%04x,
> 0x%04x),\n", 0x0131, 0x49);
The 'T' status reads (in CaseFolding.txt):
# T: special case for uppercase I and dotted uppercase I
# - For non-Turkic languages, this mapping is normally not used.
# - For Turkic languages (tr, az), this mapping can be used instead of the
normal mapping for these characters.
# Note that the Turkic mappings do not maintain canonical equivalence
without additional processing.
Since this casefold feature is locale independent, should this `T` status be
ignored? It might be helpful if we mention in the spec if we do this `T` case
folding.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2579019726