Hello,
originally I reported this on the bug tracker, but was asked to first post this 
topic to this mailing list. I was told that afterwards the bug report will be 
created.

The internal method `java.lang.ConditionalSpecialCasing#lookUpTable` is used 
for special case conversion rules, and is called when either the specified 
locale has special casing rules (e.g. Turkish) or the string to convert 
contains characters with special casing rules, for example U+0130 (Latin 
Capital Letter I with Dot Above). The problem with this method is that it 
creates temporary objects. Given that the method is in the worst case called 
for every character (possibly even twice per character), this can cause a lot 
of temporary memory allocation for large strings.

Below is the original bug report description (slightly modified), with a 
proposal how it can (at least in parts) be implemented without allocating any 
temporary objects; feedback is appreciated. I am not a JDK member and therefore 
cannot submit a pull request for this.

Kind regards

--------------------------

There are two issues with the method `lookUpTable` of the internal class 
java.lang.ConditionalSpecialCasing which is used for special case conversion:
- It uses the int codepoint as key for a Map<Integer, ...> to look up the case 
conversion; therefore this wraps the int as an Integer
- The special case conversion entries are stored in a HashSet<Entry>
  - First of all usage of a Set seems redundant because Entry does not even 
override `equals` and it look like always distinct Entry instances are added to 
the Set
  - Usage of a Set means a new Iterator object is created whenever case 
conversion entries are found for a code point

It looks like both of this can be fixed, for example in the following way:
1. Remove ConditionalSpecialCasing.Entry.ch (and the corresponding getter)
2. Remove the static field ConditionalSpecialCasing.entry
3. For every existing entry add a static final field `entry<codepoint>` storing 
a Entry[]  (<codepoint> being a placeholder for the codepoint hex string)
4. In ConditionalSpecialCasing.lookUpTable use a `switch` to access the 
corresponding `entry...`

Here is a short example snippet showing that:
```
private static final Entry[] entry0069 = {
    new Entry(new char[]{0x0069}, new char[]{0x0130}, "tr", 0), // # LATIN 
SMALL LETTER I
    new Entry(new char[]{0x0069}, new char[]{0x0130}, "az", 0) // # LATIN SMALL 
LETTER I
};
...

private static char[] lookUpTable(String src, int index, Locale locale, boolean 
bLowerCasing) {
    Entry[] entries = switch (src.codePointAt(index)) {
        case 0x0069 -> entry0069;
        ...
        default -> null;
    };
    char[] ret = null;

    if (entries != null) {
        String currentLang = locale.getLanguage();
        for (Entry entry : entries) {
            String conditionLang = entry.getLanguage();
            ...
        }
    }

    return ret;
}
```


Note: `java.lang.ConditionalSpecialCasing.isFinalCased` is also quite 
problematic because it creates a new StringCharacterIterator and a 
RuleBasedBreakIterator for each call.
Unfortunately I don't know of an easy way how this can be avoided; it would be 
great if you could investigate solving this nonetheless, in the worst case with 
ThreadLocal or simiar.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Profile the object allocations of the `toLowerCase` calls of the following code 
snippets, for example with VisualVM:

1. Snippet:
```
String s = "\u0130".repeat(1000);
s.toLowerCase(Locale.ROOT);
```

2. Snippet:
```
String s = "\u03A3".repeat(1000);
s.toLowerCase(Locale.ROOT);
```


ACTUAL -
1. Snippet:
2000 Integer objects created
2000 HashMap$KeyIterator objects created

2. Snippet:
1000 Integer objects created
1000 HashMap$KeyIterator objects created
1000 StringCharacterIterator objects created
1000 RuleBasedBreakIterator objects created

Reply via email to