Re: [PR] OAK-10996 - indexing-job: cache interned strings in a local hashmap to avoid calling String.intern too frequently [jackrabbit-oak]

via GitHub Thu, 01 Aug 2024 23:55:02 -0700


nfsantos commented on PR #1616:
URL: https://github.com/apache/jackrabbit-oak/pull/1616#issuecomment-2264692270


   > (Same as before), this could cause out-of-memory.
   > 
   > What about, to avoid possible OOME, and I think it would be easier to 
understand:
   > 
   > * Instead of a hash map, use an array (fixed array of eg. 16 K entries)
   > * Calculate the index from the hash (using bitwise and)
   > * Store the entry in the array if it doesn't match
   > * Use the entry in the array if it does
   > 
   > That would be even faster I'm pretty sure.
   > 
   > The memory usage would be at most ~ 16 * 1024 * 4 bytes or so (less than 1 
MB).
   > 
   > And we could store all the (small) path elements, and no longer have to 
worry about "startsWith" etc.
   
   I believe this strategy is already implemented in the `oak-store-document` 
module, here:
   
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/StringCache.java
   
   I think in this particular case that strategy would not work as well. The 
frequency of Strings of the path segments is extremely skewed, with a few words 
appearing in the majority of paths. These are the words of the top level paths 
(for instance, `/content/dam/<companyname>`) or the paths created by JCR which 
contain segments like `jcr:content`. To get the most of interning, these 
Strings must all be cached.
   
   The algorithm you suggest does not guarantee that these very common Strings 
will all be cached. For instance, in an extreme case where all common Strings 
hash into the same slot in the array, they would all be competing with each 
other. And another issue is that if we remove the guard that limits the Strings 
that can be cached, then the large number of infrequent Strings that appear on 
paths would displace from the table the very common Strings.
   
   I have bounded the HashMap to a max of 1024 entries, to avoid OOME. It could 
still happen if those 1024 Strings are huge, but if that was the case, Oak 
would run out of memory much before, because there are many other parts of the 
code that assume it's possible to store in memory more than 1024 full paths.
   
   In the tests that I ran against two medium-sized Flat File Stores taken from 
real OAK repositories, when scanning 10 million paths and storing in memory an 
array with all SortKeys, at the end there were around 50 Strings in the 
HashMap. And interning these Strings reduced the size of the heap at the end of 
the scanning by around 800MB. Interestingly, sorting the array of 10 million 
SortKeys was faster with the interned Strings, it took around 3s instead of 5s. 
I'm assuming that it is because `String.equals()` first checks if the objects 
compared are the same (`==`) before comparing the value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OAK-10996 - indexing-job: cache interned strings in a local hashmap to avoid calling String.intern too frequently [jackrabbit-oak]

Reply via email to