Re: RFR: 8324573: HashMap::putAll should resize to sum of both map sizes [v4]

Stuart Marks Wed, 07 Feb 2024 12:54:57 -0800

On Fri, 2 Feb 2024 17:38:13 GMT, Joshua Cao <d...@openjdk.org> wrote:


>> This change mirrors what we did for ConcurrentHashMap in 
>> https://github.com/openjdk/jdk/pull/17116. When we add all entries from one 
>> map to anther, we should resize that map to the size of the sum of both maps.
>> 
>> I used the command below to run the benchmarks. I set a high heap to reduce 
>> garbage collection noise.
>> 
>> java -Xms25G -jar benchmarks.jar -p size=100000 -p addSize=100000 -gc true 
>> org.openjdk.bench.java.util.HashMapBench
>> 
>> 
>> Before change
>> 
>> 
>> Benchmark            (addSize)        (mapType)  (size)  Mode  Cnt   Score   
>> Error  Units
>> HashMapBench.putAll     100000         HASH_MAP  100000  avgt    4  22.927 ± 
>> 3.170  ms/op
>> HashMapBench.putAll     100000  LINKED_HASH_MAP  100000  avgt    4  25.198 ± 
>> 2.189  ms/op
>> 
>> 
>> After change
>> 
>> 
>> Benchmark            (addSize)        (mapType)  (size)  Mode  Cnt   Score   
>> Error  Units
>> HashMapBench.putAll     100000         HASH_MAP  100000  avgt    4  16.780 ± 
>> 0.526  ms/op
>> HashMapBench.putAll     100000  LINKED_HASH_MAP  100000  avgt    4  19.721 ± 
>> 0.349  ms/op
>> 
>> 
>> We see about average time improvements of 26% in HashMap and 20% in 
>> LinkedHashMap.
>> 
>> ---
>> 
>> In the worse case, we may have two maps with identical keys. In this case, 
>> we would aggressively resize when we do not need to. I'm also adding an 
>> additional `putAllSameKeys` benchmark.
>> 
>> Before change:
>> 
>> 
>> Benchmark                                       (addSize)        (mapType)  
>> (size)  Mode  Cnt        Score   Error   Units
>> HashMapBench.putAllSameKeys                        100000         HASH_MAP  
>> 100000  avgt             6.956           ms/op
>> HashMapBench.putAllSameKeys:gc.alloc.rate          100000         HASH_MAP  
>> 100000  avgt          1091.383          MB/sec
>> HashMapBench.putAllSameKeys:gc.alloc.rate.norm     100000         HASH_MAP  
>> 100000  avgt       7968871.917            B/op
>> HashMapBench.putAllSameKeys:gc.count               100000         HASH_MAP  
>> 100000  avgt               ≈ 0          counts
>> HashMapBench.putAllSameKeys                        100000  LINKED_HASH_MAP  
>> 100000  avgt             8.417           ms/op
>> HashMapBench.putAllSameKeys:gc.alloc.rate          100000  LINKED_HASH_MAP  
>> 100000  avgt           992.543          MB/sec
>> HashMapBench.putAllSameKeys:gc.alloc.rate.norm     100000  LINKED_HASH_MAP  
>> 100000  avgt       8768892.941            B/op
>> HashMapBench.putAllSameKeys:gc.count               100000  LINKED_HASH_MAP  
>> 100000  avgt               ≈ 0          counts
>> 
>> 
>> Af...
>
> Joshua Cao has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   extract msize variable

I think we should step back from benchmarks a bit and examine the various 
tradeoffs occurring here. Certainly we can speed things up by pre-resizing the 
map to be larger. However, this can result in a map that is too large for the 
number of mappings it contains, in the case where this map and the argument map 
have keys in common. In other words, it might waste space. We really have 
little idea of how frequently this occurs. It's easy to imagine scenarios where 
there is no commonality (which this change will speed up) and also where there 
is 100% commonality (where this change will result in wasted space). What's the 
right tradeoff?

I've linked some older bugs to the bug report for some historical perspective. 
In particular, see [this 
comment](https://bugs.openjdk.org/browse/JDK-4710319?focusedId=12240692&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-12240692)
 from Josh Bloch on JDK-4710319:

> The conservatism of the resizing heuristic for putAll is intentional. It can 
> cause at most one extra resizing, and can result in substantial footprint 
> savings. This too should be documented in the code.

The comment he added to putAll() for this change is still visible 
[here](https://github.com/openjdk/jdk/blob/e1b3c5b5ba5cfb8243d760e99887bbe1015a9d69/jdk/src/share/classes/java/util/HashMap.java#L1271),
 but unfortunately it was removed by a later refactoring. The comment is:


        /*
         * Expand the map if the map if the number of mappings to be added
         * is greater than or equal to threshold.  This is conservative; the
         * obvious condition is (m.size() + size) >= threshold, but this
         * condition could result in a map with twice the appropriate capacity,
         * if the keys to be added overlap with the keys already in this map.
         * By using the conservative calculation, we subject ourself
         * to at most one extra resize.
         */


Note that this comment addresses fairly directly the essence of the change 
being discussed here. I think it's still applicable; the current code in 
putMapEntries() compares `m.size()` to `threshold` in deciding whether to 
resize immediately. We're not constrained by a 20-year-old comment, though. We 
can change the resizing policy if we have good reason to do so. 

However, I think the concern about space wastage is still relevant, even after 
20 years of increased memory capacity and decreased memory cost. Memory is 
cheap but not free. Larger memory consumption has a real cost, as shown by 
current cloud pricing.

>From the perspective of users of the current API, most collections grow 
>automatically but never shrink. If a HashMap's capacity ends up being too 
>large, there's no way to shrink it (aside from copying all the mappings into a 
>new, smaller HashMap). However, most collections -- including HashMap -- offer 
>the ability to pre-size a map at construction time. Thus, an application that 
>wanted to merge several disjoint maps could pre-size the destination 
>appropriately in order to avoid resizing overhead.

To me, this indicates that we probably don't want to change the size policy to 
make things bigger while incurring the risk of wasting more space, even if some 
operations get faster. Meanwhile, applications that are suffering from excess 
resizing overhead have at least some means of adjusting the initial capacity to 
avoid resizing. If there are use cases that indicate otherwise, or that 
indicate a need for additional controls (perhaps like `trimToSize` or 
`ensureCapacity`) then maybe we could discuss them. I don't think we should 
proceed with this policy change in isolation, though, merely because it makes 
some cases go faster.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17544#issuecomment-1932858114

Re: RFR: 8324573: HashMap::putAll should resize to sum of both map sizes [v4]

Reply via email to