On 5/5/20 9:41 PM, Martin Buchholz wrote:
Hi Peter,

Are you saying guava has a tiny bug?


If it was just 1 too much when expected size is a multiple of 3 then that would not be a bug, just sub-optimal calculation. And the same calculation is performed also in JDK when a copy constructor is called for example.


But I investigated further and what I found could be considered a bug. Sometimes, the following expression:


(int) ((float) expectedSize / 0.75f + 1.0f)


...calculates a value that is not enough (due to floating point arithmetic and conversion to int) to store the expectedSize elements into the HashMap without re-hashing.


What HashMap does with initialCapacity parameter is to round it up to nearest power of 2:

    static int tableSizeFor(int cap) {
        int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }

then it uses this as the initial backing table size. From that table size it calculates the threshold value:

    static int threshold(int cap) {
        float ft = (float) cap * 0.75f;
        return (cap < MAXIMUM_CAPACITY && ft < (float) MAXIMUM_CAPACITY ?
                (int) ft : Integer.MAX_VALUE);
    }

... and uses it as the max. number of elements that a HashMap can hold before it is re-hashed. So I did the following test (comparing the effectiveness of above formula with alternative (expectedSize*4+2)/3 formula):


public class HMTest {
    static final int MAXIMUM_CAPACITY = 1 << 30;

    static int tableSizeFor(int cap) {
        int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }

    static int threshold(int cap) {
        float ft = (float) cap * 0.75f;
        return (cap < MAXIMUM_CAPACITY && ft < (float) MAXIMUM_CAPACITY ?
                (int) ft : Integer.MAX_VALUE);
    }

    public static void main(String[] args) {
        for (int expectedSize = 0; expectedSize < (Integer.MAX_VALUE - 2) / 4; expectedSize++) {
            int cap1 = (int) ((float) expectedSize / 0.75f + 1.0f);
            int cap2 = (expectedSize * 4 + 2) / 3;
            int ts1 = tableSizeFor(cap1);
            int ts2 = tableSizeFor(cap2);
            int th1 = threshold(ts1);
            int th2 = threshold(ts2);

            if (th1 < expectedSize || th2 < expectedSize) {
                System.out.printf("%d: (%d, %d, %d)%s (%d, %d, %d)%s\n",
                        expectedSize,
                        cap1, ts1, th1, (th1 < expectedSize) ? "!" : " ",
                        cap2, ts2, th2, (th2 < expectedSize) ? "!" : " "
                );
            }
        }
    }
}


And what this prints is the following:


25165825: (33554432, 33554432, 25165824)! (33554434, 67108864, 50331648)
50331649: (67108864, 67108864, 50331648)! (67108866, 134217728, 100663296)
50331650: (67108864, 67108864, 50331648)! (67108867, 134217728, 100663296)
100663297: (134217728, 134217728, 100663296)! (134217730, 268435456, 201326592) 100663298: (134217728, 134217728, 100663296)! (134217731, 268435456, 201326592) 100663299: (134217728, 134217728, 100663296)! (134217732, 268435456, 201326592) 100663300: (134217728, 134217728, 100663296)! (134217734, 268435456, 201326592) 201326593: (268435456, 268435456, 201326592)! (268435458, 536870912, 402653184) 201326594: (268435456, 268435456, 201326592)! (268435459, 536870912, 402653184) 201326595: (268435456, 268435456, 201326592)! (268435460, 536870912, 402653184) 201326596: (268435456, 268435456, 201326592)! (268435462, 536870912, 402653184) 201326597: (268435456, 268435456, 201326592)! (268435463, 536870912, 402653184) 201326598: (268435456, 268435456, 201326592)! (268435464, 536870912, 402653184) 201326599: (268435456, 268435456, 201326592)! (268435466, 536870912, 402653184) 201326600: (268435456, 268435456, 201326592)! (268435467, 536870912, 402653184) 402653185: (536870912, 536870912, 402653184)! (536870914, 1073741824, 2147483647) 402653186: (536870912, 536870912, 402653184)! (536870915, 1073741824, 2147483647) 402653187: (536870912, 536870912, 402653184)! (536870916, 1073741824, 2147483647) 402653188: (536870912, 536870912, 402653184)! (536870918, 1073741824, 2147483647) 402653189: (536870912, 536870912, 402653184)! (536870919, 1073741824, 2147483647) 402653190: (536870912, 536870912, 402653184)! (536870920, 1073741824, 2147483647) 402653191: (536870912, 536870912, 402653184)! (536870922, 1073741824, 2147483647) 402653192: (536870912, 536870912, 402653184)! (536870923, 1073741824, 2147483647) 402653193: (536870912, 536870912, 402653184)! (536870924, 1073741824, 2147483647) 402653194: (536870912, 536870912, 402653184)! (536870926, 1073741824, 2147483647) 402653195: (536870912, 536870912, 402653184)! (536870927, 1073741824, 2147483647) 402653196: (536870912, 536870912, 402653184)! (536870928, 1073741824, 2147483647) 402653197: (536870912, 536870912, 402653184)! (536870930, 1073741824, 2147483647) 402653198: (536870912, 536870912, 402653184)! (536870931, 1073741824, 2147483647) 402653199: (536870912, 536870912, 402653184)! (536870932, 1073741824, 2147483647) 402653200: (536870912, 536870912, 402653184)! (536870934, 1073741824, 2147483647)


So as you see, for expectedSize < (Integer.MAX_VALUE - 2) / 4 (where the alternative formula does not experience overflow and is enough for Naoto's case) all miscalculations are due to the JDK/Guava formula which in those cases calculates a value that is less than alternative formula's value and too small to adequately pre-size the HashMap table.


Voila, we have some bugs to fix or I am doing something wrong here.


Regards, Peter



On Tue, May 5, 2020 at 12:12 PM Peter Levart <peter.lev...@gmail.com <mailto:peter.lev...@gmail.com>> wrote:

    Hi Martin,

    On 5/5/20 8:26 PM, Martin Buchholz wrote:
    See related:
    
https://guava.dev/releases/23.0/api/docs/com/google/common/collect/Maps.html#newHashMapWithExpectedSize-int-


    This is basically the same calculation (or at least gives same
    result) as Naoto did (without the max part):

    Naoto: (int)(expectedSize / 0.75f) + 1

    Guava: (int) ((float) expectedSize / 0.75F + 1.0F)

    but in case expectedSize is a multiple of 3, it gives the result
    which is 1 more than needed. If what is needed is also a power of
    2, then twice the needed space is allocated in the HashMap backing
    table.


    Regards, Peter



    On Tue, May 5, 2020 at 11:03 AM <naoto.s...@oracle.com
    <mailto:naoto.s...@oracle.com>> wrote:

        And here is the fix. Please review.

        http://cr.openjdk.java.net/~naoto/8244459/webrev.00/

        Naoto

        On 5/5/20 10:25 AM, naoto.s...@oracle.com
        <mailto:naoto.s...@oracle.com> wrote:
        > Hi Peter,
        >
        > You are correct. Thanks. I'll remove that initial value of 16.
        >
        > Naoto
        >
        > On 5/5/20 9:37 AM, Peter Levart wrote:
        >> Hi Naoto,
        >>
        >> On 4/30/20 12:18 AM, naoto.s...@oracle.com
        <mailto:naoto.s...@oracle.com> wrote:
        >>> Hello,
        >>>
        >>> Please review this small fix to the following issue:
        >>>
        >>> https://bugs.openjdk.java.net/browse/JDK-8244152
        >>>
        >>> The proposed changeset is located at:
        >>>
        >>> https://cr.openjdk.java.net/~naoto/8244152/webrev.00/
        >>>
        >>> The hash map used there didn't have initial capacity,
        even though the
        >>> exact numbers are known.
        >>
        >>
        >> Well, it has to be calculated 1st (countTokens), but I
        guess this pays
        >> off when HashSet (the backing HashMap) does not have to be
        rehashed then.
        >>
        >> The expression you use:
        >>
        >>      Math.max((int)(tokens.countTokens() / 0.75f) + 1, 16)
        >>
        >> ...has a minimum value of 16. Why is that? 16 is just
        HashMap's
        >> default initialCapacity if not specified explicitly. But
        if you only
        >> want to store say 1 entry in the map, you can specify 2 as
        >> initialCapacity and HashMap will happily work for such
        case without
        >> resizing.
        >>
        >>
        >> So you could just use:
        >>
        >>      (int)(tokens.countTokens() / 0.75f) + 1
        >>
        >> And even this expression is sometimes overshooting the
        minimal
        >> required value by 1 (when # of tokens is "exact" multiple
        of 0.75f,
        >> say 6). I think the following could be used to optimally
        pre-size the
        >> HashMap with default load factor 0.75:
        >>
        >>      (tokens.countTokens() * 4 + 2) / 3
        >>
        >>
        >> Regards, Peter
        >>
        >>>
        >>> Naoto

Reply via email to