[go-nuts] Re: fmap: a faster map

christoph...@gmail.com Wed, 25 Jun 2025 12:16:09 -0700

By swap pop I mean that a deleted item is replaced by the last item of the 
group or the chain. This will indeed move items.


I have implemented a new version using 8bit top hashes, tombstones, perfect 
match, quadratic probing and extensible hash table. It leaves items in 
place and might thus be compatible with the iteration requirements of go 
maps.

I plan to publish it on github but still need to verify deleting items on 
which I didn't spent much time so far. The following benchmark is with 
tables of 256 Groups. A table is split when reaching 90%. The filling rates 
is then far better than with my 3 table architecture. Using quadratic 
probing would then be a better choice. 

goos: darwin
goarch: arm64
pkg: fastmap/map
cpu: Apple M2
                      │ altmap/stats_arm64.txt │         
stdmap/stats_arm64.txt         │
                      │         sec/op         │    sec/op      vs base     
            │
Cache2Hit/_______1-8               6.386n ± 0%    7.081n ±  0%   +10.88% 
(p=0.000 n=10)
Cache2Hit/______10-8               8.813n ± 0%    8.114n ± 30%         ~ 
(p=0.481 n=10)
Cache2Hit/_____100-8               8.473n ± 7%   15.720n ±  8%   +85.52% 
(p=0.000 n=10)
Cache2Hit/____1000-8               8.735n ± 2%   13.715n ± 23%   +57.01% 
(p=0.000 n=10)
Cache2Hit/___10000-8               11.04n ± 1%    20.38n ±  6%   +84.56% 
(p=0.000 n=10)
Cache2Hit/__100000-8               13.19n ± 1%    17.62n ± 17%   +33.59% 
(p=0.000 n=10)
Cache2Hit/_1000000-8               52.02n ± 1%    55.45n ±  0%    +6.59% 
(p=0.001 n=10)
Cache2Hit/10000000-8               64.91n ± 1%    68.12n ±  3%    +4.96% 
(p=0.002 n=10)
Cache2Miss/_______1-8              5.003n ± 0%    8.243n ±  0%   +64.73% 
(p=0.000 n=10)
Cache2Miss/______10-8              5.878n ± 3%    9.238n ± 11%   +57.18% 
(p=0.000 n=10)
Cache2Miss/_____100-8              6.037n ± 4%   12.600n ±  9%  +108.70% 
(p=0.000 n=10)
Cache2Miss/____1000-8              6.992n ± 7%   12.015n ± 17%   +71.83% 
(p=0.000 n=10)
Cache2Miss/___10000-8              10.49n ± 2%    17.66n ±  9%   +68.40% 
(p=0.000 n=10)
Cache2Miss/__100000-8              20.73n ± 1%    26.56n ±  2%   +28.09% 
(p=0.000 n=10)
Cache2Miss/_1000000-8              33.52n ± 1%    36.83n ±  1%    +9.89% 
(p=0.000 n=10)
Cache2Miss/10000000-8              48.62n ± 1%    50.84n ±  1%    +4.56% 
(p=0.000 n=10)
geomean                            13.25n         18.38n         +38.75%

To be honest I can't fully explain these numbers. Reducing false positive 
by using an exact match and 8 bit top hashes might be part of it. It could 
also be the hash function. xxh3 on arm64 is pure go only. It is fast for 
short strings, but bad for long strings as it doesn't use simd (yet). The 8 
byte key strings are  fmt.Sprintf("%7d ", i) and fmt.Sprintf("%7d-", i) for 
misses. I hope that these numbers don't result from a bug in my code. 

For the top hash I use the H1, H2 terminology of swiss tables. H2 are the 8 
less significant bits of the hash. 0x00 is used for free slots, and 0x80 
for the tombstone. Computing H2 implies thus a conditional branch, but the 
cpu may predict its outcome right most of the time. I benchmarked 7bit top 
hashes, but it ended to be slower. I assumed it was due to the increased 
number of false positive. Using 0x80 for tombstone saves one op to find 
free slots in a group. If the speed up is due to the 8bit top hashes, I 
suspect that the C++ swiss table might also benefit from it.

On amd64, the performance is on par with go map, but I don't use simd 
instruction. 

Le lundi 23 juin 2025 à 22:41:46 UTC+2, Keith Randall a écrit :

> > As key comparison is very fast (no secondary memory access) a simple 
> sequential key comparison is fast.
>
> We've looked at that, see CL 626699 
> <https://go-review.googlesource.com/c/go/+/626699>. Maybe helps in some 
> cases, seems to hurt for miss lookups. We decided not to move forward with 
> it, but it is definitely something to think about.
>
> > I use 8 bit top hashes instead of 7
>
> How do you record a presence bit?
>
> > We can use a swap pop strategy.
>
> Not sure exactly what that is, but it sounds incompatible with iteration. 
> Our required iteration semantics mean it is very hard to move keys.
>
> > There is a small space usage penalty but when a table is split at 80% 
> for instance, it becomes 40%. It is then very close to need a grow which I 
> considered a waste of effort.
>
> I think when we split a table we use the max-sized table for each part? 
> I'm not sure what effect you're seeing here then.
>
> > I also saw that function calls are expensive. I then used manual 
> inlining for the Get method. It is nice to avoid for code readability and 
> maintenance though.
>
> We could theoretically have the compiler do this for builtin maps.
> On Sunday, June 22, 2025 at 1:28:26 AM UTC-7 christoph...@gmail.com wrote:
>
>> Thank you very much for your response. I'll summarize here the main 
>> difference with go map. 
>>
>> I very easily got faster than go map with integers and alike by dropping 
>> the swiss groups and reducing the group size to fit a cache line. As key 
>> comparison is very fast (no secondary memory access) a simple sequential 
>> key comparison is fast.The speed gain might be due to the hash function as 
>> I rewrote the xxh3 hasher for uint64 values.  I didn't spent much time 
>> exploring this further as it wasn't my use case. 
>>
>> Strings key definitely benefit from top hashes (swiss table groups). I 
>> use 8 bit top hashes instead of 7 which reduces the number of false 
>> positive and I use a perfect match which also reduce the number of false 
>> positive on arm64. amd64 uses a perfect match as it uses the simd 
>> instructions. 
>>
>> Another difference is that go map uses quadratic probing and my map 
>> doesn't. The popular wisdom is that probing is faster because the 
>> probability that the data is in cache is slightly higher and the table 
>> filling is better. Unfortunately this increases the number of required 
>> tests where false positive increases the penalty. The numbers I showed are 
>> for a different kind of table. It is composed of three arrays T1,T2 and T3. 
>> T1 is a conventional table of 8 items swiss groups. When a. group is full 
>> the item is stored in the overflow hash table T2. When the group in T2 is 
>> full, the item is appended to T3 which is treated as a sequence of groups. 
>> I do a linear search in this table containing the overflows of overflows. I 
>> implemented a table using quadratic probing and it was faster than go map 
>> on arm64. I assume it is due to the use of 8 bit hashes reducing the false 
>> positive. I'm not sure.
>>
>> I tested independent hashes for T1 and T2 that then requires the use of 
>> tombstones, and related hashes. By related hash I mean that the index in T2 
>> is for instance the fibonacci hash of the index in T1. Best filling rates 
>> of T2 were obtained with a random permutation mapping. The size ration 
>> between T1 and T2 should in theory be 2:1, but best performance where with 
>> 16:1. I assumed that this is because T2 data is hotter (in cache) than with 
>> a bigger table. The benefit of such deterministic relation between the 
>> index of T1 and T2 is that we don't need tombstones. We can use a swap pop 
>> strategy. In summary, the idea was to minimize the number of required 
>> probes to find a key. 
>>
>> The table uses also an extensible hash for which I have a simpler code 
>> than go map. I won't go faster. But this directory indirection introduces a 
>> significant performance penalty. In a later benchmark, after the one I 
>> published here, I saw a significant performance increase by removing the 
>> directory. When the hash function is good there most tables have the same 
>> depth. 
>>
>> Another optimization compared to go map is that my tables in the 
>> directory don't grow. Using fixed sized array give some performance 
>> advantage as boundaries and hash mask are constants. There is a small space 
>> usage penalty but when a table is split at 80% for instance, it becomes 
>> 40%. It is then very close to need a grow which I considered a waste of 
>> effort. For small number of items fitting. in one table, it may be worth to 
>> use a growable table which is what I didn't bother to implement. 
>>
>> I also saw that function calls are expensive. I then used manual inlining 
>> for the Get method. It is nice to avoid for code readability and 
>> maintenance though. 
>>
>> I tried assembly on arm64 where I could experiment with simd 
>> instructions, but due to the assembly function call overhead it was slower 
>> than pure go. Passing arguments by register wouldn't help. We would need 
>> assembly inlining, but I understand that this would open a can of worms and 
>> probably create more problems than it would solve.
>>
>> If you are still interested and want some specific numbers I would be 
>> glad to give them. 
>>
>> Here is the cache miss benchmark that shows the benefit of reducing the 
>> number of probes
>>
>> goos: darwin
>> goarch: arm64
>> pkg: fastCache/map
>> cpu: Apple M2
>>                       │ dirstr12/stats_arm64.txt │     
>>  puremapstr/stats_arm64.txt       │
>>                       │          sec/op          │    sec/op      vs base 
>>                │
>> Cache2Miss/_______1-8                4.721n ± 0%    8.298n ±  0%   
>> +75.75% (p=0.002 n=6)
>> Cache2Miss/______10-8                5.452n ± 0%    9.094n ± 25%   
>> +66.81% (p=0.002 n=6)
>> Cache2Miss/_____100-8                5.726n ± 6%   11.550n ±  5% 
>>  +101.71% (p=0.002 n=6)
>> Cache2Miss/____1000-8                6.572n ± 4%   12.550n ± 21%   
>> +90.96% (p=0.002 n=6)
>> Cache2Miss/___10000-8                9.428n ± 4%   18.400n ±  7%   
>> +95.17% (p=0.002 n=6)
>> Cache2Miss/__100000-8                14.79n ± 1%    26.60n ±  2%   
>> +79.79% (p=0.002 n=6)
>> Cache2Miss/_1000000-8                29.11n ± 1%    36.83n ±  2%   
>> +26.54% (p=0.002 n=6)
>> Cache2Miss/10000000-8                40.44n ± 5%    51.12n ±  1%   
>> +26.41% (p=0.002 n=6)
>> geomean                              10.60n         17.80n         +67.98%
>>
>> Here is the cache hit comparing dirstr12 with flat8 that has no 
>> directory. 
>>
>> goos: darwin
>> goarch: arm64
>> pkg: fastCache/map
>> cpu: Apple M2
>>                      │ flat8/stats_arm64.txt │     
>>  dirstr12/stats_arm64.txt       │
>>                      │        sec/op         │    sec/op     vs base     
>>           │
>> Cache2Hit/_______1-8             6.047n ± 0%    6.246n ± 5%   +3.28% 
>> (p=0.002 n=6)
>> Cache2Hit/______10-8             8.172n ± 0%    8.499n ± 0%   +4.00% 
>> (p=0.002 n=6)
>> Cache2Hit/_____100-8             7.893n ± 0%    8.256n ± 5%   +4.61% 
>> (p=0.002 n=6)
>> Cache2Hit/____1000-8             7.585n ± 6%    8.228n ± 2%   +8.48% 
>> (p=0.002 n=6)
>> Cache2Hit/___10000-8             9.191n ± 0%   10.375n ± 1%  +12.88% 
>> (p=0.002 n=6)
>> Cache2Hit/__100000-8             10.03n ± 4%    12.14n ± 1%  +21.15% 
>> (p=0.002 n=6)
>> Cache2Hit/_1000000-8             40.79n ± 1%    42.27n ± 4%   +3.64% 
>> (p=0.002 n=6)
>> Cache2Hit/10000000-8             47.29n ± 1%    56.49n ± 9%  +19.47% 
>> (p=0.002 n=6)
>> geomean                          12.31n         13.47n        +9.48%
>>
>> I currently don't has access to an x86 computer. Thank you for your mail. 
>>  
>>
>> Le dimanche 22 juin 2025 à 00:12:25 UTC+2, Keith Randall a écrit :
>>
>>> Certainly interesting. We'd be interested in incorporating some ideas 
>>> from your experiments if they translate over.
>>>
>>> One thing to caution, you showed no numbers for space utilization, map 
>>> miss speed, whether your map can do iteration, etc. It is possible to trade 
>>> off one of these for other ones, so it is not immediately obvious that your 
>>> techniques would apply to the builtin map. For instance, if you're using 
>>> 10x the space to get 10% speedup, we probably wouldn't want to do that for 
>>> the general map. (Not saying that's what you did, just pointing out raw 
>>> speed is not the only consideration.)
>>>
>>> On Thursday, June 19, 2025 at 4:11:28 AM UTC-7 christoph...@gmail.com 
>>> wrote:
>>>
>>>> Hello, 
>>>>
>>>> trying to implement a fast cache, I inadvertently entered a rabbit hole 
>>>> that led me to implement my own map. In the process I tried to make it 
>>>> faster than the go map just to see if it is possible. I worked weeks on it 
>>>> trying out various architectures and methods. 
>>>>
>>>> On my mac book air M2, I get the following benchmarks for a Get 
>>>> operation. The numbers are the number of items inserted in the table.  My 
>>>> keys are 8 byte long strings. 
>>>>
>>>> goos: darwin
>>>> goarch: arm64
>>>> pkg: fastCache/map
>>>> cpu: Apple M2
>>>>                      │ dirstr12/stats_arm64.txt │     
>>>>  puremapstr/stats_arm64.txt       │
>>>>                      │          sec/op          │    sec/op      vs 
>>>> base                │
>>>> Cache2Hit/_______1-8               6.151n ±  6%    7.087n ±  1%   
>>>> +15.22% (p=0.002 n=6)
>>>> Cache2Hit/______10-8               8.491n ±  0%    8.156n ± 29%         
>>>> ~ (p=0.394 n=6)
>>>> Cache2Hit/_____100-8               8.141n ±  7%   14.185n ± 13%   
>>>> +74.24% (p=0.002 n=6)
>>>> Cache2Hit/____1000-8               8.252n ±  3%   10.635n ± 39%   
>>>> +28.89% (p=0.002 n=6)
>>>> Cache2Hit/___10000-8               10.45n ±  2%    20.99n ±  4% 
>>>>  +100.81% (p=0.002 n=6)
>>>> Cache2Hit/__100000-8               12.16n ±  1%    19.11n ± 10%   
>>>> +57.05% (p=0.002 n=6)
>>>> Cache2Hit/_1000000-8               42.28n ±  2%    47.90n ±  2%   
>>>> +13.29% (p=0.002 n=6)
>>>> Cache2Hit/10000000-8               56.38n ± 12%    61.82n ±  6%         
>>>> ~ (p=0.065 n=6)
>>>> geomean                            13.44n          17.86n         
>>>> +32.91%
>>>>
>>>> On my amd64 i5 11th gen I get the following benchmarks.
>>>>
>>>> goos: linux
>>>> goarch: amd64
>>>> pkg: fastCache/map
>>>> cpu: 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz
>>>>                       │ dirstr12/stats_amd64.txt │     
>>>>  puremapstr/stats_amd64.txt      │
>>>>                       │          sec/op          │    sec/op      vs 
>>>> base               │
>>>> Cache2Hit/_______1-12               9.207n ±  1%    7.506n ±  3% 
>>>>  -18.48% (p=0.002 n=6)
>>>> Cache2Hit/______10-12               9.223n ±  0%    8.806n ±  6%       
>>>>  ~ (p=0.058 n=6)
>>>> Cache2Hit/_____100-12               9.279n ±  2%   10.175n ±  3%   
>>>> +9.66% (p=0.002 n=6)
>>>> Cache2Hit/____1000-12               10.45n ±  2%    11.29n ±  3%   
>>>> +8.04% (p=0.002 n=6)
>>>> Cache2Hit/___10000-12               16.00n ±  2%    17.21n ±  5%   
>>>> +7.59% (p=0.002 n=6)
>>>> Cache2Hit/__100000-12               22.20n ± 17%    24.73n ± 22% 
>>>>  +11.42% (p=0.026 n=6)
>>>> Cache2Hit/_1000000-12               87.75n ±  2%    91.05n ±  5%   
>>>> +3.76% (p=0.009 n=6)
>>>> Cache2Hit/10000000-12               104.2n ±  2%    105.6n ±  5%       
>>>>  ~ (p=0.558 n=6)
>>>> geomean                             20.11n          20.49n         
>>>> +1.90%
>>>>
>>>> On amd64 the performance is on par with go map, but go map uses inlined 
>>>> simd instructions which I don’t use because I don’t have access to it in 
>>>> pure go. I use xxh3 right out of the box for the hash function. The 
>>>> differences are not due to the hash function.
>>>>
>>>> If there is interest in this, please let me know.
>>>>
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/14986109-1554-4875-a809-594296c4a4b0n%40googlegroups.com.

[go-nuts] Re: fmap: a faster map

Reply via email to