[go-nuts] Re: fmap: a faster map

christoph...@gmail.com Sun, 22 Jun 2025 01:29:10 -0700

Thank you very much for your response. I'll summarize here the main 
difference with go map.

I very easily got faster than go map with integers and alike by dropping 
the swiss groups and reducing the group size to fit a cache line. As key 
comparison is very fast (no secondary memory access) a simple sequential 
key comparison is fast.The speed gain might be due to the hash function as 
I rewrote the xxh3 hasher for uint64 values.  I didn't spent much time 
exploring this further as it wasn't my use case. 

Strings key definitely benefit from top hashes (swiss table groups). I use 
8 bit top hashes instead of 7 which reduces the number of false positive 
and I use a perfect match which also reduce the number of false positive on 
arm64. amd64 uses a perfect match as it uses the simd instructions. 

Another difference is that go map uses quadratic probing and my map 
doesn't. The popular wisdom is that probing is faster because the 
probability that the data is in cache is slightly higher and the table 
filling is better. Unfortunately this increases the number of required 
tests where false positive increases the penalty. The numbers I showed are 
for a different kind of table. It is composed of three arrays T1,T2 and T3. 
T1 is a conventional table of 8 items swiss groups. When a. group is full 
the item is stored in the overflow hash table T2. When the group in T2 is 
full, the item is appended to T3 which is treated as a sequence of groups. 
I do a linear search in this table containing the overflows of overflows. I 
implemented a table using quadratic probing and it was faster than go map 
on arm64. I assume it is due to the use of 8 bit hashes reducing the false 
positive. I'm not sure.

I tested independent hashes for T1 and T2 that then requires the use of 
tombstones, and related hashes. By related hash I mean that the index in T2 
is for instance the fibonacci hash of the index in T1. Best filling rates 
of T2 were obtained with a random permutation mapping. The size ration 
between T1 and T2 should in theory be 2:1, but best performance where with 
16:1. I assumed that this is because T2 data is hotter (in cache) than with 
a bigger table. The benefit of such deterministic relation between the 
index of T1 and T2 is that we don't need tombstones. We can use a swap pop 
strategy. In summary, the idea was to minimize the number of required 
probes to find a key. 

The table uses also an extensible hash for which I have a simpler code than 
go map. I won't go faster. But this directory indirection introduces a 
significant performance penalty. In a later benchmark, after the one I 
published here, I saw a significant performance increase by removing the 
directory. When the hash function is good there most tables have the same 
depth. 

Another optimization compared to go map is that my tables in the directory 
don't grow. Using fixed sized array give some performance advantage as 
boundaries and hash mask are constants. There is a small space usage 
penalty but when a table is split at 80% for instance, it becomes 40%. It 
is then very close to need a grow which I considered a waste of effort. For 
small number of items fitting. in one table, it may be worth to use a 
growable table which is what I didn't bother to implement. 

I also saw that function calls are expensive. I then used manual inlining 
for the Get method. It is nice to avoid for code readability and 
maintenance though. 

I tried assembly on arm64 where I could experiment with simd instructions, 
but due to the assembly function call overhead it was slower than pure go. 
Passing arguments by register wouldn't help. We would need assembly 
inlining, but I understand that this would open a can of worms and probably 
create more problems than it would solve.

If you are still interested and want some specific numbers I would be glad 
to give them. 

Here is the cache miss benchmark that shows the benefit of reducing the 
number of probes

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
                      │ dirstr12/stats_arm64.txt │     
 puremapstr/stats_arm64.txt       │
                      │          sec/op          │    sec/op      vs base   
             │
Cache2Miss/_______1-8                4.721n ± 0%    8.298n ±  0%   +75.75% 
(p=0.002 n=6)
Cache2Miss/______10-8                5.452n ± 0%    9.094n ± 25%   +66.81% 
(p=0.002 n=6)
Cache2Miss/_____100-8                5.726n ± 6%   11.550n ±  5%  +101.71% 
(p=0.002 n=6)
Cache2Miss/____1000-8                6.572n ± 4%   12.550n ± 21%   +90.96% 
(p=0.002 n=6)
Cache2Miss/___10000-8                9.428n ± 4%   18.400n ±  7%   +95.17% 
(p=0.002 n=6)
Cache2Miss/__100000-8                14.79n ± 1%    26.60n ±  2%   +79.79% 
(p=0.002 n=6)
Cache2Miss/_1000000-8                29.11n ± 1%    36.83n ±  2%   +26.54% 
(p=0.002 n=6)
Cache2Miss/10000000-8                40.44n ± 5%    51.12n ±  1%   +26.41% 
(p=0.002 n=6)
geomean                              10.60n         17.80n         +67.98%

Here is the cache hit comparing dirstr12 with flat8 that has no directory. 

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
                     │ flat8/stats_arm64.txt │     
 dirstr12/stats_arm64.txt       │
                     │        sec/op         │    sec/op     vs base       
        │
Cache2Hit/_______1-8             6.047n ± 0%    6.246n ± 5%   +3.28% 
(p=0.002 n=6)
Cache2Hit/______10-8             8.172n ± 0%    8.499n ± 0%   +4.00% 
(p=0.002 n=6)
Cache2Hit/_____100-8             7.893n ± 0%    8.256n ± 5%   +4.61% 
(p=0.002 n=6)
Cache2Hit/____1000-8             7.585n ± 6%    8.228n ± 2%   +8.48% 
(p=0.002 n=6)
Cache2Hit/___10000-8             9.191n ± 0%   10.375n ± 1%  +12.88% 
(p=0.002 n=6)
Cache2Hit/__100000-8             10.03n ± 4%    12.14n ± 1%  +21.15% 
(p=0.002 n=6)
Cache2Hit/_1000000-8             40.79n ± 1%    42.27n ± 4%   +3.64% 
(p=0.002 n=6)
Cache2Hit/10000000-8             47.29n ± 1%    56.49n ± 9%  +19.47% 
(p=0.002 n=6)
geomean                          12.31n         13.47n        +9.48%

I currently don't has access to an x86 computer. Thank you for your mail. 

Le dimanche 22 juin 2025 à 00:12:25 UTC+2, Keith Randall a écrit :

> Certainly interesting. We'd be interested in incorporating some ideas from 
> your experiments if they translate over.
>
> One thing to caution, you showed no numbers for space utilization, map 
> miss speed, whether your map can do iteration, etc. It is possible to trade 
> off one of these for other ones, so it is not immediately obvious that your 
> techniques would apply to the builtin map. For instance, if you're using 
> 10x the space to get 10% speedup, we probably wouldn't want to do that for 
> the general map. (Not saying that's what you did, just pointing out raw 
> speed is not the only consideration.)
>
> On Thursday, June 19, 2025 at 4:11:28 AM UTC-7 christoph...@gmail.com 
> wrote:
>
>> Hello, 
>>
>> trying to implement a fast cache, I inadvertently entered a rabbit hole 
>> that led me to implement my own map. In the process I tried to make it 
>> faster than the go map just to see if it is possible. I worked weeks on it 
>> trying out various architectures and methods. 
>>
>> On my mac book air M2, I get the following benchmarks for a Get 
>> operation. The numbers are the number of items inserted in the table.  My 
>> keys are 8 byte long strings. 
>>
>> goos: darwin
>> goarch: arm64
>> pkg: fastCache/map
>> cpu: Apple M2
>>                      │ dirstr12/stats_arm64.txt │     
>>  puremapstr/stats_arm64.txt       │
>>                      │          sec/op          │    sec/op      vs base 
>>                │
>> Cache2Hit/_______1-8               6.151n ±  6%    7.087n ±  1%   +15.22% 
>> (p=0.002 n=6)
>> Cache2Hit/______10-8               8.491n ±  0%    8.156n ± 29%         ~ 
>> (p=0.394 n=6)
>> Cache2Hit/_____100-8               8.141n ±  7%   14.185n ± 13%   +74.24% 
>> (p=0.002 n=6)
>> Cache2Hit/____1000-8               8.252n ±  3%   10.635n ± 39%   +28.89% 
>> (p=0.002 n=6)
>> Cache2Hit/___10000-8               10.45n ±  2%    20.99n ±  4%  +100.81% 
>> (p=0.002 n=6)
>> Cache2Hit/__100000-8               12.16n ±  1%    19.11n ± 10%   +57.05% 
>> (p=0.002 n=6)
>> Cache2Hit/_1000000-8               42.28n ±  2%    47.90n ±  2%   +13.29% 
>> (p=0.002 n=6)
>> Cache2Hit/10000000-8               56.38n ± 12%    61.82n ±  6%         ~ 
>> (p=0.065 n=6)
>> geomean                            13.44n          17.86n         +32.91%
>>
>> On my amd64 i5 11th gen I get the following benchmarks.
>>
>> goos: linux
>> goarch: amd64
>> pkg: fastCache/map
>> cpu: 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz
>>                       │ dirstr12/stats_amd64.txt │     
>>  puremapstr/stats_amd64.txt      │
>>                       │          sec/op          │    sec/op      vs base 
>>               │
>> Cache2Hit/_______1-12               9.207n ±  1%    7.506n ±  3%  -18.48% 
>> (p=0.002 n=6)
>> Cache2Hit/______10-12               9.223n ±  0%    8.806n ±  6%        ~ 
>> (p=0.058 n=6)
>> Cache2Hit/_____100-12               9.279n ±  2%   10.175n ±  3%   +9.66% 
>> (p=0.002 n=6)
>> Cache2Hit/____1000-12               10.45n ±  2%    11.29n ±  3%   +8.04% 
>> (p=0.002 n=6)
>> Cache2Hit/___10000-12               16.00n ±  2%    17.21n ±  5%   +7.59% 
>> (p=0.002 n=6)
>> Cache2Hit/__100000-12               22.20n ± 17%    24.73n ± 22%  +11.42% 
>> (p=0.026 n=6)
>> Cache2Hit/_1000000-12               87.75n ±  2%    91.05n ±  5%   +3.76% 
>> (p=0.009 n=6)
>> Cache2Hit/10000000-12               104.2n ±  2%    105.6n ±  5%        ~ 
>> (p=0.558 n=6)
>> geomean                             20.11n          20.49n         +1.90%
>>
>> On amd64 the performance is on par with go map, but go map uses inlined 
>> simd instructions which I don’t use because I don’t have access to it in 
>> pure go. I use xxh3 right out of the box for the hash function. The 
>> differences are not due to the hash function.
>>
>> If there is interest in this, please let me know.
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/d00cca58-6f0a-4e52-8eba-256a6ae63c50n%40googlegroups.com.

[go-nuts] Re: fmap: a faster map

Reply via email to