[go-nuts] Re: fmap: a faster map

christoph...@gmail.com Thu, 26 Jun 2025 00:37:59 -0700

Hello, 

the code is published here https://github.com/chmike/fastmap.


Le mercredi 25 juin 2025 à 12:16:54 UTC+2, christoph...@gmail.com a écrit :

> By swap pop I mean that a deleted item is replaced by the last item of the 
> group or the chain. This will indeed move items.
>
> I have implemented a new version using 8bit top hashes, tombstones, 
> perfect match, quadratic probing and extensible hash table. It leaves items 
> in place and might thus be compatible with the iteration requirements of go 
> maps.
>
> I plan to publish it on github but still need to verify deleting items on 
> which I didn't spent much time so far. The following benchmark is with 
> tables of 256 Groups. A table is split when reaching 90%. The filling rates 
> is then far better than with my 3 table architecture. Using quadratic 
> probing would then be a better choice. 
>
> goos: darwin
> goarch: arm64
> pkg: fastmap/map
> cpu: Apple M2
>                       │ altmap/stats_arm64.txt │         
> stdmap/stats_arm64.txt         │
>                       │         sec/op         │    sec/op      vs base   
>               │
> Cache2Hit/_______1-8               6.386n ± 0%    7.081n ±  0%   +10.88% 
> (p=0.000 n=10)
> Cache2Hit/______10-8               8.813n ± 0%    8.114n ± 30%         ~ 
> (p=0.481 n=10)
> Cache2Hit/_____100-8               8.473n ± 7%   15.720n ±  8%   +85.52% 
> (p=0.000 n=10)
> Cache2Hit/____1000-8               8.735n ± 2%   13.715n ± 23%   +57.01% 
> (p=0.000 n=10)
> Cache2Hit/___10000-8               11.04n ± 1%    20.38n ±  6%   +84.56% 
> (p=0.000 n=10)
> Cache2Hit/__100000-8               13.19n ± 1%    17.62n ± 17%   +33.59% 
> (p=0.000 n=10)
> Cache2Hit/_1000000-8               52.02n ± 1%    55.45n ±  0%    +6.59% 
> (p=0.001 n=10)
> Cache2Hit/10000000-8               64.91n ± 1%    68.12n ±  3%    +4.96% 
> (p=0.002 n=10)
> Cache2Miss/_______1-8              5.003n ± 0%    8.243n ±  0%   +64.73% 
> (p=0.000 n=10)
> Cache2Miss/______10-8              5.878n ± 3%    9.238n ± 11%   +57.18% 
> (p=0.000 n=10)
> Cache2Miss/_____100-8              6.037n ± 4%   12.600n ±  9%  +108.70% 
> (p=0.000 n=10)
> Cache2Miss/____1000-8              6.992n ± 7%   12.015n ± 17%   +71.83% 
> (p=0.000 n=10)
> Cache2Miss/___10000-8              10.49n ± 2%    17.66n ±  9%   +68.40% 
> (p=0.000 n=10)
> Cache2Miss/__100000-8              20.73n ± 1%    26.56n ±  2%   +28.09% 
> (p=0.000 n=10)
> Cache2Miss/_1000000-8              33.52n ± 1%    36.83n ±  1%    +9.89% 
> (p=0.000 n=10)
> Cache2Miss/10000000-8              48.62n ± 1%    50.84n ±  1%    +4.56% 
> (p=0.000 n=10)
> geomean                            13.25n         18.38n         +38.75%
>
> To be honest I can't fully explain these numbers. Reducing false positive 
> by using an exact match and 8 bit top hashes might be part of it. It could 
> also be the hash function. xxh3 on arm64 is pure go only. It is fast for 
> short strings, but bad for long strings as it doesn't use simd (yet). The 8 
> byte key strings are  fmt.Sprintf("%7d ", i) and fmt.Sprintf("%7d-", i) for 
> misses. I hope that these numbers don't result from a bug in my code. 
>
> For the top hash I use the H1, H2 terminology of swiss tables. H2 are the 
> 8 less significant bits of the hash. 0x00 is used for free slots, and 0x80 
> for the tombstone. Computing H2 implies thus a conditional branch, but the 
> cpu may predict its outcome right most of the time. I benchmarked 7bit top 
> hashes, but it ended to be slower. I assumed it was due to the increased 
> number of false positive. Using 0x80 for tombstone saves one op to find 
> free slots in a group. If the speed up is due to the 8bit top hashes, I 
> suspect that the C++ swiss table might also benefit from it.
>
> On amd64, the performance is on par with go map, but I don't use simd 
> instruction. 
>
> Le lundi 23 juin 2025 à 22:41:46 UTC+2, Keith Randall a écrit :
>
>> > As key comparison is very fast (no secondary memory access) a simple 
>> sequential key comparison is fast.
>>
>> We've looked at that, see CL 626699 
>> <https://go-review.googlesource.com/c/go/+/626699>. Maybe helps in some 
>> cases, seems to hurt for miss lookups. We decided not to move forward with 
>> it, but it is definitely something to think about.
>>
>> > I use 8 bit top hashes instead of 7
>>
>> How do you record a presence bit?
>>
>> > We can use a swap pop strategy.
>>
>> Not sure exactly what that is, but it sounds incompatible with iteration. 
>> Our required iteration semantics mean it is very hard to move keys.
>>
>> > There is a small space usage penalty but when a table is split at 80% 
>> for instance, it becomes 40%. It is then very close to need a grow which I 
>> considered a waste of effort.
>>
>> I think when we split a table we use the max-sized table for each part? 
>> I'm not sure what effect you're seeing here then.
>>
>> > I also saw that function calls are expensive. I then used manual 
>> inlining for the Get method. It is nice to avoid for code readability and 
>> maintenance though.
>>
>> We could theoretically have the compiler do this for builtin maps.
>> On Sunday, June 22, 2025 at 1:28:26 AM UTC-7 christoph...@gmail.com 
>> wrote:
>>
>>> Thank you very much for your response. I'll summarize here the main 
>>> difference with go map. 
>>>
>>> I very easily got faster than go map with integers and alike by dropping 
>>> the swiss groups and reducing the group size to fit a cache line. As key 
>>> comparison is very fast (no secondary memory access) a simple sequential 
>>> key comparison is fast.The speed gain might be due to the hash function as 
>>> I rewrote the xxh3 hasher for uint64 values.  I didn't spent much time 
>>> exploring this further as it wasn't my use case. 
>>>
>>> Strings key definitely benefit from top hashes (swiss table groups). I 
>>> use 8 bit top hashes instead of 7 which reduces the number of false 
>>> positive and I use a perfect match which also reduce the number of false 
>>> positive on arm64. amd64 uses a perfect match as it uses the simd 
>>> instructions. 
>>>
>>> Another difference is that go map uses quadratic probing and my map 
>>> doesn't. The popular wisdom is that probing is faster because the 
>>> probability that the data is in cache is slightly higher and the table 
>>> filling is better. Unfortunately this increases the number of required 
>>> tests where false positive increases the penalty. The numbers I showed are 
>>> for a different kind of table. It is composed of three arrays T1,T2 and T3. 
>>> T1 is a conventional table of 8 items swiss groups. When a. group is full 
>>> the item is stored in the overflow hash table T2. When the group in T2 is 
>>> full, the item is appended to T3 which is treated as a sequence of groups. 
>>> I do a linear search in this table containing the overflows of overflows. I 
>>> implemented a table using quadratic probing and it was faster than go map 
>>> on arm64. I assume it is due to the use of 8 bit hashes reducing the false 
>>> positive. I'm not sure.
>>>
>>> I tested independent hashes for T1 and T2 that then requires the use of 
>>> tombstones, and related hashes. By related hash I mean that the index in T2 
>>> is for instance the fibonacci hash of the index in T1. Best filling rates 
>>> of T2 were obtained with a random permutation mapping. The size ration 
>>> between T1 and T2 should in theory be 2:1, but best performance where with 
>>> 16:1. I assumed that this is because T2 data is hotter (in cache) than with 
>>> a bigger table. The benefit of such deterministic relation between the 
>>> index of T1 and T2 is that we don't need tombstones. We can use a swap pop 
>>> strategy. In summary, the idea was to minimize the number of required 
>>> probes to find a key. 
>>>
>>> The table uses also an extensible hash for which I have a simpler code 
>>> than go map. I won't go faster. But this directory indirection introduces a 
>>> significant performance penalty. In a later benchmark, after the one I 
>>> published here, I saw a significant performance increase by removing the 
>>> directory. When the hash function is good there most tables have the same 
>>> depth. 
>>>
>>> Another optimization compared to go map is that my tables in the 
>>> directory don't grow. Using fixed sized array give some performance 
>>> advantage as boundaries and hash mask are constants. There is a small space 
>>> usage penalty but when a table is split at 80% for instance, it becomes 
>>> 40%. It is then very close to need a grow which I considered a waste of 
>>> effort. For small number of items fitting. in one table, it may be worth to 
>>> use a growable table which is what I didn't bother to implement. 
>>>
>>> I also saw that function calls are expensive. I then used manual 
>>> inlining for the Get method. It is nice to avoid for code readability and 
>>> maintenance though. 
>>>
>>> I tried assembly on arm64 where I could experiment with simd 
>>> instructions, but due to the assembly function call overhead it was slower 
>>> than pure go. Passing arguments by register wouldn't help. We would need 
>>> assembly inlining, but I understand that this would open a can of worms and 
>>> probably create more problems than it would solve.
>>>
>>> If you are still interested and want some specific numbers I would be 
>>> glad to give them. 
>>>
>>> Here is the cache miss benchmark that shows the benefit of reducing the 
>>> number of probes
>>>
>>> goos: darwin
>>> goarch: arm64
>>> pkg: fastCache/map
>>> cpu: Apple M2
>>>                       │ dirstr12/stats_arm64.txt │     
>>>  puremapstr/stats_arm64.txt       │
>>>                       │          sec/op          │    sec/op      vs 
>>> base                │
>>> Cache2Miss/_______1-8                4.721n ± 0%    8.298n ±  0%   
>>> +75.75% (p=0.002 n=6)
>>> Cache2Miss/______10-8                5.452n ± 0%    9.094n ± 25%   
>>> +66.81% (p=0.002 n=6)
>>> Cache2Miss/_____100-8                5.726n ± 6%   11.550n ±  5% 
>>>  +101.71% (p=0.002 n=6)
>>> Cache2Miss/____1000-8                6.572n ± 4%   12.550n ± 21%   
>>> +90.96% (p=0.002 n=6)
>>> Cache2Miss/___10000-8                9.428n ± 4%   18.400n ±  7%   
>>> +95.17% (p=0.002 n=6)
>>> Cache2Miss/__100000-8                14.79n ± 1%    26.60n ±  2%   
>>> +79.79% (p=0.002 n=6)
>>> Cache2Miss/_1000000-8                29.11n ± 1%    36.83n ±  2%   
>>> +26.54% (p=0.002 n=6)
>>> Cache2Miss/10000000-8                40.44n ± 5%    51.12n ±  1%   
>>> +26.41% (p=0.002 n=6)
>>> geomean                              10.60n         17.80n         
>>> +67.98%
>>>
>>> Here is the cache hit comparing dirstr12 with flat8 that has no 
>>> directory. 
>>>
>>> goos: darwin
>>> goarch: arm64
>>> pkg: fastCache/map
>>> cpu: Apple M2
>>>                      │ flat8/stats_arm64.txt │     
>>>  dirstr12/stats_arm64.txt       │
>>>                      │        sec/op         │    sec/op     vs base     
>>>           │
>>> Cache2Hit/_______1-8             6.047n ± 0%    6.246n ± 5%   +3.28% 
>>> (p=0.002 n=6)
>>> Cache2Hit/______10-8             8.172n ± 0%    8.499n ± 0%   +4.00% 
>>> (p=0.002 n=6)
>>> Cache2Hit/_____100-8             7.893n ± 0%    8.256n ± 5%   +4.61% 
>>> (p=0.002 n=6)
>>> Cache2Hit/____1000-8             7.585n ± 6%    8.228n ± 2%   +8.48% 
>>> (p=0.002 n=6)
>>> Cache2Hit/___10000-8             9.191n ± 0%   10.375n ± 1%  +12.88% 
>>> (p=0.002 n=6)
>>> Cache2Hit/__100000-8             10.03n ± 4%    12.14n ± 1%  +21.15% 
>>> (p=0.002 n=6)
>>> Cache2Hit/_1000000-8             40.79n ± 1%    42.27n ± 4%   +3.64% 
>>> (p=0.002 n=6)
>>> Cache2Hit/10000000-8             47.29n ± 1%    56.49n ± 9%  +19.47% 
>>> (p=0.002 n=6)
>>> geomean                          12.31n         13.47n        +9.48%
>>>
>>> I currently don't has access to an x86 computer. Thank you for your 
>>> mail. 
>>>  
>>>
>>> Le dimanche 22 juin 2025 à 00:12:25 UTC+2, Keith Randall a écrit :
>>>
>>>> Certainly interesting. We'd be interested in incorporating some ideas 
>>>> from your experiments if they translate over.
>>>>
>>>> One thing to caution, you showed no numbers for space utilization, map 
>>>> miss speed, whether your map can do iteration, etc. It is possible to 
>>>> trade 
>>>> off one of these for other ones, so it is not immediately obvious that 
>>>> your 
>>>> techniques would apply to the builtin map. For instance, if you're using 
>>>> 10x the space to get 10% speedup, we probably wouldn't want to do that for 
>>>> the general map. (Not saying that's what you did, just pointing out raw 
>>>> speed is not the only consideration.)
>>>>
>>>> On Thursday, June 19, 2025 at 4:11:28 AM UTC-7 christoph...@gmail.com 
>>>> wrote:
>>>>
>>>>> Hello, 
>>>>>
>>>>> trying to implement a fast cache, I inadvertently entered a rabbit 
>>>>> hole that led me to implement my own map. In the process I tried to make 
>>>>> it 
>>>>> faster than the go map just to see if it is possible. I worked weeks on 
>>>>> it 
>>>>> trying out various architectures and methods. 
>>>>>
>>>>> On my mac book air M2, I get the following benchmarks for a Get 
>>>>> operation. The numbers are the number of items inserted in the table.  My 
>>>>> keys are 8 byte long strings. 
>>>>>
>>>>> goos: darwin
>>>>> goarch: arm64
>>>>> pkg: fastCache/map
>>>>> cpu: Apple M2
>>>>>                      │ dirstr12/stats_arm64.txt │     
>>>>>  puremapstr/stats_arm64.txt       │
>>>>>                      │          sec/op          │    sec/op      vs 
>>>>> base                │
>>>>> Cache2Hit/_______1-8               6.151n ±  6%    7.087n ±  1%   
>>>>> +15.22% (p=0.002 n=6)
>>>>> Cache2Hit/______10-8               8.491n ±  0%    8.156n ± 29%       
>>>>>   ~ (p=0.394 n=6)
>>>>> Cache2Hit/_____100-8               8.141n ±  7%   14.185n ± 13%   
>>>>> +74.24% (p=0.002 n=6)
>>>>> Cache2Hit/____1000-8               8.252n ±  3%   10.635n ± 39%   
>>>>> +28.89% (p=0.002 n=6)
>>>>> Cache2Hit/___10000-8               10.45n ±  2%    20.99n ±  4% 
>>>>>  +100.81% (p=0.002 n=6)
>>>>> Cache2Hit/__100000-8               12.16n ±  1%    19.11n ± 10%   
>>>>> +57.05% (p=0.002 n=6)
>>>>> Cache2Hit/_1000000-8               42.28n ±  2%    47.90n ±  2%   
>>>>> +13.29% (p=0.002 n=6)
>>>>> Cache2Hit/10000000-8               56.38n ± 12%    61.82n ±  6%       
>>>>>   ~ (p=0.065 n=6)
>>>>> geomean                            13.44n          17.86n         
>>>>> +32.91%
>>>>>
>>>>> On my amd64 i5 11th gen I get the following benchmarks.
>>>>>
>>>>> goos: linux
>>>>> goarch: amd64
>>>>> pkg: fastCache/map
>>>>> cpu: 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz
>>>>>                       │ dirstr12/stats_amd64.txt │     
>>>>>  puremapstr/stats_amd64.txt      │
>>>>>                       │          sec/op          │    sec/op      vs 
>>>>> base               │
>>>>> Cache2Hit/_______1-12               9.207n ±  1%    7.506n ±  3% 
>>>>>  -18.48% (p=0.002 n=6)
>>>>> Cache2Hit/______10-12               9.223n ±  0%    8.806n ±  6%       
>>>>>  ~ (p=0.058 n=6)
>>>>> Cache2Hit/_____100-12               9.279n ±  2%   10.175n ±  3%   
>>>>> +9.66% (p=0.002 n=6)
>>>>> Cache2Hit/____1000-12               10.45n ±  2%    11.29n ±  3%   
>>>>> +8.04% (p=0.002 n=6)
>>>>> Cache2Hit/___10000-12               16.00n ±  2%    17.21n ±  5%   
>>>>> +7.59% (p=0.002 n=6)
>>>>> Cache2Hit/__100000-12               22.20n ± 17%    24.73n ± 22% 
>>>>>  +11.42% (p=0.026 n=6)
>>>>> Cache2Hit/_1000000-12               87.75n ±  2%    91.05n ±  5%   
>>>>> +3.76% (p=0.009 n=6)
>>>>> Cache2Hit/10000000-12               104.2n ±  2%    105.6n ±  5%       
>>>>>  ~ (p=0.558 n=6)
>>>>> geomean                             20.11n          20.49n         
>>>>> +1.90%
>>>>>
>>>>> On amd64 the performance is on par with go map, but go map uses 
>>>>> inlined simd instructions which I don’t use because I don’t have access 
>>>>> to 
>>>>> it in pure go. I use xxh3 right out of the box for the hash function. The 
>>>>> differences are not due to the hash function.
>>>>>
>>>>> If there is interest in this, please let me know.
>>>>>
>>>>>
>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/159bf6c9-5d39-4dad-8218-47b81fe09e55n%40googlegroups.com.

[go-nuts] Re: fmap: a faster map

Reply via email to