Hi Christoph,

You might have seen my deterministic map findings that showed that if your 
access
pattern is repeated full range scans, then caching the key/value pairs in a 
slice soundly 
trounces the built in map soundly due to better L1 utilization.

https://groups.google.com/g/golang-nuts/c/MeY_OKAhjQg

-- which is just to say, if you want to go even faster, and you are mostly 
doing full range
scans and having a repeatable, deterministic iteration order is okay -- 
then caching can probably allow you to go even faster
than you are now; I give a point example of going 14x faster than the Go 
builtin map.

Best,
Jason

On Friday, June 27, 2025 at 9:23:53 AM UTC+2 christoph...@gmail.com wrote:

Just benchmarked the fastmap with integer keys. It is exactly the same as 
the one I published but with integer keys and a xxh3 function for integer 
values. 

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
                     │ cacheint6/stats_arm64.txt │     
puremapint/stats_arm64.txt      │
                     │          sec/op           │    sec/op     vs base   
            │
Cache2Hit/_______1-8                 3.496n ± 0%   2.065n ±  0%  -40.93% 
(p=0.002 n=6)
Cache2Hit/______10-8                 4.327n ± 2%   5.052n ±  0%  +16.74% 
(p=0.002 n=6)
Cache2Hit/_____100-8                 4.364n ± 4%   5.731n ± 12%  +31.32% 
(p=0.002 n=6)
Cache2Hit/____1000-8                 4.248n ± 5%   5.935n ±  3%  +39.73% 
(p=0.002 n=6)
Cache2Hit/___10000-8                 5.167n ± 1%   7.553n ±  1%  +46.18% 
(p=0.002 n=6)
Cache2Hit/__100000-8                 6.763n ± 1%   9.300n ±  3%  +37.52% 
(p=0.002 n=6)
Cache2Hit/_1000000-8                 18.25n ± 1%   26.32n ±  1%  +44.26% 
(p=0.002 n=6)
Cache2Hit/10000000-8                 28.11n ± 0%   37.95n ±  0%  +35.05% 
(p=0.002 n=6)
geomean                              6.881n        8.405n        +22.15%

Le jeudi 26 juin 2025 à 09:37:45 UTC+2, christoph a écrit :

Hello, 

the code is published here https://github.com/chmike/fastmap.

Le mercredi 25 juin 2025 à 12:16:54 UTC+2, christoph  a écrit :

By swap pop I mean that a deleted item is replaced by the last item of the 
group or the chain. This will indeed move items.

I have implemented a new version using 8bit top hashes, tombstones, perfect 
match, quadratic probing and extensible hash table. It leaves items in 
place and might thus be compatible with the iteration requirements of go 
maps.

I plan to publish it on github but still need to verify deleting items on 
which I didn't spent much time so far. The following benchmark is with 
tables of 256 Groups. A table is split when reaching 90%. The filling rates 
is then far better than with my 3 table architecture. Using quadratic 
probing would then be a better choice. 

goos: darwin
goarch: arm64
pkg: fastmap/map
cpu: Apple M2
                      │ altmap/stats_arm64.txt │         
stdmap/stats_arm64.txt         │
                      │         sec/op         │    sec/op      vs base     
            │
Cache2Hit/_______1-8               6.386n ± 0%    7.081n ±  0%   +10.88% 
(p=0.000 n=10)
Cache2Hit/______10-8               8.813n ± 0%    8.114n ± 30%         ~ 
(p=0.481 n=10)
Cache2Hit/_____100-8               8.473n ± 7%   15.720n ±  8%   +85.52% 
(p=0.000 n=10)
Cache2Hit/____1000-8               8.735n ± 2%   13.715n ± 23%   +57.01% 
(p=0.000 n=10)
Cache2Hit/___10000-8               11.04n ± 1%    20.38n ±  6%   +84.56% 
(p=0.000 n=10)
Cache2Hit/__100000-8               13.19n ± 1%    17.62n ± 17%   +33.59% 
(p=0.000 n=10)
Cache2Hit/_1000000-8               52.02n ± 1%    55.45n ±  0%    +6.59% 
(p=0.001 n=10)
Cache2Hit/10000000-8               64.91n ± 1%    68.12n ±  3%    +4.96% 
(p=0.002 n=10)
Cache2Miss/_______1-8              5.003n ± 0%    8.243n ±  0%   +64.73% 
(p=0.000 n=10)
Cache2Miss/______10-8              5.878n ± 3%    9.238n ± 11%   +57.18% 
(p=0.000 n=10)
Cache2Miss/_____100-8              6.037n ± 4%   12.600n ±  9%  +108.70% 
(p=0.000 n=10)
Cache2Miss/____1000-8              6.992n ± 7%   12.015n ± 17%   +71.83% 
(p=0.000 n=10)
Cache2Miss/___10000-8              10.49n ± 2%    17.66n ±  9%   +68.40% 
(p=0.000 n=10)
Cache2Miss/__100000-8              20.73n ± 1%    26.56n ±  2%   +28.09% 
(p=0.000 n=10)
Cache2Miss/_1000000-8              33.52n ± 1%    36.83n ±  1%    +9.89% 
(p=0.000 n=10)
Cache2Miss/10000000-8              48.62n ± 1%    50.84n ±  1%    +4.56% 
(p=0.000 n=10)
geomean                            13.25n         18.38n         +38.75%

To be honest I can't fully explain these numbers. Reducing false positive 
by using an exact match and 8 bit top hashes might be part of it. It could 
also be the hash function. xxh3 on arm64 is pure go only. It is fast for 
short strings, but bad for long strings as it doesn't use simd (yet). The 8 
byte key strings are  fmt.Sprintf("%7d ", i) and fmt.Sprintf("%7d-", i) for 
misses. I hope that these numbers don't result from a bug in my code. 

For the top hash I use the H1, H2 terminology of swiss tables. H2 are the 8 
less significant bits of the hash. 0x00 is used for free slots, and 0x80 
for the tombstone. Computing H2 implies thus a conditional branch, but the 
cpu may predict its outcome right most of the time. I benchmarked 7bit top 
hashes, but it ended to be slower. I assumed it was due to the increased 
number of false positive. Using 0x80 for tombstone saves one op to find 
free slots in a group. If the speed up is due to the 8bit top hashes, I 
suspect that the C++ swiss table might also benefit from it.

On amd64, the performance is on par with go map, but I don't use simd 
instruction. 

Le lundi 23 juin 2025 à 22:41:46 UTC+2, Keith Randall a écrit :

> As key comparison is very fast (no secondary memory access) a simple 
sequential key comparison is fast.

We've looked at that, see CL 626699 
<https://go-review.googlesource.com/c/go/+/626699>. Maybe helps in some 
cases, seems to hurt for miss lookups. We decided not to move forward with 
it, but it is definitely something to think about.

> I use 8 bit top hashes instead of 7

How do you record a presence bit?

> We can use a swap pop strategy.

Not sure exactly what that is, but it sounds incompatible with iteration. 
Our required iteration semantics mean it is very hard to move keys.

> There is a small space usage penalty but when a table is split at 80% for 
instance, it becomes 40%. It is then very close to need a grow which I 
considered a waste of effort.

I think when we split a table we use the max-sized table for each part? I'm 
not sure what effect you're seeing here then.

> I also saw that function calls are expensive. I then used manual inlining 
for the Get method. It is nice to avoid for code readability and 
maintenance though.

We could theoretically have the compiler do this for builtin maps.
On Sunday, June 22, 2025 at 1:28:26 AM UTC-7 christoph  wrote:

Thank you very much for your response. I'll summarize here the main 
difference with go map. 

I very easily got faster than go map with integers and alike by dropping 
the swiss groups and reducing the group size to fit a cache line. As key 
comparison is very fast (no secondary memory access) a simple sequential 
key comparison is fast.The speed gain might be due to the hash function as 
I rewrote the xxh3 hasher for uint64 values.  I didn't spent much time 
exploring this further as it wasn't my use case. 

Strings key definitely benefit from top hashes (swiss table groups). I use 
8 bit top hashes instead of 7 which reduces the number of false positive 
and I use a perfect match which also reduce the number of false positive on 
arm64. amd64 uses a perfect match as it uses the simd instructions. 

Another difference is that go map uses quadratic probing and my map 
doesn't. The popular wisdom is that probing is faster because the 
probability that the data is in cache is slightly higher and the table 
filling is better. Unfortunately this increases the number of required 
tests where false positive increases the penalty. The numbers I showed are 
for a different kind of table. It is composed of three arrays T1,T2 and T3. 
T1 is a conventional table of 8 items swiss groups. When a. group is full 
the item is stored in the overflow hash table T2. When the group in T2 is 
full, the item is appended to T3 which is treated as a sequence of groups. 
I do a linear search in this table containing the overflows of overflows. I 
implemented a table using quadratic probing and it was faster than go map 
on arm64. I assume it is due to the use of 8 bit hashes reducing the false 
positive. I'm not sure.

I tested independent hashes for T1 and T2 that then requires the use of 
tombstones, and related hashes. By related hash I mean that the index in T2 
is for instance the fibonacci hash of the index in T1. Best filling rates 
of T2 were obtained with a random permutation mapping. The size ration 
between T1 and T2 should in theory be 2:1, but best performance where with 
16:1. I assumed that this is because T2 data is hotter (in cache) than with 
a bigger table. The benefit of such deterministic relation between the 
index of T1 and T2 is that we don't need tombstones. We can use a swap pop 
strategy. In summary, the idea was to minimize the number of required 
probes to find a key. 

The table uses also an extensible hash for which I have a simpler code than 
go map. I won't go faster. But this directory indirection introduces a 
significant performance penalty. In a later benchmark, after the one I 
published here, I saw a significant performance increase by removing the 
directory. When the hash function is good there most tables have the same 
depth. 

Another optimization compared to go map is that my tables in the directory 
don't grow. Using fixed sized array give some performance advantage as 
boundaries and hash mask are constants. There is a small space usage 
penalty but when a table is split at 80% for instance, it becomes 40%. It 
is then very close to need a grow which I considered a waste of effort. For 
small number of items fitting. in one table, it may be worth to use a 
growable table which is what I didn't bother to implement. 

I also saw that function calls are expensive. I then used manual inlining 
for the Get method. It is nice to avoid for code readability and 
maintenance though. 

I tried assembly on arm64 where I could experiment with simd instructions, 
but due to the assembly function call overhead it was slower than pure go. 
Passing arguments by register wouldn't help. We would need assembly 
inlining, but I understand that this would open a can of worms and probably 
create more problems than it would solve.

If you are still interested and want some specific numbers I would be glad 
to give them. 

Here is the cache miss benchmark that shows the benefit of reducing the 
number of probes

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
                      │ dirstr12/stats_arm64.txt │     
 puremapstr/stats_arm64.txt       │
                      │          sec/op          │    sec/op      vs base   
             │
Cache2Miss/_______1-8                4.721n ± 0%    8.298n ±  0%   +75.75% 
(p=0.002 n=6)
Cache2Miss/______10-8                5.452n ± 0%    9.094n ± 25%   +66.81% 
(p=0.002 n=6)
Cache2Miss/_____100-8                5.726n ± 6%   11.550n ±  5%  +101.71% 
(p=0.002 n=6)
Cache2Miss/____1000-8                6.572n ± 4%   12.550n ± 21%   +90.96% 
(p=0.002 n=6)
Cache2Miss/___10000-8                9.428n ± 4%   18.400n ±  7%   +95.17% 
(p=0.002 n=6)
Cache2Miss/__100000-8                14.79n ± 1%    26.60n ±  2%   +79.79% 
(p=0.002 n=6)
Cache2Miss/_1000000-8                29.11n ± 1%    36.83n ±  2%   +26.54% 
(p=0.002 n=6)
Cache2Miss/10000000-8                40.44n ± 5%    51.12n ±  1%   +26.41% 
(p=0.002 n=6)
geomean                              10.60n         17.80n         +67.98%

Here is the cache hit comparing dirstr12 with flat8 that has no directory. 

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
                     │ flat8/stats_arm64.txt │     
 dirstr12/stats_arm64.txt       │
                     │        sec/op         │    sec/op     vs base       
        │
Cache2Hit/_______1-8             6.047n ± 0%    6.246n ± 5%   +3.28% 
(p=0.002 n=6)
Cache2Hit/______10-8             8.172n ± 0%    8.499n ± 0%   +4.00% 
(p=0.002 n=6)
Cache2Hit/_____100-8             7.893n ± 0%    8.256n ± 5%   +4.61% 
(p=0.002 n=6)
Cache2Hit/____1000-8             7.585n ± 6%    8.228n ± 2%   +8.48% 
(p=0.002 n=6)
Cache2Hit/___10000-8             9.191n ± 0%   10.375n ± 1%  +12.88% 
(p=0.002 n=6)
Cache2Hit/__100000-8             10.03n ± 4%    12.14n ± 1%  +21.15% 
(p=0.002 n=6)
Cache2Hit/_1000000-8             40.79n ± 1%    42.27n ± 4%   +3.64% 
(p=0.002 n=6)
Cache2Hit/10000000-8             47.29n ± 1%    56.49n ± 9%  +19.47% 
(p=0.002 n=6)
geomean                          12.31n         13.47n        +9.48%

I currently don't has access to an x86 computer. Thank you for your mail. 
 

Le dimanche 22 juin 2025 à 00:12:25 UTC+2, Keith Randall a écrit :

Certainly interesting. We'd be interested in incorporating some ideas from 
your experiments if they translate over.

One thing to caution, you showed no numbers for space utilization, map miss 
speed, whether your map can do iteration, etc. It is possible to trade off 
one of these for other ones, so it is not immediately obvious that your 
techniques would apply to the builtin map. For instance, if you're using 
10x the space to get 10% speedup, we probably wouldn't want to do that for 
the general map. (Not saying that's what you did, just pointing out raw 
speed is not the only consideration.)

On Thursday, June 19, 2025 at 4:11:28 AM UTC-7 christoph wrote:

Hello, 

trying to implement a fast cache, I inadvertently entered a rabbit hole 
that led me to implement my own map. In the process I tried to make it 
faster than the go map just to see if it is possible. I worked weeks on it 
trying out various architectures and methods. 

On my mac book air M2, I get the following benchmarks for a Get operation. 
The numbers are the number of items inserted in the table.  My keys are 8 
byte long strings. 

goos: darwin
goarch: arm64
pkg: fastCache/map
cpu: Apple M2
                     │ dirstr12/stats_arm64.txt │     
 puremapstr/stats_arm64.txt       │
                     │          sec/op          │    sec/op      vs base   
             │
Cache2Hit/_______1-8               6.151n ±  6%    7.087n ±  1%   +15.22% 
(p=0.002 n=6)
Cache2Hit/______10-8               8.491n ±  0%    8.156n ± 29%         ~ 
(p=0.394 n=6)
Cache2Hit/_____100-8               8.141n ±  7%   14.185n ± 13%   +74.24% 
(p=0.002 n=6)
Cache2Hit/____1000-8               8.252n ±  3%   10.635n ± 39%   +28.89% 
(p=0.002 n=6)
Cache2Hit/___10000-8               10.45n ±  2%    20.99n ±  4%  +100.81% 
(p=0.002 n=6)
Cache2Hit/__100000-8               12.16n ±  1%    19.11n ± 10%   +57.05% 
(p=0.002 n=6)
Cache2Hit/_1000000-8               42.28n ±  2%    47.90n ±  2%   +13.29% 
(p=0.002 n=6)
Cache2Hit/10000000-8               56.38n ± 12%    61.82n ±  6%         ~ 
(p=0.065 n=6)
geomean                            13.44n          17.86n         +32.91%

On my amd64 i5 11th gen I get the following benchmarks.

goos: linux
goarch: amd64
pkg: fastCache/map
cpu: 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz
                      │ dirstr12/stats_amd64.txt │     
 puremapstr/stats_amd64.txt      │
                      │          sec/op          │    sec/op      vs base   
            │
Cache2Hit/_______1-12               9.207n ±  1%    7.506n ±  3%  -18.48% 
(p=0.002 n=6)
Cache2Hit/______10-12               9.223n ±  0%    8.806n ±  6%        ~ 
(p=0.058 n=6)
Cache2Hit/_____100-12               9.279n ±  2%   10.175n ±  3%   +9.66% 
(p=0.002 n=6)
Cache2Hit/____1000-12               10.45n ±  2%    11.29n ±  3%   +8.04% 
(p=0.002 n=6)
Cache2Hit/___10000-12               16.00n ±  2%    17.21n ±  5%   +7.59% 
(p=0.002 n=6)
Cache2Hit/__100000-12               22.20n ± 17%    24.73n ± 22%  +11.42% 
(p=0.026 n=6)
Cache2Hit/_1000000-12               87.75n ±  2%    91.05n ±  5%   +3.76% 
(p=0.009 n=6)
Cache2Hit/10000000-12               104.2n ±  2%    105.6n ±  5%        ~ 
(p=0.558 n=6)
geomean                             20.11n          20.49n         +1.90%

On amd64 the performance is on par with go map, but go map uses inlined 
simd instructions which I don’t use because I don’t have access to it in 
pure go. I use xxh3 right out of the box for the hash function. The 
differences are not due to the hash function.

If there is interest in this, please let me know.



-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/cddd3c96-f125-4c9a-99b9-3979132927c6n%40googlegroups.com.

Reply via email to