Just benchmarked the fastmap with integer keys. It is exactly the same as the one I published but with integer keys and a xxh3 function for integer values.
goos: darwin goarch: arm64 pkg: fastCache/map cpu: Apple M2 │ cacheint6/stats_arm64.txt │ puremapint/stats_arm64.txt │ │ sec/op │ sec/op vs base │ Cache2Hit/_______1-8 3.496n ± 0% 2.065n ± 0% -40.93% (p=0.002 n=6) Cache2Hit/______10-8 4.327n ± 2% 5.052n ± 0% +16.74% (p=0.002 n=6) Cache2Hit/_____100-8 4.364n ± 4% 5.731n ± 12% +31.32% (p=0.002 n=6) Cache2Hit/____1000-8 4.248n ± 5% 5.935n ± 3% +39.73% (p=0.002 n=6) Cache2Hit/___10000-8 5.167n ± 1% 7.553n ± 1% +46.18% (p=0.002 n=6) Cache2Hit/__100000-8 6.763n ± 1% 9.300n ± 3% +37.52% (p=0.002 n=6) Cache2Hit/_1000000-8 18.25n ± 1% 26.32n ± 1% +44.26% (p=0.002 n=6) Cache2Hit/10000000-8 28.11n ± 0% 37.95n ± 0% +35.05% (p=0.002 n=6) geomean 6.881n 8.405n +22.15% Le jeudi 26 juin 2025 à 09:37:45 UTC+2, christoph...@gmail.com a écrit : > Hello, > > the code is published here https://github.com/chmike/fastmap. > > Le mercredi 25 juin 2025 à 12:16:54 UTC+2, christoph...@gmail.com a > écrit : > >> By swap pop I mean that a deleted item is replaced by the last item of >> the group or the chain. This will indeed move items. >> >> I have implemented a new version using 8bit top hashes, tombstones, >> perfect match, quadratic probing and extensible hash table. It leaves items >> in place and might thus be compatible with the iteration requirements of go >> maps. >> >> I plan to publish it on github but still need to verify deleting items on >> which I didn't spent much time so far. The following benchmark is with >> tables of 256 Groups. A table is split when reaching 90%. The filling rates >> is then far better than with my 3 table architecture. Using quadratic >> probing would then be a better choice. >> >> goos: darwin >> goarch: arm64 >> pkg: fastmap/map >> cpu: Apple M2 >> │ altmap/stats_arm64.txt │ >> stdmap/stats_arm64.txt │ >> │ sec/op │ sec/op vs base >> │ >> Cache2Hit/_______1-8 6.386n ± 0% 7.081n ± 0% +10.88% >> (p=0.000 n=10) >> Cache2Hit/______10-8 8.813n ± 0% 8.114n ± 30% ~ >> (p=0.481 n=10) >> Cache2Hit/_____100-8 8.473n ± 7% 15.720n ± 8% +85.52% >> (p=0.000 n=10) >> Cache2Hit/____1000-8 8.735n ± 2% 13.715n ± 23% +57.01% >> (p=0.000 n=10) >> Cache2Hit/___10000-8 11.04n ± 1% 20.38n ± 6% +84.56% >> (p=0.000 n=10) >> Cache2Hit/__100000-8 13.19n ± 1% 17.62n ± 17% +33.59% >> (p=0.000 n=10) >> Cache2Hit/_1000000-8 52.02n ± 1% 55.45n ± 0% +6.59% >> (p=0.001 n=10) >> Cache2Hit/10000000-8 64.91n ± 1% 68.12n ± 3% +4.96% >> (p=0.002 n=10) >> Cache2Miss/_______1-8 5.003n ± 0% 8.243n ± 0% +64.73% >> (p=0.000 n=10) >> Cache2Miss/______10-8 5.878n ± 3% 9.238n ± 11% +57.18% >> (p=0.000 n=10) >> Cache2Miss/_____100-8 6.037n ± 4% 12.600n ± 9% +108.70% >> (p=0.000 n=10) >> Cache2Miss/____1000-8 6.992n ± 7% 12.015n ± 17% +71.83% >> (p=0.000 n=10) >> Cache2Miss/___10000-8 10.49n ± 2% 17.66n ± 9% +68.40% >> (p=0.000 n=10) >> Cache2Miss/__100000-8 20.73n ± 1% 26.56n ± 2% +28.09% >> (p=0.000 n=10) >> Cache2Miss/_1000000-8 33.52n ± 1% 36.83n ± 1% +9.89% >> (p=0.000 n=10) >> Cache2Miss/10000000-8 48.62n ± 1% 50.84n ± 1% +4.56% >> (p=0.000 n=10) >> geomean 13.25n 18.38n +38.75% >> >> To be honest I can't fully explain these numbers. Reducing false positive >> by using an exact match and 8 bit top hashes might be part of it. It could >> also be the hash function. xxh3 on arm64 is pure go only. It is fast for >> short strings, but bad for long strings as it doesn't use simd (yet). The 8 >> byte key strings are fmt.Sprintf("%7d ", i) and fmt.Sprintf("%7d-", i) for >> misses. I hope that these numbers don't result from a bug in my code. >> >> For the top hash I use the H1, H2 terminology of swiss tables. H2 are the >> 8 less significant bits of the hash. 0x00 is used for free slots, and 0x80 >> for the tombstone. Computing H2 implies thus a conditional branch, but the >> cpu may predict its outcome right most of the time. I benchmarked 7bit top >> hashes, but it ended to be slower. I assumed it was due to the increased >> number of false positive. Using 0x80 for tombstone saves one op to find >> free slots in a group. If the speed up is due to the 8bit top hashes, I >> suspect that the C++ swiss table might also benefit from it. >> >> On amd64, the performance is on par with go map, but I don't use simd >> instruction. >> >> Le lundi 23 juin 2025 à 22:41:46 UTC+2, Keith Randall a écrit : >> >>> > As key comparison is very fast (no secondary memory access) a simple >>> sequential key comparison is fast. >>> >>> We've looked at that, see CL 626699 >>> <https://go-review.googlesource.com/c/go/+/626699>. Maybe helps in some >>> cases, seems to hurt for miss lookups. We decided not to move forward with >>> it, but it is definitely something to think about. >>> >>> > I use 8 bit top hashes instead of 7 >>> >>> How do you record a presence bit? >>> >>> > We can use a swap pop strategy. >>> >>> Not sure exactly what that is, but it sounds incompatible with >>> iteration. Our required iteration semantics mean it is very hard to move >>> keys. >>> >>> > There is a small space usage penalty but when a table is split at 80% >>> for instance, it becomes 40%. It is then very close to need a grow which I >>> considered a waste of effort. >>> >>> I think when we split a table we use the max-sized table for each part? >>> I'm not sure what effect you're seeing here then. >>> >>> > I also saw that function calls are expensive. I then used manual >>> inlining for the Get method. It is nice to avoid for code readability and >>> maintenance though. >>> >>> We could theoretically have the compiler do this for builtin maps. >>> On Sunday, June 22, 2025 at 1:28:26 AM UTC-7 christoph...@gmail.com >>> wrote: >>> >>>> Thank you very much for your response. I'll summarize here the main >>>> difference with go map. >>>> >>>> I very easily got faster than go map with integers and alike by >>>> dropping the swiss groups and reducing the group size to fit a cache line. >>>> As key comparison is very fast (no secondary memory access) a simple >>>> sequential key comparison is fast.The speed gain might be due to the hash >>>> function as I rewrote the xxh3 hasher for uint64 values. I didn't spent >>>> much time exploring this further as it wasn't my use case. >>>> >>>> Strings key definitely benefit from top hashes (swiss table groups). I >>>> use 8 bit top hashes instead of 7 which reduces the number of false >>>> positive and I use a perfect match which also reduce the number of false >>>> positive on arm64. amd64 uses a perfect match as it uses the simd >>>> instructions. >>>> >>>> Another difference is that go map uses quadratic probing and my map >>>> doesn't. The popular wisdom is that probing is faster because the >>>> probability that the data is in cache is slightly higher and the table >>>> filling is better. Unfortunately this increases the number of required >>>> tests where false positive increases the penalty. The numbers I showed are >>>> for a different kind of table. It is composed of three arrays T1,T2 and >>>> T3. >>>> T1 is a conventional table of 8 items swiss groups. When a. group is full >>>> the item is stored in the overflow hash table T2. When the group in T2 is >>>> full, the item is appended to T3 which is treated as a sequence of groups. >>>> I do a linear search in this table containing the overflows of overflows. >>>> I >>>> implemented a table using quadratic probing and it was faster than go map >>>> on arm64. I assume it is due to the use of 8 bit hashes reducing the false >>>> positive. I'm not sure. >>>> >>>> I tested independent hashes for T1 and T2 that then requires the use of >>>> tombstones, and related hashes. By related hash I mean that the index in >>>> T2 >>>> is for instance the fibonacci hash of the index in T1. Best filling rates >>>> of T2 were obtained with a random permutation mapping. The size ration >>>> between T1 and T2 should in theory be 2:1, but best performance where with >>>> 16:1. I assumed that this is because T2 data is hotter (in cache) than >>>> with >>>> a bigger table. The benefit of such deterministic relation between the >>>> index of T1 and T2 is that we don't need tombstones. We can use a swap pop >>>> strategy. In summary, the idea was to minimize the number of required >>>> probes to find a key. >>>> >>>> The table uses also an extensible hash for which I have a simpler code >>>> than go map. I won't go faster. But this directory indirection introduces >>>> a >>>> significant performance penalty. In a later benchmark, after the one I >>>> published here, I saw a significant performance increase by removing the >>>> directory. When the hash function is good there most tables have the same >>>> depth. >>>> >>>> Another optimization compared to go map is that my tables in the >>>> directory don't grow. Using fixed sized array give some performance >>>> advantage as boundaries and hash mask are constants. There is a small >>>> space >>>> usage penalty but when a table is split at 80% for instance, it becomes >>>> 40%. It is then very close to need a grow which I considered a waste of >>>> effort. For small number of items fitting. in one table, it may be worth >>>> to >>>> use a growable table which is what I didn't bother to implement. >>>> >>>> I also saw that function calls are expensive. I then used manual >>>> inlining for the Get method. It is nice to avoid for code readability and >>>> maintenance though. >>>> >>>> I tried assembly on arm64 where I could experiment with simd >>>> instructions, but due to the assembly function call overhead it was slower >>>> than pure go. Passing arguments by register wouldn't help. We would need >>>> assembly inlining, but I understand that this would open a can of worms >>>> and >>>> probably create more problems than it would solve. >>>> >>>> If you are still interested and want some specific numbers I would be >>>> glad to give them. >>>> >>>> Here is the cache miss benchmark that shows the benefit of reducing the >>>> number of probes >>>> >>>> goos: darwin >>>> goarch: arm64 >>>> pkg: fastCache/map >>>> cpu: Apple M2 >>>> │ dirstr12/stats_arm64.txt │ >>>> puremapstr/stats_arm64.txt │ >>>> │ sec/op │ sec/op vs >>>> base │ >>>> Cache2Miss/_______1-8 4.721n ± 0% 8.298n ± 0% >>>> +75.75% (p=0.002 n=6) >>>> Cache2Miss/______10-8 5.452n ± 0% 9.094n ± 25% >>>> +66.81% (p=0.002 n=6) >>>> Cache2Miss/_____100-8 5.726n ± 6% 11.550n ± 5% >>>> +101.71% (p=0.002 n=6) >>>> Cache2Miss/____1000-8 6.572n ± 4% 12.550n ± 21% >>>> +90.96% (p=0.002 n=6) >>>> Cache2Miss/___10000-8 9.428n ± 4% 18.400n ± 7% >>>> +95.17% (p=0.002 n=6) >>>> Cache2Miss/__100000-8 14.79n ± 1% 26.60n ± 2% >>>> +79.79% (p=0.002 n=6) >>>> Cache2Miss/_1000000-8 29.11n ± 1% 36.83n ± 2% >>>> +26.54% (p=0.002 n=6) >>>> Cache2Miss/10000000-8 40.44n ± 5% 51.12n ± 1% >>>> +26.41% (p=0.002 n=6) >>>> geomean 10.60n 17.80n >>>> +67.98% >>>> >>>> Here is the cache hit comparing dirstr12 with flat8 that has no >>>> directory. >>>> >>>> goos: darwin >>>> goarch: arm64 >>>> pkg: fastCache/map >>>> cpu: Apple M2 >>>> │ flat8/stats_arm64.txt │ >>>> dirstr12/stats_arm64.txt │ >>>> │ sec/op │ sec/op vs base >>>> │ >>>> Cache2Hit/_______1-8 6.047n ± 0% 6.246n ± 5% +3.28% >>>> (p=0.002 n=6) >>>> Cache2Hit/______10-8 8.172n ± 0% 8.499n ± 0% +4.00% >>>> (p=0.002 n=6) >>>> Cache2Hit/_____100-8 7.893n ± 0% 8.256n ± 5% +4.61% >>>> (p=0.002 n=6) >>>> Cache2Hit/____1000-8 7.585n ± 6% 8.228n ± 2% +8.48% >>>> (p=0.002 n=6) >>>> Cache2Hit/___10000-8 9.191n ± 0% 10.375n ± 1% +12.88% >>>> (p=0.002 n=6) >>>> Cache2Hit/__100000-8 10.03n ± 4% 12.14n ± 1% +21.15% >>>> (p=0.002 n=6) >>>> Cache2Hit/_1000000-8 40.79n ± 1% 42.27n ± 4% +3.64% >>>> (p=0.002 n=6) >>>> Cache2Hit/10000000-8 47.29n ± 1% 56.49n ± 9% +19.47% >>>> (p=0.002 n=6) >>>> geomean 12.31n 13.47n +9.48% >>>> >>>> I currently don't has access to an x86 computer. Thank you for your >>>> mail. >>>> >>>> >>>> Le dimanche 22 juin 2025 à 00:12:25 UTC+2, Keith Randall a écrit : >>>> >>>>> Certainly interesting. We'd be interested in incorporating some ideas >>>>> from your experiments if they translate over. >>>>> >>>>> One thing to caution, you showed no numbers for space utilization, map >>>>> miss speed, whether your map can do iteration, etc. It is possible to >>>>> trade >>>>> off one of these for other ones, so it is not immediately obvious that >>>>> your >>>>> techniques would apply to the builtin map. For instance, if you're using >>>>> 10x the space to get 10% speedup, we probably wouldn't want to do that >>>>> for >>>>> the general map. (Not saying that's what you did, just pointing out raw >>>>> speed is not the only consideration.) >>>>> >>>>> On Thursday, June 19, 2025 at 4:11:28 AM UTC-7 christoph...@gmail.com >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> trying to implement a fast cache, I inadvertently entered a rabbit >>>>>> hole that led me to implement my own map. In the process I tried to make >>>>>> it >>>>>> faster than the go map just to see if it is possible. I worked weeks on >>>>>> it >>>>>> trying out various architectures and methods. >>>>>> >>>>>> On my mac book air M2, I get the following benchmarks for a Get >>>>>> operation. The numbers are the number of items inserted in the table. >>>>>> My >>>>>> keys are 8 byte long strings. >>>>>> >>>>>> goos: darwin >>>>>> goarch: arm64 >>>>>> pkg: fastCache/map >>>>>> cpu: Apple M2 >>>>>> │ dirstr12/stats_arm64.txt │ >>>>>> puremapstr/stats_arm64.txt │ >>>>>> │ sec/op │ sec/op vs >>>>>> base │ >>>>>> Cache2Hit/_______1-8 6.151n ± 6% 7.087n ± 1% >>>>>> +15.22% (p=0.002 n=6) >>>>>> Cache2Hit/______10-8 8.491n ± 0% 8.156n ± 29% >>>>>> ~ (p=0.394 n=6) >>>>>> Cache2Hit/_____100-8 8.141n ± 7% 14.185n ± 13% >>>>>> +74.24% (p=0.002 n=6) >>>>>> Cache2Hit/____1000-8 8.252n ± 3% 10.635n ± 39% >>>>>> +28.89% (p=0.002 n=6) >>>>>> Cache2Hit/___10000-8 10.45n ± 2% 20.99n ± 4% >>>>>> +100.81% (p=0.002 n=6) >>>>>> Cache2Hit/__100000-8 12.16n ± 1% 19.11n ± 10% >>>>>> +57.05% (p=0.002 n=6) >>>>>> Cache2Hit/_1000000-8 42.28n ± 2% 47.90n ± 2% >>>>>> +13.29% (p=0.002 n=6) >>>>>> Cache2Hit/10000000-8 56.38n ± 12% 61.82n ± 6% >>>>>> ~ (p=0.065 n=6) >>>>>> geomean 13.44n 17.86n >>>>>> +32.91% >>>>>> >>>>>> On my amd64 i5 11th gen I get the following benchmarks. >>>>>> >>>>>> goos: linux >>>>>> goarch: amd64 >>>>>> pkg: fastCache/map >>>>>> cpu: 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz >>>>>> │ dirstr12/stats_amd64.txt │ >>>>>> puremapstr/stats_amd64.txt │ >>>>>> │ sec/op │ sec/op vs >>>>>> base │ >>>>>> Cache2Hit/_______1-12 9.207n ± 1% 7.506n ± 3% >>>>>> -18.48% (p=0.002 n=6) >>>>>> Cache2Hit/______10-12 9.223n ± 0% 8.806n ± 6% >>>>>> ~ (p=0.058 n=6) >>>>>> Cache2Hit/_____100-12 9.279n ± 2% 10.175n ± 3% >>>>>> +9.66% (p=0.002 n=6) >>>>>> Cache2Hit/____1000-12 10.45n ± 2% 11.29n ± 3% >>>>>> +8.04% (p=0.002 n=6) >>>>>> Cache2Hit/___10000-12 16.00n ± 2% 17.21n ± 5% >>>>>> +7.59% (p=0.002 n=6) >>>>>> Cache2Hit/__100000-12 22.20n ± 17% 24.73n ± 22% >>>>>> +11.42% (p=0.026 n=6) >>>>>> Cache2Hit/_1000000-12 87.75n ± 2% 91.05n ± 5% >>>>>> +3.76% (p=0.009 n=6) >>>>>> Cache2Hit/10000000-12 104.2n ± 2% 105.6n ± 5% >>>>>> ~ (p=0.558 n=6) >>>>>> geomean 20.11n 20.49n >>>>>> +1.90% >>>>>> >>>>>> On amd64 the performance is on par with go map, but go map uses >>>>>> inlined simd instructions which I don’t use because I don’t have access >>>>>> to >>>>>> it in pure go. I use xxh3 right out of the box for the hash function. >>>>>> The >>>>>> differences are not due to the hash function. >>>>>> >>>>>> If there is interest in this, please let me know. >>>>>> >>>>>> >>>>>> >>>>>> -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/1a7e15d9-a0f6-451a-923b-ea64bb5c782cn%40googlegroups.com.