Hello, the code is published here https://github.com/chmike/fastmap.
Le mercredi 25 juin 2025 à 12:16:54 UTC+2, christoph...@gmail.com a écrit : > By swap pop I mean that a deleted item is replaced by the last item of the > group or the chain. This will indeed move items. > > I have implemented a new version using 8bit top hashes, tombstones, > perfect match, quadratic probing and extensible hash table. It leaves items > in place and might thus be compatible with the iteration requirements of go > maps. > > I plan to publish it on github but still need to verify deleting items on > which I didn't spent much time so far. The following benchmark is with > tables of 256 Groups. A table is split when reaching 90%. The filling rates > is then far better than with my 3 table architecture. Using quadratic > probing would then be a better choice. > > goos: darwin > goarch: arm64 > pkg: fastmap/map > cpu: Apple M2 > │ altmap/stats_arm64.txt │ > stdmap/stats_arm64.txt │ > │ sec/op │ sec/op vs base > │ > Cache2Hit/_______1-8 6.386n ± 0% 7.081n ± 0% +10.88% > (p=0.000 n=10) > Cache2Hit/______10-8 8.813n ± 0% 8.114n ± 30% ~ > (p=0.481 n=10) > Cache2Hit/_____100-8 8.473n ± 7% 15.720n ± 8% +85.52% > (p=0.000 n=10) > Cache2Hit/____1000-8 8.735n ± 2% 13.715n ± 23% +57.01% > (p=0.000 n=10) > Cache2Hit/___10000-8 11.04n ± 1% 20.38n ± 6% +84.56% > (p=0.000 n=10) > Cache2Hit/__100000-8 13.19n ± 1% 17.62n ± 17% +33.59% > (p=0.000 n=10) > Cache2Hit/_1000000-8 52.02n ± 1% 55.45n ± 0% +6.59% > (p=0.001 n=10) > Cache2Hit/10000000-8 64.91n ± 1% 68.12n ± 3% +4.96% > (p=0.002 n=10) > Cache2Miss/_______1-8 5.003n ± 0% 8.243n ± 0% +64.73% > (p=0.000 n=10) > Cache2Miss/______10-8 5.878n ± 3% 9.238n ± 11% +57.18% > (p=0.000 n=10) > Cache2Miss/_____100-8 6.037n ± 4% 12.600n ± 9% +108.70% > (p=0.000 n=10) > Cache2Miss/____1000-8 6.992n ± 7% 12.015n ± 17% +71.83% > (p=0.000 n=10) > Cache2Miss/___10000-8 10.49n ± 2% 17.66n ± 9% +68.40% > (p=0.000 n=10) > Cache2Miss/__100000-8 20.73n ± 1% 26.56n ± 2% +28.09% > (p=0.000 n=10) > Cache2Miss/_1000000-8 33.52n ± 1% 36.83n ± 1% +9.89% > (p=0.000 n=10) > Cache2Miss/10000000-8 48.62n ± 1% 50.84n ± 1% +4.56% > (p=0.000 n=10) > geomean 13.25n 18.38n +38.75% > > To be honest I can't fully explain these numbers. Reducing false positive > by using an exact match and 8 bit top hashes might be part of it. It could > also be the hash function. xxh3 on arm64 is pure go only. It is fast for > short strings, but bad for long strings as it doesn't use simd (yet). The 8 > byte key strings are fmt.Sprintf("%7d ", i) and fmt.Sprintf("%7d-", i) for > misses. I hope that these numbers don't result from a bug in my code. > > For the top hash I use the H1, H2 terminology of swiss tables. H2 are the > 8 less significant bits of the hash. 0x00 is used for free slots, and 0x80 > for the tombstone. Computing H2 implies thus a conditional branch, but the > cpu may predict its outcome right most of the time. I benchmarked 7bit top > hashes, but it ended to be slower. I assumed it was due to the increased > number of false positive. Using 0x80 for tombstone saves one op to find > free slots in a group. If the speed up is due to the 8bit top hashes, I > suspect that the C++ swiss table might also benefit from it. > > On amd64, the performance is on par with go map, but I don't use simd > instruction. > > Le lundi 23 juin 2025 à 22:41:46 UTC+2, Keith Randall a écrit : > >> > As key comparison is very fast (no secondary memory access) a simple >> sequential key comparison is fast. >> >> We've looked at that, see CL 626699 >> <https://go-review.googlesource.com/c/go/+/626699>. Maybe helps in some >> cases, seems to hurt for miss lookups. We decided not to move forward with >> it, but it is definitely something to think about. >> >> > I use 8 bit top hashes instead of 7 >> >> How do you record a presence bit? >> >> > We can use a swap pop strategy. >> >> Not sure exactly what that is, but it sounds incompatible with iteration. >> Our required iteration semantics mean it is very hard to move keys. >> >> > There is a small space usage penalty but when a table is split at 80% >> for instance, it becomes 40%. It is then very close to need a grow which I >> considered a waste of effort. >> >> I think when we split a table we use the max-sized table for each part? >> I'm not sure what effect you're seeing here then. >> >> > I also saw that function calls are expensive. I then used manual >> inlining for the Get method. It is nice to avoid for code readability and >> maintenance though. >> >> We could theoretically have the compiler do this for builtin maps. >> On Sunday, June 22, 2025 at 1:28:26 AM UTC-7 christoph...@gmail.com >> wrote: >> >>> Thank you very much for your response. I'll summarize here the main >>> difference with go map. >>> >>> I very easily got faster than go map with integers and alike by dropping >>> the swiss groups and reducing the group size to fit a cache line. As key >>> comparison is very fast (no secondary memory access) a simple sequential >>> key comparison is fast.The speed gain might be due to the hash function as >>> I rewrote the xxh3 hasher for uint64 values. I didn't spent much time >>> exploring this further as it wasn't my use case. >>> >>> Strings key definitely benefit from top hashes (swiss table groups). I >>> use 8 bit top hashes instead of 7 which reduces the number of false >>> positive and I use a perfect match which also reduce the number of false >>> positive on arm64. amd64 uses a perfect match as it uses the simd >>> instructions. >>> >>> Another difference is that go map uses quadratic probing and my map >>> doesn't. The popular wisdom is that probing is faster because the >>> probability that the data is in cache is slightly higher and the table >>> filling is better. Unfortunately this increases the number of required >>> tests where false positive increases the penalty. The numbers I showed are >>> for a different kind of table. It is composed of three arrays T1,T2 and T3. >>> T1 is a conventional table of 8 items swiss groups. When a. group is full >>> the item is stored in the overflow hash table T2. When the group in T2 is >>> full, the item is appended to T3 which is treated as a sequence of groups. >>> I do a linear search in this table containing the overflows of overflows. I >>> implemented a table using quadratic probing and it was faster than go map >>> on arm64. I assume it is due to the use of 8 bit hashes reducing the false >>> positive. I'm not sure. >>> >>> I tested independent hashes for T1 and T2 that then requires the use of >>> tombstones, and related hashes. By related hash I mean that the index in T2 >>> is for instance the fibonacci hash of the index in T1. Best filling rates >>> of T2 were obtained with a random permutation mapping. The size ration >>> between T1 and T2 should in theory be 2:1, but best performance where with >>> 16:1. I assumed that this is because T2 data is hotter (in cache) than with >>> a bigger table. The benefit of such deterministic relation between the >>> index of T1 and T2 is that we don't need tombstones. We can use a swap pop >>> strategy. In summary, the idea was to minimize the number of required >>> probes to find a key. >>> >>> The table uses also an extensible hash for which I have a simpler code >>> than go map. I won't go faster. But this directory indirection introduces a >>> significant performance penalty. In a later benchmark, after the one I >>> published here, I saw a significant performance increase by removing the >>> directory. When the hash function is good there most tables have the same >>> depth. >>> >>> Another optimization compared to go map is that my tables in the >>> directory don't grow. Using fixed sized array give some performance >>> advantage as boundaries and hash mask are constants. There is a small space >>> usage penalty but when a table is split at 80% for instance, it becomes >>> 40%. It is then very close to need a grow which I considered a waste of >>> effort. For small number of items fitting. in one table, it may be worth to >>> use a growable table which is what I didn't bother to implement. >>> >>> I also saw that function calls are expensive. I then used manual >>> inlining for the Get method. It is nice to avoid for code readability and >>> maintenance though. >>> >>> I tried assembly on arm64 where I could experiment with simd >>> instructions, but due to the assembly function call overhead it was slower >>> than pure go. Passing arguments by register wouldn't help. We would need >>> assembly inlining, but I understand that this would open a can of worms and >>> probably create more problems than it would solve. >>> >>> If you are still interested and want some specific numbers I would be >>> glad to give them. >>> >>> Here is the cache miss benchmark that shows the benefit of reducing the >>> number of probes >>> >>> goos: darwin >>> goarch: arm64 >>> pkg: fastCache/map >>> cpu: Apple M2 >>> │ dirstr12/stats_arm64.txt │ >>> puremapstr/stats_arm64.txt │ >>> │ sec/op │ sec/op vs >>> base │ >>> Cache2Miss/_______1-8 4.721n ± 0% 8.298n ± 0% >>> +75.75% (p=0.002 n=6) >>> Cache2Miss/______10-8 5.452n ± 0% 9.094n ± 25% >>> +66.81% (p=0.002 n=6) >>> Cache2Miss/_____100-8 5.726n ± 6% 11.550n ± 5% >>> +101.71% (p=0.002 n=6) >>> Cache2Miss/____1000-8 6.572n ± 4% 12.550n ± 21% >>> +90.96% (p=0.002 n=6) >>> Cache2Miss/___10000-8 9.428n ± 4% 18.400n ± 7% >>> +95.17% (p=0.002 n=6) >>> Cache2Miss/__100000-8 14.79n ± 1% 26.60n ± 2% >>> +79.79% (p=0.002 n=6) >>> Cache2Miss/_1000000-8 29.11n ± 1% 36.83n ± 2% >>> +26.54% (p=0.002 n=6) >>> Cache2Miss/10000000-8 40.44n ± 5% 51.12n ± 1% >>> +26.41% (p=0.002 n=6) >>> geomean 10.60n 17.80n >>> +67.98% >>> >>> Here is the cache hit comparing dirstr12 with flat8 that has no >>> directory. >>> >>> goos: darwin >>> goarch: arm64 >>> pkg: fastCache/map >>> cpu: Apple M2 >>> │ flat8/stats_arm64.txt │ >>> dirstr12/stats_arm64.txt │ >>> │ sec/op │ sec/op vs base >>> │ >>> Cache2Hit/_______1-8 6.047n ± 0% 6.246n ± 5% +3.28% >>> (p=0.002 n=6) >>> Cache2Hit/______10-8 8.172n ± 0% 8.499n ± 0% +4.00% >>> (p=0.002 n=6) >>> Cache2Hit/_____100-8 7.893n ± 0% 8.256n ± 5% +4.61% >>> (p=0.002 n=6) >>> Cache2Hit/____1000-8 7.585n ± 6% 8.228n ± 2% +8.48% >>> (p=0.002 n=6) >>> Cache2Hit/___10000-8 9.191n ± 0% 10.375n ± 1% +12.88% >>> (p=0.002 n=6) >>> Cache2Hit/__100000-8 10.03n ± 4% 12.14n ± 1% +21.15% >>> (p=0.002 n=6) >>> Cache2Hit/_1000000-8 40.79n ± 1% 42.27n ± 4% +3.64% >>> (p=0.002 n=6) >>> Cache2Hit/10000000-8 47.29n ± 1% 56.49n ± 9% +19.47% >>> (p=0.002 n=6) >>> geomean 12.31n 13.47n +9.48% >>> >>> I currently don't has access to an x86 computer. Thank you for your >>> mail. >>> >>> >>> Le dimanche 22 juin 2025 à 00:12:25 UTC+2, Keith Randall a écrit : >>> >>>> Certainly interesting. We'd be interested in incorporating some ideas >>>> from your experiments if they translate over. >>>> >>>> One thing to caution, you showed no numbers for space utilization, map >>>> miss speed, whether your map can do iteration, etc. It is possible to >>>> trade >>>> off one of these for other ones, so it is not immediately obvious that >>>> your >>>> techniques would apply to the builtin map. For instance, if you're using >>>> 10x the space to get 10% speedup, we probably wouldn't want to do that for >>>> the general map. (Not saying that's what you did, just pointing out raw >>>> speed is not the only consideration.) >>>> >>>> On Thursday, June 19, 2025 at 4:11:28 AM UTC-7 christoph...@gmail.com >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> trying to implement a fast cache, I inadvertently entered a rabbit >>>>> hole that led me to implement my own map. In the process I tried to make >>>>> it >>>>> faster than the go map just to see if it is possible. I worked weeks on >>>>> it >>>>> trying out various architectures and methods. >>>>> >>>>> On my mac book air M2, I get the following benchmarks for a Get >>>>> operation. The numbers are the number of items inserted in the table. My >>>>> keys are 8 byte long strings. >>>>> >>>>> goos: darwin >>>>> goarch: arm64 >>>>> pkg: fastCache/map >>>>> cpu: Apple M2 >>>>> │ dirstr12/stats_arm64.txt │ >>>>> puremapstr/stats_arm64.txt │ >>>>> │ sec/op │ sec/op vs >>>>> base │ >>>>> Cache2Hit/_______1-8 6.151n ± 6% 7.087n ± 1% >>>>> +15.22% (p=0.002 n=6) >>>>> Cache2Hit/______10-8 8.491n ± 0% 8.156n ± 29% >>>>> ~ (p=0.394 n=6) >>>>> Cache2Hit/_____100-8 8.141n ± 7% 14.185n ± 13% >>>>> +74.24% (p=0.002 n=6) >>>>> Cache2Hit/____1000-8 8.252n ± 3% 10.635n ± 39% >>>>> +28.89% (p=0.002 n=6) >>>>> Cache2Hit/___10000-8 10.45n ± 2% 20.99n ± 4% >>>>> +100.81% (p=0.002 n=6) >>>>> Cache2Hit/__100000-8 12.16n ± 1% 19.11n ± 10% >>>>> +57.05% (p=0.002 n=6) >>>>> Cache2Hit/_1000000-8 42.28n ± 2% 47.90n ± 2% >>>>> +13.29% (p=0.002 n=6) >>>>> Cache2Hit/10000000-8 56.38n ± 12% 61.82n ± 6% >>>>> ~ (p=0.065 n=6) >>>>> geomean 13.44n 17.86n >>>>> +32.91% >>>>> >>>>> On my amd64 i5 11th gen I get the following benchmarks. >>>>> >>>>> goos: linux >>>>> goarch: amd64 >>>>> pkg: fastCache/map >>>>> cpu: 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz >>>>> │ dirstr12/stats_amd64.txt │ >>>>> puremapstr/stats_amd64.txt │ >>>>> │ sec/op │ sec/op vs >>>>> base │ >>>>> Cache2Hit/_______1-12 9.207n ± 1% 7.506n ± 3% >>>>> -18.48% (p=0.002 n=6) >>>>> Cache2Hit/______10-12 9.223n ± 0% 8.806n ± 6% >>>>> ~ (p=0.058 n=6) >>>>> Cache2Hit/_____100-12 9.279n ± 2% 10.175n ± 3% >>>>> +9.66% (p=0.002 n=6) >>>>> Cache2Hit/____1000-12 10.45n ± 2% 11.29n ± 3% >>>>> +8.04% (p=0.002 n=6) >>>>> Cache2Hit/___10000-12 16.00n ± 2% 17.21n ± 5% >>>>> +7.59% (p=0.002 n=6) >>>>> Cache2Hit/__100000-12 22.20n ± 17% 24.73n ± 22% >>>>> +11.42% (p=0.026 n=6) >>>>> Cache2Hit/_1000000-12 87.75n ± 2% 91.05n ± 5% >>>>> +3.76% (p=0.009 n=6) >>>>> Cache2Hit/10000000-12 104.2n ± 2% 105.6n ± 5% >>>>> ~ (p=0.558 n=6) >>>>> geomean 20.11n 20.49n >>>>> +1.90% >>>>> >>>>> On amd64 the performance is on par with go map, but go map uses >>>>> inlined simd instructions which I don’t use because I don’t have access >>>>> to >>>>> it in pure go. I use xxh3 right out of the box for the hash function. The >>>>> differences are not due to the hash function. >>>>> >>>>> If there is interest in this, please let me know. >>>>> >>>>> >>>>> >>>>> -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/159bf6c9-5d39-4dad-8218-47b81fe09e55n%40googlegroups.com.