Hi, Sorry for the slow reply.
I ran the numbers filliping the probabilities. Most of it was in the noise and not statistically relevant. However there were some outliers: +-----------+-----------+---------------+-------+------------+ | benchmark | container | Type | Size | difference | +-----------+-----------+---------------+-------+------------+ | find | unordered | string | 13 | 42.00% | | find | unordered | string | 13 | 41.30% | | find | unordered | uint64_t | 11253 | 21.40% | | find | unordered | uint64_t | 11253 | 20.90% | | find | unordered | shared string | 11253 | 18.60% | | find | unordered | uint64_t | 11253 | 15.60% | | find | unordered | uint64_t | 13 | 13.50% | | find many | unordered | string | 345 | 11.20% | | find many | unordered | string | 345 | 8.60% | | find | vector | uint64_t | 11253 | -23.00% | | find | custom | uint64_t | 11253 | -18.40% | +-----------+-----------+---------------+-------+------------+ It looks like for unordered maps when the number of entries is either small or the element is placed in a what I assume to be crowded bucket. It does seem to be beneficial for some user defined datatypes, I assume due to some IPA shenanigans. But overall there were more and larger wins using probability of 0 rather than 1. Kind regards, Tamar From: Tamar Christina <tamar.christ...@arm.com> Sent: Thursday, January 2, 2025 11:02 PM To: François Dumont <frs.dum...@gmail.com>; Jonathan Wakely <jwak...@redhat.com> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; libstd...@gcc.gnu.org Subject: RE: [PATCH 2/2][libstdc++]: Adjust probabilities of hashmap loop conditions Hi, > It means that we consider that hasher is not perfect, we have several entries > in the same bucket. Shouldn't we reward those that are spending time on their > hasher to make it as perfect as possible ? I don’t think it makes much of a difference for a perfect hashtable as you’re exiting on the first iteration anyway. So taking the branch on iteration 0 shouldn’t be an issue. > Said differently is using 1 in the __builtin_expect changes the produced > figures a lot ? I expect it to be slower since the entire loop is no longer in a single fetch block. But I’ll run the numbers with this change. Thanks, Tamar From: François Dumont <frs.dum...@gmail.com<mailto:frs.dum...@gmail.com>> Sent: Monday, December 30, 2024 5:08 PM To: Jonathan Wakely <jwak...@redhat.com<mailto:jwak...@redhat.com>> Cc: Tamar Christina <tamar.christ...@arm.com<mailto:tamar.christ...@arm.com>>; gcc-patches@gcc.gnu.org<mailto:gcc-patches@gcc.gnu.org>; nd <n...@arm.com<mailto:n...@arm.com>>; libstd...@gcc.gnu.org<mailto:libstd...@gcc.gnu.org> Subject: Re: [PATCH 2/2][libstdc++]: Adjust probabilities of hashmap loop conditions Sorry to react so late on this patch. I'm only surprised by the expected result of the added __builtin_expect which is 0. It means that we consider that hasher is not perfect, we have several entries in the same bucket. Shouldn't we reward those that are spending time on their hasher to make it as perfect as possible ? Said differently is using 1 in the __builtin_expect changes the produced figures a lot ? François On Wed, Dec 18, 2024 at 5:01 PM Jonathan Wakely <jwak...@redhat.com<mailto:jwak...@redhat.com>> wrote: On Wed, 18 Dec 2024 at 14:14, Tamar Christina <tamar.christ...@arm.com<mailto:tamar.christ...@arm.com>> wrote: > > > e791e52ec329277474f3218d8a44cd37ded14ac3..8101d868d0c5f7ac4f97931a > > > ffcf71d826c88094 100644 > > > > --- a/libstdc++-v3/include/bits/hashtable.h > > > > +++ b/libstdc++-v3/include/bits/hashtable.h > > > > @@ -2171,7 +2171,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > > if (this->_M_equals(__k, __code, *__p)) > > > > return __prev_p; > > > > > > > > - if (!__p->_M_nxt || _M_bucket_index(*__p->_M_next()) != __bkt) > > > > + if (__builtin_expect (!__p->_M_nxt || _M_bucket_index(*__p- > > >_M_next()) > > > != __bkt, 0)) > > > > break; > > > > __prev_p = __p; > > > > } > > > > @@ -2201,7 +2201,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > > if (this->_M_equals_tr(__k, __code, *__p)) > > > > return __prev_p; > > > > > > > > - if (!__p->_M_nxt || _M_bucket_index(*__p->_M_next()) != > > > > __bkt) > > > > + if (__builtin_expect (!__p->_M_nxt || _M_bucket_index(*__p- > > > >_M_next()) != __bkt, 0)) > > > > break; > > > > __prev_p = __p; > > > > } > > > > @@ -2228,7 +2228,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > > pointer_to(const_cast<__node_base&>(_M_before_begin)); > > > > while (__loc._M_before->_M_nxt) > > > > { > > > > - if (this->_M_key_equals(__k, *__loc._M_node())) > > > > + if (__builtin_expect (this->_M_key_equals(__k, > > > > *__loc._M_node()), 1)) > > > > return __loc; > > > > __loc._M_before = __loc._M_before->_M_nxt; > > > > } > > > > > > The first two changes make sense to me: we're typically going to loop > > > more than once when searching for an element, so the break is > > > unlikely. The third change is a bit more surprising, because although > > > it's likely we'll find an element *eventually* it is only likely once > > > out of N iterations, not likely on every iteration. But if the numbers > > > show it helps, then that just shows my intuition is probably wrong. > > > > > > If this is optimized for 'find' is it possible that this change > > > pessimizes insertion (which also has to use _M_locate to find where to > > > insert)? > > > > You're right there seems to be a uniform slowdown for insertions in small > > sized > > hashtables of ~10 entries. They're about 10-14% slower with that change. > > > > As expected just the ones on _M_find_before_node does not cause any issues. > > > > Since the insertions into small hashtables don't run long enough I've also > > increased > > the number of iterations to ~1 million even out the score a bit more but > > there's still > > a sizable 5 regressions. > > > > I've kicked off a longer run removing the change on _M_locate and with some > > more > > variable size finds/inserts. > > > > So far it looks like the additional benefits gotten on _M_locate were > > mainly on two tests. I'll look at the perf traces to figure out exactly > > why but I'll > > respin > > the patch without the _M_locate change and send it tomorrow morning once > > benchmarks finish. > > > > Hi, > > Here's a new patch and new numbers showing the improvements over the previous > one and total improvement with the two together. This change seems to be > mostly > beneficial on larger sized hashmaps and otherwise no real losses on Insert or > Find. > (around the region of ~1% which look like noise). > > which results in ~0-10% extra on top of the previous patch. > > In table form: > > +-------------+---------------+-------+--------------------+-------------------+-----------------+ > | benchmark | Type | Size | Inline vs baseline | final vs > baseline | final vs inline | > +-------------+---------------+-------+--------------------+-------------------+-----------------+ > | find many | uint64_t | 11253 | -15.67% | -22.96% > | -8.65% | > | find many | uint64_t | 11253 | -16.74% | -23.37% > | -7.96% | > | find single | uint64_t | 345 | -5.88% | -11.54% > | -6.02% | > | find many | string | 11253 | -4.50% | -9.56% > | -5.29% | > | find single | uint64_t | 345 | -4.38% | -9.41% > | -5.26% | > | find single | shared string | 11253 | -6.67% | -11.00% > | -4.64% | > | find single | shared string | 11253 | -4.63% | -9.03% > | -4.61% | > | find single | shared string | 345 | -10.41% | -14.44% > | -4.50% | > | find many | string | 11253 | -3.41% | -7.51% > | -4.24% | > | find many | shared string | 11253 | -2.30% | -5.72% > | -3.50% | > | find many | string | 13 | 2.86% | -0.30% > | -3.07% | > | find single | string | 11253 | 4.47% | 1.34% > | -3.00% | > | find many | custom string | 11253 | 0.25% | -2.75% > | -2.99% | > | find single | uint64_t | 345 | 2.99% | 0.01% > | -2.90% | > | find single | shared string | 345 | -11.53% | -13.67% > | -2.41% | > | find single | uint64_t | 11253 | 0.49% | -1.59% > | -2.07% | > +-------------+---------------+-------+--------------------+-------------------+-----------------+ > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? OK, thanks! > > Thanks, > Tamar > > libstdc++-v3/ChangeLog: > > * include/bits/hashtable.h > (_M_find_before_node): Make it likely that the map has at least one > entry and so we do at least one iteration. > > -- inline copy of patch -- > > diff --git a/libstdc++-v3/include/bits/hashtable.h > b/libstdc++-v3/include/bits/hashtable.h > index > e791e52ec329277474f3218d8a44cd37ded14ac3..51437619b44ce8b93be6a1d223828fda5a4a4c45 > 100644 > --- a/libstdc++-v3/include/bits/hashtable.h > +++ b/libstdc++-v3/include/bits/hashtable.h > @@ -2171,7 +2171,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > if (this->_M_equals(__k, __code, *__p)) > return __prev_p; > > - if (!__p->_M_nxt || _M_bucket_index(*__p->_M_next()) != __bkt) > + if (__builtin_expect (!__p->_M_nxt || > _M_bucket_index(*__p->_M_next()) != __bkt, 0)) > break; > __prev_p = __p; > } > @@ -2201,7 +2201,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > if (this->_M_equals_tr(__k, __code, *__p)) > return __prev_p; > > - if (!__p->_M_nxt || _M_bucket_index(*__p->_M_next()) != __bkt) > + if (__builtin_expect (!__p->_M_nxt || > _M_bucket_index(*__p->_M_next()) != __bkt, 0)) > break; > __prev_p = __p; > }