> On Fri, 13 Dec 2024 at 17:13, Tamar Christina <tamar.christ...@arm.com> wrote:
> >
> > Hi All,
> >
> > We are currently generating a loop which has more comparisons than you'd
> > typically need as the probablities on the small size loop are such that it
> > assumes the likely case is that an element is not found.
> >
> > This again generates a pattern that's harder for branch predictors to 
> > follow,
> > but also just generates more instructions for the what one could say is the
> > typical case: That your hashtable contains the entry you are looking for.
> >
> > This patch adds a __builtin_expect to indicate that it's likely that you'd
> > find the element that's being searched for.
> >
> > The second change is in _M_find_before_node where at the moment the loop
> > is optimized for the case where we don't do any iterations.
> >
> > A simple testcase is:
> >
> > #include <stdbool.h>
> >
> > bool foo (int **a, int n, int val, int *tkn)
> > {
> >     for (int i = 0; i < n; i++)
> >     {
> >         if (!a[i] || a[i]==tkn)
> >           return false;
> >
> >         if (*a[i] == val)
> >           return true;
> >     }
> > }
> >
> > which generataes:
> >
> > foo:
> >         cmp     w1, 0
> >         ble     .L1
> >         add     x1, x0, w1, uxtw 3
> >         b       .L4
> > .L9:
> >         ldr     w4, [x4]
> >         cmp     w4, w2
> >         beq     .L6
> >         cmp     x0, x1
> >         beq     .L1
> > .L4:
> >         ldr     x4, [x0]
> >         add     x0, x0, 8
> >         cmp     x4, 0
> >         ccmp    x4, x3, 4, ne
> >         bne     .L9
> >         mov     w0, 0
> > .L1:
> >         ret
> > .L6:
> >         mov     w0, 1
> >         ret
> >
> > i.e. BB rotation makes is generate an unconditional branch to a conditional
> > branch. However this method is only called when the size is above a certain
> > threshold, and so it's likely that we have to do that first iteration.
> >
> > Adding:
> >
> > #include <stdbool.h>
> >
> > bool foo (int **a, int n, int val, int *tkn)
> > {
> >     for (int i = 0; i < n; i++)
> >     {
> >         if (__builtin_expect(!a[i] || a[i]==tkn, 0))
> >           return false;
> >
> >         if (*a[i] == val)
> >           return true;
> >     }
> > }
> >
> > to indicate that we will likely do an iteration more generates:
> >
> > foo:
> >         cmp     w1, 0
> >         ble     .L1
> >         add     x1, x0, w1, uxtw 3
> > .L4:
> >         ldr     x4, [x0]
> >         add     x0, x0, 8
> >         cmp     x4, 0
> >         ccmp    x4, x3, 4, ne
> >         beq     .L5
> >         ldr     w4, [x4]
> >         cmp     w4, w2
> >         beq     .L6
> >         cmp     x0, x1
> >         bne     .L4
> > .L1:
> >         ret
> > .L5:
> >         mov     w0, 0
> >         ret
> > .L6:
> >         mov     w0, 1
> >         ret
> >
> > which results in ~0-20% extra on top of the previous patch.
> >
> > In table form:
> >
> > +-----------+--------------------+-------+----------+----------------+
> > | Group     | Benchmark          | Size  | % Inline | % Unlikely all |
> > +-----------+--------------------+-------+----------+----------------+
> > | Find      | unord<uint64_t     | 11264 | 53.52%   | 64.81%         |
> > | Find      | unord<uint64_t     | 11264 | 47.98%   | 62.21%         |
> > | Find Many | unord<uint64_t     | 11    | 47.62%   | 45.32%         |
> > | Find Many | unord<uint64_t     | 11    | 40.90%   | 41.99%         |
> > | Find Many | unord<std::string  | 11    | 44.94%   | 41.25%         |
> > | Find Many | unord<std::string  | 11    | 44.89%   | 41.19%         |
> > | Find      | unord<NonSSOString | 11    | 31.17%   | 35.17%         |
> > | Find      | unord<NonSSOString | 11    | 31.80%   | 33.77%         |
> > | Find      | unord<uint64_t     | 352   | 28.27%   | 30.79%         |
> > | Find Many | unord<uint64_t     | 352   | 30.57%   | 30.57%         |
> > | Find Many | unord<uint64_t     | 11264 | 20.71%   | 30.34%         |
> > | Find Many | unord<uint64_t     | 11264 | 21.33%   | 29.87%         |
> > | Find Many | unord<NonSSOString | 352   | 25.93%   | 29.09%         |
> > | Find Many | unord<NonSSOString | 352   | 26.59%   | 28.07%         |
> > | Find Many | unord<uint64_t     | 352   | 26.80%   | 26.67%         |
> > | Find      | unord<std::string  | 11    | 25.66%   | 24.44%         |
> > | Find Many | unord<std::string  | 352   | 23.12%   | 24.05%         |
> > | Find Many | unord<NonSSOString | 11    | 23.21%   | 23.08%         |
> > | Find Many | unord<std::string  | 352   | 19.23%   | 22.08%         |
> > | Find Many | unord<NonSSOString | 11    | 22.04%   | 22.04%         |
> > | Find      | unord<uint64_t     | 352   | 15.43%   | 19.37%         |
> > | Find      | unord<std::string  | 11    | 20.36%   | 17.48%         |
> > | Find      | unord<std::string  | 352   | 18.59%   | 14.94%         |
> > | Find      | unord<NonSSOString | 352   | 13.79%   | 13.76%         |
> > | Find      | unord<NonSSOString | 11264 | 7.04%    | 13.38%         |
> > | Find      | unord<NonSSOString | 11264 | 9.19%    | 13.24%         |
> > | Find      | unord<std::string  | 11264 | 11.80%   | 12.73%         |
> > | Find Many | unord<std::string  | 11264 | 8.78%    | 12.44%         |
> > | Find Many | unord<std::string  | 11264 | 5.30%    | 10.86%         |
> > | Find Many | unord<NonSSOString | 11264 | 6.15%    | 8.85%          |
> > | Find Many | unord<NonSSOString | 11264 | 5.90%    | 8.49%          |
> > | Find      | unord<NonSSOString | 352   | 9.69%    | 8.30%          |
> > | Find      | unord<std::string  | 11264 | 9.97%    | 6.30%          |
> > | Find      | vec<uint64_t       | 352   | 6.20%    | 6.13%          |
> > +-----------+--------------------+-------+----------+----------------+
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master? and possible backports?
> >
> > Thanks,
> > Tamar
> >
> > libstdc++-v3/ChangeLog:
> >
> >         * include/bits/hashtable.h (_M_locate): Make it likely that we find 
> > an
> >         element.
> >         (_M_find_before_node): Make it likely that the map has at least one
> >         entry and so we do at least one iteration.
> >
> > ---
> > diff --git a/libstdc++-v3/include/bits/hashtable.h b/libstdc++-
> v3/include/bits/hashtable.h
> > index
> e791e52ec329277474f3218d8a44cd37ded14ac3..8101d868d0c5f7ac4f97931a
> ffcf71d826c88094 100644
> > --- a/libstdc++-v3/include/bits/hashtable.h
> > +++ b/libstdc++-v3/include/bits/hashtable.h
> > @@ -2171,7 +2171,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >           if (this->_M_equals(__k, __code, *__p))
> >             return __prev_p;
> >
> > -         if (!__p->_M_nxt || _M_bucket_index(*__p->_M_next()) != __bkt)
> > +         if (__builtin_expect (!__p->_M_nxt || 
> > _M_bucket_index(*__p->_M_next())
> != __bkt, 0))
> >             break;
> >           __prev_p = __p;
> >         }
> > @@ -2201,7 +2201,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >             if (this->_M_equals_tr(__k, __code, *__p))
> >               return __prev_p;
> >
> > -           if (!__p->_M_nxt || _M_bucket_index(*__p->_M_next()) != __bkt)
> > +           if (__builtin_expect (!__p->_M_nxt || _M_bucket_index(*__p-
> >_M_next()) != __bkt, 0))
> >               break;
> >             __prev_p = __p;
> >           }
> > @@ -2228,7 +2228,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >                pointer_to(const_cast<__node_base&>(_M_before_begin));
> >           while (__loc._M_before->_M_nxt)
> >             {
> > -             if (this->_M_key_equals(__k, *__loc._M_node()))
> > +             if (__builtin_expect (this->_M_key_equals(__k, 
> > *__loc._M_node()), 1))
> >                 return __loc;
> >               __loc._M_before = __loc._M_before->_M_nxt;
> >             }
> 
> The first two changes make sense to me: we're typically going to loop
> more than once when searching for an element, so the break is
> unlikely. The third change is a bit more surprising, because although
> it's likely we'll find an element *eventually* it is only likely once
> out of N iterations, not likely on every iteration. But if the numbers
> show it helps, then that just shows my intuition is probably wrong.
> 
> If this is optimized for 'find' is it possible that this change
> pessimizes insertion (which also has to use _M_locate to find where to
> insert)?

You're right there seems to be a uniform slowdown for insertions in small sized
hashtables of ~10 entries.  They're about 10-14% slower with that change.

As expected just the ones on _M_find_before_node does not cause any issues.

Since the insertions into small hashtables don't run long enough I've also 
increased
the number of iterations to ~1 million even out the score a bit more but 
there's still
a sizable 5 regressions.

I've kicked off a longer run removing the change on _M_locate and with some more
variable size finds/inserts.

So far it looks like the additional benefits gotten on _M_locate were
mainly on two tests.  I'll look at the perf traces to figure out exactly why 
but I'll respin
the patch without the _M_locate change and send it tomorrow morning once
benchmarks finish.

Thanks!

Tamar

Reply via email to