Re: [PATCH] add self-tuning to x86 hardware fast path in libitm

Torvald Riegel Tue, 19 May 2015 15:18:12 -0700

On Mon, 2015-05-18 at 23:27 -0400, Nuno Diegues wrote:
> On Mon, May 18, 2015 at 5:29 PM, Torvald Riegel <trie...@redhat.com> wrote:
> >
> > Are there better options for the utility function, or can we tune it to
> > be less affected by varying txn length and likelihood of txnal vs.
> > nontxnal code?  What are the things we want to avoid with the tuning?  I
> > can think of:
> > * Not needing to wait for serial txns, or being aborted by a serial txn.
> > * Not retrying using HTM too much so that the retry time is larger than
> > the scalability we actually gain by having other txns commit
> > concurrently.
> 
> 
> Yes, those are the key points we want to make sure that do not happen.
> 
> >
> >
> > Anything else?  Did you consider other utility functions during your
> > research?
> 
> 
> The txnal vs nontxnal is indeed a completely different story. To account for
> this we would need extra book-keeping to count only cycles spent inside
> txnal code. So this would require every thread (or a sample of threads) to
> perform a rdtsc (or equivalent) on every begin/end call rather than the
> current approach of a single rdtsc per optimization round.
> 
> With this type of online optimization we found that the algorithm had to be
> very simple and cheap to execute. RDTSC was a good finding to fit this, and
> it enabled us to obtain gains. Other time sources failed to do so.
> 
> I do not have, out of the box, a good alternative to offer. I suppose it would
> take some iterations of thinking/experimenting with, just like with any 
> research
> problem :)


So let's iterate on this in parallel with the other changes that we need
to get in place.  I'd prefer to have some more confidence that measuring
txn throughput in epochs is the best way forward.

Here are a few thoughts:

Why do we actually need to measure succeeded transactions?  If a HW txn
is just getting committed without interfering with anything else, is
this really different from, say, two HW txns getting committed?  Aren't
we really just interested in wasted work, or delayed txns?  That may
help taking care of the nontxnal vs. txnal problem.

Measuring committed txns during a time that might otherwise be spent by
a serial txns could be useful to figure out how much other useful work a
serial txn prevents.  But we'd see that as well if we'd just go serial
during the auto-tuning because then concurrent txns will have to wait;
and in this case we could measure it in the slow path of those
concurrent txns (ie, during waiting, where measurement overhead wouldn't
matter as much).

If a serial txn is running, concurrent txns (that wait) are free to sync
and tell the serial how much cost it creates for the concurrent txns.
There, txn length could matter, but we won't find out for real until
after the concurrent ones have run (they could be pretty short, so we
can't simply assume that the concurrent ones are as long as the serial
one, so that simply the number of concurrent ones can be used to
calculate delayed work).

> 
> > Also, note that the mitigation strategy for rdtsc
> > short-comings that you mention in the paper is not applicable in
> > general, specifically not in libitm.
> 
> 
> I suppose you mean the preemption of threads inflating the cycles measured?

Yes, preemption and migration of threads (in case there's no global
sync'ed TSC or similar) -- you mentioned in the paper that you pin
threads to cores...

> This would be similarly a problem to any time source that tries to measure the
> amount of work performed; not sure how we can avoid it in general. Any 
> thoughts?

Not really as long as we keep depending on measuring time in a
light-weight way.  Measuring smaller time intervals could make it less
likely that preemption happens during such an interval, though.

> 
> > Another issue is that we need to implement the tuning in a portable way.
> > You currently make it depend on whether the assembler knows about RTM,
> > whereas the rest of the code makes this more portable and defers to
> > arch-specific files.  I'd prefer if we could use the tuning on other
> > archs too.  But for that, we need to cleanly separate generic from
> > arch-specific parts.  That affects magic numbers as well as things like
> > rdtsc().
> 
> 
> Yes, I refrained from adding new calls to the arch-specific files, to
> contain the
> changes mainly. But that is possible and that's part of the feedback I
> was hoping
> to get.

OK.  Let me know if you want further input regarding this.

> > I'm wondering about whether it really makes sense to treat XABORT like
> > conflicts and other abort reasons, instead of like capacity aborts.
> > Perhaps we need to differentiate between the explicit abort codes glibc
> > uses, so that we can distinguish between cases where an abort is
> > supposed to signal incompatibility with txnal execution and cases where
> > it's just used for lock elision (see sysdeps/unix/sysv/linux/x86/hle.h
> > in current glibc):
> > #define _ABORT_LOCK_BUSY        0xff
> > #define _ABORT_LOCK_IS_LOCKED   0xfe
> > #define _ABORT_NESTED_TRYLOCK   0xfd
> 
> 
> I am not sure I follow: are you suggesting to consider other aborts besides
> those of capacity?

We may want to do this with UCB, similar to capacity.  Another option
would be to, for now, make fixed choices regarding whether they are
considered permanent or not.
_ABORT_NESTED_TRYLOCK is an incompatibility with txnal execution
(similar to illegal instructions and such).
_ABORT_LOCK_BUSY is failing lock elision due to the lock being already
acquired; thus, this is transient I'd say.

Andi, what was _ABORT_LOCK_IS_LOCKED used for again?

> > Finally, please pay attention to using the same leading tabs/spaces
> > style as current libitm code; it could be just your MUA, but it seems
> > that your patch uses just leading spaces.  libitm uses 8-char tabs for
> > leading space.
> 
> 
> Now that I know the style, I will enforce it ;)
> Previously I had trusted my editor to follow the current style, but maybe
> it was not consistent everywhere, or it just got confused.

Thanks :)

> >
> >
> > > +    optimizer.optimized_mode = STUBBORN;
> > > +    optimizer.optimized_attempts = htm_fastpath;
> > > +    optimizer.last_cycles = rdtsc();
> > > +    optimizer.last_total_txs_executed = 0;
> > > +    optimizer.last_throughput = fixed_const(0.0001);
> > > +    optimizer.last_attempts = htm_fastpath > 0 ? htm_fastpath - 1 : 1;
> > > +    optimizer.best_ever_throughput = optimizer.last_throughput;
> > > +    optimizer.best_ever_attempts = htm_fastpath;
> > > +    optimizer.txns_while_stubborn = 1;
> > > +    optimizer.cycles_while_stubborn = 100;
> >
> > If the assumption is that transactions can't be faster than 100 cycles,
> > please document it briefly.
> 
> 
> No, it does not imply that. This is an "accumulator" of cycles spent in
> the mode; it need only to be positive as far as I remember.

OK.  Please find out what the requirement is and document it (magic
numbers...).

> > > --- libitm/beginend.cc (revision 219316)
> > > +++ libitm/beginend.cc (working copy)
> > > @@ -25,7 +25,6 @@
> > >  #include "libitm_i.h"
> > >  #include <pthread.h>
> > >
> > > -
> >
> > Minor: just drop this chunk.
> 
> 
> What is this referring to?

This part of the patch where all you do is delete a line.

> > > @@ -190,7 +411,7 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
> > >  #ifndef HTM_CUSTOM_FASTPATH
> > >    if (likely(htm_fastpath && (prop & pr_hasNoAbort)))
> > >      {
> > > -      for (uint32_t t = htm_fastpath; t; t--)
> > > +      for (uint32_t t = optimizer.optimized_attempts; t; t--)
> >
> > Why this change?  Couldn't you keep htm_fastpath as the globally set
> > number of attempts?
> 
> The idea was to keep the variables tuned within the optimizer struct. Hence,
> the optimized_attempts replaced the role of the htm_fastpath whenever the
> tuning is enabled. We can keep it in sync with the optimizer, though, by 
> making
> it additionally assigned during the re-optimization with the final value of 
> the
> optimized_attempts.

I'd prefer the latter.  If we end up having other configurations that
use HTM but don't use the optimizer, it would be awkward to have the
same logical setting in two locations depending on the configuration
(ie, htm_fastpath or optimized_attempts).

Also, even though we don't do this yet, it may be easier to co-locate
the separate htm_fastpath value with something else (e.g., the serial
lock), so that we can save a cacheline of capacity in case of nested
txns.  htm_fastpath is one of the few things that a HW txn must access.

> > > @@ -665,12 +917,21 @@ _ITM_commitTransaction(void)
> > >    if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked()))
> > >      {
> > >        htm_commit();
> > > +      gtm_thread *tx = gtm_thr();
> > > +      if (unlikely(tx == NULL))
> > > +        {
> > > +          tx = new gtm_thread();
> > > +          set_gtm_thr(tx);
> > > +        }
> >
> > Are there actually situations in which we need to create a gtm_thread?
> 
> I did not check in run-time; statically, I followed the approach to
> check, which
> is quite omnipresent in the code.

Yeah, it's necessary.  Can you add a comment along the lines of:
When using the HTM fastpath, we might not have created a thread object
yet.

> >
> > > +      tx->number_executed_txns++;
> >
> > It might be nice if we can avoid this, so that we don't touch the
> > additional cacheline when we have nested txns.
> 
> This is really important to have some way to measure success/progress
> to guide the tuning.

I'm aware of that -- but is there a way to count differently (see the
utility function point above), or just count failures so that the actual
counting happens on a slow path (eg, after abort)?

> > > --- libitm/config/x86/target.h (revision 219316)
> > > +++ libitm/config/x86/target.h (working copy)
> > > @@ -126,12 +126,25 @@ htm_abort_should_retry (uint32_t begin_ret)
> > >    return begin_ret & _XABORT_RETRY;
> > >  }
> > >
> > > +static inline bool
> > > +htm_is_capacity_abort(uint32_t begin_ret)
> > > +{
> > > +  return begin_ret & _XABORT_CAPACITY;
> > > +}
> >
> > Is this indeed just about capacity, or do we actually mean a more
> > general situation?
> 
> The idea of the Upper Confidence Bounds (UCB) part of the algorithm was
> to detect "real" capacity aborts (i.e., permanent) vs those that are
> "transient".
> So in this case we really wanted to know if the abort was a capacity one so
> that we could direct the UCB algorithm and understand the "type" of capacity
> abort.

OK.  But do you want to single out capacity aborts as one case for which
you want to detect permanent vs. transient, or do you want to generally
find out whether aborts that are reported as permanent (e.g., capacity)
are indeed permanent?  In the latter case, a more generic name would be
more appropriate.  (And that's not just a naming thing, but guidance for
other archs regarding what they should put in the "capacity" bucket of
abort reasons...)

Re: [PATCH] add self-tuning to x86 hardware fast path in libitm

Reply via email to