Re: [PATCH] add self-tuning to x86 hardware fast path in libitm

Nuno Diegues Wed, 08 Apr 2015 08:31:22 -0700

Thank you for the feedback. Comments inline.

On Wed, Apr 8, 2015 at 3:05 PM, Andi Kleen <a...@firstfloor.org> wrote:
>
> Nuno Diegues <n...@ist.utl.pt> writes:
>
> What workloads did you test this on?

On the STAMP suite of benchmarks for transactional memory (described here [1]).
I have ran an unmodified GCC 5.0.0 against the patched GCC with these
modifications and obtain the following speedups in STAMP with 4
threads (on a Haswell with 4 cores, average 10 runs):

benchmarks: speedup
genome: 1.32
intruder: 1.66
labyrinth: 1.00
ssca2: 1.02
yada: 1.00
kmeans-high: 1.13
kmeans-low: 1.10
vacation-high: 2.27
vacation-low: 1.88

[1] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, Kunle Olukotun:
STAMP: Stanford Transactional Applications for Multi-Processing. IISWC
2008: 35-46

>
> Are you sure you need floating point here? If the program does not
> use it in any other ways faulting in the floating point state can be
> quite expensive.
>
> I bet fixed point would work for such simple purposes too.

That is a good point. While I haven't ever used fixed point
arithmetic, a cursory inspection reveals that it does make sense and
seems applicable to this case.
Are you aware of some place where this is being done already within
GCC that I could use as inspiration, or should I craft some macros
from scratch for this?

> > +  serial_lock.read_unlock(tx);
> > +
> > +  // Obtain the delta performance with respect to the last period.
> > +  uint64_t current_cycles = rdtsc();
> > +  uint64_t cycles_used = current_cycles - optimizer.last_cycles;
>
> It may be worth pointing out that rdtsc does not return cycles.
> In fact the ratio to real cycles is variable depending on the changing 
> frequency.
> I hope your algorithms can handle that.

The intent here is to obtain some notion of time passed with a low
cost. RDTSC seemed to be the best choice around: it is not critical
that the frequency of the processor may change the relativity of the
returned value with respect to actual cpu cycles.

> > +
> > +  // Compute gradient descent for the number of retries.
> > +  double change_for_better = current_throughput / 
> > optimizer.last_throughput;
> > +  double change_for_worse = optimizer.last_throughput / current_throughput;
> > +  int32_t last_attempts = optimizer.last_attempts;
> > +  int32_t current_attempts = optimizer.optimized_attempts;
> > +  int32_t new_attempts = current_attempts;
> > +  if (unlikely(change_for_worse > 1.40))
> > +    {
> > +      optimizer.optimized_attempts = optimizer.best_ever_attempts;
> > +      optimizer.last_throughput = current_throughput;
> > +      optimizer.last_attempts = current_attempts;
> > +      return;
> > +    }
> > +
> > +  if (unlikely(random() % 100 < 1))
> > +    {
>
> So where is the seed for that random stored? Could you corrupt some
> user's random state? Is the state per thread or global?
> If it's per thread how do you initialize so that they threads do
> start with different seeds.
> If it's global what synchronizes it?

As I do not specify any seed, I was under the impression that there
would be a default initialization. Furthermore, the posix
documentation specifies random() to be MT-safe, so I assumed its
internal state to be per-thread.
Did I mis-interpret this?

With regard to the self-tuning state, it is kept within the
"gtm_global_optimizer optimizer" struct, which is in essence
multi-reader (any thread running transactions can check the struct to
use the parameters optimized in it) and single-writer (notice that the
"void GTM::gtm_thread::reoptimize_htm_execution()" function is called
only by one thread, the one at the end of the list of threads, i.e.,
whose tx->next_thread == NULL).

>
> Overall the algorithm looks very complicated with many mysterious magic
> numbers. Are there simplifications possible? While the retry path is not
> extremely critical it should be at least somewhat optimized, otherwise
> it will dominate the cost of short transactions.
>
> One problems with so many magic numbers is that they may be good on one
> system, but bad on another.

Notice that the retry path is barely changed in the common case: only
a designated thread (the last one in the list of threads registered in
libitm) will periodically execute the re-optimization. Hence, most of
the patch that you can see here is code execute in that (uncommon
case).
I understand the concern with magic numbers: we could self-tune them
as well, but that would surely increase the complexity of the patch :)
In essence, we have the following numbers at the moment:
 * how often we re-optimize (every 500 successful transactions for the
designated thread)
 * how many maximum attempts we can have in hardware (20)
 * how much better and worse the performance must change for the
gradient descent to move to a new configuration (5%)
 * how terrible the performance must change for the gradient descent
to rollback to the best known configuration so far (40%)
 * how often the gradient descent can explore randomly to avoid local
optima (1%)

Once again, thank you for your time on this matter. Looking forward to
your comments.

-- Nuno Diegues

>
> -Andi
> --
> a...@linux.intel.com -- Speaking for myself only

Re: [PATCH] add self-tuning to x86 hardware fast path in libitm

Reply via email to