On Wed, Aug 17, 2016 at 10:22:05 -0700, Richard Henderson wrote: > On 08/15/2016 08:49 AM, Emilio G. Cota wrote: > >+void HELPER(xbegin)(CPUARMState *env) > >+{ > >+ uintptr_t ra = GETPC(); > >+ int status; > >+ int retries = 100; > >+ > >+ retry: > >+ status = _xbegin(); > >+ if (status != _XBEGIN_STARTED) { > >+ if (status && retries) { > >+ retries--; > >+ goto retry; > >+ } > >+ if (parallel_cpus) { > >+ cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); > >+ } > >+ } > >+} > >+ > >+void HELPER(xend)(void) > >+{ > >+ if (_xtest()) { > >+ _xend(); > >+ } else { > >+ assert(!parallel_cpus); > >+ parallel_cpus = true; > >+ } > >+} > >+ > > Interesting idea. > > FWIW, there are two other extant HTM implementations: ppc64 and s390x. As I > recall, the s390 (but not the ppc64) transactions do not roll back the fp > registers. Which suggests that we need special support within the TCG > proglogue. Perhaps folding these operations into special TCG opcodes.
I'm not familiar with s390, but as long as the hardware implements 'strong atomicity' ["strong atomicity guarantees atomicity between transactions and non-transactional code", see http://acg.cis.upenn.edu/papers/cal06_atomic_semantics.pdf ] then this approach would work, in the sense that stores wouldn't have to be instrumented. Of course architecture issues like saving the fp registers as you mention for s390 would have to be taken into account. > I believe that power8 has HTM, and there's one of those in the gcc compile > farm, so this should be relatively easy to try out. Good point! I had forgotten about power8. So far my tests have been on a 4-core Skylake. I have an account on the gcc compile farm so I will make use of it. The power8 machine in the farm has a lot of cores, so this is pretty exciting. > We increase the chances of success of the transaction if we minimize the > amount of non-target code that's executed while the transaction is running. > That suggests two things: > > (1) that it would be doubly helpful to incorporate the transaction start > directly into TCG code generation rather than as a helper and This (and leaving the fallback path in a helper) is simple enough that even I could do it :-) > (2) that we should start a new TB upon encountering a load-exclusive, so > that we maximize the chance of the store-exclusive being a part of the same > TB and thus have *nothing* extra between the beginning and commit of the > transaction. I don't know how to do this. If it's easy to do, please let me know how (for aarch64 at least, since that's the target I'm using). I've run some more tests on the Intel machine, and noticed that failed transactions are very common (up to 50% abort rate for some SPEC workloads, and I count these aborts as "retrying doesn't help" kind of aborts), so bringing that down should definitely help. Another thing I found out is that abusing tcg_exec_step (as is right now) for the fallback path is a bad idea: when there are many failed transactions, performance drops dramatically (up to 5x overall slowdown). Turns out that all this overhead comes from re-translating the code between ldrex/strex. Would it be possible to cache this step-by-step code? If not, then an alternative would be to have a way to stop the world *without* leaving the CPU loop for the calling thread. I'm more comfortable doing the latter due to my glaring lack of TCG competence. Thanks, Emilio