Paolo Bonzini writes: > On 14/07/2016 22:29, Pranith Kumar wrote: >> + } else if (curr_mb_type == TCG_BAR_STRL && >> + prev_mb_type == TCG_BAR_LDAQ) { >> + /* Consecutive load-acquire and store-release barriers >> + * can be merged into one stronger SC barrier >> + * ldaq; strl => ld; mb; st >> + */ >> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; >> + tcg_op_remove(s, prev_op); > > Is this really an optimization? For example the processor could reorder > "st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2". It cannot do this > if you change ldaq1/strl2 to ld1/mb/st2. > > On x86 for example a memory fence costs ~50 clock cycles, while normal > loads and stores are of course faster. > > Of course this is useful if your target doesn't have ldaq/strl > instructions. In this case, however, you probably want to lower ldaq to > "ld;mb" and strl to "mb;st"; the other optimizations then will remove > the unnecessary barrier. >
I agree that this is a conservative optimization. The problem is that currently even for architectures which have ldaq/strl instructions, tcg backend does not generate them. TCG just generates plain loads and stores.I guess we didn't need to since it was single threaded MTTCG. I am trying to add support to generate these instructions on AARCH64. Once this is done we can disable the above optimization. -- Pranith