On 04/02/2013 05:12 PM, Richard Henderson wrote:
On 2013-04-02 07:41, Alexander Graf wrote:
On 2013-04-01 23:34, Alexander Graf wrote:
Is this faster than a load/store with std/ldbrx?
Hmm. Almost certainly not. And since we've got stack space
allocated for function calls, we've got scratch space to do it in.
Probably similar for bswap32 too, eh?
Depends - memory load/store doesn't come for free and bswap32 is
quite short.
I'll do a tiny bit o benchmarking for power7.
Cool, thanks a bunch :)
Heh. "Almost certainly not" indeed. Unless I've made some silly
mistake,
going through memory stalls badly. No store buffer forwarding on power7?
With the following test case, time reports:
f1 2.967s
f2 8.930s
f3 7.071s
f4 7.166s
And note that f4 is a normal store/load pair, trying to determine what
the
store buffer forwarding delay might be.
Yeah, doesn't look like it makes any sense at all to do a load/store
cycle then. What a shame :).
Keep in mind that this tests icache hot cycles. However, you might get
bad icache penalties due to the long bswap64 sequence. So all the memory
latency you see here might also affect the instruction stream when it
gets executed. But then again we only care about performance of cache
hot sequences in the first place....
Alex