On 04/18/2010 05:13 PM, Aurelien Jarno wrote:
> On Tue, Apr 13, 2010 at 04:33:59PM -0700, Richard Henderson wrote:
>> Define OPC_BSWAP. Factor opcode emission to separate functions.
>> Use bswap+shift to implement 16-bit swap instead of a rolw; this
>> gets the proper zero-extension required by INDEX_op_bswap16_i32.
>
> This is not required by INDEX_op_bswap16_i32. What is need is that the
> value in the input register has the 16 upper bits set to 0.
Ah.
> Considering
> that, the rolw instruction is faster than bswap + shift.
Well, no, it isn't.
static inline int test_rolw(unsigned short *s)
{
int i, start, end;
asm volatile("rdtsc\n\t"
"movl %%eax, %1\n\t"
"movzwl %3,%2\n\t"
"rolw $8, %w2\n\t"
"addl $1,%2\n\t"
"rdtsc"
: "=&a"(end), "=r"(start), "=r"(i) : "m"(*s) : "edx");
return end - start;
}
static inline int test_bswap(unsigned short *s)
{
int i, start, end;
asm volatile("rdtsc\n\t"
"movl %%eax, %1\n\t"
"movzwl %3,%2\n\t"
"bswap %2\n\t"
"shl $16,%2\n\t"
"addl $1,%2\n\t"
"rdtsc"
: "=&a"(end), "=r"(start), "=r"(i) : "m"(*s) : "edx");
return end - start;
}
model name : Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
rolw 60 60 72 60 60 72 60 60 72 60
bswap 60 60 60 60 60 60 60 60 60 60
model name : Dual-Core AMD Opteron(tm) Processor 1210
rolw 9 10 9 9 8 8 8 8 8 8
bswap 9 9 8 8 8 8 8 8 8 8
The rolw sequence isn't ever faster, and it's more unstable,
likely due to the partial register stall I mentioned.
I will grant that the rolw sequence is smaller, and I can
adjust this patch to use that sequence if you wish.
r~