Il 22/12/2013 13:24, Aurelien Jarno ha scritto: > On Sat, Dec 21, 2013 at 03:08:21PM +0100, Paolo Bonzini wrote: >> Il 21/12/2013 00:00, Richard Henderson ha scritto: >>> + if (real_bswap && have_movbe) { >>> + tcg_out_modrm_offset(s, OPC_MOVBE_GyMy + P_DATA16 + seg, >>> + datalo, base, ofs); >>> + tcg_out_ext16u(s, datalo, datalo); >> >> Do partial register stalls still exist on Atom and Haswell? I don't >> remember exactly what you had to do to prevent them, but IIRC you first >> moved zero to the register and then overwrote the the low 16 bits. > > Note that for unsigned 16-bit load you can do either movzw + bswap or > movbe + movzw.
Yeah, I was asking if xor + movbe would be faster. Benchmarking could tell, but anyway xor + movbe is likely the smallest code you can produce. Paolo