Apparently in contrast to similar instructions on other architectures, x86's mulx will store the lower half of the result first, and the upper half later. If the same register is to be used for both, it will contain the upper half of the result after the operation.
tcg_gen_mulu2_i64()'s default case (using the host's op_mulu2_i64) apparently does not actually overwrite the target at all in such a case; and the other two code paths in that function will write the result in the wrong order (upper half first, lower half second). Therefore, we should not use that function. Reported-by: Toni Nedialkov <farm...@gmail.com> Signed-off-by: Max Reitz <mre...@redhat.com> --- I did not encounter the crash Toni has reported, but I did see this. https://gist.github.com/473bad1b711fa61f010e is a reproducer (FASM code, you can fetch the assembled version from http://78.47.108.109:8080/mulx.bin ($qemu -cpu Haswell -fda mulx.bin)). That code simply sets up Long Mode, loads some value into rdx and executes "mulx rsi,rdi,rdx" and "mulx rax,rax,rdx". After it has started, "info registers" on the monitor should look like this: RAX=0000000001234567 RBX=0000000000000000 RCX=00000000c0000080 RDX=0000111111111111 RSI=0000000001234567 RDI=89abcba987654321 RBP=0000000000000000 RSP=00000000000a0000 [...] So rsi:rdi is the result of rdx*rdx, and rax contains the upper half of that result (i.e. rax == rsi). This is the correct result. Without this patch, it looks like this: RAX=0000000080000011 RBX=0000000000000000 RCX=00000000c0000080 RDX=0000111111111111 RSI=0000000001234567 RDI=89abcba987654321 RBP=0000000000000000 RSP=00000000000a0000 So while rsi:rdi is correct, rax is completely unmodified (which is wrong). Dropping the first branch in tcg_gen_mulu2_i64() yields the following: RAX=89abcba987654321 RBX=0000000000000000 RCX=00000000c0000080 RDX=0000111111111111 RSI=0000000001234567 RDI=89abcba987654321 RBP=0000000000000000 RSP=00000000000a0000 Here, rsi:rdi is still correct, but now rax contains the lower half of the result (i.e. rax == rdi). This is wrong, too. --- target-i386/translate.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/target-i386/translate.c b/target-i386/translate.c index fbe4f80..a5c790a 100644 --- a/target-i386/translate.c +++ b/target-i386/translate.c @@ -3848,8 +3848,15 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b, break; #ifdef TARGET_X86_64 case MO_64: - tcg_gen_mulu2_i64(cpu_regs[s->vex_v], cpu_regs[reg], - cpu_T[0], cpu_regs[R_EDX]); + tcg_gen_op3_i64(INDEX_op_mul_i64, cpu_regs[s->vex_v], + cpu_T[0], cpu_regs[R_EDX]); + if (TCG_TARGET_HAS_muluh_i64) { + tcg_gen_op3_i64(INDEX_op_muluh_i64, cpu_regs[reg], + cpu_T[0], cpu_regs[R_EDX]); + } else { + gen_helper_muluh_i64(cpu_regs[reg], + cpu_T[0], cpu_regs[R_EDX]); + } break; #endif } -- 2.6.2