Hi, I am going to benchmark the following hunk separately tonight. It is independent change.
Rth, Vladimir: there are obviously several options how to make GCC use SSE for 64bit loads/stores in 32bit codegen (and 128bit loads/stores in 128bit codegen). What do you think is best variant here? (an alternative would be to make move patterns to preffer SSE variant in this case or change RA order to iterate through SSE first, but at least with pre-IRA this used to lead to bad decisions making RA to place value in SSE despite the fact it is used in arithmetic that can't be done with SSE). Honza @@ -15266,6 +15363,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[]) } else { + if (mode == DImode + && !TARGET_64BIT + && TARGET_SSE2 + && MEM_P (op0) + && MEM_P (op1) + && !push_operand (op0, mode) + && can_create_pseudo_p ()) + { + rtx temp = gen_reg_rtx (V2DImode); + emit_insn (gen_sse2_loadq (temp, op1)); + emit_insn (gen_sse_storeq (op0, temp)); + return; + } + if (mode == DImode + && !TARGET_64BIT + && TARGET_SSE + && !MEM_P (op1) + && GET_MODE (op1) == V2DImode) + { + emit_insn (gen_sse_storeq (op0, op1)); + return; + } + if (mode == TImode + && TARGET_AVX2 + && MEM_P (op0) + && !MEM_P (op1) + && GET_MODE (op1) == V4DImode) + { + op0 = convert_to_mode (V2DImode, op0, 1); + emit_insn (gen_vec_extract_lo_v4di (op0, op1)); + return; + } if (MEM_P (op0) && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode) || !push_operand (op0, mode))