This patch tweaks timode_scalar_chain::compute_convert_gain to better reflect the expansion of V1TImode arithmetic right shifts by the i386 backend. The comment "see ix86_expand_v1ti_ashiftrt" appears after "case ASHIFTRT" in compute_convert_gain, and the changes below attempt to better match the logic used there.
The original motivating example is: __int128 m1; void foo() { m1 = (m1 << 8) >> 8; } which with -O2 -mavx2 we fail to convert to vector form due to the inappropriate cost of the arithmetic right shift. Instruction gain -16 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;} Total gain: -3 Chain #1 conversion is not profitable This is reporting that the ASHIFTRT is four instructions worse using vectors than in scalar form, which is incorrect as the AVX2 expansion of this shift only requires three instructions (and the scalar form requires two). With more accurate costs in timode_scalar_chain::compute_convert_gain we now see (with -O2 -mavx2): Instruction gain -4 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;} Total gain: 9 Converting chain #1... which results in: foo: vmovdqa m1(%rip), %xmm0 vpslldq $1, %xmm0, %xmm0 vpsrad $8, %xmm0, %xmm1 vpsrldq $1, %xmm0, %xmm0 vpblendd $7, %xmm0, %xmm1, %xmm0 vmovdqa %xmm0, m1(%rip) ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. No new testcase (yet) as the code for both the vector and scalar forms of the above function are still suboptimal so code generation is in flux, but this improvement should be a step in the right direction. Ok for mainline? 2024-08-24 Roger Sayle <ro...@nextmovesoftware.com> gcc/ChangeLog * config/i386/i386-features.cc (compute_convert_gain) <case ASHIFTRT>: Update to match ix86_expand_v1ti_ashiftrt. Thanks in advance, Roger --
diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index 7e80e7b..35856ef 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -1650,23 +1650,29 @@ timode_scalar_chain::compute_convert_gain () else if (op1val == 64) vcost = COSTS_N_INSNS (3); else if (op1val == 96) - vcost = COSTS_N_INSNS (4); + vcost = COSTS_N_INSNS (3); else if (op1val >= 111) vcost = COSTS_N_INSNS (3); - else if (TARGET_AVX2 && op1val == 32) + else if ((TARGET_AVX2 || TARGET_SSE4_1) + && op1val == 32) + vcost = COSTS_N_INSNS (3); + else if ((TARGET_AVX2 || TARGET_SSE4_1) + && (op1val == 8 || op1val == 16 || op1val == 24)) vcost = COSTS_N_INSNS (3); - else if (TARGET_SSE4_1 && op1val == 32) - vcost = COSTS_N_INSNS (4); else if (op1val >= 96) - vcost = COSTS_N_INSNS (5); + vcost = COSTS_N_INSNS (4); + else if (TARGET_SSE4_1 && (op1val == 28 || op1val == 80)) + vcost = COSTS_N_INSNS (4); else if ((op1val & 7) == 0) - vcost = COSTS_N_INSNS (6); + vcost = COSTS_N_INSNS (5); else if (TARGET_AVX2 && op1val < 32) vcost = COSTS_N_INSNS (6); + else if (TARGET_SSE4_1 && op1val < 15) + vcost = COSTS_N_INSNS (6); else if (op1val == 1 || op1val >= 64) - vcost = COSTS_N_INSNS (9); + vcost = COSTS_N_INSNS (8); else - vcost = COSTS_N_INSNS (10); + vcost = COSTS_N_INSNS (9); } igain = scost - vcost; break;