My recent attempts to come up with a testcase for my patch to evaluate ss_plus in simplify-rtx.c, identified a missed optimization opportunity (that's potentially a long-time regression): The RTL optimizers no longer place constants in the constant pool.
The motivating x86_64 example is the simple program: typedef char v8qi __attribute__ ((vector_size (8))); v8qi foo() { v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 }; v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 }; v8qi t = __builtin_ia32_paddsb(tx, ty); return t; } which (with my previous patch) currently results in: foo: movq .LC0(%rip), %xmm0 movq .LC1(%rip), %xmm1 paddsb %xmm1, %xmm0 ret even though the RTL contains the result in a REG_EQUAL note: (insn 7 6 12 2 (set (reg:V8QI 83) (ss_plus:V8QI (reg:V8QI 84) (reg:V8QI 85))) "ssaddqi3.c":7:12 1419 {*mmx_ssaddv8qi3} (expr_list:REG_DEAD (reg:V8QI 85) (expr_list:REG_DEAD (reg:V8QI 84) (expr_list:REG_EQUAL (const_vector:V8QI [ (const_int 3 [0x3]) (const_int 0 [0]) repeated x7 ]) (nil))))) Together with the patch below, GCC will now generate the much more sensible: foo: movq .LC2(%rip), %xmm0 ret My first approach was to look in cse.c (where the REG_EQUAL note gets added) and notice that the constant pool handling functionality has been unreachable for a while. A quick search for constant_pool_entries_cost shows that it's initialized to zero, but never set to a non-zero value, meaning that force_const_mem is never called. This functionality used to work way back in 2003, but has been lost over time: https://gcc.gnu.org/pipermail/gcc-patches/2003-October/116435.html The changes to cse.c below restore this functionality (placing suitable constants in the constant pool) with two significant refinements; (i) it only attempts to do this if the function already uses a constant pool (thanks to the availability of crtl->uses_constant_pool since 2003). (ii) it allows different constants (i.e. modes) to have different costs, so that floating point "doubles" and 64-bit, 128-bit, 256-bit and 512-bit vectors don't all have the share the same cost. Back in 2003, the assumption was that everything in a constant pool had the same cost, hence the global variable constant_pool_entries_cost. Although this is a useful CSE fix, it turns out that it doesn't cure my motivating problem above. CSE only considers a single instruction, so determines that it's cheaper to perform the ss_plus (COSTS_N_INSNS(1)) than read the result from the constant pool (COSTS_N_INSNS(2)). It's only when the other reads from the constant pool are also eliminated, that this transformation is a win. Hence a better place to perform this transformation is in combine, where after failing to "recog" the load of a suitable constant, it can retry after calling force_const_mem. This achieves the desired transformation and allows the backend insn_cost call-back to control whether or not using the constant pool is preferrable. Alas, it's rare to change code generation without affecting something in GCC's testsuite. On x86_64-pc-linux-gnu there were two families of new failures (and I'd predict similar benign fallout on other platforms). One failure was gcc.target/i386/387-12.c (aka PR target/26915), where the test is missing an explicit -m32 flag. On i686, it's very reasonable to materialize -1.0 using "fld1; fchs", but on x86_64-pc-linux-gnu we currently generate the awkward: testm1: fld1 fchs fstpl -8(%rsp) movsd -8(%rsp), %xmm0 ret which combine now very reasonably simplifies to just: testm1: movsd .LC3(%rip), %xmm0 ret The other class of x86_64-pc-linux-gnu failure was from materialization of vector constants using vpbroadcast (e.g. gcc.target/i386/pr90773-17.c) where the decision is finely balanced; the load of an integer register with an immediate constant, followed by a vpbroadcast is deemed to be COSTS_N_INSNS(2), whereas a load from the constant pool is also reported as COSTS_N_INSNS(2). My solution is to tweak the i386.c's rtx_costs so that all other things being equal, an instruction (sequence) that accesses memory is fractionally more expensive than one that doesn't. Hopefully, this all makes sense. If someone could benchmark this for me that would me much appreciated. This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap" and "make -k check" with no new failures. Ok for mainline? 2021-10-03 Roger Sayle <ro...@nextmovesoftware.com> gcc/ChangeLog * combine.c (recog_for_combine): For an unrecognized move/set of a constant, try force_const_mem to place it in the constant pool. * cse.c (constant_pool_entries_cost, constant_pool_entries_regcost): Delete global variables (that are no longer assigned a cost value). (cse_insn): Simplify logic for deciding whether to place a folded constant in the constant pool using force_const_mem. (cse_main): Remove zero initialization of constant_pool_entries_cost and constant_pool_entries_regcost. * config/i386/i386.c (ix86_rtx_costs): Make memory accesses fractionally more expensive, when optimizing for speed. gcc/testsuite/ChangeLog * gcc.target/i386/387-12.c: Add explicit -m32 option. Roger --
diff --git a/gcc/combine.c b/gcc/combine.c index 892c834..03e9a78 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -11567,7 +11567,27 @@ recog_for_combine (rtx *pnewpat, rtx_insn *insn, rtx *pnotes) bool changed = false; if (GET_CODE (pat) == SET) - changed = change_zero_ext (pat); + { + /* For an unrecognized single set of a constant, try placing it in + the constant pool, if this function already uses one. */ + rtx src = SET_SRC (pat); + if (CONSTANT_P (src) + && !CONST_INT_P (src) + && crtl->uses_const_pool) + { + machine_mode mode = GET_MODE (src); + if (mode == VOIDmode) + mode = GET_MODE (SET_DEST (pat)); + src = force_const_mem (mode, src); + if (src) + { + SUBST (SET_SRC (pat), src); + changed = true; + } + } + else + changed = change_zero_ext (pat); + } else if (GET_CODE (pat) == PARALLEL) { int i; diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ba89e11..cd0e38d 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -20799,6 +20799,13 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno, *total = cost->sse_op; return true; + case MEM: + /* An insn that accesses memory is slightly more expensive + than one that does not. */ + if (speed) + *total += 1; + return false; + default: return false; } diff --git a/gcc/cse.c b/gcc/cse.c index 330c1e9..4c3988e 100644 --- a/gcc/cse.c +++ b/gcc/cse.c @@ -491,14 +491,6 @@ static struct table_elt *table[HASH_SIZE]; static struct table_elt *free_element_chain; -/* Set to the cost of a constant pool reference if one was found for a - symbolic constant. If this was found, it means we should try to - convert constants into constant pool entries if they don't fit in - the insn. */ - -static int constant_pool_entries_cost; -static int constant_pool_entries_regcost; - /* Trace a patch through the CFG. */ struct branch_path @@ -4609,9 +4601,6 @@ cse_insn (rtx_insn *insn) int src_folded_regcost = MAX_COST; int src_related_regcost = MAX_COST; int src_elt_regcost = MAX_COST; - /* Set nonzero if we need to call force_const_mem on with the - contents of src_folded before using it. */ - int src_folded_force_flag = 0; scalar_int_mode int_mode; dest = SET_DEST (sets[i].rtl); @@ -5166,15 +5155,7 @@ cse_insn (rtx_insn *insn) src_related_cost, src_related_regcost) <= 0 && preferable (src_folded_cost, src_folded_regcost, src_elt_cost, src_elt_regcost) <= 0) - { - trial = src_folded, src_folded_cost = MAX_COST; - if (src_folded_force_flag) - { - rtx forced = force_const_mem (mode, trial); - if (forced) - trial = forced; - } - } + trial = src_folded, src_folded_cost = MAX_COST; else if (src && preferable (src_cost, src_regcost, src_eqv_cost, src_eqv_regcost) <= 0 @@ -5361,23 +5342,24 @@ cse_insn (rtx_insn *insn) break; } - /* If we previously found constant pool entries for - constants and this is a constant, try making a - pool entry. Put it in src_folded unless we already have done - this since that is where it likely came from. */ + /* If the current function uses a constant pool and this is a + constant, try making a pool entry. Put it in src_folded + unless we already have done this since that is where it + likely came from. */ - else if (constant_pool_entries_cost + else if (crtl->uses_const_pool && CONSTANT_P (trial) - && (src_folded == 0 - || (!MEM_P (src_folded) - && ! src_folded_force_flag)) + && !CONST_INT_P (trial) + && (src_folded == 0 || !MEM_P (src_folded)) && GET_MODE_CLASS (mode) != MODE_CC && mode != VOIDmode) { - src_folded_force_flag = 1; - src_folded = trial; - src_folded_cost = constant_pool_entries_cost; - src_folded_regcost = constant_pool_entries_regcost; + src_folded = force_const_mem (mode, trial); + if (src_folded) + { + src_folded_cost = COST (src_folded, mode); + src_folded_regcost = approx_reg_cost (src_folded); + } } } @@ -6630,8 +6612,6 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs) cse_cfg_altered = false; cse_jumps_altered = false; recorded_label_ref = false; - constant_pool_entries_cost = 0; - constant_pool_entries_regcost = 0; ebb_data.path_size = 0; ebb_data.nsets = 0; rtl_hooks = cse_rtl_hooks; diff --git a/gcc/testsuite/gcc.target/i386/387-12.c b/gcc/testsuite/gcc.target/i386/387-12.c index 62c1d48..7fe50a2 100644 --- a/gcc/testsuite/gcc.target/i386/387-12.c +++ b/gcc/testsuite/gcc.target/i386/387-12.c @@ -1,6 +1,6 @@ /* PR target/26915 */ /* { dg-do compile } */ -/* { dg-options "-O -mfpmath=387 -mfancy-math-387" } */ +/* { dg-options "-O -m32 -mfpmath=387 -mfancy-math-387" } */ double testm0(void) {