So, my first cut at the function to select reassociation width for power was modeled after what I saw i386 and aarch64 doing, which is to return something based on the number of that kind of op we can do at the same time:
static int rs6000_reassociation_width (unsigned int opc, enum machine_mode mode) { switch (rs6000_cpu) { case PROCESSOR_POWER8: case PROCESSOR_POWER9: if (VECTOR_MODE_P (mode)) return 2; if (INTEGRAL_MODE_P (mode)) { if ( opc == MULT_EXPR ) return 2; return 6; /* correct for all integral modes? */ } if (FLOAT_MODE_P (mode)) return 2; /* decimal float gets default 1 */ break; default: break; } return 1; } However, the reality of the situation is a bit more complicated I think. * If we want maximum parallelism, we should really base this on the number of units times the latency. I.e. for float on p8 we have 2 units and 6 cycles latency so we would want to issue up to 12 fadd or fmul in parallel, then the result from the first one would be ready for the next series of dependent ops. * Of course this may cause massive register spills and so we can't really make things that wide. So, reassociation ought to be aware of how much register pressure it is creating and how much has been created by things that want to be live across this bb. * Ideally we would also be aware of whether we are reassociating a tree of fp additions whose terms are fp multiplies because now we have fused multipy-adds to consider. See PR 70912 for more on this. Suggestions? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain