So, my first cut at the function to select reassociation width for
power was modeled after what I saw i386 and aarch64 doing, which is to
return something based on the number of that kind of op we can do at
the same time:

static int
rs6000_reassociation_width (unsigned int opc, enum machine_mode mode)
{
    switch (rs6000_cpu) {
    case PROCESSOR_POWER8:
    case PROCESSOR_POWER9:
        if (VECTOR_MODE_P (mode)) 
            return 2;
        if (INTEGRAL_MODE_P (mode)) {
            if ( opc == MULT_EXPR ) return 2;
            return 6; /* correct for all integral modes? */
        }
        if (FLOAT_MODE_P (mode))
            return 2;
        /* decimal float gets default 1 */
        break;
    default:
        break;
    }
        
    return 1;
}

However, the reality of the situation is a bit more complicated I
think.

* If we want maximum parallelism, we should really base this on the
number of units times the latency. I.e. for float on p8 we have 2 units
and 6 cycles latency so we would want to issue up to 12 fadd or fmul in
parallel, then the result from the first one would be ready for the
next series of dependent ops.
* Of course this may cause massive register spills and so we can't
really make things that wide. So, reassociation ought to be aware of
how much register pressure it is creating and how much has been created
by things that want to be live across this bb. 
* Ideally we would also be aware of whether we are reassociating a tree
of fp additions whose terms are fp multiplies because now we have
fused multipy-adds to consider. See PR 70912 for more on this.

Suggestions?

Thanks, 
   Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain


Reply via email to