My work on PR80101 is "motivating" me to modify the implementation of
store_data_bypass_p (in gcc/recog.c).

I have a patch that bootstraps with no regressions.  However, I think
"regression" testing may not be enough to prove I got this right.  If my
new patch returns the wrong value, the outcome will be poor instruction
scheduling decisions, which will impact performance, but probably not
"correctness".

So I'd like some help understanding the existing implementation of
store_data_bypass_p.  To establish some context, here is what I think I
understand about this function:

1. As input arguments, out_insn represents an rtl expression that
potentially "produces" a store to memory and in_insn represents an rtl
expression that potentially "consumes" a value recently stored to memory.

2. If the memory store produced matches the memory fetch consumed, this
function returns true to indicate that this sequence of two instructions
qualifies for a special "bypass" latency that represents the fact that
the fetch will obtain the value out of the write buffer.  So, whereas
the instruction scheduler might normally expect that this sequence of
two instructions would experience Load-Hit-Store penalties associated
with cache coherency hardware costs, since these two instruction qualify
for the store_data_bypass optimization, the instruction scheduler counts
the latency as only 1 or 2 cycles (potentially).  [This is what I
understand, but I may be wrong, so please correct me if so.]

3. Actually, what I described above is only the "simple" case.  It may
be that the rtl for either out_insn or in_insn is really a parallel
clause with multiple rtl trees beneath it.  In this case, we compare the
subtrees in a "similar" way to see if the compound expressions qualify
for the store_data_bypass_p "optimization".  (I've got some questions
about how this is done below)  As currently implemented, special
handling is given to a CLOBBER subtree as part of either PARALLEL
expression: we ignore it.  This is because CLOBBER does not represent
any real machine instructions.  It just represents semantic information
that might be used by the compiler.

In addition to seeking confirmation of my existing understanding of the
code as outlined above, the specific questions that I am seeking help
with are:

1. In the current implementation (as I understand it), near the top of
the function body, we handle the case that the consumer (in_insn) rtl is
a single SET expression and the producer (out_insn) rtl is a PARALLEL
expression containing multiple sets.  The way I read this code, we are
requiring that every one of the producer's parallel SET instructions
produce the same value that is to be consumed in order to qualify this
sequence as a "store data bypass".  That seems wrong to me.  I would
expect that we only need "one" of the produced values to match the
consumed value in order to qualify for the "store data bypass"
optimization.  Please explain.  (The same confusing behavior happens
below in the same function, in the case that the consumer rtl is a
PARALLEL expression of multiple SETs: we require that every producer's
stored value match every consumer's fetched value.)

2. A "bigger" concern is that any time any SETs are buried within a
PARALLEL tree, I'm not sure the answer produced by this function, as
currently implemented, is at all reliable:

 a) PARALLEL does not necessarily mean all of its subtrees happen in
parallel on hardware.  It just means that there is no sequencing imposed
by the source code, so the final order in which the multiple subtrees
beneath the PARALLEL node is not known at this stage of compilation.

 b) It seems to me that it doesn't really make sense to speak of whether
a whole bunch of producers combined with a whole bunch of consumers
qualify for an optimized store data bypass latency.  If we say that they
do qualify (as a group), which pair(s) of producer and consumer machine
instructions qualify?  It seems we need to know which producer matches
with which consumer in order to know where the bypass latencies "fit"
into the schedule.

 c) Furthermore, if it turns out that the "arbitrary" order in which the
producer instructions and consumer instructions are emitted places too
much "distance" between a producer and the matching consumer, then it is
possible that by the time the hardware executes the consumer, the stored
value is no longer in the write buffer, so even though we might have
"thought" two PARALLEL rtl expressions qualified for the store bypass
optimization, we really should have returned false.

Can someone help me understand this better?

Thanks much.


-- 
Kelvin Nilsen, Ph.D.  kdnil...@linux.vnet.ibm.com
home office: 801-756-4821, cell: 520-991-6727
IBM Linux Technology Center - PPC Toolchain

Reply via email to