My work on PR80101 is "motivating" me to modify the implementation of store_data_bypass_p (in gcc/recog.c).
I have a patch that bootstraps with no regressions. However, I think "regression" testing may not be enough to prove I got this right. If my new patch returns the wrong value, the outcome will be poor instruction scheduling decisions, which will impact performance, but probably not "correctness". So I'd like some help understanding the existing implementation of store_data_bypass_p. To establish some context, here is what I think I understand about this function: 1. As input arguments, out_insn represents an rtl expression that potentially "produces" a store to memory and in_insn represents an rtl expression that potentially "consumes" a value recently stored to memory. 2. If the memory store produced matches the memory fetch consumed, this function returns true to indicate that this sequence of two instructions qualifies for a special "bypass" latency that represents the fact that the fetch will obtain the value out of the write buffer. So, whereas the instruction scheduler might normally expect that this sequence of two instructions would experience Load-Hit-Store penalties associated with cache coherency hardware costs, since these two instruction qualify for the store_data_bypass optimization, the instruction scheduler counts the latency as only 1 or 2 cycles (potentially). [This is what I understand, but I may be wrong, so please correct me if so.] 3. Actually, what I described above is only the "simple" case. It may be that the rtl for either out_insn or in_insn is really a parallel clause with multiple rtl trees beneath it. In this case, we compare the subtrees in a "similar" way to see if the compound expressions qualify for the store_data_bypass_p "optimization". (I've got some questions about how this is done below) As currently implemented, special handling is given to a CLOBBER subtree as part of either PARALLEL expression: we ignore it. This is because CLOBBER does not represent any real machine instructions. It just represents semantic information that might be used by the compiler. In addition to seeking confirmation of my existing understanding of the code as outlined above, the specific questions that I am seeking help with are: 1. In the current implementation (as I understand it), near the top of the function body, we handle the case that the consumer (in_insn) rtl is a single SET expression and the producer (out_insn) rtl is a PARALLEL expression containing multiple sets. The way I read this code, we are requiring that every one of the producer's parallel SET instructions produce the same value that is to be consumed in order to qualify this sequence as a "store data bypass". That seems wrong to me. I would expect that we only need "one" of the produced values to match the consumed value in order to qualify for the "store data bypass" optimization. Please explain. (The same confusing behavior happens below in the same function, in the case that the consumer rtl is a PARALLEL expression of multiple SETs: we require that every producer's stored value match every consumer's fetched value.) 2. A "bigger" concern is that any time any SETs are buried within a PARALLEL tree, I'm not sure the answer produced by this function, as currently implemented, is at all reliable: a) PARALLEL does not necessarily mean all of its subtrees happen in parallel on hardware. It just means that there is no sequencing imposed by the source code, so the final order in which the multiple subtrees beneath the PARALLEL node is not known at this stage of compilation. b) It seems to me that it doesn't really make sense to speak of whether a whole bunch of producers combined with a whole bunch of consumers qualify for an optimized store data bypass latency. If we say that they do qualify (as a group), which pair(s) of producer and consumer machine instructions qualify? It seems we need to know which producer matches with which consumer in order to know where the bypass latencies "fit" into the schedule. c) Furthermore, if it turns out that the "arbitrary" order in which the producer instructions and consumer instructions are emitted places too much "distance" between a producer and the matching consumer, then it is possible that by the time the hardware executes the consumer, the stored value is no longer in the write buffer, so even though we might have "thought" two PARALLEL rtl expressions qualified for the store bypass optimization, we really should have returned false. Can someone help me understand this better? Thanks much. -- Kelvin Nilsen, Ph.D. kdnil...@linux.vnet.ibm.com home office: 801-756-4821, cell: 520-991-6727 IBM Linux Technology Center - PPC Toolchain