Hi Yufeng, On Tue, 2013-11-12 at 22:34 +0000, Yufeng Zhang wrote: > Hi Bill, > > Many thanks for the review. > > I find your suggestion on using the next_interp field quite > enlightening. I prepared a patch which adds changes without modifying > the framework. With the patch, the slsr pass now tries to create a > second candidate for each memory accessing gimple statement, and chain > it to the first one via the next_interp field. > > There are two implications in this approach though: > > 1) For each memory accessing gimple statement, there can be two > candidates, and these two candidates can be part of different dependency > graphs respectively (based on different base expr). Only one of the > dependency graph should be traversed to do replace_refs. Most of the > changes in the patch is to handle this implication. > > I am aware that you suggest to follow the next-interp chain only when > the searching fails for the first interpretation. However, that doesn't > work very well, as it can result in worse code-gen. Taking a varied > form of the added test slsr-41.c for example: > > i1: a2 [i] [j] = 1; > i2: a2 [i] [j+1] = 2; > i3: a2 [i+20] [j] = i; > > With the 2nd interpretation created conditionally, the following two > dependency chains will be established: > > i1 --> i2 (base expr is an SSA_NAME defined as (a2 + i * 200)) > i1 --> i3 (base expr is a tree expression of (a2 + i * 200))
So it seems to me that really what needs to happen is to unify those two base_exprs. We don't currently have logic in this pass to look up an SSA name based on {base, index, stride, cand_type}, but that could be done with a hash table. For now to save processing time it would make sense to only do that for MEM candidates, though the cand_type should be included in the hash to allow this to be used for other candidate types if necessary. Of course, the SSA name definition must dominate the candidate to be eligible as a basis, and that should be checked, but this should generally be the case. The goal should be for all of these references to have the same base expr so that i3 can choose either i1 or i2 as a basis. (For now the logic in the pass chooses the most dominating basis, but eventually I would like to add heuristics to make better choices.) If all three of these use the same base expr, that should eliminate your concerns, right? > > the result is that three gimples will be lowered to MEM_REFs differently > (as the candidates have different base_exprs); the later passes can get > confused, generating worse code. > > What this patch does is to create two interpretations where possible (if > different base exprs exist); the following dependency chains will be > produced: > > i1 --> i2 (base expr is an SSA_NAME defined as (a2 + i * 200)) > i1 --> i2 --> i3 (base expr is a tree expression of (a2 + i * 200)) > > In analyze_candidates_and_replace, a new function preferred_ref_cand is > called to analyze a root CAND_REF and replace_refs is only called if > this root CAND_REF is found to be part of a larger dependency graph (or > longer dependency chain in simple cases). In the example above, the 2nd > dependency chain will be picked up to do replace_refs. > > 2) The 2nd implication is that the alternative candidate may expose the > underlying tree expression of a base expr, which can cause more > aggressive extraction and folding of immediate offsets. Taking the new > test slsr-41 for example, the code-gen difference on x86_64 with the > original patch and this patch is (-O2): > > - leal 5(%rsi), %edx > + leal 5(%rsi), %eax > movslq %esi, %rsi > - salq $2, %rsi > - movslq %edx, %rax > - leaq (%rax,%rax,4), %rax > - leaq (%rax,%rax,4), %rcx > - salq $3, %rcx > - leaq (%rdi,%rcx), %rax > - addq %rsi, %rax > - movl $2, -1980(%rax) > - movl %edx, 20(%rax) > - movl %edx, 4024(%rax) > - leaq -600(%rdi,%rcx), %rax > - addl $1, 16(%rsi,%rax) > + imulq $204, %rsi, %rsi > + addq %rsi, %rdi > + movl $2, -980(%rdi) > + movl %eax, 1020(%rdi) > + movl %eax, 5024(%rdi) > + addl $1, 416(%rdi) > ret > > As you can see, the larger offsets are produced as the affine expander > is able to look deep into the tree expression. This raises concern that > larger immediates can cause worse code-gen when the immediates are out > of the supported range on a target. On x86_64 it is not obvious (as it > allows larger ranges), on arm cortex-a15 the load with the immediate > 5024 will be done by > > movw r2, #5024 > str r3, [r0, r2] > > which is not optimal. Things can get worse when there are multiple > loads/stores with large immediates as each one may require an extra mov > immediate instruction. One thing can potentially be done is to reduce > the strength of multiple large immediates later on in some RTL pass by > doing an initial offsetting first? What do you think? Are you > particularly concerned about this issue? To me, this seems like something that the middle end should not concern itself about, but that may be oversimplifying. I would think this is not the only pass that can create such issues, and the overall code generation should usually be improved anyway. Richard, would you care to weigh in here? A couple of quick comments on the next_interp patch: * You don't need num_of_dependents (). You should be able to add a forward declaration for count_candidates () and use it. * Your new test case is missing a final newline, so your patch doesn't apply cleanly. Please look into unifying the base expressions, as I believe you should not need the preferred_ref_cand logic if you do that. I still prefer the approach of using next_interp for its generality and expandibility. Thanks, Bill