Ok. Thanks, Richard.
On Tue, Sep 15, 2015 at 7:56 AM, Bin.Cheng <amker.ch...@gmail.com> wrote: > Just realized that I missed the updated patch before. Here it is... > > Thanks, > bin > > On Tue, Sep 8, 2015 at 6:07 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Tue, Sep 8, 2015 at 6:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Wed, Sep 2, 2015 at 10:12 PM, Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>> Hi, >>>>> This patch is a new approach to fix PR66388. IVO today computes iv_use >>>>> with >>>>> iv_cand which has at least same type precision as the use. On 64bit >>>>> platforms like AArch64, this results in different iv_cand created for each >>>>> address type iv_use, and register pressure increased. As a matter of >>>>> fact, >>>>> the BIV should be used for all iv_uses in some of these cases. It is a >>>>> latent bug but recently getting worse because of overflow changes. >>>>> >>>>> The original approach at >>>>> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue >>>>> except it conflict with IV elimination. Seems to me it is impossible to >>>>> mitigate the contradiction. >>>>> >>>>> This new approach fixes the issue by adding sizetype iv_cand for BIVs >>>>> directly. In cases if the original BIV is preferred, the sizetype iv_cand >>>>> will be chosen. As for code generation, the sizetype iv_cand has the same >>>>> effect as the original BIV. Actually, it's better because BIV needs to be >>>>> explicitly extended to sizetype to be used in address expression on most >>>>> targets. >>>>> >>>>> One shortage of this approach is it may introduce more iv candidates. To >>>>> minimize the impact, this patch does sophisticated code analysis and adds >>>>> sizetype candidate for BIV only if it is used as index. Moreover, it >>>>> avoids >>>>> to add candidate of the original type if the BIV is only used as index. >>>>> Statistics for compiling spec2k6 shows increase of candidate number is >>>>> modest and can be ignored. >>>>> >>>>> There are two more patches following to fix corner cases revealed by this >>>>> one. In together they bring obvious perf improvement for spec26k/int on >>>>> aarch64. >>>>> Spec2k6/int >>>>> 400.perlbench 3.44% >>>>> 445.gobmk -0.86% >>>>> 456.hmmer 14.83% >>>>> 458.sjeng 2.49% >>>>> 462.libquantum -0.79% >>>>> GEOMEAN 1.68% >>>>> >>>>> There is also about 0.36% improvement for spec2k6/fp, mostly because of >>>>> case >>>>> 436.cactusADM. I believe it can be further improved, but that should be >>>>> another patch. >>>>> >>>>> I also collected benchmark data for x86_64. Spec2k6/fp is not affected. >>>>> As >>>>> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is >>>>> regressed by ~3%. I can see BIVs are chosen for some loops instead of >>>>> address candidates. Generally, the loop header will be simplified because >>>>> iv elimination with BIV is simpler; the number of instructions in loop >>>>> body >>>>> isn't changed. I suspect the regression comes from different addressing >>>>> modes. With BIV, complex addressing mode like [base + index << scale + >>>>> disp] is used, rather than [base + disp]. I guess the former has more >>>>> micro-ops, thus more expensive. This guess can be confirmed by manually >>>>> suppressing the complex addressing mode with higher address cost. >>>>> Now the problem becomes why overall cost of BIV is computed lower while >>>>> the >>>>> actual cost is higher. I noticed for most affected loops, loop header is >>>>> bloated because of iv elimination using the old address candidate. The >>>>> bloated loop header results in much higher cost than BIV. As a result, >>>>> BIV >>>>> is preferred. I also noticed the bloated loop header generally can be >>>>> simplified (I have a following patch for this). After applying the local >>>>> patch, the old address candidate is chosen, and most of regression is >>>>> recovered. >>>>> Conclusion is I think loop header bloated issue should be blamed for the >>>>> regression, and it can be resolved. >>>>> >>>>> Bootstrap and test on x64_64 and aarch64. It fixes failure of >>>>> gcc.target/i386/pr49781-1.c, without new breakage. >>>>> >>>>> So what do you think? >>>> >>>> The data above looks ok to me. >>>> >>>> +static struct iv * >>>> +find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv) >>>> +{ >>>> + aff_tree aff; >>>> + struct expand_data exp_data; >>>> + >>>> + if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME) >>>> + return iv; >>>> + >>>> + /* Expand IV's ssa_name till the deriving biv is found. */ >>>> + exp_data.data = data; >>>> + exp_data.biv = NULL; >>>> + tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name), >>>> + &aff, &data->name_expansion_cache, >>>> + stop_expand, &exp_data); >>>> + return exp_data.biv; >>>> >>>> that's actually "abusing" tree_to_aff_combination_expand for simply walking >>>> SSA uses and their defs uses recursively until you hit "stop". ISTR past >>>> discussion to add a generic walk_ssa_use interface for that. Not sure if >>>> it >>>> materialized with a name I can't remember or whether it didn't. >>> Thanks for reviewing. I didn't found existing interface to walk up >>> definition chains of ssa vars. In this updated patch, I implemented a >>> simple function which meets the minimal requirement of walking up >>> definition chains of BIV variables. I also counted number of >>> no_overflow BIVs that are not used in address type use. Since >>> generally there are only two BIVs in a loop, this can prevent us from >>> visiting definition chains most of time. Statistics shows that number >>> of call to find_deriving_biv_for_expr plummet. >>> >>>> >>>> - add_candidate (data, iv->base, iv->step, true, NULL); >>>> + /* Check if this biv is used in address type use. */ >>>> + if (iv->no_overflow && iv->have_address_use >>>> + && INTEGRAL_TYPE_P (TREE_TYPE (iv->base)) >>>> + && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION >>>> (sizetype)) >>>> + { >>>> + tree type = unsigned_type_for (sizetype); >>>> >>>> sizetype is unsigned. >>> Fixed. >>> >>> Bootstrap and test on x86_64, is this OK? >>> >>> Thanks, >>> bin >> >> And here is the updated Changelog entry. >> >> >> 2015-09-08 Bin Cheng <bin.ch...@arm.com> >> >> PR tree-optimization/66388 >> * tree-ssa-loop-ivopts.c (struct iv, iv_cand, ivopts_data): New >> fields. >> (dump_iv): Dump no_overflow information. >> (alloc_iv): Initialize new field for struct iv. >> (mark_bivs): Count number of no_overflow bivs. >> (find_deriving_biv_for_expr, record_biv_for_address_use): New >> functions. >> (idx_find_step): Call new functions above. >> (add_candidate_1, add_candidate): New paramter. >> (add_iv_candidate_for_biv): Add sizetype cand for BIV. >> (get_computation_aff): Simplify convertion of cand for BIV. >> (get_computation_cost_at): Step cand's base if necessary.