On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>> Hi, >>>> This simple patch makes interchange even more conservative for small loops >>>> with constant initialized simple reduction. >>>> The reason is undoing such reduction introduces new data reference and >>>> cond_expr, which could cost too much in a small >>>> loop. >>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch. Is it OK if >>>> test passes? >>> >>> Shouldn't we do this even for non-constant initialzied simple >>> reduction? Because for any simple >>> reduction we add two DRs that are not innermost, for constant >>> initialized we add an additional >>> cond-expr. So ... >>> >>> + /* Conservatively skip interchange in cases only have few data references >>> + and constant initialized simple reduction since it introduces new data >>> + reference as well as ?: operation. */ >>> + if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length >>> ()) >>> + return false; >>> + >>> >>> can you, instead of carrying num_const_init_simple_reduc simply loop >>> over m_reductions >>> and classify them in this function accordingly? I think we want to >>> cost non-constant-init >>> reductions as well. The :? can eventually count for another DR for >>> cost purposes. >> Number of non-constant-init reductions can still be carried in struct >> loop_cand? I am not very sure what's the advantage of an additional >> loop over m_reductions getting the same information. >> Perhaps the increase of stmts should be counted like: >> num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs >> Question is which number should this be compared against. (we may >> need to shift num_new_inv_drs to the other side for wrapping issue). >> >>> >>> It looks like we do count the existing DRs for the reduction? Is that >>> why you arrive >>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:) >> Yes. >>> But we don't really know whether the DR was invariant in the outer >>> loop (well, I suppose >> Hmm, I might misunderstand here. num_old_inv_drs tracks the number of >> invariant reference with regarding to inner loop, rather than the >> outer loop. The same to num_new_inv_drs, >> which means a reference become invariant after loop interchange with >> regarding to (the new) inner loop. This invariant information is >> always known from data reference, right? >> As for DRs for reduction, we know it's invariant because we set its >> inner loop stride to zero. >> >>> we could remember the DR in m_reductions). >>> >>> Note that the good thing is that the ?: has an invariant condition and >>> thus vectorization >>> can hoist the mask generation out of the vectorized loop which means >>> it boils down to >>> cheap operations. My gut feeling is that just looking at the number >>> of memory references >>> isn't a good indicator of profitability as the regular stmt workload >>> has a big impact on >>> profitability of vectorization. >> It's not specific to vectorization. The generated new code also costs >> too much in small loops without vectorization. But yes, # of mem_refs >> may be too inaccurate, maybe we should check against num_stmts. > > Not specific to vectorization but the interchange may pay off only when > vectorizing a loop. Would the loop in loop-interchange-5.c be still > interchanged? If we remove the multiplication and just keep > c[i][j] = c[i][j] + b[k][j]; Both loop-interchange-5.c and the modified version are interchange, because we check it against number of all data references (including num_old_inv_drs): if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())
> ? That is, why is the constant init so special? Even for non-constant init > we're changing two outer loop DRs to two non-consecutive inner loop DRs. No, the two outer loop DRs becomes consecutive with respect to inner loop. So for a typical matrix mul case, the interchange moves two outer loop DRs into inner loops, moves an inner loop DR out to outer loop. Overall it introduces an additional instruction. In terms of cache behavior, it's even cheaper? Because the two new DRs access the same memory object. As for pr62178.c, assembly regressed from: .L3: ldr w3, [x0], 124 ldr w4, [x2, 4]! cmp x0, x5 madd w1, w4, w3, w1 bne .L3 To: .L3: ldr w2, [x3, x0, lsl 2] cmp w5, 1 ldr w1, [x4, x0, lsl 2] csel w2, w2, wzr, ne madd w1, w6, w1, w2 str w1, [x3, x0, lsl 2] add x0, x0, 1 cmp x0, 31 bne .L3 Without vectorization. Two additional instruction for cond_expr, one additional memory reference for interchange, and one additional instruction because of ivopt/addressing mode. Thanks, bin > > Richard. > >> Thanks, >> bin >>> >>> So no ack nor nack... >>> >>> Richard. >>> >>>> Thanks, >>>> bin >>>> 2017-12-08 Bin Cheng <bin.ch...@arm.com> >>>> >>>> * gimple-loop-interchange.cc (struct loop_cand): New field. >>>> (loop_cand::loop_cand): Init new field in constructor. >>>> (loop_cand::classify_simple_reduction): Record simple reduction >>>> initialized with constant value. >>>> (should_interchange_loops): New parameter. Skip interchange if >>>> loop >>>> has few data references and constant intitialized simple reduction. >>>> (tree_loop_interchange::interchange): Update call to above >>>> function. >>>> (should_interchange_loop_nest): Ditto.