Re: [SVE] PR86753

Prathamesh Kulkarni Fri, 23 Aug 2019 04:57:49 -0700

On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guent...@gmail.com> wrote:
>
> On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni
> <prathamesh.kulka...@linaro.org> wrote:
> >
> > On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
> > <richard.sandif...@arm.com> wrote:
> > >
> > > Richard Biener <richard.guent...@gmail.com> writes:
> > > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> > > > <richard.guent...@gmail.com> wrote:
> > > >>
> > > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> > > >> <prathamesh.kulka...@linaro.org> wrote:
> > > >> >
> > > >> > Hi,
> > > >> > The attached patch tries to fix PR86753.
> > > >> >
> > > >> > For following test:
> > > >> > void
> > > >> > f1 (int *restrict x, int *restrict y, int *restrict z)
> > > >> > {
> > > >> >   for (int i = 0; i < 100; ++i)
> > > >> >     x[i] = y[i] ? z[i] : 10;
> > > >> > }
> > > >> >
> > > >> > vect dump shows:
> > > >> >   vect_cst__42 = { 0, ... };
> > > >> >   vect_cst__48 = { 0, ... };
> > > >> >
> > > >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> > > >> >   _4 = *_3;
> > > >> >   _5 = z_12(D) + _2;
> > > >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> > > >> >   _35 = _4 != 0;
> > > >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> > > >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> > > >> >   iftmp.0_13 = 0;
> > > >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > > >> > vect_iftmp.11_47, vect_cst__49>;
> > > >> >
> > > >> > and following code-gen:
> > > >> > L2:
> > > >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> > > >> >         cmpne   p1.s, p3/z, z0.s, #0
> > > >> >         cmpne   p0.s, p2/z, z0.s, #0
> > > >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> > > >> >         sel     z0.s, p1, z0.s, z1.s
> > > >> >
> > > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> > > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> > > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != 
> > > >> > vect_cst__48.
> > > >> >
> > > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from 
> > > >> > masked load,
> > > >> > which is conditional on C, then we could reuse the mask used in load,
> > > >> > in vec_cond_expr ?
> > > >> >
> > > >> > The patch maintains a hash_map cond_to_vec_mask
> > > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> > > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask 
> > > >> > & loop_mask,
> > > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists 
> > > >> > in
> > > >> > cond_to_vec_mask
> > > >> > and if found, the corresponding vec_mask is used as 1st operand of
> > > >> > vec_cond_expr.
> > > >> >
> > > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> > > >> > adds tree_cond_ops to represent condition operator and operands 
> > > >> > coming
> > > >> > either from cond_expr
> > > >> > or a gimple comparison stmt. If the stmt is not comparison, it 
> > > >> > returns
> > > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> > > >> >
> > > >> > With patch, the redundant p1 is eliminated and sel uses p0 for above 
> > > >> > test.
> > > >> >
> > > >> > For following test:
> > > >> > void
> > > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> > > >> > {
> > > >> >   for (int i = 0; i < 100; ++i)
> > > >> >     x[i] = y[i] ? z[i] : fallback;
> > > >> > }
> > > >> >
> > > >> > input to vectorizer has operands swapped in cond_expr:
> > > >> >   _36 = _4 != 0;
> > > >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> > > >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> > > >> >
> > > >> > So we need to check for inverted condition in cond_to_vec_mask,
> > > >> > and swap the operands.
> > > >> > Does the patch look OK so far ?
> > > >> >
> > > >> > One major issue remaining with the patch is value  numbering.
> > > >> > Currently, it does value numbering for entire function using sccvn
> > > >> > during start of vect pass, which is too expensive since we only need
> > > >> > block based VN. I am looking into that.
> > > >>
> > > >> Why do you need it at all?  We run VN on the if-converted loop bodies 
> > > >> btw.
> > >
> > > This was my suggestion, but with the idea being to do the numbering
> > > per-statement as we vectorise.  We'll then see pattern statements too.
> > >
> > > That's important because we use pattern statements to set the right
> > > vector boolean type (e.g. vect_recog_mask_conversion_pattern).
> > > So some of the masks we care about don't exist after if converison.
> > >
> > > > Also I can't trivially see the equality of the masks and probably so
> > > > can't VN.  Is it that we just don't bother to apply loop_mask to
> > > > VEC_COND but there's no harm if we do?
> > >
> > > Yeah.  The idea of the optimisation is to decide when it's more profitable
> > > to apply the loop mask, even though doing so isn't necessary.  It would
> > > be hard to do after vectorisation because the masks aren't equivalent.
> > > We're relying on knowledge of how the vectoriser uses the result.
> > Hi,
> > Sorry for late response. This is an updated patch, that integrates
> > block-based VN into vect pass.
> > The patch
> > (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
> > initialize VN state, and vn_bb_free to free it.
> > (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
> > stmts. We're only interested in obtaining
> > value numbers, not eliminating redundancies.
> > Does it look in the right direction ?
>
> It looks a bit odd to me.  I'd have expected it to work by generating
> the stmts as before in the vectorizer and then on the stmts we care
> invoke vn_visit_stmt that does both value-numbering and elimination.
> Alternatively you could ask the VN state to generate the stmt for
> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> work).  One complication might be availability if you don't value-number
> all stmts in the block, but well.  I'm not sure constraining to a single
> block is necessary - I've thought of having a "CSE"ing gimple_build
> for some time (add & CSE new stmts onto a sequence), so one
> should keep this mode in mind when designing the one working on
> an existing BB.  Note as you write it it depends on visiting the
> stmts in proper order - is that guaranteed when for example
> vectorizing SLP?
Hi,
Indeed, I wrote the function with assumption that, stmts would be
visited in proper order.
This doesn't affect SLP currently, because call to vn_visit_stmt in
vect_transform_stmt is
conditional on cond_to_vec_mask, which is only allocated inside
vect_transform_loop.
But I agree we could make it more general.
AFAIU, the idea of constraining VN to single block was to avoid using defs from
non-dominating scalar stmts during outer-loop vectorization.


* fmla_2.c regression with patch:
This happens because with patch, forwprop4 is able to convert all 3
vec_cond_expr's
to .cond_fma(), which results in 3 calls to fmla, regressing the
test-case. If matching with
inverted condition is disabled in patch in vectorizable_condition,
then the old behavior gets preserved.

Thanks,
Prathamesh
>
> > I am not sure if the initialization in vn_bb_init is entirely correct.
> >
> > PS: The patch seems to regress fmla_2.c. I am looking into it.
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Richard

Re: [SVE] PR86753

Reply via email to