Re: Vector Comparison patch

Richard Guenther Wed, 17 Aug 2011 02:50:21 -0700

On Tue, Aug 16, 2011 at 6:35 PM, Artem Shinkarov
<artyom.shinkar...@gmail.com> wrote:
> On Tue, Aug 16, 2011 at 4:28 PM, Richard Guenther
> <richard.guent...@gmail.com> wrote:
>>>> Index: gcc/fold-const.c
>>>> ===================================================================
>>>> --- gcc/fold-const.c    (revision 177665)
>>>> +++ gcc/fold-const.c    (working copy)
>>>> @@ -9073,34 +9073,61 @@ fold_comparison (location_t loc, enum tr
>>>>      floating-point, we can only do some of these simplifications.)  */
>>>>   if (operand_equal_p (arg0, arg1, 0))
>>>>     {
>>>> -      switch (code)
>>>> +      if (TREE_CODE (TREE_TYPE (arg0)) == VECTOR_TYPE)
>>>>        {
>>>> -       case EQ_EXPR:
>>>> -         if (! FLOAT_TYPE_P (TREE_TYPE (arg0))
>>>> -             || ! HONOR_NANS (TYPE_MODE (TREE_TYPE (arg0))))
>>>> -           return constant_boolean_node (1, type);
>>>>
>>>> I think this change should go in a separate patch for improved
>>>> constant folding.  It shouldn't be necessary for enabling vector compares, 
>>>> no?
>>>
>>> Unfortunately no, this case must be covered here, otherwise x != x
>>> condition fails.
>>
>> How does it fail?
>
> When I have x > x, x == x, and so on, fold-const.c trigger
> operand_equal_p (arg0, arg1, 0), which returns true, and then it calls
>  constant_boolean_node (<val>, type). But the problem is that the
> result of the comparison is a vector,  not a boolean. So we have an
> assertion failure:
> test.c: In function ‘foo’:
> test.c:9:3: internal compiler error: in build_int_cst_wide, at tree.c:1222
> Please submit a full bug report,
> with preprocessed source if appropriate.


Ok, so we have to either avoid folding it (which would be a shame), or
define how true / false look like for vector typed comparison results.

The documentation above the tree code defintions for comparisons in
tree.def needs updating then, with something like

  and the value is either the type used by the language for booleans
  or an integer vector type of the same size and with the same number
  of elements as the comparison operands.  True for a vector of
  comparison results has all bits set while false is equal to zero.

or some better wording.

Then changing constant_boolean_node to return a proper true/false
vector would be the fix for your problem.

>>>> +      /* Currently the expansion of VEC_COND_EXPR does not allow
>>>> +        expessions where the type of vectors you compare differs
>>>> +        form the type of vectors you select from. For the time
>>>> +        being we insert implicit conversions.  */
>>>> +      if ((COMPARISON_CLASS_P (ifexp)
>>>>
>>>> Why only for comparison-class?
>>> Not only, there is || involved:
>>> (COMPARISON_CLASS_P (ifexp)  && TREE_TYPE (TREE_OPERAND (ifexp, 0)) != 
>>> type1)
>>> || TREE_TYPE (ifexp) != type1
>>>
>>> So if this is a comparison class, we check the first operand, because
>>> the result of the comparison fits, however the operands could not. In
>>> case we have an expression of signed vector, we know that we would
>>> transform it into exp != {0,0,...} in tree-vect-generic.c, but if the
>>> types of operands do not match we convert them.
>>
>> Hm, ok ... let's hope we can sort-out the backend issues before this
>> patch goes in so we can remove this converting stuff.
>
> Hm, I would hope that we could commit this patch even with this issue,
> because my feeling is that this case would produce errors on all the
> other architectures as well, as VEC_COND_EXPR is the feature heavily
> used in auto-vectorizer. So it means that all the backends must be
> fixed. And another argument, that this conversion is harmless.

It shouldn't be hard to fix all the backends.  And if we don't do it now
it will never happen.  I would expect that the codegen part of the
backends doesn't need any adjustments, just the patterns that
match what is supported.

Uros, can you convert x86 as an example?  Thus, for

(define_expand "vcond<mode>"
  [(set (match_operand:VF 0 "register_operand" "")
        (if_then_else:VF
          (match_operator 3 ""
            [(match_operand:VF 4 "nonimmediate_operand" "")
             (match_operand:VF 5 "nonimmediate_operand" "")])
          (match_operand:VF 1 "general_operand" "")
          (match_operand:VF 2 "general_operand" "")))]
  "TARGET_SSE"
{
  bool ok = ix86_expand_fp_vcond (operands);
  gcc_assert (ok);

allow any vector mode of the same size (and same number of elements?)
for the vcond mode and operand 1 and 2?  Thus, only restrict the
embedded comparison to VF?

> So I really hope that someone could shed some light or help me with
> this issue, but even if not I think that the current conversion is ok.
> However, I don't have any architectures different from x86.
[...]
>>>>
>>>> +  /* Expand VEC_COND_EXPR into a vector of scalar COND_EXPRs.  */
>>>> +  v = VEC_alloc(constructor_elt, gc, nunits);
>>>> +  for (i = 0; i < nunits;
>>>> +       i += 1, index = int_const_binop (PLUS_EXPR, index, part_width))
>>>> +    {
>>>> +      tree tcond = tree_vec_extract (gsi, inner_type, var, part_width, 
>>>> index);
>>>> +      tree a = tree_vec_extract (gsi, inner_type, vec0, part_width, 
>>>> index);
>>>> +      tree b = tree_vec_extract (gsi, inner_type, vec1, part_width, 
>>>> index);
>>>> +      tree rcond = gimplify_build2 (gsi, NE_EXPR, boolean_type_node, 
>>>> tcond,
>>>> +                                    build_int_cst (inner_type ,0));
>>>>
>>>> I seriously doubt that when expanding this part piecewise expanding
>>>> the mask first in either way is going to be beneficial.  Instead I would
>>>> suggest to "inline" the comparison here.  Thus instead of
>>>
>>> Well, the ting is that, if expand_vector_comparison, would insert
>>> builtin there rather than expanding the code piecewise, I'll have to
>>> do the comparison with 0 anyway, because true is expressed as -1
>>> there.
>>>
>>> Well, I would hope that in case we have:
>>> c_0 = a_0 > b_0;
>>> d_0 = c_0 != 0;
>>>
>>> {d_0, d_1,...}
>>>
>>> all the d_n should be constant-folded, or should I pull fold explicitly 
>>> here?
>>>
>>> 1) I construct the mask
>>>>
>>>>  mask =
>>>>         = { mask[0] != 0 ? ... }
>>>>
>>>> do
>>>>
>>>>          = { c0[0] < c1[0] ? ..., }
>>>>
>>>> or even expand the ? : using mask operations if we efficiently can
>>>> create that mask.
>>>>
>>>
>>> I assume that if we cannot expand VEC_COND_EXPR, then masking the
>>> elements is a problem for us. Otherwise VEC_COND_EXPE expansion has a
>>> bug somewhere. Or I am wrong somewhere?
>>
>> I think we can always do bitwise operations, so if we can get at the
>> mask vector we are fine.
>>
>> I was thinking about how the case of explicitly computing the value of
>> v1 < v2 into a vector vs. a condition inside a VEC_COND_EXPR should
>> be handled.  If the target produces a mask of condition codes for
>> a comparison then it might be able to efficiently expand a VEC_COND_EXPR.
>> It could as well generate a mask via expanding v1 < v2 ? -1 : 0 then.
>> A similar case is for AMD XOP which can expand mask ? v1 : v2
>> with a single instruction (so even without seeing a comparison).
>>
>> Basically, if we can get at the mask we should use that to do the
>> vector selection in parallel via (v1 & mask) | (v2 & ~mask).
>>
>> If we cannot even get at the mask then we can build the result
>> vector piecewise as { v1[0] < v2[0] ? v1[0] : v2[0], .... } etc.
>
> Ok, I am perfectly fine to construct (v1 & mask) | (v2 & ~mask), the
> question is do  I need to check (v1 & mask) and (v2 & mask) or I can
> just blindly insert it? The problem is that we have a single veclower
> pass, so if I insert something that needs expansion, we would not have
> the second chance to expand it again.

I think you need to re-lower them, thus, insert a stmt and then immediately
lower it.

> I'll adjust the patch.
>
>>>>
>>>> +  /* Check if VEC_COND_EXPR is supported in hardware within the
>>>> +     given types.  */
>>>> +  if (code == VEC_COND_EXPR)
>>>> +    {
>>>> +      tree exp = gimple_assign_rhs1 (stmt);
>>>> +      tree cond = TREE_OPERAND (exp, 0);
>>>> +
>>>> +      /* If VEC_COND_EXPR is presented as A ? V0 : V1, we
>>>> +        change it to A != {0,0,...} ? V0 : V1  */
>>>> +      if (!COMPARISON_CLASS_P (cond))
>>>> +       TREE_OPERAND (exp, 0) =
>>>> +           build2 (NE_EXPR, TREE_TYPE (cond), cond,
>>>> +                   build_vector_from_val (TREE_TYPE (cond),
>>>> +                     build_int_cst (TREE_TYPE (TREE_TYPE (cond)), 0)));
>>>>
>>>> That looks inefficient as well.  Iff we know that the mask is always
>>>> either {-1, -1 ..} or {0, 0 ...} then we can expand the ? : using
>>>> bitwise operations (see what the i?86 expander does, for example).
>>>
>>> This is a requirement of VEC_COND_EXPR, I need to pass 4 parameters,
>>> not 3, that is why I introduce this fake {0,0,..} here.
>>
>> Sure, but if you look at expand_vec_cond_expr_p then you don't need
>> that, and this fake comparison should instead be produced by the
>> expander (or really avoided by maybe splitting up the named pattern
>> into two).
>>
>> It's for sure not necessary for earlyexpand_vec_cond_expr (but instead
>> makes it less efficient - with just the mask it can do the bitwise
>> fallback easily).
>
> Richard, let me give you an example:
> #define vector(elcount, type)  \
> __attribute__((vector_size((elcount)*sizeof(type)))) type
>
> int
> foo (vector (4, int) i0, vector (4, int) i1, int x)
> {
>  i0 = i0 ? i1 : i0;
>  return i0[x];
> }
>
> when we optimize i0 ? i1 : i0, expand_vec_cond_expr_p  happily accepts
> that and says that it can expand this expression. Now after the
> veclowering is done, expand_vec_cond_expr calls vector_compare_rtx
> (op0, unsignedp, icode), which has an assertion:
> gcc_assert (COMPARISON_CLASS_P (cond));
> and of course it fails.

Yes, it's totally non-robust ;)  Specifically tailored for the
vectorizer so far.

> So someone needs to insert != {0,0...} expression. I do it in the
> tree-vect-geneic, but it could be done in expand_vec_cond_expr. The
> question is where?

For this obvious case in expand_vec_cond_expr.  It should handle
what expand_vec_cond_expr_p claims to handle ;)

> I can agree with you that I don't need to put this mask in case I
> expand vcond piecewise, I will adjust that, but actually it does not
> make much of a difference, in case expansion will use (v0 & mask) |
> (v1 & ~mask).
>
> Am I wrong somewhere?

Just in the place that should need fixing.

>>>> +/* Find target specific sequence for vector comparison of
>>>> +   real-type vectors V0 and V1. Returns variable containing
>>>> +   result of the comparison or NULL_TREE in other case.  */
>>>> +static tree
>>>> +vector_fp_compare (gimple_stmt_iterator *gsi, tree rettype,
>>>> +                   enum machine_mode mode, tree v0, tree v1,
>>>> +                   enum tree_code code)
>>>> +{
>>>> +  enum ix86_builtins fcode;
>>>>
>>>> is there a reason we need this and cannot simply provide expanders
>>>> for the named patterns?  We'd need to give them semantics of
>>>> producing all-ones / all-zero masks of course.  Richard, do you think
>>>> that's sensible?  That way we'd avoid the new target hook and could
>>>> simply do optab queries.
>>>
>>> I think I don't really understand the idea. How we are going to
>>> represent the fact that we need to convert a given node to the given
>>> machine instruction? May be you could point where the similar
>>> technique is already used.
>>
>> In all places we check optab_handler (op, mode) != CODE_FOR_nothing.
>> We have eq_optab for example, so optab_handler (eq_optab, V4SImode)
>> would get you the instruction sequence for a comparison of V4SImode
>> vectors.  That isn't yet properly defined what it should return.
>>
>> Otherwise I'd say we should ask the target to expand
>> v1 < v2 as VEC_COND_EXPR (v1 < v2, -1, 0) instead.  That one could
>> as well special-case the -1 and 0 result vectors (and maybe it already
>> does).
>
> Ok, I can adjust the optab  checking for the mode, but I recall that
> we introduced the hook exactly because optabs did not return anything
> sensible. It was your idea :)

Heh, I don't remember ...

Still it's probably easiest (for now) to handle

 vec = v1 < v2;

the same as we would handle

 vec = v1 < v2 ? {-1,...} : {0,...};

during lowering (and even for expanding).  I tried to convince the vectorizer
to create a vectorized stand-alone comparison but failed, so it's probably
un-tested territory anyway.

At least the above would reduce the patch size considerably.

> Also, I don't like the idea to expand any comparison  to VEC_COND_EXPR
> (v1 < v2, -1, 0). Look at expand_vec_cond_expr, it would do the job
> only if there is an instruction vcond in the architecture, it checks
> for direct_optab_handler (vcond_optab, mode). But it is not
> necessarily the case that using vcond is as efficient as using
> comparison instructions. Also, we could run into the situation when
> vcond is not supported, but comparison is, or can't we?

That's unlikely as the vcond pattern is required by the vectorizer and
all targets implement it if they can handle the comparison part.  The
rest is just emulated by the target using the bitwise operation trick.

> Anyhow, I would think that we want to keep vcond and comparison separately.

Sure, but I think as the vcond case is in place already we can optimize
the comparison case separately if needed.  I would expect that the
code generated with your patch for

v = v1 < v2;
v = v1 < v2 ? {-1,...} : {0,...};

should be the same?

Thanks,
Richard.

Re: Vector Comparison patch

Reply via email to