On Thu, 17 Oct 2013, Mike Stump wrote:

> On Oct 17, 2013, at 1:46 AM, Richard Biener <rguent...@suse.de> wrote:
> > [This function shows another optimization issue:
> > 
> >    case BOOLEAN_TYPE:
> >      /* Cache false or true.  */
> >      limit = 2;
> >      if (wi::leu_p (cst, 1))
> >        ix = cst.to_uhwi ();
> > 
> > I would have expected cst <= 1 be optimized to cst.len == 1 &&
> > cst.val[0] <= 1.
> 
> So lts_p has the same problem, and is used just below.  If we do:
> 
> @@ -1461,12 +1470,22 @@ wi::ne_p (const T1 &x, const T2 &y)
>  inline bool
>  wi::lts_p (const wide_int_ref &x, const wide_int_ref &y)
>  {
> -  if (x.precision <= HOST_BITS_PER_WIDE_INT
> -      && y.precision <= HOST_BITS_PER_WIDE_INT)
> -    return x.slow () < y.slow ();
> -  else
> -    return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
> -                       y.precision);
> +  // We optimize w < N, where N is 64 or fewer bits.
> +  if (y.precision <= HOST_BITS_PER_WIDE_INT)
> +    {
> +      // If it fits directly into a shwi, we can compare directly.
> +      if (wi::fits_shwi_p (x))
> +       return x.slow () < y.slow ();
> +      // If it is doesn't fit, then it must be more negative than any
> +      // value in y, and hence smaller.
> +      if (neg_p (x, SIGNED))
> +       return true;
> +      // If it is positve, then it must be larger than any value in y,
> +      // and hence greater than.
> +      return false;
> +    }
> +  return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
> +                     y.precision);
>  }
> 
> then for:
> 
> bool MGEN(wide_int cst) {
>   return wi::lts_p (cst, 100);
> }
> 
> we generate:
> 
>         movl    520(%rsp), %eax
>         cmpl    $1, %eax
>         je      .L5283
>         subl    $1, %eax
>         movq    8(%rsp,%rax,8), %rax
>         shrq    $63, %rax
>         ret
>         .p2align 4,,10
>         .p2align 3
> .L5283:
>         cmpq    $99, 8(%rsp)
>         setle   %al
>         ret
> 
> which as you can see, never calls lts_p_large, and hence, if that was the 
> only reason the storage escapes, then it should be able to optimize better.  
> In the above code, minimal, no function calls, and the 100 is propagated all 
> the way down into the cmpq.  Now, the reason why we should do it this way, 
> most of the time, most types are 64 bits or less.  When that is true, then it 
> will never call into lts_p_large.
> 
> If the above 100 is changed to l, from a parameter long l, then the code is 
> the same except the last part is:
> 
>         cmpq    8(%rsp), %rdi
>         setg    %al
>         ret
> 
> If we have two wide_int's, then the code does this:
> 
>         .cfi_startproc
>         subq    $8, %rsp
>       .cfi_def_cfa_offset 16
>         movl    1052(%rsp), %r9d
>         movl    1048(%rsp), %r8d
>         movl    532(%rsp), %edx
>         movl    528(%rsp), %esi
>         cmpl    $64, %r9d
>         ja      .L5281
>         cmpl    $1, %esi
>       je      .L5284
>         subl    $1, %esi
>         movq    16(%rsp,%rsi,8), %rax
>         addq    $8, %rsp
>         .cfi_remember_state
>         .cfi_def_cfa_offset 8
>         shrq    $63, %rax
>         ret
>         .p2align 4,,10
>         .p2align 3
> .L5281:
>         .cfi_restore_state
>         leaq    536(%rsp), %rcx
>         leaq    16(%rsp), %rdi
>         call    _ZN2wi11lts_p_largeEPKljjS1_jj
>       addq    $8, %rsp
>         .cfi_remember_state
>         .cfi_def_cfa_offset 8
>         ret
>         .p2align 4,,10
>         .p2align 3
> .L5284:
>         .cfi_restore_state
>         movq    536(%rsp), %rax
>         cmpq    %rax, 16(%rsp)
>         setl    %al
>         addq    $8, %rsp
>         .cfi_def_cfa_offset 8
>         ret
>         .cfi_endproc
> 
> which is still pretty nice.
> 
> Does this look ok?  Kenny, can you double check the cases, think I have them 
> right, but?  a double check would be good.

That works for me.

> > the representation should guarantee the
> > compare with a low precision (32 bit) constant is evaluatable
> > at compile-time if len of the larger value is > 1, no?
> 
> I agree, though, I didn't check the symmetric case, as constants and 
> smaller values can be put on the right by convention.
>
> > The question is whether we should try to optimize wide-int for
> > such cases or simply not use wi:leu_p (cst, 1) but rather
> > 
> > if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)
> > 
> > ?
> 
> Since we can do an excellent job with the simple interface, I'd argue 
> for the simple interface and the simple change to do the conditional 
> better.  I'd rather up the ante on meta programming to get any 
> performance we want, retaining the simple interfaces.

Agreed.  Note that I'd make heavy use of __builtin_constant_p
in the inline implementation as if we cannot optimize the choice
at compile-time we'll both slow down things and make the code
bloated up to the extent that the inliner no longer will consider
it for inlining.  That is,

> +  // We optimize w < N, where N is 64 or fewer bits.
> +  if (y.precision <= HOST_BITS_PER_WIDE_INT)

should be something like

   if (CONSTANT (y.precision <= HOST_BITS_PER_WIDE_INT))
    ...

and evaluate to false if it doesn't evaluate to compile-time
constant true.  The ultimate 'else' case then of couse has
to work for all cases.

If I got it right then sth like

#define CONSTANT (x) __builtin_constant_p (x) ? x : 0

should do the job (eventually with some stmt expression
magic to evaluate x only once).

That said, the various tree predicates in tree.c should
ultimately just work on the tree rep - they are basic
enough that they don't need to go through wide-ints.  So,
instead of

int
integer_zerop (const_tree expr)
{
  STRIP_NOPS (expr);

  switch (TREE_CODE (expr))
    {
    case INTEGER_CST:
      return wi::eq_p (expr, 0);

I'd put there

    case INTEGER_CST:
      return TREE_INT_CST_NUNITS (expr) == 1
             && TREE_INT_CST_ELT (expr, 0) == 0;

I just used the existing code as convenient simplest users
of wide-ints to inspect code generation (originally I looked
at tree-dfa.c:get_ref_base_and_extent which is timing critical
but somewhat convoluted).

So before the merge we should adjust the tree.c changes on the
branch appropriately.

Richard.

Reply via email to