On 10/18/2013 04:45 AM, Richard Biener wrote:
On Thu, 17 Oct 2013, Mike Stump wrote:

On Oct 17, 2013, at 1:46 AM, Richard Biener <rguent...@suse.de> wrote:
[This function shows another optimization issue:

    case BOOLEAN_TYPE:
      /* Cache false or true.  */
      limit = 2;
      if (wi::leu_p (cst, 1))
        ix = cst.to_uhwi ();

I would have expected cst <= 1 be optimized to cst.len == 1 &&
cst.val[0] <= 1.
So lts_p has the same problem, and is used just below.  If we do:

@@ -1461,12 +1470,22 @@ wi::ne_p (const T1 &x, const T2 &y)
  inline bool
  wi::lts_p (const wide_int_ref &x, const wide_int_ref &y)
  {
-  if (x.precision <= HOST_BITS_PER_WIDE_INT
-      && y.precision <= HOST_BITS_PER_WIDE_INT)
-    return x.slow () < y.slow ();
-  else
-    return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
-                       y.precision);
+  // We optimize w < N, where N is 64 or fewer bits.
+  if (y.precision <= HOST_BITS_PER_WIDE_INT)
+    {
+      // If it fits directly into a shwi, we can compare directly.
+      if (wi::fits_shwi_p (x))
+       return x.slow () < y.slow ();
+      // If it is doesn't fit, then it must be more negative than any
+      // value in y, and hence smaller.
+      if (neg_p (x, SIGNED))
+       return true;
+      // If it is positve, then it must be larger than any value in y,
+      // and hence greater than.
+      return false;
+    }
+  return lts_p_large (x.val, x.len, x.precision, y.val, y.len,
+                     y.precision);
  }

then for:

bool MGEN(wide_int cst) {
   return wi::lts_p (cst, 100);
}

we generate:

         movl    520(%rsp), %eax
         cmpl    $1, %eax
         je      .L5283
         subl    $1, %eax
         movq    8(%rsp,%rax,8), %rax
         shrq    $63, %rax
         ret
         .p2align 4,,10
         .p2align 3
.L5283:
         cmpq    $99, 8(%rsp)
         setle   %al
         ret

which as you can see, never calls lts_p_large, and hence, if that was the only 
reason the storage escapes, then it should be able to optimize better.  In the 
above code, minimal, no function calls, and the 100 is propagated all the way 
down into the cmpq.  Now, the reason why we should do it this way, most of the 
time, most types are 64 bits or less.  When that is true, then it will never 
call into lts_p_large.

If the above 100 is changed to l, from a parameter long l, then the code is the 
same except the last part is:

         cmpq    8(%rsp), %rdi
         setg    %al
         ret

If we have two wide_int's, then the code does this:

         .cfi_startproc
         subq    $8, %rsp
        .cfi_def_cfa_offset 16
         movl    1052(%rsp), %r9d
         movl    1048(%rsp), %r8d
         movl    532(%rsp), %edx
         movl    528(%rsp), %esi
         cmpl    $64, %r9d
         ja      .L5281
         cmpl    $1, %esi
        je      .L5284
         subl    $1, %esi
         movq    16(%rsp,%rsi,8), %rax
         addq    $8, %rsp
         .cfi_remember_state
         .cfi_def_cfa_offset 8
         shrq    $63, %rax
         ret
         .p2align 4,,10
         .p2align 3
.L5281:
         .cfi_restore_state
         leaq    536(%rsp), %rcx
         leaq    16(%rsp), %rdi
         call    _ZN2wi11lts_p_largeEPKljjS1_jj
        addq    $8, %rsp
         .cfi_remember_state
         .cfi_def_cfa_offset 8
         ret
         .p2align 4,,10
         .p2align 3
.L5284:
         .cfi_restore_state
         movq    536(%rsp), %rax
         cmpq    %rax, 16(%rsp)
         setl    %al
         addq    $8, %rsp
         .cfi_def_cfa_offset 8
         ret
         .cfi_endproc

which is still pretty nice.

Does this look ok?  Kenny, can you double check the cases, think I have them 
right, but?  a double check would be good.
That works for me.

i talked to mike about this last night, but did not follow up with an email until now. The problem is that this code is wrong!!! He is working to fix that and so i would expect something from him later (tomorrow for those in europe). The problem is if the variable that comes in is a unsigned HWI that has to the top bit set. In that case, the wide-int canonicalized version would have two hwi's, the top one being a 0 block and so you cannot do the fast path test coming in on that even though it's precison fits.
the representation should guarantee the
compare with a low precision (32 bit) constant is evaluatable
at compile-time if len of the larger value is > 1, no?
I agree, though, I didn't check the symmetric case, as constants and
smaller values can be put on the right by convention.

The question is whether we should try to optimize wide-int for
such cases or simply not use wi:leu_p (cst, 1) but rather

if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1)

?
Since we can do an excellent job with the simple interface, I'd argue
for the simple interface and the simple change to do the conditional
better.  I'd rather up the ante on meta programming to get any
performance we want, retaining the simple interfaces.
Agreed.  Note that I'd make heavy use of __builtin_constant_p
in the inline implementation as if we cannot optimize the choice
at compile-time we'll both slow down things and make the code
bloated up to the extent that the inliner no longer will consider
it for inlining.  That is,

+  // We optimize w < N, where N is 64 or fewer bits.
+  if (y.precision <= HOST_BITS_PER_WIDE_INT)
should be something like

    if (CONSTANT (y.precision <= HOST_BITS_PER_WIDE_INT))
     ...

and evaluate to false if it doesn't evaluate to compile-time
constant true.  The ultimate 'else' case then of couse has
to work for all cases.

If I got it right then sth like

#define CONSTANT (x) __builtin_constant_p (x) ? x : 0

should do the job (eventually with some stmt expression
magic to evaluate x only once).

That said, the various tree predicates in tree.c should
ultimately just work on the tree rep - they are basic
enough that they don't need to go through wide-ints.  So,
instead of

int
integer_zerop (const_tree expr)
{
   STRIP_NOPS (expr);

   switch (TREE_CODE (expr))
     {
     case INTEGER_CST:
       return wi::eq_p (expr, 0);

I'd put there

     case INTEGER_CST:
       return TREE_INT_CST_NUNITS (expr) == 1
              && TREE_INT_CST_ELT (expr, 0) == 0;

I just used the existing code as convenient simplest users
of wide-ints to inspect code generation (originally I looked
at tree-dfa.c:get_ref_base_and_extent which is timing critical
but somewhat convoluted).

So before the merge we should adjust the tree.c changes on the
branch appropriately.

Richard.

Reply via email to