On Thu, 17 Oct 2013, Mike Stump wrote: > On Oct 17, 2013, at 1:46 AM, Richard Biener <rguent...@suse.de> wrote: > > [This function shows another optimization issue: > > > > case BOOLEAN_TYPE: > > /* Cache false or true. */ > > limit = 2; > > if (wi::leu_p (cst, 1)) > > ix = cst.to_uhwi (); > > > > I would have expected cst <= 1 be optimized to cst.len == 1 && > > cst.val[0] <= 1. > > So lts_p has the same problem, and is used just below. If we do: > > @@ -1461,12 +1470,22 @@ wi::ne_p (const T1 &x, const T2 &y) > inline bool > wi::lts_p (const wide_int_ref &x, const wide_int_ref &y) > { > - if (x.precision <= HOST_BITS_PER_WIDE_INT > - && y.precision <= HOST_BITS_PER_WIDE_INT) > - return x.slow () < y.slow (); > - else > - return lts_p_large (x.val, x.len, x.precision, y.val, y.len, > - y.precision); > + // We optimize w < N, where N is 64 or fewer bits. > + if (y.precision <= HOST_BITS_PER_WIDE_INT) > + { > + // If it fits directly into a shwi, we can compare directly. > + if (wi::fits_shwi_p (x)) > + return x.slow () < y.slow (); > + // If it is doesn't fit, then it must be more negative than any > + // value in y, and hence smaller. > + if (neg_p (x, SIGNED)) > + return true; > + // If it is positve, then it must be larger than any value in y, > + // and hence greater than. > + return false; > + } > + return lts_p_large (x.val, x.len, x.precision, y.val, y.len, > + y.precision); > } > > then for: > > bool MGEN(wide_int cst) { > return wi::lts_p (cst, 100); > } > > we generate: > > movl 520(%rsp), %eax > cmpl $1, %eax > je .L5283 > subl $1, %eax > movq 8(%rsp,%rax,8), %rax > shrq $63, %rax > ret > .p2align 4,,10 > .p2align 3 > .L5283: > cmpq $99, 8(%rsp) > setle %al > ret > > which as you can see, never calls lts_p_large, and hence, if that was the > only reason the storage escapes, then it should be able to optimize better. > In the above code, minimal, no function calls, and the 100 is propagated all > the way down into the cmpq. Now, the reason why we should do it this way, > most of the time, most types are 64 bits or less. When that is true, then it > will never call into lts_p_large. > > If the above 100 is changed to l, from a parameter long l, then the code is > the same except the last part is: > > cmpq 8(%rsp), %rdi > setg %al > ret > > If we have two wide_int's, then the code does this: > > .cfi_startproc > subq $8, %rsp > .cfi_def_cfa_offset 16 > movl 1052(%rsp), %r9d > movl 1048(%rsp), %r8d > movl 532(%rsp), %edx > movl 528(%rsp), %esi > cmpl $64, %r9d > ja .L5281 > cmpl $1, %esi > je .L5284 > subl $1, %esi > movq 16(%rsp,%rsi,8), %rax > addq $8, %rsp > .cfi_remember_state > .cfi_def_cfa_offset 8 > shrq $63, %rax > ret > .p2align 4,,10 > .p2align 3 > .L5281: > .cfi_restore_state > leaq 536(%rsp), %rcx > leaq 16(%rsp), %rdi > call _ZN2wi11lts_p_largeEPKljjS1_jj > addq $8, %rsp > .cfi_remember_state > .cfi_def_cfa_offset 8 > ret > .p2align 4,,10 > .p2align 3 > .L5284: > .cfi_restore_state > movq 536(%rsp), %rax > cmpq %rax, 16(%rsp) > setl %al > addq $8, %rsp > .cfi_def_cfa_offset 8 > ret > .cfi_endproc > > which is still pretty nice. > > Does this look ok? Kenny, can you double check the cases, think I have them > right, but? a double check would be good.
That works for me. > > the representation should guarantee the > > compare with a low precision (32 bit) constant is evaluatable > > at compile-time if len of the larger value is > 1, no? > > I agree, though, I didn't check the symmetric case, as constants and > smaller values can be put on the right by convention. > > > The question is whether we should try to optimize wide-int for > > such cases or simply not use wi:leu_p (cst, 1) but rather > > > > if (cst.fits_uhwi_p () == 1 && cst.to_uhwi () < 1) > > > > ? > > Since we can do an excellent job with the simple interface, I'd argue > for the simple interface and the simple change to do the conditional > better. I'd rather up the ante on meta programming to get any > performance we want, retaining the simple interfaces. Agreed. Note that I'd make heavy use of __builtin_constant_p in the inline implementation as if we cannot optimize the choice at compile-time we'll both slow down things and make the code bloated up to the extent that the inliner no longer will consider it for inlining. That is, > + // We optimize w < N, where N is 64 or fewer bits. > + if (y.precision <= HOST_BITS_PER_WIDE_INT) should be something like if (CONSTANT (y.precision <= HOST_BITS_PER_WIDE_INT)) ... and evaluate to false if it doesn't evaluate to compile-time constant true. The ultimate 'else' case then of couse has to work for all cases. If I got it right then sth like #define CONSTANT (x) __builtin_constant_p (x) ? x : 0 should do the job (eventually with some stmt expression magic to evaluate x only once). That said, the various tree predicates in tree.c should ultimately just work on the tree rep - they are basic enough that they don't need to go through wide-ints. So, instead of int integer_zerop (const_tree expr) { STRIP_NOPS (expr); switch (TREE_CODE (expr)) { case INTEGER_CST: return wi::eq_p (expr, 0); I'd put there case INTEGER_CST: return TREE_INT_CST_NUNITS (expr) == 1 && TREE_INT_CST_ELT (expr, 0) == 0; I just used the existing code as convenient simplest users of wide-ints to inspect code generation (originally I looked at tree-dfa.c:get_ref_base_and_extent which is timing critical but somewhat convoluted). So before the merge we should adjust the tree.c changes on the branch appropriately. Richard.