On Tue, Nov 26, 2013 at 12:31:00PM -0500, Geert Bosch wrote: > >> [...] > >> A few things helped to make the cost small: the biggest one is that > >> typically on of the operands is known to be negative or positive. > >> Gigi will use Ada type information, and Natural or Positive integer > >> variables are very common. So, if you compute X + C with C positive, > >> you can write the conditional expression as: > > > > On x64 efect of this analysis is small, processor does overflow detection > > for you. > > The problem as always is that of pass ordering. If we would encapsulate > the overflow check some kind of builtin that we'd directly very late in > the compilation process to processor-specific instructions, then early > optimizers cannot do their work. > That is flaw in gcc design, one can make these transparent to analysis by supplying a equivalent alternative implementation. As one can supply a implementation for early passes like in following implementation. static inline uint64_t mul_s (uint64_t x, uint64_t y) { if (__builtin_constant_p (x < (1L << 32)) && x < (1L << 32) && __builtin_constant_p (y < (1L << 32)) && y < (1L << 32)) return x * y; if (__builtin_constant_p (x) && __builtin_constant_p (y)) if (x != 0 && UINT64_MAX / x < y) return -1; else return x * y; return builtin_mul_s (x, y); } > > By just expanding out to regular additions, comparisons and control > > flow we can avoid this problem. > > > >> (if X < Integer'Last - C then X + C else raise Constraint_Error) > >> > > > >> On my x86-64 this generates something like: > >> 00 cmpl $0x7fffffff,%de > >> 06 je 0x0000000c > >> 08 leal 0x01(%rdi),%eax > >> 0b ret > >> [..] > > > > This has redundant compare instruction that cost a cycle and 6 bytes. > > You can just write > > > > 0: 83 c7 01 add $0x1,%edi > > 3: 71 03 jno 0x8 > > [...] > > When you know that one operand is positive or you deal with unsigned > > then you could replace jno with jnc which is bit faster on sandy bridge > > processors and later as add, jnc pair is macro-fused but add jno is not. > > Indeed that is ideal, assuming condition flags are dead, but I think > that the right way to do that is by late combine-like optimizations > after instruction selection. > How would do you that for multiplication?
Addition is handled by gcc almost optimally, in following function gcc does eliminate cmp but uses cmov instead of jump both on -Os and -O3 leading to larger code size. uint64_t add_s (uint64_t x, uint64_t y) { if (__builtin_expect(x + y < x, 0)) return -1; return x + y; } assembly is following: 4003c0: 48 89 f8 mov %rdi,%rax 4003c3: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx 4003ca: 48 01 f0 add %rsi,%rax 4003cd: 48 0f 42 c2 cmovb %rdx,%rax 4003d1: c3 retq btw also when I replace -1 by x * y then cmov stays which is suboptimal given a expect hint. > In looking at generated code in actual programs, most checks are > optimized away. This is more important than saving a cycle here or > there in the much smaller number of checks that remain. That specialization is unwarranted, this holds in general case but calculating allocation size has different use pattern. A typical use case is passing a size by argument or reading from io and allocating apppropriate array.