https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67606
Bug ID: 67606 Summary: Missing optimization: load possible return value before early-out test Product: gcc Version: 5.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Sry if this is a duplicate; I couldn't think of any good search terms to describe this. In a function that might return a constant if a loop condition is false for the first iteration, gcc generates a separate basic block for the early-out return when it doesn't need to. int f(int a[], int length) { int count=0; for (int i = 0 ; i < length ; i++) if (a[i] > 5) count++; return count; } x86 gcc 5.2 -O3 -fno-tree-vectorize compiles it to: f(int*, int): testl %esi, %esi # length jle .L4 #, leal -1(%rsi), %eax #, D.2373 leaq 4(%rdi,%rax,4), %rcx #, D.2371 xorl %eax, %eax # count .L3: xorl %edx, %edx # D.2370 cmpl $5, (%rdi) #, MEM[base: _29, offset: 0B] setg %dl #, D.2370 addq $4, %rdi #, ivtmp.7 addl %edx, %eax # D.2370, count cmpq %rdi, %rcx # ivtmp.7, D.2371 jne .L3 #, rep ret .L4: xorl %eax, %eax # count ret (actually g++, since I used godbolt.org, but gcc 4.9.2 looked similar) A better sequence would be: f(int*, int): xorl %eax, %eax # count testl %esi, %esi # length jle .L4 #, ... unchanged ... cmpq %rdi, %rcx # ivtmp.7, D.2371 jne .L3 #, .L4: rep ret In this case, the ret was already a rep ret, so this change saves ~3 instruction bytes, (or more usually, none, because of padding) at zero speed cost in either the taken or non-taken early-out case. Also, looks like a separate bug, but what the crap is gcc doing with the two lea instructions? Defending against integer overflow when calculating a one-past-the end loop boundary pointer? It doesn't do that when f() takes a size_t arg. (In this case, nothing depends on the lea results soon enough for it to matter. They're dependent, anyway, and the first one is one of the first 4 insns. leal -1(%rsi), %eax #, D.2373 leaq 4(%rdi,%rax,4), %rcx #, D.2371 xorl %eax, %eax # count Unless I'm missing something, leaq (%rdi, %rsi, 4), %rcx should be just as safe. Is something adding the -1-before-scaling, correct after scaling thing for a reason? Is it something that doesn't realize it will be implemented with lea, which avoids any problems? It's an extra instruction, and 4(...) stores the offset as a 32bit constant in the insn encoding, not a sign-extended 8bit value.