From: Tom Herbert > Sent: 02 March 2016 22:19 ... > + /* Main loop using 64byte blocks */ > + for (; len > 64; len -= 64, buff += 64) { > + asm("addq 0*8(%[src]),%[res]\n\t" > + "adcq 1*8(%[src]),%[res]\n\t" > + "adcq 2*8(%[src]),%[res]\n\t" > + "adcq 3*8(%[src]),%[res]\n\t" > + "adcq 4*8(%[src]),%[res]\n\t" > + "adcq 5*8(%[src]),%[res]\n\t" > + "adcq 6*8(%[src]),%[res]\n\t" > + "adcq 7*8(%[src]),%[res]\n\t" > + "adcq $0,%[res]" > + : [res] "=r" (result) > + : [src] "r" (buff), > + "[res]" (result));
Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..' without any unrolling? ... > + /* Sum over any remaining bytes (< 8 of them) */ > + if (len & 0x7) { > + unsigned long val; > + /* > + * Since "len" is > 8 here we backtrack in the buffer to load > + * the outstanding bytes into the low order bytes of a quad and > + * then shift to extract the relevant bytes. By doing this we > + * avoid additional calls to load_unaligned_zeropad. That comment is wrong. Maybe: * Read the last 8 bytes of the buffer then shift to extract * the required bytes. * This is safe because the original length was > 8 and avoids * any problems reading beyond the end of the valid data. David