On Thu, Feb 4, 2016 at 12:59 PM, Tom Herbert <t...@herbertland.com> wrote: > On Thu, Feb 4, 2016 at 9:09 AM, David Laight <david.lai...@aculab.com> wrote: >> From: Tom Herbert >> ... >>> > If nothing else reducing the size of this main loop may be desirable. >>> > I know the newer x86 is supposed to have a loop buffer so that it can >>> > basically loop on already decoded instructions. Normally it is only >>> > something like 64 or 128 bytes in size though. You might find that >>> > reducing this loop to that smaller size may improve the performance >>> > for larger payloads. >>> >>> I saw 128 to be better in my testing. For large packets this loop does >>> all the work. I see performance dependent on the amount of loop >>> overhead, i.e. we got it down to two non-adcq instructions but it is >>> still noticeable. Also, this helps a lot on sizes up to 128 bytes >>> since we only need to do single call in the jump table and no trip >>> through the loop. >> >> But one of your 'loop overhead' instructions is 'loop'. >> Look at http://www.agner.org/optimize/instruction_tables.pdf >> you don't want to be using 'loop' on intel cpus. >> > I'm not following. We can replace loop with decl %ecx and jg, but why > is that better?
Because loop takes something like 7 cycles whereas the decl/jg approach takes 2 or 3. It is probably one of the reasons things look so much better with the loop unrolled. - Alex