On Thu, Feb 4, 2016 at 9:09 AM, David Laight <david.lai...@aculab.com> wrote: > From: Tom Herbert > ... >> > If nothing else reducing the size of this main loop may be desirable. >> > I know the newer x86 is supposed to have a loop buffer so that it can >> > basically loop on already decoded instructions. Normally it is only >> > something like 64 or 128 bytes in size though. You might find that >> > reducing this loop to that smaller size may improve the performance >> > for larger payloads. >> >> I saw 128 to be better in my testing. For large packets this loop does >> all the work. I see performance dependent on the amount of loop >> overhead, i.e. we got it down to two non-adcq instructions but it is >> still noticeable. Also, this helps a lot on sizes up to 128 bytes >> since we only need to do single call in the jump table and no trip >> through the loop. > > But one of your 'loop overhead' instructions is 'loop'. > Look at http://www.agner.org/optimize/instruction_tables.pdf > you don't want to be using 'loop' on intel cpus. > I'm not following. We can replace loop with decl %ecx and jg, but why is that better?
Tom > You might get some benefit from pipelining the loop (so you do > a read to register in one iteration and a register-register adc > the next). > > David >