On Tue, Apr 14, 2015 at 7:53 PM, Stephen Hemminger < stephen at networkplumber.org> wrote:
> On Tue, 14 Apr 2015 14:31:53 -0700 > Ravi Kerur <rkerur at gmail.com> wrote: > > > + > > + for (i = 0; i < 2; i++) > > + rte_mov32(dst + i * 32, src + i * 32); > > } > Unless you force compiler to unroll the loop, it will be slower. > I had done following things 1. Use sample code from Intel to make sure CPU supports those instructions. 2. Check generated code with and without loop using (gcc -O3 -m64 -S), gcc version is 4.8.2 No difference in code generated between "loop" and "no-loop". At least I was expecting difference in the code. 3. Run "make test" and compare "memcpy perf" results.