[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

Wang, Zhihong Mon, 26 Jan 2015 01:30:55 +0000

Hi Luke,

I?m very glad that you?re interested in this work. ?


I never published any performance data, and haven?t run cachebench.
We use test_memcpy_perf.c in DPDK to do the test mainly, because it?s the 
environment that DPDK runs. You can also find the performance comparison there 
with glibc.
It can be launched in <target>/app/test: memcpy_perf_autotest.

Finally, inline can bring benefits based on practice, constant value unrolling 
for example, and for DPDK we need all possible optimization.


Thanks
John


From: lukego at gmail.com [mailto:[email protected]] On Behalf Of Luke Gorrie
Sent: Sunday, January 25, 2015 10:50 PM
To: Wang, Zhihong
Cc: dev at dpdk.org; snabb-devel at googlegroups.com
Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

Hi John,

On 19 January 2015 at 02:53, <zhihong.wang at intel.com<mailto:zhihong.wang at 
intel.com>> wrote:
This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.

I am really interested in this work you are doing on memory copies optimized 
for packet data. I would like to understand it in more depth. I have a lot of 
questions and ideas but let me try to keep it simple for now :-)

How do you benchmark? where does the "factor of 2-8" cited elsewhere in the 
thread come from? how can I reproduce? what results are you seeing compared 
with libc?

I did a quick benchmark this weekend based on 
cachebench<http://icl.cs.utk.edu/projects/llcbench/cachebench.html>. This seems 
like a fairly weak benchmark (always L1 cache, always same alignment, always 
predictable branches). Do you think this is relevant? How does this compare 
with your results?

I compared:
  rte_memcpy (the new optimized one compiled with gcc-4.9 and -march=native and 
-O3)
  memcpy from glibc 2.19 (ubuntu 14.04)
  memcpy from glibc 2.20 (arch linux)

on hardware:
  E5-2620v3 (Haswell)
  E5-2650 (Sandy Bridge)

running cachebench like this:

./cachebench -p -e1 -x1 -m14

rte_memcpy.h on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            89191.88        1.00
384             0.01            96505.43        0.92
512             0.01            96509.19        1.00
768             0.01            91475.72        1.06
1024            0.01            96293.82        0.95
1536            0.01            96521.66        1.00
2048            0.01            96522.87        1.00
3072            0.01            96525.53        1.00
4096            0.01            96522.79        1.00
6144            0.01            96507.71        1.00
8192            0.01            94584.41        1.02
12288           0.01            95062.80        0.99
16384           0.01            80493.46        1.18

libc 2.20 on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            65978.64        1.00
384             0.01            100249.01       0.66
512             0.01            123476.55       0.81
768             0.01            144699.86       0.85
1024            0.01            159459.88       0.91
1536            0.01            168001.92       0.95
2048            0.01            80738.31        2.08
3072            0.01            80270.02        1.01
4096            0.01            84239.84        0.95
6144            0.01            90600.13        0.93
8192            0.01            89767.94        1.01
12288           0.01            92085.98        0.97
16384           0.01            92719.95        0.99

libc 2.19 on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.02            59871.69        1.00
384             0.01            68545.94        0.87
512             0.01            72674.23        0.94
768             0.01            79257.47        0.92
1024            0.01            79740.43        0.99
1536            0.01            85483.67        0.93
2048            0.01            87703.68        0.97
3072            0.01            86685.71        1.01
4096            0.01            87147.84        0.99
6144            0.01            68622.96        1.27
8192            0.01            70591.25        0.97
12288           0.01            72621.28        0.97
16384           0.01            67713.63        1.07

rte_memcpy on Sandy Bridge:

     Memory Copy Library Cache Test

C Size         Nanosec       MB/sec        % Chnge
-------        -------       -------       -------
256             0.02            62158.19        1.00
384             0.01            73256.41        0.85
512             0.01            82032.16        0.89
768             0.01            73919.92        1.11
1024            0.01            75937.51        0.97
1536            0.01            78280.20        0.97
2048            0.01            79562.54        0.98
3072            0.01            80800.93        0.98
4096            0.01            81453.71        0.99
6144            0.01            81915.84        0.99
8192            0.01            82427.98        0.99
12288           0.01            82789.82        1.00
16384           0.01            67519.66        1.23


libc 2.20 on Sandy Bridge:

     Memory Copy Library Cache Test

C Size         Nanosec       MB/sec        % Chnge
-------        -------       -------       -------
256             0.02            48651.20        1.00
384             0.02            57653.91        0.84
512             0.01            67909.77        0.85
768             0.01            71177.75        0.95
1024            0.01            72519.48        0.98
1536            0.01            76686.24        0.95
2048            0.19            4975.55         15.41
3072            0.19            5091.97         0.98
4096            0.19            5152.38         0.99
6144            0.18            5211.26         0.99
8192            0.18            5245.27         0.99
12288           0.18            5276.50         0.99
16384           0.18            5209.80         1.01


libc 2.19 on Sandy Bridge:

     Memory Copy Library Cache Test

C Size         Nanosec       MB/sec        % Chnge
-------        -------       -------       -------
256             0.02            44970.51        1.00
384             0.02            51922.46        0.87
512             0.02            57230.56        0.91
768             0.02            63438.96        0.90
1024            0.01            67506.58        0.94
1536            0.01            72579.25        0.93
2048            0.01            75722.25        0.96
3072            0.01            71039.19        1.07
4096            0.01            73946.17        0.96
6144            0.02            40969.79        1.80
8192            0.02            41396.05        0.99
12288           0.02            41830.01        0.99
16384           0.02            42032.40        1.00

Last question: Why is rte_memcpy inline? (Would making it a library function 
give you smaller code, comparable performance, and fast compiles?)

Cheers!
-Luke

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

Reply via email to