[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

Luke Gorrie Sun, 25 Jan 2015 15:50:27 +0100

Hi John,

On 19 January 2015 at 02:53, <zhihong.wang at intel.com> wrote:


> This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> It also extends memcpy test coverage with unaligned cases and more test
> points.
>

I am really interested in this work you are doing on memory copies
optimized for packet data. I would like to understand it in more depth. I
have a lot of questions and ideas but let me try to keep it simple for now
:-)

How do you benchmark? where does the "factor of 2-8" cited elsewhere in the
thread come from? how can I reproduce? what results are you seeing compared
with libc?

I did a quick benchmark this weekend based on cachebench
<http://icl.cs.utk.edu/projects/llcbench/cachebench.html>. This seems like
a fairly weak benchmark (always L1 cache, always same alignment, always
predictable branches). Do you think this is relevant? How does this compare
with your results?

I compared:
  rte_memcpy (the new optimized one compiled with gcc-4.9 and -march=native
and -O3)
  memcpy from glibc 2.19 (ubuntu 14.04)
  memcpy from glibc 2.20 (arch linux)

on hardware:
  E5-2620v3 (Haswell)
  E5-2650 (Sandy Bridge)

running cachebench like this:

./cachebench -p -e1 -x1 -m14


rte_memcpy.h on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            89191.88        1.00
384             0.01            96505.43        0.92
512             0.01            96509.19        1.00
768             0.01            91475.72        1.06
1024            0.01            96293.82        0.95
1536            0.01            96521.66        1.00
2048            0.01            96522.87        1.00
3072            0.01            96525.53        1.00
4096            0.01            96522.79        1.00
6144            0.01            96507.71        1.00
8192            0.01            94584.41        1.02
12288           0.01            95062.80        0.99
16384           0.01            80493.46        1.18


libc 2.20 on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            65978.64        1.00
384             0.01            100249.01       0.66
512             0.01            123476.55       0.81
768             0.01            144699.86       0.85
1024            0.01            159459.88       0.91
1536            0.01            168001.92       0.95
2048            0.01            80738.31        2.08
3072            0.01            80270.02        1.01
4096            0.01            84239.84        0.95
6144            0.01            90600.13        0.93
8192            0.01            89767.94        1.01
12288           0.01            92085.98        0.97
16384           0.01            92719.95        0.99


libc 2.19 on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.02            59871.69        1.00
384             0.01            68545.94        0.87
512             0.01            72674.23        0.94
768             0.01            79257.47        0.92
1024            0.01            79740.43        0.99
1536            0.01            85483.67        0.93
2048            0.01            87703.68        0.97
3072            0.01            86685.71        1.01
4096            0.01            87147.84        0.99
6144            0.01            68622.96        1.27
8192            0.01            70591.25        0.97
12288           0.01            72621.28        0.97
16384           0.01            67713.63        1.07


rte_memcpy on Sandy Bridge:

Memory Copy Library Cache Test

C Size Nanosec MB/sec % Chnge
------- ------- ------- -------
256             0.02            62158.19        1.00
384             0.01            73256.41        0.85
512             0.01            82032.16        0.89
768             0.01            73919.92        1.11
1024            0.01            75937.51        0.97
1536            0.01            78280.20        0.97
2048            0.01            79562.54        0.98
3072            0.01            80800.93        0.98
4096            0.01            81453.71        0.99
6144            0.01            81915.84        0.99
8192            0.01            82427.98        0.99
12288           0.01            82789.82        1.00
16384           0.01            67519.66        1.23



libc 2.20 on Sandy Bridge:

Memory Copy Library Cache Test

C Size Nanosec MB/sec % Chnge
------- ------- ------- -------
256             0.02            48651.20        1.00
384             0.02            57653.91        0.84
512             0.01            67909.77        0.85
768             0.01            71177.75        0.95
1024            0.01            72519.48        0.98
1536            0.01            76686.24        0.95
2048            0.19            4975.55         15.41
3072            0.19            5091.97         0.98
4096            0.19            5152.38         0.99
6144            0.18            5211.26         0.99
8192            0.18            5245.27         0.99
12288           0.18            5276.50         0.99
16384           0.18            5209.80         1.01



libc 2.19 on Sandy Bridge:

Memory Copy Library Cache Test

C Size Nanosec MB/sec % Chnge
------- ------- ------- -------
256             0.02            44970.51        1.00
384             0.02            51922.46        0.87
512             0.02            57230.56        0.91
768             0.02            63438.96        0.90
1024            0.01            67506.58        0.94
1536            0.01            72579.25        0.93
2048            0.01            75722.25        0.96
3072            0.01            71039.19        1.07
4096            0.01            73946.17        0.96
6144            0.02            40969.79        1.80
8192            0.02            41396.05        0.99
12288           0.02            41830.01        0.99
16384           0.02            42032.40        1.00


Last question: Why is rte_memcpy inline? (Would making it a library
function give you smaller code, comparable performance, and fast compiles?)

Cheers!
-Luke

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

Reply via email to