On Thu, 12 Jan 2017 10:30:58 +0800
Yuanhan Liu <yuanhan....@linux.intel.com> wrote:

> On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote:
> > 2017-01-11 12:27, Yuanhan Liu:  
> > > The fact that virtio net header is initiated to zero in PMD driver
> > > init stage means that these costly writes are unnecessary and could
> > > be avoided:
> > > 
> > >     if (hdr->csum_start != 0)
> > >         hdr->csum_start = 0;
> > > 
> > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
> > > performance drop introduced by TSO enabling is recovered: it could
> > > be up to 20% in micro benchmarking.  
> > 
> > This patch is adding a condition to assignments.
> > We need a benchmark on other architectures like ARM. Please anyone?  
> 
> I think the cost of condition should be way lower than the cost from the
> penalty introduced by the cache issue, that I don't see it would perform
> bad on other platforms.
> 
> But, of course, testing is always welcome!
> 
>       --yliu

Hello,

we've done a synthetic measurement, principle briefly:

== Without condition check ==

start = gettimeofday();

for (i = 0; i < 1024*1024*128; ++i) {
        hdr->csum_start = 0;
        hdr->csum_offset = 0;
        hdr->flags = 0;
}

end = gettimeofday();


== With condition check ==

start = gettimeofday();

for (i = 0; i < 1024*1024*128; ++i) {
        ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
        ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
        ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
}

end = gettimeofday();


== Results ==

Computed as total time of all threads:

for i = 1..THREAD_COUNT:
        result += end[i] - start[i]

cpu           threads  without-check (ms)  with-check
Xeon E5-2670        1            516              529
Xeon E5-2670        2           1155              953
Xeon E5-2670        8           8947             5044
Xeon E5-2670       16          23335            16836
Zynq-7020 (armv7)   1           6735             7205
Zynq-7020 (armv7)   2          13753            14418

The advantage for Intel is evident when increasing the number
of threads.

However, on 32-bit ARMs we might expect some performance drop.

Regards
Jan

> > 
> > 
> > [...]  
> > > +/* avoid write operation when necessary, to lessen cache issues */
> > > +#define ASSIGN_UNLESS_EQUAL(var, val) do {       \
> > > + if ((var) != (val))                     \
> > > +         (var) = (val);                  \
> > > +} while (0)  

Reply via email to