On Thu, Jan 12, 2017 at 04:02:56PM +0100, Jan Viktorin wrote: > On Thu, 12 Jan 2017 10:30:58 +0800 > Yuanhan Liu <yuanhan....@linux.intel.com> wrote: > > > On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote: > > > 2017-01-11 12:27, Yuanhan Liu: > > > > The fact that virtio net header is initiated to zero in PMD driver > > > > init stage means that these costly writes are unnecessary and could > > > > be avoided: > > > > > > > > if (hdr->csum_start != 0) > > > > hdr->csum_start = 0; > > > > > > > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the > > > > performance drop introduced by TSO enabling is recovered: it could > > > > be up to 20% in micro benchmarking. > > > > > > This patch is adding a condition to assignments. > > > We need a benchmark on other architectures like ARM. Please anyone? > > > > I think the cost of condition should be way lower than the cost from the > > penalty introduced by the cache issue, that I don't see it would perform > > bad on other platforms. > > > > But, of course, testing is always welcome! > > > > --yliu > > Hello, > > we've done a synthetic measurement, principle briefly:
Thanks! > > == Without condition check == > > start = gettimeofday(); > > for (i = 0; i < 1024*1024*128; ++i) { > hdr->csum_start = 0; > hdr->csum_offset = 0; > hdr->flags = 0; > } > > end = gettimeofday(); > > > == With condition check == > > start = gettimeofday(); > > for (i = 0; i < 1024*1024*128; ++i) { > ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); > ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); > ASSIGN_UNLESS_EQUAL(hdr->flags, 0); > } > > end = gettimeofday(); But it's not the test methodology I'd expect. You are purely testing the instruction cycles. The drop on ARM is something more like "the if instruction takes more cycles than the simple assignment". This macro is used in the case that one process is heavily writing same value (0 here) again and again while another process is heavily read it also again and again. That means cache violation always happen. With this macro, however, this cache issue could be avoided, since no write happens. For such workload, I don't think it would behaviour worse on ARM. --yliu > == Results == > > Computed as total time of all threads: > > for i = 1..THREAD_COUNT: > result += end[i] - start[i] > > cpu threads without-check (ms) with-check > Xeon E5-2670 1 516 529 > Xeon E5-2670 2 1155 953 > Xeon E5-2670 8 8947 5044 > Xeon E5-2670 16 23335 16836 > Zynq-7020 (armv7) 1 6735 7205 > Zynq-7020 (armv7) 2 13753 14418 > > The advantage for Intel is evident when increasing the number > of threads. > > However, on 32-bit ARMs we might expect some performance drop. > > Regards > Jan > > > > > > > > > > [...] > > > > +/* avoid write operation when necessary, to lessen cache issues */ > > > > +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ > > > > + if ((var) != (val)) \ > > > > + (var) = (val); \ > > > > +} while (0)