> From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Friday, 29 July 2022 18.06 > > On Fri, 29 Jul 2022 12:13:52 +0000 > Konstantin Ananyev <konstantin.anan...@huawei.com> wrote: > > > Sorry, missed that part. > > > > > > > > > Another question - who will do 'sfence' after the copying? > > > > Would it be inside memcpy_nt (seems quite costly), or would > > > > it be another API function for that: memcpy_nt_flush() or so? > > > > > > Outside. Only the developer knows when it is required, so it > wouldn't make any sense to add the cost inside memcpy_nt(). > > > > > > I don't think we should add a flush function; it would just be > another name for an already existing function. Referring to the > required > > > operation in the memcpy_nt() function documentation should suffice. > > > > > > > Ok, but again wouldn't it be arch specific? > > AFAIK for x86 it needs to boil down to sfence, for other > architectures - I don't know. > > If you think there already is some generic one (rte_wmb?) that would > always produce > > correct instructions - sure let's use it. > > > > > > It makes sense in a few select places to use non-temporal copy. > But it would add unnecessary complexity to DPDK if every function in > DPDK that could > cause a copy had a non-temporal variant.
Agree. Packet capturing is one of those few places where it makes sense - the improvement scales with the number of packet, not just with the number of packet bursts. > > Maybe just having rte_memcpy have a threshold (config value?) that if > copy is larger than > a certain size, then it would automatically be non-temporal. Small > copies wouldn't matter, > the optimization is more about not stopping cache size issues with > large streams of data. Small copies matter too, if there are many of them. As shown in my previous response, a burst of 32 packets will save 6.25 % of a 64 KB L1 data cache, when copying 64 byte or less from each packet. The saving is per packet, so it quickly adds up. Copying a burst of 32 1518 byte packets trashes 2 * 32 * 1536 = 98 KB data cache, i.e. the entire L1 cache. The threshold in glibc's memcpy() is much higher than 1536 byte. I don't think it will be possible to find a good threshold that works 99 % of the time. So we have to let the application developer make the choice.