> -----Original Message----- > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Shailja Pandey > Sent: Thursday, April 19, 2018 3:30 PM > To: Wiles, Keith <keith.wi...@intel.com> > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] Why packet replication is more efficient when done > using memcpy( ) as compared to rte_mbuf_refcnt_update() > function? > > > The two code fragments are doing two different ways the first is using a > > loop to create possible more then one replication and the second > one is not, correct? The loop can cause performance hits, but should be small. > Sorry for the confusion, for memcpy version also we are using a loop > outside of this function. Essentially, we are making same number of > copies in both the cases. > > The first one is using the hdr->next pointer which is in the second > > cacheline of the mbuf header, this can and will cause a cacheline miss > and degrade your performance. The second code does not touch hdr->next and > will not cause a cacheline miss. When the packet goes > beyond 64bytes then you hit the second cacheline, are you starting to see the > problem here. > We also performed same experiment for different packet sizes(64B, 128B, > 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only > when the packet size increases from 64B to 128B and not after that. So, > cacheline miss should happen for other packet sizes also. I am not sure > why this is the case. Why the drop is not sharp after 128 B packets when > replicated using rte_pktmbuf_refcnt_update(). > > > Every time you touch a new cache line performance will drop unless the > > cacheline is prefetched into memory first, but in this case it > really can not be done easily. Count the cachelines you are touching and make > sure they are the same number in each case. > I don't understand the complexity here, could you please explain it in > detail. > > > > Why did you use memcpy and not rte_memcpy here as rte_memcpy should be > > faster? > > > > I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the > > number of rte_pktmbuf_alloc() calls, which should help if you > know the number of packets you need to replicate up front. > We are already using both of these functions, just to simplify the > pseudo-code I used memcpy and rte_pktmbuf_alloc(). > > # pktsz 1(64 bytes) | pktsz 2(128 bytes) | pktsz 3(256 > bytes) | pktsz 4(512 bytes) | pktsz 4(1024 bytes) | > # memcpy refcnt | memcpy refcnt | memcpy refcnt | > memcpy refcnt | memcpy refcnt | > 5949888 5806720| 5831360 2890816 | 5640379 2886016 | > 5107840 2863264 | 4510121 2692876 | > > Throughput is in MPPS. >
Wonder what NIC and TX function do you use? Any chance that multi-seg support is not on? Konstantin