On Thursday 19 April 2018 09:38 PM, Wiles, Keith wrote:

On Apr 19, 2018, at 9:30 AM, Shailja Pandey <csz168...@iitd.ac.in> wrote:

The two code fragments are doing two different ways the first is using a loop 
to create possible more then one replication and the second one is not, 
correct? The loop can cause performance hits, but should be small.
Sorry for the confusion, for memcpy version also we are using a loop outside of 
this function. Essentially, we are making same number of copies in both the 
cases.
The first one is using the hdr->next pointer which is in the second cacheline of 
the mbuf header, this can and will cause a cacheline miss and degrade your 
performance. The second code does not touch hdr->next and will not cause a 
cacheline miss. When the packet goes beyond 64bytes then you hit the second 
cacheline, are you starting to see the problem here.
We also performed same experiment for different packet sizes(64B, 128B, 256B, 
512B, 1024B, 1518B), the sharp drop in throughput is observed only when the 
packet size increases from 64B to 128B and not after that. So, cacheline miss 
should happen for other packet sizes also. I am not sure why this is the case. 
Why the drop is not sharp after 128 B packets when replicated using 
rte_pktmbuf_refcnt_update().

  Every time you touch a new cache line performance will drop unless the 
cacheline is prefetched into memory first, but in this case it really can not 
be done easily. Count the cachelines you are touching and make sure they are 
the same number in each case.
I don't understand the complexity here, could you please explain it in detail.
In this case you can not do a prefetch on other cache lines far enough in 
advance to not get a CPU stall for a cacheline.

Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?
Still did not answer this question.

I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number 
of rte_pktmbuf_alloc() calls, which should help if you know the number of 
packets you need to replicate up front.
We are already using both of these functions, just to simplify the pseudo-code 
I used memcpy and rte_pktmbuf_alloc().
Then please show the real code fragment as your example was confusing.

In our experiments, for packet replication using refcntupdate(), we observed a sharp drop in throughput when the packet size was changed from 64B to 128B because the replicated packets were not being sent. Only the original packets were being sent hence throughput roughly dropped to half compared to the case of 64B packets where both replicated and original packets were being sent. Actually, the ether_type field was not being set appropriately for replicated packets and hence the replicated packets were dropped at hardware level.

We did not realize this, as in case of 64B packet this was not a problem and NIC was able to transmit both original and replicated packets despite ether_type field not being set appropriately. For 128B and onward packets, replicated packets were sent by driver to NIC but not being transmitted on the wire from NIC and hence a drop in throughput.

After setting the ether_type field appropriately for 128B and onwards packet sizes, the throughput is similar for all packet sizes.


# pktsz 1(64 bytes)    |   pktsz 2(128 bytes)     |  pktsz 3(256 bytes)    |  
pktsz 4(512 bytes)   | pktsz 4(1024 bytes)    |
# memcpy    refcnt    |   memcpy    refcnt      | memcpy refcnt       |  memcpy 
 refcnt       | memcpy   refcnt         |
    5949888    5806720|   5831360    2890816  |  5640379    2886016 |  5107840  
 2863264  | 4510121   2692876    |

Refcnt also needs to adjust the value using a atomic update and you still have 
not told me the type of system you are on x86 or ???

Please describe your total system Host OS, DPDK version, NICs used, … a number 
of people have performance similar test and do not see the problem you are 
suggesting. Maybe modify say L3fwd (which does some thing similar to your 
example code) and see if you still see the difference. They you can post the 
patch to that example app and we can try to figure it out.

Throughput is in MPPS.

--

Thanks,
Shailja

Regards,
Keith

Thank again!

--

Thanks,
Shailja

Reply via email to