On 03/17/2017 06:47 AM, Yuanhan Liu wrote:
On Thu, Mar 16, 2017 at 10:10:05AM +0000, Kevin Traynor wrote:
On 03/16/2017 06:21 AM, Yuanhan Liu wrote:
On Wed, Mar 15, 2017 at 07:10:49PM +0000, Kevin Traynor wrote:
The virtio_net structure is used in both enqueue and dequeue datapaths.
broadcast_rarp is checked with cmpset in the dequeue datapath regardless
of whether descriptors are available or not.
It is observed in some cases where dequeue and enqueue are performed by
different cores and no packets are available on the dequeue datapath
(i.e. uni-directional traffic), the frequent checking of broadcast_rarp
in dequeue causes performance degradation for the enqueue datapath.
In OVS the issue can cause a uni-directional performance drop of up to 15%.
Fix that by moving broadcast_rarp to a different cache line in
virtio_net struct.
Thanks, but I'm a bit confused. The drop looks like being caused by
cache false sharing, but I don't see anything would lead to a false
sharing. I mean, there is no write in the same cache line where the
broadcast_rarp belongs. Or, the "volatile" type is the culprit here?
Yes, the cmpset code uses cmpxchg and that performs a write regardless
of the result - it either writes the new value or back the old value.
Oh, right, I missed this part!
Talking about that, I had actually considered to turn "broadcast_rarp"
to a simple "int" or "uint16_t" type, to make it more light weight.
The reason I used atomic type is to exactly send one broadcast RARP
packet once SEND_RARP request is recieved. Otherwise, we may send more
than one RARP packet when MQ is invovled. But I think we don't have
to be that accurate: it's tolerable when more RARP are sent. I saw 4
SEND_RARP requests (aka 4 RARP packets) in the last time I tried
vhost-user live migration after all. I don't quite remember why
it was 4 though.
That said, I think it also would resolve the performance issue if you
change "rte_atomic16_t" to "uint16_t", without moving the place?
Yes, that should work fine, with the side effect you mentioned of
possibly some more rarps - no big deal.
I tested another solution also - as it is unlikely we would need to send
the broadcast_rarp, you can first read and only do the cmpset if it is
likely to succeed. This resolved the issue too.
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1057,7 +1057,8 @@ static inline bool __attribute__((always_inline))
*
* Check user_send_rarp() for more information.
*/
- if (unlikely(rte_atomic16_cmpset((volatile uint16_t *)
+ if (unlikely(rte_atomic16_read(&dev->broadcast_rarp) &&
+ rte_atomic16_cmpset((volatile uint16_t *)
&dev->broadcast_rarp.cnt, 1, 0))) {
rarp_mbuf = rte_pktmbuf_alloc(mbuf_pool);
if (rarp_mbuf == NULL) {
I'm okay with this one. It's simple and clean enough, that it could
be picked to a stable release. Later, I'd like to send another patch
to turn it to "uint16_t". Since it changes the behaviour a bit, it
is not a good candidate for stable release.
BTW, would you please include the root cause (false sharing) into
your commit log?
And maybe also adds the info to the comment just above?
I will help people wondering why we read before cmpset.
Maxime