On Fri, Dec 07, 2018 at 05:58:24PM +0300, Ilya Maximets wrote: > On 06.12.2018 16:48, Michael S. Tsirkin wrote: > > On Thu, Dec 06, 2018 at 12:17:38PM +0800, Jason Wang wrote: > >> > >> On 2018/12/5 下午7:30, Ilya Maximets wrote: > >>> On 05.12.2018 12:49, Maxime Coquelin wrote: > >>>> A read barrier is required to ensure the ordering between > >>>> available index and the descriptor reads is enforced. > >>>> > >>>> Fixes: 4796ad63ba1f ("examples/vhost: import userspace vhost > >>>> application") > >>>> Cc: sta...@dpdk.org > >>>> > >>>> Reported-by: Jason Wang <jasow...@redhat.com> > >>>> Signed-off-by: Maxime Coquelin <maxime.coque...@redhat.com> > >>>> --- > >>>> lib/librte_vhost/virtio_net.c | 12 ++++++++++++ > >>>> 1 file changed, 12 insertions(+) > >>>> > >>>> diff --git a/lib/librte_vhost/virtio_net.c > >>>> b/lib/librte_vhost/virtio_net.c > >>>> index 5e1a1a727..f11ebb54f 100644 > >>>> --- a/lib/librte_vhost/virtio_net.c > >>>> +++ b/lib/librte_vhost/virtio_net.c > >>>> @@ -791,6 +791,12 @@ virtio_dev_rx_split(struct virtio_net *dev, struct > >>>> vhost_virtqueue *vq, > >>>> rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - > >>>> 1)]); > >>>> avail_head = *((volatile uint16_t *)&vq->avail->idx); > >>>> + /* > >>>> + * The ordering between avail index and > >>>> + * desc reads needs to be enforced. > >>>> + */ > >>>> + rte_smp_rmb(); > >>>> + > >>> Hmm. This looks weird to me. > >>> Could you please describe the bad scenario here? (It'll be good to have it > >>> in commit message too) > >>> > >>> As I understand, you're enforcing the read of avail->idx to happen before > >>> reading the avail->ring[avail_idx]. Is it correct? > >>> > >>> But we have following code sequence: > >>> > >>> 1. read avail->idx (avail_head). > >>> 2. check that last_avail_idx != avail_head. > >>> 3. read from the ring using last_avail_idx. > >>> > >>> So, there is a strict dependency between all 3 steps and the memory > >>> transaction will be finished at the step #2 in any case. There is no > >>> way to read the ring before reading the avail->idx. > >>> > >>> Am I missing something? > >> > >> > >> Nope, I kind of get what you meaning now. And even if we will > >> > >> 4. read descriptor from descriptor ring using the id read from 3 > >> > >> 5. read descriptor content according to the address from 4 > >> > >> They still have dependent memory access. So there's no need for rmb. > > > > I am pretty sure on some architectures there is a need for a barrier > > here. This is an execution dependency since avail_head is not used as an > > index. And reads can be speculated. So the read from the ring can be > > speculated and execute before the read of avail_head and the check. > > > > However SMP rmb is/should be free on x86. > > rte_smp_rmd() turns into compiler barrier on x86. And compiler barriers > could be harmful too in some cases. > > > So unless someone on this > > thread is actually testing performance on non-x86, you are both wasting > > cycles discussing removal of nop macros and also risk pushing untested > > software on users. > > Since DPDK supports not only x86, we have to consider possible performance > issues on different architectures. In fact that this patch makes no sense > on x86, the only thing we need to consider is the stability and performance > on non-x86 architectures. If we'll not pay attention to things like this, > vhost-user could become completely unusable on non-x86 architectures someday. > > It'll be cool if someone could test patches (autotest would be nice too) on > ARM at least. But, unfortunately, testing of DPDK is still far from being > ideal. And the lack of hardware is the main issue. I'm running vhost with > qemu on my ARMv8 platform from time to time, but it's definitely not enough. > And I can not test every patch on a list. > > However I made a few tests on ARMv8 and this patch shows no significant > performance difference. But it makes the performance a bit more stable > between runs, which is nice.
I'm sorry about being unclear. I think a barrier is required, so this patch is good. I was trying to say that splitting hairs trying to prove that the barrier can be omitted without testing that omitting it gives a performance benefit doesn't make sense. Since you observed that adding a barrier actually helps performance stability, it's all good. > > > > > >> > >>> > >>>> for (pkt_idx = 0; pkt_idx < count; pkt_idx++) { > >>>> uint32_t pkt_len = pkts[pkt_idx]->pkt_len + > >>>> dev->vhost_hlen; > >>>> uint16_t nr_vec = 0; > >>>> @@ -1373,6 +1379,12 @@ virtio_dev_tx_split(struct virtio_net *dev, > >>>> struct vhost_virtqueue *vq, > >>>> if (free_entries == 0) > >>>> return 0; > >>>> + /* > >>>> + * The ordering between avail index and > >>>> + * desc reads needs to be enforced. > >>>> + */ > >>>> + rte_smp_rmb(); > >>>> + > >>> This one is strange too. > >>> > >>> free_entries = *((volatile uint16_t *)&vq->avail->idx) - > >>> vq->last_avail_idx; > >>> if (free_entries == 0) > >>> return 0; > >>> > >>> The code reads the value of avail->idx and uses the value on the next > >>> line even with any compiler optimizations. There is no way for CPU to > >>> postpone the actual read. > >> > >> > >> Yes. > >> > >> Thanks > >> > >> > >>> > >>>> VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__); > >>>> count = RTE_MIN(count, MAX_PKT_BURST); > >>>> > > > >