On Thu, Aug 24, 2023 at 5:54 PM Tyler Retzlaff
<roret...@linux.microsoft.com> wrote:
>
> On Thu, Aug 24, 2023 at 04:18:06PM +0200, David Marchand wrote:
> > On Thu, Aug 24, 2023 at 2:47 PM Morten Brørup <m...@smartsharesystems.com> 
> > wrote:
> > > > From: David Marchand [mailto:david.march...@redhat.com]
> > > > However, I am a bit puzzled why the prefetch change makes the compiler
> > > > consider this loop differently.
> > > > We have the same constructs everywhere in this library and x86_64
> > > > builds are fine...
> > >
> > > That is indeed the relevant question here!
> > >
> > > Perhaps the compiler somehow ignores the "const" part of the parameter 
> > > given to the "asm" (in rte_prefetch0()) for 64 bit arch, but not for 32 
> > > bit arch?
> >
> > It is possible to reproduce the issue with current DPDK tree (with
> > existing prefetch implementation in asm) and removing all
> > rte_prefetch0 calls from the async rx path:
> >
> > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> > index d7624d18c8..6f941cf27d 100644
> > --- a/lib/vhost/virtio_net.c
> > +++ b/lib/vhost/virtio_net.c
> > @@ -748,8 +748,6 @@ map_one_desc(struct virtio_net *dev, struct
> > vhost_virtqueue *vq,
> >                 if (unlikely(!desc_addr))
> >                         return -1;
> >
> > -               rte_prefetch0((void *)(uintptr_t)desc_addr);
> > -
> >                 buf_vec[vec_id].buf_iova = desc_iova;
> >                 buf_vec[vec_id].buf_addr = desc_addr;
> >                 buf_vec[vec_id].buf_len  = desc_chunck_len;
> > @@ -1808,8 +1806,6 @@ virtio_dev_rx_async_submit_split(struct
> > virtio_net *dev, struct vhost_virtqueue
> >          */
> >         avail_head = __atomic_load_n(&vq->avail->idx, __ATOMIC_ACQUIRE);
> >
> > -       rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 
> > 1)]);
> > -
> >         async_iter_reset(async);
> >
> >         for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
> > @@ -1997,7 +1993,6 @@ virtio_dev_rx_async_packed_batch_enqueue(struct
> > virtio_net *dev,
> >         uint16_t i;
> >
> >         vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > -               rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> >                 desc = vhost_iova_to_vva(dev, vq, desc_addrs[i],
> > &lens[i], VHOST_ACCESS_RW);
> >                 hdrs[i] = (struct virtio_net_hdr_mrg_rxbuf 
> > *)(uintptr_t)desc;
> >                 lens[i] = pkts[i]->pkt_len +
> > @@ -2106,8 +2101,6 @@ virtio_dev_rx_async_submit_packed(struct
> > virtio_net *dev, struct vhost_virtqueue
> >         uint16_t i;
> >
> >         do {
> > -               rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> > -
> >                 if (count - pkt_idx >= PACKED_BATCH_SIZE) {
> >                         if (!virtio_dev_rx_async_packed_batch(dev, vq,
> > &pkts[pkt_idx],
> >                                         dma_id, vchan_id)) {
> >
> >
> > If any prefetch is left, the warning disappears.
> > So the asm usage probably impacts/disables some compiler feature, for
> > this code path.
>
> So as an aside one of the reasons often cited as to why intrinsics are
> preferred over inline asm is that inline asm can't be considered when
> the compiler performs optimization. It could be that when moving to use
> the intrinsics the scope of what can be optimized around the prefetch0
> calls can be used for optimization leading to the observations here.

Agreed.
While we decide the next steps on the vhost library, and for making
progress on this series, I decided to keep the previous asm
implementation and put intrinsics under a #ifdef MSVC.


-- 
David Marchand

Reply via email to