On Thu, Aug 24, 2023 at 5:54 PM Tyler Retzlaff <roret...@linux.microsoft.com> wrote: > > On Thu, Aug 24, 2023 at 04:18:06PM +0200, David Marchand wrote: > > On Thu, Aug 24, 2023 at 2:47 PM Morten Brørup <m...@smartsharesystems.com> > > wrote: > > > > From: David Marchand [mailto:david.march...@redhat.com] > > > > However, I am a bit puzzled why the prefetch change makes the compiler > > > > consider this loop differently. > > > > We have the same constructs everywhere in this library and x86_64 > > > > builds are fine... > > > > > > That is indeed the relevant question here! > > > > > > Perhaps the compiler somehow ignores the "const" part of the parameter > > > given to the "asm" (in rte_prefetch0()) for 64 bit arch, but not for 32 > > > bit arch? > > > > It is possible to reproduce the issue with current DPDK tree (with > > existing prefetch implementation in asm) and removing all > > rte_prefetch0 calls from the async rx path: > > > > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c > > index d7624d18c8..6f941cf27d 100644 > > --- a/lib/vhost/virtio_net.c > > +++ b/lib/vhost/virtio_net.c > > @@ -748,8 +748,6 @@ map_one_desc(struct virtio_net *dev, struct > > vhost_virtqueue *vq, > > if (unlikely(!desc_addr)) > > return -1; > > > > - rte_prefetch0((void *)(uintptr_t)desc_addr); > > - > > buf_vec[vec_id].buf_iova = desc_iova; > > buf_vec[vec_id].buf_addr = desc_addr; > > buf_vec[vec_id].buf_len = desc_chunck_len; > > @@ -1808,8 +1806,6 @@ virtio_dev_rx_async_submit_split(struct > > virtio_net *dev, struct vhost_virtqueue > > */ > > avail_head = __atomic_load_n(&vq->avail->idx, __ATOMIC_ACQUIRE); > > > > - rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - > > 1)]); > > - > > async_iter_reset(async); > > > > for (pkt_idx = 0; pkt_idx < count; pkt_idx++) { > > @@ -1997,7 +1993,6 @@ virtio_dev_rx_async_packed_batch_enqueue(struct > > virtio_net *dev, > > uint16_t i; > > > > vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > > - rte_prefetch0((void *)(uintptr_t)desc_addrs[i]); > > desc = vhost_iova_to_vva(dev, vq, desc_addrs[i], > > &lens[i], VHOST_ACCESS_RW); > > hdrs[i] = (struct virtio_net_hdr_mrg_rxbuf > > *)(uintptr_t)desc; > > lens[i] = pkts[i]->pkt_len + > > @@ -2106,8 +2101,6 @@ virtio_dev_rx_async_submit_packed(struct > > virtio_net *dev, struct vhost_virtqueue > > uint16_t i; > > > > do { > > - rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]); > > - > > if (count - pkt_idx >= PACKED_BATCH_SIZE) { > > if (!virtio_dev_rx_async_packed_batch(dev, vq, > > &pkts[pkt_idx], > > dma_id, vchan_id)) { > > > > > > If any prefetch is left, the warning disappears. > > So the asm usage probably impacts/disables some compiler feature, for > > this code path. > > So as an aside one of the reasons often cited as to why intrinsics are > preferred over inline asm is that inline asm can't be considered when > the compiler performs optimization. It could be that when moving to use > the intrinsics the scope of what can be optimized around the prefetch0 > calls can be used for optimization leading to the observations here.
Agreed. While we decide the next steps on the vhost library, and for making progress on this series, I decided to keep the previous asm implementation and put intrinsics under a #ifdef MSVC. -- David Marchand