Re: [PATCH vhost v13 05/12] virtio_ring: introduce virtqueue_dma_dev()
On Tue, Aug 15, 2023 at 2:32 PM Xuan Zhuo wrote: > > > Hi, Jason > > Could you skip this patch? I'm fine with either merging or dropping this. > > Let we review other patches firstly? I will be on vacation soon, and won't have time to do this until next week. But I spot two possible "issues": 1) the DMA metadata were stored in the headroom of the page, this breaks frags coalescing, we need to benchmark it's impact 2) pre mapped DMA addresses were not reused in the case of XDP_TX/XDP_REDIRECT I see Michael has merge this series so I'm fine to let it go first. Thanks > > Thanks. > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 2/4] vdpa/mlx5: implement .reset_map driver op
On Mon, 2023-08-14 at 18:43 -0700, Si-Wei Liu wrote: > This patch is based on top of the "vdpa/mlx5: Fixes > for ASID handling" series [1]. > > [1] vdpa/mlx5: Fixes for ASID handling > https://lore.kernel.org/virtualization/20230802171231.11001-1-dtatu...@nvidia.com/ > > Signed-off-by: Si-Wei Liu > --- > drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + > drivers/vdpa/mlx5/core/mr.c | 72 + > - > drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++--- > 3 files changed, 54 insertions(+), 37 deletions(-) > > diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h > b/drivers/vdpa/mlx5/core/mlx5_vdpa.h > index b53420e..5c9a25a 100644 > --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h > +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h > @@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, > struct vhost_iotlb *iotlb, > unsigned int asid); > void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev); > void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int > asid); > +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); > > #define mlx5_vdpa_warn(__dev, format, > ...) \ > dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, > __func__, __LINE__, \ > diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c > index 5a1971fc..c8d64fc 100644 > --- a/drivers/vdpa/mlx5/core/mr.c > +++ b/drivers/vdpa/mlx5/core/mr.c > @@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, > struct mlx5_vdpa_mr *mr > } > } > > -static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned > int asid) > +static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev) > { > - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) > - return; > - > prune_iotlb(mvdev); > } > > -static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned > int asid) > +static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev) > { > struct mlx5_vdpa_mr *mr = &mvdev->mr; > > - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) > - return; > - > if (!mr->initialized) > return; > > @@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev > *mvdev, unsigned int asid) > > mutex_lock(&mr->mkey_mtx); > > - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid); > - _mlx5_vdpa_destroy_cvq_mr(mvdev, asid); > + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) > + _mlx5_vdpa_destroy_dvq_mr(mvdev); > + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) > + _mlx5_vdpa_destroy_cvq_mr(mvdev); > > mutex_unlock(&mr->mkey_mtx); > } > @@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev) > } > > static int _mlx5_vdpa_create_cvq_mr(struct mlx5_vdpa_dev *mvdev, > - struct vhost_iotlb *iotlb, > - unsigned int asid) > + struct vhost_iotlb *iotlb) > { > - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) > - return 0; > - > return dup_iotlb(mvdev, iotlb); > } > > static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev, > - struct vhost_iotlb *iotlb, > - unsigned int asid) > + struct vhost_iotlb *iotlb) > { > struct mlx5_vdpa_mr *mr = &mvdev->mr; > int err; > > - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) > - return 0; > - > if (mr->initialized) > return 0; > > @@ -574,20 +562,18 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev > *mvdev, > { > int err; > > - err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid); > - if (err) > - return err; > - > - err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb, asid); > - if (err) > - goto out_err; > + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) { > + err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid); > + if (err) > + return err; > + } > + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) { > + err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb); > + if (err) > + return err; I think you still need the goto here, when CVQ and DVQ fall in same asid and there's a CVQ mr creation error, you are left stuck with the DVQ mr. > + } > > return 0; > - > -out_err: > - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid); > - > - return err; > } > > int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb > *iotlb, > @@ -601,6 +587,28 @@ int mlx5_vdpa_create_mr(stru
[PATCH] MAINTAINERS: Add myself as mlx5_vdpa driver
As Eli Cohen moved to other work, I'll be the contact point for mlx5_vdpa. Signed-off-by: Dragos Tatulea --- MAINTAINERS | 6 ++ 1 file changed, 6 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 9a5863f1b016..c9a9259f4d37 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13555,6 +13555,12 @@ F: Documentation/leds/leds-mlxcpld.rst F: drivers/leds/leds-mlxcpld.c F: drivers/leds/leds-mlxreg.c +MELLANOX MLX5_VDPA DRIVER +M: Dragos Tatulea +L: virtualization@lists.linux-foundation.org +S: Supported +F: drivers/vdpa/mlx5/ + MELLANOX PLATFORM DRIVER M: Vadim Pasternak L: platform-driver-...@vger.kernel.org -- 2.41.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v13 05/12] virtio_ring: introduce virtqueue_dma_dev()
On Tue, 15 Aug 2023 15:50:23 +0800, Jason Wang wrote: > On Tue, Aug 15, 2023 at 2:32 PM Xuan Zhuo wrote: > > > > > > Hi, Jason > > > > Could you skip this patch? > > I'm fine with either merging or dropping this. > > > > > Let we review other patches firstly? > > I will be on vacation soon, and won't have time to do this until next week. Have a happly vacation. > > But I spot two possible "issues": > > 1) the DMA metadata were stored in the headroom of the page, this > breaks frags coalescing, we need to benchmark it's impact Not every page, just the first page of the COMP pages. So I think there is no impact. > 2) pre mapped DMA addresses were not reused in the case of XDP_TX/XDP_REDIRECT Because that the tx is not the premapped mode. Thanks. > > I see Michael has merge this series so I'm fine to let it go first. > > Thanks > > > > > Thanks. > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v13 05/12] virtio_ring: introduce virtqueue_dma_dev()
On Tue, Aug 15, 2023 at 03:50:23PM +0800, Jason Wang wrote: > On Tue, Aug 15, 2023 at 2:32 PM Xuan Zhuo wrote: > > > > > > Hi, Jason > > > > Could you skip this patch? > > I'm fine with either merging or dropping this. > > > > > Let we review other patches firstly? > > I will be on vacation soon, and won't have time to do this until next week. > > But I spot two possible "issues": > > 1) the DMA metadata were stored in the headroom of the page, this > breaks frags coalescing, we need to benchmark it's impact > 2) pre mapped DMA addresses were not reused in the case of XDP_TX/XDP_REDIRECT > > I see Michael has merge this series so I'm fine to let it go first. > > Thanks it's still queued for next. not too late to drop or better add a patch on top. > > > > Thanks. > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net v1] virtio_net: Introduce skb_vnet_common_hdr to avoid typecasting
On 2023-08-15 a.m.6:51, Simon Horman wrote: External email: Use caution opening links or attachments On Mon, Aug 14, 2023 at 01:18:45PM -0400, Feng Liu wrote: + "David S. Miller" Eric Dumazet Jakub Kicinski Paolo Abeni Thanks for adding David S. Miller. The virtio_net driver currently deals with different versions and types of virtio net headers, such as virtio_net_hdr_mrg_rxbuf, virtio_net_hdr_v1_hash, etc. Due to these variations, the code relies on multiple type casts to convert memory between different structures, potentially leading to bugs when there are changes in these structures. Introduces the "struct skb_vnet_common_hdr" as a unifying header structure using a union. With this approach, various virtio net header structures can be converted by accessing different members of this structure, thus eliminating the need for type casting and reducing the risk of potential bugs. For example following code: static struct sk_buff *page_to_skb(struct virtnet_info *vi, struct receive_queue *rq, struct page *page, unsigned int offset, unsigned int len, unsigned int truesize, unsigned int headroom) { [...] struct virtio_net_hdr_mrg_rxbuf *hdr; [...] hdr_len = vi->hdr_len; [...] ok: hdr = skb_vnet_hdr(skb); memcpy(hdr, hdr_p, hdr_len); [...] } When VIRTIO_NET_F_HASH_REPORT feature is enabled, hdr_len = 20 But the sizeof(*hdr) is 12, memcpy(hdr, hdr_p, hdr_len); will copy 20 bytes to the hdr, which make a potential risk of bug. And this risk can be avoided by introducing struct virtio_net_hdr_mrg_rxbuf. Signed-off-by: Feng Liu Reviewed-by: Jiri Pirko I'm unsure if this is 'net' material. It is about the modification of the virtio_net driver. I think it should be regarded as `net` material. --- drivers/net/virtio_net.c| 29 - include/uapi/linux/virtio_net.h | 7 +++ 2 files changed, 23 insertions(+), 13 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 1270c8d23463..6ce0fbcabda9 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -344,9 +344,10 @@ static int rxq2vq(int rxq) return rxq * 2; } -static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff *skb) +static inline struct virtio_net_common_hdr * +skb_vnet_common_hdr(struct sk_buff *skb) { - return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb; + return (struct virtio_net_common_hdr *)skb->cb; } /* @@ -469,7 +470,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, unsigned int headroom) { struct sk_buff *skb; - struct virtio_net_hdr_mrg_rxbuf *hdr; + struct virtio_net_common_hdr *hdr; unsigned int copy, hdr_len, hdr_padded_len; struct page *page_to_free = NULL; int tailroom, shinfo_size; @@ -554,7 +555,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, give_pages(rq, page); ok: - hdr = skb_vnet_hdr(skb); + hdr = skb_vnet_common_hdr(skb); memcpy(hdr, hdr_p, hdr_len); if (page_to_free) put_page(page_to_free); @@ -966,7 +967,7 @@ static struct sk_buff *receive_small_build_skb(struct virtnet_info *vi, return NULL; buf += header_offset; - memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len); + memcpy(skb_vnet_common_hdr(skb), buf, vi->hdr_len); return skb; } @@ -1577,7 +1578,8 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq, { struct net_device *dev = vi->dev; struct sk_buff *skb; - struct virtio_net_hdr_mrg_rxbuf *hdr; + struct virtio_net_common_hdr *common_hdr; + struct virtio_net_hdr_mrg_rxbuf *mrg_hdr; if (unlikely(len < vi->hdr_len + ETH_HLEN)) { pr_debug("%s: short packet %i\n", dev->name, len); @@ -1597,18 +1599,19 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq, if (unlikely(!skb)) return; - hdr = skb_vnet_hdr(skb); + common_hdr = skb_vnet_common_hdr(skb); if (dev->features & NETIF_F_RXHASH && vi->has_rss_hash_report) - virtio_skb_set_hash((const struct virtio_net_hdr_v1_hash *)hdr, skb); + virtio_skb_set_hash(&common_hdr->hash_v1_hdr, skb); - if (hdr->hdr.flags & VIRTIO_NET_HDR_F_DATA_VALID) + mrg_hdr = &common_hdr->mrg_hdr; + if (mrg_hdr->hdr.flags & VIRTIO_NET_HDR_F_DATA_VALID) skb->ip_summed = CHECKSUM_UNNECESSARY; - if (virtio_net_hdr_to_skb(skb, &hdr->hdr, + if (virtio_net_hdr_to_skb(skb, &mrg_hdr->hdr, virtio_is_little_endian(vi->vdev))) { net_warn_ratelimited("%s: bad gso: type: %u, size: %u\n", - dev->name, hdr->hdr.gso_type, - hdr->hdr.gso_size); +
Re: [PATCH net v1] virtio_net: Introduce skb_vnet_common_hdr to avoid typecasting
On Tue, Aug 15, 2023 at 12:29 PM Simon Horman wrote: > > On Tue, Aug 15, 2023 at 11:09:02AM -0400, Feng Liu wrote: > > > > > > On 2023-08-15 a.m.6:51, Simon Horman wrote: > > > External email: Use caution opening links or attachments > > > > > > > > > On Mon, Aug 14, 2023 at 01:18:45PM -0400, Feng Liu wrote: > > > > > > + "David S. Miller" > > >Eric Dumazet > > >Jakub Kicinski > > >Paolo Abeni > > > > > Thanks for adding David S. Miller. > > > > > > The virtio_net driver currently deals with different versions and types > > > > of virtio net headers, such as virtio_net_hdr_mrg_rxbuf, > > > > virtio_net_hdr_v1_hash, etc. Due to these variations, the code relies > > > > on multiple type casts to convert memory between different structures, > > > > potentially leading to bugs when there are changes in these structures. > > > > > > > > Introduces the "struct skb_vnet_common_hdr" as a unifying header > > > > structure using a union. With this approach, various virtio net header > > > > structures can be converted by accessing different members of this > > > > structure, thus eliminating the need for type casting and reducing the > > > > risk of potential bugs. > > > > > > > > For example following code: > > > > static struct sk_buff *page_to_skb(struct virtnet_info *vi, > > > >struct receive_queue *rq, > > > >struct page *page, unsigned int offset, > > > >unsigned int len, unsigned int truesize, > > > >unsigned int headroom) > > > > { > > > > [...] > > > >struct virtio_net_hdr_mrg_rxbuf *hdr; > > > > [...] > > > >hdr_len = vi->hdr_len; > > > > [...] > > > > ok: > > > >hdr = skb_vnet_hdr(skb); > > > >memcpy(hdr, hdr_p, hdr_len); > > > > [...] > > > > } > > > > > > > > When VIRTIO_NET_F_HASH_REPORT feature is enabled, hdr_len = 20 > > > > But the sizeof(*hdr) is 12, > > > > memcpy(hdr, hdr_p, hdr_len); will copy 20 bytes to the hdr, > > > > which make a potential risk of bug. And this risk can be avoided by > > > > introducing struct virtio_net_hdr_mrg_rxbuf. > > > > > > > > Signed-off-by: Feng Liu > > > > Reviewed-by: Jiri Pirko > > > > > > I'm unsure if this is 'net' material. > > > > > > > It is about the modification of the virtio_net driver. I think it should be > > regarded as `net` material. > > To clarify: In general new Networking features go via the net-next tree, > while bug fixes go via the net tree. I was suggesting this > is more appropriate for net-next, and that should be reflected in the > subject. > > Subject: [PATCH net-next] ... > > Sorry for not being clearer the first time around. Right, this should go to net-next. > > > > > > > --- > > > > drivers/net/virtio_net.c| 29 - > > > > include/uapi/linux/virtio_net.h | 7 +++ > > > > 2 files changed, 23 insertions(+), 13 deletions(-) > > > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > > > index 1270c8d23463..6ce0fbcabda9 100644 > > > > --- a/drivers/net/virtio_net.c > > > > +++ b/drivers/net/virtio_net.c > > > > @@ -344,9 +344,10 @@ static int rxq2vq(int rxq) > > > >return rxq * 2; > > > > } > > > > > > > > -static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct > > > > sk_buff *skb) > > > > +static inline struct virtio_net_common_hdr * > > > > +skb_vnet_common_hdr(struct sk_buff *skb) > > > > { > > > > - return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb; > > > > + return (struct virtio_net_common_hdr *)skb->cb; > > > > } > > > > > > > > /* > > > > @@ -469,7 +470,7 @@ static struct sk_buff *page_to_skb(struct > > > > virtnet_info *vi, > > > > unsigned int headroom) > > > > { > > > >struct sk_buff *skb; > > > > - struct virtio_net_hdr_mrg_rxbuf *hdr; > > > > + struct virtio_net_common_hdr *hdr; > > > >unsigned int copy, hdr_len, hdr_padded_len; > > > >struct page *page_to_free = NULL; > > > >int tailroom, shinfo_size; > > > > @@ -554,7 +555,7 @@ static struct sk_buff *page_to_skb(struct > > > > virtnet_info *vi, > > > >give_pages(rq, page); > > > > > > > > ok: > > > > - hdr = skb_vnet_hdr(skb); > > > > + hdr = skb_vnet_common_hdr(skb); > > > >memcpy(hdr, hdr_p, hdr_len); > > > >if (page_to_free) > > > >put_page(page_to_free); > > > > @@ -966,7 +967,7 @@ static struct sk_buff > > > > *receive_small_build_skb(struct virtnet_info *vi, > > > >return NULL; > > > > > > > >buf += header_offset; > > > > - memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len); > > > > + memcpy(skb_vnet_common_hdr(skb), buf, vi->hdr_len); > > > > > > > >return skb; > > > > } > > > > @@ -1577,7 +1578,8 @@ static void receive_buf(struct virtnet_info *vi, > > > > struct receive_queue *rq, > > > > { > > > >struct net_device *dev = vi->dev; > > > >struct sk_bu
Re: [PATCH RFC 1/4] vdpa: introduce .reset_map operation callback
On 8/14/2023 7:21 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:46 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- include/linux/vdpa.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index db1b0ea..3a3878d 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -314,6 +314,12 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping (optional) + * Needed for device that using device + * specific DMA translation (on-chip IOMMU) This exposes the device internal to the upper layer which is not optimal. Not sure what does it mean by "device internal", but this op callback just follows existing convention to describe what vdpa parent this API targets. * @set_map: Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) : : * @dma_map: Map an area of PA to IOVA (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) * and preferring incremental map. : : * @dma_unmap: Unmap an area of IOVA (optional but * must be implemented with dma_map) * Needed for device that using device * specific DMA translation (on-chip IOMMU) * and preferring incremental unmap. Btw, what's the difference between this and a simple set_map(NULL)? I don't think parent drivers support this today - they can accept non-NULL iotlb containing empty map entry, but not a NULL iotlb. The behavior is undefined or it even causes panic when a NULL iotlb is passed in. Further this doesn't work with .dma_map parent drivers. The reason why a new op is needed or better is because it allows userspace to tell apart different reset behavior from the older kernel (via the F_IOTLB_PERSIST feature bit in patch 4), while this behavior could vary between parent drivers. Regards, -Siwei Thanks + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -390,6 +396,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
On 8/14/2023 7:25 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 +++- include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 62b0a01..75092a7 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -406,6 +406,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa *v) return ops->resume; } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map; So this means the IOTLB/IOMMU mappings have already been decoupled from the vdpa reset. Not in the sense of API, it' been coupled since day one from the implementations of every on-chip IOMMU parent driver, namely mlx5_vdpa and vdpa_sim. Because of that, later on the (improper) support for virtio-vdpa, from commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa") and 6c3d329e6486 ("vdpa_sim: get rid of DMA ops") misused the .reset() op to realize 1:1 mapping, rendering strong coupling between device reset and reset of iotlb mappings. This series try to rectify that implementation deficiency, while keep userspace continuing to work with older kernel behavior. So it should have been noticed by the userspace. Yes, userspace had noticed this no-chip IOMMU discrepancy since day one I suppose. Unfortunately there's already code in userspace with this assumption in mind that proactively tears down and sets up iotlb mapping around vdpa device reset... I guess we can just fix the simulator and mlx5 then we are fine? Only IF we don't care about running new QEMU on older kernels with flawed on-chip iommu behavior around reset. But that's a big IF... Regards, -Siwei Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 3/4] vhost-vdpa: should restore 1:1 dma mapping before detaching driver
On 8/14/2023 7:32 PM, Jason Wang wrote: On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu wrote: Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b43e868..62b0a01 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,14 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(&as->hash_link); vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with on-chip IOMMU need to restore iotlb +* to 1:1 identity mapping before vhost-vdpa is going +* to be removed and detached from the device. Give +* them a chance to do so, as this cannot be done +* efficiently via the whole-range unmap call above. +*/ Same question as before, if 1:1 is restored and the userspace doesn't do any IOTLB updating. It looks like a security issue? (Assuming IOVA is PA) This is already flawed independent of this series. It was introduced from the two commits I referenced earlier in the other thread. Today userspace is already able to do so with device reset and don't do any IOTLB update. This series don't get it worse nor make it better. FWIW as said earlier, to address this security issue properly we probably should set up 1:1 DMA mapping in virtio_vdpa_probe() on demand, and tears it down at virtio_vdpa_release_dev(). Question is, was virtio-vdpa the only vdpa bus user that needs 1:1 DMA mapping, or it's the other way around that vhost-vdpa is the only exception among all vdpa bus drivers that don't want to start with 1:1 by default. This would help parent vdpa implementation for what kind of mapping it should start with upon creation. Regards, -Siwei Thanks + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 2/4] vdpa/mlx5: implement .reset_map driver op
On 8/15/2023 1:26 AM, Dragos Tatulea wrote: On Mon, 2023-08-14 at 18:43 -0700, Si-Wei Liu wrote: This patch is based on top of the "vdpa/mlx5: Fixes for ASID handling" series [1]. [1] vdpa/mlx5: Fixes for ASID handling https://lore.kernel.org/virtualization/20230802171231.11001-1-dtatu...@nvidia.com/ Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c | 72 + - drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +++--- 3 files changed, 54 insertions(+), 37 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index b53420e..5c9a25a 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -123,6 +123,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev); void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 5a1971fc..c8d64fc 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -489,21 +489,15 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_mr *mr } } -static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return; - prune_iotlb(mvdev); } -static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev) { struct mlx5_vdpa_mr *mr = &mvdev->mr; - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) - return; - if (!mr->initialized) return; @@ -521,8 +515,10 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int asid) mutex_lock(&mr->mkey_mtx); - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid); - _mlx5_vdpa_destroy_cvq_mr(mvdev, asid); + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) + _mlx5_vdpa_destroy_dvq_mr(mvdev); + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) + _mlx5_vdpa_destroy_cvq_mr(mvdev); mutex_unlock(&mr->mkey_mtx); } @@ -534,25 +530,17 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev) } static int _mlx5_vdpa_create_cvq_mr(struct mlx5_vdpa_dev *mvdev, - struct vhost_iotlb *iotlb, - unsigned int asid) + struct vhost_iotlb *iotlb) { - if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid) - return 0; - return dup_iotlb(mvdev, iotlb); } static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev, - struct vhost_iotlb *iotlb, - unsigned int asid) + struct vhost_iotlb *iotlb) { struct mlx5_vdpa_mr *mr = &mvdev->mr; int err; - if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] != asid) - return 0; - if (mr->initialized) return 0; @@ -574,20 +562,18 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, { int err; - err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid); - if (err) - return err; - - err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb, asid); - if (err) - goto out_err; + if (mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP] == asid) { + err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid); + if (err) + return err; + } + if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] == asid) { + err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb); + if (err) + return err; I think you still need the goto here, when CVQ and DVQ fall in same asid and there's a CVQ mr creation error, you are left stuck with the DVQ mr. Yes, you are right, I will fix this in v2. Thank you for spotting this! -Siwei + } return 0; - -out_err: - _mlx5_vdpa_destroy_dvq_mr(mvdev, asid); - - return err; } int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, @@ -601,6 +587,28 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, return err; }
Re: [PATCH RFC 0/3] vdpa: dedicated descriptor table group
On 8/9/2023 7:49 AM, Eugenio Perez Martin wrote: On Wed, Aug 9, 2023 at 2:56 PM Si-Wei Liu wrote: Following patchset introduces dedicated group for descriptor table to reduce live migration downtime when passthrough VQ is being switched to shadow VQ. As this RFC set is to seek early feedback on the uAPI and driver API part, for now there's no associated driver patch consuming the API. As soon as the support is in place on both hardware device and driver, performance data will be show using real hardware device. The target goal of this series is to reduce the SVQ switching overhead to less than 300ms on a ~100GB guest with 2 non-mq vhost-vdpa devices. I would expand the cover letter with something in the line of: The reduction in the downtime is thanks to avoiding the full remap in the switching. Sure, will add in the next. The plan of the intended driver implementation is to use a dedicated group (specifically, 2 in below table) to host descriptor table for all data vqs, different from where buffer addresses are contained (in group 0 as below). cvq does not have to allocate dedicated group for descriptor table, so its buffers and descriptor table would always belong to a same group (1). | data vq | ctrl vq ==+==+=== vq_group |0 |1 vq_desc_group |2 |1 Acked-by: Eugenio Pérez Thanks! -Siwei --- Si-Wei Liu (3): vdpa: introduce dedicated descriptor group for virtqueue vhost-vdpa: introduce descriptor group backend feature vhost-vdpa: uAPI to get dedicated descriptor group id drivers/vhost/vdpa.c | 27 +++ include/linux/vdpa.h | 11 +++ include/uapi/linux/vhost.h | 8 include/uapi/linux/vhost_types.h | 5 + 4 files changed, 51 insertions(+) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 2/3] vhost-vdpa: introduce descriptor group backend feature
On 8/9/2023 8:49 PM, Jason Wang wrote: On Wed, Aug 9, 2023 at 8:56 PM Si-Wei Liu wrote: Userspace knows if the device has dedicated descriptor group or not by checking this feature bit. It's only exposed if the vdpa driver backend implements the .get_vq_desc_group() operation callback. Userspace trying to negotiate this feature when it or the dependent _F_IOTLB_ASID feature hasn't been exposed will result in an error. Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 17 + include/uapi/linux/vhost_types.h | 5 + 2 files changed, 22 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index b43e868..f2e5dce 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -389,6 +389,14 @@ static bool vhost_vdpa_can_resume(const struct vhost_vdpa *v) return ops->resume; } +static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return ops->get_vq_desc_group; +} + static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -679,6 +687,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if (copy_from_user(&features, featurep, sizeof(features))) return -EFAULT; if (features & ~(VHOST_VDPA_BACKEND_FEATURES | +BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | BIT_ULL(VHOST_BACKEND_F_RESUME))) return -EOPNOTSUPP; @@ -688,6 +697,12 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_RESUME)) && !vhost_vdpa_can_resume(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && + !(features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) + return -EINVAL; + if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && +!vhost_vdpa_has_desc_group(v)) + return -EOPNOTSUPP; vhost_set_backend_features(&v->vdev, features); return 0; } @@ -741,6 +756,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_SUSPEND); if (vhost_vdpa_can_resume(v)) features |= BIT_ULL(VHOST_BACKEND_F_RESUME); + if (vhost_vdpa_has_desc_group(v)) + features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); if (copy_to_user(featurep, &features, sizeof(features))) r = -EFAULT; break; diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index d3aad12a..0856f84 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -181,5 +181,10 @@ struct vhost_vdpa_iova_range { #define VHOST_BACKEND_F_SUSPEND 0x4 /* Device can be resumed */ #define VHOST_BACKEND_F_RESUME 0x5 +/* Device may expose the descriptor table, avail and used ring in a + * different group for ASID binding than the buffers it contains. Nit: s/a different group/different groups/? Yep, I will try to rephrase. Would below work? "Device may expose virtqueue's descriptor table, avail and used ring in a different group for ASID binding than where buffers it contains reside." Btw, not a native speaker but I think "descriptor" might be confusing since as you explained above, it contains more than just a descriptor table. Yep. I chose "descriptor" because packed virtqueue doesn't have "physical" avail and used ring other than descriptor table, but I think I am open to a better name. I once thought of "descriptor ring" but that might be too specific to packed virtqueue. Any suggestion? Thanks, -Siwei Thanks + * Requires VHOST_BACKEND_F_IOTLB_ASID. + */ +#define VHOST_BACKEND_F_DESC_ASID0x6 #endif -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 0/3] vdpa: dedicated descriptor table group
On 8/9/2023 8:50 PM, Jason Wang wrote: On Wed, Aug 9, 2023 at 8:56 PM Si-Wei Liu wrote: Following patchset introduces dedicated group for descriptor table to reduce live migration downtime when passthrough VQ is being switched to shadow VQ. As this RFC set is to seek early feedback on the uAPI and driver API part, for now there's no associated driver patch consuming the API. As soon as the support is in place on both hardware device and driver, performance data will be show using real hardware device. The target goal of this series is to reduce the SVQ switching overhead to less than 300ms on a ~100GB guest with 2 non-mq vhost-vdpa devices. The plan of the intended driver implementation is to use a dedicated group (specifically, 2 in below table) to host descriptor table for all data vqs, different from where buffer addresses are contained (in group 0 as below). cvq does not have to allocate dedicated group for descriptor table, so its buffers and descriptor table would always belong to a same group (1). I'm fine with this, but I think we need an implementation in the driver (e.g the simulator). Yes. FWIW for the sake of time saving and get this series accepted promptly in the upcoming v6.6 merge window, the driver we're going to support along with this series will be mlx5_vdpa in the formal submission, and simulator support may come up later after if I got spare cycle. Do you foresee any issue without simulator change? We will have mlx5_vdpa driver consuming the API for sure, that's the target of this work and it has to be proved working on real device at first. Thanks, -Siwei Thanks | data vq | ctrl vq ==+==+=== vq_group |0 |1 vq_desc_group |2 |1 --- Si-Wei Liu (3): vdpa: introduce dedicated descriptor group for virtqueue vhost-vdpa: introduce descriptor group backend feature vhost-vdpa: uAPI to get dedicated descriptor group id drivers/vhost/vdpa.c | 27 +++ include/linux/vdpa.h | 11 +++ include/uapi/linux/vhost.h | 8 include/uapi/linux/vhost_types.h | 5 + 4 files changed, 51 insertions(+) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v13 05/12] virtio_ring: introduce virtqueue_dma_dev()
On Tue, Aug 15, 2023 at 5:40 PM Xuan Zhuo wrote: > > On Tue, 15 Aug 2023 15:50:23 +0800, Jason Wang wrote: > > On Tue, Aug 15, 2023 at 2:32 PM Xuan Zhuo > > wrote: > > > > > > > > > Hi, Jason > > > > > > Could you skip this patch? > > > > I'm fine with either merging or dropping this. > > > > > > > > Let we review other patches firstly? > > > > I will be on vacation soon, and won't have time to do this until next week. > > Have a happly vacation. > > > > > But I spot two possible "issues": > > > > 1) the DMA metadata were stored in the headroom of the page, this > > breaks frags coalescing, we need to benchmark it's impact > > Not every page, just the first page of the COMP pages. > > So I think there is no impact. Nope, see this: if (SKB_FRAG_PAGE_ORDER && !static_branch_unlikely(&net_high_order_alloc_disable_key)) { /* Avoid direct reclaim but allow kswapd to wake */ pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY, SKB_FRAG_PAGE_ORDER); if (likely(pfrag->page)) { pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; return true; } } The comp page might be disabled due to the SKB_FRAG_PAGE_ORDER and net_high_order_alloc_disable_key. > > > > 2) pre mapped DMA addresses were not reused in the case of > > XDP_TX/XDP_REDIRECT > > Because that the tx is not the premapped mode. Yes, we can optimize this on top. Thanks > > Thanks. > > > > > I see Michael has merge this series so I'm fine to let it go first. > > > > Thanks > > > > > > > > Thanks. > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 4/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
On Wed, Aug 16, 2023 at 6:31 AM Si-Wei Liu wrote: > > > > On 8/14/2023 7:25 PM, Jason Wang wrote: > > On Tue, Aug 15, 2023 at 9:45 AM Si-Wei Liu wrote: > >> Signed-off-by: Si-Wei Liu > >> --- > >> drivers/vhost/vdpa.c | 16 +++- > >> include/uapi/linux/vhost_types.h | 2 ++ > >> 2 files changed, 17 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c > >> index 62b0a01..75092a7 100644 > >> --- a/drivers/vhost/vdpa.c > >> +++ b/drivers/vhost/vdpa.c > >> @@ -406,6 +406,14 @@ static bool vhost_vdpa_can_resume(const struct > >> vhost_vdpa *v) > >> return ops->resume; > >> } > >> > >> +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) > >> +{ > >> + struct vdpa_device *vdpa = v->vdpa; > >> + const struct vdpa_config_ops *ops = vdpa->config; > >> + > >> + return (!ops->set_map && !ops->dma_map) || ops->reset_map; > > So this means the IOTLB/IOMMU mappings have already been decoupled > > from the vdpa reset. > Not in the sense of API, it' been coupled since day one from the > implementations of every on-chip IOMMU parent driver, namely mlx5_vdpa > and vdpa_sim. Because of that, later on the (improper) support for > virtio-vdpa, from commit 6f5312f80183 ("vdpa/mlx5: Add support for > running with virtio_vdpa") and 6c3d329e6486 ("vdpa_sim: get rid of DMA > ops") misused the .reset() op to realize 1:1 mapping, rendering strong > coupling between device reset and reset of iotlb mappings. This series > try to rectify that implementation deficiency, while keep userspace > continuing to work with older kernel behavior. > > > So it should have been noticed by the userspace. > Yes, userspace had noticed this no-chip IOMMU discrepancy since day one > I suppose. Unfortunately there's already code in userspace with this > assumption in mind that proactively tears down and sets up iotlb mapping > around vdpa device reset... > > I guess we can just fix the simulator and mlx5 then we are fine? > Only IF we don't care about running new QEMU on older kernels with > flawed on-chip iommu behavior around reset. But that's a big IF... So what I meant is: Userspace doesn't know whether the vendor specific mappings (set_map) are required or not. And in the implementation of vhost_vdpa, if platform IOMMU is used, the mappings are decoupled from the reset. So if the Qemu works with parents with platform IOMMU it means Qemu can work if we just decouple vendor specific mappings from the parents that uses set_map. Thanks > > Regards, > -Siwei > > > > Thanks > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 0/3] vdpa: dedicated descriptor table group
On Wed, Aug 16, 2023 at 8:02 AM Si-Wei Liu wrote: > > > > On 8/9/2023 8:50 PM, Jason Wang wrote: > > On Wed, Aug 9, 2023 at 8:56 PM Si-Wei Liu wrote: > >> Following patchset introduces dedicated group for descriptor table to > >> reduce live migration downtime when passthrough VQ is being switched > >> to shadow VQ. As this RFC set is to seek early feedback on the uAPI > >> and driver API part, for now there's no associated driver patch consuming > >> the API. As soon as the support is in place on both hardware device and > >> driver, performance data will be show using real hardware device. The > >> target goal of this series is to reduce the SVQ switching overhead > >> to less than 300ms on a ~100GB guest with 2 non-mq vhost-vdpa devices. > >> > >> The plan of the intended driver implementation is to use a dedicated > >> group (specifically, 2 in below table) to host descriptor table for > >> all data vqs, different from where buffer addresses are contained (in > >> group 0 as below). cvq does not have to allocate dedicated group for > >> descriptor table, so its buffers and descriptor table would always > >> belong to a same group (1). > > I'm fine with this, but I think we need an implementation in the > > driver (e.g the simulator). > Yes. FWIW for the sake of time saving and get this series accepted > promptly in the upcoming v6.6 merge window, the driver we're going to > support along with this series will be mlx5_vdpa in the formal > submission, and simulator support may come up later after if I got spare > cycle. That would be fine. > Do you foresee any issue without simulator change? No. > We will have > mlx5_vdpa driver consuming the API for sure, that's the target of this > work and it has to be proved working on real device at first. Yes, I mention the simulator just because it's simple, it would be also great if it is proved on real vDPA parent. Thanks > > Thanks, > -Siwei > > > > > Thanks > > > >> > >>| data vq | ctrl vq > >> ==+==+=== > >> vq_group |0 |1 > >> vq_desc_group |2 |1 > >> > >> > >> --- > >> > >> Si-Wei Liu (3): > >>vdpa: introduce dedicated descriptor group for virtqueue > >>vhost-vdpa: introduce descriptor group backend feature > >>vhost-vdpa: uAPI to get dedicated descriptor group id > >> > >> drivers/vhost/vdpa.c | 27 +++ > >> include/linux/vdpa.h | 11 +++ > >> include/uapi/linux/vhost.h | 8 > >> include/uapi/linux/vhost_types.h | 5 + > >> 4 files changed, 51 insertions(+) > >> > >> -- > >> 1.8.3.1 > >> > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 1/4] vdpa: introduce .reset_map operation callback
On Wed, Aug 16, 2023 at 3:49 AM Si-Wei Liu wrote: > > > > On 8/14/2023 7:21 PM, Jason Wang wrote: > > On Tue, Aug 15, 2023 at 9:46 AM Si-Wei Liu wrote: > >> Signed-off-by: Si-Wei Liu > >> --- > >> include/linux/vdpa.h | 7 +++ > >> 1 file changed, 7 insertions(+) > >> > >> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h > >> index db1b0ea..3a3878d 100644 > >> --- a/include/linux/vdpa.h > >> +++ b/include/linux/vdpa.h > >> @@ -314,6 +314,12 @@ struct vdpa_map_file { > >>* @iova: iova to be unmapped > >>* @size: size of the area > >>* Returns integer: success (0) or error (< > >> 0) > >> + * @reset_map: Reset device memory mapping (optional) > >> + * Needed for device that using device > >> + * specific DMA translation (on-chip IOMMU) > > This exposes the device internal to the upper layer which is not optimal. > Not sure what does it mean by "device internal", but this op callback > just follows existing convention to describe what vdpa parent this API > targets. I meant the bus tries to hide the differences among vendors. So it needs to hide on-chip IOMMU stuff to the upper layer. We can expose two dimensional IO mappings models but it looks like over engineering for this issue. More below. > > * @set_map:Set device memory mapping (optional) > * Needed for device that using device > * specific DMA translation (on-chip IOMMU) > : > : > * @dma_map:Map an area of PA to IOVA (optional) > * Needed for device that using device > * specific DMA translation (on-chip IOMMU) > * and preferring incremental map. > : > : > * @dma_unmap: Unmap an area of IOVA (optional but > * must be implemented with dma_map) > * Needed for device that using device > * specific DMA translation (on-chip IOMMU) > * and preferring incremental unmap. > > > > Btw, what's the difference between this and a simple > > > > set_map(NULL)? > I don't think parent drivers support this today - they can accept > non-NULL iotlb containing empty map entry, but not a NULL iotlb. The > behavior is undefined or it even causes panic when a NULL iotlb is > passed in. We can do this simple change if it can work. > Further this doesn't work with .dma_map parent drivers. Probably, but I'd remove dma_map as it doesn't have any real users except for the simulator. > > The reason why a new op is needed or better is because it allows > userspace to tell apart different reset behavior from the older kernel > (via the F_IOTLB_PERSIST feature bit in patch 4), while this behavior > could vary between parent drivers. I'm ok with a new feature flag, but we need to first seek a way to reuse the existing API. Thanks > > Regards, > -Siwei > > > > > Thanks > > > >> + * @vdev: vdpa device > >> + * @asid: address space identifier > >> + * Returns integer: success (0) or error (< 0) > >>* @get_vq_dma_dev:Get the dma device for a specific > >>* virtqueue (optional) > >>* @vdev: vdpa device > >> @@ -390,6 +396,7 @@ struct vdpa_config_ops { > >> u64 iova, u64 size, u64 pa, u32 perm, void > >> *opaque); > >> int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, > >> u64 iova, u64 size); > >> + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); > >> int (*set_group_asid)(struct vdpa_device *vdev, unsigned int > >> group, > >>unsigned int asid); > >> struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 > >> idx); > >> -- > >> 1.8.3.1 > >> > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 2/3] vhost-vdpa: introduce descriptor group backend feature
On Wed, Aug 16, 2023 at 7:54 AM Si-Wei Liu wrote: > > > > On 8/9/2023 8:49 PM, Jason Wang wrote: > > On Wed, Aug 9, 2023 at 8:56 PM Si-Wei Liu wrote: > >> Userspace knows if the device has dedicated descriptor group or not > >> by checking this feature bit. > >> > >> It's only exposed if the vdpa driver backend implements the > >> .get_vq_desc_group() operation callback. Userspace trying to negotiate > >> this feature when it or the dependent _F_IOTLB_ASID feature hasn't > >> been exposed will result in an error. > >> > >> Signed-off-by: Si-Wei Liu > >> --- > >> drivers/vhost/vdpa.c | 17 + > >> include/uapi/linux/vhost_types.h | 5 + > >> 2 files changed, 22 insertions(+) > >> > >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c > >> index b43e868..f2e5dce 100644 > >> --- a/drivers/vhost/vdpa.c > >> +++ b/drivers/vhost/vdpa.c > >> @@ -389,6 +389,14 @@ static bool vhost_vdpa_can_resume(const struct > >> vhost_vdpa *v) > >> return ops->resume; > >> } > >> > >> +static bool vhost_vdpa_has_desc_group(const struct vhost_vdpa *v) > >> +{ > >> + struct vdpa_device *vdpa = v->vdpa; > >> + const struct vdpa_config_ops *ops = vdpa->config; > >> + > >> + return ops->get_vq_desc_group; > >> +} > >> + > >> static long vhost_vdpa_get_features(struct vhost_vdpa *v, u64 __user > >> *featurep) > >> { > >> struct vdpa_device *vdpa = v->vdpa; > >> @@ -679,6 +687,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file > >> *filep, > >> if (copy_from_user(&features, featurep, sizeof(features))) > >> return -EFAULT; > >> if (features & ~(VHOST_VDPA_BACKEND_FEATURES | > >> +BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | > >> BIT_ULL(VHOST_BACKEND_F_SUSPEND) | > >> BIT_ULL(VHOST_BACKEND_F_RESUME))) > >> return -EOPNOTSUPP; > >> @@ -688,6 +697,12 @@ static long vhost_vdpa_unlocked_ioctl(struct file > >> *filep, > >> if ((features & BIT_ULL(VHOST_BACKEND_F_RESUME)) && > >> !vhost_vdpa_can_resume(v)) > >> return -EOPNOTSUPP; > >> + if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && > >> + !(features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) > >> + return -EINVAL; > >> + if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && > >> +!vhost_vdpa_has_desc_group(v)) > >> + return -EOPNOTSUPP; > >> vhost_set_backend_features(&v->vdev, features); > >> return 0; > >> } > >> @@ -741,6 +756,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file > >> *filep, > >> features |= BIT_ULL(VHOST_BACKEND_F_SUSPEND); > >> if (vhost_vdpa_can_resume(v)) > >> features |= BIT_ULL(VHOST_BACKEND_F_RESUME); > >> + if (vhost_vdpa_has_desc_group(v)) > >> + features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); > >> if (copy_to_user(featurep, &features, sizeof(features))) > >> r = -EFAULT; > >> break; > >> diff --git a/include/uapi/linux/vhost_types.h > >> b/include/uapi/linux/vhost_types.h > >> index d3aad12a..0856f84 100644 > >> --- a/include/uapi/linux/vhost_types.h > >> +++ b/include/uapi/linux/vhost_types.h > >> @@ -181,5 +181,10 @@ struct vhost_vdpa_iova_range { > >> #define VHOST_BACKEND_F_SUSPEND 0x4 > >> /* Device can be resumed */ > >> #define VHOST_BACKEND_F_RESUME 0x5 > >> +/* Device may expose the descriptor table, avail and used ring in a > >> + * different group for ASID binding than the buffers it contains. > > Nit: > > > > s/a different group/different groups/? > Yep, I will try to rephrase. Would below work? > > "Device may expose virtqueue's descriptor table, avail and used ring in a > different group for ASID binding than where buffers it contains reside." > > > Btw, not a native speaker but I think "descriptor" might be confusing > > since as you explained above, it contains more than just a descriptor > > table. > Yep. I chose "descriptor" because packed virtqueue doesn't have > "physical" avail and used ring other than descriptor table, but I think > I am open to a better name. I once thought of "descriptor ring" but that > might be too specific to packed virtqueue. Any suggestion? I don't have a better idea, probably "metadata" ? Thanks > > Thanks, > -Siwei > > > > > Thanks > > > >> + * Requires VHOST_BACKEND_F_IOTLB_ASID. > >> + */ > >> +#define VHOST_BACKEND_F_DESC_ASID0x6 > >> > >> #endif > >> -- > >> 1.8.3.1 > >> > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virt
Re: [PATCH vhost v13 05/12] virtio_ring: introduce virtqueue_dma_dev()
On Wed, 16 Aug 2023 09:13:48 +0800, Jason Wang wrote: > On Tue, Aug 15, 2023 at 5:40 PM Xuan Zhuo wrote: > > > > On Tue, 15 Aug 2023 15:50:23 +0800, Jason Wang wrote: > > > On Tue, Aug 15, 2023 at 2:32 PM Xuan Zhuo > > > wrote: > > > > > > > > > > > > Hi, Jason > > > > > > > > Could you skip this patch? > > > > > > I'm fine with either merging or dropping this. > > > > > > > > > > > Let we review other patches firstly? > > > > > > I will be on vacation soon, and won't have time to do this until next > > > week. > > > > Have a happly vacation. > > > > > > > > But I spot two possible "issues": > > > > > > 1) the DMA metadata were stored in the headroom of the page, this > > > breaks frags coalescing, we need to benchmark it's impact > > > > Not every page, just the first page of the COMP pages. > > > > So I think there is no impact. > > Nope, see this: > > if (SKB_FRAG_PAGE_ORDER && > !static_branch_unlikely(&net_high_order_alloc_disable_key)) { > /* Avoid direct reclaim but allow kswapd to wake */ > pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | > __GFP_COMP | __GFP_NOWARN | > __GFP_NORETRY, > SKB_FRAG_PAGE_ORDER); > if (likely(pfrag->page)) { > pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; > return true; > } > } > > The comp page might be disabled due to the SKB_FRAG_PAGE_ORDER and > net_high_order_alloc_disable_key. YES. But if comp page is disabled. Then we only get one page each time. The pages are not contiguous, so we don't have frags coalescing. If you mean the two pages got from alloc_page may be contiguous. The coalescing may then be broken. It's a possibility, but I think the impact will be small. Thanks. > > > > > > > > 2) pre mapped DMA addresses were not reused in the case of > > > XDP_TX/XDP_REDIRECT > > > > Because that the tx is not the premapped mode. > > Yes, we can optimize this on top. > > Thanks > > > > > Thanks. > > > > > > > > I see Michael has merge this series so I'm fine to let it go first. > > > > > > Thanks > > > > > > > > > > > Thanks. > > > > > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v13 05/12] virtio_ring: introduce virtqueue_dma_dev()
On Wed, Aug 16, 2023 at 10:16 AM Xuan Zhuo wrote: > > On Wed, 16 Aug 2023 09:13:48 +0800, Jason Wang wrote: > > On Tue, Aug 15, 2023 at 5:40 PM Xuan Zhuo > > wrote: > > > > > > On Tue, 15 Aug 2023 15:50:23 +0800, Jason Wang > > > wrote: > > > > On Tue, Aug 15, 2023 at 2:32 PM Xuan Zhuo > > > > wrote: > > > > > > > > > > > > > > > Hi, Jason > > > > > > > > > > Could you skip this patch? > > > > > > > > I'm fine with either merging or dropping this. > > > > > > > > > > > > > > Let we review other patches firstly? > > > > > > > > I will be on vacation soon, and won't have time to do this until next > > > > week. > > > > > > Have a happly vacation. > > > > > > > > > > > But I spot two possible "issues": > > > > > > > > 1) the DMA metadata were stored in the headroom of the page, this > > > > breaks frags coalescing, we need to benchmark it's impact > > > > > > Not every page, just the first page of the COMP pages. > > > > > > So I think there is no impact. > > > > Nope, see this: > > > > if (SKB_FRAG_PAGE_ORDER && > > !static_branch_unlikely(&net_high_order_alloc_disable_key)) { > > /* Avoid direct reclaim but allow kswapd to wake */ > > pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | > > __GFP_COMP | __GFP_NOWARN | > > __GFP_NORETRY, > > SKB_FRAG_PAGE_ORDER); > > if (likely(pfrag->page)) { > > pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; > > return true; > > } > > } > > > > The comp page might be disabled due to the SKB_FRAG_PAGE_ORDER and > > net_high_order_alloc_disable_key. > > > YES. > > But if comp page is disabled. Then we only get one page each time. The pages > are > not contiguous, so we don't have frags coalescing. > > If you mean the two pages got from alloc_page may be contiguous. The > coalescing > may then be broken. It's a possibility, but I think the impact will be small. Let's have a simple benchmark and see? Thanks > > Thanks. > > > > > > > > > > > > > > 2) pre mapped DMA addresses were not reused in the case of > > > > XDP_TX/XDP_REDIRECT > > > > > > Because that the tx is not the premapped mode. > > > > Yes, we can optimize this on top. > > > > Thanks > > > > > > > > Thanks. > > > > > > > > > > > I see Michael has merge this series so I'm fine to let it go first. > > > > > > > > Thanks > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v13 05/12] virtio_ring: introduce virtqueue_dma_dev()
On Wed, 16 Aug 2023 10:19:34 +0800, Jason Wang wrote: > On Wed, Aug 16, 2023 at 10:16 AM Xuan Zhuo wrote: > > > > On Wed, 16 Aug 2023 09:13:48 +0800, Jason Wang wrote: > > > On Tue, Aug 15, 2023 at 5:40 PM Xuan Zhuo > > > wrote: > > > > > > > > On Tue, 15 Aug 2023 15:50:23 +0800, Jason Wang > > > > wrote: > > > > > On Tue, Aug 15, 2023 at 2:32 PM Xuan Zhuo > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi, Jason > > > > > > > > > > > > Could you skip this patch? > > > > > > > > > > I'm fine with either merging or dropping this. > > > > > > > > > > > > > > > > > Let we review other patches firstly? > > > > > > > > > > I will be on vacation soon, and won't have time to do this until next > > > > > week. > > > > > > > > Have a happly vacation. > > > > > > > > > > > > > > But I spot two possible "issues": > > > > > > > > > > 1) the DMA metadata were stored in the headroom of the page, this > > > > > breaks frags coalescing, we need to benchmark it's impact > > > > > > > > Not every page, just the first page of the COMP pages. > > > > > > > > So I think there is no impact. > > > > > > Nope, see this: > > > > > > if (SKB_FRAG_PAGE_ORDER && > > > !static_branch_unlikely(&net_high_order_alloc_disable_key)) { > > > /* Avoid direct reclaim but allow kswapd to wake */ > > > pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | > > > __GFP_COMP | __GFP_NOWARN | > > > __GFP_NORETRY, > > > SKB_FRAG_PAGE_ORDER); > > > if (likely(pfrag->page)) { > > > pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; > > > return true; > > > } > > > } > > > > > > The comp page might be disabled due to the SKB_FRAG_PAGE_ORDER and > > > net_high_order_alloc_disable_key. > > > > > > YES. > > > > But if comp page is disabled. Then we only get one page each time. The > > pages are > > not contiguous, so we don't have frags coalescing. > > > > If you mean the two pages got from alloc_page may be contiguous. The > > coalescing > > may then be broken. It's a possibility, but I think the impact will be > > small. > > Let's have a simple benchmark and see? That is ok. I think you want to know the perf num with big traffic and the comp page disabled. Thanks. > > Thanks > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > 2) pre mapped DMA addresses were not reused in the case of > > > > > XDP_TX/XDP_REDIRECT > > > > > > > > Because that the tx is not the premapped mode. > > > > > > Yes, we can optimize this on top. > > > > > > Thanks > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > I see Michael has merge this series so I'm fine to let it go first. > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v13 05/12] virtio_ring: introduce virtqueue_dma_dev()
On Wed, Aug 16, 2023 at 10:24 AM Xuan Zhuo wrote: > > On Wed, 16 Aug 2023 10:19:34 +0800, Jason Wang wrote: > > On Wed, Aug 16, 2023 at 10:16 AM Xuan Zhuo > > wrote: > > > > > > On Wed, 16 Aug 2023 09:13:48 +0800, Jason Wang > > > wrote: > > > > On Tue, Aug 15, 2023 at 5:40 PM Xuan Zhuo > > > > wrote: > > > > > > > > > > On Tue, 15 Aug 2023 15:50:23 +0800, Jason Wang > > > > > wrote: > > > > > > On Tue, Aug 15, 2023 at 2:32 PM Xuan Zhuo > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi, Jason > > > > > > > > > > > > > > Could you skip this patch? > > > > > > > > > > > > I'm fine with either merging or dropping this. > > > > > > > > > > > > > > > > > > > > Let we review other patches firstly? > > > > > > > > > > > > I will be on vacation soon, and won't have time to do this until > > > > > > next week. > > > > > > > > > > Have a happly vacation. > > > > > > > > > > > > > > > > > But I spot two possible "issues": > > > > > > > > > > > > 1) the DMA metadata were stored in the headroom of the page, this > > > > > > breaks frags coalescing, we need to benchmark it's impact > > > > > > > > > > Not every page, just the first page of the COMP pages. > > > > > > > > > > So I think there is no impact. > > > > > > > > Nope, see this: > > > > > > > > if (SKB_FRAG_PAGE_ORDER && > > > > !static_branch_unlikely(&net_high_order_alloc_disable_key)) > > > > { > > > > /* Avoid direct reclaim but allow kswapd to wake */ > > > > pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) > > > > | > > > > __GFP_COMP | __GFP_NOWARN | > > > > __GFP_NORETRY, > > > > SKB_FRAG_PAGE_ORDER); > > > > if (likely(pfrag->page)) { > > > > pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; > > > > return true; > > > > } > > > > } > > > > > > > > The comp page might be disabled due to the SKB_FRAG_PAGE_ORDER and > > > > net_high_order_alloc_disable_key. > > > > > > > > > YES. > > > > > > But if comp page is disabled. Then we only get one page each time. The > > > pages are > > > not contiguous, so we don't have frags coalescing. > > > > > > If you mean the two pages got from alloc_page may be contiguous. The > > > coalescing > > > may then be broken. It's a possibility, but I think the impact will be > > > small. > > > > Let's have a simple benchmark and see? > > > That is ok. > > I think you want to know the perf num with big traffic and the comp page > disabled. Yes. Thanks > > Thanks. > > > > > > Thanks > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > 2) pre mapped DMA addresses were not reused in the case of > > > > > > XDP_TX/XDP_REDIRECT > > > > > > > > > > Because that the tx is not the premapped mode. > > > > > > > > Yes, we can optimize this on top. > > > > > > > > Thanks > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > I see Michael has merge this series so I'm fine to let it go first. > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net v1] virtio_net: Introduce skb_vnet_common_hdr to avoid typecasting
On 2023-08-15 p.m.2:13, Willem de Bruijn wrote: External email: Use caution opening links or attachments On Tue, Aug 15, 2023 at 12:29 PM Simon Horman wrote: On Tue, Aug 15, 2023 at 11:09:02AM -0400, Feng Liu wrote: To clarify: In general new Networking features go via the net-next tree, while bug fixes go via the net tree. I was suggesting this is more appropriate for net-next, and that should be reflected in the subject. Subject: [PATCH net-next] ... Sorry for not being clearer the first time around. Right, this should go to net-next. Will do, thanks diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index 12c1c9699935..db40f93ae8b3 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -201,6 +201,13 @@ struct virtio_net_hdr_mrg_rxbuf { struct virtio_net_hdr hdr; __virtio16 num_buffers; /* Number of merged rx buffers */ }; + +struct virtio_net_common_hdr { + union { + struct virtio_net_hdr_mrg_rxbuf mrg_hdr; + struct virtio_net_hdr_v1_hash hash_v1_hdr; + }; +}; Does this belong in the UAPI? I would have assumed it's a Kernel implementation detail. The existing codes, virtio_net.h is in uapi/linux/, I added the new structure and followed existing code. My modification is related to Kernel implementation detail now. The header you have modified forms part of the userspace API (UAPI). Perhaps there is something about virtio_net that makes this correct, but it seems to me that kernel-internal details don't belong there. FWIW, I ran into similar issues before in a draft that added timestamp support [1] If we're going to change this structure, we should do it in a way that is forward proof to future extensions to the virtio spec and with that the fields in this struct. Especially in UAPI. Is virtio_net_hdr_v1_hash the latest virtio-spec compliant header? And do we expect for v1.3 to just add some fields to this? The struct comment of virtio_net_hdr_v1 states "This is bitwise-equivalent to the legacy struct virtio_net_hdr_mrg_rxbuf, only flattened.". I don't quite understand what the flattening bought, vs having struct virtio_net_hdr as first member. Another difference may be the endianness between legacy (0.9) and v1.0+. Since legacy virtio will no longer be modified, I don't think there is much value is exposing this new union as UAPI. I do appreciate the benefit to the implementation. [1] https://patches.linaro.org/project/netdev/patch/20210208185558.995292-3-willemdebruijn.ker...@gmail.com/ Hi, William and Simon Thanks for the detailed explanation. I kept virtio_net_hdr_mrg_rxbuf and virtio_net_hdr_v1_hash structures in virtio_net.h, which can be forward compatible with existing user applications which use these structures. After checking kernel code, the virtio_net_hdr_v1_hash structure does only add new members to virtio_net_hdr_mrg_rxbuf, so the spec should only add new members, otherwise there will be compatibility problems in struct virtio_net_hdr_v1_hash structure. struct virtio_net_hdr_v1_hash { struct virtio_net_hdr_v1 hdr; /*same size as virtio_net_hdr*/ [...] __le32 hash_value; /*new member*/ __le16 hash_report; /*new member*/ __le16 padding; /*new member*/ }; virtio_net_hdr_v1_hash cannot use virtio_net_hdr as the first member, because in virtio_net_hdr_v1, csum_start and csum_offset are stored in union as a structure, and virtio_net_hdr cannot be used instead. struct virtio_net_hdr_v1 { [...] union { struct { __virtio16 csum_start; __virtio16 csum_offset; }; [...] }; __virtio16 num_buffers; /* Number of merged rx buffers */ }; struct virtio_net_hdr { [...] __virtio16 csum_start; __virtio16 csum_offset; }; In addition, I put this new structure virtio_net_common_hdr in uapi, hoping it could be used in future user space application to avoid potential risks caused by type coercion (such as the problems mentioned in the patch description ). So I think it should be in this header file. What do you think? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] MAINTAINERS: Add myself as mlx5_vdpa driver
On Tue, Aug 15, 2023 at 4:33 PM Dragos Tatulea wrote: > > As Eli Cohen moved to other work, I'll be the contact point for > mlx5_vdpa. > > Signed-off-by: Dragos Tatulea Acked-by: Jason Wang Thanks > --- > MAINTAINERS | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 9a5863f1b016..c9a9259f4d37 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -13555,6 +13555,12 @@ F: Documentation/leds/leds-mlxcpld.rst > F: drivers/leds/leds-mlxcpld.c > F: drivers/leds/leds-mlxreg.c > > +MELLANOX MLX5_VDPA DRIVER > +M: Dragos Tatulea > +L: virtualization@lists.linux-foundation.org > +S: Supported > +F: drivers/vdpa/mlx5/ > + > MELLANOX PLATFORM DRIVER > M: Vadim Pasternak > L: platform-driver-...@vger.kernel.org > -- > 2.41.0 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v13 05/12] virtio_ring: introduce virtqueue_dma_dev()
On Wed, 16 Aug 2023 10:33:34 +0800, Jason Wang wrote: > On Wed, Aug 16, 2023 at 10:24 AM Xuan Zhuo wrote: > > > > On Wed, 16 Aug 2023 10:19:34 +0800, Jason Wang wrote: > > > On Wed, Aug 16, 2023 at 10:16 AM Xuan Zhuo > > > wrote: > > > > > > > > On Wed, 16 Aug 2023 09:13:48 +0800, Jason Wang > > > > wrote: > > > > > On Tue, Aug 15, 2023 at 5:40 PM Xuan Zhuo > > > > > wrote: > > > > > > > > > > > > On Tue, 15 Aug 2023 15:50:23 +0800, Jason Wang > > > > > > wrote: > > > > > > > On Tue, Aug 15, 2023 at 2:32 PM Xuan Zhuo > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Hi, Jason > > > > > > > > > > > > > > > > Could you skip this patch? > > > > > > > > > > > > > > I'm fine with either merging or dropping this. > > > > > > > > > > > > > > > > > > > > > > > Let we review other patches firstly? > > > > > > > > > > > > > > I will be on vacation soon, and won't have time to do this until > > > > > > > next week. > > > > > > > > > > > > Have a happly vacation. > > > > > > > > > > > > > > > > > > > > But I spot two possible "issues": > > > > > > > > > > > > > > 1) the DMA metadata were stored in the headroom of the page, this > > > > > > > breaks frags coalescing, we need to benchmark it's impact > > > > > > > > > > > > Not every page, just the first page of the COMP pages. > > > > > > > > > > > > So I think there is no impact. > > > > > > > > > > Nope, see this: > > > > > > > > > > if (SKB_FRAG_PAGE_ORDER && > > > > > > > > > > !static_branch_unlikely(&net_high_order_alloc_disable_key)) { > > > > > /* Avoid direct reclaim but allow kswapd to wake */ > > > > > pfrag->page = alloc_pages((gfp & > > > > > ~__GFP_DIRECT_RECLAIM) | > > > > > __GFP_COMP | __GFP_NOWARN | > > > > > __GFP_NORETRY, > > > > > SKB_FRAG_PAGE_ORDER); > > > > > if (likely(pfrag->page)) { > > > > > pfrag->size = PAGE_SIZE << > > > > > SKB_FRAG_PAGE_ORDER; > > > > > return true; > > > > > } > > > > > } > > > > > > > > > > The comp page might be disabled due to the SKB_FRAG_PAGE_ORDER and > > > > > net_high_order_alloc_disable_key. > > > > > > > > > > > > YES. > > > > > > > > But if comp page is disabled. Then we only get one page each time. The > > > > pages are > > > > not contiguous, so we don't have frags coalescing. > > > > > > > > If you mean the two pages got from alloc_page may be contiguous. The > > > > coalescing > > > > may then be broken. It's a possibility, but I think the impact will be > > > > small. > > > > > > Let's have a simple benchmark and see? > > > > > > That is ok. > > > > I think you want to know the perf num with big traffic and the comp page > > disabled. > > Yes. Hi, Host: for ((i=0; i < 10; ++i)) do sockperf tp -i 192.168.122.100 -t 1000 -m 64000& done Guest: 03:23:12 AM IFACE rxpck/s txpck/srxkB/stxkB/s rxcmp/s txcmp/s rxmcst/s %ifutil 03:23:13 AMlo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:23:13 AM ens4 61848.00 1.00 3868036.73 0.58 0.00 0.00 0.00 0.00 tcpdump: 03:25:01.741563 IP 192.168.122.1.29693 > 192.168.122.100.1: UDP, length 64000 03:25:01.741580 IP 192.168.122.1.22239 > 192.168.122.100.1: UDP, length 64000 03:25:01.741623 IP 192.168.122.1.22396 > 192.168.122.100.1: UDP, length 64000 The Guest CPU util is low, every packet is 64000. But the Host vhost process is 100%. So we can not judge by the traffic or the cpu of the Guest. So I use the kernel without my patches 0635819decaf9d60e6cacfecfebfabe3cbdddafb. I want to count the frags coalescing num when the comp page is disabled. $ sh -x test.sh + sysctl -w net.core.high_order_alloc_disable=1 net.core.high_order_alloc_disable = 1 + sysctl net.core.high_order_alloc_disable net.core.high_order_alloc_disable = 1 + sleep 5 + timeout 5 bpftrace -e 'kprobe: skb_coalesce_rx_frag{@[nsecs/1000/1000/1000]=count()}' Attaching 1 probe... + sysctl -w net.core.high_order_alloc_disable=0 net.core.high_order_alloc_disable = 0 + sysctl net.core.high_order_alloc_disable net.core.high_order_alloc_disable = 0 + sleep 5 + timeout 5 bpftrace -e 'kprobe: skb_coalesce_rx_frag{@[nsecs/1000/1000/1000]=count()}' Attaching 1 probe... @[356]: 167020 @[361]: 673653 @[359]: 900844 @[360]: 912657 @[358]: 915853 @[357]: 932245 We can see that the skb_coalesce_rx_frag is not called when comp page is disabled. If the comp page is enable, there will be many frags coalescing. So I think that my change will