Re: [PATCH net 3/5] hv_netvsc: Preserve contiguous PFN grouping in the page buffer array

Simon Horman Wed, 14 May 2025 02:34:55 -0700

On Mon, May 12, 2025 at 05:06:02PM -0700, mhkelle...@gmail.com wrote:
> From: Michael Kelley <mhkli...@outlook.com>
> 
> Starting with commit dca5161f9bd0 ("hv_netvsc: Check status in
> SEND_RNDIS_PKT completion message") in the 6.3 kernel, the Linux
> driver for Hyper-V synthetic networking (netvsc) occasionally reports
> "nvsp_rndis_pkt_complete error status: 2".[1] This error indicates
> that Hyper-V has rejected a network packet transmit request from the
> guest, and the outgoing network packet is dropped. Higher level
> network protocols presumably recover and resend the packet so there is
> no functional error, but performance is slightly impacted. Commit
> dca5161f9bd0 is not the cause of the error -- it only added reporting
> of an error that was already happening without any notice. The error
> has presumably been present since the netvsc driver was originally
> introduced into Linux.
> 
> The root cause of the problem is that the netvsc driver in Linux may
> send an incorrectly formatted VMBus message to Hyper-V when
> transmitting the network packet. The incorrect formatting occurs when
> the rndis header of the VMBus message crosses a page boundary due to
> how the Linux skb head memory is aligned. In such a case, two PFNs are
> required to describe the location of the rndis header, even though
> they are contiguous in guest physical address (GPA) space. Hyper-V
> requires that two rndis header PFNs be in a single "GPA range" data
> struture, but current netvsc code puts each PFN in its own GPA range,
> which Hyper-V rejects as an error.
> 
> The incorrect formatting occurs only for larger packets that netvsc
> must transmit via a VMBus "GPA Direct" message. There's no problem
> when netvsc transmits a smaller packet by copying it into a pre-
> allocated send buffer slot because the pre-allocated slots don't have
> page crossing issues.
> 
> After commit 14ad6ed30a10 ("net: allow small head cache usage with
> large MAX_SKB_FRAGS values") in the 6.14-rc4 kernel, the error occurs
> much more frequently in VMs with 16 or more vCPUs. It may occur every
> few seconds, or even more frequently, in an ssh session that outputs a
> lot of text. Commit 14ad6ed30a10 subtly changes how skb head memory is
> allocated, making it much more likely that the rndis header will cross
> a page boundary when the vCPU count is 16 or more. The changes in
> commit 14ad6ed30a10 are perfectly valid -- they just had the side
> effect of making the netvsc bug more prominent.
> 
> Current code in init_page_array() creates a separate page buffer array
> entry for each PFN required to identify the data to be transmitted.
> Contiguous PFNs get separate entries in the page buffer array, and any
> information about contiguity is lost.
> 
> Fix the core issue by having init_page_array() construct the page
> buffer array to represent contiguous ranges rather than individual
> pages. When these ranges are subsequently passed to
> netvsc_build_mpb_array(), it can build GPA ranges that contain
> multiple PFNs, as required to avoid the error "nvsp_rndis_pkt_complete
> error status: 2". If instead the network packet is sent by copying
> into a pre-allocated send buffer slot, the copy proceeds using the
> contiguous ranges rather than individual pages, but the result of the
> copying is the same. Also fix rndis_filter_send_request() to construct
> a contiguous range, since it has its own page buffer array.
> 
> This change has a side benefit in CoCo VMs in that netvsc_dma_map()
> calls dma_map_single() on each contiguous range instead of on each
> page. This results in fewer calls to dma_map_single() but on larger
> chunks of memory, which should reduce contention on the swiotlb.
> 
> Since the page buffer array now contains one entry for each contiguous
> range instead of for each individual page, the number of entries in
> the array can be reduced, saving 208 bytes of stack space in
> netvsc_xmit() when MAX_SKG_FRAGS has the default value of 17.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217503
> 
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217503
> Cc: <sta...@vger.kernel.org> # 6.1.x
> Signed-off-by: Michael Kelley <mhkli...@outlook.com>
> ---
>  drivers/net/hyperv/hyperv_net.h   | 12 ++++++
>  drivers/net/hyperv/netvsc_drv.c   | 63 ++++++++-----------------------
>  drivers/net/hyperv/rndis_filter.c | 24 +++---------
>  3 files changed, 32 insertions(+), 67 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 70f7cb383228..76725f25abd5 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -893,6 +893,18 @@ struct nvsp_message {
>                                sizeof(struct nvsp_message))
>  #define NETVSC_MIN_IN_MSG_SIZE sizeof(struct vmpacket_descriptor)
>  
> +/* Maximum # of contiguous data ranges that can make up a trasmitted packet.
> + * Typically it's the max SKB fragments plus 2 for the rndis packet and the
> + * linear portion of the SKB. But if MAX_SKB_FRAGS is large, the value may
> + * need to be limited to MAX_PAGE_BUFFER_COUNT, which is the max # of entries
> + * in a GPA direct packet sent to netvsp over VMBus.
> + */
> +#if MAX_SKB_FRAGS + 2 < MAX_PAGE_BUFFER_COUNT
> +#define MAX_DATA_RANGES (MAX_SKB_FRAGS + 2)
> +#else
> +#define MAX_DATA_RANGES MAX_PAGE_BUFFER_COUNT
> +#endif
> +
>  /* Estimated requestor size:
>   * out_ring_size/min_out_msg_size + in_ring_size/min_in_msg_size
>   */


...

> @@ -371,28 +338,28 @@ static u32 init_page_array(void *hdr, u32 len, struct 
> sk_buff *skb,
>        * 2. skb linear data
>        * 3. skb fragment data
>        */
> -     slots_used += fill_pg_buf(virt_to_hvpfn(hdr),
> -                               offset_in_hvpage(hdr),
> -                               len,
> -                               &pb[slots_used]);
>  
> +     pb[0].offset = offset_in_hvpage(hdr);
> +     pb[0].len = len;
> +     pb[0].pfn = virt_to_hvpfn(hdr);
>       packet->rmsg_size = len;
> -     packet->rmsg_pgcnt = slots_used;
> +     packet->rmsg_pgcnt = 1;
>  
> -     slots_used += fill_pg_buf(virt_to_hvpfn(data),
> -                               offset_in_hvpage(data),
> -                               skb_headlen(skb),
> -                               &pb[slots_used]);
> +     pb[1].offset = offset_in_hvpage(skb->data);
> +     pb[1].len = skb_headlen(skb);
> +     pb[1].pfn = virt_to_hvpfn(skb->data);
>  
>       for (i = 0; i < frags; i++) {
>               skb_frag_t *frag = skb_shinfo(skb)->frags + i;
> +             struct hv_page_buffer *cur_pb = &pb[i + 2];

Hi Michael,

If I got things right then then pb is allocated on the stack
in netvsc_xmit and has MAX_DATA_RANGES elements.

If MAX_SKB_FRAGS is largs and MAX_DATA_RANGES has been limited to
MAX_DATA_RANGES. And frags is large. Is is possible to overrun pb here?

> +             u64 pfn = page_to_hvpfn(skb_frag_page(frag));
> +             u32 offset = skb_frag_off(frag);
>  
> -             slots_used += fill_pg_buf(page_to_hvpfn(skb_frag_page(frag)),
> -                                       skb_frag_off(frag),
> -                                       skb_frag_size(frag),
> -                                       &pb[slots_used]);
> +             cur_pb->offset = offset_in_hvpage(offset);
> +             cur_pb->len = skb_frag_size(frag);
> +             cur_pb->pfn = pfn + (offset >> HV_HYP_PAGE_SHIFT);
>       }
> -     return slots_used;
> +     return frags + 2;
>  }
>  
>  static int count_skb_frag_slots(struct sk_buff *skb)
> @@ -483,7 +450,7 @@ static int netvsc_xmit(struct sk_buff *skb, struct 
> net_device *net, bool xdp_tx)
>       struct net_device *vf_netdev;
>       u32 rndis_msg_size;
>       u32 hash;
> -     struct hv_page_buffer pb[MAX_PAGE_BUFFER_COUNT];
> +     struct hv_page_buffer pb[MAX_DATA_RANGES];
>  
>       /* If VF is present and up then redirect packets to it.
>        * Skip the VF if it is marked down or has no carrier.

...

Re: [PATCH net 3/5] hv_netvsc: Preserve contiguous PFN grouping in the page buffer array

Reply via email to