On Mon, May 12, 2025 at 05:06:02PM -0700, mhkelle...@gmail.com wrote: > From: Michael Kelley <mhkli...@outlook.com> > > Starting with commit dca5161f9bd0 ("hv_netvsc: Check status in > SEND_RNDIS_PKT completion message") in the 6.3 kernel, the Linux > driver for Hyper-V synthetic networking (netvsc) occasionally reports > "nvsp_rndis_pkt_complete error status: 2".[1] This error indicates > that Hyper-V has rejected a network packet transmit request from the > guest, and the outgoing network packet is dropped. Higher level > network protocols presumably recover and resend the packet so there is > no functional error, but performance is slightly impacted. Commit > dca5161f9bd0 is not the cause of the error -- it only added reporting > of an error that was already happening without any notice. The error > has presumably been present since the netvsc driver was originally > introduced into Linux. > > The root cause of the problem is that the netvsc driver in Linux may > send an incorrectly formatted VMBus message to Hyper-V when > transmitting the network packet. The incorrect formatting occurs when > the rndis header of the VMBus message crosses a page boundary due to > how the Linux skb head memory is aligned. In such a case, two PFNs are > required to describe the location of the rndis header, even though > they are contiguous in guest physical address (GPA) space. Hyper-V > requires that two rndis header PFNs be in a single "GPA range" data > struture, but current netvsc code puts each PFN in its own GPA range, > which Hyper-V rejects as an error. > > The incorrect formatting occurs only for larger packets that netvsc > must transmit via a VMBus "GPA Direct" message. There's no problem > when netvsc transmits a smaller packet by copying it into a pre- > allocated send buffer slot because the pre-allocated slots don't have > page crossing issues. > > After commit 14ad6ed30a10 ("net: allow small head cache usage with > large MAX_SKB_FRAGS values") in the 6.14-rc4 kernel, the error occurs > much more frequently in VMs with 16 or more vCPUs. It may occur every > few seconds, or even more frequently, in an ssh session that outputs a > lot of text. Commit 14ad6ed30a10 subtly changes how skb head memory is > allocated, making it much more likely that the rndis header will cross > a page boundary when the vCPU count is 16 or more. The changes in > commit 14ad6ed30a10 are perfectly valid -- they just had the side > effect of making the netvsc bug more prominent. > > Current code in init_page_array() creates a separate page buffer array > entry for each PFN required to identify the data to be transmitted. > Contiguous PFNs get separate entries in the page buffer array, and any > information about contiguity is lost. > > Fix the core issue by having init_page_array() construct the page > buffer array to represent contiguous ranges rather than individual > pages. When these ranges are subsequently passed to > netvsc_build_mpb_array(), it can build GPA ranges that contain > multiple PFNs, as required to avoid the error "nvsp_rndis_pkt_complete > error status: 2". If instead the network packet is sent by copying > into a pre-allocated send buffer slot, the copy proceeds using the > contiguous ranges rather than individual pages, but the result of the > copying is the same. Also fix rndis_filter_send_request() to construct > a contiguous range, since it has its own page buffer array. > > This change has a side benefit in CoCo VMs in that netvsc_dma_map() > calls dma_map_single() on each contiguous range instead of on each > page. This results in fewer calls to dma_map_single() but on larger > chunks of memory, which should reduce contention on the swiotlb. > > Since the page buffer array now contains one entry for each contiguous > range instead of for each individual page, the number of entries in > the array can be reduced, saving 208 bytes of stack space in > netvsc_xmit() when MAX_SKG_FRAGS has the default value of 17. > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217503 > > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217503 > Cc: <sta...@vger.kernel.org> # 6.1.x > Signed-off-by: Michael Kelley <mhkli...@outlook.com> > --- > drivers/net/hyperv/hyperv_net.h | 12 ++++++ > drivers/net/hyperv/netvsc_drv.c | 63 ++++++++----------------------- > drivers/net/hyperv/rndis_filter.c | 24 +++--------- > 3 files changed, 32 insertions(+), 67 deletions(-) > > diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h > index 70f7cb383228..76725f25abd5 100644 > --- a/drivers/net/hyperv/hyperv_net.h > +++ b/drivers/net/hyperv/hyperv_net.h > @@ -893,6 +893,18 @@ struct nvsp_message { > sizeof(struct nvsp_message)) > #define NETVSC_MIN_IN_MSG_SIZE sizeof(struct vmpacket_descriptor) > > +/* Maximum # of contiguous data ranges that can make up a trasmitted packet. > + * Typically it's the max SKB fragments plus 2 for the rndis packet and the > + * linear portion of the SKB. But if MAX_SKB_FRAGS is large, the value may > + * need to be limited to MAX_PAGE_BUFFER_COUNT, which is the max # of entries > + * in a GPA direct packet sent to netvsp over VMBus. > + */ > +#if MAX_SKB_FRAGS + 2 < MAX_PAGE_BUFFER_COUNT > +#define MAX_DATA_RANGES (MAX_SKB_FRAGS + 2) > +#else > +#define MAX_DATA_RANGES MAX_PAGE_BUFFER_COUNT > +#endif > + > /* Estimated requestor size: > * out_ring_size/min_out_msg_size + in_ring_size/min_in_msg_size > */
... > @@ -371,28 +338,28 @@ static u32 init_page_array(void *hdr, u32 len, struct > sk_buff *skb, > * 2. skb linear data > * 3. skb fragment data > */ > - slots_used += fill_pg_buf(virt_to_hvpfn(hdr), > - offset_in_hvpage(hdr), > - len, > - &pb[slots_used]); > > + pb[0].offset = offset_in_hvpage(hdr); > + pb[0].len = len; > + pb[0].pfn = virt_to_hvpfn(hdr); > packet->rmsg_size = len; > - packet->rmsg_pgcnt = slots_used; > + packet->rmsg_pgcnt = 1; > > - slots_used += fill_pg_buf(virt_to_hvpfn(data), > - offset_in_hvpage(data), > - skb_headlen(skb), > - &pb[slots_used]); > + pb[1].offset = offset_in_hvpage(skb->data); > + pb[1].len = skb_headlen(skb); > + pb[1].pfn = virt_to_hvpfn(skb->data); > > for (i = 0; i < frags; i++) { > skb_frag_t *frag = skb_shinfo(skb)->frags + i; > + struct hv_page_buffer *cur_pb = &pb[i + 2]; Hi Michael, If I got things right then then pb is allocated on the stack in netvsc_xmit and has MAX_DATA_RANGES elements. If MAX_SKB_FRAGS is largs and MAX_DATA_RANGES has been limited to MAX_DATA_RANGES. And frags is large. Is is possible to overrun pb here? > + u64 pfn = page_to_hvpfn(skb_frag_page(frag)); > + u32 offset = skb_frag_off(frag); > > - slots_used += fill_pg_buf(page_to_hvpfn(skb_frag_page(frag)), > - skb_frag_off(frag), > - skb_frag_size(frag), > - &pb[slots_used]); > + cur_pb->offset = offset_in_hvpage(offset); > + cur_pb->len = skb_frag_size(frag); > + cur_pb->pfn = pfn + (offset >> HV_HYP_PAGE_SHIFT); > } > - return slots_used; > + return frags + 2; > } > > static int count_skb_frag_slots(struct sk_buff *skb) > @@ -483,7 +450,7 @@ static int netvsc_xmit(struct sk_buff *skb, struct > net_device *net, bool xdp_tx) > struct net_device *vf_netdev; > u32 rndis_msg_size; > u32 hash; > - struct hv_page_buffer pb[MAX_PAGE_BUFFER_COUNT]; > + struct hv_page_buffer pb[MAX_DATA_RANGES]; > > /* If VF is present and up then redirect packets to it. > * Skip the VF if it is marked down or has no carrier. ...