On Sat, Jul 6, 2024 at 4:30 AM Hyman Huang <yong.hu...@smartx.com> wrote:
> Unexpected work by certain Windows guests equipped with the e1000 > interface can cause the network to go down and never come back up > again unless the guest's interface is reset. > > To reproduce the failure: > 1. Set up two guests with a Windows 2016 or 2019 server operating > system. > 2. Set up the e1000 interface for the guests. > 3. Pressurize the network slightly between two guests using the iPerf tool. > > The network goes down after a few days (2-5days), and the issue > is the result of not adhering to the e1000 specification. Refer > to the details of the specification at the following link: > > https://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf > > Chapter 3.2.6 describe the Receive Descriptor Tail register(RDT) > as following: > This register holds a value that is an offset from the base, and > identifies the location beyond the last descriptor hardware can > process. Note that tail should still point to an area in the > descriptor ring (somewhere between RDBA and RDBA + RDLEN). > This is because tail points to the location where software writes > the first new descriptor. > > This means that if the provider—in this case, QEMU—has not yet > loaded the packet, RDT should never point to that place. When > implementing the emulation of the e1000 interface, QEMU evaluates > if the receive ring buffer is full once the RDT equals the RDH, > based on the assumption that guest drivers adhere to this > criterion strictly. > > We applied the following log patch to assist in analyzing the > issue and eventually obtained the unexpected information. > > Log patch: > ----------------------------------------------------------------- > |--- a/hw/net/e1000.c > |+++ b/hw/net/e1000.c > |@@ -836,6 +836,9 @@ e1000_set_link_status(NetClientState *nc) > | static bool e1000_has_rxbufs(E1000State *s, size_t total_size) > | { > | int bufs; > |+ DBGOUT(RX, "rxbuf_size = %u, s->mac_reg[RDLEN] = %u, s->mac_reg[RDH] > = %u, s->mac_reg[RDT] = %u\n", > |+ s->rxbuf_size, s->mac_reg[RDLEN], s->mac_reg[RDH], > s->mac_reg[RDT]); > |+ > | /* Fast-path short packets */ > | if (total_size <= s->rxbuf_size) { > | if (s->mac_reg[RDH] == s->mac_reg[RDT] && s->last_overrun) > |@@ -1022,6 +1025,9 @@ e1000_receive_iov(NetClientState *nc, const struct > iovec *iov, int iovcnt) > | s->rxbuf_min_shift) > | n |= E1000_ICS_RXDMT0; > | > |+ DBGOUT(RX, "rxbuf_size = %u, s->mac_reg[RDLEN] = %u, s->mac_reg[RDH] > = %u, s->mac_reg[RDT] = %u\n", > |+ s->rxbuf_size, s->mac_reg[RDLEN], s->mac_reg[RDH], > s->mac_reg[RDT]); > |+ > ----------------------------------------------------------------- > > The last few logs of information when the network is down: > > e1000: total_size = 1, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, > s->mac_reg[RDH] = 897, s->mac_reg[RDT] = 885 > <- the receive ring buffer is checked for fullness in the > e1000_has_rxbufs function, not full. > > e1000: total_size = 64, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, > s->mac_reg[RDH] = 898, s->mac_reg[RDT] = 885 > <- RDT stays the same, RDH updates to 898, and 1 descriptor > utilized after putting the packet to ring buffer. > > e1000: total_size = 1, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, > s->mac_reg[RDH] = 898, s->mac_reg[RDT] = 885 > <- the receive ring buffer is checked for fullness in the > e1000_has_rxbufs function, not full. > > e1000: total_size = 64, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, > s->mac_reg[RDH] = 899, s->mac_reg[RDT] = 885 > <- RDT stays the same, RDH updates to 899, and 1 descriptor > utilized after putting the packet to ring buffer. > > e1000: total_size = 1, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, > s->mac_reg[RDH] = 899, s->mac_reg[RDT] = 885 > <- the receive ring buffer is checked for fullness in the > e1000_has_rxbufs function, not full. > > e1000: total_size = 64, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, > s->mac_reg[RDH] = 900, s->mac_reg[RDT] = 885 > <- RDT stays the same, RDH updates to 900 , and 1 descriptor > utilized after putting the packet to ring buffer. > > e1000: total_size = 1, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, > s->mac_reg[RDH] = 900, s->mac_reg[RDT] = 900 > <- The ring is full, according to e1000_has_rxbufs, because > of the RDT update to 900 and equals RDH ! But in reality, > the state of the ring buffer is empty because the producer > only used one descriptor the last time, and the ring buffer > was not full after that. > > To sum up, QEMU claims that the receive ring buffer is full > in the aforementioned scenario, placing the packet in the > self-maintained queue and unregistering the tap device's > readable fd handler and then waiting for the guest to consume > the receive ring buffer. This brings down the network since > guests have nothing to consume and never update the RDT > location. > > In the above scenario, QEMU assert that the ring is full, > put the packet on the queue, unregister the readable fd > handler of the tap device, waiting the guest to consume > the receive ring. While, guest have nothing to consume > on the receive ring and never update the RDT location, > this makes the network down. > > To get around this issue, just mark the overrun if RDH > equals RDT at the end of placing the packet on the ring > buffer for the producer. > > Signed-off-by: Hyman Huang <yong.hu...@smartx.com> > --- > hw/net/e1000.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/hw/net/e1000.c b/hw/net/e1000.c > index 5012b96464..f80cb70283 100644 > --- a/hw/net/e1000.c > +++ b/hw/net/e1000.c > @@ -126,6 +126,12 @@ struct E1000State_st { > > QEMUTimer *flush_queue_timer; > > + /* > + * Indicate that the receive circular buffer queue overrun > + * the last time hardware produced packets. > + */ > + bool last_overrun; > + > /* Compatibility flags for migration to/from qemu 1.3.0 and older */ > #define E1000_FLAG_MAC_BIT 2 > #define E1000_FLAG_TSO_BIT 3 > @@ -832,7 +838,12 @@ static bool e1000_has_rxbufs(E1000State *s, size_t > total_size) > int bufs; > /* Fast-path short packets */ > if (total_size <= s->rxbuf_size) { > - return s->mac_reg[RDH] != s->mac_reg[RDT]; > + if (s->mac_reg[RDH] == s->mac_reg[RDT] && s->last_overrun) { > + return false; > + } > + > + DBGOUT(RX, "Receive ring buffer is not full unexpectedly!\n"); > I'll drop this DBGOUT in the next version now that "treat RX ring with RDH == RDT as empty" is the expected and original conclusion in commit: commit e5b8b0d4ba29fe1268ba049519a1b0cf8552a21a Author: Dmitry Fleytman <dmi...@daynix.com> Date: Fri Oct 19 07:56:55 2012 +0200 e1000: drop check_rxov, always treat RX ring with RDH == RDT as empty Real HW always treats RX ring with RDH == RDT as empty. Emulation is supposed to behave the same. Reported-by: Chris Webb <chris.w...@elastichosts.com> Reported-by: Richard Davies <richard.dav...@elastichosts.com> Signed-off-by: Dmitry Fleytman <dmi...@daynix.com> Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com> > + return true; > } > if (s->mac_reg[RDH] < s->mac_reg[RDT]) { > bufs = s->mac_reg[RDT] - s->mac_reg[RDH]; > @@ -840,7 +851,12 @@ static bool e1000_has_rxbufs(E1000State *s, size_t > total_size) > bufs = s->mac_reg[RDLEN] / sizeof(struct e1000_rx_desc) + > s->mac_reg[RDT] - s->mac_reg[RDH]; > } else { > - return false; > + if (s->last_overrun) { > + return false; > + } > + > + DBGOUT(RX, "Receive ring buffer is not full unexpectedly!\n"); > + return true; > } > return total_size <= bufs * s->rxbuf_size; > } > @@ -999,6 +1015,8 @@ e1000_receive_iov(NetClientState *nc, const struct > iovec *iov, int iovcnt) > > e1000x_update_rx_total_stats(s->mac_reg, pkt_type, size, total_size); > > + s->last_overrun = (s->mac_reg[RDH] == s->mac_reg[RDT]) ? true : false; > + > n = E1000_ICS_RXT0; > if ((rdt = s->mac_reg[RDT]) < s->mac_reg[RDH]) > rdt += s->mac_reg[RDLEN] / sizeof(desc); > -- > 2.39.1 > > -- Best regards