On 09.02.2019 00:09, Eric Dumazet wrote: > > > On 02/08/2019 01:50 PM, Heiner Kallweit wrote: >> On 08.02.2019 22:45, Sander Eikelenboom wrote: >>> On 08/02/2019 22:22, Heiner Kallweit wrote: >>>> On 08.02.2019 21:55, Sander Eikelenboom wrote: >>>>> On 08/02/2019 19:52, Heiner Kallweit wrote: >>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote: >>>>>>> L.S., >>>>>>> >>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they >>>>>>> don't seem related) under Xen i the nasty splat below, >>>>>>> that I haven encountered with Linux 4.20.x. >>>>>>> >>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting >>>>>>> could be nasty due to another (networking related) kernel bug. >>>>>>> >>>>>>> If you need more info, want me to run a debug patch etc., please feel >>>>>>> free to ask. >>>>>>> >>>>>> Thanks for the report. However I see no change in the r8169 driver >>>>>> between >>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause >>>>>> could >>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed. >>>>> >>>>> Hmm i did some diging and i think: >>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb >>>>> barriers >>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and >>>>> __netdev_sent_queue >>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add >>>>> __netdev_sent_queue as variant of __netdev_tx_sent_queue >>>>> >>>> You're right. Thought this was added in 4.20 already. >>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't >>>> heard about >>>> this issue from any user of physical hw. And due to the fact that a lot of >>>> mainboards >>>> have onboard Realtek network I have quite a few testers out there. >>>> Does the issue occur under specific circumstances like very high load? >>> >>> Yep, the box is already quite contented with the Xen VM's and if I remember >>> correctly it occurred while kernel compiling >>> on the host. >>> >>>> If indeed the xmit_more patch causes the issue, I think we have to involve >>>> Eric Dumazet >>>> as author of the underlying changes. >>> >>> It could also be the barriers weren't that unneeded as assumed. >> >> The barriers were removed after adding xmit_more handling. Therefore it >> would be good to >> test also with only >> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb >> barriers >> removed. >> >>> Since we are almost at RC6 i took the liberty to CC Eric now. >>> >> Sure, thanks. >> >>> BTW am i correct these patches are merely optimizations ? >> >> Yes >> >>> If so and concluding they revert cleanly, perhaps it should be considered >>> at this point in the RC's >>> to revert them for 5.0 and try again for 5.1 ? >>> >> Before removing both it would be good to test with only the barrier-removal >> removed. >> > > Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more > and __netdev_sent_queue > looks buggy to me, since the skb might have been freed already on another cpu > when you call > > You could try : > > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index > 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93 > 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c > @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff > *skb, > dma_addr_t mapping; > u32 opts[2], len; > bool stop_queue; > + bool door_bell; > int frags; > > if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) { > @@ -6116,6 +6117,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff > *skb, > /* Force memory writes to complete before releasing descriptor */ > dma_wmb(); > > + door_bell = __netdev_sent_queue(dev, skb->len, skb->xmit_more); > + > txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry); > > /* Force all memory writes to complete before notifying device */ > @@ -6127,7 +6130,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff > *skb, > if (unlikely(stop_queue)) > netif_stop_queue(dev); > > - if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) { > + if (door_bell) { > RTL_W8(tp, TxPoll, NPQ); > mmiowb(); > } > Thanks a lot for checking and for the proposed fix. Sander, can you try with this patch on top of 5.0-rc5 w/o removing two two commits?
> > . > Heiner