Hi, On Wed, Jul 17, 2019 at 9:53 PM Thomas Gleixner <t...@linutronix.de> wrote: > > On Wed, 17 Jul 2019, Sudip Mukherjee wrote: > > I am using v4.14.55 on an Intel Atom based board and I am seeing network > > packet drops frequently on wireshark logs. After lots of debugging it > > seems that when this happens softirq is taking huge time to start after > > it has been raised. This is a small snippet from ftrace: > > > > <...>-2110 [001] dNH1 466.634916: irq_handler_entry: irq=126 > > name=eth0-TxRx-0 > > <...>-2110 [001] dNH1 466.634917: softirq_raise: vec=3 > > [action=NET_RX] > > <...>-2110 [001] dNH1 466.634918: irq_handler_exit: irq=126 > > ret=handled > > ksoftirqd/1-15 [001] ..s. 466.635826: softirq_entry: vec=3 > > [action=NET_RX] > > ksoftirqd/1-15 [001] ..s. 466.635852: softirq_exit: vec=3 > > [action=NET_RX] > > ksoftirqd/1-15 [001] d.H. 466.635856: irq_handler_entry: irq=126 > > name=eth0-TxRx-0 > > ksoftirqd/1-15 [001] d.H. 466.635857: softirq_raise: vec=3 > > [action=NET_RX] > > ksoftirqd/1-15 [001] d.H. 466.635858: irq_handler_exit: irq=126 > > ret=handled > > ksoftirqd/1-15 [001] ..s. 466.635860: softirq_entry: vec=3 > > [action=NET_RX] > > ksoftirqd/1-15 [001] ..s. 466.635863: softirq_exit: vec=3 > > [action=NET_RX] > > > > So, softirq was raised at 466.634917 but it started at 466.635826 almost > > 909 usec after it was raised. > > This is a situation where the network softirq decided to delegate softirq > processing to ksoftirqd. That happens when too much work is available while > processing softirqs on return from interrupt. > > That means that softirq processing happens under scheduler control. So if > there are other runnable tasks on the same CPU ksoftirqd can be delayed > until their time slice expired. As a consequence ksoftirqd might not be > able to catch up with the incoming packet flood and the NIC starts to drop.
Yes, and I see in the ftrace that there are many other userspace processes getting scheduled in that time. > > You can hack ksoftirq_running() to return always false to avoid this, but > that might cause application starvation and a huge packet buffer backlog > when the amount of incoming packets makes the CPU do nothing else than > softirq processing. I tried that now, it is better but still not as good as v3.8 Now I am getting 375.9usec as the maximum time between raising the softirq and it starting to execute and packet drops still there. And just a thought, do you think there should be a CONFIG_ option for this feature of ksoftirqd_running() so that it can be disabled if needed by users like us? Can you please think of anything else that might have changed which I still need to change to make the time comparable to v3.8.. -- Regards Sudip