Re: Soft lockup in inet_put_port on 4.6

Josef Bacik Fri, 09 Dec 2016 18:01:22 -0800

On Thu, Dec 8, 2016 at 8:01 PM, Josef Bacik <jba...@fb.com> wrote:

On Dec 8, 2016, at 7:32 PM, Eric Dumazet <eric.duma...@gmail.com>wrote:
 On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:

 We can reproduce the problem at will, still trying to run down the
problem. I'll try and find one of the boxes that dumped a coreand get
 a bt of everybody.  Thanks,
 OK, sounds good.

 I had a look and :
 - could not spot a fix that came after 4.6.
 - could not spot an obvious bug.

 Anything special in the program triggering the issue ?
 SO_REUSEPORT and/or special socket options ?
So they recently started using SO_REUSEPORT, that's what triggeredit, if they don't use it then everything is fine.
I added some instrumentation for get_port to see if it was looping inthere and none of my printk's triggered. The softlockup messages arealways on the inet_bind_bucket lock, sometimes in the process contextin get_port or in the softirq context either through inet_put_port orinet_kill_twsk. On the box that I have a coredump for there's onlyone processor in the inet code so I'm not sure what to make of that.That was a box from last week so I'll look at a more recent core andsee if it's different. Thanks,


Ok more investigation today, a few bullet points

- With all the debugging turned on the boxes seem to recover afterabout a minute. I'd get the spam of the soft lockup messages all onthe inet_bind_bucket, and then the box would be fine.- I looked at a core I had from before I started investigating thingsand there's only one process trying to get the inet_bind_bucket of allthe 48 cpus.

- I noticed that there was over 100k twsk's in that original core.

- I put a global counter of the twsk's (since most of the softlockupmessages have the twsk timers in the stack) and noticed with thedebugging kernel it started around 16k twsk's and once it recovered itwas down to less than a thousand. There's a jump where it goes from 8kto 2k and then there's only one more softlockup message and the box isfine.- This happens when we restart the service with the config option tostart using SO_REUSEPORT.

The application is our load balancing app, so obviously has lots ofconnections opened at any given time. What I'm wondering and will teston Monday is if the SO_REUSEPORT change even matters, or if simplyrestarting the service is what triggers the problem. One thing Iforgot to mention is that it's also using TCP_FASTOPEN in both thenon-reuseport and reuseport variants.

What I suspect is happening is the service stops, all of the sockets ithad open go into TIMEWAIT with relatively the same timer period, andthen suddenly all wake up at the same time which coupled with themassive amount of traffic that we see per box anyway results in so muchcontention and ksoftirqd usage that the box livelocks for a while.With the lock debugging and stuff turned on we aren't able to serviceas much traffic so it recovers relatively quickly, whereas a normalproduction kernel never recovers.

Please keep in mind that I"m a file system developer so my conclusionsmay be completely insane, any guidance would be welcome. I'll continuehammering on this on Monday. Thanks,


Josef

Re: Soft lockup in inet_put_port on 4.6

Reply via email to