On Fri, 2016-12-09 at 20:59 -0500, Josef Bacik wrote: > On Thu, Dec 8, 2016 at 8:01 PM, Josef Bacik <jba...@fb.com> wrote: > > > >> On Dec 8, 2016, at 7:32 PM, Eric Dumazet <eric.duma...@gmail.com> > >> wrote: > >> > >>> On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote: > >>> > >>> We can reproduce the problem at will, still trying to run down the > >>> problem. I'll try and find one of the boxes that dumped a core > >>> and get > >>> a bt of everybody. Thanks, > >> > >> OK, sounds good. > >> > >> I had a look and : > >> - could not spot a fix that came after 4.6. > >> - could not spot an obvious bug. > >> > >> Anything special in the program triggering the issue ? > >> SO_REUSEPORT and/or special socket options ? > >> > > > > So they recently started using SO_REUSEPORT, that's what triggered > > it, if they don't use it then everything is fine. > > > > I added some instrumentation for get_port to see if it was looping in > > there and none of my printk's triggered. The softlockup messages are > > always on the inet_bind_bucket lock, sometimes in the process context > > in get_port or in the softirq context either through inet_put_port or > > inet_kill_twsk. On the box that I have a coredump for there's only > > one processor in the inet code so I'm not sure what to make of that. > > That was a box from last week so I'll look at a more recent core and > > see if it's different. Thanks, > > Ok more investigation today, a few bullet points > > - With all the debugging turned on the boxes seem to recover after > about a minute. I'd get the spam of the soft lockup messages all on > the inet_bind_bucket, and then the box would be fine. > - I looked at a core I had from before I started investigating things > and there's only one process trying to get the inet_bind_bucket of all > the 48 cpus. > - I noticed that there was over 100k twsk's in that original core. > - I put a global counter of the twsk's (since most of the softlockup > messages have the twsk timers in the stack) and noticed with the > debugging kernel it started around 16k twsk's and once it recovered it > was down to less than a thousand. There's a jump where it goes from 8k > to 2k and then there's only one more softlockup message and the box is > fine. > - This happens when we restart the service with the config option to > start using SO_REUSEPORT. > > The application is our load balancing app, so obviously has lots of > connections opened at any given time. What I'm wondering and will test > on Monday is if the SO_REUSEPORT change even matters, or if simply > restarting the service is what triggers the problem. One thing I > forgot to mention is that it's also using TCP_FASTOPEN in both the > non-reuseport and reuseport variants. > > What I suspect is happening is the service stops, all of the sockets it > had open go into TIMEWAIT with relatively the same timer period, and > then suddenly all wake up at the same time which coupled with the > massive amount of traffic that we see per box anyway results in so much > contention and ksoftirqd usage that the box livelocks for a while. > With the lock debugging and stuff turned on we aren't able to service > as much traffic so it recovers relatively quickly, whereas a normal > production kernel never recovers. > > Please keep in mind that I"m a file system developer so my conclusions > may be completely insane, any guidance would be welcome. I'll continue > hammering on this on Monday. Thanks,
Hmm... Is your ephemeral port range includes the port your load balancing app is using ?