Re: Soft lockup in inet_put_port on 4.6

Josef Bacik Fri, 16 Dec 2016 14:10:20 -0800

On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jba...@fb.com> wrote:

On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jba...@fb.com> wrote:
On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa<han...@stressinduktion.org> wrote:
Hi Josef,
On 15.12.2016 19:53, Josef Bacik wrote:
On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert<t...@herbertland.com> wrote:
On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek<kraigatg...@gmail.com>
 wrote:
On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert<t...@herbertland.com>
 wrote:
I think there may be some suspicious code ininet_csk_get_port. At
  tb_found there is:

                  if (((tb->fastreuse > 0 && reuse) ||
                       (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) &&
                        sk->sk_reuseport && uid_eq(tb->fastuid,
 uid))) &&
                      smallest_size == -1)
                          goto success;
if(inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
 tb, true)) {
                          if ((reuse ||
                               (tb->fastreuseport > 0 &&
                                sk->sk_reuseport &&

 !rcu_access_pointer(sk->sk_reuseport_cb) &&
                                uid_eq(tb->fastuid, uid))) &&
smallest_size != -1 && --attempts>= 0) {
                                  spin_unlock_bh(&head->lock);
                                  goto again;
                          }
                          goto fail_unlock;
                  }
AFAICT there is redundancy in these two conditionals. Thesame clause
  is being checked in both: (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport&&uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this istrue thefirst conditional should be hit, goto done, and the secondwill neverevaluate that part to true-- unless the sk is changed (do weneed
  READ_ONCE for sk->sk_reuseport_cb?).
  That's an interesting point... It looks like this function also
changed in 4.6 from using a single local_bh_disable() at thebeginning
  with several spin_lock(&head->lock) to exclusively
spin_lock_bh(&head->lock) at each locking point. Perhaps thefull bhdisable variant was preventing the timers in your stack tracefrom
  running interleaved with this function before?
Could be, although dropping the lock shouldn't be able to affectthe
 search state. TBH, I'm a little lost in reading function, the
 SO_REUSEPORT handling is pretty complicated. For instance,
rcu_access_pointer(sk->sk_reuseport_cb) is checked three timesin thatfunction and also in every call to inet_csk_bind_conflict. Iwonder if
 we can simply this under the assumption that SO_REUSEPORT is only
 allowed if the port number (snum) is explicitly specified.
 Ok first I have data for you Hannes, here's the time distributions
before during and after the lockup (with all the debugging inplace thebox eventually recovers). I've attached it as a text file sinceit is
 long.
Thanks a lot!
Second is I was thinking about why we would spend so much timedoing the
 ->owners list, and obviously it's because of the massive amount of
timewait sockets on the owners list. I wrote the following dumbpatchand tested it and the problem has disappeared completely. Now Idon'tknow if this is right at all, but I thought it was weird weweren'tcopying the soreuseport option from the original socket onto thetwsk.Is there are reason we aren't doing this currently? Does thishelp
 explain what is happening?  Thanks,
The patch is interesting and a good clue, but I am immediately a bit
concerned that we don't copy/tag the socket with the uid also tokeep
the security properties for SO_REUSEPORT. I have to think a bit more
about this.
We have seen hangs during connect. I am afraid this patch wouldn'thelp
there while also guaranteeing uniqueness.
Yeah so I looked at the code some more and actually my patch isreally bad. If sk2->sk_reuseport is set we'll look atsk2->sk_reuseport_cb, which is outside of the timewait sock, sothat's definitely bad.
But we should at least be setting it to 0 so that we don't do thisnormally. Unfortunately simply setting it to 0 doesn't fix theproblem. So for some reason having ->sk_reuseport set to 1 on atimewait socket makes this problem non-existent, which is strange.
So back to the drawing board I guess. I wonder if doing what craigsuggested and batching the timewait timer expires so it hurts lesswould accomplish the same results. Thanks,
Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. Thisis the code
                       if ((!reuse || !sk2->sk_reuse ||
                           sk2->sk_state == TCP_LISTEN) &&
                           (!reuseport || !sk2->sk_reuseport ||
                            rcu_access_pointer(sk->sk_reuseport_cb) ||
                            (sk2->sk_state != TCP_TIME_WAIT &&
                            !uid_eq(uid, sock_i_uid(sk2))))) {
if (!sk2->sk_rcv_saddr ||!sk->sk_rcv_saddr ||sk2->sk_rcv_saddr ==sk->sk_rcv_saddr)
                                       break;
                       }
so in my patches case we now have reuseport == 1, sk2->sk_reuseport== 1. But now we are using reuseport, so sk->sk_reuseport_cb shouldbe non-NULL right? So really setting the timewait sock'ssk_reuseport should have no bearing on how this loop plays out right?Thanks,

So more messing around and I noticed that we basically don't do thetb->fastreuseport logic at all if we've ended up with a nonSO_REUSEPORT socket on that tb. So before I fully understood what Iwas doing I fixed it so that after we go through ->bind_conflict() oncewith a SO_REUSEPORT socket, we reset tb->fastreuseport to 1 and set theuid to match the uid of the socket. This made the problem go away.Tom pointed out that if we bind to the same port on a different addressand we have a non SO_REUSEPORT socket with the same address on this tbthen we'd be screwed with my code.

Which brings me to his proposed solution. We need another hash tablethat is indexed based on the binding address. Then each nodecorresponds to one address/port binding, with non-SO_REUSEPORT entrieshaving only one entry, and normal SO_REUSEPORT entries having many.This cleans up the need to search all the possible sockets on any giventb, we just go and look at the one we care about. Does this makesense? Thanks,


Josef

Re: Soft lockup in inet_put_port on 4.6

Reply via email to