On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa <han...@stressinduktion.org> wrote:
Hi Josef,

On 15.12.2016 19:53, Josef Bacik wrote:
On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <t...@herbertland.com> wrote:
On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatg...@gmail.com>
 wrote:
On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <t...@herbertland.com>
 wrote:
I think there may be some suspicious code in inet_csk_get_port. At
  tb_found there is:

                  if (((tb->fastreuse > 0 && reuse) ||
                       (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) &&
                        sk->sk_reuseport && uid_eq(tb->fastuid,
 uid))) &&
                      smallest_size == -1)
                          goto success;
                  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
 tb, true)) {
                          if ((reuse ||
                               (tb->fastreuseport > 0 &&
                                sk->sk_reuseport &&

 !rcu_access_pointer(sk->sk_reuseport_cb) &&
                                uid_eq(tb->fastuid, uid))) &&
smallest_size != -1 && --attempts >= 0) {
                                  spin_unlock_bh(&head->lock);
                                  goto again;
                          }
                          goto fail_unlock;
                  }

AFAICT there is redundancy in these two conditionals. The same clause
  is being checked in both: (tb->fastreuseport > 0 &&
  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the first conditional should be hit, goto done, and the second will never evaluate that part to true-- unless the sk is changed (do we need
  READ_ONCE for sk->sk_reuseport_cb?).
  That's an interesting point... It looks like this function also
changed in 4.6 from using a single local_bh_disable() at the beginning
  with several spin_lock(&head->lock) to exclusively
spin_lock_bh(&head->lock) at each locking point. Perhaps the full bh disable variant was preventing the timers in your stack trace from
  running interleaved with this function before?

Could be, although dropping the lock shouldn't be able to affect the
 search state. TBH, I'm a little lost in reading function, the
 SO_REUSEPORT handling is pretty complicated. For instance,
rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that function and also in every call to inet_csk_bind_conflict. I wonder if
 we can simply this under the assumption that SO_REUSEPORT is only
 allowed if the port number (snum) is explicitly specified.

 Ok first I have data for you Hannes, here's the time distributions
before during and after the lockup (with all the debugging in place the box eventually recovers). I've attached it as a text file since it is
 long.

Thanks a lot!

Second is I was thinking about why we would spend so much time doing the
 ->owners list, and obviously it's because of the massive amount of
timewait sockets on the owners list. I wrote the following dumb patch and tested it and the problem has disappeared completely. Now I don't
 know if this is right at all, but I thought it was weird we weren't
copying the soreuseport option from the original socket onto the twsk.
 Is there are reason we aren't doing this currently?  Does this help
 explain what is happening?  Thanks,

The patch is interesting and a good clue, but I am immediately a bit
concerned that we don't copy/tag the socket with the uid also to keep
the security properties for SO_REUSEPORT. I have to think a bit more
about this.

We have seen hangs during connect. I am afraid this patch wouldn't help
there while also guaranteeing uniqueness.


Yeah so I looked at the code some more and actually my patch is really bad. If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, which is outside of the timewait sock, so that's definitely bad.

But we should at least be setting it to 0 so that we don't do this normally. Unfortunately simply setting it to 0 doesn't fix the problem. So for some reason having ->sk_reuseport set to 1 on a timewait socket makes this problem non-existent, which is strange.

So back to the drawing board I guess. I wonder if doing what craig suggested and batching the timewait timer expires so it hurts less would accomplish the same results. Thanks,

Josef

Reply via email to