On 16.12.2016 23:50, Josef Bacik wrote: > On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <t...@herbertland.com> wrote: >> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jba...@fb.com> wrote: >>> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jba...@fb.com> wrote: >>>> >>>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jba...@fb.com> wrote: >>>>> >>>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa >>>>> <han...@stressinduktion.org> wrote: >>>>>> >>>>>> Hi Josef, >>>>>> >>>>>> On 15.12.2016 19:53, Josef Bacik wrote: >>>>>>> >>>>>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <t...@herbertland.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek >>>>>>>> <kraigatg...@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert >>>>>>>>> <t...@herbertland.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I think there may be some suspicious code in >>>>>>>>>> inet_csk_get_port. At >>>>>>>>>> tb_found there is: >>>>>>>>>> >>>>>>>>>> if (((tb->fastreuse > 0 && reuse) || >>>>>>>>>> (tb->fastreuseport > 0 && >>>>>>>>>> >>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>>>>>>>> uid))) && >>>>>>>>>> smallest_size == -1) >>>>>>>>>> goto success; >>>>>>>>>> if >>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>>>>>>> tb, true)) { >>>>>>>>>> if ((reuse || >>>>>>>>>> (tb->fastreuseport > 0 && >>>>>>>>>> sk->sk_reuseport && >>>>>>>>>> >>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>>> uid_eq(tb->fastuid, uid))) && >>>>>>>>>> smallest_size != -1 && >>>>>>>>>> --attempts >= >>>>>>>>>> 0) { >>>>>>>>>> spin_unlock_bh(&head->lock); >>>>>>>>>> goto again; >>>>>>>>>> } >>>>>>>>>> goto fail_unlock; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> AFAICT there is redundancy in these two conditionals. The >>>>>>>>>> same >>>>>>>>>> clause >>>>>>>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>>> sk->sk_reuseport && >>>>>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this >>>>>>>>>> is true >>>>>>>>>> the >>>>>>>>>> first conditional should be hit, goto done, and the second >>>>>>>>>> will >>>>>>>>>> never >>>>>>>>>> evaluate that part to true-- unless the sk is changed (do >>>>>>>>>> we need >>>>>>>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>>>>>>> >>>>>>>>> That's an interesting point... It looks like this function also >>>>>>>>> changed in 4.6 from using a single local_bh_disable() at the >>>>>>>>> beginning >>>>>>>>> with several spin_lock(&head->lock) to exclusively >>>>>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps >>>>>>>>> the full >>>>>>>>> bh >>>>>>>>> disable variant was preventing the timers in your stack >>>>>>>>> trace from >>>>>>>>> running interleaved with this function before? >>>>>>>> >>>>>>>> >>>>>>>> Could be, although dropping the lock shouldn't be able to >>>>>>>> affect the >>>>>>>> search state. TBH, I'm a little lost in reading function, the >>>>>>>> SO_REUSEPORT handling is pretty complicated. For instance, >>>>>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in >>>>>>>> that >>>>>>>> function and also in every call to inet_csk_bind_conflict. I >>>>>>>> wonder >>>>>>>> if >>>>>>>> we can simply this under the assumption that SO_REUSEPORT is only >>>>>>>> allowed if the port number (snum) is explicitly specified. >>>>>>> >>>>>>> >>>>>>> Ok first I have data for you Hannes, here's the time distributions >>>>>>> before during and after the lockup (with all the debugging in >>>>>>> place >>>>>>> the >>>>>>> box eventually recovers). I've attached it as a text file >>>>>>> since it is >>>>>>> long. >>>>>> >>>>>> >>>>>> Thanks a lot! >>>>>> >>>>>>> Second is I was thinking about why we would spend so much time >>>>>>> doing >>>>>>> the >>>>>>> ->owners list, and obviously it's because of the massive amount of >>>>>>> timewait sockets on the owners list. I wrote the following >>>>>>> dumb patch >>>>>>> and tested it and the problem has disappeared completely. Now >>>>>>> I don't >>>>>>> know if this is right at all, but I thought it was weird we >>>>>>> weren't >>>>>>> copying the soreuseport option from the original socket onto >>>>>>> the twsk. >>>>>>> Is there are reason we aren't doing this currently? Does this >>>>>>> help >>>>>>> explain what is happening? Thanks, >>>>>> >>>>>> >>>>>> The patch is interesting and a good clue, but I am immediately a bit >>>>>> concerned that we don't copy/tag the socket with the uid also to >>>>>> keep >>>>>> the security properties for SO_REUSEPORT. I have to think a bit more >>>>>> about this. >>>>>> >>>>>> We have seen hangs during connect. I am afraid this patch >>>>>> wouldn't help >>>>>> there while also guaranteeing uniqueness. >>>>> >>>>> >>>>> >>>>> Yeah so I looked at the code some more and actually my patch is >>>>> really >>>>> bad. If sk2->sk_reuseport is set we'll look at >>>>> sk2->sk_reuseport_cb, which >>>>> is outside of the timewait sock, so that's definitely bad. >>>>> >>>>> But we should at least be setting it to 0 so that we don't do this >>>>> normally. Unfortunately simply setting it to 0 doesn't fix the >>>>> problem. So >>>>> for some reason having ->sk_reuseport set to 1 on a timewait >>>>> socket makes >>>>> this problem non-existent, which is strange. >>>>> >>>>> So back to the drawing board I guess. I wonder if doing what craig >>>>> suggested and batching the timewait timer expires so it hurts less >>>>> would >>>>> accomplish the same results. Thanks, >>>> >>>> >>>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This >>>> is the >>>> code >>>> >>>> if ((!reuse || !sk2->sk_reuse || >>>> sk2->sk_state == TCP_LISTEN) && >>>> (!reuseport || !sk2->sk_reuseport || >>>> rcu_access_pointer(sk->sk_reuseport_cb) || >>>> (sk2->sk_state != TCP_TIME_WAIT && >>>> !uid_eq(uid, sock_i_uid(sk2))))) { >>>> >>>> if (!sk2->sk_rcv_saddr || >>>> !sk->sk_rcv_saddr >>>> || >>>> sk2->sk_rcv_saddr == >>>> sk->sk_rcv_saddr) >>>> break; >>>> } >>>> >>>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport >>>> == 1. >>>> But now we are using reuseport, so sk->sk_reuseport_cb should be >>>> non-NULL >>>> right? So really setting the timewait sock's sk_reuseport should >>>> have no >>>> bearing on how this loop plays out right? Thanks, >>> >>> >>> >>> So more messing around and I noticed that we basically don't do the >>> tb->fastreuseport logic at all if we've ended up with a non >>> SO_REUSEPORT >>> socket on that tb. So before I fully understood what I was doing I >>> fixed it >>> so that after we go through ->bind_conflict() once with a SO_REUSEPORT >>> socket, we reset tb->fastreuseport to 1 and set the uid to match the >>> uid of >>> the socket. This made the problem go away. Tom pointed out that if >>> we bind >>> to the same port on a different address and we have a non SO_REUSEPORT >>> socket with the same address on this tb then we'd be screwed with my >>> code. >>> >>> Which brings me to his proposed solution. We need another hash >>> table that >>> is indexed based on the binding address. Then each node corresponds >>> to one >>> address/port binding, with non-SO_REUSEPORT entries having only one >>> entry, >>> and normal SO_REUSEPORT entries having many. This cleans up the >>> need to >>> search all the possible sockets on any given tb, we just go and look >>> at the >>> one we care about. Does this make sense? Thanks, >>> >> Hi Josef, >> >> Thinking about it some more the hash table won't work because of the >> rules of binding different addresses to the same port. What I think we >> can do is to change inet_bind_bucket to be structure that contains all >> the information used to detect conflicts (reuse*, if, address, uid, >> etc.) and a list of sockets that share that exact same information-- >> for instance all socket in timewait state create through some listener >> socket should wind up on single bucket. When we do the bind_conflict >> function we only should have to walk this buckets, not the full list >> of sockets. >> >> Thoughts on this? > > This sounds good, maybe tb->owners be a list of say > > struct inet_unique_shit { > struct sock_common sk; > struct hlist socks; > }; > > Then we make inet_unique_shit like twsks', just copy the relevant > information, then hang the real sockets off of the socks hlist. > Something like that? Thanks,
As a counter idea: can we push up a flag to the inet_bind_bucket that we check in the fast- way style that indicates that we have wildcarded non-reuse(port) in there, so we can skip the bind_bucket much more quickly? This wouldn't require a new data structure. Thanks, Hannes