On Wed, 9 Oct 2019 16:57:39 +0000, Pooja Trivedi wrote: > On Fri, Sep 27, 2019 at 05:37:53PM -0700, Jakub Kicinski wrote: > > On Tue, 24 Sep 2019 12:48:26 -0400, Pooja Trivedi wrote: > > > On Mon, Sep 23, 2019 at 8:28 PM Jakub Kicinski wrote: > > > > On Sat, 21 Sep 2019 23:19:20 -0400, Pooja Trivedi wrote: > > > > > On Wed, Sep 18, 2019 at 5:45 PM Jakub Kicinski wrote: > > > > > > On Wed, 18 Sep 2019 17:37:44 -0400, Pooja Trivedi wrote: > > > > > > > Hi Jakub, > > > > > > > > > > > > > > I have explained one potential way for the race to happen in my > > > > > > > original message to the netdev mailing list here: > > > > > > > https://marc.info/?l=linux-netdev&m=156805120229554&w=2 > > > > > > > > > > > > > > Here is the part out of there that's relevant to your question: > > > > > > > > > > > > > > ----------------------------------------- > > > > > > > > > > > > > > One potential way for race condition to appear: > > > > > > > > > > > > > > When under tcp memory pressure, Thread 1 takes the following code > > > > > > > path: > > > > > > > do_sendfile ---> ... ---> .... ---> tls_sw_sendpage ---> > > > > > > > tls_sw_do_sendpage ---> tls_tx_records ---> tls_push_sg ---> > > > > > > > do_tcp_sendpages ---> sk_stream_wait_memory ---> sk_wait_event > > > > > > > > > > > > Ugh, so do_tcp_sendpages() can also release the lock :/ > > > > > > > > > > > > Since the problem occurs in tls_sw_do_sendpage() and > > > > > > tls_sw_do_sendmsg() as well, should we perhaps fix it at that > > > > > > level? > > > > > > > > > > That won't do because tls_tx_records also gets called when completion > > > > > callbacks schedule delayed work. That was the code path that caused > > > > > the crash for my test. Cavium's nitrox crypto offload driver calling > > > > > tls_encrypt_done, which calls schedule_delayed_work. Delayed work that > > > > > was scheduled would then be processed by tx_work_handler. > > > > > Notice in my previous reply, > > > > > "Thread 2 code path: > > > > > tx_work_handler ---> tls_tx_records" > > > > > > > > > > "Thread 2 code path: > > > > > tx_work_handler ---> tls_tx_records" > > > > > > > > Right, the work handler would obviously also have to obey the exclusion > > > > mechanism of choice. > > > > > > > > Having said that this really does feel like we are trying to lock code, > > > > not data here :( > > > > > > Agree with you and exactly the thought process I went through. So what > > > are some other options? > > > > > > 1) A lock member inside of ctx to protect tx_list > > > We are load testing ktls offload with nitrox and the performance was > > > quite adversely affected by this. This approach can be explored more, > > > but the original design of using socket lock didn't follow this model > > > either. > > > 2) Allow tagging of individual record inside of tx_list to indicate if > > > it has been 'processed' > > > This approach would likely protect the data without compromising > > > performance. It will allow Thread 2 to proceed with the TX portion of > > > tls_tx_records while Thread 1 sleeps waiting for memory. There will > > > need to be careful cleanup and backtracking after the thread wakes up > > > to ensure a consistent state of tx_list and record transmission. > > > The approach has several problems, however -- (a) It could cause > > > out-of-order record tx (b) If Thread 1 is waiting for memory, Thread 2 > > > most likely will (c) Again, socket lock wasn't designed to follow this > > > model to begin with > > > > > > > > > Given that socket lock essentially was working as a code protector -- > > > as an exclusion mechanism to allow only a single writer through > > > tls_tx_records at a time -- what other clean ways do we have to fix > > > the race without a significant refactor of the design and code? > > > > Very sorry about the delay. I don't think we can maintain the correct > > semantics without sleeping :( If we just bail in tls_tx_records() when > > there's already another writer the later writer will return from the > > system call, even though the data is not pushed into the TCP layer. > > Thanks for your response and sorry about the delay! > > I am trying the following scenarios in my head to see how valid your > concern is. Play along with me please. > > The two main writers in picture here are > Thread 1 -- Enqueue thread (sendfile system call) -- pushes records to > card, also performs completions (push to tcp) if records are ready > Thread 2 -- Work handler (tx_work_handler) -- bottom-half completions routine > > With the submitted patch, > Case 1 (your concern) : Thread 2 grabs socket lock, calls > tls_tx_records, runs into memory pressure, releases socket lock, waits > for memory. Now Thread 1 grabs socket lock, calls tls_tx_records, bails. > In this case, sendfile system call will bail out without performing > completions. Is that really a problem? When Thread 1 ultimately > proceeds, it will perform the completions anyway.
I think there is value in preserving normal socket semantics as much as possible. If a writer pushes more data to a TCP connection than send buffer the call should block. > Case 2: Threads grab socket lock in a reverse sequence of Case 1. So > Thread 1 grabs socket lock first and ends up waiting for memory. > Thread 2 comes in later and bails from tls_tx_records. > In this case, in the submitted patch, I realized that we are not rescheduling > the work before bailing. So I think an amendment to that patch, > something along the lines of what's shown below, would fare better. > > ******************************************************************************** > > > What was reason for the performance impact on (1)? > > > While the performance impact still needs to be investigated, that effort has > stopped short due > to other issues with that approach like hard lockups. The basic problem is > that the socket lock > is touched at multiple layers (tls, tcp etc.). > > Here are the two approaches we tried along the lines of using an additional > lock... > > Approach 1 -- Protect tx_list with a spinlock_t tx_list_lock : > Approach 2 -- Protect tx_list with a spinlock_t tx_list_lock and move > lock/release of socket lock to tls_push_sg : You can't sleep with spin locks, we'd need a mutex... > > My feeling is that > > we need to make writers wait to maintain socket write semantics, and > > that implies putting writers to sleep, which is indeed very costly.. > > > > Perhaps something along the lines of: > > > > if (ctx->in_tcp_sendpages) { > > rc = sk_stream_wait_memory(sk, &timeo); > > ... > > } > > > > in tls_tx_records() would be the "most correct" solution? If we get > > there and there is already a writer, that means the first writer has > > to be waiting for memory, and so should the second.. > > > > WDYT? > > Hmmm, I am not sure what the benefit would be of having two threads > do the completions, as explained above with Case 1 and Case 2 > scenarios. If we made the later thread wait also while earlier one > waits for memory, just so the later one can also perform the > completions, it will either have no completions left to take care of > or have minimal (a few records that may have sneaked in while the > earlier thread returned and second made it in). Semantics matter. > The only patch that we have been able to make consistently work > without crashing and also without compromising performance, is the > previously submitted one where later thread bails out of > tls_tx_records. And as mentioned, it can perhaps be made more > efficient by rescheduling delayed work in the case where work handler > thread turns out to be the later thread that has to bail. Let me try to find a way to repro this reliably without any funky accelerators. The sleep in do_tcp_sendpages() should affect all cases. I should have some time today and tomorrow to look into this, bear with me..