From: Josh Hay <joshua.a....@intel.com> Date: Tue, 11 Jun 2024 11:13:53 -0700
> > > On 6/11/2024 3:44 AM, Alexander Lobakin wrote: >> From: David Decotigny <ddeco...@gmail.com> >> Date: Tue, 4 Jun 2024 16:34:48 -0700 [...] >> Note that there are several patches fixing Tx (incl. timeouts) in my >> tree, including yours (Joshua's) which you somehow didn't send yet ._. >> Maybe start from them first? > > I believe it was you who specifically asked our team to defer pushing > any upstream patches while you were working on the XDP series "to avoid > having to rebase", which was a reasonable request at the time. We also It was only related to the virtchnl refactoring and later I cancelled that when I realized it will go earlier than our series. > had no reason to believe the existing upstream idpf implementation was > experiencing timeouts (it is being tested by numerous validation teams). > So there was no urgency to get those patches upstream. Which patches in > your tree do you believe fix specific timeout situations? It appears you [0][1][2] > pulled in some of the changes from the out-of-tree driver, but those > were all enhancements. It wasn't until the workload that David mentioned No, there are all fixes. [0] is your from the OOT, extended. [1] is mine and never was in the OOT. [2] is your from the OOT, extended by MichaĆ. They really do help. Note that there's one more Tx timeout patch from you in the OOT, but it actually broke Tx xD > was run on the current driver that we had any indication there were > timeout issues. > >> >> I don't buy 30 seconds, at least for now. Maybe I'm missing something. >> >> Nacked-by: Alexander Lobakin <aleksander.loba...@intel.com> > > > In the process of debugging the newly discovered timeout, our > architecture team made it clear that 5 seconds is too low for this type > of device, with a non deterministic pipeline where packets can take a > number of exception/slow paths. Admittedly, we don't know the exact Slowpath which takes 30 seconds to complete, seriously? > number, so the solution for the time being was to bump it up with a > comfortable buffer. As we tune things and debug with various workloads, > we can bring it back down. As David mentioned, there is precedent for an > extended timeout for smartnics. Why is it suddenly unacceptable for > Intel's device? I don't know where this "suddenly" comes from. Because even 5 seconds is too much. HW usually send packets in microseconds if not faster. Extending the timeout will hide real issues and make debugging more difficult. I suspect this all is for OOO packets with an explicit sending timestamp passed from the userspace, but as I said, you need to teach the kernel watchdog to account them. Otherwise, I can ask the driver to send a packet in 31 seconds, what then, there will be a timeout and you will send a patch to extend it to 60 seconds? > >> >> Thanks, >> Olek > > Thanks, > Josh [0] https://github.com/alobakin/linux/commit/aad547037598 [1] https://github.com/alobakin/linux/commit/50f4c27ba13e [2] https://github.com/alobakin/linux/commit/4a9b6c5d0ee8 Thanks, Olek