Re: [Intel-wired-lan] [PATCH iwl-net] idpf: extend tx watchdog timeout

Josh Hay Wed, 12 Jun 2024 23:36:46 -0700



On 6/12/2024 2:34 AM, Alexander Lobakin wrote:

From: Josh Hay <[email protected]>
Date: Tue, 11 Jun 2024 11:13:53 -0700



On 6/11/2024 3:44 AM, Alexander Lobakin wrote:

From: David Decotigny <[email protected]>
Date: Tue, 4 Jun 2024 16:34:48 -0700


[...]

Note that there are several patches fixing Tx (incl. timeouts) in my
tree, including yours (Joshua's) which you somehow didn't send yet ._.
Maybe start from them first?


I believe it was you who specifically asked our team to defer pushing
any upstream patches while you were working on the XDP series "to avoid
having to rebase", which was a reasonable request at the time. We also


It was only related to the virtchnl refactoring and later I cancelled
that when I realized it will go earlier than our series.

had no reason to believe the existing upstream idpf implementation was
experiencing timeouts (it is being tested by numerous validation teams).
So there was no urgency to get those patches upstream. Which patches in
your tree do you believe fix specific timeout situations? It appears you


[0][1][2]

pulled in some of the changes from the out-of-tree driver, but those
were all enhancements. It wasn't until the workload that David mentioned


No, there are all fixes.

[0] is your from the OOT, extended. > [1] is mine and never was in the OOT.
[2] is your from the OOT, extended by Michał.

My main point was since no other tx timeouts have been reported on theupstream driver, none of these seem like critical fixes. This particulartimeout signature did not seem to match any of these in general. E.g. itwould have been obvious if the timeout was because of what [0] fixes.It's also possible these timeouts did not manifest on the upstreamdriver because it is missing other OOT changes.


They really do help.

No disagreement there. I would've loved to push these changes sooner,but we already covered why that didn't happen.


Note that there's one more Tx timeout patch from you in the OOT, but it
actually broke Tx xD

If you are implying that our team would commit code that is knowinglybroken, that is absolutely not true. I believe what you're referring tois a change that introduced a tx timeout that also took a very specificworkload to trigger it. That issue was fixed and not applicable to thecurrent upstream implementation, so I do not see how that is relevant tothis conversation.

was run on the current driver that we had any indication there were
timeout issues.


I don't buy 30 seconds, at least for now. Maybe I'm missing something.

Nacked-by: Alexander Lobakin <[email protected]>



In the process of debugging the newly discovered timeout, our
architecture team made it clear that 5 seconds is too low for this type
of device, with a non deterministic pipeline where packets can take a
number of exception/slow paths. Admittedly, we don't know the exact


Slowpath which takes 30 seconds to complete, seriously?

The architecture team said 5s is too low. 30s was chosen to give amplecushion to avoid changing the timeout should this situation come up again.

number, so the solution for the time being was to bump it up with a
comfortable buffer. As we tune things and debug with various workloads,
we can bring it back down. As David mentioned, there is precedent for an
extended timeout for smartnics. Why is it suddenly unacceptable for
Intel's device?


I don't know where this "suddenly" comes from.
Because even 5 seconds is too much.
HW usually send packets in microseconds if not faster. Extending the
timeout will hide real issues and make debugging more difficult.

Can you please elaborate on exactly why it would be more difficult? Ifsomething is actually wrong in HW, it seems unlikely extra time wouldcorrect it. If something is functionally wrong in the driver, why doesit matter if it's 5s, 15s, or 30s? It will timeout just the same.


I suspect this all is for OOO packets with an explicit sending timestamp
passed from the userspace, but as I said, you need to teach the kernel
watchdog to account them.

Out of order completions can happen for numerous reasons, some of whichthe driver will know nothing about, i.e. the userspace timestamps arenot the only things that trigger OOO completions. We can detect thatwe're processing completion B before A, but it's only at that time thatwe can tell the stack to _maybe_ expect a delayed completion. I'm opento discussing this further, but it does not seem like a simple solutionthat can be implemented in the immediate future.

Otherwise, I can ask the driver to send a packet in 31 seconds, what
then, there will be a timeout and you will send a patch to extend it to
60 seconds?

I hope there are checks in the stack itself that would not allow thepacket to be scheduled beyond the timeout window :)


Thanks,
Olek


Thanks,
Josh


[0] https://github.com/alobakin/linux/commit/aad547037598
[1] https://github.com/alobakin/linux/commit/50f4c27ba13e
[2] https://github.com/alobakin/linux/commit/4a9b6c5d0ee8

Thanks,
Olek


Thanks,
Josh

Re: [Intel-wired-lan] [PATCH iwl-net] idpf: extend tx watchdog timeout

Reply via email to