Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call

Sagi Grimberg Thu, 27 Jul 2017 04:13:12 -0700

Yes I realize that, but can't the device still complete in a burst (of
unsuppressed completions)? I mean its not guaranteed that for every
txq_complete a signaled completion is pending right? What happens if
the device has inconsistent completion pacing? Can't the sw grow a
batch of completions if txq_complete will process a single completion
unconditionally?

Speculation. First of all, device doesn't delay completion notifications for no
reason. ASIC is not a SW running on top of a OS.


I'm sorry but this statement is not correct. It might be correct in a
lab environment, but in practice, there are lots of things that can
affect the device timing.

If a completion comes up late,
this means device really can't keep up the rate of posting descriptors. If so,
tx_burst() should generate back-pressure by returning partial Tx, then app can
make a decision between drop or retry. Retry on Tx means back-pressuring Rx side
if app is forwarding packets.


Not arguing on that, I was simply suggesting that better heuristics
could be applied than "process one completion unconditionally".

More serious problem I expected was a case that the THRESH is smaller than
burst size. In that case, txq->elts[] will be short of slots all the time. But
fortunately, in MLX PMD, we request one completion per a burst at most, not
every THRESH of packets.

If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure.
Question to myself was "when does it shrink?". It shrinks when Tx burst is light
(burst size is smaller than THRESH) because mlx5_tx_complete() is always called
every time tx_burst() is called. What if it keeps growing? Then, drop is
necessary and natural like I mentioned above.

It doesn't make sense for SW to absorb any possible SW jitters. Cost is high.
It is usually done by increasing queue depth. Keeping steady state is more
important.


Again, I agree jitters are bad, but with proper heuristics in place mlx5
can still keep a low jitter _and_ consume completions faster than
consecutive tx_burst invocations.

Rather, this patch is helpful for reducing jitters. When I run a profiler, the
most cycle-consuming part on Tx is still freeing buffers. If we allow loops on
checking valid CQE, many buffers could be freed in a single call of
mlx5_tx_complete() at some moment, then it would cause a long delay. This would
aggravate jitter.


I didn't argue the fact that this patch addresses an issue, but mlx5 is
a driver that is designed to run applications that can act differently
than your test case.

Of course. I appreciate your time for the review. And keep in mind that nothing
is impossible in an open source community. I always like to discuss about ideas
with anyone. But I was just asking to hear more details about your suggestion if
you wanted me to implement it, rather than giving me one-sentence question :-)


Good to know.

Does "budget" mean the
threshold? If so, calculation of stats for adaptive threshold can impact single
core performance. With multiple cores, adjusting threshold doesn't affect much.


If you look at mlx5e driver in the kernel, it maintains online stats on
its RX and TX queues. It maintain these stats mostly for adaptive
interrupt moderation control (but not only).

I was suggesting maintaining per TX queue stats on average completions
consumed for each TX burst call, and adjust the stopping condition
according to a calculated stat.

In case of interrupt mitigation, it could be beneficial because interrupt
handling cost is too costly. But, the beauty of DPDK is polling, isn't it?


If you read again my comment, I didn't suggest to apply stats for
interrupt moderation, I just gave an example of a use-case. I was
suggesting to maintain the online stats for adjusting a threshold of
how much completions to process in a tx burst call (instead of
processing one unconditionally).

And please remember to ack at the end of this discussion if you are okay so that
this patch can gets merged. One data point is, single core performance (fwd) of
vectorized PMD gets improved by more than 6% with this patch. 6% is never small.


Yea, I don't mind merging it in given that I don't have time to come
up with anything better (or worse :))

Reviewed-by: Sagi Grimberg <s...@grimberg.me>

Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call

Reply via email to