29/06/2022 21:44, Honnappa Nagarahalli пишет:
<snip>
04/06/2022 13:51, Andrew Rybchenko пишет:
On 6/4/22 15:19, Morten Brørup wrote:
From: Jerin Jacob [mailto:jerinjac...@gmail.com]
Sent: Saturday, 4 June 2022 13.10
On Sat, Jun 4, 2022 at 3:30 PM Andrew Rybchenko
<andrew.rybche...@oktetlabs.ru> wrote:
On 6/4/22 12:33, Jerin Jacob wrote:
On Sat, Jun 4, 2022 at 2:39 PM Morten Brørup
<m...@smartsharesystems.com> wrote:
I would like the DPDK community to change its view on compile
time
options. Here is why:
Application specific performance micro-optimizations like “fast
mbuf free” and “mbuf direct re-arm” are being added to DPDK and
presented as features.
They are not features, but optimizations, and I don’t understand
the need for them to be available at run-time!
Instead of adding a bunch of exotic exceptions to the fast path
of
the PMDs, they should be compile time options. This will improve
performance by avoiding branches in the fast path, both for the
applications using them, and for generic applications (where the
exotic code is omitted).
Agree. I think, keeping the best of both worlds would be
-Enable the feature/optimization as runtime -Have a compile-time
option to disable the feature/optimization as
an override.
It is hard to find the right balance, but in general compile time
options are a nightmare for maintenance. Number of required builds
will grow as an exponent.
Test combinations are exponential for N features, regardless if N are
runtime or compile time options.
But since I'm talking about build checks I don't care about
exponential grows in run time. Yes, testing should care, but it is a separate
story.
Of course, we can
limit number of checked combinations, but it will result in flow of
patches to fix build in other cases.
The build breakage can be fixed if we use (2) vs (1)
1)
#ifdef ...
My feature
#endif
2)
static __rte_always_inline int
rte_has_xyz_feature(void)
{
#ifdef RTE_LIBRTE_XYZ_FEATURE
return RTE_LIBRTE_XYZ_FEATURE; #else
return 0;
#endif
}
if(rte_has_xyz_feature())) {
My feature code
}
Jerin, thanks, very good example.
I'm not sure all the features can be covered by that, e.g. added
fields in structures.
+1
Also, I would consider such features "opt in" at compile time only.
As such, they could be allowed to break the ABI/API.
Also compile time options tend to make code less readable which
makes all aspects of the development harder.
Yes, compile time is nice for micro optimizations, but I have great
concerns that it is a right way to go.
Please note that I am only talking about the performance
optimizations that are limited to application specific use cases. I
think it makes sense to require that performance optimizing an
application also requires recompiling the performance critical
libraries used by it.
abandon some of existing functionality to create a 'short-cut'
Allowing compile time options for application specific
performance
optimizations in DPDK would also open a path for other
optimizations, which can only be achieved at compile time, such as
“no fragmented packets”, “no attached mbufs” and “single mbuf pool”.
And even more exotic optimizations, such as the “indexed mempool
cache”, which was rejected due to ABI violations – they could be
marked as “risky and untested” or similar, but still be part of the DPDK main
repository.
Thanks Morten for bringing it up, it is an interesting topic.
Though I look at it from different angle.
All optimizations you mentioned above introduce new limitations:
MBUF_FAST_FREE - no indirect mbufs and multiple mempools, mempool object
indexes - mempool size is limited to 4GB, direct rearm - drop ability to
stop/reconfigure TX queue, while RX queue is still running, etc.
Note that all these limitations are not forced by HW.
All of them are pure SW limitations that developers forced in (or tried to) to
get
few extra performance.
That's concerning tendency.
As more and more such 'optimization via limitation' will come in:
- DPDK feature list will become more and more fragmented.
- Would cause more and more confusion for the users.
- Unmet expectations - difference in performance between 'default'
and 'optimized' version of DPDK will become bigger and bigger.
- As Andrew already mentioned, maintaining all these 'sub-flavours'
of DPDK will become more and more difficult.
The point that we need to remember is, these features/optimizations are
introduced after seeing performance issues in practical use cases.
Sorry I didn't get it: what performance issues you are talking about?
If let say our mempool code is sub-optimal in some place for some
architecture due to bad design or bad implementation - please point to
it and let's try to fix it, instead of avoiding using mempool API
If you just saying that avoiding using mempool in some cases
could buy us few extra performance (a short-cut),
then yes it surely could.
Another question - is it really worth it?
Having all mbufs management covered by one SW abstraction
helps a lot in terms of project maintainability, further extensions,
introducing new common optimizations, etc.
DPDK is not being used in just one use case, it is being used in several use cases which have their own unique requirements. Is 4GB enough for packet buffers - yes it is enough in certain use
cases. Are their NICs with single port - yes there are.
Sure there are NICs with one port.
But also there are NICs with 2 ports, 4 ports, etc.
Should we maintain specific DPDK sub-versions for all these cases?
From my perspective - no.
It would be overwhelming effort for DPDK community, plus
many customers use DPDK to build their own products that supposed
to work seamlessly across multiple use-cases/platforms.
HW is being created because use cases and business cases exist. It is
obvious that as DPDK gets adopted on more platforms that differ largely,
the features will increase and it will become complex. Complexity should
not be used as a criteria to reject patches.
Well, we do have plenty of HW specific optimizations inside DPDK
and we put a lot of effort that all this HW specific staff be
transparent to the user as much as possible.
I don't see why for SW specific optimizations it should be different.
There is different perspective to what you are calling as 'limitations'.
By 'limitations' I mean situation when user has to cut off
existing functionality to enable these 'optimizations'.
I can argue that multiple mempools, stop/reconfigure TX queue while RX
queue is still running are exotic. Just because those are allowed
currently (probably accidently) does not mean they are being used. Are
there use cases that make use of these features?
If DPDK examples/l3fwd doesn't use these features,
it doesn't mean they are useless :)
I believe both multiple mempools (indirect-mbufs) and ability to
start/stop queues separately are major DPDK features that are used
across many real-world deployments.
The base/existing design for DPDK was done with one particular HW architecture
in mind where there was an abundance of resources. Unfortunately, that HW
architecture is fast evolving and DPDK is adopted in use cases where that kind
of resources are not available. For ex: efficiency cores are being introduced
by every CPU vendor now. Soon enough, we will see big-little architecture in
networking as well. The existing PMD design introduces 512B of stores (256B for
copying to stack variable and 256B to store lcore cache) and 256B load/store on
RX side every 32 packets back to back. It doesn't make sense to have that kind
of memcopy for little/efficiency cores just for the driver code.
I don't object about specific use-case optimizations.
Specially if the use-case is a common one.
But I think such changes has to be transparent to the user as
much as possible and shouldn't cause further DPDK code fragmentation
(new CONFIG options, etc.).
I understand that it is not always possible, but for pure SW based
optimizations, I think it is a reasonable expectation.
So, probably instead of making such changes easier, we need somehow to
persuade developers to think more about optimizations that would be generic
and transparent to the user.
Or may be we need to think of creating alternate ways of programming.
I do realize that it is not always possible due to various reasons (HW
limitations,
external dependencies, etc.) but that's another story.
Let's take for example MBUF_FAST_FREE.
In fact, I am not sure that we need it as tx offload flag at all.
PMD TX-path has all necessary information to decide at run-time can it do
fast_free() for not:
At tx_burst() PMD can check are all mbufs satisfy these conditions (same
mempool, refcnt==1) and update some fields and/or counters inside TXQ to
reflect it.
Then, at tx_free() we can use this info to decide between fast_free() and
normal_free().
As at tx_burst() we read mbuf fields anyway, impact for this extra step I guess
would be minimal.
Yes, most likely, it wouldn't be as fast as with current TX offload flag, or
conditional compilation approach.
But it might be still significantly faster then normal_free(), plus such
approach
will be generic and transparent to the user.
IMO, this depends on the philosophy that we want to adopt. I would prefer to
make control plane complex for performance gains on the data plane. The
performance on the data plane has a multiplying effect due to the ratio of
number of cores assigned for data plane vs control plane.
I am not against evaluating alternatives, but the alternative approaches need
to have similar (not the same) performance.
Konstantin