On 11/6/20 3:23 PM, Morten Brørup wrote:
From: Ananyev, Konstantin [mailto:konstantin.anan...@intel.com]
Sent: Friday, November 6, 2020 12:54 PM
Hi Olivier,
m->nb_seg must be reset on mbuf free
whatever
the
value
of m->next,
because it can happen that m->nb_seg is
!=
1.
For
instance in this
case:
m1 = rte_pktmbuf_alloc(mp);
rte_pktmbuf_append(m1, 500);
m2 = rte_pktmbuf_alloc(mp);
rte_pktmbuf_append(m2, 500);
rte_pktmbuf_chain(m1, m2);
m0 = rte_pktmbuf_alloc(mp);
rte_pktmbuf_append(m0, 500);
rte_pktmbuf_chain(m0, m1);
As rte_pktmbuf_chain() does not reset
nb_seg in
the
initial m1
segment (this is not required), after
this
code
the
mbuf chain
have 3 segments:
- m0: next=m1, nb_seg=3
- m1: next=m2, nb_seg=2
- m2: next=NULL, nb_seg=1
Freeing this mbuf chain will not
restore
nb_seg=1
in
the second
segment.
Hmm, not sure why is that?
You are talking about freeing m1, right?
rte_pktmbuf_prefree_seg(struct rte_mbuf
*m)
{
...
if (m->next != NULL) {
m->next = NULL;
m->nb_segs = 1;
}
m1->next != NULL, so it will enter the
if()
block,
and will reset both next and nb_segs.
What I am missing here?
Thinking in more generic way, that
change:
- if (m->next != NULL) {
- m->next = NULL;
- m->nb_segs = 1;
- }
+ m->next = NULL;
+ m->nb_segs = 1;
Ah, sorry. I oversimplified the example
and
now
it
does
not
show the issue...
The full example also adds a split() to
break
the
mbuf
chain
between m1 and m2. The kind of thing that
would
be
done
for
software TCP segmentation.
If so, may be the right solution is to care
about
nb_segs
when next is set to NULL on split? Any
place
when
next
is
set
to NULL. Just to keep the optimization in a
more
generic
place.
The problem with that approach is that there
are
already
several
existing split() or trim() implementations in
different
DPDK-
based
applications. For instance, we have some in
6WINDGate. If
we
force
applications to set nb_seg to 1 when
resetting
next,
it
has
to be
documented because it is not straightforward.
I think it is better to go that way.
From my perspective it seems natural to reset
nb_seg at
same
time
we reset next, otherwise inconsistency will
occur.
While it is not explicitly stated for nb_segs, to
me
it
was
clear
that
nb_segs is only valid in the first segment, like
for
many
fields
(port,
ol_flags, vlan, rss, ...).
If we say that nb_segs has to be valid in any
segments,
it
means
that
chain() or split() will have to update it in all
segments,
which
is not
efficient.
Why in all?
We can state that nb_segs on non-first segment
should
always
equal
1.
As I understand in that case, both split() and
chain()
have
to
update nb_segs
only for head mbufs, rest ones will remain
untouched.
Well, anyway, I think it's strange to have a
constraint
on m-
nb_segs
for
non-first segment. We don't have that kind of
constraints
for
other
fields.
True, we don't. But this is one of the fields we
consider
critical
for proper work of mbuf alloc/free mechanism.
I am not sure that requiring m->nb_segs == 1 on non-first
segments
will provide any benefits.
It would make this patch unneeded.
So, for direct, non-segmented mbufs pktmbuf_free() will
remain
write-
free.
I see. Then I agree with Konstantin that alternative
solutions
should
be considered.
The benefit regarding free()'ing non-segmented mbufs - which
is a
very common operation - certainly outweighs the cost of
requiring
split()/chain() operations to set the new head mbuf's nb_segs =
1.
Nonetheless, the bug needs to be fixed somehow.
If we can't come up with a better solution that doesn't break
the
ABI, we are forced to accept the patch.
Unless the techboard accepts to break the ABI in order to
avoid
the
performance cost of this patch.
Did someone notice a performance drop with this patch?
On my side, I don't see any regression on a L3 use case.
I am afraid that the DPDK performance regression tests are based
on
TX immediately following RX, so cache misses in TX may go by
unnoticed
because RX warmed up the cache for TX already. And similarly for RX
reusing mbufs that have been warmed up by the preceding free() at
TX.
Please consider testing the performance difference with the mbuf
being completely cold at TX, and going completely cold again before
being reused for RX.
Let's sumarize: splitting a mbuf chain and freeing it causes
subsequent
mbuf
allocation to return a mbuf which is not correctly initialized.
There
are 2
options to fix it:
1/ change the mbuf free function (this patch)
- m->nb_segs would behave like many other field: valid in
the
first
segment, ignored in other segments
- may impact performance (suspected)
2/ change all places where a mbuf chain is split, or trimmed
- m->nb_segs would have a specific behavior: count the
number of
segments in the first mbuf, should be 1 in the last
segment,
ignored in other ones.
- no code change in mbuf library, so no performance impact
- need to patch all places where we do a mbuf split or trim.
From
afar,
I see at least mbuf_cut_seg_ofs() in DPDK. Some external
applications
may have to be patched (for instance, I already found 3
places
in
6WIND code base without a deep search).
In my opinion, 1/ is better, except we notice a significant
performance,
because the (implicit) behavior is unchanged.
Whatever the solution, some documentation has to be added.
Olivier
Unfortunately, I don't think that anything but the first option
will
go into 20.11 and stable releases of older versions, so I stand by
my
acknowledgment of the patch.
If we are affraid about 20.11 performance (it is legitimate, few
days
before the release), we can target 21.02. After all, everybody
lives
with this bug since 2017, so there is no urgency. If accepted and
well
tested, it can be backported in stable branches.
+1
Good thinking, Olivier!
Looking at the changes once again, it probably can be reworked a bit:
- if (m->next != NULL) {
- m->next = NULL;
- m->nb_segs = 1;
- }
+ if (m->next != NULL)
+ m->next = NULL;
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
That way we add one more condition checking, but I suppose it
shouldn't be that perf critical.
That way for direct,non-segmented mbuf it still should be write-free.
Except cases as you described above: chain(), then split().
Of-course we still need to do perf testing for that approach too.
So if your preference it to postpone it till 21.02 - that's ok for me.
Konstantin
With this suggestion, I cannot imagine any performance drop for direct, non-segmented
mbufs: It now reads m->nb_segs, residing in the mbuf's first cache line, but the
function already reads m->refcnt in the first cache line; so no cache misses are
introduced.
+1