On 21.08.2013 21:59, Navdeep Parhar wrote:
On 08/21/13 12:41, Scott Long wrote:
On Aug 21, 2013, at 8:59 AM, Andre Oppermann <an...@freebsd.org> wrote:
On 19.08.2013 23:45, Navdeep Parhar wrote:
On 08/19/13 13:58, Andre Oppermann wrote:
On 19.08.2013 19:33, Navdeep Parhar wrote:
On 08/19/13 04:16, Andre Oppermann wrote:
Author: andre
Date: Mon Aug 19 11:16:53 2013
New Revision: 254520
URL: http://svnweb.freebsd.org/changeset/base/254520
Log:
Remove the unused M_NOFREE mbuf flag. It didn't have any in-tree
users
for a very long time, if ever.
Should such a functionality ever be needed again the appropriate and
much better way to do it is through a custom EXT_SOMETHING
external mbuf
type together with a dedicated *ext_free function.
Discussed with: trociny, glebius
Modified:
head/sys/kern/kern_mbuf.c
head/sys/kern/uipc_mbuf.c
head/sys/sys/mbuf.h
Hello Andre,
Is this just garbage collection or is there some other reason for this?
This is garbage collection and removal of not quite right, rotten,
functionality.
I recently tried some experiments to reduce the number of mbuf and
cluster allocations in a 40G NIC driver. M_NOFREE and EXT_EXTREF proved
very useful and the code changes to the kernel were minimal. See
user/np/cxl_tuning. The experiment was quite successful and I was
planning to bring in most of those changes to HEAD. I was hoping to get
some runtime mileage on the approach in general before tweaking the
ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
within the cluster. But now M_NOFREE has vanished without a warning...
I'm looking through your experimental code and that is some really good
numbers you're achieving there!
However a couple things don't feel quite right, hackish even, and not
fit for HEAD. This is a bit the same situation we had with some of the
first 1GigE cards quite a number of years back (mostly ti(4)). There
we ended up with a couple of just good enough hacks to make it fast.
Most of the remains I've collected today.
If M_NOFREE and EXT_EXTREF are properly supported in the tree (and I'm
arguing that they were, before r254520) then the changes are perfectly
legitimate. The only hackish part was that I was getting the cluster
from the jumbop zone while bypassing its normal refcnt mechanism. This
I did so as to use the same zone as m_uiotombuf to keep it "hot" for all
consumers (driver + network stack).
If you insist I'll revert the commit removing M_NOFREE. EXT_EXTREF isn't
touched yet, but should get better support.
The hackish part for me is that the driver again manages its own memory
pool. Windows works that way, NetBSD is moving towards it while FreeBSD
has and remains at a central network memory pool. The latter (our current)
way of doing it seems more efficient overall especially on heavily loaded
networked machines. There may be significant queues building (think app
blocked having many sockets buffer fill up) up delaying the freeing and
returning of network memory resources. Together with fragmentation this
can lead to bad very outcomes. Router applications with many interfaces
also greatly benefit from central memory pools.
So I'm really not sure that we should move back in the driver owned pool
direction with lots of code duplication and copy-pasting (see NetBSD).
Also it is kinda weird to have a kernel based pool for data going down
the stack and another one in each driver for those going up.
Actually I'm of the opinion that we should stay with the central memory
pool and fix so that it works just as well for those cases a driver pool
currently performs better.
The central memory pool approach is too slow, unfortunately. There's a
reason that other OS's are moving to them. At Netflix we are
currently working on some approaches to private memory pools in order to
achieve better efficiency, and we're closely watching and anticipating Navdeep's
work.
I should point out that I went to great lengths to use the jumbop zone
in my experiments, and not create my own pool of memory for the rx
buffers. The hope was to share cache warmth (sounds very cosy :-) with
the likes of m_uiotombuf (which uses jumbop too) etc. So I'm actually
in the camp that prefers central pools. I'm just trying out ways to
reduce the trips we have to make to the pool(s) involved. Laying down
mbufs within clusters, and packing multiple frames per cluster clearly
helps. Careful cluster recycling within the NIC seems to work too.
What you describe does make a lot of sense. Jumbop is the optimal size
for the VM. We should really look at and pushing forward to have a nicer
M_NOFREE+M_EXTREF API for 10 and new HEAD going forward.
As always it seems to depend on the use case and what is being measured
too. Is it single stream performance? Concurrent streams? Both ways
or only one way, in or out? Each makes very different use of many parts
of the stack and driver leading to different bottlenecks.
The Netflix case is a bit special obviously by being heavily send oriented
on a great many concurrent connections. Here re-use of mbufs moving down
the stack seems limited to non-existent because the clusters stay in the
socket send buffer until acknowledged which is many milliseconds later.
Only re-use of the mbuf header would be possible. Additionally with the
use of sendfile there are no mbuf clusters to be captured anyway.
On the way up the Netflix usage seems to be almost ACK only, that is small
packets. If the NIC is capable of splatting a number of them into the same
jumbop cluster back to back together without wasting a full cluster for
each that looks to be quite a win as well. Then one only has to attach the
mbuf headers to it and send it up the stack.
There used to be some Syskonnect/Marvell GigE chips that were able to
take different sized mbufs on their RX rings. One could be configured
to splat the packet right into the small data portion of a header mbuf.
Larger packets would into normal 2K clusters. Such an approach, if the
NIC is capable of doing that, probably would be beneficial as well doing
away with all M_EXT handling for small (ack) packets.
However for the Netflix case I believe two things may provide additional
immediate gains: a) moving the routing and arp tables to rmlocks with
de-pointering them at the same time (see my other recent email to net@);
b) analyzing and tuning the interaction between LRO, tcp ack compression,
and tcp abc to reduce the chance of small TSO chains from getting emitted.
--
Andre
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"