On 19.08.2013 23:45, Navdeep Parhar wrote:
On 08/19/13 13:58, Andre Oppermann wrote:
On 19.08.2013 19:33, Navdeep Parhar wrote:
On 08/19/13 04:16, Andre Oppermann wrote:
Author: andre
Date: Mon Aug 19 11:16:53 2013
New Revision: 254520
URL: http://svnweb.freebsd.org/changeset/base/254520

Log:
    Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree
users
    for a very long time, if ever.

    Should such a functionality ever be needed again the appropriate and
    much better way to do it is through a custom EXT_SOMETHING
external mbuf
    type together with a dedicated *ext_free function.

    Discussed with:    trociny, glebius

Modified:
    head/sys/kern/kern_mbuf.c
    head/sys/kern/uipc_mbuf.c
    head/sys/sys/mbuf.h


Hello Andre,

Is this just garbage collection or is there some other reason for this?

This is garbage collection and removal of not quite right, rotten,
functionality.

I recently tried some experiments to reduce the number of mbuf and
cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
very useful and the code changes to the kernel were minimal.  See
user/np/cxl_tuning.  The experiment was quite successful and I was
planning to bring in most of those changes to HEAD.  I was hoping to get
some runtime mileage on the approach in general before tweaking the
ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
within the cluster.  But now M_NOFREE has vanished without a warning...

I'm looking through your experimental code and that is some really good
numbers you're achieving there!

However a couple things don't feel quite right, hackish even, and not
fit for HEAD.  This is a bit the same situation we had with some of the
first 1GigE cards quite a number of years back (mostly ti(4)).  There
we ended up with a couple of just good enough hacks to make it fast.
Most of the remains I've collected today.

If M_NOFREE and EXT_EXTREF are properly supported in the tree (and I'm
arguing that they were, before r254520) then the changes are perfectly
legitimate.  The only hackish part was that I was getting the cluster
from the jumbop zone while bypassing its normal refcnt mechanism.  This
I did so as to use the same zone as m_uiotombuf to keep it "hot" for all
consumers (driver + network stack).

If you insist I'll revert the commit removing M_NOFREE.  EXT_EXTREF isn't
touched yet, but should get better support.

The hackish part for me is that the driver again manages its own memory
pool.  Windows works that way, NetBSD is moving towards it while FreeBSD
has and remains at a central network memory pool.  The latter (our current)
way of doing it seems more efficient overall especially on heavily loaded
networked machines.  There may be significant queues building (think app
blocked having many sockets buffer fill up) up delaying the freeing and
returning of network memory resources.  Together with fragmentation this
can lead to bad very outcomes.  Router applications with many interfaces
also greatly benefit from central memory pools.

So I'm really not sure that we should move back in the driver owned pool
direction with lots of code duplication and copy-pasting (see NetBSD).
Also it is kinda weird to have a kernel based pool for data going down
the stack and another one in each driver for those going up.

Actually I'm of the opinion that we should stay with the central memory
pool and fix so that it works just as well for those cases a driver pool
currently performs better.

I believe most of the improvements you've shown can be implemented in
a more generic and safe way into the mbuf system.  Also a number of things
in your experimental code may have side-effects in situations other than
netperf runs.

Agreed.  As I mentioned my long term plan was to tweak the jumboX zones
to allow them to allocate cluster with embedded mbuf + refcount.  The
M_NOFREE+EXT_EXTREF approach is the perfect bridge from here to there.
It is non-intrusive and lends itself well to experimentation.

Agreed, full support on fixing the refcount issue.

To summarize what I get from your experimental branch commits:
- the Chelsio T4/T5 card can DMA multiple ethernet frames (packets) into
   a single memory buffer, provided it is large enough.
- you make use of that feature and point multiple m_ext mbufs into that
   buffer to separate the packets and send them up the stack.
- you embed the m_ext refcount into the single memory buffer as well.

yes, yes, and yes.

- you recycle mbufs? (I'm not entirely clear on that as I'm not familiar
   with the cxgbe code)

I recycle the cluster (and the embedded mbuf in it) when possible.  All
depends on whether it's still in use by the time the rx queue needs it back.

There's always a couple of problems with driver managed pools.  Driver shutdown/
unloading requires all such managed mbufs to have returned before it can 
proceed.
This may take a undetermined long time as mbufs are sitting in socket buffers or
other queues.

Lets examine and discuss these parts:
- M_NOFREE wasn't really safe with bpf anyway at least for packets going
   down the stack.

I see.  Can you point out the parts of bpf unable to deal with M_NOFREE?

Sorry, false alarm.

- Instead of M_NOFREE a custom *ext_free should be used that has the same
   and even more functionality.

Yes, that's what I was planning too, with the jumboX zone changes.  It
would be faster than running the m_ext's free function (which is a
function dereference+call).

It would be a bit faster but how do you know when an mbuf itself has been
free'd and can be reused without notification?

- Recycling mbufs may cause imbalances to the mbuf pool with multiple cores
   and possibly NUMA in the future.  Not that we are very good with it at
   the moment but bypassing the mbuf allocator shouldn't become the norm.
   If it is a problem/bottleneck again it should be fixed, not worked around
   and copy-pasted n-times in so many drivers.

If/when a cluster is recycled, it is given back to the same rx ithread
that originally allocated it, and not not any other queue.  If the
ithread stays in the same NUMA domain (and it really should) then
recycling the cluster in the same queue really shouldn't cause imbalances.

OK. (Side note at the moment the mbuf allocator isn't NUMA aware at all.)

- jumbo9 and jumbo16 mbufs should not be used because they are more special
   than necessary with being KVM and physically contiguous.  This probably
   isn't necessary for the T4/T5 cards and any other modern DMA engine.
   Under heavy diverse network the kernel memory becomes fragmented and
can't
   find memory fulfilling both criteria anymore.  In fact both are an
artifact
   of the early GigE hacks when high speed DMA engines were in their
infancy.
   Both jumbo9 and jumbo16 should go away without direct replacement.

The devices I deal will be able to cope, but I suspect this will be
disruptive (for not enough reason to justify the disruption, imho) to
others.  That's a separate discussion anyway.

Again using the jumbo9 or jumbo16 doesn't scale.  Wollman had it fail quite
badly on his heavily loaded NFS server.

The way to go should be 4K clusters as they are native to the architecture.
IIRC a PCIe DMA can't cross a 4K boundary anyway, so it having a number of
them together shouldn't have much of an impact except for consuming descriptors?

On a well loaded system it's almost impossible to obtain physically contiguous
pages after some time, except through pre-allocation.  But then the 
pre-allocation
has to be properly sized in advance and takes memory away from other uses.

   In your T4/T5 case the alternative would be either to a) allocate your
   own memory directly from KVM with the necessary properties (KVM and/or
phys
   contig); b) have such a generic kernel mbuf type fulfilling the same
role.
   There may be some cache line issues on non-x86 systems that have to be
   though and taken care of.
- Refcounts should have the option of being separately allocated.  It was
   mandatory to use the refcount zone so far because there wasn't any other
   type of refcount.  Now that we have one we should revisit the issue.
   Actually the entire mbuf allocation and initialization could be
streamlined
   as a whole.

On a side note a different mbuf handling for the RX DMA rings may give some
significant improvements as well: allocate just a cluster without a mbuf
through
m_cljget() and put it into the RX ring.  Only when the DMA has put a
packet into
it allocated attach the mbuf (in the drivers RX function).  This avoids
the cache
pollution from touching the mbuf fields during initial allocation,
including the
refcount.

Er, this is pretty much how cxgbe(4) has always done it.  A cluster is
allocated with m_cljget, an mbuf is allocated with MT_NOINIT to avoid
touching any line in the mbuf.  m_init is called later in the rx ithread.

Excellent.  I haven't (yet) spent much time reading the cxgbe(4) code.

Another nice trick would be to shorten the mbuf by 4 bytes (in ext_size)
and put
the refcount there.

Lets work on these together.

Yes.  I think we agree on the long term optimization, or are at least
headed in the same general direction.  But removing M_NOFREE removed a
dirt-cheap way to test and exercise the approach without implementing
the "real thing."  Why not just let it be?

I don't mind too much having M_NOFREE return but, while convenient, I'm
not convinced at all that it is the right and best way of handling that
use case.

--
Andre

_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to