Re: 9k jumbo clusters

Rick Macklem Sun, 29 Jul 2018 14:39:25 -0700

Adrian Chadd wrote:
>John-Mark Gurney wrote:
[stuff snipped]
>>
>> Drivers need to be fixed to use 4k pages instead of cluster.  I really hope
>> no one is using a card that can't do 4k pages, or if they are, then they
>> should get a real card that can do scatter/gather on 4k pages for jumbo
>> frames..
>
>Yeah but it's 2018 and your server has like minimum a dozen million 4k
>pages.
>
>So if you're doing stuff like lots of network packet kerchunking why not
>have specialised allocator paths that can do things like "hey, always give
>me 64k physical contig pages for storage/mbufs because you know what?
>they're going to be allocated/freed together always."
>
>There was always a race between bus bandwidth, memory bandwidth and
>bus/memory latencies. I'm not currently on the disk/packet pushing side of
>things, but the last couple times I were it was at different points in that
>4d space and almost every single time there was a benefit from having a
>couple of specialised allocators so you didn't have to try and manage a few
>dozen million 4k pages based on your changing workload.
>
>I enjoy the 4k page size management stuff for my 128MB routers. Your 128G
>server has a lot of 4k pages. It's a bit silly.
Here's my NFS guy perspective.
I do think 9K mbuf clusters should go away. I'll note that I once coded NFS so 
it
would use 4K mbuf clusters for the big RPCs (write requests and read replies) 
and
I actually could get the mbuf cluster pool fragmented to the point it stopped
working on a small machine, so it is possible (although not likely) to fragment
even a 2K/4K mix.


For me, send and receive are two very different cases:
- For sending a large NFS RPC (lets say a reply to a 64K read), the NFS code 
will
  generate a list of 33 2K mbuf clusters. If the net interface doesn't do TSO, 
this
  is probably fine, since tcp_output() will end up busting this up into a bunch 
of
  TCP segments using the list of mbuf clusters with TCP/IP headers added for
  each segment, etc...
  - If the net interface does TSO, this long list goes down to the net driver 
and uses
    34->35 ring entries to send it (it adds at least one segment for the MAC 
header
    typically). If the driver isn't buggy and the net chip supports lots of 
transmit
    ring entries, this works ok but...
 - If there was a 64K supercluster, the NFS code could easily use that for the 
64K
   of data and the TSO enabled net interface would use 2 transmit ring entries.
   (one for the MAC/TCP/NFS header and one for the 64K of data). If the net 
interface
   can't handle a TSO segment over 65535bytes, it will end up getting 2 TSO 
segments
   from tcp_output(), but that still is a lot less than 35.
I don't know enough about net hardware to know when/if this will help perf., but
it seems that it might, at least for some chipsets?

For receive, it seems that a 64K mbuf cluster is overkill for jumbo packets, 
but as
others have noted, they won't be allocated for long unless packets arrive out of
order, at least for NFS. (For other apps., they  might not read the socket for 
a while
to get the data, so they might sit in the socket rcv queue for a while.)

I chose 64K, since that is what most net interfaces can handle for TSO these 
days.
(If it will soon be larger, I think this should be even larger, but all of them 
the same
 size to avoid fragmentation.) For the send case for NFS, it wouldn't even need 
to
be a very large pool, since they get free'd as soon as the net interface 
transmits
the TSO segment.

For NFS, it could easily call mget_supercl() and then fall back on the current 
code using 2K mbuf clusters if mget_supercl() failed, so a small pool would be 
fine for the
 NFS send side.

I'd like to see a pool for 64K or larger mbuf clusters for the send side.
For the receive side, I'll let others figure out the best solution (4K or larger
for jumbo clusters). I do think anything larger than 4K needs a separate 
allocation
pool to avoid fragmentation.
(I don't know, but I'd guess iSCSI could use them as well?)

rick

_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9k jumbo clusters

Reply via email to