On Fri, Feb 2, 2024, at 6:13 PM, Rick Macklem wrote:
>  A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte NFS 
> write request
> or read reply will result in a 514 element mbuf chain. Each of these (mostly 
> 2K mbuf clusters)
> are non-contiguous data segments. (I suspect most NICs do not handle this 
> many segments well,
> if at all.)

Excellent point

> 
> The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for the 
> ktls), but I do not
> know what it would take to make these work for non-KTLS TSO?


Sendfile already uses M_EXTPG mbufs... When I was initially doing M_EXTPG stuff 
for kTLS, I added support for using M_EXTPG mbufs in sendfile regardless of 
whether or not kTLS was in use.  That reduced CPU use marginally on 64-bit 
platforms (due to reducing socket buffer lengths, and hence reducing pointer 
chasing), and quite a bit more on 32-bit platforms (due to also not needing to 
map memory into the kernel map, and by reducing pointer chasing even more, as 
more pages fit into an M_EXTPG mbuf when a paddr_t is 32-bits.


> I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs.
> Does it assume each M_EXTPG mbuf is one contiguous data segment?

No, its fully aware of how to handle M_EXTPG mbufs.  Look at tcp_m_copy().  We 
added code in the segment counting part of that function to count the 
hdr/trailer parts of an M_EXTPG mbuf, and to deal with the start/end page being 
misaligned.

> I do see that ip_output() will call mb_unmapped_to_ext() when the NIC does 
> not have IFCAP_MEXTPG set.
> (If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it can 
> become
> a single contiguous data segment for TSO or ???)

No, it just means that a NIC driver has been verified to call not mtod() an 
M_EXTPGS mbuf and deref the resulting data pointer. (which would make it go 
"boom").

But the page size is only 4K on most platforms.  So while an M_EXTPGS mbuf can 
hold 5 pages (..from memory, too lazy to do the math right now) and reduces 
socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k per mbuf), 
the S/G list that a NIC will need to consume would likely decrease only by a 
factor of 2.  And even then only if the busdma code to map mbufs for DMA is not 
coalescing adjacent mbufs.  I know busdma does some coalescing, but I can't 
recall if it coalesces physcally adjacent mbufs.  

> If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being 
> called) were to
> all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS 
> (non-TLS case).


It does.  You should enable it for at least TCP.

Drew

Reply via email to