On 11/1/25 17:26, Rick Macklem wrote:
On Sat, Nov 1, 2025 at 2:10 PM Konstantin Belousov <[email protected]> wrote:

On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote:
On Sat, Nov 1, 2025 at 1:50 PM Konstantin Belousov <[email protected]> wrote:

Added Slava Schwartsman.

On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote:
Hi,

I've had NFS over RDMA on my todo list for a very loonnnggg
time. I've avoided it because I haven't had a way to test it,
but I'm now going to start working on it. (A bunch of this work
is already done for NFS-over-TLS which added code for handling
M_EXTPG mbufs.)

>From RFC-8166, there appears to be 4 operations the krpc
needs to do:
send-rdma - Send on the payload stream (sending messages that
                     are kept in order).
recv-rdma - Receive the above.
ddp-write - Do a write of DDP data.
ddp-read - Do a read of DDP data.

So, here is how I see the krpc doing this.
An NFS write RPC for example:
- The NFS client code packages the Write RPC XDR as follows:
   - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments
      that precede the write data.
   - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?)
   - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be
     written.
   - 0 or more mbufs/mbuf_clusters with additional RPC request XDR.

This would be passed to the krpc which would...
  - the mbufs up to "start of ddp" in the payload stream.
  - Would specify a ddp-read for the pages from the M_EXTPG mbufs
    and send that in the payload stream.
  - send the remaining mbufs/mbuf_clusters in the payload stream

The NFS server end would process the received payload stream,
putting the non-ddp stuff in mbufs/mbuf_clusters.
It would do the ddp-read of the data into anonymous pages it allocates
and would associate these with M_EXTPG mbufs.
It would put any remaining payload stream stuff for the RPC message in
additional mbufs/mbuf_clusters.
--> Call the NFS server with the mbuf list for processing.
      - When the NFS server gets to the write data (in M_EXTPG mbufs)
        it would set up a uio/iovec for the pages and call VOP_WRITE().

Now, the above is straightforward for me, since I know the NFS and
krpc code fairly well.
But that is where my expertise ends.

So, what kind of calls do the drivers provide to send and receive
what RFC-8166 calls the payload stream?

And what kind of calls do the drivers provide to write and read DDP
chunks?

Also, if the above sounds way off the mark, please let me know.

What you need is, most likely, the infiniband API or KPI to handle
RDMA.  It is driver-independent, same as for ip NFS you use system IP
stack and not call to ethernet drivers.  In fact, most likely the
transport used would be not native IB, but IB over UDP (RoCE v2).

IB verbs, which is the official interface for both kernel and user mode,
are not well documented.  An overview is provided by the document
titled "RDMA Aware Networks Programming User Manual", which should
be google-able.  Otherwise, the Infiniband specication is the reference.
Thanks. I'll look at that. (I notice that the Intel code references something
they call Linux-OpenIB. Hopefully that looks about the same and the
glue needed to support non-Mellanox drivers isn't too difficult?)
OpenIB is perhaps the reference to the IB code in Linux kernel proper
plus userspace libraries from rdma-core.  This is what was forked/grown
from OFED.

Intel put efforts into the iWARP, which is sort of alternative for RoCEv2.
It has RFCs and works over TCP AFAIR, which causes problems for it.
Heh, heh. I'm trying to avoid the iWARP vs RoCE wars.;-)
(I did see a Mellanox white paper with graphs showing how RoCE outperforms
iWARP.)
Intel currently claims to support RoCE on its 810 and 820 NICs.
Broadcom also claims to support RoCE, but doesn't mention FreeBSD
drivers and Chelsio does iWARP, afaik.

For some reason, at the last NFSv4 Bakeathon, Chuck was testing with
iWARP and not RoCE? (I haven't asked Chuck why he chose that. It
might just be more convenient to set up the siw driver in Linux vs the
rxe one? He is the main author of RFC-8166, so he's the NFS-over-RDMA guy.)

But it does look like a fun project for the next year. (I recall jhb@ mentioning
that NFS-over-TLS wouldn't be easy and it turned out to be a fun
little project.)

Konstantin is right though that sys/ofed is Linux OpenIB and has an interface
that should let you do RDMA (both ROCEv2 and iWARP).  I'm hoping to use the APIs
in sys/ofed to support NVMe over RDMA (both ROCEv2 and iWARP) at some point as
well.
rick



Btw, if anyone is interested in taking a more active involvement in this,
they are more than welcome to do so. (I'm going to be starting where I
understand things in the krpc/nfs. I'm not looking forward to porting rxe,
but will probably end up there. I have already had one offer w.r.t. access
to a lab that includes Mellanox hardware, but I don't know if remote
debugging will be practical yet.)

rick


The IB implementation for us is still called OFED for historical reasons,
and it is located in sys/ofed.


As for testing, I am planning on hacking away at one of the RDMA
in software drivers in Linux to get it working well enough to use for
testing. Whatever seems to be easiest to get kinda working.
Yes rxe driver is the sw RoCE v2 implementation.  We looked at the
amount of work to port it.  Its size is ~12 kLoC, which is compatible
with libibverbs (userspace core infiniband interface).

Interesting.  I'm currently working on merging back several OFED commits from
Linux to sys/ofed (currently I have about 30 commits merged, some older than
Hans' last merge, and some newer, some of the newer ones should permit removing
compat stubs for some of the newer APIs that are duplicated in bnxt, irdma, and
mlx*).  When I get a bit further along I'll post the branch I have for more
testing (it is a bunch of individual cherry-picks rather than a giant merge).

Porting over rxe could be useful for me as well for some work I am doing.

--
John Baldwin


Reply via email to