On Fri, 27 Jan 2017 13:33:44 -0800 John Fastabend <john.fastab...@gmail.com> wrote:
> This adds ndo ops for upper layer objects to request direct DMA from > the network interface into memory "slots". The slots must be DMA'able > memory given by a page/offset/size vector in a packet_ring_buffer > structure. > > The PF_PACKET socket interface can use these ndo_ops to do zerocopy > RX from the network device into memory mapped userspace memory. For > this to work drivers encode the correct descriptor blocks and headers > so that existing PF_PACKET applications work without any modification. > This only supports the V2 header formats for now. And works by mapping > a ring of the network device to these slots. Originally I used V2 > header formats but this does complicate the driver a bit. > > V3 header formats added bulk polling via socket calls and timers > used in the polling interface to return every n milliseconds. Currently, > I don't see any way to support this in hardware because we can't > know if the hardware is in the middle of a DMA operation or not > on a slot. So when a timer fires I don't know how to advance the > descriptor ring leaving empty descriptors similar to how the software > ring works. The easiest (best?) route is to simply not support this. >From a performance pov bulking is essential. Systems like netmap that also depend on transferring control between kernel and userspace, report[1] that they need at least bulking size 8, to amortize the overhead. [1] Figure 7, page 10, http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf > It might be worth creating a new v4 header that is simple for drivers > to support direct DMA ops with. I can imagine using the xdp_buff > structure as a header for example. Thoughts? Likely, but I would like that we do a measurement based approach. Lets benchmark with this V2 header format, and see how far we are from target, and see what lights-up in perf report and if it is something we can address. > The ndo operations and new socket option PACKET_RX_DIRECT work by > giving a queue_index to run the direct dma operations over. Once > setsockopt returns successfully the indicated queue is mapped > directly to the requesting application and can not be used for > other purposes. Also any kernel layers such as tc will be bypassed > and need to be implemented in the hardware via some other mechanism > such as tc offload or other offload interfaces. Will this also need to bypass XDP too? E.g. how will you support XDP_TX? AFAIK you cannot remove/detach a packet with this solution (and place it on a TX queue and wait for DMA TX completion). > Users steer traffic to the selected queue using flow director, > tc offload infrastructure or via macvlan offload. > > The new socket option added to PF_PACKET is called PACKET_RX_DIRECT. > It takes a single unsigned int value specifying the queue index, > > setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT, > &queue_index, sizeof(queue_index)); > > Implementing busy_poll support will allow userspace to kick the > drivers receive routine if needed. This work is TBD. > > To test this I hacked a hardcoded test into the tool psock_tpacket > in the selftests kernel directory here: > > ./tools/testing/selftests/net/psock_tpacket.c > > Running this tool opens a socket and listens for packets over > the PACKET_RX_DIRECT enabled socket. Obviously it needs to be > reworked to enable all the older tests and not hardcode my > interface before it actually gets released. > > In general this is a rough patch to explore the interface and > put something concrete up for debate. The patch does not handle > all the error cases correctly and needs to be cleaned up. > > Known Limitations (TBD): > > (1) Users are required to match the number of rx ring > slots with ethtool to the number requested by the > setsockopt PF_PACKET layout. In the future we could > possibly do this automatically. > > (2) Users need to configure Flow director or setup_tc > to steer traffic to the correct queues. I don't believe > this needs to be changed it seems to be a good mechanism > for driving directed dma. > > (3) Not supporting timestamps or priv space yet, pushing > a v4 packet header would resolve this nicely. > > (5) Only RX supported so far. TX already supports direct DMA > interface but uses skbs which is really not needed. In > the TX_RING case we can optimize this path as well. > > To support TX case we can do a similar "slots" mechanism and > kick operation. The kick could be a busy_poll like operation > but on the TX side. The flow would be user space loads up > n number of slots with packets, kicks tx busy poll bit, the > driver sends packets, and finally when xmit is complete > clears header bits to give slots back. When we have qdisc > bypass set today we already bypass the entire stack so no > paticular reason to use skb's in this case. Using xdp_buff > as a v4 packet header would also allow us to consolidate > driver code. > > To be done: > > (1) More testing and performance analysis > (2) Busy polling sockets > (3) Implement v4 xdp_buff headers for analysis > (4) performance testing :/ hopefully it looks good. Guess, I don't understand the details of the af_packet versions well enough, but can you explain to me, how userspace knows what slots it can read/fetch, and how it marks when it is complete/finished so the kernel knows it can reuse this slot? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer