Re: Focusing the XDP project

Tom Herbert Wed, 22 Feb 2017 14:09:06 -0800

On Wed, Feb 22, 2017 at 1:43 PM, Jesper Dangaard Brouer
<bro...@redhat.com> wrote:
> On Wed, 22 Feb 2017 09:22:53 -0800
> Tom Herbert <t...@herbertland.com> wrote:
>
>> On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouer
>> <bro...@redhat.com> wrote:
>> >
>> > On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert <t...@herbertland.com> 
>> > wrote:
>> >> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed 
>> >> <sae...@dev.mellanox.co.il> wrote:
>> > [...]
>> >> > The only complexity XDP is adding to the drivers is the constrains on
>> >> > RX memory management and memory model, calling the XDP program itself
>> >> > and handling the  action is really a simple thing once you have the
>> >> > correct memory model.
>> >
>> > Exactly, that is why I've been looking at introducing a generic
>> > facility for a memory model for drivers.  This should help simply
>> > drivers.  Due to performance needs this need to be a very thin API layer
>> > on top of the page allocator. (That's why I'm working with Mel Gorman
>> > to get more close integration with the page allocator e.g. a bulking
>> > facility).
>> >
>> >> > Who knows! maybe someday XDP will define one unified RX API for all
>> >> > drivers and it even will handle normal stack delivery it self :).
>> >> >
>> >> That's exactly the point and what we need for TXDP. I'm missing why
>> >> doing this is such rocket science other than the fact that all these
>> >> drivers are vastly different and changing the existing API is
>> >> unpleasant. The only functional complexity I see in creating a generic
>> >> batching interface is handling return codes asynchronously. This is
>> >> entirely feasible though...
>> >
>> > I'll be happy as long as we get a batching interface, then we can
>> > incrementally do the optimizations later.
>> >
>> > In the future, I do hope (like Saeed) this RX API will evolve into
>> > delivering (a bulk of) raw-packet-pages into the netstack, this should
>> > simplify drivers, and we can keep the complexity and SKB allocations
>> > out of the drivers.
>> > To start with, we can play with doing this delivering (a bulk of)
>> > raw-packet-pages into Tom's TXDP engine/system?
>> >
>> Hi Jesper,
>>
>> Maybe we can to start to narrow in on what a batching API might look like.
>>
>> Looking at mlx5 (as a model of how XDP is implemented) the main RX
>> loop in ml5e_poll_rx_cq calls the backend handler in one indirect
>> function call. The XDP path goes through mlx5e_handle_rx_cqe,
>> skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with
>> building the skbuf. As a prerequisite to RX batching it would be
>> helpful if this could be flatten so that most of the logic is obvious
>> in the main RX loop.
>
> I fully agree here, it would be helpful to flatten out.  The mlx5
> driver is a bit hard to follow in that respect.  Saeed have already
> send me some offlist patches, where some of this code gets
> restructured. In one of the patches the RX-stages does get flatten out
> some more.  We are currently benchmarking this patchset, and depending
> on CPU it is either a small win or a small (7ns) regressing (on the newest
> CPUs).
>
Cool!


>
>> The model of RX batching seems straightforward enough-- pull packets
>> from the ring, save xdp_data information in a vector, periodically
>> call into the stack to handle a batch where argument is the vector of
>> packets and another argument is an output vector that gives return
>> codes (XDP actions), process the each return code for each packet in
>> the driver accordingly.
>
> Yes, exactly.  I did imagine that (maybe), the input vector of packets
> could have a room for the return codes (XDP actions) next to the packet
> pointer?
>
Which ever way is more efficient I suppose. The important point is
that the return code should be only the only thing returned to the
driver.

>
>> Presumably, there is a maximum allowed batch
>> that may or may not be the same as the NAPI budget so the so the
>> batching call needs to be done when the limit is reach and also before
>> exiting NAPI.
>
> In my PoC code that Saeed is working on, we have a smaller batch
> size(10), and prefetch to L2 cache (like DPDK does), based on the
> theory that we don't want to stress the L2 cache usage, and that these
> CPUs usually have a Line Feed Buffer (LFB) that is limited to 10
> outstanding cache-lines.
>
> I don't know if this artifically smaller batch size is the right thing,
> as DPDK always prefetch to L2 cache all 32 packets on RX.  And snabb
> uses batches of 100 packets per "breath".
>
Maybe make it configurable :-)

>
>> For each packet the stack can return an XDP code,
>> XDP_PASS in this case could be interpreted as being consumed by the
>> stack; this would be used in the case the stack creates an skbuff for
>> the packet. The stack on it's part can process the batch how it sees
>> fit, it can process each packet individual in the canonical model, or
>> we can continue processing a batch in a VPP-like fashion.
>
> Agree.
>
>> The batching API could be transparent to the stack or not. In the
>> transparent case, the driver calls what looks like a receive function
>> but the stack may defer processing for batching. A callback function
>> (that can be inlined) is used to process return codes as I mentioned
>> previously. In the non-transparent model, the driver knowingly creates
>> the packet vector and then explicitly calls another function to
>> process the vector. Personally, I lean towards the transparent API,
>> this may be less complexity in drivers and gives the stack more
>> control over the parameters of batching (for instance it may choose
>> some batch size to optimize its processing instead of driver guessing
>> the best size).
>
> I cannot make up my mind on which model... I have to think some more
> about this.  Thanks for bringing this up! :-)  This is something we
> need to think about.
>
>
>> Btw the logic for RX batching is very similar to how we batch packets
>> for RPS (I think you already mention an skb-less RPS and that should
>> hopefully be something would falls out from this design).
>
> Yes, I've mentioned skb-less RPS before, because it seem wasteful for
> RPS to allocate the fat skb on one CPU (and memset zero) force all
> cache-lines hot, just to transfer it to another CPU.
>  The tricky part is how do we transfer, the info from the NIC specific
> descriptor on HW-offload fields? (that we usually update the SKB with,
> before the netstack gets the SKB).
>
> The question is also if XDP should be part of skb-less RPS steering, or
> it should be something more generic in the stack? (after we get a bulk
> of raw-packets delivered to the stack).
>
I'd probably keep them separate for now since the mechanisms have
pretty different goals.

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Re: Focusing the XDP project

Reply via email to