Re: Questions on XDP

John Fastabend Wed, 22 Feb 2017 09:11:51 -0800

On 17-02-21 09:44 AM, Alexander Duyck wrote:
> On Mon, Feb 20, 2017 at 11:55 PM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
>> On Mon, Feb 20, 2017 at 08:00:57PM -0800, Alexander Duyck wrote:
>>>
>>> I assumed "toy Tx" since I wasn't aware that they were actually
>>> allowing writing to the page.  I think that might work for the XDP_TX
>>> case,
>>
>> Take a look at samples/bpf/xdp_tx_iptunnel_kern.c
>> It's close enough approximation of load balancer.
>> The packet header is rewritten by the bpf program.
>> That's where dma_bidirectional requirement came from.
> 
> Thanks.  I will take a look at it.
> 
>>> but the case where encap/decap is done and then passed up to the
>>> stack runs the risk of causing data corruption on some architectures
>>> if they unmap the page before the stack is done with the skb.  I
>>> already pointed out the issue to the Mellanox guys and that will
>>> hopefully be addressed shortly.
>>
>> sure. the path were xdp program does decap and passes to the stack
>> is not finished. To make it work properly we need to expose
>> csum complete field to the program at least.
> 
> I would think the checksum is something that could be validated after
> the frame has been modified.  In the case of encapsulating or
> decapsulating a TCP frame you could probably assume the inner TCP
> checksum is valid and then you only have to deal with the checksum if
> it is present in the outer tunnel header.  Basically deal with it like
> we do the local checksum offload, only you would have to compute the
> pseudo header checksum for the inner and outer headers since you can't
> use the partial checksum of the inner header.
> 
>>> As far as the Tx I need to work with John since his current solution
>>> doesn't have any batching support that I saw and that is a major
>>> requirement if we want to get above 7 Mpps for a single core.
>>
>> I think we need to focus on both Mpps and 'perf report' together.
> 
> Agreed, I usually look over both as one tells you how fast you are
> going and the other tells you where the bottlenecks are.
> 
>> Single core doing 7Mpps and scaling linearly to 40Gbps line rate
>> is much better than single core doing 20Mpps and not scaling at all.
>> There could be sw inefficiencies and hw limits, hence 'perf report'
>> is must have when discussing numbers.
> 
> Agreed.
> 
>> I think long term we will be able to agree on a set of real life
>> use cases and corresponding set of 'blessed' bpf programs and
>> create a table of nic, driver, use case 1, 2, 3, single core, multi.
>> Making level playing field for all nic vendors is one of the goals.
>>
>> Right now we have xdp1, xdp2 and xdp_tx_iptunnel benchmarks.
>> They are approximations of ddos, router, load balancer
>> use cases. They obviously need work to get to 'blessed' shape,
>> but imo quite good to do vendor vs vendor comparison for
>> the use cases that we care about.
>> Eventually nic->vm and vm->vm use cases via xdp_redirect should
>> be added to such set of 'blessed' benchmarks too.
>> I think so far we avoided falling into trap of microbenchmarking wars.
> 
> I'll keep this in mind for upcoming patches.
>


Yep, agreed although having some larger examples in the wild even if
not in the kernel source would be great. I think we will see these
soon.

>>>>>> 3.  Should we support scatter-gather to support 9K jumbo frames
>>>>>> instead of allocating order 2 pages?
>>>>>
>>>>> we can, if main use case of mtu < 4k doesn't suffer.
>>>>
>>>> Agreed I don't think it should degrade <4k performance. That said
>>>> for VM traffic this is absolutely needed. Without TSO enabled VM
>>>> traffic is 50% slower on my tests :/.
>>>>
>>>> With tap/vhost support for XDP this becomes necessary. vhost/tap
>>>> support for XDP is on my list directly behind ixgbe and redirect
>>>> support.
>>>
>>> I'm thinking we just need to turn XDP into something like a
>>> scatterlist for such cases.  It wouldn't take much to just convert the
>>> single xdp_buf into an array of xdp_buf.
>>
>> datapath has to be fast. If xdp program needs to look at all
>> bytes of the packet the performance is gone. Therefore I don't see
>> a need to expose an array of xdp_buffs to the program.
> 
> The program itself may not care, but if we are going to deal with
> things like Tx and Drop we need to make sure we drop all the parts of
> the frame.  An alternate idea I have been playing around with is just
> having the driver repeat the last action until it hits the end of a
> frame.  So XDP would analyze the first 1.5K or 3K of the frame, and
> then tell us to either drop it, pass it, or xmit it.  After that we
> would just repeat that action until we hit the end of the frame.  The
> only limitation is that it means XDP is limited to only accessing the
> first 1514 bytes.
> 
>> The alternative would be to add a hidden field to xdp_buff that keeps
>> SG in some form and data_end will point to the end of linear chunk.
>> But you cannot put only headers into linear part. If program
>> needs to see something that is after data_end, it will drop the packet.
>> So it's not at all split-header model. 'data-data_end' chunk
>> should cover as much as possible. We cannot get into sg games
>> while parsing the packet inside the program. Everything
>> that program needs to see has to be in the linear part.
> 
> Agreed, I wouldn't want to do header split.  My thought is to break
> things up so that you have 1.5K at least.  The rest should hopefully
> just be data that we wouldn't need to look at.
> 
>> I think such model will work well for jumbo packet case.
>> but I don't think it will work for VM's tso.
>> For xdp program to pass csum_partial packet from vm into nic
>> in a meaninful way it needs to gain knowledge of ip, l4, csum
>> and bunch of other meta-data fields that nic needs to do TSO.
>> I'm not sure it's worth exposing all that to xdp. Instead
>> can we make VM to do segmentation, so that xdp program don't
>> need to deal with gso packets ?
> 
> GSO/TSO is getting into advanced stuff I would rather not have to get
> into right now.  I figure we need to take this portion one step at a
> time.  To support GSO we need more information like the mss.
> 

Agreed lets get the driver support for basic things first. But this
is on my list. I'm just repeating myself but VM to VM performance uses
TSO/LRO heavily.

> I think for now if we can get xmit bulking working that should get us
> behavior close enough to see some of the benefits of GRO as we can
> avoid having vhost transition between user space and kernel space
> quite so much.
> 

I'll try this out on ixgbe.

>> I think the main cost is packet csum and for this something
>> like xdp_tx_with_pseudo_header() helper can work.
>> xdp program will see individual packets with pseudo header
>> and hw nic will do final csum over packet.
>> The program will see csum field as part of xdp_buff
>> and if it's csum_partial it will use xdp_tx_with_pseudo_header()
>> to transmit the packet instead of xdp_redirect or xdp_tx.
>> The call may look like:
>> xdp_tx_with_pseudo_header(xdp_buff, ip_off, tcp_off);
>> and these two offsets the program will compute based on
>> the packet itself and not metadata that came from VM.
>> In other words I'd like xdp program to deal with raw packets
>> as much as possible. pseudo header is part of the packet.
>> So the only metadata program needs is whether packet has
>> pseudo header or not.
>> Saying it differently: whether the packet came from physical
>> nic and in xdp_buff we have csum field (from hw rx descriptor)
>> that has csum complete meaning or the packet came from VM,
>> pseudo header is populated and xdp_buff->csum is empty.
>> From physical nic the packet will travel through xdp program
>> into VM and csum complete is nicely covers all encap/decap
>> cases whether they're done by xdp program or by stack inside VM.
>> From VM the packet similarly travels through xdp programs
>> and when it's about to hit physical nic the last program
>> calls xdp_tx_with_pseudo_header(). Any packet manipulations
>> that are done in between are done cleanly without worrying
>> about gso and adjustments to metadata.
>>
>>> The ixgbe driver has been doing page recycling for years.  I believe
>>> Dave just pulled the bits from Jeff to enable ixgbe to use build_skb,
>>> update the DMA API, and bulk the page count additions.  There is still
>>> a few tweaks I plan to make to increase the headroom since it is
>>> currently only NET_SKB_PAD + NET_IP_ALIGN and I think we have enough
>>> room for 192 bytes of headroom as I recall.
>>
>> Nice. Why keep old ixgbe_construct_skb code around?
>> With split page and build_skb perf should be great for small and large
>> packets ?
> 
> I kept it around mostly just as a fall back in case something goes
> wrong.  The last time I pushed the build_skb code I had to revert it
> shortly afterwards as we found the DMA API did't support he use of
> build_skb on DMA mapped pages.  This way if we find there is an
> architecture that doesn't play well with this we have a fall back we
> can enable until we can get it fixed.
> 
>> Unless I wasn't clear earlier:
>> Please release your ixgbe and i40e xdp patches in whatever shape
>> they are right now. I'm ready to test with xdp1+xdp2+xdp_tx_iptunnel :)
> 
> The i40e patches are going to be redone since they need to be rebased
> on top of the build_skb changes for our out-of-tree driver and then
> resubmitted once those changes are upstream.  So it will probably be a
> few weeks before we have them ready.
> 
> I'll let John talk to when the ixgbe changes can be submitted.

I'm addressing Alex's comments now and should have a v2 out soon. Certainly
in the next day or two.

Thanks,
John

Re: Questions on XDP

Reply via email to