Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support

Björn Töpel Fri, 18 May 2018 09:32:52 -0700

2018-05-18 18:17 GMT+02:00 Daniel Borkmann <dan...@iogearbox.net>:
> On 05/18/2018 05:18 PM, Björn Töpel wrote:
>> 2018-05-18 15:43 GMT+02:00 Daniel Borkmann <dan...@iogearbox.net>:
>>> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
>>>> On 5/16/18 11:46 PM, Björn Töpel wrote:
>>>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov 
>>>>> <alexei.starovoi...@gmail.com>:
>>>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>>>>> From: Björn Töpel <bjorn.to...@intel.com>
>>>>>>>>
>>>>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>>>>> optimized for high performance packet processing and, in upcoming
>>>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>>>>> work has already been accepted. We will publish our zero-copy support
>>>>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>>>>
>>>>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>>>>> and there but nothing to hold it up, for the series:
>>>>>>>
>>>>>>> Acked-by: Daniel Borkmann <dan...@iogearbox.net>
>>>>>>>
>>>>>>> Thanks everyone!
>>>>>>
>>>>>> Great stuff!
>>>>>>
>>>>>> Applied to bpf-next, with one condition.
>>>>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>>>>> and reviewed within this release window.
>>>>>> If netdev community as a whole won't be able to agree on the zero-copy
>>>>>> bits we'd need to revert this feature before the next merge window.
>>>>>>
>>>>>> Few other minor nits:
>>>>>> patch 3:
>>>>>> +struct xdp_ring {
>>>>>> +       __u32 producer __attribute__((aligned(64)));
>>>>>> +       __u32 consumer __attribute__((aligned(64)));
>>>>>> +};
>>>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi 
>>>>>> headers.
>>>>>
>>>>> Hmm, I need some guidance on what a sane uapi variant would be. We
>>>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>>>>> both 64B and 128B according to the specs. Contemporary IA processors
>>>>> have 64B.
>>>>>
>>>>> The simplest, and maybe most future-proof, would be 128B aligned for
>>>>> all. Another is having 128B for ARM and 64B for all IA. A third option
>>>>> is having a hand-shaking API (I think virtio has that) for determine
>>>>> the cache line size, but I'd rather not go down that route.
>>>>>
>>>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>>>>> would look like?
>>>>
>>>> I suspect i40e+arm combination wasn't tested anyway.
>>>> The api may have endianness issues too on something like sparc.
>>>> I think the way to be backwards compatible in this area
>>>> is to make the api usable on x86 only by adding
>>>> to include/uapi/linux/if_xdp.h
>>>> #if defined(__x86_64__)
>>>> #define AF_XDP_CACHE_BYTES 64
>>>> #else
>>>> #error "AF_XDP support is not yet available for this architecture"
>>>> #endif
>>>> and doing:
>>>>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>>>
>>>> And progressively add to this for arm64 and few other archs.
>>>> Eventually removing #error and adding some generic define
>>>> that's good enough for long tail of architectures that
>>>> we really cannot test.
>>>
>>> Been looking into this yesterday as well a bit, and it's a bit of a mess 
>>> what
>>> uapi headers do on this regard (though there are just a handful of such 
>>> headers).
>>> Some of the kernel uapi headers hard-code generally 64 bytes regardless of 
>>> the
>>> underlying arch. In general, the kernel does expose it to user space via 
>>> sysfs
>>> (coherency_line_size). Here's what perf does to retrieve it:
>>>
>>> #ifdef _SC_LEVEL1_DCACHE_LINESIZE
>>> #define cache_line_size(cacheline_sizep) *cacheline_sizep = 
>>> sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
>>> #else
>>> static void cache_line_size(int *cacheline_sizep)
>>> {
>>>         if 
>>> (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size",
>>>  cacheline_sizep))
>>>                 pr_debug("cannot determine cache line size");
>>> }
>>> #endif
>>>
>>> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
>>> available for x86, arm64, s390 and ppc on a cursory glance in the glibc 
>>> code.
>>> In the x86 case it retrieves the info from cpuid insn. In order to 
>>> generically
>>> use it in combination with the header you'd have some probe which would then
>>> set this as a define before including the header.
>>
>> But as a uapi we cannot depend on the L1 cache line size for what's
>> currently running on the system, right? So, either a "one cache line
>> size for all flavors of an arch", e.g. for ARMv8 that would be 128,
>> even though there can be 64 flavors out there.
>>
>> Another way would be to remove the ring structure completely, and
>> leave that to the user-space application to figure out. So there's a
>> runtime interface (getsockopt) to probe the offsets the head and tail
>> pointer is after/before the mmap call, and only expose the descriptor
>> format explicitly in if_xdp.h. Don't know if that is too unorthodox or
>> not...
>
> Good point, I think that may not be too unreasonable thing to do, imho,
> at least this doesn't get us into the issue discussed here in the first
> place, and it would work for other archs more seamlessly rather than ugly
> build error or single per arch 'catch-all' define.
>


I'll try that route (runtime-based offset to prod/cons), and see where
it ends up!


Enjoy the weekend,
Björn

> Thanks,
> Daniel

Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support

Reply via email to