Re: [PATCH v1 1/1] bitops: Share BYTES_TO_BITS() for everyone
From: Yury Norov Date: Sun, 10 Sep 2023 07:07:16 -0700 > On Wed, Sep 06, 2023 at 05:54:26PM +0300, Andy Shevchenko wrote: >> On Wed, Sep 06, 2023 at 04:40:39PM +0200, Alexander Lobakin wrote: >>> From: Andy Shevchenko >>> Date: Thu, 31 Aug 2023 16:21:30 +0300 >>>> On Fri, Aug 25, 2023 at 04:49:07PM +0200, Alexander Lobakin wrote: >>>>> From: Andy Shevchenko >>>>> Date: Thu, 24 Aug 2023 15:37:28 +0300 >>>>> >>>>>> It may be new callers for the same macro, share it. >>>>>> >>>>>> Note, it's unknown why it's represented in the current form instead of >>>>>> simple multiplication and commit 1ff511e35ed8 ("tracing/kprobes: Add >>>>>> bitfield type") doesn't explain that neither. Let leave it as is and >>>>>> we may improve it in the future. >>>>> >>>>> Maybe symmetrical change in tools/ like I did[0] an aeon ago? >>>> >>>> Hmm... Why can't you simply upstream your version? It seems better than >>>> mine. >>> >>> It was a part of the Netlink bigint API which is a bit on hold for now >>> (I needed this macro available treewide). >>> But I can send it as standalone if you're fine with that. >> >> I'm fine. Yury? > > Do we have opencoded BYTES_TO_BITS() somewhere else? If so, it should be > fixed in the same series. Treewide -- a ton. We could add it so that devs could start using it and stop open-coding :D > > Regarding implementation, the current: > > #define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long)) > > looks weird. Maybe there are some special considerations in a tracing > subsystem to make it like this, but as per Masami's email - there's > not. > > For a general purpose I'd suggest a simpler: > #define BYTES_TO_BITS(nb) ((nb) * BITS_PER_BYTE) I also didn't notice anything that would require using logic more complex than this one. It would probably make more sense to define it that way when moving. > > Thanks, > Yury Thanks, Olek
Re: [PATCH v3] scripts/link-vmlinux.sh: Add alias to duplicate symbols for kallsyms
From: Alessandro Carminati (Red Hat) Date: Mon, 28 Aug 2023 08:04:23 + > From: Alessandro Carminati > > It is not uncommon for drivers or modules related to similar peripherals > to have symbols with the exact same name. [...] > Changes from v2: > - Alias tags are created by querying DWARF information from the vmlinux. > - The filename + line number is normalized and appended to the original name. > - The tag begins with '@' to indicate the symbol source. > - Not a change, but worth mentioning, since the alias is added to the existing > list, the old duplicated name is preserved, and the livepatch way of dealing > with duplicates is maintained. > - Acknowledging the existence of scenarios where inlined functions declared in > header files may result in multiple copies due to compiler behavior, though >it is not actionable as it does not pose an operational issue. > - Highlighting a single exception where the same name refers to different > functions: the case of "compat_binfmt_elf.c," which directly includes > "binfmt_elf.c" producing identical function copies in two separate > modules. Oh, I thought you managed to handle this in v3 since you didn't reply in the previous thread... > > sample from new v3 > > ~ # cat /proc/kallsyms | grep gic_mask_irq > d0b03c04dae4 t gic_mask_irq > d0b03c04dae4 t gic_mask_irq@_drivers_irqchip_irq-gic_c_167 > d0b03c050960 t gic_mask_irq > d0b03c050960 t gic_mask_irq@_drivers_irqchip_irq-gic-v3_c_404 BTW, why normalize them? Why not just gic_mask_irq@drivers/irqchip/... And why line number? Line numbers break reproducible builds and also would make it harder to refer to a particular symbol by its path and name since we also have to pass its line number which may change once you add a debug print there, for example. OTOH there can't be 2 symbols with the same name within one file, so just path + name would be enough. Or not? (sorry if some of this was already discussed previously) [...] Thanks, Olek
[PATCH mips-next] vmlinux.lds.h: catch more UBSAN symbols into .data
LKP triggered lots of LD orphan warnings [0]: mipsel-linux-ld: warning: orphan section `.data.$Lubsan_data299' from `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_data299' mipsel-linux-ld: warning: orphan section `.data.$Lubsan_data183' from `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_data183' mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type3' from `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type3' mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type2' from `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type2' mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type0' from `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type0' [...] Seems like "unnamed data" isn't the only type of symbols that UBSAN instrumentation can emit. Catch these into .data with the wildcard as well. [0] https://lore.kernel.org/linux-mm/202102160741.k57gcnsr-...@intel.com Fixes: f41b233de0ae ("vmlinux.lds.h: catch UBSAN's "unnamed data" into data") Reported-by: kernel test robot Signed-off-by: Alexander Lobakin --- include/asm-generic/vmlinux.lds.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index cc659e77fcb0..83537e5ee78f 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -95,7 +95,7 @@ */ #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION #define TEXT_MAIN .text .text.[0-9a-zA-Z_]* -#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* .data.$__unnamed_* +#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* .data.$__unnamed_* .data.$Lubsan_* #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]* #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L* #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral* -- 2.30.1
[PATCH v4 bpf-next 1/6] netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition
This is harmless for now, but comes fatal for the subsequent patch. Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom") Signed-off-by: Alexander Lobakin --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b9bcbfde7849..b895973390ee 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1584,6 +1584,7 @@ enum netdev_priv_flags { #define IFF_L3MDEV_SLAVE IFF_L3MDEV_SLAVE #define IFF_TEAM IFF_TEAM #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED +#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM #define IFF_MACSEC IFF_MACSEC #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER -- 2.30.1
[PATCH v4 bpf-next 0/6] xsk: build skb by page (aka generic zerocopy xmit)
This series introduces XSK generic zerocopy xmit by adding XSK umem pages as skb frags instead of copying data to linear space. The only requirement for this for drivers is to be able to xmit skbs with skb_headlen(skb) == 0, i.e. all data including hard headers starts from frag 0. To indicate whether a particular driver supports this, a new netdev priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net as it's already capable of doing it). So consider implementing this in your drivers to greatly speed-up generic XSK xmit. The first two bits refactor netdev_priv_flags a bit to harden them in terms of bitfield overflow, as IFF_TX_SKB_NO_LINEAR is the last one that fits into unsigned int. The fifth patch adds headroom and tailroom reservations for the allocated skbs on XSK generic xmit path. This ensures there won't be any unwanted skb reallocations on fast-path due to headroom and/or tailroom driver/device requirements (own headers/descriptors etc.). The other three add a new private flag, declare it in virtio_net driver and introduce generic XSK zerocopy xmit itself. The main body of work is created and done by Xuan Zhuo. His original cover letter: v3: Optimized code v2: 1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature 2. split the patch to three: a. add priv_flags IFF_TX_SKB_NO_LINEAR b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR c. When there is support this flag, construct skb without linear space 3. use ERR_PTR() and PTR_ERR() to handle the err v1 message log: --- This patch is used to construct skb based on page to save memory copy overhead. This has one problem: We construct the skb by fill the data page as a frag into the skb. In this way, the linear space is empty, and the header information is also in the frag, not in the linear space, which is not allowed for some network cards. For example, Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb I also tried to use build_skb to construct skb, but because of the existence of skb_shinfo, it must be behind the linear space, so this method is not working. We can't put skb_shinfo on desc->addr, it will be exposed to users, this is not safe. Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the network card supports the header information of the packet in the frag and not in the linear space. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% >From v3 [0]: - refactor netdev_priv_flags to make it easier to add new ones and prevent bitwidth overflow; - add headroom (both standard and zerocopy) and tailroom (standard) reservation in skb for drivers to avoid potential reallocations; - fix skb->truesize accounting; - misc comment rewords. [0] https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com Alexander Lobakin (3): netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition netdevice: check for net_device::priv_flags bitfield overflow xsk: respect device's headroom and tailroom on generic xmit path Xuan Zhuo (3): net: add priv_flags for allow tx skb without linear virtio-net: support IFF_TX_SKB_NO_LINEAR xsk: build skb by page (aka generic zerocopy xmit) drivers/net/virtio_net.c | 3 +- include/linux/netdevice.h | 138 +- net/xdp/xsk.c | 113 ++- 3 files changed, 173 insertions(+), 81 deletions(-) -- 2.30.1
[PATCH v4 bpf-next 4/6] virtio-net: support IFF_TX_SKB_NO_LINEAR
From: Xuan Zhuo Virtio net supports the case where the skb linear space is empty, so add priv_flags. Signed-off-by: Xuan Zhuo Acked-by: Michael S. Tsirkin Signed-off-by: Alexander Lobakin --- drivers/net/virtio_net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ba8e63792549..f2ff6c3906c1 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev) return -ENOMEM; /* Set up network device as normal. */ - dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE; + dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE | + IFF_TX_SKB_NO_LINEAR; dev->netdev_ops = &virtnet_netdev; dev->features = NETIF_F_HIGHDMA; -- 2.30.1
[PATCH v4 bpf-next 3/6] net: add priv_flags for allow tx skb without linear
From: Xuan Zhuo In some cases, we hope to construct skb directly based on the existing memory without copying data. In this case, the page will be placed directly in the skb, and the linear space of skb is empty. But unfortunately, many the network card does not support this operation. For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb So a priv_flag is added here to indicate whether the network card supports this feature. Signed-off-by: Xuan Zhuo Suggested-by: Alexander Lobakin [ alobakin: give a new flag more detailed description ] Signed-off-by: Alexander Lobakin --- include/linux/netdevice.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index fa4ab77ce81e..86e19f62f978 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1525,6 +1525,8 @@ struct net_device_ops { * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running + * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with + * skb_headlen(skb) == 0 (data starts from frag0) */ enum netdev_priv_flags { IFF_802_1Q_VLAN_BIT, @@ -1558,6 +1560,7 @@ enum netdev_priv_flags { IFF_FAILOVER_SLAVE_BIT, IFF_L3MDEV_RX_HANDLER_BIT, IFF_LIVE_RENAME_OK_BIT, + IFF_TX_SKB_NO_LINEAR_BIT, NETDEV_PRIV_FLAG_COUNT, }; @@ -1600,6 +1603,7 @@ static_assert(sizeof(netdev_priv_flags_t) * BITS_PER_BYTE <= #define IFF_FAILOVER_SLAVE __IFF(FAILOVER_SLAVE) #define IFF_L3MDEV_RX_HANDLER __IFF(L3MDEV_RX_HANDLER) #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK) +#define IFF_TX_SKB_NO_LINEAR __IFF(TX_SKB_NO_LINEAR) /** * struct net_device - The DEVICE structure. -- 2.30.1
[PATCH v4 bpf-next 5/6] xsk: respect device's headroom and tailroom on generic xmit path
xsk_generic_xmit() allocates a new skb and then queues it for xmitting. The size of new skb's headroom is desc->len, so it comes to the driver/device with no reserved headroom and/or tailroom. Lots of drivers need some headroom (and sometimes tailroom) to prepend (and/or append) some headers or data, e.g. CPU tags, device-specific headers/descriptors (LSO, TLS etc.), and if case of no available space skb_cow_head() will reallocate the skb. Reallocations are unwanted on fast-path, especially when it comes to XDP, so generic XSK xmit should reserve the spaces declared in dev->needed_headroom and dev->needed tailroom to avoid them. Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)): Usually, output functions reserve LL_RESERVED_SPACE(dev), which consists of dev->hard_header_len + dev->needed_headroom, aligned by 16. However, on XSK xmit hard header is already here in the chunk, so hard_header_len is not needed. But it'd still be better to align data up to cacheline, while reserving no less than driver requests for headroom. NET_SKB_PAD here is to double-insure there will be no reallocations even when the driver advertises no needed_headroom, but in fact need it (not so rare case). Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Alexander Lobakin --- net/xdp/xsk.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4faabd1ecfd1..143979ea4165 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; + u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + tr = xs->dev->needed_tailroom; + while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { char *buffer; u64 addr; @@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk) } len = desc.len; - skb = sock_alloc_send_skb(sk, len, 1, &err); + skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err); if (unlikely(!skb)) goto out; + skb_reserve(skb, hr); skb_put(skb, len); + addr = desc.addr; buffer = xsk_buff_raw_get_data(xs->pool, addr); err = skb_store_bits(skb, 0, buffer, len); -- 2.30.1
[PATCH v4 bpf-next 2/6] netdevice: check for net_device::priv_flags bitfield overflow
We almost ran out of unsigned int bitwidth. Define priv flags and check for potential overflow in the fashion of netdev_features_t. Defined this way, priv_flags can be easily expanded later with just changing its typedef. Signed-off-by: Alexander Lobakin --- include/linux/netdevice.h | 135 -- 1 file changed, 72 insertions(+), 63 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b895973390ee..fa4ab77ce81e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1527,70 +1527,79 @@ struct net_device_ops { * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running */ enum netdev_priv_flags { - IFF_802_1Q_VLAN = 1<<0, - IFF_EBRIDGE = 1<<1, - IFF_BONDING = 1<<2, - IFF_ISATAP = 1<<3, - IFF_WAN_HDLC= 1<<4, - IFF_XMIT_DST_RELEASE= 1<<5, - IFF_DONT_BRIDGE = 1<<6, - IFF_DISABLE_NETPOLL = 1<<7, - IFF_MACVLAN_PORT= 1<<8, - IFF_BRIDGE_PORT = 1<<9, - IFF_OVS_DATAPATH= 1<<10, - IFF_TX_SKB_SHARING = 1<<11, - IFF_UNICAST_FLT = 1<<12, - IFF_TEAM_PORT = 1<<13, - IFF_SUPP_NOFCS = 1<<14, - IFF_LIVE_ADDR_CHANGE= 1<<15, - IFF_MACVLAN = 1<<16, - IFF_XMIT_DST_RELEASE_PERM = 1<<17, - IFF_L3MDEV_MASTER = 1<<18, - IFF_NO_QUEUE= 1<<19, - IFF_OPENVSWITCH = 1<<20, - IFF_L3MDEV_SLAVE= 1<<21, - IFF_TEAM= 1<<22, - IFF_RXFH_CONFIGURED = 1<<23, - IFF_PHONY_HEADROOM = 1<<24, - IFF_MACSEC = 1<<25, - IFF_NO_RX_HANDLER = 1<<26, - IFF_FAILOVER= 1<<27, - IFF_FAILOVER_SLAVE = 1<<28, - IFF_L3MDEV_RX_HANDLER = 1<<29, - IFF_LIVE_RENAME_OK = 1<<30, + IFF_802_1Q_VLAN_BIT, + IFF_EBRIDGE_BIT, + IFF_BONDING_BIT, + IFF_ISATAP_BIT, + IFF_WAN_HDLC_BIT, + IFF_XMIT_DST_RELEASE_BIT, + IFF_DONT_BRIDGE_BIT, + IFF_DISABLE_NETPOLL_BIT, + IFF_MACVLAN_PORT_BIT, + IFF_BRIDGE_PORT_BIT, + IFF_OVS_DATAPATH_BIT, + IFF_TX_SKB_SHARING_BIT, + IFF_UNICAST_FLT_BIT, + IFF_TEAM_PORT_BIT, + IFF_SUPP_NOFCS_BIT, + IFF_LIVE_ADDR_CHANGE_BIT, + IFF_MACVLAN_BIT, + IFF_XMIT_DST_RELEASE_PERM_BIT, + IFF_L3MDEV_MASTER_BIT, + IFF_NO_QUEUE_BIT, + IFF_OPENVSWITCH_BIT, + IFF_L3MDEV_SLAVE_BIT, + IFF_TEAM_BIT, + IFF_RXFH_CONFIGURED_BIT, + IFF_PHONY_HEADROOM_BIT, + IFF_MACSEC_BIT, + IFF_NO_RX_HANDLER_BIT, + IFF_FAILOVER_BIT, + IFF_FAILOVER_SLAVE_BIT, + IFF_L3MDEV_RX_HANDLER_BIT, + IFF_LIVE_RENAME_OK_BIT, + + NETDEV_PRIV_FLAG_COUNT, }; -#define IFF_802_1Q_VLANIFF_802_1Q_VLAN -#define IFF_EBRIDGEIFF_EBRIDGE -#define IFF_BONDINGIFF_BONDING -#define IFF_ISATAP IFF_ISATAP -#define IFF_WAN_HDLC IFF_WAN_HDLC -#define IFF_XMIT_DST_RELEASE IFF_XMIT_DST_RELEASE -#define IFF_DONT_BRIDGEIFF_DONT_BRIDGE -#define IFF_DISABLE_NETPOLLIFF_DISABLE_NETPOLL -#define IFF_MACVLAN_PORT IFF_MACVLAN_PORT -#define IFF_BRIDGE_PORTIFF_BRIDGE_PORT -#define IFF_OVS_DATAPATH IFF_OVS_DATAPATH -#define IFF_TX_SKB_SHARING IFF_TX_SKB_SHARING -#define IFF_UNICAST_FLTIFF_UNICAST_FLT -#define IFF_TEAM_PORT IFF_TEAM_PORT -#define IFF_SUPP_NOFCS IFF_SUPP_NOFCS -#define IFF_LIVE_ADDR_CHANGE IFF_LIVE_ADDR_CHANGE -#define IFF_MACVLANIFF_MACVLAN -#define IFF_XMIT_DST_RELEASE_PERM IFF_XMIT_DST_RELEASE_PERM -#define IFF_L3MDEV_MASTER IFF_L3MDEV_MASTER -#define IFF_NO_QUEUE IFF_NO_QUEUE -#define IFF_OPENVSWITCHIFF_OPENVSWITCH -#define IFF_L3MDEV_SLAVE IFF_L3MDEV_SLAVE -#define IFF_TEAM IFF_TEAM -#define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED -#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM -#define IFF_MACSEC IFF_MACSEC -#define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER -#define IFF_FAILOVER IFF_FAILOVER -#define IFF_FAILOVER_SLAV
[PATCH v4 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)
From: Xuan Zhuo This patch is used to construct skb based on page to save memory copy overhead. This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to directly construct skb. If this feature is not supported, it is still necessary to copy data to construct skb. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% Signed-off-by: Xuan Zhuo Reviewed-by: Dust Li [ alobakin: - expand subject to make it clearer; - improve skb->truesize calculation; - reserve some headroom in skb for drivers; - tailroom is not needed as skb is non-linear ] Signed-off-by: Alexander Lobakin --- net/xdp/xsk.c | 119 -- 1 file changed, 95 insertions(+), 24 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 143979ea4165..ff7bd06e1241 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -445,6 +445,96 @@ static void xsk_destruct_skb(struct sk_buff *skb) sock_wfree(skb); } +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, + struct xdp_desc *desc) +{ + struct xsk_buff_pool *pool = xs->pool; + u32 hr, len, offset, copy, copied; + struct sk_buff *skb; + struct page *page; + void *buffer; + int err, i; + u64 addr; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + + skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + + addr = desc->addr; + len = desc->len; + + buffer = xsk_buff_raw_get_data(pool, addr); + offset = offset_in_page(buffer); + addr = buffer - pool->addrs; + + for (copied = 0, i = 0; copied < len; i++) { + page = pool->umem->pgs[addr >> PAGE_SHIFT]; + get_page(page); + + copy = min_t(u32, PAGE_SIZE - offset, len - copied); + skb_fill_page_desc(skb, i, page, offset, copy); + + copied += copy; + addr += copy; + offset = 0; + } + + skb->len += len; + skb->data_len += len; + skb->truesize += pool->unaligned ? len : pool->chunk_size; + + refcount_add(skb->truesize, &xs->sk.sk_wmem_alloc); + + return skb; +} + +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, +struct xdp_desc *desc) +{ + struct net_device *dev = xs->dev; + struct sk_buff *skb; + + if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { + skb = xsk_build_skb_zerocopy(xs, desc); + if (IS_ERR(skb)) + return skb; + } else { + u32 hr, tr, len; + void *buffer; + int err; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)); + tr = dev->needed_tailroom; + len = desc->len; + + skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + skb_put(skb, len); + + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr); + err = skb_store_bits(skb, 0, buffer, len); + if (unlikely(err)) { + kfree_skb(skb); + return ERR_PTR(err); + } + } + + skb->dev = dev; + skb->priority = xs->sk.sk_priority; + skb->mark = xs->sk.sk_mark; + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr; + skb->destructor = xsk_destruct_skb; + + return skb; +} + static int xsk_generic_xmit(struct sock *sk) { struct xdp_sock *xs = xdp_sk(sk); @@ -454,56 +544,37 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; - u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; - hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); - tr = xs->dev->needed_tailroom; - while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { - char *buffer; - u64 addr; - u32 len; - if (max_batch-- == 0) { err = -EAGAIN; goto out; } -
Re: [PATCH v4 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)
From: Magnus Karlsson Date: Tue, 16 Feb 2021 15:08:26 +0100 > On Tue, Feb 16, 2021 at 12:44 PM Alexander Lobakin wrote: > > > > From: Xuan Zhuo > > > > This patch is used to construct skb based on page to save memory copy > > overhead. > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > > directly construct skb. If this feature is not supported, it is still > > necessary to copy data to construct skb. > > > > Performance Testing > > > > The test environment is Aliyun ECS server. > > Test cmd: > > ``` > > xdpsock -i eth0 -t -S -s > > ``` > > > > Test result data: > > > > size64 512 10241500 > > copy1916747 1775988 1600203 1440054 > > page1974058 1953655 1945463 1904478 > > percent 3.0%10.0% 21.58% 32.3% > > > > Signed-off-by: Xuan Zhuo > > Reviewed-by: Dust Li > > [ alobakin: > > - expand subject to make it clearer; > > - improve skb->truesize calculation; > > - reserve some headroom in skb for drivers; > > - tailroom is not needed as skb is non-linear ] > > Signed-off-by: Alexander Lobakin > > Thank you Alexander! > > Acked-by: Magnus Karlsson Thanks! I have one more generic zerocopy to offer (inspired by this series) that wouldn't require IFF_TX_SKB_NO_LINEAR, only a capability to xmit S/G packets that almost every NIC has. I'll publish an RFC once this and your upcoming changes get merged. > > --- > > net/xdp/xsk.c | 119 -- > > 1 file changed, 95 insertions(+), 24 deletions(-) > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > index 143979ea4165..ff7bd06e1241 100644 > > --- a/net/xdp/xsk.c > > +++ b/net/xdp/xsk.c > > @@ -445,6 +445,96 @@ static void xsk_destruct_skb(struct sk_buff *skb) > > sock_wfree(skb); > > } > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > > + struct xdp_desc *desc) > > +{ > > + struct xsk_buff_pool *pool = xs->pool; > > + u32 hr, len, offset, copy, copied; > > + struct sk_buff *skb; > > + struct page *page; > > + void *buffer; > > + int err, i; > > + u64 addr; > > + > > + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); > > + > > + skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err); > > + if (unlikely(!skb)) > > + return ERR_PTR(err); > > + > > + skb_reserve(skb, hr); > > + > > + addr = desc->addr; > > + len = desc->len; > > + > > + buffer = xsk_buff_raw_get_data(pool, addr); > > + offset = offset_in_page(buffer); > > + addr = buffer - pool->addrs; > > + > > + for (copied = 0, i = 0; copied < len; i++) { > > + page = pool->umem->pgs[addr >> PAGE_SHIFT]; > > + get_page(page); > > + > > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > > + skb_fill_page_desc(skb, i, page, offset, copy); > > + > > + copied += copy; > > + addr += copy; > > + offset = 0; > > + } > > + > > + skb->len += len; > > + skb->data_len += len; > > + skb->truesize += pool->unaligned ? len : pool->chunk_size; > > + > > + refcount_add(skb->truesize, &xs->sk.sk_wmem_alloc); > > + > > + return skb; > > +} > > + > > +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > > +struct xdp_desc *desc) > > +{ > > + struct net_device *dev = xs->dev; > > + struct sk_buff *skb; > > + > > + if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { > > + skb = xsk_build_skb_zerocopy(xs, desc); > > + if (IS_ERR(skb)) > > + return skb; > > + } else { > > + u32 hr, tr, len; > > + void *buffer; > > + int err; > > + > > + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)); > > + tr = dev->needed_tailroom; > > + len = desc->len; > > + > > + skb = sock_alloc_send_skb
[PATCH v5 bpf-next 0/6] xsk: build skb by page (aka generic zerocopy xmit)
This series introduces XSK generic zerocopy xmit by adding XSK umem pages as skb frags instead of copying data to linear space. The only requirement for this for drivers is to be able to xmit skbs with skb_headlen(skb) == 0, i.e. all data including hard headers starts from frag 0. To indicate whether a particular driver supports this, a new netdev priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net as it's already capable of doing it). So consider implementing this in your drivers to greatly speed-up generic XSK xmit. The first two bits refactor netdev_priv_flags a bit to harden them in terms of bitfield overflow, as IFF_TX_SKB_NO_LINEAR is the last one that fits into unsigned int. The fifth patch adds headroom and tailroom reservations for the allocated skbs on XSK generic xmit path. This ensures there won't be any unwanted skb reallocations on fast-path due to headroom and/or tailroom driver/device requirements (own headers/descriptors etc.). The other three add a new private flag, declare it in virtio_net driver and introduce generic XSK zerocopy xmit itself. The main body of work is created and done by Xuan Zhuo. His original cover letter: v3: Optimized code v2: 1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature 2. split the patch to three: a. add priv_flags IFF_TX_SKB_NO_LINEAR b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR c. When there is support this flag, construct skb without linear space 3. use ERR_PTR() and PTR_ERR() to handle the err v1 message log: --- This patch is used to construct skb based on page to save memory copy overhead. This has one problem: We construct the skb by fill the data page as a frag into the skb. In this way, the linear space is empty, and the header information is also in the frag, not in the linear space, which is not allowed for some network cards. For example, Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb I also tried to use build_skb to construct skb, but because of the existence of skb_shinfo, it must be behind the linear space, so this method is not working. We can't put skb_shinfo on desc->addr, it will be exposed to users, this is not safe. Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the network card supports the header information of the packet in the frag and not in the linear space. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% >From v4 [1]: - fix 0002 build error due to inverted static_assert() condition (0day bot); - collect two Acked-bys (Magnus). >From v3 [0]: - refactor netdev_priv_flags to make it easier to add new ones and prevent bitwidth overflow; - add headroom (both standard and zerocopy) and tailroom (standard) reservation in skb for drivers to avoid potential reallocations; - fix skb->truesize accounting; - misc comment rewords. [0] https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com [1] https://lore.kernel.org/netdev/20210216113740.62041-1-aloba...@pm.me Alexander Lobakin (3): netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition netdevice: check for net_device::priv_flags bitfield overflow xsk: respect device's headroom and tailroom on generic xmit path Xuan Zhuo (3): net: add priv_flags for allow tx skb without linear virtio-net: support IFF_TX_SKB_NO_LINEAR xsk: build skb by page (aka generic zerocopy xmit) drivers/net/virtio_net.c | 3 +- include/linux/netdevice.h | 138 +- net/xdp/xsk.c | 113 ++- 3 files changed, 173 insertions(+), 81 deletions(-) -- 2.30.1
[PATCH v5 bpf-next 2/6] netdevice: check for net_device::priv_flags bitfield overflow
We almost ran out of unsigned int bitwidth. Define priv flags and check for potential overflow in the fashion of netdev_features_t. Defined this way, priv_flags can be easily expanded later with just changing its typedef. Signed-off-by: Alexander Lobakin Reported-by: kernel test robot # Inverted assert condition --- include/linux/netdevice.h | 135 -- 1 file changed, 72 insertions(+), 63 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b895973390ee..0a9b2b31f411 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1527,70 +1527,79 @@ struct net_device_ops { * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running */ enum netdev_priv_flags { - IFF_802_1Q_VLAN = 1<<0, - IFF_EBRIDGE = 1<<1, - IFF_BONDING = 1<<2, - IFF_ISATAP = 1<<3, - IFF_WAN_HDLC= 1<<4, - IFF_XMIT_DST_RELEASE= 1<<5, - IFF_DONT_BRIDGE = 1<<6, - IFF_DISABLE_NETPOLL = 1<<7, - IFF_MACVLAN_PORT= 1<<8, - IFF_BRIDGE_PORT = 1<<9, - IFF_OVS_DATAPATH= 1<<10, - IFF_TX_SKB_SHARING = 1<<11, - IFF_UNICAST_FLT = 1<<12, - IFF_TEAM_PORT = 1<<13, - IFF_SUPP_NOFCS = 1<<14, - IFF_LIVE_ADDR_CHANGE= 1<<15, - IFF_MACVLAN = 1<<16, - IFF_XMIT_DST_RELEASE_PERM = 1<<17, - IFF_L3MDEV_MASTER = 1<<18, - IFF_NO_QUEUE= 1<<19, - IFF_OPENVSWITCH = 1<<20, - IFF_L3MDEV_SLAVE= 1<<21, - IFF_TEAM= 1<<22, - IFF_RXFH_CONFIGURED = 1<<23, - IFF_PHONY_HEADROOM = 1<<24, - IFF_MACSEC = 1<<25, - IFF_NO_RX_HANDLER = 1<<26, - IFF_FAILOVER= 1<<27, - IFF_FAILOVER_SLAVE = 1<<28, - IFF_L3MDEV_RX_HANDLER = 1<<29, - IFF_LIVE_RENAME_OK = 1<<30, + IFF_802_1Q_VLAN_BIT, + IFF_EBRIDGE_BIT, + IFF_BONDING_BIT, + IFF_ISATAP_BIT, + IFF_WAN_HDLC_BIT, + IFF_XMIT_DST_RELEASE_BIT, + IFF_DONT_BRIDGE_BIT, + IFF_DISABLE_NETPOLL_BIT, + IFF_MACVLAN_PORT_BIT, + IFF_BRIDGE_PORT_BIT, + IFF_OVS_DATAPATH_BIT, + IFF_TX_SKB_SHARING_BIT, + IFF_UNICAST_FLT_BIT, + IFF_TEAM_PORT_BIT, + IFF_SUPP_NOFCS_BIT, + IFF_LIVE_ADDR_CHANGE_BIT, + IFF_MACVLAN_BIT, + IFF_XMIT_DST_RELEASE_PERM_BIT, + IFF_L3MDEV_MASTER_BIT, + IFF_NO_QUEUE_BIT, + IFF_OPENVSWITCH_BIT, + IFF_L3MDEV_SLAVE_BIT, + IFF_TEAM_BIT, + IFF_RXFH_CONFIGURED_BIT, + IFF_PHONY_HEADROOM_BIT, + IFF_MACSEC_BIT, + IFF_NO_RX_HANDLER_BIT, + IFF_FAILOVER_BIT, + IFF_FAILOVER_SLAVE_BIT, + IFF_L3MDEV_RX_HANDLER_BIT, + IFF_LIVE_RENAME_OK_BIT, + + NETDEV_PRIV_FLAG_COUNT, }; -#define IFF_802_1Q_VLANIFF_802_1Q_VLAN -#define IFF_EBRIDGEIFF_EBRIDGE -#define IFF_BONDINGIFF_BONDING -#define IFF_ISATAP IFF_ISATAP -#define IFF_WAN_HDLC IFF_WAN_HDLC -#define IFF_XMIT_DST_RELEASE IFF_XMIT_DST_RELEASE -#define IFF_DONT_BRIDGEIFF_DONT_BRIDGE -#define IFF_DISABLE_NETPOLLIFF_DISABLE_NETPOLL -#define IFF_MACVLAN_PORT IFF_MACVLAN_PORT -#define IFF_BRIDGE_PORTIFF_BRIDGE_PORT -#define IFF_OVS_DATAPATH IFF_OVS_DATAPATH -#define IFF_TX_SKB_SHARING IFF_TX_SKB_SHARING -#define IFF_UNICAST_FLTIFF_UNICAST_FLT -#define IFF_TEAM_PORT IFF_TEAM_PORT -#define IFF_SUPP_NOFCS IFF_SUPP_NOFCS -#define IFF_LIVE_ADDR_CHANGE IFF_LIVE_ADDR_CHANGE -#define IFF_MACVLANIFF_MACVLAN -#define IFF_XMIT_DST_RELEASE_PERM IFF_XMIT_DST_RELEASE_PERM -#define IFF_L3MDEV_MASTER IFF_L3MDEV_MASTER -#define IFF_NO_QUEUE IFF_NO_QUEUE -#define IFF_OPENVSWITCHIFF_OPENVSWITCH -#define IFF_L3MDEV_SLAVE IFF_L3MDEV_SLAVE -#define IFF_TEAM IFF_TEAM -#define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED -#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM -#define IFF_MACSEC IFF_MACSEC -#define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER -#define IFF_FA
[PATCH v5 bpf-next 1/6] netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition
This is harmless for now, but comes fatal for the subsequent patch. Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom") Signed-off-by: Alexander Lobakin --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b9bcbfde7849..b895973390ee 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1584,6 +1584,7 @@ enum netdev_priv_flags { #define IFF_L3MDEV_SLAVE IFF_L3MDEV_SLAVE #define IFF_TEAM IFF_TEAM #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED +#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM #define IFF_MACSEC IFF_MACSEC #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER -- 2.30.1
[PATCH v5 bpf-next 3/6] net: add priv_flags for allow tx skb without linear
From: Xuan Zhuo In some cases, we hope to construct skb directly based on the existing memory without copying data. In this case, the page will be placed directly in the skb, and the linear space of skb is empty. But unfortunately, many the network card does not support this operation. For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb So a priv_flag is added here to indicate whether the network card supports this feature. Signed-off-by: Xuan Zhuo Suggested-by: Alexander Lobakin [ alobakin: give a new flag more detailed description ] Signed-off-by: Alexander Lobakin --- include/linux/netdevice.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0a9b2b31f411..ecaf67efab5b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1525,6 +1525,8 @@ struct net_device_ops { * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running + * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with + * skb_headlen(skb) == 0 (data starts from frag0) */ enum netdev_priv_flags { IFF_802_1Q_VLAN_BIT, @@ -1558,6 +1560,7 @@ enum netdev_priv_flags { IFF_FAILOVER_SLAVE_BIT, IFF_L3MDEV_RX_HANDLER_BIT, IFF_LIVE_RENAME_OK_BIT, + IFF_TX_SKB_NO_LINEAR_BIT, NETDEV_PRIV_FLAG_COUNT, }; @@ -1600,6 +1603,7 @@ static_assert(sizeof(netdev_priv_flags_t) * BITS_PER_BYTE >= #define IFF_FAILOVER_SLAVE __IFF(FAILOVER_SLAVE) #define IFF_L3MDEV_RX_HANDLER __IFF(L3MDEV_RX_HANDLER) #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK) +#define IFF_TX_SKB_NO_LINEAR __IFF(TX_SKB_NO_LINEAR) /** * struct net_device - The DEVICE structure. -- 2.30.1
[PATCH v5 bpf-next 5/6] xsk: respect device's headroom and tailroom on generic xmit path
xsk_generic_xmit() allocates a new skb and then queues it for xmitting. The size of new skb's headroom is desc->len, so it comes to the driver/device with no reserved headroom and/or tailroom. Lots of drivers need some headroom (and sometimes tailroom) to prepend (and/or append) some headers or data, e.g. CPU tags, device-specific headers/descriptors (LSO, TLS etc.), and if case of no available space skb_cow_head() will reallocate the skb. Reallocations are unwanted on fast-path, especially when it comes to XDP, so generic XSK xmit should reserve the spaces declared in dev->needed_headroom and dev->needed tailroom to avoid them. Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)): Usually, output functions reserve LL_RESERVED_SPACE(dev), which consists of dev->hard_header_len + dev->needed_headroom, aligned by 16. However, on XSK xmit hard header is already here in the chunk, so hard_header_len is not needed. But it'd still be better to align data up to cacheline, while reserving no less than driver requests for headroom. NET_SKB_PAD here is to double-insure there will be no reallocations even when the driver advertises no needed_headroom, but in fact need it (not so rare case). Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Alexander Lobakin Acked-by: Magnus Karlsson --- net/xdp/xsk.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4faabd1ecfd1..143979ea4165 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; + u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + tr = xs->dev->needed_tailroom; + while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { char *buffer; u64 addr; @@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk) } len = desc.len; - skb = sock_alloc_send_skb(sk, len, 1, &err); + skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err); if (unlikely(!skb)) goto out; + skb_reserve(skb, hr); skb_put(skb, len); + addr = desc.addr; buffer = xsk_buff_raw_get_data(xs->pool, addr); err = skb_store_bits(skb, 0, buffer, len); -- 2.30.1
[PATCH v5 bpf-next 4/6] virtio-net: support IFF_TX_SKB_NO_LINEAR
From: Xuan Zhuo Virtio net supports the case where the skb linear space is empty, so add priv_flags. Signed-off-by: Xuan Zhuo Acked-by: Michael S. Tsirkin Signed-off-by: Alexander Lobakin --- drivers/net/virtio_net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ba8e63792549..f2ff6c3906c1 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev) return -ENOMEM; /* Set up network device as normal. */ - dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE; + dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE | + IFF_TX_SKB_NO_LINEAR; dev->netdev_ops = &virtnet_netdev; dev->features = NETIF_F_HIGHDMA; -- 2.30.1
[PATCH v5 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)
From: Xuan Zhuo This patch is used to construct skb based on page to save memory copy overhead. This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to directly construct skb. If this feature is not supported, it is still necessary to copy data to construct skb. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% Signed-off-by: Xuan Zhuo Reviewed-by: Dust Li [ alobakin: - expand subject to make it clearer; - improve skb->truesize calculation; - reserve some headroom in skb for drivers; - tailroom is not needed as skb is non-linear ] Signed-off-by: Alexander Lobakin Acked-by: Magnus Karlsson --- net/xdp/xsk.c | 119 -- 1 file changed, 95 insertions(+), 24 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 143979ea4165..ff7bd06e1241 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -445,6 +445,96 @@ static void xsk_destruct_skb(struct sk_buff *skb) sock_wfree(skb); } +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, + struct xdp_desc *desc) +{ + struct xsk_buff_pool *pool = xs->pool; + u32 hr, len, offset, copy, copied; + struct sk_buff *skb; + struct page *page; + void *buffer; + int err, i; + u64 addr; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + + skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + + addr = desc->addr; + len = desc->len; + + buffer = xsk_buff_raw_get_data(pool, addr); + offset = offset_in_page(buffer); + addr = buffer - pool->addrs; + + for (copied = 0, i = 0; copied < len; i++) { + page = pool->umem->pgs[addr >> PAGE_SHIFT]; + get_page(page); + + copy = min_t(u32, PAGE_SIZE - offset, len - copied); + skb_fill_page_desc(skb, i, page, offset, copy); + + copied += copy; + addr += copy; + offset = 0; + } + + skb->len += len; + skb->data_len += len; + skb->truesize += pool->unaligned ? len : pool->chunk_size; + + refcount_add(skb->truesize, &xs->sk.sk_wmem_alloc); + + return skb; +} + +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, +struct xdp_desc *desc) +{ + struct net_device *dev = xs->dev; + struct sk_buff *skb; + + if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { + skb = xsk_build_skb_zerocopy(xs, desc); + if (IS_ERR(skb)) + return skb; + } else { + u32 hr, tr, len; + void *buffer; + int err; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)); + tr = dev->needed_tailroom; + len = desc->len; + + skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + skb_put(skb, len); + + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr); + err = skb_store_bits(skb, 0, buffer, len); + if (unlikely(err)) { + kfree_skb(skb); + return ERR_PTR(err); + } + } + + skb->dev = dev; + skb->priority = xs->sk.sk_priority; + skb->mark = xs->sk.sk_mark; + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr; + skb->destructor = xsk_destruct_skb; + + return skb; +} + static int xsk_generic_xmit(struct sock *sk) { struct xdp_sock *xs = xdp_sk(sk); @@ -454,56 +544,37 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; - u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; - hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); - tr = xs->dev->needed_tailroom; - while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { - char *buffer; - u64 addr; - u32 len; - if (max_batch-- == 0) { err = -EAGAIN; goto out;
Re: [PATCH v5 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)
From: Alexander Lobakin Date: Tue, 16 Feb 2021 14:35:02 + > From: Xuan Zhuo > > This patch is used to construct skb based on page to save memory copy > overhead. > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > directly construct skb. If this feature is not supported, it is still > necessary to copy data to construct skb. > > Performance Testing > > The test environment is Aliyun ECS server. > Test cmd: > ``` > xdpsock -i eth0 -t -S -s > ``` > > Test result data: > > size64 512 10241500 > copy1916747 1775988 1600203 1440054 > page1974058 1953655 1945463 1904478 > percent 3.0%10.0% 21.58% 32.3% > > Signed-off-by: Xuan Zhuo > Reviewed-by: Dust Li > [ alobakin: > - expand subject to make it clearer; > - improve skb->truesize calculation; > - reserve some headroom in skb for drivers; > - tailroom is not needed as skb is non-linear ] > Signed-off-by: Alexander Lobakin > Acked-by: Magnus Karlsson > --- > net/xdp/xsk.c | 119 -- > 1 file changed, 95 insertions(+), 24 deletions(-) > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index 143979ea4165..ff7bd06e1241 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -445,6 +445,96 @@ static void xsk_destruct_skb(struct sk_buff *skb) > sock_wfree(skb); > } > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > + struct xdp_desc *desc) > +{ > + struct xsk_buff_pool *pool = xs->pool; > + u32 hr, len, offset, copy, copied; > + struct sk_buff *skb; > + struct page *page; > + void *buffer; > + int err, i; > + u64 addr; > + > + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); > + > + skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err); > + if (unlikely(!skb)) > + return ERR_PTR(err); > + > + skb_reserve(skb, hr); > + > + addr = desc->addr; > + len = desc->len; > + > + buffer = xsk_buff_raw_get_data(pool, addr); > + offset = offset_in_page(buffer); > + addr = buffer - pool->addrs; > + > + for (copied = 0, i = 0; copied < len; i++) { > + page = pool->umem->pgs[addr >> PAGE_SHIFT]; > + get_page(page); > + > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > + skb_fill_page_desc(skb, i, page, offset, copy); > + > + copied += copy; > + addr += copy; > + offset = 0; > + } > + > + skb->len += len; > + skb->data_len += len; > + skb->truesize += pool->unaligned ? len : pool->chunk_size; > + > + refcount_add(skb->truesize, &xs->sk.sk_wmem_alloc); Meh, there's a refcount leak here I accidentally introduced in v4. Sorry for that, I'll upload v6 in just a moment. > + return skb; > +} > + > +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > + struct xdp_desc *desc) > +{ > + struct net_device *dev = xs->dev; > + struct sk_buff *skb; > + > + if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { > + skb = xsk_build_skb_zerocopy(xs, desc); > + if (IS_ERR(skb)) > + return skb; > + } else { > + u32 hr, tr, len; > + void *buffer; > + int err; > + > + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)); > + tr = dev->needed_tailroom; > + len = desc->len; > + > + skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err); > + if (unlikely(!skb)) > + return ERR_PTR(err); > + > + skb_reserve(skb, hr); > + skb_put(skb, len); > + > + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr); > + err = skb_store_bits(skb, 0, buffer, len); > + if (unlikely(err)) { > + kfree_skb(skb); > + return ERR_PTR(err); > + } > + } > + > + skb->dev = dev; > + skb->priority = xs->sk.sk_priority; > + skb->mark = xs->sk.sk_mark; > + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr; > + skb->destructor = xsk_destruct_skb; > + > + return skb; > +} > + > static int xsk_generic_xmit(struct sock *sk)
[PATCH v6 bpf-next 0/6] xsk: build skb by page (aka generic zerocopy xmit)
This series introduces XSK generic zerocopy xmit by adding XSK umem pages as skb frags instead of copying data to linear space. The only requirement for this for drivers is to be able to xmit skbs with skb_headlen(skb) == 0, i.e. all data including hard headers starts from frag 0. To indicate whether a particular driver supports this, a new netdev priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net as it's already capable of doing it). So consider implementing this in your drivers to greatly speed-up generic XSK xmit. The first two bits refactor netdev_priv_flags a bit to harden them in terms of bitfield overflow, as IFF_TX_SKB_NO_LINEAR is the last one that fits into unsigned int. The fifth patch adds headroom and tailroom reservations for the allocated skbs on XSK generic xmit path. This ensures there won't be any unwanted skb reallocations on fast-path due to headroom and/or tailroom driver/device requirements (own headers/descriptors etc.). The other three add a new private flag, declare it in virtio_net driver and introduce generic XSK zerocopy xmit itself. The main body of work is created and done by Xuan Zhuo. His original cover letter: v3: Optimized code v2: 1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature 2. split the patch to three: a. add priv_flags IFF_TX_SKB_NO_LINEAR b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR c. When there is support this flag, construct skb without linear space 3. use ERR_PTR() and PTR_ERR() to handle the err v1 message log: --- This patch is used to construct skb based on page to save memory copy overhead. This has one problem: We construct the skb by fill the data page as a frag into the skb. In this way, the linear space is empty, and the header information is also in the frag, not in the linear space, which is not allowed for some network cards. For example, Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb I also tried to use build_skb to construct skb, but because of the existence of skb_shinfo, it must be behind the linear space, so this method is not working. We can't put skb_shinfo on desc->addr, it will be exposed to users, this is not safe. Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the network card supports the header information of the packet in the frag and not in the linear space. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% >From v5 [2]: - fix a refcount leak in 0006 introduced in v4. >From v4 [1]: - fix 0002 build error due to inverted static_assert() condition (0day bot); - collect two Acked-bys (Magnus). >From v3 [0]: - refactor netdev_priv_flags to make it easier to add new ones and prevent bitwidth overflow; - add headroom (both standard and zerocopy) and tailroom (standard) reservation in skb for drivers to avoid potential reallocations; - fix skb->truesize accounting; - misc comment rewords. [0] https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com [1] https://lore.kernel.org/netdev/20210216113740.62041-1-aloba...@pm.me [2] https://lore.kernel.org/netdev/2021021614.5861-1-aloba...@pm.me Alexander Lobakin (3): netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition netdevice: check for net_device::priv_flags bitfield overflow xsk: respect device's headroom and tailroom on generic xmit path Xuan Zhuo (3): net: add priv_flags for allow tx skb without linear virtio-net: support IFF_TX_SKB_NO_LINEAR xsk: build skb by page (aka generic zerocopy xmit) drivers/net/virtio_net.c | 3 +- include/linux/netdevice.h | 138 +- net/xdp/xsk.c | 114 ++- 3 files changed, 174 insertions(+), 81 deletions(-) -- 2.30.1
[PATCH v6 bpf-next 1/6] netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition
This is harmless for now, but comes fatal for the subsequent patch. Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom") Signed-off-by: Alexander Lobakin --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b9bcbfde7849..b895973390ee 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1584,6 +1584,7 @@ enum netdev_priv_flags { #define IFF_L3MDEV_SLAVE IFF_L3MDEV_SLAVE #define IFF_TEAM IFF_TEAM #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED +#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM #define IFF_MACSEC IFF_MACSEC #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER -- 2.30.1
[PATCH v6 bpf-next 2/6] netdevice: check for net_device::priv_flags bitfield overflow
We almost ran out of unsigned int bitwidth. Define priv flags and check for potential overflow in the fashion of netdev_features_t. Defined this way, priv_flags can be easily expanded later with just changing its typedef. Signed-off-by: Alexander Lobakin Reported-by: kernel test robot # Inverted assert condition --- include/linux/netdevice.h | 135 -- 1 file changed, 72 insertions(+), 63 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b895973390ee..0a9b2b31f411 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1527,70 +1527,79 @@ struct net_device_ops { * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running */ enum netdev_priv_flags { - IFF_802_1Q_VLAN = 1<<0, - IFF_EBRIDGE = 1<<1, - IFF_BONDING = 1<<2, - IFF_ISATAP = 1<<3, - IFF_WAN_HDLC= 1<<4, - IFF_XMIT_DST_RELEASE= 1<<5, - IFF_DONT_BRIDGE = 1<<6, - IFF_DISABLE_NETPOLL = 1<<7, - IFF_MACVLAN_PORT= 1<<8, - IFF_BRIDGE_PORT = 1<<9, - IFF_OVS_DATAPATH= 1<<10, - IFF_TX_SKB_SHARING = 1<<11, - IFF_UNICAST_FLT = 1<<12, - IFF_TEAM_PORT = 1<<13, - IFF_SUPP_NOFCS = 1<<14, - IFF_LIVE_ADDR_CHANGE= 1<<15, - IFF_MACVLAN = 1<<16, - IFF_XMIT_DST_RELEASE_PERM = 1<<17, - IFF_L3MDEV_MASTER = 1<<18, - IFF_NO_QUEUE= 1<<19, - IFF_OPENVSWITCH = 1<<20, - IFF_L3MDEV_SLAVE= 1<<21, - IFF_TEAM= 1<<22, - IFF_RXFH_CONFIGURED = 1<<23, - IFF_PHONY_HEADROOM = 1<<24, - IFF_MACSEC = 1<<25, - IFF_NO_RX_HANDLER = 1<<26, - IFF_FAILOVER= 1<<27, - IFF_FAILOVER_SLAVE = 1<<28, - IFF_L3MDEV_RX_HANDLER = 1<<29, - IFF_LIVE_RENAME_OK = 1<<30, + IFF_802_1Q_VLAN_BIT, + IFF_EBRIDGE_BIT, + IFF_BONDING_BIT, + IFF_ISATAP_BIT, + IFF_WAN_HDLC_BIT, + IFF_XMIT_DST_RELEASE_BIT, + IFF_DONT_BRIDGE_BIT, + IFF_DISABLE_NETPOLL_BIT, + IFF_MACVLAN_PORT_BIT, + IFF_BRIDGE_PORT_BIT, + IFF_OVS_DATAPATH_BIT, + IFF_TX_SKB_SHARING_BIT, + IFF_UNICAST_FLT_BIT, + IFF_TEAM_PORT_BIT, + IFF_SUPP_NOFCS_BIT, + IFF_LIVE_ADDR_CHANGE_BIT, + IFF_MACVLAN_BIT, + IFF_XMIT_DST_RELEASE_PERM_BIT, + IFF_L3MDEV_MASTER_BIT, + IFF_NO_QUEUE_BIT, + IFF_OPENVSWITCH_BIT, + IFF_L3MDEV_SLAVE_BIT, + IFF_TEAM_BIT, + IFF_RXFH_CONFIGURED_BIT, + IFF_PHONY_HEADROOM_BIT, + IFF_MACSEC_BIT, + IFF_NO_RX_HANDLER_BIT, + IFF_FAILOVER_BIT, + IFF_FAILOVER_SLAVE_BIT, + IFF_L3MDEV_RX_HANDLER_BIT, + IFF_LIVE_RENAME_OK_BIT, + + NETDEV_PRIV_FLAG_COUNT, }; -#define IFF_802_1Q_VLANIFF_802_1Q_VLAN -#define IFF_EBRIDGEIFF_EBRIDGE -#define IFF_BONDINGIFF_BONDING -#define IFF_ISATAP IFF_ISATAP -#define IFF_WAN_HDLC IFF_WAN_HDLC -#define IFF_XMIT_DST_RELEASE IFF_XMIT_DST_RELEASE -#define IFF_DONT_BRIDGEIFF_DONT_BRIDGE -#define IFF_DISABLE_NETPOLLIFF_DISABLE_NETPOLL -#define IFF_MACVLAN_PORT IFF_MACVLAN_PORT -#define IFF_BRIDGE_PORTIFF_BRIDGE_PORT -#define IFF_OVS_DATAPATH IFF_OVS_DATAPATH -#define IFF_TX_SKB_SHARING IFF_TX_SKB_SHARING -#define IFF_UNICAST_FLTIFF_UNICAST_FLT -#define IFF_TEAM_PORT IFF_TEAM_PORT -#define IFF_SUPP_NOFCS IFF_SUPP_NOFCS -#define IFF_LIVE_ADDR_CHANGE IFF_LIVE_ADDR_CHANGE -#define IFF_MACVLANIFF_MACVLAN -#define IFF_XMIT_DST_RELEASE_PERM IFF_XMIT_DST_RELEASE_PERM -#define IFF_L3MDEV_MASTER IFF_L3MDEV_MASTER -#define IFF_NO_QUEUE IFF_NO_QUEUE -#define IFF_OPENVSWITCHIFF_OPENVSWITCH -#define IFF_L3MDEV_SLAVE IFF_L3MDEV_SLAVE -#define IFF_TEAM IFF_TEAM -#define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED -#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM -#define IFF_MACSEC IFF_MACSEC -#define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER -#define IFF_FA
[PATCH v6 bpf-next 3/6] net: add priv_flags for allow tx skb without linear
From: Xuan Zhuo In some cases, we hope to construct skb directly based on the existing memory without copying data. In this case, the page will be placed directly in the skb, and the linear space of skb is empty. But unfortunately, many the network card does not support this operation. For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb So a priv_flag is added here to indicate whether the network card supports this feature. Signed-off-by: Xuan Zhuo Suggested-by: Alexander Lobakin [ alobakin: give a new flag more detailed description ] Signed-off-by: Alexander Lobakin --- include/linux/netdevice.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0a9b2b31f411..ecaf67efab5b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1525,6 +1525,8 @@ struct net_device_ops { * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running + * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with + * skb_headlen(skb) == 0 (data starts from frag0) */ enum netdev_priv_flags { IFF_802_1Q_VLAN_BIT, @@ -1558,6 +1560,7 @@ enum netdev_priv_flags { IFF_FAILOVER_SLAVE_BIT, IFF_L3MDEV_RX_HANDLER_BIT, IFF_LIVE_RENAME_OK_BIT, + IFF_TX_SKB_NO_LINEAR_BIT, NETDEV_PRIV_FLAG_COUNT, }; @@ -1600,6 +1603,7 @@ static_assert(sizeof(netdev_priv_flags_t) * BITS_PER_BYTE >= #define IFF_FAILOVER_SLAVE __IFF(FAILOVER_SLAVE) #define IFF_L3MDEV_RX_HANDLER __IFF(L3MDEV_RX_HANDLER) #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK) +#define IFF_TX_SKB_NO_LINEAR __IFF(TX_SKB_NO_LINEAR) /** * struct net_device - The DEVICE structure. -- 2.30.1
[PATCH v6 bpf-next 4/6] virtio-net: support IFF_TX_SKB_NO_LINEAR
From: Xuan Zhuo Virtio net supports the case where the skb linear space is empty, so add priv_flags. Signed-off-by: Xuan Zhuo Acked-by: Michael S. Tsirkin Signed-off-by: Alexander Lobakin --- drivers/net/virtio_net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ba8e63792549..f2ff6c3906c1 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev) return -ENOMEM; /* Set up network device as normal. */ - dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE; + dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE | + IFF_TX_SKB_NO_LINEAR; dev->netdev_ops = &virtnet_netdev; dev->features = NETIF_F_HIGHDMA; -- 2.30.1
[PATCH v6 bpf-next 5/6] xsk: respect device's headroom and tailroom on generic xmit path
xsk_generic_xmit() allocates a new skb and then queues it for xmitting. The size of new skb's headroom is desc->len, so it comes to the driver/device with no reserved headroom and/or tailroom. Lots of drivers need some headroom (and sometimes tailroom) to prepend (and/or append) some headers or data, e.g. CPU tags, device-specific headers/descriptors (LSO, TLS etc.), and if case of no available space skb_cow_head() will reallocate the skb. Reallocations are unwanted on fast-path, especially when it comes to XDP, so generic XSK xmit should reserve the spaces declared in dev->needed_headroom and dev->needed tailroom to avoid them. Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)): Usually, output functions reserve LL_RESERVED_SPACE(dev), which consists of dev->hard_header_len + dev->needed_headroom, aligned by 16. However, on XSK xmit hard header is already here in the chunk, so hard_header_len is not needed. But it'd still be better to align data up to cacheline, while reserving no less than driver requests for headroom. NET_SKB_PAD here is to double-insure there will be no reallocations even when the driver advertises no needed_headroom, but in fact need it (not so rare case). Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Alexander Lobakin Acked-by: Magnus Karlsson --- net/xdp/xsk.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4faabd1ecfd1..143979ea4165 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; + u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + tr = xs->dev->needed_tailroom; + while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { char *buffer; u64 addr; @@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk) } len = desc.len; - skb = sock_alloc_send_skb(sk, len, 1, &err); + skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err); if (unlikely(!skb)) goto out; + skb_reserve(skb, hr); skb_put(skb, len); + addr = desc.addr; buffer = xsk_buff_raw_get_data(xs->pool, addr); err = skb_store_bits(skb, 0, buffer, len); -- 2.30.1
[PATCH v6 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)
From: Xuan Zhuo This patch is used to construct skb based on page to save memory copy overhead. This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to directly construct skb. If this feature is not supported, it is still necessary to copy data to construct skb. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% Signed-off-by: Xuan Zhuo Reviewed-by: Dust Li [ alobakin: - expand subject to make it clearer; - improve skb->truesize calculation; - reserve some headroom in skb for drivers; - tailroom is not needed as skb is non-linear ] Signed-off-by: Alexander Lobakin Acked-by: Magnus Karlsson --- net/xdp/xsk.c | 120 -- 1 file changed, 96 insertions(+), 24 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 143979ea4165..a71ed664da0a 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -445,6 +445,97 @@ static void xsk_destruct_skb(struct sk_buff *skb) sock_wfree(skb); } +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, + struct xdp_desc *desc) +{ + struct xsk_buff_pool *pool = xs->pool; + u32 hr, len, ts, offset, copy, copied; + struct sk_buff *skb; + struct page *page; + void *buffer; + int err, i; + u64 addr; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + + skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + + addr = desc->addr; + len = desc->len; + ts = pool->unaligned ? len : pool->chunk_size; + + buffer = xsk_buff_raw_get_data(pool, addr); + offset = offset_in_page(buffer); + addr = buffer - pool->addrs; + + for (copied = 0, i = 0; copied < len; i++) { + page = pool->umem->pgs[addr >> PAGE_SHIFT]; + get_page(page); + + copy = min_t(u32, PAGE_SIZE - offset, len - copied); + skb_fill_page_desc(skb, i, page, offset, copy); + + copied += copy; + addr += copy; + offset = 0; + } + + skb->len += len; + skb->data_len += len; + skb->truesize += ts; + + refcount_add(ts, &xs->sk.sk_wmem_alloc); + + return skb; +} + +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, +struct xdp_desc *desc) +{ + struct net_device *dev = xs->dev; + struct sk_buff *skb; + + if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { + skb = xsk_build_skb_zerocopy(xs, desc); + if (IS_ERR(skb)) + return skb; + } else { + u32 hr, tr, len; + void *buffer; + int err; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)); + tr = dev->needed_tailroom; + len = desc->len; + + skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + skb_put(skb, len); + + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr); + err = skb_store_bits(skb, 0, buffer, len); + if (unlikely(err)) { + kfree_skb(skb); + return ERR_PTR(err); + } + } + + skb->dev = dev; + skb->priority = xs->sk.sk_priority; + skb->mark = xs->sk.sk_mark; + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr; + skb->destructor = xsk_destruct_skb; + + return skb; +} + static int xsk_generic_xmit(struct sock *sk) { struct xdp_sock *xs = xdp_sk(sk); @@ -454,56 +545,37 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; - u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; - hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); - tr = xs->dev->needed_tailroom; - while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { - char *buffer; - u64 addr; - u32 len; - if (max_batch-- == 0) { err = -EAGAIN;
Re: [PATCH mips-next] vmlinux.lds.h: catch more UBSAN symbols into .data
From: Nick Desaulniers Date: Tue, 16 Feb 2021 09:56:32 -0800 > On Tue, Feb 16, 2021 at 12:55 AM Alexander Lobakin wrote: > > > > LKP triggered lots of LD orphan warnings [0]: > > Thanks for the patch, just some questions. > > With which linker? Was there a particular config from the bot's > report that triggered this? All the info can be found by going through the link from the commit message. Compiler was GCC 9.3, so I suppose BFD was used as a linker. I mentioned CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y in the attached dotconfig, the warnings and the fix are relevant only for this case. > > > > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_data299' from > > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_data299' > > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_data183' from > > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_data183' > > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type3' from > > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type3' > > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type2' from > > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type2' > > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type0' from > > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type0' > > > > [...] > > > > Seems like "unnamed data" isn't the only type of symbols that UBSAN > > instrumentation can emit. > > Catch these into .data with the wildcard as well. > > > > [0] https://lore.kernel.org/linux-mm/202102160741.k57gcnsr-...@intel.com > > > > Fixes: f41b233de0ae ("vmlinux.lds.h: catch UBSAN's "unnamed data" into > > data") > > Reported-by: kernel test robot > > Signed-off-by: Alexander Lobakin > > --- > > include/asm-generic/vmlinux.lds.h | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/include/asm-generic/vmlinux.lds.h > > b/include/asm-generic/vmlinux.lds.h > > index cc659e77fcb0..83537e5ee78f 100644 > > --- a/include/asm-generic/vmlinux.lds.h > > +++ b/include/asm-generic/vmlinux.lds.h > > @@ -95,7 +95,7 @@ > > */ > > #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION > > #define TEXT_MAIN .text .text.[0-9a-zA-Z_]* > > -#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* > > .data..compoundliteral* .data.$__unnamed_* > > +#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* > > .data..compoundliteral* .data.$__unnamed_* .data.$Lubsan_* > > Are these sections only created when > CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is selected? (Same with > .data.$__unnamed_*) > > > #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]* > > #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L* > > #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral* > > -- > > 2.30.1 > > > > > > > -- > Thanks, > ~Nick Desaulniers Al
Re: [GIT PULL] clang-lto for v5.12-rc1
From: Kees Cook Date: Tue, 16 Feb 2021 12:34:37 -0800 > Hi Linus, > > Please pull this Clang Link Time Optimization series for v5.12-rc1. This > has been in linux-next for the entire last development cycle, and is > built on the work done preparing[0] for LTO by arm64 folks, tracing folks, > etc. This series includes the core changes as well as the remaining pieces > for arm64 (LTO has been the default build method on Android for about > 3 years now, as it is the prerequisite for the Control Flow Integrity > protections). While x86 LTO support is done[1], there is still some > on-going clean-up work happening for objtool[2] that should hopefully > land by the v5.13 merge window. > > For merge log posterity, and as detailed in commit dc5723b02e52 ("kbuild: > add support for Clang LTO"), here is the lt;dr to do an LTO build: > > make LLVM=1 LLVM_IAS=1 defconfig > scripts/config -e LTO_CLANG_THIN > make LLVM=1 LLVM_IAS=1 > > (To do a cross-compile of arm64, add "CROSS_COMPILE=aarch64-linux-gnu-" > and "ARCH=arm64" to the "make" command lines.) > > Thanks! > > -Kees > > [0] https://git.kernel.org/linus/3c09ec59cdea5b132212d97154d625fd34e436dd > [1] https://github.com/samitolvanen/linux/commits/clang-lto > [2] https://lore.kernel.org/lkml/cover.1611263461.git.jpoim...@redhat.com/ > > The following changes since commit e71ba9452f0b5b2e8dc8aa5445198cd9214a6a62: > > Linux 5.11-rc2 (2021-01-03 15:55:30 -0800) > > are available in the Git repository at: > > https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git > tags/clang-lto-v5.12-rc1 > > for you to fetch changes up to 112b6a8e038d793d016e330f53acb9383ac504b3: > > arm64: allow LTO to be selected (2021-01-14 08:21:10 -0800) > > > clang-lto for v5.12-rc1 > > Provide build infrastructure for arm64 Clang LTO. > > > Sami Tolvanen (16): > tracing: move function tracer options to Kconfig > kbuild: add support for Clang LTO > kbuild: lto: fix module versioning > kbuild: lto: limit inlining > kbuild: lto: merge module sections > kbuild: lto: add a default list of used symbols > init: lto: ensure initcall ordering > init: lto: fix PREL32 relocations > PCI: Fix PREL32 relocations for LTO > modpost: lto: strip .lto from module names > scripts/mod: disable LTO for empty.c > efi/libstub: disable LTO > drivers/misc/lkdtm: disable LTO for rodata.o > arm64: vdso: disable LTO > arm64: disable recordmcount with DYNAMIC_FTRACE_WITH_REGS > arm64: allow LTO to be selected Seems like you forgot the fix from [0], didn't you? > .gitignore| 1 + > Makefile | 45 -- > arch/Kconfig | 90 > arch/arm64/Kconfig| 4 + > arch/arm64/kernel/vdso/Makefile | 3 +- > drivers/firmware/efi/libstub/Makefile | 2 + > drivers/misc/lkdtm/Makefile | 1 + > include/asm-generic/vmlinux.lds.h | 11 +- > include/linux/init.h | 79 -- > include/linux/pci.h | 27 +++- > init/Kconfig | 1 + > kernel/trace/Kconfig | 16 ++ > scripts/Makefile.build| 48 +- > scripts/Makefile.lib | 6 +- > scripts/Makefile.modfinal | 9 +- > scripts/Makefile.modpost | 25 +++- > scripts/generate_initcall_order.pl| 270 > ++ > scripts/link-vmlinux.sh | 70 +++-- > scripts/lto-used-symbollist.txt | 5 + > scripts/mod/Makefile | 1 + > scripts/mod/modpost.c | 16 +- > scripts/mod/modpost.h | 9 ++ > scripts/mod/sumversion.c | 6 +- > scripts/module.lds.S | 24 +++ > 24 files changed, 707 insertions(+), 62 deletions(-) > create mode 100755 scripts/generate_initcall_order.pl > create mode 100644 scripts/lto-used-symbollist.txt > > -- > Kees Cook > [0] https://lore.kernel.org/lkml/20210121184544.659998-1-aloba...@pm.me Al
[BUG] net: core: netif_receive_skb_list() crash on non-standard ptypes forwarding
Hi Edward, Seems like I've found another poisoned skb->next crash with netif_receive_skb_list(). This is similar to the one than has been already fixed in 22f6bbb7bcfc ("net: use skb_list_del_init() to remove from RX sublists"). This one however applies only to non-standard ptypes (in my case -- ETH_P_XDSA). I use simple VLAN NAT setup through nft. After switching my in-dev driver to netif_receive_skb_list(), system started to crash on forwarding: [ 88.606777] CPU 0 Unable to handle kernel paging request at virtual address 000e, epc == 80687078, ra == 8052cc7c [ 88.618666] Oops[#1]: [ 88.621196] CPU: 0 PID: 0 Comm: swapper Not tainted 5.1.0-rc2-dlink-00206-g4192a172-dirty #1473 [ 88.630885] $ 0 : 1400 0002 864d7850 [ 88.636709] $ 4 : 87c0ddf0 864d7800 87c0ddf0 [ 88.642526] $ 8 : 4960 0001 0001 [ 88.648342] $12 : c288617b dadbee27 25d17c41 [ 88.654159] $16 : 87c0ddf0 85cff080 8079 fffd [ 88.659975] $20 : 80797b20 0001 864d7800 [ 88.665793] $24 : 8011e658 [ 88.671609] $28 : 8079 87c0dbc0 87cabf00 8052cc7c [ 88.677427] Hi : 0003 [ 88.680622] Lo : 7b5b4220 [ 88.683840] epc : 80687078 vlan_dev_hard_start_xmit+0x1c/0x1a0 [ 88.690532] ra : 8052cc7c dev_hard_start_xmit+0xac/0x188 [ 88.696734] Status: 1404 IEp [ 88.700422] Cause : 5008 (ExcCode 02) [ 88.704874] BadVA : 000e [ 88.708069] PrId : 0001a120 (MIPS interAptiv (multi)) [ 88.713005] Modules linked in: [ 88.716407] Process swapper (pid: 0, threadinfo=(ptrval), task=(ptrval), tls=) [ 88.725219] Stack : 85f61c28 000e 8078 87c0ddf0 85cff080 8079 8052cc7c [ 88.734529] 87cabf00 0001 85f5fb40 807b 864d7850 87cabf00 807d [ 88.743839] 864d7800 8655f600 85cff080 87c1c000 006a 8052d96c [ 88.753149] 807a 8057adb8 87c0dcc8 87c0dc50 85cfff08 0558 87cabf00 85f58c50 [ 88.762460] 0002 85f58c00 864d7800 80543308 fff4 0001 85f58c00 864d7800 [ 88.771770] ... [ 88.774483] Call Trace: [ 88.777199] [<80687078>] vlan_dev_hard_start_xmit+0x1c/0x1a0 [ 88.783504] [<8052cc7c>] dev_hard_start_xmit+0xac/0x188 [ 88.789326] [<8052d96c>] __dev_queue_xmit+0x6e8/0x7d4 [ 88.794955] [<805a8640>] ip_finish_output2+0x238/0x4d0 [ 88.800677] [<805ab6a0>] ip_output+0xc8/0x140 [ 88.805526] [<805a68f4>] ip_forward+0x364/0x560 [ 88.810567] [<805a4ff8>] ip_rcv+0x48/0xe4 [ 88.815030] [<80528d44>] __netif_receive_skb_one_core+0x44/0x58 [ 88.821635] [<8067f220>] dsa_switch_rcv+0x108/0x1ac [ 88.827067] [<80528f80>] __netif_receive_skb_list_core+0x228/0x26c [ 88.833951] [<8052ed84>] netif_receive_skb_list+0x1d4/0x394 [ 88.840160] [<80355a88>] lunar_rx_poll+0x38c/0x828 [ 88.845496] [<8052fa78>] net_rx_action+0x14c/0x3cc [ 88.850835] [<806ad300>] __do_softirq+0x178/0x338 [ 88.856077] [<8012a2d4>] irq_exit+0xbc/0x100 [ 88.860846] [<802f8b70>] plat_irq_dispatch+0xc0/0x144 [ 88.866477] [<80105974>] handle_int+0x14c/0x158 [ 88.871516] [<806acfb0>] r4k_wait+0x30/0x40 [ 88.876462] Code: afb10014 8c8200a0 00803025 <9443000c> 94a20468 10620042 00a08025 9605046a [ 88.887332] [ 88.888982] ---[ end trace eb863d007da11cf1 ]--- [ 88.894122] Kernel panic - not syncing: Fatal exception in interrupt [ 88.901202] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Some additional debug have showed that skb->next is poisoned on dsa_switch_rcv() -- ETH_P_XDSA ptype .func() callback. So when skb enters dev_hard_start_xmit(), function tries to "schedule" backpointer to list_head for transmitting. Here's a working possible fix for that, not sure if it can break anything though. diff --git a/net/core/dev.c b/net/core/dev.c index 2b67f2aa59dd..fdcff29df915 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5014,8 +5014,10 @@ static inline void __netif_receive_skb_list_ptype(struct list_head *head, if (pt_prev->list_func != NULL) pt_prev->list_func(head, pt_prev, orig_dev); else - list_for_each_entry_safe(skb, next, head, list) + list_for_each_entry_safe(skb, next, head, list) { + skb_list_del_init(skb); pt_prev->func(skb, skb->dev, pt_prev, orig_dev); + } } static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc) Maybe you could look into this and find another/better solution (or I could submit this one if that's pretty enough). BTW, great work with netif_receive_skb_list() -- I've got 70 Mbps gain (~15%) on my setup in comparsion to napi_gro_receive(). Thanks, Alexander. Regards, ᚷ ᛖ ᚢ ᚦ ᚠᚱ
Re: [PATCH v4 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()
From: Paolo Abeni Date: Thu, 11 Feb 2021 15:55:04 +0100 > On Thu, 2021-02-11 at 14:28 +0000, Alexander Lobakin wrote: > > From: Paolo Abeni on Thu, 11 Feb 2021 11:16:40 +0100 > > wrote: > > > What about changing __napi_alloc_skb() to always use > > > the __napi_build_skb(), for both kmalloc and page backed skbs? That is, > > > always doing the 'data' allocation in __napi_alloc_skb() - either via > > > page_frag or via kmalloc() - and than call __napi_build_skb(). > > > > > > I think that should avoid adding more checks in __alloc_skb() and > > > should probably reduce the number of conditional used > > > by __napi_alloc_skb(). > > > > I thought of this too. But this will introduce conditional branch > > to set or not skb->head_frag. So one branch less in __alloc_skb(), > > one branch more here, and we also lose the ability to __alloc_skb() > > with decached head. > > Just to try to be clear, I mean something alike the following (not even > build tested). In the fast path it has less branches than the current > code - for both kmalloc and page_frag allocation. > > --- > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 785daff48030..a242fbe4730e 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -506,23 +506,12 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct > *napi, unsigned int len, >gfp_t gfp_mask) > { > struct napi_alloc_cache *nc; > + bool head_frag, pfmemalloc; > struct sk_buff *skb; > void *data; > > len += NET_SKB_PAD + NET_IP_ALIGN; > > - /* If requested length is either too small or too big, > - * we use kmalloc() for skb->head allocation. > - */ > - if (len <= SKB_WITH_OVERHEAD(1024) || > - len > SKB_WITH_OVERHEAD(PAGE_SIZE) || > - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { > - skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); > - if (!skb) > - goto skb_fail; > - goto skb_success; > - } > - > nc = this_cpu_ptr(&napi_alloc_cache); > len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > len = SKB_DATA_ALIGN(len); > @@ -530,25 +519,34 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct > *napi, unsigned int len, > if (sk_memalloc_socks()) > gfp_mask |= __GFP_MEMALLOC; > > - data = page_frag_alloc(&nc->page, len, gfp_mask); > + if (len <= SKB_WITH_OVERHEAD(1024) || > +len > SKB_WITH_OVERHEAD(PAGE_SIZE) || > +(gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { > + data = kmalloc_reserve(len, gfp_mask, NUMA_NO_NODE, > &pfmemalloc); > + head_frag = 0; > + len = 0; > + } else { > + data = page_frag_alloc(&nc->page, len, gfp_mask); > + pfmemalloc = nc->page.pfmemalloc; > + head_frag = 1; > + } > if (unlikely(!data)) > return NULL; Sure. I have a separate WIP series that reworks all three *alloc_skb() functions, as there's a nice room for optimization, especially after that tiny skbs now fall back to __alloc_skb(). It will likely hit mailing lists after the merge window and next net-next season, not now. And it's not really connected with NAPI cache reusing. > skb = __build_skb(data, len); > if (unlikely(!skb)) { > - skb_free_frag(data); > + if (head_frag) > + skb_free_frag(data); > + else > + kfree(data); > return NULL; > } > > - if (nc->page.pfmemalloc) > - skb->pfmemalloc = 1; > - skb->head_frag = 1; > + skb->pfmemalloc = pfmemalloc; > + skb->head_frag = head_frag; > > -skb_success: > skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN); > skb->dev = napi->dev; > - > -skb_fail: > return skb; > } > EXPORT_SYMBOL(__napi_alloc_skb); Al
[PATCH v5 net-next 00/11] skbuff: introduce skbuff_heads bulking and reusing
Currently, all sorts of skb allocation always do allocate skbuff_heads one by one via kmem_cache_alloc(). On the other hand, we have percpu napi_alloc_cache to store skbuff_heads queued up for freeing and flush them by bulks. We can use this cache not only for bulk-wiping, but also to obtain heads for new skbs and avoid unconditional allocations, as well as for bulk-allocating (like XDP's cpumap code and veth driver already do). As this might affect latencies, cache pressure and lots of hardware and driver-dependent stuff, this new feature is mostly optional and can be issued via: - a new napi_build_skb() function (as a replacement for build_skb()); - existing {,__}napi_alloc_skb() and napi_get_frags() functions; - __alloc_skb() with passing SKB_ALLOC_NAPI in flags. iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger on more powerful hosts and NICs with tens of Mpps. Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs: - kmalloc()/kmem_cache_alloc() itself allows by default allocating memory from the remote nodes to defragment their slabs. This is controlled by sysctl, but according to this, skbuff_head from a remote node is an OK case; - The easiest way to check if the slab of skbuff_head is remote or pfmemalloc'ed is: if (!dev_page_is_reusable(virt_to_head_page(skb))) /* drop it */; ...*but*, regarding that most slabs are built of compound pages, virt_to_head_page() will hit unlikely-branch every single call. This check costed at least 20 Mbps in test scenarios and seems like it'd be better to _not_ do this. Since v4 [3]: - rebase on top of net-next and address kernel build robot issue; - reorder checks a bit in __alloc_skb() to make new condition even more harmless. Since v3 [2]: - make the feature mostly optional, so driver developers could decide whether to use it or not (Paolo Abeni). This reuses the old flag for __alloc_skb() and introduces a new napi_build_skb(); - reduce bulk-allocation size from 32 to 16 elements (also Paolo). This equals to the value of XDP's devmap and veth batch processing (which were tested a lot) and should be sane enough; - don't waste cycles on explicit in_serving_softirq() check. Since v2 [1]: - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy after the changes that pass tiny skbs requests to kmalloc layer); - cover the cache with KASAN instrumentation (suggested by Eric Dumazet, help of Dmitry Vyukov); - completely drop redundant __kfree_skb_flush() (also Eric); - lots of code cleanups; - expand the commit message with NUMA and pfmemalloc points (Jakub). Since v1 [0]: - use one unified cache instead of two separate to greatly simplify the logics and reduce hotpath overhead (Edward Cree); - new: recycle also GRO_MERGED_FREE skbs instead of immediate freeing; - correct performance numbers after optimizations and performing lots of tests for different use cases. [0] https://lore.kernel.org/netdev/2021082655.12159-1-aloba...@pm.me [1] https://lore.kernel.org/netdev/20210113133523.39205-1-aloba...@pm.me [2] https://lore.kernel.org/netdev/20210209204533.327360-1-aloba...@pm.me [3] https://lore.kernel.org/netdev/20210210162732.80467-1-aloba...@pm.me Alexander Lobakin (11): skbuff: move __alloc_skb() next to the other skb allocation functions skbuff: simplify kmalloc_reserve() skbuff: make __build_skb_around() return void skbuff: simplify __alloc_skb() a bit skbuff: use __build_skb_around() in __alloc_skb() skbuff: remove __kfree_skb_flush() skbuff: move NAPI cache declarations upper in the file skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads skbuff: allow to optionally use NAPI cache from __alloc_skb() skbuff: allow to use NAPI cache from __napi_alloc_skb() skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing include/linux/skbuff.h | 4 +- net/core/dev.c | 16 +- net/core/skbuff.c | 429 +++-- 3 files changed, 243 insertions(+), 206 deletions(-) -- 2.30.1
[PATCH v5 net-next 01/11] skbuff: move __alloc_skb() next to the other skb allocation functions
In preparation before reusing several functions in all three skb allocation variants, move __alloc_skb() next to the __netdev_alloc_skb() and __napi_alloc_skb(). No functional changes. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 284 +++--- 1 file changed, 142 insertions(+), 142 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d380c7b5a12d..a0f846872d19 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -119,148 +119,6 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) skb_panic(skb, sz, addr, __func__); } -/* - * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells - * the caller if emergency pfmemalloc reserves are being used. If it is and - * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves - * may be used. Otherwise, the packet data may be discarded until enough - * memory is free - */ -#define kmalloc_reserve(size, gfp, node, pfmemalloc) \ -__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc) - -static void *__kmalloc_reserve(size_t size, gfp_t flags, int node, - unsigned long ip, bool *pfmemalloc) -{ - void *obj; - bool ret_pfmemalloc = false; - - /* -* Try a regular allocation, when that fails and we're not entitled -* to the reserves, fail. -*/ - obj = kmalloc_node_track_caller(size, - flags | __GFP_NOMEMALLOC | __GFP_NOWARN, - node); - if (obj || !(gfp_pfmemalloc_allowed(flags))) - goto out; - - /* Try again but now we are using pfmemalloc reserves */ - ret_pfmemalloc = true; - obj = kmalloc_node_track_caller(size, flags, node); - -out: - if (pfmemalloc) - *pfmemalloc = ret_pfmemalloc; - - return obj; -} - -/* Allocate a new skbuff. We do this ourselves so we can fill in a few - * 'private' fields and also do memory statistics to find all the - * [BEEP] leaks. - * - */ - -/** - * __alloc_skb - allocate a network buffer - * @size: size to allocate - * @gfp_mask: allocation mask - * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache - * instead of head cache and allocate a cloned (child) skb. - * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for - * allocations in case the data is required for writeback - * @node: numa node to allocate memory on - * - * Allocate a new &sk_buff. The returned buffer has no headroom and a - * tail room of at least size bytes. The object has a reference count - * of one. The return is the buffer. On a failure the return is %NULL. - * - * Buffers may only be allocated from interrupts using a @gfp_mask of - * %GFP_ATOMIC. - */ -struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, - int flags, int node) -{ - struct kmem_cache *cache; - struct skb_shared_info *shinfo; - struct sk_buff *skb; - u8 *data; - bool pfmemalloc; - - cache = (flags & SKB_ALLOC_FCLONE) - ? skbuff_fclone_cache : skbuff_head_cache; - - if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX)) - gfp_mask |= __GFP_MEMALLOC; - - /* Get the HEAD */ - skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); - if (!skb) - goto out; - prefetchw(skb); - - /* We do our best to align skb_shared_info on a separate cache -* line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives -* aligned memory blocks, unless SLUB/SLAB debug is enabled. -* Both skb->head and skb_shared_info are cache line aligned. -*/ - size = SKB_DATA_ALIGN(size); - size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc); - if (!data) - goto nodata; - /* kmalloc(size) might give us more room than requested. -* Put skb_shared_info exactly at the end of allocated zone, -* to allow max possible filling before reallocation. -*/ - size = SKB_WITH_OVERHEAD(ksize(data)); - prefetchw(data + size); - - /* -* Only clear those fields we need to clear, not those that we will -* actually initialise below. Hence, don't put any more fields after -* the tail pointer in struct sk_buff! -*/ - memset(skb, 0, offsetof(struct sk_buff, tail)); - /* Account for allocated memory : skb + skb->head */ - skb->truesize = SKB_TRUESIZE(size); - skb->pfmemalloc = pfmemalloc; - refcount_set(&skb->users, 1); - skb->head = data; - skb->data = data; - skb_reset_tail_pointer(skb)
[PATCH v5 net-next 02/11] skbuff: simplify kmalloc_reserve()
Eversince the introduction of __kmalloc_reserve(), "ip" argument hasn't been used. _RET_IP_ is embedded inside kmalloc_node_track_caller(). Remove the redundant macro and rename the function after it. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a0f846872d19..70289f22a6f4 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -273,11 +273,8 @@ EXPORT_SYMBOL(__netdev_alloc_frag_align); * may be used. Otherwise, the packet data may be discarded until enough * memory is free */ -#define kmalloc_reserve(size, gfp, node, pfmemalloc) \ -__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc) - -static void *__kmalloc_reserve(size_t size, gfp_t flags, int node, - unsigned long ip, bool *pfmemalloc) +static void *kmalloc_reserve(size_t size, gfp_t flags, int node, +bool *pfmemalloc) { void *obj; bool ret_pfmemalloc = false; -- 2.30.1
[PATCH v5 net-next 05/11] skbuff: use __build_skb_around() in __alloc_skb()
Just call __build_skb_around() instead of open-coding it. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 18 +- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 88566de26cd1..1c6f6ef70339 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -326,7 +326,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int flags, int node) { struct kmem_cache *cache; - struct skb_shared_info *shinfo; struct sk_buff *skb; u8 *data; bool pfmemalloc; @@ -366,21 +365,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, * the tail pointer in struct sk_buff! */ memset(skb, 0, offsetof(struct sk_buff, tail)); - /* Account for allocated memory : skb + skb->head */ - skb->truesize = SKB_TRUESIZE(size); + __build_skb_around(skb, data, 0); skb->pfmemalloc = pfmemalloc; - refcount_set(&skb->users, 1); - skb->head = data; - skb->data = data; - skb_reset_tail_pointer(skb); - skb->end = skb->tail + size; - skb->mac_header = (typeof(skb->mac_header))~0U; - skb->transport_header = (typeof(skb->transport_header))~0U; - - /* make sure we initialize shinfo sequentially */ - shinfo = skb_shinfo(skb); - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); - atomic_set(&shinfo->dataref, 1); if (flags & SKB_ALLOC_FCLONE) { struct sk_buff_fclones *fclones; @@ -393,8 +379,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, fclones->skb2.fclone = SKB_FCLONE_CLONE; } - skb_set_kcov_handle(skb, kcov_common_handle()); - return skb; nodata: -- 2.30.1
[PATCH v5 net-next 03/11] skbuff: make __build_skb_around() return void
__build_skb_around() can never fail and always returns passed skb. Make it return void to simplify and optimize the code. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 70289f22a6f4..c7d184e11547 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -120,8 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) } /* Caller must provide SKB that is memset cleared */ -static struct sk_buff *__build_skb_around(struct sk_buff *skb, - void *data, unsigned int frag_size) +static void __build_skb_around(struct sk_buff *skb, void *data, + unsigned int frag_size) { struct skb_shared_info *shinfo; unsigned int size = frag_size ? : ksize(data); @@ -144,8 +144,6 @@ static struct sk_buff *__build_skb_around(struct sk_buff *skb, atomic_set(&shinfo->dataref, 1); skb_set_kcov_handle(skb, kcov_common_handle()); - - return skb; } /** @@ -176,8 +174,9 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) return NULL; memset(skb, 0, offsetof(struct sk_buff, tail)); + __build_skb_around(skb, data, frag_size); - return __build_skb_around(skb, data, frag_size); + return skb; } /* build_skb() is wrapper over __build_skb(), that specifically @@ -210,9 +209,9 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, if (unlikely(!skb)) return NULL; - skb = __build_skb_around(skb, data, frag_size); + __build_skb_around(skb, data, frag_size); - if (skb && frag_size) { + if (frag_size) { skb->head_frag = 1; if (page_is_pfmemalloc(virt_to_head_page(data))) skb->pfmemalloc = 1; -- 2.30.1
[PATCH v5 net-next 06/11] skbuff: remove __kfree_skb_flush()
This function isn't much needed as NAPI skb queue gets bulk-freed anyway when there's no more room, and even may reduce the efficiency of bulk operations. It will be even less needed after reusing skb cache on allocation path, so remove it and this way lighten network softirqs a bit. Suggested-by: Eric Dumazet Signed-off-by: Alexander Lobakin --- include/linux/skbuff.h | 1 - net/core/dev.c | 7 +-- net/core/skbuff.c | 12 3 files changed, 1 insertion(+), 19 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0a4e91a2f873..0e0707296098 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2919,7 +2919,6 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi, } void napi_consume_skb(struct sk_buff *skb, int budget); -void __kfree_skb_flush(void); void __kfree_skb_defer(struct sk_buff *skb); /** diff --git a/net/core/dev.c b/net/core/dev.c index 321d41a110e7..4154d4683bb9 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4944,8 +4944,6 @@ static __latent_entropy void net_tx_action(struct softirq_action *h) else __kfree_skb_defer(skb); } - - __kfree_skb_flush(); } if (sd->output_queue) { @@ -7012,7 +7010,6 @@ static int napi_threaded_poll(void *data) __napi_poll(napi, &repoll); netpoll_poll_unlock(have); - __kfree_skb_flush(); local_bh_enable(); if (!repoll) @@ -7042,7 +7039,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) if (list_empty(&list)) { if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll)) - goto out; + return; break; } @@ -7069,8 +7066,6 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) __raise_softirq_irqoff(NET_RX_SOFTIRQ); net_rps_action_and_irq_enable(sd); -out: - __kfree_skb_flush(); } struct netdev_adjacent { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1c6f6ef70339..4be2bb969535 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -838,18 +838,6 @@ void __consume_stateless_skb(struct sk_buff *skb) kfree_skbmem(skb); } -void __kfree_skb_flush(void) -{ - struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); - - /* flush skb_cache if containing objects */ - if (nc->skb_count) { - kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count, -nc->skb_cache); - nc->skb_count = 0; - } -} - static inline void _kfree_skb_defer(struct sk_buff *skb) { struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); -- 2.30.1
[PATCH v5 net-next 04/11] skbuff: simplify __alloc_skb() a bit
Use unlikely() annotations for skbuff_head and data similarly to the two other allocation functions and remove totally redundant goto. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c7d184e11547..88566de26cd1 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -339,8 +339,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, /* Get the HEAD */ skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); - if (!skb) - goto out; + if (unlikely(!skb)) + return NULL; prefetchw(skb); /* We do our best to align skb_shared_info on a separate cache @@ -351,7 +351,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, size = SKB_DATA_ALIGN(size); size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc); - if (!data) + if (unlikely(!data)) goto nodata; /* kmalloc(size) might give us more room than requested. * Put skb_shared_info exactly at the end of allocated zone, @@ -395,12 +395,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, skb_set_kcov_handle(skb, kcov_common_handle()); -out: return skb; + nodata: kmem_cache_free(cache, skb); - skb = NULL; - goto out; + return NULL; } EXPORT_SYMBOL(__alloc_skb); -- 2.30.1
[PATCH v5 net-next 07/11] skbuff: move NAPI cache declarations upper in the file
NAPI cache structures will be used for allocating skbuff_heads, so move their declarations a bit upper. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 90 +++ 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 4be2bb969535..860a9d4f752f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -119,6 +119,51 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) skb_panic(skb, sz, addr, __func__); } +#define NAPI_SKB_CACHE_SIZE64 + +struct napi_alloc_cache { + struct page_frag_cache page; + unsigned int skb_count; + void *skb_cache[NAPI_SKB_CACHE_SIZE]; +}; + +static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache); +static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache); + +static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask, + unsigned int align_mask) +{ + struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + + return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask); +} + +void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) +{ + fragsz = SKB_DATA_ALIGN(fragsz); + + return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); +} +EXPORT_SYMBOL(__napi_alloc_frag_align); + +void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) +{ + struct page_frag_cache *nc; + void *data; + + fragsz = SKB_DATA_ALIGN(fragsz); + if (in_irq() || irqs_disabled()) { + nc = this_cpu_ptr(&netdev_alloc_cache); + data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask); + } else { + local_bh_disable(); + data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); + local_bh_enable(); + } + return data; +} +EXPORT_SYMBOL(__netdev_alloc_frag_align); + /* Caller must provide SKB that is memset cleared */ static void __build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size) @@ -220,51 +265,6 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, } EXPORT_SYMBOL(build_skb_around); -#define NAPI_SKB_CACHE_SIZE64 - -struct napi_alloc_cache { - struct page_frag_cache page; - unsigned int skb_count; - void *skb_cache[NAPI_SKB_CACHE_SIZE]; -}; - -static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache); -static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache); - -static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask, - unsigned int align_mask) -{ - struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); - - return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask); -} - -void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) -{ - fragsz = SKB_DATA_ALIGN(fragsz); - - return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); -} -EXPORT_SYMBOL(__napi_alloc_frag_align); - -void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) -{ - struct page_frag_cache *nc; - void *data; - - fragsz = SKB_DATA_ALIGN(fragsz); - if (in_irq() || irqs_disabled()) { - nc = this_cpu_ptr(&netdev_alloc_cache); - data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask); - } else { - local_bh_disable(); - data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); - local_bh_enable(); - } - return data; -} -EXPORT_SYMBOL(__netdev_alloc_frag_align); - /* * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells * the caller if emergency pfmemalloc reserves are being used. If it is and -- 2.30.1
[PATCH v5 net-next 08/11] skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads
Instead of just bulk-flushing skbuff_heads queued up through napi_consume_skb() or __kfree_skb_defer(), try to reuse them on allocation path. If the cache is empty on allocation, bulk-allocate the first 16 elements, which is more efficient than per-skb allocation. If the cache is full on freeing, bulk-wipe the second half of the cache (32 elements). This also includes custom KASAN poisoning/unpoisoning to be double sure there are no use-after-free cases. To not change current behaviour, introduce a new function, napi_build_skb(), to optionally use a new approach later in drivers. Note on selected bulk size, 16: - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE and especially VETH_XDP_BATCH, which is also used to bulk-allocate skbuff_heads and was tested on powerful setups; - this also showed the best performance in the actual test series (from the array of {8, 16, 32}). Suggested-by: Edward Cree # Divide on two halves Suggested-by: Eric Dumazet# KASAN poisoning Cc: Dmitry Vyukov # Help with KASAN Cc: Paolo Abeni # Reduced batch size Signed-off-by: Alexander Lobakin --- include/linux/skbuff.h | 2 + net/core/skbuff.c | 94 -- 2 files changed, 83 insertions(+), 13 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0e0707296098..906122eac82a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size); struct sk_buff *build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size); +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size); + /** * alloc_skb - allocate a network buffer * @size: size to allocate diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 860a9d4f752f..9e1a8ded4acc 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) } #define NAPI_SKB_CACHE_SIZE64 +#define NAPI_SKB_CACHE_BULK16 +#define NAPI_SKB_CACHE_HALF(NAPI_SKB_CACHE_SIZE / 2) struct napi_alloc_cache { struct page_frag_cache page; @@ -164,6 +166,25 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) } EXPORT_SYMBOL(__netdev_alloc_frag_align); +static struct sk_buff *napi_skb_cache_get(void) +{ + struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + struct sk_buff *skb; + + if (unlikely(!nc->skb_count)) + nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache, + GFP_ATOMIC, + NAPI_SKB_CACHE_BULK, + nc->skb_cache); + if (unlikely(!nc->skb_count)) + return NULL; + + skb = nc->skb_cache[--nc->skb_count]; + kasan_unpoison_object_data(skbuff_head_cache, skb); + + return skb; +} + /* Caller must provide SKB that is memset cleared */ static void __build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size) @@ -265,6 +286,53 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, } EXPORT_SYMBOL(build_skb_around); +/** + * __napi_build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Version of __build_skb() that uses NAPI percpu caches to obtain + * skbuff_head instead of inplace allocation. + * + * Returns a new &sk_buff on success, %NULL on allocation failure. + */ +static struct sk_buff *__napi_build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb; + + skb = napi_skb_cache_get(); + if (unlikely(!skb)) + return NULL; + + memset(skb, 0, offsetof(struct sk_buff, tail)); + __build_skb_around(skb, data, frag_size); + + return skb; +} + +/** + * napi_build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Version of __napi_build_skb() that takes care of skb->head_frag + * and skb->pfmemalloc when the data is a page or page fragment. + * + * Returns a new &sk_buff on success, %NULL on allocation failure. + */ +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb = __napi_build_skb(data, frag_size); + + if (likely(skb) && frag_size) { + skb->head_frag = 1; + skb_propagate_pfmemalloc(virt_to_head_page(data), skb); + } + + return skb; +} +EXPORT_SYMBOL(napi_build_skb); + /* * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells * the caller if emergency pfmemalloc reserves are being used. If it is and @@ -838,31 +906,31
[PATCH v5 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()
Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get an skbuff_head from the NAPI cache instead of inplace allocation inside __alloc_skb(). This implies that the function is called from softirq or BH-off context, not for allocating a clone or from a distant node. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9e1a8ded4acc..a0b457ae87c2 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, struct sk_buff *skb; u8 *data; bool pfmemalloc; + bool clone; - cache = (flags & SKB_ALLOC_FCLONE) - ? skbuff_fclone_cache : skbuff_head_cache; + clone = !!(flags & SKB_ALLOC_FCLONE); + cache = clone ? skbuff_fclone_cache : skbuff_head_cache; if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX)) gfp_mask |= __GFP_MEMALLOC; /* Get the HEAD */ - skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); + if ((flags & SKB_ALLOC_NAPI) && !clone && + likely(node == NUMA_NO_NODE || node == numa_mem_id())) + skb = napi_skb_cache_get(); + else + skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node); if (unlikely(!skb)) return NULL; prefetchw(skb); @@ -436,7 +441,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, __build_skb_around(skb, data, 0); skb->pfmemalloc = pfmemalloc; - if (flags & SKB_ALLOC_FCLONE) { + if (clone) { struct sk_buff_fclones *fclones; fclones = container_of(skb, struct sk_buff_fclones, skb1); -- 2.30.1
[PATCH v5 net-next 10/11] skbuff: allow to use NAPI cache from __napi_alloc_skb()
{,__}napi_alloc_skb() is mostly used either for optional non-linear receive methods (usually controlled via Ethtool private flags and off by default) and/or for Rx copybreaks. Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache instead of inplace allocations. This includes both kmalloc and page frag paths. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a0b457ae87c2..c8f3ea1d9fbb 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -563,7 +563,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (len <= SKB_WITH_OVERHEAD(1024) || len > SKB_WITH_OVERHEAD(PAGE_SIZE) || (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { - skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); + skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI, + NUMA_NO_NODE); if (!skb) goto skb_fail; goto skb_success; @@ -580,7 +581,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (unlikely(!data)) return NULL; - skb = __build_skb(data, len); + skb = __napi_build_skb(data, len); if (unlikely(!skb)) { skb_free_frag(data); return NULL; -- 2.30.1
[PATCH v5 net-next 11/11] skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing
napi_frags_finish() and napi_skb_finish() can only be called inside NAPI Rx context, so we can feed NAPI cache with skbuff_heads that got NAPI_MERGED_FREE verdict instead of immediate freeing. Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish() and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs to NAPI cache. As many drivers call napi_alloc_skb()/napi_get_frags() on their receive path, this becomes especially useful. Signed-off-by: Alexander Lobakin --- include/linux/skbuff.h | 1 + net/core/dev.c | 9 + net/core/skbuff.c | 12 +--- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 906122eac82a..6d0a33d1c0db 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2921,6 +2921,7 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi, } void napi_consume_skb(struct sk_buff *skb, int budget); +void napi_skb_free_stolen_head(struct sk_buff *skb); void __kfree_skb_defer(struct sk_buff *skb); /** diff --git a/net/core/dev.c b/net/core/dev.c index 4154d4683bb9..6d2c7ae90a23 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6095,13 +6095,6 @@ struct packet_offload *gro_find_complete_by_type(__be16 type) } EXPORT_SYMBOL(gro_find_complete_by_type); -static void napi_skb_free_stolen_head(struct sk_buff *skb) -{ - skb_dst_drop(skb); - skb_ext_put(skb); - kmem_cache_free(skbuff_head_cache, skb); -} - static gro_result_t napi_skb_finish(struct napi_struct *napi, struct sk_buff *skb, gro_result_t ret) @@ -6115,7 +6108,7 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi, if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) napi_skb_free_stolen_head(skb); else - __kfree_skb(skb); + __kfree_skb_defer(skb); break; case GRO_HELD: diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c8f3ea1d9fbb..85f0768a1144 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -917,9 +917,6 @@ static void napi_skb_cache_put(struct sk_buff *skb) struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); u32 i; - /* drop skb->head and call any destructors for packet */ - skb_release_all(skb); - kasan_poison_object_data(skbuff_head_cache, skb); nc->skb_cache[nc->skb_count++] = skb; @@ -936,6 +933,14 @@ static void napi_skb_cache_put(struct sk_buff *skb) void __kfree_skb_defer(struct sk_buff *skb) { + skb_release_all(skb); + napi_skb_cache_put(skb); +} + +void napi_skb_free_stolen_head(struct sk_buff *skb) +{ + skb_dst_drop(skb); + skb_ext_put(skb); napi_skb_cache_put(skb); } @@ -961,6 +966,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget) return; } + skb_release_all(skb); napi_skb_cache_put(skb); } EXPORT_SYMBOL(napi_consume_skb); -- 2.30.1
Re: [PATCH v5 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()
From: Alexander Duyck Date: Thu, 11 Feb 2021 19:18:45 -0800 > On Thu, Feb 11, 2021 at 11:00 AM Alexander Lobakin wrote: > > > > Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get > > an skbuff_head from the NAPI cache instead of inplace allocation > > inside __alloc_skb(). > > This implies that the function is called from softirq or BH-off > > context, not for allocating a clone or from a distant node. > > > > Signed-off-by: Alexander Lobakin > > --- > > net/core/skbuff.c | 13 + > > 1 file changed, 9 insertions(+), 4 deletions(-) > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 9e1a8ded4acc..a0b457ae87c2 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t > > gfp_mask, > > struct sk_buff *skb; > > u8 *data; > > bool pfmemalloc; > > + bool clone; > > > > - cache = (flags & SKB_ALLOC_FCLONE) > > - ? skbuff_fclone_cache : skbuff_head_cache; > > + clone = !!(flags & SKB_ALLOC_FCLONE); > > The boolean conversion here is probably unnecessary. I would make > clone an int like flags and work with that. I suspect the compiler is > doing it already, but it is better to be explicit. > > > + cache = clone ? skbuff_fclone_cache : skbuff_head_cache; > > > > if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX)) > > gfp_mask |= __GFP_MEMALLOC; > > > > /* Get the HEAD */ > > - skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); > > + if ((flags & SKB_ALLOC_NAPI) && !clone && > > Rather than having to do two checks you could just check for > SKB_ALLOC_NAPI and SKB_ALLOC_FCLONE in a single check. You could just > do something like: > if ((flags & (SKB_ALLOC_FCLONE | SKB_ALLOC_NAPI) == SKB_ALLOC_NAPI) > > That way you can avoid the extra conditional jumps and can start > computing the flags value sooner. I thought about combined check for two flags yesterday, so yeah, that probably should be better than the current version. > > + likely(node == NUMA_NO_NODE || node == numa_mem_id())) > > + skb = napi_skb_cache_get(); > > + else > > + skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, > > node); > > if (unlikely(!skb)) > > return NULL; > > prefetchw(skb); > > @@ -436,7 +441,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t > > gfp_mask, > > __build_skb_around(skb, data, 0); > > skb->pfmemalloc = pfmemalloc; > > > > - if (flags & SKB_ALLOC_FCLONE) { > > + if (clone) { > > struct sk_buff_fclones *fclones; > > > > fclones = container_of(skb, struct sk_buff_fclones, skb1); > > -- > > 2.30.1 Thanks, Al
Re: [PATCH v5 net-next 06/11] skbuff: remove __kfree_skb_flush()
From: Alexander Duyck Date: Thu, 11 Feb 2021 19:28:52 -0800 > On Thu, Feb 11, 2021 at 10:57 AM Alexander Lobakin wrote: > > > > This function isn't much needed as NAPI skb queue gets bulk-freed > > anyway when there's no more room, and even may reduce the efficiency > > of bulk operations. > > It will be even less needed after reusing skb cache on allocation path, > > so remove it and this way lighten network softirqs a bit. > > > > Suggested-by: Eric Dumazet > > Signed-off-by: Alexander Lobakin > > I'm wondering if you have any actual gains to show from this patch? > > The reason why I ask is because the flushing was happening at the end > of the softirq before the system basically gave control back over to > something else. As such there is a good chance for the memory to be > dropped from the cache by the time we come back to it. So it may be > just as expensive if not more so than accessing memory that was just > freed elsewhere and placed in the slab cache. Just retested after readding this function (and changing the logics so it would drop the second half of the cache, like napi_skb_cache_put() does) and got 10 Mbps drawback with napi_build_skb() + napi_gro_receive(). So seems like getting a pointer from an array instead of calling kmem_cache_alloc() is cheaper even if the given object was pulled out of CPU caches. > > --- > > include/linux/skbuff.h | 1 - > > net/core/dev.c | 7 +-- > > net/core/skbuff.c | 12 > > 3 files changed, 1 insertion(+), 19 deletions(-) > > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > index 0a4e91a2f873..0e0707296098 100644 > > --- a/include/linux/skbuff.h > > +++ b/include/linux/skbuff.h > > @@ -2919,7 +2919,6 @@ static inline struct sk_buff *napi_alloc_skb(struct > > napi_struct *napi, > > } > > void napi_consume_skb(struct sk_buff *skb, int budget); > > > > -void __kfree_skb_flush(void); > > void __kfree_skb_defer(struct sk_buff *skb); > > > > /** > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 321d41a110e7..4154d4683bb9 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -4944,8 +4944,6 @@ static __latent_entropy void net_tx_action(struct > > softirq_action *h) > > else > > __kfree_skb_defer(skb); > > } > > - > > - __kfree_skb_flush(); > > } > > > > if (sd->output_queue) { > > @@ -7012,7 +7010,6 @@ static int napi_threaded_poll(void *data) > > __napi_poll(napi, &repoll); > > netpoll_poll_unlock(have); > > > > - __kfree_skb_flush(); > > local_bh_enable(); > > > > if (!repoll) > > So it looks like this is the one exception to my comment above. Here > we should probably be adding a "if (!repoll)" before calling > __kfree_skb_flush(). > > > @@ -7042,7 +7039,7 @@ static __latent_entropy void net_rx_action(struct > > softirq_action *h) > > > > if (list_empty(&list)) { > > if (!sd_has_rps_ipi_waiting(sd) && > > list_empty(&repoll)) > > - goto out; > > + return; > > break; > > } > > > > @@ -7069,8 +7066,6 @@ static __latent_entropy void net_rx_action(struct > > softirq_action *h) > > __raise_softirq_irqoff(NET_RX_SOFTIRQ); > > > > net_rps_action_and_irq_enable(sd); > > -out: > > - __kfree_skb_flush(); > > } > > > > struct netdev_adjacent { > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 1c6f6ef70339..4be2bb969535 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -838,18 +838,6 @@ void __consume_stateless_skb(struct sk_buff *skb) > > kfree_skbmem(skb); > > } > > > > -void __kfree_skb_flush(void) > > -{ > > - struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); > > - > > - /* flush skb_cache if containing objects */ > > - if (nc->skb_count) { > > - kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count, > > -nc->skb_cache); > > - nc->skb_count = 0; > > - } > > -} > > - > > static inline void _kfree_skb_defer(struct sk_buff *skb) > > { > > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); > > -- > > 2.30.1 Al
[PATCH v6 net-next 00/11] skbuff: introduce skbuff_heads bulking and reusing
Currently, all sorts of skb allocation always do allocate skbuff_heads one by one via kmem_cache_alloc(). On the other hand, we have percpu napi_alloc_cache to store skbuff_heads queued up for freeing and flush them by bulks. We can use this cache not only for bulk-wiping, but also to obtain heads for new skbs and avoid unconditional allocations, as well as for bulk-allocating (like XDP's cpumap code and veth driver already do). As this might affect latencies, cache pressure and lots of hardware and driver-dependent stuff, this new feature is mostly optional and can be issued via: - a new napi_build_skb() function (as a replacement for build_skb()); - existing {,__}napi_alloc_skb() and napi_get_frags() functions; - __alloc_skb() with passing SKB_ALLOC_NAPI in flags. iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger on more powerful hosts and NICs with tens of Mpps. Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs: - kmalloc()/kmem_cache_alloc() itself allows by default allocating memory from the remote nodes to defragment their slabs. This is controlled by sysctl, but according to this, skbuff_head from a remote node is an OK case; - The easiest way to check if the slab of skbuff_head is remote or pfmemalloc'ed is: if (!dev_page_is_reusable(virt_to_head_page(skb))) /* drop it */; ...*but*, regarding that most slabs are built of compound pages, virt_to_head_page() will hit unlikely-branch every single call. This check costed at least 20 Mbps in test scenarios and seems like it'd be better to _not_ do this. Since v5 [4]: - revert flags-to-bool conversion and simplify flags testing in __alloc_skb() (Alexander Duyck). Since v4 [3]: - rebase on top of net-next and address kernel build robot issue; - reorder checks a bit in __alloc_skb() to make new condition even more harmless. Since v3 [2]: - make the feature mostly optional, so driver developers could decide whether to use it or not (Paolo Abeni). This reuses the old flag for __alloc_skb() and introduces a new napi_build_skb(); - reduce bulk-allocation size from 32 to 16 elements (also Paolo). This equals to the value of XDP's devmap and veth batch processing (which were tested a lot) and should be sane enough; - don't waste cycles on explicit in_serving_softirq() check. Since v2 [1]: - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy after the changes that pass tiny skbs requests to kmalloc layer); - cover the cache with KASAN instrumentation (suggested by Eric Dumazet, help of Dmitry Vyukov); - completely drop redundant __kfree_skb_flush() (also Eric); - lots of code cleanups; - expand the commit message with NUMA and pfmemalloc points (Jakub). Since v1 [0]: - use one unified cache instead of two separate to greatly simplify the logics and reduce hotpath overhead (Edward Cree); - new: recycle also GRO_MERGED_FREE skbs instead of immediate freeing; - correct performance numbers after optimizations and performing lots of tests for different use cases. [0] https://lore.kernel.org/netdev/2021082655.12159-1-aloba...@pm.me [1] https://lore.kernel.org/netdev/20210113133523.39205-1-aloba...@pm.me [2] https://lore.kernel.org/netdev/20210209204533.327360-1-aloba...@pm.me [3] https://lore.kernel.org/netdev/20210210162732.80467-1-aloba...@pm.me [4] https://lore.kernel.org/netdev/20210211185220.9753-1-aloba...@pm.me Alexander Lobakin (11): skbuff: move __alloc_skb() next to the other skb allocation functions skbuff: simplify kmalloc_reserve() skbuff: make __build_skb_around() return void skbuff: simplify __alloc_skb() a bit skbuff: use __build_skb_around() in __alloc_skb() skbuff: remove __kfree_skb_flush() skbuff: move NAPI cache declarations upper in the file skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads skbuff: allow to optionally use NAPI cache from __alloc_skb() skbuff: allow to use NAPI cache from __napi_alloc_skb() skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing include/linux/skbuff.h | 4 +- net/core/dev.c | 16 +- net/core/skbuff.c | 428 +++-- 3 files changed, 242 insertions(+), 206 deletions(-) -- 2.30.1
[PATCH v6 net-next 02/11] skbuff: simplify kmalloc_reserve()
Eversince the introduction of __kmalloc_reserve(), "ip" argument hasn't been used. _RET_IP_ is embedded inside kmalloc_node_track_caller(). Remove the redundant macro and rename the function after it. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a0f846872d19..70289f22a6f4 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -273,11 +273,8 @@ EXPORT_SYMBOL(__netdev_alloc_frag_align); * may be used. Otherwise, the packet data may be discarded until enough * memory is free */ -#define kmalloc_reserve(size, gfp, node, pfmemalloc) \ -__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc) - -static void *__kmalloc_reserve(size_t size, gfp_t flags, int node, - unsigned long ip, bool *pfmemalloc) +static void *kmalloc_reserve(size_t size, gfp_t flags, int node, +bool *pfmemalloc) { void *obj; bool ret_pfmemalloc = false; -- 2.30.1
[PATCH v6 net-next 03/11] skbuff: make __build_skb_around() return void
__build_skb_around() can never fail and always returns passed skb. Make it return void to simplify and optimize the code. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 70289f22a6f4..c7d184e11547 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -120,8 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) } /* Caller must provide SKB that is memset cleared */ -static struct sk_buff *__build_skb_around(struct sk_buff *skb, - void *data, unsigned int frag_size) +static void __build_skb_around(struct sk_buff *skb, void *data, + unsigned int frag_size) { struct skb_shared_info *shinfo; unsigned int size = frag_size ? : ksize(data); @@ -144,8 +144,6 @@ static struct sk_buff *__build_skb_around(struct sk_buff *skb, atomic_set(&shinfo->dataref, 1); skb_set_kcov_handle(skb, kcov_common_handle()); - - return skb; } /** @@ -176,8 +174,9 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) return NULL; memset(skb, 0, offsetof(struct sk_buff, tail)); + __build_skb_around(skb, data, frag_size); - return __build_skb_around(skb, data, frag_size); + return skb; } /* build_skb() is wrapper over __build_skb(), that specifically @@ -210,9 +209,9 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, if (unlikely(!skb)) return NULL; - skb = __build_skb_around(skb, data, frag_size); + __build_skb_around(skb, data, frag_size); - if (skb && frag_size) { + if (frag_size) { skb->head_frag = 1; if (page_is_pfmemalloc(virt_to_head_page(data))) skb->pfmemalloc = 1; -- 2.30.1
[PATCH v6 net-next 01/11] skbuff: move __alloc_skb() next to the other skb allocation functions
In preparation before reusing several functions in all three skb allocation variants, move __alloc_skb() next to the __netdev_alloc_skb() and __napi_alloc_skb(). No functional changes. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 284 +++--- 1 file changed, 142 insertions(+), 142 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d380c7b5a12d..a0f846872d19 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -119,148 +119,6 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) skb_panic(skb, sz, addr, __func__); } -/* - * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells - * the caller if emergency pfmemalloc reserves are being used. If it is and - * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves - * may be used. Otherwise, the packet data may be discarded until enough - * memory is free - */ -#define kmalloc_reserve(size, gfp, node, pfmemalloc) \ -__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc) - -static void *__kmalloc_reserve(size_t size, gfp_t flags, int node, - unsigned long ip, bool *pfmemalloc) -{ - void *obj; - bool ret_pfmemalloc = false; - - /* -* Try a regular allocation, when that fails and we're not entitled -* to the reserves, fail. -*/ - obj = kmalloc_node_track_caller(size, - flags | __GFP_NOMEMALLOC | __GFP_NOWARN, - node); - if (obj || !(gfp_pfmemalloc_allowed(flags))) - goto out; - - /* Try again but now we are using pfmemalloc reserves */ - ret_pfmemalloc = true; - obj = kmalloc_node_track_caller(size, flags, node); - -out: - if (pfmemalloc) - *pfmemalloc = ret_pfmemalloc; - - return obj; -} - -/* Allocate a new skbuff. We do this ourselves so we can fill in a few - * 'private' fields and also do memory statistics to find all the - * [BEEP] leaks. - * - */ - -/** - * __alloc_skb - allocate a network buffer - * @size: size to allocate - * @gfp_mask: allocation mask - * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache - * instead of head cache and allocate a cloned (child) skb. - * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for - * allocations in case the data is required for writeback - * @node: numa node to allocate memory on - * - * Allocate a new &sk_buff. The returned buffer has no headroom and a - * tail room of at least size bytes. The object has a reference count - * of one. The return is the buffer. On a failure the return is %NULL. - * - * Buffers may only be allocated from interrupts using a @gfp_mask of - * %GFP_ATOMIC. - */ -struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, - int flags, int node) -{ - struct kmem_cache *cache; - struct skb_shared_info *shinfo; - struct sk_buff *skb; - u8 *data; - bool pfmemalloc; - - cache = (flags & SKB_ALLOC_FCLONE) - ? skbuff_fclone_cache : skbuff_head_cache; - - if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX)) - gfp_mask |= __GFP_MEMALLOC; - - /* Get the HEAD */ - skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); - if (!skb) - goto out; - prefetchw(skb); - - /* We do our best to align skb_shared_info on a separate cache -* line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives -* aligned memory blocks, unless SLUB/SLAB debug is enabled. -* Both skb->head and skb_shared_info are cache line aligned. -*/ - size = SKB_DATA_ALIGN(size); - size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc); - if (!data) - goto nodata; - /* kmalloc(size) might give us more room than requested. -* Put skb_shared_info exactly at the end of allocated zone, -* to allow max possible filling before reallocation. -*/ - size = SKB_WITH_OVERHEAD(ksize(data)); - prefetchw(data + size); - - /* -* Only clear those fields we need to clear, not those that we will -* actually initialise below. Hence, don't put any more fields after -* the tail pointer in struct sk_buff! -*/ - memset(skb, 0, offsetof(struct sk_buff, tail)); - /* Account for allocated memory : skb + skb->head */ - skb->truesize = SKB_TRUESIZE(size); - skb->pfmemalloc = pfmemalloc; - refcount_set(&skb->users, 1); - skb->head = data; - skb->data = data; - skb_reset_tail_pointer(skb)
[PATCH v6 net-next 04/11] skbuff: simplify __alloc_skb() a bit
Use unlikely() annotations for skbuff_head and data similarly to the two other allocation functions and remove totally redundant goto. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c7d184e11547..88566de26cd1 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -339,8 +339,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, /* Get the HEAD */ skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); - if (!skb) - goto out; + if (unlikely(!skb)) + return NULL; prefetchw(skb); /* We do our best to align skb_shared_info on a separate cache @@ -351,7 +351,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, size = SKB_DATA_ALIGN(size); size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc); - if (!data) + if (unlikely(!data)) goto nodata; /* kmalloc(size) might give us more room than requested. * Put skb_shared_info exactly at the end of allocated zone, @@ -395,12 +395,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, skb_set_kcov_handle(skb, kcov_common_handle()); -out: return skb; + nodata: kmem_cache_free(cache, skb); - skb = NULL; - goto out; + return NULL; } EXPORT_SYMBOL(__alloc_skb); -- 2.30.1
[PATCH v6 net-next 05/11] skbuff: use __build_skb_around() in __alloc_skb()
Just call __build_skb_around() instead of open-coding it. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 18 +- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 88566de26cd1..1c6f6ef70339 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -326,7 +326,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int flags, int node) { struct kmem_cache *cache; - struct skb_shared_info *shinfo; struct sk_buff *skb; u8 *data; bool pfmemalloc; @@ -366,21 +365,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, * the tail pointer in struct sk_buff! */ memset(skb, 0, offsetof(struct sk_buff, tail)); - /* Account for allocated memory : skb + skb->head */ - skb->truesize = SKB_TRUESIZE(size); + __build_skb_around(skb, data, 0); skb->pfmemalloc = pfmemalloc; - refcount_set(&skb->users, 1); - skb->head = data; - skb->data = data; - skb_reset_tail_pointer(skb); - skb->end = skb->tail + size; - skb->mac_header = (typeof(skb->mac_header))~0U; - skb->transport_header = (typeof(skb->transport_header))~0U; - - /* make sure we initialize shinfo sequentially */ - shinfo = skb_shinfo(skb); - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); - atomic_set(&shinfo->dataref, 1); if (flags & SKB_ALLOC_FCLONE) { struct sk_buff_fclones *fclones; @@ -393,8 +379,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, fclones->skb2.fclone = SKB_FCLONE_CLONE; } - skb_set_kcov_handle(skb, kcov_common_handle()); - return skb; nodata: -- 2.30.1
[PATCH v6 net-next 07/11] skbuff: move NAPI cache declarations upper in the file
NAPI cache structures will be used for allocating skbuff_heads, so move their declarations a bit upper. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 90 +++ 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 4be2bb969535..860a9d4f752f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -119,6 +119,51 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) skb_panic(skb, sz, addr, __func__); } +#define NAPI_SKB_CACHE_SIZE64 + +struct napi_alloc_cache { + struct page_frag_cache page; + unsigned int skb_count; + void *skb_cache[NAPI_SKB_CACHE_SIZE]; +}; + +static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache); +static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache); + +static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask, + unsigned int align_mask) +{ + struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + + return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask); +} + +void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) +{ + fragsz = SKB_DATA_ALIGN(fragsz); + + return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); +} +EXPORT_SYMBOL(__napi_alloc_frag_align); + +void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) +{ + struct page_frag_cache *nc; + void *data; + + fragsz = SKB_DATA_ALIGN(fragsz); + if (in_irq() || irqs_disabled()) { + nc = this_cpu_ptr(&netdev_alloc_cache); + data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask); + } else { + local_bh_disable(); + data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); + local_bh_enable(); + } + return data; +} +EXPORT_SYMBOL(__netdev_alloc_frag_align); + /* Caller must provide SKB that is memset cleared */ static void __build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size) @@ -220,51 +265,6 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, } EXPORT_SYMBOL(build_skb_around); -#define NAPI_SKB_CACHE_SIZE64 - -struct napi_alloc_cache { - struct page_frag_cache page; - unsigned int skb_count; - void *skb_cache[NAPI_SKB_CACHE_SIZE]; -}; - -static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache); -static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache); - -static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask, - unsigned int align_mask) -{ - struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); - - return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask); -} - -void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) -{ - fragsz = SKB_DATA_ALIGN(fragsz); - - return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); -} -EXPORT_SYMBOL(__napi_alloc_frag_align); - -void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) -{ - struct page_frag_cache *nc; - void *data; - - fragsz = SKB_DATA_ALIGN(fragsz); - if (in_irq() || irqs_disabled()) { - nc = this_cpu_ptr(&netdev_alloc_cache); - data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask); - } else { - local_bh_disable(); - data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); - local_bh_enable(); - } - return data; -} -EXPORT_SYMBOL(__netdev_alloc_frag_align); - /* * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells * the caller if emergency pfmemalloc reserves are being used. If it is and -- 2.30.1
[PATCH v6 net-next 06/11] skbuff: remove __kfree_skb_flush()
This function isn't much needed as NAPI skb queue gets bulk-freed anyway when there's no more room, and even may reduce the efficiency of bulk operations. It will be even less needed after reusing skb cache on allocation path, so remove it and this way lighten network softirqs a bit. Suggested-by: Eric Dumazet Signed-off-by: Alexander Lobakin --- include/linux/skbuff.h | 1 - net/core/dev.c | 7 +-- net/core/skbuff.c | 12 3 files changed, 1 insertion(+), 19 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0a4e91a2f873..0e0707296098 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2919,7 +2919,6 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi, } void napi_consume_skb(struct sk_buff *skb, int budget); -void __kfree_skb_flush(void); void __kfree_skb_defer(struct sk_buff *skb); /** diff --git a/net/core/dev.c b/net/core/dev.c index ce6291bc2e16..631807c196ad 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4944,8 +4944,6 @@ static __latent_entropy void net_tx_action(struct softirq_action *h) else __kfree_skb_defer(skb); } - - __kfree_skb_flush(); } if (sd->output_queue) { @@ -7012,7 +7010,6 @@ static int napi_threaded_poll(void *data) __napi_poll(napi, &repoll); netpoll_poll_unlock(have); - __kfree_skb_flush(); local_bh_enable(); if (!repoll) @@ -7042,7 +7039,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) if (list_empty(&list)) { if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll)) - goto out; + return; break; } @@ -7069,8 +7066,6 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) __raise_softirq_irqoff(NET_RX_SOFTIRQ); net_rps_action_and_irq_enable(sd); -out: - __kfree_skb_flush(); } struct netdev_adjacent { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1c6f6ef70339..4be2bb969535 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -838,18 +838,6 @@ void __consume_stateless_skb(struct sk_buff *skb) kfree_skbmem(skb); } -void __kfree_skb_flush(void) -{ - struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); - - /* flush skb_cache if containing objects */ - if (nc->skb_count) { - kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count, -nc->skb_cache); - nc->skb_count = 0; - } -} - static inline void _kfree_skb_defer(struct sk_buff *skb) { struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); -- 2.30.1
[PATCH v6 net-next 08/11] skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads
Instead of just bulk-flushing skbuff_heads queued up through napi_consume_skb() or __kfree_skb_defer(), try to reuse them on allocation path. If the cache is empty on allocation, bulk-allocate the first 16 elements, which is more efficient than per-skb allocation. If the cache is full on freeing, bulk-wipe the second half of the cache (32 elements). This also includes custom KASAN poisoning/unpoisoning to be double sure there are no use-after-free cases. To not change current behaviour, introduce a new function, napi_build_skb(), to optionally use a new approach later in drivers. Note on selected bulk size, 16: - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE and especially VETH_XDP_BATCH, which is also used to bulk-allocate skbuff_heads and was tested on powerful setups; - this also showed the best performance in the actual test series (from the array of {8, 16, 32}). Suggested-by: Edward Cree # Divide on two halves Suggested-by: Eric Dumazet# KASAN poisoning Cc: Dmitry Vyukov # Help with KASAN Cc: Paolo Abeni # Reduced batch size Signed-off-by: Alexander Lobakin --- include/linux/skbuff.h | 2 + net/core/skbuff.c | 94 -- 2 files changed, 83 insertions(+), 13 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0e0707296098..906122eac82a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size); struct sk_buff *build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size); +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size); + /** * alloc_skb - allocate a network buffer * @size: size to allocate diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 860a9d4f752f..9e1a8ded4acc 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) } #define NAPI_SKB_CACHE_SIZE64 +#define NAPI_SKB_CACHE_BULK16 +#define NAPI_SKB_CACHE_HALF(NAPI_SKB_CACHE_SIZE / 2) struct napi_alloc_cache { struct page_frag_cache page; @@ -164,6 +166,25 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) } EXPORT_SYMBOL(__netdev_alloc_frag_align); +static struct sk_buff *napi_skb_cache_get(void) +{ + struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + struct sk_buff *skb; + + if (unlikely(!nc->skb_count)) + nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache, + GFP_ATOMIC, + NAPI_SKB_CACHE_BULK, + nc->skb_cache); + if (unlikely(!nc->skb_count)) + return NULL; + + skb = nc->skb_cache[--nc->skb_count]; + kasan_unpoison_object_data(skbuff_head_cache, skb); + + return skb; +} + /* Caller must provide SKB that is memset cleared */ static void __build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size) @@ -265,6 +286,53 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, } EXPORT_SYMBOL(build_skb_around); +/** + * __napi_build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Version of __build_skb() that uses NAPI percpu caches to obtain + * skbuff_head instead of inplace allocation. + * + * Returns a new &sk_buff on success, %NULL on allocation failure. + */ +static struct sk_buff *__napi_build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb; + + skb = napi_skb_cache_get(); + if (unlikely(!skb)) + return NULL; + + memset(skb, 0, offsetof(struct sk_buff, tail)); + __build_skb_around(skb, data, frag_size); + + return skb; +} + +/** + * napi_build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Version of __napi_build_skb() that takes care of skb->head_frag + * and skb->pfmemalloc when the data is a page or page fragment. + * + * Returns a new &sk_buff on success, %NULL on allocation failure. + */ +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb = __napi_build_skb(data, frag_size); + + if (likely(skb) && frag_size) { + skb->head_frag = 1; + skb_propagate_pfmemalloc(virt_to_head_page(data), skb); + } + + return skb; +} +EXPORT_SYMBOL(napi_build_skb); + /* * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells * the caller if emergency pfmemalloc reserves are being used. If it is and @@ -838,31 +906,31
[PATCH v6 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()
Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get an skbuff_head from the NAPI cache instead of inplace allocation inside __alloc_skb(). This implies that the function is called from softirq or BH-off context, not for allocating a clone or from a distant node. Cc: Alexander Duyck # Simplified flags check Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9e1a8ded4acc..a80581eed7fc 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -405,7 +405,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, gfp_mask |= __GFP_MEMALLOC; /* Get the HEAD */ - skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); + if ((flags & (SKB_ALLOC_FCLONE | SKB_ALLOC_NAPI)) == SKB_ALLOC_NAPI && + likely(node == NUMA_NO_NODE || node == numa_mem_id())) + skb = napi_skb_cache_get(); + else + skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node); if (unlikely(!skb)) return NULL; prefetchw(skb); -- 2.30.1
[PATCH v6 net-next 11/11] skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing
napi_frags_finish() and napi_skb_finish() can only be called inside NAPI Rx context, so we can feed NAPI cache with skbuff_heads that got NAPI_MERGED_FREE verdict instead of immediate freeing. Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish() and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs to NAPI cache. As many drivers call napi_alloc_skb()/napi_get_frags() on their receive path, this becomes especially useful. Signed-off-by: Alexander Lobakin --- include/linux/skbuff.h | 1 + net/core/dev.c | 9 + net/core/skbuff.c | 12 +--- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 906122eac82a..6d0a33d1c0db 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2921,6 +2921,7 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi, } void napi_consume_skb(struct sk_buff *skb, int budget); +void napi_skb_free_stolen_head(struct sk_buff *skb); void __kfree_skb_defer(struct sk_buff *skb); /** diff --git a/net/core/dev.c b/net/core/dev.c index 631807c196ad..ea9b46318d23 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6095,13 +6095,6 @@ struct packet_offload *gro_find_complete_by_type(__be16 type) } EXPORT_SYMBOL(gro_find_complete_by_type); -static void napi_skb_free_stolen_head(struct sk_buff *skb) -{ - skb_dst_drop(skb); - skb_ext_put(skb); - kmem_cache_free(skbuff_head_cache, skb); -} - static gro_result_t napi_skb_finish(struct napi_struct *napi, struct sk_buff *skb, gro_result_t ret) @@ -6115,7 +6108,7 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi, if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) napi_skb_free_stolen_head(skb); else - __kfree_skb(skb); + __kfree_skb_defer(skb); break; case GRO_HELD: diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 875e1a453f7e..545a472273a5 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -916,9 +916,6 @@ static void napi_skb_cache_put(struct sk_buff *skb) struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); u32 i; - /* drop skb->head and call any destructors for packet */ - skb_release_all(skb); - kasan_poison_object_data(skbuff_head_cache, skb); nc->skb_cache[nc->skb_count++] = skb; @@ -935,6 +932,14 @@ static void napi_skb_cache_put(struct sk_buff *skb) void __kfree_skb_defer(struct sk_buff *skb) { + skb_release_all(skb); + napi_skb_cache_put(skb); +} + +void napi_skb_free_stolen_head(struct sk_buff *skb) +{ + skb_dst_drop(skb); + skb_ext_put(skb); napi_skb_cache_put(skb); } @@ -960,6 +965,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget) return; } + skb_release_all(skb); napi_skb_cache_put(skb); } EXPORT_SYMBOL(napi_consume_skb); -- 2.30.1
[PATCH v6 net-next 10/11] skbuff: allow to use NAPI cache from __napi_alloc_skb()
{,__}napi_alloc_skb() is mostly used either for optional non-linear receive methods (usually controlled via Ethtool private flags and off by default) and/or for Rx copybreaks. Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache instead of inplace allocations. This includes both kmalloc and page frag paths. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a80581eed7fc..875e1a453f7e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -562,7 +562,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (len <= SKB_WITH_OVERHEAD(1024) || len > SKB_WITH_OVERHEAD(PAGE_SIZE) || (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { - skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); + skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI, + NUMA_NO_NODE); if (!skb) goto skb_fail; goto skb_success; @@ -579,7 +580,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (unlikely(!data)) return NULL; - skb = __build_skb(data, len); + skb = __napi_build_skb(data, len); if (unlikely(!skb)) { skb_free_frag(data); return NULL; -- 2.30.1
Re: linux-next: manual merge of the kspp tree with the mips tree
From: Stephen Rothwell Date: Tue, 23 Feb 2021 10:49:50 +1100 > Hi all, Hi, > On Mon, 15 Feb 2021 07:47:26 +1100 Stephen Rothwell > wrote: > > > > On Mon, 18 Jan 2021 15:08:04 +1100 Stephen Rothwell > > wrote: > > > > > > Today's linux-next merge of the kspp tree got a conflict in: > > > > > > include/asm-generic/vmlinux.lds.h > > > > > > between commits: > > > > > > 9a427556fb8e ("vmlinux.lds.hf41b233de0ae: catch compound literals into > > > data and BSS") > > > f41b233de0ae ("vmlinux.lds.h: catch UBSAN's "unnamed data" into data") > > > > > > from the mips tree and commit: > > > > > > dc5723b02e52 ("kbuild: add support for Clang LTO") > > > > > > from the kspp tree. > > > > > > I fixed it up (9a427556fb8e and dc5723b02e52 made the same change to > > > DATA_MAIN, which conflicted with the change in f41b233de0ae) and can > > > carry the fix as necessary. This is now fixed as far as linux-next is > > > concerned, but any non trivial conflicts should be mentioned to your > > > upstream maintainer when your tree is submitted for merging. You may > > > also want to consider cooperating with the maintainer of the > > > conflicting tree to minimise any particularly complex conflicts. > > > > With the merge window about to open, this is a reminder that this > > conflict still exists. > > This is now a conflict between the kspp tree and Linus' tree. Kees prepared a Git pull of kspp tree for Linus, this will be resolved soon. > -- > Cheers, > Stephen Rothwell Al
[PATCH mips-fixes] vmlinux.lds.h: catch even more instrumentation symbols into .data
LKP caught another bunch of orphaned instrumentation symbols [0]: mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/main.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/main.o' being placed in section `.data.$LPBX0' mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/do_mounts.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/do_mounts.o' being placed in section `.data.$LPBX0' mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/do_mounts_initrd.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/do_mounts_initrd.o' being placed in section `.data.$LPBX0' mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/initramfs.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/initramfs.o' being placed in section `.data.$LPBX0' mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/calibrate.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/calibrate.o' being placed in section `.data.$LPBX0' [...] Soften the wildcard to .data.$L* to grab these ones into .data too. [0] https://lore.kernel.org/lkml/202102231519.lwplpvev-...@intel.com Reported-by: kernel test robot Signed-off-by: Alexander Lobakin --- include/asm-generic/vmlinux.lds.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 01a3fd6a64d2..c887ac36c1b4 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -95,7 +95,7 @@ */ #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION #define TEXT_MAIN .text .text.[0-9a-zA-Z_]* -#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* .data.$__unnamed_* .data.$Lubsan_* +#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* .data.$__unnamed_* .data.$L* #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]* #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L* #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral* -- 2.30.1
Re: [PATCH mips-fixes] vmlinux.lds.h: catch even more instrumentation symbols into .data
> LKP caught another bunch of orphaned instrumentation symbols [0]: > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > `init/main.o' being placed in section `.data.$LPBX1' > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > `init/main.o' being placed in section `.data.$LPBX0' > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > `init/do_mounts.o' being placed in section `.data.$LPBX1' > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > `init/do_mounts.o' being placed in section `.data.$LPBX0' > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > `init/do_mounts_initrd.o' being placed in section `.data.$LPBX1' > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > `init/do_mounts_initrd.o' being placed in section `.data.$LPBX0' > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > `init/initramfs.o' being placed in section `.data.$LPBX1' > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > `init/initramfs.o' being placed in section `.data.$LPBX0' > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > `init/calibrate.o' being placed in section `.data.$LPBX1' > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > `init/calibrate.o' being placed in section `.data.$LPBX0' > > [...] > > Soften the wildcard to .data.$L* to grab these ones into .data too. > > [0] https://lore.kernel.org/lkml/202102231519.lwplpvev-...@intel.com > > Reported-by: kernel test robot > Signed-off-by: Alexander Lobakin > --- > include/asm-generic/vmlinux.lds.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Hi Thomas, This applies on top of mips-next or Linus' tree, so you may need to rebase mips-fixes before taking it. It's not for mips-next as it should go into this cycle as a [hot]fix. I haven't added any "Fixes:" tag since these warnings is a result of merging several sets and of certain build configurations that almost couldn't be tested separately. > diff --git a/include/asm-generic/vmlinux.lds.h > b/include/asm-generic/vmlinux.lds.h > index 01a3fd6a64d2..c887ac36c1b4 100644 > --- a/include/asm-generic/vmlinux.lds.h > +++ b/include/asm-generic/vmlinux.lds.h > @@ -95,7 +95,7 @@ > */ > #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION > #define TEXT_MAIN .text .text.[0-9a-zA-Z_]* > -#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* > .data..compoundliteral* .data.$__unnamed_* .data.$Lubsan_* > +#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* > .data..compoundliteral* .data.$__unnamed_* .data.$L* > #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]* > #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L* > #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral* > -- > 2.30.1 Thanks, Al
Re: [PATCH mips-fixes] vmlinux.lds.h: catch even more instrumentation symbols into .data
From: Thomas Bogendoerfer Date: Tue, 23 Feb 2021 13:21:44 +0100 > On Tue, Feb 23, 2021 at 11:36:41AM +0000, Alexander Lobakin wrote: > > > LKP caught another bunch of orphaned instrumentation symbols [0]: > > > > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > > > `init/main.o' being placed in section `.data.$LPBX1' > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > > > `init/main.o' being placed in section `.data.$LPBX0' > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > > > `init/do_mounts.o' being placed in section `.data.$LPBX1' > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > > > `init/do_mounts.o' being placed in section `.data.$LPBX0' > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > > > `init/do_mounts_initrd.o' being placed in section `.data.$LPBX1' > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > > > `init/do_mounts_initrd.o' being placed in section `.data.$LPBX0' > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > > > `init/initramfs.o' being placed in section `.data.$LPBX1' > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > > > `init/initramfs.o' being placed in section `.data.$LPBX0' > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from > > > `init/calibrate.o' being placed in section `.data.$LPBX1' > > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from > > > `init/calibrate.o' being placed in section `.data.$LPBX0' > > > > > > [...] > > > > > > Soften the wildcard to .data.$L* to grab these ones into .data too. > > > > > > [0] https://lore.kernel.org/lkml/202102231519.lwplpvev-...@intel.com > > > > > > Reported-by: kernel test robot > > > Signed-off-by: Alexander Lobakin > > > --- > > > include/asm-generic/vmlinux.lds.h | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > Hi Thomas, > > > > This applies on top of mips-next or Linus' tree, so you may need to > > rebase mips-fixes before taking it. > > It's not for mips-next as it should go into this cycle as a [hot]fix. > > I haven't added any "Fixes:" tag since these warnings is a result > > of merging several sets and of certain build configurations that > > almost couldn't be tested separately. > > no worries, mips-fixes is defunct during merge windows. I'll send another > pull request to Linus and will add this patch to it. Ah, thank you! > Thomas. Al > -- > Crap can work. Given enough thrust pigs will fly, but it's not necessarily a > good idea.[ RFC1925, 2.3 ]
Re: [GIT PULL v2] clang-lto for v5.12-rc1
From: Linus Torvalds Date: Tue, 23 Feb 2021 12:33:05 -0800 > On Tue, Feb 23, 2021 at 9:49 AM Linus Torvalds > wrote: > > > > On Mon, Feb 22, 2021 at 3:11 PM Kees Cook wrote: > > > > > > While x86 LTO enablement is done[1], it depends on some objtool > > > clean-ups[2], though it appears those actually have been in linux-next > > > (via tip/objtool/core), so it's possible that if that tree lands [..] > > > > That tree is actually next on my list of things to merge after this > > one, so it should be out soonish. > > "soonish" turned out to be later than I thought, because my "build > changes" set of pulls included the module change that I then wasted a > lot of time on trying to figure out why it slowed down my build so > much. I guess it's about CONFIG_TRIM_UNUSED_KSYMS you disabled in your tree. Well, it's actually widely used, mostly in the embedded world where there are often no out-of-tree modules, but a need to save as much space as possible. For full-blown systems and distributions it's almost needless, right. > But it's out now, as pr-tracker-bot already noted. > > Linus Thanks, Al
Re: [PATCH] arm64: enable GENERIC_FIND_FIRST_BIT
From: Yury Norov Date: Sat, 5 Dec 2020 08:54:06 -0800 Hi, > ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't > enable it in config. It leads to using find_next_bit() which is less > efficient: > > : >0: aa0003e4mov x4, x0 >4: aa0103e0mov x0, x1 >8: b4000181cbz x1, 38 >c: f9400083ldr x3, [x4] > 10: d2800802mov x2, #0x40 // #64 > 14: 91002084add x4, x4, #0x8 > 18: b4c3cbz x3, 30 > 1c: 1408b 3c > 20: f8408483ldr x3, [x4], #8 > 24: 91010045add x5, x2, #0x40 > 28: b5c3cbnzx3, 40 > 2c: aa0503e2mov x2, x5 > 30: eb02001fcmp x0, x2 > 34: 5468b.hi20 // b.pmore > 38: d65f03c0ret > 3c: d282mov x2, #0x0// #0 > 40: dac00063rbitx3, x3 > 44: dac01063clz x3, x3 > 48: 8b020062add x2, x3, x2 > 4c: eb02001fcmp x0, x2 > 50: 9a829000cselx0, x0, x2, ls // ls = plast > 54: d65f03c0ret > > ... > > 0118 <_find_next_bit.constprop.1>: > 118: eb02007fcmp x3, x2 > 11c: 540002e2b.cs178 <_find_next_bit.constprop.1+0x60> // b.hs, > b.nlast > 120: d346fc66lsr x6, x3, #6 > 124: f8667805ldr x5, [x0, x6, lsl #3] > 128: b461cbz x1, 134 <_find_next_bit.constprop.1+0x1c> > 12c: f8667826ldr x6, [x1, x6, lsl #3] > 130: 8a0600a5and x5, x5, x6 > 134: ca0400a6eor x6, x5, x4 > 138: 9285mov x5, #0x // #-1 > 13c: 9ac320a5lsl x5, x5, x3 > 140: 927ae463and x3, x3, #0xffc0 > 144: ea0600a5andsx5, x5, x6 > 148: 54000120b.eq16c <_find_next_bit.constprop.1+0x54> // b.none > 14c: 140eb 184 <_find_next_bit.constprop.1+0x6c> > 150: d346fc66lsr x6, x3, #6 > 154: f8667805ldr x5, [x0, x6, lsl #3] > 158: b461cbz x1, 164 <_find_next_bit.constprop.1+0x4c> > 15c: f8667826ldr x6, [x1, x6, lsl #3] > 160: 8a0600a5and x5, x5, x6 > 164: eb05009fcmp x4, x5 > 168: 54c1b.ne180 <_find_next_bit.constprop.1+0x68> // b.any > 16c: 91010063add x3, x3, #0x40 > 170: eb03005fcmp x2, x3 > 174: 54fffee8b.hi150 <_find_next_bit.constprop.1+0x38> // > b.pmore > 178: aa0203e0mov x0, x2 > 17c: d65f03c0ret > 180: ca050085eor x5, x4, x5 > 184: dac000a5rbitx5, x5 > 188: dac010a5clz x5, x5 > 18c: 8b0300a3add x3, x5, x3 > 190: eb03005fcmp x2, x3 > 194: 9a839042cselx2, x2, x3, ls // ls = plast > 198: aa0203e0mov x0, x2 > 19c: d65f03c0ret > > ... > > 0238 : > 238: a9bf7bfdstp x29, x30, [sp, #-16]! > 23c: aa0203e3mov x3, x2 > 240: d284mov x4, #0x0// #0 > 244: aa0103e2mov x2, x1 > 248: 910003fdmov x29, sp > 24c: d281mov x1, #0x0// #0 > 250: 97b2bl 118 <_find_next_bit.constprop.1> > 254: a8c17bfdldp x29, x30, [sp], #16 > 258: d65f03c0ret > > Enabling this functions would also benefit for_each_{set,clear}_bit(). > Would it make sense to enable this config for all such architectures by > default? I confirm that GENERIC_FIND_FIRST_BIT also produces more optimized and fast code on MIPS (32 R2) where there is also no architecture-specific bitsearching routines. So, if it's okay for other folks, I'd suggest to go for it and enable for all similar arches. (otherwise, I'll publish a separate entry for mips-next after 5.12-rc1 release and mention you in "Suggested-by:") > Signed-off-by: Yury Norov > > --- > arch/arm64/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 1515f6f153a0..2b90ef1f548e 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -106,6 +106,7 @@ config ARM64 > select GENERIC_CPU_AUTOPROBE > select GENERIC_CPU_VULNERABILITIES > select GENERIC_EARLY_IOREMAP > + select GENERIC_FIND_FIRST_BIT > select GENERIC_IDLE_POLL_SETUP > select GENERIC_IRQ_IPI > select GENERIC_IRQ_MULTI_HANDLER > -- > 2.25.1 Thanks, Al
Re: [v3 net-next 08/10] skbuff: reuse NAPI skb cache on allocation path (__build_skb())
From: Paolo Abeni Date: Wed, 10 Feb 2021 11:21:06 +0100 > Hello, Hi! > I'm sorry for the late feedback, I could not step-in before. > > Also adding Jesper for awareness, as he introduced the bulk free > infrastructure. > > On Tue, 2021-02-09 at 20:48 +, Alexander Lobakin wrote: > > @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int > > frag_size) > > */ > > struct sk_buff *build_skb(void *data, unsigned int frag_size) > > { > > - struct sk_buff *skb = __build_skb(data, frag_size); > > + struct sk_buff *skb = __build_skb(data, frag_size, true); > > I must admit I'm a bit scared of this. There are several high speed > device drivers that will move to bulk allocation, and we don't have any > performance figure for them. > > In my experience with (low end) MIPS board, cache misses cost tend to > be much less visible there compared to reasonably recent server H/W, > because the CPU/memory access time difference is much lower. > > When moving to higher end H/W the performance gain you measured could > be completely countered by less optimal cache usage. > > I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a > single skb would be visible e.g. in a round-robin test. Generally > speaking bulk allocating 32 skbs looks a bit too much. IIRC, when > Edward added listification to GRO, he did several measures with > different list size and found 8 to be the optimal value (for the tested > workload). Above such number the list become too big and the pressure > on the cache outweighted the bulking benefits. I can change to logics the way so it would allocate the first 8. I think I've already seen this batch value somewhere in XDP code, so this might be a balanced one. Regarding bulk-freeing: can the batch size make sense when freeing or it's okay to wipe 32 (currently 64 in baseline) in a row? > Perhaps giving the device drivers the ability to opt-in on this infra > via a new helper - as done back then with napi_consume_skb() - would > make this change safer? That's actually a very nice idea. There's only a little in the code to change to introduce an ability to take heads from the cache optionally. This way developers could switch to it when needed. Thanks for the suggestions! I'll definitely absorb them into the code and give it a test. > > @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb) > > kfree_skbmem(skb); > > } > > > > -static inline void _kfree_skb_defer(struct sk_buff *skb) > > +static void napi_skb_cache_put(struct sk_buff *skb) > > { > > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); > > + u32 i; > > > > /* drop skb->head and call any destructors for packet */ > > skb_release_all(skb); > > > > - /* record skb to CPU local list */ > > + kasan_poison_object_data(skbuff_head_cache, skb); > > nc->skb_cache[nc->skb_count++] = skb; > > > > -#ifdef CONFIG_SLUB > > - /* SLUB writes into objects when freeing */ > > - prefetchw(skb); > > -#endif > > It looks like this chunk has been lost. Is that intentional? Yep. This prefetchw() assumed that skbuff_heads will be wiped immediately or at the end of network softirq. Reusing this cache means that heads can be reused later or may be kept in a cache for some time, so prefetching makes no sense anymore. > Thanks! > > Paolo Al
[PATCH v4 net-next 00/11] skbuff: introduce skbuff_heads bulking and reusing
Currently, all sorts of skb allocation always do allocate skbuff_heads one by one via kmem_cache_alloc(). On the other hand, we have percpu napi_alloc_cache to store skbuff_heads queued up for freeing and flush them by bulks. We can use this cache not only for bulk-wiping, but also to obtain heads for new skbs and avoid unconditional allocations, as well as for bulk-allocating (like XDP's cpumap code and veth driver already do). As this might affect latencies, cache pressure and lots of hardware and driver-dependent stuff, this new feature is mostly optional and can be issued via: - a new napi_build_skb() function (as a replacement for build_skb()); - existing {,__}napi_alloc_skb() and napi_get_frags() functions; - __alloc_skb() with passing SKB_ALLOC_NAPI in flags. iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger on more powerful hosts and NICs with tens of Mpps. Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs: - kmalloc()/kmem_cache_alloc() itself allows by default allocating memory from the remote nodes to defragment their slabs. This is controlled by sysctl, but according to this, skbuff_head from a remote node is an OK case; - The easiest way to check if the slab of skbuff_head is remote or pfmemalloc'ed is: if (!dev_page_is_reusable(virt_to_head_page(skb))) /* drop it */; ...*but*, regarding that most slabs are built of compound pages, virt_to_head_page() will hit unlikely-branch every single call. This check costed at least 20 Mbps in test scenarios and seems like it'd be better to _not_ do this. Since v3 [2]: - make the feature mostly optional, so driver developers could decide whether to use it or not (Paolo Abeni). This reuses the old flag for __alloc_skb() and introduces a new napi_build_skb(); - reduce bulk-allocation size from 32 to 16 elements (also Paolo). This equals to the value of XDP's devmap and veth batch processing (which were tested a lot) and should be sane enough; - don't waste cycles on explicit in_serving_softirq() check. Since v2 [1]: - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy after the changes that pass tiny skbs requests to kmalloc layer); - cover the cache with KASAN instrumentation (suggested by Eric Dumazet, help of Dmitry Vyukov); - completely drop redundant __kfree_skb_flush() (also Eric); - lots of code cleanups; - expand the commit message with NUMA and pfmemalloc points (Jakub). Since v1 [0]: - use one unified cache instead of two separate to greatly simplify the logics and reduce hotpath overhead (Edward Cree); - new: recycle also GRO_MERGED_FREE skbs instead of immediate freeing; - correct performance numbers after optimizations and performing lots of tests for different use cases. [0] https://lore.kernel.org/netdev/2021082655.12159-1-aloba...@pm.me [1] https://lore.kernel.org/netdev/20210113133523.39205-1-aloba...@pm.me [2] https://lore.kernel.org/netdev/20210209204533.327360-1-aloba...@pm.me Alexander Lobakin (11): skbuff: move __alloc_skb() next to the other skb allocation functions skbuff: simplify kmalloc_reserve() skbuff: make __build_skb_around() return void skbuff: simplify __alloc_skb() a bit skbuff: use __build_skb_around() in __alloc_skb() skbuff: remove __kfree_skb_flush() skbuff: move NAPI cache declarations upper in the file skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads skbuff: allow to optionally use NAPI cache from __alloc_skb() skbuff: allow to use NAPI cache from __napi_alloc_skb() skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing include/linux/skbuff.h | 4 +- net/core/dev.c | 15 +- net/core/skbuff.c | 429 +++-- 3 files changed, 243 insertions(+), 205 deletions(-) -- 2.30.1
[PATCH v4 net-next 01/11] skbuff: move __alloc_skb() next to the other skb allocation functions
In preparation before reusing several functions in all three skb allocation variants, move __alloc_skb() next to the __netdev_alloc_skb() and __napi_alloc_skb(). No functional changes. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 284 +++--- 1 file changed, 142 insertions(+), 142 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d380c7b5a12d..a0f846872d19 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -119,148 +119,6 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) skb_panic(skb, sz, addr, __func__); } -/* - * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells - * the caller if emergency pfmemalloc reserves are being used. If it is and - * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves - * may be used. Otherwise, the packet data may be discarded until enough - * memory is free - */ -#define kmalloc_reserve(size, gfp, node, pfmemalloc) \ -__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc) - -static void *__kmalloc_reserve(size_t size, gfp_t flags, int node, - unsigned long ip, bool *pfmemalloc) -{ - void *obj; - bool ret_pfmemalloc = false; - - /* -* Try a regular allocation, when that fails and we're not entitled -* to the reserves, fail. -*/ - obj = kmalloc_node_track_caller(size, - flags | __GFP_NOMEMALLOC | __GFP_NOWARN, - node); - if (obj || !(gfp_pfmemalloc_allowed(flags))) - goto out; - - /* Try again but now we are using pfmemalloc reserves */ - ret_pfmemalloc = true; - obj = kmalloc_node_track_caller(size, flags, node); - -out: - if (pfmemalloc) - *pfmemalloc = ret_pfmemalloc; - - return obj; -} - -/* Allocate a new skbuff. We do this ourselves so we can fill in a few - * 'private' fields and also do memory statistics to find all the - * [BEEP] leaks. - * - */ - -/** - * __alloc_skb - allocate a network buffer - * @size: size to allocate - * @gfp_mask: allocation mask - * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache - * instead of head cache and allocate a cloned (child) skb. - * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for - * allocations in case the data is required for writeback - * @node: numa node to allocate memory on - * - * Allocate a new &sk_buff. The returned buffer has no headroom and a - * tail room of at least size bytes. The object has a reference count - * of one. The return is the buffer. On a failure the return is %NULL. - * - * Buffers may only be allocated from interrupts using a @gfp_mask of - * %GFP_ATOMIC. - */ -struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, - int flags, int node) -{ - struct kmem_cache *cache; - struct skb_shared_info *shinfo; - struct sk_buff *skb; - u8 *data; - bool pfmemalloc; - - cache = (flags & SKB_ALLOC_FCLONE) - ? skbuff_fclone_cache : skbuff_head_cache; - - if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX)) - gfp_mask |= __GFP_MEMALLOC; - - /* Get the HEAD */ - skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); - if (!skb) - goto out; - prefetchw(skb); - - /* We do our best to align skb_shared_info on a separate cache -* line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives -* aligned memory blocks, unless SLUB/SLAB debug is enabled. -* Both skb->head and skb_shared_info are cache line aligned. -*/ - size = SKB_DATA_ALIGN(size); - size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc); - if (!data) - goto nodata; - /* kmalloc(size) might give us more room than requested. -* Put skb_shared_info exactly at the end of allocated zone, -* to allow max possible filling before reallocation. -*/ - size = SKB_WITH_OVERHEAD(ksize(data)); - prefetchw(data + size); - - /* -* Only clear those fields we need to clear, not those that we will -* actually initialise below. Hence, don't put any more fields after -* the tail pointer in struct sk_buff! -*/ - memset(skb, 0, offsetof(struct sk_buff, tail)); - /* Account for allocated memory : skb + skb->head */ - skb->truesize = SKB_TRUESIZE(size); - skb->pfmemalloc = pfmemalloc; - refcount_set(&skb->users, 1); - skb->head = data; - skb->data = data; - skb_reset_tail_pointer(skb)
[PATCH v4 net-next 02/11] skbuff: simplify kmalloc_reserve()
Eversince the introduction of __kmalloc_reserve(), "ip" argument hasn't been used. _RET_IP_ is embedded inside kmalloc_node_track_caller(). Remove the redundant macro and rename the function after it. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a0f846872d19..70289f22a6f4 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -273,11 +273,8 @@ EXPORT_SYMBOL(__netdev_alloc_frag_align); * may be used. Otherwise, the packet data may be discarded until enough * memory is free */ -#define kmalloc_reserve(size, gfp, node, pfmemalloc) \ -__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc) - -static void *__kmalloc_reserve(size_t size, gfp_t flags, int node, - unsigned long ip, bool *pfmemalloc) +static void *kmalloc_reserve(size_t size, gfp_t flags, int node, +bool *pfmemalloc) { void *obj; bool ret_pfmemalloc = false; -- 2.30.1
[PATCH v4 net-next 03/11] skbuff: make __build_skb_around() return void
__build_skb_around() can never fail and always returns passed skb. Make it return void to simplify and optimize the code. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 70289f22a6f4..c7d184e11547 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -120,8 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) } /* Caller must provide SKB that is memset cleared */ -static struct sk_buff *__build_skb_around(struct sk_buff *skb, - void *data, unsigned int frag_size) +static void __build_skb_around(struct sk_buff *skb, void *data, + unsigned int frag_size) { struct skb_shared_info *shinfo; unsigned int size = frag_size ? : ksize(data); @@ -144,8 +144,6 @@ static struct sk_buff *__build_skb_around(struct sk_buff *skb, atomic_set(&shinfo->dataref, 1); skb_set_kcov_handle(skb, kcov_common_handle()); - - return skb; } /** @@ -176,8 +174,9 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) return NULL; memset(skb, 0, offsetof(struct sk_buff, tail)); + __build_skb_around(skb, data, frag_size); - return __build_skb_around(skb, data, frag_size); + return skb; } /* build_skb() is wrapper over __build_skb(), that specifically @@ -210,9 +209,9 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, if (unlikely(!skb)) return NULL; - skb = __build_skb_around(skb, data, frag_size); + __build_skb_around(skb, data, frag_size); - if (skb && frag_size) { + if (frag_size) { skb->head_frag = 1; if (page_is_pfmemalloc(virt_to_head_page(data))) skb->pfmemalloc = 1; -- 2.30.1
[PATCH v4 net-next 06/11] skbuff: remove __kfree_skb_flush()
This function isn't much needed as NAPI skb queue gets bulk-freed anyway when there's no more room, and even may reduce the efficiency of bulk operations. It will be even less needed after reusing skb cache on allocation path, so remove it and this way lighten network softirqs a bit. Suggested-by: Eric Dumazet Signed-off-by: Alexander Lobakin --- include/linux/skbuff.h | 1 - net/core/dev.c | 6 +- net/core/skbuff.c | 12 3 files changed, 1 insertion(+), 18 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0a4e91a2f873..0e0707296098 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2919,7 +2919,6 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi, } void napi_consume_skb(struct sk_buff *skb, int budget); -void __kfree_skb_flush(void); void __kfree_skb_defer(struct sk_buff *skb); /** diff --git a/net/core/dev.c b/net/core/dev.c index 7647278e46f0..7134ae2fc0db 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4944,8 +4944,6 @@ static __latent_entropy void net_tx_action(struct softirq_action *h) else __kfree_skb_defer(skb); } - - __kfree_skb_flush(); } if (sd->output_queue) { @@ -7041,7 +7039,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) if (list_empty(&list)) { if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll)) - goto out; + return; break; } @@ -7068,8 +7066,6 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) __raise_softirq_irqoff(NET_RX_SOFTIRQ); net_rps_action_and_irq_enable(sd); -out: - __kfree_skb_flush(); } struct netdev_adjacent { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1c6f6ef70339..4be2bb969535 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -838,18 +838,6 @@ void __consume_stateless_skb(struct sk_buff *skb) kfree_skbmem(skb); } -void __kfree_skb_flush(void) -{ - struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); - - /* flush skb_cache if containing objects */ - if (nc->skb_count) { - kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count, -nc->skb_cache); - nc->skb_count = 0; - } -} - static inline void _kfree_skb_defer(struct sk_buff *skb) { struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); -- 2.30.1
[PATCH v4 net-next 04/11] skbuff: simplify __alloc_skb() a bit
Use unlikely() annotations for skbuff_head and data similarly to the two other allocation functions and remove totally redundant goto. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c7d184e11547..88566de26cd1 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -339,8 +339,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, /* Get the HEAD */ skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); - if (!skb) - goto out; + if (unlikely(!skb)) + return NULL; prefetchw(skb); /* We do our best to align skb_shared_info on a separate cache @@ -351,7 +351,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, size = SKB_DATA_ALIGN(size); size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc); - if (!data) + if (unlikely(!data)) goto nodata; /* kmalloc(size) might give us more room than requested. * Put skb_shared_info exactly at the end of allocated zone, @@ -395,12 +395,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, skb_set_kcov_handle(skb, kcov_common_handle()); -out: return skb; + nodata: kmem_cache_free(cache, skb); - skb = NULL; - goto out; + return NULL; } EXPORT_SYMBOL(__alloc_skb); -- 2.30.1
[PATCH v4 net-next 08/11] skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads
Instead of just bulk-flushing skbuff_heads queued up through napi_consume_skb() or __kfree_skb_defer(), try to reuse them on allocation path. If the cache is empty on allocation, bulk-allocate the first 16 elements, which is more efficient than per-skb allocation. If the cache is full on freeing, bulk-wipe the second half of the cache (32 elements). This also includes custom KASAN poisoning/unpoisoning to be double sure there are no use-after-free cases. To not change current behaviour, introduce a new function, napi_build_skb(), to optionally use a new approach later in drivers. Note on selected bulk size, 16: - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE and especially VETH_XDP_BATCH, which is also used to bulk-allocate skbuff_heads and was tested on powerful setups; - this also showed the best performance in the actual test series (from the array of {8, 16, 32}). Suggested-by: Edward Cree # Divide on two halves Suggested-by: Eric Dumazet# KASAN poisoning Cc: Dmitry Vyukov # Help with KASAN Cc: Paolo Abeni # Reduced batch size Signed-off-by: Alexander Lobakin --- include/linux/skbuff.h | 2 + net/core/skbuff.c | 94 -- 2 files changed, 83 insertions(+), 13 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0e0707296098..906122eac82a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size); struct sk_buff *build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size); +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size); + /** * alloc_skb - allocate a network buffer * @size: size to allocate diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 860a9d4f752f..9e1a8ded4acc 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) } #define NAPI_SKB_CACHE_SIZE64 +#define NAPI_SKB_CACHE_BULK16 +#define NAPI_SKB_CACHE_HALF(NAPI_SKB_CACHE_SIZE / 2) struct napi_alloc_cache { struct page_frag_cache page; @@ -164,6 +166,25 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) } EXPORT_SYMBOL(__netdev_alloc_frag_align); +static struct sk_buff *napi_skb_cache_get(void) +{ + struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + struct sk_buff *skb; + + if (unlikely(!nc->skb_count)) + nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache, + GFP_ATOMIC, + NAPI_SKB_CACHE_BULK, + nc->skb_cache); + if (unlikely(!nc->skb_count)) + return NULL; + + skb = nc->skb_cache[--nc->skb_count]; + kasan_unpoison_object_data(skbuff_head_cache, skb); + + return skb; +} + /* Caller must provide SKB that is memset cleared */ static void __build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size) @@ -265,6 +286,53 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, } EXPORT_SYMBOL(build_skb_around); +/** + * __napi_build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Version of __build_skb() that uses NAPI percpu caches to obtain + * skbuff_head instead of inplace allocation. + * + * Returns a new &sk_buff on success, %NULL on allocation failure. + */ +static struct sk_buff *__napi_build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb; + + skb = napi_skb_cache_get(); + if (unlikely(!skb)) + return NULL; + + memset(skb, 0, offsetof(struct sk_buff, tail)); + __build_skb_around(skb, data, frag_size); + + return skb; +} + +/** + * napi_build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Version of __napi_build_skb() that takes care of skb->head_frag + * and skb->pfmemalloc when the data is a page or page fragment. + * + * Returns a new &sk_buff on success, %NULL on allocation failure. + */ +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb = __napi_build_skb(data, frag_size); + + if (likely(skb) && frag_size) { + skb->head_frag = 1; + skb_propagate_pfmemalloc(virt_to_head_page(data), skb); + } + + return skb; +} +EXPORT_SYMBOL(napi_build_skb); + /* * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells * the caller if emergency pfmemalloc reserves are being used. If it is and @@ -838,31 +906,31
[PATCH v4 net-next 07/11] skbuff: move NAPI cache declarations upper in the file
NAPI cache structures will be used for allocating skbuff_heads, so move their declarations a bit upper. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 90 +++ 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 4be2bb969535..860a9d4f752f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -119,6 +119,51 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) skb_panic(skb, sz, addr, __func__); } +#define NAPI_SKB_CACHE_SIZE64 + +struct napi_alloc_cache { + struct page_frag_cache page; + unsigned int skb_count; + void *skb_cache[NAPI_SKB_CACHE_SIZE]; +}; + +static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache); +static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache); + +static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask, + unsigned int align_mask) +{ + struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + + return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask); +} + +void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) +{ + fragsz = SKB_DATA_ALIGN(fragsz); + + return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); +} +EXPORT_SYMBOL(__napi_alloc_frag_align); + +void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) +{ + struct page_frag_cache *nc; + void *data; + + fragsz = SKB_DATA_ALIGN(fragsz); + if (in_irq() || irqs_disabled()) { + nc = this_cpu_ptr(&netdev_alloc_cache); + data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask); + } else { + local_bh_disable(); + data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); + local_bh_enable(); + } + return data; +} +EXPORT_SYMBOL(__netdev_alloc_frag_align); + /* Caller must provide SKB that is memset cleared */ static void __build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size) @@ -220,51 +265,6 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, } EXPORT_SYMBOL(build_skb_around); -#define NAPI_SKB_CACHE_SIZE64 - -struct napi_alloc_cache { - struct page_frag_cache page; - unsigned int skb_count; - void *skb_cache[NAPI_SKB_CACHE_SIZE]; -}; - -static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache); -static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache); - -static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask, - unsigned int align_mask) -{ - struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); - - return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask); -} - -void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) -{ - fragsz = SKB_DATA_ALIGN(fragsz); - - return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); -} -EXPORT_SYMBOL(__napi_alloc_frag_align); - -void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) -{ - struct page_frag_cache *nc; - void *data; - - fragsz = SKB_DATA_ALIGN(fragsz); - if (in_irq() || irqs_disabled()) { - nc = this_cpu_ptr(&netdev_alloc_cache); - data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask); - } else { - local_bh_disable(); - data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask); - local_bh_enable(); - } - return data; -} -EXPORT_SYMBOL(__netdev_alloc_frag_align); - /* * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells * the caller if emergency pfmemalloc reserves are being used. If it is and -- 2.30.1
[PATCH v4 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()
Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get an skbuff_head from the NAPI cache instead of inplace allocation inside __alloc_skb(). This implies that the function is called from softirq or BH-off context, not for allocating a clone or from a distant node. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9e1a8ded4acc..750fa1825b28 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, struct sk_buff *skb; u8 *data; bool pfmemalloc; + bool clone; - cache = (flags & SKB_ALLOC_FCLONE) - ? skbuff_fclone_cache : skbuff_head_cache; + clone = !!(flags & SKB_ALLOC_FCLONE); + cache = clone ? skbuff_fclone_cache : skbuff_head_cache; if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX)) gfp_mask |= __GFP_MEMALLOC; /* Get the HEAD */ - skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); + if (!clone && (flags & SKB_ALLOC_NAPI) && + likely(node == NUMA_NO_NODE || node == numa_mem_id())) + skb = napi_skb_cache_get(); + else + skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node); if (unlikely(!skb)) return NULL; prefetchw(skb); @@ -436,7 +441,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, __build_skb_around(skb, data, 0); skb->pfmemalloc = pfmemalloc; - if (flags & SKB_ALLOC_FCLONE) { + if (clone) { struct sk_buff_fclones *fclones; fclones = container_of(skb, struct sk_buff_fclones, skb1); -- 2.30.1
[PATCH v4 net-next 05/11] skbuff: use __build_skb_around() in __alloc_skb()
Just call __build_skb_around() instead of open-coding it. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 18 +- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 88566de26cd1..1c6f6ef70339 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -326,7 +326,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int flags, int node) { struct kmem_cache *cache; - struct skb_shared_info *shinfo; struct sk_buff *skb; u8 *data; bool pfmemalloc; @@ -366,21 +365,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, * the tail pointer in struct sk_buff! */ memset(skb, 0, offsetof(struct sk_buff, tail)); - /* Account for allocated memory : skb + skb->head */ - skb->truesize = SKB_TRUESIZE(size); + __build_skb_around(skb, data, 0); skb->pfmemalloc = pfmemalloc; - refcount_set(&skb->users, 1); - skb->head = data; - skb->data = data; - skb_reset_tail_pointer(skb); - skb->end = skb->tail + size; - skb->mac_header = (typeof(skb->mac_header))~0U; - skb->transport_header = (typeof(skb->transport_header))~0U; - - /* make sure we initialize shinfo sequentially */ - shinfo = skb_shinfo(skb); - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); - atomic_set(&shinfo->dataref, 1); if (flags & SKB_ALLOC_FCLONE) { struct sk_buff_fclones *fclones; @@ -393,8 +379,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, fclones->skb2.fclone = SKB_FCLONE_CLONE; } - skb_set_kcov_handle(skb, kcov_common_handle()); - return skb; nodata: -- 2.30.1
[PATCH v4 net-next 11/11] skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing
napi_frags_finish() and napi_skb_finish() can only be called inside NAPI Rx context, so we can feed NAPI cache with skbuff_heads that got NAPI_MERGED_FREE verdict instead of immediate freeing. Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish() and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs to NAPI cache. As many drivers call napi_alloc_skb()/napi_get_frags() on their receive path, this becomes especially useful. Signed-off-by: Alexander Lobakin --- include/linux/skbuff.h | 1 + net/core/dev.c | 9 + net/core/skbuff.c | 12 +--- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 906122eac82a..6d0a33d1c0db 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2921,6 +2921,7 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi, } void napi_consume_skb(struct sk_buff *skb, int budget); +void napi_skb_free_stolen_head(struct sk_buff *skb); void __kfree_skb_defer(struct sk_buff *skb); /** diff --git a/net/core/dev.c b/net/core/dev.c index 7134ae2fc0db..f04877295b4f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6094,13 +6094,6 @@ struct packet_offload *gro_find_complete_by_type(__be16 type) } EXPORT_SYMBOL(gro_find_complete_by_type); -static void napi_skb_free_stolen_head(struct sk_buff *skb) -{ - skb_dst_drop(skb); - skb_ext_put(skb); - kmem_cache_free(skbuff_head_cache, skb); -} - static gro_result_t napi_skb_finish(struct napi_struct *napi, struct sk_buff *skb, gro_result_t ret) @@ -6114,7 +6107,7 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi, if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) napi_skb_free_stolen_head(skb); else - __kfree_skb(skb); + __kfree_skb_defer(skb); break; case GRO_HELD: diff --git a/net/core/skbuff.c b/net/core/skbuff.c index ac6e0172f206..9ff701afa837 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -917,9 +917,6 @@ static void napi_skb_cache_put(struct sk_buff *skb) struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); u32 i; - /* drop skb->head and call any destructors for packet */ - skb_release_all(skb); - kasan_poison_object_data(skbuff_head_cache, skb); nc->skb_cache[nc->skb_count++] = skb; @@ -936,6 +933,14 @@ static void napi_skb_cache_put(struct sk_buff *skb) void __kfree_skb_defer(struct sk_buff *skb) { + skb_release_all(skb); + napi_skb_cache_put(skb); +} + +void napi_skb_free_stolen_head(struct sk_buff *skb) +{ + skb_dst_drop(skb); + skb_ext_put(skb); napi_skb_cache_put(skb); } @@ -961,6 +966,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget) return; } + skb_release_all(skb); napi_skb_cache_put(skb); } EXPORT_SYMBOL(napi_consume_skb); -- 2.30.1
[PATCH v4 net-next 10/11] skbuff: allow to use NAPI cache from __napi_alloc_skb()
{,__}napi_alloc_skb() is mostly used either for optional non-linear receive methods (usually controlled via Ethtool private flags and off by default) and/or for Rx copybreaks. Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache instead of inplace allocations. This includes both kmalloc and page frag paths. Signed-off-by: Alexander Lobakin --- net/core/skbuff.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 750fa1825b28..ac6e0172f206 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -563,7 +563,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (len <= SKB_WITH_OVERHEAD(1024) || len > SKB_WITH_OVERHEAD(PAGE_SIZE) || (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { - skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); + skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI, + NUMA_NO_NODE); if (!skb) goto skb_fail; goto skb_success; @@ -580,7 +581,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (unlikely(!data)) return NULL; - skb = __build_skb(data, len); + skb = __napi_build_skb(data, len); if (unlikely(!skb)) { skb_free_frag(data); return NULL; -- 2.30.1
Re: [PATCH v4 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()
From: Paolo Abeni Date: Thu, 11 Feb 2021 11:16:40 +0100 > On Wed, 2021-02-10 at 16:30 +0000, Alexander Lobakin wrote: > > Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get > > an skbuff_head from the NAPI cache instead of inplace allocation > > inside __alloc_skb(). > > This implies that the function is called from softirq or BH-off > > context, not for allocating a clone or from a distant node. > > > > Signed-off-by: Alexander Lobakin > > --- > > net/core/skbuff.c | 13 + > > 1 file changed, 9 insertions(+), 4 deletions(-) > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 9e1a8ded4acc..750fa1825b28 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t > > gfp_mask, > > struct sk_buff *skb; > > u8 *data; > > bool pfmemalloc; > > + bool clone; > > > > - cache = (flags & SKB_ALLOC_FCLONE) > > - ? skbuff_fclone_cache : skbuff_head_cache; > > + clone = !!(flags & SKB_ALLOC_FCLONE); > > + cache = clone ? skbuff_fclone_cache : skbuff_head_cache; > > > > if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX)) > > gfp_mask |= __GFP_MEMALLOC; > > > > /* Get the HEAD */ > > - skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); > > + if (!clone && (flags & SKB_ALLOC_NAPI) && > > + likely(node == NUMA_NO_NODE || node == numa_mem_id())) > > + skb = napi_skb_cache_get(); > > + else > > + skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node); > > if (unlikely(!skb)) > > return NULL; > > prefetchw(skb); > > I hope the opt-in thing would have allowed leaving this code unchanged. > I see it's not trivial avoid touching this code path. > Still I think it would be nice if you would be able to let the device > driver use the cache without touching the above, which is also used > e.g. by the TCP xmit path, which in turn will not leverage the cache > (as it requires FCLONE skbs). > > If I read correctly, the above chunk is needed to > allow __napi_alloc_skb() access the cache even for small skb > allocation. Not only. I wanted to give an ability to access the new feature through __alloc_skb() too, not only through napi_build_skb() or napi_alloc_skb(). And not only for drivers. As you may remember, firstly napi_consume_skb()'s batching system landed for drivers, but then it got used in network core code. I think that some core parts may benefit from reusing the NAPI caches. We'll only see it later. It's not as complex as it may seem. NUMA check is cheap and tends to be true for the vast majority of cases. Check for fclone is already present in baseline code, even two times through the function. So it's mostly about (flags & SKB_ALLOC_NAPI). > Good device drivers should not call alloc_skb() in the fast > path. Not really. Several enterprise NIC drivers use __alloc_skb() and alloc_skb(): ChelsIO and Mellanox for inline TLS, Netronome etc. Lots of RDMA and wireless drivers (not the legacy ones), too. __alloc_skb() gives you more control on NUMA node and needed skb headroom, so it's still sometimes useful in drivers. > What about changing __napi_alloc_skb() to always use > the __napi_build_skb(), for both kmalloc and page backed skbs? That is, > always doing the 'data' allocation in __napi_alloc_skb() - either via > page_frag or via kmalloc() - and than call __napi_build_skb(). > > I think that should avoid adding more checks in __alloc_skb() and > should probably reduce the number of conditional used > by __napi_alloc_skb(). I thought of this too. But this will introduce conditional branch to set or not skb->head_frag. So one branch less in __alloc_skb(), one branch more here, and we also lose the ability to __alloc_skb() with decached head. > Thanks! > > Paolo Thanks, Al
Re: [PATCH v4 net-next 08/11] skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads
From: Jesper Dangaard Brouer Date: Thu, 11 Feb 2021 13:54:59 +0100 > On Wed, 10 Feb 2021 16:30:23 + > Alexander Lobakin wrote: > > > Instead of just bulk-flushing skbuff_heads queued up through > > napi_consume_skb() or __kfree_skb_defer(), try to reuse them > > on allocation path. > > Maybe you are already aware of this dynamics, but high speed NICs will > usually run the TX "cleanup" (opportunistic DMA-completion) in the napi > poll function call, and often before processing RX packets. Like > ixgbe_poll[1] calls ixgbe_clean_tx_irq() before ixgbe_clean_rx_irq(). Sure. 1G MIPS is my home project (I'll likely migrate to ARM64 cluster in 2-3 months). I mostly work with 10-100G NICs at work. > If traffic is symmetric (or is routed-back same interface) then this > SKB recycle scheme will be highly efficient. (I had this part of my > initial patchset and tested it on ixgbe). > > [1] > https://elixir.bootlin.com/linux/v5.11-rc7/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L3149 That's exactly why I introduced this feature. Firstly driver enriches the cache with the consumed skbs from Tx completion queue, and then it just decaches them back on Rx completion cycle. That's how things worked most of the time on my test setup. The reason why Paolo proposed this as an option, and why I agreed it's safer to do instead of unconditional switching, is that different platforms and setup may react differently on this. We don't have an ability to test the entire zoo, so we propose an option for driver and network core developers to test and use "on demand". As I wrote in reply to Paolo, there might be cases when even the core networking code may benefit from this. > > If the cache is empty on allocation, bulk-allocate the first > > 16 elements, which is more efficient than per-skb allocation. > > If the cache is full on freeing, bulk-wipe the second half of > > the cache (32 elements). > > This also includes custom KASAN poisoning/unpoisoning to be > > double sure there are no use-after-free cases. > > > > To not change current behaviour, introduce a new function, > > napi_build_skb(), to optionally use a new approach later > > in drivers. > > > > Note on selected bulk size, 16: > > - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE > >and especially VETH_XDP_BATCH, which is also used to > >bulk-allocate skbuff_heads and was tested on powerful > >setups; > > - this also showed the best performance in the actual > >test series (from the array of {8, 16, 32}). > > > > Suggested-by: Edward Cree # Divide on two halves > > Suggested-by: Eric Dumazet# KASAN poisoning > > Cc: Dmitry Vyukov # Help with KASAN > > Cc: Paolo Abeni # Reduced batch size > > Signed-off-by: Alexander Lobakin > > --- > > include/linux/skbuff.h | 2 + > > net/core/skbuff.c | 94 -- > > 2 files changed, 83 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > index 0e0707296098..906122eac82a 100644 > > --- a/include/linux/skbuff.h > > +++ b/include/linux/skbuff.h > > @@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int > > frag_size); > > struct sk_buff *build_skb_around(struct sk_buff *skb, > > void *data, unsigned int frag_size); > > > > +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size); > > + > > /** > > * alloc_skb - allocate a network buffer > > * @size: size to allocate > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 860a9d4f752f..9e1a8ded4acc 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, > > unsigned int sz, void *addr) > > } > > > > #define NAPI_SKB_CACHE_SIZE64 > > +#define NAPI_SKB_CACHE_BULK16 > > +#define NAPI_SKB_CACHE_HALF(NAPI_SKB_CACHE_SIZE / 2) > > > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer Thanks, Al
[PATCH v7 bpf-next 0/6] xsk: build skb by page (aka generic zerocopy xmit)
This series introduces XSK generic zerocopy xmit by adding XSK umem pages as skb frags instead of copying data to linear space. The only requirement for this for drivers is to be able to xmit skbs with skb_headlen(skb) == 0, i.e. all data including hard headers starts from frag 0. To indicate whether a particular driver supports this, a new netdev priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net as it's already capable of doing it). So consider implementing this in your drivers to greatly speed-up generic XSK xmit. The first two bits refactor netdev_priv_flags a bit to harden them in terms of bitfield overflow, as IFF_TX_SKB_NO_LINEAR is the last one that fits into unsigned int. The fifth patch adds headroom and tailroom reservations for the allocated skbs on XSK generic xmit path. This ensures there won't be any unwanted skb reallocations on fast-path due to headroom and/or tailroom driver/device requirements (own headers/descriptors etc.). The other three add a new private flag, declare it in virtio_net driver and introduce generic XSK zerocopy xmit itself. The main body of work is created and done by Xuan Zhuo. His original cover letter: v3: Optimized code v2: 1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature 2. split the patch to three: a. add priv_flags IFF_TX_SKB_NO_LINEAR b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR c. When there is support this flag, construct skb without linear space 3. use ERR_PTR() and PTR_ERR() to handle the err v1 message log: --- This patch is used to construct skb based on page to save memory copy overhead. This has one problem: We construct the skb by fill the data page as a frag into the skb. In this way, the linear space is empty, and the header information is also in the frag, not in the linear space, which is not allowed for some network cards. For example, Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb I also tried to use build_skb to construct skb, but because of the existence of skb_shinfo, it must be behind the linear space, so this method is not working. We can't put skb_shinfo on desc->addr, it will be exposed to users, this is not safe. Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the network card supports the header information of the packet in the frag and not in the linear space. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% >From v6 [3]: - rebase ontop of bpf-next after merge with net-next; - address kdoc warnings. >From v5 [2]: - fix a refcount leak in 0006 introduced in v4. >From v4 [1]: - fix 0002 build error due to inverted static_assert() condition (0day bot); - collect two Acked-bys (Magnus). >From v3 [0]: - refactor netdev_priv_flags to make it easier to add new ones and prevent bitwidth overflow; - add headroom (both standard and zerocopy) and tailroom (standard) reservation in skb for drivers to avoid potential reallocations; - fix skb->truesize accounting; - misc comment rewords. [0] https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com [1] https://lore.kernel.org/netdev/20210216113740.62041-1-aloba...@pm.me [2] https://lore.kernel.org/netdev/2021021614.5861-1-aloba...@pm.me [3] https://lore.kernel.org/netdev/20210216172640.374487-1-aloba...@pm.me Alexander Lobakin (3): netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition netdevice: check for net_device::priv_flags bitfield overflow xsk: respect device's headroom and tailroom on generic xmit path Xuan Zhuo (3): net: add priv_flags for allow tx skb without linear virtio-net: support IFF_TX_SKB_NO_LINEAR xsk: build skb by page (aka generic zerocopy xmit) drivers/net/virtio_net.c | 3 +- include/linux/netdevice.h | 202 -- net/xdp/xsk.c | 114
[PATCH v7 bpf-next 1/6] netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition
This is harmless for now, but comes fatal for the subsequent patch. Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom") Signed-off-by: Alexander Lobakin --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ddf4cfc12615..3b6f82c2c271 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1577,6 +1577,7 @@ enum netdev_priv_flags { #define IFF_L3MDEV_SLAVE IFF_L3MDEV_SLAVE #define IFF_TEAM IFF_TEAM #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED +#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM #define IFF_MACSEC IFF_MACSEC #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER -- 2.30.1
[PATCH v7 bpf-next 4/6] virtio-net: support IFF_TX_SKB_NO_LINEAR
From: Xuan Zhuo Virtio net supports the case where the skb linear space is empty, so add priv_flags. Signed-off-by: Xuan Zhuo Acked-by: Michael S. Tsirkin Signed-off-by: Alexander Lobakin --- drivers/net/virtio_net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ba8e63792549..f2ff6c3906c1 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev) return -ENOMEM; /* Set up network device as normal. */ - dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE; + dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE | + IFF_TX_SKB_NO_LINEAR; dev->netdev_ops = &virtnet_netdev; dev->features = NETIF_F_HIGHDMA; -- 2.30.1
[PATCH v7 bpf-next 3/6] net: add priv_flags for allow tx skb without linear
From: Xuan Zhuo In some cases, we hope to construct skb directly based on the existing memory without copying data. In this case, the page will be placed directly in the skb, and the linear space of skb is empty. But unfortunately, many the network card does not support this operation. For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb So a priv_flag is added here to indicate whether the network card supports this feature. Signed-off-by: Xuan Zhuo Suggested-by: Alexander Lobakin [ alobakin: give a new flag more detailed description ] Signed-off-by: Alexander Lobakin --- include/linux/netdevice.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2c1a642ecdc0..1186ba901ad3 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1518,6 +1518,8 @@ struct net_device_ops { * @IFF_FAILOVER_SLAVE_BIT: device is lower dev of a failover master device * @IFF_L3MDEV_RX_HANDLER_BIT: only invoke the rx handler of L3 master device * @IFF_LIVE_RENAME_OK_BIT: rename is allowed while device is up and running + * @IFF_TX_SKB_NO_LINEAR_BIT: device/driver is capable of xmitting frames with + * skb_headlen(skb) == 0 (data starts from frag0) * * @NETDEV_PRIV_FLAG_COUNT: total priv flags count */ @@ -1553,6 +1555,7 @@ enum netdev_priv_flags { IFF_FAILOVER_SLAVE_BIT, IFF_L3MDEV_RX_HANDLER_BIT, IFF_LIVE_RENAME_OK_BIT, + IFF_TX_SKB_NO_LINEAR_BIT, NETDEV_PRIV_FLAG_COUNT, }; @@ -1595,6 +1598,7 @@ static_assert(sizeof(netdev_priv_flags_t) * BITS_PER_BYTE >= #define IFF_FAILOVER_SLAVE __IFF(FAILOVER_SLAVE) #define IFF_L3MDEV_RX_HANDLER __IFF(L3MDEV_RX_HANDLER) #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK) +#define IFF_TX_SKB_NO_LINEAR __IFF(TX_SKB_NO_LINEAR) /** * struct net_device - The DEVICE structure. -- 2.30.1
[PATCH v7 bpf-next 2/6] netdevice: check for net_device::priv_flags bitfield overflow
We almost ran out of unsigned int bitwidth. Define priv flags and check for potential overflow in the fashion of netdev_features_t. Defined this way, priv_flags can be easily expanded later with just changing its typedef. Signed-off-by: Alexander Lobakin Reported-by: kernel test robot # Inverted assert condition --- include/linux/netdevice.h | 199 -- 1 file changed, 105 insertions(+), 94 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3b6f82c2c271..2c1a642ecdc0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1483,107 +1483,118 @@ struct net_device_ops { * * You should have a pretty good reason to be extending these flags. * - * @IFF_802_1Q_VLAN: 802.1Q VLAN device - * @IFF_EBRIDGE: Ethernet bridging device - * @IFF_BONDING: bonding master or slave - * @IFF_ISATAP: ISATAP interface (RFC4214) - * @IFF_WAN_HDLC: WAN HDLC device - * @IFF_XMIT_DST_RELEASE: dev_hard_start_xmit() is allowed to + * @IFF_802_1Q_VLAN_BIT: 802.1Q VLAN device + * @IFF_EBRIDGE_BIT: Ethernet bridging device + * @IFF_BONDING_BIT: bonding master or slave + * @IFF_ISATAP_BIT: ISATAP interface (RFC4214) + * @IFF_WAN_HDLC_BIT: WAN HDLC device + * @IFF_XMIT_DST_RELEASE_BIT: dev_hard_start_xmit() is allowed to * release skb->dst - * @IFF_DONT_BRIDGE: disallow bridging this ether dev - * @IFF_DISABLE_NETPOLL: disable netpoll at run-time - * @IFF_MACVLAN_PORT: device used as macvlan port - * @IFF_BRIDGE_PORT: device used as bridge port - * @IFF_OVS_DATAPATH: device used as Open vSwitch datapath port - * @IFF_TX_SKB_SHARING: The interface supports sharing skbs on transmit - * @IFF_UNICAST_FLT: Supports unicast filtering - * @IFF_TEAM_PORT: device used as team port - * @IFF_SUPP_NOFCS: device supports sending custom FCS - * @IFF_LIVE_ADDR_CHANGE: device supports hardware address + * @IFF_DONT_BRIDGE_BIT: disallow bridging this ether dev + * @IFF_DISABLE_NETPOLL_BIT: disable netpoll at run-time + * @IFF_MACVLAN_PORT_BIT: device used as macvlan port + * @IFF_BRIDGE_PORT_BIT: device used as bridge port + * @IFF_OVS_DATAPATH_BIT: device used as Open vSwitch datapath port + * @IFF_TX_SKB_SHARING_BIT: The interface supports sharing skbs on transmit + * @IFF_UNICAST_FLT_BIT: Supports unicast filtering + * @IFF_TEAM_PORT_BIT: device used as team port + * @IFF_SUPP_NOFCS_BIT: device supports sending custom FCS + * @IFF_LIVE_ADDR_CHANGE_BIT: device supports hardware address * change when it's running - * @IFF_MACVLAN: Macvlan device - * @IFF_XMIT_DST_RELEASE_PERM: IFF_XMIT_DST_RELEASE not taking into account + * @IFF_MACVLAN_BIT: Macvlan device + * @IFF_XMIT_DST_RELEASE_PERM_BIT: IFF_XMIT_DST_RELEASE not taking into account * underlying stacked devices - * @IFF_L3MDEV_MASTER: device is an L3 master device - * @IFF_NO_QUEUE: device can run without qdisc attached - * @IFF_OPENVSWITCH: device is a Open vSwitch master - * @IFF_L3MDEV_SLAVE: device is enslaved to an L3 master device - * @IFF_TEAM: device is a team device - * @IFF_RXFH_CONFIGURED: device has had Rx Flow indirection table configured - * @IFF_PHONY_HEADROOM: the headroom value is controlled by an external + * @IFF_L3MDEV_MASTER_BIT: device is an L3 master device + * @IFF_NO_QUEUE_BIT: device can run without qdisc attached + * @IFF_OPENVSWITCH_BIT: device is a Open vSwitch master + * @IFF_L3MDEV_SLAVE_BIT: device is enslaved to an L3 master device + * @IFF_TEAM_BIT: device is a team device + * @IFF_RXFH_CONFIGURED_BIT: device has had Rx Flow indirection table configured + * @IFF_PHONY_HEADROOM_BIT: the headroom value is controlled by an external * entity (i.e. the master device for bridged veth) - * @IFF_MACSEC: device is a MACsec device - * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook - * @IFF_FAILOVER: device is a failover master device - * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device - * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device - * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running + * @IFF_MACSEC_BIT: device is a MACsec device + * @IFF_NO_RX_HANDLER_BIT: device doesn't support the rx_handler hook + * @IFF_FAILOVER_BIT: device is a failover master device + * @IFF_FAILOVER_SLAVE_BIT: device is lower dev of a failover master device + * @IFF_L3MDEV_RX_HANDLER_BIT: only invoke the rx handler of L3 master device + * @IFF_LIVE_RENAME_OK_BIT: rename is allowed while device is up and running + * + * @NETDEV_PRIV_FLAG_COUNT: total priv flags count */ enum netdev_priv_flags { - IFF_802_1Q_VLAN = 1<<0, - IFF_EBRIDGE = 1<<1, - IFF_BONDING = 1<<2, - IFF_ISATAP = 1<<3, - IFF_WAN_HDLC= 1<<4, - IFF_XMIT_DST_RELEASE= 1<<5, - IFF_DONT_BRIDGE = 1<<6, - IFF_
[PATCH v7 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)
From: Xuan Zhuo This patch is used to construct skb based on page to save memory copy overhead. This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to directly construct skb. If this feature is not supported, it is still necessary to copy data to construct skb. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% Signed-off-by: Xuan Zhuo Reviewed-by: Dust Li [ alobakin: - expand subject to make it clearer; - improve skb->truesize calculation; - reserve some headroom in skb for drivers; - tailroom is not needed as skb is non-linear ] Signed-off-by: Alexander Lobakin Acked-by: Magnus Karlsson --- net/xdp/xsk.c | 120 -- 1 file changed, 96 insertions(+), 24 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 143979ea4165..a71ed664da0a 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -445,6 +445,97 @@ static void xsk_destruct_skb(struct sk_buff *skb) sock_wfree(skb); } +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, + struct xdp_desc *desc) +{ + struct xsk_buff_pool *pool = xs->pool; + u32 hr, len, ts, offset, copy, copied; + struct sk_buff *skb; + struct page *page; + void *buffer; + int err, i; + u64 addr; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + + skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + + addr = desc->addr; + len = desc->len; + ts = pool->unaligned ? len : pool->chunk_size; + + buffer = xsk_buff_raw_get_data(pool, addr); + offset = offset_in_page(buffer); + addr = buffer - pool->addrs; + + for (copied = 0, i = 0; copied < len; i++) { + page = pool->umem->pgs[addr >> PAGE_SHIFT]; + get_page(page); + + copy = min_t(u32, PAGE_SIZE - offset, len - copied); + skb_fill_page_desc(skb, i, page, offset, copy); + + copied += copy; + addr += copy; + offset = 0; + } + + skb->len += len; + skb->data_len += len; + skb->truesize += ts; + + refcount_add(ts, &xs->sk.sk_wmem_alloc); + + return skb; +} + +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, +struct xdp_desc *desc) +{ + struct net_device *dev = xs->dev; + struct sk_buff *skb; + + if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { + skb = xsk_build_skb_zerocopy(xs, desc); + if (IS_ERR(skb)) + return skb; + } else { + u32 hr, tr, len; + void *buffer; + int err; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)); + tr = dev->needed_tailroom; + len = desc->len; + + skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + skb_put(skb, len); + + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr); + err = skb_store_bits(skb, 0, buffer, len); + if (unlikely(err)) { + kfree_skb(skb); + return ERR_PTR(err); + } + } + + skb->dev = dev; + skb->priority = xs->sk.sk_priority; + skb->mark = xs->sk.sk_mark; + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr; + skb->destructor = xsk_destruct_skb; + + return skb; +} + static int xsk_generic_xmit(struct sock *sk) { struct xdp_sock *xs = xdp_sk(sk); @@ -454,56 +545,37 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; - u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; - hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); - tr = xs->dev->needed_tailroom; - while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { - char *buffer; - u64 addr; - u32 len; - if (max_batch-- == 0) { err = -EAGAIN;
[PATCH v7 bpf-next 5/6] xsk: respect device's headroom and tailroom on generic xmit path
xsk_generic_xmit() allocates a new skb and then queues it for xmitting. The size of new skb's headroom is desc->len, so it comes to the driver/device with no reserved headroom and/or tailroom. Lots of drivers need some headroom (and sometimes tailroom) to prepend (and/or append) some headers or data, e.g. CPU tags, device-specific headers/descriptors (LSO, TLS etc.), and if case of no available space skb_cow_head() will reallocate the skb. Reallocations are unwanted on fast-path, especially when it comes to XDP, so generic XSK xmit should reserve the spaces declared in dev->needed_headroom and dev->needed tailroom to avoid them. Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)): Usually, output functions reserve LL_RESERVED_SPACE(dev), which consists of dev->hard_header_len + dev->needed_headroom, aligned by 16. However, on XSK xmit hard header is already here in the chunk, so hard_header_len is not needed. But it'd still be better to align data up to cacheline, while reserving no less than driver requests for headroom. NET_SKB_PAD here is to double-insure there will be no reallocations even when the driver advertises no needed_headroom, but in fact need it (not so rare case). Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Alexander Lobakin Acked-by: Magnus Karlsson --- net/xdp/xsk.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4faabd1ecfd1..143979ea4165 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; + u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + tr = xs->dev->needed_tailroom; + while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { char *buffer; u64 addr; @@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk) } len = desc.len; - skb = sock_alloc_send_skb(sk, len, 1, &err); + skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err); if (unlikely(!skb)) goto out; + skb_reserve(skb, hr); skb_put(skb, len); + addr = desc.addr; buffer = xsk_buff_raw_get_data(xs->pool, addr); err = skb_store_bits(skb, 0, buffer, len); -- 2.30.1
[PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit)
This series introduces XSK generic zerocopy xmit by adding XSK umem pages as skb frags instead of copying data to linear space. The only requirement for this for drivers is to be able to xmit skbs with skb_headlen(skb) == 0, i.e. all data including hard headers starts from frag 0. To indicate whether a particular driver supports this, a new netdev priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net as it's already capable of doing it). So consider implementing this in your drivers to greatly speed-up generic XSK xmit. The first bit adds missing IFF self-definition. It's a bit out, but "while we are here". The fourth patch adds headroom and tailroom reservations for the allocated skbs on XSK generic xmit path. This ensures there won't be any unwanted skb reallocations on fast-path due to headroom and/or tailroom driver/device requirements (own headers/descriptors etc.). The other three add a new private flag, declare it in virtio_net driver and introduce generic XSK zerocopy xmit itself. The main body of work is created and done by Xuan Zhuo. His original cover letter: v3: Optimized code v2: 1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature 2. split the patch to three: a. add priv_flags IFF_TX_SKB_NO_LINEAR b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR c. When there is support this flag, construct skb without linear space 3. use ERR_PTR() and PTR_ERR() to handle the err v1 message log: --- This patch is used to construct skb based on page to save memory copy overhead. This has one problem: We construct the skb by fill the data page as a frag into the skb. In this way, the linear space is empty, and the header information is also in the frag, not in the linear space, which is not allowed for some network cards. For example, Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb I also tried to use build_skb to construct skb, but because of the existence of skb_shinfo, it must be behind the linear space, so this method is not working. We can't put skb_shinfo on desc->addr, it will be exposed to users, this is not safe. Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the network card supports the header information of the packet in the frag and not in the linear space. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% >From v7 [4]: - drop netdev priv flags rework (will be issued separately); - pick up Acks from John. >From v6 [3]: - rebase ontop of bpf-next after merge with net-next; - address kdoc warnings. >From v5 [2]: - fix a refcount leak in 0006 introduced in v4. >From v4 [1]: - fix 0002 build error due to inverted static_assert() condition (0day bot); - collect two Acked-bys (Magnus). >From v3 [0]: - refactor netdev_priv_flags to make it easier to add new ones and prevent bitwidth overflow; - add headroom (both standard and zerocopy) and tailroom (standard) reservation in skb for drivers to avoid potential reallocations; - fix skb->truesize accounting; - misc comment rewords. [0] https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com [1] https://lore.kernel.org/netdev/20210216113740.62041-1-aloba...@pm.me [2] https://lore.kernel.org/netdev/2021021614.5861-1-aloba...@pm.me [3] https://lore.kernel.org/netdev/20210216172640.374487-1-aloba...@pm.me [4] https://lore.kernel.org/netdev/20210217120003.7938-1-aloba...@pm.me Alexander Lobakin (2): netdevice: add missing IFF_PHONY_HEADROOM self-definition xsk: respect device's headroom and tailroom on generic xmit path Xuan Zhuo (3): net: add priv_flags for allow tx skb without linear virtio-net: support IFF_TX_SKB_NO_LINEAR xsk: build skb by page (aka generic zerocopy xmit) drivers/net/virtio_net.c | 3 +- include/linux/netdevice.h | 5 ++ net/xdp/xsk.c
[PATCH v8 bpf-next 1/5] netdevice: add missing IFF_PHONY_HEADROOM self-definition
This is harmless for now, but can be fatal for future refactors. Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom") Signed-off-by: Alexander Lobakin Acked-by: John Fastabend --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ddf4cfc12615..3b6f82c2c271 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1577,6 +1577,7 @@ enum netdev_priv_flags { #define IFF_L3MDEV_SLAVE IFF_L3MDEV_SLAVE #define IFF_TEAM IFF_TEAM #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED +#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM #define IFF_MACSEC IFF_MACSEC #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER -- 2.30.1
[PATCH v8 bpf-next 2/5] net: add priv_flags for allow tx skb without linear
From: Xuan Zhuo In some cases, we hope to construct skb directly based on the existing memory without copying data. In this case, the page will be placed directly in the skb, and the linear space of skb is empty. But unfortunately, many the network card does not support this operation. For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb So a priv_flag is added here to indicate whether the network card supports this feature. Signed-off-by: Xuan Zhuo Suggested-by: Alexander Lobakin [ alobakin: give a new flag more detailed description ] Signed-off-by: Alexander Lobakin Acked-by: John Fastabend --- include/linux/netdevice.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3b6f82c2c271..6cef47b76cc6 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1518,6 +1518,8 @@ struct net_device_ops { * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running + * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with + * skb_headlen(skb) == 0 (data starts from frag0) */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 1<<0, @@ -1551,6 +1553,7 @@ enum netdev_priv_flags { IFF_FAILOVER_SLAVE = 1<<28, IFF_L3MDEV_RX_HANDLER = 1<<29, IFF_LIVE_RENAME_OK = 1<<30, + IFF_TX_SKB_NO_LINEAR= 1<<31, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1584,6 +1587,7 @@ enum netdev_priv_flags { #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE #define IFF_L3MDEV_RX_HANDLER IFF_L3MDEV_RX_HANDLER #define IFF_LIVE_RENAME_OK IFF_LIVE_RENAME_OK +#define IFF_TX_SKB_NO_LINEAR IFF_TX_SKB_NO_LINEAR /** * struct net_device - The DEVICE structure. -- 2.30.1
[PATCH v8 bpf-next 3/5] virtio-net: support IFF_TX_SKB_NO_LINEAR
From: Xuan Zhuo Virtio net supports the case where the skb linear space is empty, so add priv_flags. Signed-off-by: Xuan Zhuo Acked-by: Michael S. Tsirkin Signed-off-by: Alexander Lobakin Acked-by: John Fastabend --- drivers/net/virtio_net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ba8e63792549..f2ff6c3906c1 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev) return -ENOMEM; /* Set up network device as normal. */ - dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE; + dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE | + IFF_TX_SKB_NO_LINEAR; dev->netdev_ops = &virtnet_netdev; dev->features = NETIF_F_HIGHDMA; -- 2.30.1
[PATCH v8 bpf-next 4/5] xsk: respect device's headroom and tailroom on generic xmit path
xsk_generic_xmit() allocates a new skb and then queues it for xmitting. The size of new skb's headroom is desc->len, so it comes to the driver/device with no reserved headroom and/or tailroom. Lots of drivers need some headroom (and sometimes tailroom) to prepend (and/or append) some headers or data, e.g. CPU tags, device-specific headers/descriptors (LSO, TLS etc.), and if case of no available space skb_cow_head() will reallocate the skb. Reallocations are unwanted on fast-path, especially when it comes to XDP, so generic XSK xmit should reserve the spaces declared in dev->needed_headroom and dev->needed tailroom to avoid them. Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)): Usually, output functions reserve LL_RESERVED_SPACE(dev), which consists of dev->hard_header_len + dev->needed_headroom, aligned by 16. However, on XSK xmit hard header is already here in the chunk, so hard_header_len is not needed. But it'd still be better to align data up to cacheline, while reserving no less than driver requests for headroom. NET_SKB_PAD here is to double-insure there will be no reallocations even when the driver advertises no needed_headroom, but in fact need it (not so rare case). Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Alexander Lobakin Acked-by: Magnus Karlsson Acked-by: John Fastabend --- net/xdp/xsk.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4faabd1ecfd1..143979ea4165 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; + u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + tr = xs->dev->needed_tailroom; + while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { char *buffer; u64 addr; @@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk) } len = desc.len; - skb = sock_alloc_send_skb(sk, len, 1, &err); + skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err); if (unlikely(!skb)) goto out; + skb_reserve(skb, hr); skb_put(skb, len); + addr = desc.addr; buffer = xsk_buff_raw_get_data(xs->pool, addr); err = skb_store_bits(skb, 0, buffer, len); -- 2.30.1
[PATCH v8 bpf-next 5/5] xsk: build skb by page (aka generic zerocopy xmit)
From: Xuan Zhuo This patch is used to construct skb based on page to save memory copy overhead. This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to directly construct skb. If this feature is not supported, it is still necessary to copy data to construct skb. Performance Testing The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s ``` Test result data: size64 512 10241500 copy1916747 1775988 1600203 1440054 page1974058 1953655 1945463 1904478 percent 3.0%10.0% 21.58% 32.3% Signed-off-by: Xuan Zhuo Reviewed-by: Dust Li [ alobakin: - expand subject to make it clearer; - improve skb->truesize calculation; - reserve some headroom in skb for drivers; - tailroom is not needed as skb is non-linear ] Signed-off-by: Alexander Lobakin Acked-by: Magnus Karlsson Acked-by: John Fastabend --- net/xdp/xsk.c | 120 -- 1 file changed, 96 insertions(+), 24 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 143979ea4165..a71ed664da0a 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -445,6 +445,97 @@ static void xsk_destruct_skb(struct sk_buff *skb) sock_wfree(skb); } +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, + struct xdp_desc *desc) +{ + struct xsk_buff_pool *pool = xs->pool; + u32 hr, len, ts, offset, copy, copied; + struct sk_buff *skb; + struct page *page; + void *buffer; + int err, i; + u64 addr; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); + + skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + + addr = desc->addr; + len = desc->len; + ts = pool->unaligned ? len : pool->chunk_size; + + buffer = xsk_buff_raw_get_data(pool, addr); + offset = offset_in_page(buffer); + addr = buffer - pool->addrs; + + for (copied = 0, i = 0; copied < len; i++) { + page = pool->umem->pgs[addr >> PAGE_SHIFT]; + get_page(page); + + copy = min_t(u32, PAGE_SIZE - offset, len - copied); + skb_fill_page_desc(skb, i, page, offset, copy); + + copied += copy; + addr += copy; + offset = 0; + } + + skb->len += len; + skb->data_len += len; + skb->truesize += ts; + + refcount_add(ts, &xs->sk.sk_wmem_alloc); + + return skb; +} + +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, +struct xdp_desc *desc) +{ + struct net_device *dev = xs->dev; + struct sk_buff *skb; + + if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { + skb = xsk_build_skb_zerocopy(xs, desc); + if (IS_ERR(skb)) + return skb; + } else { + u32 hr, tr, len; + void *buffer; + int err; + + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)); + tr = dev->needed_tailroom; + len = desc->len; + + skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_reserve(skb, hr); + skb_put(skb, len); + + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr); + err = skb_store_bits(skb, 0, buffer, len); + if (unlikely(err)) { + kfree_skb(skb); + return ERR_PTR(err); + } + } + + skb->dev = dev; + skb->priority = xs->sk.sk_priority; + skb->mark = xs->sk.sk_mark; + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr; + skb->destructor = xsk_destruct_skb; + + return skb; +} + static int xsk_generic_xmit(struct sock *sk) { struct xdp_sock *xs = xdp_sk(sk); @@ -454,56 +545,37 @@ static int xsk_generic_xmit(struct sock *sk) struct sk_buff *skb; unsigned long flags; int err = 0; - u32 hr, tr; mutex_lock(&xs->mutex); if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; - hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom)); - tr = xs->dev->needed_tailroom; - while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { - char *buffer; - u64 addr; - u32 len; - if (max_batch-- == 0) { err = -EAGAIN;
Re: [RESEND PATCH net v4] udp: ipv4: manipulate network header of NATed UDP GRO fraglist
From: Dongseok Yi Date: Sat, 30 Jan 2021 08:13:27 +0900 > UDP/IP header of UDP GROed frag_skbs are not updated even after NAT > forwarding. Only the header of head_skb from ip_finish_output_gso -> > skb_gso_segment is updated but following frag_skbs are not updated. > > A call path skb_mac_gso_segment -> inet_gso_segment -> > udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list > does not try to update UDP/IP header of the segment list but copy > only the MAC header. > > Update port, addr and check of each skb of the segment list in > __udp_gso_segment_list. It covers both SNAT and DNAT. > > Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.) > Signed-off-by: Dongseok Yi > Acked-by: Steffen Klassert > --- > v1: > Steffen Klassert said, there could be 2 options. > https://lore.kernel.org/patchwork/patch/1362257/ > I was trying to write a quick fix, but it was not easy to forward > segmented list. Currently, assuming DNAT only. > > v2: > Per Steffen Klassert request, moved the procedure from > udp4_ufo_fragment to __udp_gso_segment_list and support SNAT. > > v3: > Per Steffen Klassert request, applied fast return by comparing seg > and seg->next at the beginning of __udpv4_gso_segment_list_csum. > > Fixed uh->dest = *newport and iph->daddr = *newip to > *oldport = *newport and *oldip = *newip. > > v4: > Clear "Changes Requested" mark in > https://patchwork.kernel.org/project/netdevbpf > > Simplified the return statement in __udp_gso_segment_list. > > include/net/udp.h | 2 +- > net/ipv4/udp_offload.c | 69 > ++ > net/ipv6/udp_offload.c | 2 +- > 3 files changed, 66 insertions(+), 7 deletions(-) > > diff --git a/include/net/udp.h b/include/net/udp.h > index 877832b..01351ba 100644 > --- a/include/net/udp.h > +++ b/include/net/udp.h > @@ -178,7 +178,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, > struct sk_buff *skb, > int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup); > > struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, > - netdev_features_t features); > + netdev_features_t features, bool is_ipv6); > > static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb) > { > diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c > index ff39e94..cfc8726 100644 > --- a/net/ipv4/udp_offload.c > +++ b/net/ipv4/udp_offload.c > @@ -187,8 +187,67 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff > *skb, > } > EXPORT_SYMBOL(skb_udp_tunnel_segment); > > +static void __udpv4_gso_segment_csum(struct sk_buff *seg, > + __be32 *oldip, __be32 *newip, > + __be16 *oldport, __be16 *newport) > +{ > + struct udphdr *uh; > + struct iphdr *iph; > + > + if (*oldip == *newip && *oldport == *newport) > + return; > + > + uh = udp_hdr(seg); > + iph = ip_hdr(seg); > + > + if (uh->check) { > + inet_proto_csum_replace4(&uh->check, seg, *oldip, *newip, > + true); > + inet_proto_csum_replace2(&uh->check, seg, *oldport, *newport, > + false); > + if (!uh->check) > + uh->check = CSUM_MANGLED_0; > + } > + *oldport = *newport; > + > + csum_replace4(&iph->check, *oldip, *newip); > + *oldip = *newip; > +} > + > +static struct sk_buff *__udpv4_gso_segment_list_csum(struct sk_buff *segs) > +{ > + struct sk_buff *seg; > + struct udphdr *uh, *uh2; > + struct iphdr *iph, *iph2; > + > + seg = segs; > + uh = udp_hdr(seg); > + iph = ip_hdr(seg); > + > + if ((udp_hdr(seg)->dest == udp_hdr(seg->next)->dest) && > + (udp_hdr(seg)->source == udp_hdr(seg->next)->source) && > + (ip_hdr(seg)->daddr == ip_hdr(seg->next)->daddr) && > + (ip_hdr(seg)->saddr == ip_hdr(seg->next)->saddr)) > + return segs; > + > + while ((seg = seg->next)) { > + uh2 = udp_hdr(seg); > + iph2 = ip_hdr(seg); > + > + __udpv4_gso_segment_csum(seg, > + &iph2->saddr, &iph->saddr, > + &uh2->source, &uh->source); > + __udpv4_gso_segment_csum(seg, > + &iph2->daddr, &iph->daddr, > + &uh2->dest, &uh->dest); > + } > + > + return segs; > +} > + > static struct sk_buff *__udp_gso_segment_list(struct sk_buff *skb, > - netdev_features_t features) > + netdev_features_t features, > + bool is_ipv6) > { > unsigned int mss = skb_shinfo(skb)->gso_size; > > @@ -198,11 +257,11 @@ static struct sk_buff *__udp_gso_segment_list(struct > sk_buff *skb, > >
Re: [PATCH v2 net-next 3/4] net: introduce common dev_page_is_reserved()
From: Jakub Kicinski Date: Fri, 29 Jan 2021 18:39:07 -0800 > On Wed, 27 Jan 2021 20:11:23 +0000 Alexander Lobakin wrote: > > + * dev_page_is_reserved - check whether a page can be reused for network Rx > > + * @page: the page to test > > + * > > + * A page shouldn't be considered for reusing/recycling if it was allocated > > + * under memory pressure or at a distant memory node. > > + * > > + * Returns true if this page should be returned to page allocator, false > > + * otherwise. > > + */ > > +static inline bool dev_page_is_reserved(const struct page *page) > > Am I the only one who feels like "reusable" is a better term than > "reserved". I thought about it, but this will need to inverse the conditions in most of the drivers. I decided to keep it as it is. I can redo if "reusable" is preferred. Regarding "no objectives to take patch 1 through net-next": patches 2-3 depend on it, so I can't put it in a separate series. Thanks, Al
Re: [PATCH v2 net-next 3/4] net: introduce common dev_page_is_reserved()
From: Jakub Kicinski Date: Sat, 30 Jan 2021 11:07:07 -0800 > On Sat, 30 Jan 2021 15:42:29 +0000 Alexander Lobakin wrote: > > > On Wed, 27 Jan 2021 20:11:23 +0000 Alexander Lobakin wrote: > > > > + * dev_page_is_reserved - check whether a page can be reused for > > > > network Rx > > > > + * @page: the page to test > > > > + * > > > > + * A page shouldn't be considered for reusing/recycling if it was > > > > allocated > > > > + * under memory pressure or at a distant memory node. > > > > + * > > > > + * Returns true if this page should be returned to page allocator, > > > > false > > > > + * otherwise. > > > > + */ > > > > +static inline bool dev_page_is_reserved(const struct page *page) > > > > > > Am I the only one who feels like "reusable" is a better term than > > > "reserved". > > > > I thought about it, but this will need to inverse the conditions in > > most of the drivers. I decided to keep it as it is. > > I can redo if "reusable" is preferred. > > Naming is hard. As long as the condition is not a double negative it > reads fine to me, but that's probably personal preference. > The thing that doesn't sit well is the fact that there is nothing > "reserved" about a page from another NUMA node.. But again, if nobody > +1s this it's whatever... Agree on NUMA and naming. I'm a bit surprised that 95% of drivers have this helper called "reserved" (one of the reasons why I finished with this variant). Let's say, if anybody else will vote for "reusable", I'll pick it for v3. > That said can we move the likely()/unlikely() into the helper itself? > People on the internet may say otherwise but according to my tests > using __builtin_expect() on a return value of a static inline helper > works just fine. Sounds fine, this will make code more elegant. Will publish v3 soon. Thanks, Al
[PATCH v3 net-next 2/5] skbuff: constify skb_propagate_pfmemalloc() "page" argument
The function doesn't write anything to the page struct itself, so this argument can be const. Misc: align second argument to the brace while at it. Signed-off-by: Alexander Lobakin Reviewed-by: Jesse Brandeburg Acked-by: David Rientjes --- include/linux/skbuff.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 9313b5aaf45b..b027526da4f9 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2943,8 +2943,8 @@ static inline struct page *dev_alloc_page(void) * @page: The page that was allocated from skb_alloc_page * @skb: The skb that may need pfmemalloc set */ -static inline void skb_propagate_pfmemalloc(struct page *page, -struct sk_buff *skb) +static inline void skb_propagate_pfmemalloc(const struct page *page, + struct sk_buff *skb) { if (page_is_pfmemalloc(page)) skb->pfmemalloc = true; -- 2.30.0
[PATCH v3 net-next 4/5] net: use the new dev_page_is_reusable() instead of private versions
Now we can remove a bunch of identical functions from the drivers and make them use common dev_page_is_reusable(). All {,un}likely() checks are omitted since it's already present in this helper. Also update some comments near the call sites. Suggested-by: David Rientjes Suggested-by: Jakub Kicinski Cc: John Hubbard Signed-off-by: Alexander Lobakin --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 17 ++--- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 13 - drivers/net/ethernet/intel/i40e/i40e_txrx.c | 15 +-- drivers/net/ethernet/intel/iavf/iavf_txrx.c | 15 +-- drivers/net/ethernet/intel/ice/ice_txrx.c | 13 ++--- drivers/net/ethernet/intel/igb/igb_main.c | 9 ++--- drivers/net/ethernet/intel/igc/igc_main.c | 9 ++--- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 ++--- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 9 ++--- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 7 +-- 10 files changed, 23 insertions(+), 93 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 512080640cbc..f39f5b1c4cec 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2800,12 +2800,6 @@ static void hns3_nic_alloc_rx_buffers(struct hns3_enet_ring *ring, writel(i, ring->tqp->io_base + HNS3_RING_RX_RING_HEAD_REG); } -static bool hns3_page_is_reusable(struct page *page) -{ - return page_to_nid(page) == numa_mem_id() && - !page_is_pfmemalloc(page); -} - static bool hns3_can_reuse_page(struct hns3_desc_cb *cb) { return (page_count(cb->priv) - cb->pagecnt_bias) == 1; @@ -2823,10 +2817,11 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i, skb_add_rx_frag(skb, i, desc_cb->priv, desc_cb->page_offset + pull_len, size - pull_len, truesize); - /* Avoid re-using remote pages, or the stack is still using the page -* when page_offset rollback to zero, flag default unreuse + /* Avoid re-using remote and pfmemalloc pages, or the stack is still +* using the page when page_offset rollback to zero, flag default +* unreuse */ - if (unlikely(!hns3_page_is_reusable(desc_cb->priv)) || + if (!dev_page_is_reusable(desc_cb->priv) || (!desc_cb->page_offset && !hns3_can_reuse_page(desc_cb))) { __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias); return; @@ -3083,8 +3078,8 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length, if (length <= HNS3_RX_HEAD_SIZE) { memcpy(__skb_put(skb, length), va, ALIGN(length, sizeof(long))); - /* We can reuse buffer as-is, just make sure it is local */ - if (likely(hns3_page_is_reusable(desc_cb->priv))) + /* We can reuse buffer as-is, just make sure it is reusable */ + if (dev_page_is_reusable(desc_cb->priv)) desc_cb->reuse_flag = 1; else /* This page cannot be reused so discard it */ __page_frag_cache_drain(desc_cb->priv, diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c index 99b8252eb969..247f44f4cb30 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c @@ -194,17 +194,12 @@ static void fm10k_reuse_rx_page(struct fm10k_ring *rx_ring, DMA_FROM_DEVICE); } -static inline bool fm10k_page_is_reserved(struct page *page) -{ - return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); -} - static bool fm10k_can_reuse_rx_page(struct fm10k_rx_buffer *rx_buffer, struct page *page, unsigned int __maybe_unused truesize) { - /* avoid re-using remote pages */ - if (unlikely(fm10k_page_is_reserved(page))) + /* avoid re-using remote and pfmemalloc pages */ + if (!dev_page_is_reusable(page)) return false; #if (PAGE_SIZE < 8192) @@ -265,8 +260,8 @@ static bool fm10k_add_rx_frag(struct fm10k_rx_buffer *rx_buffer, if (likely(size <= FM10K_RX_HDR_LEN)) { memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long))); - /* page is not reserved, we can reuse buffer as-is */ - if (likely(!fm10k_page_is_reserved(page))) + /* page is reusable, we can reuse buffer as-is */ + if (dev_page_is_reusable(page)) return true; /* this page cannot be reused so discard it */ diff --git a/drivers/net/ethernet/intel/i40e/i4
[PATCH v3 net-next 5/5] net: page_pool: simplify page recycling condition tests
pool_page_reusable() is a leftover from pre-NUMA-aware times. For now, this function is just a redundant wrapper over page_is_pfmemalloc(), so inline it into its sole call site. Signed-off-by: Alexander Lobakin Acked-by: Jesper Dangaard Brouer Reviewed-by: Ilias Apalodimas Reviewed-by: Jesse Brandeburg Acked-by: David Rientjes --- net/core/page_pool.c | 14 -- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f3c690b8c8e3..ad8b0707af04 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -350,14 +350,6 @@ static bool page_pool_recycle_in_cache(struct page *page, return true; } -/* page is NOT reusable when: - * 1) allocated when system is under some pressure. (page_is_pfmemalloc) - */ -static bool pool_page_reusable(struct page_pool *pool, struct page *page) -{ - return !page_is_pfmemalloc(page); -} - /* If the page refcnt == 1, this will try to recycle the page. * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for * the configured size min(dma_sync_size, pool->max_len). @@ -373,9 +365,11 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, * regular page allocator APIs. * * refcnt == 1 means page_pool owns page, and can recycle it. +* +* page is NOT reusable when allocated when system is under +* some pressure. (page_is_pfmemalloc) */ - if (likely(page_ref_count(page) == 1 && - pool_page_reusable(pool, page))) { + if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) { /* Read barrier done in page_ref_count / READ_ONCE */ if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) -- 2.30.0
Re: [PATCH v3 net-next 5/5] net: page_pool: simplify page recycling condition tests
From: Matthew Wilcox Date: Sun, 31 Jan 2021 12:23:48 + > On Sun, Jan 31, 2021 at 12:12:11PM +0000, Alexander Lobakin wrote: > > pool_page_reusable() is a leftover from pre-NUMA-aware times. For now, > > this function is just a redundant wrapper over page_is_pfmemalloc(), > > so inline it into its sole call site. > > Why doesn't this want to use {dev_}page_is_reusable()? Page Pool handles NUMA on its own. Replacing plain page_is_pfmemalloc() with dev_page_is_reusable() will only add a completely redundant and always-false check on the fastpath. Al
Re: [PATCH v3 net-next 3/5] net: introduce common dev_page_is_reusable()
From: Matthew Wilcox Date: Sun, 31 Jan 2021 12:22:05 + > On Sun, Jan 31, 2021 at 12:11:52PM +0000, Alexander Lobakin wrote: > > A bunch of drivers test the page before reusing/recycling for two > > common conditions: > > - if a page was allocated under memory pressure (pfmemalloc page); > > - if a page was allocated at a distant memory node (to exclude > >slowdowns). > > > > Introduce a new common inline for doing this, with likely() already > > folded inside to make driver code a bit simpler. > > I don't see the need for the 'dev_' prefix. That actually confuses me > because it makes me think this is tied to ZONE_DEVICE or some such. Several functions right above this one also use 'dev_' prefix. It's a rather old mark that it's about network devices. > So how about calling it just 'page_is_reusable' and putting it in mm.h > with page_is_pfmemalloc() and making the comment a little less > network-centric? This pair of conditions (!pfmemalloc + local memory node) is really specific to network drivers. I didn't see any other instances of such tests, so I don't see a reason to place it in a more common mm.h. > Or call it something like skb_page_is_recyclable() since it's only used > by networking today. But I bet it could/should be used more widely. There's nothing about skb. Tested page is just a memory chunk for DMA transaction. It can be used as skb head/frag, for XDP buffer/frame or for XSK umem. > > +/** > > + * dev_page_is_reusable - check whether a page can be reused for network Rx > > + * @page: the page to test > > + * > > + * A page shouldn't be considered for reusing/recycling if it was allocated > > + * under memory pressure or at a distant memory node. > > + * > > + * Returns false if this page should be returned to page allocator, true > > + * otherwise. > > + */ > > +static inline bool dev_page_is_reusable(const struct page *page) > > +{ > > + return likely(page_to_nid(page) == numa_mem_id() && > > + !page_is_pfmemalloc(page)); > > +} > > + Al
[PATCH v3 net-next 0/5] net: consolidate page_is_pfmemalloc() usage
page_is_pfmemalloc() is used mostly by networking drivers to test if a page can be considered for reusing/recycling. It doesn't write anything to the struct page itself, so its sole argument can be constified, as well as the first argument of skb_propagate_pfmemalloc(). In Page Pool core code, it can be simply inlined instead. Most of the callers from NIC drivers were just doppelgangers of the same condition tests. Derive them into a new common function do deduplicate the code. Since v2 [1]: - use more intuitive name for the new inline function since there's nothing "reserved" in remote pages (Jakub Kicinski, John Hubbard); - fold likely() inside the helper itself to make driver code a bit fancier (Jakub Kicinski); - split function introduction and using into two separate commits; - collect some more tags (Jesse Brandeburg, David Rientjes). Since v1 [0]: - new: reduce code duplication by introducing a new common function to test if a page can be reused/recycled (David Rientjes); - collect autographs for Page Pool bits (Jesper Dangaard Brouer, Ilias Apalodimas). [0] https://lore.kernel.org/netdev/20210125164612.243838-1-aloba...@pm.me [1] https://lore.kernel.org/netdev/20210127201031.98544-1-aloba...@pm.me Alexander Lobakin (5): mm: constify page_is_pfmemalloc() argument skbuff: constify skb_propagate_pfmemalloc() "page" argument net: introduce common dev_page_is_reusable() net: use the new dev_page_is_reusable() instead of private versions net: page_pool: simplify page recycling condition tests .../net/ethernet/hisilicon/hns3/hns3_enet.c | 17 ++-- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 13 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 15 +- drivers/net/ethernet/intel/iavf/iavf_txrx.c | 15 +- drivers/net/ethernet/intel/ice/ice_txrx.c | 13 ++-- drivers/net/ethernet/intel/igb/igb_main.c | 9 ++--- drivers/net/ethernet/intel/igc/igc_main.c | 9 ++--- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 ++--- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 9 ++--- .../net/ethernet/mellanox/mlx5/core/en_rx.c | 7 +-- include/linux/mm.h| 2 +- include/linux/skbuff.h| 20 +-- net/core/page_pool.c | 14 - 13 files changed, 46 insertions(+), 106 deletions(-) -- 2.30.0
[PATCH v3 net-next 1/5] mm: constify page_is_pfmemalloc() argument
The function only tests for page->index, so its argument should be const. Signed-off-by: Alexander Lobakin Reviewed-by: Jesse Brandeburg Acked-by: David Rientjes --- include/linux/mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..078633d43af9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1584,7 +1584,7 @@ struct address_space *page_mapping_file(struct page *page); * ALLOC_NO_WATERMARKS and the low watermark was not * met implying that the system is under some pressure. */ -static inline bool page_is_pfmemalloc(struct page *page) +static inline bool page_is_pfmemalloc(const struct page *page) { /* * Page index cannot be this large so this must be -- 2.30.0