from:"Alexander Lobakin"

Re: [PATCH v1 1/1] bitops: Share BYTES_TO_BITS() for everyone

2023-09-11 Thread Alexander Lobakin

From: Yury Norov 
Date: Sun, 10 Sep 2023 07:07:16 -0700

> On Wed, Sep 06, 2023 at 05:54:26PM +0300, Andy Shevchenko wrote:
>> On Wed, Sep 06, 2023 at 04:40:39PM +0200, Alexander Lobakin wrote:
>>> From: Andy Shevchenko 
>>> Date: Thu, 31 Aug 2023 16:21:30 +0300
>>>> On Fri, Aug 25, 2023 at 04:49:07PM +0200, Alexander Lobakin wrote:
>>>>> From: Andy Shevchenko 
>>>>> Date: Thu, 24 Aug 2023 15:37:28 +0300
>>>>>
>>>>>> It may be new callers for the same macro, share it.
>>>>>>
>>>>>> Note, it's unknown why it's represented in the current form instead of
>>>>>> simple multiplication and commit 1ff511e35ed8 ("tracing/kprobes: Add
>>>>>> bitfield type") doesn't explain that neither. Let leave it as is and
>>>>>> we may improve it in the future.
>>>>>
>>>>> Maybe symmetrical change in tools/ like I did[0] an aeon ago?
>>>>
>>>> Hmm... Why can't you simply upstream your version? It seems better than 
>>>> mine.
>>>
>>> It was a part of the Netlink bigint API which is a bit on hold for now
>>> (I needed this macro available treewide).
>>> But I can send it as standalone if you're fine with that.
>>
>> I'm fine. Yury?
> 
> Do we have opencoded BYTES_TO_BITS() somewhere else? If so, it should be
> fixed in the same series.

Treewide -- a ton.
We could add it so that devs could start using it and stop open-coding :D

> 
> Regarding implementation, the current:
> 
> #define BYTES_TO_BITS(nb)  ((BITS_PER_LONG * (nb)) / sizeof(long))
> 
> looks weird. Maybe there are some special considerations in a tracing
> subsystem to make it like this, but as per Masami's email - there's
> not. 
> 
> For a general purpose I'd suggest a simpler:
> #define BYTES_TO_BITS(nb)  ((nb) * BITS_PER_BYTE)

I also didn't notice anything that would require using logic more
complex than this one. It would probably make more sense to define
it that way when moving.

> 
> Thanks,
> Yury

Thanks,
Olek

Re: [PATCH v3] scripts/link-vmlinux.sh: Add alias to duplicate symbols for kallsyms

2023-09-11 Thread Alexander Lobakin

From: Alessandro Carminati (Red Hat) 
Date: Mon, 28 Aug 2023 08:04:23 +

> From: Alessandro Carminati 
> 
> It is not uncommon for drivers or modules related to similar peripherals
> to have symbols with the exact same name.

[...]

> Changes from v2:
> - Alias tags are created by querying DWARF information from the vmlinux.
> - The filename + line number is normalized and appended to the original name.
> - The tag begins with '@' to indicate the symbol source.
> - Not a change, but worth mentioning, since the alias is added to the existing
>   list, the old duplicated name is preserved, and the livepatch way of dealing
>   with duplicates is maintained.
> - Acknowledging the existence of scenarios where inlined functions declared in
>   header files may result in multiple copies due to compiler behavior, though
>it is not actionable as it does not pose an operational issue.
> - Highlighting a single exception where the same name refers to different
>   functions: the case of "compat_binfmt_elf.c," which directly includes
>   "binfmt_elf.c" producing identical function copies in two separate
>   modules.

Oh, I thought you managed to handle this in v3 since you didn't reply in
the previous thread...

> 
> sample from new v3
> 
>  ~ # cat /proc/kallsyms | grep gic_mask_irq
>  d0b03c04dae4 t gic_mask_irq
>  d0b03c04dae4 t gic_mask_irq@_drivers_irqchip_irq-gic_c_167
>  d0b03c050960 t gic_mask_irq
>  d0b03c050960 t gic_mask_irq@_drivers_irqchip_irq-gic-v3_c_404

BTW, why normalize them? Why not just

gic_mask_irq@drivers/irqchip/...

And why line number? Line numbers break reproducible builds and also
would make it harder to refer to a particular symbol by its path and
name since we also have to pass its line number which may change once
you add a debug print there, for example.
OTOH there can't be 2 symbols with the same name within one file, so
just path + name would be enough. Or not?

(sorry if some of this was already discussed previously)

[...]

Thanks,
Olek

[PATCH mips-next] vmlinux.lds.h: catch more UBSAN symbols into .data

2021-02-16 Thread Alexander Lobakin

LKP triggered lots of LD orphan warnings [0]:

mipsel-linux-ld: warning: orphan section `.data.$Lubsan_data299' from
`init/do_mounts_rd.o' being placed in section `.data.$Lubsan_data299'
mipsel-linux-ld: warning: orphan section `.data.$Lubsan_data183' from
`init/do_mounts_rd.o' being placed in section `.data.$Lubsan_data183'
mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type3' from
`init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type3'
mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type2' from
`init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type2'
mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type0' from
`init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type0'

[...]

Seems like "unnamed data" isn't the only type of symbols that UBSAN
instrumentation can emit.
Catch these into .data with the wildcard as well.

[0] https://lore.kernel.org/linux-mm/202102160741.k57gcnsr-...@intel.com

Fixes: f41b233de0ae ("vmlinux.lds.h: catch UBSAN's "unnamed data" into data")
Reported-by: kernel test robot 
Signed-off-by: Alexander Lobakin 
---
 include/asm-generic/vmlinux.lds.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index cc659e77fcb0..83537e5ee78f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -95,7 +95,7 @@
  */
 #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
 #define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
-#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* 
.data.$__unnamed_*
+#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* 
.data.$__unnamed_* .data.$Lubsan_*
 #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
 #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L*
 #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral*
-- 
2.30.1

[PATCH v4 bpf-next 1/6] netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition

2021-02-16 Thread Alexander Lobakin

This is harmless for now, but comes fatal for the subsequent patch.

Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom")
Signed-off-by: Alexander Lobakin 
---
 include/linux/netdevice.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b9bcbfde7849..b895973390ee 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1584,6 +1584,7 @@ enum netdev_priv_flags {
 #define IFF_L3MDEV_SLAVE   IFF_L3MDEV_SLAVE
 #define IFF_TEAM   IFF_TEAM
 #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
+#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM
 #define IFF_MACSEC IFF_MACSEC
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
 #define IFF_FAILOVER   IFF_FAILOVER
-- 
2.30.1

[PATCH v4 bpf-next 0/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-16 Thread Alexander Lobakin

This series introduces XSK generic zerocopy xmit by adding XSK umem
pages as skb frags instead of copying data to linear space.
The only requirement for this for drivers is to be able to xmit skbs
with skb_headlen(skb) == 0, i.e. all data including hard headers
starts from frag 0.
To indicate whether a particular driver supports this, a new netdev
priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net
as it's already capable of doing it). So consider implementing this
in your drivers to greatly speed-up generic XSK xmit.

The first two bits refactor netdev_priv_flags a bit to harden them
in terms of bitfield overflow, as IFF_TX_SKB_NO_LINEAR is the last
one that fits into unsigned int.
The fifth patch adds headroom and tailroom reservations for the
allocated skbs on XSK generic xmit path. This ensures there won't
be any unwanted skb reallocations on fast-path due to headroom and/or
tailroom driver/device requirements (own headers/descriptors etc.).
The other three add a new private flag, declare it in virtio_net
driver and introduce generic XSK zerocopy xmit itself.

The main body of work is created and done by Xuan Zhuo. His original
cover letter:

v3:
Optimized code

v2:
1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature
2. split the patch to three:
a. add priv_flags IFF_TX_SKB_NO_LINEAR
b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR
c. When there is support this flag, construct skb without linear
   space
3. use ERR_PTR() and PTR_ERR() to handle the err

v1 message log:
---

This patch is used to construct skb based on page to save memory copy
overhead.

This has one problem:

We construct the skb by fill the data page as a frag into the skb. In
this way, the linear space is empty, and the header information is also
in the frag, not in the linear space, which is not allowed for some
network cards. For example, Mellanox Technologies MT27710 Family
[ConnectX-4 Lx] will get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

I also tried to use build_skb to construct skb, but because of the
existence of skb_shinfo, it must be behind the linear space, so this
method is not working. We can't put skb_shinfo on desc->addr, it will be
exposed to users, this is not safe.

Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
network card supports the header information of the packet in the frag
and not in the linear space.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

>From v3 [0]:
 - refactor netdev_priv_flags to make it easier to add new ones and
   prevent bitwidth overflow;
 - add headroom (both standard and zerocopy) and tailroom (standard)
   reservation in skb for drivers to avoid potential reallocations;
 - fix skb->truesize accounting;
 - misc comment rewords.

[0] 
https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com

Alexander Lobakin (3):
  netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition
  netdevice: check for net_device::priv_flags bitfield overflow
  xsk: respect device's headroom and tailroom on generic xmit path

Xuan Zhuo (3):
  net: add priv_flags for allow tx skb without linear
  virtio-net: support IFF_TX_SKB_NO_LINEAR
  xsk: build skb by page (aka generic zerocopy xmit)

 drivers/net/virtio_net.c  |   3 +-
 include/linux/netdevice.h | 138 +-
 net/xdp/xsk.c | 113 ++-
 3 files changed, 173 insertions(+), 81 deletions(-)

-- 
2.30.1

[PATCH v4 bpf-next 4/6] virtio-net: support IFF_TX_SKB_NO_LINEAR

2021-02-16 Thread Alexander Lobakin

From: Xuan Zhuo 

Virtio net supports the case where the skb linear space is empty, so add
priv_flags.

Signed-off-by: Xuan Zhuo 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Alexander Lobakin 
---
 drivers/net/virtio_net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e63792549..f2ff6c3906c1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev)
return -ENOMEM;
 
/* Set up network device as normal. */
-   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
+   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
+  IFF_TX_SKB_NO_LINEAR;
dev->netdev_ops = &virtnet_netdev;
dev->features = NETIF_F_HIGHDMA;
 
-- 
2.30.1

[PATCH v4 bpf-next 3/6] net: add priv_flags for allow tx skb without linear

2021-02-16 Thread Alexander Lobakin

From: Xuan Zhuo 

In some cases, we hope to construct skb directly based on the existing
memory without copying data. In this case, the page will be placed
directly in the skb, and the linear space of skb is empty. But
unfortunately, many the network card does not support this operation.
For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

So a priv_flag is added here to indicate whether the network card
supports this feature.

Signed-off-by: Xuan Zhuo 
Suggested-by: Alexander Lobakin 
[ alobakin: give a new flag more detailed description ]
Signed-off-by: Alexander Lobakin 
---
 include/linux/netdevice.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa4ab77ce81e..86e19f62f978 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1525,6 +1525,8 @@ struct net_device_ops {
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
+ * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
+ * skb_headlen(skb) == 0 (data starts from frag0)
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN_BIT,
@@ -1558,6 +1560,7 @@ enum netdev_priv_flags {
IFF_FAILOVER_SLAVE_BIT,
IFF_L3MDEV_RX_HANDLER_BIT,
IFF_LIVE_RENAME_OK_BIT,
+   IFF_TX_SKB_NO_LINEAR_BIT,
 
NETDEV_PRIV_FLAG_COUNT,
 };
@@ -1600,6 +1603,7 @@ static_assert(sizeof(netdev_priv_flags_t) * BITS_PER_BYTE 
<=
 #define IFF_FAILOVER_SLAVE __IFF(FAILOVER_SLAVE)
 #define IFF_L3MDEV_RX_HANDLER  __IFF(L3MDEV_RX_HANDLER)
 #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK)
+#define IFF_TX_SKB_NO_LINEAR   __IFF(TX_SKB_NO_LINEAR)
 
 /**
  * struct net_device - The DEVICE structure.
-- 
2.30.1

[PATCH v4 bpf-next 5/6] xsk: respect device's headroom and tailroom on generic xmit path

2021-02-16 Thread Alexander Lobakin

xsk_generic_xmit() allocates a new skb and then queues it for
xmitting. The size of new skb's headroom is desc->len, so it comes
to the driver/device with no reserved headroom and/or tailroom.
Lots of drivers need some headroom (and sometimes tailroom) to
prepend (and/or append) some headers or data, e.g. CPU tags,
device-specific headers/descriptors (LSO, TLS etc.), and if case
of no available space skb_cow_head() will reallocate the skb.
Reallocations are unwanted on fast-path, especially when it comes
to XDP, so generic XSK xmit should reserve the spaces declared in
dev->needed_headroom and dev->needed tailroom to avoid them.

Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)):

Usually, output functions reserve LL_RESERVED_SPACE(dev), which
consists of dev->hard_header_len + dev->needed_headroom, aligned
by 16.
However, on XSK xmit hard header is already here in the chunk, so
hard_header_len is not needed. But it'd still be better to align
data up to cacheline, while reserving no less than driver requests
for headroom. NET_SKB_PAD here is to double-insure there will be
no reallocations even when the driver advertises no needed_headroom,
but in fact need it (not so rare case).

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Alexander Lobakin 
---
 net/xdp/xsk.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4faabd1ecfd1..143979ea4165 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
+   u32 hr, tr;
 
mutex_lock(&xs->mutex);
 
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
 
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+   tr = xs->dev->needed_tailroom;
+
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
char *buffer;
u64 addr;
@@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk)
}
 
len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, &err);
+   skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err);
if (unlikely(!skb))
goto out;
 
+   skb_reserve(skb, hr);
skb_put(skb, len);
+
addr = desc.addr;
buffer = xsk_buff_raw_get_data(xs->pool, addr);
err = skb_store_bits(skb, 0, buffer, len);
-- 
2.30.1

[PATCH v4 bpf-next 2/6] netdevice: check for net_device::priv_flags bitfield overflow

2021-02-16 Thread Alexander Lobakin

We almost ran out of unsigned int bitwidth. Define priv flags and
check for potential overflow in the fashion of netdev_features_t.
Defined this way, priv_flags can be easily expanded later with
just changing its typedef.

Signed-off-by: Alexander Lobakin 
---
 include/linux/netdevice.h | 135 --
 1 file changed, 72 insertions(+), 63 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b895973390ee..fa4ab77ce81e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1527,70 +1527,79 @@ struct net_device_ops {
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
  */
 enum netdev_priv_flags {
-   IFF_802_1Q_VLAN = 1<<0,
-   IFF_EBRIDGE = 1<<1,
-   IFF_BONDING = 1<<2,
-   IFF_ISATAP  = 1<<3,
-   IFF_WAN_HDLC= 1<<4,
-   IFF_XMIT_DST_RELEASE= 1<<5,
-   IFF_DONT_BRIDGE = 1<<6,
-   IFF_DISABLE_NETPOLL = 1<<7,
-   IFF_MACVLAN_PORT= 1<<8,
-   IFF_BRIDGE_PORT = 1<<9,
-   IFF_OVS_DATAPATH= 1<<10,
-   IFF_TX_SKB_SHARING  = 1<<11,
-   IFF_UNICAST_FLT = 1<<12,
-   IFF_TEAM_PORT   = 1<<13,
-   IFF_SUPP_NOFCS  = 1<<14,
-   IFF_LIVE_ADDR_CHANGE= 1<<15,
-   IFF_MACVLAN = 1<<16,
-   IFF_XMIT_DST_RELEASE_PERM   = 1<<17,
-   IFF_L3MDEV_MASTER   = 1<<18,
-   IFF_NO_QUEUE= 1<<19,
-   IFF_OPENVSWITCH = 1<<20,
-   IFF_L3MDEV_SLAVE= 1<<21,
-   IFF_TEAM= 1<<22,
-   IFF_RXFH_CONFIGURED = 1<<23,
-   IFF_PHONY_HEADROOM  = 1<<24,
-   IFF_MACSEC  = 1<<25,
-   IFF_NO_RX_HANDLER   = 1<<26,
-   IFF_FAILOVER= 1<<27,
-   IFF_FAILOVER_SLAVE  = 1<<28,
-   IFF_L3MDEV_RX_HANDLER   = 1<<29,
-   IFF_LIVE_RENAME_OK  = 1<<30,
+   IFF_802_1Q_VLAN_BIT,
+   IFF_EBRIDGE_BIT,
+   IFF_BONDING_BIT,
+   IFF_ISATAP_BIT,
+   IFF_WAN_HDLC_BIT,
+   IFF_XMIT_DST_RELEASE_BIT,
+   IFF_DONT_BRIDGE_BIT,
+   IFF_DISABLE_NETPOLL_BIT,
+   IFF_MACVLAN_PORT_BIT,
+   IFF_BRIDGE_PORT_BIT,
+   IFF_OVS_DATAPATH_BIT,
+   IFF_TX_SKB_SHARING_BIT,
+   IFF_UNICAST_FLT_BIT,
+   IFF_TEAM_PORT_BIT,
+   IFF_SUPP_NOFCS_BIT,
+   IFF_LIVE_ADDR_CHANGE_BIT,
+   IFF_MACVLAN_BIT,
+   IFF_XMIT_DST_RELEASE_PERM_BIT,
+   IFF_L3MDEV_MASTER_BIT,
+   IFF_NO_QUEUE_BIT,
+   IFF_OPENVSWITCH_BIT,
+   IFF_L3MDEV_SLAVE_BIT,
+   IFF_TEAM_BIT,
+   IFF_RXFH_CONFIGURED_BIT,
+   IFF_PHONY_HEADROOM_BIT,
+   IFF_MACSEC_BIT,
+   IFF_NO_RX_HANDLER_BIT,
+   IFF_FAILOVER_BIT,
+   IFF_FAILOVER_SLAVE_BIT,
+   IFF_L3MDEV_RX_HANDLER_BIT,
+   IFF_LIVE_RENAME_OK_BIT,
+
+   NETDEV_PRIV_FLAG_COUNT,
 };
 
-#define IFF_802_1Q_VLANIFF_802_1Q_VLAN
-#define IFF_EBRIDGEIFF_EBRIDGE
-#define IFF_BONDINGIFF_BONDING
-#define IFF_ISATAP IFF_ISATAP
-#define IFF_WAN_HDLC   IFF_WAN_HDLC
-#define IFF_XMIT_DST_RELEASE   IFF_XMIT_DST_RELEASE
-#define IFF_DONT_BRIDGEIFF_DONT_BRIDGE
-#define IFF_DISABLE_NETPOLLIFF_DISABLE_NETPOLL
-#define IFF_MACVLAN_PORT   IFF_MACVLAN_PORT
-#define IFF_BRIDGE_PORTIFF_BRIDGE_PORT
-#define IFF_OVS_DATAPATH   IFF_OVS_DATAPATH
-#define IFF_TX_SKB_SHARING IFF_TX_SKB_SHARING
-#define IFF_UNICAST_FLTIFF_UNICAST_FLT
-#define IFF_TEAM_PORT  IFF_TEAM_PORT
-#define IFF_SUPP_NOFCS IFF_SUPP_NOFCS
-#define IFF_LIVE_ADDR_CHANGE   IFF_LIVE_ADDR_CHANGE
-#define IFF_MACVLANIFF_MACVLAN
-#define IFF_XMIT_DST_RELEASE_PERM  IFF_XMIT_DST_RELEASE_PERM
-#define IFF_L3MDEV_MASTER  IFF_L3MDEV_MASTER
-#define IFF_NO_QUEUE   IFF_NO_QUEUE
-#define IFF_OPENVSWITCHIFF_OPENVSWITCH
-#define IFF_L3MDEV_SLAVE   IFF_L3MDEV_SLAVE
-#define IFF_TEAM   IFF_TEAM
-#define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
-#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM
-#define IFF_MACSEC IFF_MACSEC
-#define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
-#define IFF_FAILOVER   IFF_FAILOVER
-#define IFF_FAILOVER_SLAV

[PATCH v4 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-16 Thread Alexander Lobakin

From: Xuan Zhuo 

This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
[ alobakin:
 - expand subject to make it clearer;
 - improve skb->truesize calculation;
 - reserve some headroom in skb for drivers;
 - tailroom is not needed as skb is non-linear ]
Signed-off-by: Alexander Lobakin 
---
 net/xdp/xsk.c | 119 --
 1 file changed, 95 insertions(+), 24 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 143979ea4165..ff7bd06e1241 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -445,6 +445,96 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+ struct xdp_desc *desc)
+{
+   struct xsk_buff_pool *pool = xs->pool;
+   u32 hr, len, offset, copy, copied;
+   struct sk_buff *skb;
+   struct page *page;
+   void *buffer;
+   int err, i;
+   u64 addr;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+
+   skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+
+   addr = desc->addr;
+   len = desc->len;
+
+   buffer = xsk_buff_raw_get_data(pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; i++) {
+   page = pool->umem->pgs[addr >> PAGE_SHIFT];
+   get_page(page);
+
+   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += pool->unaligned ? len : pool->chunk_size;
+
+   refcount_add(skb->truesize, &xs->sk.sk_wmem_alloc);
+
+   return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+struct xdp_desc *desc)
+{
+   struct net_device *dev = xs->dev;
+   struct sk_buff *skb;
+
+   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   skb = xsk_build_skb_zerocopy(xs, desc);
+   if (IS_ERR(skb))
+   return skb;
+   } else {
+   u32 hr, tr, len;
+   void *buffer;
+   int err;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+   tr = dev->needed_tailroom;
+   len = desc->len;
+
+   skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+   skb_put(skb, len);
+
+   buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+   err = skb_store_bits(skb, 0, buffer, len);
+   if (unlikely(err)) {
+   kfree_skb(skb);
+   return ERR_PTR(err);
+   }
+   }
+
+   skb->dev = dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -454,56 +544,37 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
-   u32 hr, tr;
 
mutex_lock(&xs->mutex);
 
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
 
-   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
-   tr = xs->dev->needed_tailroom;
-
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;
goto out;
}
 
-

Re: [PATCH v4 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-16 Thread Alexander Lobakin

From: Magnus Karlsson 
Date: Tue, 16 Feb 2021 15:08:26 +0100

> On Tue, Feb 16, 2021 at 12:44 PM Alexander Lobakin  wrote:
> >
> > From: Xuan Zhuo 
> >
> > This patch is used to construct skb based on page to save memory copy
> > overhead.
> >
> > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > directly construct skb. If this feature is not supported, it is still
> > necessary to copy data to construct skb.
> >
> >  Performance Testing 
> >
> > The test environment is Aliyun ECS server.
> > Test cmd:
> > ```
> > xdpsock -i eth0 -t  -S -s 
> > ```
> >
> > Test result data:
> >
> > size64  512 10241500
> > copy1916747 1775988 1600203 1440054
> > page1974058 1953655 1945463 1904478
> > percent 3.0%10.0%   21.58%  32.3%
> >
> > Signed-off-by: Xuan Zhuo 
> > Reviewed-by: Dust Li 
> > [ alobakin:
> >  - expand subject to make it clearer;
> >  - improve skb->truesize calculation;
> >  - reserve some headroom in skb for drivers;
> >  - tailroom is not needed as skb is non-linear ]
> > Signed-off-by: Alexander Lobakin 
> 
> Thank you Alexander!
> 
> Acked-by: Magnus Karlsson 

Thanks!

I have one more generic zerocopy to offer (inspired by this series)
that wouldn't require IFF_TX_SKB_NO_LINEAR, only a capability to xmit
S/G packets that almost every NIC has. I'll publish an RFC once this
and your upcoming changes get merged.

> > ---
> >  net/xdp/xsk.c | 119 --
> >  1 file changed, 95 insertions(+), 24 deletions(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 143979ea4165..ff7bd06e1241 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -445,6 +445,96 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > sock_wfree(skb);
> >  }
> >
> > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > + struct xdp_desc *desc)
> > +{
> > +   struct xsk_buff_pool *pool = xs->pool;
> > +   u32 hr, len, offset, copy, copied;
> > +   struct sk_buff *skb;
> > +   struct page *page;
> > +   void *buffer;
> > +   int err, i;
> > +   u64 addr;
> > +
> > +   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
> > +
> > +   skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
> > +   if (unlikely(!skb))
> > +   return ERR_PTR(err);
> > +
> > +   skb_reserve(skb, hr);
> > +
> > +   addr = desc->addr;
> > +   len = desc->len;
> > +
> > +   buffer = xsk_buff_raw_get_data(pool, addr);
> > +   offset = offset_in_page(buffer);
> > +   addr = buffer - pool->addrs;
> > +
> > +   for (copied = 0, i = 0; copied < len; i++) {
> > +   page = pool->umem->pgs[addr >> PAGE_SHIFT];
> > +   get_page(page);
> > +
> > +   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > +   skb_fill_page_desc(skb, i, page, offset, copy);
> > +
> > +   copied += copy;
> > +   addr += copy;
> > +   offset = 0;
> > +   }
> > +
> > +   skb->len += len;
> > +   skb->data_len += len;
> > +   skb->truesize += pool->unaligned ? len : pool->chunk_size;
> > +
> > +   refcount_add(skb->truesize, &xs->sk.sk_wmem_alloc);
> > +
> > +   return skb;
> > +}
> > +
> > +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > +struct xdp_desc *desc)
> > +{
> > +   struct net_device *dev = xs->dev;
> > +   struct sk_buff *skb;
> > +
> > +   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> > +   skb = xsk_build_skb_zerocopy(xs, desc);
> > +   if (IS_ERR(skb))
> > +   return skb;
> > +   } else {
> > +   u32 hr, tr, len;
> > +   void *buffer;
> > +   int err;
> > +
> > +   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> > +   tr = dev->needed_tailroom;
> > +   len = desc->len;
> > +
> > +   skb = sock_alloc_send_skb

[PATCH v5 bpf-next 0/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-16 Thread Alexander Lobakin

This series introduces XSK generic zerocopy xmit by adding XSK umem
pages as skb frags instead of copying data to linear space.
The only requirement for this for drivers is to be able to xmit skbs
with skb_headlen(skb) == 0, i.e. all data including hard headers
starts from frag 0.
To indicate whether a particular driver supports this, a new netdev
priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net
as it's already capable of doing it). So consider implementing this
in your drivers to greatly speed-up generic XSK xmit.

The first two bits refactor netdev_priv_flags a bit to harden them
in terms of bitfield overflow, as IFF_TX_SKB_NO_LINEAR is the last
one that fits into unsigned int.
The fifth patch adds headroom and tailroom reservations for the
allocated skbs on XSK generic xmit path. This ensures there won't
be any unwanted skb reallocations on fast-path due to headroom and/or
tailroom driver/device requirements (own headers/descriptors etc.).
The other three add a new private flag, declare it in virtio_net
driver and introduce generic XSK zerocopy xmit itself.

The main body of work is created and done by Xuan Zhuo. His original
cover letter:

v3:
Optimized code

v2:
1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature
2. split the patch to three:
a. add priv_flags IFF_TX_SKB_NO_LINEAR
b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR
c. When there is support this flag, construct skb without linear
   space
3. use ERR_PTR() and PTR_ERR() to handle the err

v1 message log:
---

This patch is used to construct skb based on page to save memory copy
overhead.

This has one problem:

We construct the skb by fill the data page as a frag into the skb. In
this way, the linear space is empty, and the header information is also
in the frag, not in the linear space, which is not allowed for some
network cards. For example, Mellanox Technologies MT27710 Family
[ConnectX-4 Lx] will get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

I also tried to use build_skb to construct skb, but because of the
existence of skb_shinfo, it must be behind the linear space, so this
method is not working. We can't put skb_shinfo on desc->addr, it will be
exposed to users, this is not safe.

Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
network card supports the header information of the packet in the frag
and not in the linear space.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

>From v4 [1]:
 - fix 0002 build error due to inverted static_assert() condition
   (0day bot);
 - collect two Acked-bys (Magnus).

>From v3 [0]:
 - refactor netdev_priv_flags to make it easier to add new ones and
   prevent bitwidth overflow;
 - add headroom (both standard and zerocopy) and tailroom (standard)
   reservation in skb for drivers to avoid potential reallocations;
 - fix skb->truesize accounting;
 - misc comment rewords.

[0] 
https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com
[1] https://lore.kernel.org/netdev/20210216113740.62041-1-aloba...@pm.me

Alexander Lobakin (3):
  netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition
  netdevice: check for net_device::priv_flags bitfield overflow
  xsk: respect device's headroom and tailroom on generic xmit path

Xuan Zhuo (3):
  net: add priv_flags for allow tx skb without linear
  virtio-net: support IFF_TX_SKB_NO_LINEAR
  xsk: build skb by page (aka generic zerocopy xmit)

 drivers/net/virtio_net.c  |   3 +-
 include/linux/netdevice.h | 138 +-
 net/xdp/xsk.c | 113 ++-
 3 files changed, 173 insertions(+), 81 deletions(-)

-- 
2.30.1

[PATCH v5 bpf-next 2/6] netdevice: check for net_device::priv_flags bitfield overflow

2021-02-16 Thread Alexander Lobakin

We almost ran out of unsigned int bitwidth. Define priv flags and
check for potential overflow in the fashion of netdev_features_t.
Defined this way, priv_flags can be easily expanded later with
just changing its typedef.

Signed-off-by: Alexander Lobakin 
Reported-by: kernel test robot  # Inverted assert condition
---
 include/linux/netdevice.h | 135 --
 1 file changed, 72 insertions(+), 63 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b895973390ee..0a9b2b31f411 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1527,70 +1527,79 @@ struct net_device_ops {
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
  */
 enum netdev_priv_flags {
-   IFF_802_1Q_VLAN = 1<<0,
-   IFF_EBRIDGE = 1<<1,
-   IFF_BONDING = 1<<2,
-   IFF_ISATAP  = 1<<3,
-   IFF_WAN_HDLC= 1<<4,
-   IFF_XMIT_DST_RELEASE= 1<<5,
-   IFF_DONT_BRIDGE = 1<<6,
-   IFF_DISABLE_NETPOLL = 1<<7,
-   IFF_MACVLAN_PORT= 1<<8,
-   IFF_BRIDGE_PORT = 1<<9,
-   IFF_OVS_DATAPATH= 1<<10,
-   IFF_TX_SKB_SHARING  = 1<<11,
-   IFF_UNICAST_FLT = 1<<12,
-   IFF_TEAM_PORT   = 1<<13,
-   IFF_SUPP_NOFCS  = 1<<14,
-   IFF_LIVE_ADDR_CHANGE= 1<<15,
-   IFF_MACVLAN = 1<<16,
-   IFF_XMIT_DST_RELEASE_PERM   = 1<<17,
-   IFF_L3MDEV_MASTER   = 1<<18,
-   IFF_NO_QUEUE= 1<<19,
-   IFF_OPENVSWITCH = 1<<20,
-   IFF_L3MDEV_SLAVE= 1<<21,
-   IFF_TEAM= 1<<22,
-   IFF_RXFH_CONFIGURED = 1<<23,
-   IFF_PHONY_HEADROOM  = 1<<24,
-   IFF_MACSEC  = 1<<25,
-   IFF_NO_RX_HANDLER   = 1<<26,
-   IFF_FAILOVER= 1<<27,
-   IFF_FAILOVER_SLAVE  = 1<<28,
-   IFF_L3MDEV_RX_HANDLER   = 1<<29,
-   IFF_LIVE_RENAME_OK  = 1<<30,
+   IFF_802_1Q_VLAN_BIT,
+   IFF_EBRIDGE_BIT,
+   IFF_BONDING_BIT,
+   IFF_ISATAP_BIT,
+   IFF_WAN_HDLC_BIT,
+   IFF_XMIT_DST_RELEASE_BIT,
+   IFF_DONT_BRIDGE_BIT,
+   IFF_DISABLE_NETPOLL_BIT,
+   IFF_MACVLAN_PORT_BIT,
+   IFF_BRIDGE_PORT_BIT,
+   IFF_OVS_DATAPATH_BIT,
+   IFF_TX_SKB_SHARING_BIT,
+   IFF_UNICAST_FLT_BIT,
+   IFF_TEAM_PORT_BIT,
+   IFF_SUPP_NOFCS_BIT,
+   IFF_LIVE_ADDR_CHANGE_BIT,
+   IFF_MACVLAN_BIT,
+   IFF_XMIT_DST_RELEASE_PERM_BIT,
+   IFF_L3MDEV_MASTER_BIT,
+   IFF_NO_QUEUE_BIT,
+   IFF_OPENVSWITCH_BIT,
+   IFF_L3MDEV_SLAVE_BIT,
+   IFF_TEAM_BIT,
+   IFF_RXFH_CONFIGURED_BIT,
+   IFF_PHONY_HEADROOM_BIT,
+   IFF_MACSEC_BIT,
+   IFF_NO_RX_HANDLER_BIT,
+   IFF_FAILOVER_BIT,
+   IFF_FAILOVER_SLAVE_BIT,
+   IFF_L3MDEV_RX_HANDLER_BIT,
+   IFF_LIVE_RENAME_OK_BIT,
+
+   NETDEV_PRIV_FLAG_COUNT,
 };
 
-#define IFF_802_1Q_VLANIFF_802_1Q_VLAN
-#define IFF_EBRIDGEIFF_EBRIDGE
-#define IFF_BONDINGIFF_BONDING
-#define IFF_ISATAP IFF_ISATAP
-#define IFF_WAN_HDLC   IFF_WAN_HDLC
-#define IFF_XMIT_DST_RELEASE   IFF_XMIT_DST_RELEASE
-#define IFF_DONT_BRIDGEIFF_DONT_BRIDGE
-#define IFF_DISABLE_NETPOLLIFF_DISABLE_NETPOLL
-#define IFF_MACVLAN_PORT   IFF_MACVLAN_PORT
-#define IFF_BRIDGE_PORTIFF_BRIDGE_PORT
-#define IFF_OVS_DATAPATH   IFF_OVS_DATAPATH
-#define IFF_TX_SKB_SHARING IFF_TX_SKB_SHARING
-#define IFF_UNICAST_FLTIFF_UNICAST_FLT
-#define IFF_TEAM_PORT  IFF_TEAM_PORT
-#define IFF_SUPP_NOFCS IFF_SUPP_NOFCS
-#define IFF_LIVE_ADDR_CHANGE   IFF_LIVE_ADDR_CHANGE
-#define IFF_MACVLANIFF_MACVLAN
-#define IFF_XMIT_DST_RELEASE_PERM  IFF_XMIT_DST_RELEASE_PERM
-#define IFF_L3MDEV_MASTER  IFF_L3MDEV_MASTER
-#define IFF_NO_QUEUE   IFF_NO_QUEUE
-#define IFF_OPENVSWITCHIFF_OPENVSWITCH
-#define IFF_L3MDEV_SLAVE   IFF_L3MDEV_SLAVE
-#define IFF_TEAM   IFF_TEAM
-#define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
-#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM
-#define IFF_MACSEC IFF_MACSEC
-#define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
-#define IFF_FA

[PATCH v5 bpf-next 1/6] netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition

2021-02-16 Thread Alexander Lobakin

This is harmless for now, but comes fatal for the subsequent patch.

Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom")
Signed-off-by: Alexander Lobakin 
---
 include/linux/netdevice.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b9bcbfde7849..b895973390ee 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1584,6 +1584,7 @@ enum netdev_priv_flags {
 #define IFF_L3MDEV_SLAVE   IFF_L3MDEV_SLAVE
 #define IFF_TEAM   IFF_TEAM
 #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
+#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM
 #define IFF_MACSEC IFF_MACSEC
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
 #define IFF_FAILOVER   IFF_FAILOVER
-- 
2.30.1

[PATCH v5 bpf-next 3/6] net: add priv_flags for allow tx skb without linear

2021-02-16 Thread Alexander Lobakin

From: Xuan Zhuo 

In some cases, we hope to construct skb directly based on the existing
memory without copying data. In this case, the page will be placed
directly in the skb, and the linear space of skb is empty. But
unfortunately, many the network card does not support this operation.
For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

So a priv_flag is added here to indicate whether the network card
supports this feature.

Signed-off-by: Xuan Zhuo 
Suggested-by: Alexander Lobakin 
[ alobakin: give a new flag more detailed description ]
Signed-off-by: Alexander Lobakin 
---
 include/linux/netdevice.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0a9b2b31f411..ecaf67efab5b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1525,6 +1525,8 @@ struct net_device_ops {
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
+ * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
+ * skb_headlen(skb) == 0 (data starts from frag0)
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN_BIT,
@@ -1558,6 +1560,7 @@ enum netdev_priv_flags {
IFF_FAILOVER_SLAVE_BIT,
IFF_L3MDEV_RX_HANDLER_BIT,
IFF_LIVE_RENAME_OK_BIT,
+   IFF_TX_SKB_NO_LINEAR_BIT,
 
NETDEV_PRIV_FLAG_COUNT,
 };
@@ -1600,6 +1603,7 @@ static_assert(sizeof(netdev_priv_flags_t) * BITS_PER_BYTE 
>=
 #define IFF_FAILOVER_SLAVE __IFF(FAILOVER_SLAVE)
 #define IFF_L3MDEV_RX_HANDLER  __IFF(L3MDEV_RX_HANDLER)
 #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK)
+#define IFF_TX_SKB_NO_LINEAR   __IFF(TX_SKB_NO_LINEAR)
 
 /**
  * struct net_device - The DEVICE structure.
-- 
2.30.1

[PATCH v5 bpf-next 5/6] xsk: respect device's headroom and tailroom on generic xmit path

2021-02-16 Thread Alexander Lobakin

xsk_generic_xmit() allocates a new skb and then queues it for
xmitting. The size of new skb's headroom is desc->len, so it comes
to the driver/device with no reserved headroom and/or tailroom.
Lots of drivers need some headroom (and sometimes tailroom) to
prepend (and/or append) some headers or data, e.g. CPU tags,
device-specific headers/descriptors (LSO, TLS etc.), and if case
of no available space skb_cow_head() will reallocate the skb.
Reallocations are unwanted on fast-path, especially when it comes
to XDP, so generic XSK xmit should reserve the spaces declared in
dev->needed_headroom and dev->needed tailroom to avoid them.

Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)):

Usually, output functions reserve LL_RESERVED_SPACE(dev), which
consists of dev->hard_header_len + dev->needed_headroom, aligned
by 16.
However, on XSK xmit hard header is already here in the chunk, so
hard_header_len is not needed. But it'd still be better to align
data up to cacheline, while reserving no less than driver requests
for headroom. NET_SKB_PAD here is to double-insure there will be
no reallocations even when the driver advertises no needed_headroom,
but in fact need it (not so rare case).

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
---
 net/xdp/xsk.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4faabd1ecfd1..143979ea4165 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
+   u32 hr, tr;
 
mutex_lock(&xs->mutex);
 
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
 
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+   tr = xs->dev->needed_tailroom;
+
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
char *buffer;
u64 addr;
@@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk)
}
 
len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, &err);
+   skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err);
if (unlikely(!skb))
goto out;
 
+   skb_reserve(skb, hr);
skb_put(skb, len);
+
addr = desc.addr;
buffer = xsk_buff_raw_get_data(xs->pool, addr);
err = skb_store_bits(skb, 0, buffer, len);
-- 
2.30.1

[PATCH v5 bpf-next 4/6] virtio-net: support IFF_TX_SKB_NO_LINEAR

2021-02-16 Thread Alexander Lobakin

From: Xuan Zhuo 

Virtio net supports the case where the skb linear space is empty, so add
priv_flags.

Signed-off-by: Xuan Zhuo 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Alexander Lobakin 
---
 drivers/net/virtio_net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e63792549..f2ff6c3906c1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev)
return -ENOMEM;
 
/* Set up network device as normal. */
-   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
+   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
+  IFF_TX_SKB_NO_LINEAR;
dev->netdev_ops = &virtnet_netdev;
dev->features = NETIF_F_HIGHDMA;
 
-- 
2.30.1

[PATCH v5 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-16 Thread Alexander Lobakin

From: Xuan Zhuo 

This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
[ alobakin:
 - expand subject to make it clearer;
 - improve skb->truesize calculation;
 - reserve some headroom in skb for drivers;
 - tailroom is not needed as skb is non-linear ]
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
---
 net/xdp/xsk.c | 119 --
 1 file changed, 95 insertions(+), 24 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 143979ea4165..ff7bd06e1241 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -445,6 +445,96 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+ struct xdp_desc *desc)
+{
+   struct xsk_buff_pool *pool = xs->pool;
+   u32 hr, len, offset, copy, copied;
+   struct sk_buff *skb;
+   struct page *page;
+   void *buffer;
+   int err, i;
+   u64 addr;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+
+   skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+
+   addr = desc->addr;
+   len = desc->len;
+
+   buffer = xsk_buff_raw_get_data(pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; i++) {
+   page = pool->umem->pgs[addr >> PAGE_SHIFT];
+   get_page(page);
+
+   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += pool->unaligned ? len : pool->chunk_size;
+
+   refcount_add(skb->truesize, &xs->sk.sk_wmem_alloc);
+
+   return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+struct xdp_desc *desc)
+{
+   struct net_device *dev = xs->dev;
+   struct sk_buff *skb;
+
+   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   skb = xsk_build_skb_zerocopy(xs, desc);
+   if (IS_ERR(skb))
+   return skb;
+   } else {
+   u32 hr, tr, len;
+   void *buffer;
+   int err;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+   tr = dev->needed_tailroom;
+   len = desc->len;
+
+   skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+   skb_put(skb, len);
+
+   buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+   err = skb_store_bits(skb, 0, buffer, len);
+   if (unlikely(err)) {
+   kfree_skb(skb);
+   return ERR_PTR(err);
+   }
+   }
+
+   skb->dev = dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -454,56 +544,37 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
-   u32 hr, tr;
 
mutex_lock(&xs->mutex);
 
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
 
-   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
-   tr = xs->dev->needed_tailroom;
-
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;
goto out;

Re: [PATCH v5 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-16 Thread Alexander Lobakin

From: Alexander Lobakin 
Date: Tue, 16 Feb 2021 14:35:02 +

> From: Xuan Zhuo 
> 
> This patch is used to construct skb based on page to save memory copy
> overhead.
> 
> This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> directly construct skb. If this feature is not supported, it is still
> necessary to copy data to construct skb.
> 
>  Performance Testing 
> 
> The test environment is Aliyun ECS server.
> Test cmd:
> ```
> xdpsock -i eth0 -t  -S -s 
> ```
> 
> Test result data:
> 
> size64  512 10241500
> copy1916747 1775988 1600203 1440054
> page1974058 1953655 1945463 1904478
> percent 3.0%10.0%   21.58%  32.3%
> 
> Signed-off-by: Xuan Zhuo 
> Reviewed-by: Dust Li 
> [ alobakin:
>  - expand subject to make it clearer;
>  - improve skb->truesize calculation;
>  - reserve some headroom in skb for drivers;
>  - tailroom is not needed as skb is non-linear ]
> Signed-off-by: Alexander Lobakin 
> Acked-by: Magnus Karlsson 
> ---
>  net/xdp/xsk.c | 119 --
>  1 file changed, 95 insertions(+), 24 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 143979ea4165..ff7bd06e1241 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -445,6 +445,96 @@ static void xsk_destruct_skb(struct sk_buff *skb)
>   sock_wfree(skb);
>  }
>  
> +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> +   struct xdp_desc *desc)
> +{
> + struct xsk_buff_pool *pool = xs->pool;
> + u32 hr, len, offset, copy, copied;
> + struct sk_buff *skb;
> + struct page *page;
> + void *buffer;
> + int err, i;
> + u64 addr;
> +
> + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
> +
> + skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
> + if (unlikely(!skb))
> + return ERR_PTR(err);
> +
> + skb_reserve(skb, hr);
> +
> + addr = desc->addr;
> + len = desc->len;
> +
> + buffer = xsk_buff_raw_get_data(pool, addr);
> + offset = offset_in_page(buffer);
> + addr = buffer - pool->addrs;
> +
> + for (copied = 0, i = 0; copied < len; i++) {
> + page = pool->umem->pgs[addr >> PAGE_SHIFT];
> + get_page(page);
> +
> + copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> + skb_fill_page_desc(skb, i, page, offset, copy);
> +
> + copied += copy;
> + addr += copy;
> + offset = 0;
> + }
> +
> + skb->len += len;
> + skb->data_len += len;
> + skb->truesize += pool->unaligned ? len : pool->chunk_size;
> +
> + refcount_add(skb->truesize, &xs->sk.sk_wmem_alloc);

Meh, there's a refcount leak here I accidentally introduced in v4.
Sorry for that, I'll upload v6 in just a moment.

> + return skb;
> +}
> +
> +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> +  struct xdp_desc *desc)
> +{
> + struct net_device *dev = xs->dev;
> + struct sk_buff *skb;
> +
> + if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> + skb = xsk_build_skb_zerocopy(xs, desc);
> + if (IS_ERR(skb))
> + return skb;
> + } else {
> + u32 hr, tr, len;
> + void *buffer;
> + int err;
> +
> + hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> + tr = dev->needed_tailroom;
> + len = desc->len;
> +
> + skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
> + if (unlikely(!skb))
> + return ERR_PTR(err);
> +
> + skb_reserve(skb, hr);
> + skb_put(skb, len);
> +
> + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
> + err = skb_store_bits(skb, 0, buffer, len);
> + if (unlikely(err)) {
> + kfree_skb(skb);
> + return ERR_PTR(err);
> + }
> + }
> +
> + skb->dev = dev;
> + skb->priority = xs->sk.sk_priority;
> + skb->mark = xs->sk.sk_mark;
> + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
> + skb->destructor = xsk_destruct_skb;
> +
> + return skb;
> +}
> +
>  static int xsk_generic_xmit(struct sock *sk)

[PATCH v6 bpf-next 0/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-16 Thread Alexander Lobakin

This series introduces XSK generic zerocopy xmit by adding XSK umem
pages as skb frags instead of copying data to linear space.
The only requirement for this for drivers is to be able to xmit skbs
with skb_headlen(skb) == 0, i.e. all data including hard headers
starts from frag 0.
To indicate whether a particular driver supports this, a new netdev
priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net
as it's already capable of doing it). So consider implementing this
in your drivers to greatly speed-up generic XSK xmit.

The first two bits refactor netdev_priv_flags a bit to harden them
in terms of bitfield overflow, as IFF_TX_SKB_NO_LINEAR is the last
one that fits into unsigned int.
The fifth patch adds headroom and tailroom reservations for the
allocated skbs on XSK generic xmit path. This ensures there won't
be any unwanted skb reallocations on fast-path due to headroom and/or
tailroom driver/device requirements (own headers/descriptors etc.).
The other three add a new private flag, declare it in virtio_net
driver and introduce generic XSK zerocopy xmit itself.

The main body of work is created and done by Xuan Zhuo. His original
cover letter:

v3:
Optimized code

v2:
1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature
2. split the patch to three:
a. add priv_flags IFF_TX_SKB_NO_LINEAR
b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR
c. When there is support this flag, construct skb without linear
   space
3. use ERR_PTR() and PTR_ERR() to handle the err

v1 message log:
---

This patch is used to construct skb based on page to save memory copy
overhead.

This has one problem:

We construct the skb by fill the data page as a frag into the skb. In
this way, the linear space is empty, and the header information is also
in the frag, not in the linear space, which is not allowed for some
network cards. For example, Mellanox Technologies MT27710 Family
[ConnectX-4 Lx] will get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

I also tried to use build_skb to construct skb, but because of the
existence of skb_shinfo, it must be behind the linear space, so this
method is not working. We can't put skb_shinfo on desc->addr, it will be
exposed to users, this is not safe.

Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
network card supports the header information of the packet in the frag
and not in the linear space.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

>From v5 [2]:
 - fix a refcount leak in 0006 introduced in v4.

>From v4 [1]:
 - fix 0002 build error due to inverted static_assert() condition
   (0day bot);
 - collect two Acked-bys (Magnus).

>From v3 [0]:
 - refactor netdev_priv_flags to make it easier to add new ones and
   prevent bitwidth overflow;
 - add headroom (both standard and zerocopy) and tailroom (standard)
   reservation in skb for drivers to avoid potential reallocations;
 - fix skb->truesize accounting;
 - misc comment rewords.

[0] 
https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com
[1] https://lore.kernel.org/netdev/20210216113740.62041-1-aloba...@pm.me
[2] https://lore.kernel.org/netdev/2021021614.5861-1-aloba...@pm.me

Alexander Lobakin (3):
  netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition
  netdevice: check for net_device::priv_flags bitfield overflow
  xsk: respect device's headroom and tailroom on generic xmit path

Xuan Zhuo (3):
  net: add priv_flags for allow tx skb without linear
  virtio-net: support IFF_TX_SKB_NO_LINEAR
  xsk: build skb by page (aka generic zerocopy xmit)

 drivers/net/virtio_net.c  |   3 +-
 include/linux/netdevice.h | 138 +-
 net/xdp/xsk.c | 114 ++-
 3 files changed, 174 insertions(+), 81 deletions(-)

-- 
2.30.1

[PATCH v6 bpf-next 1/6] netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition

2021-02-16 Thread Alexander Lobakin

This is harmless for now, but comes fatal for the subsequent patch.

Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom")
Signed-off-by: Alexander Lobakin 
---
 include/linux/netdevice.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b9bcbfde7849..b895973390ee 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1584,6 +1584,7 @@ enum netdev_priv_flags {
 #define IFF_L3MDEV_SLAVE   IFF_L3MDEV_SLAVE
 #define IFF_TEAM   IFF_TEAM
 #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
+#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM
 #define IFF_MACSEC IFF_MACSEC
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
 #define IFF_FAILOVER   IFF_FAILOVER
-- 
2.30.1

[PATCH v6 bpf-next 2/6] netdevice: check for net_device::priv_flags bitfield overflow

2021-02-16 Thread Alexander Lobakin

We almost ran out of unsigned int bitwidth. Define priv flags and
check for potential overflow in the fashion of netdev_features_t.
Defined this way, priv_flags can be easily expanded later with
just changing its typedef.

Signed-off-by: Alexander Lobakin 
Reported-by: kernel test robot  # Inverted assert condition
---
 include/linux/netdevice.h | 135 --
 1 file changed, 72 insertions(+), 63 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b895973390ee..0a9b2b31f411 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1527,70 +1527,79 @@ struct net_device_ops {
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
  */
 enum netdev_priv_flags {
-   IFF_802_1Q_VLAN = 1<<0,
-   IFF_EBRIDGE = 1<<1,
-   IFF_BONDING = 1<<2,
-   IFF_ISATAP  = 1<<3,
-   IFF_WAN_HDLC= 1<<4,
-   IFF_XMIT_DST_RELEASE= 1<<5,
-   IFF_DONT_BRIDGE = 1<<6,
-   IFF_DISABLE_NETPOLL = 1<<7,
-   IFF_MACVLAN_PORT= 1<<8,
-   IFF_BRIDGE_PORT = 1<<9,
-   IFF_OVS_DATAPATH= 1<<10,
-   IFF_TX_SKB_SHARING  = 1<<11,
-   IFF_UNICAST_FLT = 1<<12,
-   IFF_TEAM_PORT   = 1<<13,
-   IFF_SUPP_NOFCS  = 1<<14,
-   IFF_LIVE_ADDR_CHANGE= 1<<15,
-   IFF_MACVLAN = 1<<16,
-   IFF_XMIT_DST_RELEASE_PERM   = 1<<17,
-   IFF_L3MDEV_MASTER   = 1<<18,
-   IFF_NO_QUEUE= 1<<19,
-   IFF_OPENVSWITCH = 1<<20,
-   IFF_L3MDEV_SLAVE= 1<<21,
-   IFF_TEAM= 1<<22,
-   IFF_RXFH_CONFIGURED = 1<<23,
-   IFF_PHONY_HEADROOM  = 1<<24,
-   IFF_MACSEC  = 1<<25,
-   IFF_NO_RX_HANDLER   = 1<<26,
-   IFF_FAILOVER= 1<<27,
-   IFF_FAILOVER_SLAVE  = 1<<28,
-   IFF_L3MDEV_RX_HANDLER   = 1<<29,
-   IFF_LIVE_RENAME_OK  = 1<<30,
+   IFF_802_1Q_VLAN_BIT,
+   IFF_EBRIDGE_BIT,
+   IFF_BONDING_BIT,
+   IFF_ISATAP_BIT,
+   IFF_WAN_HDLC_BIT,
+   IFF_XMIT_DST_RELEASE_BIT,
+   IFF_DONT_BRIDGE_BIT,
+   IFF_DISABLE_NETPOLL_BIT,
+   IFF_MACVLAN_PORT_BIT,
+   IFF_BRIDGE_PORT_BIT,
+   IFF_OVS_DATAPATH_BIT,
+   IFF_TX_SKB_SHARING_BIT,
+   IFF_UNICAST_FLT_BIT,
+   IFF_TEAM_PORT_BIT,
+   IFF_SUPP_NOFCS_BIT,
+   IFF_LIVE_ADDR_CHANGE_BIT,
+   IFF_MACVLAN_BIT,
+   IFF_XMIT_DST_RELEASE_PERM_BIT,
+   IFF_L3MDEV_MASTER_BIT,
+   IFF_NO_QUEUE_BIT,
+   IFF_OPENVSWITCH_BIT,
+   IFF_L3MDEV_SLAVE_BIT,
+   IFF_TEAM_BIT,
+   IFF_RXFH_CONFIGURED_BIT,
+   IFF_PHONY_HEADROOM_BIT,
+   IFF_MACSEC_BIT,
+   IFF_NO_RX_HANDLER_BIT,
+   IFF_FAILOVER_BIT,
+   IFF_FAILOVER_SLAVE_BIT,
+   IFF_L3MDEV_RX_HANDLER_BIT,
+   IFF_LIVE_RENAME_OK_BIT,
+
+   NETDEV_PRIV_FLAG_COUNT,
 };
 
-#define IFF_802_1Q_VLANIFF_802_1Q_VLAN
-#define IFF_EBRIDGEIFF_EBRIDGE
-#define IFF_BONDINGIFF_BONDING
-#define IFF_ISATAP IFF_ISATAP
-#define IFF_WAN_HDLC   IFF_WAN_HDLC
-#define IFF_XMIT_DST_RELEASE   IFF_XMIT_DST_RELEASE
-#define IFF_DONT_BRIDGEIFF_DONT_BRIDGE
-#define IFF_DISABLE_NETPOLLIFF_DISABLE_NETPOLL
-#define IFF_MACVLAN_PORT   IFF_MACVLAN_PORT
-#define IFF_BRIDGE_PORTIFF_BRIDGE_PORT
-#define IFF_OVS_DATAPATH   IFF_OVS_DATAPATH
-#define IFF_TX_SKB_SHARING IFF_TX_SKB_SHARING
-#define IFF_UNICAST_FLTIFF_UNICAST_FLT
-#define IFF_TEAM_PORT  IFF_TEAM_PORT
-#define IFF_SUPP_NOFCS IFF_SUPP_NOFCS
-#define IFF_LIVE_ADDR_CHANGE   IFF_LIVE_ADDR_CHANGE
-#define IFF_MACVLANIFF_MACVLAN
-#define IFF_XMIT_DST_RELEASE_PERM  IFF_XMIT_DST_RELEASE_PERM
-#define IFF_L3MDEV_MASTER  IFF_L3MDEV_MASTER
-#define IFF_NO_QUEUE   IFF_NO_QUEUE
-#define IFF_OPENVSWITCHIFF_OPENVSWITCH
-#define IFF_L3MDEV_SLAVE   IFF_L3MDEV_SLAVE
-#define IFF_TEAM   IFF_TEAM
-#define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
-#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM
-#define IFF_MACSEC IFF_MACSEC
-#define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
-#define IFF_FA

[PATCH v6 bpf-next 3/6] net: add priv_flags for allow tx skb without linear

2021-02-16 Thread Alexander Lobakin

From: Xuan Zhuo 

In some cases, we hope to construct skb directly based on the existing
memory without copying data. In this case, the page will be placed
directly in the skb, and the linear space of skb is empty. But
unfortunately, many the network card does not support this operation.
For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

So a priv_flag is added here to indicate whether the network card
supports this feature.

Signed-off-by: Xuan Zhuo 
Suggested-by: Alexander Lobakin 
[ alobakin: give a new flag more detailed description ]
Signed-off-by: Alexander Lobakin 
---
 include/linux/netdevice.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0a9b2b31f411..ecaf67efab5b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1525,6 +1525,8 @@ struct net_device_ops {
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
+ * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
+ * skb_headlen(skb) == 0 (data starts from frag0)
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN_BIT,
@@ -1558,6 +1560,7 @@ enum netdev_priv_flags {
IFF_FAILOVER_SLAVE_BIT,
IFF_L3MDEV_RX_HANDLER_BIT,
IFF_LIVE_RENAME_OK_BIT,
+   IFF_TX_SKB_NO_LINEAR_BIT,
 
NETDEV_PRIV_FLAG_COUNT,
 };
@@ -1600,6 +1603,7 @@ static_assert(sizeof(netdev_priv_flags_t) * BITS_PER_BYTE 
>=
 #define IFF_FAILOVER_SLAVE __IFF(FAILOVER_SLAVE)
 #define IFF_L3MDEV_RX_HANDLER  __IFF(L3MDEV_RX_HANDLER)
 #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK)
+#define IFF_TX_SKB_NO_LINEAR   __IFF(TX_SKB_NO_LINEAR)
 
 /**
  * struct net_device - The DEVICE structure.
-- 
2.30.1

[PATCH v6 bpf-next 4/6] virtio-net: support IFF_TX_SKB_NO_LINEAR

2021-02-16 Thread Alexander Lobakin

From: Xuan Zhuo 

Virtio net supports the case where the skb linear space is empty, so add
priv_flags.

Signed-off-by: Xuan Zhuo 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Alexander Lobakin 
---
 drivers/net/virtio_net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e63792549..f2ff6c3906c1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev)
return -ENOMEM;
 
/* Set up network device as normal. */
-   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
+   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
+  IFF_TX_SKB_NO_LINEAR;
dev->netdev_ops = &virtnet_netdev;
dev->features = NETIF_F_HIGHDMA;
 
-- 
2.30.1

[PATCH v6 bpf-next 5/6] xsk: respect device's headroom and tailroom on generic xmit path

2021-02-16 Thread Alexander Lobakin

xsk_generic_xmit() allocates a new skb and then queues it for
xmitting. The size of new skb's headroom is desc->len, so it comes
to the driver/device with no reserved headroom and/or tailroom.
Lots of drivers need some headroom (and sometimes tailroom) to
prepend (and/or append) some headers or data, e.g. CPU tags,
device-specific headers/descriptors (LSO, TLS etc.), and if case
of no available space skb_cow_head() will reallocate the skb.
Reallocations are unwanted on fast-path, especially when it comes
to XDP, so generic XSK xmit should reserve the spaces declared in
dev->needed_headroom and dev->needed tailroom to avoid them.

Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)):

Usually, output functions reserve LL_RESERVED_SPACE(dev), which
consists of dev->hard_header_len + dev->needed_headroom, aligned
by 16.
However, on XSK xmit hard header is already here in the chunk, so
hard_header_len is not needed. But it'd still be better to align
data up to cacheline, while reserving no less than driver requests
for headroom. NET_SKB_PAD here is to double-insure there will be
no reallocations even when the driver advertises no needed_headroom,
but in fact need it (not so rare case).

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
---
 net/xdp/xsk.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4faabd1ecfd1..143979ea4165 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
+   u32 hr, tr;
 
mutex_lock(&xs->mutex);
 
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
 
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+   tr = xs->dev->needed_tailroom;
+
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
char *buffer;
u64 addr;
@@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk)
}
 
len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, &err);
+   skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err);
if (unlikely(!skb))
goto out;
 
+   skb_reserve(skb, hr);
skb_put(skb, len);
+
addr = desc.addr;
buffer = xsk_buff_raw_get_data(xs->pool, addr);
err = skb_store_bits(skb, 0, buffer, len);
-- 
2.30.1

[PATCH v6 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-16 Thread Alexander Lobakin

From: Xuan Zhuo 

This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
[ alobakin:
 - expand subject to make it clearer;
 - improve skb->truesize calculation;
 - reserve some headroom in skb for drivers;
 - tailroom is not needed as skb is non-linear ]
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
---
 net/xdp/xsk.c | 120 --
 1 file changed, 96 insertions(+), 24 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 143979ea4165..a71ed664da0a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -445,6 +445,97 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+ struct xdp_desc *desc)
+{
+   struct xsk_buff_pool *pool = xs->pool;
+   u32 hr, len, ts, offset, copy, copied;
+   struct sk_buff *skb;
+   struct page *page;
+   void *buffer;
+   int err, i;
+   u64 addr;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+
+   skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+
+   addr = desc->addr;
+   len = desc->len;
+   ts = pool->unaligned ? len : pool->chunk_size;
+
+   buffer = xsk_buff_raw_get_data(pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; i++) {
+   page = pool->umem->pgs[addr >> PAGE_SHIFT];
+   get_page(page);
+
+   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += ts;
+
+   refcount_add(ts, &xs->sk.sk_wmem_alloc);
+
+   return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+struct xdp_desc *desc)
+{
+   struct net_device *dev = xs->dev;
+   struct sk_buff *skb;
+
+   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   skb = xsk_build_skb_zerocopy(xs, desc);
+   if (IS_ERR(skb))
+   return skb;
+   } else {
+   u32 hr, tr, len;
+   void *buffer;
+   int err;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+   tr = dev->needed_tailroom;
+   len = desc->len;
+
+   skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+   skb_put(skb, len);
+
+   buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+   err = skb_store_bits(skb, 0, buffer, len);
+   if (unlikely(err)) {
+   kfree_skb(skb);
+   return ERR_PTR(err);
+   }
+   }
+
+   skb->dev = dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -454,56 +545,37 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
-   u32 hr, tr;
 
mutex_lock(&xs->mutex);
 
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
 
-   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
-   tr = xs->dev->needed_tailroom;
-
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;

Re: [PATCH mips-next] vmlinux.lds.h: catch more UBSAN symbols into .data

2021-02-16 Thread Alexander Lobakin

From: Nick Desaulniers 
Date: Tue, 16 Feb 2021 09:56:32 -0800

> On Tue, Feb 16, 2021 at 12:55 AM Alexander Lobakin  wrote:
> >
> > LKP triggered lots of LD orphan warnings [0]:
> 
> Thanks for the patch, just some questions.
> 
> With which linker?  Was there a particular config from the bot's
> report that triggered this?

All the info can be found by going through the link from the commit
message. Compiler was GCC 9.3, so I suppose BFD was used as a linker.
I mentioned CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y in the attached
dotconfig, the warnings and the fix are relevant only for this case.

> >
> > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_data299' from
> > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_data299'
> > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_data183' from
> > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_data183'
> > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type3' from
> > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type3'
> > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type2' from
> > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type2'
> > mipsel-linux-ld: warning: orphan section `.data.$Lubsan_type0' from
> > `init/do_mounts_rd.o' being placed in section `.data.$Lubsan_type0'
> >
> > [...]
> >
> > Seems like "unnamed data" isn't the only type of symbols that UBSAN
> > instrumentation can emit.
> > Catch these into .data with the wildcard as well.
> >
> > [0] https://lore.kernel.org/linux-mm/202102160741.k57gcnsr-...@intel.com
> >
> > Fixes: f41b233de0ae ("vmlinux.lds.h: catch UBSAN's "unnamed data" into 
> > data")
> > Reported-by: kernel test robot 
> > Signed-off-by: Alexander Lobakin 
> > ---
> >  include/asm-generic/vmlinux.lds.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/asm-generic/vmlinux.lds.h 
> > b/include/asm-generic/vmlinux.lds.h
> > index cc659e77fcb0..83537e5ee78f 100644
> > --- a/include/asm-generic/vmlinux.lds.h
> > +++ b/include/asm-generic/vmlinux.lds.h
> > @@ -95,7 +95,7 @@
> >   */
> >  #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
> >  #define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
> > -#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* 
> > .data..compoundliteral* .data.$__unnamed_*
> > +#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* 
> > .data..compoundliteral* .data.$__unnamed_* .data.$Lubsan_*
> 
> Are these sections only created when
> CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is selected?  (Same with
> .data.$__unnamed_*)
> 
> >  #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
> >  #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L*
> >  #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral*
> > --
> > 2.30.1
> >
> >
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

Al

Re: [GIT PULL] clang-lto for v5.12-rc1

2021-02-16 Thread Alexander Lobakin

From: Kees Cook 
Date: Tue, 16 Feb 2021 12:34:37 -0800

> Hi Linus,
> 
> Please pull this Clang Link Time Optimization series for v5.12-rc1. This
> has been in linux-next for the entire last development cycle, and is
> built on the work done preparing[0] for LTO by arm64 folks, tracing folks,
> etc. This series includes the core changes as well as the remaining pieces
> for arm64 (LTO has been the default build method on Android for about
> 3 years now, as it is the prerequisite for the Control Flow Integrity
> protections). While x86 LTO support is done[1], there is still some
> on-going clean-up work happening for objtool[2] that should hopefully
> land by the v5.13 merge window.
> 
> For merge log posterity, and as detailed in commit dc5723b02e52 ("kbuild:
> add support for Clang LTO"), here is the lt;dr to do an LTO build:
> 
>   make LLVM=1 LLVM_IAS=1 defconfig
>   scripts/config -e LTO_CLANG_THIN
>   make LLVM=1 LLVM_IAS=1
> 
> (To do a cross-compile of arm64, add "CROSS_COMPILE=aarch64-linux-gnu-"
> and "ARCH=arm64" to the "make" command lines.)
> 
> Thanks!
> 
> -Kees
> 
> [0] https://git.kernel.org/linus/3c09ec59cdea5b132212d97154d625fd34e436dd
> [1] https://github.com/samitolvanen/linux/commits/clang-lto
> [2] https://lore.kernel.org/lkml/cover.1611263461.git.jpoim...@redhat.com/
> 
> The following changes since commit e71ba9452f0b5b2e8dc8aa5445198cd9214a6a62:
> 
>   Linux 5.11-rc2 (2021-01-03 15:55:30 -0800)
> 
> are available in the Git repository at:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git 
> tags/clang-lto-v5.12-rc1
> 
> for you to fetch changes up to 112b6a8e038d793d016e330f53acb9383ac504b3:
> 
>   arm64: allow LTO to be selected (2021-01-14 08:21:10 -0800)
> 
> 
> clang-lto for v5.12-rc1
> 
> Provide build infrastructure for arm64 Clang LTO.
> 
> 
> Sami Tolvanen (16):
>   tracing: move function tracer options to Kconfig
>   kbuild: add support for Clang LTO
>   kbuild: lto: fix module versioning
>   kbuild: lto: limit inlining
>   kbuild: lto: merge module sections
>   kbuild: lto: add a default list of used symbols
>   init: lto: ensure initcall ordering
>   init: lto: fix PREL32 relocations
>   PCI: Fix PREL32 relocations for LTO
>   modpost: lto: strip .lto from module names
>   scripts/mod: disable LTO for empty.c
>   efi/libstub: disable LTO
>   drivers/misc/lkdtm: disable LTO for rodata.o
>   arm64: vdso: disable LTO
>   arm64: disable recordmcount with DYNAMIC_FTRACE_WITH_REGS
>   arm64: allow LTO to be selected

Seems like you forgot the fix from [0], didn't you?

>  .gitignore|   1 +
>  Makefile  |  45 --
>  arch/Kconfig  |  90 
>  arch/arm64/Kconfig|   4 +
>  arch/arm64/kernel/vdso/Makefile   |   3 +-
>  drivers/firmware/efi/libstub/Makefile |   2 +
>  drivers/misc/lkdtm/Makefile   |   1 +
>  include/asm-generic/vmlinux.lds.h |  11 +-
>  include/linux/init.h  |  79 --
>  include/linux/pci.h   |  27 +++-
>  init/Kconfig  |   1 +
>  kernel/trace/Kconfig  |  16 ++
>  scripts/Makefile.build|  48 +-
>  scripts/Makefile.lib  |   6 +-
>  scripts/Makefile.modfinal |   9 +-
>  scripts/Makefile.modpost  |  25 +++-
>  scripts/generate_initcall_order.pl| 270 
> ++
>  scripts/link-vmlinux.sh   |  70 +++--
>  scripts/lto-used-symbollist.txt   |   5 +
>  scripts/mod/Makefile  |   1 +
>  scripts/mod/modpost.c |  16 +-
>  scripts/mod/modpost.h |   9 ++
>  scripts/mod/sumversion.c  |   6 +-
>  scripts/module.lds.S  |  24 +++
>  24 files changed, 707 insertions(+), 62 deletions(-)
>  create mode 100755 scripts/generate_initcall_order.pl
>  create mode 100644 scripts/lto-used-symbollist.txt
> 
> -- 
> Kees Cook
> 

[0] https://lore.kernel.org/lkml/20210121184544.659998-1-aloba...@pm.me

Al

[BUG] net: core: netif_receive_skb_list() crash on non-standard ptypes forwarding

2019-03-28 Thread Alexander Lobakin


Hi Edward,

Seems like I've found another poisoned skb->next crash with
netif_receive_skb_list().
This is similar to the one than has been already fixed in 22f6bbb7bcfc
("net: use skb_list_del_init() to remove from RX sublists"). This one 
however

applies only to non-standard ptypes (in my case -- ETH_P_XDSA).

I use simple VLAN NAT setup through nft. After switching my in-dev 
driver to

netif_receive_skb_list(), system started to crash on forwarding:

[ 88.606777] CPU 0 Unable to handle kernel paging request at virtual 
address 000e, epc == 80687078, ra == 8052cc7c

[ 88.618666] Oops[#1]:
[ 88.621196] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.1.0-rc2-dlink-00206-g4192a172-dirty #1473

[ 88.630885] $ 0 :  1400 0002 864d7850
[ 88.636709] $ 4 : 87c0ddf0 864d7800 87c0ddf0 
[ 88.642526] $ 8 :  4960 0001 0001
[ 88.648342] $12 :  c288617b dadbee27 25d17c41
[ 88.654159] $16 : 87c0ddf0 85cff080 8079 fffd
[ 88.659975] $20 : 80797b20  0001 864d7800
[ 88.665793] $24 :  8011e658
[ 88.671609] $28 : 8079 87c0dbc0 87cabf00 8052cc7c
[ 88.677427] Hi : 0003
[ 88.680622] Lo : 7b5b4220
[ 88.683840] epc : 80687078 vlan_dev_hard_start_xmit+0x1c/0x1a0
[ 88.690532] ra : 8052cc7c dev_hard_start_xmit+0xac/0x188
[ 88.696734] Status: 1404   IEp
[ 88.700422] Cause : 5008 (ExcCode 02)
[ 88.704874] BadVA : 000e
[ 88.708069] PrId : 0001a120 (MIPS interAptiv (multi))
[ 88.713005] Modules linked in:
[ 88.716407] Process swapper (pid: 0, threadinfo=(ptrval), 
task=(ptrval), tls=)
[ 88.725219] Stack : 85f61c28  000e 8078 87c0ddf0 
85cff080 8079 8052cc7c
[ 88.734529] 87cabf00  0001 85f5fb40 807b 864d7850 
87cabf00 807d
[ 88.743839] 864d7800 8655f600  85cff080 87c1c000 006a 
 8052d96c
[ 88.753149] 807a 8057adb8 87c0dcc8 87c0dc50 85cfff08 0558 
87cabf00 85f58c50
[ 88.762460] 0002 85f58c00 864d7800 80543308 fff4 0001 
85f58c00 864d7800

[ 88.771770] ...
[ 88.774483] Call Trace:
[ 88.777199] [<80687078>] vlan_dev_hard_start_xmit+0x1c/0x1a0
[ 88.783504] [<8052cc7c>] dev_hard_start_xmit+0xac/0x188
[ 88.789326] [<8052d96c>] __dev_queue_xmit+0x6e8/0x7d4
[ 88.794955] [<805a8640>] ip_finish_output2+0x238/0x4d0
[ 88.800677] [<805ab6a0>] ip_output+0xc8/0x140
[ 88.805526] [<805a68f4>] ip_forward+0x364/0x560
[ 88.810567] [<805a4ff8>] ip_rcv+0x48/0xe4
[ 88.815030] [<80528d44>] __netif_receive_skb_one_core+0x44/0x58
[ 88.821635] [<8067f220>] dsa_switch_rcv+0x108/0x1ac
[ 88.827067] [<80528f80>] __netif_receive_skb_list_core+0x228/0x26c
[ 88.833951] [<8052ed84>] netif_receive_skb_list+0x1d4/0x394
[ 88.840160] [<80355a88>] lunar_rx_poll+0x38c/0x828
[ 88.845496] [<8052fa78>] net_rx_action+0x14c/0x3cc
[ 88.850835] [<806ad300>] __do_softirq+0x178/0x338
[ 88.856077] [<8012a2d4>] irq_exit+0xbc/0x100
[ 88.860846] [<802f8b70>] plat_irq_dispatch+0xc0/0x144
[ 88.866477] [<80105974>] handle_int+0x14c/0x158
[ 88.871516] [<806acfb0>] r4k_wait+0x30/0x40
[ 88.876462] Code: afb10014 8c8200a0 00803025 <9443000c> 94a20468 
 10620042 00a08025 9605046a

[ 88.887332]
[ 88.888982] ---[ end trace eb863d007da11cf1 ]---
[ 88.894122] Kernel panic - not syncing: Fatal exception in interrupt
[ 88.901202] ---[ end Kernel panic - not syncing: Fatal exception in 
interrupt ]---


Some additional debug have showed that skb->next is poisoned on 
dsa_switch_rcv()
-- ETH_P_XDSA ptype .func() callback. So when skb enters 
dev_hard_start_xmit(),

function tries to "schedule" backpointer to list_head for transmitting.

Here's a working possible fix for that, not sure if it can break 
anything

though.

diff --git a/net/core/dev.c b/net/core/dev.c
index 2b67f2aa59dd..fdcff29df915 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5014,8 +5014,10 @@ static inline void 
__netif_receive_skb_list_ptype(struct list_head *head,

if (pt_prev->list_func != NULL)
pt_prev->list_func(head, pt_prev, orig_dev);
else
-   list_for_each_entry_safe(skb, next, head, list)
+   list_for_each_entry_safe(skb, next, head, list) {
+   skb_list_del_init(skb);
pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   }
}

static void __netif_receive_skb_list_core(struct list_head *head, bool 
pfmemalloc)


Maybe you could look into this and find another/better solution (or I 
could

submit this one if that's pretty enough).

BTW, great work with netif_receive_skb_list() -- I've got 70 Mbps gain 
(~15%)

on my setup in comparsion to napi_gro_receive().

Thanks,
Alexander.

Regards,
ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ

Re: [PATCH v4 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()

2021-02-11 Thread Alexander Lobakin

From: Paolo Abeni 
Date: Thu, 11 Feb 2021 15:55:04 +0100

> On Thu, 2021-02-11 at 14:28 +0000, Alexander Lobakin wrote:
> > From: Paolo Abeni  on Thu, 11 Feb 2021 11:16:40 +0100 
> > wrote:
> > > What about changing __napi_alloc_skb() to always use
> > > the __napi_build_skb(), for both kmalloc and page backed skbs? That is,
> > > always doing the 'data' allocation in __napi_alloc_skb() - either via
> > > page_frag or via kmalloc() - and than call __napi_build_skb().
> > > 
> > > I think that should avoid adding more checks in __alloc_skb() and
> > > should probably reduce the number of conditional used
> > > by __napi_alloc_skb().
> > 
> > I thought of this too. But this will introduce conditional branch
> > to set or not skb->head_frag. So one branch less in __alloc_skb(),
> > one branch more here, and we also lose the ability to __alloc_skb()
> > with decached head.
> 
> Just to try to be clear, I mean something alike the following (not even
> build tested). In the fast path it has less branches than the current
> code - for both kmalloc and page_frag allocation.
> 
> ---
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 785daff48030..a242fbe4730e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -506,23 +506,12 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct 
> *napi, unsigned int len,
>gfp_t gfp_mask)
>  {
>   struct napi_alloc_cache *nc;
> + bool head_frag, pfmemalloc;
>   struct sk_buff *skb;
>   void *data;
>  
>   len += NET_SKB_PAD + NET_IP_ALIGN;
>  
> - /* If requested length is either too small or too big,
> -  * we use kmalloc() for skb->head allocation.
> -  */
> - if (len <= SKB_WITH_OVERHEAD(1024) ||
> - len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> - skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> - if (!skb)
> - goto skb_fail;
> - goto skb_success;
> - }
> -
>   nc = this_cpu_ptr(&napi_alloc_cache);
>   len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>   len = SKB_DATA_ALIGN(len);
> @@ -530,25 +519,34 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct 
> *napi, unsigned int len,
>   if (sk_memalloc_socks())
>   gfp_mask |= __GFP_MEMALLOC;
>  
> - data = page_frag_alloc(&nc->page, len, gfp_mask);
> + if (len <= SKB_WITH_OVERHEAD(1024) ||
> +len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> +(gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> + data = kmalloc_reserve(len, gfp_mask, NUMA_NO_NODE, 
> &pfmemalloc);
> + head_frag = 0;
> + len = 0;
> + } else {
> + data = page_frag_alloc(&nc->page, len, gfp_mask);
> + pfmemalloc = nc->page.pfmemalloc;
> + head_frag = 1;
> + }
>   if (unlikely(!data))
>   return NULL;

Sure. I have a separate WIP series that reworks all three *alloc_skb()
functions, as there's a nice room for optimization, especially after
that tiny skbs now fall back to __alloc_skb().
It will likely hit mailing lists after the merge window and next
net-next season, not now. And it's not really connected with NAPI
cache reusing.

>   skb = __build_skb(data, len);
>   if (unlikely(!skb)) {
> - skb_free_frag(data);
> + if (head_frag)
> + skb_free_frag(data);
> + else
> + kfree(data);
>   return NULL;
>   }
>  
> - if (nc->page.pfmemalloc)
> - skb->pfmemalloc = 1;
> - skb->head_frag = 1;
> + skb->pfmemalloc = pfmemalloc;
> + skb->head_frag = head_frag;
>  
> -skb_success:
>   skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
>   skb->dev = napi->dev;
> -
> -skb_fail:
>   return skb;
>  }
>  EXPORT_SYMBOL(__napi_alloc_skb);

Al

[PATCH v5 net-next 00/11] skbuff: introduce skbuff_heads bulking and reusing

2021-02-11 Thread Alexander Lobakin

Currently, all sorts of skb allocation always do allocate
skbuff_heads one by one via kmem_cache_alloc().
On the other hand, we have percpu napi_alloc_cache to store
skbuff_heads queued up for freeing and flush them by bulks.

We can use this cache not only for bulk-wiping, but also to obtain
heads for new skbs and avoid unconditional allocations, as well as
for bulk-allocating (like XDP's cpumap code and veth driver already
do).

As this might affect latencies, cache pressure and lots of hardware
and driver-dependent stuff, this new feature is mostly optional and
can be issued via:
 - a new napi_build_skb() function (as a replacement for build_skb());
 - existing {,__}napi_alloc_skb() and napi_get_frags() functions;
 - __alloc_skb() with passing SKB_ALLOC_NAPI in flags.

iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing
VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger
on more powerful hosts and NICs with tens of Mpps.

Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs:
 - kmalloc()/kmem_cache_alloc() itself allows by default allocating
   memory from the remote nodes to defragment their slabs. This is
   controlled by sysctl, but according to this, skbuff_head from a
   remote node is an OK case;
 - The easiest way to check if the slab of skbuff_head is remote or
   pfmemalloc'ed is:

if (!dev_page_is_reusable(virt_to_head_page(skb)))
/* drop it */;

   ...*but*, regarding that most slabs are built of compound pages,
   virt_to_head_page() will hit unlikely-branch every single call.
   This check costed at least 20 Mbps in test scenarios and seems
   like it'd be better to _not_ do this.

Since v4 [3]:
 - rebase on top of net-next and address kernel build robot issue;
 - reorder checks a bit in __alloc_skb() to make new condition even
   more harmless.

Since v3 [2]:
 - make the feature mostly optional, so driver developers could
   decide whether to use it or not (Paolo Abeni).
   This reuses the old flag for __alloc_skb() and introduces
   a new napi_build_skb();
 - reduce bulk-allocation size from 32 to 16 elements (also Paolo).
   This equals to the value of XDP's devmap and veth batch processing
   (which were tested a lot) and should be sane enough;
 - don't waste cycles on explicit in_serving_softirq() check.

Since v2 [1]:
 - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy
   after the changes that pass tiny skbs requests to kmalloc layer);
 - cover the cache with KASAN instrumentation (suggested by Eric
   Dumazet, help of Dmitry Vyukov);
 - completely drop redundant __kfree_skb_flush() (also Eric);
 - lots of code cleanups;
 - expand the commit message with NUMA and pfmemalloc points (Jakub).

Since v1 [0]:
 - use one unified cache instead of two separate to greatly simplify
   the logics and reduce hotpath overhead (Edward Cree);
 - new: recycle also GRO_MERGED_FREE skbs instead of immediate
   freeing;
 - correct performance numbers after optimizations and performing
   lots of tests for different use cases.

[0] https://lore.kernel.org/netdev/2021082655.12159-1-aloba...@pm.me
[1] https://lore.kernel.org/netdev/20210113133523.39205-1-aloba...@pm.me
[2] https://lore.kernel.org/netdev/20210209204533.327360-1-aloba...@pm.me
[3] https://lore.kernel.org/netdev/20210210162732.80467-1-aloba...@pm.me

Alexander Lobakin (11):
  skbuff: move __alloc_skb() next to the other skb allocation functions
  skbuff: simplify kmalloc_reserve()
  skbuff: make __build_skb_around() return void
  skbuff: simplify __alloc_skb() a bit
  skbuff: use __build_skb_around() in __alloc_skb()
  skbuff: remove __kfree_skb_flush()
  skbuff: move NAPI cache declarations upper in the file
  skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads
  skbuff: allow to optionally use NAPI cache from __alloc_skb()
  skbuff: allow to use NAPI cache from __napi_alloc_skb()
  skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing

 include/linux/skbuff.h |   4 +-
 net/core/dev.c |  16 +-
 net/core/skbuff.c  | 429 +++--
 3 files changed, 243 insertions(+), 206 deletions(-)

-- 
2.30.1

[PATCH v5 net-next 01/11] skbuff: move __alloc_skb() next to the other skb allocation functions

2021-02-11 Thread Alexander Lobakin

In preparation before reusing several functions in all three skb
allocation variants, move __alloc_skb() next to the
__netdev_alloc_skb() and __napi_alloc_skb().
No functional changes.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 284 +++---
 1 file changed, 142 insertions(+), 142 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d380c7b5a12d..a0f846872d19 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -119,148 +119,6 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
skb_panic(skb, sz, addr, __func__);
 }
 
-/*
- * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
- * the caller if emergency pfmemalloc reserves are being used. If it is and
- * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
- * may be used. Otherwise, the packet data may be discarded until enough
- * memory is free
- */
-#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
-__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
-
-static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
-  unsigned long ip, bool *pfmemalloc)
-{
-   void *obj;
-   bool ret_pfmemalloc = false;
-
-   /*
-* Try a regular allocation, when that fails and we're not entitled
-* to the reserves, fail.
-*/
-   obj = kmalloc_node_track_caller(size,
-   flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
-   node);
-   if (obj || !(gfp_pfmemalloc_allowed(flags)))
-   goto out;
-
-   /* Try again but now we are using pfmemalloc reserves */
-   ret_pfmemalloc = true;
-   obj = kmalloc_node_track_caller(size, flags, node);
-
-out:
-   if (pfmemalloc)
-   *pfmemalloc = ret_pfmemalloc;
-
-   return obj;
-}
-
-/* Allocate a new skbuff. We do this ourselves so we can fill in a few
- * 'private' fields and also do memory statistics to find all the
- * [BEEP] leaks.
- *
- */
-
-/**
- * __alloc_skb -   allocate a network buffer
- * @size: size to allocate
- * @gfp_mask: allocation mask
- * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache
- * instead of head cache and allocate a cloned (child) skb.
- * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
- * allocations in case the data is required for writeback
- * @node: numa node to allocate memory on
- *
- * Allocate a new &sk_buff. The returned buffer has no headroom and a
- * tail room of at least size bytes. The object has a reference count
- * of one. The return is the buffer. On a failure the return is %NULL.
- *
- * Buffers may only be allocated from interrupts using a @gfp_mask of
- * %GFP_ATOMIC.
- */
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-   int flags, int node)
-{
-   struct kmem_cache *cache;
-   struct skb_shared_info *shinfo;
-   struct sk_buff *skb;
-   u8 *data;
-   bool pfmemalloc;
-
-   cache = (flags & SKB_ALLOC_FCLONE)
-   ? skbuff_fclone_cache : skbuff_head_cache;
-
-   if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
-   gfp_mask |= __GFP_MEMALLOC;
-
-   /* Get the HEAD */
-   skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
-   if (!skb)
-   goto out;
-   prefetchw(skb);
-
-   /* We do our best to align skb_shared_info on a separate cache
-* line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
-* aligned memory blocks, unless SLUB/SLAB debug is enabled.
-* Both skb->head and skb_shared_info are cache line aligned.
-*/
-   size = SKB_DATA_ALIGN(size);
-   size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-   data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
-   if (!data)
-   goto nodata;
-   /* kmalloc(size) might give us more room than requested.
-* Put skb_shared_info exactly at the end of allocated zone,
-* to allow max possible filling before reallocation.
-*/
-   size = SKB_WITH_OVERHEAD(ksize(data));
-   prefetchw(data + size);
-
-   /*
-* Only clear those fields we need to clear, not those that we will
-* actually initialise below. Hence, don't put any more fields after
-* the tail pointer in struct sk_buff!
-*/
-   memset(skb, 0, offsetof(struct sk_buff, tail));
-   /* Account for allocated memory : skb + skb->head */
-   skb->truesize = SKB_TRUESIZE(size);
-   skb->pfmemalloc = pfmemalloc;
-   refcount_set(&skb->users, 1);
-   skb->head = data;
-   skb->data = data;
-   skb_reset_tail_pointer(skb)

[PATCH v5 net-next 02/11] skbuff: simplify kmalloc_reserve()

2021-02-11 Thread Alexander Lobakin

Eversince the introduction of __kmalloc_reserve(), "ip" argument
hasn't been used. _RET_IP_ is embedded inside
kmalloc_node_track_caller().
Remove the redundant macro and rename the function after it.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a0f846872d19..70289f22a6f4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -273,11 +273,8 @@ EXPORT_SYMBOL(__netdev_alloc_frag_align);
  * may be used. Otherwise, the packet data may be discarded until enough
  * memory is free
  */
-#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
-__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
-
-static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
-  unsigned long ip, bool *pfmemalloc)
+static void *kmalloc_reserve(size_t size, gfp_t flags, int node,
+bool *pfmemalloc)
 {
void *obj;
bool ret_pfmemalloc = false;
-- 
2.30.1

[PATCH v5 net-next 05/11] skbuff: use __build_skb_around() in __alloc_skb()

2021-02-11 Thread Alexander Lobakin

Just call __build_skb_around() instead of open-coding it.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 18 +-
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 88566de26cd1..1c6f6ef70339 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -326,7 +326,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
int flags, int node)
 {
struct kmem_cache *cache;
-   struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
bool pfmemalloc;
@@ -366,21 +365,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 * the tail pointer in struct sk_buff!
 */
memset(skb, 0, offsetof(struct sk_buff, tail));
-   /* Account for allocated memory : skb + skb->head */
-   skb->truesize = SKB_TRUESIZE(size);
+   __build_skb_around(skb, data, 0);
skb->pfmemalloc = pfmemalloc;
-   refcount_set(&skb->users, 1);
-   skb->head = data;
-   skb->data = data;
-   skb_reset_tail_pointer(skb);
-   skb->end = skb->tail + size;
-   skb->mac_header = (typeof(skb->mac_header))~0U;
-   skb->transport_header = (typeof(skb->transport_header))~0U;
-
-   /* make sure we initialize shinfo sequentially */
-   shinfo = skb_shinfo(skb);
-   memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-   atomic_set(&shinfo->dataref, 1);
 
if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff_fclones *fclones;
@@ -393,8 +379,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
fclones->skb2.fclone = SKB_FCLONE_CLONE;
}
 
-   skb_set_kcov_handle(skb, kcov_common_handle());
-
return skb;
 
 nodata:
-- 
2.30.1

[PATCH v5 net-next 03/11] skbuff: make __build_skb_around() return void

2021-02-11 Thread Alexander Lobakin

__build_skb_around() can never fail and always returns passed skb.
Make it return void to simplify and optimize the code.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 70289f22a6f4..c7d184e11547 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -120,8 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
 }
 
 /* Caller must provide SKB that is memset cleared */
-static struct sk_buff *__build_skb_around(struct sk_buff *skb,
- void *data, unsigned int frag_size)
+static void __build_skb_around(struct sk_buff *skb, void *data,
+  unsigned int frag_size)
 {
struct skb_shared_info *shinfo;
unsigned int size = frag_size ? : ksize(data);
@@ -144,8 +144,6 @@ static struct sk_buff *__build_skb_around(struct sk_buff 
*skb,
atomic_set(&shinfo->dataref, 1);
 
skb_set_kcov_handle(skb, kcov_common_handle());
-
-   return skb;
 }
 
 /**
@@ -176,8 +174,9 @@ struct sk_buff *__build_skb(void *data, unsigned int 
frag_size)
return NULL;
 
memset(skb, 0, offsetof(struct sk_buff, tail));
+   __build_skb_around(skb, data, frag_size);
 
-   return __build_skb_around(skb, data, frag_size);
+   return skb;
 }
 
 /* build_skb() is wrapper over __build_skb(), that specifically
@@ -210,9 +209,9 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
if (unlikely(!skb))
return NULL;
 
-   skb = __build_skb_around(skb, data, frag_size);
+   __build_skb_around(skb, data, frag_size);
 
-   if (skb && frag_size) {
+   if (frag_size) {
skb->head_frag = 1;
if (page_is_pfmemalloc(virt_to_head_page(data)))
skb->pfmemalloc = 1;
-- 
2.30.1

[PATCH v5 net-next 06/11] skbuff: remove __kfree_skb_flush()

2021-02-11 Thread Alexander Lobakin

This function isn't much needed as NAPI skb queue gets bulk-freed
anyway when there's no more room, and even may reduce the efficiency
of bulk operations.
It will be even less needed after reusing skb cache on allocation path,
so remove it and this way lighten network softirqs a bit.

Suggested-by: Eric Dumazet 
Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h |  1 -
 net/core/dev.c |  7 +--
 net/core/skbuff.c  | 12 
 3 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0a4e91a2f873..0e0707296098 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2919,7 +2919,6 @@ static inline struct sk_buff *napi_alloc_skb(struct 
napi_struct *napi,
 }
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
-void __kfree_skb_flush(void);
 void __kfree_skb_defer(struct sk_buff *skb);
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index 321d41a110e7..4154d4683bb9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4944,8 +4944,6 @@ static __latent_entropy void net_tx_action(struct 
softirq_action *h)
else
__kfree_skb_defer(skb);
}
-
-   __kfree_skb_flush();
}
 
if (sd->output_queue) {
@@ -7012,7 +7010,6 @@ static int napi_threaded_poll(void *data)
__napi_poll(napi, &repoll);
netpoll_poll_unlock(have);
 
-   __kfree_skb_flush();
local_bh_enable();
 
if (!repoll)
@@ -7042,7 +7039,7 @@ static __latent_entropy void net_rx_action(struct 
softirq_action *h)
 
if (list_empty(&list)) {
if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
-   goto out;
+   return;
break;
}
 
@@ -7069,8 +7066,6 @@ static __latent_entropy void net_rx_action(struct 
softirq_action *h)
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 
net_rps_action_and_irq_enable(sd);
-out:
-   __kfree_skb_flush();
 }
 
 struct netdev_adjacent {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1c6f6ef70339..4be2bb969535 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -838,18 +838,6 @@ void __consume_stateless_skb(struct sk_buff *skb)
kfree_skbmem(skb);
 }
 
-void __kfree_skb_flush(void)
-{
-   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
-   /* flush skb_cache if containing objects */
-   if (nc->skb_count) {
-   kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count,
-nc->skb_cache);
-   nc->skb_count = 0;
-   }
-}
-
 static inline void _kfree_skb_defer(struct sk_buff *skb)
 {
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-- 
2.30.1

[PATCH v5 net-next 04/11] skbuff: simplify __alloc_skb() a bit

2021-02-11 Thread Alexander Lobakin

Use unlikely() annotations for skbuff_head and data similarly to the
two other allocation functions and remove totally redundant goto.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c7d184e11547..88566de26cd1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -339,8 +339,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
-   if (!skb)
-   goto out;
+   if (unlikely(!skb))
+   return NULL;
prefetchw(skb);
 
/* We do our best to align skb_shared_info on a separate cache
@@ -351,7 +351,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
size = SKB_DATA_ALIGN(size);
size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
-   if (!data)
+   if (unlikely(!data))
goto nodata;
/* kmalloc(size) might give us more room than requested.
 * Put skb_shared_info exactly at the end of allocated zone,
@@ -395,12 +395,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 
skb_set_kcov_handle(skb, kcov_common_handle());
 
-out:
return skb;
+
 nodata:
kmem_cache_free(cache, skb);
-   skb = NULL;
-   goto out;
+   return NULL;
 }
 EXPORT_SYMBOL(__alloc_skb);
 
-- 
2.30.1

[PATCH v5 net-next 07/11] skbuff: move NAPI cache declarations upper in the file

2021-02-11 Thread Alexander Lobakin

NAPI cache structures will be used for allocating skbuff_heads,
so move their declarations a bit upper.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 90 +++
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4be2bb969535..860a9d4f752f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -119,6 +119,51 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
skb_panic(skb, sz, addr, __func__);
 }
 
+#define NAPI_SKB_CACHE_SIZE64
+
+struct napi_alloc_cache {
+   struct page_frag_cache page;
+   unsigned int skb_count;
+   void *skb_cache[NAPI_SKB_CACHE_SIZE];
+};
+
+static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
+
+static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask,
+   unsigned int align_mask)
+{
+   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+   return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask);
+}
+
+void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+{
+   fragsz = SKB_DATA_ALIGN(fragsz);
+
+   return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
+}
+EXPORT_SYMBOL(__napi_alloc_frag_align);
+
+void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+{
+   struct page_frag_cache *nc;
+   void *data;
+
+   fragsz = SKB_DATA_ALIGN(fragsz);
+   if (in_irq() || irqs_disabled()) {
+   nc = this_cpu_ptr(&netdev_alloc_cache);
+   data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, 
align_mask);
+   } else {
+   local_bh_disable();
+   data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
+   local_bh_enable();
+   }
+   return data;
+}
+EXPORT_SYMBOL(__netdev_alloc_frag_align);
+
 /* Caller must provide SKB that is memset cleared */
 static void __build_skb_around(struct sk_buff *skb, void *data,
   unsigned int frag_size)
@@ -220,51 +265,6 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(build_skb_around);
 
-#define NAPI_SKB_CACHE_SIZE64
-
-struct napi_alloc_cache {
-   struct page_frag_cache page;
-   unsigned int skb_count;
-   void *skb_cache[NAPI_SKB_CACHE_SIZE];
-};
-
-static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
-
-static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask,
-   unsigned int align_mask)
-{
-   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
-   return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask);
-}
-
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
-{
-   fragsz = SKB_DATA_ALIGN(fragsz);
-
-   return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
-}
-EXPORT_SYMBOL(__napi_alloc_frag_align);
-
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
-{
-   struct page_frag_cache *nc;
-   void *data;
-
-   fragsz = SKB_DATA_ALIGN(fragsz);
-   if (in_irq() || irqs_disabled()) {
-   nc = this_cpu_ptr(&netdev_alloc_cache);
-   data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, 
align_mask);
-   } else {
-   local_bh_disable();
-   data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
-   local_bh_enable();
-   }
-   return data;
-}
-EXPORT_SYMBOL(__netdev_alloc_frag_align);
-
 /*
  * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
  * the caller if emergency pfmemalloc reserves are being used. If it is and
-- 
2.30.1

[PATCH v5 net-next 08/11] skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads

2021-02-11 Thread Alexander Lobakin

Instead of just bulk-flushing skbuff_heads queued up through
napi_consume_skb() or __kfree_skb_defer(), try to reuse them
on allocation path.
If the cache is empty on allocation, bulk-allocate the first
16 elements, which is more efficient than per-skb allocation.
If the cache is full on freeing, bulk-wipe the second half of
the cache (32 elements).
This also includes custom KASAN poisoning/unpoisoning to be
double sure there are no use-after-free cases.

To not change current behaviour, introduce a new function,
napi_build_skb(), to optionally use a new approach later
in drivers.

Note on selected bulk size, 16:
 - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE
   and especially VETH_XDP_BATCH, which is also used to
   bulk-allocate skbuff_heads and was tested on powerful
   setups;
 - this also showed the best performance in the actual
   test series (from the array of {8, 16, 32}).

Suggested-by: Edward Cree  # Divide on two halves
Suggested-by: Eric Dumazet# KASAN poisoning
Cc: Dmitry Vyukov  # Help with KASAN
Cc: Paolo Abeni # Reduced batch size
Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h |  2 +
 net/core/skbuff.c  | 94 --
 2 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0e0707296098..906122eac82a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int 
frag_size);
 struct sk_buff *build_skb_around(struct sk_buff *skb,
 void *data, unsigned int frag_size);
 
+struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
+
 /**
  * alloc_skb - allocate a network buffer
  * @size: size to allocate
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 860a9d4f752f..9e1a8ded4acc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
 }
 
 #define NAPI_SKB_CACHE_SIZE64
+#define NAPI_SKB_CACHE_BULK16
+#define NAPI_SKB_CACHE_HALF(NAPI_SKB_CACHE_SIZE / 2)
 
 struct napi_alloc_cache {
struct page_frag_cache page;
@@ -164,6 +166,25 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, 
unsigned int align_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_frag_align);
 
+static struct sk_buff *napi_skb_cache_get(void)
+{
+   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+   struct sk_buff *skb;
+
+   if (unlikely(!nc->skb_count))
+   nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
+ GFP_ATOMIC,
+ NAPI_SKB_CACHE_BULK,
+ nc->skb_cache);
+   if (unlikely(!nc->skb_count))
+   return NULL;
+
+   skb = nc->skb_cache[--nc->skb_count];
+   kasan_unpoison_object_data(skbuff_head_cache, skb);
+
+   return skb;
+}
+
 /* Caller must provide SKB that is memset cleared */
 static void __build_skb_around(struct sk_buff *skb, void *data,
   unsigned int frag_size)
@@ -265,6 +286,53 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(build_skb_around);
 
+/**
+ * __napi_build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ *
+ * Version of __build_skb() that uses NAPI percpu caches to obtain
+ * skbuff_head instead of inplace allocation.
+ *
+ * Returns a new &sk_buff on success, %NULL on allocation failure.
+ */
+static struct sk_buff *__napi_build_skb(void *data, unsigned int frag_size)
+{
+   struct sk_buff *skb;
+
+   skb = napi_skb_cache_get();
+   if (unlikely(!skb))
+   return NULL;
+
+   memset(skb, 0, offsetof(struct sk_buff, tail));
+   __build_skb_around(skb, data, frag_size);
+
+   return skb;
+}
+
+/**
+ * napi_build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ *
+ * Version of __napi_build_skb() that takes care of skb->head_frag
+ * and skb->pfmemalloc when the data is a page or page fragment.
+ *
+ * Returns a new &sk_buff on success, %NULL on allocation failure.
+ */
+struct sk_buff *napi_build_skb(void *data, unsigned int frag_size)
+{
+   struct sk_buff *skb = __napi_build_skb(data, frag_size);
+
+   if (likely(skb) && frag_size) {
+   skb->head_frag = 1;
+   skb_propagate_pfmemalloc(virt_to_head_page(data), skb);
+   }
+
+   return skb;
+}
+EXPORT_SYMBOL(napi_build_skb);
+
 /*
  * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
  * the caller if emergency pfmemalloc reserves are being used. If it is and
@@ -838,31 +906,31

[PATCH v5 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()

2021-02-11 Thread Alexander Lobakin

Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
an skbuff_head from the NAPI cache instead of inplace allocation
inside __alloc_skb().
This implies that the function is called from softirq or BH-off
context, not for allocating a clone or from a distant node.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9e1a8ded4acc..a0b457ae87c2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
struct sk_buff *skb;
u8 *data;
bool pfmemalloc;
+   bool clone;
 
-   cache = (flags & SKB_ALLOC_FCLONE)
-   ? skbuff_fclone_cache : skbuff_head_cache;
+   clone = !!(flags & SKB_ALLOC_FCLONE);
+   cache = clone ? skbuff_fclone_cache : skbuff_head_cache;
 
if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
gfp_mask |= __GFP_MEMALLOC;
 
/* Get the HEAD */
-   skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
+   if ((flags & SKB_ALLOC_NAPI) && !clone &&
+   likely(node == NUMA_NO_NODE || node == numa_mem_id()))
+   skb = napi_skb_cache_get();
+   else
+   skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);
if (unlikely(!skb))
return NULL;
prefetchw(skb);
@@ -436,7 +441,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
__build_skb_around(skb, data, 0);
skb->pfmemalloc = pfmemalloc;
 
-   if (flags & SKB_ALLOC_FCLONE) {
+   if (clone) {
struct sk_buff_fclones *fclones;
 
fclones = container_of(skb, struct sk_buff_fclones, skb1);
-- 
2.30.1

[PATCH v5 net-next 10/11] skbuff: allow to use NAPI cache from __napi_alloc_skb()

2021-02-11 Thread Alexander Lobakin

{,__}napi_alloc_skb() is mostly used either for optional non-linear
receive methods (usually controlled via Ethtool private flags and off
by default) and/or for Rx copybreaks.
Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache
instead of inplace allocations. This includes both kmalloc and page
frag paths.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a0b457ae87c2..c8f3ea1d9fbb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -563,7 +563,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, 
unsigned int len,
if (len <= SKB_WITH_OVERHEAD(1024) ||
len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
(gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
-   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
+   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
+ NUMA_NO_NODE);
if (!skb)
goto skb_fail;
goto skb_success;
@@ -580,7 +581,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, 
unsigned int len,
if (unlikely(!data))
return NULL;
 
-   skb = __build_skb(data, len);
+   skb = __napi_build_skb(data, len);
if (unlikely(!skb)) {
skb_free_frag(data);
return NULL;
-- 
2.30.1

[PATCH v5 net-next 11/11] skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing

2021-02-11 Thread Alexander Lobakin

napi_frags_finish() and napi_skb_finish() can only be called inside
NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
got NAPI_MERGED_FREE verdict instead of immediate freeing.
Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
to NAPI cache.
As many drivers call napi_alloc_skb()/napi_get_frags() on their
receive path, this becomes especially useful.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h |  1 +
 net/core/dev.c |  9 +
 net/core/skbuff.c  | 12 +---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 906122eac82a..6d0a33d1c0db 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2921,6 +2921,7 @@ static inline struct sk_buff *napi_alloc_skb(struct 
napi_struct *napi,
 }
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
+void napi_skb_free_stolen_head(struct sk_buff *skb);
 void __kfree_skb_defer(struct sk_buff *skb);
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index 4154d4683bb9..6d2c7ae90a23 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6095,13 +6095,6 @@ struct packet_offload *gro_find_complete_by_type(__be16 
type)
 }
 EXPORT_SYMBOL(gro_find_complete_by_type);
 
-static void napi_skb_free_stolen_head(struct sk_buff *skb)
-{
-   skb_dst_drop(skb);
-   skb_ext_put(skb);
-   kmem_cache_free(skbuff_head_cache, skb);
-}
-
 static gro_result_t napi_skb_finish(struct napi_struct *napi,
struct sk_buff *skb,
gro_result_t ret)
@@ -6115,7 +6108,7 @@ static gro_result_t napi_skb_finish(struct napi_struct 
*napi,
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
napi_skb_free_stolen_head(skb);
else
-   __kfree_skb(skb);
+   __kfree_skb_defer(skb);
break;
 
case GRO_HELD:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c8f3ea1d9fbb..85f0768a1144 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -917,9 +917,6 @@ static void napi_skb_cache_put(struct sk_buff *skb)
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
u32 i;
 
-   /* drop skb->head and call any destructors for packet */
-   skb_release_all(skb);
-
kasan_poison_object_data(skbuff_head_cache, skb);
nc->skb_cache[nc->skb_count++] = skb;
 
@@ -936,6 +933,14 @@ static void napi_skb_cache_put(struct sk_buff *skb)
 
 void __kfree_skb_defer(struct sk_buff *skb)
 {
+   skb_release_all(skb);
+   napi_skb_cache_put(skb);
+}
+
+void napi_skb_free_stolen_head(struct sk_buff *skb)
+{
+   skb_dst_drop(skb);
+   skb_ext_put(skb);
napi_skb_cache_put(skb);
 }
 
@@ -961,6 +966,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
return;
}
 
+   skb_release_all(skb);
napi_skb_cache_put(skb);
 }
 EXPORT_SYMBOL(napi_consume_skb);
-- 
2.30.1

Re: [PATCH v5 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()

2021-02-13 Thread Alexander Lobakin

From: Alexander Duyck 
Date: Thu, 11 Feb 2021 19:18:45 -0800

> On Thu, Feb 11, 2021 at 11:00 AM Alexander Lobakin  wrote:
> >
> > Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
> > an skbuff_head from the NAPI cache instead of inplace allocation
> > inside __alloc_skb().
> > This implies that the function is called from softirq or BH-off
> > context, not for allocating a clone or from a distant node.
> >
> > Signed-off-by: Alexander Lobakin 
> > ---
> >  net/core/skbuff.c | 13 +
> >  1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 9e1a8ded4acc..a0b457ae87c2 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > gfp_mask,
> > struct sk_buff *skb;
> > u8 *data;
> > bool pfmemalloc;
> > +   bool clone;
> >
> > -   cache = (flags & SKB_ALLOC_FCLONE)
> > -   ? skbuff_fclone_cache : skbuff_head_cache;
> > +   clone = !!(flags & SKB_ALLOC_FCLONE);
> 
> The boolean conversion here is probably unnecessary. I would make
> clone an int like flags and work with that. I suspect the compiler is
> doing it already, but it is better to be explicit.
> 
> > +   cache = clone ? skbuff_fclone_cache : skbuff_head_cache;
> >
> > if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
> > gfp_mask |= __GFP_MEMALLOC;
> >
> > /* Get the HEAD */
> > -   skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> > +   if ((flags & SKB_ALLOC_NAPI) && !clone &&
> 
> Rather than having to do two checks you could just check for
> SKB_ALLOC_NAPI and SKB_ALLOC_FCLONE in a single check. You could just
> do something like:
> if ((flags & (SKB_ALLOC_FCLONE | SKB_ALLOC_NAPI) == SKB_ALLOC_NAPI)
> 
> That way you can avoid the extra conditional jumps and can start
> computing the flags value sooner.

I thought about combined check for two flags yesterday, so yeah, that
probably should be better than the current version.

> > +   likely(node == NUMA_NO_NODE || node == numa_mem_id()))
> > +   skb = napi_skb_cache_get();
> > +   else
> > +   skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, 
> > node);
> > if (unlikely(!skb))
> > return NULL;
> > prefetchw(skb);
> > @@ -436,7 +441,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > gfp_mask,
> > __build_skb_around(skb, data, 0);
> > skb->pfmemalloc = pfmemalloc;
> >
> > -   if (flags & SKB_ALLOC_FCLONE) {
> > +   if (clone) {
> > struct sk_buff_fclones *fclones;
> >
> > fclones = container_of(skb, struct sk_buff_fclones, skb1);
> > --
> > 2.30.1

Thanks,
Al

Re: [PATCH v5 net-next 06/11] skbuff: remove __kfree_skb_flush()

2021-02-13 Thread Alexander Lobakin

From: Alexander Duyck 
Date: Thu, 11 Feb 2021 19:28:52 -0800

> On Thu, Feb 11, 2021 at 10:57 AM Alexander Lobakin  wrote:
> >
> > This function isn't much needed as NAPI skb queue gets bulk-freed
> > anyway when there's no more room, and even may reduce the efficiency
> > of bulk operations.
> > It will be even less needed after reusing skb cache on allocation path,
> > so remove it and this way lighten network softirqs a bit.
> >
> > Suggested-by: Eric Dumazet 
> > Signed-off-by: Alexander Lobakin 
> 
> I'm wondering if you have any actual gains to show from this patch?
> 
> The reason why I ask is because the flushing was happening at the end
> of the softirq before the system basically gave control back over to
> something else. As such there is a good chance for the memory to be
> dropped from the cache by the time we come back to it. So it may be
> just as expensive if not more so than accessing memory that was just
> freed elsewhere and placed in the slab cache.

Just retested after readding this function (and changing the logics so
it would drop the second half of the cache, like napi_skb_cache_put()
does) and got 10 Mbps drawback with napi_build_skb() +
napi_gro_receive().

So seems like getting a pointer from an array instead of calling
kmem_cache_alloc() is cheaper even if the given object was pulled
out of CPU caches.

> > ---
> >  include/linux/skbuff.h |  1 -
> >  net/core/dev.c |  7 +--
> >  net/core/skbuff.c  | 12 
> >  3 files changed, 1 insertion(+), 19 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 0a4e91a2f873..0e0707296098 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -2919,7 +2919,6 @@ static inline struct sk_buff *napi_alloc_skb(struct 
> > napi_struct *napi,
> >  }
> >  void napi_consume_skb(struct sk_buff *skb, int budget);
> >
> > -void __kfree_skb_flush(void);
> >  void __kfree_skb_defer(struct sk_buff *skb);
> >
> >  /**
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 321d41a110e7..4154d4683bb9 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4944,8 +4944,6 @@ static __latent_entropy void net_tx_action(struct 
> > softirq_action *h)
> > else
> > __kfree_skb_defer(skb);
> > }
> > -
> > -   __kfree_skb_flush();
> > }
> >
> > if (sd->output_queue) {
> > @@ -7012,7 +7010,6 @@ static int napi_threaded_poll(void *data)
> > __napi_poll(napi, &repoll);
> > netpoll_poll_unlock(have);
> >
> > -   __kfree_skb_flush();
> > local_bh_enable();
> >
> > if (!repoll)
> 
> So it looks like this is the one exception to my comment above. Here
> we should probably be adding a "if (!repoll)" before calling
> __kfree_skb_flush().
> 
> > @@ -7042,7 +7039,7 @@ static __latent_entropy void net_rx_action(struct 
> > softirq_action *h)
> >
> > if (list_empty(&list)) {
> > if (!sd_has_rps_ipi_waiting(sd) && 
> > list_empty(&repoll))
> > -   goto out;
> > +   return;
> > break;
> > }
> >
> > @@ -7069,8 +7066,6 @@ static __latent_entropy void net_rx_action(struct 
> > softirq_action *h)
> > __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >
> > net_rps_action_and_irq_enable(sd);
> > -out:
> > -   __kfree_skb_flush();
> >  }
> >
> >  struct netdev_adjacent {
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 1c6f6ef70339..4be2bb969535 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -838,18 +838,6 @@ void __consume_stateless_skb(struct sk_buff *skb)
> > kfree_skbmem(skb);
> >  }
> >
> > -void __kfree_skb_flush(void)
> > -{
> > -   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > -
> > -   /* flush skb_cache if containing objects */
> > -   if (nc->skb_count) {
> > -   kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count,
> > -nc->skb_cache);
> > -   nc->skb_count = 0;
> > -   }
> > -}
> > -
> >  static inline void _kfree_skb_defer(struct sk_buff *skb)
> >  {
> > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > --
> > 2.30.1

Al

[PATCH v6 net-next 00/11] skbuff: introduce skbuff_heads bulking and reusing

2021-02-13 Thread Alexander Lobakin

Currently, all sorts of skb allocation always do allocate
skbuff_heads one by one via kmem_cache_alloc().
On the other hand, we have percpu napi_alloc_cache to store
skbuff_heads queued up for freeing and flush them by bulks.

We can use this cache not only for bulk-wiping, but also to obtain
heads for new skbs and avoid unconditional allocations, as well as
for bulk-allocating (like XDP's cpumap code and veth driver already
do).

As this might affect latencies, cache pressure and lots of hardware
and driver-dependent stuff, this new feature is mostly optional and
can be issued via:
 - a new napi_build_skb() function (as a replacement for build_skb());
 - existing {,__}napi_alloc_skb() and napi_get_frags() functions;
 - __alloc_skb() with passing SKB_ALLOC_NAPI in flags.

iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing
VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger
on more powerful hosts and NICs with tens of Mpps.

Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs:
 - kmalloc()/kmem_cache_alloc() itself allows by default allocating
   memory from the remote nodes to defragment their slabs. This is
   controlled by sysctl, but according to this, skbuff_head from a
   remote node is an OK case;
 - The easiest way to check if the slab of skbuff_head is remote or
   pfmemalloc'ed is:

if (!dev_page_is_reusable(virt_to_head_page(skb)))
/* drop it */;

   ...*but*, regarding that most slabs are built of compound pages,
   virt_to_head_page() will hit unlikely-branch every single call.
   This check costed at least 20 Mbps in test scenarios and seems
   like it'd be better to _not_ do this.

Since v5 [4]:
 - revert flags-to-bool conversion and simplify flags testing in
   __alloc_skb() (Alexander Duyck).

Since v4 [3]:
 - rebase on top of net-next and address kernel build robot issue;
 - reorder checks a bit in __alloc_skb() to make new condition even
   more harmless.

Since v3 [2]:
 - make the feature mostly optional, so driver developers could
   decide whether to use it or not (Paolo Abeni).
   This reuses the old flag for __alloc_skb() and introduces
   a new napi_build_skb();
 - reduce bulk-allocation size from 32 to 16 elements (also Paolo).
   This equals to the value of XDP's devmap and veth batch processing
   (which were tested a lot) and should be sane enough;
 - don't waste cycles on explicit in_serving_softirq() check.

Since v2 [1]:
 - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy
   after the changes that pass tiny skbs requests to kmalloc layer);
 - cover the cache with KASAN instrumentation (suggested by Eric
   Dumazet, help of Dmitry Vyukov);
 - completely drop redundant __kfree_skb_flush() (also Eric);
 - lots of code cleanups;
 - expand the commit message with NUMA and pfmemalloc points (Jakub).

Since v1 [0]:
 - use one unified cache instead of two separate to greatly simplify
   the logics and reduce hotpath overhead (Edward Cree);
 - new: recycle also GRO_MERGED_FREE skbs instead of immediate
   freeing;
 - correct performance numbers after optimizations and performing
   lots of tests for different use cases.

[0] https://lore.kernel.org/netdev/2021082655.12159-1-aloba...@pm.me
[1] https://lore.kernel.org/netdev/20210113133523.39205-1-aloba...@pm.me
[2] https://lore.kernel.org/netdev/20210209204533.327360-1-aloba...@pm.me
[3] https://lore.kernel.org/netdev/20210210162732.80467-1-aloba...@pm.me
[4] https://lore.kernel.org/netdev/20210211185220.9753-1-aloba...@pm.me

Alexander Lobakin (11):
  skbuff: move __alloc_skb() next to the other skb allocation functions
  skbuff: simplify kmalloc_reserve()
  skbuff: make __build_skb_around() return void
  skbuff: simplify __alloc_skb() a bit
  skbuff: use __build_skb_around() in __alloc_skb()
  skbuff: remove __kfree_skb_flush()
  skbuff: move NAPI cache declarations upper in the file
  skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads
  skbuff: allow to optionally use NAPI cache from __alloc_skb()
  skbuff: allow to use NAPI cache from __napi_alloc_skb()
  skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing

 include/linux/skbuff.h |   4 +-
 net/core/dev.c |  16 +-
 net/core/skbuff.c  | 428 +++--
 3 files changed, 242 insertions(+), 206 deletions(-)

-- 
2.30.1

[PATCH v6 net-next 02/11] skbuff: simplify kmalloc_reserve()

2021-02-13 Thread Alexander Lobakin

Eversince the introduction of __kmalloc_reserve(), "ip" argument
hasn't been used. _RET_IP_ is embedded inside
kmalloc_node_track_caller().
Remove the redundant macro and rename the function after it.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a0f846872d19..70289f22a6f4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -273,11 +273,8 @@ EXPORT_SYMBOL(__netdev_alloc_frag_align);
  * may be used. Otherwise, the packet data may be discarded until enough
  * memory is free
  */
-#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
-__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
-
-static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
-  unsigned long ip, bool *pfmemalloc)
+static void *kmalloc_reserve(size_t size, gfp_t flags, int node,
+bool *pfmemalloc)
 {
void *obj;
bool ret_pfmemalloc = false;
-- 
2.30.1

[PATCH v6 net-next 03/11] skbuff: make __build_skb_around() return void

2021-02-13 Thread Alexander Lobakin

__build_skb_around() can never fail and always returns passed skb.
Make it return void to simplify and optimize the code.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 70289f22a6f4..c7d184e11547 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -120,8 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
 }
 
 /* Caller must provide SKB that is memset cleared */
-static struct sk_buff *__build_skb_around(struct sk_buff *skb,
- void *data, unsigned int frag_size)
+static void __build_skb_around(struct sk_buff *skb, void *data,
+  unsigned int frag_size)
 {
struct skb_shared_info *shinfo;
unsigned int size = frag_size ? : ksize(data);
@@ -144,8 +144,6 @@ static struct sk_buff *__build_skb_around(struct sk_buff 
*skb,
atomic_set(&shinfo->dataref, 1);
 
skb_set_kcov_handle(skb, kcov_common_handle());
-
-   return skb;
 }
 
 /**
@@ -176,8 +174,9 @@ struct sk_buff *__build_skb(void *data, unsigned int 
frag_size)
return NULL;
 
memset(skb, 0, offsetof(struct sk_buff, tail));
+   __build_skb_around(skb, data, frag_size);
 
-   return __build_skb_around(skb, data, frag_size);
+   return skb;
 }
 
 /* build_skb() is wrapper over __build_skb(), that specifically
@@ -210,9 +209,9 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
if (unlikely(!skb))
return NULL;
 
-   skb = __build_skb_around(skb, data, frag_size);
+   __build_skb_around(skb, data, frag_size);
 
-   if (skb && frag_size) {
+   if (frag_size) {
skb->head_frag = 1;
if (page_is_pfmemalloc(virt_to_head_page(data)))
skb->pfmemalloc = 1;
-- 
2.30.1

[PATCH v6 net-next 01/11] skbuff: move __alloc_skb() next to the other skb allocation functions

2021-02-13 Thread Alexander Lobakin

In preparation before reusing several functions in all three skb
allocation variants, move __alloc_skb() next to the
__netdev_alloc_skb() and __napi_alloc_skb().
No functional changes.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 284 +++---
 1 file changed, 142 insertions(+), 142 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d380c7b5a12d..a0f846872d19 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -119,148 +119,6 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
skb_panic(skb, sz, addr, __func__);
 }
 
-/*
- * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
- * the caller if emergency pfmemalloc reserves are being used. If it is and
- * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
- * may be used. Otherwise, the packet data may be discarded until enough
- * memory is free
- */
-#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
-__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
-
-static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
-  unsigned long ip, bool *pfmemalloc)
-{
-   void *obj;
-   bool ret_pfmemalloc = false;
-
-   /*
-* Try a regular allocation, when that fails and we're not entitled
-* to the reserves, fail.
-*/
-   obj = kmalloc_node_track_caller(size,
-   flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
-   node);
-   if (obj || !(gfp_pfmemalloc_allowed(flags)))
-   goto out;
-
-   /* Try again but now we are using pfmemalloc reserves */
-   ret_pfmemalloc = true;
-   obj = kmalloc_node_track_caller(size, flags, node);
-
-out:
-   if (pfmemalloc)
-   *pfmemalloc = ret_pfmemalloc;
-
-   return obj;
-}
-
-/* Allocate a new skbuff. We do this ourselves so we can fill in a few
- * 'private' fields and also do memory statistics to find all the
- * [BEEP] leaks.
- *
- */
-
-/**
- * __alloc_skb -   allocate a network buffer
- * @size: size to allocate
- * @gfp_mask: allocation mask
- * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache
- * instead of head cache and allocate a cloned (child) skb.
- * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
- * allocations in case the data is required for writeback
- * @node: numa node to allocate memory on
- *
- * Allocate a new &sk_buff. The returned buffer has no headroom and a
- * tail room of at least size bytes. The object has a reference count
- * of one. The return is the buffer. On a failure the return is %NULL.
- *
- * Buffers may only be allocated from interrupts using a @gfp_mask of
- * %GFP_ATOMIC.
- */
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-   int flags, int node)
-{
-   struct kmem_cache *cache;
-   struct skb_shared_info *shinfo;
-   struct sk_buff *skb;
-   u8 *data;
-   bool pfmemalloc;
-
-   cache = (flags & SKB_ALLOC_FCLONE)
-   ? skbuff_fclone_cache : skbuff_head_cache;
-
-   if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
-   gfp_mask |= __GFP_MEMALLOC;
-
-   /* Get the HEAD */
-   skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
-   if (!skb)
-   goto out;
-   prefetchw(skb);
-
-   /* We do our best to align skb_shared_info on a separate cache
-* line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
-* aligned memory blocks, unless SLUB/SLAB debug is enabled.
-* Both skb->head and skb_shared_info are cache line aligned.
-*/
-   size = SKB_DATA_ALIGN(size);
-   size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-   data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
-   if (!data)
-   goto nodata;
-   /* kmalloc(size) might give us more room than requested.
-* Put skb_shared_info exactly at the end of allocated zone,
-* to allow max possible filling before reallocation.
-*/
-   size = SKB_WITH_OVERHEAD(ksize(data));
-   prefetchw(data + size);
-
-   /*
-* Only clear those fields we need to clear, not those that we will
-* actually initialise below. Hence, don't put any more fields after
-* the tail pointer in struct sk_buff!
-*/
-   memset(skb, 0, offsetof(struct sk_buff, tail));
-   /* Account for allocated memory : skb + skb->head */
-   skb->truesize = SKB_TRUESIZE(size);
-   skb->pfmemalloc = pfmemalloc;
-   refcount_set(&skb->users, 1);
-   skb->head = data;
-   skb->data = data;
-   skb_reset_tail_pointer(skb)

[PATCH v6 net-next 04/11] skbuff: simplify __alloc_skb() a bit

2021-02-13 Thread Alexander Lobakin

Use unlikely() annotations for skbuff_head and data similarly to the
two other allocation functions and remove totally redundant goto.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c7d184e11547..88566de26cd1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -339,8 +339,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
-   if (!skb)
-   goto out;
+   if (unlikely(!skb))
+   return NULL;
prefetchw(skb);
 
/* We do our best to align skb_shared_info on a separate cache
@@ -351,7 +351,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
size = SKB_DATA_ALIGN(size);
size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
-   if (!data)
+   if (unlikely(!data))
goto nodata;
/* kmalloc(size) might give us more room than requested.
 * Put skb_shared_info exactly at the end of allocated zone,
@@ -395,12 +395,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 
skb_set_kcov_handle(skb, kcov_common_handle());
 
-out:
return skb;
+
 nodata:
kmem_cache_free(cache, skb);
-   skb = NULL;
-   goto out;
+   return NULL;
 }
 EXPORT_SYMBOL(__alloc_skb);
 
-- 
2.30.1

[PATCH v6 net-next 05/11] skbuff: use __build_skb_around() in __alloc_skb()

2021-02-13 Thread Alexander Lobakin

Just call __build_skb_around() instead of open-coding it.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 18 +-
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 88566de26cd1..1c6f6ef70339 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -326,7 +326,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
int flags, int node)
 {
struct kmem_cache *cache;
-   struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
bool pfmemalloc;
@@ -366,21 +365,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 * the tail pointer in struct sk_buff!
 */
memset(skb, 0, offsetof(struct sk_buff, tail));
-   /* Account for allocated memory : skb + skb->head */
-   skb->truesize = SKB_TRUESIZE(size);
+   __build_skb_around(skb, data, 0);
skb->pfmemalloc = pfmemalloc;
-   refcount_set(&skb->users, 1);
-   skb->head = data;
-   skb->data = data;
-   skb_reset_tail_pointer(skb);
-   skb->end = skb->tail + size;
-   skb->mac_header = (typeof(skb->mac_header))~0U;
-   skb->transport_header = (typeof(skb->transport_header))~0U;
-
-   /* make sure we initialize shinfo sequentially */
-   shinfo = skb_shinfo(skb);
-   memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-   atomic_set(&shinfo->dataref, 1);
 
if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff_fclones *fclones;
@@ -393,8 +379,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
fclones->skb2.fclone = SKB_FCLONE_CLONE;
}
 
-   skb_set_kcov_handle(skb, kcov_common_handle());
-
return skb;
 
 nodata:
-- 
2.30.1

[PATCH v6 net-next 07/11] skbuff: move NAPI cache declarations upper in the file

2021-02-13 Thread Alexander Lobakin

NAPI cache structures will be used for allocating skbuff_heads,
so move their declarations a bit upper.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 90 +++
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4be2bb969535..860a9d4f752f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -119,6 +119,51 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
skb_panic(skb, sz, addr, __func__);
 }
 
+#define NAPI_SKB_CACHE_SIZE64
+
+struct napi_alloc_cache {
+   struct page_frag_cache page;
+   unsigned int skb_count;
+   void *skb_cache[NAPI_SKB_CACHE_SIZE];
+};
+
+static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
+
+static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask,
+   unsigned int align_mask)
+{
+   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+   return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask);
+}
+
+void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+{
+   fragsz = SKB_DATA_ALIGN(fragsz);
+
+   return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
+}
+EXPORT_SYMBOL(__napi_alloc_frag_align);
+
+void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+{
+   struct page_frag_cache *nc;
+   void *data;
+
+   fragsz = SKB_DATA_ALIGN(fragsz);
+   if (in_irq() || irqs_disabled()) {
+   nc = this_cpu_ptr(&netdev_alloc_cache);
+   data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, 
align_mask);
+   } else {
+   local_bh_disable();
+   data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
+   local_bh_enable();
+   }
+   return data;
+}
+EXPORT_SYMBOL(__netdev_alloc_frag_align);
+
 /* Caller must provide SKB that is memset cleared */
 static void __build_skb_around(struct sk_buff *skb, void *data,
   unsigned int frag_size)
@@ -220,51 +265,6 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(build_skb_around);
 
-#define NAPI_SKB_CACHE_SIZE64
-
-struct napi_alloc_cache {
-   struct page_frag_cache page;
-   unsigned int skb_count;
-   void *skb_cache[NAPI_SKB_CACHE_SIZE];
-};
-
-static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
-
-static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask,
-   unsigned int align_mask)
-{
-   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
-   return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask);
-}
-
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
-{
-   fragsz = SKB_DATA_ALIGN(fragsz);
-
-   return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
-}
-EXPORT_SYMBOL(__napi_alloc_frag_align);
-
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
-{
-   struct page_frag_cache *nc;
-   void *data;
-
-   fragsz = SKB_DATA_ALIGN(fragsz);
-   if (in_irq() || irqs_disabled()) {
-   nc = this_cpu_ptr(&netdev_alloc_cache);
-   data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, 
align_mask);
-   } else {
-   local_bh_disable();
-   data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
-   local_bh_enable();
-   }
-   return data;
-}
-EXPORT_SYMBOL(__netdev_alloc_frag_align);
-
 /*
  * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
  * the caller if emergency pfmemalloc reserves are being used. If it is and
-- 
2.30.1

[PATCH v6 net-next 06/11] skbuff: remove __kfree_skb_flush()

2021-02-13 Thread Alexander Lobakin

This function isn't much needed as NAPI skb queue gets bulk-freed
anyway when there's no more room, and even may reduce the efficiency
of bulk operations.
It will be even less needed after reusing skb cache on allocation path,
so remove it and this way lighten network softirqs a bit.

Suggested-by: Eric Dumazet 
Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h |  1 -
 net/core/dev.c |  7 +--
 net/core/skbuff.c  | 12 
 3 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0a4e91a2f873..0e0707296098 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2919,7 +2919,6 @@ static inline struct sk_buff *napi_alloc_skb(struct 
napi_struct *napi,
 }
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
-void __kfree_skb_flush(void);
 void __kfree_skb_defer(struct sk_buff *skb);
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index ce6291bc2e16..631807c196ad 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4944,8 +4944,6 @@ static __latent_entropy void net_tx_action(struct 
softirq_action *h)
else
__kfree_skb_defer(skb);
}
-
-   __kfree_skb_flush();
}
 
if (sd->output_queue) {
@@ -7012,7 +7010,6 @@ static int napi_threaded_poll(void *data)
__napi_poll(napi, &repoll);
netpoll_poll_unlock(have);
 
-   __kfree_skb_flush();
local_bh_enable();
 
if (!repoll)
@@ -7042,7 +7039,7 @@ static __latent_entropy void net_rx_action(struct 
softirq_action *h)
 
if (list_empty(&list)) {
if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
-   goto out;
+   return;
break;
}
 
@@ -7069,8 +7066,6 @@ static __latent_entropy void net_rx_action(struct 
softirq_action *h)
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 
net_rps_action_and_irq_enable(sd);
-out:
-   __kfree_skb_flush();
 }
 
 struct netdev_adjacent {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1c6f6ef70339..4be2bb969535 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -838,18 +838,6 @@ void __consume_stateless_skb(struct sk_buff *skb)
kfree_skbmem(skb);
 }
 
-void __kfree_skb_flush(void)
-{
-   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
-   /* flush skb_cache if containing objects */
-   if (nc->skb_count) {
-   kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count,
-nc->skb_cache);
-   nc->skb_count = 0;
-   }
-}
-
 static inline void _kfree_skb_defer(struct sk_buff *skb)
 {
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-- 
2.30.1

[PATCH v6 net-next 08/11] skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads

2021-02-13 Thread Alexander Lobakin

Instead of just bulk-flushing skbuff_heads queued up through
napi_consume_skb() or __kfree_skb_defer(), try to reuse them
on allocation path.
If the cache is empty on allocation, bulk-allocate the first
16 elements, which is more efficient than per-skb allocation.
If the cache is full on freeing, bulk-wipe the second half of
the cache (32 elements).
This also includes custom KASAN poisoning/unpoisoning to be
double sure there are no use-after-free cases.

To not change current behaviour, introduce a new function,
napi_build_skb(), to optionally use a new approach later
in drivers.

Note on selected bulk size, 16:
 - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE
   and especially VETH_XDP_BATCH, which is also used to
   bulk-allocate skbuff_heads and was tested on powerful
   setups;
 - this also showed the best performance in the actual
   test series (from the array of {8, 16, 32}).

Suggested-by: Edward Cree  # Divide on two halves
Suggested-by: Eric Dumazet# KASAN poisoning
Cc: Dmitry Vyukov  # Help with KASAN
Cc: Paolo Abeni # Reduced batch size
Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h |  2 +
 net/core/skbuff.c  | 94 --
 2 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0e0707296098..906122eac82a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int 
frag_size);
 struct sk_buff *build_skb_around(struct sk_buff *skb,
 void *data, unsigned int frag_size);
 
+struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
+
 /**
  * alloc_skb - allocate a network buffer
  * @size: size to allocate
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 860a9d4f752f..9e1a8ded4acc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
 }
 
 #define NAPI_SKB_CACHE_SIZE64
+#define NAPI_SKB_CACHE_BULK16
+#define NAPI_SKB_CACHE_HALF(NAPI_SKB_CACHE_SIZE / 2)
 
 struct napi_alloc_cache {
struct page_frag_cache page;
@@ -164,6 +166,25 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, 
unsigned int align_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_frag_align);
 
+static struct sk_buff *napi_skb_cache_get(void)
+{
+   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+   struct sk_buff *skb;
+
+   if (unlikely(!nc->skb_count))
+   nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
+ GFP_ATOMIC,
+ NAPI_SKB_CACHE_BULK,
+ nc->skb_cache);
+   if (unlikely(!nc->skb_count))
+   return NULL;
+
+   skb = nc->skb_cache[--nc->skb_count];
+   kasan_unpoison_object_data(skbuff_head_cache, skb);
+
+   return skb;
+}
+
 /* Caller must provide SKB that is memset cleared */
 static void __build_skb_around(struct sk_buff *skb, void *data,
   unsigned int frag_size)
@@ -265,6 +286,53 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(build_skb_around);
 
+/**
+ * __napi_build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ *
+ * Version of __build_skb() that uses NAPI percpu caches to obtain
+ * skbuff_head instead of inplace allocation.
+ *
+ * Returns a new &sk_buff on success, %NULL on allocation failure.
+ */
+static struct sk_buff *__napi_build_skb(void *data, unsigned int frag_size)
+{
+   struct sk_buff *skb;
+
+   skb = napi_skb_cache_get();
+   if (unlikely(!skb))
+   return NULL;
+
+   memset(skb, 0, offsetof(struct sk_buff, tail));
+   __build_skb_around(skb, data, frag_size);
+
+   return skb;
+}
+
+/**
+ * napi_build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ *
+ * Version of __napi_build_skb() that takes care of skb->head_frag
+ * and skb->pfmemalloc when the data is a page or page fragment.
+ *
+ * Returns a new &sk_buff on success, %NULL on allocation failure.
+ */
+struct sk_buff *napi_build_skb(void *data, unsigned int frag_size)
+{
+   struct sk_buff *skb = __napi_build_skb(data, frag_size);
+
+   if (likely(skb) && frag_size) {
+   skb->head_frag = 1;
+   skb_propagate_pfmemalloc(virt_to_head_page(data), skb);
+   }
+
+   return skb;
+}
+EXPORT_SYMBOL(napi_build_skb);
+
 /*
  * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
  * the caller if emergency pfmemalloc reserves are being used. If it is and
@@ -838,31 +906,31

[PATCH v6 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()

2021-02-13 Thread Alexander Lobakin

Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
an skbuff_head from the NAPI cache instead of inplace allocation
inside __alloc_skb().
This implies that the function is called from softirq or BH-off
context, not for allocating a clone or from a distant node.

Cc: Alexander Duyck  # Simplified flags check
Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9e1a8ded4acc..a80581eed7fc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -405,7 +405,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
gfp_mask |= __GFP_MEMALLOC;
 
/* Get the HEAD */
-   skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
+   if ((flags & (SKB_ALLOC_FCLONE | SKB_ALLOC_NAPI)) == SKB_ALLOC_NAPI &&
+   likely(node == NUMA_NO_NODE || node == numa_mem_id()))
+   skb = napi_skb_cache_get();
+   else
+   skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);
if (unlikely(!skb))
return NULL;
prefetchw(skb);
-- 
2.30.1

[PATCH v6 net-next 11/11] skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing

2021-02-13 Thread Alexander Lobakin

napi_frags_finish() and napi_skb_finish() can only be called inside
NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
got NAPI_MERGED_FREE verdict instead of immediate freeing.
Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
to NAPI cache.
As many drivers call napi_alloc_skb()/napi_get_frags() on their
receive path, this becomes especially useful.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h |  1 +
 net/core/dev.c |  9 +
 net/core/skbuff.c  | 12 +---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 906122eac82a..6d0a33d1c0db 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2921,6 +2921,7 @@ static inline struct sk_buff *napi_alloc_skb(struct 
napi_struct *napi,
 }
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
+void napi_skb_free_stolen_head(struct sk_buff *skb);
 void __kfree_skb_defer(struct sk_buff *skb);
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index 631807c196ad..ea9b46318d23 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6095,13 +6095,6 @@ struct packet_offload *gro_find_complete_by_type(__be16 
type)
 }
 EXPORT_SYMBOL(gro_find_complete_by_type);
 
-static void napi_skb_free_stolen_head(struct sk_buff *skb)
-{
-   skb_dst_drop(skb);
-   skb_ext_put(skb);
-   kmem_cache_free(skbuff_head_cache, skb);
-}
-
 static gro_result_t napi_skb_finish(struct napi_struct *napi,
struct sk_buff *skb,
gro_result_t ret)
@@ -6115,7 +6108,7 @@ static gro_result_t napi_skb_finish(struct napi_struct 
*napi,
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
napi_skb_free_stolen_head(skb);
else
-   __kfree_skb(skb);
+   __kfree_skb_defer(skb);
break;
 
case GRO_HELD:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 875e1a453f7e..545a472273a5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -916,9 +916,6 @@ static void napi_skb_cache_put(struct sk_buff *skb)
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
u32 i;
 
-   /* drop skb->head and call any destructors for packet */
-   skb_release_all(skb);
-
kasan_poison_object_data(skbuff_head_cache, skb);
nc->skb_cache[nc->skb_count++] = skb;
 
@@ -935,6 +932,14 @@ static void napi_skb_cache_put(struct sk_buff *skb)
 
 void __kfree_skb_defer(struct sk_buff *skb)
 {
+   skb_release_all(skb);
+   napi_skb_cache_put(skb);
+}
+
+void napi_skb_free_stolen_head(struct sk_buff *skb)
+{
+   skb_dst_drop(skb);
+   skb_ext_put(skb);
napi_skb_cache_put(skb);
 }
 
@@ -960,6 +965,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
return;
}
 
+   skb_release_all(skb);
napi_skb_cache_put(skb);
 }
 EXPORT_SYMBOL(napi_consume_skb);
-- 
2.30.1

[PATCH v6 net-next 10/11] skbuff: allow to use NAPI cache from __napi_alloc_skb()

2021-02-13 Thread Alexander Lobakin

{,__}napi_alloc_skb() is mostly used either for optional non-linear
receive methods (usually controlled via Ethtool private flags and off
by default) and/or for Rx copybreaks.
Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache
instead of inplace allocations. This includes both kmalloc and page
frag paths.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a80581eed7fc..875e1a453f7e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -562,7 +562,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, 
unsigned int len,
if (len <= SKB_WITH_OVERHEAD(1024) ||
len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
(gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
-   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
+   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
+ NUMA_NO_NODE);
if (!skb)
goto skb_fail;
goto skb_success;
@@ -579,7 +580,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, 
unsigned int len,
if (unlikely(!data))
return NULL;
 
-   skb = __build_skb(data, len);
+   skb = __napi_build_skb(data, len);
if (unlikely(!skb)) {
skb_free_frag(data);
return NULL;
-- 
2.30.1

Re: linux-next: manual merge of the kspp tree with the mips tree

2021-02-23 Thread Alexander Lobakin

From: Stephen Rothwell 
Date: Tue, 23 Feb 2021 10:49:50 +1100

> Hi all,

Hi,

> On Mon, 15 Feb 2021 07:47:26 +1100 Stephen Rothwell  
> wrote:
> >
> > On Mon, 18 Jan 2021 15:08:04 +1100 Stephen Rothwell  
> > wrote:
> > >
> > > Today's linux-next merge of the kspp tree got a conflict in:
> > >
> > >   include/asm-generic/vmlinux.lds.h
> > >
> > > between commits:
> > >
> > >   9a427556fb8e ("vmlinux.lds.hf41b233de0ae: catch compound literals into 
> > > data and BSS")
> > >   f41b233de0ae ("vmlinux.lds.h: catch UBSAN's "unnamed data" into data")
> > >
> > > from the mips tree and commit:
> > >
> > >   dc5723b02e52 ("kbuild: add support for Clang LTO")
> > >
> > > from the kspp tree.
> > >
> > > I fixed it up (9a427556fb8e and dc5723b02e52 made the same change to
> > > DATA_MAIN, which conflicted with the change in f41b233de0ae) and can
> > > carry the fix as necessary. This is now fixed as far as linux-next is
> > > concerned, but any non trivial conflicts should be mentioned to your
> > > upstream maintainer when your tree is submitted for merging. You may
> > > also want to consider cooperating with the maintainer of the
> > > conflicting tree to minimise any particularly complex conflicts.
> >
> > With the merge window about to open, this is a reminder that this
> > conflict still exists.
>
> This is now a conflict between the kspp tree and Linus' tree.

Kees prepared a Git pull of kspp tree for Linus, this will be resolved
soon.

> --
> Cheers,
> Stephen Rothwell

Al

[PATCH mips-fixes] vmlinux.lds.h: catch even more instrumentation symbols into .data

2021-02-23 Thread Alexander Lobakin

LKP caught another bunch of orphaned instrumentation symbols [0]:

mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/main.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/main.o' being placed in section `.data.$LPBX0'
mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/do_mounts.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/do_mounts.o' being placed in section `.data.$LPBX0'
mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/do_mounts_initrd.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/do_mounts_initrd.o' being placed in section `.data.$LPBX0'
mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/initramfs.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/initramfs.o' being placed in section `.data.$LPBX0'
mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/calibrate.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/calibrate.o' being placed in section `.data.$LPBX0'

[...]

Soften the wildcard to .data.$L* to grab these ones into .data too.

[0] https://lore.kernel.org/lkml/202102231519.lwplpvev-...@intel.com

Reported-by: kernel test robot 
Signed-off-by: Alexander Lobakin 
---
 include/asm-generic/vmlinux.lds.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 01a3fd6a64d2..c887ac36c1b4 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -95,7 +95,7 @@
  */
 #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
 #define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
-#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* 
.data.$__unnamed_* .data.$Lubsan_*
+#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* 
.data.$__unnamed_* .data.$L*
 #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
 #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L*
 #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral*
--
2.30.1

Re: [PATCH mips-fixes] vmlinux.lds.h: catch even more instrumentation symbols into .data

2021-02-23 Thread Alexander Lobakin

> LKP caught another bunch of orphaned instrumentation symbols [0]:
>
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/main.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/main.o' being placed in section `.data.$LPBX0'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/do_mounts.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/do_mounts.o' being placed in section `.data.$LPBX0'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/do_mounts_initrd.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/do_mounts_initrd.o' being placed in section `.data.$LPBX0'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/initramfs.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/initramfs.o' being placed in section `.data.$LPBX0'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/calibrate.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/calibrate.o' being placed in section `.data.$LPBX0'
>
> [...]
>
> Soften the wildcard to .data.$L* to grab these ones into .data too.
>
> [0] https://lore.kernel.org/lkml/202102231519.lwplpvev-...@intel.com
>
> Reported-by: kernel test robot 
> Signed-off-by: Alexander Lobakin 
> ---
>  include/asm-generic/vmlinux.lds.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Hi Thomas,

This applies on top of mips-next or Linus' tree, so you may need to
rebase mips-fixes before taking it.
It's not for mips-next as it should go into this cycle as a [hot]fix.
I haven't added any "Fixes:" tag since these warnings is a result
of merging several sets and of certain build configurations that
almost couldn't be tested separately.

> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index 01a3fd6a64d2..c887ac36c1b4 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -95,7 +95,7 @@
>   */
>  #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
>  #define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
> -#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* 
> .data..compoundliteral* .data.$__unnamed_* .data.$Lubsan_*
> +#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* 
> .data..compoundliteral* .data.$__unnamed_* .data.$L*
>  #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
>  #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L*
>  #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral*
> --
> 2.30.1

Thanks,
Al

Re: [PATCH mips-fixes] vmlinux.lds.h: catch even more instrumentation symbols into .data

2021-02-23 Thread Alexander Lobakin

From: Thomas Bogendoerfer 
Date: Tue, 23 Feb 2021 13:21:44 +0100

> On Tue, Feb 23, 2021 at 11:36:41AM +0000, Alexander Lobakin wrote:
> > > LKP caught another bunch of orphaned instrumentation symbols [0]:
> > >
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/main.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/main.o' being placed in section `.data.$LPBX0'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/do_mounts.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/do_mounts.o' being placed in section `.data.$LPBX0'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/do_mounts_initrd.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/do_mounts_initrd.o' being placed in section `.data.$LPBX0'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/initramfs.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/initramfs.o' being placed in section `.data.$LPBX0'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/calibrate.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/calibrate.o' being placed in section `.data.$LPBX0'
> > >
> > > [...]
> > >
> > > Soften the wildcard to .data.$L* to grab these ones into .data too.
> > >
> > > [0] https://lore.kernel.org/lkml/202102231519.lwplpvev-...@intel.com
> > >
> > > Reported-by: kernel test robot 
> > > Signed-off-by: Alexander Lobakin 
> > > ---
> > >  include/asm-generic/vmlinux.lds.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > Hi Thomas,
> >
> > This applies on top of mips-next or Linus' tree, so you may need to
> > rebase mips-fixes before taking it.
> > It's not for mips-next as it should go into this cycle as a [hot]fix.
> > I haven't added any "Fixes:" tag since these warnings is a result
> > of merging several sets and of certain build configurations that
> > almost couldn't be tested separately.
>
> no worries, mips-fixes is defunct during merge windows. I'll send another
> pull request to Linus and will add this patch to it.

Ah, thank you!

> Thomas.

Al

> --
> Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
> good idea.[ RFC1925, 2.3 ]

Re: [GIT PULL v2] clang-lto for v5.12-rc1

2021-02-23 Thread Alexander Lobakin

From: Linus Torvalds 
Date: Tue, 23 Feb 2021 12:33:05 -0800

> On Tue, Feb 23, 2021 at 9:49 AM Linus Torvalds
>  wrote:
> >
> > On Mon, Feb 22, 2021 at 3:11 PM Kees Cook  wrote:
> > >
> > > While x86 LTO enablement is done[1], it depends on some objtool
> > > clean-ups[2], though it appears those actually have been in linux-next
> > > (via tip/objtool/core), so it's possible that if that tree lands [..]
> >
> > That tree is actually next on my list of things to merge after this
> > one, so it should be out soonish.
>
> "soonish" turned out to be later than I thought, because my "build
> changes" set of pulls included the module change that I then wasted a
> lot of time on trying to figure out why it slowed down my build so
> much.

I guess it's about CONFIG_TRIM_UNUSED_KSYMS you disabled in your tree.
Well, it's actually widely used, mostly in the embedded world where
there are often no out-of-tree modules, but a need to save as much
space as possible.
For full-blown systems and distributions it's almost needless, right.

> But it's out now, as pr-tracker-bot already noted.
>
>   Linus

Thanks,
Al

Re: [PATCH] arm64: enable GENERIC_FIND_FIRST_BIT

2021-02-24 Thread Alexander Lobakin

From: Yury Norov 
Date: Sat, 5 Dec 2020 08:54:06 -0800

Hi,

> ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't
> enable it in config. It leads to using find_next_bit() which is less
> efficient:
>
>  :
>0: aa0003e4mov x4, x0
>4: aa0103e0mov x0, x1
>8: b4000181cbz x1, 38 
>c: f9400083ldr x3, [x4]
>   10: d2800802mov x2, #0x40   // #64
>   14: 91002084add x4, x4, #0x8
>   18: b4c3cbz x3, 30 
>   1c: 1408b   3c 
>   20: f8408483ldr x3, [x4], #8
>   24: 91010045add x5, x2, #0x40
>   28: b5c3cbnzx3, 40 
>   2c: aa0503e2mov x2, x5
>   30: eb02001fcmp x0, x2
>   34: 5468b.hi20   // b.pmore
>   38: d65f03c0ret
>   3c: d282mov x2, #0x0// #0
>   40: dac00063rbitx3, x3
>   44: dac01063clz x3, x3
>   48: 8b020062add x2, x3, x2
>   4c: eb02001fcmp x0, x2
>   50: 9a829000cselx0, x0, x2, ls  // ls = plast
>   54: d65f03c0ret
>
>   ...
>
> 0118 <_find_next_bit.constprop.1>:
>  118: eb02007fcmp x3, x2
>  11c: 540002e2b.cs178 <_find_next_bit.constprop.1+0x60>  // b.hs, 
> b.nlast
>  120: d346fc66lsr x6, x3, #6
>  124: f8667805ldr x5, [x0, x6, lsl #3]
>  128: b461cbz x1, 134 <_find_next_bit.constprop.1+0x1c>
>  12c: f8667826ldr x6, [x1, x6, lsl #3]
>  130: 8a0600a5and x5, x5, x6
>  134: ca0400a6eor x6, x5, x4
>  138: 9285mov x5, #0x // #-1
>  13c: 9ac320a5lsl x5, x5, x3
>  140: 927ae463and x3, x3, #0xffc0
>  144: ea0600a5andsx5, x5, x6
>  148: 54000120b.eq16c <_find_next_bit.constprop.1+0x54>  // b.none
>  14c: 140eb   184 <_find_next_bit.constprop.1+0x6c>
>  150: d346fc66lsr x6, x3, #6
>  154: f8667805ldr x5, [x0, x6, lsl #3]
>  158: b461cbz x1, 164 <_find_next_bit.constprop.1+0x4c>
>  15c: f8667826ldr x6, [x1, x6, lsl #3]
>  160: 8a0600a5and x5, x5, x6
>  164: eb05009fcmp x4, x5
>  168: 54c1b.ne180 <_find_next_bit.constprop.1+0x68>  // b.any
>  16c: 91010063add x3, x3, #0x40
>  170: eb03005fcmp x2, x3
>  174: 54fffee8b.hi150 <_find_next_bit.constprop.1+0x38>  // 
> b.pmore
>  178: aa0203e0mov x0, x2
>  17c: d65f03c0ret
>  180: ca050085eor x5, x4, x5
>  184: dac000a5rbitx5, x5
>  188: dac010a5clz x5, x5
>  18c: 8b0300a3add x3, x5, x3
>  190: eb03005fcmp x2, x3
>  194: 9a839042cselx2, x2, x3, ls  // ls = plast
>  198: aa0203e0mov x0, x2
>  19c: d65f03c0ret
>
>  ...
>
> 0238 :
>  238: a9bf7bfdstp x29, x30, [sp, #-16]!
>  23c: aa0203e3mov x3, x2
>  240: d284mov x4, #0x0// #0
>  244: aa0103e2mov x2, x1
>  248: 910003fdmov x29, sp
>  24c: d281mov x1, #0x0// #0
>  250: 97b2bl  118 <_find_next_bit.constprop.1>
>  254: a8c17bfdldp x29, x30, [sp], #16
>  258: d65f03c0ret
>
> Enabling this functions would also benefit for_each_{set,clear}_bit().
> Would it make sense to enable this config for all such architectures by
> default?

I confirm that GENERIC_FIND_FIRST_BIT also produces more optimized and
fast code on MIPS (32 R2) where there is also no architecture-specific
bitsearching routines.
So, if it's okay for other folks, I'd suggest to go for it and enable
for all similar arches.

(otherwise, I'll publish a separate entry for mips-next after 5.12-rc1
 release and mention you in "Suggested-by:")

> Signed-off-by: Yury Norov 
>
> ---
>  arch/arm64/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1515f6f153a0..2b90ef1f548e 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -106,6 +106,7 @@ config ARM64
>   select GENERIC_CPU_AUTOPROBE
>   select GENERIC_CPU_VULNERABILITIES
>   select GENERIC_EARLY_IOREMAP
> + select GENERIC_FIND_FIRST_BIT
>   select GENERIC_IDLE_POLL_SETUP
>   select GENERIC_IRQ_IPI
>   select GENERIC_IRQ_MULTI_HANDLER
> --
> 2.25.1

Thanks,
Al

Re: [v3 net-next 08/10] skbuff: reuse NAPI skb cache on allocation path (__build_skb())

2021-02-10 Thread Alexander Lobakin

From: Paolo Abeni 
Date: Wed, 10 Feb 2021 11:21:06 +0100

> Hello,

Hi!
 
> I'm sorry for the late feedback, I could not step-in before.
> 
> Also adding Jesper for awareness, as he introduced the bulk free
> infrastructure.
> 
> On Tue, 2021-02-09 at 20:48 +, Alexander Lobakin wrote:
> > @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int 
> > frag_size)
> >   */
> >  struct sk_buff *build_skb(void *data, unsigned int frag_size)
> >  {
> > -   struct sk_buff *skb = __build_skb(data, frag_size);
> > +   struct sk_buff *skb = __build_skb(data, frag_size, true);
> 
> I must admit I'm a bit scared of this. There are several high speed
> device drivers that will move to bulk allocation, and we don't have any
> performance figure for them.
> 
> In my experience with (low end) MIPS board, cache misses cost tend to
> be much less visible there compared to reasonably recent server H/W,
> because the CPU/memory access time difference is much lower.
> 
> When moving to higher end H/W the performance gain you measured could
> be completely countered by less optimal cache usage.
> 
> I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a
> single skb would be visible e.g. in a round-robin test. Generally
> speaking bulk allocating 32 skbs looks a bit too much. IIRC, when
> Edward added listification to GRO, he did several measures with
> different list size and found 8 to be the optimal value (for the tested
> workload). Above such number the list become too big and the pressure
> on the cache outweighted the bulking benefits.

I can change to logics the way so it would allocate the first 8.
I think I've already seen this batch value somewhere in XDP code,
so this might be a balanced one.

Regarding bulk-freeing: can the batch size make sense when freeing
or it's okay to wipe 32 (currently 64 in baseline) in a row?

> Perhaps giving the device drivers the ability to opt-in on this infra
> via a new helper - as done back then with napi_consume_skb() - would
> make this change safer?

That's actually a very nice idea. There's only a little in the code
to change to introduce an ability to take heads from the cache
optionally. This way developers could switch to it when needed.

Thanks for the suggestions! I'll definitely absorb them into the code
and give it a test.

> > @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb)
> > kfree_skbmem(skb);
> >  }
> >
> > -static inline void _kfree_skb_defer(struct sk_buff *skb)
> > +static void napi_skb_cache_put(struct sk_buff *skb)
> >  {
> > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > +   u32 i;
> >
> > /* drop skb->head and call any destructors for packet */
> > skb_release_all(skb);
> >
> > -   /* record skb to CPU local list */
> > +   kasan_poison_object_data(skbuff_head_cache, skb);
> > nc->skb_cache[nc->skb_count++] = skb;
> >
> > -#ifdef CONFIG_SLUB
> > -   /* SLUB writes into objects when freeing */
> > -   prefetchw(skb);
> > -#endif
> 
> It looks like this chunk has been lost. Is that intentional?

Yep. This prefetchw() assumed that skbuff_heads will be wiped
immediately or at the end of network softirq. Reusing this cache
means that heads can be reused later or may be kept in a cache for
some time, so prefetching makes no sense anymore.

> Thanks!
> 
> Paolo

Al

[PATCH v4 net-next 00/11] skbuff: introduce skbuff_heads bulking and reusing

2021-02-10 Thread Alexander Lobakin

Currently, all sorts of skb allocation always do allocate
skbuff_heads one by one via kmem_cache_alloc().
On the other hand, we have percpu napi_alloc_cache to store
skbuff_heads queued up for freeing and flush them by bulks.

We can use this cache not only for bulk-wiping, but also to obtain
heads for new skbs and avoid unconditional allocations, as well as
for bulk-allocating (like XDP's cpumap code and veth driver already
do).

As this might affect latencies, cache pressure and lots of hardware
and driver-dependent stuff, this new feature is mostly optional and
can be issued via:
 - a new napi_build_skb() function (as a replacement for build_skb());
 - existing {,__}napi_alloc_skb() and napi_get_frags() functions;
 - __alloc_skb() with passing SKB_ALLOC_NAPI in flags.

iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing
VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger
on more powerful hosts and NICs with tens of Mpps.

Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs:
 - kmalloc()/kmem_cache_alloc() itself allows by default allocating
   memory from the remote nodes to defragment their slabs. This is
   controlled by sysctl, but according to this, skbuff_head from a
   remote node is an OK case;
 - The easiest way to check if the slab of skbuff_head is remote or
   pfmemalloc'ed is:

if (!dev_page_is_reusable(virt_to_head_page(skb)))
/* drop it */;

   ...*but*, regarding that most slabs are built of compound pages,
   virt_to_head_page() will hit unlikely-branch every single call.
   This check costed at least 20 Mbps in test scenarios and seems
   like it'd be better to _not_ do this.

Since v3 [2]:
 - make the feature mostly optional, so driver developers could
   decide whether to use it or not (Paolo Abeni).
   This reuses the old flag for __alloc_skb() and introduces
   a new napi_build_skb();
 - reduce bulk-allocation size from 32 to 16 elements (also Paolo).
   This equals to the value of XDP's devmap and veth batch processing
   (which were tested a lot) and should be sane enough;
 - don't waste cycles on explicit in_serving_softirq() check.

Since v2 [1]:
 - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy
   after the changes that pass tiny skbs requests to kmalloc layer);
 - cover the cache with KASAN instrumentation (suggested by Eric
   Dumazet, help of Dmitry Vyukov);
 - completely drop redundant __kfree_skb_flush() (also Eric);
 - lots of code cleanups;
 - expand the commit message with NUMA and pfmemalloc points (Jakub).

Since v1 [0]:
 - use one unified cache instead of two separate to greatly simplify
   the logics and reduce hotpath overhead (Edward Cree);
 - new: recycle also GRO_MERGED_FREE skbs instead of immediate
   freeing;
 - correct performance numbers after optimizations and performing
   lots of tests for different use cases.

[0] https://lore.kernel.org/netdev/2021082655.12159-1-aloba...@pm.me
[1] https://lore.kernel.org/netdev/20210113133523.39205-1-aloba...@pm.me
[2] https://lore.kernel.org/netdev/20210209204533.327360-1-aloba...@pm.me

Alexander Lobakin (11):
  skbuff: move __alloc_skb() next to the other skb allocation functions
  skbuff: simplify kmalloc_reserve()
  skbuff: make __build_skb_around() return void
  skbuff: simplify __alloc_skb() a bit
  skbuff: use __build_skb_around() in __alloc_skb()
  skbuff: remove __kfree_skb_flush()
  skbuff: move NAPI cache declarations upper in the file
  skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads
  skbuff: allow to optionally use NAPI cache from __alloc_skb()
  skbuff: allow to use NAPI cache from __napi_alloc_skb()
  skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing

 include/linux/skbuff.h |   4 +-
 net/core/dev.c |  15 +-
 net/core/skbuff.c  | 429 +++--
 3 files changed, 243 insertions(+), 205 deletions(-)

-- 
2.30.1

[PATCH v4 net-next 01/11] skbuff: move __alloc_skb() next to the other skb allocation functions

2021-02-10 Thread Alexander Lobakin

In preparation before reusing several functions in all three skb
allocation variants, move __alloc_skb() next to the
__netdev_alloc_skb() and __napi_alloc_skb().
No functional changes.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 284 +++---
 1 file changed, 142 insertions(+), 142 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d380c7b5a12d..a0f846872d19 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -119,148 +119,6 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
skb_panic(skb, sz, addr, __func__);
 }
 
-/*
- * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
- * the caller if emergency pfmemalloc reserves are being used. If it is and
- * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
- * may be used. Otherwise, the packet data may be discarded until enough
- * memory is free
- */
-#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
-__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
-
-static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
-  unsigned long ip, bool *pfmemalloc)
-{
-   void *obj;
-   bool ret_pfmemalloc = false;
-
-   /*
-* Try a regular allocation, when that fails and we're not entitled
-* to the reserves, fail.
-*/
-   obj = kmalloc_node_track_caller(size,
-   flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
-   node);
-   if (obj || !(gfp_pfmemalloc_allowed(flags)))
-   goto out;
-
-   /* Try again but now we are using pfmemalloc reserves */
-   ret_pfmemalloc = true;
-   obj = kmalloc_node_track_caller(size, flags, node);
-
-out:
-   if (pfmemalloc)
-   *pfmemalloc = ret_pfmemalloc;
-
-   return obj;
-}
-
-/* Allocate a new skbuff. We do this ourselves so we can fill in a few
- * 'private' fields and also do memory statistics to find all the
- * [BEEP] leaks.
- *
- */
-
-/**
- * __alloc_skb -   allocate a network buffer
- * @size: size to allocate
- * @gfp_mask: allocation mask
- * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache
- * instead of head cache and allocate a cloned (child) skb.
- * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
- * allocations in case the data is required for writeback
- * @node: numa node to allocate memory on
- *
- * Allocate a new &sk_buff. The returned buffer has no headroom and a
- * tail room of at least size bytes. The object has a reference count
- * of one. The return is the buffer. On a failure the return is %NULL.
- *
- * Buffers may only be allocated from interrupts using a @gfp_mask of
- * %GFP_ATOMIC.
- */
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-   int flags, int node)
-{
-   struct kmem_cache *cache;
-   struct skb_shared_info *shinfo;
-   struct sk_buff *skb;
-   u8 *data;
-   bool pfmemalloc;
-
-   cache = (flags & SKB_ALLOC_FCLONE)
-   ? skbuff_fclone_cache : skbuff_head_cache;
-
-   if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
-   gfp_mask |= __GFP_MEMALLOC;
-
-   /* Get the HEAD */
-   skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
-   if (!skb)
-   goto out;
-   prefetchw(skb);
-
-   /* We do our best to align skb_shared_info on a separate cache
-* line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
-* aligned memory blocks, unless SLUB/SLAB debug is enabled.
-* Both skb->head and skb_shared_info are cache line aligned.
-*/
-   size = SKB_DATA_ALIGN(size);
-   size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-   data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
-   if (!data)
-   goto nodata;
-   /* kmalloc(size) might give us more room than requested.
-* Put skb_shared_info exactly at the end of allocated zone,
-* to allow max possible filling before reallocation.
-*/
-   size = SKB_WITH_OVERHEAD(ksize(data));
-   prefetchw(data + size);
-
-   /*
-* Only clear those fields we need to clear, not those that we will
-* actually initialise below. Hence, don't put any more fields after
-* the tail pointer in struct sk_buff!
-*/
-   memset(skb, 0, offsetof(struct sk_buff, tail));
-   /* Account for allocated memory : skb + skb->head */
-   skb->truesize = SKB_TRUESIZE(size);
-   skb->pfmemalloc = pfmemalloc;
-   refcount_set(&skb->users, 1);
-   skb->head = data;
-   skb->data = data;
-   skb_reset_tail_pointer(skb)

[PATCH v4 net-next 02/11] skbuff: simplify kmalloc_reserve()

2021-02-10 Thread Alexander Lobakin

Eversince the introduction of __kmalloc_reserve(), "ip" argument
hasn't been used. _RET_IP_ is embedded inside
kmalloc_node_track_caller().
Remove the redundant macro and rename the function after it.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a0f846872d19..70289f22a6f4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -273,11 +273,8 @@ EXPORT_SYMBOL(__netdev_alloc_frag_align);
  * may be used. Otherwise, the packet data may be discarded until enough
  * memory is free
  */
-#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
-__kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
-
-static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
-  unsigned long ip, bool *pfmemalloc)
+static void *kmalloc_reserve(size_t size, gfp_t flags, int node,
+bool *pfmemalloc)
 {
void *obj;
bool ret_pfmemalloc = false;
-- 
2.30.1

[PATCH v4 net-next 03/11] skbuff: make __build_skb_around() return void

2021-02-10 Thread Alexander Lobakin

__build_skb_around() can never fail and always returns passed skb.
Make it return void to simplify and optimize the code.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 70289f22a6f4..c7d184e11547 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -120,8 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
 }
 
 /* Caller must provide SKB that is memset cleared */
-static struct sk_buff *__build_skb_around(struct sk_buff *skb,
- void *data, unsigned int frag_size)
+static void __build_skb_around(struct sk_buff *skb, void *data,
+  unsigned int frag_size)
 {
struct skb_shared_info *shinfo;
unsigned int size = frag_size ? : ksize(data);
@@ -144,8 +144,6 @@ static struct sk_buff *__build_skb_around(struct sk_buff 
*skb,
atomic_set(&shinfo->dataref, 1);
 
skb_set_kcov_handle(skb, kcov_common_handle());
-
-   return skb;
 }
 
 /**
@@ -176,8 +174,9 @@ struct sk_buff *__build_skb(void *data, unsigned int 
frag_size)
return NULL;
 
memset(skb, 0, offsetof(struct sk_buff, tail));
+   __build_skb_around(skb, data, frag_size);
 
-   return __build_skb_around(skb, data, frag_size);
+   return skb;
 }
 
 /* build_skb() is wrapper over __build_skb(), that specifically
@@ -210,9 +209,9 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
if (unlikely(!skb))
return NULL;
 
-   skb = __build_skb_around(skb, data, frag_size);
+   __build_skb_around(skb, data, frag_size);
 
-   if (skb && frag_size) {
+   if (frag_size) {
skb->head_frag = 1;
if (page_is_pfmemalloc(virt_to_head_page(data)))
skb->pfmemalloc = 1;
-- 
2.30.1

[PATCH v4 net-next 06/11] skbuff: remove __kfree_skb_flush()

2021-02-10 Thread Alexander Lobakin

This function isn't much needed as NAPI skb queue gets bulk-freed
anyway when there's no more room, and even may reduce the efficiency
of bulk operations.
It will be even less needed after reusing skb cache on allocation path,
so remove it and this way lighten network softirqs a bit.

Suggested-by: Eric Dumazet 
Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h |  1 -
 net/core/dev.c |  6 +-
 net/core/skbuff.c  | 12 
 3 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0a4e91a2f873..0e0707296098 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2919,7 +2919,6 @@ static inline struct sk_buff *napi_alloc_skb(struct 
napi_struct *napi,
 }
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
-void __kfree_skb_flush(void);
 void __kfree_skb_defer(struct sk_buff *skb);
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index 7647278e46f0..7134ae2fc0db 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4944,8 +4944,6 @@ static __latent_entropy void net_tx_action(struct 
softirq_action *h)
else
__kfree_skb_defer(skb);
}
-
-   __kfree_skb_flush();
}
 
if (sd->output_queue) {
@@ -7041,7 +7039,7 @@ static __latent_entropy void net_rx_action(struct 
softirq_action *h)
 
if (list_empty(&list)) {
if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
-   goto out;
+   return;
break;
}
 
@@ -7068,8 +7066,6 @@ static __latent_entropy void net_rx_action(struct 
softirq_action *h)
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 
net_rps_action_and_irq_enable(sd);
-out:
-   __kfree_skb_flush();
 }
 
 struct netdev_adjacent {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1c6f6ef70339..4be2bb969535 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -838,18 +838,6 @@ void __consume_stateless_skb(struct sk_buff *skb)
kfree_skbmem(skb);
 }
 
-void __kfree_skb_flush(void)
-{
-   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
-   /* flush skb_cache if containing objects */
-   if (nc->skb_count) {
-   kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count,
-nc->skb_cache);
-   nc->skb_count = 0;
-   }
-}
-
 static inline void _kfree_skb_defer(struct sk_buff *skb)
 {
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-- 
2.30.1

[PATCH v4 net-next 04/11] skbuff: simplify __alloc_skb() a bit

2021-02-10 Thread Alexander Lobakin

Use unlikely() annotations for skbuff_head and data similarly to the
two other allocation functions and remove totally redundant goto.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c7d184e11547..88566de26cd1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -339,8 +339,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
-   if (!skb)
-   goto out;
+   if (unlikely(!skb))
+   return NULL;
prefetchw(skb);
 
/* We do our best to align skb_shared_info on a separate cache
@@ -351,7 +351,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
size = SKB_DATA_ALIGN(size);
size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
-   if (!data)
+   if (unlikely(!data))
goto nodata;
/* kmalloc(size) might give us more room than requested.
 * Put skb_shared_info exactly at the end of allocated zone,
@@ -395,12 +395,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 
skb_set_kcov_handle(skb, kcov_common_handle());
 
-out:
return skb;
+
 nodata:
kmem_cache_free(cache, skb);
-   skb = NULL;
-   goto out;
+   return NULL;
 }
 EXPORT_SYMBOL(__alloc_skb);
 
-- 
2.30.1

[PATCH v4 net-next 08/11] skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads

2021-02-10 Thread Alexander Lobakin

Instead of just bulk-flushing skbuff_heads queued up through
napi_consume_skb() or __kfree_skb_defer(), try to reuse them
on allocation path.
If the cache is empty on allocation, bulk-allocate the first
16 elements, which is more efficient than per-skb allocation.
If the cache is full on freeing, bulk-wipe the second half of
the cache (32 elements).
This also includes custom KASAN poisoning/unpoisoning to be
double sure there are no use-after-free cases.

To not change current behaviour, introduce a new function,
napi_build_skb(), to optionally use a new approach later
in drivers.

Note on selected bulk size, 16:
 - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE
   and especially VETH_XDP_BATCH, which is also used to
   bulk-allocate skbuff_heads and was tested on powerful
   setups;
 - this also showed the best performance in the actual
   test series (from the array of {8, 16, 32}).

Suggested-by: Edward Cree  # Divide on two halves
Suggested-by: Eric Dumazet# KASAN poisoning
Cc: Dmitry Vyukov  # Help with KASAN
Cc: Paolo Abeni # Reduced batch size
Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h |  2 +
 net/core/skbuff.c  | 94 --
 2 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0e0707296098..906122eac82a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int 
frag_size);
 struct sk_buff *build_skb_around(struct sk_buff *skb,
 void *data, unsigned int frag_size);
 
+struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
+
 /**
  * alloc_skb - allocate a network buffer
  * @size: size to allocate
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 860a9d4f752f..9e1a8ded4acc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
 }
 
 #define NAPI_SKB_CACHE_SIZE64
+#define NAPI_SKB_CACHE_BULK16
+#define NAPI_SKB_CACHE_HALF(NAPI_SKB_CACHE_SIZE / 2)
 
 struct napi_alloc_cache {
struct page_frag_cache page;
@@ -164,6 +166,25 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, 
unsigned int align_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_frag_align);
 
+static struct sk_buff *napi_skb_cache_get(void)
+{
+   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+   struct sk_buff *skb;
+
+   if (unlikely(!nc->skb_count))
+   nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
+ GFP_ATOMIC,
+ NAPI_SKB_CACHE_BULK,
+ nc->skb_cache);
+   if (unlikely(!nc->skb_count))
+   return NULL;
+
+   skb = nc->skb_cache[--nc->skb_count];
+   kasan_unpoison_object_data(skbuff_head_cache, skb);
+
+   return skb;
+}
+
 /* Caller must provide SKB that is memset cleared */
 static void __build_skb_around(struct sk_buff *skb, void *data,
   unsigned int frag_size)
@@ -265,6 +286,53 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(build_skb_around);
 
+/**
+ * __napi_build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ *
+ * Version of __build_skb() that uses NAPI percpu caches to obtain
+ * skbuff_head instead of inplace allocation.
+ *
+ * Returns a new &sk_buff on success, %NULL on allocation failure.
+ */
+static struct sk_buff *__napi_build_skb(void *data, unsigned int frag_size)
+{
+   struct sk_buff *skb;
+
+   skb = napi_skb_cache_get();
+   if (unlikely(!skb))
+   return NULL;
+
+   memset(skb, 0, offsetof(struct sk_buff, tail));
+   __build_skb_around(skb, data, frag_size);
+
+   return skb;
+}
+
+/**
+ * napi_build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ *
+ * Version of __napi_build_skb() that takes care of skb->head_frag
+ * and skb->pfmemalloc when the data is a page or page fragment.
+ *
+ * Returns a new &sk_buff on success, %NULL on allocation failure.
+ */
+struct sk_buff *napi_build_skb(void *data, unsigned int frag_size)
+{
+   struct sk_buff *skb = __napi_build_skb(data, frag_size);
+
+   if (likely(skb) && frag_size) {
+   skb->head_frag = 1;
+   skb_propagate_pfmemalloc(virt_to_head_page(data), skb);
+   }
+
+   return skb;
+}
+EXPORT_SYMBOL(napi_build_skb);
+
 /*
  * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
  * the caller if emergency pfmemalloc reserves are being used. If it is and
@@ -838,31 +906,31

[PATCH v4 net-next 07/11] skbuff: move NAPI cache declarations upper in the file

2021-02-10 Thread Alexander Lobakin

NAPI cache structures will be used for allocating skbuff_heads,
so move their declarations a bit upper.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 90 +++
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4be2bb969535..860a9d4f752f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -119,6 +119,51 @@ static void skb_under_panic(struct sk_buff *skb, unsigned 
int sz, void *addr)
skb_panic(skb, sz, addr, __func__);
 }
 
+#define NAPI_SKB_CACHE_SIZE64
+
+struct napi_alloc_cache {
+   struct page_frag_cache page;
+   unsigned int skb_count;
+   void *skb_cache[NAPI_SKB_CACHE_SIZE];
+};
+
+static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
+
+static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask,
+   unsigned int align_mask)
+{
+   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+   return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask);
+}
+
+void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+{
+   fragsz = SKB_DATA_ALIGN(fragsz);
+
+   return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
+}
+EXPORT_SYMBOL(__napi_alloc_frag_align);
+
+void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+{
+   struct page_frag_cache *nc;
+   void *data;
+
+   fragsz = SKB_DATA_ALIGN(fragsz);
+   if (in_irq() || irqs_disabled()) {
+   nc = this_cpu_ptr(&netdev_alloc_cache);
+   data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, 
align_mask);
+   } else {
+   local_bh_disable();
+   data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
+   local_bh_enable();
+   }
+   return data;
+}
+EXPORT_SYMBOL(__netdev_alloc_frag_align);
+
 /* Caller must provide SKB that is memset cleared */
 static void __build_skb_around(struct sk_buff *skb, void *data,
   unsigned int frag_size)
@@ -220,51 +265,6 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(build_skb_around);
 
-#define NAPI_SKB_CACHE_SIZE64
-
-struct napi_alloc_cache {
-   struct page_frag_cache page;
-   unsigned int skb_count;
-   void *skb_cache[NAPI_SKB_CACHE_SIZE];
-};
-
-static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
-
-static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask,
-   unsigned int align_mask)
-{
-   struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
-   return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask);
-}
-
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
-{
-   fragsz = SKB_DATA_ALIGN(fragsz);
-
-   return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
-}
-EXPORT_SYMBOL(__napi_alloc_frag_align);
-
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
-{
-   struct page_frag_cache *nc;
-   void *data;
-
-   fragsz = SKB_DATA_ALIGN(fragsz);
-   if (in_irq() || irqs_disabled()) {
-   nc = this_cpu_ptr(&netdev_alloc_cache);
-   data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, 
align_mask);
-   } else {
-   local_bh_disable();
-   data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
-   local_bh_enable();
-   }
-   return data;
-}
-EXPORT_SYMBOL(__netdev_alloc_frag_align);
-
 /*
  * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
  * the caller if emergency pfmemalloc reserves are being used. If it is and
-- 
2.30.1

[PATCH v4 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()

2021-02-10 Thread Alexander Lobakin

Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
an skbuff_head from the NAPI cache instead of inplace allocation
inside __alloc_skb().
This implies that the function is called from softirq or BH-off
context, not for allocating a clone or from a distant node.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9e1a8ded4acc..750fa1825b28 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
struct sk_buff *skb;
u8 *data;
bool pfmemalloc;
+   bool clone;
 
-   cache = (flags & SKB_ALLOC_FCLONE)
-   ? skbuff_fclone_cache : skbuff_head_cache;
+   clone = !!(flags & SKB_ALLOC_FCLONE);
+   cache = clone ? skbuff_fclone_cache : skbuff_head_cache;
 
if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
gfp_mask |= __GFP_MEMALLOC;
 
/* Get the HEAD */
-   skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
+   if (!clone && (flags & SKB_ALLOC_NAPI) &&
+   likely(node == NUMA_NO_NODE || node == numa_mem_id()))
+   skb = napi_skb_cache_get();
+   else
+   skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);
if (unlikely(!skb))
return NULL;
prefetchw(skb);
@@ -436,7 +441,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
__build_skb_around(skb, data, 0);
skb->pfmemalloc = pfmemalloc;
 
-   if (flags & SKB_ALLOC_FCLONE) {
+   if (clone) {
struct sk_buff_fclones *fclones;
 
fclones = container_of(skb, struct sk_buff_fclones, skb1);
-- 
2.30.1

[PATCH v4 net-next 05/11] skbuff: use __build_skb_around() in __alloc_skb()

2021-02-10 Thread Alexander Lobakin

Just call __build_skb_around() instead of open-coding it.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 18 +-
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 88566de26cd1..1c6f6ef70339 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -326,7 +326,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
int flags, int node)
 {
struct kmem_cache *cache;
-   struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
bool pfmemalloc;
@@ -366,21 +365,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 * the tail pointer in struct sk_buff!
 */
memset(skb, 0, offsetof(struct sk_buff, tail));
-   /* Account for allocated memory : skb + skb->head */
-   skb->truesize = SKB_TRUESIZE(size);
+   __build_skb_around(skb, data, 0);
skb->pfmemalloc = pfmemalloc;
-   refcount_set(&skb->users, 1);
-   skb->head = data;
-   skb->data = data;
-   skb_reset_tail_pointer(skb);
-   skb->end = skb->tail + size;
-   skb->mac_header = (typeof(skb->mac_header))~0U;
-   skb->transport_header = (typeof(skb->transport_header))~0U;
-
-   /* make sure we initialize shinfo sequentially */
-   shinfo = skb_shinfo(skb);
-   memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-   atomic_set(&shinfo->dataref, 1);
 
if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff_fclones *fclones;
@@ -393,8 +379,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
fclones->skb2.fclone = SKB_FCLONE_CLONE;
}
 
-   skb_set_kcov_handle(skb, kcov_common_handle());
-
return skb;
 
 nodata:
-- 
2.30.1

[PATCH v4 net-next 11/11] skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing

2021-02-10 Thread Alexander Lobakin

napi_frags_finish() and napi_skb_finish() can only be called inside
NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
got NAPI_MERGED_FREE verdict instead of immediate freeing.
Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
to NAPI cache.
As many drivers call napi_alloc_skb()/napi_get_frags() on their
receive path, this becomes especially useful.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h |  1 +
 net/core/dev.c |  9 +
 net/core/skbuff.c  | 12 +---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 906122eac82a..6d0a33d1c0db 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2921,6 +2921,7 @@ static inline struct sk_buff *napi_alloc_skb(struct 
napi_struct *napi,
 }
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
+void napi_skb_free_stolen_head(struct sk_buff *skb);
 void __kfree_skb_defer(struct sk_buff *skb);
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index 7134ae2fc0db..f04877295b4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6094,13 +6094,6 @@ struct packet_offload *gro_find_complete_by_type(__be16 
type)
 }
 EXPORT_SYMBOL(gro_find_complete_by_type);
 
-static void napi_skb_free_stolen_head(struct sk_buff *skb)
-{
-   skb_dst_drop(skb);
-   skb_ext_put(skb);
-   kmem_cache_free(skbuff_head_cache, skb);
-}
-
 static gro_result_t napi_skb_finish(struct napi_struct *napi,
struct sk_buff *skb,
gro_result_t ret)
@@ -6114,7 +6107,7 @@ static gro_result_t napi_skb_finish(struct napi_struct 
*napi,
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
napi_skb_free_stolen_head(skb);
else
-   __kfree_skb(skb);
+   __kfree_skb_defer(skb);
break;
 
case GRO_HELD:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ac6e0172f206..9ff701afa837 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -917,9 +917,6 @@ static void napi_skb_cache_put(struct sk_buff *skb)
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
u32 i;
 
-   /* drop skb->head and call any destructors for packet */
-   skb_release_all(skb);
-
kasan_poison_object_data(skbuff_head_cache, skb);
nc->skb_cache[nc->skb_count++] = skb;
 
@@ -936,6 +933,14 @@ static void napi_skb_cache_put(struct sk_buff *skb)
 
 void __kfree_skb_defer(struct sk_buff *skb)
 {
+   skb_release_all(skb);
+   napi_skb_cache_put(skb);
+}
+
+void napi_skb_free_stolen_head(struct sk_buff *skb)
+{
+   skb_dst_drop(skb);
+   skb_ext_put(skb);
napi_skb_cache_put(skb);
 }
 
@@ -961,6 +966,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
return;
}
 
+   skb_release_all(skb);
napi_skb_cache_put(skb);
 }
 EXPORT_SYMBOL(napi_consume_skb);
-- 
2.30.1

[PATCH v4 net-next 10/11] skbuff: allow to use NAPI cache from __napi_alloc_skb()

2021-02-10 Thread Alexander Lobakin

{,__}napi_alloc_skb() is mostly used either for optional non-linear
receive methods (usually controlled via Ethtool private flags and off
by default) and/or for Rx copybreaks.
Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache
instead of inplace allocations. This includes both kmalloc and page
frag paths.

Signed-off-by: Alexander Lobakin 
---
 net/core/skbuff.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 750fa1825b28..ac6e0172f206 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -563,7 +563,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, 
unsigned int len,
if (len <= SKB_WITH_OVERHEAD(1024) ||
len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
(gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
-   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
+   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
+ NUMA_NO_NODE);
if (!skb)
goto skb_fail;
goto skb_success;
@@ -580,7 +581,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, 
unsigned int len,
if (unlikely(!data))
return NULL;
 
-   skb = __build_skb(data, len);
+   skb = __napi_build_skb(data, len);
if (unlikely(!skb)) {
skb_free_frag(data);
return NULL;
-- 
2.30.1

Re: [PATCH v4 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()

2021-02-11 Thread Alexander Lobakin

From: Paolo Abeni 
Date: Thu, 11 Feb 2021 11:16:40 +0100

> On Wed, 2021-02-10 at 16:30 +0000, Alexander Lobakin wrote:
> > Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
> > an skbuff_head from the NAPI cache instead of inplace allocation
> > inside __alloc_skb().
> > This implies that the function is called from softirq or BH-off
> > context, not for allocating a clone or from a distant node.
> > 
> > Signed-off-by: Alexander Lobakin 
> > ---
> >  net/core/skbuff.c | 13 +
> >  1 file changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 9e1a8ded4acc..750fa1825b28 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > gfp_mask,
> > struct sk_buff *skb;
> > u8 *data;
> > bool pfmemalloc;
> > +   bool clone;
> >  
> > -   cache = (flags & SKB_ALLOC_FCLONE)
> > -   ? skbuff_fclone_cache : skbuff_head_cache;
> > +   clone = !!(flags & SKB_ALLOC_FCLONE);
> > +   cache = clone ? skbuff_fclone_cache : skbuff_head_cache;
> >  
> > if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
> > gfp_mask |= __GFP_MEMALLOC;
> >  
> > /* Get the HEAD */
> > -   skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> > +   if (!clone && (flags & SKB_ALLOC_NAPI) &&
> > +   likely(node == NUMA_NO_NODE || node == numa_mem_id()))
> > +   skb = napi_skb_cache_get();
> > +   else
> > +   skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);
> > if (unlikely(!skb))
> > return NULL;
> > prefetchw(skb);
> 
> I hope the opt-in thing would have allowed leaving this code unchanged.
> I see it's not trivial avoid touching this code path.
> Still I think it would be nice if you would be able to let the device
> driver use the cache without touching the above, which is also used
> e.g. by the TCP xmit path, which in turn will not leverage the cache
> (as it requires FCLONE skbs).
> 
> If I read correctly, the above chunk is needed to
> allow __napi_alloc_skb() access the cache even for small skb
> allocation.

Not only. I wanted to give an ability to access the new feature
through __alloc_skb() too, not only through napi_build_skb() or
napi_alloc_skb().
And not only for drivers. As you may remember, firstly
napi_consume_skb()'s batching system landed for drivers, but then
it got used in network core code.
I think that some core parts may benefit from reusing the NAPI
caches. We'll only see it later.

It's not as complex as it may seem. NUMA check is cheap and tends
to be true for the vast majority of cases. Check for fclone is
already present in baseline code, even two times through the function.
So it's mostly about (flags & SKB_ALLOC_NAPI).

> Good device drivers should not call alloc_skb() in the fast
> path.

Not really. Several enterprise NIC drivers use __alloc_skb() and
alloc_skb(): ChelsIO and Mellanox for inline TLS, Netronome etc.
Lots of RDMA and wireless drivers (not the legacy ones), too.
__alloc_skb() gives you more control on NUMA node and needed skb
headroom, so it's still sometimes useful in drivers.

> What about changing __napi_alloc_skb() to always use
> the __napi_build_skb(), for both kmalloc and page backed skbs? That is,
> always doing the 'data' allocation in __napi_alloc_skb() - either via
> page_frag or via kmalloc() - and than call __napi_build_skb().
>
> I think that should avoid adding more checks in __alloc_skb() and
> should probably reduce the number of conditional used
> by __napi_alloc_skb().

I thought of this too. But this will introduce conditional branch
to set or not skb->head_frag. So one branch less in __alloc_skb(),
one branch more here, and we also lose the ability to __alloc_skb()
with decached head.

> Thanks!
> 
> Paolo

Thanks,
Al

Re: [PATCH v4 net-next 08/11] skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads

2021-02-11 Thread Alexander Lobakin

From: Jesper Dangaard Brouer 
Date: Thu, 11 Feb 2021 13:54:59 +0100

> On Wed, 10 Feb 2021 16:30:23 +
> Alexander Lobakin  wrote:
> 
> > Instead of just bulk-flushing skbuff_heads queued up through
> > napi_consume_skb() or __kfree_skb_defer(), try to reuse them
> > on allocation path.
> 
> Maybe you are already aware of this dynamics, but high speed NICs will
> usually run the TX "cleanup" (opportunistic DMA-completion) in the napi
> poll function call, and often before processing RX packets. Like
> ixgbe_poll[1] calls ixgbe_clean_tx_irq() before ixgbe_clean_rx_irq().

Sure. 1G MIPS is my home project (I'll likely migrate to ARM64 cluster
in 2-3 months). I mostly work with 10-100G NICs at work.

> If traffic is symmetric (or is routed-back same interface) then this
> SKB recycle scheme will be highly efficient. (I had this part of my
> initial patchset and tested it on ixgbe).
> 
> [1] 
> https://elixir.bootlin.com/linux/v5.11-rc7/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L3149

That's exactly why I introduced this feature. Firstly driver enriches
the cache with the consumed skbs from Tx completion queue, and then
it just decaches them back on Rx completion cycle. That's how things
worked most of the time on my test setup.

The reason why Paolo proposed this as an option, and why I agreed
it's safer to do instead of unconditional switching, is that
different platforms and setup may react differently on this.
We don't have an ability to test the entire zoo, so we propose
an option for driver and network core developers to test and use
"on demand".
As I wrote in reply to Paolo, there might be cases when even the
core networking code may benefit from this.

> > If the cache is empty on allocation, bulk-allocate the first
> > 16 elements, which is more efficient than per-skb allocation.
> > If the cache is full on freeing, bulk-wipe the second half of
> > the cache (32 elements).
> > This also includes custom KASAN poisoning/unpoisoning to be
> > double sure there are no use-after-free cases.
> > 
> > To not change current behaviour, introduce a new function,
> > napi_build_skb(), to optionally use a new approach later
> > in drivers.
> > 
> > Note on selected bulk size, 16:
> >  - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE
> >and especially VETH_XDP_BATCH, which is also used to
> >bulk-allocate skbuff_heads and was tested on powerful
> >setups;
> >  - this also showed the best performance in the actual
> >test series (from the array of {8, 16, 32}).
> > 
> > Suggested-by: Edward Cree  # Divide on two halves
> > Suggested-by: Eric Dumazet# KASAN poisoning
> > Cc: Dmitry Vyukov  # Help with KASAN
> > Cc: Paolo Abeni # Reduced batch size
> > Signed-off-by: Alexander Lobakin 
> > ---
> >  include/linux/skbuff.h |  2 +
> >  net/core/skbuff.c  | 94 --
> >  2 files changed, 83 insertions(+), 13 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 0e0707296098..906122eac82a 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int 
> > frag_size);
> >  struct sk_buff *build_skb_around(struct sk_buff *skb,
> >  void *data, unsigned int frag_size);
> >  
> > +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
> > +
> >  /**
> >   * alloc_skb - allocate a network buffer
> >   * @size: size to allocate
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 860a9d4f752f..9e1a8ded4acc 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, 
> > unsigned int sz, void *addr)
> >  }
> >  
> >  #define NAPI_SKB_CACHE_SIZE64
> > +#define NAPI_SKB_CACHE_BULK16
> > +#define NAPI_SKB_CACHE_HALF(NAPI_SKB_CACHE_SIZE / 2)
> >  
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Thanks,
Al

[PATCH v7 bpf-next 0/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-17 Thread Alexander Lobakin

This series introduces XSK generic zerocopy xmit by adding XSK umem
pages as skb frags instead of copying data to linear space.
The only requirement for this for drivers is to be able to xmit skbs
with skb_headlen(skb) == 0, i.e. all data including hard headers
starts from frag 0.
To indicate whether a particular driver supports this, a new netdev
priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net
as it's already capable of doing it). So consider implementing this
in your drivers to greatly speed-up generic XSK xmit.

The first two bits refactor netdev_priv_flags a bit to harden them
in terms of bitfield overflow, as IFF_TX_SKB_NO_LINEAR is the last
one that fits into unsigned int.
The fifth patch adds headroom and tailroom reservations for the
allocated skbs on XSK generic xmit path. This ensures there won't
be any unwanted skb reallocations on fast-path due to headroom and/or
tailroom driver/device requirements (own headers/descriptors etc.).
The other three add a new private flag, declare it in virtio_net
driver and introduce generic XSK zerocopy xmit itself.

The main body of work is created and done by Xuan Zhuo. His original
cover letter:

v3:
Optimized code

v2:
1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature
2. split the patch to three:
a. add priv_flags IFF_TX_SKB_NO_LINEAR
b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR
c. When there is support this flag, construct skb without linear
   space
3. use ERR_PTR() and PTR_ERR() to handle the err

v1 message log:
---

This patch is used to construct skb based on page to save memory copy
overhead.

This has one problem:

We construct the skb by fill the data page as a frag into the skb. In
this way, the linear space is empty, and the header information is also
in the frag, not in the linear space, which is not allowed for some
network cards. For example, Mellanox Technologies MT27710 Family
[ConnectX-4 Lx] will get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

I also tried to use build_skb to construct skb, but because of the
existence of skb_shinfo, it must be behind the linear space, so this
method is not working. We can't put skb_shinfo on desc->addr, it will be
exposed to users, this is not safe.

Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
network card supports the header information of the packet in the frag
and not in the linear space.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

>From v6 [3]:
 - rebase ontop of bpf-next after merge with net-next;
 - address kdoc warnings.

>From v5 [2]:
 - fix a refcount leak in 0006 introduced in v4.

>From v4 [1]:
 - fix 0002 build error due to inverted static_assert() condition
   (0day bot);
 - collect two Acked-bys (Magnus).

>From v3 [0]:
 - refactor netdev_priv_flags to make it easier to add new ones and
   prevent bitwidth overflow;
 - add headroom (both standard and zerocopy) and tailroom (standard)
   reservation in skb for drivers to avoid potential reallocations;
 - fix skb->truesize accounting;
 - misc comment rewords.

[0] 
https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com
[1] https://lore.kernel.org/netdev/20210216113740.62041-1-aloba...@pm.me
[2] https://lore.kernel.org/netdev/2021021614.5861-1-aloba...@pm.me
[3] https://lore.kernel.org/netdev/20210216172640.374487-1-aloba...@pm.me

Alexander Lobakin (3):
  netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition
  netdevice: check for net_device::priv_flags bitfield overflow
  xsk: respect device's headroom and tailroom on generic xmit path

Xuan Zhuo (3):
  net: add priv_flags for allow tx skb without linear
  virtio-net: support IFF_TX_SKB_NO_LINEAR
  xsk: build skb by page (aka generic zerocopy xmit)

 drivers/net/virtio_net.c  |   3 +-
 include/linux/netdevice.h | 202 --
 net/xdp/xsk.c | 114

[PATCH v7 bpf-next 1/6] netdev_priv_flags: add missing IFF_PHONY_HEADROOM self-definition

2021-02-17 Thread Alexander Lobakin

This is harmless for now, but comes fatal for the subsequent patch.

Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom")
Signed-off-by: Alexander Lobakin 
---
 include/linux/netdevice.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..3b6f82c2c271 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1577,6 +1577,7 @@ enum netdev_priv_flags {
 #define IFF_L3MDEV_SLAVE   IFF_L3MDEV_SLAVE
 #define IFF_TEAM   IFF_TEAM
 #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
+#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM
 #define IFF_MACSEC IFF_MACSEC
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
 #define IFF_FAILOVER   IFF_FAILOVER
-- 
2.30.1

[PATCH v7 bpf-next 4/6] virtio-net: support IFF_TX_SKB_NO_LINEAR

2021-02-17 Thread Alexander Lobakin

From: Xuan Zhuo 

Virtio net supports the case where the skb linear space is empty, so add
priv_flags.

Signed-off-by: Xuan Zhuo 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Alexander Lobakin 
---
 drivers/net/virtio_net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e63792549..f2ff6c3906c1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev)
return -ENOMEM;
 
/* Set up network device as normal. */
-   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
+   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
+  IFF_TX_SKB_NO_LINEAR;
dev->netdev_ops = &virtnet_netdev;
dev->features = NETIF_F_HIGHDMA;
 
-- 
2.30.1

[PATCH v7 bpf-next 3/6] net: add priv_flags for allow tx skb without linear

2021-02-17 Thread Alexander Lobakin

From: Xuan Zhuo 

In some cases, we hope to construct skb directly based on the existing
memory without copying data. In this case, the page will be placed
directly in the skb, and the linear space of skb is empty. But
unfortunately, many the network card does not support this operation.
For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

So a priv_flag is added here to indicate whether the network card
supports this feature.

Signed-off-by: Xuan Zhuo 
Suggested-by: Alexander Lobakin 
[ alobakin: give a new flag more detailed description ]
Signed-off-by: Alexander Lobakin 
---
 include/linux/netdevice.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c1a642ecdc0..1186ba901ad3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1518,6 +1518,8 @@ struct net_device_ops {
  * @IFF_FAILOVER_SLAVE_BIT: device is lower dev of a failover master device
  * @IFF_L3MDEV_RX_HANDLER_BIT: only invoke the rx handler of L3 master device
  * @IFF_LIVE_RENAME_OK_BIT: rename is allowed while device is up and running
+ * @IFF_TX_SKB_NO_LINEAR_BIT: device/driver is capable of xmitting frames with
+ * skb_headlen(skb) == 0 (data starts from frag0)
  *
  * @NETDEV_PRIV_FLAG_COUNT: total priv flags count
  */
@@ -1553,6 +1555,7 @@ enum netdev_priv_flags {
IFF_FAILOVER_SLAVE_BIT,
IFF_L3MDEV_RX_HANDLER_BIT,
IFF_LIVE_RENAME_OK_BIT,
+   IFF_TX_SKB_NO_LINEAR_BIT,
 
NETDEV_PRIV_FLAG_COUNT,
 };
@@ -1595,6 +1598,7 @@ static_assert(sizeof(netdev_priv_flags_t) * BITS_PER_BYTE 
>=
 #define IFF_FAILOVER_SLAVE __IFF(FAILOVER_SLAVE)
 #define IFF_L3MDEV_RX_HANDLER  __IFF(L3MDEV_RX_HANDLER)
 #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK)
+#define IFF_TX_SKB_NO_LINEAR   __IFF(TX_SKB_NO_LINEAR)
 
 /**
  * struct net_device - The DEVICE structure.
-- 
2.30.1

[PATCH v7 bpf-next 2/6] netdevice: check for net_device::priv_flags bitfield overflow

2021-02-17 Thread Alexander Lobakin

We almost ran out of unsigned int bitwidth. Define priv flags and
check for potential overflow in the fashion of netdev_features_t.
Defined this way, priv_flags can be easily expanded later with
just changing its typedef.

Signed-off-by: Alexander Lobakin 
Reported-by: kernel test robot  # Inverted assert condition
---
 include/linux/netdevice.h | 199 --
 1 file changed, 105 insertions(+), 94 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3b6f82c2c271..2c1a642ecdc0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1483,107 +1483,118 @@ struct net_device_ops {
  *
  * You should have a pretty good reason to be extending these flags.
  *
- * @IFF_802_1Q_VLAN: 802.1Q VLAN device
- * @IFF_EBRIDGE: Ethernet bridging device
- * @IFF_BONDING: bonding master or slave
- * @IFF_ISATAP: ISATAP interface (RFC4214)
- * @IFF_WAN_HDLC: WAN HDLC device
- * @IFF_XMIT_DST_RELEASE: dev_hard_start_xmit() is allowed to
+ * @IFF_802_1Q_VLAN_BIT: 802.1Q VLAN device
+ * @IFF_EBRIDGE_BIT: Ethernet bridging device
+ * @IFF_BONDING_BIT: bonding master or slave
+ * @IFF_ISATAP_BIT: ISATAP interface (RFC4214)
+ * @IFF_WAN_HDLC_BIT: WAN HDLC device
+ * @IFF_XMIT_DST_RELEASE_BIT: dev_hard_start_xmit() is allowed to
  * release skb->dst
- * @IFF_DONT_BRIDGE: disallow bridging this ether dev
- * @IFF_DISABLE_NETPOLL: disable netpoll at run-time
- * @IFF_MACVLAN_PORT: device used as macvlan port
- * @IFF_BRIDGE_PORT: device used as bridge port
- * @IFF_OVS_DATAPATH: device used as Open vSwitch datapath port
- * @IFF_TX_SKB_SHARING: The interface supports sharing skbs on transmit
- * @IFF_UNICAST_FLT: Supports unicast filtering
- * @IFF_TEAM_PORT: device used as team port
- * @IFF_SUPP_NOFCS: device supports sending custom FCS
- * @IFF_LIVE_ADDR_CHANGE: device supports hardware address
+ * @IFF_DONT_BRIDGE_BIT: disallow bridging this ether dev
+ * @IFF_DISABLE_NETPOLL_BIT: disable netpoll at run-time
+ * @IFF_MACVLAN_PORT_BIT: device used as macvlan port
+ * @IFF_BRIDGE_PORT_BIT: device used as bridge port
+ * @IFF_OVS_DATAPATH_BIT: device used as Open vSwitch datapath port
+ * @IFF_TX_SKB_SHARING_BIT: The interface supports sharing skbs on transmit
+ * @IFF_UNICAST_FLT_BIT: Supports unicast filtering
+ * @IFF_TEAM_PORT_BIT: device used as team port
+ * @IFF_SUPP_NOFCS_BIT: device supports sending custom FCS
+ * @IFF_LIVE_ADDR_CHANGE_BIT: device supports hardware address
  * change when it's running
- * @IFF_MACVLAN: Macvlan device
- * @IFF_XMIT_DST_RELEASE_PERM: IFF_XMIT_DST_RELEASE not taking into account
+ * @IFF_MACVLAN_BIT: Macvlan device
+ * @IFF_XMIT_DST_RELEASE_PERM_BIT: IFF_XMIT_DST_RELEASE not taking into account
  * underlying stacked devices
- * @IFF_L3MDEV_MASTER: device is an L3 master device
- * @IFF_NO_QUEUE: device can run without qdisc attached
- * @IFF_OPENVSWITCH: device is a Open vSwitch master
- * @IFF_L3MDEV_SLAVE: device is enslaved to an L3 master device
- * @IFF_TEAM: device is a team device
- * @IFF_RXFH_CONFIGURED: device has had Rx Flow indirection table configured
- * @IFF_PHONY_HEADROOM: the headroom value is controlled by an external
+ * @IFF_L3MDEV_MASTER_BIT: device is an L3 master device
+ * @IFF_NO_QUEUE_BIT: device can run without qdisc attached
+ * @IFF_OPENVSWITCH_BIT: device is a Open vSwitch master
+ * @IFF_L3MDEV_SLAVE_BIT: device is enslaved to an L3 master device
+ * @IFF_TEAM_BIT: device is a team device
+ * @IFF_RXFH_CONFIGURED_BIT: device has had Rx Flow indirection table 
configured
+ * @IFF_PHONY_HEADROOM_BIT: the headroom value is controlled by an external
  * entity (i.e. the master device for bridged veth)
- * @IFF_MACSEC: device is a MACsec device
- * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
- * @IFF_FAILOVER: device is a failover master device
- * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
- * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
- * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
+ * @IFF_MACSEC_BIT: device is a MACsec device
+ * @IFF_NO_RX_HANDLER_BIT: device doesn't support the rx_handler hook
+ * @IFF_FAILOVER_BIT: device is a failover master device
+ * @IFF_FAILOVER_SLAVE_BIT: device is lower dev of a failover master device
+ * @IFF_L3MDEV_RX_HANDLER_BIT: only invoke the rx handler of L3 master device
+ * @IFF_LIVE_RENAME_OK_BIT: rename is allowed while device is up and running
+ *
+ * @NETDEV_PRIV_FLAG_COUNT: total priv flags count
  */
 enum netdev_priv_flags {
-   IFF_802_1Q_VLAN = 1<<0,
-   IFF_EBRIDGE = 1<<1,
-   IFF_BONDING = 1<<2,
-   IFF_ISATAP  = 1<<3,
-   IFF_WAN_HDLC= 1<<4,
-   IFF_XMIT_DST_RELEASE= 1<<5,
-   IFF_DONT_BRIDGE = 1<<6,
-   IFF_

[PATCH v7 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-17 Thread Alexander Lobakin

From: Xuan Zhuo 

This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
[ alobakin:
 - expand subject to make it clearer;
 - improve skb->truesize calculation;
 - reserve some headroom in skb for drivers;
 - tailroom is not needed as skb is non-linear ]
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
---
 net/xdp/xsk.c | 120 --
 1 file changed, 96 insertions(+), 24 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 143979ea4165..a71ed664da0a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -445,6 +445,97 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+ struct xdp_desc *desc)
+{
+   struct xsk_buff_pool *pool = xs->pool;
+   u32 hr, len, ts, offset, copy, copied;
+   struct sk_buff *skb;
+   struct page *page;
+   void *buffer;
+   int err, i;
+   u64 addr;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+
+   skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+
+   addr = desc->addr;
+   len = desc->len;
+   ts = pool->unaligned ? len : pool->chunk_size;
+
+   buffer = xsk_buff_raw_get_data(pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; i++) {
+   page = pool->umem->pgs[addr >> PAGE_SHIFT];
+   get_page(page);
+
+   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += ts;
+
+   refcount_add(ts, &xs->sk.sk_wmem_alloc);
+
+   return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+struct xdp_desc *desc)
+{
+   struct net_device *dev = xs->dev;
+   struct sk_buff *skb;
+
+   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   skb = xsk_build_skb_zerocopy(xs, desc);
+   if (IS_ERR(skb))
+   return skb;
+   } else {
+   u32 hr, tr, len;
+   void *buffer;
+   int err;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+   tr = dev->needed_tailroom;
+   len = desc->len;
+
+   skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+   skb_put(skb, len);
+
+   buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+   err = skb_store_bits(skb, 0, buffer, len);
+   if (unlikely(err)) {
+   kfree_skb(skb);
+   return ERR_PTR(err);
+   }
+   }
+
+   skb->dev = dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -454,56 +545,37 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
-   u32 hr, tr;
 
mutex_lock(&xs->mutex);
 
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
 
-   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
-   tr = xs->dev->needed_tailroom;
-
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;

[PATCH v7 bpf-next 5/6] xsk: respect device's headroom and tailroom on generic xmit path

2021-02-17 Thread Alexander Lobakin

xsk_generic_xmit() allocates a new skb and then queues it for
xmitting. The size of new skb's headroom is desc->len, so it comes
to the driver/device with no reserved headroom and/or tailroom.
Lots of drivers need some headroom (and sometimes tailroom) to
prepend (and/or append) some headers or data, e.g. CPU tags,
device-specific headers/descriptors (LSO, TLS etc.), and if case
of no available space skb_cow_head() will reallocate the skb.
Reallocations are unwanted on fast-path, especially when it comes
to XDP, so generic XSK xmit should reserve the spaces declared in
dev->needed_headroom and dev->needed tailroom to avoid them.

Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)):

Usually, output functions reserve LL_RESERVED_SPACE(dev), which
consists of dev->hard_header_len + dev->needed_headroom, aligned
by 16.
However, on XSK xmit hard header is already here in the chunk, so
hard_header_len is not needed. But it'd still be better to align
data up to cacheline, while reserving no less than driver requests
for headroom. NET_SKB_PAD here is to double-insure there will be
no reallocations even when the driver advertises no needed_headroom,
but in fact need it (not so rare case).

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
---
 net/xdp/xsk.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4faabd1ecfd1..143979ea4165 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
+   u32 hr, tr;
 
mutex_lock(&xs->mutex);
 
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
 
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+   tr = xs->dev->needed_tailroom;
+
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
char *buffer;
u64 addr;
@@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk)
}
 
len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, &err);
+   skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err);
if (unlikely(!skb))
goto out;
 
+   skb_reserve(skb, hr);
skb_put(skb, len);
+
addr = desc.addr;
buffer = xsk_buff_raw_get_data(xs->pool, addr);
err = skb_store_bits(skb, 0, buffer, len);
-- 
2.30.1

[PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-18 Thread Alexander Lobakin

This series introduces XSK generic zerocopy xmit by adding XSK umem
pages as skb frags instead of copying data to linear space.
The only requirement for this for drivers is to be able to xmit skbs
with skb_headlen(skb) == 0, i.e. all data including hard headers
starts from frag 0.
To indicate whether a particular driver supports this, a new netdev
priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net
as it's already capable of doing it). So consider implementing this
in your drivers to greatly speed-up generic XSK xmit.

The first bit adds missing IFF self-definition. It's a bit out, but
"while we are here".
The fourth patch adds headroom and tailroom reservations for the
allocated skbs on XSK generic xmit path. This ensures there won't
be any unwanted skb reallocations on fast-path due to headroom and/or
tailroom driver/device requirements (own headers/descriptors etc.).
The other three add a new private flag, declare it in virtio_net
driver and introduce generic XSK zerocopy xmit itself.

The main body of work is created and done by Xuan Zhuo. His original
cover letter:

v3:
Optimized code

v2:
1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature
2. split the patch to three:
a. add priv_flags IFF_TX_SKB_NO_LINEAR
b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR
c. When there is support this flag, construct skb without linear
   space
3. use ERR_PTR() and PTR_ERR() to handle the err

v1 message log:
---

This patch is used to construct skb based on page to save memory copy
overhead.

This has one problem:

We construct the skb by fill the data page as a frag into the skb. In
this way, the linear space is empty, and the header information is also
in the frag, not in the linear space, which is not allowed for some
network cards. For example, Mellanox Technologies MT27710 Family
[ConnectX-4 Lx] will get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

I also tried to use build_skb to construct skb, but because of the
existence of skb_shinfo, it must be behind the linear space, so this
method is not working. We can't put skb_shinfo on desc->addr, it will be
exposed to users, this is not safe.

Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
network card supports the header information of the packet in the frag
and not in the linear space.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

>From v7 [4]:
 - drop netdev priv flags rework (will be issued separately);
 - pick up Acks from John.

>From v6 [3]:
 - rebase ontop of bpf-next after merge with net-next;
 - address kdoc warnings.

>From v5 [2]:
 - fix a refcount leak in 0006 introduced in v4.

>From v4 [1]:
 - fix 0002 build error due to inverted static_assert() condition
   (0day bot);
 - collect two Acked-bys (Magnus).

>From v3 [0]:
 - refactor netdev_priv_flags to make it easier to add new ones and
   prevent bitwidth overflow;
 - add headroom (both standard and zerocopy) and tailroom (standard)
   reservation in skb for drivers to avoid potential reallocations;
 - fix skb->truesize accounting;
 - misc comment rewords.

[0] 
https://lore.kernel.org/netdev/cover.1611236588.git.xuanz...@linux.alibaba.com
[1] https://lore.kernel.org/netdev/20210216113740.62041-1-aloba...@pm.me
[2] https://lore.kernel.org/netdev/2021021614.5861-1-aloba...@pm.me
[3] https://lore.kernel.org/netdev/20210216172640.374487-1-aloba...@pm.me
[4] https://lore.kernel.org/netdev/20210217120003.7938-1-aloba...@pm.me

Alexander Lobakin (2):
  netdevice: add missing IFF_PHONY_HEADROOM self-definition
  xsk: respect device's headroom and tailroom on generic xmit path

Xuan Zhuo (3):
  net: add priv_flags for allow tx skb without linear
  virtio-net: support IFF_TX_SKB_NO_LINEAR
  xsk: build skb by page (aka generic zerocopy xmit)

 drivers/net/virtio_net.c  |   3 +-
 include/linux/netdevice.h |   5 ++
 net/xdp/xsk.c

[PATCH v8 bpf-next 1/5] netdevice: add missing IFF_PHONY_HEADROOM self-definition

2021-02-18 Thread Alexander Lobakin

This is harmless for now, but can be fatal for future refactors.

Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom")
Signed-off-by: Alexander Lobakin 
Acked-by: John Fastabend 
---
 include/linux/netdevice.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..3b6f82c2c271 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1577,6 +1577,7 @@ enum netdev_priv_flags {
 #define IFF_L3MDEV_SLAVE   IFF_L3MDEV_SLAVE
 #define IFF_TEAM   IFF_TEAM
 #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
+#define IFF_PHONY_HEADROOM IFF_PHONY_HEADROOM
 #define IFF_MACSEC IFF_MACSEC
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
 #define IFF_FAILOVER   IFF_FAILOVER
--
2.30.1

[PATCH v8 bpf-next 2/5] net: add priv_flags for allow tx skb without linear

2021-02-18 Thread Alexander Lobakin

From: Xuan Zhuo 

In some cases, we hope to construct skb directly based on the existing
memory without copying data. In this case, the page will be placed
directly in the skb, and the linear space of skb is empty. But
unfortunately, many the network card does not support this operation.
For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

So a priv_flag is added here to indicate whether the network card
supports this feature.

Signed-off-by: Xuan Zhuo 
Suggested-by: Alexander Lobakin 
[ alobakin: give a new flag more detailed description ]
Signed-off-by: Alexander Lobakin 
Acked-by: John Fastabend 
---
 include/linux/netdevice.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3b6f82c2c271..6cef47b76cc6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1518,6 +1518,8 @@ struct net_device_ops {
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
+ * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
+ * skb_headlen(skb) == 0 (data starts from frag0)
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1551,6 +1553,7 @@ enum netdev_priv_flags {
IFF_FAILOVER_SLAVE  = 1<<28,
IFF_L3MDEV_RX_HANDLER   = 1<<29,
IFF_LIVE_RENAME_OK  = 1<<30,
+   IFF_TX_SKB_NO_LINEAR= 1<<31,
 };

 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1584,6 +1587,7 @@ enum netdev_priv_flags {
 #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE
 #define IFF_L3MDEV_RX_HANDLER  IFF_L3MDEV_RX_HANDLER
 #define IFF_LIVE_RENAME_OK IFF_LIVE_RENAME_OK
+#define IFF_TX_SKB_NO_LINEAR   IFF_TX_SKB_NO_LINEAR

 /**
  * struct net_device - The DEVICE structure.
--
2.30.1

[PATCH v8 bpf-next 3/5] virtio-net: support IFF_TX_SKB_NO_LINEAR

2021-02-18 Thread Alexander Lobakin

From: Xuan Zhuo 

Virtio net supports the case where the skb linear space is empty, so add
priv_flags.

Signed-off-by: Xuan Zhuo 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Alexander Lobakin 
Acked-by: John Fastabend 
---
 drivers/net/virtio_net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e63792549..f2ff6c3906c1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev)
return -ENOMEM;

/* Set up network device as normal. */
-   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
+   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
+  IFF_TX_SKB_NO_LINEAR;
dev->netdev_ops = &virtnet_netdev;
dev->features = NETIF_F_HIGHDMA;

--
2.30.1

[PATCH v8 bpf-next 4/5] xsk: respect device's headroom and tailroom on generic xmit path

2021-02-18 Thread Alexander Lobakin

xsk_generic_xmit() allocates a new skb and then queues it for
xmitting. The size of new skb's headroom is desc->len, so it comes
to the driver/device with no reserved headroom and/or tailroom.
Lots of drivers need some headroom (and sometimes tailroom) to
prepend (and/or append) some headers or data, e.g. CPU tags,
device-specific headers/descriptors (LSO, TLS etc.), and if case
of no available space skb_cow_head() will reallocate the skb.
Reallocations are unwanted on fast-path, especially when it comes
to XDP, so generic XSK xmit should reserve the spaces declared in
dev->needed_headroom and dev->needed tailroom to avoid them.

Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)):

Usually, output functions reserve LL_RESERVED_SPACE(dev), which
consists of dev->hard_header_len + dev->needed_headroom, aligned
by 16.
However, on XSK xmit hard header is already here in the chunk, so
hard_header_len is not needed. But it'd still be better to align
data up to cacheline, while reserving no less than driver requests
for headroom. NET_SKB_PAD here is to double-insure there will be
no reallocations even when the driver advertises no needed_headroom,
but in fact need it (not so rare case).

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
Acked-by: John Fastabend 
---
 net/xdp/xsk.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4faabd1ecfd1..143979ea4165 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
+   u32 hr, tr;

mutex_lock(&xs->mutex);

if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;

+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+   tr = xs->dev->needed_tailroom;
+
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
char *buffer;
u64 addr;
@@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk)
}

len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, &err);
+   skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err);
if (unlikely(!skb))
goto out;

+   skb_reserve(skb, hr);
skb_put(skb, len);
+
addr = desc.addr;
buffer = xsk_buff_raw_get_data(xs->pool, addr);
err = skb_store_bits(skb, 0, buffer, len);
--
2.30.1

[PATCH v8 bpf-next 5/5] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-18 Thread Alexander Lobakin

From: Xuan Zhuo 

This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
[ alobakin:
 - expand subject to make it clearer;
 - improve skb->truesize calculation;
 - reserve some headroom in skb for drivers;
 - tailroom is not needed as skb is non-linear ]
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
Acked-by: John Fastabend 
---
 net/xdp/xsk.c | 120 --
 1 file changed, 96 insertions(+), 24 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 143979ea4165..a71ed664da0a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -445,6 +445,97 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }

+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+ struct xdp_desc *desc)
+{
+   struct xsk_buff_pool *pool = xs->pool;
+   u32 hr, len, ts, offset, copy, copied;
+   struct sk_buff *skb;
+   struct page *page;
+   void *buffer;
+   int err, i;
+   u64 addr;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+
+   skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+
+   addr = desc->addr;
+   len = desc->len;
+   ts = pool->unaligned ? len : pool->chunk_size;
+
+   buffer = xsk_buff_raw_get_data(pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; i++) {
+   page = pool->umem->pgs[addr >> PAGE_SHIFT];
+   get_page(page);
+
+   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += ts;
+
+   refcount_add(ts, &xs->sk.sk_wmem_alloc);
+
+   return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+struct xdp_desc *desc)
+{
+   struct net_device *dev = xs->dev;
+   struct sk_buff *skb;
+
+   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   skb = xsk_build_skb_zerocopy(xs, desc);
+   if (IS_ERR(skb))
+   return skb;
+   } else {
+   u32 hr, tr, len;
+   void *buffer;
+   int err;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+   tr = dev->needed_tailroom;
+   len = desc->len;
+
+   skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+   skb_put(skb, len);
+
+   buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+   err = skb_store_bits(skb, 0, buffer, len);
+   if (unlikely(err)) {
+   kfree_skb(skb);
+   return ERR_PTR(err);
+   }
+   }
+
+   skb->dev = dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -454,56 +545,37 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
-   u32 hr, tr;

mutex_lock(&xs->mutex);

if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;

-   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
-   tr = xs->dev->needed_tailroom;
-
while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;

Re: [RESEND PATCH net v4] udp: ipv4: manipulate network header of NATed UDP GRO fraglist

2021-01-30 Thread Alexander Lobakin

From: Dongseok Yi 
Date: Sat, 30 Jan 2021 08:13:27 +0900

> UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
> forwarding. Only the header of head_skb from ip_finish_output_gso ->
> skb_gso_segment is updated but following frag_skbs are not updated.
> 
> A call path skb_mac_gso_segment -> inet_gso_segment ->
> udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
> does not try to update UDP/IP header of the segment list but copy
> only the MAC header.
> 
> Update port, addr and check of each skb of the segment list in
> __udp_gso_segment_list. It covers both SNAT and DNAT.
> 
> Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.)
> Signed-off-by: Dongseok Yi 
> Acked-by: Steffen Klassert 
> ---
> v1:
> Steffen Klassert said, there could be 2 options.
> https://lore.kernel.org/patchwork/patch/1362257/
> I was trying to write a quick fix, but it was not easy to forward
> segmented list. Currently, assuming DNAT only.
> 
> v2:
> Per Steffen Klassert request, moved the procedure from
> udp4_ufo_fragment to __udp_gso_segment_list and support SNAT.
> 
> v3:
> Per Steffen Klassert request, applied fast return by comparing seg
> and seg->next at the beginning of __udpv4_gso_segment_list_csum.
> 
> Fixed uh->dest = *newport and iph->daddr = *newip to
> *oldport = *newport and *oldip = *newip.
> 
> v4:
> Clear "Changes Requested" mark in
> https://patchwork.kernel.org/project/netdevbpf
> 
> Simplified the return statement in __udp_gso_segment_list.
> 
>  include/net/udp.h  |  2 +-
>  net/ipv4/udp_offload.c | 69 
> ++
>  net/ipv6/udp_offload.c |  2 +-
>  3 files changed, 66 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 877832b..01351ba 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -178,7 +178,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
> struct sk_buff *skb,
>  int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
> 
>  struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
> -   netdev_features_t features);
> +   netdev_features_t features, bool is_ipv6);
> 
>  static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
>  {
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index ff39e94..cfc8726 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -187,8 +187,67 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff 
> *skb,
>  }
>  EXPORT_SYMBOL(skb_udp_tunnel_segment);
> 
> +static void __udpv4_gso_segment_csum(struct sk_buff *seg,
> +  __be32 *oldip, __be32 *newip,
> +  __be16 *oldport, __be16 *newport)
> +{
> + struct udphdr *uh;
> + struct iphdr *iph;
> +
> + if (*oldip == *newip && *oldport == *newport)
> + return;
> +
> + uh = udp_hdr(seg);
> + iph = ip_hdr(seg);
> +
> + if (uh->check) {
> + inet_proto_csum_replace4(&uh->check, seg, *oldip, *newip,
> +  true);
> + inet_proto_csum_replace2(&uh->check, seg, *oldport, *newport,
> +  false);
> + if (!uh->check)
> + uh->check = CSUM_MANGLED_0;
> + }
> + *oldport = *newport;
> +
> + csum_replace4(&iph->check, *oldip, *newip);
> + *oldip = *newip;
> +}
> +
> +static struct sk_buff *__udpv4_gso_segment_list_csum(struct sk_buff *segs)
> +{
> + struct sk_buff *seg;
> + struct udphdr *uh, *uh2;
> + struct iphdr *iph, *iph2;
> +
> + seg = segs;
> + uh = udp_hdr(seg);
> + iph = ip_hdr(seg);
> +
> + if ((udp_hdr(seg)->dest == udp_hdr(seg->next)->dest) &&
> + (udp_hdr(seg)->source == udp_hdr(seg->next)->source) &&
> + (ip_hdr(seg)->daddr == ip_hdr(seg->next)->daddr) &&
> + (ip_hdr(seg)->saddr == ip_hdr(seg->next)->saddr))
> + return segs;
> +
> + while ((seg = seg->next)) {
> + uh2 = udp_hdr(seg);
> + iph2 = ip_hdr(seg);
> +
> + __udpv4_gso_segment_csum(seg,
> +  &iph2->saddr, &iph->saddr,
> +  &uh2->source, &uh->source);
> + __udpv4_gso_segment_csum(seg,
> +  &iph2->daddr, &iph->daddr,
> +  &uh2->dest, &uh->dest);
> + }
> +
> + return segs;
> +}
> +
>  static struct sk_buff *__udp_gso_segment_list(struct sk_buff *skb,
> -   netdev_features_t features)
> +   netdev_features_t features,
> +   bool is_ipv6)
>  {
>   unsigned int mss = skb_shinfo(skb)->gso_size;
> 
> @@ -198,11 +257,11 @@ static struct sk_buff *__udp_gso_segment_list(struct 
> sk_buff *skb,
> 
>

Re: [PATCH v2 net-next 3/4] net: introduce common dev_page_is_reserved()

2021-01-30 Thread Alexander Lobakin

From: Jakub Kicinski 
Date: Fri, 29 Jan 2021 18:39:07 -0800

> On Wed, 27 Jan 2021 20:11:23 +0000 Alexander Lobakin wrote:
> > + * dev_page_is_reserved - check whether a page can be reused for network Rx
> > + * @page: the page to test
> > + *
> > + * A page shouldn't be considered for reusing/recycling if it was allocated
> > + * under memory pressure or at a distant memory node.
> > + *
> > + * Returns true if this page should be returned to page allocator, false
> > + * otherwise.
> > + */
> > +static inline bool dev_page_is_reserved(const struct page *page)
> 
> Am I the only one who feels like "reusable" is a better term than
> "reserved".

I thought about it, but this will need to inverse the conditions in
most of the drivers. I decided to keep it as it is.
I can redo if "reusable" is preferred.

Regarding "no objectives to take patch 1 through net-next": patches
2-3 depend on it, so I can't put it in a separate series.

Thanks,
Al

Re: [PATCH v2 net-next 3/4] net: introduce common dev_page_is_reserved()

2021-01-30 Thread Alexander Lobakin

From: Jakub Kicinski 
Date: Sat, 30 Jan 2021 11:07:07 -0800

> On Sat, 30 Jan 2021 15:42:29 +0000 Alexander Lobakin wrote:
> > > On Wed, 27 Jan 2021 20:11:23 +0000 Alexander Lobakin wrote:
> > > > + * dev_page_is_reserved - check whether a page can be reused for 
> > > > network Rx
> > > > + * @page: the page to test
> > > > + *
> > > > + * A page shouldn't be considered for reusing/recycling if it was 
> > > > allocated
> > > > + * under memory pressure or at a distant memory node.
> > > > + *
> > > > + * Returns true if this page should be returned to page allocator, 
> > > > false
> > > > + * otherwise.
> > > > + */
> > > > +static inline bool dev_page_is_reserved(const struct page *page)
> > >
> > > Am I the only one who feels like "reusable" is a better term than
> > > "reserved".
> >
> > I thought about it, but this will need to inverse the conditions in
> > most of the drivers. I decided to keep it as it is.
> > I can redo if "reusable" is preferred.
> 
> Naming is hard. As long as the condition is not a double negative it
> reads fine to me, but that's probably personal preference.
> The thing that doesn't sit well is the fact that there is nothing
> "reserved" about a page from another NUMA node.. But again, if nobody
> +1s this it's whatever...

Agree on NUMA and naming. I'm a bit surprised that 95% of drivers
have this helper called "reserved" (one of the reasons why I finished
with this variant).
Let's say, if anybody else will vote for "reusable", I'll pick it for
v3.

> That said can we move the likely()/unlikely() into the helper itself?
> People on the internet may say otherwise but according to my tests
> using __builtin_expect() on a return value of a static inline helper
> works just fine.

Sounds fine, this will make code more elegant. Will publish v3 soon.

Thanks,
Al

[PATCH v3 net-next 2/5] skbuff: constify skb_propagate_pfmemalloc() "page" argument

2021-01-31 Thread Alexander Lobakin

The function doesn't write anything to the page struct itself,
so this argument can be const.

Misc: align second argument to the brace while at it.

Signed-off-by: Alexander Lobakin 
Reviewed-by: Jesse Brandeburg 
Acked-by: David Rientjes 
---
 include/linux/skbuff.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9313b5aaf45b..b027526da4f9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2943,8 +2943,8 @@ static inline struct page *dev_alloc_page(void)
  * @page: The page that was allocated from skb_alloc_page
  * @skb: The skb that may need pfmemalloc set
  */
-static inline void skb_propagate_pfmemalloc(struct page *page,
-struct sk_buff *skb)
+static inline void skb_propagate_pfmemalloc(const struct page *page,
+   struct sk_buff *skb)
 {
if (page_is_pfmemalloc(page))
skb->pfmemalloc = true;
-- 
2.30.0

[PATCH v3 net-next 4/5] net: use the new dev_page_is_reusable() instead of private versions

2021-01-31 Thread Alexander Lobakin

Now we can remove a bunch of identical functions from the drivers and
make them use common dev_page_is_reusable(). All {,un}likely() checks
are omitted since it's already present in this helper.
Also update some comments near the call sites.

Suggested-by: David Rientjes 
Suggested-by: Jakub Kicinski 
Cc: John Hubbard 
Signed-off-by: Alexander Lobakin 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 17 ++---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c   | 13 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 15 +--
 drivers/net/ethernet/intel/iavf/iavf_txrx.c | 15 +--
 drivers/net/ethernet/intel/ice/ice_txrx.c   | 13 ++---
 drivers/net/ethernet/intel/igb/igb_main.c   |  9 ++---
 drivers/net/ethernet/intel/igc/igc_main.c   |  9 ++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |  9 ++---
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c   |  9 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |  7 +--
 10 files changed, 23 insertions(+), 93 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 512080640cbc..f39f5b1c4cec 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2800,12 +2800,6 @@ static void hns3_nic_alloc_rx_buffers(struct 
hns3_enet_ring *ring,
writel(i, ring->tqp->io_base + HNS3_RING_RX_RING_HEAD_REG);
 }
 
-static bool hns3_page_is_reusable(struct page *page)
-{
-   return page_to_nid(page) == numa_mem_id() &&
-   !page_is_pfmemalloc(page);
-}
-
 static bool hns3_can_reuse_page(struct hns3_desc_cb *cb)
 {
return (page_count(cb->priv) - cb->pagecnt_bias) == 1;
@@ -2823,10 +2817,11 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, 
int i,
skb_add_rx_frag(skb, i, desc_cb->priv, desc_cb->page_offset + pull_len,
size - pull_len, truesize);
 
-   /* Avoid re-using remote pages, or the stack is still using the page
-* when page_offset rollback to zero, flag default unreuse
+   /* Avoid re-using remote and pfmemalloc pages, or the stack is still
+* using the page when page_offset rollback to zero, flag default
+* unreuse
 */
-   if (unlikely(!hns3_page_is_reusable(desc_cb->priv)) ||
+   if (!dev_page_is_reusable(desc_cb->priv) ||
(!desc_cb->page_offset && !hns3_can_reuse_page(desc_cb))) {
__page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias);
return;
@@ -3083,8 +3078,8 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, 
unsigned int length,
if (length <= HNS3_RX_HEAD_SIZE) {
memcpy(__skb_put(skb, length), va, ALIGN(length, sizeof(long)));
 
-   /* We can reuse buffer as-is, just make sure it is local */
-   if (likely(hns3_page_is_reusable(desc_cb->priv)))
+   /* We can reuse buffer as-is, just make sure it is reusable */
+   if (dev_page_is_reusable(desc_cb->priv))
desc_cb->reuse_flag = 1;
else /* This page cannot be reused so discard it */
__page_frag_cache_drain(desc_cb->priv,
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 99b8252eb969..247f44f4cb30 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -194,17 +194,12 @@ static void fm10k_reuse_rx_page(struct fm10k_ring 
*rx_ring,
 DMA_FROM_DEVICE);
 }
 
-static inline bool fm10k_page_is_reserved(struct page *page)
-{
-   return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page);
-}
-
 static bool fm10k_can_reuse_rx_page(struct fm10k_rx_buffer *rx_buffer,
struct page *page,
unsigned int __maybe_unused truesize)
 {
-   /* avoid re-using remote pages */
-   if (unlikely(fm10k_page_is_reserved(page)))
+   /* avoid re-using remote and pfmemalloc pages */
+   if (!dev_page_is_reusable(page))
return false;
 
 #if (PAGE_SIZE < 8192)
@@ -265,8 +260,8 @@ static bool fm10k_add_rx_frag(struct fm10k_rx_buffer 
*rx_buffer,
if (likely(size <= FM10K_RX_HDR_LEN)) {
memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
 
-   /* page is not reserved, we can reuse buffer as-is */
-   if (likely(!fm10k_page_is_reserved(page)))
+   /* page is reusable, we can reuse buffer as-is */
+   if (dev_page_is_reusable(page))
return true;
 
/* this page cannot be reused so discard it */
diff --git a/drivers/net/ethernet/intel/i40e/i4

[PATCH v3 net-next 5/5] net: page_pool: simplify page recycling condition tests

2021-01-31 Thread Alexander Lobakin

pool_page_reusable() is a leftover from pre-NUMA-aware times. For now,
this function is just a redundant wrapper over page_is_pfmemalloc(),
so inline it into its sole call site.

Signed-off-by: Alexander Lobakin 
Acked-by: Jesper Dangaard Brouer 
Reviewed-by: Ilias Apalodimas 
Reviewed-by: Jesse Brandeburg 
Acked-by: David Rientjes 
---
 net/core/page_pool.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f3c690b8c8e3..ad8b0707af04 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -350,14 +350,6 @@ static bool page_pool_recycle_in_cache(struct page *page,
return true;
 }
 
-/* page is NOT reusable when:
- * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
- */
-static bool pool_page_reusable(struct page_pool *pool, struct page *page)
-{
-   return !page_is_pfmemalloc(page);
-}
-
 /* If the page refcnt == 1, this will try to recycle the page.
  * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
  * the configured size min(dma_sync_size, pool->max_len).
@@ -373,9 +365,11 @@ __page_pool_put_page(struct page_pool *pool, struct page 
*page,
 * regular page allocator APIs.
 *
 * refcnt == 1 means page_pool owns page, and can recycle it.
+*
+* page is NOT reusable when allocated when system is under
+* some pressure. (page_is_pfmemalloc)
 */
-   if (likely(page_ref_count(page) == 1 &&
-  pool_page_reusable(pool, page))) {
+   if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
/* Read barrier done in page_ref_count / READ_ONCE */
 
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
-- 
2.30.0

Re: [PATCH v3 net-next 5/5] net: page_pool: simplify page recycling condition tests

2021-01-31 Thread Alexander Lobakin

From: Matthew Wilcox 
Date: Sun, 31 Jan 2021 12:23:48 +

> On Sun, Jan 31, 2021 at 12:12:11PM +0000, Alexander Lobakin wrote:
> > pool_page_reusable() is a leftover from pre-NUMA-aware times. For now,
> > this function is just a redundant wrapper over page_is_pfmemalloc(),
> > so inline it into its sole call site.
> 
> Why doesn't this want to use {dev_}page_is_reusable()?

Page Pool handles NUMA on its own. Replacing plain page_is_pfmemalloc()
with dev_page_is_reusable() will only add a completely redundant and
always-false check on the fastpath.

Al

Re: [PATCH v3 net-next 3/5] net: introduce common dev_page_is_reusable()

2021-01-31 Thread Alexander Lobakin

From: Matthew Wilcox 
Date: Sun, 31 Jan 2021 12:22:05 +

> On Sun, Jan 31, 2021 at 12:11:52PM +0000, Alexander Lobakin wrote:
> > A bunch of drivers test the page before reusing/recycling for two
> > common conditions:
> >  - if a page was allocated under memory pressure (pfmemalloc page);
> >  - if a page was allocated at a distant memory node (to exclude
> >slowdowns).
> >
> > Introduce a new common inline for doing this, with likely() already
> > folded inside to make driver code a bit simpler.
> 
> I don't see the need for the 'dev_' prefix.  That actually confuses me
> because it makes me think this is tied to ZONE_DEVICE or some such.

Several functions right above this one also use 'dev_' prefix. It's
a rather old mark that it's about network devices.

> So how about calling it just 'page_is_reusable' and putting it in mm.h
> with page_is_pfmemalloc() and making the comment a little less 
> network-centric?

This pair of conditions (!pfmemalloc + local memory node) is really
specific to network drivers. I didn't see any other instances of such
tests, so I don't see a reason to place it in a more common mm.h.

> Or call it something like skb_page_is_recyclable() since it's only used
> by networking today.  But I bet it could/should be used more widely.

There's nothing about skb. Tested page is just a memory chunk for DMA
transaction. It can be used as skb head/frag, for XDP buffer/frame or
for XSK umem.

> > +/**
> > + * dev_page_is_reusable - check whether a page can be reused for network Rx
> > + * @page: the page to test
> > + *
> > + * A page shouldn't be considered for reusing/recycling if it was allocated
> > + * under memory pressure or at a distant memory node.
> > + *
> > + * Returns false if this page should be returned to page allocator, true
> > + * otherwise.
> > + */
> > +static inline bool dev_page_is_reusable(const struct page *page)
> > +{
> > +   return likely(page_to_nid(page) == numa_mem_id() &&
> > + !page_is_pfmemalloc(page));
> > +}
> > +

Al

[PATCH v3 net-next 0/5] net: consolidate page_is_pfmemalloc() usage

2021-01-31 Thread Alexander Lobakin

page_is_pfmemalloc() is used mostly by networking drivers to test
if a page can be considered for reusing/recycling.
It doesn't write anything to the struct page itself, so its sole
argument can be constified, as well as the first argument of
skb_propagate_pfmemalloc().
In Page Pool core code, it can be simply inlined instead.
Most of the callers from NIC drivers were just doppelgangers of
the same condition tests. Derive them into a new common function
do deduplicate the code.

Since v2 [1]:
 - use more intuitive name for the new inline function since there's
   nothing "reserved" in remote pages (Jakub Kicinski, John Hubbard);
 - fold likely() inside the helper itself to make driver code a bit
   fancier (Jakub Kicinski);
 - split function introduction and using into two separate commits;
 - collect some more tags (Jesse Brandeburg, David Rientjes).

Since v1 [0]:
 - new: reduce code duplication by introducing a new common function
   to test if a page can be reused/recycled (David Rientjes);
 - collect autographs for Page Pool bits (Jesper Dangaard Brouer,
   Ilias Apalodimas).

[0] https://lore.kernel.org/netdev/20210125164612.243838-1-aloba...@pm.me
[1] https://lore.kernel.org/netdev/20210127201031.98544-1-aloba...@pm.me

Alexander Lobakin (5):
  mm: constify page_is_pfmemalloc() argument
  skbuff: constify skb_propagate_pfmemalloc() "page" argument
  net: introduce common dev_page_is_reusable()
  net: use the new dev_page_is_reusable() instead of private versions
  net: page_pool: simplify page recycling condition tests

 .../net/ethernet/hisilicon/hns3/hns3_enet.c   | 17 ++--
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 13 
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 15 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   | 15 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c | 13 ++--
 drivers/net/ethernet/intel/igb/igb_main.c |  9 ++---
 drivers/net/ethernet/intel/igc/igc_main.c |  9 ++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  9 ++---
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  9 ++---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  7 +--
 include/linux/mm.h|  2 +-
 include/linux/skbuff.h| 20 +--
 net/core/page_pool.c  | 14 -
 13 files changed, 46 insertions(+), 106 deletions(-)

-- 
2.30.0

[PATCH v3 net-next 1/5] mm: constify page_is_pfmemalloc() argument

2021-01-31 Thread Alexander Lobakin

The function only tests for page->index, so its argument should be
const.

Signed-off-by: Alexander Lobakin 
Reviewed-by: Jesse Brandeburg 
Acked-by: David Rientjes 
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..078633d43af9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1584,7 +1584,7 @@ struct address_space *page_mapping_file(struct page 
*page);
  * ALLOC_NO_WATERMARKS and the low watermark was not
  * met implying that the system is under some pressure.
  */
-static inline bool page_is_pfmemalloc(struct page *page)
+static inline bool page_is_pfmemalloc(const struct page *page)
 {
/*
 * Page index cannot be this large so this must be
-- 
2.30.0

1 2 3 4 5 6 >

1 - 100 of 554 matches

Mail list logo