[PATCH net] drivers/net/wan/hdlc_cisco: Add hard_header_len
This driver didn't set hard_header_len. This patch sets hard_header_len for it according to its header_ops->create function. This driver's header_ops->create function (cisco_hard_header) creates a header of (struct hdlc_header), so hard_header_len should be set to sizeof(struct hdlc_header). Cc: Martin Schiller Signed-off-by: Xie He --- drivers/net/wan/hdlc_cisco.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/wan/hdlc_cisco.c b/drivers/net/wan/hdlc_cisco.c index d8cba3625c18..444130655d8e 100644 --- a/drivers/net/wan/hdlc_cisco.c +++ b/drivers/net/wan/hdlc_cisco.c @@ -370,6 +370,7 @@ static int cisco_ioctl(struct net_device *dev, struct ifreq *ifr) memcpy(&state(hdlc)->settings, &new_settings, size); spin_lock_init(&state(hdlc)->lock); dev->header_ops = &cisco_header_ops; + dev->hard_header_len = sizeof(struct hdlc_header); dev->type = ARPHRD_CISCO; call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE, dev); netif_dormant_on(dev); -- 2.25.1
Re: [PATCH nf-next v3 0/3] Netfilter egress hook
Hi Lukas, On 8/27/20 10:55 AM, Lukas Wunner wrote: Introduce a netfilter egress hook to allow filtering outbound AF_PACKETs such as DHCP and to prepare for in-kernel NAT64/NAT46. Thinking more about this, how will this allow to sufficiently filter AF_PACKET? It won't. Any AF_PACKET application can freely set PACKET_QDISC_BYPASS without additional privileges and then dev_queue_xmit() is being bypassed in the host ns. This is therefore ineffective and not sufficient. (From container side these can be caught w/ host veth on ingress, but not in host ns, of course, so hook won't be invoked.) Thanks, Daniel
Re: [PATCH] Revert "wlcore: Adding suppoprt for IGTK key in wlcore driver"
Steve deRosier writes: > On Tue, Aug 25, 2020 at 10:49 PM Mauro Carvalho Chehab > wrote: >> >> This patch causes a regression betwen Kernel 5.7 and 5.8 at wlcore: >> with it applied, WiFi stops working, and the Kernel starts printing >> this message every second: >> >>wlcore: PHY firmware version: Rev 8.2.0.0.242 >>wlcore: firmware booted (Rev 8.9.0.0.79) >>wlcore: ERROR command execute failure 14 > > Only if NO firmware for the device in question supports the `KEY_IGTK` > value, then this revert is appropriate. Otherwise, it likely isn't. > My suspicion is that the feature that `KEY_IGTK` is enabling is > specific to a newer firmware that Mauro hasn't upgraded to. What the > OP should do is find the updated firmware and give it a try. > > AND - since there's some firmware the feature doesn't work with, the > driver should be fixed to detect the running firmware version and not > do things that the firmware doesn't support. AND the firmware writer > should also make it so the firmware doesn't barf on bad input and > instead rejects it politely. > > But I will say I'm making an educated guess; while I have played with > the TI devices in the past, it was years ago and I won't claim to be > an expert. I also am unable to fix it myself at this time. > > I'd just rather see it fixed properly instead of a knee-jerk reaction > of reverting it simply because the OP doesn't have current firmware. Yeah, a proper fix for this is of course better but if there's no fix, say within the next week or so, let's revert this. A new version of the patch implementing IGTK, with proper feature detection, can be always added later. -- https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH] Revert "wlcore: Adding suppoprt for IGTK key in wlcore driver"
Em Thu, 27 Aug 2020 13:36:28 -0700 Steve deRosier escreveu: > > > And let's revisit the discussion of having a kernel splat because an > > > unrelated piece of code fails yet the driver does exactly what it is > > > supposed to do. We shouldn't be dumping registers and stack-trace when > > > the code that crashed has nothing to do with the registers and > > > stack-trace outputted. It is a false positive. A simple printk WARN > > > or ERROR should output notifying us that the chip firmware has crashed > > > and why. IMHO. Yeah, that WARN_ON() is disturbing. Sometimes, it prints it here out of the blue at the first time it tries to use WiFi: [4.502250] mmc_host mmc0: Bus speed (slot 0) = 40Hz (slot req 40Hz, actual 40HZ div = 0) [4.542376] mmc_host mmc0: Bus speed (slot 0) = 2500Hz (slot req 2500Hz, actual 2500HZ div = 0) [4.678228] mmc_host mmc0: Bus speed (slot 0) = 40Hz (slot req 40Hz, actual 40HZ div = 0) [4.719082] mmc_host mmc0: Bus speed (slot 0) = 2500Hz (slot req 2500Hz, actual 2500HZ div = 0) [4.830243] mmc_host mmc0: Bus speed (slot 0) = 40Hz (slot req 40Hz, actual 40HZ div = 0) [4.870524] mmc_host mmc0: Bus speed (slot 0) = 2500Hz (slot req 2500Hz, actual 2500HZ div = 0) [5.088650] wlcore: wl18xx HW: 183x or 180x, PG 2.2 (ROM 0x11) [5.095260] wlcore: WARNING Detected unconfigured mac address in nvs, derive from fuse instead. [5.104030] wlcore: WARNING This default nvs file can be removed from the file system [5.114699] wlcore: loaded [5.270777] mmc_host mmc0: Bus speed (slot 0) = 40Hz (slot req 40Hz, actual 40HZ div = 0) [5.310835] mmc_host mmc0: Bus speed (slot 0) = 2500Hz (slot req 2500Hz, actual 2500HZ div = 0) [5.414725] mmc_host mmc0: Bus speed (slot 0) = 40Hz (slot req 40Hz, actual 40HZ div = 0) [5.454684] mmc_host mmc0: Bus speed (slot 0) = 2500Hz (slot req 2500Hz, actual 2500HZ div = 0) [5.751078] wlcore: PHY firmware version: Rev 8.2.0.0.243 [5.799065] wlcore: firmware booted (Rev 8.9.0.0.81) [5.804035] wlcore: ERROR Couldn't parse firmware version string [5.821800] wlcore: down [5.946770] mmc_host mmc0: Bus speed (slot 0) = 40Hz (slot req 40Hz, actual 40HZ div = 0) [5.986777] mmc_host mmc0: Bus speed (slot 0) = 2500Hz (slot req 2500Hz, actual 2500HZ div = 0) [6.878108] [ cut here ] [6.882741] WARNING: CPU: 3 PID: 297 at drivers/net/wireless/ti/wlcore/sdio.c:78 wl12xx_sdio_raw_read+0x11c/0x1c0 [wlcore_sdio] [6.894220] Modules linked in: wl18xx wlcore mac80211 libarc4 cfg80211 rfkill snd_soc_hdmi_codec wlcore_sdio adv7511 cec kirin9xx_drm(C) crct10dif_ce kirin9xx_dw_drm_dsi(C) drm_kms_helper drm ip_tables x_tables ipv6 nf_defrag_ipv6 [6.914682] CPU: 3 PID: 297 Comm: NetworkManager Tainted: G C 5.8.0+ #197 [6.922771] Hardware name: HiKey970 (DT) [6.926693] pstate: 6005 (nZCv daif -PAN -UAO BTYPE=--) [6.932263] pc : wl12xx_sdio_raw_read+0x11c/0x1c0 [wlcore_sdio] [6.938181] lr : wl12xx_sdio_raw_read+0x8c/0x1c0 [wlcore_sdio] [6.944009] sp : 800012793140 [6.947318] x29: 800012793140 x28: [6.952626] x27: 0001a79ba258 x26: 0001a79b9e80 [6.957935] x25: x24: 0001ac5db100 [6.963243] x23: 0004 x22: 0001b19b1810 [6.968552] x21: 0001b17a x20: 00013738 [6.973859] x19: 0001b6988800 x18: 0002e3eebd18 [6.979168] x17: 0001 x16: 0001 [6.984476] x15: 0008291e1896a232 x14: 0008290c7b540ede [6.989784] x13: 03c4 x12: fa83b2da [6.995093] x11: 03c4 x10: 09c0 [7.000401] x9 : 800012792d20 x8 : 0001b17a0a20 [7.005708] x7 : 0001 x6 : 01fb396f [7.011017] x5 : 00ff x4 : [7.016325] x3 : 0001b760e104 x2 : [7.021633] x1 : 0001b17a x0 : ff92 [7.026943] Call trace: [7.029386] wl12xx_sdio_raw_read+0x11c/0x1c0 [wlcore_sdio] [7.034978] wl18xx_boot+0x414/0x890 [wl18xx] [7.039373] wl1271_op_add_interface+0x784/0xa60 [wlcore] [7.044843] drv_add_interface+0x38/0x84 [mac80211] [7.049750] ieee80211_do_open+0x59c/0x8cc [mac80211] [7.054829] ieee80211_open+0x48/0x70 [mac80211] [7.059453] __dev_open+0xe4/0x190 [7.062852] __dev_change_flags+0x180/0x1f0 [7.067031] dev_change_flags+0x24/0x64 [7.070866] do_setlink+0x20c/0xc40 [7.074349] __rtnl_newlink+0x500/0x820 [7.078180] rtnl_newlink+0x4c/0x80 [7.081664] rtnetlink_rcv_msg+0x11c/0x340 [7.085759] netlink_rcv_skb+0x58/0x11c [7.089591] rtnetlink_rcv+0x18/0x2c [7.093163] netlink_unicast+0x25c/0x320 [7.097080] netlink_sendmsg+0x190/0x3a0 [7.101003
Re: packet deadline and process scheduling
On 8/27/20 11:45 PM, S.V.R.Anand wrote: > Hi, > > In the control loop application I am trying to build, an incoming message from > the network will have a deadline before which it should be delivered to the > receiver process. This essentially calls for a way of scheduling this process > based on the deadline information contained in the message. > > If not already available, I wish to write code for such run-time ordering of > processes in the earlist deadline first fashion. The assumption, however > futuristic it may be, is that deadline information is contained as part of the > packet header something like an inband-OAM. > > Your feedback on the above will be very helpful. > > Hope the above objective will be of general interest to netdev as well. > > My apologies if this is not the appropriate mailing list for posting this kind > of mails. > > Anand > Is this described in some RFC ? If not, I guess you might have to code this in user space.
[PATCH bpf-next v5 00/15] xsk: support shared umems between devices and queues
This patch set adds support to share a umem between AF_XDP sockets bound to different queue ids on the same device or even between devices. It has already been possible to do this by registering the umem multiple times, but this wastes a lot of memory. Just imagine having 10 threads each having 10 sockets open sharing a single umem. This means that you would have to register the umem 100 times consuming large quantities of memory. Instead, we extend the existing XDP_SHARED_UMEM flag to also work when sharing a umem between different queue ids as well as devices. If you would like to share umem between two sockets, just create the first one as would do normally. For the second socket you would not register the same umem using the XDP_UMEM_REG setsockopt. Instead attach one new fill ring and one new completion ring to this second socket and then use the XDP_SHARED_UMEM bind flag supplying the file descriptor of the first socket in the sxdp_shared_umem_fd field to signify that it is the umem of the first socket you would like to share. One important thing to note in this example, is that there needs to be one fill ring and one completion ring per unique device and queue id bound to. This so that the single-producer and single-consumer semantics of the rings can be upheld. To recap, if you bind multiple sockets to the same device and queue id (already supported without this patch set), you only need one pair of fill and completion rings. If you bind multiple sockets to multiple different queues or devices, you need one fill and completion ring pair per unique device,queue_id tuple. The implementation is based around extending the buffer pool in the core xsk code. This is a structure that exists on a per unique device and queue id basis. So, a number of entities that can now be shared are moved from the umem to the buffer pool. Information about DMA mappings are also moved from the buffer pool, but as these are per device independent of the queue id, they are now hanging off the umem in list. However, the pool is set up to point directly to the dma_addr_t array that it needs. In summary after this patch set, there is one xdp_sock struct per socket created. This points to an xsk_buff_pool for which there is one per unique device and queue id. The buffer pool points to a DMA mapping structure for which there is one per device that a umem has been bound to. And finally, the buffer pool also points to a xdp_umem struct, for which there is only one per umem registration. Before: XSK -> UMEM -> POOL Now: XSK -> POOL -> DMA \ > UMEM Patches 1-8 only rearrange internal structures to support the buffer pool carrying this new information, while patches 9 and 10 improve performance. Finally, patches 11-15 introduce the new functionality together with libbpf support, samples, and documentation. Libbpf has also been extended to support sharing of umems between sockets bound to different devices and queue ids by introducing a new function called xsk_socket__create_shared(). The difference between this and the existing xsk_socket__create() is that the former takes a reference to a fill ring and a completion ring as these need to be created. This new function needs to be used for the second and following sockets that binds to the same umem. The first socket can be created by either function as it will also have called xsk_umem__create(). There is also a new sample xsk_fwd that demonstrates this new interface and capability. Performance for the non-shared umem case is up 3% for the l2fwd xdpsock sample application with this patch set. For workloads that share a umem, this patch set can give rise to added performance benefits due to the decrease in memory usage. v4 -> v5: * Fixed performance problem with sharing a umem between different queues on the same netdev. Sharing the dma_pages array between buffer pool instances was a bad idea. It led to many cross-core snoop traffic messages that degraded performance. The solution: only map the dma mappings once as before, but copy the dma_addr_t to a per buffer pool array so that this sharing dissappears. * Added patch 10 that improves performance with 3% for l2fwd with a simple fix that is now possible, as we pass the buffer pool to the driver. * xp_dma_unmap() did not honor the refcount. Fixed. [Maxim] * Fixed bisectabilty problem in patch 5 [Maxim] v3 -> v4: * Fixed compilation error when CONFIG_XDP_SOCKETS_DIAG is set [lkp robot] v2 -> v3: * Clean up of fq_tmp and cq_tmp in xsk_release [Maxim] * Fixed bug when bind failed that caused pool to be freed twice [Ciara] v1 -> v2: * Tx need_wakeup init bug fixed. Missed to set the cached_need_wakeup flag for Tx. * Need wakeup turned on for xsk_fwd sample [Cristian] * Commit messages cleaned up * Moved dma mapping list from netdev to umem [Maxim] * Now the buffer pool is only created once. Fill ring and completion ring pointers are stored in the socket during initialization (befor
[PATCH bpf-next v5 01/15] xsk: i40e: ice: ixgbe: mlx5: pass buffer pool to driver instead of umem
Replace the explicit umem reference passed to the driver in AF_XDP zero-copy mode with the buffer pool instead. This in preparation for extending the functionality of the zero-copy mode so that umems can be shared between queues on the same netdev and also between netdevs. In this commit, only an umem reference has been added to the buffer pool struct. But later commits will add other entities to it. These are going to be entities that are different between different queue ids and netdevs even though the umem is shared between them. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 +- drivers/net/ethernet/intel/i40e/i40e_main.c| 29 +-- drivers/net/ethernet/intel/i40e/i40e_txrx.c| 10 +- drivers/net/ethernet/intel/i40e/i40e_txrx.h| 2 +- drivers/net/ethernet/intel/i40e/i40e_xsk.c | 81 drivers/net/ethernet/intel/i40e/i40e_xsk.h | 4 +- drivers/net/ethernet/intel/ice/ice.h | 18 +- drivers/net/ethernet/intel/ice/ice_base.c | 16 +- drivers/net/ethernet/intel/ice/ice_lib.c | 2 +- drivers/net/ethernet/intel/ice/ice_main.c | 10 +- drivers/net/ethernet/intel/ice/ice_txrx.c | 8 +- drivers/net/ethernet/intel/ice/ice_txrx.h | 2 +- drivers/net/ethernet/intel/ice/ice_xsk.c | 136 ++--- drivers/net/ethernet/intel/ice/ice_xsk.h | 7 +- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 34 ++-- .../net/ethernet/intel/ixgbe/ixgbe_txrx_common.h | 7 +- drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 61 +++--- drivers/net/ethernet/mellanox/mlx5/core/Makefile | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en.h | 19 +- drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 5 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/pool.c | 217 + .../net/ethernet/mellanox/mlx5/core/en/xsk/pool.h | 27 +++ .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h| 10 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.c | 12 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.h | 2 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.c| 14 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.h| 6 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.c | 217 - .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.h | 29 --- .../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 2 +- .../ethernet/mellanox/mlx5/core/en_fs_ethtool.c| 2 +- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 49 ++--- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 16 +- include/linux/netdevice.h | 10 +- include/net/xdp_sock_drv.h | 7 +- include/net/xsk_buff_pool.h| 4 +- net/ethtool/channels.c | 2 +- net/ethtool/ioctl.c| 2 +- net/xdp/xdp_umem.c | 45 ++--- net/xdp/xsk_buff_pool.c| 5 +- 41 files changed, 575 insertions(+), 560 deletions(-) create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.h delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 825c104..dc15771 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -1967,7 +1967,7 @@ static int i40e_set_ringparam(struct net_device *netdev, (new_rx_count == vsi->rx_rings[0]->count)) return 0; - /* If there is a AF_XDP UMEM attached to any of Rx rings, + /* If there is a AF_XDP page pool attached to any of Rx rings, * disallow changing the number of descriptors -- regardless * if the netdev is running or not. */ diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 2e433fd..cbf2a44 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -3122,12 +3122,12 @@ static void i40e_config_xps_tx_ring(struct i40e_ring *ring) } /** - * i40e_xsk_umem - Retrieve the AF_XDP ZC if XDP and ZC is enabled + * i40e_xsk_pool - Retrieve the AF_XDP buffer pool if XDP and ZC is enabled * @ring: The Tx or Rx ring * - * Returns the UMEM or NULL. + * Returns the AF_XDP buffer pool or NULL. **/ -static struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring) +static struct xsk_buff_pool *i40e_xsk_pool(struct i40e_ring *ring) { bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi); int qid = ring->queue_i
[PATCH bpf-next v5 03/15] xsk: create and free buffer pool independently from umem
Create and free the buffer pool independently from the umem. Move these operations that are performed on the buffer pool from the umem create and destroy functions to new create and destroy functions just for the buffer pool. This so that in later commits we can instantiate multiple buffer pools per umem when sharing a umem between HW queues and/or devices. We also erradicate the back pointer from the umem to the buffer pool as this will not work when we introduce the possibility to have multiple buffer pools per umem. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- include/net/xdp_sock.h | 3 +- include/net/xsk_buff_pool.h | 13 +++- net/xdp/xdp_umem.c | 164 net/xdp/xdp_umem.h | 4 +- net/xdp/xsk.c | 74 +--- net/xdp/xsk.h | 3 + net/xdp/xsk_buff_pool.c | 150 net/xdp/xsk_queue.h | 12 ++-- 8 files changed, 236 insertions(+), 187 deletions(-) diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index ccf6cb5..ea2b020 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -20,13 +20,12 @@ struct xdp_buff; struct xdp_umem { struct xsk_queue *fq; struct xsk_queue *cq; - struct xsk_buff_pool *pool; u64 size; u32 headroom; u32 chunk_size; + u32 chunks; struct user_struct *user; refcount_t users; - struct work_struct work; struct page **pgs; u32 npgs; u16 queue_id; diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index f851b0a..4025486 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -14,6 +14,7 @@ struct xdp_rxq_info; struct xsk_queue; struct xdp_desc; struct xdp_umem; +struct xdp_sock; struct device; struct page; @@ -46,16 +47,22 @@ struct xsk_buff_pool { struct xdp_umem *umem; void *addrs; struct device *dev; + refcount_t users; + struct work_struct work; struct xdp_buff_xsk *free_heads[]; }; /* AF_XDP core. */ -struct xsk_buff_pool *xp_create(struct xdp_umem *umem, u32 chunks, - u32 chunk_size, u32 headroom, u64 size, - bool unaligned); +struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs, + struct xdp_umem *umem); +int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *dev, + u16 queue_id, u16 flags); void xp_set_fq(struct xsk_buff_pool *pool, struct xsk_queue *fq); void xp_destroy(struct xsk_buff_pool *pool); void xp_release(struct xdp_buff_xsk *xskb); +void xp_get_pool(struct xsk_buff_pool *pool); +void xp_put_pool(struct xsk_buff_pool *pool); +void xp_clear_dev(struct xsk_buff_pool *pool); /* AF_XDP, and XDP core. */ void xp_free(struct xdp_buff_xsk *xskb); diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index adde4d5..f290345 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -47,160 +47,41 @@ void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs) spin_unlock_irqrestore(&umem->xsk_tx_list_lock, flags); } -/* The umem is stored both in the _rx struct and the _tx struct as we do - * not know if the device has more tx queues than rx, or the opposite. - * This might also change during run time. - */ -static int xsk_reg_pool_at_qid(struct net_device *dev, - struct xsk_buff_pool *pool, - u16 queue_id) -{ - if (queue_id >= max_t(unsigned int, - dev->real_num_rx_queues, - dev->real_num_tx_queues)) - return -EINVAL; - - if (queue_id < dev->real_num_rx_queues) - dev->_rx[queue_id].pool = pool; - if (queue_id < dev->real_num_tx_queues) - dev->_tx[queue_id].pool = pool; - - return 0; -} - -struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev, - u16 queue_id) +static void xdp_umem_unpin_pages(struct xdp_umem *umem) { - if (queue_id < dev->real_num_rx_queues) - return dev->_rx[queue_id].pool; - if (queue_id < dev->real_num_tx_queues) - return dev->_tx[queue_id].pool; + unpin_user_pages_dirty_lock(umem->pgs, umem->npgs, true); - return NULL; + kfree(umem->pgs); + umem->pgs = NULL; } -EXPORT_SYMBOL(xsk_get_pool_from_qid); -static void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id) +static void xdp_umem_unaccount_pages(struct xdp_umem *umem) { - if (queue_id < dev->real_num_rx_queues) - dev->_rx[queue_id].pool = NULL; - if (queue_id < dev->real_num_tx_queues) - dev->_tx[queue_id].pool = NULL; + if (umem->user) { + atomic_long_sub(umem->npgs, &umem->use
[PATCH bpf-next v5 11/15] xsk: add shared umem support between queue ids
Add support to share a umem between queue ids on the same device. This mode can be invoked with the XDP_SHARED_UMEM bind flag. Previously, sharing was only supported within the same queue id and device, and you shared one set of fill and completion rings. However, note that when sharing a umem between queue ids, you need to create a fill ring and a completion ring and tie them to the socket before you do the bind with the XDP_SHARED_UMEM flag. This so that the single-producer single-consumer semantics can be upheld. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- include/net/xsk_buff_pool.h | 2 ++ net/xdp/xsk.c | 44 ++-- net/xdp/xsk_buff_pool.c | 26 -- 3 files changed, 56 insertions(+), 16 deletions(-) diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index 907537d..0140d08 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -81,6 +81,8 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs, struct xdp_umem *umem); int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *dev, u16 queue_id, u16 flags); +int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_umem *umem, +struct net_device *dev, u16 queue_id); void xp_destroy(struct xsk_buff_pool *pool); void xp_release(struct xdp_buff_xsk *xskb); void xp_get_pool(struct xsk_buff_pool *pool); diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 067e854..ea8d2ec 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -689,12 +689,6 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) goto out_unlock; } - if (xs->fq_tmp || xs->cq_tmp) { - /* Do not allow setting your own fq or cq. */ - err = -EINVAL; - goto out_unlock; - } - sock = xsk_lookup_xsk_from_fd(sxdp->sxdp_shared_umem_fd); if (IS_ERR(sock)) { err = PTR_ERR(sock); @@ -707,15 +701,41 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) sockfd_put(sock); goto out_unlock; } - if (umem_xs->dev != dev || umem_xs->queue_id != qid) { + if (umem_xs->dev != dev) { err = -EINVAL; sockfd_put(sock); goto out_unlock; } - /* Share the buffer pool with the other socket. */ - xp_get_pool(umem_xs->pool); - xs->pool = umem_xs->pool; + if (umem_xs->queue_id != qid) { + /* Share the umem with another socket on another qid */ + xs->pool = xp_create_and_assign_umem(xs, +umem_xs->umem); + if (!xs->pool) { + sockfd_put(sock); + goto out_unlock; + } + + err = xp_assign_dev_shared(xs->pool, umem_xs->umem, + dev, qid); + if (err) { + xp_destroy(xs->pool); + sockfd_put(sock); + goto out_unlock; + } + } else { + /* Share the buffer pool with the other socket. */ + if (xs->fq_tmp || xs->cq_tmp) { + /* Do not allow setting your own fq or cq. */ + err = -EINVAL; + sockfd_put(sock); + goto out_unlock; + } + + xp_get_pool(umem_xs->pool); + xs->pool = umem_xs->pool; + } + xdp_get_umem(umem_xs->umem); WRITE_ONCE(xs->umem, umem_xs->umem); sockfd_put(sock); @@ -847,10 +867,6 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname, mutex_unlock(&xs->mutex); return -EBUSY; } - if (!xs->umem) { - mutex_unlock(&xs->mutex); - return -EINVAL; - } q = (optname == XDP_UMEM_FILL_RING) ? &xs->fq_tmp : &xs->cq_tmp; diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index 547eb41..795d7c8 100644 --- a/net/xdp/xsk_buff_pool.c +++ b/net/xdp/xsk_buff_pool.c @@ -123,8 +123,8 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool) } } -int xp_assign_dev(struct xsk_buff_pool *pool, struct net_de
[PATCH bpf-next v5 08/15] xsk: enable sharing of dma mappings
Enable the sharing of dma mappings by moving them out from the buffer pool. Instead we put each dma mapped umem region in a list in the umem structure. If dma has already been mapped for this umem and device, it is not mapped again and the existing dma mappings are reused. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- include/net/xdp_sock.h | 1 + include/net/xsk_buff_pool.h | 13 net/xdp/xdp_umem.c | 1 + net/xdp/xsk_buff_pool.c | 183 ++-- 4 files changed, 156 insertions(+), 42 deletions(-) diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index 126d243..282aeba 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -30,6 +30,7 @@ struct xdp_umem { u8 flags; int id; bool zc; + struct list_head xsk_dma_list; }; struct xsk_map { diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index 83f100c..356d0ac 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -28,10 +28,23 @@ struct xdp_buff_xsk { struct list_head free_list_node; }; +struct xsk_dma_map { + dma_addr_t *dma_pages; + struct device *dev; + struct net_device *netdev; + refcount_t users; + struct list_head list; /* Protected by the RTNL_LOCK */ + u32 dma_pages_cnt; + bool dma_need_sync; +}; + struct xsk_buff_pool { struct xsk_queue *fq; struct xsk_queue *cq; struct list_head free_list; + /* For performance reasons, each buff pool has its own array of dma_pages +* even when they are identical. +*/ dma_addr_t *dma_pages; struct xdp_buff_xsk *heads; u64 chunk_mask; diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index 77604c3..a7227b4 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -198,6 +198,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) umem->user = NULL; umem->flags = mr->flags; + INIT_LIST_HEAD(&umem->xsk_dma_list); refcount_set(&umem->users, 1); err = xdp_umem_account_pages(umem); diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index c563874..547eb41 100644 --- a/net/xdp/xsk_buff_pool.c +++ b/net/xdp/xsk_buff_pool.c @@ -104,6 +104,25 @@ void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq) } EXPORT_SYMBOL(xp_set_rxq_info); +static void xp_disable_drv_zc(struct xsk_buff_pool *pool) +{ + struct netdev_bpf bpf; + int err; + + ASSERT_RTNL(); + + if (pool->umem->zc) { + bpf.command = XDP_SETUP_XSK_POOL; + bpf.xsk.pool = NULL; + bpf.xsk.queue_id = pool->queue_id; + + err = pool->netdev->netdev_ops->ndo_bpf(pool->netdev, &bpf); + + if (err) + WARN(1, "Failed to disable zero-copy!\n"); + } +} + int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *netdev, u16 queue_id, u16 flags) { @@ -122,6 +141,8 @@ int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *netdev, if (xsk_get_pool_from_qid(netdev, queue_id)) return -EBUSY; + pool->netdev = netdev; + pool->queue_id = queue_id; err = xsk_reg_pool_at_qid(netdev, pool, queue_id); if (err) return err; @@ -155,11 +176,15 @@ int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *netdev, if (err) goto err_unreg_pool; - pool->netdev = netdev; - pool->queue_id = queue_id; + if (!pool->dma_pages) { + WARN(1, "Driver did not DMA map zero-copy buffers"); + goto err_unreg_xsk; + } pool->umem->zc = true; return 0; +err_unreg_xsk: + xp_disable_drv_zc(pool); err_unreg_pool: if (!force_zc) err = 0; /* fallback to copy mode */ @@ -170,25 +195,10 @@ int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *netdev, void xp_clear_dev(struct xsk_buff_pool *pool) { - struct netdev_bpf bpf; - int err; - - ASSERT_RTNL(); - if (!pool->netdev) return; - if (pool->umem->zc) { - bpf.command = XDP_SETUP_XSK_POOL; - bpf.xsk.pool = NULL; - bpf.xsk.queue_id = pool->queue_id; - - err = pool->netdev->netdev_ops->ndo_bpf(pool->netdev, &bpf); - - if (err) - WARN(1, "Failed to disable zero-copy!\n"); - } - + xp_disable_drv_zc(pool); xsk_clear_pool_at_qid(pool->netdev, pool->queue_id); dev_put(pool->netdev); pool->netdev = NULL; @@ -233,70 +243,159 @@ void xp_put_pool(struct xsk_buff_pool *pool) } } -void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs) +static struct xsk_dma_map *xp_find_dma_map(struct xsk_buff_pool *pool) +{ + st
[PATCH bpf-next v5 04/15] xsk: move fill and completion rings to buffer pool
Move the fill and completion rings from the umem to the buffer pool. This so that we in a later commit can share the umem between multiple HW queue ids. In this case, we need one fill and completion ring per queue id. As the buffer pool is per queue id and napi id this is a natural place for it and one umem struture can be shared between these buffer pools. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- include/net/xdp_sock.h | 4 ++-- include/net/xsk_buff_pool.h | 2 +- net/xdp/xdp_umem.c | 15 -- net/xdp/xsk.c | 48 + net/xdp/xsk_buff_pool.c | 20 ++- net/xdp/xsk_diag.c | 12 +++- 6 files changed, 52 insertions(+), 49 deletions(-) diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index ea2b020..2a284e1 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -18,8 +18,6 @@ struct xsk_queue; struct xdp_buff; struct xdp_umem { - struct xsk_queue *fq; - struct xsk_queue *cq; u64 size; u32 headroom; u32 chunk_size; @@ -77,6 +75,8 @@ struct xdp_sock { struct list_head map_list; /* Protects map_list */ spinlock_t map_list_lock; + struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */ + struct xsk_queue *cq_tmp; /* Only as tmp storage before bind */ }; #ifdef CONFIG_XDP_SOCKETS diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index 4025486..380d9ae 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -30,6 +30,7 @@ struct xdp_buff_xsk { struct xsk_buff_pool { struct xsk_queue *fq; + struct xsk_queue *cq; struct list_head free_list; dma_addr_t *dma_pages; struct xdp_buff_xsk *heads; @@ -57,7 +58,6 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs, struct xdp_umem *umem); int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *dev, u16 queue_id, u16 flags); -void xp_set_fq(struct xsk_buff_pool *pool, struct xsk_queue *fq); void xp_destroy(struct xsk_buff_pool *pool); void xp_release(struct xdp_buff_xsk *xskb); void xp_get_pool(struct xsk_buff_pool *pool); diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index f290345..7d86a63 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -85,16 +85,6 @@ static void xdp_umem_release(struct xdp_umem *umem) ida_simple_remove(&umem_ida, umem->id); - if (umem->fq) { - xskq_destroy(umem->fq); - umem->fq = NULL; - } - - if (umem->cq) { - xskq_destroy(umem->cq); - umem->cq = NULL; - } - xdp_umem_unpin_pages(umem); xdp_umem_unaccount_pages(umem); @@ -278,8 +268,3 @@ struct xdp_umem *xdp_umem_create(struct xdp_umem_reg *mr) return umem; } - -bool xdp_umem_validate_queues(struct xdp_umem *umem) -{ - return umem->fq && umem->cq; -} diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 5739f19..dacd340 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -36,7 +36,7 @@ static DEFINE_PER_CPU(struct list_head, xskmap_flush_list); bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs) { return READ_ONCE(xs->rx) && READ_ONCE(xs->umem) && - READ_ONCE(xs->umem->fq); + (xs->pool->fq || READ_ONCE(xs->fq_tmp)); } void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) @@ -46,7 +46,7 @@ void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) if (umem->need_wakeup & XDP_WAKEUP_RX) return; - umem->fq->ring->flags |= XDP_RING_NEED_WAKEUP; + pool->fq->ring->flags |= XDP_RING_NEED_WAKEUP; umem->need_wakeup |= XDP_WAKEUP_RX; } EXPORT_SYMBOL(xsk_set_rx_need_wakeup); @@ -76,7 +76,7 @@ void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool) if (!(umem->need_wakeup & XDP_WAKEUP_RX)) return; - umem->fq->ring->flags &= ~XDP_RING_NEED_WAKEUP; + pool->fq->ring->flags &= ~XDP_RING_NEED_WAKEUP; umem->need_wakeup &= ~XDP_WAKEUP_RX; } EXPORT_SYMBOL(xsk_clear_rx_need_wakeup); @@ -254,7 +254,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, static void xsk_flush(struct xdp_sock *xs) { xskq_prod_submit(xs->rx); - __xskq_cons_release(xs->umem->fq); + __xskq_cons_release(xs->pool->fq); sock_def_readable(&xs->sk); } @@ -297,7 +297,7 @@ void __xsk_map_flush(void) void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries) { - xskq_prod_submit_n(pool->umem->cq, nb_entries); + xskq_prod_submit_n(pool->cq, nb_entries); } EXPORT_SYMBOL(xsk_tx_completed); @@ -331,7 +331,7 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc) * if there is space in it. This avoids having to implement * any buffering in
[PATCH bpf-next v5 10/15] xsk: i40e: ice: ixgbe: mlx5: test for dma_need_sync earlier for better performance
Test for dma_need_sync earlier to increase performance. xsk_buff_dma_sync_for_cpu() takes an xdp_buff as parameter and from that the xsk_buff_pool reference is dug out. Perf shows that this dereference causes a lot of cache misses. But as the buffer pool is now sent down to the driver at zero-copy initialization time, we might as well use this pointer directly, instead of going via the xsk_buff and we can do so already in xsk_buff_dma_sync_for_cpu() instead of in xp_dma_sync_for_cpu. This gets rid of these cache misses. Throughput increases with 3% for the xdpsock l2fwd sample application on my machine. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- drivers/net/ethernet/intel/i40e/i40e_xsk.c | 2 +- drivers/net/ethernet/intel/ice/ice_xsk.c| 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c| 2 +- drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 4 ++-- include/net/xdp_sock_drv.h | 7 +-- include/net/xsk_buff_pool.h | 3 --- 6 files changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c index 95b9a7e..2a1153d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c @@ -314,7 +314,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) bi = i40e_rx_bi(rx_ring, rx_ring->next_to_clean); (*bi)->data_end = (*bi)->data + size; - xsk_buff_dma_sync_for_cpu(*bi); + xsk_buff_dma_sync_for_cpu(*bi, rx_ring->xsk_pool); xdp_res = i40e_run_xdp_zc(rx_ring, *bi); if (xdp_res) { diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c index dffef37..7978865 100644 --- a/drivers/net/ethernet/intel/ice/ice_xsk.c +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c @@ -595,7 +595,7 @@ int ice_clean_rx_irq_zc(struct ice_ring *rx_ring, int budget) rx_buf = &rx_ring->rx_buf[rx_ring->next_to_clean]; rx_buf->xdp->data_end = rx_buf->xdp->data + size; - xsk_buff_dma_sync_for_cpu(rx_buf->xdp); + xsk_buff_dma_sync_for_cpu(rx_buf->xdp, rx_ring->xsk_pool); xdp_res = ice_run_xdp_zc(rx_ring, rx_buf->xdp); if (xdp_res) { diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c index 6af34da..3771857 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c @@ -287,7 +287,7 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector, } bi->xdp->data_end = bi->xdp->data + size; - xsk_buff_dma_sync_for_cpu(bi->xdp); + xsk_buff_dma_sync_for_cpu(bi->xdp, rx_ring->xsk_pool); xdp_res = ixgbe_run_xdp_zc(adapter, rx_ring, bi->xdp); if (xdp_res) { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c index a33a1f7..902ce77 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c @@ -48,7 +48,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, xdp->data_end = xdp->data + cqe_bcnt32; xdp_set_data_meta_invalid(xdp); - xsk_buff_dma_sync_for_cpu(xdp); + xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool); prefetch(xdp->data); rcu_read_lock(); @@ -99,7 +99,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq, xdp->data_end = xdp->data + cqe_bcnt; xdp_set_data_meta_invalid(xdp); - xsk_buff_dma_sync_for_cpu(xdp); + xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool); prefetch(xdp->data); if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_RESP_SEND)) { diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h index a7c7d2e..5b1ee8a 100644 --- a/include/net/xdp_sock_drv.h +++ b/include/net/xdp_sock_drv.h @@ -99,10 +99,13 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr) return xp_raw_get_data(pool, addr); } -static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp) +static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool) { struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp); + if (!pool->dma_need_sync) + return; + xp_dma_sync_for_cpu(xskb); } @@ -222,7 +225,7 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr) return NULL; } -static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp) +static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool) { } diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_
[PATCH bpf-next v5 05/15] xsk: move queue_id, dev and need_wakeup to buffer pool
Move queue_id, dev, and need_wakeup from the umem to the buffer pool. This so that we in a later commit can share the umem between multiple HW queues. There is one buffer pool per dev and queue id, so these variables should belong to the buffer pool, not the umem. Need_wakeup is also something that is set on a per napi level, so there is usually one per device and queue id. So move this to the buffer pool too. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- include/net/xdp_sock.h | 3 --- include/net/xsk_buff_pool.h | 4 net/xdp/xdp_umem.c | 22 ++ net/xdp/xdp_umem.h | 4 net/xdp/xsk.c | 34 +- net/xdp/xsk.h | 7 --- net/xdp/xsk_buff_pool.c | 39 ++- net/xdp/xsk_diag.c | 4 ++-- 8 files changed, 43 insertions(+), 74 deletions(-) diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index 2a284e1..b052f1c 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -26,11 +26,8 @@ struct xdp_umem { refcount_t users; struct page **pgs; u32 npgs; - u16 queue_id; - u8 need_wakeup; u8 flags; int id; - struct net_device *dev; bool zc; spinlock_t xsk_tx_list_lock; struct list_head xsk_tx_list; diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index 380d9ae..2d94890 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -43,11 +43,15 @@ struct xsk_buff_pool { u32 headroom; u32 chunk_size; u32 frame_len; + u16 queue_id; + u8 cached_need_wakeup; + bool uses_need_wakeup; bool dma_need_sync; bool unaligned; struct xdp_umem *umem; void *addrs; struct device *dev; + struct net_device *netdev; refcount_t users; struct work_struct work; struct xdp_buff_xsk *free_heads[]; diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index 7d86a63..3e612fc 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -63,26 +63,9 @@ static void xdp_umem_unaccount_pages(struct xdp_umem *umem) } } -void xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev, -u16 queue_id) -{ - umem->dev = dev; - umem->queue_id = queue_id; - - dev_hold(dev); -} - -void xdp_umem_clear_dev(struct xdp_umem *umem) -{ - dev_put(umem->dev); - umem->dev = NULL; - umem->zc = false; -} - static void xdp_umem_release(struct xdp_umem *umem) { - xdp_umem_clear_dev(umem); - + umem->zc = false; ida_simple_remove(&umem_ida, umem->id); xdp_umem_unpin_pages(umem); @@ -181,8 +164,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) return -EINVAL; } - if (mr->flags & ~(XDP_UMEM_UNALIGNED_CHUNK_FLAG | - XDP_UMEM_USES_NEED_WAKEUP)) + if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG) return -EINVAL; if (!unaligned_chunks && !is_power_of_2(chunk_size)) diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h index 93e96be..67bf3f3 100644 --- a/net/xdp/xdp_umem.h +++ b/net/xdp/xdp_umem.h @@ -8,10 +8,6 @@ #include -void xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev, -u16 queue_id); -void xdp_umem_clear_dev(struct xdp_umem *umem); -bool xdp_umem_validate_queues(struct xdp_umem *umem); void xdp_get_umem(struct xdp_umem *umem); void xdp_put_umem(struct xdp_umem *umem); void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs); diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index dacd340..9f1b906e 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -41,13 +41,11 @@ bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs) void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) { - struct xdp_umem *umem = pool->umem; - - if (umem->need_wakeup & XDP_WAKEUP_RX) + if (pool->cached_need_wakeup & XDP_WAKEUP_RX) return; pool->fq->ring->flags |= XDP_RING_NEED_WAKEUP; - umem->need_wakeup |= XDP_WAKEUP_RX; + pool->cached_need_wakeup |= XDP_WAKEUP_RX; } EXPORT_SYMBOL(xsk_set_rx_need_wakeup); @@ -56,7 +54,7 @@ void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool) struct xdp_umem *umem = pool->umem; struct xdp_sock *xs; - if (umem->need_wakeup & XDP_WAKEUP_TX) + if (pool->cached_need_wakeup & XDP_WAKEUP_TX) return; rcu_read_lock(); @@ -65,19 +63,17 @@ void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool) } rcu_read_unlock(); - umem->need_wakeup |= XDP_WAKEUP_TX; + pool->cached_need_wakeup |= XDP_WAKEUP_TX; } EXPORT_SYMBOL(xsk_set_tx_need_wakeup); void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool) { - struct xdp_umem *umem = pool->umem
[PATCH bpf-next v5 02/15] xsk: i40e: ice: ixgbe: mlx5: rename xsk zero-copy driver interfaces
Rename the AF_XDP zero-copy driver interface functions to better reflect what they do after the replacement of umems with buffer pools in the previous commit. Mostly it is about replacing the umem name from the function names with xsk_buff and also have them take the a buffer pool pointer instead of a umem. The various ring functions have also been renamed in the process so that they have the same naming convention as the internal functions in xsk_queue.h. This so that it will be clearer what they do and also for consistency. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- drivers/net/ethernet/intel/i40e/i40e_main.c| 6 +- drivers/net/ethernet/intel/i40e/i40e_xsk.c | 34 +++--- drivers/net/ethernet/intel/ice/ice_base.c | 6 +- drivers/net/ethernet/intel/ice/ice_xsk.c | 28 ++--- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 6 +- drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 32 +++--- drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 4 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/pool.c | 12 +-- .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h| 8 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.c| 10 +- .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.h| 6 +- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 4 +- include/net/xdp_sock.h | 1 + include/net/xdp_sock_drv.h | 114 +++-- net/ethtool/channels.c | 2 +- net/ethtool/ioctl.c| 2 +- net/xdp/xdp_umem.c | 24 ++--- net/xdp/xsk.c | 45 19 files changed, 179 insertions(+), 167 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index cbf2a44..05c6d3e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -3138,7 +3138,7 @@ static struct xsk_buff_pool *i40e_xsk_pool(struct i40e_ring *ring) if (!xdp_on || !test_bit(qid, ring->vsi->af_xdp_zc_qps)) return NULL; - return xdp_get_xsk_pool_from_qid(ring->vsi->netdev, qid); + return xsk_get_pool_from_qid(ring->vsi->netdev, qid); } /** @@ -3286,7 +3286,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring) if (ret) return ret; ring->rx_buf_len = - xsk_umem_get_rx_frame_size(ring->xsk_pool->umem); + xsk_pool_get_rx_frame_size(ring->xsk_pool); /* For AF_XDP ZC, we disallow packets to span on * multiple buffers, thus letting us skip that * handling in the fast-path. @@ -3370,7 +3370,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring) writel(0, ring->tail); if (ring->xsk_pool) { - xsk_buff_set_rxq_info(ring->xsk_pool->umem, &ring->xdp_rxq); + xsk_pool_set_rxq_info(ring->xsk_pool, &ring->xdp_rxq); ok = i40e_alloc_rx_buffers_zc(ring, I40E_DESC_UNUSED(ring)); } else { ok = !i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring)); diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c index 00e9fe6..95b9a7e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c @@ -55,8 +55,7 @@ static int i40e_xsk_pool_enable(struct i40e_vsi *vsi, qid >= netdev->real_num_tx_queues) return -EINVAL; - err = xsk_buff_dma_map(pool->umem, &vsi->back->pdev->dev, - I40E_RX_DMA_ATTR); + err = xsk_pool_dma_map(pool, &vsi->back->pdev->dev, I40E_RX_DMA_ATTR); if (err) return err; @@ -97,7 +96,7 @@ static int i40e_xsk_pool_disable(struct i40e_vsi *vsi, u16 qid) bool if_running; int err; - pool = xdp_get_xsk_pool_from_qid(netdev, qid); + pool = xsk_get_pool_from_qid(netdev, qid); if (!pool) return -EINVAL; @@ -110,7 +109,7 @@ static int i40e_xsk_pool_disable(struct i40e_vsi *vsi, u16 qid) } clear_bit(qid, vsi->af_xdp_zc_qps); - xsk_buff_dma_unmap(pool->umem, I40E_RX_DMA_ATTR); + xsk_pool_dma_unmap(pool, I40E_RX_DMA_ATTR); if (if_running) { err = i40e_queue_pair_enable(vsi, qid); @@ -196,7 +195,7 @@ bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count) rx_desc = I40E_RX_DESC(rx_ring, ntu); bi = i40e_rx_bi(rx_ring, ntu); do { - xdp = xsk_buff_alloc(rx_ring->xsk_pool->umem); + xdp = xsk_buff_alloc(rx_ring->xsk_pool); if (!xdp) { ok = false; goto no_buffer
[PATCH bpf-next v5 06/15] xsk: move xsk_tx_list and its lock to buffer pool
Move the xsk_tx_list and the xsk_tx_list_lock from the umem to the buffer pool. This so that we in a later commit can share the umem between multiple HW queues. There is one xsk_tx_list per device and queue id, so it should be located in the buffer pool. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- include/net/xdp_sock.h | 4 +--- include/net/xsk_buff_pool.h | 5 + net/xdp/xdp_umem.c | 26 -- net/xdp/xdp_umem.h | 2 -- net/xdp/xsk.c | 15 ++- net/xdp/xsk_buff_pool.c | 26 ++ 6 files changed, 38 insertions(+), 40 deletions(-) diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index b052f1c..9a61d05 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -29,8 +29,6 @@ struct xdp_umem { u8 flags; int id; bool zc; - spinlock_t xsk_tx_list_lock; - struct list_head xsk_tx_list; }; struct xsk_map { @@ -57,7 +55,7 @@ struct xdp_sock { /* Protects multiple processes in the control path */ struct mutex mutex; struct xsk_queue *tx cacheline_aligned_in_smp; - struct list_head list; + struct list_head tx_list; /* Mutual exclusion of NAPI TX thread and sendmsg error paths * in the SKB destructor callback. */ diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index 2d94890..83f100c 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -52,6 +52,9 @@ struct xsk_buff_pool { void *addrs; struct device *dev; struct net_device *netdev; + struct list_head xsk_tx_list; + /* Protects modifications to the xsk_tx_list */ + spinlock_t xsk_tx_list_lock; refcount_t users; struct work_struct work; struct xdp_buff_xsk *free_heads[]; @@ -67,6 +70,8 @@ void xp_release(struct xdp_buff_xsk *xskb); void xp_get_pool(struct xsk_buff_pool *pool); void xp_put_pool(struct xsk_buff_pool *pool); void xp_clear_dev(struct xsk_buff_pool *pool); +void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs); +void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs); /* AF_XDP, and XDP core. */ void xp_free(struct xdp_buff_xsk *xskb); diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index 3e612fc..7751592 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -23,30 +23,6 @@ static DEFINE_IDA(umem_ida); -void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs) -{ - unsigned long flags; - - if (!xs->tx) - return; - - spin_lock_irqsave(&umem->xsk_tx_list_lock, flags); - list_add_rcu(&xs->list, &umem->xsk_tx_list); - spin_unlock_irqrestore(&umem->xsk_tx_list_lock, flags); -} - -void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs) -{ - unsigned long flags; - - if (!xs->tx) - return; - - spin_lock_irqsave(&umem->xsk_tx_list_lock, flags); - list_del_rcu(&xs->list); - spin_unlock_irqrestore(&umem->xsk_tx_list_lock, flags); -} - static void xdp_umem_unpin_pages(struct xdp_umem *umem) { unpin_user_pages_dirty_lock(umem->pgs, umem->npgs, true); @@ -205,8 +181,6 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) umem->pgs = NULL; umem->user = NULL; umem->flags = mr->flags; - INIT_LIST_HEAD(&umem->xsk_tx_list); - spin_lock_init(&umem->xsk_tx_list_lock); refcount_set(&umem->users, 1); diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h index 67bf3f3..181fdda 100644 --- a/net/xdp/xdp_umem.h +++ b/net/xdp/xdp_umem.h @@ -10,8 +10,6 @@ void xdp_get_umem(struct xdp_umem *umem); void xdp_put_umem(struct xdp_umem *umem); -void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs); -void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs); struct xdp_umem *xdp_umem_create(struct xdp_umem_reg *mr); #endif /* XDP_UMEM_H_ */ diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 9f1b906e..067e854 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -51,14 +51,13 @@ EXPORT_SYMBOL(xsk_set_rx_need_wakeup); void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool) { - struct xdp_umem *umem = pool->umem; struct xdp_sock *xs; if (pool->cached_need_wakeup & XDP_WAKEUP_TX) return; rcu_read_lock(); - list_for_each_entry_rcu(xs, &umem->xsk_tx_list, list) { + list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) { xs->tx->ring->flags |= XDP_RING_NEED_WAKEUP; } rcu_read_unlock(); @@ -79,14 +78,13 @@ EXPORT_SYMBOL(xsk_clear_rx_need_wakeup); void xsk_clear_tx_need_wakeup(struct xsk_buff_pool *pool) { - struct xdp_umem *umem = pool->umem; struct xdp_sock *xs; if (!(pool->cached_need_wakeup & XDP_WAKEUP_TX)) return; rcu_read_lock(); - list_for_ea
[PATCH bpf-next v5 15/15] xsk: documentation for XDP_SHARED_UMEM between queues and netdevs
Add documentation for the XDP_SHARED_UMEM feature when a UMEM is shared between different queues and/or netdevs. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- Documentation/networking/af_xdp.rst | 68 +++-- 1 file changed, 58 insertions(+), 10 deletions(-) diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 5bc55a4..2ccc564 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -258,14 +258,21 @@ socket into zero-copy mode or fail. XDP_SHARED_UMEM bind flag - -This flag enables you to bind multiple sockets to the same UMEM, but -only if they share the same queue id. In this mode, each socket has -their own RX and TX rings, but the UMEM (tied to the fist socket -created) only has a single FILL ring and a single COMPLETION -ring. To use this mode, create the first socket and bind it in the normal -way. Create a second socket and create an RX and a TX ring, or at -least one of them, but no FILL or COMPLETION rings as the ones from -the first socket will be used. In the bind call, set he +This flag enables you to bind multiple sockets to the same UMEM. It +works on the same queue id, between queue ids and between +netdevs/devices. In this mode, each socket has their own RX and TX +rings as usual, but you are going to have one or more FILL and +COMPLETION ring pairs. You have to create one of these pairs per +unique netdev and queue id tuple that you bind to. + +Starting with the case were we would like to share a UMEM between +sockets bound to the same netdev and queue id. The UMEM (tied to the +fist socket created) will only have a single FILL ring and a single +COMPLETION ring as there is only on unique netdev,queue_id tuple that +we have bound to. To use this mode, create the first socket and bind +it in the normal way. Create a second socket and create an RX and a TX +ring, or at least one of them, but no FILL or COMPLETION rings as the +ones from the first socket will be used. In the bind call, set he XDP_SHARED_UMEM option and provide the initial socket's fd in the sxdp_shared_umem_fd field. You can attach an arbitrary number of extra sockets this way. @@ -305,11 +312,41 @@ concurrently. There are no synchronization primitives in the libbpf code that protects multiple users at this point in time. Libbpf uses this mode if you create more than one socket tied to the -same umem. However, note that you need to supply the +same UMEM. However, note that you need to supply the XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the xsk_socket__create calls and load your own XDP program as there is no built in one in libbpf that will route the traffic for you. +The second case is when you share a UMEM between sockets that are +bound to different queue ids and/or netdevs. In this case you have to +create one FILL ring and one COMPLETION ring for each unique +netdev,queue_id pair. Let us say you want to create two sockets bound +to two different queue ids on the same netdev. Create the first socket +and bind it in the normal way. Create a second socket and create an RX +and a TX ring, or at least one of them, and then one FILL and +COMPLETION ring for this socket. Then in the bind call, set he +XDP_SHARED_UMEM option and provide the initial socket's fd in the +sxdp_shared_umem_fd field as you registered the UMEM on that +socket. These two sockets will now share one and the same UMEM. + +There is no need to supply an XDP program like the one in the previous +case where sockets were bound to the same queue id and +device. Instead, use the NIC's packet steering capabilities to steer +the packets to the right queue. In the previous example, there is only +one queue shared among sockets, so the NIC cannot do this steering. It +can only steer between queues. + +In libbpf, you need to use the xsk_socket__create_shared() API as it +takes a reference to a FILL ring and a COMPLETION ring that will be +created for you and bound to the shared UMEM. You can use this +function for all the sockets you create, or you can use it for the +second and following ones and use xsk_socket__create() for the first +one. Both methods yield the same result. + +Note that a UMEM can be shared between sockets on the same queue id +and device, as well as between queues on the same device and between +devices at the same time. + XDP_USE_NEED_WAKEUP bind flag - @@ -364,7 +401,7 @@ resources by only setting up one of them. Both the FILL ring and the COMPLETION ring are mandatory as you need to have a UMEM tied to your socket. But if the XDP_SHARED_UMEM flag is used, any socket after the first one does not have a UMEM and should in that case not have any -FILL or COMPLETION rings created as the ones from the shared umem will +FILL or COMPLETION rings created as the ones from the shared UMEM will be used. Note, that the rings are single-producer
[PATCH bpf-next v5 13/15] libbpf: support shared umems between queues and devices
Add support for shared umems between hardware queues and devices to the AF_XDP part of libbpf. This so that zero-copy can be achieved in applications that want to send and receive packets between HW queues on one device or between different devices/netdevs. In order to create sockets that share a umem between hardware queues and devices, a new function has been added called xsk_socket__create_shared(). It takes the same arguments as xsk_socket_create() plus references to a fill ring and a completion ring. So for every socket that share a umem, you need to have one more set of fill and completion rings. This in order to maintain the single-producer single-consumer semantics of the rings. You can create all the sockets via the new xsk_socket__create_shared() call, or create the first one with xsk_socket__create() and the rest with xsk_socket__create_shared(). Both methods work. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- tools/lib/bpf/libbpf.map | 1 + tools/lib/bpf/xsk.c | 376 ++- tools/lib/bpf/xsk.h | 9 ++ 3 files changed, 254 insertions(+), 132 deletions(-) diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index 66a6286..3fedcdc 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -306,4 +306,5 @@ LIBBPF_0.2.0 { perf_buffer__buffer_fd; perf_buffer__epoll_fd; perf_buffer__consume_buffer; + xsk_socket__create_shared; } LIBBPF_0.1.0; diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c index a9b0210..49c3245 100644 --- a/tools/lib/bpf/xsk.c +++ b/tools/lib/bpf/xsk.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -45,26 +46,35 @@ #endif struct xsk_umem { - struct xsk_ring_prod *fill; - struct xsk_ring_cons *comp; + struct xsk_ring_prod *fill_save; + struct xsk_ring_cons *comp_save; char *umem_area; struct xsk_umem_config config; int fd; int refcount; + struct list_head ctx_list; +}; + +struct xsk_ctx { + struct xsk_ring_prod *fill; + struct xsk_ring_cons *comp; + __u32 queue_id; + struct xsk_umem *umem; + int refcount; + int ifindex; + struct list_head list; + int prog_fd; + int xsks_map_fd; + char ifname[IFNAMSIZ]; }; struct xsk_socket { struct xsk_ring_cons *rx; struct xsk_ring_prod *tx; __u64 outstanding_tx; - struct xsk_umem *umem; + struct xsk_ctx *ctx; struct xsk_socket_config config; int fd; - int ifindex; - int prog_fd; - int xsks_map_fd; - __u32 queue_id; - char ifname[IFNAMSIZ]; }; struct xsk_nl_info { @@ -200,15 +210,73 @@ static int xsk_get_mmap_offsets(int fd, struct xdp_mmap_offsets *off) return -EINVAL; } +static int xsk_create_umem_rings(struct xsk_umem *umem, int fd, +struct xsk_ring_prod *fill, +struct xsk_ring_cons *comp) +{ + struct xdp_mmap_offsets off; + void *map; + int err; + + err = setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, +&umem->config.fill_size, +sizeof(umem->config.fill_size)); + if (err) + return -errno; + + err = setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, +&umem->config.comp_size, +sizeof(umem->config.comp_size)); + if (err) + return -errno; + + err = xsk_get_mmap_offsets(fd, &off); + if (err) + return -errno; + + map = mmap(NULL, off.fr.desc + umem->config.fill_size * sizeof(__u64), + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, + XDP_UMEM_PGOFF_FILL_RING); + if (map == MAP_FAILED) + return -errno; + + fill->mask = umem->config.fill_size - 1; + fill->size = umem->config.fill_size; + fill->producer = map + off.fr.producer; + fill->consumer = map + off.fr.consumer; + fill->flags = map + off.fr.flags; + fill->ring = map + off.fr.desc; + fill->cached_cons = umem->config.fill_size; + + map = mmap(NULL, off.cr.desc + umem->config.comp_size * sizeof(__u64), + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, + XDP_UMEM_PGOFF_COMPLETION_RING); + if (map == MAP_FAILED) { + err = -errno; + goto out_mmap; + } + + comp->mask = umem->config.comp_size - 1; + comp->size = umem->config.comp_size; + comp->producer = map + off.cr.producer; + comp->consumer = map + off.cr.consumer; + comp->flags = map + off.cr.flags; + comp->ring = map + off.cr.desc; + + return 0; + +out_mmap: + munmap(map, off.fr.desc + umem->config.fill_size * sizeof(__u64)); +
[PATCH bpf-next v5 09/15] xsk: rearrange internal structs for better performance
Rearrange the xdp_sock, xdp_umem and xsk_buff_pool structures so that they get smaller and align better to the cache lines. In the previous commits of this patch set, these structs have been reordered with the focus on functionality and simplicity, not performance. This patch improves throughput performance by around 3%. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- include/net/xdp_sock.h | 13 +++-- include/net/xsk_buff_pool.h | 27 +++ 2 files changed, 22 insertions(+), 18 deletions(-) diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index 282aeba..1a9559c 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -23,13 +23,13 @@ struct xdp_umem { u32 headroom; u32 chunk_size; u32 chunks; + u32 npgs; struct user_struct *user; refcount_t users; - struct page **pgs; - u32 npgs; u8 flags; - int id; bool zc; + struct page **pgs; + int id; struct list_head xsk_dma_list; }; @@ -42,7 +42,7 @@ struct xsk_map { struct xdp_sock { /* struct sock must be the first member of struct xdp_sock */ struct sock sk; - struct xsk_queue *rx; + struct xsk_queue *rx cacheline_aligned_in_smp; struct net_device *dev; struct xdp_umem *umem; struct list_head flush_node; @@ -54,8 +54,7 @@ struct xdp_sock { XSK_BOUND, XSK_UNBOUND, } state; - /* Protects multiple processes in the control path */ - struct mutex mutex; + struct xsk_queue *tx cacheline_aligned_in_smp; struct list_head tx_list; /* Mutual exclusion of NAPI TX thread and sendmsg error paths @@ -72,6 +71,8 @@ struct xdp_sock { struct list_head map_list; /* Protects map_list */ spinlock_t map_list_lock; + /* Protects multiple processes in the control path */ + struct mutex mutex; struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */ struct xsk_queue *cq_tmp; /* Only as tmp storage before bind */ }; diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index 356d0ac..38d03a6 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -39,9 +39,22 @@ struct xsk_dma_map { }; struct xsk_buff_pool { - struct xsk_queue *fq; - struct xsk_queue *cq; + /* Members only used in the control path first. */ + struct device *dev; + struct net_device *netdev; + struct list_head xsk_tx_list; + /* Protects modifications to the xsk_tx_list */ + spinlock_t xsk_tx_list_lock; + refcount_t users; + struct xdp_umem *umem; + struct work_struct work; struct list_head free_list; + u32 heads_cnt; + u16 queue_id; + + /* Data path members as close to free_heads at the end as possible. */ + struct xsk_queue *fq cacheline_aligned_in_smp; + struct xsk_queue *cq; /* For performance reasons, each buff pool has its own array of dma_pages * even when they are identical. */ @@ -51,25 +64,15 @@ struct xsk_buff_pool { u64 addrs_cnt; u32 free_list_cnt; u32 dma_pages_cnt; - u32 heads_cnt; u32 free_heads_cnt; u32 headroom; u32 chunk_size; u32 frame_len; - u16 queue_id; u8 cached_need_wakeup; bool uses_need_wakeup; bool dma_need_sync; bool unaligned; - struct xdp_umem *umem; void *addrs; - struct device *dev; - struct net_device *netdev; - struct list_head xsk_tx_list; - /* Protects modifications to the xsk_tx_list */ - spinlock_t xsk_tx_list_lock; - refcount_t users; - struct work_struct work; struct xdp_buff_xsk *free_heads[]; }; -- 2.7.4
[PATCH bpf-next v5 12/15] xsk: add shared umem support between devices
Add support to share a umem between different devices. This mode can be invoked with the XDP_SHARED_UMEM bind flag. Previously, sharing was only supported within the same device. Note that when sharing a umem between devices, just as in the case of sharing a umem between queue ids, you need to create a fill ring and a completion ring and tie them to the socket (with two setsockopts, one for each ring) before you do the bind with the XDP_SHARED_UMEM flag. This so that the single-producer single-consumer semantics of the rings can be upheld. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- net/xdp/xsk.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index ea8d2ec..5eb6662 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -701,14 +701,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) sockfd_put(sock); goto out_unlock; } - if (umem_xs->dev != dev) { - err = -EINVAL; - sockfd_put(sock); - goto out_unlock; - } - if (umem_xs->queue_id != qid) { - /* Share the umem with another socket on another qid */ + if (umem_xs->queue_id != qid || umem_xs->dev != dev) { + /* Share the umem with another socket on another qid +* and/or device. +*/ xs->pool = xp_create_and_assign_umem(xs, umem_xs->umem); if (!xs->pool) { -- 2.7.4
[PATCH bpf-next v5 07/15] xsk: move addrs from buffer pool to umem
Replicate the addrs pointer in the buffer pool to the umem. This mapping will be the same for all buffer pools sharing the same umem. In the buffer pool we leave the addrs pointer for performance reasons. Signed-off-by: Magnus Karlsson Acked-by: Björn Töpel --- include/net/xdp_sock.h | 1 + net/xdp/xdp_umem.c | 22 ++ net/xdp/xsk_buff_pool.c | 21 ++--- 3 files changed, 25 insertions(+), 19 deletions(-) diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index 9a61d05..126d243 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -18,6 +18,7 @@ struct xsk_queue; struct xdp_buff; struct xdp_umem { + void *addrs; u64 size; u32 headroom; u32 chunk_size; diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index 7751592..77604c3 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -39,11 +39,27 @@ static void xdp_umem_unaccount_pages(struct xdp_umem *umem) } } +static void xdp_umem_addr_unmap(struct xdp_umem *umem) +{ + vunmap(umem->addrs); + umem->addrs = NULL; +} + +static int xdp_umem_addr_map(struct xdp_umem *umem, struct page **pages, +u32 nr_pages) +{ + umem->addrs = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); + if (!umem->addrs) + return -ENOMEM; + return 0; +} + static void xdp_umem_release(struct xdp_umem *umem) { umem->zc = false; ida_simple_remove(&umem_ida, umem->id); + xdp_umem_addr_unmap(umem); xdp_umem_unpin_pages(umem); xdp_umem_unaccount_pages(umem); @@ -192,8 +208,14 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) if (err) goto out_account; + err = xdp_umem_addr_map(umem, umem->pgs, umem->npgs); + if (err) + goto out_unpin; + return 0; +out_unpin: + xdp_umem_unpin_pages(umem); out_account: xdp_umem_unaccount_pages(umem); return err; diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index dbd913e..c563874 100644 --- a/net/xdp/xsk_buff_pool.c +++ b/net/xdp/xsk_buff_pool.c @@ -35,26 +35,11 @@ void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs) spin_unlock_irqrestore(&pool->xsk_tx_list_lock, flags); } -static void xp_addr_unmap(struct xsk_buff_pool *pool) -{ - vunmap(pool->addrs); -} - -static int xp_addr_map(struct xsk_buff_pool *pool, - struct page **pages, u32 nr_pages) -{ - pool->addrs = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); - if (!pool->addrs) - return -ENOMEM; - return 0; -} - void xp_destroy(struct xsk_buff_pool *pool) { if (!pool) return; - xp_addr_unmap(pool); kvfree(pool->heads); kvfree(pool); } @@ -64,7 +49,6 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs, { struct xsk_buff_pool *pool; struct xdp_buff_xsk *xskb; - int err; u32 i; pool = kvzalloc(struct_size(pool, free_heads, umem->chunks), @@ -86,6 +70,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs, pool->frame_len = umem->chunk_size - umem->headroom - XDP_PACKET_HEADROOM; pool->umem = umem; + pool->addrs = umem->addrs; INIT_LIST_HEAD(&pool->free_list); INIT_LIST_HEAD(&pool->xsk_tx_list); spin_lock_init(&pool->xsk_tx_list_lock); @@ -103,9 +88,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs, pool->free_heads[i] = xskb; } - err = xp_addr_map(pool, umem->pgs, umem->npgs); - if (!err) - return pool; + return pool; out: xp_destroy(pool); -- 2.7.4
[PATCH bpf-next v5 14/15] samples/bpf: add new sample xsk_fwd.c
From: Cristian Dumitrescu This sample code illustrates the packet forwarding between multiple AF_XDP sockets in multi-threading environment. All the threads and sockets are sharing a common buffer pool, with each socket having its own private buffer cache. The sockets are created with the xsk_socket__create_shared() function, which allows multiple AF_XDP sockets to share the same UMEM object. Example 1: Single thread handling two sockets. Packets received from socket A (on top of interface IFA, queue QA) are forwarded to socket B (on top of interface IFB, queue QB) and vice-versa. The thread is affinitized to CPU core C: ./xsk_fwd -i IFA -q QA -i IFB -q QB -c C Example 2: Two threads, each handling two sockets. Packets from socket A are sent to socket B (by thread X), packets from socket B are sent to socket A (by thread X); packets from socket C are sent to socket D (by thread Y), packets from socket D are sent to socket C (by thread Y). The two threads are bound to CPU cores CX and CY: ./xdp_fwd -i IFA -q QA -i IFB -q QB -i IFC -q QC -i IFD -q QD -c CX -c CY Signed-off-by: Cristian Dumitrescu Acked-by: Björn Töpel --- samples/bpf/Makefile |3 + samples/bpf/xsk_fwd.c | 1085 + 2 files changed, 1088 insertions(+) create mode 100644 samples/bpf/xsk_fwd.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index a6d3646..4f1ed0e 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -48,6 +48,7 @@ tprogs-y += syscall_tp tprogs-y += cpustat tprogs-y += xdp_adjust_tail tprogs-y += xdpsock +tprogs-y += xsk_fwd tprogs-y += xdp_fwd tprogs-y += task_fd_query tprogs-y += xdp_sample_pkts @@ -104,6 +105,7 @@ syscall_tp-objs := syscall_tp_user.o cpustat-objs := cpustat_user.o xdp_adjust_tail-objs := xdp_adjust_tail_user.o xdpsock-objs := xdpsock_user.o +xsk_fwd-objs := xsk_fwd.o xdp_fwd-objs := xdp_fwd_user.o task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS) xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS) @@ -203,6 +205,7 @@ TPROGLDLIBS_trace_output+= -lrt TPROGLDLIBS_map_perf_test += -lrt TPROGLDLIBS_test_overhead += -lrt TPROGLDLIBS_xdpsock+= -pthread +TPROGLDLIBS_xsk_fwd+= -pthread # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make M=samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/xsk_fwd.c b/samples/bpf/xsk_fwd.c new file mode 100644 index 000..1cd97c8 --- /dev/null +++ b/samples/bpf/xsk_fwd.c @@ -0,0 +1,1085 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2020 Intel Corporation. */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include + +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) + +typedef __u64 u64; +typedef __u32 u32; +typedef __u16 u16; +typedef __u8 u8; + +/* This program illustrates the packet forwarding between multiple AF_XDP + * sockets in multi-threaded environment. All threads are sharing a common + * buffer pool, with each socket having its own private buffer cache. + * + * Example 1: Single thread handling two sockets. The packets received by socket + * A (interface IFA, queue QA) are forwarded to socket B (interface IFB, queue + * QB), while the packets received by socket B are forwarded to socket A. The + * thread is running on CPU core X: + * + * ./xsk_fwd -i IFA -q QA -i IFB -q QB -c X + * + * Example 2: Two threads, each handling two sockets. The thread running on CPU + * core X forwards all the packets received by socket A to socket B, and all the + * packets received by socket B to socket A. The thread running on CPU core Y is + * performing the same packet forwarding between sockets C and D: + * + * ./xsk_fwd -i IFA -q QA -i IFB -q QB -i IFC -q QC -i IFD -q QD + * -c CX -c CY + */ + +/* + * Buffer pool and buffer cache + * + * For packet forwarding, the packet buffers are typically allocated from the + * pool for packet reception and freed back to the pool for further reuse once + * the packet transmission is completed. + * + * The buffer pool is shared between multiple threads. In order to minimize the + * access latency to the shared buffer pool, each thread creates one (or + * several) buffer caches, which, unlike the buffer pool, are private to the + * thread that creates them and therefore cannot be shared with other threads. + * The access to the shared pool is only needed either (A) when the cache gets + * empty due to repeated buffer allocations and it needs to be replenished from + * the pool, or (B) when the cache gets full due to repeated buffer free and it + * needs to be flushed back to the pull. + * + * In a packet forwarding system, a packet received on a
Re: [PATCH] Revert "wlcore: Adding suppoprt for IGTK key in wlcore driver"
Em Thu, 27 Aug 2020 13:36:28 -0700 Steve deRosier escreveu: > Hi Mauro, > > On Thu, Aug 27, 2020 at 10:42 AM Mauro Carvalho Chehab > wrote: > > > > Em Thu, 27 Aug 2020 08:48:30 -0700 > > Steve deRosier escreveu: > > > > > On Tue, Aug 25, 2020 at 10:49 PM Mauro Carvalho Chehab > > > wrote: > > > > > > > > This patch causes a regression betwen Kernel 5.7 and 5.8 at wlcore: > > > > with it applied, WiFi stops working, and the Kernel starts printing > > > > this message every second: > > > > > > > >wlcore: PHY firmware version: Rev 8.2.0.0.242 > > > >wlcore: firmware booted (Rev 8.9.0.0.79) > > > >wlcore: ERROR command execute failure 14 > > > > > > Only if NO firmware for the device in question supports the `KEY_IGTK` > > > value, then this revert is appropriate. Otherwise, it likely isn't. > > > > Yeah, that's what I suspect too: some specific firmware is required > > for KEY_IGTK to work. > > > > > My suspicion is that the feature that `KEY_IGTK` is enabling is > > > specific to a newer firmware that Mauro hasn't upgraded to. What the > > > OP should do is find the updated firmware and give it a try. > > > > I didn't try checking if linux-firmware tree has a newer version on > > it. I'm using Debian Bullseye on this device. So, I suspect that > > it may have a relatively new firmware. > > > > Btw, that's also the version that came together with Fedora 32: > > > > $ strings /lib/firmware/ti-connectivity/wl18xx-fw-4.bin |grep FRev > > FRev 8.9.0.0.79 > > FRev 8.2.0.0.242 > > > > Looking at: > > https://git.ti.com/cgit/wilink8-wlan/wl18xx_fw/ > > > > It sounds that there's a newer version released this year: > > > > 2020-05-28 Updated to FW 8.9.0.0.81 > > 2018-07-29 Updated to FW 8.9.0.0.79 > > > > However, it doesn't reached linux-firmware upstream yet: > > > > $ git log --pretty=oneline ti-connectivity/wl18xx-fw-4.bin > > 3a5103fc3c29 wl18xx: update firmware file 8.9.0.0.79 > > 65b1c68c63f9 wl18xx: update firmware file 8.9.0.0.76 > > dbb85a5154a5 wl18xx: update firmware file > > 69a250dd556b wl18xx: update firmware file > > dbe3f134bb69 wl18xx: update firmware file, remove conf file > > dab4b79b3fbc wl18xx: add version 4 of the wl18xx firmware > > > > > AND - since there's some firmware the feature doesn't work with, the > > > driver should be fixed to detect the running firmware version and not > > > do things that the firmware doesn't support. AND the firmware writer > > > should also make it so the firmware doesn't barf on bad input and > > > instead rejects it politely. > > > > Agreed. The main issue here seems to be that the current patch > > assumes that this feature is available. A proper approach would > > be to check if this feature is available before trying to use it. > > > > Now, I dunno if version 8.9.0.0.81 has what's required for it to > > work - or if KEY_IGTK require some custom firmware version. > > > > If it works with such version, one way would be to add a check > > for this specific version, disabling KEY_IGTK otherwise. > > > > Also, someone from TI should be sending the newer version to > > be added at linux-firmware. > > > > I'll try to do a test maybe tomorrow. > > > > I think we're totally agreed on all of the above points. > Fundamentally: the orig patch should've been coded defensively and > tested properly since clearly it causes certain firmwares to break. > Be nice if TI would both update the firmware and also update the > driver to detect the relevant version for features. I don't know > about this one, but I do know the QCA firmwares (and others) have a > set of feature flags that are detected by the drivers to determine > what is supported. > > I look forward to hearing the results of your test. This whole thing > has gotten me interested. I'd be tempted to pull out the relevant dev > boards and play with them myself, but IIRC they got sent back to a > previous employer and I don't have access to them anymore. I upgraded to the newest firmware available at TI firmware site: https://git.ti.com/cgit/wilink8-wlan/wl18xx_fw/log/ No joy. It worked once, but I guess it just selected some different cipher, as the access point it is logging in is set to WPA2-PSK with auto cipher. I even wrote a patch that checks the version, enabling AES-CMAC algorithm only with newer firmware (see below). Maybe this feature will only be available on some future firmware or it requires some custom-made one. It may also require some newer version of the chipset. So, for now, I suggest to revert this patch, c/c stable. A later patch can be re-enable it, once some additional logic gets added in order to validate if this algorithm is properly supported by the hardware/firmware. Thanks, Mauro [PATCH] net: wireless: wlcore: fix support for IGTK key Changeset 2b7aadd3b9e1 ("wlcore: Adding suppoprt for IGTK key in wlcore driver") adde
Re: packet deadline and process scheduling
There is an active Internet draft "Packet Delivery Deadline time in 6LoWPAN Routing Header" (https://datatracker.ietf.org/doc/draft-ietf-6lo-deadline-time/) which is presently in the RFC Editor queue and is expected to become an RFC in the near future. I happened to be one of the co-authors of this draft. The main objective of the draft is to support time sensitive industrial applications such as Industrial process control and automation over IP networks. While the current draft caters to 6LoWPAN networks, I would assume that it can be extended to carry deadline information in other encapsulations including IPv6. Once the packet reaches the destination at the network stack in the kernel, it has to be passed on to the receiver application within the deadline carried in the packet because it is the receiver application running in user space is the eventual consumer of the data. My mail below is for ensuring passing on the packet sitting in the socket interface to the user receiver application process in a timely fashion with the help of OS scheduler. Since the incoming packet experieces variable delay, the remaining time left before deadline approaches too varies. There should be a mechanism within the kernel, where network stack needs to communicate with the OS scheduler by letting the scheduler know the deadline before user application socket recv call is expected to return. Anand On 20-08-28 10:14:13, Eric Dumazet wrote: > > > On 8/27/20 11:45 PM, S.V.R.Anand wrote: > > Hi, > > > > In the control loop application I am trying to build, an incoming message > > from > > the network will have a deadline before which it should be delivered to the > > receiver process. This essentially calls for a way of scheduling this > > process > > based on the deadline information contained in the message. > > > > If not already available, I wish to write code for such run-time ordering > > of > > processes in the earlist deadline first fashion. The assumption, however > > futuristic it may be, is that deadline information is contained as part of > > the > > packet header something like an inband-OAM. > > > > Your feedback on the above will be very helpful. > > > > Hope the above objective will be of general interest to netdev as well. > > > > My apologies if this is not the appropriate mailing list for posting this > > kind > > of mails. > > > > Anand > > > > Is this described in some RFC ? > > If not, I guess you might have to code this in user space. > >
Re: [PATCH 12/30] net: wireless: cisco: airo: Fix a myriad of coding style issues
Ondrej Zary writes: > On Thursday 27 August 2020 09:49:12 Kalle Valo wrote: >> Ondrej Zary writes: >> >> > On Monday 17 August 2020 20:27:06 Jesse Brandeburg wrote: >> >> On Mon, 17 Aug 2020 16:27:01 +0300 >> >> Kalle Valo wrote: >> >> >> >> > I was surprised to see that someone was using this driver in 2015, so >> >> > I'm not sure anymore what to do. Of course we could still just remove >> >> > it and later revert if someone steps up and claims the driver is still >> >> > usable. Hmm. Does anyone any users of this driver? >> >> >> >> What about moving the driver over into staging, which is generally the >> >> way I understood to move a driver slowly out of the kernel? >> > >> > Please don't remove random drivers. >> >> We don't want to waste time on obsolete drivers and instead prefer to >> use our time on more productive tasks. For us wireless maintainers it's >> really hard to know if old drivers are still in use or if they are just >> broken. >> >> > I still have the Aironet PCMCIA card and can test the driver. >> >> Great. Do you know if the airo driver still works with recent kernels? > > Yes, it does. Nice, I'm very surprised that so old and unmaintained driver still works. Thanks for testing. -- https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH nf-next v3 0/3] Netfilter egress hook
On 8/28/20 12:14 AM, Daniel Borkmann wrote: > Hi Lukas, > > On 8/27/20 10:55 AM, Lukas Wunner wrote: >> Introduce a netfilter egress hook to allow filtering outbound AF_PACKETs >> such as DHCP and to prepare for in-kernel NAT64/NAT46. > > Thinking more about this, how will this allow to sufficiently filter > AF_PACKET? > It won't. Any AF_PACKET application can freely set PACKET_QDISC_BYPASS without > additional privileges and then dev_queue_xmit() is being bypassed in the host > ns. > This is therefore ineffective and not sufficient. (From container side these > can > be caught w/ host veth on ingress, but not in host ns, of course, so hook > won't > be invoked.) Presumably dev_direct_xmit() could be augmented to support the hook. dev_direct_xmit() (packet_direct_xmit()) was introduced to bypass qdisc, not to bypass everything.
pull-request: mac80211 2020-08-28
Hi Dave, We have a number of fixes for the current release cycle, one is for a syzbot reported warning (the sanity check) but most are more wifi protocol related. Please pull and let me know if there's any problem. Thanks, johannes The following changes since commit cf96d977381d4a23957bade2ddf1c420b74a26b6: net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe() (2020-08-19 16:37:18 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2020-08-28 for you to fetch changes up to 2d9b55508556ccee6410310fb9ea2482fd3328eb: cfg80211: Adjust 6 GHz frequency to channel conversion (2020-08-27 10:53:21 +0200) We have: * fixes for AQL (airtime queue limits) * reduce packet loss detection false positives * a small channel number fix for the 6 GHz band * a fix for 80+80/160 MHz negotiation * an nl80211 attribute (NL80211_ATTR_HE_6GHZ_CAPABILITY) fix * add a missing sanity check for the regulatory code Amar Singhal (1): cfg80211: Adjust 6 GHz frequency to channel conversion Felix Fietkau (4): mac80211: use rate provided via status->rate on ieee80211_tx_status_ext for AQL mac80211: factor out code to look up the average packet length duration for a rate mac80211: improve AQL aggregation estimation for low data rates mac80211: reduce packet loss event false positives Johannes Berg (2): nl80211: fix NL80211_ATTR_HE_6GHZ_CAPABILITY usage cfg80211: regulatory: reject invalid hints Shay Bar (1): wireless: fix wrong 160/80+80 MHz setting net/mac80211/airtime.c | 202 ++-- net/mac80211/sta_info.h | 5 +- net/mac80211/status.c | 43 ++- net/wireless/chan.c | 15 +++- net/wireless/nl80211.c | 2 +- net/wireless/reg.c | 3 + net/wireless/util.c | 8 +- 7 files changed, 192 insertions(+), 86 deletions(-)
Re: [PATCH 12/30] net: wireless: cisco: airo: Fix a myriad of coding style issues
On Fri, 28 Aug 2020, Kalle Valo wrote: > Ondrej Zary writes: > > > On Thursday 27 August 2020 09:49:12 Kalle Valo wrote: > >> Ondrej Zary writes: > >> > >> > On Monday 17 August 2020 20:27:06 Jesse Brandeburg wrote: > >> >> On Mon, 17 Aug 2020 16:27:01 +0300 > >> >> Kalle Valo wrote: > >> >> > >> >> > I was surprised to see that someone was using this driver in 2015, so > >> >> > I'm not sure anymore what to do. Of course we could still just remove > >> >> > it and later revert if someone steps up and claims the driver is still > >> >> > usable. Hmm. Does anyone any users of this driver? > >> >> > >> >> What about moving the driver over into staging, which is generally the > >> >> way I understood to move a driver slowly out of the kernel? > >> > > >> > Please don't remove random drivers. > >> > >> We don't want to waste time on obsolete drivers and instead prefer to > >> use our time on more productive tasks. For us wireless maintainers it's > >> really hard to know if old drivers are still in use or if they are just > >> broken. > >> > >> > I still have the Aironet PCMCIA card and can test the driver. > >> > >> Great. Do you know if the airo driver still works with recent kernels? > > > > Yes, it does. > > Nice, I'm very surprised that so old and unmaintained driver still > works. Thanks for testing. That's awesome. Go Linux! So where does this leave us from a Maintainership perspective? Are you still treating the driver as obsolete? After this revelation, I suggest not. So let's make it better. :) -- Lee Jones [李琼斯] Senior Technical Lead - Developer Services Linaro.org │ Open source software for Arm SoCs Follow Linaro: Facebook | Twitter | Blog
pull-request: mac80211-next 2020-08-28
Hi Dave, Here also nothing stands out, though perhaps you'd be interested in the fact that we now use the new netlink range length validation for some binary attributes. Please pull and let me know if there's any problem. Thanks, johannes The following changes since commit f09665811b142cbf1eb36641ca42cee42c463b3f: Merge branch 'drivers-net-constify-static-ops-variables' (2020-08-26 16:21:17 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git tags/mac80211-next-for-davem-2020-08-28 for you to fetch changes up to 2831a631022eed6e3f800f08892132c6edde652c: nl80211: support SAE authentication offload in AP mode (2020-08-27 15:19:44 +0200) This time we have: * some code to support SAE (WPA3) offload in AP mode * many documentation (wording) fixes/updates * netlink policy updates, including the use of NLA_RANGE with binary attributes * regulatory improvements for adjacent frequency bands * and a few other small additions/refactorings/cleanups Chung-Hsien Hsu (1): nl80211: support SAE authentication offload in AP mode James Prestwood (1): nl80211: fix PORT_AUTHORIZED wording to reflect behavior Johannes Berg (2): nl80211: clean up code/policy a bit nl80211: use NLA_POLICY_RANGE(NLA_BINARY, ...) for a few attributes John Crispin (2): nl80211: rename csa counter attributes countdown counters mac80211: rename csa counters to countdown counters Markus Theil (2): cfg80211: add helper fn for single rule channels cfg80211: add helper fn for adjacent rule channels Miaohe Lin (1): net: wireless: Convert to use the preferred fallthrough macro Miles Hu (1): nl80211: add support for setting fixed HE rate/gi/ltf Randy Dunlap (7): net: mac80211: agg-rx.c: fix duplicated words net: mac80211: mesh.h: delete duplicated word net: wireless: delete duplicated word + fix grammar net: wireless: reg.c: delete duplicated words + fix punctuation net: wireless: scan.c: delete or fix duplicated words net: wireless: sme.c: delete duplicated word net: wireless: wext_compat.c: delete duplicated word drivers/net/wireless/ath/ath10k/mac.c | 4 +- drivers/net/wireless/ath/ath10k/wmi.c | 2 +- drivers/net/wireless/ath/ath11k/wmi.c | 4 +- drivers/net/wireless/ath/ath9k/beacon.c| 2 +- drivers/net/wireless/ath/ath9k/htc_drv_beacon.c| 2 +- drivers/net/wireless/intel/iwlwifi/mvm/mac-ctxt.c | 6 +- .../net/wireless/intel/iwlwifi/mvm/time-event.c| 2 +- drivers/net/wireless/mac80211_hwsim.c | 2 +- drivers/net/wireless/mediatek/mt76/mac80211.c | 4 +- drivers/net/wireless/mediatek/mt76/mt7615/mcu.c| 10 +- drivers/net/wireless/mediatek/mt76/mt7915/mcu.c| 8 +- include/net/cfg80211.h | 3 + include/net/mac80211.h | 35 ++- include/uapi/linux/nl80211.h | 76 -- net/mac80211/agg-rx.c | 2 +- net/mac80211/cfg.c | 14 +- net/mac80211/ibss.c| 4 +- net/mac80211/ieee80211_i.h | 6 +- net/mac80211/main.c| 2 +- net/mac80211/mesh.c| 6 +- net/mac80211/offchannel.c | 2 +- net/mac80211/tx.c | 73 +++--- net/wireless/chan.c| 4 +- net/wireless/core.h| 4 +- net/wireless/mlme.c| 2 +- net/wireless/nl80211.c | 278 ++--- net/wireless/reg.c | 257 +++ net/wireless/scan.c| 6 +- net/wireless/sme.c | 6 +- net/wireless/util.c| 4 +- net/wireless/wext-compat.c | 6 +- 31 files changed, 561 insertions(+), 275 deletions(-)
Re: [RFC PATCH 00/22] Enhance VHOST to enable SoC-to-SoC communication
On Thu, 9 Jul 2020 14:26:53 +0800 Jason Wang wrote: [Let me note right at the beginning that I first noted this while listening to Kishon's talk at LPC on Wednesday. I might be very confused about the background here, so let me apologize beforehand for any confusion I might spread.] > On 2020/7/8 下午9:13, Kishon Vijay Abraham I wrote: > > Hi Jason, > > > > On 7/8/2020 4:52 PM, Jason Wang wrote: > >> On 2020/7/7 下午10:45, Kishon Vijay Abraham I wrote: > >>> Hi Jason, > >>> > >>> On 7/7/2020 3:17 PM, Jason Wang wrote: > On 2020/7/6 下午5:32, Kishon Vijay Abraham I wrote: > > Hi Jason, > > > > On 7/3/2020 12:46 PM, Jason Wang wrote: > >> On 2020/7/2 下午9:35, Kishon Vijay Abraham I wrote: > >>> Hi Jason, > >>> > >>> On 7/2/2020 3:40 PM, Jason Wang wrote: > On 2020/7/2 下午5:51, Michael S. Tsirkin wrote: > > On Thu, Jul 02, 2020 at 01:51:21PM +0530, Kishon Vijay Abraham I > > wrote: > >> This series enhances Linux Vhost support to enable SoC-to-SoC > >> communication over MMIO. This series enables rpmsg communication > >> between > >> two SoCs using both PCIe RC<->EP and HOST1-NTB-HOST2 > >> > >> 1) Modify vhost to use standard Linux driver model > >> 2) Add support in vring to access virtqueue over MMIO > >> 3) Add vhost client driver for rpmsg > >> 4) Add PCIe RC driver (uses virtio) and PCIe EP driver (uses > >> vhost) for > >> rpmsg communication between two SoCs connected to each > >> other > >> 5) Add NTB Virtio driver and NTB Vhost driver for rpmsg > >> communication > >> between two SoCs connected via NTB > >> 6) Add configfs to configure the components > >> > >> UseCase1 : > >> > >> VHOST RPMSG VIRTIO RPMSG > >> + + > >> | | > >> | | > >> | | > >> | | > >> +-v--+ +--v---+ > >> | Linux | | Linux | > >> | Endpoint | | Root Complex | > >> | <-> | > >> | | | | > >> | SOC1 | | SOC2 | > >> ++ +--+ > >> > >> UseCase 2: > >> > >> VHOST RPMSG VIRTIO > >> RPMSG > >> + + > >> | | > >> | | > >> | | > >> | | > >> +--v--+ > >> +--v--+ > >> | | | > >> | > >> | HOST1 | | > >> HOST2 | > >> | | | > >> | > >> +--^--+ > >> +--^--+ > >> | | > >> | | > >> +-+ > >> | +--v--+ > >> +--v--+ | > >> | | | | > >> | | > >> | | EP | | EP > >> | | > >> | | CONTROLLER1 | | CONTROLLER2 > >> | | > >> | | <---> > >> | | > >> | | | | > >> | | > >> | | | | > >> | | > >> | | | SoC With Multiple EP Instances | > >> | | > >> | | | (Configured using NTB Function) | > >> | | > >> | +-+ > >> +-+ | > >> +-
Re: [PATCH net] drivers/net/wan/hdlc_cisco: Add hard_header_len
Hello Xie, Xie He writes: > This driver didn't set hard_header_len. This patch sets hard_header_len > for it according to its header_ops->create function. BTW it's 4 bytes long: struct hdlc_header { u8 address; u8 control; __be16 protocol; }__packed; OTOH hdlc_setup_dev() initializes hard_header_len to 16, but in this case I guess 4 bytes are better. Acked-by: Krzysztof Halasa > Cc: Martin Schiller > Signed-off-by: Xie He > --- > --- a/drivers/net/wan/hdlc_cisco.c > +++ b/drivers/net/wan/hdlc_cisco.c > @@ -370,6 +370,7 @@ static int cisco_ioctl(struct net_device *dev, struct > ifreq *ifr) > memcpy(&state(hdlc)->settings, &new_settings, size); > spin_lock_init(&state(hdlc)->lock); > dev->header_ops = &cisco_header_ops; > + dev->hard_header_len = sizeof(struct hdlc_header); > dev->type = ARPHRD_CISCO; > call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE, dev); > netif_dormant_on(dev); -- Krzysztof Halasa Sieć Badawcza Łukasiewicz Przemysłowy Instytut Automatyki i Pomiarów PIAP Al. Jerozolimskie 202, 02-486 Warszawa
[PATCH net v2] net: dsa: mt7530: fix advertising unsupported 1000baseT_Half
Remove 1000baseT_Half to advertise correct hardware capability in phylink_validate() callback function. Fixes: 38f790a80560 ("net: dsa: mt7530: Add support for port 5") Signed-off-by: Landen Chao Reviewed-by: Andrew Lunn Reviewed-by: Florian Fainelli --- v1->v2 - fix the commit subject spilled into the commit message --- drivers/net/dsa/mt7530.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c index 8dcb8a49ab67..238417db26f9 100644 --- a/drivers/net/dsa/mt7530.c +++ b/drivers/net/dsa/mt7530.c @@ -1501,7 +1501,7 @@ static void mt7530_phylink_validate(struct dsa_switch *ds, int port, phylink_set(mask, 100baseT_Full); if (state->interface != PHY_INTERFACE_MODE_MII) { - phylink_set(mask, 1000baseT_Half); + /* This switch only supports 1G full-duplex. */ phylink_set(mask, 1000baseT_Full); if (port == 5) phylink_set(mask, 1000baseX_Full); -- 2.17.1
[PATCH net-next] net: phylink: avoid oops during initialisation
If we intend to use PCS operations, mac_pcs_get_state() will not be implemented, so will be NULL. If we also intend to register the PCS operations in mac_prepare() or mac_config(), then this leads to an attempt to call NULL function pointer during phylink_start(). Avoid this, but we must report the link is down. Signed-off-by: Russell King --- There are no users of the new split PCS support currently, so this does not require backporting, but if people think it should have a fixes tag, that would be: Fixes: 7137e18f6f88 ("net: phylink: add struct phylink_pcs") drivers/net/phy/phylink.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index 32b4bd6a5b55..5e4cb12972eb 100644 --- a/drivers/net/phy/phylink.c +++ b/drivers/net/phy/phylink.c @@ -535,8 +535,10 @@ static void phylink_mac_pcs_get_state(struct phylink *pl, if (pl->pcs_ops) pl->pcs_ops->pcs_get_state(pl->pcs, state); - else + else if (pl->mac_ops->mac_pcs_get_state) pl->mac_ops->mac_pcs_get_state(pl->config, state); + else + state->link = 0; } /* The fixed state is... fixed except for the link state, -- 2.20.1
Re: [PATCH net v2] net: dsa: mt7530: fix advertising unsupported 1000baseT_Half
On Fri, Aug 28, 2020 at 06:52:44PM +0800, Landen Chao wrote: > Remove 1000baseT_Half to advertise correct hardware capability in > phylink_validate() callback function. > > Fixes: 38f790a80560 ("net: dsa: mt7530: Add support for port 5") > Signed-off-by: Landen Chao > Reviewed-by: Andrew Lunn > Reviewed-by: Florian Fainelli Reviewed-by: Russell King Thanks. > --- > v1->v2 > - fix the commit subject spilled into the commit message > --- > drivers/net/dsa/mt7530.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c > index 8dcb8a49ab67..238417db26f9 100644 > --- a/drivers/net/dsa/mt7530.c > +++ b/drivers/net/dsa/mt7530.c > @@ -1501,7 +1501,7 @@ static void mt7530_phylink_validate(struct dsa_switch > *ds, int port, > phylink_set(mask, 100baseT_Full); > > if (state->interface != PHY_INTERFACE_MODE_MII) { > - phylink_set(mask, 1000baseT_Half); > + /* This switch only supports 1G full-duplex. */ > phylink_set(mask, 1000baseT_Full); > if (port == 5) > phylink_set(mask, 1000baseX_Full); > -- > 2.17.1 > ___ > linux-arm-kernel mailing list > linux-arm-ker...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: [PATCH] net: usb: Fix uninit-was-stored issue in asix_read_phy_addr()
On Thu, Aug 27, 2020 at 1:28 PM Sergei Shtylyov wrote: > > Hello! > > On 27.08.2020 9:53, Himadri Pandya wrote: > > > The buffer size is 2 Bytes and we expect to receive the same amount of > > data. But sometimes we receive less data and run into uninit-was-stored > > issue upon read. Hence modify the error check on the return value to match > > with the buffer size as a prevention. > > > > Reported-and-tested by: > > syzbot+a7e220df5a81d1ab4...@syzkaller.appspotmail.com > > Signed-off-by: Himadri Pandya > > --- > > drivers/net/usb/asix_common.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/net/usb/asix_common.c b/drivers/net/usb/asix_common.c > > index e39f41efda3e..7bc6e8f856fe 100644 > > --- a/drivers/net/usb/asix_common.c > > +++ b/drivers/net/usb/asix_common.c > > @@ -296,7 +296,7 @@ int asix_read_phy_addr(struct usbnet *dev, int internal) > > > > netdev_dbg(dev->net, "asix_get_phy_addr()\n"); > > > > - if (ret < 0) { > > + if (ret < 2) { > > netdev_err(dev->net, "Error reading PHYID register: %02x\n", > > ret); > > Hm... printing possibly negative values as hex? > Yeah. That's odd! Fixing it. Thanks, Himadri > [...] > > MBR, Sergei
Re: [PATCH RFC net-next] net/tls: Implement getsockopt SOL_TLS TLS_RX
Hello, is there any chance that this patch gets reviewed? Thanks, Yutaro 2020年8月18日(火) 23:12 Yutaro Hayakawa : > > Implement the getsockopt SOL_TLS TLS_RX which is currently missing. The > primary usecase is to use it in conjunction with TCP_REPAIR to > checkpoint/restore the TLS record layer state. > > TLS connection state usually exists on the user space library. So > basically we can easily extract it from there, but when the TLS > connections are delegated to the kTLS, it is not the case. We need to > have a way to extract the TLS state from the kernel for both of TX and > RX side. > > The new TLS_RX getsockopt copies the crypto_info to user in the same > way as TLS_TX does. > > We have described use cases in our research work in Netdev 0x14 > Transport Workshop [1]. > > Also, there is an TLS implementation called tlse [2] which supports > TLS connection migration. They have support of kTLS and their code > shows that they are expecting the future support of this option. > > [1] https://speakerdeck.com/yutarohayakawa/prism-proxies-without-the-pain > [2] https://github.com/eduardsui/tlse > > Signed-off-by: Yutaro Hayakawa > --- > net/tls/tls_main.c | 50 +- > 1 file changed, 36 insertions(+), 14 deletions(-) > > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c > index bbc52b088d29..ea66cac2cd84 100644 > --- a/net/tls/tls_main.c > +++ b/net/tls/tls_main.c > @@ -330,8 +330,8 @@ static void tls_sk_proto_close(struct sock *sk, long > timeout) > tls_ctx_free(sk, ctx); > } > > -static int do_tls_getsockopt_tx(struct sock *sk, char __user *optval, > - int __user *optlen) > +static int do_tls_getsockopt_conf(struct sock *sk, char __user *optval, > + int __user *optlen, int tx) > { > int rc = 0; > struct tls_context *ctx = tls_get_ctx(sk); > @@ -352,7 +352,11 @@ static int do_tls_getsockopt_tx(struct sock *sk, char > __user *optval, > } > > /* get user crypto info */ > - crypto_info = &ctx->crypto_send.info; > + if (tx) { > + crypto_info = &ctx->crypto_send.info; > + } else { > + crypto_info = &ctx->crypto_recv.info; > + } > > if (!TLS_CRYPTO_INFO_READY(crypto_info)) { > rc = -EBUSY; > @@ -378,11 +382,19 @@ static int do_tls_getsockopt_tx(struct sock *sk, char > __user *optval, > goto out; > } > lock_sock(sk); > - memcpy(crypto_info_aes_gcm_128->iv, > - ctx->tx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, > - TLS_CIPHER_AES_GCM_128_IV_SIZE); > - memcpy(crypto_info_aes_gcm_128->rec_seq, ctx->tx.rec_seq, > - TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); > + if (tx) { > + memcpy(crypto_info_aes_gcm_128->iv, > + ctx->tx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, > + TLS_CIPHER_AES_GCM_128_IV_SIZE); > + memcpy(crypto_info_aes_gcm_128->rec_seq, > ctx->tx.rec_seq, > + TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); > + } else { > + memcpy(crypto_info_aes_gcm_128->iv, > + ctx->rx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, > + TLS_CIPHER_AES_GCM_128_IV_SIZE); > + memcpy(crypto_info_aes_gcm_128->rec_seq, > ctx->rx.rec_seq, > + TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); > + } > release_sock(sk); > if (copy_to_user(optval, > crypto_info_aes_gcm_128, > @@ -402,11 +414,19 @@ static int do_tls_getsockopt_tx(struct sock *sk, char > __user *optval, > goto out; > } > lock_sock(sk); > - memcpy(crypto_info_aes_gcm_256->iv, > - ctx->tx.iv + TLS_CIPHER_AES_GCM_256_SALT_SIZE, > - TLS_CIPHER_AES_GCM_256_IV_SIZE); > - memcpy(crypto_info_aes_gcm_256->rec_seq, ctx->tx.rec_seq, > - TLS_CIPHER_AES_GCM_256_REC_SEQ_SIZE); > + if (tx) { > + memcpy(crypto_info_aes_gcm_256->iv, > + ctx->tx.iv + TLS_CIPHER_AES_GCM_256_SALT_SIZE, > + TLS_CIPHER_AES_GCM_256_IV_SIZE); > + memcpy(crypto_info_aes_gcm_256->rec_seq, > ctx->tx.rec_seq, > + TLS_CIPHER_AES_GCM_256_REC_SEQ_SIZE); > + } else { > + memcpy(crypto_info_aes_gcm_256->iv, > + ctx->rx.iv + TLS_CIPHER_AES_GCM_256_SALT_SIZE, > + TLS_CIPHER_AES_GCM_256_IV_SIZE); > + memcpy(crypto_info_aes
[PATCH RFC] xfrm: fail to create ixgbe offload of IPsec tunnel mode sa
Based on talks and indirect references ixgbe driver does not support offloading IPsec tunnel mode. It only support transport mode. Now explicitly fail to avoid when trying to offload. Fixes: 63a67fe229ea ("ixgbe: add ipsec offload add and remove SA") Signed-off-by: Antony Antony --- I haven't tested this fix as I have no access to the hardware. This patch is based on a libreswan bug report. https://github.com/libreswan/libreswan/issues/252 Is it useful to this bug report in kernel commit message? drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 5 + drivers/net/ethernet/intel/ixgbevf/ipsec.c | 5 + 2 files changed, 10 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c index eca73526ac86..e2b978efcc5a 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c @@ -575,6 +575,11 @@ static int ixgbe_ipsec_add_sa(struct xfrm_state *xs) return -EINVAL; } + if (xs->props.mode != XFRM_MODE_TRANSPORT) { + netdev_err(dev, "Unsupported mode for ipsec offload\n"); + return -EINVAL; + } + if (ixgbe_ipsec_check_mgmt_ip(xs)) { netdev_err(dev, "IPsec IP addr clash with mgmt filters\n"); return -EINVAL; diff --git a/drivers/net/ethernet/intel/ixgbevf/ipsec.c b/drivers/net/ethernet/intel/ixgbevf/ipsec.c index 5170dd9d8705..d11b3f3414ea 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ipsec.c +++ b/drivers/net/ethernet/intel/ixgbevf/ipsec.c @@ -272,6 +272,11 @@ static int ixgbevf_ipsec_add_sa(struct xfrm_state *xs) return -EINVAL; } + if (xs->props.mode != XFRM_MODE_TRANSPORT) { + netdev_err(dev, "Unsupported mode for ipsec offload\n"); + return -EINVAL; + } + if (xs->xso.flags & XFRM_OFFLOAD_INBOUND) { struct rx_sa rsa; -- 2.21.3
Re: [Patch net] net_sched: fix error path in red_init()
Cong Wang writes: > When ->init() fails, ->destroy() is called to clean up. > So it is unnecessary to clean up in red_init(), and it > would cause some refcount underflow. Hmm, yeah, qdisc_put() would get called twice. A surprising API, the init needs to make sure to always bring the qdisc into destroyable state. But qevents are like that after kzalloc, so the fix looks correct. Reviewed-by: Petr Machata Thanks!
Re: [PATCH v3 5/8] net: dsa: hellcreek: Add TAPRIO offloading support
Hi Richard, On Thu Aug 27 2020, Richard Cochran wrote: > On Tue, Aug 25, 2020 at 11:55:37AM +0200, Kurt Kanzenbach wrote: >> >> I get your point. But how to do it? We would need a timer based on the >> PTP clock in the switch. > > Can't you use an hrtimer based on CLOCK_MONOTONIC? When the switch and the Linux machine aren't synchronized, we would calculate the difference between both systems and could arm the Linux timer based on CLOCK_MONOTONIC. Given the fact that we eight seconds, it would *probably* work when the ptp offset adjustments are in that range. > > I would expect the driver to work based solely on the device's clock. Understood. Thanks, Kurt signature.asc Description: PGP signature
Re: [PATCH bpf-next] bpf: make bpf_link_info.iter similar to bpf_iter_link_info
On 8/28/20 7:19 AM, Yonghong Song wrote: bpf_link_info.iter is used by link_query to return bpf_iter_link_info to user space. Fields may be different ,e.g., map_fd vs. map_id, so we cannot reuse the exact structure. But make them similar, e.g., struct bpf_link_info { /* common fields */ union { struct { ... } raw_tracepoint; struct { ... } tracing; ... struct { /* common fields for iter */ union { struct { __u32 map_id; } map; /* other structs for other targets */ }; }; }; }; so the structure is extensible the same way as bpf_iter_link_info. Fixes: 6b0a249a301e ("bpf: Implement link_query for bpf iterators") Acked-by: Andrii Nakryiko Signed-off-by: Yonghong Song Applied, thanks!
Re: [PATCH 12/30] net: wireless: cisco: airo: Fix a myriad of coding style issues
Lee Jones writes: > On Fri, 28 Aug 2020, Kalle Valo wrote: > >> Ondrej Zary writes: >> >> > On Thursday 27 August 2020 09:49:12 Kalle Valo wrote: >> >> Ondrej Zary writes: >> >> >> >> > On Monday 17 August 2020 20:27:06 Jesse Brandeburg wrote: >> >> >> On Mon, 17 Aug 2020 16:27:01 +0300 >> >> >> Kalle Valo wrote: >> >> >> >> >> >> > I was surprised to see that someone was using this driver in 2015, so >> >> >> > I'm not sure anymore what to do. Of course we could still just remove >> >> >> > it and later revert if someone steps up and claims the driver is >> >> >> > still >> >> >> > usable. Hmm. Does anyone any users of this driver? >> >> >> >> >> >> What about moving the driver over into staging, which is generally the >> >> >> way I understood to move a driver slowly out of the kernel? >> >> > >> >> > Please don't remove random drivers. >> >> >> >> We don't want to waste time on obsolete drivers and instead prefer to >> >> use our time on more productive tasks. For us wireless maintainers it's >> >> really hard to know if old drivers are still in use or if they are just >> >> broken. >> >> >> >> > I still have the Aironet PCMCIA card and can test the driver. >> >> >> >> Great. Do you know if the airo driver still works with recent kernels? >> > >> > Yes, it does. >> >> Nice, I'm very surprised that so old and unmaintained driver still >> works. Thanks for testing. > > That's awesome. Go Linux! > > So where does this leave us from a Maintainership perspective? Are > you still treating the driver as obsolete? After this revelation, I > suggest not. So let's make it better. :) Yeah, I can take patches to airo now. I already applied this one. -- https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
[PATCH bpf-next] samples/bpf: optimize l2fwd performance in xdpsock
Optimize the throughput performance of the l2fwd sub-app in the xdpsock sample application by removing a duplicate syscall and increasing the size of the fill ring. The latter needs some further explanation. We recommend that you set the fill ring size >= HW RX ring size + AF_XDP RX ring size. Make sure you fill up the fill ring with buffers at regular intervals, and you will with this setting avoid allocation failures in the driver. These are usually quite expensive since drivers have not been written to assume that allocation failures are common. For regular sockets, kernel allocated memory is used that only runs out in OOM situations that should be rare. These two performance optimizations together lead to a 6% percent improvement for the l2fwd app on my machine. Signed-off-by: Magnus Karlsson --- samples/bpf/xdpsock_user.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c index 19c6794..9f54df7 100644 --- a/samples/bpf/xdpsock_user.c +++ b/samples/bpf/xdpsock_user.c @@ -613,7 +613,16 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size) { struct xsk_umem_info *umem; struct xsk_umem_config cfg = { - .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, + /* We recommend that you set the fill ring size >= HW RX ring size + +* AF_XDP RX ring size. Make sure you fill up the fill ring +* with buffers at regular intervals, and you will with this setting +* avoid allocation failures in the driver. These are usually quite +* expensive since drivers have not been written to assume that +* allocation failures are common. For regular sockets, kernel +* allocated memory is used that only runs out in OOM situations +* that should be rare. +*/ + .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS * 2, .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, .frame_size = opt_xsk_frame_size, .frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM, @@ -640,13 +649,13 @@ static void xsk_populate_fill_ring(struct xsk_umem_info *umem) u32 idx; ret = xsk_ring_prod__reserve(&umem->fq, -XSK_RING_PROD__DEFAULT_NUM_DESCS, &idx); - if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS) +XSK_RING_PROD__DEFAULT_NUM_DESCS * 2, &idx); + if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS * 2) exit_with_error(-ret); - for (i = 0; i < XSK_RING_PROD__DEFAULT_NUM_DESCS; i++) + for (i = 0; i < XSK_RING_PROD__DEFAULT_NUM_DESCS * 2; i++) *xsk_ring_prod__fill_addr(&umem->fq, idx++) = i * opt_xsk_frame_size; - xsk_ring_prod__submit(&umem->fq, XSK_RING_PROD__DEFAULT_NUM_DESCS); + xsk_ring_prod__submit(&umem->fq, XSK_RING_PROD__DEFAULT_NUM_DESCS * 2); } static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem, @@ -888,9 +897,6 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk, if (!xsk->outstanding_tx) return; - if (!opt_need_wakeup || xsk_ring_prod__needs_wakeup(&xsk->tx)) - kick_tx(xsk); - ndescs = (xsk->outstanding_tx > opt_batch_size) ? opt_batch_size : xsk->outstanding_tx; -- 2.7.4
Re: pull-request: mac80211 2020-08-28
From: Johannes Berg Date: Fri, 28 Aug 2020 12:08:04 +0200 > We have a number of fixes for the current release cycle, > one is for a syzbot reported warning (the sanity check) > but most are more wifi protocol related. > > Please pull and let me know if there's any problem. Pulled, thanks Johannes.
Re: pull-request: mac80211-next 2020-08-28
From: Johannes Berg Date: Fri, 28 Aug 2020 12:12:37 +0200 > Here also nothing stands out, though perhaps you'd be > interested in the fact that we now use the new netlink > range length validation for some binary attributes. Awesome! > Please pull and let me know if there's any problem. Pulled, thanks.
[PATCH net-next 1/2] gtp: remove useless rcu_read_lock()
The rtnl lock is taken just the line above, no need to take the rcu also. Fixes: 1788b8569f5d ("gtp: fix use-after-free in gtp_encap_destroy()") Signed-off-by: Nicolas Dichtel --- drivers/net/gtp.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c index c84a10569388..6f871ec31393 100644 --- a/drivers/net/gtp.c +++ b/drivers/net/gtp.c @@ -1071,7 +1071,6 @@ static int gtp_genl_new_pdp(struct sk_buff *skb, struct genl_info *info) } rtnl_lock(); - rcu_read_lock(); gtp = gtp_find_dev(sock_net(skb->sk), info->attrs); if (!gtp) { @@ -1100,7 +1099,6 @@ static int gtp_genl_new_pdp(struct sk_buff *skb, struct genl_info *info) } out_unlock: - rcu_read_unlock(); rtnl_unlock(); return err; } -- 2.26.2
[PATCH net-next 2/2] gtp: relax alloc constraint when adding a pdp
When a PDP context is added, the rtnl lock is held, thus no need to force a GFP_ATOMIC. Signed-off-by: Nicolas Dichtel --- drivers/net/gtp.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c index 6f871ec31393..2ed1e82a8ad8 100644 --- a/drivers/net/gtp.c +++ b/drivers/net/gtp.c @@ -1036,7 +1036,7 @@ static void pdp_context_delete(struct pdp_ctx *pctx) call_rcu(&pctx->rcu_head, pdp_context_free); } -static int gtp_tunnel_notify(struct pdp_ctx *pctx, u8 cmd); +static int gtp_tunnel_notify(struct pdp_ctx *pctx, u8 cmd, gfp_t allocation); static int gtp_genl_new_pdp(struct sk_buff *skb, struct genl_info *info) { @@ -1094,7 +1094,7 @@ static int gtp_genl_new_pdp(struct sk_buff *skb, struct genl_info *info) if (IS_ERR(pctx)) { err = PTR_ERR(pctx); } else { - gtp_tunnel_notify(pctx, GTP_CMD_NEWPDP); + gtp_tunnel_notify(pctx, GTP_CMD_NEWPDP, GFP_KERNEL); err = 0; } @@ -1166,7 +1166,7 @@ static int gtp_genl_del_pdp(struct sk_buff *skb, struct genl_info *info) netdev_dbg(pctx->dev, "GTPv1-U: deleting tunnel id = %x/%x (pdp %p)\n", pctx->u.v1.i_tei, pctx->u.v1.o_tei, pctx); - gtp_tunnel_notify(pctx, GTP_CMD_DELPDP); + gtp_tunnel_notify(pctx, GTP_CMD_DELPDP, GFP_ATOMIC); pdp_context_delete(pctx); out_unlock: @@ -1220,12 +1220,12 @@ static int gtp_genl_fill_info(struct sk_buff *skb, u32 snd_portid, u32 snd_seq, return -EMSGSIZE; } -static int gtp_tunnel_notify(struct pdp_ctx *pctx, u8 cmd) +static int gtp_tunnel_notify(struct pdp_ctx *pctx, u8 cmd, gfp_t allocation) { struct sk_buff *msg; int ret; - msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC); + msg = nlmsg_new(NLMSG_DEFAULT_SIZE, allocation); if (!msg) return -ENOMEM; -- 2.26.2
[PATCH net-next 0/2] gtp: minor enhancements
The first patch removes a useless rcu lock and the second relax alloc constraints when a PDP context is added. drivers/net/gtp.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) Comments are welcomed, Nicolas
Re: [PATCH net-next v5 0/3] Add phylib support to smsc95xx
From: Andre Edich Date: Wed, 26 Aug 2020 13:17:14 +0200 > To allow to probe external PHY drivers, this patch series adds use of > phylib to the smsc95xx driver. ... Series applied, thank you.
Re: [PATCH] netlink: fix a data race in netlink_rcv_wake()
From: zhudi Date: Wed, 26 Aug 2020 20:01:13 +0800 > The data races were reported by KCSAN: > BUG: KCSAN: data-race in netlink_recvmsg / skb_queue_tail ... > Since the write is under sk_receive_queue->lock but the read > is done as lockless. so fix it by using skb_queue_empty_lockless() > instead of skb_queue_empty() for the read in netlink_rcv_wake() > > Signed-off-by: zhudi Applied, thank you.
Re: [PATCHv3 next] net: add option to not create fall-back tunnels in root-ns as well
From: Mahesh Bandewar Date: Wed, 26 Aug 2020 09:05:35 -0700 > The sysctl that was added earlier by commit 79134e6ce2c ("net: do > not create fallback tunnels for non-default namespaces") to create > fall-back only in root-ns. This patch enhances that behavior to provide > option not to create fallback tunnels in root-ns as well. Since modules > that create fallback tunnels could be built-in and setting the sysctl > value after booting is pointless, so added a kernel cmdline options to > change this default. The default setting is preseved for backward > compatibility. The kernel command line option of fb_tunnels=initns will > set the sysctl value to 1 and will create fallback tunnels only in initns > while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and > fallback tunnels are skipped in every netns. > > Signed-off-by: Mahesh Bandewar Applied to net-next, thank you.
Re: [PATCH] net: dsa: mt7530: fix advertising unsupported
From: Landen Chao Date: Thu, 27 Aug 2020 17:15:47 +0800 > 1000baseT_Half > > Remove 1000baseT_Half to advertise correct hardware capability in > phylink_validate() callback function. > > Fixes: 38f790a80560 ("net: dsa: mt7530: Add support for port 5") > Signed-off-by: Landen Chao Applied and queued up for -stablel, with Subject line spillage fixed, thank you.
[PATCH v2] netlabel: remove unused param from audit_log_format()
Commit d3b990b7f327 ("netlabel: fix problems with mapping removal") added a check to return an error if ret_val != 0, before ret_val is later used in a log message. Now it will unconditionally print "... res=1". So just drop the check. Addresses-Coverity: ("Dead code") Fixes: d3b990b7f327 ("netlabel: fix problems with mapping removal") Signed-off-by: Alex Dewar --- v2: Still print the res field, because it's useful (Paul) net/netlabel/netlabel_domainhash.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/netlabel/netlabel_domainhash.c b/net/netlabel/netlabel_domainhash.c index f73a8382c275e..dc8c39f51f7d3 100644 --- a/net/netlabel/netlabel_domainhash.c +++ b/net/netlabel/netlabel_domainhash.c @@ -612,9 +612,8 @@ int netlbl_domhsh_remove_entry(struct netlbl_dom_map *entry, audit_buf = netlbl_audit_start_common(AUDIT_MAC_MAP_DEL, audit_info); if (audit_buf != NULL) { audit_log_format(audit_buf, -" nlbl_domain=%s res=%u", -entry->domain ? entry->domain : "(default)", -ret_val == 0 ? 1 : 0); +" nlbl_domain=%s res=1", +entry->domain ? entry->domain : "(default)"); audit_log_end(audit_buf); } -- 2.28.0
Re: packet deadline and process scheduling
On Fri, Aug 28, 2020 at 10:51 AM S.V.R.Anand wrote: > > There is an active Internet draft "Packet Delivery Deadline time in > 6LoWPAN Routing Header" > (https://datatracker.ietf.org/doc/draft-ietf-6lo-deadline-time/) which > is presently in the RFC Editor queue and is expected to become an RFC in > the near future. I happened to be one of the co-authors of this draft. > The main objective of the draft is to support time sensitive industrial > applications such as Industrial process control and automation over IP > networks. While the current draft caters to 6LoWPAN networks, I would > assume that it can be extended to carry deadline information in other > encapsulations including IPv6. > > Once the packet reaches the destination at the network stack in the > kernel, it has to be passed on to the receiver application within the > deadline carried in the packet because it is the receiver application > running in user space is the eventual consumer of the data. My mail below is > for > ensuring passing on the packet sitting in the socket interface to the > user receiver application process in a timely fashion with the help of > OS scheduler. Since the incoming packet experieces variable delay, the > remaining time left before deadline approaches too varies. There should > be a mechanism within the kernel, where network stack needs to > communicate with the OS scheduler by letting the scheduler know the > deadline before user application socket recv call is expected to return. > > Anand > > > On 20-08-28 10:14:13, Eric Dumazet wrote: > > > > > > On 8/27/20 11:45 PM, S.V.R.Anand wrote: > > > Hi, > > > > > > In the control loop application I am trying to build, an incoming message > > > from > > > the network will have a deadline before which it should be delivered to > > > the > > > receiver process. This essentially calls for a way of scheduling this > > > process > > > based on the deadline information contained in the message. > > > > > > If not already available, I wish to write code for such run-time > > > ordering of > > > processes in the earlist deadline first fashion. The assumption, however > > > futuristic it may be, is that deadline information is contained as part > > > of the > > > packet header something like an inband-OAM. > > > > > > Your feedback on the above will be very helpful. > > > > > > Hope the above objective will be of general interest to netdev as well. > > > > > > My apologies if this is not the appropriate mailing list for posting this > > > kind > > > of mails. > > > > > > Anand > > > > > > > Is this described in some RFC ? > > > > If not, I guess you might have to code this in user space. Could ingress redirect to an IFB device with FQ scheduler work for ingress EDT? With a BPF program at ifb device egress hook to read the header and write skb->tstamp.
Re: [PATCH net-next v3 0/2] Enable Fiber on DP83822 PHY
From: Dan Murphy Date: Thu, 27 Aug 2020 08:45:07 -0500 > The DP83822 Ethernet PHY has the ability to connect via a Fiber port. The > derivative PHYs DP83825 and DP83826 do not have this ability. In fiber mode > the DP83822 disables auto negotiation and has a fixed 100Mbps speed with > support for full or half duplex modes. > > A devicetree binding was added to set the signal polarity for the fiber > connection. This property is only applicable if the FX_EN strap is set in > hardware other wise the signal loss detection is disabled on the PHY. > > If the FX_EN is not strapped the device can be configured to run in fiber mode > via the device tree. All be it the PHY will not perform signal loss detection. > > v2 review from a long time ago can be found here - > https://lore.kernel.org/patchwork/patch/1270958/ Series applied, thank you.
Re: [PATCHi v2] net: mdiobus: fix device unregistering in mdiobus_register
On Thu, Aug 27, 2020 at 10:48:48AM +0200, Heiner Kallweit wrote: > On 27.08.2020 09:06, Sascha Hauer wrote: > > After device_register has been called the device structure may not be > > freed anymore, put_device() has to be called instead. This gets violated > > when device_register() or any of the following steps before the mdio > > bus is fully registered fails. In this case the caller will call > > mdiobus_free() which then directly frees the mdio bus structure. > > > > Set bus->state to MDIOBUS_UNREGISTERED right before calling > > device_register(). With this mdiobus_free() calls put_device() instead > > as it ought to be. > > > > Signed-off-by: Sascha Hauer > > --- > > > > Changes since v1: > > - set bus->state before calling device_register(), not afterwards > > > > drivers/net/phy/mdio_bus.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c > > index 0af20faad69d..9434b04a11c8 100644 > > --- a/drivers/net/phy/mdio_bus.c > > +++ b/drivers/net/phy/mdio_bus.c > > @@ -534,6 +534,8 @@ int __mdiobus_register(struct mii_bus *bus, struct > > module *owner) > > bus->dev.groups = NULL; > > dev_set_name(&bus->dev, "%s", bus->id); > > > > + bus->state = MDIOBUS_UNREGISTERED; > > + > > err = device_register(&bus->dev); > > if (err) { > > pr_err("mii_bus %s failed to register\n", bus->id); > > > LGTM. Just two points: > 1. Subject has a typo (PATCHi). And it should be [PATCH net v2], because it's >something for the stable branch. > 2. A "Fixes" tag is needed. Uh, AFAICT this fixes a patch from 2008, this makes for quite some stable updates :) Sascha | commit 161c8d2f50109b44b664eaf23831ea1587979a61 | Author: Krzysztof Halasa | Date: Thu Dec 25 16:50:41 2008 -0800 | | net: PHYLIB mdio fixes #2 -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0| Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917- |
Re: [Patch net] net_sched: fix error path in red_init()
From: Cong Wang Date: Thu, 27 Aug 2020 10:40:41 -0700 > When ->init() fails, ->destroy() is called to clean up. > So it is unnecessary to clean up in red_init(), and it > would cause some refcount underflow. > > Fixes: aee9caa03fc3 ("net: sched: sch_red: Add qevents "early_drop" and > "mark"") > Reported-and-tested-by: syzbot+b33c1cb0a30ebdc8a...@syzkaller.appspotmail.com > Reported-and-tested-by: syzbot+e5ea5f8a3ecfd4427...@syzkaller.appspotmail.com > Cc: Petr Machata > Signed-off-by: Cong Wang Applied, thank you.
RE: [PATCH net-next] Add Mellanox BlueField Gigabit Ethernet driver
> > + The second generation BlueField SoC from Mellanox Technologies > > + supports an out-of-band Gigabit Ethernet management port to the > > + Arm subsystem. > > You might want to additionally select the PHY driver you are using. > It is preferable to not set a specific PHY driver here because it is susceptible to change. And even customers might use a different PHY device.
RE: [PATCH net-next] Add Mellanox BlueField Gigabit Ethernet driver
> > +static int mlxbf_gige_get_link_ksettings(struct net_device *netdev, > > +struct ethtool_link_ksettings > *link_ksettings) { > > + struct phy_device *phydev = netdev->phydev; > > + u32 supported, advertising; > phy_ethtool_ksettings_get() and maybe phy_ethtool_ksettings_set(). Sounds good for phy_ethtool_ksettings_get. However, there is no use for phy_ethtool_ksettings_set because our HW only supports 1G full duplex speed. (and consequently aneg is always supported). Thanks. Asmaa
Re: [PATCH net-next] Add Mellanox BlueField Gigabit Ethernet driver
On Fri, Aug 28, 2020 at 02:24:28PM +, Asmaa Mnebhi wrote: > > > + The second generation BlueField SoC from Mellanox Technologies > > > + supports an out-of-band Gigabit Ethernet management port to the > > > + Arm subsystem. > > > > You might want to additionally select the PHY driver you are using. > > > It is preferable to not set a specific PHY driver here because it is > susceptible to change. And even customers might use a different PHY > device. O.K. Not a problem. Andrew
Re: [PATCH v2] netlabel: remove unused param from audit_log_format()
On Fri, Aug 28, 2020 at 9:56 AM Alex Dewar wrote: > > Commit d3b990b7f327 ("netlabel: fix problems with mapping removal") > added a check to return an error if ret_val != 0, before ret_val is > later used in a log message. Now it will unconditionally print "... > res=1". So just drop the check. > > Addresses-Coverity: ("Dead code") > Fixes: d3b990b7f327 ("netlabel: fix problems with mapping removal") > Signed-off-by: Alex Dewar > --- > v2: Still print the res field, because it's useful (Paul) > > net/netlabel/netlabel_domainhash.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) Thanks Alex. Acked-by: Paul Moore > diff --git a/net/netlabel/netlabel_domainhash.c > b/net/netlabel/netlabel_domainhash.c > index f73a8382c275e..dc8c39f51f7d3 100644 > --- a/net/netlabel/netlabel_domainhash.c > +++ b/net/netlabel/netlabel_domainhash.c > @@ -612,9 +612,8 @@ int netlbl_domhsh_remove_entry(struct netlbl_dom_map > *entry, > audit_buf = netlbl_audit_start_common(AUDIT_MAC_MAP_DEL, audit_info); > if (audit_buf != NULL) { > audit_log_format(audit_buf, > -" nlbl_domain=%s res=%u", > -entry->domain ? entry->domain : "(default)", > -ret_val == 0 ? 1 : 0); > +" nlbl_domain=%s res=1", > +entry->domain ? entry->domain : "(default)"); > audit_log_end(audit_buf); > } > > -- > 2.28.0 > -- paul moore www.paul-moore.com
[PATCH iproute2-next] ip xfrm: support printing XFRMA_SET_MARK_MASK attribute in states
The XFRMA_SET_MARK_MASK attribute is set in states (4.19+). It is the mask of XFRMA_SET_MARK(a.k.a. XFRMA_OUTPUT_MARK in 4.18) sample output: note the output-mark mask ip xfrm state src 192.1.2.23 dst 192.1.3.33 proto esp spi 0xSPISPI reqid REQID mode tunnel replay-window 32 flag af-unspec output-mark 0x3/0xff aead rfc4106(gcm(aes)) 0xENCAUTHKEY 128 if_id 0x1 Signed-off-by: Antony Antony --- ip/ipxfrm.c | 4 1 file changed, 4 insertions(+) diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c index cac8ba25..e4a72bd0 100644 --- a/ip/ipxfrm.c +++ b/ip/ipxfrm.c @@ -649,6 +649,10 @@ static void xfrm_output_mark_print(struct rtattr *tb[], FILE *fp) __u32 output_mark = rta_getattr_u32(tb[XFRMA_OUTPUT_MARK]); fprintf(fp, "output-mark 0x%x", output_mark); + if (tb[XFRMA_SET_MARK_MASK]) { + __u32 mask = rta_getattr_u32(tb[XFRMA_SET_MARK_MASK]); + fprintf(fp, "/0x%x", mask); + } } int xfrm_parse_mark(struct xfrm_mark *mark, int *argcp, char ***argvp) -- 2.21.3
Re: [PATCH v3 net-next 00/12] ionic memory usage rework
From: Shannon Nelson Date: Thu, 27 Aug 2020 16:00:18 -0700 > Previous review comments have suggested [1],[2] that this driver > needs to rework how queue resources are managed and reconfigured > so that we don't do a full driver reset and to better handle > potential allocation failures. This patchset is intended to > address those comments. > > The first few patches clean some general issues and > simplify some of the memory structures. The last 4 patches > specifically address queue parameter changes without a full > ionic_stop()/ionic_open(). > > [1] > https://lore.kernel.org/netdev/20200706103305.182bd...@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ > [2] > https://lore.kernel.org/netdev/20200724.194417.2151242753657227232.da...@davemloft.net/ ... Series applied, thanks for doing this work as this is an area where many drivers have poor behavior.
Re: [PATCH] Remove ipvs v6 dependency on iptables
Le 28/08/2020 à 00:07, Lach a écrit : > This dependency was added in 63dca2c0b0e7a92cb39d1b1ecefa32ffda201975, > because this commit had dependency on > ipv6_find_hdr, which was located in iptables-specific code > > But it is no longer required, because > f8f626754ebeca613cf1af2e6f890cfde0e74d5b moved them to a more common location > --- Your 'Signed-off-by' is missing, the commit log lines are too long, a commit should not be referenced like this. Please run checkpatch on your submissions. Regards, Nicolas
Re: linux-next: build failure after merge of the net-next tree
On 2020-08-27 11:12 -0700, Brian Vazquez wrote: > I've been trying to reproduce it with your config but I didn't > succeed. I also looked at the file after the preprocessor and it > looked good: > > ret = ({ __builtin_expect(!!(ops->match == fib6_rule_match), 1) ? > fib6_rule_match(rule, fl, flags) : ops->match(rule, fl, flags); }) However, in my configuration I have CONFIG_IPV6=m, and so fib6_rule_match is not available as a builtin. I think that's why ld is complaining about the undefined reference. Changing the configuration to CONFIG_IPV6=y helps, FWIW. > Note that fib4_rule_match doesn't appear as the > CONFIG_IP_MULTIPLE_TABLES is not there. > > Could you share more details on how you're compiling it and what > compiler you're using?? Tried with both gcc 9 and gcc 10 under Debian unstable, binutils 2.35. I usually use "make bindebpkg", but just running "make" is sufficient to reproduce the problem, as it happens when linking vmlinux. Cheers, Sven > On Mon, Aug 24, 2020 at 1:08 AM Sven Joachim wrote: >> >> On 2020-08-22 08:16 +0200, Sven Joachim wrote: >> >> > On 2020-08-21 09:23 -0700, Brian Vazquez wrote: >> > >> >> Hi Sven, >> >> >> >> Sorry for the late reply, did you still see this after: >> >> https://patchwork.ozlabs.org/project/netdev/patch/20200803131948.41736-1-yuehaib...@huawei.com/ >> >> ?? >> > >> > That patch is apparently already in 5.9-rc1 as commit 80fbbb1672e7, so >> > yes I'm still seeing it. >> >> Still present in 5.9-rc2 as of today, I have attached my .config for >> reference. Note that I have CONFIG_IPV6_MULTIPLE_TABLES=y, but >> CONFIG_IP_MULTIPLE_TABLES is not mentioned at all there. >> >> To build the kernel, I have now deselected IPV6_MULTIPLE_TABLES. Not >> sure why this was enabled in my .config which has grown organically over >> many years. >> >> Cheers, >>Sven >> >> >> >> On Mon, Aug 17, 2020 at 12:21 AM Sven Joachim wrote: >> >> >> >>> On 2020-07-29 21:27 +1000, Stephen Rothwell wrote: >> >>> >> >>> > Hi all, >> >>> > >> >>> > After merging the net-next tree, today's linux-next build (i386 >> >>> defconfig) >> >>> > failed like this: >> >>> > >> >>> > x86_64-linux-gnu-ld: net/core/fib_rules.o: in function >> >>> `fib_rules_lookup': >> >>> > fib_rules.c:(.text+0x5c6): undefined reference to `fib6_rule_match' >> >>> > x86_64-linux-gnu-ld: fib_rules.c:(.text+0x5d8): undefined reference to >> >>> `fib6_rule_match' >> >>> > x86_64-linux-gnu-ld: fib_rules.c:(.text+0x64d): undefined reference to >> >>> `fib6_rule_action' >> >>> > x86_64-linux-gnu-ld: fib_rules.c:(.text+0x662): undefined reference to >> >>> `fib6_rule_action' >> >>> > x86_64-linux-gnu-ld: fib_rules.c:(.text+0x67a): undefined reference to >> >>> `fib6_rule_suppress' >> >>> > x86_64-linux-gnu-ld: fib_rules.c:(.text+0x68d): undefined reference to >> >>> `fib6_rule_suppress' >> >>> >> >>> FWIW, I saw these errors in 5.9-rc1 today, so the fix in commit >> >>> 41d707b7332f ("fib: fix fib_rules_ops indirect calls wrappers") was >> >>> apparently not sufficient. >> >>> >> >>> , >> >>> | $ grep IPV6 .config >> >>> | CONFIG_IPV6=m >> >>> | # CONFIG_IPV6_ROUTER_PREF is not set >> >>> | # CONFIG_IPV6_OPTIMISTIC_DAD is not set >> >>> | # CONFIG_IPV6_MIP6 is not set >> >>> | # CONFIG_IPV6_ILA is not set >> >>> | # CONFIG_IPV6_VTI is not set >> >>> | CONFIG_IPV6_SIT=m >> >>> | # CONFIG_IPV6_SIT_6RD is not set >> >>> | CONFIG_IPV6_NDISC_NODETYPE=y >> >>> | CONFIG_IPV6_TUNNEL=m >> >>> | CONFIG_IPV6_MULTIPLE_TABLES=y >> >>> | # CONFIG_IPV6_SUBTREES is not set >> >>> | # CONFIG_IPV6_MROUTE is not set >> >>> | # CONFIG_IPV6_SEG6_LWTUNNEL is not set >> >>> | # CONFIG_IPV6_SEG6_HMAC is not set >> >>> | # CONFIG_IPV6_RPL_LWTUNNEL is not set >> >>> | # CONFIG_NF_SOCKET_IPV6 is not set >> >>> | # CONFIG_NF_TPROXY_IPV6 is not set >> >>> | # CONFIG_NF_DUP_IPV6 is not set >> >>> | # CONFIG_NF_REJECT_IPV6 is not set >> >>> | # CONFIG_NF_LOG_IPV6 is not set >> >>> | CONFIG_NF_DEFRAG_IPV6=m >> >>> ` >> >>> >> >>> > Caused by commit >> >>> > >> >>> > b9aaec8f0be5 ("fib: use indirect call wrappers in the most common >> >>> fib_rules_ops") >> >>> > >> >>> > # CONFIG_IPV6_MULTIPLE_TABLES is not set >> >>> > >> >>> > I have reverted that commit for today. >> >>> >> >>> Cheers, >> >>>Sven >> >>>
Re: linux-next: build failure after merge of the net-next tree
On 8/28/20 8:09 AM, Sven Joachim wrote: > On 2020-08-27 11:12 -0700, Brian Vazquez wrote: > >> I've been trying to reproduce it with your config but I didn't >> succeed. I also looked at the file after the preprocessor and it >> looked good: >> >> ret = ({ __builtin_expect(!!(ops->match == fib6_rule_match), 1) ? >> fib6_rule_match(rule, fl, flags) : ops->match(rule, fl, flags); }) > > However, in my configuration I have CONFIG_IPV6=m, and so > fib6_rule_match is not available as a builtin. I think that's why ld is > complaining about the undefined reference. Same here FWIW. CONFIG_IPV6=m. > Changing the configuration to CONFIG_IPV6=y helps, FWIW. > >> Note that fib4_rule_match doesn't appear as the >> CONFIG_IP_MULTIPLE_TABLES is not there. >> >> Could you share more details on how you're compiling it and what >> compiler you're using?? > > Tried with both gcc 9 and gcc 10 under Debian unstable, binutils 2.35. > I usually use "make bindebpkg", but just running "make" is sufficient to > reproduce the problem, as it happens when linking vmlinux. > > Cheers, >Sven > > >> On Mon, Aug 24, 2020 at 1:08 AM Sven Joachim wrote: >>> >>> On 2020-08-22 08:16 +0200, Sven Joachim wrote: >>> On 2020-08-21 09:23 -0700, Brian Vazquez wrote: > Hi Sven, > > Sorry for the late reply, did you still see this after: > https://patchwork.ozlabs.org/project/netdev/patch/20200803131948.41736-1-yuehaib...@huawei.com/ > ?? That patch is apparently already in 5.9-rc1 as commit 80fbbb1672e7, so yes I'm still seeing it. >>> >>> Still present in 5.9-rc2 as of today, I have attached my .config for >>> reference. Note that I have CONFIG_IPV6_MULTIPLE_TABLES=y, but >>> CONFIG_IP_MULTIPLE_TABLES is not mentioned at all there. >>> >>> To build the kernel, I have now deselected IPV6_MULTIPLE_TABLES. Not >>> sure why this was enabled in my .config which has grown organically over >>> many years. >>> >>> Cheers, >>>Sven >>> >>> > On Mon, Aug 17, 2020 at 12:21 AM Sven Joachim wrote: > >> On 2020-07-29 21:27 +1000, Stephen Rothwell wrote: >> >>> Hi all, >>> >>> After merging the net-next tree, today's linux-next build (i386 >> defconfig) >>> failed like this: >>> >>> x86_64-linux-gnu-ld: net/core/fib_rules.o: in function >> `fib_rules_lookup': >>> fib_rules.c:(.text+0x5c6): undefined reference to `fib6_rule_match' >>> x86_64-linux-gnu-ld: fib_rules.c:(.text+0x5d8): undefined reference to >> `fib6_rule_match' >>> x86_64-linux-gnu-ld: fib_rules.c:(.text+0x64d): undefined reference to >> `fib6_rule_action' >>> x86_64-linux-gnu-ld: fib_rules.c:(.text+0x662): undefined reference to >> `fib6_rule_action' >>> x86_64-linux-gnu-ld: fib_rules.c:(.text+0x67a): undefined reference to >> `fib6_rule_suppress' >>> x86_64-linux-gnu-ld: fib_rules.c:(.text+0x68d): undefined reference to >> `fib6_rule_suppress' >> >> FWIW, I saw these errors in 5.9-rc1 today, so the fix in commit >> 41d707b7332f ("fib: fix fib_rules_ops indirect calls wrappers") was >> apparently not sufficient. >> >> , >> | $ grep IPV6 .config >> | CONFIG_IPV6=m >> | # CONFIG_IPV6_ROUTER_PREF is not set >> | # CONFIG_IPV6_OPTIMISTIC_DAD is not set >> | # CONFIG_IPV6_MIP6 is not set >> | # CONFIG_IPV6_ILA is not set >> | # CONFIG_IPV6_VTI is not set >> | CONFIG_IPV6_SIT=m >> | # CONFIG_IPV6_SIT_6RD is not set >> | CONFIG_IPV6_NDISC_NODETYPE=y >> | CONFIG_IPV6_TUNNEL=m >> | CONFIG_IPV6_MULTIPLE_TABLES=y >> | # CONFIG_IPV6_SUBTREES is not set >> | # CONFIG_IPV6_MROUTE is not set >> | # CONFIG_IPV6_SEG6_LWTUNNEL is not set >> | # CONFIG_IPV6_SEG6_HMAC is not set >> | # CONFIG_IPV6_RPL_LWTUNNEL is not set >> | # CONFIG_NF_SOCKET_IPV6 is not set >> | # CONFIG_NF_TPROXY_IPV6 is not set >> | # CONFIG_NF_DUP_IPV6 is not set >> | # CONFIG_NF_REJECT_IPV6 is not set >> | # CONFIG_NF_LOG_IPV6 is not set >> | CONFIG_NF_DEFRAG_IPV6=m >> ` >> >>> Caused by commit >>> >>> b9aaec8f0be5 ("fib: use indirect call wrappers in the most common >> fib_rules_ops") >>> >>> # CONFIG_IPV6_MULTIPLE_TABLES is not set >>> >>> I have reverted that commit for today. >> >> Cheers, >>Sven -- ~Randy
[PATCH net-next v2 0/2] Add ip6_fragment in ipv6_stub
From: wenxu Add ip6_fragment in ipv6_stub and use it in openvswitch This version add default function eafnosupport_ipv6_fragment wenxu (2): ipv6: add ipv6_fragment hook in ipv6_stub openvswitch: using ip6_fragment in ipv6_stub include/net/ipv6_stubs.h | 3 +++ net/ipv6/addrconf_core.c | 8 net/ipv6/af_inet6.c | 1 + net/openvswitch/actions.c | 7 +-- 4 files changed, 13 insertions(+), 6 deletions(-) -- 1.8.3.1
[PATCH net-next v2 1/2] ipv6: add ipv6_fragment hook in ipv6_stub
From: wenxu Add ipv6_fragment to ipv6_stub to avoid calling netfilter when access ip6_fragment. Signed-off-by: wenxu --- v2: add default one eafnosupport_ipv6_fragment include/net/ipv6_stubs.h | 3 +++ net/ipv6/addrconf_core.c | 8 net/ipv6/af_inet6.c | 1 + 3 files changed, 12 insertions(+) diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h index d7a7f7c..8fce558 100644 --- a/include/net/ipv6_stubs.h +++ b/include/net/ipv6_stubs.h @@ -63,6 +63,9 @@ struct ipv6_stub { int encap_type); #endif struct neigh_table *nd_tbl; + + int (*ipv6_fragment)(struct net *net, struct sock *sk, struct sk_buff *skb, +int (*output)(struct net *, struct sock *, struct sk_buff *)); }; extern const struct ipv6_stub *ipv6_stub __read_mostly; diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c index 9ebf3fe..c70c192 100644 --- a/net/ipv6/addrconf_core.c +++ b/net/ipv6/addrconf_core.c @@ -191,6 +191,13 @@ static int eafnosupport_ip6_del_rt(struct net *net, struct fib6_info *rt, return -EAFNOSUPPORT; } +static int eafnosupport_ipv6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, + int (*output)(struct net *, struct sock *, struct sk_buff *)) +{ + kfree_skb(skb); + return -EAFNOSUPPORT; +} + const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) { .ipv6_dst_lookup_flow = eafnosupport_ipv6_dst_lookup_flow, .ipv6_route_input = eafnosupport_ipv6_route_input, @@ -201,6 +208,7 @@ static int eafnosupport_ip6_del_rt(struct net *net, struct fib6_info *rt, .ip6_mtu_from_fib6 = eafnosupport_ip6_mtu_from_fib6, .fib6_nh_init = eafnosupport_fib6_nh_init, .ip6_del_rt= eafnosupport_ip6_del_rt, + .ipv6_fragment = eafnosupport_ipv6_fragment, }; EXPORT_SYMBOL_GPL(ipv6_stub); diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index d9a1493..e648fbe 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -1027,6 +1027,7 @@ static int ipv6_route_input(struct sk_buff *skb) .xfrm6_rcv_encap = xfrm6_rcv_encap, #endif .nd_tbl = &nd_tbl, + .ipv6_fragment = ip6_fragment, }; static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = { -- 1.8.3.1
[PATCH net-next v2 2/2] openvswitch: using ip6_fragment in ipv6_stub
From: wenxu Using ipv6_stub->ipv6_fragment to avoid the netfilter dependency Signed-off-by: wenxu --- net/openvswitch/actions.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 2611657..fd34089 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -9,7 +9,6 @@ #include #include #include -#include #include #include #include @@ -848,13 +847,9 @@ static void ovs_fragment(struct net *net, struct vport *vport, ip_do_fragment(net, skb->sk, skb, ovs_vport_output); refdst_drop(orig_dst); } else if (key->eth.type == htons(ETH_P_IPV6)) { - const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops(); unsigned long orig_dst; struct rt6_info ovs_rt; - if (!v6ops) - goto err; - prepare_frag(vport, skb, orig_network_offset, ovs_key_mac_proto(key)); memset(&ovs_rt, 0, sizeof(ovs_rt)); @@ -866,7 +861,7 @@ static void ovs_fragment(struct net *net, struct vport *vport, skb_dst_set_noref(skb, &ovs_rt.dst); IP6CB(skb)->frag_max_size = mru; - v6ops->fragment(net, skb->sk, skb, ovs_vport_output); + ipv6_stub->ipv6_fragment(net, skb->sk, skb, ovs_vport_output); refdst_drop(orig_dst); } else { WARN_ONCE(1, "Failed fragment ->%s: eth=%04x, MRU=%d, MTU=%d.", -- 1.8.3.1
[PATCH bpf-next] 0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling-frames.patch
--- ...to-xdpsock-to-avoid-recycling-frames.patch | 62 +++ 1 file changed, 62 insertions(+) create mode 100644 0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling-frames.patch diff --git a/0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling-frames.patch b/0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling-frames.patch new file mode 100644 index ..ae3b99b335e2 --- /dev/null +++ b/0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling-frames.patch @@ -0,0 +1,62 @@ +From df0a23a79c9dca96c0059b4d766a613eba57200e Mon Sep 17 00:00:00 2001 +From: Weqaar Janjua +Date: Fri, 28 Aug 2020 13:36:32 +0100 +Subject: [PATCH bpf-next] samples/bpf: fix to xdpsock to avoid recycling + frames +To: magnus.karls...@intel.com + +The txpush program in the xdpsock sample application is supposed +to send out all packets in the umem in a round-robin fashion. +The problem is that it only cycled through the first BATCH_SIZE +worth of packets. Fixed this so that it cycles through all buffers +in the umem as intended. + +Fixes: 248c7f9c0e21 ("samples/bpf: convert xdpsock to use libbpf for AF_XDP access") +Signed-off-by: Weqaar Janjua +--- + samples/bpf/xdpsock_user.c | 10 +- + 1 file changed, 5 insertions(+), 5 deletions(-) + +diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c +index 19c679456a0e..c821e9867139 100644 +--- a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c +@@ -1004,7 +1004,7 @@ static void rx_drop_all(void) + } + } + +-static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb, int batch_size) ++static void tx_only(struct xsk_socket_info *xsk, u32 *frame_nb, int batch_size) + { + u32 idx; + unsigned int i; +@@ -1017,14 +1017,14 @@ static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb, int batch_size) + for (i = 0; i < batch_size; i++) { + struct xdp_desc *tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, + idx + i); +- tx_desc->addr = (frame_nb + i) << XSK_UMEM__DEFAULT_FRAME_SHIFT; ++ tx_desc->addr = (*frame_nb + i) << XSK_UMEM__DEFAULT_FRAME_SHIFT; + tx_desc->len = PKT_SIZE; + } + + xsk_ring_prod__submit(&xsk->tx, batch_size); + xsk->outstanding_tx += batch_size; +- frame_nb += batch_size; +- frame_nb %= NUM_FRAMES; ++ *frame_nb += batch_size; ++ *frame_nb %= NUM_FRAMES; + complete_tx_only(xsk, batch_size); + } + +@@ -1080,7 +1080,7 @@ static void tx_only_all(void) + } + + for (i = 0; i < num_socks; i++) +- tx_only(xsks[i], frame_nb[i], batch_size); ++ tx_only(xsks[i], &frame_nb[i], batch_size); + + pkt_cnt += batch_size; + +-- +2.20.1 + -- 2.20.1 -- Intel Research and Development Ireland Limited Registered in Ireland Registered Office: Collinstown Industrial Park, Leixlip, County Kildare Registered Number: 308263 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: [PATCH v3 bpf-next 0/3] bpf: Relax the max_entries check for inner map
On 8/28/20 3:18 AM, Martin KaFai Lau wrote: v3: - Add map_meta_equal to bpf_map_ops and use it as an explict opt-in support for map-in-map v2: - New BPF_MAP_TYPE_FL to minimize code churns (Alexei) - s/capabilities/properties/ (Andrii) - Describe WHY in commit log (Andrii) People has a use case that starts with a smaller inner map first and then replaces it with a larger inner map later when it is needed. This series allows the outer map to be updated with inner map in different size as long as it is safe (meaning the max_entries is not used in the verification time during prog load). Please see individual patch for details. Martin KaFai Lau (3): bpf: Add map_meta_equal map ops bpf: Relax max_entries check for most of the inner map types bpf: selftests: Add test for different inner map size include/linux/bpf.h | 16 + kernel/bpf/arraymap.c | 16 + kernel/bpf/bpf_inode_storage.c| 1 + kernel/bpf/cpumap.c | 1 + kernel/bpf/devmap.c | 2 ++ kernel/bpf/hashtab.c | 4 +++ kernel/bpf/lpm_trie.c | 1 + kernel/bpf/map_in_map.c | 24 + kernel/bpf/map_in_map.h | 2 -- kernel/bpf/queue_stack_maps.c | 2 ++ kernel/bpf/reuseport_array.c | 1 + kernel/bpf/ringbuf.c | 1 + kernel/bpf/stackmap.c | 1 + kernel/bpf/syscall.c | 1 + net/core/bpf_sk_storage.c | 1 + net/core/sock_map.c | 2 ++ net/xdp/xskmap.c | 8 + .../selftests/bpf/prog_tests/btf_map_in_map.c | 35 ++- .../selftests/bpf/progs/test_btf_map_in_map.c | 31 19 files changed, 132 insertions(+), 18 deletions(-) Looks good to me, applied thanks!
[PATCH net] cxgb4: fix thermal zone device registration
When multiple adapters are present in the system, pci hot-removing second adapter leads to the following warning as both the adapters registered thermal zone device with same thermal zone name/type. Therefore, use unique thermal zone name during thermal zone device initialization. Also mark thermal zone dev NULL once unregistered. [ 414.370143] [ cut here ] [ 414.370944] sysfs group 'power' not found for kobject 'hwmon0' [ 414.371747] WARNING: CPU: 9 PID: 2661 at fs/sysfs/group.c:281 sysfs_remove_group+0x76/0x80 [ 414.382550] CPU: 9 PID: 2661 Comm: bash Not tainted 5.8.0-rc6+ #33 [ 414.383593] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0a 06/23/2016 [ 414.384669] RIP: 0010:sysfs_remove_group+0x76/0x80 [ 414.385738] Code: 48 89 df 5b 5d 41 5c e9 d8 b5 ff ff 48 89 df e8 60 b0 ff ff eb cb 49 8b 14 24 48 8b 75 00 48 c7 c7 90 ae 13 bb e8 6a 27 d0 ff <0f> 0b 5b 5d 41 5c c3 0f 1f 00 0f 1f 44 00 00 48 85 f6 74 31 41 54 [ 414.388404] RSP: 0018:a22bc080fcb0 EFLAGS: 00010286 [ 414.389638] RAX: RBX: RCX: [ 414.390829] RDX: 0001 RSI: 8ee2de3e9510 RDI: 8ee2de3e9510 [ 414.392064] RBP: baef2ee0 R08: R09: [ 414.393224] R10: R11: 2b30006c R12: 8ee260720008 [ 414.394388] R13: 8ee25e0a40e8 R14: a22bc080ff08 R15: 8ee2c3be5020 [ 414.395661] FS: 7fd2a7171740() GS:8ee2de20() knlGS: [ 414.396825] CS: 0010 DS: ES: CR0: 80050033 [ 414.398011] CR2: 7f178ffe5020 CR3: 00084c5cc003 CR4: 003606e0 [ 414.399172] DR0: DR1: DR2: [ 414.400352] DR3: DR6: fffe0ff0 DR7: 0400 [ 414.401473] Call Trace: [ 414.402685] device_del+0x89/0x400 [ 414.403819] device_unregister+0x16/0x60 [ 414.405024] hwmon_device_unregister+0x44/0xa0 [ 414.406112] thermal_remove_hwmon_sysfs+0x196/0x200 [ 414.407256] thermal_zone_device_unregister+0x1b5/0x1f0 [ 414.408415] cxgb4_thermal_remove+0x3c/0x4f [cxgb4] [ 414.409668] remove_one+0x212/0x290 [cxgb4] [ 414.410875] pci_device_remove+0x36/0xb0 [ 414.412004] device_release_driver_internal+0xe2/0x1c0 [ 414.413276] pci_stop_bus_device+0x64/0x90 [ 414.414433] pci_stop_and_remove_bus_device_locked+0x16/0x30 [ 414.415609] remove_store+0x75/0x90 [ 414.416790] kernfs_fop_write+0x114/0x1b0 [ 414.417930] vfs_write+0xcf/0x210 [ 414.419059] ksys_write+0xa7/0xe0 [ 414.420120] do_syscall_64+0x4c/0xa0 [ 414.421278] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 414.422335] RIP: 0033:0x7fd2a686afd0 [ 414.423396] Code: Bad RIP value. [ 414.424549] RSP: 002b:7fffc1446148 EFLAGS: 0246 ORIG_RAX: 0001 [ 414.425638] RAX: ffda RBX: 0002 RCX: 7fd2a686afd0 [ 414.426830] RDX: 0002 RSI: 7fd2a7196000 RDI: 0001 [ 414.427927] RBP: 7fd2a7196000 R08: 000a R09: 7fd2a7171740 [ 414.428923] R10: 7fd2a7171740 R11: 0246 R12: 7fd2a6b43400 [ 414.430082] R13: 0002 R14: 0001 R15: [ 414.431027] irq event stamp: 76300 [ 414.435678] ---[ end trace 13865acb4d5ab00f ]--- Fixes: b18719157762 ("cxgb4: Add thermal zone support") Signed-off-by: Potnuri Bharat Teja --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_thermal.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_thermal.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_thermal.c index 3de8a5e83b6c..d7fefdbf3e57 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_thermal.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_thermal.c @@ -62,6 +62,7 @@ static struct thermal_zone_device_ops cxgb4_thermal_ops = { int cxgb4_thermal_init(struct adapter *adap) { struct ch_thermal *ch_thermal = &adap->ch_thermal; + char ch_tz_name[THERMAL_NAME_LENGTH]; int num_trip = CXGB4_NUM_TRIPS; u32 param, val; int ret; @@ -82,7 +83,8 @@ int cxgb4_thermal_init(struct adapter *adap) ch_thermal->trip_type = THERMAL_TRIP_CRITICAL; } - ch_thermal->tzdev = thermal_zone_device_register("cxgb4", num_trip, + snprintf(ch_tz_name, sizeof(ch_tz_name), "cxgb4_%s", adap->name); + ch_thermal->tzdev = thermal_zone_device_register(ch_tz_name, num_trip, 0, adap, &cxgb4_thermal_ops, NULL, 0, 0); @@ -97,7 +99,9 @@ int cxgb4_thermal_init(struct adapter *adap) int cxgb4_thermal_remove(struct adapter *adap) { - if (adap->ch_thermal.tzdev) + if (adap->ch_thermal.tzdev) { thermal_zone_device_unregister(adap->ch_thermal.tzdev); + adap->ch_t
Re: VRRP not working on i40e X722 S2600WFT
On Thu, Aug 27, 2020 at 02:30:39PM -0400, Lennart Sorensen wrote: > I have hit a new problem with the X722 chipset (Intel R1304WFT server). > VRRP simply does not work. > > When keepalived registers a vmac interface, and starts transmitting > multicast packets with the vrp message, it never receives those packets > from the peers, so all nodes think they are the master. tcpdump shows > transmits, but no receives. If I stop keepalived, which deletes the > vmac interface, then I start to receive the multicast packets from the > other nodes. Even in promisc mode, tcpdump can't see those packets. > > So it seems the hardware is dropping all packets with a source mac that > matches the source mac of the vmac interface, even when the destination > is a multicast address that was subcribed to. This is clearly not > proper behaviour. > > I tried a stock 5.8 kernel to check if a driver update helped, and updated > the nvm firware to the latest 4.10 (which appears to be over a year old), > and nothing changes the behaviour at all. > > Seems other people have hit this problem too: > http://mails.dpdk.org/archives/users/2018-May/003128.html > > Unless someone has a way to fix this, we will have to change away from > this hardware very quickly. The IPsec NAT RSS defect we could tolerate > although didn't like, while this is just unworkable. > > Quite frustrated by this. Intel network hardware was always great, > how did the X722 make it out in this state. Another case with the same problem on an X710: https://www.talkend.net/post/13256.html -- Len Sorensen
Re: [PATCH v2] netlabel: remove unused param from audit_log_format()
From: Alex Dewar Date: Fri, 28 Aug 2020 14:55:23 +0100 > Commit d3b990b7f327 ("netlabel: fix problems with mapping removal") > added a check to return an error if ret_val != 0, before ret_val is > later used in a log message. Now it will unconditionally print "... > res=1". So just drop the check. > > Addresses-Coverity: ("Dead code") > Fixes: d3b990b7f327 ("netlabel: fix problems with mapping removal") > Signed-off-by: Alex Dewar > --- > v2: Still print the res field, because it's useful (Paul) Applied to net-next, thank you.
RE: [PATCH bpf-next] 0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling-frames.patch
> -Original Message- > From: Janjua, Weqaar A > Sent: Friday, August 28, 2020 4:20 PM > To: Karlsson, Magnus ; Topel, Bjorn > ; a...@kernel.org; dan...@iogearbox.net; > netdev@vger.kernel.org; jonathan.le...@gmail.com > Cc: Janjua, Weqaar A ; b...@vger.kernel.org > Subject: [PATCH bpf-next] 0001-samples-bpf-fix-to-xdpsock-to-avoid- > recycling-frames.patch > > --- > ...to-xdpsock-to-avoid-recycling-frames.patch | 62 +++ > 1 file changed, 62 insertions(+) > create mode 100644 0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling- > frames.patch > > diff --git a/0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling-frames.patch > b/0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling-frames.patch > new file mode 100644 > index ..ae3b99b335e2 > --- /dev/null > +++ b/0001-samples-bpf-fix-to-xdpsock-to-avoid-recycling-frames.patch > @@ -0,0 +1,62 @@ > +From df0a23a79c9dca96c0059b4d766a613eba57200e Mon Sep 17 > 00:00:00 2001 > +From: Weqaar Janjua > +Date: Fri, 28 Aug 2020 13:36:32 +0100 > +Subject: [PATCH bpf-next] samples/bpf: fix to xdpsock to avoid > +recycling frames > +To: magnus.karls...@intel.com > + > +The txpush program in the xdpsock sample application is supposed to > +send out all packets in the umem in a round-robin fashion. > +The problem is that it only cycled through the first BATCH_SIZE worth > +of packets. Fixed this so that it cycles through all buffers in the > +umem as intended. > + > +Fixes: 248c7f9c0e21 ("samples/bpf: convert xdpsock to use libbpf for > +AF_XDP access") > +Signed-off-by: Weqaar Janjua > +--- > + samples/bpf/xdpsock_user.c | 10 +- > + 1 file changed, 5 insertions(+), 5 deletions(-) > + > +diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c > +index 19c679456a0e..c821e9867139 100644 > +--- a/samples/bpf/xdpsock_user.c > b/samples/bpf/xdpsock_user.c > +@@ -1004,7 +1004,7 @@ static void rx_drop_all(void) > + } > + } > + > +-static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb, int > +batch_size) > ++static void tx_only(struct xsk_socket_info *xsk, u32 *frame_nb, int > ++batch_size) > + { > + u32 idx; > + unsigned int i; > +@@ -1017,14 +1017,14 @@ static void tx_only(struct xsk_socket_info > *xsk, u32 frame_nb, int batch_size) > + for (i = 0; i < batch_size; i++) { > + struct xdp_desc *tx_desc = xsk_ring_prod__tx_desc(&xsk- > >tx, > + idx + i); > +-tx_desc->addr = (frame_nb + i) << > XSK_UMEM__DEFAULT_FRAME_SHIFT; > ++tx_desc->addr = (*frame_nb + i) << > XSK_UMEM__DEFAULT_FRAME_SHIFT; > + tx_desc->len = PKT_SIZE; > + } > + > + xsk_ring_prod__submit(&xsk->tx, batch_size); > + xsk->outstanding_tx += batch_size; > +-frame_nb += batch_size; > +-frame_nb %= NUM_FRAMES; > ++*frame_nb += batch_size; > ++*frame_nb %= NUM_FRAMES; > + complete_tx_only(xsk, batch_size); > + } > + > +@@ -1080,7 +1080,7 @@ static void tx_only_all(void) > + } > + > + for (i = 0; i < num_socks; i++) > +-tx_only(xsks[i], frame_nb[i], batch_size); > ++tx_only(xsks[i], &frame_nb[i], batch_size); > + > + pkt_cnt += batch_size; > + > +-- > +2.20.1 > + > -- > 2.20.1 [Janjua, Weqaar A] Apologies, please discard this patch, will resubmit. -- Intel Research and Development Ireland Limited Registered in Ireland Registered Office: Collinstown Industrial Park, Leixlip, County Kildare Registered Number: 308263 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
[PATCH bpf-next] samples/bpf: fix to xdpsock to avoid recycling frames
The txpush program in the xdpsock sample application is supposed to send out all packets in the umem in a round-robin fashion. The problem is that it only cycled through the first BATCH_SIZE worth of packets. Fixed this so that it cycles through all buffers in the umem as intended. Fixes: 248c7f9c0e21 ("samples/bpf: convert xdpsock to use libbpf for AF_XDP access") Signed-off-by: Weqaar Janjua --- samples/bpf/xdpsock_user.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c index 19c679456a0e..c821e9867139 100644 --- a/samples/bpf/xdpsock_user.c +++ b/samples/bpf/xdpsock_user.c @@ -1004,7 +1004,7 @@ static void rx_drop_all(void) } } -static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb, int batch_size) +static void tx_only(struct xsk_socket_info *xsk, u32 *frame_nb, int batch_size) { u32 idx; unsigned int i; @@ -1017,14 +1017,14 @@ static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb, int batch_size) for (i = 0; i < batch_size; i++) { struct xdp_desc *tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i); - tx_desc->addr = (frame_nb + i) << XSK_UMEM__DEFAULT_FRAME_SHIFT; + tx_desc->addr = (*frame_nb + i) << XSK_UMEM__DEFAULT_FRAME_SHIFT; tx_desc->len = PKT_SIZE; } xsk_ring_prod__submit(&xsk->tx, batch_size); xsk->outstanding_tx += batch_size; - frame_nb += batch_size; - frame_nb %= NUM_FRAMES; + *frame_nb += batch_size; + *frame_nb %= NUM_FRAMES; complete_tx_only(xsk, batch_size); } @@ -1080,7 +1080,7 @@ static void tx_only_all(void) } for (i = 0; i < num_socks; i++) - tx_only(xsks[i], frame_nb[i], batch_size); + tx_only(xsks[i], &frame_nb[i], batch_size); pkt_cnt += batch_size; -- 2.20.1 -- Intel Research and Development Ireland Limited Registered in Ireland Registered Office: Collinstown Industrial Park, Leixlip, County Kildare Registered Number: 308263 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: [PATCH] net: netfilter: delete repeated words
On Sat, Aug 22, 2020 at 06:07:27PM -0700, Randy Dunlap wrote: > Drop duplicated words in net/netfilter/ and net/ipv4/netfilter/. Applied, thanks.
Re: [PATCH V2 1/5 nf] selftests: netfilter: fix header example
On Sun, Aug 23, 2020 at 08:15:37PM +0200, Fabian Frederick wrote: > nft_flowtable.sh is made for bash not sh. > Also give values which not return "RTNETLINK answers: Invalid > argument" Series from 1 to 5 is applied.
Re: [PATCH v3 1/1] netfilter: nat: add a range check for l3/l4 protonum
Hi Will, Given this is for -stable maintainers only, I'd suggest: 1) Specify what -stable kernel versions this patch applies to. Explain that this problem is gone since what kernel version. 2) Maybe clarify that this is only for stable in the patch subject, e.g. [PATCH -stable v3] netfilter: nat: add a range check for l3/l4 Otherwise, this -stable maintainers might not identify this patch as something that is targetted to them. Thanks. On Mon, Aug 24, 2020 at 07:38:32PM +, Will McVicker wrote: > The indexes to the nf_nat_l[34]protos arrays come from userspace. So > check the tuple's family, e.g. l3num, when creating the conntrack in > order to prevent an OOB memory access during setup. Here is an example > kernel panic on 4.14.180 when userspace passes in an index greater than > NFPROTO_NUMPROTO. > > Internal error: Oops - BUG: 0 [#1] PREEMPT SMP > Modules linked in:... > Process poc (pid: 5614, stack limit = 0xa3933121) > CPU: 4 PID: 5614 Comm: poc Tainted: G S W O4.14.180-g051355490483 > Hardware name: Qualcomm Technologies, Inc. SM8150 V2 PM8150 Google Inc. MSM > task: 2a3dfffe task.stack: a3933121 > pc : __cfi_check_fail+0x1c/0x24 > lr : __cfi_check_fail+0x1c/0x24 > ... > Call trace: > __cfi_check_fail+0x1c/0x24 > name_to_dev_t+0x0/0x468 > nfnetlink_parse_nat_setup+0x234/0x258 > ctnetlink_parse_nat_setup+0x4c/0x228 > ctnetlink_new_conntrack+0x590/0xc40 > nfnetlink_rcv_msg+0x31c/0x4d4 > netlink_rcv_skb+0x100/0x184 > nfnetlink_rcv+0xf4/0x180 > netlink_unicast+0x360/0x770 > netlink_sendmsg+0x5a0/0x6a4 > ___sys_sendmsg+0x314/0x46c > SyS_sendmsg+0xb4/0x108 > el0_svc_naked+0x34/0x38 > > Fixes: c1d10adb4a521 ("[NETFILTER]: Add ctnetlink port for nf_conntrack") > Signed-off-by: Will McVicker > --- > net/netfilter/nf_conntrack_netlink.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/netfilter/nf_conntrack_netlink.c > b/net/netfilter/nf_conntrack_netlink.c > index 31fa94064a62..0b89609a6e9d 100644 > --- a/net/netfilter/nf_conntrack_netlink.c > +++ b/net/netfilter/nf_conntrack_netlink.c > @@ -1129,6 +1129,8 @@ ctnetlink_parse_tuple(const struct nlattr * const cda[], > if (!tb[CTA_TUPLE_IP]) > return -EINVAL; > > + if (l3num != NFPROTO_IPV4 && l3num != NFPROTO_IPV6) > + return -EOPNOTSUPP; > tuple->src.l3num = l3num; > > err = ctnetlink_parse_tuple_ip(tb[CTA_TUPLE_IP], tuple); > -- > 2.28.0.297.g1956fa8f8d-goog >
Re: [PATCH net-next 2/3] devlink: Consider other controller while building phys_port_name
On Fri, 28 Aug 2020 04:27:19 + Parav Pandit wrote: > > From: Jakub Kicinski > > Sent: Friday, August 28, 2020 3:12 AM > > > > On Thu, 27 Aug 2020 20:15:01 + Parav Pandit wrote: > > > > From: Jakub Kicinski > > > > > > > > I find it strange that you have pfnum 0 everywhere but then > > > > different controllers. > > > There are multiple PFs, connected to different PCI RC. So device has > > > same pfnum for both the PFs. > > > > > > > For MultiHost at Netronome we've used pfnum to distinguish between > > > > the hosts. ASIC must have some unique identifiers for each PF. > > > Yes. there is. It is identified by a unique controller number; > > > internally it is called host_number. But internal host_number is > > > misleading term as multiple cables of same physical card can be > > > plugged into single host. So identifying based on a unique > > > (controller) number and matching that up on external cable is desired. > > > > > > > I'm not aware of any practical reason for creating PFs on one RC > > > > without reinitializing all the others. > > > I may be misunderstanding, but how is initialization is related > > > multiple PFs? > > > > If the number of PFs is static it should be possible to understand which > > one is on > > which system. > > How? How do we tell that pfnum A means external system. > Want to avoid such 'implicit' notion. How do you tell that controller A means external system? > > > > I can see how having multiple controllers may make things clearer, > > > > but adding another layer of IDs while the one under it is unused > > > > (pfnum=0) feels very unnecessary. > > > pfnum=0 is used today. not sure I understand your comment about being > > > unused. Can you please explain? > > > > You examples only ever have pfnum 0: > > > Because both controllers have pfnum 0. > > > From patch 2: > > > > $ devlink port show pci/:00:08.0/2 > > pci/:00:08.0/2: type eth netdev eth7 controller 0 flavour pcivf pfnum 0 > > vfnum 1 splittable false > > function: > > hw_addr 00:00:00:00:00:00 > > > > $ devlink port show -jp pci/:00:08.0/2 { > > "port": { > > "pci/:00:08.0/1": { > > "type": "eth", > > "netdev": "eth7", > > "controller": 0, > > "flavour": "pcivf", > > "pfnum": 0, > > "vfnum": 1, > > "splittable": false, > > "function": { > > "hw_addr": "00:00:00:00:00:00" > > } > > } > > } > > } > > > > From earlier email: > > > > pci/:00:08.0/1: type eth netdev eth6 flavour pcipf pfnum 0 > > pci/:00:08.0/2: type eth netdev eth7 flavour pcipf pfnum 0 > > > > If you never use pfnum, you can just put the controller ID there, like > > Netronome. > > > It likely not going to work for us. Because pfnum is not some randomly > generated number. > It is linked to the underlying PCI pf number. {pf0, pf1...} > Orchestration sw uses this to identify representor of a PF-VF pair. For orchestration software which is unaware of controllers ports will still alias on pf/vf nums. Besides you have one devlink instance per port currently so I'm guessing there is no pf1 ever, in your case... > Replacing pfnum with controller number breaks this; and it still doesn't tell > user that it's the pf on other_host. Neither does the opaque controller id. Maybe now you understand better why I wanted peer objects :/ > So it is used, and would like to continue to use even if there are multiple > PFs port (that has same pfnum) under the same eswitch. > > In an alternative, > Currently we have pcipf, pcivf (and pcisf) flavours. May be if we introduce > new flavour say 'epcipf' to indicate external pci PF/VF/SF ports? > There can be better name than epcipf. I just put epcipf to differentiate it. > However these ports have same attributes as pcipf, pcivf, pcisf flavours. I don't think the controllers are a terrible idea. Seems like a fairly reasonable extension. But MLX don't seem to need them. And you have a history of trying to make the Linux APIs look like your FW API. Jiri, would you mind chiming in? What's your take?
Re: [PATCH v3 1/1] netfilter: nat: add a range check for l3/l4 protonum
Pablo Neira Ayuso wrote: > Hi Will, > > Given this is for -stable maintainers only, I'd suggest: > > 1) Specify what -stable kernel versions this patch applies to. >Explain that this problem is gone since what kernel version. > > 2) Maybe clarify that this is only for stable in the patch subject, >e.g. [PATCH -stable v3] netfilter: nat: add a range check for l3/l4 Hmm, we silently accept a tuple that we can't really deal with, no? > > + if (l3num != NFPROTO_IPV4 && l3num != NFPROTO_IPV6) > > + return -EOPNOTSUPP; I vote to apply this to nf.git
Re: [PATCH RFC net-next] net/tls: Implement getsockopt SOL_TLS TLS_RX
On Tue, 18 Aug 2020 14:12:24 + Yutaro Hayakawa wrote: > @@ -352,7 +352,11 @@ static int do_tls_getsockopt_tx(struct sock *sk, char > __user *optval, > } > > /* get user crypto info */ > - crypto_info = &ctx->crypto_send.info; > + if (tx) { > + crypto_info = &ctx->crypto_send.info; > + } else { > + crypto_info = &ctx->crypto_recv.info; > + } No need for parenthesis, if both branches have one line. > > if (!TLS_CRYPTO_INFO_READY(crypto_info)) { > rc = -EBUSY; > @@ -378,11 +382,19 @@ static int do_tls_getsockopt_tx(struct sock *sk, char > __user *optval, > goto out; > } > lock_sock(sk); > - memcpy(crypto_info_aes_gcm_128->iv, > -ctx->tx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, > -TLS_CIPHER_AES_GCM_128_IV_SIZE); > - memcpy(crypto_info_aes_gcm_128->rec_seq, ctx->tx.rec_seq, > -TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); > + if (tx) { > + memcpy(crypto_info_aes_gcm_128->iv, > +ctx->tx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, > +TLS_CIPHER_AES_GCM_128_IV_SIZE); > + memcpy(crypto_info_aes_gcm_128->rec_seq, > ctx->tx.rec_seq, > +TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); > + } else { > + memcpy(crypto_info_aes_gcm_128->iv, > +ctx->rx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, > +TLS_CIPHER_AES_GCM_128_IV_SIZE); > + memcpy(crypto_info_aes_gcm_128->rec_seq, > ctx->rx.rec_seq, > +TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); > + } Instead of all the duplication choose the right struct cipher_context above, like we do for crypto_info. > release_sock(sk); > if (copy_to_user(optval, >crypto_info_aes_gcm_128, > @@ -402,11 +414,19 @@ static int do_tls_getsockopt_tx(struct sock *sk, char > __user *optval, > goto out; > } > lock_sock(sk); > - memcpy(crypto_info_aes_gcm_256->iv, > -ctx->tx.iv + TLS_CIPHER_AES_GCM_256_SALT_SIZE, > -TLS_CIPHER_AES_GCM_256_IV_SIZE); > - memcpy(crypto_info_aes_gcm_256->rec_seq, ctx->tx.rec_seq, > -TLS_CIPHER_AES_GCM_256_REC_SEQ_SIZE); > + if (tx) { > + memcpy(crypto_info_aes_gcm_256->iv, > +ctx->tx.iv + TLS_CIPHER_AES_GCM_256_SALT_SIZE, > +TLS_CIPHER_AES_GCM_256_IV_SIZE); > + memcpy(crypto_info_aes_gcm_256->rec_seq, > ctx->tx.rec_seq, > +TLS_CIPHER_AES_GCM_256_REC_SEQ_SIZE); > + } else { > + memcpy(crypto_info_aes_gcm_256->iv, > +ctx->rx.iv + TLS_CIPHER_AES_GCM_256_SALT_SIZE, > +TLS_CIPHER_AES_GCM_256_IV_SIZE); > + memcpy(crypto_info_aes_gcm_256->rec_seq, > ctx->rx.rec_seq, > +TLS_CIPHER_AES_GCM_256_REC_SEQ_SIZE); > + } ditto. > release_sock(sk); > if (copy_to_user(optval, >crypto_info_aes_gcm_256,
Re: [PATCH v3 1/1] netfilter: nat: add a range check for l3/l4 protonum
On Fri, Aug 28, 2020 at 06:45:51PM +0200, Florian Westphal wrote: > Pablo Neira Ayuso wrote: > > Hi Will, > > > > Given this is for -stable maintainers only, I'd suggest: > > > > 1) Specify what -stable kernel versions this patch applies to. > >Explain that this problem is gone since what kernel version. > > > > 2) Maybe clarify that this is only for stable in the patch subject, > >e.g. [PATCH -stable v3] netfilter: nat: add a range check for l3/l4 > > Hmm, we silently accept a tuple that we can't really deal with, no? Oh, I overlook, existing kernels are affected. You're right. > > > + if (l3num != NFPROTO_IPV4 && l3num != NFPROTO_IPV6) > > > + return -EOPNOTSUPP; > > I vote to apply this to nf.git I have rebased this patch on top of nf.git, attached what I'll apply to nf.git. >From 4d3426b91eba6eb28f38a2b06ee024aff861aa16 Mon Sep 17 00:00:00 2001 From: Will McVicker Date: Mon, 24 Aug 2020 19:38:32 + Subject: [PATCH] netfilter: ctnetlink: add a range check for l3/l4 protonum The indexes to the nf_nat_l[34]protos arrays come from userspace. So check the tuple's family, e.g. l3num, when creating the conntrack in order to prevent an OOB memory access during setup. Here is an example kernel panic on 4.14.180 when userspace passes in an index greater than NFPROTO_NUMPROTO. Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Modules linked in:... Process poc (pid: 5614, stack limit = 0xa3933121) CPU: 4 PID: 5614 Comm: poc Tainted: G S W O4.14.180-g051355490483 Hardware name: Qualcomm Technologies, Inc. SM8150 V2 PM8150 Google Inc. MSM task: 2a3dfffe task.stack: a3933121 pc : __cfi_check_fail+0x1c/0x24 lr : __cfi_check_fail+0x1c/0x24 ... Call trace: __cfi_check_fail+0x1c/0x24 name_to_dev_t+0x0/0x468 nfnetlink_parse_nat_setup+0x234/0x258 ctnetlink_parse_nat_setup+0x4c/0x228 ctnetlink_new_conntrack+0x590/0xc40 nfnetlink_rcv_msg+0x31c/0x4d4 netlink_rcv_skb+0x100/0x184 nfnetlink_rcv+0xf4/0x180 netlink_unicast+0x360/0x770 netlink_sendmsg+0x5a0/0x6a4 ___sys_sendmsg+0x314/0x46c SyS_sendmsg+0xb4/0x108 el0_svc_naked+0x34/0x38 Fixes: c1d10adb4a521 ("[NETFILTER]: Add ctnetlink port for nf_conntrack") Signed-off-by: Will McVicker [pa...@netfilter.org: rebased original patch on top of nf.git] Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 832eabecfbdd..d65846aa8059 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -1404,7 +1404,8 @@ ctnetlink_parse_tuple_filter(const struct nlattr * const cda[], if (err < 0) return err; - + if (l3num != NFPROTO_IPV4 && l3num != NFPROTO_IPV6) + return -EOPNOTSUPP; tuple->src.l3num = l3num; if (flags & CTA_FILTER_FLAG(CTA_IP_DST) || -- 2.20.1
Re: [PATCH net-next v1 3/3] hinic: add support to query function table
On Fri, 28 Aug 2020 11:16:22 +0800 luobin (L) wrote: > On 2020/8/28 3:44, Jakub Kicinski wrote: > > On Thu, 27 Aug 2020 19:13:21 +0800 Luo bin wrote: > >> + switch (idx) { > >> + case VALID: > >> + return funcfg_table_elem->dw0.bs.valid; > >> + case RX_MODE: > >> + return funcfg_table_elem->dw0.bs.nic_rx_mode; > >> + case MTU: > >> + return funcfg_table_elem->dw1.bs.mtu; > >> + case VLAN_MODE: > >> + return funcfg_table_elem->dw1.bs.vlan_mode; > >> + case VLAN_ID: > >> + return funcfg_table_elem->dw1.bs.vlan_id; > >> + case RQ_DEPTH: > >> + return funcfg_table_elem->dw13.bs.cfg_rq_depth; > >> + case QUEUE_NUM: > >> + return funcfg_table_elem->dw13.bs.cfg_q_num; > > > > The first two patches look fairly unobjectionable to me, but here the > > information does not seem that driver-specific. What's vlan_mode, and > > vlan_id in the context of PF? Why expose mtu, is it different than > > netdev mtu? What's valid? rq_depth? > > . > > > The vlan_mode and vlan_id in function table are provided for VF in QinQ > scenario > and they are useless for PF. Querying VF's function table is unsupported now, > so > there is no need to expose vlan_id and vlan mode and I'll remove them in my > next > patchset. The function table is saved in hw and we expose the mtu to ensure > the > mtu saved in hw is same with netdev mtu. The valid filed indicates whether > this > function is enabled or not and the hw can judge whether the RQ buffer in host > is > sufficient by comparing the values of rq depth, pi and ci. Queue depth is definitely something we can add to the ethtool API. You already expose raw producer and consumer indexes so the calculation can be done, anyway.
Re: [Linux-kernel-mentees] [PATCH net-next v2] ipvs: Fix uninit-value in do_ip_vs_set_ctl()
On Tue, Aug 11, 2020 at 03:46:40AM -0400, Peilin Ye wrote: > do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is > zero. Fix it. Applied to nf-next, thanks.
[PATCH net-next 0/4] sfc: clean up some W=1 build warnings
A collection of minor fixes to issues flagged up by W=1. After this series, the only remaining warnings in the sfc driver are some 'member missing in kerneldoc' warnings from ptp.c. Tested by building on x86_64 and running 'ethtool -p' on an EF10 NIC; there was no error, but I couldn't observe the actual LED as I'm working remotely. [ Incidentally, ethtool_phys_id()'s behaviour on an error return looks strange — if I'm reading it right, it will break out of the inner loop but not the outer one, and eventually return the rc from the last run of the inner loop. Is this intended? ] Edward Cree (4): sfc: fix W=1 warnings in efx_farch_handle_rx_not_ok sfc: fix unused-but-set-variable warning in efx_farch_filter_remove_safe sfc: fix kernel-doc on struct efx_loopback_state sfc: return errors from efx_mcdi_set_id_led, and de-indirect drivers/net/ethernet/sfc/ef10.c | 2 -- drivers/net/ethernet/sfc/ethtool.c| 3 +-- drivers/net/ethernet/sfc/farch.c | 9 ++--- drivers/net/ethernet/sfc/mcdi.c | 6 ++ drivers/net/ethernet/sfc/mcdi.h | 2 +- drivers/net/ethernet/sfc/net_driver.h | 2 -- drivers/net/ethernet/sfc/selftest.c | 2 +- drivers/net/ethernet/sfc/siena.c | 1 - 8 files changed, 7 insertions(+), 20 deletions(-)
[PATCH net-next 1/4] sfc: fix W=1 warnings in efx_farch_handle_rx_not_ok
Some of these RX-event flags aren't used at all, so remove them. Others are used only #ifdef DEBUG to log a message; suppress the unused-var warnings #ifndef DEBUG with a void cast. Signed-off-by: Edward Cree --- drivers/net/ethernet/sfc/farch.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c index d07eeaad9bdf..aff2974e66df 100644 --- a/drivers/net/ethernet/sfc/farch.c +++ b/drivers/net/ethernet/sfc/farch.c @@ -863,13 +863,8 @@ static u16 efx_farch_handle_rx_not_ok(struct efx_rx_queue *rx_queue, bool rx_ev_tcp_udp_chksum_err, rx_ev_eth_crc_err; bool rx_ev_frm_trunc, rx_ev_tobe_disc; bool rx_ev_other_err, rx_ev_pause_frm; - bool rx_ev_hdr_type, rx_ev_mcast_pkt; - unsigned rx_ev_pkt_type; - rx_ev_hdr_type = EFX_QWORD_FIELD(*event, FSF_AZ_RX_EV_HDR_TYPE); - rx_ev_mcast_pkt = EFX_QWORD_FIELD(*event, FSF_AZ_RX_EV_MCAST_PKT); rx_ev_tobe_disc = EFX_QWORD_FIELD(*event, FSF_AZ_RX_EV_TOBE_DISC); - rx_ev_pkt_type = EFX_QWORD_FIELD(*event, FSF_AZ_RX_EV_PKT_TYPE); rx_ev_buf_owner_id_err = EFX_QWORD_FIELD(*event, FSF_AZ_RX_EV_BUF_OWNER_ID_ERR); rx_ev_ip_hdr_chksum_err = EFX_QWORD_FIELD(*event, @@ -918,6 +913,8 @@ static u16 efx_farch_handle_rx_not_ok(struct efx_rx_queue *rx_queue, rx_ev_tobe_disc ? " [TOBE_DISC]" : "", rx_ev_pause_frm ? " [PAUSE]" : ""); } +#else + (void) rx_ev_other_err; #endif if (efx->net_dev->features & NETIF_F_RXALL)
[PATCH net-next 2/4] sfc: fix unused-but-set-variable warning in efx_farch_filter_remove_safe
Thanks to some past refactor, 'spec' is not actually used in this function; the code using it moved to the callee efx_farch_filter_remove. Remove the variable to fix a W=1 warning. Signed-off-by: Edward Cree --- drivers/net/ethernet/sfc/farch.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c index aff2974e66df..0d9795fb9356 100644 --- a/drivers/net/ethernet/sfc/farch.c +++ b/drivers/net/ethernet/sfc/farch.c @@ -2589,7 +2589,6 @@ int efx_farch_filter_remove_safe(struct efx_nic *efx, enum efx_farch_filter_table_id table_id; struct efx_farch_filter_table *table; unsigned int filter_idx; - struct efx_farch_filter_spec *spec; int rc; table_id = efx_farch_filter_id_table_id(filter_id); @@ -2601,7 +2600,6 @@ int efx_farch_filter_remove_safe(struct efx_nic *efx, if (filter_idx >= table->size) return -ENOENT; down_write(&state->lock); - spec = &table->spec[filter_idx]; rc = efx_farch_filter_remove(efx, table, filter_idx, priority); up_write(&state->lock);
[PATCH net-next 4/4] sfc: return errors from efx_mcdi_set_id_led, and de-indirect
W=1 warnings indicated that 'rc' was unused in efx_mcdi_set_id_led(); change the function to return int instead of void and plumb the rc through the caller efx_ethtool_phys_id(). Since (post-Falcon) all sfc NICs use MCDI for this, there's no point in indirecting through a nic_type method, so remove that and just call efx_mcdi_set_id_led() directly. Signed-off-by: Edward Cree --- drivers/net/ethernet/sfc/ef10.c | 2 -- drivers/net/ethernet/sfc/ethtool.c| 3 +-- drivers/net/ethernet/sfc/mcdi.c | 6 ++ drivers/net/ethernet/sfc/mcdi.h | 2 +- drivers/net/ethernet/sfc/net_driver.h | 2 -- drivers/net/ethernet/sfc/siena.c | 1 - 6 files changed, 4 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c index 4b0b2cf026a5..0b4bcac53f18 100644 --- a/drivers/net/ethernet/sfc/ef10.c +++ b/drivers/net/ethernet/sfc/ef10.c @@ -3955,7 +3955,6 @@ const struct efx_nic_type efx_hunt_a0_vf_nic_type = { .start_stats = efx_port_dummy_op_void, .pull_stats = efx_port_dummy_op_void, .stop_stats = efx_port_dummy_op_void, - .set_id_led = efx_mcdi_set_id_led, .push_irq_moderation = efx_ef10_push_irq_moderation, .reconfigure_mac = efx_ef10_mac_reconfigure, .check_mac_fault = efx_mcdi_mac_check_fault, @@ -4066,7 +4065,6 @@ const struct efx_nic_type efx_hunt_a0_nic_type = { .start_stats = efx_mcdi_mac_start_stats, .pull_stats = efx_mcdi_mac_pull_stats, .stop_stats = efx_mcdi_mac_stop_stats, - .set_id_led = efx_mcdi_set_id_led, .push_irq_moderation = efx_ef10_push_irq_moderation, .reconfigure_mac = efx_ef10_mac_reconfigure, .check_mac_fault = efx_mcdi_mac_check_fault, diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c index 4ffda7782f68..12a91c559aa2 100644 --- a/drivers/net/ethernet/sfc/ethtool.c +++ b/drivers/net/ethernet/sfc/ethtool.c @@ -50,8 +50,7 @@ static int efx_ethtool_phys_id(struct net_device *net_dev, return 1; /* cycle on/off once per second */ } - efx->type->set_id_led(efx, mode); - return 0; + return efx_mcdi_set_id_led(efx, mode); } static int efx_ethtool_get_regs_len(struct net_device *net_dev) diff --git a/drivers/net/ethernet/sfc/mcdi.c b/drivers/net/ethernet/sfc/mcdi.c index 5467819aef6e..be6bfd6b7ec7 100644 --- a/drivers/net/ethernet/sfc/mcdi.c +++ b/drivers/net/ethernet/sfc/mcdi.c @@ -1868,10 +1868,9 @@ int efx_mcdi_handle_assertion(struct efx_nic *efx) return efx_mcdi_exit_assertion(efx); } -void efx_mcdi_set_id_led(struct efx_nic *efx, enum efx_led_mode mode) +int efx_mcdi_set_id_led(struct efx_nic *efx, enum efx_led_mode mode) { MCDI_DECLARE_BUF(inbuf, MC_CMD_SET_ID_LED_IN_LEN); - int rc; BUILD_BUG_ON(EFX_LED_OFF != MC_CMD_LED_OFF); BUILD_BUG_ON(EFX_LED_ON != MC_CMD_LED_ON); @@ -1881,8 +1880,7 @@ void efx_mcdi_set_id_led(struct efx_nic *efx, enum efx_led_mode mode) MCDI_SET_DWORD(inbuf, SET_ID_LED_IN_STATE, mode); - rc = efx_mcdi_rpc(efx, MC_CMD_SET_ID_LED, inbuf, sizeof(inbuf), - NULL, 0, NULL); + return efx_mcdi_rpc(efx, MC_CMD_SET_ID_LED, inbuf, sizeof(inbuf), NULL, 0, NULL); } static int efx_mcdi_reset_func(struct efx_nic *efx) diff --git a/drivers/net/ethernet/sfc/mcdi.h b/drivers/net/ethernet/sfc/mcdi.h index 658cf345420d..8aed65018964 100644 --- a/drivers/net/ethernet/sfc/mcdi.h +++ b/drivers/net/ethernet/sfc/mcdi.h @@ -348,7 +348,7 @@ int efx_mcdi_nvram_info(struct efx_nic *efx, unsigned int type, int efx_new_mcdi_nvram_test_all(struct efx_nic *efx); int efx_mcdi_nvram_test_all(struct efx_nic *efx); int efx_mcdi_handle_assertion(struct efx_nic *efx); -void efx_mcdi_set_id_led(struct efx_nic *efx, enum efx_led_mode mode); +int efx_mcdi_set_id_led(struct efx_nic *efx, enum efx_led_mode mode); int efx_mcdi_wol_filter_set_magic(struct efx_nic *efx, const u8 *mac, int *id_out); int efx_mcdi_wol_filter_get_magic(struct efx_nic *efx, int *id_out); diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h index 062462a13847..338ebb0402be 100644 --- a/drivers/net/ethernet/sfc/net_driver.h +++ b/drivers/net/ethernet/sfc/net_driver.h @@ -1217,7 +1217,6 @@ struct efx_udp_tunnel { * @start_stats: Start the regular fetching of statistics * @pull_stats: Pull stats from the NIC and wait until they arrive. * @stop_stats: Stop the regular fetching of statistics - * @set_id_led: Set state of identifying LED or revert to automatic function * @push_irq_moderation: Apply interrupt moderation value * @reconfigure_port: Push loopback/power/txdis changes to the MAC and PHY * @prepare_enable_fc_tx: Prepare MAC to enable pause frame TX (may be %NULL) @@ -1362,7 +1361,6 @@ struct efx_nic_type { void (*start_stats)(struct efx_nic *efx); void (*pull_st
[PATCH net-next 3/4] sfc: fix kernel-doc on struct efx_loopback_state
Missing 'struct' keyword caused "cannot understand function prototype" warnings. Signed-off-by: Edward Cree --- drivers/net/ethernet/sfc/selftest.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/sfc/selftest.c b/drivers/net/ethernet/sfc/selftest.c index e71d6d37a317..34b9c7d50c4e 100644 --- a/drivers/net/ethernet/sfc/selftest.c +++ b/drivers/net/ethernet/sfc/selftest.c @@ -67,7 +67,7 @@ static const char *const efx_interrupt_mode_names[] = { STRING_TABLE_LOOKUP(efx->interrupt_mode, efx_interrupt_mode) /** - * efx_loopback_state - persistent state during a loopback selftest + * struct efx_loopback_state - persistent state during a loopback selftest * @flush: Drop all packets in efx_loopback_rx_packet * @packet_count: Number of packets being used in this test * @skbs: An array of skbs transmitted
Re: [PATCH net-next] netfilter: xt_HMARK: Use ip_is_fragment() helper
On Thu, Aug 27, 2020 at 10:08:13PM +0800, YueHaibing wrote: > Use ip_is_fragment() to simpify code. Applied.
Re: [PATCH] netfilter: nf_conntrack_sip: fix parsing error
On Sat, Aug 15, 2020 at 12:50:30PM -0400, Tong Zhang wrote: > ct_sip_parse_numerical_param can only return 0 or 1, but the caller is > checking parsing error using < 0 Is this are real issue in your setup or probably some static analysis tool is reporting? You are right that ct_sip_parse_numerical_param() never returns < 0, however, looking at: https://tools.ietf.org/html/rfc3261 see Page 161 expires is optional, my understanding is that your patch is making this option mandatory.
Re: [PATCH] netfilter: nf_conntrack_sip: fix parsing error
Hi Pablo, I'm not an expert in this networking stuff. But from my point of view there's no point in checking if this condition is always true. There's also no need of returning anything from the ct_sip_parse_numerical_param() if they are all being ignored like this. On Fri, Aug 28, 2020 at 2:07 PM Pablo Neira Ayuso wrote: > Is this are real issue in your setup or probably some static analysis > tool is reporting? > > You are right that ct_sip_parse_numerical_param() never returns < 0, > however, looking at: > > https://tools.ietf.org/html/rfc3261 see Page 161 > > expires is optional, my understanding is that your patch is making > this option mandatory.
Re: [PATCH net-next] net/sched: add act_ct_output support
On Tue, Aug 25, 2020 at 8:33 AM Marcelo Ricardo Leitner wrote: > I still don't understand Cong's argument for not having this on > act_mirred because TC is L2. That's actually not right. TC hooks at L2 You miss a very important point that it is already too late to rename act_mirred to reflect whatever new feature adding to it. > but deals with L3 and L4 (after all, it does static NAT, mungles L4 > headers and classifies based on virtually anything) since beginning, > and this is just another case. So eventually you want TC to deal with all L3 stuff?? I think you are exaggerating it, modifying L3/L4 headers does not mean it handles L3 protocol. But, doing IP layer fragmentation is clearly doing something belongs to IP protocol. Look at the code, you never need to call into IP layer code (except some trivial helpers) until you do CT or fragmentation. This is why I do not like act_ct either, it fits oddly into TC. Why not just do segmentation instead of fragmentation? GSO is already performed at L2 by software. Thanks.
Re: [PATCH] netfilter: nf_conntrack_sip: fix parsing error
On Fri, Aug 28, 2020 at 02:14:48PM -0400, Tong Zhang wrote: > Hi Pablo, > I'm not an expert in this networking stuff. > But from my point of view there's no point in checking if this > condition is always true. Understood. > There's also no need of returning anything from the > ct_sip_parse_numerical_param() > if they are all being ignored like this. Then probably update this code to ignore the return value?
Re: [PATCH v3 00/11] Fix PM hibernation in Xen guests
On Fri, Aug 21, 2020 at 10:22:43PM +, Anchal Agarwal wrote: > Hello, > This series fixes PM hibernation for hvm guests running on xen hypervisor. > The running guest could now be hibernated and resumed successfully at a > later time. The fixes for PM hibernation are added to block and > network device drivers i.e xen-blkfront and xen-netfront. Any other driver > that needs to add S4 support if not already, can follow same method of > introducing freeze/thaw/restore callbacks. > The patches had been tested against upstream kernel and xen4.11. Large > scale testing is also done on Xen based Amazon EC2 instances. All this testing > involved running memory exhausting workload in the background. > > Doing guest hibernation does not involve any support from hypervisor and > this way guest has complete control over its state. Infrastructure > restrictions for saving up guest state can be overcome by guest initiated > hibernation. > > These patches were send out as RFC before and all the feedback had been > incorporated in the patches. The last v1 & v2 could be found here: > > [v1]: https://lkml.org/lkml/2020/5/19/1312 > [v2]: https://lkml.org/lkml/2020/7/2/995 > All comments and feedback from v2 had been incorporated in v3 series. > > Known issues: > 1.KASLR causes intermittent hibernation failures. VM fails to resumes and > has to be restarted. I will investigate this issue separately and shouldn't > be a blocker for this patch series. > 2. During hibernation, I observed sometimes that freezing of tasks fails due > to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1 > out of 200 runs and hibernation is aborted in this case. Re-trying hibernation > may work. Also, this is a known issue with hibernation and some > filesystems like XFS has been discussed by the community for years with not an > effectve resolution at this point. > > Testing How to: > --- > 1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream > xen-4.11] > 2. Bring up a HVM guest w/t kernel compiled with hibernation patches > [I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem > images]. > 3. Create a swap file size=RAM size > 4. Update grub parameters and reboot > 5. Trigger pm-hibernation from within the VM > > Example: > Set up a file-backed swap space. Swap file size>=Total memory on the system > sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB > sudo chmod 600 /swap > sudo mkswap /swap > sudo swapon /swap > > Update resume device/resume offset in grub if using swap file: > resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1 > > Execute: > > sudo pm-hibernate > OR > echo disk > /sys/power/state && echo reboot > /sys/power/disk > > Compute resume offset code: > " > #!/usr/bin/env python > import sys > import array > import fcntl > > #swap file > f = open(sys.argv[1], 'r') > buf = array.array('L', [0]) > > #FIBMAP > ret = fcntl.ioctl(f.fileno(), 0x01, buf) > print buf[0] > " > > Aleksei Besogonov (1): > PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA > > Anchal Agarwal (4): > x86/xen: Introduce new function to map HYPERVISOR_shared_info on > Resume > x86/xen: save and restore steal clock during PM hibernation > xen: Introduce wrapper for save/restore sched clock offset > xen: Update sched clock offset to avoid system instability in > hibernation > > Munehisa Kamata (5): > xen/manage: keep track of the on-going suspend mode > xenbus: add freeze/thaw/restore callbacks support > x86/xen: add system core suspend and resume callbacks > xen-blkfront: add callbacks for PM suspend and hibernation > xen-netfront: add callbacks for PM suspend and hibernation > > Thomas Gleixner (1): > genirq: Shutdown irq chips in suspend/resume during hibernation > > arch/x86/xen/enlighten_hvm.c | 7 +++ > arch/x86/xen/suspend.c| 63 > arch/x86/xen/time.c | 15 - > arch/x86/xen/xen-ops.h| 3 + > drivers/block/xen-blkfront.c | 122 > -- > drivers/net/xen-netfront.c| 96 +- > drivers/xen/events/events_base.c | 1 + > drivers/xen/manage.c | 46 ++ > drivers/xen/xenbus/xenbus_probe.c | 96 +- > include/linux/irq.h | 2 + > include/xen/xen-ops.h | 3 + > include/xen/xenbus.h | 3 + > kernel/irq/chip.c | 2 +- > kernel/irq/internals.h| 1 + > kernel/irq/pm.c | 31 +++--- > kernel/power/user.c | 7 ++- > 16 files changed, 464 insertions(+), 34 deletions(-) > > -- > 2.16.6 > A gentle ping on the series in case there is any more feedback or can we plan to merge this? I can then send the series with minor fixes pointed by tglx@ Thanks, Anchal
Re: [PATCH v3 00/11] Fix PM hibernation in Xen guests
On Fri, Aug 28, 2020 at 8:26 PM Anchal Agarwal wrote: > > On Fri, Aug 21, 2020 at 10:22:43PM +, Anchal Agarwal wrote: > > Hello, > > This series fixes PM hibernation for hvm guests running on xen hypervisor. > > The running guest could now be hibernated and resumed successfully at a > > later time. The fixes for PM hibernation are added to block and > > network device drivers i.e xen-blkfront and xen-netfront. Any other driver > > that needs to add S4 support if not already, can follow same method of > > introducing freeze/thaw/restore callbacks. > > The patches had been tested against upstream kernel and xen4.11. Large > > scale testing is also done on Xen based Amazon EC2 instances. All this > > testing > > involved running memory exhausting workload in the background. > > > > Doing guest hibernation does not involve any support from hypervisor and > > this way guest has complete control over its state. Infrastructure > > restrictions for saving up guest state can be overcome by guest initiated > > hibernation. > > > > These patches were send out as RFC before and all the feedback had been > > incorporated in the patches. The last v1 & v2 could be found here: > > > > [v1]: https://lkml.org/lkml/2020/5/19/1312 > > [v2]: https://lkml.org/lkml/2020/7/2/995 > > All comments and feedback from v2 had been incorporated in v3 series. > > > > Known issues: > > 1.KASLR causes intermittent hibernation failures. VM fails to resumes and > > has to be restarted. I will investigate this issue separately and shouldn't > > be a blocker for this patch series. > > 2. During hibernation, I observed sometimes that freezing of tasks fails due > > to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1 > > out of 200 runs and hibernation is aborted in this case. Re-trying > > hibernation > > may work. Also, this is a known issue with hibernation and some > > filesystems like XFS has been discussed by the community for years with not > > an > > effectve resolution at this point. > > > > Testing How to: > > --- > > 1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream > > xen-4.11] > > 2. Bring up a HVM guest w/t kernel compiled with hibernation patches > > [I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem > > images]. > > 3. Create a swap file size=RAM size > > 4. Update grub parameters and reboot > > 5. Trigger pm-hibernation from within the VM > > > > Example: > > Set up a file-backed swap space. Swap file size>=Total memory on the system > > sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB > > sudo chmod 600 /swap > > sudo mkswap /swap > > sudo swapon /swap > > > > Update resume device/resume offset in grub if using swap file: > > resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1 > > > > Execute: > > > > sudo pm-hibernate > > OR > > echo disk > /sys/power/state && echo reboot > /sys/power/disk > > > > Compute resume offset code: > > " > > #!/usr/bin/env python > > import sys > > import array > > import fcntl > > > > #swap file > > f = open(sys.argv[1], 'r') > > buf = array.array('L', [0]) > > > > #FIBMAP > > ret = fcntl.ioctl(f.fileno(), 0x01, buf) > > print buf[0] > > " > > > > Aleksei Besogonov (1): > > PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA > > > > Anchal Agarwal (4): > > x86/xen: Introduce new function to map HYPERVISOR_shared_info on > > Resume > > x86/xen: save and restore steal clock during PM hibernation > > xen: Introduce wrapper for save/restore sched clock offset > > xen: Update sched clock offset to avoid system instability in > > hibernation > > > > Munehisa Kamata (5): > > xen/manage: keep track of the on-going suspend mode > > xenbus: add freeze/thaw/restore callbacks support > > x86/xen: add system core suspend and resume callbacks > > xen-blkfront: add callbacks for PM suspend and hibernation > > xen-netfront: add callbacks for PM suspend and hibernation > > > > Thomas Gleixner (1): > > genirq: Shutdown irq chips in suspend/resume during hibernation > > > > arch/x86/xen/enlighten_hvm.c | 7 +++ > > arch/x86/xen/suspend.c| 63 > > arch/x86/xen/time.c | 15 - > > arch/x86/xen/xen-ops.h| 3 + > > drivers/block/xen-blkfront.c | 122 > > -- > > drivers/net/xen-netfront.c| 96 +- > > drivers/xen/events/events_base.c | 1 + > > drivers/xen/manage.c | 46 ++ > > drivers/xen/xenbus/xenbus_probe.c | 96 +- > > include/linux/irq.h | 2 + > > include/xen/xen-ops.h | 3 + > > include/xen/xenbus.h | 3 + > > kernel/irq/chip.c | 2 +- > > kernel/irq/internals.h| 1 + > > kernel/irq/pm.c | 31 +++--- > > kernel/power/user.c
Re: [PATCH] netfilter: nf_conntrack_sip: fix parsing error
I think the original code complaining parsing error is there for a reason, A better way is to modify ct_sip_parse_numerical_param() and let it return a real parsing error code instead of returning FOUND(1) and NOT FOUND(0) if deemed necessary Once again I'm not an expert and I'm may suggest something stupid, please pardon my ignorance -- - Tong On Fri, Aug 28, 2020 at 2:19 PM Pablo Neira Ayuso wrote: > > Then probably update this code to ignore the return value?
Re: [PATCH net-next] net/sched: add act_ct_output support
On Tue, Aug 25, 2020 at 1:45 AM wrote: > > From: wenxu > > The fragment packets do defrag in act_ct module. If the reassembled > packet should send out to another net device. This over mtu big packet > should be fragmented to send out. This patch add the act ct_output to > archive this. There are a lot of things missing in your changelog. For example: Why do we need a new action here? Why segmentation is not done on the target device? At least for the egress side, dev_queue_xmit() is called by act_mirred, it will perform a segmentation with skb_gso_segment() if needed. So why bigger packets can not be segmented here? Please add all these necessary details into your changelog. Thanks.
Re: [PATCH v3 00/11] Fix PM hibernation in Xen guests
On Fri, Aug 28, 2020 at 08:29:24PM +0200, Rafael J. Wysocki wrote: > CAUTION: This email originated from outside of the organization. Do not click > links or open attachments unless you can confirm the sender and know the > content is safe. > > > > On Fri, Aug 28, 2020 at 8:26 PM Anchal Agarwal wrote: > > > > On Fri, Aug 21, 2020 at 10:22:43PM +, Anchal Agarwal wrote: > > > Hello, > > > This series fixes PM hibernation for hvm guests running on xen hypervisor. > > > The running guest could now be hibernated and resumed successfully at a > > > later time. The fixes for PM hibernation are added to block and > > > network device drivers i.e xen-blkfront and xen-netfront. Any other driver > > > that needs to add S4 support if not already, can follow same method of > > > introducing freeze/thaw/restore callbacks. > > > The patches had been tested against upstream kernel and xen4.11. Large > > > scale testing is also done on Xen based Amazon EC2 instances. All this > > > testing > > > involved running memory exhausting workload in the background. > > > > > > Doing guest hibernation does not involve any support from hypervisor and > > > this way guest has complete control over its state. Infrastructure > > > restrictions for saving up guest state can be overcome by guest initiated > > > hibernation. > > > > > > These patches were send out as RFC before and all the feedback had been > > > incorporated in the patches. The last v1 & v2 could be found here: > > > > > > [v1]: https://lkml.org/lkml/2020/5/19/1312 > > > [v2]: https://lkml.org/lkml/2020/7/2/995 > > > All comments and feedback from v2 had been incorporated in v3 series. > > > > > > Known issues: > > > 1.KASLR causes intermittent hibernation failures. VM fails to resumes and > > > has to be restarted. I will investigate this issue separately and > > > shouldn't > > > be a blocker for this patch series. > > > 2. During hibernation, I observed sometimes that freezing of tasks fails > > > due > > > to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may > > > be 1 > > > out of 200 runs and hibernation is aborted in this case. Re-trying > > > hibernation > > > may work. Also, this is a known issue with hibernation and some > > > filesystems like XFS has been discussed by the community for years with > > > not an > > > effectve resolution at this point. > > > > > > Testing How to: > > > --- > > > 1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 > > > +upstream > > > xen-4.11] > > > 2. Bring up a HVM guest w/t kernel compiled with hibernation patches > > > [I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem > > > images]. > > > 3. Create a swap file size=RAM size > > > 4. Update grub parameters and reboot > > > 5. Trigger pm-hibernation from within the VM > > > > > > Example: > > > Set up a file-backed swap space. Swap file size>=Total memory on the > > > system > > > sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB > > > sudo chmod 600 /swap > > > sudo mkswap /swap > > > sudo swapon /swap > > > > > > Update resume device/resume offset in grub if using swap file: > > > resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1 > > > > > > Execute: > > > > > > sudo pm-hibernate > > > OR > > > echo disk > /sys/power/state && echo reboot > /sys/power/disk > > > > > > Compute resume offset code: > > > " > > > #!/usr/bin/env python > > > import sys > > > import array > > > import fcntl > > > > > > #swap file > > > f = open(sys.argv[1], 'r') > > > buf = array.array('L', [0]) > > > > > > #FIBMAP > > > ret = fcntl.ioctl(f.fileno(), 0x01, buf) > > > print buf[0] > > > " > > > > > > Aleksei Besogonov (1): > > > PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA > > > > > > Anchal Agarwal (4): > > > x86/xen: Introduce new function to map HYPERVISOR_shared_info on > > > Resume > > > x86/xen: save and restore steal clock during PM hibernation > > > xen: Introduce wrapper for save/restore sched clock offset > > > xen: Update sched clock offset to avoid system instability in > > > hibernation > > > > > > Munehisa Kamata (5): > > > xen/manage: keep track of the on-going suspend mode > > > xenbus: add freeze/thaw/restore callbacks support > > > x86/xen: add system core suspend and resume callbacks > > > xen-blkfront: add callbacks for PM suspend and hibernation > > > xen-netfront: add callbacks for PM suspend and hibernation > > > > > > Thomas Gleixner (1): > > > genirq: Shutdown irq chips in suspend/resume during hibernation > > > > > > arch/x86/xen/enlighten_hvm.c | 7 +++ > > > arch/x86/xen/suspend.c| 63 > > > arch/x86/xen/time.c | 15 - > > > arch/x86/xen/xen-ops.h| 3 + > > > drivers/block/xen-blkfront.c | 122 > > > -- > > > drivers/net/xen-netfront.c| 96 +-
RE: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
Lukas Wunner wrote: > Commit e687ad60af09 ("netfilter: add netfilter ingress hook after > handle_ing() under unique static key") introduced the ability to > classify packets on ingress. > > Support the same on egress. This allows filtering locally generated > traffic such as DHCP, or outbound AF_PACKETs in general. It will also > allow introducing in-kernel NAT64 and NAT46. A patch for nftables to > hook up egress rules from user space has been submitted separately. > > Position the hook immediately before a packet is handed to traffic > control and then sent out on an interface, thereby mirroring the ingress > order. This order allows marking packets in the netfilter egress hook > and subsequently using the mark in tc. Another benefit of this order is > consistency with a lot of existing documentation which says that egress > tc is performed after netfilter hooks. > > To avoid a performance degradation in the default case (with neither > netfilter nor traffic control used), Daniel Borkmann suggests "a single > static_key which wraps an empty function call entry which can then be > patched by the kernel at runtime. Inside that trampoline we can still > keep the ordering [between netfilter and traffic control] intact": > > https://lore.kernel.org/netdev/20200318123315.gi...@breakpoint.cc/ > > To this end, introduce nf_sch_egress() which is dynamically patched into > __dev_queue_xmit(), contingent on egress_needed_key. Inside that > function, nf_egress() and sch_handle_egress() is called, each contingent > on its own separate static_key. > > nf_sch_egress() is declared noinline per Florian Westphal's suggestion. > This change alone causes a speedup if neither netfilter nor traffic > control is used, apparently because it reduces instruction cache > pressure. The same effect was previously observed by Eric Dumazet for > the ingress path: > > https://lore.kernel.org/netdev/1431387038.566.47.ca...@edumazet-glaptop2.roam.corp.google.com/ > > Overall, performance improves with this commit if neither netfilter nor > traffic control is used. However it degrades a little if only traffic > control is used, due to the "noinline", the additional outer static key > and the added netfilter code: > > * Before: 4730418pps 2270Mb/sec (2270600640bps) > * After:4759206pps 2284Mb/sec (2284418880bps) These baseline numbers seem low to me. > > * Before + tc: 4063912pps 1950Mb/sec (1950677760bps) > * After + tc: 4007728pps 1923Mb/sec (1923709440bps) > > * After + nft: 3714546pps 1782Mb/sec (1782982080bps) > > Measured on a bare-metal Core i7-3615QM. OK I have some server class systems here I would like to run these benchmarks again on to be sure we don't have any performance regressions on that side. I'll try to get to it asap, but likely will be Monday morning by the time I get to it. I assume that should be no problem seeing we are only on rc2. Thanks. > > Commands to perform a measurement: > ip link add dev foo type dummy > ip link set dev foo up > modprobe pktgen > echo "add_device foo" > /proc/net/pktgen/kpktgend_3 > samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 4 -m > "11:11:11:11:11:11" -d 1.1.1.1 Thats a single thread correct? -t option if I recall correctly. I think we should also try with many threads to see if that makes a difference. I guess probably not, but lets see. > > Commands to enable egress traffic control: > tc qdisc add dev foo clsact > tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,' > > Commands to enable egress netfilter: > nft add table netdev t > nft add chain netdev t co \{ type filter hook egress device foo priority 0 \; > \} > nft add rule netdev t co ip daddr 4.3.2.1/32 drop > I'll give above a try.
Re: [PATCHi v2] net: mdiobus: fix device unregistering in mdiobus_register
On 28.08.2020 16:15, Sascha Hauer wrote: > On Thu, Aug 27, 2020 at 10:48:48AM +0200, Heiner Kallweit wrote: >> On 27.08.2020 09:06, Sascha Hauer wrote: >>> After device_register has been called the device structure may not be >>> freed anymore, put_device() has to be called instead. This gets violated >>> when device_register() or any of the following steps before the mdio >>> bus is fully registered fails. In this case the caller will call >>> mdiobus_free() which then directly frees the mdio bus structure. >>> >>> Set bus->state to MDIOBUS_UNREGISTERED right before calling >>> device_register(). With this mdiobus_free() calls put_device() instead >>> as it ought to be. >>> >>> Signed-off-by: Sascha Hauer >>> --- >>> >>> Changes since v1: >>> - set bus->state before calling device_register(), not afterwards >>> >>> drivers/net/phy/mdio_bus.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c >>> index 0af20faad69d..9434b04a11c8 100644 >>> --- a/drivers/net/phy/mdio_bus.c >>> +++ b/drivers/net/phy/mdio_bus.c >>> @@ -534,6 +534,8 @@ int __mdiobus_register(struct mii_bus *bus, struct >>> module *owner) >>> bus->dev.groups = NULL; >>> dev_set_name(&bus->dev, "%s", bus->id); >>> >>> + bus->state = MDIOBUS_UNREGISTERED; >>> + >>> err = device_register(&bus->dev); >>> if (err) { >>> pr_err("mii_bus %s failed to register\n", bus->id); >>> >> LGTM. Just two points: >> 1. Subject has a typo (PATCHi). And it should be [PATCH net v2], because it's >>something for the stable branch. >> 2. A "Fixes" tag is needed. > > Uh, AFAICT this fixes a patch from 2008, this makes for quite some > stable updates :) > There's just a handful of LTS kernel versions (oldest is 4.4), therefore it shouldn't be that bad. But right, for things that have always been like they are now, sometimes it's tricky to find a proper Fixes tag. > Sascha > > | commit 161c8d2f50109b44b664eaf23831ea1587979a61 > | Author: Krzysztof Halasa > | Date: Thu Dec 25 16:50:41 2008 -0800 > | > | net: PHYLIB mdio fixes #2 >