general protection fault in find_match (2)
Hello, syzbot found the following issue on: HEAD commit:3db1a3fa Merge tag 'staging-5.11-rc1' of git://git.kernel... git tree: bpf-next console output: https://syzkaller.appspot.com/x/log.txt?x=1103eadf50 kernel config: https://syzkaller.appspot.com/x/.config?x=2764fc28a92339f9 dashboard link: https://syzkaller.appspot.com/bug?extid=b08cdcfff539328e6c32 compiler: gcc (GCC) 10.1.0-syz 20200507 Unfortunately, I don't have any reproducer for this issue yet. IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+b08cdcfff539328e6...@syzkaller.appspotmail.com general protection fault, probably for non-canonical address 0xdc4b: [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0258-0x025f] CPU: 1 PID: 13682 Comm: syz-executor.5 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:ip6_ignore_linkdown include/net/addrconf.h:407 [inline] RIP: 0010:find_match.part.0+0xcc/0xc70 net/ipv6/route.c:753 Code: f9 0f b6 45 c0 84 c0 0f 84 39 04 00 00 e8 ec a0 c4 f9 49 8d bf 5c 02 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 91 RSP: 0018:c900022cf000 EFLAGS: 00010207 RAX: dc00 RBX: 88801c2cc2a0 RCX: c90013d33000 RDX: 004b RSI: 87abfc74 RDI: 025c RBP: c900022cf070 R08: 0001 R09: c900022cf250 R10: R11: R12: 0003 R13: 0001 R14: R15: FS: 7f94614f2700() GS:8880b9f0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f75a1fccdb8 CR3: 2d8c6000 CR4: 001506e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: find_match net/ipv6/route.c:840 [inline] __find_rr_leaf+0x17f/0xd10 net/ipv6/route.c:841 find_rr_leaf net/ipv6/route.c:862 [inline] rt6_select net/ipv6/route.c:906 [inline] fib6_table_lookup+0x5b3/0xa20 net/ipv6/route.c:2193 ip6_pol_route+0x1e1/0x11c0 net/ipv6/route.c:2229 pol_lookup_func include/net/ip6_fib.h:583 [inline] fib6_rule_lookup+0x111/0x6f0 net/ipv6/fib6_rules.c:115 ip6_route_output_flags_noref+0x2c2/0x360 net/ipv6/route.c:2510 ip6_route_output_flags+0x8b/0x310 net/ipv6/route.c:2523 ip6_route_output include/net/ip6_route.h:98 [inline] ip6_dst_lookup_tail+0xb3a/0x1700 net/ipv6/ip6_output.c:1024 ip6_dst_lookup_flow+0x8c/0x1d0 net/ipv6/ip6_output.c:1154 ip6_sk_dst_lookup_flow+0x55c/0x990 net/ipv6/ip6_output.c:1192 udpv6_sendmsg+0x18a5/0x2bd0 net/ipv6/udp.c:1508 inet6_sendmsg+0x99/0xe0 net/ipv6/af_inet6.c:638 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 sys_sendmsg+0x6e8/0x810 net/socket.c:2336 ___sys_sendmsg+0xf3/0x170 net/socket.c:2390 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2423 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45e149 Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7f94614f1c68 EFLAGS: 0246 ORIG_RAX: 002e RAX: ffda RBX: 0003 RCX: 0045e149 RDX: RSI: 2040 RDI: 0005 RBP: 0119bfc0 R08: R09: R10: R11: 0246 R12: 0119bf8c R13: 7ffd9413345f R14: 7f94614f29c0 R15: 0119bf8c Modules linked in: ---[ end trace a9799337710952e8 ]--- RIP: 0010:ip6_ignore_linkdown include/net/addrconf.h:407 [inline] RIP: 0010:find_match.part.0+0xcc/0xc70 net/ipv6/route.c:753 Code: f9 0f b6 45 c0 84 c0 0f 84 39 04 00 00 e8 ec a0 c4 f9 49 8d bf 5c 02 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 91 RSP: 0018:c900022cf000 EFLAGS: 00010207 RAX: dc00 RBX: 88801c2cc2a0 RCX: c90013d33000 RDX: 004b RSI: 87abfc74 RDI: 025c RBP: c900022cf070 R08: 0001 R09: c900022cf250 R10: R11: R12: 0003 R13: 0001 R14: R15: FS: 7f94614f2700() GS:8880b9f0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 2d8c6000 CR4: 001506e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzb
[RFC PATCH 0/1] net: arcnet: Fix RESET sequence
Folks, At drivers/net/arcnet/arcnet.c, there is: irqreturn_t arcnet_interrupt(int irq, void *dev_id) { ... if (status & RESETflag) { arcnet_close(dev); arcnet_open(dev); } ... } struct net_device_ops arcnet_netdev_ops = { .ndo_open = arcnet_open, .ndo_stop = arcnet_close, ... }; which is wrong, in many ways: 1) In general, interrupt handlers should never call ->ndo_stop() and ->ndo_open() functions. They are usually full of blocking calls and other methods that are expected to be called only from drivers init/exit code paths. 2) arcnet_close() contains a del_timer_sync(). If the irq handler interrupts the to-be-deleted timer then call del_timer_sync(), it will just loop forever. 3) arcnet_close() also calls tasklet_kill(), which has a warning if called from irq context. 4) For device reset, the sequence "arcnet_close(); arcnet_open();" is not complete. Some children arcnet drivers have special init/exit code sequences, which then embed a call to arcnet_open() and arcnet_close() accordingly. Check drivers/net/arcnet/com20020.c. Included is an RFC patch to fix the points above: if the RESET flag is encountered, a workqueue is scheduled to run the generic reset sequence. Note: Only compile-tested, as I do not have the hardware in question. Thanks, 8<-- Ahmed S. Darwish (1): net: arcnet: Fix RESET flag handling drivers/net/arcnet/arc-rimi.c | 4 +- drivers/net/arcnet/arcdevice.h| 6 +++ drivers/net/arcnet/arcnet.c | 69 +-- drivers/net/arcnet/com20020-isa.c | 4 +- drivers/net/arcnet/com20020-pci.c | 2 +- drivers/net/arcnet/com20020_cs.c | 2 +- drivers/net/arcnet/com90io.c | 4 +- drivers/net/arcnet/com90xx.c | 4 +- 8 files changed, 81 insertions(+), 14 deletions(-) base-commit: 2c85ebc57b3e1817b6ce1a6b703928e113a90442 -- 2.29.2
[RFC PATCH 1/1] net: arcnet: Fix RESET flag handling
The main arcnet interrupt handler calls arcnet_close() then arcnet_open(), if the RESET status flag is encountered. This is invalid: 1) In general, interrupt handlers should never call ->ndo_stop() and ->ndo_open() functions. They are usually full of blocking calls and other methods that are expected to be called only from drivers init and exit code paths. 2) arcnet_close() contains a del_timer_sync(). If the irq handler interrupts the to-be-deleted timer, del_timer_sync() will just loop forever. 3) arcnet_close() also calls tasklet_kill(), which has a warning if called from irq context. 4) For device reset, the sequence "arcnet_close(); arcnet_open();" is not complete. Some children arcnet drivers have special init/exit code sequences, which then embed a call to arcnet_open() and arcnet_close() accordingly. Check drivers/net/arcnet/com20020.c. Run the device RESET sequence from a scheduled workqueue instead. Signed-off-by: Ahmed S. Darwish --- drivers/net/arcnet/arc-rimi.c | 4 +- drivers/net/arcnet/arcdevice.h| 6 +++ drivers/net/arcnet/arcnet.c | 69 +-- drivers/net/arcnet/com20020-isa.c | 4 +- drivers/net/arcnet/com20020-pci.c | 2 +- drivers/net/arcnet/com20020_cs.c | 2 +- drivers/net/arcnet/com90io.c | 4 +- drivers/net/arcnet/com90xx.c | 4 +- 8 files changed, 81 insertions(+), 14 deletions(-) diff --git a/drivers/net/arcnet/arc-rimi.c b/drivers/net/arcnet/arc-rimi.c index 98df38fe553c..12d085405bd0 100644 --- a/drivers/net/arcnet/arc-rimi.c +++ b/drivers/net/arcnet/arc-rimi.c @@ -332,7 +332,7 @@ static int __init arc_rimi_init(void) dev->irq = 9; if (arcrimi_probe(dev)) { - free_netdev(dev); + free_arcdev(dev); return -EIO; } @@ -349,7 +349,7 @@ static void __exit arc_rimi_exit(void) iounmap(lp->mem_start); release_mem_region(dev->mem_start, dev->mem_end - dev->mem_start + 1); free_irq(dev->irq, dev); - free_netdev(dev); + free_arcdev(dev); } #ifndef MODULE diff --git a/drivers/net/arcnet/arcdevice.h b/drivers/net/arcnet/arcdevice.h index 22a49c6d7ae6..5d4a4c7efbbf 100644 --- a/drivers/net/arcnet/arcdevice.h +++ b/drivers/net/arcnet/arcdevice.h @@ -298,6 +298,10 @@ struct arcnet_local { int excnak_pending;/* We just got an excesive nak interrupt */ + /* RESET flag handling */ + int reset_in_progress; + struct work_struct reset_work; + struct { uint16_t sequence; /* sequence number (incs with each packet) */ __be16 aborted_seq; @@ -350,7 +354,9 @@ void arcnet_dump_skb(struct net_device *dev, struct sk_buff *skb, char *desc) void arcnet_unregister_proto(struct ArcProto *proto); irqreturn_t arcnet_interrupt(int irq, void *dev_id); + struct net_device *alloc_arcdev(const char *name); +void free_arcdev(struct net_device *dev); int arcnet_open(struct net_device *dev); int arcnet_close(struct net_device *dev); diff --git a/drivers/net/arcnet/arcnet.c b/drivers/net/arcnet/arcnet.c index e04efc0a5c97..563c43ae5cce 100644 --- a/drivers/net/arcnet/arcnet.c +++ b/drivers/net/arcnet/arcnet.c @@ -387,10 +387,46 @@ static void arcnet_timer(struct timer_list *t) struct arcnet_local *lp = from_timer(lp, t, timer); struct net_device *dev = lp->dev; - if (!netif_carrier_ok(dev)) { + spin_lock_irq(&lp->lock); + + if (!lp->reset_in_progress && !netif_carrier_ok(dev)) { netif_carrier_on(dev); netdev_info(dev, "link up\n"); } + + spin_unlock_irq(&lp->lock); +} + +static void reset_device_work(struct work_struct *work) +{ + struct arcnet_local *lp; + struct net_device *dev; + + lp = container_of(work, struct arcnet_local, reset_work); + dev = lp->dev; + + /* +* Do not bring the network interface back up if an ifdown +* was already done. +*/ + if (!netif_running(dev) || !lp->reset_in_progress) + return; + + rtnl_lock(); + + /* +* Do another check, in case of an ifdown that was triggered in +* the small race window between the exit condition above and +* acquiring RTNL. +*/ + if (!netif_running(dev) || !lp->reset_in_progress) + goto out; + + dev_close(dev); + dev_open(dev, NULL); + +out: + rtnl_unlock(); } static void arcnet_reply_tasklet(unsigned long data) @@ -452,12 +488,26 @@ struct net_device *alloc_arcdev(const char *name) lp->dev = dev; spin_lock_init(&lp->lock); timer_setup(&lp->timer, arcnet_timer, 0); + INIT_WORK(&lp->reset_work, reset_device_work); } return dev; } EXPORT_SYMBOL(alloc_arcdev); +void free_arcdev(struct net_device *dev) +{ + struct arcnet_local
Re: [PATCH net v2 2/3] net: move the xps cpus retrieval out of net-sysfs
Hi Alexander, Quoting Alexander Duyck (2020-12-21 23:33:15) > > One thing I might change is to actually bump this patch up in the > patch set as I think it would probably make things a bit cleaner to > read as you are going to be moving the other functions to this pattern > as well. Right. If it were not for net (vs net-next), I would have split the patches a bit differently to make things easier to review. But those patches are fixes and can be backported to older kernel versions. They're fixing 2 commits that were introduced in different versions, so this patch has to be made before the next one, as it is fixing older kernels. (I also did not give an hint in the commit message to what is done in patch 3 for the same reason. But I agree that's arguable.) Thanks, Antoine
Re: [PATCH net v2 1/3] net: fix race conditions in xps by locking the maps and dev->tc_num
Hello Alexander, Jakub, Quoting Alexander Duyck (2020-12-22 00:21:57) > > Looking over this patch it seems kind of obvious that extending the > xps_map_mutex is making things far more complex then they need to be. > > Applying the rtnl_mutex would probably be much simpler. Although as I > think you have already discovered we need to apply it to the store, > and show for this interface. In addition we probably need to perform > similar locking around traffic_class_show in order to prevent it from > generating a similar error. I don't think we have the same kind of issues with traffic_class_show: dev->num_tc is used, but not for navigating through the map. Protecting only a single read wouldn't change much. We can still think about what could go wrong here without the lock, but that is not related to this series of fixes. If I understood correctly, as things are a bit too complex now, you would prefer that we go for the solution proposed in v1? I can still do the code factoring for the 2 sysfs show operations, but that would then target net-next and would be in a different series. So I believe we'll use the patches of v1, unmodified. Jakub, should I send a v3 then? Thanks! Antoine
Re: [PATCH v3 4/5] Bluetooth: advmon offload MSFT handle controller reset
Hi Archie, >>> When the controller is powered off, the registered advertising monitor >>> is removed from the controller. This patch handles the re-registration >>> of those monitors when the power is on. >>> >>> Signed-off-by: Archie Pusaka >>> Reviewed-by: Miao-chen Chou >>> Reviewed-by: Yun-Hao Chung >>> >>> --- >>> >>> (no changes since v1) >>> >>> net/bluetooth/msft.c | 79 +--- >>> 1 file changed, 74 insertions(+), 5 deletions(-) >>> >>> diff --git a/net/bluetooth/msft.c b/net/bluetooth/msft.c >>> index f5aa0e3b1b9b..7e33a85c3f1c 100644 >>> --- a/net/bluetooth/msft.c >>> +++ b/net/bluetooth/msft.c >>> @@ -82,8 +82,15 @@ struct msft_data { >>> struct list_head handle_map; >>> __u16 pending_add_handle; >>> __u16 pending_remove_handle; >>> + >>> + struct { >>> + u8 reregistering:1; >>> + } flags; >>> }; >> >> hmmm. Do you have bigger plans with this struct? I would just skip it. >> > This struct is also used in patch 5/5 to store the "enabled" status of > the filter. > Suspend/resume would need to enable/disable the filter, but it is not > yet implemented in this patch series. just do it without the nested structs. I think you are overdoing it here. Regards Marcel
Re: [RFC PATCH v2 1/8] dt-bindings: net: sparx5: Add sparx5-switch bindings
Hi Rob, On Mon, 2020-12-21 at 14:40 -0700, Rob Herring wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you > know the content is safe > > On Thu, Dec 17, 2020 at 08:51:27AM +0100, Steen Hegelund wrote: > > Document the Sparx5 switch device driver bindings > > > > Signed-off-by: Steen Hegelund > > Signed-off-by: Lars Povlsen > > --- > > .../bindings/net/microchip,sparx5-switch.yaml | 178 > > ++ > > 1 file changed, 178 insertions(+) > > create mode 100644 > > Documentation/devicetree/bindings/net/microchip,sparx5-switch.yaml > > > > diff --git > > a/Documentation/devicetree/bindings/net/microchip,sparx5- > > switch.yaml > > b/Documentation/devicetree/bindings/net/microchip,sparx5- > > switch.yaml > > new file mode 100644 > > index ..6e3ef8285e9a > > --- /dev/null > > +++ b/Documentation/devicetree/bindings/net/microchip,sparx5- > > switch.yaml > > @@ -0,0 +1,178 @@ > > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause > > +%YAML 1.2 > > +--- > > +$id: > > http://devicetree.org/schemas/net/microchip,sparx5-switch.yaml# > > +$schema: http://devicetree.org/meta-schemas/core.yaml# > > + > > +title: Microchip Sparx5 Ethernet switch controller > > + > > +maintainers: > > + - Lars Povlsen > > + - Steen Hegelund > > + > > +description: | > > + The SparX-5 Enterprise Ethernet switch family provides a rich > > set of > > + Enterprise switching features such as advanced TCAM-based VLAN > > and > > + QoS processing enabling delivery of differentiated services, and > > + security through TCAM-based frame processing using versatile > > content > > + aware processor (VCAP). > > + > > + IPv4/IPv6 Layer 3 (L3) unicast and multicast routing is > > supported > > + with up to 18K IPv4/9K IPv6 unicast LPM entries and up to 9K > > IPv4/3K > > + IPv6 (S,G) multicast groups. > > + > > + L3 security features include source guard and reverse path > > + forwarding (uRPF) tasks. Additional L3 features include VRF-Lite > > and > > + IP tunnels (IP over GRE/IP). > > + > > + The SparX-5 switch family targets managed Layer 2 and Layer 3 > > + equipment in SMB, SME, and Enterprise where high port count > > + 1G/2.5G/5G/10G switching with 10G/25G aggregation links is > > required. > > + > > +properties: > > + $nodename: > > + pattern: "^switch@[0-9a-f]+$" > > + > > + compatible: > > + const: microchip,sparx5-switch > > + > > + reg: > > + minItems: 2 > > + > > + reg-names: > > + minItems: 2 > > This is the default based on 'items' length. Does that mean that I should omit minItems here? > > > + items: > > + - const: devices > > + - const: gcb > > + > > + interrupts: > > + maxItems: 1 > > + description: Interrupt used for reception of packets to the > > CPU > > + > > + ethernet-ports: > > + type: object > > + properties: > > + '#address-cells': > > + const: 1 > > + '#size-cells': > > + const: 0 > > + > > + patternProperties: > > + "^port@[0-9]+$": > > + type: object > > + description: Switch ports > > + > > + allOf: > > + - $ref: ethernet-controller.yaml# > > + > > + properties: > > + reg: > > + description: Switch port number > > + > > + max-speed: > > + maxItems: 1 > > Is that an array? No it is just a single value. > > > + description: Bandwidth allocated to this port > > + > > + phys: > > How many? (maxItems) I will add "maxItems: 1" > > > + description: phandle of a Ethernet Serdes PHY > > + > > + phy-handle: > > + description: phandle of a Ethernet PHY > > + > > + phy-mode: > > + description: Interface between the serdes and the phy > > The whole set of modes defined is supported? This driver does not impose any limits on phy-mode. It is passed on to the phy, so all modes are supported as I see it. > > > + > > + sfp: > > + description: phandle of an SFP > > + > > + managed: > > + maxItems: 1 > > An array? No just a single item. Thanks for your comments. BR Steen > > > + description: SFP management > > + > > + required: > > + - reg > > + - max-speed > > + - phys > > + > > + oneOf: > > + - required: > > + - phy-handle > > + - phy-mode > > + - required: > > + - sfp > > + - managed > > + > > + additionalProperties: false > > + > > +required: > > + - compatible > > + - reg > > + - reg-names > > + - interrupts > > + - ethernet-ports > > + > > +additionalProperties: false > > + > > +examples: > > + - | > > + #include > > + switch: switch@6 { > > + compatible = "microchip,sparx5-switch"; > > + reg = <0x1000 0x80>, > > + <0x1101 0x1b0>; > > + reg-names = "devices", "gcb"; > > + > > + in
RE: [PATCH V5 0/5] can: flexcan: add stop mode support for i.MX8QM
> -Original Message- > From: Marc Kleine-Budde > Sent: 2020年11月6日 19:33 > To: Joakim Zhang ; robh...@kernel.org; > shawn...@kernel.org; s.ha...@pengutronix.de > Cc: ker...@pengutronix.de; dl-linux-imx ; > linux-...@vger.kernel.org; netdev@vger.kernel.org; > linux-ker...@vger.kernel.org > Subject: Re: [PATCH V5 0/5] can: flexcan: add stop mode support for i.MX8QM > > On 11/6/20 11:56 AM, Joakim Zhang wrote: > > Add stop mode support for i.MX8QM. > > > > ChangeLogs: > > V4->V5: > > * remove patch:firmware: imx: always export SCU symbols, since > > it done by commit: 95de5094f5ac firmware: imx: add dummy functions > > * rebase to fsl,flexcan.yaml > > > > V3->V4: > > * can_idx->scu_idx. > > * return imx_scu_get_handle(&priv->sc_ipc_handle); > > * failed_canregister->failed_setup_stop_mode. > > > > V2->V3: > > * define IMX_SC_R_CAN(x) in rsrc.h > > * remove error message on -EPROBE_DEFER. > > * split disable wakeup patch into separate one. > > > > V1->V2: > > * split ECC fix patches into separate patches. > > * free can dev if failed to setup stop mode. > > * disable wakeup on flexcan_remove. > > * add FLEXCAN_IMX_SC_R_CAN macro helper. > > * fsl,can-index->fsl,scu-index. > > * move fsl,scu-index and priv->can_idx into > > * flexcan_setup_stop_mode_scfw() > > * prove failed if failed to setup stop mode. > > > > Joakim Zhang (5): > > dt-bindings: can: flexcan: fix fsl,clk-source property > > added to linux-can/testing > > > dt-bindings: can: flexcan: add fsl,scu-index property to indicate a > > resource > > can: flexcan: rename macro FLEXCAN_QUIRK_SETUP_STOP_MODE -> > > FLEXCAN_QUIRK_SETUP_STOP_MODE_GPR > > dt-bindings: firmware: add IMX_SC_R_CAN(x) macro for CAN > > can: flexcan: add CAN wakeup function for i.MX8QM > > The others go via linux-can-next/testing, once net/master is merged back to > net-next/master to have the yaml bindings. Hi Marc, How about below patches? I even can't see it in your linux-can-next/testing branch. Are these missed? dt-bindings: can: flexcan: add fsl,scu-index property to indicate a resource can: flexcan: add CAN wakeup function for i.MX8QM Best Regards, Joakim Zhang
Re: [PATCH V5 0/5] can: flexcan: add stop mode support for i.MX8QM
On 12/22/20 12:37 PM, Joakim Zhang wrote: > How about below patches? I even can't see it in your linux-can-next/testing > branch. Are these missed? > dt-bindings: can: flexcan: add fsl,scu-index property to indicate a > resource > can: flexcan: add CAN wakeup function for i.MX8QM The patch "firmware: imx: always export SCU symbols" is not yet in net-next/master, so the flexcan patch will not compile where the SCU is switched off. Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917- | signature.asc Description: OpenPGP digital signature
[PATCH] rtw88: coex: remove useless if and else
Fix the following coccinelle report: drivers/net/wireless/realtek/rtw88/coex.c:1619:3-5: WARNING: possible condition with no effect (if == else) Signed-off-by: Tian Tao --- drivers/net/wireless/realtek/rtw88/coex.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/net/wireless/realtek/rtw88/coex.c b/drivers/net/wireless/realtek/rtw88/coex.c index 24530ca..df6676a 100644 --- a/drivers/net/wireless/realtek/rtw88/coex.c +++ b/drivers/net/wireless/realtek/rtw88/coex.c @@ -1616,12 +1616,7 @@ static void rtw_coex_action_bt_relink(struct rtw_dev *rtwdev) if (efuse->share_ant) { /* Shared-Ant */ if (coex_stat->wl_gl_busy) { table_case = 26; - if (coex_stat->bt_hid_exist && - coex_stat->bt_profile_num == 1) { - tdma_case = 20; - } else { - tdma_case = 20; - } + tdma_case = 20; } else { table_case = 1; tdma_case = 0; -- 2.7.4
[PATCH v3] net: neighbor: fix a crash caused by mod zero
pneigh_enqueue() tries to obtain a random delay by mod NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY) migth be zero at that point because someone could write zero to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the callers check it. This patch makes pneigh_enqueue() get a delay time passed in by the callers and the callers guarantee it is not zero. Signed-off-by: weichenchen --- V3: - Callers need to pass the delay time to pneigh_enqueue() now and they should guarantee it is not zero. - Use READ_ONCE() to read NEIGH_VAR(p, PROXY_DELAY) in both of the existing callers of pneigh_enqueue() and then pass it to pneigh_enqueue(). V2: - Use READ_ONCE() to prevent the complier from re-reading NEIGH_VAR(p, PROXY_DELAY). - Give a hint to the complier that delay <= 0 is unlikely to happen. --- include/net/neighbour.h | 2 +- net/core/neighbour.c| 5 ++--- net/ipv4/arp.c | 8 +--- net/ipv6/ndisc.c| 6 +++--- 4 files changed, 11 insertions(+), 10 deletions(-) diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 22ced1381ede..f7564dc5304d 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -352,7 +352,7 @@ struct net *neigh_parms_net(const struct neigh_parms *parms) unsigned long neigh_rand_reach_time(unsigned long base); void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p, - struct sk_buff *skb); + struct sk_buff *skb, int delay); struct pneigh_entry *pneigh_lookup(struct neigh_table *tbl, struct net *net, const void *key, struct net_device *dev, int creat); diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 9500d28a43b0..b440f966d109 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1567,12 +1567,11 @@ static void neigh_proxy_process(struct timer_list *t) } void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p, - struct sk_buff *skb) + struct sk_buff *skb, int delay) { unsigned long now = jiffies; - unsigned long sched_next = now + (prandom_u32() % - NEIGH_VAR(p, PROXY_DELAY)); + unsigned long sched_next = now + (prandom_u32() % delay); if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) { kfree_skb(skb); diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 922dd73e5740..6ddce6e0a648 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -841,20 +841,22 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb) arp_fwd_pvlan(in_dev, dev, rt, sip, tip) || (rt->dst.dev != dev && pneigh_lookup(&arp_tbl, net, &tip, dev, 0 { + int delay; + n = neigh_event_ns(&arp_tbl, sha, &sip, dev); if (n) neigh_release(n); + delay = READ_ONCE(NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY)); if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED || - skb->pkt_type == PACKET_HOST || - NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) { + skb->pkt_type == PACKET_HOST || delay == 0) { arp_send_dst(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip, sha, dev->dev_addr, sha, reply_dst); } else { pneigh_enqueue(&arp_tbl, - in_dev->arp_parms, skb); + in_dev->arp_parms, skb, delay); goto out_free_dst; } goto out_consume_skb; diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index 76717478f173..efdaaab47535 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -892,10 +892,10 @@ static void ndisc_recv_ns(struct sk_buff *skb) (idev->cnf.forwarding && (net->ipv6.devconf_all->proxy_ndp || idev->cnf.proxy_ndp) && (is_router = pndisc_is_router(&msg->target, dev)) >= 0)) { + int delay = READ_ONCE(NEIGH_VAR(idev->nd_parms, PROXY_DELAY)); if (!(NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED) && skb->pkt_type != PACKET_HOST && - inc && - NEIGH_VAR(idev->nd_parms, PROXY_DELAY) != 0) { +
Re: [PATCH] dt-bindings: net: qcom,ipa: Drop unnecessary type ref on 'memory-region'
On 12/21/20 10:01 PM, Rob Herring wrote: 'memory-region' is a common property, so it doesn't need a type ref here. Acked-by: Alex Elder Cc: "David S. Miller" Cc: Jakub Kicinski Cc: Alex Elder Cc: netdev@vger.kernel.org Signed-off-by: Rob Herring --- I'll take this via the DT tree. Documentation/devicetree/bindings/net/qcom,ipa.yaml | 1 - 1 file changed, 1 deletion(-) diff --git a/Documentation/devicetree/bindings/net/qcom,ipa.yaml b/Documentation/devicetree/bindings/net/qcom,ipa.yaml index d0cbbcf1b0e5..8a2d12644675 100644 --- a/Documentation/devicetree/bindings/net/qcom,ipa.yaml +++ b/Documentation/devicetree/bindings/net/qcom,ipa.yaml @@ -121,7 +121,6 @@ properties: receive and act on notifications of modem up/down events. memory-region: -$ref: /schemas/types.yaml#/definitions/phandle-array maxItems: 1 description: If present, a phandle for a reserved memory area that holds
[PATCH net-next] net: tipc: Replace expression with offsetof()
Use the existing offsetof() macro instead of duplicating code. Signed-off-by: Zheng Yongjun --- net/tipc/monitor.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c index 6dce2abf436e..48fac3b17e40 100644 --- a/net/tipc/monitor.c +++ b/net/tipc/monitor.c @@ -108,7 +108,7 @@ const int tipc_max_domain_size = sizeof(struct tipc_mon_domain); */ static int dom_rec_len(struct tipc_mon_domain *dom, u16 mcnt) { - return ((void *)&dom->members - (void *)dom) + (mcnt * sizeof(u32)); + return (offsetof(struct tipc_mon_domain, members)) + (mcnt * sizeof(u32)); } /* dom_size() : calculate size of own domain based on number of peers -- 2.22.0
[PATCH -next] scsi: megaraid: Remove unnecessary memset
memcpy operation is next to memset code, and the size to copy is equals to the size to memset, so the memset operation is unnecessary, remove it. Signed-off-by: Zheng Yongjun --- drivers/net/wireless/ath/wcn36xx/smd.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c b/drivers/net/wireless/ath/wcn36xx/smd.c index 766400f7b61c..273fed22711f 100644 --- a/drivers/net/wireless/ath/wcn36xx/smd.c +++ b/drivers/net/wireless/ath/wcn36xx/smd.c @@ -484,7 +484,6 @@ static void init_hal_msg(struct wcn36xx_hal_msg_header *hdr, #define PREPARE_HAL_PTT_MSG_BUF(send_buf, p_msg_body) \ do {\ - memset(send_buf, 0, p_msg_body->header.len); \ memcpy(send_buf, p_msg_body, p_msg_body->header.len); \ } while (0) -- 2.22.0
[PATCH -next] scsi: megaraid: Remove unnecessary memset
memcpy operation is next to memset code, and the size to copy is equals to the size to memset, so the memset operation is unnecessary, remove it. Signed-off-by: Zheng Yongjun --- drivers/net/wireless/ath/wcn36xx/smd.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c b/drivers/net/wireless/ath/wcn36xx/smd.c index 766400f7b61c..273fed22711f 100644 --- a/drivers/net/wireless/ath/wcn36xx/smd.c +++ b/drivers/net/wireless/ath/wcn36xx/smd.c @@ -484,7 +484,6 @@ static void init_hal_msg(struct wcn36xx_hal_msg_header *hdr, #define PREPARE_HAL_PTT_MSG_BUF(send_buf, p_msg_body) \ do {\ - memset(send_buf, 0, p_msg_body->header.len); \ memcpy(send_buf, p_msg_body, p_msg_body->header.len); \ } while (0) -- 2.22.0
[RFC PATCH v2 net-next 01/15] net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLAN
The sja1105 implementation can be blind about this, but the felix driver doesn't do exactly what it's being told, so it needs to know whether it is a TX or an RX VLAN, so it can install the appropriate type of TCAM rule. Signed-off-by: Vladimir Oltean --- Changes in v2: None. include/linux/dsa/8021q.h | 14 ++ net/dsa/tag_8021q.c | 15 +-- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/include/linux/dsa/8021q.h b/include/linux/dsa/8021q.h index 88cd72dfa4e0..b12b05f1c8b4 100644 --- a/include/linux/dsa/8021q.h +++ b/include/linux/dsa/8021q.h @@ -64,6 +64,10 @@ int dsa_8021q_rx_source_port(u16 vid); u16 dsa_8021q_rx_subvlan(u16 vid); +bool vid_is_dsa_8021q_rxvlan(u16 vid); + +bool vid_is_dsa_8021q_txvlan(u16 vid); + bool vid_is_dsa_8021q(u16 vid); #else @@ -123,6 +127,16 @@ u16 dsa_8021q_rx_subvlan(u16 vid) return 0; } +bool vid_is_dsa_8021q_rxvlan(u16 vid) +{ + return false; +} + +bool vid_is_dsa_8021q_txvlan(u16 vid) +{ + return false; +} + bool vid_is_dsa_8021q(u16 vid) { return false; diff --git a/net/dsa/tag_8021q.c b/net/dsa/tag_8021q.c index 8e3e8a5b8559..008c1ec6e20c 100644 --- a/net/dsa/tag_8021q.c +++ b/net/dsa/tag_8021q.c @@ -133,10 +133,21 @@ u16 dsa_8021q_rx_subvlan(u16 vid) } EXPORT_SYMBOL_GPL(dsa_8021q_rx_subvlan); +bool vid_is_dsa_8021q_rxvlan(u16 vid) +{ + return (vid & DSA_8021Q_DIR_MASK) == DSA_8021Q_DIR_RX; +} +EXPORT_SYMBOL_GPL(vid_is_dsa_8021q_rxvlan); + +bool vid_is_dsa_8021q_txvlan(u16 vid) +{ + return (vid & DSA_8021Q_DIR_MASK) == DSA_8021Q_DIR_TX; +} +EXPORT_SYMBOL_GPL(vid_is_dsa_8021q_txvlan); + bool vid_is_dsa_8021q(u16 vid) { - return ((vid & DSA_8021Q_DIR_MASK) == DSA_8021Q_DIR_RX || - (vid & DSA_8021Q_DIR_MASK) == DSA_8021Q_DIR_TX); + return vid_is_dsa_8021q_rxvlan(vid) || vid_is_dsa_8021q_txvlan(vid); } EXPORT_SYMBOL_GPL(vid_is_dsa_8021q); -- 2.25.1
[RFC PATCH v2 net-next 00/15] tag_8021q for Ocelot switches
The Felix switch inside LS1028A has an issue. It has a 2.5G CPU port, and the external ports, in the majority of use cases, run at 1G. This means that, when the CPU injects traffic into the switch, it is very easy to run into congestion. This is not to say that it is impossible to enter congestion even with all ports running at the same speed, just that the default configuration is already very prone to that by design. Normally, the way to deal with that is using Ethernet flow control (PAUSE frames). However, this functionality is not working today with the ENETC - Felix switch pair. The hardware issue is undergoing documentation right now as an erratum within NXP, but several customers have been requesting a reasonable workaround for it. In truth, the LS1028A has 2 internal port pairs. The lack of flow control is an issue only when NPI mode (Node Processor Interface, aka the mode where the "CPU port module", which carries DSA-style tagged packets, is connected to a regular Ethernet port) is used, and NPI mode is supported by Felix on a single port. In past BSPs, we have had setups where both internal port pairs were enabled. We were advertising the following setup: "data port" "control port" (2.5G)(1G) eno2 eno3 ^^ || | regular| DSA-tagged | frames | frames || vv swp4 swp5 This works but is highly unpractical, due to NXP shifting the task of designing a functional system (choosing which port to use, depending on type of traffic required) up to the end user. The swpN interfaces would have to be bridged with swp4, in order for the eno2 "data port" to have access to the outside network. And the swpN interfaces would still be capable of IP networking. So running a DHCP client would give us two IP interfaces from the same subnet, one assigned to eno2, and the other to swpN (0, 1, 2, 3). Also, the dual port design doesn't scale. When attaching another DSA switch to a Felix port, the end result is that the "data port" cannot carry any meaningful data to the external world, since it lacks the DSA tags required to traverse the sja1105 switches below. All that traffic needs to go through the "control port". So in newer BSPs there was a desire to simplify that setup, and only have one internal port pair: eno2eno3 ^ | | DSA-taggedx disabled | frames | v swp4swp5 However, this setup only exacerbates the issue of not having flow control on the NPI port, since that is the only port now. Also, there are use cases that still require the "data port", such as IEEE 802.1CB (TSN stream identification doesn't work over an NPI port), source MAC address learning over NPI, etc. Again, there is a desire to keep the simplicity of the single internal port setup, while regaining the benefits of having a dedicated data port as well. And this series attempts to deliver just that. So the NPI functionality is disabled conditionally. Its purpose was: - To ensure individually addressable ports on TX. This can be replaced by using some designated VLAN tags which are pushed by the DSA tagger code, then removed by the switch (so they are invisible to the outside world and to the user). - To ensure source port identification on RX. Again, this can be replaced by using some designated VLAN tags to encapsulate all RX traffic (each VLAN uniquely identifies a source port). The DSA tagger determines which port it was based on the VLAN number, then removes that header. - To deliver PTP timestamps. This cannot be obtained through VLAN headers, so we need to take a step back and see how else we can do that. The Microchip Ocelot-1 (VSC7514 MIPS) driver performs manual injection/extraction from the CPU port module using register-based MMIO, and not over Ethernet. We will need to do the same from DSA, which makes this tagger a sort of hybrid between DSA and pure switchdev. I determined that a Kconfig option would be a sufficiently good configuration interface for selecting between the existing NPI-based tagged and the tag_8021q software-defined tagger. However, this is one of the things that is up for debate today. Changes in v2: Posted the entire rework necessary for PTP support using tag_8021q.c. Added a larger audience to the series. Vladimir Oltean (15): net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLAN net: mscc: ocelot: export VCAP structures to include/soc/mscc net: mscc: ocelot: store a namespaced VCAP filter ID net: dsa: felix: add new VLAN-based tagger net: mscc: ocelot: stop returning IRQ_NONE in ocelot_xtr_irq_handler net: mscc: ocelot: only drain extraction queue on error net: mscc: ocelot: just flush the CPU extraction group on error net: mscc: ocelot: better error handling in ocelot_xtr_irq_handler net: mscc: ocelot: use DIV_ROUND_UP helper in
[RFC PATCH v2 net-next 02/15] net: mscc: ocelot: export VCAP structures to include/soc/mscc
The Felix driver will need to preinstall some VCAP filters for its tag_8021q implementation (outside of the tc-flower offload logic), so these need to be exported to the common includes. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot_net.c | 1 + drivers/net/ethernet/mscc/ocelot_vcap.h | 293 +--- include/soc/mscc/ocelot_vcap.h | 289 +++ 3 files changed, 292 insertions(+), 291 deletions(-) diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c index 2bd2840d88bd..3a3bbd5e7883 100644 --- a/drivers/net/ethernet/mscc/ocelot_net.c +++ b/drivers/net/ethernet/mscc/ocelot_net.c @@ -5,6 +5,7 @@ */ #include +#include #include "ocelot.h" #include "ocelot_vcap.h" diff --git a/drivers/net/ethernet/mscc/ocelot_vcap.h b/drivers/net/ethernet/mscc/ocelot_vcap.h index 82fd10581a14..cfc8b976d1de 100644 --- a/drivers/net/ethernet/mscc/ocelot_vcap.h +++ b/drivers/net/ethernet/mscc/ocelot_vcap.h @@ -7,300 +7,11 @@ #define _MSCC_OCELOT_VCAP_H_ #include "ocelot.h" -#include "ocelot_police.h" -#include -#include +#include +#include #define OCELOT_POLICER_DISCARD 0x17f -struct ocelot_ipv4 { - u8 addr[4]; -}; - -enum ocelot_vcap_bit { - OCELOT_VCAP_BIT_ANY, - OCELOT_VCAP_BIT_0, - OCELOT_VCAP_BIT_1 -}; - -struct ocelot_vcap_u8 { - u8 value[1]; - u8 mask[1]; -}; - -struct ocelot_vcap_u16 { - u8 value[2]; - u8 mask[2]; -}; - -struct ocelot_vcap_u24 { - u8 value[3]; - u8 mask[3]; -}; - -struct ocelot_vcap_u32 { - u8 value[4]; - u8 mask[4]; -}; - -struct ocelot_vcap_u40 { - u8 value[5]; - u8 mask[5]; -}; - -struct ocelot_vcap_u48 { - u8 value[6]; - u8 mask[6]; -}; - -struct ocelot_vcap_u64 { - u8 value[8]; - u8 mask[8]; -}; - -struct ocelot_vcap_u128 { - u8 value[16]; - u8 mask[16]; -}; - -struct ocelot_vcap_vid { - u16 value; - u16 mask; -}; - -struct ocelot_vcap_ipv4 { - struct ocelot_ipv4 value; - struct ocelot_ipv4 mask; -}; - -struct ocelot_vcap_udp_tcp { - u16 value; - u16 mask; -}; - -struct ocelot_vcap_port { - u8 value; - u8 mask; -}; - -enum ocelot_vcap_key_type { - OCELOT_VCAP_KEY_ANY, - OCELOT_VCAP_KEY_ETYPE, - OCELOT_VCAP_KEY_LLC, - OCELOT_VCAP_KEY_SNAP, - OCELOT_VCAP_KEY_ARP, - OCELOT_VCAP_KEY_IPV4, - OCELOT_VCAP_KEY_IPV6 -}; - -struct ocelot_vcap_key_vlan { - struct ocelot_vcap_vid vid;/* VLAN ID (12 bit) */ - struct ocelot_vcap_u8 pcp;/* PCP (3 bit) */ - enum ocelot_vcap_bit dei;/* DEI */ - enum ocelot_vcap_bit tagged; /* Tagged/untagged frame */ -}; - -struct ocelot_vcap_key_etype { - struct ocelot_vcap_u48 dmac; - struct ocelot_vcap_u48 smac; - struct ocelot_vcap_u16 etype; - struct ocelot_vcap_u16 data; /* MAC data */ -}; - -struct ocelot_vcap_key_llc { - struct ocelot_vcap_u48 dmac; - struct ocelot_vcap_u48 smac; - - /* LLC header: DSAP at byte 0, SSAP at byte 1, Control at byte 2 */ - struct ocelot_vcap_u32 llc; -}; - -struct ocelot_vcap_key_snap { - struct ocelot_vcap_u48 dmac; - struct ocelot_vcap_u48 smac; - - /* SNAP header: Organization Code at byte 0, Type at byte 3 */ - struct ocelot_vcap_u40 snap; -}; - -struct ocelot_vcap_key_arp { - struct ocelot_vcap_u48 smac; - enum ocelot_vcap_bit arp; /* Opcode ARP/RARP */ - enum ocelot_vcap_bit req; /* Opcode request/reply */ - enum ocelot_vcap_bit unknown;/* Opcode unknown */ - enum ocelot_vcap_bit smac_match; /* Sender MAC matches SMAC */ - enum ocelot_vcap_bit dmac_match; /* Target MAC matches DMAC */ - - /**< Protocol addr. length 4, hardware length 6 */ - enum ocelot_vcap_bit length; - - enum ocelot_vcap_bit ip; /* Protocol address type IP */ - enum ocelot_vcap_bit ethernet; /* Hardware address type Ethernet */ - struct ocelot_vcap_ipv4 sip; /* Sender IP address */ - struct ocelot_vcap_ipv4 dip; /* Target IP address */ -}; - -struct ocelot_vcap_key_ipv4 { - enum ocelot_vcap_bit ttl; /* TTL zero */ - enum ocelot_vcap_bit fragment; /* Fragment */ - enum ocelot_vcap_bit options; /* Header options */ - struct ocelot_vcap_u8 ds; - struct ocelot_vcap_u8 proto; /* Protocol */ - struct ocelot_vcap_ipv4 sip; /* Source IP address */ - struct ocelot_vcap_ipv4 dip; /* Destination IP address */ - struct ocelot_vcap_u48 data; /* Not UDP/TCP: IP data */ - struct ocelot_vcap_udp_tcp sport; /* UDP/TCP: Source port */ - struct ocelot_vcap_udp_tcp dport; /* UDP/TCP: Destination port */ - enum ocelot_vcap_bit tcp_fin; - enum ocelot_vcap_bit tcp_syn; - enum ocelot_vcap_bit tcp_rst; - enum ocel
[RFC PATCH v2 net-next 06/15] net: mscc: ocelot: only drain extraction queue on error
It appears that the intention of this snippet of code is to not exit ocelot_xtr_irq_handler() while in the middle of extracting a frame. The problem in extracting it word by word is that future extraction attempts are really easy to get desynchronized, since the IRQ handler assumes that the first 16 bytes are the IFH, which give further information about the frame, such as frame length. But during normal operation, "err" will not be 0, but 4, set from here: for (i = 0; i < OCELOT_TAG_LEN / 4; i++) { err = ocelot_rx_frame_word(ocelot, grp, true, &ifh[i]); if (err != 4) break; } if (err != 4) break; In that case, draining the extraction queue is a no-op. So explicitly make this code execute only on negative err. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot_vsc7514.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c b/drivers/net/ethernet/mscc/ocelot_vsc7514.c index 00c6d9838970..ed632dd79245 100644 --- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c +++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c @@ -702,7 +702,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) dev->stats.rx_packets++; } - if (err) + if (err < 0) while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) ocelot_read_rix(ocelot, QS_XTR_RD, grp); -- 2.25.1
[RFC PATCH v2 net-next 05/15] net: mscc: ocelot: stop returning IRQ_NONE in ocelot_xtr_irq_handler
Since the xtr (extraction) IRQ of the ocelot switch is not shared, then if it fired, it means that some data must be present in the queues of the CPU port module. So simplify the code. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot_vsc7514.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c b/drivers/net/ethernet/mscc/ocelot_vsc7514.c index 1e7729421a82..00c6d9838970 100644 --- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c +++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c @@ -605,10 +605,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) int i = 0, grp = 0; int err = 0; - if (!(ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp))) - return IRQ_NONE; - - do { + while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) { struct skb_shared_hwtstamps *shhwtstamps; struct ocelot_port_private *priv; struct ocelot_port *ocelot_port; @@ -703,7 +700,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) netif_rx(skb); dev->stats.rx_bytes += len; dev->stats.rx_packets++; - } while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)); + } if (err) while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) -- 2.25.1
[RFC PATCH v2 net-next 08/15] net: mscc: ocelot: better error handling in ocelot_xtr_irq_handler
The ocelot_rx_frame_word() function can return a negative error code, however this isn't being checked for consistently. Also, some constructs can be simplified by using "goto" instead of repeated "break" statements. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot_vsc7514.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c b/drivers/net/ethernet/mscc/ocelot_vsc7514.c index 52ebc69a52cc..ea66b372d63b 100644 --- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c +++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c @@ -620,12 +620,9 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) for (i = 0; i < OCELOT_TAG_LEN / 4; i++) { err = ocelot_rx_frame_word(ocelot, grp, true, &ifh[i]); if (err != 4) - break; + goto out; } - if (err != 4) - break; - /* At this point the IFH was read correctly, so it is safe to * presume that there is no error. The err needs to be reset * otherwise a frame could come in CPU queue between the while @@ -646,7 +643,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) if (unlikely(!skb)) { netdev_err(dev, "Unable to allocate sk_buff\n"); err = -ENOMEM; - break; + goto out; } buf_len = info.len - ETH_FCS_LEN; buf = (u32 *)skb_put(skb, buf_len); @@ -654,12 +651,21 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) len = 0; do { sz = ocelot_rx_frame_word(ocelot, grp, false, &val); + if (sz < 0) { + err = sz; + goto out; + } *buf++ = val; len += sz; } while (len < buf_len); /* Read the FCS */ sz = ocelot_rx_frame_word(ocelot, grp, false, &val); + if (sz < 0) { + err = sz; + goto out; + } + /* Update the statistics if part of the FCS was read before */ len -= ETH_FCS_LEN - sz; @@ -668,11 +674,6 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) *buf = val; } - if (sz < 0) { - err = sz; - break; - } - if (ocelot->ptp) { ocelot_ptp_gettime64(&ocelot->ptp_info, &ts); @@ -702,6 +703,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) dev->stats.rx_packets++; } +out: if (err < 0) { ocelot_write(ocelot, QS_XTR_FLUSH, BIT(grp)); ocelot_write(ocelot, QS_XTR_FLUSH, 0); -- 2.25.1
[RFC PATCH v2 net-next 07/15] net: mscc: ocelot: just flush the CPU extraction group on error
This procedure should yield the same effect as manually reading out the extraction data just to discard it. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot_vsc7514.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c b/drivers/net/ethernet/mscc/ocelot_vsc7514.c index ed632dd79245..52ebc69a52cc 100644 --- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c +++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c @@ -702,9 +702,10 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) dev->stats.rx_packets++; } - if (err < 0) - while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) - ocelot_read_rix(ocelot, QS_XTR_RD, grp); + if (err < 0) { + ocelot_write(ocelot, QS_XTR_FLUSH, BIT(grp)); + ocelot_write(ocelot, QS_XTR_FLUSH, 0); + } return IRQ_HANDLED; } -- 2.25.1
[RFC PATCH v2 net-next 04/15] net: dsa: felix: add new VLAN-based tagger
There are use cases for which the existing tagger, based on the NPI (Node Processor Interface) functionality, is insufficient. Namely: - Frames injected through the NPI port bypass the frame analyzer, so no source address learning is performed, no TSN stream classification, etc. - Flow control is not functional over an NPI port (PAUSE frames are encapsulated in the same Extraction Frame Header as all other frames) - There can be at most one NPI port configured for an Ocelot switch. But in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI port is currently either disabled, or operated as a plain user port (albeit an internally-facing one). Having the ability to configure the two CPU ports symmetrically could pave the way for e.g. creating a LAG between them, to increase bandwidth seamlessly for the system. So, there is a desire to have an alternative to the NPI mode. This patch brings an implementation of the software-defined tag_8021q.c tagger format, which also preserves full functionality under a vlan_filtering bridge (unlike sja1105, the only other user of tag_8021q). It does this by using the TCAM engines for: - pushing the RX VLAN as a second, outer tag, on egress towards the CPU port - redirecting towards the correct front port based on TX VLAN and popping that on egress Signed-off-by: Vladimir Oltean --- Changes in v2: Clean up the hardcoding of random VCAP filter IDs and the inclusion of a private ocelot header. MAINTAINERS | 1 + drivers/net/dsa/ocelot/Kconfig | 4 +- drivers/net/dsa/ocelot/Makefile | 5 + drivers/net/dsa/ocelot/felix.c | 108 +-- drivers/net/dsa/ocelot/felix.h | 1 + drivers/net/dsa/ocelot/felix_tag_8021q.c | 166 +++ drivers/net/dsa/ocelot/felix_tag_8021q.h | 20 +++ drivers/net/ethernet/mscc/ocelot.c | 18 ++- include/soc/mscc/ocelot.h| 1 + net/dsa/Kconfig | 34 + net/dsa/Makefile | 3 +- net/dsa/tag_ocelot_8021q.c | 61 + 12 files changed, 399 insertions(+), 23 deletions(-) create mode 100644 drivers/net/dsa/ocelot/felix_tag_8021q.c create mode 100644 drivers/net/dsa/ocelot/felix_tag_8021q.h create mode 100644 net/dsa/tag_ocelot_8021q.c diff --git a/MAINTAINERS b/MAINTAINERS index a355db292486..a9cb0b659c00 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12736,6 +12736,7 @@ F: drivers/net/dsa/ocelot/* F: drivers/net/ethernet/mscc/ F: include/soc/mscc/ocelot* F: net/dsa/tag_ocelot.c +F: net/dsa/tag_ocelot_8021q.c F: tools/testing/selftests/drivers/net/ocelot/* OCXL (Open Coherent Accelerator Processor Interface OpenCAPI) DRIVER diff --git a/drivers/net/dsa/ocelot/Kconfig b/drivers/net/dsa/ocelot/Kconfig index c110e82a7973..ab8de14c4dae 100644 --- a/drivers/net/dsa/ocelot/Kconfig +++ b/drivers/net/dsa/ocelot/Kconfig @@ -2,11 +2,11 @@ config NET_DSA_MSCC_FELIX tristate "Ocelot / Felix Ethernet switch support" depends on NET_DSA && PCI + depends on NET_DSA_TAG_OCELOT_NPI || NET_DSA_TAG_OCELOT_8021Q depends on NET_VENDOR_MICROSEMI depends on NET_VENDOR_FREESCALE depends on HAS_IOMEM select MSCC_OCELOT_SWITCH_LIB - select NET_DSA_TAG_OCELOT select FSL_ENETC_MDIO select PCS_LYNX help @@ -16,10 +16,10 @@ config NET_DSA_MSCC_FELIX config NET_DSA_MSCC_SEVILLE tristate "Ocelot / Seville Ethernet switch support" depends on NET_DSA + depends on NET_DSA_TAG_OCELOT_NPI || NET_DSA_TAG_OCELOT_8021Q depends on NET_VENDOR_MICROSEMI depends on HAS_IOMEM select MSCC_OCELOT_SWITCH_LIB - select NET_DSA_TAG_OCELOT select PCS_LYNX help This driver supports the VSC9953 (Seville) switch, which is embedded diff --git a/drivers/net/dsa/ocelot/Makefile b/drivers/net/dsa/ocelot/Makefile index f6dd131e7491..e9ea8c0331d8 100644 --- a/drivers/net/dsa/ocelot/Makefile +++ b/drivers/net/dsa/ocelot/Makefile @@ -9,3 +9,8 @@ mscc_felix-objs := \ mscc_seville-objs := \ felix.o \ seville_vsc9953.o + +ifdef CONFIG_NET_DSA_TAG_OCELOT_8021Q +mscc_felix-objs += felix_tag_8021q.o +mscc_seville-objs += felix_tag_8021q.o +endif diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c index 7dc230677b78..77f73c6bad0b 100644 --- a/drivers/net/dsa/ocelot/felix.c +++ b/drivers/net/dsa/ocelot/felix.c @@ -23,6 +23,7 @@ #include #include #include "felix.h" +#include "felix_tag_8021q.h" static enum dsa_tag_protocol felix_get_tag_protocol(struct dsa_switch *ds, int port, @@ -439,6 +440,7 @@ static int felix_init_structs(struct felix *felix, int num_phys_ports) { struct ocelot *ocelot = &felix->ocelot; phy_interface_t *port_phy_modes; + enum ocelot_tag_prefix prefix
[RFC PATCH v2 net-next 09/15] net: mscc: ocelot: use DIV_ROUND_UP helper in ocelot_port_inject_frame
This looks a bit nicer than the open-coded "(x + 3) % 4" idiom. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot_net.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c index 3a3bbd5e7883..7cc0fcd1df8d 100644 --- a/drivers/net/ethernet/mscc/ocelot_net.c +++ b/drivers/net/ethernet/mscc/ocelot_net.c @@ -386,7 +386,7 @@ static int ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev) ocelot_write_rix(ocelot, (__force u32)cpu_to_be32(ifh[i]), QS_INJ_WR, grp); - count = (skb->len + 3) / 4; + count = DIV_ROUND_UP(skb->len, 4); last = skb->len % 4; for (i = 0; i < count; i++) ocelot_write_rix(ocelot, ((u32 *)skb->data)[i], QS_INJ_WR, grp); -- 2.25.1
[RFC PATCH v2 net-next 10/15] net: mscc: ocelot: refactor ocelot_port_inject_frame out of ocelot_port_xmit
The felix DSA driver will inject some frames through register MMIO, same as ocelot switchdev currently does. So we need to be able to reuse the common code. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot.c | 78 + drivers/net/ethernet/mscc/ocelot.h | 4 ++ drivers/net/ethernet/mscc/ocelot_net.c | 81 +++--- 3 files changed, 89 insertions(+), 74 deletions(-) diff --git a/drivers/net/ethernet/mscc/ocelot.c b/drivers/net/ethernet/mscc/ocelot.c index be4671bfe95f..52f6c986aef0 100644 --- a/drivers/net/ethernet/mscc/ocelot.c +++ b/drivers/net/ethernet/mscc/ocelot.c @@ -566,6 +566,84 @@ void ocelot_get_txtstamp(struct ocelot *ocelot) } EXPORT_SYMBOL(ocelot_get_txtstamp); +/* Generate the IFH for frame injection + * + * The IFH is a 128bit-value + * bit 127: bypass the analyzer processing + * bit 56-67: destination mask + * bit 28-29: pop_cnt: 3 disables all rewriting of the frame + * bit 20-27: cpu extraction queue mask + * bit 16: tag type 0: C-tag, 1: S-tag + * bit 0-11: VID + */ +static int ocelot_gen_ifh(u32 *ifh, struct frame_info *info) +{ + ifh[0] = IFH_INJ_BYPASS | ((0x1ff & info->rew_op) << 21); + ifh[1] = (0xf00 & info->port) >> 8; + ifh[2] = (0xff & info->port) << 24; + ifh[3] = (info->tag_type << 16) | info->vid; + + return 0; +} + +bool ocelot_can_inject(struct ocelot *ocelot, int grp) +{ + u32 val = ocelot_read(ocelot, QS_INJ_STATUS); + + if (!(val & QS_INJ_STATUS_FIFO_RDY(BIT(grp + return false; + if (val & QS_INJ_STATUS_WMARK_REACHED(BIT(grp))) + return false; + + return true; +} + +void ocelot_port_inject_frame(struct ocelot *ocelot, int port, int grp, + u32 rew_op, struct sk_buff *skb) +{ + struct frame_info info = {}; + u32 ifh[OCELOT_TAG_LEN / 4]; + unsigned int i, count, last; + + ocelot_write_rix(ocelot, QS_INJ_CTRL_GAP_SIZE(1) | +QS_INJ_CTRL_SOF, QS_INJ_CTRL, grp); + + info.port = BIT(port); + info.tag_type = IFH_TAG_TYPE_C; + info.vid = skb_vlan_tag_get(skb); + info.rew_op = rew_op; + + ocelot_gen_ifh(ifh, &info); + + for (i = 0; i < OCELOT_TAG_LEN / 4; i++) + ocelot_write_rix(ocelot, (__force u32)cpu_to_be32(ifh[i]), +QS_INJ_WR, grp); + + count = DIV_ROUND_UP(skb->len, 4); + last = skb->len % 4; + for (i = 0; i < count; i++) + ocelot_write_rix(ocelot, ((u32 *)skb->data)[i], QS_INJ_WR, grp); + + /* Add padding */ + while (i < (OCELOT_BUFFER_CELL_SZ / 4)) { + ocelot_write_rix(ocelot, 0, QS_INJ_WR, grp); + i++; + } + + /* Indicate EOF and valid bytes in last word */ + ocelot_write_rix(ocelot, QS_INJ_CTRL_GAP_SIZE(1) | +QS_INJ_CTRL_VLD_BYTES(skb->len < OCELOT_BUFFER_CELL_SZ ? 0 : last) | +QS_INJ_CTRL_EOF, +QS_INJ_CTRL, grp); + + /* Add dummy CRC */ + ocelot_write_rix(ocelot, 0, QS_INJ_WR, grp); + skb_tx_timestamp(skb); + + skb->dev->stats.tx_packets++; + skb->dev->stats.tx_bytes += skb->len; +} + int ocelot_fdb_add(struct ocelot *ocelot, int port, const unsigned char *addr, u16 vid) { diff --git a/drivers/net/ethernet/mscc/ocelot.h b/drivers/net/ethernet/mscc/ocelot.h index 291d39d49c4e..e7685a58b7e2 100644 --- a/drivers/net/ethernet/mscc/ocelot.h +++ b/drivers/net/ethernet/mscc/ocelot.h @@ -126,6 +126,10 @@ void ocelot_set_cpu_port(struct ocelot *ocelot, int cpu, enum ocelot_tag_prefix injection, enum ocelot_tag_prefix extraction); +bool ocelot_can_inject(struct ocelot *ocelot, int grp); +void ocelot_port_inject_frame(struct ocelot *ocelot, int port, int grp, + u32 rew_op, struct sk_buff *skb); + extern struct notifier_block ocelot_netdevice_nb; extern struct notifier_block ocelot_switchdev_nb; extern struct notifier_block ocelot_switchdev_blocking_nb; diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c index 7cc0fcd1df8d..942eb56535b7 100644 --- a/drivers/net/ethernet/mscc/ocelot_net.c +++ b/drivers/net/ethernet/mscc/ocelot_net.c @@ -311,53 +311,20 @@ static int ocelot_port_stop(struct net_device *dev) return 0; } -/* Generate the IFH for frame injection - * - * The IFH is a 128bit-value - * bit 127: bypass the analyzer processing - * bit 56-67: destination mask - * bit 28-29: pop_cnt: 3 disables all rewriting of the frame - * bit 20-27: cpu extraction queue mask - * bit 16: tag type 0: C-tag, 1: S-tag - * bit 0-11: VID - */ -static int ocelot_gen_ifh(u32 *ifh, struct frame_info *info) -{ - ifh[0] = IFH_INJ_BYPASS | ((0x1ff & info->rew_op) << 21); - ifh[1] = (0xf00 & info->port) >> 8
[RFC PATCH v2 net-next 03/15] net: mscc: ocelot: store a namespaced VCAP filter ID
We will be adding some private VCAP filters that should not interfere in any way with the filters added using tc-flower. So we need to allocate some IDs which will not be used by tc. Currently ocelot uses an u32 id derived from the flow cookie, which in itself is an unsigned long. This is a problem in itself, since on 64 bit systems, sizeof(unsigned long)=8, so the driver is already truncating these. Create a struct ocelot_vcap_id which contains the full unsigned long cookie from tc, as well as a boolean that is supposed to namespace the filters added by tc with the ones that aren't. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot_flower.c | 7 --- drivers/net/ethernet/mscc/ocelot_vcap.c | 16 drivers/net/ethernet/mscc/ocelot_vcap.h | 3 ++- include/soc/mscc/ocelot_vcap.h| 7 ++- 4 files changed, 24 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/mscc/ocelot_flower.c b/drivers/net/ethernet/mscc/ocelot_flower.c index 729495a1a77e..c3ac026f6aea 100644 --- a/drivers/net/ethernet/mscc/ocelot_flower.c +++ b/drivers/net/ethernet/mscc/ocelot_flower.c @@ -622,7 +622,8 @@ static int ocelot_flower_parse(struct ocelot *ocelot, int port, bool ingress, int ret; filter->prio = f->common.prio; - filter->id = f->cookie; + filter->id.cookie = f->cookie; + filter->id.tc_offload = true; ret = ocelot_flower_parse_action(ocelot, port, ingress, f, filter); if (ret) @@ -717,7 +718,7 @@ int ocelot_cls_flower_destroy(struct ocelot *ocelot, int port, block = &ocelot->block[block_id]; - filter = ocelot_vcap_block_find_filter_by_id(block, f->cookie); + filter = ocelot_vcap_block_find_filter_by_id(block, f->cookie, true); if (!filter) return 0; @@ -741,7 +742,7 @@ int ocelot_cls_flower_stats(struct ocelot *ocelot, int port, block = &ocelot->block[block_id]; - filter = ocelot_vcap_block_find_filter_by_id(block, f->cookie); + filter = ocelot_vcap_block_find_filter_by_id(block, f->cookie, true); if (!filter || filter->type == OCELOT_VCAP_FILTER_DUMMY) return 0; diff --git a/drivers/net/ethernet/mscc/ocelot_vcap.c b/drivers/net/ethernet/mscc/ocelot_vcap.c index d8c778ee6f1b..2f588dfdc9a2 100644 --- a/drivers/net/ethernet/mscc/ocelot_vcap.c +++ b/drivers/net/ethernet/mscc/ocelot_vcap.c @@ -959,6 +959,12 @@ static void ocelot_vcap_filter_add_to_block(struct ocelot *ocelot, list_add(&filter->list, pos->prev); } +static bool ocelot_vcap_filter_equal(const struct ocelot_vcap_filter *a, +const struct ocelot_vcap_filter *b) +{ + return !memcmp(&a->id, &b->id, sizeof(struct ocelot_vcap_id)); +} + static int ocelot_vcap_block_get_filter_index(struct ocelot_vcap_block *block, struct ocelot_vcap_filter *filter) { @@ -966,7 +972,7 @@ static int ocelot_vcap_block_get_filter_index(struct ocelot_vcap_block *block, int index = 0; list_for_each_entry(tmp, &block->rules, list) { - if (filter->id == tmp->id) + if (ocelot_vcap_filter_equal(filter, tmp)) return index; index++; } @@ -991,12 +997,14 @@ ocelot_vcap_block_find_filter_by_index(struct ocelot_vcap_block *block, } struct ocelot_vcap_filter * -ocelot_vcap_block_find_filter_by_id(struct ocelot_vcap_block *block, int id) +ocelot_vcap_block_find_filter_by_id(struct ocelot_vcap_block *block, int cookie, + bool tc_offload) { struct ocelot_vcap_filter *filter; list_for_each_entry(filter, &block->rules, list) - if (filter->id == id) + if (filter->id.tc_offload == tc_offload && + filter->id.cookie == cookie) return filter; return NULL; @@ -1160,7 +1168,7 @@ static void ocelot_vcap_block_remove_filter(struct ocelot *ocelot, list_for_each_safe(pos, q, &block->rules) { tmp = list_entry(pos, struct ocelot_vcap_filter, list); - if (tmp->id == filter->id) { + if (ocelot_vcap_filter_equal(filter, tmp)) { if (tmp->block_id == VCAP_IS2 && tmp->action.police_ena) ocelot_vcap_policer_del(ocelot, block, diff --git a/drivers/net/ethernet/mscc/ocelot_vcap.h b/drivers/net/ethernet/mscc/ocelot_vcap.h index cfc8b976d1de..3b0c7916056e 100644 --- a/drivers/net/ethernet/mscc/ocelot_vcap.h +++ b/drivers/net/ethernet/mscc/ocelot_vcap.h @@ -15,7 +15,8 @@ int ocelot_vcap_filter_stats_update(struct ocelot *ocelot, struct ocelot_vcap_filter *rule); struct ocelot_vcap_filter * -ocelot_vcap_block_find_filter_by_id(struct ocelot_vcap_block *block, int id); +ocelot_
[RFC PATCH v2 net-next 11/15] net: mscc: ocelot: export struct ocelot_frame_info
Because felix DSA must now be able to extract a frame in 2 stages over MMIO (first the XFH then the frame data), it needs access to this internal ocelot structure that holds the unpacked information from the Extraction Frame Header. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot.c | 4 ++-- drivers/net/ethernet/mscc/ocelot.h | 9 - drivers/net/ethernet/mscc/ocelot_vsc7514.c | 4 ++-- include/soc/mscc/ocelot.h | 9 + 4 files changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/mscc/ocelot.c b/drivers/net/ethernet/mscc/ocelot.c index 52f6c986aef0..7d73c3251dfb 100644 --- a/drivers/net/ethernet/mscc/ocelot.c +++ b/drivers/net/ethernet/mscc/ocelot.c @@ -576,7 +576,7 @@ EXPORT_SYMBOL(ocelot_get_txtstamp); * bit 16: tag type 0: C-tag, 1: S-tag * bit 0-11: VID */ -static int ocelot_gen_ifh(u32 *ifh, struct frame_info *info) +static int ocelot_gen_ifh(u32 *ifh, struct ocelot_frame_info *info) { ifh[0] = IFH_INJ_BYPASS | ((0x1ff & info->rew_op) << 21); ifh[1] = (0xf00 & info->port) >> 8; @@ -601,7 +601,7 @@ bool ocelot_can_inject(struct ocelot *ocelot, int grp) void ocelot_port_inject_frame(struct ocelot *ocelot, int port, int grp, u32 rew_op, struct sk_buff *skb) { - struct frame_info info = {}; + struct ocelot_frame_info info = {}; u32 ifh[OCELOT_TAG_LEN / 4]; unsigned int i, count, last; diff --git a/drivers/net/ethernet/mscc/ocelot.h b/drivers/net/ethernet/mscc/ocelot.h index e7685a58b7e2..7dac0edd7767 100644 --- a/drivers/net/ethernet/mscc/ocelot.h +++ b/drivers/net/ethernet/mscc/ocelot.h @@ -32,15 +32,6 @@ #define OCELOT_PTP_QUEUE_SZ128 -struct frame_info { - u32 len; - u16 port; - u16 vid; - u8 tag_type; - u16 rew_op; - u32 timestamp; /* rew_val */ -}; - struct ocelot_port_tc { bool block_shared; unsigned long offload_cnt; diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c b/drivers/net/ethernet/mscc/ocelot_vsc7514.c index ea66b372d63b..504881d531e5 100644 --- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c +++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c @@ -533,7 +533,7 @@ static int ocelot_chip_init(struct ocelot *ocelot, const struct ocelot_ops *ops) return 0; } -static int ocelot_parse_ifh(u32 *_ifh, struct frame_info *info) +static int ocelot_parse_ifh(u32 *_ifh, struct ocelot_frame_info *info) { u8 llen, wlen; u64 ifh[2]; @@ -607,10 +607,10 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg) while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) { struct skb_shared_hwtstamps *shhwtstamps; + struct ocelot_frame_info info = {}; struct ocelot_port_private *priv; struct ocelot_port *ocelot_port; u64 tod_in_ns, full_ts_in_ns; - struct frame_info info = {}; struct net_device *dev; u32 ifh[4], val, *buf; struct timespec64 ts; diff --git a/include/soc/mscc/ocelot.h b/include/soc/mscc/ocelot.h index 4cbb7655ef0c..25a93bcc6afe 100644 --- a/include/soc/mscc/ocelot.h +++ b/include/soc/mscc/ocelot.h @@ -669,6 +669,15 @@ struct ocelot_policer { u32 burst; /* bytes */ }; +struct ocelot_frame_info { + u32 len; + u16 port; + u16 vid; + u8 tag_type; + u16 rew_op; + u32 timestamp; /* rew_val */ +}; + #define ocelot_read_ix(ocelot, reg, gi, ri) __ocelot_read_ix(ocelot, reg, reg##_GSZ * (gi) + reg##_RSZ * (ri)) #define ocelot_read_gix(ocelot, reg, gi) __ocelot_read_ix(ocelot, reg, reg##_GSZ * (gi)) #define ocelot_read_rix(ocelot, reg, ri) __ocelot_read_ix(ocelot, reg, reg##_RSZ * (ri)) -- 2.25.1
[RFC PATCH v2 net-next 12/15] net: mscc: ocelot: refactor ocelot_xtr_irq_handler into ocelot_xtr_poll
Since the felix DSA driver will need to poll the CPU port module for extracted frames as well, let's create some common functions that read an Extraction Frame Header, and then an skb, from a CPU extraction group. This is so complicated, because the procedure to retrieve a struct net_device pointer based on the source port is different for DSA and switchdev. So this is the reason why the polling function is split in the middle. The ocelot_xtr_poll_xfh() permits the caller to get a struct net_device pointer based on the XFH port field, then pass this to the ocelot_xtr_poll_frame() function. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/ethernet/mscc/ocelot.c | 163 + drivers/net/ethernet/mscc/ocelot.h | 6 + drivers/net/ethernet/mscc/ocelot_vsc7514.c | 158 ++-- 3 files changed, 179 insertions(+), 148 deletions(-) diff --git a/drivers/net/ethernet/mscc/ocelot.c b/drivers/net/ethernet/mscc/ocelot.c index 7d73c3251dfb..b91d4c31d3d7 100644 --- a/drivers/net/ethernet/mscc/ocelot.c +++ b/drivers/net/ethernet/mscc/ocelot.c @@ -12,6 +12,9 @@ #define TABLE_UPDATE_SLEEP_US 10 #define TABLE_UPDATE_TIMEOUT_US 10 +#define IFH_EXTRACT_BITFIELD64(x, o, w) \ + (((x) >> (o)) & GENMASK_ULL((w) - 1, 0)) + struct ocelot_mact_entry { u8 mac[ETH_ALEN]; u16 vid; @@ -566,6 +569,166 @@ void ocelot_get_txtstamp(struct ocelot *ocelot) } EXPORT_SYMBOL(ocelot_get_txtstamp); +static int ocelot_parse_xfh(u32 *_ifh, struct ocelot_frame_info *info) +{ + u8 llen, wlen; + u64 ifh[2]; + + ifh[0] = be64_to_cpu(((__force __be64 *)_ifh)[0]); + ifh[1] = be64_to_cpu(((__force __be64 *)_ifh)[1]); + + wlen = IFH_EXTRACT_BITFIELD64(ifh[0], 7, 8); + llen = IFH_EXTRACT_BITFIELD64(ifh[0], 15, 6); + + info->len = OCELOT_BUFFER_CELL_SZ * wlen + llen - 80; + + info->timestamp = IFH_EXTRACT_BITFIELD64(ifh[0], 21, 32); + + info->port = IFH_EXTRACT_BITFIELD64(ifh[1], 43, 4); + + info->tag_type = IFH_EXTRACT_BITFIELD64(ifh[1], 16, 1); + info->vid = IFH_EXTRACT_BITFIELD64(ifh[1], 0, 12); + + return 0; +} + +static int ocelot_rx_frame_word(struct ocelot *ocelot, u8 grp, bool ifh, + u32 *rval) +{ + u32 val; + u32 bytes_valid; + + val = ocelot_read_rix(ocelot, QS_XTR_RD, grp); + if (val == XTR_NOT_READY) { + if (ifh) + return -EIO; + + do { + val = ocelot_read_rix(ocelot, QS_XTR_RD, grp); + } while (val == XTR_NOT_READY); + } + + switch (val) { + case XTR_ABORT: + return -EIO; + case XTR_EOF_0: + case XTR_EOF_1: + case XTR_EOF_2: + case XTR_EOF_3: + case XTR_PRUNED: + bytes_valid = XTR_VALID_BYTES(val); + val = ocelot_read_rix(ocelot, QS_XTR_RD, grp); + if (val == XTR_ESCAPE) + *rval = ocelot_read_rix(ocelot, QS_XTR_RD, grp); + else + *rval = val; + + return bytes_valid; + case XTR_ESCAPE: + *rval = ocelot_read_rix(ocelot, QS_XTR_RD, grp); + + return 4; + default: + *rval = val; + + return 4; + } +} + +int ocelot_xtr_poll_xfh(struct ocelot *ocelot, int grp, + struct ocelot_frame_info *info) +{ + u32 ifh[OCELOT_TAG_LEN / 4]; + int i, err = 0; + + for (i = 0; i < OCELOT_TAG_LEN / 4; i++) { + err = ocelot_rx_frame_word(ocelot, grp, true, &ifh[i]); + if (err != 4) + return (err < 0) ? err : -EIO; + } + + ocelot_parse_xfh(ifh, info); + + return 0; +} + +int ocelot_xtr_poll_frame(struct ocelot *ocelot, int grp, + struct net_device *dev, + struct ocelot_frame_info *info, + struct sk_buff **nskb) +{ + struct skb_shared_hwtstamps *shhwtstamps; + u64 tod_in_ns, full_ts_in_ns; + struct timespec64 ts; + int sz, len, buf_len; + struct sk_buff *skb; + u32 val, *buf; + int err = 0; + + skb = netdev_alloc_skb(dev, info->len); + if (unlikely(!skb)) { + netdev_err(dev, "Unable to allocate sk_buff\n"); + err = -ENOMEM; + goto out; + } + + buf_len = info->len - ETH_FCS_LEN; + buf = (u32 *)skb_put(skb, buf_len); + + len = 0; + do { + sz = ocelot_rx_frame_word(ocelot, grp, false, &val); + if (sz < 0) { + err = sz; + goto out; + } + *buf++ = val; + len += sz; + } while (len < buf_len); + + /* Read the FCS */ + sz = ocelot_rx_frame_word(ocelot, grp, false, &val); + if (s
[RFC PATCH v2 net-next 13/15] net: dsa: felix: setup MMIO filtering rules for PTP when using tag_8021q
Since the tag_8021q tagger is software-defined, it has no means by itself for retrieving hardware timestamps of PTP event messages. Because we do want to support PTP on ocelot even with tag_8021q, we need to use the CPU port module for that. The RX timestamp is present in the Extraction Frame Header. And because we can't use NPI mode which redirects the CPU queues to an "external CPU" (meaning the ARM CPU), then we need to poll the CPU port module through the MMIO registers to retrieve TX and RX timestamps. Sadly, on NXP LS1028A, the Felix switch was integrated into the SoC without wiring the extraction IRQ line to the ARM GIC. So, if we want to be notified of any PTP packets received on the CPU port module, we have a problem. There is a possible workaround, which is to use the Ethernet CPU port as a notification channel that packets are available on the CPU port module as well. When a PTP packet is received by the DSA tagger (without timestamp, of course), we go to the CPU extraction queues, poll for it there, then we drop the original Ethernet packet and masquerade the packet retrieved over MMIO (plus the timestamp) as the original when we inject it up the stack. Create a quirk in struct felix is selected by the Felix driver (but not by Seville, since that doesn't support PTP at all). We want to do this such that the workaround is minimally invasive for future switches that don't require this workaround. The only traffic for which we need timestamps is PTP traffic, so add a redirection rule to the CPU port module for this. Currently we only have the need for PTP over L2, so redirection rules for UDP ports 319 and 320 are TBD for now. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/dsa/ocelot/felix.h | 13 drivers/net/dsa/ocelot/felix_tag_8021q.c | 83 +++- drivers/net/dsa/ocelot/felix_vsc9959.c | 1 + 3 files changed, 96 insertions(+), 1 deletion(-) diff --git a/drivers/net/dsa/ocelot/felix.h b/drivers/net/dsa/ocelot/felix.h index 71f343326c00..c75a934c3e0b 100644 --- a/drivers/net/dsa/ocelot/felix.h +++ b/drivers/net/dsa/ocelot/felix.h @@ -25,6 +25,19 @@ struct felix_info { int switch_pci_bar; int imdio_pci_bar; const struct ptp_clock_info *ptp_caps; + + /* Some Ocelot switches are integrated into the SoC without the +* extraction IRQ line connected to the ARM GIC. By enabling this +* workaround, the few packets that are delivered to the CPU port +* module (currently only PTP) are copied not only to the hardware CPU +* port module, but also to the 802.1Q Ethernet CPU port, and polling +* the extraction registers is triggered once the DSA tagger sees a PTP +* frame. The Ethernet frame is only used as a notification: it is +* dropped, and the original frame is extracted over MMIO and annotated +* with the RX timestamp. +*/ + boolquirk_no_xtr_irq; + int (*mdio_bus_alloc)(struct ocelot *ocelot); void(*mdio_bus_free)(struct ocelot *ocelot); void(*phylink_validate)(struct ocelot *ocelot, int port, diff --git a/drivers/net/dsa/ocelot/felix_tag_8021q.c b/drivers/net/dsa/ocelot/felix_tag_8021q.c index f209273bbf69..5243e55a8054 100644 --- a/drivers/net/dsa/ocelot/felix_tag_8021q.c +++ b/drivers/net/dsa/ocelot/felix_tag_8021q.c @@ -142,6 +142,85 @@ static const struct dsa_8021q_ops felix_tag_8021q_ops = { .vlan_add = felix_tag_8021q_vlan_add, }; +/* Set up a VCAP IS2 rule for delivering PTP frames to the CPU port module. + * If the NET_DSA_TAG_OCELOT_QUIRK_NO_XTR_IRQ is in place, then also copy those + * PTP frames to the tag_8021q CPU port. + */ +static int felix_setup_mmio_filtering(struct felix *felix) +{ + struct ocelot_vcap_filter *redirect_rule; + struct ocelot_vcap_filter *tagging_rule; + struct ocelot *ocelot = &felix->ocelot; + unsigned long ingress_port_mask; + int cpu = ocelot->dsa_8021q_cpu; + int ret; + + tagging_rule = kzalloc(sizeof(struct ocelot_vcap_filter), GFP_KERNEL); + if (!tagging_rule) + return -ENOMEM; + + redirect_rule = kzalloc(sizeof(struct ocelot_vcap_filter), GFP_KERNEL); + if (!redirect_rule) { + kfree(tagging_rule); + return -ENOMEM; + } + + ingress_port_mask = GENMASK(ocelot->num_phys_ports - 1, 0) & ~BIT(cpu); + + tagging_rule->key_type = OCELOT_VCAP_KEY_ETYPE; + *(u16 *)tagging_rule->key.etype.etype.value = htons(ETH_P_1588); + *(u16 *)tagging_rule->key.etype.etype.mask = 0x; + tagging_rule->ingress_port_mask = ingress_port_mask; + tagging_rule->prio = 1; + tagging_rule->id.cookie = ocelot->num_phys_ports; + tagging_rule->id.tc_offload = false; + tagging_rule->block_id = VCAP_IS1; + tagging_
[RFC PATCH net-next 15/15] net: dsa: tag_ocelot_8021q: add support for PTP timestamping
On TX, use the result of the ptp_classify_raw() BPF classifier from dsa_skb_tx_timestamp() to divert some frames over to the MMIO-based injection registers. On RX, set up a VCAP IS2 rule that redirects the frames with an EtherType for 1588 to the CPU port module (for MMIO based extraction) and, if the "no XTR IRQ" workaround is in place, copies them to the dsa_8021q CPU port as well (for notification). There is a conflict between the VCAP IS2 trapping rule and the semantics of the BPF classifier. Namely, ptp_classify_raw() deems general messages as non-timestampable, but still, those are trapped to the CPU port module since they have an EtherType of ETH_P_1588. So, if the "no XTR IRQ" workaround is in place, we need to run another BPF classifier on the frames extracted over MMIO, to avoid duplicates being sent to the stack (once over Ethernet, once over MMIO). It doesn't look like it's possible to install VCAP IS2 rules based on keys extracted from the 1588 frame headers. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. drivers/net/dsa/ocelot/felix.c | 12 + drivers/net/dsa/ocelot/felix_tag_8021q.c | 61 drivers/net/dsa/ocelot/felix_tag_8021q.h | 7 +++ drivers/net/ethernet/mscc/ocelot.c | 3 ++ drivers/net/ethernet/mscc/ocelot.h | 8 include/soc/mscc/ocelot.h| 9 net/dsa/tag_ocelot_8021q.c | 24 ++ 7 files changed, 116 insertions(+), 8 deletions(-) diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c index 77f73c6bad0b..a1ee6884f11c 100644 --- a/drivers/net/dsa/ocelot/felix.c +++ b/drivers/net/dsa/ocelot/felix.c @@ -757,6 +757,18 @@ static bool felix_rxtstamp(struct dsa_switch *ds, int port, struct timespec64 ts; u64 tstamp, val; + /* If the "no XTR IRQ" workaround is in use, tell DSA to defer this skb +* for RX timestamping. Then free it, and poll for its copy through +* MMIO in the CPU port module, and inject that into the stack from +* ocelot_xtr_poll(). +* If the "no XTR IRQ" workaround isn't in use, this is a no-op and +* should be eliminated by the compiler as dead code. +*/ + if (felix_check_xtr_pkt(ocelot, type)) { + kfree_skb(skb); + return true; + } + ocelot_ptp_gettime64(&ocelot->ptp_info, &ts); tstamp = ktime_set(ts.tv_sec, ts.tv_nsec); diff --git a/drivers/net/dsa/ocelot/felix_tag_8021q.c b/drivers/net/dsa/ocelot/felix_tag_8021q.c index 5243e55a8054..2a7ab3ddb12b 100644 --- a/drivers/net/dsa/ocelot/felix_tag_8021q.c +++ b/drivers/net/dsa/ocelot/felix_tag_8021q.c @@ -11,9 +11,70 @@ #include #include #include +#include #include "felix.h" #include "felix_tag_8021q.h" +bool felix_check_xtr_pkt(struct ocelot *ocelot, unsigned int ptp_type) +{ + struct felix *felix = ocelot_to_felix(ocelot); + int err, grp = 0; + + if (!felix->info->quirk_no_xtr_irq) + return false; + + if (ptp_type == PTP_CLASS_NONE) + return false; + + while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) { + struct ocelot_frame_info info = {}; + struct dsa_port *dp; + struct sk_buff *skb; + unsigned int type; + + err = ocelot_xtr_poll_xfh(ocelot, grp, &info); + if (err) + break; + + if (WARN_ON(info.port >= ocelot->num_phys_ports)) + goto out; + + dp = dsa_to_port(felix->ds, info.port); + + err = ocelot_xtr_poll_frame(ocelot, grp, dp->slave, + &info, &skb); + if (err) + break; + + /* We trap to the CPU port module all PTP frames, but +* felix_rxtstamp() only gets called for event frames. +* So we need to avoid sending duplicate general +* message frames by running a second BPF classifier +* here and dropping those. +*/ + __skb_push(skb, ETH_HLEN); + + type = ptp_classify_raw(skb); + + __skb_pull(skb, ETH_HLEN); + + if (type == PTP_CLASS_NONE) { + kfree_skb(skb); + continue; + } + + netif_rx(skb); + } + +out: + if (err < 0) { + ocelot_write(ocelot, QS_XTR_FLUSH, BIT(grp)); + ocelot_write(ocelot, QS_XTR_FLUSH, 0); + } + + return true; +} + static int felix_tag_8021q_rxvlan_add(struct ocelot *ocelot, int port, u16 vid, bool pvid, bool untagged) { diff --git a/drivers/net/dsa/ocelot/felix_tag_8021q.h b/drivers/net/dsa/ocelot/felix_tag_8021q.h index 47684a18c96e..1bf92d5fc4c4 100644 --- a/drivers/net/dsa/ocelot/felix_tag_8021q.h +++
[RFC PATCH v2 net-next 14/15] net: dsa: felix: use promisc on master for PTP with tag_8021q
This is for the "no extraction IRQ" workaround, where the DSA master on LS1028A (enetc) serves as a de-facto irqchip. It needs to be promiscuous so that it will never drop a PTP frame (sent to the 01-80-c2-00-00-0e multicast MAC address), otherwise the tagger will get confused about which Ethernet PTP frame corresponds to which PTP frame over the MMIO registers. Signed-off-by: Vladimir Oltean --- Changes in v2: Patch is new. net/dsa/tag_ocelot_8021q.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/dsa/tag_ocelot_8021q.c b/net/dsa/tag_ocelot_8021q.c index fa89c6a5bb7d..1b5102e74369 100644 --- a/net/dsa/tag_ocelot_8021q.c +++ b/net/dsa/tag_ocelot_8021q.c @@ -53,6 +53,7 @@ static struct dsa_device_ops ocelot_netdev_ops = { .xmit = ocelot_xmit, .rcv= ocelot_rcv, .overhead = VLAN_HLEN, + .promisc_on_master = true, }; MODULE_LICENSE("GPL v2"); -- 2.25.1
Re: [RFC PATCH v2 3/8] net: sparx5: add hostmode with phylink support
Hi Andrew, On Sat, 2020-12-19 at 20:51 +0100, Andrew Lunn wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you > know the content is safe > > > + /* Create a phylink for PHY management. Also handles SFPs */ > > + spx5_port->phylink_config.dev = &spx5_port->ndev->dev; > > + spx5_port->phylink_co > > nfig.type = PHYLINK_NETDEV; > > + spx5_port->phylink_config.pcs_poll = true; > > + > > + /* phylink needs a valid interface mode to parse dt node */ > > + if (phy_mode == PHY_INTERFACE_MODE_NA) > > + phy_mode = PHY_INTERFACE_MODE_10GBASER; > > Maybe just enforce a valid value in DT? Maybe I need to clarify that you must choose between an Ethernet cuPHY or an SFP, so it is optional. > > > +/* Configuration */ > > +static inline bool sparx5_use_cu_phy(struct sparx5_port *port) > > +{ > > + return port->conf.phy_mode != PHY_INTERFACE_MODE_NA; > > +} > > That is a rather odd definition of copper. Should I rather use a bool property to select between the two options (cuPHY or SFP)? > > > diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_netdev.c > > b/drivers/net/ethernet/microchip/sparx5/sparx5_netdev.c > > new file mode 100644 > > index ..6f9282e9d3f4 > > --- /dev/null > > +++ b/drivers/net/ethernet/microchip/sparx5/sparx5_netdev.c > > @@ -0,0 +1,203 @@ > > +// SPDX-License-Identifier: GPL-2.0+ > > +/* Microchip Sparx5 Switch driver > > + * > > + * Copyright (c) 2020 Microchip Technology Inc. and its > > subsidiaries. > > + */ > > + > > +#include "sparx5_main.h" > > I don't actually know what is preferred here, but very few drivers > i've reviewed put all the required headers into another header > file. They normally list them in each .c file. I will look at reworking this. > > > +static int sparx5_port_open(struct net_device *ndev) > > +{ > > + struct sparx5_port *port = netdev_priv(ndev); > > + int err = 0; > > + > > + err = phylink_of_phy_connect(port->phylink, port->of_node, > > 0); > > + if (err) { > > + netdev_err(ndev, "Could not attach to PHY\n"); > > + return err; > > + } > > + > > + phylink_start(port->phylink); > > + > > + if (!ndev->phydev) { > > Humm. When is ndev->phydev set? I don't think phylink ever sets it. Indirectly: phylink_of_phy_connect uses phy_attach_direct and that sets the phydev. > > > + /* power up serdes */ > > + port->conf.power_down = false; > > + err = phy_power_on(port->serdes); > > + if (err) > > + netdev_err(ndev, "%s failed\n", __func__); > > + } > > + > > + return err; > > +} > > > +struct net_device *sparx5_create_netdev(struct sparx5 *sparx5, u32 > > portno) > > +{ > > + struct net_device *ndev; > > + struct sparx5_port *spx5_port; > > + > > + ndev = devm_alloc_etherdev(sparx5->dev, sizeof(struct > > sparx5_port)); > > + if (!ndev) > > + return ERR_PTR(-ENOMEM); > > + > > + SET_NETDEV_DEV(ndev, sparx5->dev); > > + spx5_port = netdev_priv(ndev); > > + spx5_port->ndev = ndev; > > + spx5_port->sparx5 = sparx5; > > + spx5_port->portno = portno; > > + sparx5_set_port_ifh(spx5_port->ifh, portno); > > + snprintf(ndev->name, IFNAMSIZ, "eth%d", portno); > > + > > + ether_setup(ndev); > > devm_alloc_etherdev() should of already called ether_setup(). Ah - yes it is the setup(dev) call in alloc_netdev_mqs. I will remove that then. > > > + ndev->netdev_ops = &sparx5_port_netdev_ops; > > + ndev->features |= NETIF_F_LLTX; /* software tx */ > > + > > + ether_addr_copy(ndev->dev_addr, sparx5->base_mac); > > + ndev->dev_addr[ETH_ALEN - 1] += portno + 1; > > That will cause some surprises with wrap around. Use eth_addr_inc() OK - will do. > > > +static void sparx5_xtr_grp(struct sparx5 *sparx5, u8 grp, bool > > byte_swap) > > +{ > > + int i, byte_cnt = 0; > > + bool eof_flag = false, pruned_flag = false, abort_flag = > > false; > > + u32 ifh[IFH_LEN]; > > + struct sk_buff *skb; > > + struct frame_info fi; > > + struct sparx5_port *port; > > + struct net_device *netdev; > > + u32 *rxbuf; > > + > > + /* Get IFH */ > > + for (i = 0; i < IFH_LEN; i++) > > + ifh[i] = spx5_rd(sparx5, QS_XTR_RD(grp)); > > + > > + /* Decode IFH (whats needed) */ > > + sparx5_ifh_parse(ifh, &fi); > > + > > + /* Map to port netdev */ > > + port = fi.src_port < SPX5_PORTS ? > > + sparx5->ports[fi.src_port] : NULL; > > + if (!port || !port->ndev) { > > + dev_err(sparx5->dev, "Data on inactive port %d\n", > > fi.src_port); > > + sparx5_xtr_flush(sparx5, grp); > > + return; > > + } > > + > > + /* Have netdev, get skb */ > > + netdev = port->ndev; > > + skb = netdev_alloc_skb(netdev, netdev->mtu + ETH_HLEN); > > + if (!skb) { > > + sparx5_xtr_flush(sparx5, grp); > > +
[PATCH wireless v3 -next] brcmfmac: Delete useless kfree code
A null pointer will be passed to a kfree() call after a kzalloc() call failed. This code is useless. Thus delete the extra function call. A goto statement is also no longer needed. Thus adjust an if branch. Signed-off-by: Zheng Yongjun --- .../wireless/broadcom/brcm80211/brcmfmac/firmware.c| 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c index d821a4758f8c..d40104b8df55 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c @@ -319,8 +319,10 @@ static void brcmf_fw_strip_multi_v2(struct nvram_parser *nvp, u16 domain_nr, u8 *nvram; nvram = kzalloc(nvp->nvram_len + 1 + 3 + sizeof(u32), GFP_KERNEL); - if (!nvram) - goto fail; + if (!nvram) { + nvp->nvram_len = 0; + return; + } /* Copy all valid entries, release old nvram and assign new one. * Valid entries are of type pcie/X/Y/ where X = domain_nr and @@ -350,10 +352,6 @@ static void brcmf_fw_strip_multi_v2(struct nvram_parser *nvp, u16 domain_nr, kfree(nvp->nvram); nvp->nvram = nvram; nvp->nvram_len = j; - return; -fail: - kfree(nvram); - nvp->nvram_len = 0; } static void brcmf_fw_add_defaults(struct nvram_parser *nvp) -- 2.22.0
RE: [PATCH v3] Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer
From: Andrea Parri (Microsoft) Sent: Monday, December 7, 2020 8:53 PM > > From: Andres Beltran > > Pointers to ring-buffer packets sent by Hyper-V are used within the > guest VM. Hyper-V can send packets with erroneous values or modify > packet fields after they are processed by the guest. To defend > against these scenarios, return a copy of the incoming VMBus packet > after validating its length and offset fields in hv_pkt_iter_first(). > In this way, the packet can no longer be modified by the host. > > Signed-off-by: Andres Beltran > Co-developed-by: Andrea Parri (Microsoft) > Signed-off-by: Andrea Parri (Microsoft) > Cc: "David S. Miller" > Cc: Jakub Kicinski > Cc: "James E.J. Bottomley" > Cc: "Martin K. Petersen" > Cc: netdev@vger.kernel.org > Cc: linux-s...@vger.kernel.org > --- > drivers/hv/channel.c | 9 ++-- > drivers/hv/hv_fcopy.c | 1 + > drivers/hv/hv_kvp.c | 1 + > drivers/hv/hyperv_vmbus.h | 2 +- > drivers/hv/ring_buffer.c | 82 ++- > drivers/net/hyperv/hyperv_net.h | 3 ++ > drivers/net/hyperv/netvsc.c | 2 + > drivers/net/hyperv/rndis_filter.c | 2 + > drivers/scsi/storvsc_drv.c| 10 > include/linux/hyperv.h| 48 +++--- > net/vmw_vsock/hyperv_transport.c | 4 +- > 11 files changed, 139 insertions(+), 25 deletions(-) > Reviewed-by: Michael Kelley
Re: Reporting SFP presence status
On Tue, Dec 22, 2020 at 07:28:10AM +0100, Martin Hundebøll wrote: > Hi Andrew, > > On 21/12/2020 16.22, Andrew Lunn wrote: > > On Mon, Dec 21, 2020 at 11:37:55AM +0100, Martin Hundebøll wrote: > > > Hi Andrew, > > > > > > I've browsed the code in drivers/net/phy, but haven't found a place where > > > the SFP module status/change is reported to user-space. Is there a > > > "standard" way to report insert/remove events for SFP modules, or should > > > we > > > just add a custom sysfs attribute to our driver? > > > > Hi Martin > > > > There is currently no standard way of notifying user space. But it is > > something which could be added. But it should not be systfs. This > > should be a netlink notification, probably as part of ethtool netlink > > API. Or maybe the extended link info. > > > > What is your intended use case? Why do you need to know when a module > > has been inserted? It seems like you cannot do too much on such a > > notification. It seems lots of modules don't conform to the standard, > > will not immediately respond on the i2c bus. So ethtool -m is probably > > not going to be useful. You probably need to poll until it does > > respond, which defeats the purpose of having a notification. > > You're right; a notification isn't what I need. But a way to query the > current state of the module would be nice, i.e. using ethtool. What do you mean by state? ethtool -m gives you some state information. ENODEV gives you an idea that there is no module inserted. Lots of data suggests there is a module. You can decode the data to get a lot of information. There was also a patchset from Russell King a few weeks ago exposing some information in debugfs. But since it is debugfs, you cannot rely on it. Back to, what is you real use cases here? Andrew
Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails
On Mon, Dec 21, 2020 at 11:41 PM Jason Wang wrote: > > > On 2020/12/22 上午7:07, Willem de Bruijn wrote: > > On Wed, Dec 16, 2020 at 3:20 AM wangyunjian wrote: > >> From: Yunjian Wang > >> > >> Currently we break the loop and wake up the vhost_worker when > >> sendmsg fails. When the worker wakes up again, we'll meet the > >> same error. > > The patch is based on the assumption that such error cases always > > return EAGAIN. Can it not also be ENOMEM, such as from tun_build_skb? > > > >> This will cause high CPU load. To fix this issue, > >> we can skip this description by ignoring the error. When we > >> exceeds sndbuf, the return value of sendmsg is -EAGAIN. In > >> the case we don't skip the description and don't drop packet. > > the -> that > > > > here and above: description -> descriptor > > > > Perhaps slightly revise to more explicitly state that > > > > 1. in the case of persistent failure (i.e., bad packet), the driver > > drops the packet > > 2. in the case of transient failure (e.g,. memory pressure) the driver > > schedules the worker to try again later > > > If we want to go with this way, we need a better time to wakeup the > worker. Otherwise it just produces more stress on the cpu that is what > this patch tries to avoid. Perhaps I misunderstood the purpose of the patch: is it to drop everything, regardless of transient or persistent failure, until the ring runs out of descriptors? I can understand both a blocking and drop strategy during memory pressure. But partial drop strategy until exceeding ring capacity seems like a peculiar hybrid?
Re: [PATCH -next] scsi: megaraid: Remove unnecessary memset
Zheng Yongjun writes: > memcpy operation is next to memset code, and the size to copy is equals to > the size to > memset, so the memset operation is unnecessary, remove it. > > Signed-off-by: Zheng Yongjun > --- > drivers/net/wireless/ath/wcn36xx/smd.c | 1 - > 1 file changed, 1 deletion(-) The title is wrong, this is not about scsi. Please resend as v2. -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [RFC PATCH v2 3/8] net: sparx5: add hostmode with phylink support
On Tue, Dec 22, 2020 at 10:46:12AM +0100, Steen Hegelund wrote: > Hi Andrew, > > On Sat, 2020-12-19 at 20:51 +0100, Andrew Lunn wrote: > > EXTERNAL EMAIL: Do not click links or open attachments unless you > > know the content is safe > > > > > + /* Create a phylink for PHY management. Also handles SFPs */ > > > + spx5_port->phylink_config.dev = &spx5_port->ndev->dev; > > > + spx5_port->phylink_co > > > nfig.type = PHYLINK_NETDEV; > > > + spx5_port->phylink_config.pcs_poll = true; > > > + > > > + /* phylink needs a valid interface mode to parse dt node */ > > > + if (phy_mode == PHY_INTERFACE_MODE_NA) > > > + phy_mode = PHY_INTERFACE_MODE_10GBASER; > > > > Maybe just enforce a valid value in DT? > > Maybe I need to clarify that you must choose between an Ethernet cuPHY > or an SFP, so it is optional. But you also need to watch out for somebody putting a copper modules in an SFP port. phylink will then set the mode to SGMII for a 1G copper module, etc. > > > +/* Configuration */ > > > +static inline bool sparx5_use_cu_phy(struct sparx5_port *port) > > > +{ > > > + return port->conf.phy_mode != PHY_INTERFACE_MODE_NA; > > > +} > > > > That is a rather odd definition of copper. > > Should I rather use a bool property to select between the two options > (cuPHY or SFP)? I guess what you are trying to indicate is between a hard wired Copper PHY and an SFP cage? You have some sort of MII switch which allows the MAC to be connected to either the QSGMII PHY, or an SFP cage? But since the SFP cage could be populated with a copper PHY, and PHYLINK will then instantiate a phylib copper PHY driver for it, looking at phy_mode is not reliable. You need a property which selects the port, not the technology. > > > +static int sparx5_port_open(struct net_device *ndev) > > > +{ > > > + struct sparx5_port *port = netdev_priv(ndev); > > > + int err = 0; > > > + > > > + err = phylink_of_phy_connect(port->phylink, port->of_node, > > > 0); > > > + if (err) { > > > + netdev_err(ndev, "Could not attach to PHY\n"); > > > + return err; > > > + } > > > + > > > + phylink_start(port->phylink); > > > + > > > + if (!ndev->phydev) { > > > > Humm. When is ndev->phydev set? I don't think phylink ever sets it. > > Indirectly: phylink_of_phy_connect uses phy_attach_direct and that sets > the phydev. Ah, O.K. But watch out for a copper SFP module! > > > +static void sparx5_xtr_grp(struct sparx5 *sparx5, u8 grp, bool > > > byte_swap) > > > +{ > > > + int i, byte_cnt = 0; > > > + bool eof_flag = false, pruned_flag = false, abort_flag = > > > false; > > > + u32 ifh[IFH_LEN]; > > > + struct sk_buff *skb; > > > + struct frame_info fi; > > > + struct sparx5_port *port; > > > + struct net_device *netdev; > > > + u32 *rxbuf; > > > + > > > + /* Get IFH */ > > > + for (i = 0; i < IFH_LEN; i++) > > > + ifh[i] = spx5_rd(sparx5, QS_XTR_RD(grp)); > > > + > > > + /* Decode IFH (whats needed) */ > > > + sparx5_ifh_parse(ifh, &fi); > > > + > > > + /* Map to port netdev */ > > > + port = fi.src_port < SPX5_PORTS ? > > > + sparx5->ports[fi.src_port] : NULL; > > > + if (!port || !port->ndev) { > > > + dev_err(sparx5->dev, "Data on inactive port %d\n", > > > fi.src_port); > > > + sparx5_xtr_flush(sparx5, grp); > > > + return; > > > + } > > > + > > > + /* Have netdev, get skb */ > > > + netdev = port->ndev; > > > + skb = netdev_alloc_skb(netdev, netdev->mtu + ETH_HLEN); > > > + if (!skb) { > > > + sparx5_xtr_flush(sparx5, grp); > > > + dev_err(sparx5->dev, "No skb allocated\n"); > > > + return; > > > + } > > > + rxbuf = (u32 *)skb->data; > > > + > > > + /* Now, pull frame data */ > > > + while (!eof_flag) { > > > + u32 val = spx5_rd(sparx5, QS_XTR_RD(grp)); > > > + u32 cmp = val; > > > + > > > + if (byte_swap) > > > + cmp = ntohl((__force __be32)val); > > > + > > > + switch (cmp) { > > > + case XTR_NOT_READY: > > > + break; > > > + case XTR_ABORT: > > > + /* No accompanying data */ > > > + abort_flag = true; > > > + eof_flag = true; > > > + break; > > > + case XTR_EOF_0: > > > + case XTR_EOF_1: > > > + case XTR_EOF_2: > > > + case XTR_EOF_3: > > > + /* This assumes STATUS_WORD_POS == 1, Status > > > + * just after last data > > > + */ > > > + byte_cnt -= (4 - XTR_VALID_BYTES(val)); > > > + eof_flag = true; > > > + break; > > > + case XTR_PRUNED: > > > + /* But get the last 4 bytes as well
Re: [PATCH 05/12 v2 RFC] skbuff: replace sock_zerocopy_put() with skb_zcopy_put()
On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon wrote: > > From: Jonathan Lemon > > Replace sock_zerocopy_put with the generic skb_zcopy_put() > function. Pass 'true' as the success argument, as this > is identical to no change. > > Signed-off-by: Jonathan Lemon uarg->zerocopy may be false if sock_zerocopy_put_abort is called from tcp_sendmsg_locked
Re: [PATCH 01/12 v2 RFC] net: group skb_shinfo zerocopy related bits together.
On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon wrote: > > From: Jonathan Lemon > > In preparation for expanded zerocopy (TX and RX), move > the ZC related bits out of tx_flags into their own flag > word. > > Signed-off-by: Jonathan Lemon I think it's better to expand tx_flags to a u16 and add the two new flags that you need. That allows the additional 7 bits to be used for arbitrary flags, not stranding 8 bits exactly only for zerocopy features. Moving around a few u8's in the same cacheline won't be a problem. I also prefer not to rename flags that some of us are familiar with, if it's not needed.
Re: [PATCH 04/12 v2 RFC] skbuff: Push status and refcounts into sock_zerocopy_callback
On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon wrote: > > From: Jonathan Lemon > > Before this change, the caller of sock_zerocopy_callback would > need to save the zerocopy status, decrement and check the refcount, > and then call the callback function - the callback was only invoked > when the refcount reached zero. > > Now, the caller just passes the status into the callback function, > which saves the status and handles its own refcounts. > > This makes the behavior of the sock_zerocopy_callback identical > to the tpacket and vhost callbacks. > > Signed-off-by: Jonathan Lemon > --- > include/linux/skbuff.h | 3 --- > net/core/skbuff.c | 14 +++--- > 2 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index fb6dd6af0f82..c9d7de9d624d 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -1482,9 +1482,6 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, > bool zerocopy) > if (uarg) { > if (skb_zcopy_is_nouarg(skb)) { > /* no notification callback */ > - } else if (uarg->callback == sock_zerocopy_callback) { > - uarg->zerocopy = uarg->zerocopy && zerocopy; > - sock_zerocopy_put(uarg); > } else { > uarg->callback(uarg, zerocopy); > } > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index ea32b3414ad6..73699dbdc4a1 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1194,7 +1194,7 @@ static bool skb_zerocopy_notify_extend(struct sk_buff > *skb, u32 lo, u16 len) > return true; > } > > -void sock_zerocopy_callback(struct ubuf_info *uarg, bool success) > +static void __sock_zerocopy_callback(struct ubuf_info *uarg) > { > struct sk_buff *tail, *skb = skb_from_uarg(uarg); > struct sock_exterr_skb *serr; > @@ -1222,7 +1222,7 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, > bool success) > serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY; > serr->ee.ee_data = hi; > serr->ee.ee_info = lo; > - if (!success) > + if (!uarg->zerocopy) > serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED; > > q = &sk->sk_error_queue; > @@ -1241,11 +1241,19 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, > bool success) > consume_skb(skb); > sock_put(sk); > } > + > +void sock_zerocopy_callback(struct ubuf_info *uarg, bool success) > +{ > + uarg->zerocopy = uarg->zerocopy & success; > + > + if (refcount_dec_and_test(&uarg->refcnt)) > + __sock_zerocopy_callback(uarg); > +} > EXPORT_SYMBOL_GPL(sock_zerocopy_callback); I still think this helper is unnecessary. Just return immediately in existing sock_zerocopy_callback if refcount is not zero. > void sock_zerocopy_put(struct ubuf_info *uarg) > { > - if (uarg && refcount_dec_and_test(&uarg->refcnt)) > + if (uarg) > uarg->callback(uarg, uarg->zerocopy); > } > EXPORT_SYMBOL_GPL(sock_zerocopy_put); This does increase the number of indirect function calls. Which are not cheap post spectre. In the common case for msg_zerocopy we only have two clones, one sent out and one on the retransmit queue. So I guess the cost will be acceptable.
Re: [PATCH 03/12 v2 RFC] skbuff: simplify sock_zerocopy_put
On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon wrote: > > From: Jonathan Lemon > > All 'struct ubuf_info' users should have a callback defined. > Remove the dead code path to consume_skb(), which makes > unwarranted assumptions about how the structure was allocated. > > Signed-off-by: Jonathan Lemon Please link to the commit I shared that made the consume_skb path obsolete. Before that this would have been unsafe. should have a callback defined -> have a callback defined as of commit 0a4a060bb204 ("sock: fix zerocopy_success regression with msg_zerocopy"). With that explanation why this is correct Acked-by: Willem de Bruijn
kernel BUG at lib/string.c:LINE! (6)
Hello, syzbot found the following issue on: HEAD commit:d64c6f96 Merge tag 'net-5.11-rc1' of git://git.kernel.org/.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=10bc561350 kernel config: https://syzkaller.appspot.com/x/.config?x=aca0dc5c721fe9e5 dashboard link: https://syzkaller.appspot.com/bug?extid=e86f7c428c8c50db65b4 compiler: gcc (GCC) 10.1.0-syz 20200507 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=169378a750 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=144692cb50 The issue was bisected to: commit 2f78788b55baa3410b1ec91a576286abe1ad4d6a Author: Jakub Jelinek Date: Wed Dec 16 04:43:37 2020 + ilog2: improve ilog2 for constant arguments bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=1584f13750 final oops: https://syzkaller.appspot.com/x/report.txt?x=1784f13750 console output: https://syzkaller.appspot.com/x/log.txt?x=1384f13750 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+e86f7c428c8c50db6...@syzkaller.appspotmail.com Fixes: 2f78788b55ba ("ilog2: improve ilog2 for constant arguments") detected buffer overflow in strlen [ cut here ] kernel BUG at lib/string.c:1149! invalid opcode: [#1] PREEMPT SMP KASAN CPU: 0 PID: 8713 Comm: syz-executor731 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:fortify_panic+0xf/0x11 lib/string.c:1149 Code: b5 78 a3 04 48 c7 c7 c0 8f c2 89 58 5b 5d 41 5c 41 5d 41 5e 41 5f e9 30 ba ee ff 48 89 fe 48 c7 c7 80 90 c2 89 e8 21 ba ee ff <0f> 0b e8 90 f9 97 f8 0f b6 f3 48 c7 c7 20 f4 10 8c e8 41 e8 fc fa RSP: 0018:c900020af500 EFLAGS: 00010282 RAX: 0022 RBX: 888011c26768 RCX: RDX: 88801bad RSI: 815a6925 RDI: f52000415e92 RBP: 88801be7c220 R08: 0022 R09: R10: 815a4d7b R11: R12: 88801180ec00 R13: 888011c26700 R14: 192000415ea2 R15: 0010 FS: 00812880() GS:8880b9c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 006dcf60 CR3: 141ee000 CR4: 001506f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: strlen include/linux/string.h:325 [inline] strlcpy include/linux/string.h:348 [inline] xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143 xt_check_target+0x26c/0x9e0 net/netfilter/x_tables.c:1019 check_target net/ipv6/netfilter/ip6_tables.c:529 [inline] find_check_entry.constprop.0+0x7f1/0x9e0 net/ipv6/netfilter/ip6_tables.c:572 translate_table+0xc8b/0x1750 net/ipv6/netfilter/ip6_tables.c:734 do_replace net/ipv6/netfilter/ip6_tables.c:1152 [inline] do_ip6t_set_ctl+0x553/0xb70 net/ipv6/netfilter/ip6_tables.c:1636 nf_setsockopt+0x83/0xe0 net/netfilter/nf_sockopt.c:101 ipv6_setsockopt+0x122/0x180 net/ipv6/ipv6_sockglue.c:1008 tcp_setsockopt+0x136/0x2440 net/ipv4/tcp.c:3597 __sys_setsockopt+0x2db/0x610 net/socket.c:2115 __do_sys_setsockopt net/socket.c:2126 [inline] __se_sys_setsockopt net/socket.c:2123 [inline] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2123 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x4493d9 Code: e8 0c ca 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 9b cb fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7fff679a3898 EFLAGS: 0246 ORIG_RAX: 0036 RAX: ffda RBX: 22c0 RCX: 004493d9 RDX: 0040 RSI: 0029 RDI: 0006 RBP: 7fff679a38b0 R08: 0470 R09: 00c2 R10: 2080 R11: 0246 R12: 000112d5 R13: 006d7dc8 R14: R15: Modules linked in: ---[ end trace e17a915ca7e8b666 ]--- RIP: 0010:fortify_panic+0xf/0x11 lib/string.c:1149 Code: b5 78 a3 04 48 c7 c7 c0 8f c2 89 58 5b 5d 41 5c 41 5d 41 5e 41 5f e9 30 ba ee ff 48 89 fe 48 c7 c7 80 90 c2 89 e8 21 ba ee ff <0f> 0b e8 90 f9 97 f8 0f b6 f3 48 c7 c7 20 f4 10 8c e8 41 e8 fc fa RSP: 0018:c900020af500 EFLAGS: 00010282 RAX: 0022 RBX: 888011c26768 RCX: RDX: 88801bad RSI: 815a6925 RDI: f52000415e92 RBP: 88801be7c220 R08: 0022 R09: R10: 815a4d7b R11: R12: 88801180ec00 R13: 888011c26700 R14: 192000415ea2 R15: 0010 FS: 00812880() GS:8880b9c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 006dcf60 CR3: 141ee000 CR4: 001506f0 DR0: DR1: 000
[RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace
This series introduces a framework, which can be used to implement vDPA Devices in a userspace program. The work consist of two parts: control path forwarding and data path offloading. In the control path, the VDUSE driver will make use of message mechnism to forward the config operation from vdpa bus driver to userspace. Userspace can use read()/write() to receive/reply those control messages. In the data path, the core is mapping dma buffer into VDUSE daemon's address space, which can be implemented in different ways depending on the vdpa bus to which the vDPA device is attached. In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma buffer is reside in a userspace memory region which can be shared to the VDUSE userspace processs via transferring the shmfd. The details and our user case is shown below: - -- |Container || QEMU(VM) | | VDUSE daemon | | - || --- | | - | | |dev/vdx| || |/dev/vhost-vdpa-x| | | | vDPA device emulation | | block driver | | +--- ---+ -+--+- | || | | || | +---++--+- || block device | | vhost device || vduse driver | | TCP/IP || |---+ +---+ -+| | | | | || | --+-- --+--- ---+--- || | | virtio-blk driver | | vhost-vdpa driver | | vdpa device | || | --+-- --+--- ---+--- || | | virtio bus | | || | ++--- | | || || | | || | --+--| | || | | virtio-blk device || | || | --+--| | || || | | || | ---+--- | | || | | virtio-vdpa driver | | | || | ---+--- | | || || | |vdpa bus || | ---+--+---+ || | ---+--- | -| NIC |-- ---+--- | -+- | Remote Storages | --- We make use of it to implement a block device connecting to our distributed storage, which can be used both in containers and VMs. Thus, we can have an unified technology stack in this two cases. To test it with null-blk: $ qemu-storage-daemon \ --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \ --monitor chardev=charmonitor \ --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \ --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128 The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse Future work: - Improve perf
[RFC v2 02/13] eventfd: track eventfd_signal() recursion depth separately in different cases
Now we have a global percpu counter to limit the recursion depth of eventfd_signal(). This can avoid deadlock or stack overflow. But in stack overflow case, it should be OK to increase the recursion depth if needed. So we add a percpu counter in eventfd_ctx to limit the recursion depth for deadlock case. Then it could be fine to increase the global percpu counter later. Signed-off-by: Xie Yongji --- fs/aio.c| 3 ++- fs/eventfd.c| 20 +++- include/linux/eventfd.h | 5 + 3 files changed, 22 insertions(+), 6 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 1f32da13d39e..5d82903161f5 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1698,7 +1698,8 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, list_del(&iocb->ki_list); iocb->ki_res.res = mangle_poll(mask); req->done = true; - if (iocb->ki_eventfd && eventfd_signal_count()) { + if (iocb->ki_eventfd && + eventfd_signal_count(iocb->ki_eventfd)) { iocb = NULL; INIT_WORK(&req->work, aio_poll_put_work); schedule_work(&req->work); diff --git a/fs/eventfd.c b/fs/eventfd.c index e265b6dd4f34..2df24f9bada3 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -25,6 +25,8 @@ #include #include +#define EVENTFD_WAKE_DEPTH 0 + DEFINE_PER_CPU(int, eventfd_wake_count); static DEFINE_IDA(eventfd_ida); @@ -42,9 +44,17 @@ struct eventfd_ctx { */ __u64 count; unsigned int flags; + int __percpu *wake_count; int id; }; +bool eventfd_signal_count(struct eventfd_ctx *ctx) +{ + return (this_cpu_read(*ctx->wake_count) || + this_cpu_read(eventfd_wake_count) > EVENTFD_WAKE_DEPTH); +} +EXPORT_SYMBOL_GPL(eventfd_signal_count); + /** * eventfd_signal - Adds @n to the eventfd counter. * @ctx: [in] Pointer to the eventfd context. @@ -71,17 +81,19 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) * it returns true, the eventfd_signal() call should be deferred to a * safe context. */ - if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count))) + if (WARN_ON_ONCE(eventfd_signal_count(ctx))) return 0; spin_lock_irqsave(&ctx->wqh.lock, flags); this_cpu_inc(eventfd_wake_count); + this_cpu_inc(*ctx->wake_count); if (ULLONG_MAX - ctx->count < n) n = ULLONG_MAX - ctx->count; ctx->count += n; if (waitqueue_active(&ctx->wqh)) wake_up_locked_poll(&ctx->wqh, EPOLLIN); this_cpu_dec(eventfd_wake_count); + this_cpu_dec(*ctx->wake_count); spin_unlock_irqrestore(&ctx->wqh.lock, flags); return n; @@ -92,6 +104,7 @@ static void eventfd_free_ctx(struct eventfd_ctx *ctx) { if (ctx->id >= 0) ida_simple_remove(&eventfd_ida, ctx->id); + free_percpu(ctx->wake_count); kfree(ctx); } @@ -423,6 +436,11 @@ static int do_eventfd(unsigned int count, int flags) kref_init(&ctx->kref); init_waitqueue_head(&ctx->wqh); + ctx->wake_count = alloc_percpu(int); + if (!ctx->wake_count) { + kfree(ctx); + return -ENOMEM; + } ctx->count = count; ctx->flags = flags; ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL); diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index fa0a524baed0..1a11ebbd74a9 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -45,10 +45,7 @@ void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt); DECLARE_PER_CPU(int, eventfd_wake_count); -static inline bool eventfd_signal_count(void) -{ - return this_cpu_read(eventfd_wake_count); -} +bool eventfd_signal_count(struct eventfd_ctx *ctx); #else /* CONFIG_EVENTFD */ -- 2.11.0
[RFC v2 03/13] eventfd: Increase the recursion depth of eventfd_signal()
Increase the recursion depth of eventfd_signal() to 1. This will be used in VDUSE case later. Signed-off-by: Xie Yongji --- fs/eventfd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 2df24f9bada3..478cdc175949 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -25,7 +25,7 @@ #include #include -#define EVENTFD_WAKE_DEPTH 0 +#define EVENTFD_WAKE_DEPTH 1 DEFINE_PER_CPU(int, eventfd_wake_count); -- 2.11.0
[RFC v2 04/13] vdpa: Remove the restriction that only supports virtio-net devices
With VDUSE, we should be able to support all kinds of virtio devices. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 29 +++-- 1 file changed, 3 insertions(+), 26 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 29ed4173f04e..448be7875b6d 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -22,6 +22,7 @@ #include #include #include +#include #include "vhost.h" @@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 __user *statusp) return 0; } -static int vhost_vdpa_config_validate(struct vhost_vdpa *v, - struct vhost_vdpa_config *c) -{ - long size = 0; - - switch (v->virtio_id) { - case VIRTIO_ID_NET: - size = sizeof(struct virtio_net_config); - break; - } - - if (c->len == 0) - return -EINVAL; - - if (c->len > size - c->off) - return -E2BIG; - - return 0; -} - static long vhost_vdpa_get_config(struct vhost_vdpa *v, struct vhost_vdpa_config __user *c) { @@ -215,7 +196,7 @@ static long vhost_vdpa_get_config(struct vhost_vdpa *v, if (copy_from_user(&config, c, size)) return -EFAULT; - if (vhost_vdpa_config_validate(v, &config)) + if (config.len == 0) return -EINVAL; buf = kvzalloc(config.len, GFP_KERNEL); if (!buf) @@ -243,7 +224,7 @@ static long vhost_vdpa_set_config(struct vhost_vdpa *v, if (copy_from_user(&config, c, size)) return -EFAULT; - if (vhost_vdpa_config_validate(v, &config)) + if (config.len == 0) return -EINVAL; buf = kvzalloc(config.len, GFP_KERNEL); if (!buf) @@ -1025,10 +1006,6 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa) int minor; int r; - /* Currently, we only accept the network devices. */ - if (ops->get_device_id(vdpa) != VIRTIO_ID_NET) - return -ENOTSUPP; - v = kzalloc(sizeof(*v), GFP_KERNEL | __GFP_RETRY_MAYFAIL); if (!v) return -ENOMEM; -- 2.11.0
[RFC v2 01/13] mm: export zap_page_range() for driver use
Export zap_page_range() for use in VDUSE. Signed-off-by: Xie Yongji --- mm/memory.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/memory.c b/mm/memory.c index 7d608765932b..edd2d6497bb3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1542,6 +1542,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb, start, range.end); } +EXPORT_SYMBOL(zap_page_range); /** * zap_page_range_single - remove user pages in a given range -- 2.11.0
[RFC v2 13/13] vduse: Introduce a workqueue for irq injection
This patch introduces a dedicated workqueue for irq injection so that we are able to do some performance tuning for it. Signed-off-by: Xie Yongji --- drivers/vdpa/vdpa_user/eventfd.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c index dbffddb08908..caf7d8d68ac0 100644 --- a/drivers/vdpa/vdpa_user/eventfd.c +++ b/drivers/vdpa/vdpa_user/eventfd.c @@ -18,6 +18,7 @@ #include "eventfd.h" static struct workqueue_struct *vduse_irqfd_cleanup_wq; +static struct workqueue_struct *vduse_irq_wq; static void vduse_virqfd_shutdown(struct work_struct *work) { @@ -57,7 +58,7 @@ static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode, __poll_t flags = key_to_poll(key); if (flags & EPOLLIN) - schedule_work(&virqfd->inject); + queue_work(vduse_irq_wq, &virqfd->inject); if (flags & EPOLLHUP) { spin_lock(&vq->irq_lock); @@ -165,11 +166,18 @@ int vduse_virqfd_init(void) if (!vduse_irqfd_cleanup_wq) return -ENOMEM; + vduse_irq_wq = alloc_workqueue("vduse-irq", WQ_SYSFS | WQ_UNBOUND, 0); + if (!vduse_irq_wq) { + destroy_workqueue(vduse_irqfd_cleanup_wq); + return -ENOMEM; + } + return 0; } void vduse_virqfd_exit(void) { + destroy_workqueue(vduse_irq_wq); destroy_workqueue(vduse_irqfd_cleanup_wq); } -- 2.11.0
[RFC v2 07/13] vduse: support get/set virtqueue state
This patch makes vhost-vdpa bus driver can get/set virtqueue state from userspace VDUSE process. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 4 +++ drivers/vdpa/vdpa_user/vduse_dev.c | 54 ++ include/uapi/linux/vduse.h | 9 +++ 3 files changed, 67 insertions(+) diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst index da9b3040f20a..623f7b040ccf 100644 --- a/Documentation/driver-api/vduse.rst +++ b/Documentation/driver-api/vduse.rst @@ -30,6 +30,10 @@ The following types of messages are provided by the VDUSE framework now: - VDUSE_GET_VQ_READY: Get ready status of virtqueue +- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue + +- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue + - VDUSE_SET_FEATURES: Set virtio features supported by the driver - VDUSE_GET_FEATURES: Get virtio features supported by the device diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index 4a869b9698ef..b974333ed4e9 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -291,6 +291,40 @@ static bool vduse_dev_get_vq_ready(struct vduse_dev *dev, return ready; } +static int vduse_dev_get_vq_state(struct vduse_dev *dev, + struct vduse_virtqueue *vq, + struct vdpa_vq_state *state) +{ + struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_STATE); + int ret; + + msg->req.size = sizeof(struct vduse_vq_state); + msg->req.vq_state.index = vq->index; + + ret = vduse_dev_msg_sync(dev, msg); + state->avail_index = msg->resp.vq_state.avail_idx; + vduse_dev_msg_put(msg); + + return ret; +} + +static int vduse_dev_set_vq_state(struct vduse_dev *dev, + struct vduse_virtqueue *vq, + const struct vdpa_vq_state *state) +{ + struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_STATE); + int ret; + + msg->req.size = sizeof(struct vduse_vq_state); + msg->req.vq_state.index = vq->index; + msg->req.vq_state.avail_idx = state->avail_index; + + ret = vduse_dev_msg_sync(dev, msg); + vduse_dev_msg_put(msg); + + return ret; +} + static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct file *file = iocb->ki_filp; @@ -431,6 +465,24 @@ static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx) return vq->ready; } +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx, + const struct vdpa_vq_state *state) +{ + struct vduse_dev *dev = vdpa_to_vduse(vdpa); + struct vduse_virtqueue *vq = &dev->vqs[idx]; + + return vduse_dev_set_vq_state(dev, vq, state); +} + +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx, + struct vdpa_vq_state *state) +{ + struct vduse_dev *dev = vdpa_to_vduse(vdpa); + struct vduse_virtqueue *vq = &dev->vqs[idx]; + + return vduse_dev_get_vq_state(dev, vq, state); +} + static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa) { struct vduse_dev *dev = vdpa_to_vduse(vdpa); @@ -532,6 +584,8 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = { .set_vq_num = vduse_vdpa_set_vq_num, .set_vq_ready = vduse_vdpa_set_vq_ready, .get_vq_ready = vduse_vdpa_get_vq_ready, + .set_vq_state = vduse_vdpa_set_vq_state, + .get_vq_state = vduse_vdpa_get_vq_state, .get_vq_align = vduse_vdpa_get_vq_align, .get_features = vduse_vdpa_get_features, .set_features = vduse_vdpa_set_features, diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h index f8579abdaa3b..873305dfd93f 100644 --- a/include/uapi/linux/vduse.h +++ b/include/uapi/linux/vduse.h @@ -13,6 +13,8 @@ enum vduse_req_type { VDUSE_SET_VQ_ADDR, VDUSE_SET_VQ_READY, VDUSE_GET_VQ_READY, + VDUSE_SET_VQ_STATE, + VDUSE_GET_VQ_STATE, VDUSE_SET_FEATURES, VDUSE_GET_FEATURES, VDUSE_SET_STATUS, @@ -38,6 +40,11 @@ struct vduse_vq_ready { __u8 ready; }; +struct vduse_vq_state { + __u32 index; + __u16 avail_idx; +}; + struct vduse_dev_config_data { __u32 offset; __u32 len; @@ -53,6 +60,7 @@ struct vduse_dev_request { struct vduse_vq_num vq_num; /* virtqueue num */ struct vduse_vq_addr vq_addr; /* virtqueue address */ struct vduse_vq_ready vq_ready; /* virtqueue ready status */ + struct vduse_vq_state vq_state; /* virtqueue state */ struct vduse_dev_config_data config; /* virtio device config space */ __u
[RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
This patch introduces a new method in the vdpa_config_ops to support processing the raw vhost memory mapping message in the vDPA device driver. Signed-off-by: Xie Yongji --- drivers/vhost/vdpa.c | 5 - include/linux/vdpa.h | 7 +++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 448be7875b6d..ccbb391e38be 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev *dev, if (r) return r; + if (ops->process_iotlb_msg) + return ops->process_iotlb_msg(vdpa, msg); + switch (msg->type) { case VHOST_IOTLB_UPDATE: r = vhost_vdpa_process_iotlb_update(v, msg); @@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) int ret; /* Device want to do DMA by itself */ - if (ops->set_map || ops->dma_map) + if (ops->set_map || ops->dma_map || ops->process_iotlb_msg) return 0; bus = dma_dev->bus; diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 656fe264234e..7bccedf22f4b 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -5,6 +5,7 @@ #include #include #include +#include #include #include @@ -172,6 +173,10 @@ struct vdpa_iova_range { * @vdev: vdpa device * Returns the iova range supported by * the device. + * @process_iotlb_msg: Process vhost memory mapping message (optional) + * Only used for VDUSE device now + * @vdev: vdpa device + * @msg: vhost memory mapping message * @set_map: Set device memory mapping (optional) * Needed for device that using device * specific DMA translation (on-chip IOMMU) @@ -240,6 +245,8 @@ struct vdpa_config_ops { struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev); /* DMA ops */ + int (*process_iotlb_msg)(struct vdpa_device *vdev, +struct vhost_iotlb_msg *msg); int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb); int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size, u64 pa, u32 perm); -- 2.11.0
[RFC v2 05/13] vdpa: Pass the netlink attributes to ops.dev_add()
Pass the netlink attributes to ops.dev_add() so that we could get some device specific attributes when creating a vdpa device. Signed-off-by: Xie Yongji --- drivers/vdpa/vdpa.c | 2 +- drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++- include/linux/vdpa.h | 4 +++- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c index 32bd48baffab..f6ff81927694 100644 --- a/drivers/vdpa/vdpa.c +++ b/drivers/vdpa/vdpa.c @@ -440,7 +440,7 @@ static int vdpa_nl_cmd_dev_add_set_doit(struct sk_buff *skb, struct genl_info *i goto err; } - vdev = pdev->ops->dev_add(pdev, name, device_id); + vdev = pdev->ops->dev_add(pdev, name, device_id, info->attrs); if (IS_ERR(vdev)) goto err; diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c index 85776e4e6749..cfc314f5403a 100644 --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c @@ -726,7 +726,8 @@ static const struct vdpa_config_ops vdpasim_net_batch_config_ops = { }; static struct vdpa_device * -vdpa_dev_add(struct vdpa_parent_dev *pdev, const char *name, u32 device_id) +vdpa_dev_add(struct vdpa_parent_dev *pdev, const char *name, + u32 device_id, struct nlattr **attrs) { struct vdpasim *simdev; diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index cb5a3d847af3..656fe264234e 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -6,6 +6,7 @@ #include #include #include +#include /** * vDPA callback definition. @@ -349,6 +350,7 @@ static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset, * @pdev: parent device to use for device addition * @name: name of the new vdpa device * @device_id: device id of the new vdpa device + * @attrs: device specific attributes * Driver need to add a new device using vdpa_register_device() after * fully initializing the vdpa device. On successful addition driver * must return a valid pointer of vdpa device or ERR_PTR for the error. @@ -359,7 +361,7 @@ static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset, */ struct vdpa_dev_ops { struct vdpa_device* (*dev_add)(struct vdpa_parent_dev *pdev, const char *name, - u32 device_id); + u32 device_id, struct nlattr **attrs); void (*dev_del)(struct vdpa_parent_dev *pdev, struct vdpa_device *dev); }; -- 2.11.0
[RFC v2 09/13] vduse: Add support for processing vhost iotlb message
To support vhost-vdpa bus driver, we need a way to share the vhost-vdpa backend process's memory with the userspace VDUSE process. This patch tries to make use of the vhost iotlb message to achieve that. We will get the shm file from the iotlb message and pass it to the userspace VDUSE process. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 15 +++- drivers/vdpa/vdpa_user/vduse_dev.c | 147 - include/uapi/linux/vduse.h | 11 +++ 3 files changed, 171 insertions(+), 2 deletions(-) diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst index 623f7b040ccf..48e4b1ba353f 100644 --- a/Documentation/driver-api/vduse.rst +++ b/Documentation/driver-api/vduse.rst @@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE framework now: - VDUSE_GET_CONFIG: Read from device specific configuration space +- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB + +- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB + Please see include/linux/vdpa.h for details. -In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +The data path of userspace vDPA device is implemented in different ways +depending on the vdpa bus to which it is attached. + +In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. The userspace iova region can be created by passing the userspace vDPA device fd to mmap(2). +In vhost-vdpa case, the dma buffer is reside in a userspace memory region +which will be shared to the VDUSE userspace processs via the file +descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address +mapping (IOVA of dma buffer <-> VA of the memory region) is also included +in this message. + Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace. The following ioctls on the userspace vDPA device fd are provided to support that: diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index b974333ed4e9..d24aaacb6008 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -34,6 +34,7 @@ struct vduse_dev_msg { struct vduse_dev_request req; + struct file *iotlb_file; struct vduse_dev_response resp; struct list_head list; wait_queue_head_t waitq; @@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev, return ret; } +static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file, + u64 offset, u64 iova, u64 size, u8 perm) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.offset = offset; + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + msg->req.iotlb.perm = perm; + msg->req.iotlb.fd = -1; + msg->iotlb_file = get_file(file); + + ret = vduse_dev_msg_sync(dev, msg); + vduse_dev_msg_put(msg); + fput(file); + + return ret; +} + +static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev, + u64 iova, u64 size) +{ + struct vduse_dev_msg *msg; + int ret; + + if (!size) + return -EINVAL; + + msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB); + msg->req.size = sizeof(struct vduse_iotlb); + msg->req.iotlb.iova = iova; + msg->req.iotlb.size = size; + + ret = vduse_dev_msg_sync(dev, msg); + vduse_dev_msg_put(msg); + + return ret; +} + +static unsigned int perm_to_file_flags(u8 perm) +{ + unsigned int flags = 0; + + switch (perm) { + case VHOST_ACCESS_WO: + flags |= O_WRONLY; + break; + case VHOST_ACCESS_RO: + flags |= O_RDONLY; + break; + case VHOST_ACCESS_RW: + flags |= O_RDWR; + break; + default: + WARN(1, "invalidate vhost IOTLB permission\n"); + break; + } + + return flags; +} + static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct file *file = iocb->ki_filp; struct vduse_dev *dev = file->private_data; struct vduse_dev_msg *msg; - int size = sizeof(struct vduse_dev_request); + unsigned int flags; + int fd, size = sizeof(struct vduse_dev_request); ssize_t ret = 0; if (iov_iter_count(to) < size) @@ -349,6 +418,18 @@ static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to) if (ret) return ret; } + + if (msg->req.type == VDUSE_UPD
[RFC v2 10/13] vduse: grab the module's references until there is no vduse device
The module should not be unloaded if any vduse device exists. So increase the module's reference count when creating vduse device. And the reference count is kept until the device is destroyed. Signed-off-by: Xie Yongji --- drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index d24aaacb6008..c29b24a7e7e9 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -1052,6 +1052,7 @@ static int vduse_destroy_dev(u32 id) kfree(dev->vqs); vduse_iova_domain_destroy(dev->domain); vduse_dev_destroy(dev); + module_put(THIS_MODULE); return 0; } @@ -1096,6 +1097,7 @@ static int vduse_create_dev(struct vduse_dev_config *config) refcount_inc(&dev->refcnt); list_add(&dev->list, &vduse_devs); + __module_get(THIS_MODULE); return fd; err_fd: -- 2.11.0
Re: [PATCH 12/12 v2 RFC] skbuff: rename sock_zerocopy_* to msg_zerocopy_*
On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon wrote: > > From: Jonathan Lemon > > At Willem's suggestion, rename the sock_zerocopy_* functions > so that they match the MSG_ZEROCOPY flag, which makes it clear > they are specific to this zerocopy implementation. > > Signed-off-by: Jonathan Lemon Acked-by: Willem de Bruijn
[RFC v2 12/13] vduse: Add memory shrinker to reclaim bounce pages
Add a shrinker to reclaim several pages used by bounce buffer in order to avoid memory pressures. Signed-off-by: Xie Yongji --- drivers/vdpa/vdpa_user/vduse_dev.c | 51 ++ 1 file changed, 51 insertions(+) diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index c29b24a7e7e9..1bc2e627c476 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -1142,6 +1142,43 @@ static long vduse_ioctl(struct file *file, unsigned int cmd, return ret; } +static unsigned long vduse_shrink_scan(struct shrinker *shrinker, + struct shrink_control *sc) +{ + unsigned long freed = 0; + struct vduse_dev *dev; + + if (!mutex_trylock(&vduse_lock)) + return SHRINK_STOP; + + list_for_each_entry(dev, &vduse_devs, list) { + if (!dev->domain) + continue; + + freed = vduse_domain_reclaim(dev->domain); + if (!freed) + continue; + + list_move_tail(&dev->list, &vduse_devs); + break; + } + mutex_unlock(&vduse_lock); + + return freed ? freed : SHRINK_STOP; +} + +static unsigned long vduse_shrink_count(struct shrinker *shrink, + struct shrink_control *sc) +{ + return percpu_counter_read_positive(&vduse_total_bounce_pages); +} + +static struct shrinker vduse_bounce_pages_shrinker = { + .count_objects = vduse_shrink_count, + .scan_objects = vduse_shrink_scan, + .seeks = DEFAULT_SEEKS, +}; + static const struct file_operations vduse_fops = { .owner = THIS_MODULE, .unlocked_ioctl = vduse_ioctl, @@ -1292,12 +1329,24 @@ static int vduse_init(void) if (ret) goto err_irqfd; + ret = vduse_domain_init(); + if (ret) + goto err_domain; + + ret = register_shrinker(&vduse_bounce_pages_shrinker); + if (ret) + goto err_shrinker; + ret = vduse_parentdev_init(); if (ret) goto err_parentdev; return 0; err_parentdev: + unregister_shrinker(&vduse_bounce_pages_shrinker); +err_shrinker: + vduse_domain_exit(); +err_domain: vduse_virqfd_exit(); err_irqfd: destroy_workqueue(vduse_vdpa_wq); @@ -1309,8 +1358,10 @@ module_init(vduse_init); static void vduse_exit(void) { + unregister_shrinker(&vduse_bounce_pages_shrinker); misc_deregister(&vduse_misc); destroy_workqueue(vduse_vdpa_wq); + vduse_domain_exit(); vduse_virqfd_exit(); vduse_parentdev_exit(); } -- 2.11.0
[RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
This VDUSE driver enables implementing vDPA devices in userspace. Both control path and data path of vDPA devices will be able to be handled in userspace. In the control path, the VDUSE driver will make use of message mechnism to forward the config operation from vdpa bus driver to userspace. Userspace can use read()/write() to receive/reply those control messages. In the data path, the VDUSE driver implements a MMU-based on-chip IOMMU driver which supports mapping the kernel dma buffer to a userspace iova region dynamically. Userspace can access those iova region via mmap(). Besides, the eventfd mechanism is used to trigger interrupt callbacks and receive virtqueue kicks in userspace Now we only support virtio-vdpa bus driver with this patch applied. Signed-off-by: Xie Yongji --- Documentation/driver-api/vduse.rst | 74 ++ Documentation/userspace-api/ioctl/ioctl-number.rst |1 + drivers/vdpa/Kconfig |8 + drivers/vdpa/Makefile |1 + drivers/vdpa/vdpa_user/Makefile|5 + drivers/vdpa/vdpa_user/eventfd.c | 221 drivers/vdpa/vdpa_user/eventfd.h | 48 + drivers/vdpa/vdpa_user/iova_domain.c | 442 drivers/vdpa/vdpa_user/iova_domain.h | 93 ++ drivers/vdpa/vdpa_user/vduse.h | 59 ++ drivers/vdpa/vdpa_user/vduse_dev.c | 1121 include/uapi/linux/vdpa.h |1 + include/uapi/linux/vduse.h | 99 ++ 13 files changed, 2173 insertions(+) create mode 100644 Documentation/driver-api/vduse.rst create mode 100644 drivers/vdpa/vdpa_user/Makefile create mode 100644 drivers/vdpa/vdpa_user/eventfd.c create mode 100644 drivers/vdpa/vdpa_user/eventfd.h create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h create mode 100644 drivers/vdpa/vdpa_user/vduse.h create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c create mode 100644 include/uapi/linux/vduse.h diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst new file mode 100644 index ..da9b3040f20a --- /dev/null +++ b/Documentation/driver-api/vduse.rst @@ -0,0 +1,74 @@ +== +VDUSE - "vDPA Device in Userspace" +== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. + +How VDUSE works + +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on +the VDUSE character device (/dev/vduse). Then a file descriptor pointing +to the new resources will be returned, which can be used to implement the +userspace vDPA device's control path and data path. + +To implement control path, the read/write operations to the file descriptor +will be used to receive/reply the control messages from/to VDUSE driver. +Those control messages are based on the vdpa_config_ops which defines a +unified interface to control different types of vDPA device. + +The following types of messages are provided by the VDUSE framework now: + +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue. + +- VDUSE_SET_VQ_NUM: Set the size of virtqueue + +- VDUSE_SET_VQ_READY: Set ready status of virtqueue + +- VDUSE_GET_VQ_READY: Get ready status of virtqueue + +- VDUSE_SET_FEATURES: Set virtio features supported by the driver + +- VDUSE_GET_FEATURES: Get virtio features supported by the device + +- VDUSE_SET_STATUS: Set the device status + +- VDUSE_GET_STATUS: Get the device status + +- VDUSE_SET_CONFIG: Write to device specific configuration space + +- VDUSE_GET_CONFIG: Read from device specific configuration space + +Please see include/linux/vdpa.h for details. + +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU +driver which supports mapping the kernel dma buffer to a userspace iova +region dynamically. The userspace iova region can be created by passing +the userspace vDPA device fd to mmap(2). + +Besides, the eventfd mechanism is used to trigger interrupt callbacks and +receive virtqueue kicks in userspace. The following ioctls on the userspace +vDPA device fd are provided to support that: + +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used + by VDUSE driver to notify userspace to consume the vring. + +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used + by userspace to notify VDUSE driver to trigger interrupt callbacks. + +MMU-based IOMMU Driver +-- +The basic idea behind the IOMMU driver is treating MMU (VA->
[RFC v2 11/13] vduse/iova_domain: Support reclaiming bounce pages
Introduce vduse_domain_reclaim() to support reclaiming bounce page when necessary. We will do reclaiming chunk by chunk. And only reclaim the iova chunk that no one used. Signed-off-by: Xie Yongji --- drivers/vdpa/vdpa_user/iova_domain.c | 83 ++-- drivers/vdpa/vdpa_user/iova_domain.h | 10 + 2 files changed, 89 insertions(+), 4 deletions(-) diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c index 27022157abc6..c438cc85d33d 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.c +++ b/drivers/vdpa/vdpa_user/iova_domain.c @@ -29,6 +29,8 @@ struct vduse_mmap_vma { struct list_head list; }; +struct percpu_counter vduse_total_bounce_pages; + static inline struct page * vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, unsigned long iova) @@ -48,6 +50,13 @@ vduse_domain_set_bounce_page(struct vduse_iova_domain *domain, unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK; unsigned long pgindex = chunkoff >> PAGE_SHIFT; + if (page) { + domain->chunks[index].used_bounce_pages++; + percpu_counter_inc(&vduse_total_bounce_pages); + } else { + domain->chunks[index].used_bounce_pages--; + percpu_counter_dec(&vduse_total_bounce_pages); + } domain->chunks[index].bounce_pages[pgindex] = page; } @@ -175,6 +184,29 @@ void vduse_domain_remove_mapping(struct vduse_iova_domain *domain, } } +static bool vduse_domain_try_unmap(struct vduse_iova_domain *domain, + unsigned long iova, size_t size) +{ + struct vduse_mmap_vma *mmap_vma; + unsigned long uaddr; + bool unmap = true; + + mutex_lock(&domain->vma_lock); + list_for_each_entry(mmap_vma, &domain->vma_list, list) { + if (!mmap_read_trylock(mmap_vma->vma->vm_mm)) { + unmap = false; + break; + } + + uaddr = iova + mmap_vma->vma->vm_start; + zap_page_range(mmap_vma->vma, uaddr, size); + mmap_read_unlock(mmap_vma->vma->vm_mm); + } + mutex_unlock(&domain->vma_lock); + + return unmap; +} + void vduse_domain_unmap(struct vduse_iova_domain *domain, unsigned long iova, size_t size) { @@ -302,6 +334,32 @@ bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain, return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP; } +int vduse_domain_reclaim(struct vduse_iova_domain *domain) +{ + struct vduse_iova_chunk *chunk; + int i, freed = 0; + + for (i = domain->chunk_num - 1; i >= 0; i--) { + chunk = &domain->chunks[i]; + if (!chunk->used_bounce_pages) + continue; + + if (atomic_cmpxchg(&chunk->state, 0, INT_MIN) != 0) + continue; + + if (!vduse_domain_try_unmap(domain, + chunk->start, IOVA_CHUNK_SIZE)) { + atomic_sub(INT_MIN, &chunk->state); + break; + } + freed += vduse_domain_free_bounce_pages(domain, + chunk->start, IOVA_CHUNK_SIZE); + atomic_sub(INT_MIN, &chunk->state); + } + + return freed; +} + unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain, size_t size, enum iova_map_type type) { @@ -319,10 +377,13 @@ unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain, if (atomic_read(&chunk->map_type) != type) continue; - iova = gen_pool_alloc_algo(chunk->pool, size, + if (atomic_fetch_inc(&chunk->state) >= 0) { + iova = gen_pool_alloc_algo(chunk->pool, size, gen_pool_first_fit_align, &data); - if (iova) - break; + if (iova) + break; + } + atomic_dec(&chunk->state); } return iova; @@ -335,6 +396,7 @@ void vduse_domain_free_iova(struct vduse_iova_domain *domain, struct vduse_iova_chunk *chunk = &domain->chunks[index]; gen_pool_free(chunk->pool, iova, size); + atomic_dec(&chunk->state); } static void vduse_iova_chunk_cleanup(struct vduse_iova_chunk *chunk) @@ -351,7 +413,8 @@ void vduse_iova_domain_destroy(struct vduse_iova_domain *domain) for (i = 0; i < domain->chunk_num; i++) { chunk = &domain->chunks[i]; - vduse_domain_free_bounce_pages(domain, + if (chunk->used_bounce_pages) + vduse_domain_free_bounce_pages(domain, chunk->start, IOVA_CHUNK_SIZE); vd
Re: [PATCH 02/12 v2 RFC] skbuff: remove unused skb_zcopy_abort function
On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon wrote: > > From: Jonathan Lemon > > skb_zcopy_abort() has no in-tree consumers, remove it. > > Signed-off-by: Jonathan Lemon Acked-by: Willem de Bruijn Indeed unused. This was used in the packet and raw zerocopy patches in the original patchset. But we never merged those, for lack of copy avoidance benefit at small packet sizes.
Re: [PATCH 09/12 v2 RFC] skbuff: add zc_flags to ubuf_info for ubuf setup
On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon wrote: > > From: Jonathan Lemon > > Currently, an ubuf is attached to a new skb, the skb zc_flags > is initialized to a fixed value. Instead of doing this, set > the default zc_flags in the ubuf, and have new skb's inherit > from this default. > > This is needed when setting up different zerocopy types. > > Signed-off-by: Jonathan Lemon > --- > include/linux/skbuff.h | 3 ++- > net/core/skbuff.c | 1 + > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index da0c10da..b90be4b0b2de 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -478,6 +478,7 @@ struct ubuf_info { > }; > }; > refcount_t refcnt; > + u8 zc_flags; > > struct mmpin { > struct user_struct *user; When allocating ubuf_info for msg_zerocopy, we actually allocate the notification skb, to be sure that notifications won't be dropped due to memory pressure at notification time. We actually allocate the skb and place ubuf_info in skb->cb[]. The struct is exactly 48 bytes on 64-bit platforms, filling all of cb. This new field fills a 4B hole, so it should still be fine. Just being very explicit, as this is a fragile bit of code. Come to think of it, this probably deserves a BUILD_BUG_ON.
Re: [RFC PATCH v2 2/8] net: sparx5: add the basic sparx5 driver
> > > +static void sparx5_board_init(struct sparx5 *sparx5) > > > +{ > > > + int idx; > > > + > > > + if (!sparx5->sd_sgpio_remapping) > > > + return; > > > + > > > + /* Enable SGPIO Signal Detect remapping */ > > > + spx5_rmw(GCB_HW_SGPIO_SD_CFG_SD_MAP_SEL, > > > + GCB_HW_SGPIO_SD_CFG_SD_MAP_SEL, > > > + sparx5, > > > + GCB_HW_SGPIO_SD_CFG); > > > + > > > + /* Refer to LOS SGPIO */ > > > + for (idx = 0; idx < SPX5_PORTS; idx++) { > > > + if (sparx5->ports[idx]) { > > > + if (sparx5->ports[idx]->conf.sd_sgpio != ~0) > > > { > > > + spx5_wr(sparx5->ports[idx]- > > > >conf.sd_sgpio, > > > + sparx5, > > > + > > > GCB_HW_SGPIO_TO_SD_MAP_CFG(idx)); > > > + } > > > + } > > > + } > > > +} > > > > I've not looked at how you do SFP integration yet. Is this the LOS > > from the SFP socket? Is there a Linux GPIO controller exported by > > this > > driver, so the SFP driver can use the GPIOs? > > Yes the SFP driver (used by the Sparx5 SerDes driver) will use the > SGPIO LOS, Module Detect etc, and the Port Modules are aware of the > location of the LOS, and use this by default without any driver > configuration. > But on the PCB134 the SGPIOs are shifted one bit by a mistake, and they > are not located in the expected position, so we have this board > remapping function to handle that aspect. Is it possible to turn this off in the hardware? It might be less confusing if LOS it determined by phylink, not phylink and the switch itself. Especially when we get into race conditions between PHYLINK polling the GPIO and the hardware taking the short cut? > > > +static int mchp_sparx5_probe(struct platform_device *pdev) > > > +{ > > > + struct device_node *np = pdev->dev.of_node; > > > + struct sparx5 *sparx5; > > > + struct device_node *ports, *portnp; > > > + const u8 *mac_addr; > > > + int err = 0; > > > + > > > + if (!np && !pdev->dev.platform_data) > > > + return -ENODEV; > > > + > > > + sparx5 = devm_kzalloc(&pdev->dev, sizeof(*sparx5), > > > GFP_KERNEL); > > > + if (!sparx5) > > > + return -ENOMEM; > > > + > > > + platform_set_drvdata(pdev, sparx5); > > > + sparx5->pdev = pdev; > > > + sparx5->dev = &pdev->dev; > > > + > > > + /* Default values, some from DT */ > > > + sparx5->coreclock = SPX5_CORE_CLOCK_DEFAULT; > > > + > > > + mac_addr = of_get_mac_address(np); > > > + if (IS_ERR_OR_NULL(mac_addr)) { > > > + dev_info(sparx5->dev, "MAC addr was not set, use > > > random MAC\n"); > > > + eth_random_addr(sparx5->base_mac); > > > + sparx5->base_mac[5] = 0; > > > + } else { > > > + ether_addr_copy(sparx5->base_mac, mac_addr); > > > + } > > > > The binding document does not say anything about a MAC address at the > > top level. What is this used for? > > This the base MAC address used for generating the the switch NI's MAC > addresses. Yes, that is obvious from the code. But all DT properties must be in the binding Documentation. The DT verifier is going to complain when it finds a mac-address property which is not described in the yaml file. > > > + config.media_type = ETH_MEDIA_DAC; > > > + config.serdes_reset = true; > > > + config.portmode = config.phy_mode; > > > + err = sparx5_probe_port(sparx5, portnp, serdes, > > > portno, &config); > > > + if (err) { > > > + dev_err(sparx5->dev, "port probe error\n"); > > > + goto cleanup_ports; > > > + } > > > + } > > > + sparx5_board_init(sparx5); > > > + > > > +cleanup_ports: > > > + return err; > > > > Seems missed named, no cleanup. > > Ah - this comes later (as the driver was split in functional groups for > reviewing). I hope this is OK, as it is only temporary - I could add a > comment to that effect. Yes, this is fine. Here, and in other places, a comment like: /* More code to be added in later patches */ would of been nice, just as a heads up. That is the problem with linear patch review. > > > +static int __init sparx5_switch_reset(void) > > > +{ > > > + const char *syscon_cpu = "microchip,sparx5-cpu-syscon", > > > + *syscon_gcb = "microchip,sparx5-gcb-syscon"; > > > + struct regmap *cpu_ctrl, *gcb_ctrl; > > > + u32 val; > > > + > > > + cpu_ctrl = syscon_regmap_lookup_by_compatible(syscon_cpu); > > > + if (IS_ERR(cpu_ctrl)) { > > > + pr_err("No '%s' syscon map\n", syscon_cpu); > > > + return PTR_ERR(cpu_ctrl); > > > + } > > > + > > > + gcb_ctrl = syscon_regmap_lookup_by_compatible(syscon_gcb); > > > + if (IS_ERR(gcb_ctrl)) { > > > + pr_err("No '%s' syscon map\n", syscon_gcb); > >
Re: [PATCH v3 03/24] wfx: add Makefile/Kconfig
Jerome Pouiller writes: > From: Jérôme Pouiller > > Signed-off-by: Jérôme Pouiller [...] > +wfx-$(CONFIG_SPI) += bus_spi.o > +wfx-$(subst m,y,$(CONFIG_MMC)) += bus_sdio.o Why this subst? And why only for MMC? -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH v3 01/24] mmc: sdio: add SDIO IDs for Silabs WF200 chip
Jerome Pouiller writes: > From: Jérôme Pouiller > > Add Silabs SDIO ID to sdio_ids.h. > > Note that the values used by Silabs are uncommon. A driver cannot fully > rely on the SDIO PnP. It should also check if the device is declared in > the DT. > > Signed-off-by: Jérôme Pouiller > --- > include/linux/mmc/sdio_ids.h | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/include/linux/mmc/sdio_ids.h b/include/linux/mmc/sdio_ids.h > index 12036619346c..20a48162f7fc 100644 > --- a/include/linux/mmc/sdio_ids.h > +++ b/include/linux/mmc/sdio_ids.h > @@ -25,6 +25,11 @@ > * Vendors and devices. Sort key: vendor first, device next. > */ > > +// Silabs does not use a reliable vendor ID. To avoid conflicts, the driver > +// won't probe the device if it is not also declared in the DT. C++ comments? -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH v3 03/24] wfx: add Makefile/Kconfig
Jerome Pouiller writes: > From: Jérôme Pouiller > > Signed-off-by: Jérôme Pouiller > --- > drivers/net/wireless/silabs/wfx/Kconfig | 8 > drivers/net/wireless/silabs/wfx/Makefile | 25 > 2 files changed, 33 insertions(+) > create mode 100644 drivers/net/wireless/silabs/wfx/Kconfig > create mode 100644 drivers/net/wireless/silabs/wfx/Makefile > > diff --git a/drivers/net/wireless/silabs/wfx/Kconfig > b/drivers/net/wireless/silabs/wfx/Kconfig > new file mode 100644 > index ..83ee4d0ca8c6 > --- /dev/null > +++ b/drivers/net/wireless/silabs/wfx/Kconfig > @@ -0,0 +1,8 @@ > +config WFX > + tristate "Silicon Labs wireless chips WF200 and further" > + depends on MAC80211 > + depends on MMC || !MMC # do not allow WFX=y if MMC=m > + depends on (SPI || MMC) > + help > + This is a driver for Silicons Labs WFxxx series (WF200 and further) > + chipsets. This chip can be found on SPI or SDIO buses. Kconfig should mention about the SDIO id snafu and that Device Tree is required. -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH v3 05/24] wfx: add main.c/main.h
Jerome Pouiller writes: > From: Jérôme Pouiller > > Signed-off-by: Jérôme Pouiller [...] > +static const struct ieee80211_supported_band wfx_band_2ghz = { > + .channels = wfx_2ghz_chantable, > + .n_channels = ARRAY_SIZE(wfx_2ghz_chantable), > + .bitrates = wfx_rates, > + .n_bitrates = ARRAY_SIZE(wfx_rates), > + .ht_cap = { > + // Receive caps This driver is full of C++ comments, please convert them to network style comments. More info in the coding style doc. -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [RFC PATCH v2 4/8] net: sparx5: add port module support
> > > +static int sparx5_port_verify_speed(struct sparx5 *sparx5, > > > + struct sparx5_port *port, > > > + struct sparx5_port_config *conf) > > > +{ > > > + case PHY_INTERFACE_MODE_SGMII: > > > + if (conf->speed != SPEED_1000 && > > > + conf->speed != SPEED_100 && > > > + conf->speed != SPEED_10 && > > > + conf->speed != SPEED_2500) > > > + return sparx5_port_error(port, conf, > > > SPX5_PERR_SPEED); > > > > Is it really SGMII over clocked at 2500? Or 2500BaseX? > > Yes the SGMII mode in the serdes driver is overclocked. > Nothing in the switch driver needs changing when changing between > speeds 1G/2G5. So it continues to use the SGMII inband signalling? There is a lot of confusion in this area, but SGMII inband signalling overclocked does not make much sense. So it is more likely to be using 2500BaseX. Andrew
Re: [PATCH v3 09/24] wfx: add hwio.c/hwio.h
Jerome Pouiller writes: > +/* > + * Internal helpers. > + * > + * About CONFIG_VMAP_STACK: > + * When CONFIG_VMAP_STACK is enabled, it is not possible to run DMA on stack > + * allocated data. Functions below that work with registers (aka functions > + * ending with "32") automatically reallocate buffers with kmalloc. However, > + * functions that work with arbitrary length buffers let's caller to handle > + * memory location. In doubt, enable CONFIG_DEBUG_SG to detect badly located > + * buffer. > + */ This sounds very hacky to me, I have understood that you should never use stack with DMA. -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
[PATCH net] net: ipa: fix interconnect enable bug
When the core clock rate and interconnect bandwidth specifications were moved into configuration data, a copy/paste bug was introduced, causing the memory interconnect bandwidth to be set three times rather than enabling the three different interconnects. Fix this bug. Fixes: 91d02f9551501 ("net: ipa: use config data for clocking") Signed-off-by: Alex Elder --- drivers/net/ipa/ipa_clock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ipa/ipa_clock.c b/drivers/net/ipa/ipa_clock.c index 9dcf16f399b7a..135c393437f12 100644 --- a/drivers/net/ipa/ipa_clock.c +++ b/drivers/net/ipa/ipa_clock.c @@ -115,13 +115,13 @@ static int ipa_interconnect_enable(struct ipa *ipa) return ret; data = &clock->interconnect_data[IPA_INTERCONNECT_IMEM]; - ret = icc_set_bw(clock->memory_path, data->average_rate, + ret = icc_set_bw(clock->imem_path, data->average_rate, data->peak_rate); if (ret) goto err_memory_path_disable; data = &clock->interconnect_data[IPA_INTERCONNECT_CONFIG]; - ret = icc_set_bw(clock->memory_path, data->average_rate, + ret = icc_set_bw(clock->config_path, data->average_rate, data->peak_rate); if (ret) goto err_imem_path_disable; -- 2.20.1
Re: [PATCH v3 12/24] wfx: add hif_api_*.h
Jerome Pouiller writes: > --- /dev/null > +++ b/drivers/net/wireless/silabs/wfx/hif_api_general.h > @@ -0,0 +1,267 @@ > +/* SPDX-License-Identifier: Apache-2.0 */ > +/* > + * WFx hardware interface definitions > + * > + * Copyright (c) 2018-2020, Silicon Laboratories Inc. > + */ > + > +#ifndef WFX_HIF_API_GENERAL_H > +#define WFX_HIF_API_GENERAL_H > + > +#ifdef __KERNEL__ > +#include > +#include > +#else > +#include > +#include > +#define __packed __attribute__((__packed__)) > +#endif Why check for __KERNEL__ and redefined __packed? These don't belong to a wireless driver. -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH v3 00/24] wfx: get out from the staging area
Jerome Pouiller writes: > From: Jérôme Pouiller > > I think the wfx driver is now mature enough to be accepted in the > drivers/net/wireless directory. What's the status with firmware images? Can anyone take the latest kernel and linux-firmware and use this driver normally? -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH v3 09/24] wfx: add hwio.c/hwio.h
On Tue, Dec 22, 2020 at 05:10:11PM +0200, Kalle Valo wrote: > Jerome Pouiller writes: > > > +/* > > + * Internal helpers. > > + * > > + * About CONFIG_VMAP_STACK: > > + * When CONFIG_VMAP_STACK is enabled, it is not possible to run DMA on > > stack > > + * allocated data. Functions below that work with registers (aka functions > > + * ending with "32") automatically reallocate buffers with kmalloc. > > However, > > + * functions that work with arbitrary length buffers let's caller to handle > > + * memory location. In doubt, enable CONFIG_DEBUG_SG to detect badly > > located > > + * buffer. > > + */ > > This sounds very hacky to me, I have understood that you should never > use stack with DMA. You should never do that because some platforms do not support it, so no driver should ever try to do that as they do not know what platform they are running on. thanks, greg k-h
Re: [PATCH v3 18/24] wfx: add data_tx.c/data_tx.h
Jerome Pouiller writes: > +static bool ieee80211_is_action_back(struct ieee80211_hdr *hdr) > +{ > + struct ieee80211_mgmt *mgmt = (struct ieee80211_mgmt *)hdr; > + > + if (!ieee80211_is_action(mgmt->frame_control)) > + return false; > + if (mgmt->u.action.category != WLAN_CATEGORY_BACK) > + return false; > + return true; > +} Don't use ieee80211_ prefix within a driver, it's reserved for mac80211. Use wfx_ instead. -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [RFC PATCH v2 0/8] Adding the Sparx5 Switch Driver
Florian Fainelli writes: > On 12/16/2020 11:51 PM, Steen Hegelund wrote: >> This series provides the Microchip Sparx5 Switch Driver >> >> The Sparx5 Carrier Ethernet and Industrial switch family delivers 64 >> Ethernet ports and up to 200 Gbps of switching bandwidth. >> >> It provides a rich set of Ethernet switching features such as hierarchical >> QoS, hardware-based OAM and service activation testing, protection >> switching, IEEE 1588, and Synchronous Ethernet. >> >> Using provider bridging (Q-in-Q) and MPLS/MPLS-TP technology, it delivers >> MEF CE >> 2.0 Ethernet virtual connections (EVCs) and features advanced TCAM >> classification in both ingress and egress. >> >> Per-EVC features include advanced L3-aware classification, a rich set of >> statistics, OAM for end-to-end performance monitoring, and dual-rate >> policing and shaping. >> >> Time sensitive networking (TSN) is supported through a comprehensive set of >> features including frame preemption, cut-through, frame replication and >> elimination for reliability, enhanced scheduling: credit-based shaping, >> time-aware shaping, cyclic queuing, and forwarding, and per-stream policing >> and filtering. >> >> Together with IEEE 1588 and IEEE 802.1AS support, this guarantees >> low-latency deterministic networking for Fronthaul, Carrier, and Industrial >> Ethernet. >> >> The Sparx5 switch family consists of following SKUs: >> >> - VSC7546 Sparx5-64 up to 64 Gbps of bandwidth with the following primary >> port configurations: >> - 6 *10G >> - 16 * 2.5G + 2 * 10G >> - 24 * 1G + 4 * 10G >> >> - VSC7549 Sparx5-90 up to 90 Gbps of bandwidth with the following primary >> port configurations: >> - 9 * 10G >> - 16 * 2.5G + 4 * 10G >> - 48 * 1G + 4 * 10G >> >> - VSC7552 Sparx5-128 up to 128 Gbps of bandwidth with the following primary >> port configurations: >> - 12 * 10G >> - 16 * 2.5G + 8 * 10G >> - 48 * 1G + 8 * 10G >> >> - VSC7556 Sparx5-160 up to 160 Gbps of bandwidth with the following primary >> port configurations: >> - 16 * 10G >> - 10 * 10G + 2 * 25G >> - 16 * 2.5G + 10 * 10G >> - 48 * 1G + 10 * 10G >> >> - VSC7558 Sparx5-200 up to 200 Gbps of bandwidth with the following primary >> port configurations: >> - 20 * 10G >> - 8 * 25G >> >> In addition, the device supports one 10/100/1000/2500/5000 Mbps >> SGMII/SerDes node processor interface (NPI) Ethernet port. >> >> The Sparx5 support is developed on the PCB134 and PCB135 evaluation boards. >> >> - PCB134 main networking features: >> - 12x SFP+ front 10G module slots (connected to Sparx5 through SFI). >> - 8x SFP28 front 25G module slots (connected to Sparx5 through SFI high >> speed). >> - Optional, one additional 10/100/1000BASE-T (RJ45) Ethernet port >> (on-board VSC8211 PHY connected to Sparx5 through SGMII). >> >> - PCB135 main networking features: >> - 48x1G (10/100/1000M) RJ45 front ports using 12xVSC8514 QuadPHY’s each >> connected to VSC7558 through QSGMII. >> - 4x10G (1G/2.5G/5G/10G) RJ45 front ports using the AQR407 10G QuadPHY >> each port connects to VSC7558 through SFI. >> - 4x SFP28 25G module slots on back connected to VSC7558 through SFI high >> speed. >> - Optional, one additional 1G (10/100/1000M) RJ45 port using an on-board >> VSC8211 PHY, which can be connected to VSC7558 NPI port through SGMII >> using a loopback add-on PCB) >> >> This series provides support for: >> - SFPs and DAC cables via PHYLINK with a number of 5G, 10G and 25G >> devices and media types. >> - Port module configuration for 10M to 25G speeds with SGMII, QSGMII, >> 1000BASEX, 2500BASEX and 10GBASER as appropriate for these modes. >> - SerDes configuration via the Sparx5 SerDes driver (see below). >> - Host mode providing register based injection and extraction. >> - Switch mode providing MAC/VLAN table learning and Layer2 switching >> offloaded to the Sparx5 switch. >> - STP state, VLAN support, host/bridge port mode, Forwarding DB, and >> configuration and statistics via ethtool. >> >> More support will be added at a later stage. >> >> The Sparx5 Switch chip register model can be browsed here: >> Link: https://microchip-ung.github.io/sparx-5_reginfo/reginfo_sparx-5.html > > Out of curiosity, what tool was used to generate the register > information page? It looks really neat and well organized. Florian, It is an in-house tool. The input data is in a proprietary XML-like format. We're pleased that you like it, we do too. We are also pleased that being a Microchip entity, we can actually make this kind of information public. I'll pass your praise on. ---Lars -- Lars Povlsen, Microchip
Re: [PATCH bpf-next V9 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper
On Fri, 18 Dec 2020 12:13:45 -0800 Andrii Nakryiko wrote: > On Thu, Dec 17, 2020 at 9:30 AM Jesper Dangaard Brouer > wrote: > > > > Adding selftest for BPF-helper bpf_check_mtu(). Making sure > > it can be used from both XDP and TC. > > > > Signed-off-by: Jesper Dangaard Brouer > > --- > > tools/testing/selftests/bpf/prog_tests/check_mtu.c | 204 > > > > tools/testing/selftests/bpf/progs/test_check_mtu.c | 196 > > +++ > > 2 files changed, 400 insertions(+) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c > > create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c > > b/tools/testing/selftests/bpf/prog_tests/check_mtu.c > > new file mode 100644 > > index ..b5d0c3a9abe8 > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c > > @@ -0,0 +1,204 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* Copyright (c) 2020 Jesper Dangaard Brouer */ > > + > > +#include /* before test_progs.h, avoid bpf_util.h > > redefines */ > > + > > +#include > > +#include "test_check_mtu.skel.h" > > +#include > > + > > +#include > > +#include > > + > > +#define IFINDEX_LO 1 > > + > > +static __u32 duration; /* Hint: needed for CHECK macro */ > > + > > +static int read_mtu_device_lo(void) > > +{ > > + const char *filename = "/sys/devices/virtual/net/lo/mtu"; I will change this to: /sys/class/net/lo/mtu > > + char buf[11] = {}; > > + int value; > > + int fd; > > + > > + fd = open(filename, 0, O_RDONLY); > > + if (fd == -1) > > + return -1; > > + > > + if (read(fd, buf, sizeof(buf)) == -1) > > close fd missing here? ack, fixed. > > + return -2; > > + close(fd); > > + > > + value = strtoimax(buf, NULL, 10); > > + if (errno == ERANGE) > > + return -3; > > + > > + return value; > > +} > > + > > +static void test_check_mtu_xdp_attach(struct bpf_program *prog) > > +{ > > + int err = 0; > > + int fd; > > + > > + fd = bpf_program__fd(prog); > > + err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE); > > + if (CHECK(err, "XDP-attach", "failed")) > > + return; > > + > > + bpf_set_link_xdp_fd(IFINDEX_LO, -1, 0); > > can you please use bpf_link-based bpf_program__attach_xdp() which will > provide auto-cleanup in case of crash? Sure, that will be good for me to learn. > also check that it succeeded? > > > +} > > + > > +static void test_check_mtu_run_xdp(struct test_check_mtu *skel, > > + struct bpf_program *prog, > > + __u32 mtu_expect) > > +{ > > + const char *prog_name = bpf_program__name(prog); > > + int retval_expect = XDP_PASS; > > + __u32 mtu_result = 0; > > + char buf[256]; > > + int err; > > + > > + struct bpf_prog_test_run_attr tattr = { > > + .repeat = 1, > > + .data_in = &pkt_v4, > > + .data_size_in = sizeof(pkt_v4), > > + .data_out = buf, > > + .data_size_out = sizeof(buf), > > + .prog_fd = bpf_program__fd(prog), > > + }; > > nit: it's a variable declaration, so keep it all in one block. There > is also opts-based variant, which might be good to use here instead. > > > + > > + memset(buf, 0, sizeof(buf)); > > char buf[256] = {}; would make this unnecessary ok. > > > + > > + err = bpf_prog_test_run_xattr(&tattr); > > + CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run", > > + "prog_name:%s (err %d errno %d retval %d)\n", > > + prog_name, err, errno, tattr.retval); > > + > > +CHECK(tattr.retval != retval_expect, "retval", > > whitespaces are off? Yes, I noticed with scripts/checkpatch.pl. And there are a couple more, that I've already fixed. > > + "progname:%s unexpected retval=%d expected=%d\n", > > + prog_name, tattr.retval, retval_expect); > > + > > + /* Extract MTU that BPF-prog got */ > > + mtu_result = skel->bss->global_bpf_mtu_xdp; > > + CHECK(mtu_result != mtu_expect, "MTU-compare-user", > > + "failed (MTU user:%d bpf:%d)", mtu_expect, mtu_result); > > There is nicer ASSERT_EQ() macro for such cases: > > ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user"); it will format > sensible error message automatically Nice simplification :-) > > > +} > > + > > [...] [... same ...] > [...] > > > + > > +void test_check_mtu(void) > > +{ > > + struct test_check_mtu *skel; > > + __u32 mtu_lo; > > + > > + skel = test_check_mtu__open_and_load(); > > + if (CHECK(!skel, "open and load skel", "failed")) > > + return; /* Exit if e.g. helper unknown to kernel */ > > + > > + if (test__start_subte
Re: [PATCH v3 18/24] wfx: add data_tx.c/data_tx.h
Jerome Pouiller writes: > +void wfx_tx(struct ieee80211_hw *hw, struct ieee80211_tx_control *control, > + struct sk_buff *skb) > +{ > + struct wfx_dev *wdev = hw->priv; > + struct wfx_vif *wvif; > + struct ieee80211_sta *sta = control ? control->sta : NULL; > + struct ieee80211_tx_info *tx_info = IEEE80211_SKB_CB(skb); > + struct ieee80211_hdr *hdr = (struct ieee80211_hdr *)skb->data; > + size_t driver_data_room = sizeof_field(struct ieee80211_tx_info, > +rate_driver_data); > + > + compiletime_assert(sizeof(struct wfx_tx_priv) <= driver_data_room, > +"struct tx_priv is too large"); Interesting, never seen compiletime_assert() before. But I suspect BUILD_BUG_ON_MSG() is preferred, at least there are more users. -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH v3 05/24] wfx: add main.c/main.h
Jerome Pouiller writes: > +/* NOTE: wfx_send_pds() destroy buf */ > +int wfx_send_pds(struct wfx_dev *wdev, u8 *buf, size_t len) > +{ > + int ret; > + int start, brace_level, i; > + > + start = 0; > + brace_level = 0; > + if (buf[0] != '{') { > + dev_err(wdev->dev, "valid PDS start with '{'. Did you forget to > compress it?\n"); > + return -EINVAL; > + } > + for (i = 1; i < len - 1; i++) { > + if (buf[i] == '{') > + brace_level++; > + if (buf[i] == '}') > + brace_level--; > + if (buf[i] == '}' && !brace_level) { > + i++; > + if (i - start + 1 > WFX_PDS_MAX_SIZE) > + return -EFBIG; > + buf[start] = '{'; > + buf[i] = 0; > + dev_dbg(wdev->dev, "send PDS '%s}'\n", buf + start); > + buf[i] = '}'; > + ret = hif_configuration(wdev, buf + start, > + i - start + 1); > + if (ret > 0) { > + dev_err(wdev->dev, "PDS bytes %d to %d: invalid data (unsupported > options?)\n", > + start, i); > + return -EINVAL; > + } > + if (ret == -ETIMEDOUT) { > + dev_err(wdev->dev, "PDS bytes %d to %d: chip didn't reply (corrupted > file?)\n", > + start, i); > + return ret; > + } > + if (ret) { > + dev_err(wdev->dev, "PDS bytes %d to %d: chip returned an unknown > error\n", > + start, i); > + return -EIO; > + } > + buf[i] = ','; > + start = i; > + } > + } > + return 0; > +} What does this function do? Looks very strange. -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [RFC v2 01/13] mm: export zap_page_range() for driver use
On Tue, Dec 22, 2020 at 10:52:09PM +0800, Xie Yongji wrote: > Export zap_page_range() for use in VDUSE. Err, no. This has absolutely no business being used by drivers.
[PATCH net-next v1] stmmac: intel: Add PCI IDs for TGL-H platform
From: Noor Azura Ahmad Tarmizi Add TGL-H PCI info and PCI IDs for the new TSN Controller to the list of supported devices. Signed-off-by: Noor Azura Ahmad Tarmizi Signed-off-by: Voon Weifeng Signed-off-by: Muhammad Husaini Zulkifli --- drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c index 55dbb1a930da..d3608db576f7 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c @@ -723,6 +723,8 @@ static SIMPLE_DEV_PM_OPS(intel_eth_pm_ops, intel_eth_pci_suspend, #define PCI_DEVICE_ID_INTEL_EHL_PSE1_RGMII1G_ID0x4bb0 #define PCI_DEVICE_ID_INTEL_EHL_PSE1_SGMII1G_ID0x4bb1 #define PCI_DEVICE_ID_INTEL_EHL_PSE1_SGMII2G5_ID 0x4bb2 +#define PCI_DEVICE_ID_INTEL_TGLH_SGMII1G_0_ID 0x43ac +#define PCI_DEVICE_ID_INTEL_TGLH_SGMII1G_1_ID 0x43a2 #define PCI_DEVICE_ID_INTEL_TGL_SGMII1G_ID 0xa0ac static const struct pci_device_id intel_eth_pci_id_table[] = { @@ -737,6 +739,8 @@ static const struct pci_device_id intel_eth_pci_id_table[] = { { PCI_DEVICE_DATA(INTEL, EHL_PSE1_SGMII1G_ID, &ehl_pse1_sgmii1g_info) }, { PCI_DEVICE_DATA(INTEL, EHL_PSE1_SGMII2G5_ID, &ehl_pse1_sgmii1g_info) }, { PCI_DEVICE_DATA(INTEL, TGL_SGMII1G_ID, &tgl_sgmii1g_info) }, + { PCI_DEVICE_DATA(INTEL, TGLH_SGMII1G_0_ID, &tgl_sgmii1g_info) }, + { PCI_DEVICE_DATA(INTEL, TGLH_SGMII1G_1_ID, &tgl_sgmii1g_info) }, {} }; MODULE_DEVICE_TABLE(pci, intel_eth_pci_id_table); -- 2.17.1
Re: [PATCH net v2 1/3] net: fix race conditions in xps by locking the maps and dev->tc_num
On Tue, Dec 22, 2020 at 1:21 AM Antoine Tenart wrote: > > Hello Alexander, Jakub, > > Quoting Alexander Duyck (2020-12-22 00:21:57) > > > > Looking over this patch it seems kind of obvious that extending the > > xps_map_mutex is making things far more complex then they need to be. > > > > Applying the rtnl_mutex would probably be much simpler. Although as I > > think you have already discovered we need to apply it to the store, > > and show for this interface. In addition we probably need to perform > > similar locking around traffic_class_show in order to prevent it from > > generating a similar error. > > I don't think we have the same kind of issues with traffic_class_show: > dev->num_tc is used, but not for navigating through the map. Protecting > only a single read wouldn't change much. We can still think about what > could go wrong here without the lock, but that is not related to this > series of fixes. The problem is we are actually reading the netdev, tx queue, and tc_to_txq mapping. Basically we have several different items that we are accessing at the same time. If any one is updated while we are doing it then it will throw things off. > If I understood correctly, as things are a bit too complex now, you > would prefer that we go for the solution proposed in v1? Yeah, that is what I am thinking. Basically we just need to make sure the num_tc cannot be updated while we are reading the other values. > I can still do the code factoring for the 2 sysfs show operations, but > that would then target net-next and would be in a different series. So I > believe we'll use the patches of v1, unmodified. I agree the code factoring would be better targeted to net-next. The rtnl_lock approach from v1 would work for net and for backports. > Jakub, should I send a v3 then? > > Thanks! > Antoine
Re: [PATCH net] net: ipa: fix interconnect enable bug
On 12/22/20 17:16, Alex Elder wrote: When the core clock rate and interconnect bandwidth specifications were moved into configuration data, a copy/paste bug was introduced, causing the memory interconnect bandwidth to be set three times rather than enabling the three different interconnects. Fix this bug. Fixes: 91d02f9551501 ("net: ipa: use config data for clocking") Signed-off-by: Alex Elder Reviewed-by: Georgi Djakov --- drivers/net/ipa/ipa_clock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ipa/ipa_clock.c b/drivers/net/ipa/ipa_clock.c index 9dcf16f399b7a..135c393437f12 100644 --- a/drivers/net/ipa/ipa_clock.c +++ b/drivers/net/ipa/ipa_clock.c @@ -115,13 +115,13 @@ static int ipa_interconnect_enable(struct ipa *ipa) return ret; data = &clock->interconnect_data[IPA_INTERCONNECT_IMEM]; - ret = icc_set_bw(clock->memory_path, data->average_rate, + ret = icc_set_bw(clock->imem_path, data->average_rate, data->peak_rate); if (ret) goto err_memory_path_disable; data = &clock->interconnect_data[IPA_INTERCONNECT_CONFIG]; - ret = icc_set_bw(clock->memory_path, data->average_rate, + ret = icc_set_bw(clock->config_path, data->average_rate, data->peak_rate); if (ret) goto err_imem_path_disable;
[PATCH bpf-next V10 0/7] bpf: New approach for BPF MTU handling
This patchset drops all the MTU checks in TC BPF-helpers that limits growing the packet size. This is done because these BPF-helpers doesn't take redirect into account, which can result in their MTU check being done against the wrong netdev. The new approach is to give BPF-programs knowledge about the MTU on a netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers are added and extended to make it possible to do MTU checks in the BPF-code. If BPF-prog doesn't comply with the MTU then the packet will eventually get dropped as some other layer. In some cases the existing kernel MTU checks will drop the packet, but there are also cases where BPF can bypass these checks. Specifically doing TC-redirect from ingress step (sch_handle_ingress) into egress code path (basically calling dev_queue_xmit()). It is left up to driver code to handle these kind of MTU violations. One advantage of this approach is that it ingress-to-egress BPF-prog can send information via packet data. With the MTU checks removed in the helpers, and also not done in skb_do_redirect() call, this allows for an ingress BPF-prog to communicate with an egress BPF-prog via packet data, as long as egress BPF-prog remove this prior to transmitting packet. This patchset is primarily focused on TC-BPF, but I've made sure that the MTU BPF-helpers also works for XDP BPF-programs. V2: Change BPF-helper API from lookup to check. V3: Drop enforcement of MTU in net-core, leave it to drivers. V4: Keep sanity limit + netdev "up" checks + rename BPF-helper. V5: Fix uninit variable + name struct output member mtu_result. V6: Use bpf_check_mtu() in selftest V7: Fix logic using tot_len and add another selftest V8: Add better selftests for BPF-helper bpf_check_mtu V9: Remove patch that use skb_set_redirected V10: Fix selftests and 'tot_len' MTU check like XDP --- Jesper Dangaard Brouer (7): bpf: Remove MTU check in __bpf_skb_max_len bpf: fix bpf_fib_lookup helper MTU check for SKB ctx bpf: bpf_fib_lookup return MTU value as output when looked up bpf: add BPF-helper for MTU checking bpf: drop MTU check when doing TC-BPF redirect to ingress selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect bpf/selftests: tests using bpf_check_mtu BPF-helper include/linux/netdevice.h | 31 +++ include/uapi/linux/bpf.h | 78 +++ net/core/dev.c | 19 -- net/core/filter.c | 183 +++-- tools/include/uapi/linux/bpf.h | 78 +++ tools/testing/selftests/bpf/prog_tests/check_mtu.c | 218 tools/testing/selftests/bpf/progs/test_check_mtu.c | 199 ++ .../selftests/bpf/progs/test_cls_redirect.c|7 + 8 files changed, 769 insertions(+), 44 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c --
[PATCH bpf-next V10 1/7] bpf: Remove MTU check in __bpf_skb_max_len
Multiple BPF-helpers that can manipulate/increase the size of the SKB uses __bpf_skb_max_len() as the max-length. This function limit size against the current net_device MTU (skb->dev->mtu). When a BPF-prog grow the packet size, then it should not be limited to the MTU. The MTU is a transmit limitation, and software receiving this packet should be allowed to increase the size. Further more, current MTU check in __bpf_skb_max_len uses the MTU from ingress/current net_device, which in case of redirects uses the wrong net_device. This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit is elsewhere in the system. Jesper's testing[1] showed it was not possible to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is in-effect due to this being called from softirq context see code __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed that frames above 16KiB can cause NICs to reset (but not crash). Keep this sanity limit at this level as memory layer can differ based on kernel config. [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size. Signed-off-by: Jesper Dangaard Brouer --- net/core/filter.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 255aeee72402..f8f198252ff2 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, return 0; } -static u32 __bpf_skb_max_len(const struct sk_buff *skb) -{ - return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len : - SKB_MAX_ALLOC; -} +#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, u32, mode, u64, flags) @@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, { u32 len_cur, len_diff_abs = abs(len_diff); u32 len_min = bpf_skb_net_base_len(skb); - u32 len_max = __bpf_skb_max_len(skb); + u32 len_max = BPF_SKB_MAX_LEN; __be16 proto = skb->protocol; bool shrink = len_diff < 0; u32 off; @@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len) static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 min_len = __bpf_skb_min_len(skb); int ret; @@ -3764,7 +3760,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = { static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 new_len = skb->len + head_room; int ret;
[PATCH bpf-next V10 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size, by adjusting fib_params 'tot_len' with the packet length plus the expected encap size. (Just like the bpf_check_mtu helper supports). He discovered that for SKB ctx the param->tot_len was not used, instead skb->len was used (via MTU check in is_skb_forwardable() that checks against netdev MTU). Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g. zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU check is done like XDP code-path, which checks against FIB-dst MTU. V10: - Use same method as XDP for 'tot_len' MTU check Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status") Reported-by: Carlo Carraro Signed-off-by: Jesper Dangaard Brouer --- net/core/filter.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index f8f198252ff2..c1e460193bae 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5548,6 +5548,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, { struct net *net = dev_net(skb->dev); int rc = -EAFNOSUPPORT; + bool check_mtu = false; if (plen < sizeof(*params)) return -EINVAL; @@ -,22 +5556,28 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT)) return -EINVAL; + if (params->tot_len) + check_mtu = true; + switch (params->family) { #if IS_ENABLED(CONFIG_INET) case AF_INET: - rc = bpf_ipv4_fib_lookup(net, params, flags, false); + rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu); break; #endif #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: - rc = bpf_ipv6_fib_lookup(net, params, flags, false); + rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu); break; #endif } - if (!rc) { + if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) { struct net_device *dev; + /* When tot_len isn't provided by user, +* check skb against net_device MTU +*/ dev = dev_get_by_index_rcu(net, params->ifindex); if (!is_skb_forwardable(dev, skb)) rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
[PATCH bpf-next V10 4/7] bpf: add BPF-helper for MTU checking
This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs. The SKB object is complex and the skb->len value (accessible from BPF-prog) also include the length of any extra GRO/GSO segments, but without taking into account that these GRO/GSO segments get added transport (L4) and network (L3) headers before being transmitted. Thus, this BPF-helper is created such that the BPF-programmer don't need to handle these details in the BPF-prog. The API is designed to help the BPF-programmer, that want to do packet context size changes, which involves other helpers. These other helpers usually does a delta size adjustment. This helper also support a delta size (len_diff), which allow BPF-programmer to reuse arguments needed by these other helpers, and perform the MTU check prior to doing any actual size adjustment of the packet context. It is on purpose, that we allow the len adjustment to become a negative result, that will pass the MTU check. This might seem weird, but it's not this helpers responsibility to "catch" wrong len_diff adjustments. Other helpers will take care of these checks, if BPF-programmer chooses to do actual size adjustment. V9: - Use dev->hard_header_len (instead of ETH_HLEN) - Annotate with unlikely req from Daniel - Fix logic error using skb_gso_validate_network_len from Daniel V6: - Took John's advice and dropped BPF_MTU_CHK_RELAX - Returned MTU is kept at L3-level (like fib_lookup) V4: Lot of changes - ifindex 0 now use current netdev for MTU lookup - rename helper from bpf_mtu_check to bpf_check_mtu - fix bug for GSO pkt length (as skb->len is total len) - remove __bpf_len_adj_positive, simply allow negative len adj Signed-off-by: Jesper Dangaard Brouer --- include/uapi/linux/bpf.h | 67 ++ net/core/filter.c | 122 tools/include/uapi/linux/bpf.h | 67 ++ 3 files changed, 256 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 649586d656b6..fa2e99351758 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3833,6 +3833,61 @@ union bpf_attr { * Return * A pointer to a struct socket on success or NULL if the file is * not a socket. + * + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags) + * Description + * Check ctx packet size against MTU of net device (based on + * *ifindex*). This helper will likely be used in combination with + * helpers that adjust/change the packet size. The argument + * *len_diff* can be used for querying with a planned size + * change. This allows to check MTU prior to changing packet ctx. + * + * Specifying *ifindex* zero means the MTU check is performed + * against the current net device. This is practical if this isn't + * used prior to redirect. + * + * The Linux kernel route table can configure MTUs on a more + * specific per route level, which is not provided by this helper. + * For route level MTU checks use the **bpf_fib_lookup**\ () + * helper. + * + * *ctx* is either **struct xdp_md** for XDP programs or + * **struct sk_buff** for tc cls_act programs. + * + * The *flags* argument can be a combination of one or more of the + * following values: + * + * **BPF_MTU_CHK_SEGS** + * This flag will only works for *ctx* **struct sk_buff**. + * If packet context contains extra packet segment buffers + * (often knows as GSO skb), then MTU check is harder to + * check at this point, because in transmit path it is + * possible for the skb packet to get re-segmented + * (depending on net device features). This could still be + * a MTU violation, so this flag enables performing MTU + * check against segments, with a different violation + * return code to tell it apart. Check cannot use len_diff. + * + * On return *mtu_len* pointer contains the MTU value of the net + * device. Remember the net device configured MTU is the L3 size, + * which is returned here and XDP and TX length operate at L2. + * Helper take this into account for you, but remember when using + * MTU value in your BPF-code. On input *mtu_len* must be a valid + * pointer and be initialized (to zero), else verifier will reject + * BPF program. + * + * Return + * * 0 on success, and populate MTU value in *mtu_len* pointer. + * + * * < 0 if any input argument is invalid (*mtu_len* not updated) + * + * MTU violations return positive values, but also populat
[PATCH bpf-next V10 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up
The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup) can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog don't know the MTU value that caused this rejection. If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it need to know this MTU value for the ICMP packet. Patch change lookup and result struct bpf_fib_lookup, to contain this MTU value as output via a union with 'tot_len' as this is the value used for the MTU lookup. V5: - Fixed uninit value spotted by Dan Carpenter. - Name struct output member mtu_result Reported-by: kernel test robot Reported-by: Dan Carpenter Signed-off-by: Jesper Dangaard Brouer --- include/uapi/linux/bpf.h | 11 +-- net/core/filter.c | 22 +++--- tools/include/uapi/linux/bpf.h | 11 +-- 3 files changed, 33 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 77d7c1bb2923..649586d656b6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2225,6 +2225,9 @@ union bpf_attr { * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the * packet is not forwarded or needs assist from full stack * + * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU + * was exceeded and output params->mtu_result contains the MTU. + * * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) * Description * Add an entry to, or update a sockhash *map* referencing sockets. @@ -4975,9 +4978,13 @@ struct bpf_fib_lookup { __be16 sport; __be16 dport; - /* total length of packet from network header - used for MTU check */ - __u16 tot_len; + union { /* used for MTU check */ + /* input to lookup */ + __u16 tot_len; /* L3 length from network hdr (iph->tot_len) */ + /* output: MTU value */ + __u16 mtu_result; + }; /* input: L3 device index for lookup * output: device index from FIB lookup */ diff --git a/net/core/filter.c b/net/core/filter.c index c1e460193bae..db59ab55572c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5268,12 +5268,14 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = { #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6) static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, const struct neighbour *neigh, - const struct net_device *dev) + const struct net_device *dev, u32 mtu) { memcpy(params->dmac, neigh->ha, ETH_ALEN); memcpy(params->smac, dev->dev_addr, ETH_ALEN); params->h_vlan_TCI = 0; params->h_vlan_proto = 0; + if (mtu) + params->mtu_result = mtu; /* union with tot_len */ return 0; } @@ -5289,8 +5291,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct net_device *dev; struct fib_result res; struct flowi4 fl4; + u32 mtu = 0; int err; - u32 mtu; dev = dev_get_by_index_rcu(net, params->ifindex); if (unlikely(!dev)) @@ -5357,8 +5359,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; + } } nhc = res.nhc; @@ -5392,7 +5396,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (!neigh) return BPF_FIB_LKUP_RET_NO_NEIGH; - return bpf_fib_set_fwd_params(params, neigh, dev); + return bpf_fib_set_fwd_params(params, neigh, dev, mtu); } #endif @@ -5409,7 +5413,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct flowi6 fl6; int strict = 0; int oif, err; - u32 mtu; + u32 mtu = 0; /* link local addresses are never forwarded */ if (rt6_need_strict(dst) || rt6_need_strict(src)) @@ -5484,8 +5488,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; + } } if (res.nh->fib_nh_lws) @@ -5505,7 +5511,7 @@ static int bpf_ipv6_fib
[PATCH bpf-next V10 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress
The use-case for dropping the MTU check when TC-BPF does redirect to ingress, is described by Eyal Birger in email[0]. The summary is the ability to increase packet size (e.g. with IPv6 headers for NAT64) and ingress redirect packet and let normal netstack fragment packet as needed. [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/ V9: - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect V4: - Keep net_device "up" (IFF_UP) check. - Adjustment to handle bpf_redirect_peer() helper Signed-off-by: Jesper Dangaard Brouer --- include/linux/netdevice.h | 31 +-- net/core/dev.c| 19 ++- net/core/filter.c | 14 +++--- 3 files changed, 42 insertions(+), 22 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7bf167993c05..4c78f73db1ec 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3947,11 +3947,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb); +static __always_inline bool __is_skb_forwardable(const struct net_device *dev, +const struct sk_buff *skb, +const bool check_mtu) +{ + const u32 vlan_hdr_len = 4; /* VLAN_HLEN */ + unsigned int len; + + if (!(dev->flags & IFF_UP)) + return false; + + if (!check_mtu) + return true; + + len = dev->mtu + dev->hard_header_len + vlan_hdr_len; + if (skb->len <= len) + return true; + + /* if TSO is enabled, we don't care about the length as the packet +* could be forwarded without being segmented before +*/ + if (skb_is_gso(skb)) + return true; + + return false; +} + static __always_inline int dev_forward_skb(struct net_device *dev, - struct sk_buff *skb) + struct sk_buff *skb, + const bool check_mtu) { if (skb_orphan_frags(skb, GFP_ATOMIC) || - unlikely(!is_skb_forwardable(dev, skb))) { + unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) { atomic_long_inc(&dev->rx_dropped); kfree_skb(skb); return NET_RX_DROP; diff --git a/net/core/dev.c b/net/core/dev.c index a46334906c94..ffead13c158f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2176,28 +2176,13 @@ static inline void net_timestamp_set(struct sk_buff *skb) bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) { - unsigned int len; - - if (!(dev->flags & IFF_UP)) - return false; - - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; - if (skb->len <= len) - return true; - - /* if TSO is enabled, we don't care about the length as the packet -* could be forwarded without being segmented before -*/ - if (skb_is_gso(skb)) - return true; - - return false; + return __is_skb_forwardable(dev, skb, true); } EXPORT_SYMBOL_GPL(is_skb_forwardable); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) { - int ret = dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, true); if (likely(!ret)) { skb->protocol = eth_type_trans(skb, dev); diff --git a/net/core/filter.c b/net/core/filter.c index 3f2e593244ca..1908800b671c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = { static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb) { - return dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, false); + + if (likely(!ret)) { + skb->protocol = eth_type_trans(skb, dev); + skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); + ret = netif_rx(skb); + } + + return ret; } static inline int __bpf_rx_skb_no_mac(struct net_device *dev, struct sk_buff *skb) { - int ret = dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, false); if (likely(!ret)) { skb->dev = dev; @@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb) goto out_drop; dev = ops->ndo_get_peer_dev(dev); if (unlikely(!dev || -!is_skb_forwardable(dev, skb) || +!(dev->flags & IFF_UP) || net_eq(net, dev_net(dev goto out_drop;
[PATCH bpf-next V10 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
This demonstrate how bpf_check_mtu() helper can easily be used together with bpf_skb_adjust_room() helper, prior to doing size adjustment, as delta argument is already setup. Hint: This specific test can be selected like this: ./test_progs -t cls_redirect Signed-off-by: Jesper Dangaard Brouer --- .../selftests/bpf/progs/test_cls_redirect.c|7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect.c b/tools/testing/selftests/bpf/progs/test_cls_redirect.c index c9f8464996ea..3c1e042962e6 100644 --- a/tools/testing/selftests/bpf/progs/test_cls_redirect.c +++ b/tools/testing/selftests/bpf/progs/test_cls_redirect.c @@ -70,6 +70,7 @@ typedef struct { uint64_t errors_total_encap_adjust_failed; uint64_t errors_total_encap_buffer_too_small; uint64_t errors_total_redirect_loop; + uint64_t errors_total_encap_mtu_violate; } metrics_t; typedef enum { @@ -407,6 +408,7 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr); int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead; uint16_t proto = ETH_P_IP; + uint32_t mtu_len = 0; /* Loop protection: the inner packet's TTL is decremented as a safeguard * against any forwarding loop. As the only interesting field is the TTL @@ -479,6 +481,11 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e } } + if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) { + metrics->errors_total_encap_mtu_violate++; + return TC_ACT_SHOT; + } + if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET, BPF_F_ADJ_ROOM_FIXED_GSO | BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||
[PATCH bpf-next V10 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper
Adding selftest for BPF-helper bpf_check_mtu(). Making sure it can be used from both XDP and TC. V10: - Remove errno non-zero test in CHECK_ATTR() - Addresse comments from Andrii Nakryiko Signed-off-by: Jesper Dangaard Brouer --- tools/testing/selftests/bpf/prog_tests/check_mtu.c | 218 tools/testing/selftests/bpf/progs/test_check_mtu.c | 199 ++ 2 files changed, 417 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c new file mode 100644 index ..63f01c9e08d8 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c @@ -0,0 +1,218 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Jesper Dangaard Brouer */ + +#include /* before test_progs.h, avoid bpf_util.h redefines */ + +#include +#include "test_check_mtu.skel.h" +#include + +#include +#include + +#define IFINDEX_LO 1 + +static __u32 duration; /* Hint: needed for CHECK macro */ + +static int read_mtu_device_lo(void) +{ + const char *filename = "/sys/class/net/lo/mtu"; + char buf[11] = {}; + int value; + int fd; + + fd = open(filename, 0, O_RDONLY); + if (fd == -1) + return -1; + + if (read(fd, buf, sizeof(buf)) == -1) { + close(fd); + return -2; + } + close(fd); + + value = strtoimax(buf, NULL, 10); + if (errno == ERANGE) + return -3; + + return value; +} + +static void test_check_mtu_xdp_attach() +{ + struct bpf_link_info link_info; + __u32 link_info_len = sizeof(link_info); + struct test_check_mtu *skel; + struct bpf_program *prog; + struct bpf_link *link; + int err = 0; + int fd; + + skel = test_check_mtu__open_and_load(); + if (CHECK(!skel, "open and load skel", "failed")) + return; /* Exit if e.g. helper unknown to kernel */ + + prog = skel->progs.xdp_use_helper_basic; + + link = bpf_program__attach_xdp(prog, IFINDEX_LO); + if (CHECK(IS_ERR(link), "link_attach", "failed: %ld\n", PTR_ERR(link))) + goto out; + skel->links.xdp_use_helper_basic = link; + + memset(&link_info, 0, sizeof(link_info)); + fd = bpf_link__fd(link); + err = bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len); + if (CHECK(err, "link_info", "failed: %d\n", err)) + goto out; + + CHECK(link_info.type != BPF_LINK_TYPE_XDP, "link_type", + "got %u != exp %u\n", link_info.type, BPF_LINK_TYPE_XDP); + CHECK(link_info.xdp.ifindex != IFINDEX_LO, "link_ifindex", + "got %u != exp %u\n", link_info.xdp.ifindex, IFINDEX_LO); + + err = bpf_link__detach(link); + CHECK(err, "link_detach", "failed %d\n", err); + +out: + test_check_mtu__destroy(skel); +} + +static void test_check_mtu_run_xdp(struct test_check_mtu *skel, + struct bpf_program *prog, + __u32 mtu_expect) +{ + const char *prog_name = bpf_program__name(prog); + int retval_expect = XDP_PASS; + __u32 mtu_result = 0; + char buf[256] = {}; + int err; + struct bpf_prog_test_run_attr tattr = { + .repeat = 1, + .data_in = &pkt_v4, + .data_size_in = sizeof(pkt_v4), + .data_out = buf, + .data_size_out = sizeof(buf), + .prog_fd = bpf_program__fd(prog), + }; + + err = bpf_prog_test_run_xattr(&tattr); + CHECK_ATTR(err != 0, "bpf_prog_test_run", + "prog_name:%s (err %d errno %d retval %d)\n", + prog_name, err, errno, tattr.retval); + + CHECK(tattr.retval != retval_expect, "retval", + "progname:%s unexpected retval=%d expected=%d\n", + prog_name, tattr.retval, retval_expect); + + /* Extract MTU that BPF-prog got */ + mtu_result = skel->bss->global_bpf_mtu_xdp; + ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user"); +} + + +static void test_check_mtu_xdp(__u32 mtu, __u32 ifindex) +{ + struct test_check_mtu *skel; + int err; + + skel = test_check_mtu__open(); + if (CHECK(!skel, "skel_open", "failed")) + return; + + /* Update "constants" in BPF-prog *BEFORE* libbpf load */ + skel->rodata->GLOBAL_USER_MTU = mtu; + skel->rodata->GLOBAL_USER_IFINDEX = ifindex; + + err = test_check_mtu__load(skel); + if (CHECK(err, "skel_load", "failed: %d\n", err)) + goto cleanup; + + test_check_mtu_run_xdp(skel, skel->progs.xdp_use_helper, mtu); + test_check_mtu_run_xdp(skel, skel->progs.xdp_exceed_mtu, mtu); + test_check_mtu_run_xdp(skel, skel->progs.xdp_minus_delta
Re: [PATCH v3] net: neighbor: fix a crash caused by mod zero
On 12/22/20 1:38 PM, weichenchen wrote: > pneigh_enqueue() tries to obtain a random delay by mod > NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY) > migth be zero at that point because someone could write zero > to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the > callers check it. > > This patch makes pneigh_enqueue() get a delay time passed in > by the callers and the callers guarantee it is not zero. > > Signed-off-by: weichenchen > --- > V3: > - Callers need to pass the delay time to pneigh_enqueue() > now and they should guarantee it is not zero. > - Use READ_ONCE() to read NEIGH_VAR(p, PROXY_DELAY) in both > of the existing callers of pneigh_enqueue() and then pass > it to pneigh_enqueue(). > V2: > - Use READ_ONCE() to prevent the complier from re-reading > NEIGH_VAR(p, PROXY_DELAY). > - Give a hint to the complier that delay <= 0 is unlikely > to happen. > --- > include/net/neighbour.h | 2 +- > net/core/neighbour.c| 5 ++--- > net/ipv4/arp.c | 8 +--- > net/ipv6/ndisc.c| 6 +++--- > 4 files changed, 11 insertions(+), 10 deletions(-) > > diff --git a/include/net/neighbour.h b/include/net/neighbour.h > index 22ced1381ede..f7564dc5304d 100644 > --- a/include/net/neighbour.h > +++ b/include/net/neighbour.h > @@ -352,7 +352,7 @@ struct net *neigh_parms_net(const struct neigh_parms > *parms) > unsigned long neigh_rand_reach_time(unsigned long base); > > void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p, > - struct sk_buff *skb); > + struct sk_buff *skb, int delay); > struct pneigh_entry *pneigh_lookup(struct neigh_table *tbl, struct net *net, > const void *key, struct net_device *dev, > int creat); > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index 9500d28a43b0..b440f966d109 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -1567,12 +1567,11 @@ static void neigh_proxy_process(struct timer_list *t) > } > > void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p, > - struct sk_buff *skb) > + struct sk_buff *skb, int delay) > { > unsigned long now = jiffies; > > - unsigned long sched_next = now + (prandom_u32() % > - NEIGH_VAR(p, PROXY_DELAY)); > + unsigned long sched_next = now + (prandom_u32() % delay); > > if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) { > kfree_skb(skb); This seems rather complex, what about not using a divide in the first place ? : diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 9500d28a43b0e1a390382912b6fb59db935e727b..745bc89acc87c2a4802fb6f301c11edd2f0096da 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1569,10 +1569,7 @@ static void neigh_proxy_process(struct timer_list *t) void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p, struct sk_buff *skb) { - unsigned long now = jiffies; - - unsigned long sched_next = now + (prandom_u32() % - NEIGH_VAR(p, PROXY_DELAY)); + unsigned long sched_next = jiffies + prandom_u32_max(NEIGH_VAR(p, PROXY_DELAY)); if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) { kfree_skb(skb);
pull-request: wireless-drivers-2020-12-22
Hi, here's a pull request to net tree, more info below. Please let me know if there are any problems. Kalle The following changes since commit 13458ffe0a953e17587f172a8e5059c243e6850a: net: x25: Remove unimplemented X.25-over-LLC code stubs (2020-12-12 17:15:33 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git tags/wireless-drivers-2020-12-22 for you to fetch changes up to bfe55584713b4d4d518ffe9cf2dab1129eba6321: MAINTAINERS: switch to different email address (2020-12-21 19:07:39 +0200) wireless-drivers fixes for v5.11 First set of fixes for v5.11, more fixes than usual this time. For ath11k we have several fixes for QCA6390 PCI support and mt76 has several. Also one build fix for mt76. mt76 * fix two NULL pointer dereference * fix build error when CONFIG_MAC80211_MESH is disabled rtlwifi * fix use-after-free in firmware handling code ath11k * error handling fixes * fix crash found during connect and disconnect test * handle HT disable better * avoid printing qmi memory failure during firmware bootup * disable ASPM during firmware bootup Arend van Spriel (1): MAINTAINERS: switch to different email address Carl Huang (4): ath11k: fix crash caused by NULL rx_channel ath11k: start vdev if a bss peer is already created ath11k: qmi: try to allocate a big block of DMA memory first ath11k: pci: disable ASPM L0sLs before downloading firmware Colin Ian King (1): ath11k: add missing null check on allocated skb Dan Carpenter (2): ath11k: Fix error code in ath11k_core_suspend() ath11k: Fix ath11k_pci_fix_l1ss() Kalle Valo (1): Merge ath-current from git://git.kernel.org/.../kvalo/ath.git Lorenzo Bianconi (4): mt76: mt76u: fix NULL pointer dereference in mt76u_status_worker mt76: usb: remove wake logic in mt76u_status_worker mt76: sdio: remove wake logic in mt76s_process_tx_queue mt76: mt76s: fix NULL pointer dereference in mt76s_process_tx_queue Ping-Ke Shih (1): rtlwifi: rise completion at the last step of firmware callback Randy Dunlap (1): mt76: mt7915: fix MESH ifdef block MAINTAINERS | 2 +- drivers/net/wireless/ath/ath11k/core.c | 2 +- drivers/net/wireless/ath/ath11k/dp_rx.c | 10 -- drivers/net/wireless/ath/ath11k/mac.c| 8 +++-- drivers/net/wireless/ath/ath11k/pci.c| 44 +--- drivers/net/wireless/ath/ath11k/pci.h| 2 ++ drivers/net/wireless/ath/ath11k/peer.c | 17 + drivers/net/wireless/ath/ath11k/peer.h | 2 ++ drivers/net/wireless/ath/ath11k/qmi.c| 24 +++-- drivers/net/wireless/ath/ath11k/qmi.h| 1 + drivers/net/wireless/ath/ath11k/wmi.c| 3 ++ drivers/net/wireless/mediatek/mt76/mt7915/init.c | 4 +-- drivers/net/wireless/mediatek/mt76/sdio.c| 19 -- drivers/net/wireless/mediatek/mt76/usb.c | 9 ++--- drivers/net/wireless/realtek/rtlwifi/core.c | 8 +++-- 15 files changed, 118 insertions(+), 37 deletions(-)
Re: [PATCH 04/12 v2 RFC] skbuff: Push status and refcounts into sock_zerocopy_callback
On 12/22/20 7:43 AM, Willem de Bruijn wrote: > >> void sock_zerocopy_put(struct ubuf_info *uarg) >> { >> - if (uarg && refcount_dec_and_test(&uarg->refcnt)) >> + if (uarg) >> uarg->callback(uarg, uarg->zerocopy); >> } >> EXPORT_SYMBOL_GPL(sock_zerocopy_put); > > This does increase the number of indirect function calls. Which are > not cheap post spectre. > > In the common case for msg_zerocopy we only have two clones, one sent > out and one on the retransmit queue. So I guess the cost will be > acceptable. > sock_zerocopy_callback seems to be the only one at the moment - or the most dominant if I goofed my search. Could use the INDIRECT_CALL macros for it.
Re: [PATCH 03/12 v2 RFC] skbuff: simplify sock_zerocopy_put
On 12/21/20 5:09 PM, Jonathan Lemon wrote: > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 327ee8938f78..ea32b3414ad6 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1245,12 +1245,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback); > > void sock_zerocopy_put(struct ubuf_info *uarg) > { > - if (uarg && refcount_dec_and_test(&uarg->refcnt)) { > - if (uarg->callback) > - uarg->callback(uarg, uarg->zerocopy); > - else > - consume_skb(skb_from_uarg(uarg)); > - } > + if (uarg && refcount_dec_and_test(&uarg->refcnt)) > + uarg->callback(uarg, uarg->zerocopy); > } > EXPORT_SYMBOL_GPL(sock_zerocopy_put); > > since it is down to 2 lines, move to skbuff.h as an inline?
Re: [PATCH 03/12 v2 RFC] skbuff: simplify sock_zerocopy_put
On 12/22/20 9:52 AM, David Ahern wrote: > On 12/21/20 5:09 PM, Jonathan Lemon wrote: >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> index 327ee8938f78..ea32b3414ad6 100644 >> --- a/net/core/skbuff.c >> +++ b/net/core/skbuff.c >> @@ -1245,12 +1245,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback); >> >> void sock_zerocopy_put(struct ubuf_info *uarg) >> { >> -if (uarg && refcount_dec_and_test(&uarg->refcnt)) { >> -if (uarg->callback) >> -uarg->callback(uarg, uarg->zerocopy); >> -else >> -consume_skb(skb_from_uarg(uarg)); >> -} >> +if (uarg && refcount_dec_and_test(&uarg->refcnt)) >> +uarg->callback(uarg, uarg->zerocopy); >> } >> EXPORT_SYMBOL_GPL(sock_zerocopy_put); >> >> > > since it is down to 2 lines, move to skbuff.h as an inline? > nm. that is done in patch 5.
Re: [RFC PATCH v2 2/8] net: sparx5: add the basic sparx5 driver
On 22/12/2020 16:01:22+0100, Andrew Lunn wrote: > > The problem is that the switch core reset also affects (reset) the > > SGPIO controller. > > > > We tried to put this in the reset driver, but it was rejected. If the > > reset is done at probe time, the SGPIO driver may already have > > initialized state. > > > > The switch core reset will then reset all SGPIO registers. > > Ah, O.K. Dumb question. Why is the SGPIO driver a separate driver? It > sounds like it should be embedded inside this driver if it is sharing > hardware. > > Another option would be to look at the reset subsystem, and have this > driver export a reset controller, which the SGPIO driver can bind to. > Given that the GPIO driver has been merged, if this will work, it is > probably a better solution. > That was my suggestion. Then you can ensure from the reset controller driver that this is done exactly once, either from the sgpio driver or from the switchdev driver. IIRC, the sgpio from the other SoCs are not affected by the reset. -- Alexandre Belloni, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
Re: [PATCH 01/12 v2 RFC] net: group skb_shinfo zerocopy related bits together.
On Tue, Dec 22, 2020 at 09:43:15AM -0500, Willem de Bruijn wrote: > On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon > wrote: > > > > From: Jonathan Lemon > > > > In preparation for expanded zerocopy (TX and RX), move > > the ZC related bits out of tx_flags into their own flag > > word. > > > > Signed-off-by: Jonathan Lemon > > I think it's better to expand tx_flags to a u16 and add the two new > flags that you need. Okay, but in that case, tx_flags is now wrong, since some of these flags will also be checked on the rx path. > That allows the additional 7 bits to be used for arbitrary flags, not > stranding 8 bits exactly only for zerocopy features. > > Moving around a few u8's in the same cacheline won't be a problem. How about rearranging them to form 16 bits, and just calling it 'flags'? > I also prefer not to rename flags that some of us are familiar with, > if it's not needed. Ack. -- Jonathan
Re: [PATCH 05/12 v2 RFC] skbuff: replace sock_zerocopy_put() with skb_zcopy_put()
On Tue, Dec 22, 2020 at 09:42:40AM -0500, Willem de Bruijn wrote: > On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon > wrote: > > > > From: Jonathan Lemon > > > > Replace sock_zerocopy_put with the generic skb_zcopy_put() > > function. Pass 'true' as the success argument, as this > > is identical to no change. > > > > Signed-off-by: Jonathan Lemon > > uarg->zerocopy may be false if sock_zerocopy_put_abort is called from > tcp_sendmsg_locked Yes, it may well be false. The original logic goes: sock_zerocopy_put_abort() sock_zerocopy_put() sock_zerocopy_callback(..., success = uarg->zerocopy) if (success) The new logic is now: sock_zerocopy_put_abort() sock_zerocopy_callback(..., success = true) uarg->zerocopy = uarg->zerocopy & success if (uarg->zerocopy) The success value ls latched into uarg->zerocopy, and any failure is persistent. Hence my comment about passing 'true' not changing the logic. -- Jonathan
Re: [PATCH 04/12 v2 RFC] skbuff: Push status and refcounts into sock_zerocopy_callback
On Tue, Dec 22, 2020 at 09:43:39AM -0500, Willem de Bruijn wrote: > On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon > wrote: > > > > From: Jonathan Lemon > > > > Before this change, the caller of sock_zerocopy_callback would > > need to save the zerocopy status, decrement and check the refcount, > > and then call the callback function - the callback was only invoked > > when the refcount reached zero. > > > > Now, the caller just passes the status into the callback function, > > which saves the status and handles its own refcounts. > > > > This makes the behavior of the sock_zerocopy_callback identical > > to the tpacket and vhost callbacks. > > > > Signed-off-by: Jonathan Lemon > > --- > > include/linux/skbuff.h | 3 --- > > net/core/skbuff.c | 14 +++--- > > 2 files changed, 11 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > index fb6dd6af0f82..c9d7de9d624d 100644 > > --- a/include/linux/skbuff.h > > +++ b/include/linux/skbuff.h > > @@ -1482,9 +1482,6 @@ static inline void skb_zcopy_clear(struct sk_buff > > *skb, bool zerocopy) > > if (uarg) { > > if (skb_zcopy_is_nouarg(skb)) { > > /* no notification callback */ > > - } else if (uarg->callback == sock_zerocopy_callback) { > > - uarg->zerocopy = uarg->zerocopy && zerocopy; > > - sock_zerocopy_put(uarg); > > } else { > > uarg->callback(uarg, zerocopy); > > } > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index ea32b3414ad6..73699dbdc4a1 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -1194,7 +1194,7 @@ static bool skb_zerocopy_notify_extend(struct sk_buff > > *skb, u32 lo, u16 len) > > return true; > > } > > > > -void sock_zerocopy_callback(struct ubuf_info *uarg, bool success) > > +static void __sock_zerocopy_callback(struct ubuf_info *uarg) > > { > > struct sk_buff *tail, *skb = skb_from_uarg(uarg); > > struct sock_exterr_skb *serr; > > @@ -1222,7 +1222,7 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, > > bool success) > > serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY; > > serr->ee.ee_data = hi; > > serr->ee.ee_info = lo; > > - if (!success) > > + if (!uarg->zerocopy) > > serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED; > > > > q = &sk->sk_error_queue; > > @@ -1241,11 +1241,19 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, > > bool success) > > consume_skb(skb); > > sock_put(sk); > > } > > + > > +void sock_zerocopy_callback(struct ubuf_info *uarg, bool success) > > +{ > > + uarg->zerocopy = uarg->zerocopy & success; > > + > > + if (refcount_dec_and_test(&uarg->refcnt)) > > + __sock_zerocopy_callback(uarg); > > +} > > EXPORT_SYMBOL_GPL(sock_zerocopy_callback); > > I still think this helper is unnecessary. Just return immediately in > existing sock_zerocopy_callback if refcount is not zero. I think the helper makes the logic clearer, and prevents misuse of the success variable in the main function (use of last value vs the latched value). If you really feel that strongly about it, I'll fold it into one function. > > void sock_zerocopy_put(struct ubuf_info *uarg) > > { > > - if (uarg && refcount_dec_and_test(&uarg->refcnt)) > > + if (uarg) > > uarg->callback(uarg, uarg->zerocopy); > > } > > EXPORT_SYMBOL_GPL(sock_zerocopy_put); > > This does increase the number of indirect function calls. Which are > not cheap post spectre. > > In the common case for msg_zerocopy we only have two clones, one sent > out and one on the retransmit queue. So I guess the cost will be > acceptable. Yes, this was the source of my original comment about this being a slight pessimization - moving the refcount into the function. I briefly considered adding a flag like 'use_refcnt', so the logic becomes: if (uarg && (!uarg->use_refcnt || refcount_dec_and_test(&uarg->refcnt))) But thought this might be too much micro-optimization. But if the call overhead is significant, I can put this back in. -- Jonathan
Re: [RFC PATCH v2 2/8] net: sparx5: add the basic sparx5 driver
Hi Andrew, On Sat, 2020-12-19 at 20:11 +0100, Andrew Lunn wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you > know the content is safe > > On Thu, Dec 17, 2020 at 08:51:28AM +0100, Steen Hegelund wrote: > > > +static struct sparx5_io_resource sparx5_iomap[] = { > > This could be made const i think,. Yes > > > + { TARGET_DEV2G5, 0, 0 }, /* 0x610004000: > > dev2g5_0 */ > > + { TARGET_DEV5G, 0x4000, 0 }, /* 0x610008000: > > dev5g_0 */ > > + { TARGET_PCS5G_BR, 0x8000, 0 }, /* 0x61000c000: > > pcs5g_br_0 */ > > + { TARGET_DEV2G5 + 1, 0xc000, 0 }, /* 0x61001: > > dev2g5_1 */ > > > +static int sparx5_create_targets(struct sparx5 *sparx5) > > +{ > > + int idx, jdx; > > + struct resource *iores[IO_RANGES]; > > + void __iomem *iomem[IO_RANGES]; > > + void __iomem *begin[IO_RANGES]; > > + int range_id[IO_RANGES]; > > Reverse Christmas tree. idx, jdx need to come last. Yes - I will check the entire file for RCT... > > > + > > + /* Check if done previously (deferred by serdes load) */ > > + if (sparx5->regs[sparx5_iomap[0].id]) > > + return 0; > > Could you explain this a bit more. Do you mean -EPROBE_DEFER? Yes that was the intended usage. I will change the startup flow to try to avoid checking this- > > > +static int sparx5_probe_port(struct sparx5 *sparx5, > > + struct device_node *portnp, > > + struct phy *serdes, > > + u32 portno, > > + struct sparx5_port_config *conf) > > +{ > > + struct sparx5_port *spx5_port; > > + struct net_device *ndev; > > + int err; > > + > > + err = sparx5_create_targets(sparx5); > > + if (err) > > + return err; > > This sees odd here. Don't sparx5_create_targets() create all the > targets, where as this creates one specific port? Seems like > sparx5_create_targets() should be in the devices as a whole probe, > not > the port probe. You are right - the name does not really fit (anymore). I will rework this. > > > + spx5_port = netdev_priv(ndev); > > + spx5_port->of_node = portnp; > > + spx5_port->serdes = serdes; > > + spx5_port->pvid = NULL_VID; > > + spx5_port->signd_internal = true; > > + spx5_port->signd_active_high = true; > > + spx5_port->signd_enable = true; > > + spx5_port->flow_control = false; > > + spx5_port->max_vlan_tags = SPX5_PORT_MAX_TAGS_NONE; > > + spx5_port->vlan_type = SPX5_VLAN_PORT_TYPE_UNAWARE; > > + spx5_port->custom_etype = 0x8880; /* Vitesse */ > > + conf->portmode = conf->phy_mode; > > + spx5_port->conf.speed = SPEED_UNKNOWN; > > + spx5_port->conf.power_down = true; > > + sparx5->ports[portno] = spx5_port; > > + return 0; > > I'm also not sure this has the correct name. This does not look like > a > typical probe function. Agree. > > > > +} > > + > > +static int sparx5_init_switchcore(struct sparx5 *sparx5) > > +{ > > + u32 value, pending, jdx, idx; > > + struct { > > + bool gazwrap; > > + void __iomem *init_reg; > > + u32 init_val; > > + } ram, ram_init_list[] = { > > + {false, spx5_reg_get(sparx5, ANA_AC_STAT_RESET), > > + ANA_AC_STAT_RESET_RESET}, > > + {false, spx5_reg_get(sparx5, ASM_STAT_CFG), > > + ASM_STAT_CFG_STAT_CNT_CLR_SHOT}, > > + {true, spx5_reg_get(sparx5, QSYS_RAM_INIT), 0}, > > + {true, spx5_reg_get(sparx5, REW_RAM_INIT), 0}, > > + {true, spx5_reg_get(sparx5, VOP_RAM_INIT), 0}, > > + {true, spx5_reg_get(sparx5, ANA_AC_RAM_INIT), 0}, > > + {true, spx5_reg_get(sparx5, ASM_RAM_INIT), 0}, > > + {true, spx5_reg_get(sparx5, EACL_RAM_INIT), 0}, > > + {true, spx5_reg_get(sparx5, VCAP_SUPER_RAM_INIT), > > 0}, > > + {true, spx5_reg_get(sparx5, DSM_RAM_INIT), 0} > > + }; > > Looks like this could be const as well. And this does not really fit > reverse christmas tree. I will update this. > > > + > > + spx5_rmw(EACL_POL_EACL_CFG_EACL_FORCE_INIT_SET(1), > > + EACL_POL_EACL_CFG_EACL_FORCE_INIT, > > + sparx5, > > + EACL_POL_EACL_CFG); > > + > > + spx5_rmw(EACL_POL_EACL_CFG_EACL_FORCE_INIT_SET(0), > > + EACL_POL_EACL_CFG_EACL_FORCE_INIT, > > + sparx5, > > + EACL_POL_EACL_CFG); > > + > > + /* Initialize memories, if not done already */ > > + value = spx5_rd(sparx5, HSCH_RESET_CFG); > > + > > + if (!(value & HSCH_RESET_CFG_CORE_ENA)) { > > + for (idx = 0; idx < 10; idx++) { > > + pending = ARRAY_SIZE(ram_init_list); > > + for (jdx = 0; jdx < > > ARRAY_SIZE(ram_init_list); jdx++) { > > + ram = ram_init_list[jdx]; > > +
[PATCH net 0/3] net: ipa: GSI interrupt handling fixes
This series implements fixes for some issues related to handling interrupts when GSI channel and event ring commands complete. The first issue is that the completion condition for an event ring or channel command could occur while the associated interrupt is disabled. This would cause the interrupt to fire when it is subsequently enabled, even if the condition it signals had already been handled. The fix is to clear any pending interrupt conditions before re-enabling the interrupt. The second and third patches change how the success of an event ring or channel command is determined. These commands change the state of an event ring or channel. Previously the receipt of a completion interrupt was required to consider a command successful. Instead, a command is successful if it changes the state of the target event ring or channel in the way expected. This way the command can succeed even if the completion interrupt did not arrive while it was enabled. -Alex Alex Elder (3): net: ipa: clear pending interrupts before enabling net: ipa: use state to determine channel command success net: ipa: use state to determine event ring command success drivers/net/ipa/gsi.c | 89 +++ 1 file changed, 56 insertions(+), 33 deletions(-) -- 2.20.1
[PATCH net 3/3] net: ipa: use state to determine event ring command success
This patch implements the same basic fix for event rings as the previous one does for channels. The result of issuing an event ring control command should be that the event ring changes state. If enabled, a completion interrupt signals that the event ring state has changed. This interrupt is enabled by gsi_evt_ring_command() and disabled again after the command has completed (or we time out). There is a window of time during which the command could complete successfully without interrupting. This would cause the event ring to transition to the desired new state. So whether a event ring command ends via completion interrupt or timeout, we can consider the command successful if the event ring has entered the desired state (and a failure if it has not, regardless of the cause). Fixes: b4175f8731f78 ("net: ipa: only enable GSI event control IRQs when needed") Signed-off-by: Alex Elder --- drivers/net/ipa/gsi.c | 34 ++ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c index 4f0e791764237..579cc3e516775 100644 --- a/drivers/net/ipa/gsi.c +++ b/drivers/net/ipa/gsi.c @@ -384,13 +384,15 @@ static int gsi_evt_ring_alloc_command(struct gsi *gsi, u32 evt_ring_id) } ret = evt_ring_command(gsi, evt_ring_id, GSI_EVT_ALLOCATE); - if (!ret && evt_ring->state != GSI_EVT_RING_STATE_ALLOCATED) { - dev_err(gsi->dev, "event ring %u bad state %u after alloc\n", - evt_ring_id, evt_ring->state); - ret = -EIO; - } - return ret; + /* If successful the event ring state will have changed */ + if (evt_ring->state == GSI_EVT_RING_STATE_ALLOCATED) + return 0; + + dev_err(gsi->dev, "event ring %u bad state %u after alloc\n", + evt_ring_id, evt_ring->state); + + return -EIO; } /* Reset a GSI event ring in ALLOCATED or ERROR state. */ @@ -408,9 +410,13 @@ static void gsi_evt_ring_reset_command(struct gsi *gsi, u32 evt_ring_id) } ret = evt_ring_command(gsi, evt_ring_id, GSI_EVT_RESET); - if (!ret && evt_ring->state != GSI_EVT_RING_STATE_ALLOCATED) - dev_err(gsi->dev, "event ring %u bad state %u after reset\n", - evt_ring_id, evt_ring->state); + + /* If successful the event ring state will have changed */ + if (evt_ring->state == GSI_EVT_RING_STATE_ALLOCATED) + return; + + dev_err(gsi->dev, "event ring %u bad state %u after reset\n", + evt_ring_id, evt_ring->state); } /* Issue a hardware de-allocation request for an allocated event ring */ @@ -426,9 +432,13 @@ static void gsi_evt_ring_de_alloc_command(struct gsi *gsi, u32 evt_ring_id) } ret = evt_ring_command(gsi, evt_ring_id, GSI_EVT_DE_ALLOC); - if (!ret && evt_ring->state != GSI_EVT_RING_STATE_NOT_ALLOCATED) - dev_err(gsi->dev, "event ring %u bad state %u after dealloc\n", - evt_ring_id, evt_ring->state); + + /* If successful the event ring state will have changed */ + if (evt_ring->state == GSI_EVT_RING_STATE_NOT_ALLOCATED) + return; + + dev_err(gsi->dev, "event ring %u bad state %u after dealloc\n", + evt_ring_id, evt_ring->state); } /* Fetch the current state of a channel from hardware */ -- 2.20.1