date:20201222

general protection fault in find_match (2)

2020-12-22 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:3db1a3fa Merge tag 'staging-5.11-rc1' of git://git.kernel...
git tree:   bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1103eadf50
kernel config:  https://syzkaller.appspot.com/x/.config?x=2764fc28a92339f9
dashboard link: https://syzkaller.appspot.com/bug?extid=b08cdcfff539328e6c32
compiler:   gcc (GCC) 10.1.0-syz 20200507

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+b08cdcfff539328e6...@syzkaller.appspotmail.com

general protection fault, probably for non-canonical address 
0xdc4b:  [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0258-0x025f]
CPU: 1 PID: 13682 Comm: syz-executor.5 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:ip6_ignore_linkdown include/net/addrconf.h:407 [inline]
RIP: 0010:find_match.part.0+0xcc/0xc70 net/ipv6/route.c:753
Code: f9 0f b6 45 c0 84 c0 0f 84 39 04 00 00 e8 ec a0 c4 f9 49 8d bf 5c 02 00 
00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 
e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 91
RSP: 0018:c900022cf000 EFLAGS: 00010207
RAX: dc00 RBX: 88801c2cc2a0 RCX: c90013d33000
RDX: 004b RSI: 87abfc74 RDI: 025c
RBP: c900022cf070 R08: 0001 R09: c900022cf250
R10:  R11:  R12: 0003
R13: 0001 R14:  R15: 
FS:  7f94614f2700() GS:8880b9f0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f75a1fccdb8 CR3: 2d8c6000 CR4: 001506e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 find_match net/ipv6/route.c:840 [inline]
 __find_rr_leaf+0x17f/0xd10 net/ipv6/route.c:841
 find_rr_leaf net/ipv6/route.c:862 [inline]
 rt6_select net/ipv6/route.c:906 [inline]
 fib6_table_lookup+0x5b3/0xa20 net/ipv6/route.c:2193
 ip6_pol_route+0x1e1/0x11c0 net/ipv6/route.c:2229
 pol_lookup_func include/net/ip6_fib.h:583 [inline]
 fib6_rule_lookup+0x111/0x6f0 net/ipv6/fib6_rules.c:115
 ip6_route_output_flags_noref+0x2c2/0x360 net/ipv6/route.c:2510
 ip6_route_output_flags+0x8b/0x310 net/ipv6/route.c:2523
 ip6_route_output include/net/ip6_route.h:98 [inline]
 ip6_dst_lookup_tail+0xb3a/0x1700 net/ipv6/ip6_output.c:1024
 ip6_dst_lookup_flow+0x8c/0x1d0 net/ipv6/ip6_output.c:1154
 ip6_sk_dst_lookup_flow+0x55c/0x990 net/ipv6/ip6_output.c:1192
 udpv6_sendmsg+0x18a5/0x2bd0 net/ipv6/udp.c:1508
 inet6_sendmsg+0x99/0xe0 net/ipv6/af_inet6.c:638
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 sys_sendmsg+0x6e8/0x810 net/socket.c:2336
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2390
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2423
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45e149
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7f94614f1c68 EFLAGS: 0246 ORIG_RAX: 002e
RAX: ffda RBX: 0003 RCX: 0045e149
RDX:  RSI: 2040 RDI: 0005
RBP: 0119bfc0 R08:  R09: 
R10:  R11: 0246 R12: 0119bf8c
R13: 7ffd9413345f R14: 7f94614f29c0 R15: 0119bf8c
Modules linked in:
---[ end trace a9799337710952e8 ]---
RIP: 0010:ip6_ignore_linkdown include/net/addrconf.h:407 [inline]
RIP: 0010:find_match.part.0+0xcc/0xc70 net/ipv6/route.c:753
Code: f9 0f b6 45 c0 84 c0 0f 84 39 04 00 00 e8 ec a0 c4 f9 49 8d bf 5c 02 00 
00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 
e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 91
RSP: 0018:c900022cf000 EFLAGS: 00010207
RAX: dc00 RBX: 88801c2cc2a0 RCX: c90013d33000
RDX: 004b RSI: 87abfc74 RDI: 025c
RBP: c900022cf070 R08: 0001 R09: c900022cf250
R10:  R11:  R12: 0003
R13: 0001 R14:  R15: 
FS:  7f94614f2700() GS:8880b9f0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 2d8c6000 CR4: 001506e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzb

[RFC PATCH 0/1] net: arcnet: Fix RESET sequence

2020-12-22 Thread Ahmed S. Darwish

Folks,

At drivers/net/arcnet/arcnet.c, there is:

  irqreturn_t arcnet_interrupt(int irq, void *dev_id)
  {
...
if (status & RESETflag) {
arcnet_close(dev);
arcnet_open(dev);
}
...
  }

  struct net_device_ops arcnet_netdev_ops = {
.ndo_open = arcnet_open,
.ndo_stop = arcnet_close,
...
  };

which is wrong, in many ways:

  1) In general, interrupt handlers should never call ->ndo_stop() and
 ->ndo_open() functions. They are usually full of blocking calls and
 other methods that are expected to be called only from drivers
 init/exit code paths.

  2) arcnet_close() contains a del_timer_sync(). If the irq handler
 interrupts the to-be-deleted timer then call del_timer_sync(), it
 will just loop forever.

  3) arcnet_close() also calls tasklet_kill(), which has a warning if
 called from irq context.

  4) For device reset, the sequence "arcnet_close(); arcnet_open();" is
 not complete.  Some children arcnet drivers have special init/exit
 code sequences, which then embed a call to arcnet_open() and
 arcnet_close() accordingly. Check drivers/net/arcnet/com20020.c.

Included is an RFC patch to fix the points above: if the RESET flag is
encountered, a workqueue is scheduled to run the generic reset sequence.

Note: Only compile-tested, as I do not have the hardware in question.

Thanks,

8<--

Ahmed S. Darwish (1):
  net: arcnet: Fix RESET flag handling

 drivers/net/arcnet/arc-rimi.c |  4 +-
 drivers/net/arcnet/arcdevice.h|  6 +++
 drivers/net/arcnet/arcnet.c   | 69 +--
 drivers/net/arcnet/com20020-isa.c |  4 +-
 drivers/net/arcnet/com20020-pci.c |  2 +-
 drivers/net/arcnet/com20020_cs.c  |  2 +-
 drivers/net/arcnet/com90io.c  |  4 +-
 drivers/net/arcnet/com90xx.c  |  4 +-
 8 files changed, 81 insertions(+), 14 deletions(-)

base-commit: 2c85ebc57b3e1817b6ce1a6b703928e113a90442
--
2.29.2

[RFC PATCH 1/1] net: arcnet: Fix RESET flag handling

2020-12-22 Thread Ahmed S. Darwish

The main arcnet interrupt handler calls arcnet_close() then
arcnet_open(), if the RESET status flag is encountered.

This is invalid:

  1) In general, interrupt handlers should never call ->ndo_stop() and
 ->ndo_open() functions. They are usually full of blocking calls and
 other methods that are expected to be called only from drivers
 init and exit code paths.

  2) arcnet_close() contains a del_timer_sync(). If the irq handler
 interrupts the to-be-deleted timer, del_timer_sync() will just loop
 forever.

  3) arcnet_close() also calls tasklet_kill(), which has a warning if
 called from irq context.

  4) For device reset, the sequence "arcnet_close(); arcnet_open();" is
 not complete.  Some children arcnet drivers have special init/exit
 code sequences, which then embed a call to arcnet_open() and
 arcnet_close() accordingly. Check drivers/net/arcnet/com20020.c.

Run the device RESET sequence from a scheduled workqueue instead.

Signed-off-by: Ahmed S. Darwish 
---
 drivers/net/arcnet/arc-rimi.c |  4 +-
 drivers/net/arcnet/arcdevice.h|  6 +++
 drivers/net/arcnet/arcnet.c   | 69 +--
 drivers/net/arcnet/com20020-isa.c |  4 +-
 drivers/net/arcnet/com20020-pci.c |  2 +-
 drivers/net/arcnet/com20020_cs.c  |  2 +-
 drivers/net/arcnet/com90io.c  |  4 +-
 drivers/net/arcnet/com90xx.c  |  4 +-
 8 files changed, 81 insertions(+), 14 deletions(-)

diff --git a/drivers/net/arcnet/arc-rimi.c b/drivers/net/arcnet/arc-rimi.c
index 98df38fe553c..12d085405bd0 100644
--- a/drivers/net/arcnet/arc-rimi.c
+++ b/drivers/net/arcnet/arc-rimi.c
@@ -332,7 +332,7 @@ static int __init arc_rimi_init(void)
dev->irq = 9;
 
if (arcrimi_probe(dev)) {
-   free_netdev(dev);
+   free_arcdev(dev);
return -EIO;
}
 
@@ -349,7 +349,7 @@ static void __exit arc_rimi_exit(void)
iounmap(lp->mem_start);
release_mem_region(dev->mem_start, dev->mem_end - dev->mem_start + 1);
free_irq(dev->irq, dev);
-   free_netdev(dev);
+   free_arcdev(dev);
 }
 
 #ifndef MODULE
diff --git a/drivers/net/arcnet/arcdevice.h b/drivers/net/arcnet/arcdevice.h
index 22a49c6d7ae6..5d4a4c7efbbf 100644
--- a/drivers/net/arcnet/arcdevice.h
+++ b/drivers/net/arcnet/arcdevice.h
@@ -298,6 +298,10 @@ struct arcnet_local {
 
int excnak_pending;/* We just got an excesive nak interrupt */
 
+   /* RESET flag handling */
+   int reset_in_progress;
+   struct work_struct reset_work;
+
struct {
uint16_t sequence;  /* sequence number (incs with each 
packet) */
__be16 aborted_seq;
@@ -350,7 +354,9 @@ void arcnet_dump_skb(struct net_device *dev, struct sk_buff 
*skb, char *desc)
 
 void arcnet_unregister_proto(struct ArcProto *proto);
 irqreturn_t arcnet_interrupt(int irq, void *dev_id);
+
 struct net_device *alloc_arcdev(const char *name);
+void free_arcdev(struct net_device *dev);
 
 int arcnet_open(struct net_device *dev);
 int arcnet_close(struct net_device *dev);
diff --git a/drivers/net/arcnet/arcnet.c b/drivers/net/arcnet/arcnet.c
index e04efc0a5c97..563c43ae5cce 100644
--- a/drivers/net/arcnet/arcnet.c
+++ b/drivers/net/arcnet/arcnet.c
@@ -387,10 +387,46 @@ static void arcnet_timer(struct timer_list *t)
struct arcnet_local *lp = from_timer(lp, t, timer);
struct net_device *dev = lp->dev;
 
-   if (!netif_carrier_ok(dev)) {
+   spin_lock_irq(&lp->lock);
+
+   if (!lp->reset_in_progress && !netif_carrier_ok(dev)) {
netif_carrier_on(dev);
netdev_info(dev, "link up\n");
}
+
+   spin_unlock_irq(&lp->lock);
+}
+
+static void reset_device_work(struct work_struct *work)
+{
+   struct arcnet_local *lp;
+   struct net_device *dev;
+
+   lp = container_of(work, struct arcnet_local, reset_work);
+   dev = lp->dev;
+
+   /*
+* Do not bring the network interface back up if an ifdown
+* was already done.
+*/
+   if (!netif_running(dev) || !lp->reset_in_progress)
+   return;
+
+   rtnl_lock();
+
+   /*
+* Do another check, in case of an ifdown that was triggered in
+* the small race window between the exit condition above and
+* acquiring RTNL.
+*/
+   if (!netif_running(dev) || !lp->reset_in_progress)
+   goto out;
+
+   dev_close(dev);
+   dev_open(dev, NULL);
+
+out:
+   rtnl_unlock();
 }
 
 static void arcnet_reply_tasklet(unsigned long data)
@@ -452,12 +488,26 @@ struct net_device *alloc_arcdev(const char *name)
lp->dev = dev;
spin_lock_init(&lp->lock);
timer_setup(&lp->timer, arcnet_timer, 0);
+   INIT_WORK(&lp->reset_work, reset_device_work);
}
 
return dev;
 }
 EXPORT_SYMBOL(alloc_arcdev);
 
+void free_arcdev(struct net_device *dev)
+{
+   struct arcnet_local

Re: [PATCH net v2 2/3] net: move the xps cpus retrieval out of net-sysfs

2020-12-22 Thread Antoine Tenart

Hi Alexander,

Quoting Alexander Duyck (2020-12-21 23:33:15)
> 
> One thing I might change is to actually bump this patch up in the
> patch set as I think it would probably make things a bit cleaner to
> read as you are going to be moving the other functions to this pattern
> as well.

Right. If it were not for net (vs net-next), I would have split the
patches a bit differently to make things easier to review. But those
patches are fixes and can be backported to older kernel versions.
They're fixing 2 commits that were introduced in different versions, so
this patch has to be made before the next one, as it is fixing older
kernels.

(I also did not give an hint in the commit message to what is done in
patch 3 for the same reason. But I agree that's arguable.)

Thanks,
Antoine

Re: [PATCH net v2 1/3] net: fix race conditions in xps by locking the maps and dev->tc_num

2020-12-22 Thread Antoine Tenart

Hello Alexander, Jakub,

Quoting Alexander Duyck (2020-12-22 00:21:57)
> 
> Looking over this patch it seems kind of obvious that extending the
> xps_map_mutex is making things far more complex then they need to be.
> 
> Applying the rtnl_mutex would probably be much simpler. Although as I
> think you have already discovered we need to apply it to the store,
> and show for this interface. In addition we probably need to perform
> similar locking around traffic_class_show in order to prevent it from
> generating a similar error.

I don't think we have the same kind of issues with traffic_class_show:
dev->num_tc is used, but not for navigating through the map. Protecting
only a single read wouldn't change much. We can still think about what
could go wrong here without the lock, but that is not related to this
series of fixes.

If I understood correctly, as things are a bit too complex now, you
would prefer that we go for the solution proposed in v1?

I can still do the code factoring for the 2 sysfs show operations, but
that would then target net-next and would be in a different series. So I
believe we'll use the patches of v1, unmodified.

Jakub, should I send a v3 then?

Thanks!
Antoine

Re: [PATCH v3 4/5] Bluetooth: advmon offload MSFT handle controller reset

2020-12-22 Thread Marcel Holtmann

Hi Archie,

>>> When the controller is powered off, the registered advertising monitor
>>> is removed from the controller. This patch handles the re-registration
>>> of those monitors when the power is on.
>>> 
>>> Signed-off-by: Archie Pusaka 
>>> Reviewed-by: Miao-chen Chou 
>>> Reviewed-by: Yun-Hao Chung 
>>> 
>>> ---
>>> 
>>> (no changes since v1)
>>> 
>>> net/bluetooth/msft.c | 79 +---
>>> 1 file changed, 74 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/net/bluetooth/msft.c b/net/bluetooth/msft.c
>>> index f5aa0e3b1b9b..7e33a85c3f1c 100644
>>> --- a/net/bluetooth/msft.c
>>> +++ b/net/bluetooth/msft.c
>>> @@ -82,8 +82,15 @@ struct msft_data {
>>>  struct list_head handle_map;
>>>  __u16 pending_add_handle;
>>>  __u16 pending_remove_handle;
>>> +
>>> + struct {
>>> + u8 reregistering:1;
>>> + } flags;
>>> };
>> 
>> hmmm. Do you have bigger plans with this struct? I would just skip it.
>> 
> This struct is also used in patch 5/5 to store the "enabled" status of
> the filter.
> Suspend/resume would need to enable/disable the filter, but it is not
> yet implemented in this patch series.

just do it without the nested structs. I think you are overdoing it here.

Regards

Marcel

Re: [RFC PATCH v2 1/8] dt-bindings: net: sparx5: Add sparx5-switch bindings

2020-12-22 Thread Steen Hegelund

Hi Rob,

On Mon, 2020-12-21 at 14:40 -0700, Rob Herring wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you
> know the content is safe
> 
> On Thu, Dec 17, 2020 at 08:51:27AM +0100, Steen Hegelund wrote:
> > Document the Sparx5 switch device driver bindings
> > 
> > Signed-off-by: Steen Hegelund 
> > Signed-off-by: Lars Povlsen 
> > ---
> >  .../bindings/net/microchip,sparx5-switch.yaml | 178
> > ++
> >  1 file changed, 178 insertions(+)
> >  create mode 100644
> > Documentation/devicetree/bindings/net/microchip,sparx5-switch.yaml
> > 
> > diff --git
> > a/Documentation/devicetree/bindings/net/microchip,sparx5-
> > switch.yaml
> > b/Documentation/devicetree/bindings/net/microchip,sparx5-
> > switch.yaml
> > new file mode 100644
> > index ..6e3ef8285e9a
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/net/microchip,sparx5-
> > switch.yaml
> > @@ -0,0 +1,178 @@
> > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > +%YAML 1.2
> > +---
> > +$id: 
> > http://devicetree.org/schemas/net/microchip,sparx5-switch.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: Microchip Sparx5 Ethernet switch controller
> > +
> > +maintainers:
> > +  - Lars Povlsen 
> > +  - Steen Hegelund 
> > +
> > +description: |
> > +  The SparX-5 Enterprise Ethernet switch family provides a rich
> > set of
> > +  Enterprise switching features such as advanced TCAM-based VLAN
> > and
> > +  QoS processing enabling delivery of differentiated services, and
> > +  security through TCAM-based frame processing using versatile
> > content
> > +  aware processor (VCAP).
> > +
> > +  IPv4/IPv6 Layer 3 (L3) unicast and multicast routing is
> > supported
> > +  with up to 18K IPv4/9K IPv6 unicast LPM entries and up to 9K
> > IPv4/3K
> > +  IPv6 (S,G) multicast groups.
> > +
> > +  L3 security features include source guard and reverse path
> > +  forwarding (uRPF) tasks. Additional L3 features include VRF-Lite
> > and
> > +  IP tunnels (IP over GRE/IP).
> > +
> > +  The SparX-5 switch family targets managed Layer 2 and Layer 3
> > +  equipment in SMB, SME, and Enterprise where high port count
> > +  1G/2.5G/5G/10G switching with 10G/25G aggregation links is
> > required.
> > +
> > +properties:
> > +  $nodename:
> > +    pattern: "^switch@[0-9a-f]+$"
> > +
> > +  compatible:
> > +    const: microchip,sparx5-switch
> > +
> > +  reg:
> > +    minItems: 2
> > +
> > +  reg-names:
> > +    minItems: 2
> 
> This is the default based on 'items' length.

Does that mean that I should omit minItems here?

> 
> > +    items:
> > +  - const: devices
> > +  - const: gcb
> > +
> > +  interrupts:
> > +    maxItems: 1
> > +    description: Interrupt used for reception of packets to the
> > CPU
> > +
> > +  ethernet-ports:
> > +    type: object
> > +    properties:
> > +  '#address-cells':
> > +    const: 1
> > +  '#size-cells':
> > +    const: 0
> > +
> > +    patternProperties:
> > +  "^port@[0-9]+$":
> > +    type: object
> > +    description: Switch ports
> > +
> > +    allOf:
> > +  - $ref: ethernet-controller.yaml#
> > +
> > +    properties:
> > +  reg:
> > +    description: Switch port number
> > +
> > +  max-speed:
> > +    maxItems: 1
> 
> Is that an array?

No it is just a single value.
> 
> > +    description: Bandwidth allocated to this port
> > +
> > +  phys:
> 
> How many? (maxItems)

I will add "maxItems: 1"

> 
> > +    description: phandle of a Ethernet Serdes PHY
> > +
> > +  phy-handle:
> > +    description: phandle of a Ethernet PHY
> > +
> > +  phy-mode:
> > +    description: Interface between the serdes and the phy
> 
> The whole set of modes defined is supported?

This driver does not impose any limits on phy-mode.  It is passed on to
the phy, so all modes are supported as I see it.

> 
> > +
> > +  sfp:
> > +    description: phandle of an SFP
> > +
> > +  managed:
> > +    maxItems: 1
> 
> An array?

No just a single item.


Thanks for your comments.

BR
Steen

> 
> > +    description: SFP management
> > +
> > +    required:
> > +  - reg
> > +  - max-speed
> > +  - phys
> > +
> > +    oneOf:
> > +  - required:
> > +  - phy-handle
> > +  - phy-mode
> > +  - required:
> > +  - sfp
> > +  - managed
> > +
> > +    additionalProperties: false
> > +
> > +required:
> > +  - compatible
> > +  - reg
> > +  - reg-names
> > +  - interrupts
> > +  - ethernet-ports
> > +
> > +additionalProperties: false
> > +
> > +examples:
> > +  - |
> > +    #include 
> > +    switch: switch@6 {
> > +  compatible = "microchip,sparx5-switch";
> > +  reg =  <0x1000 0x80>,
> > + <0x1101 0x1b0>;
> > +  reg-names = "devices", "gcb";
> > +
> > +  in

RE: [PATCH V5 0/5] can: flexcan: add stop mode support for i.MX8QM

2020-12-22 Thread Joakim Zhang


> -Original Message-
> From: Marc Kleine-Budde 
> Sent: 2020年11月6日 19:33
> To: Joakim Zhang ; robh...@kernel.org;
> shawn...@kernel.org; s.ha...@pengutronix.de
> Cc: ker...@pengutronix.de; dl-linux-imx ;
> linux-...@vger.kernel.org; netdev@vger.kernel.org;
> linux-ker...@vger.kernel.org
> Subject: Re: [PATCH V5 0/5] can: flexcan: add stop mode support for i.MX8QM
> 
> On 11/6/20 11:56 AM, Joakim Zhang wrote:
> > Add stop mode support for i.MX8QM.
> >
> > ChangeLogs:
> > V4->V5:
> > * remove patch:firmware: imx: always export SCU symbols, since
> > it done by commit: 95de5094f5ac firmware: imx: add dummy functions
> > * rebase to fsl,flexcan.yaml
> >
> > V3->V4:
> > * can_idx->scu_idx.
> > * return imx_scu_get_handle(&priv->sc_ipc_handle);
> > * failed_canregister->failed_setup_stop_mode.
> >
> > V2->V3:
> > * define IMX_SC_R_CAN(x) in rsrc.h
> > * remove error message on -EPROBE_DEFER.
> > * split disable wakeup patch into separate one.
> >
> > V1->V2:
> > * split ECC fix patches into separate patches.
> > * free can dev if failed to setup stop mode.
> > * disable wakeup on flexcan_remove.
> > * add FLEXCAN_IMX_SC_R_CAN macro helper.
> > * fsl,can-index->fsl,scu-index.
> > * move fsl,scu-index and priv->can_idx into
> > * flexcan_setup_stop_mode_scfw()
> > * prove failed if failed to setup stop mode.
> >
> > Joakim Zhang (5):
> >   dt-bindings: can: flexcan: fix fsl,clk-source property
> 
> added to linux-can/testing
> 
> >   dt-bindings: can: flexcan: add fsl,scu-index property to indicate a
> > resource
> >   can: flexcan: rename macro FLEXCAN_QUIRK_SETUP_STOP_MODE ->
> > FLEXCAN_QUIRK_SETUP_STOP_MODE_GPR
> >   dt-bindings: firmware: add IMX_SC_R_CAN(x) macro for CAN
> >   can: flexcan: add CAN wakeup function for i.MX8QM
> 
> The others go via linux-can-next/testing, once net/master is merged back to
> net-next/master to have the yaml bindings.

Hi Marc,

How about below patches? I even can't see it in your linux-can-next/testing 
branch. Are these missed?
dt-bindings: can: flexcan: add fsl,scu-index property to indicate a 
resource
can: flexcan: add CAN wakeup function for i.MX8QM

Best Regards,
Joakim Zhang

Re: [PATCH V5 0/5] can: flexcan: add stop mode support for i.MX8QM

2020-12-22 Thread Marc Kleine-Budde

On 12/22/20 12:37 PM, Joakim Zhang wrote:
> How about below patches? I even can't see it in your linux-can-next/testing 
> branch. Are these missed?
>   dt-bindings: can: flexcan: add fsl,scu-index property to indicate a 
> resource
>   can: flexcan: add CAN wakeup function for i.MX8QM

The patch "firmware: imx: always export SCU symbols" is not yet in
net-next/master, so the flexcan patch will not compile where the SCU is switched
off.

Marc

-- 
Pengutronix e.K. | Marc Kleine-Budde   |
Embedded Linux   | https://www.pengutronix.de  |
Vertretung West/Dortmund | Phone: +49-231-2826-924 |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917- |



signature.asc
Description: OpenPGP digital signature

[PATCH] rtw88: coex: remove useless if and else

2020-12-22 Thread Tian Tao

Fix the following coccinelle report:
drivers/net/wireless/realtek/rtw88/coex.c:1619:3-5: WARNING:
possible condition with no effect (if == else)

Signed-off-by: Tian Tao 
---
 drivers/net/wireless/realtek/rtw88/coex.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtw88/coex.c 
b/drivers/net/wireless/realtek/rtw88/coex.c
index 24530ca..df6676a 100644
--- a/drivers/net/wireless/realtek/rtw88/coex.c
+++ b/drivers/net/wireless/realtek/rtw88/coex.c
@@ -1616,12 +1616,7 @@ static void rtw_coex_action_bt_relink(struct rtw_dev 
*rtwdev)
if (efuse->share_ant) { /* Shared-Ant */
if (coex_stat->wl_gl_busy) {
table_case = 26;
-   if (coex_stat->bt_hid_exist &&
-   coex_stat->bt_profile_num == 1) {
-   tdma_case = 20;
-   } else {
-   tdma_case = 20;
-   }
+   tdma_case = 20;
} else {
table_case = 1;
tdma_case = 0;
-- 
2.7.4

[PATCH v3] net: neighbor: fix a crash caused by mod zero

2020-12-22 Thread weichenchen

pneigh_enqueue() tries to obtain a random delay by mod
NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
migth be zero at that point because someone could write zero
to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
callers check it.

This patch makes pneigh_enqueue() get a delay time passed in
by the callers and the callers guarantee it is not zero.

Signed-off-by: weichenchen 
---
V3:
- Callers need to pass the delay time to pneigh_enqueue()
  now and they should guarantee it is not zero.
- Use READ_ONCE() to read NEIGH_VAR(p, PROXY_DELAY) in both
  of the existing callers of pneigh_enqueue() and then pass
  it to pneigh_enqueue().
V2:
- Use READ_ONCE() to prevent the complier from re-reading
  NEIGH_VAR(p, PROXY_DELAY).
- Give a hint to the complier that delay <= 0 is unlikely
  to happen.
---
 include/net/neighbour.h | 2 +-
 net/core/neighbour.c| 5 ++---
 net/ipv4/arp.c  | 8 +---
 net/ipv6/ndisc.c| 6 +++---
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 22ced1381ede..f7564dc5304d 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -352,7 +352,7 @@ struct net *neigh_parms_net(const struct neigh_parms *parms)
 unsigned long neigh_rand_reach_time(unsigned long base);
 
 void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
-   struct sk_buff *skb);
+   struct sk_buff *skb, int delay);
 struct pneigh_entry *pneigh_lookup(struct neigh_table *tbl, struct net *net,
   const void *key, struct net_device *dev,
   int creat);
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 9500d28a43b0..b440f966d109 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1567,12 +1567,11 @@ static void neigh_proxy_process(struct timer_list *t)
 }
 
 void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
-   struct sk_buff *skb)
+   struct sk_buff *skb, int delay)
 {
unsigned long now = jiffies;
 
-   unsigned long sched_next = now + (prandom_u32() %
- NEIGH_VAR(p, PROXY_DELAY));
+   unsigned long sched_next = now + (prandom_u32() % delay);
 
if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) {
kfree_skb(skb);
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 922dd73e5740..6ddce6e0a648 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -841,20 +841,22 @@ static int arp_process(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 arp_fwd_pvlan(in_dev, dev, rt, sip, tip) ||
 (rt->dst.dev != dev &&
  pneigh_lookup(&arp_tbl, net, &tip, dev, 0 {
+   int delay;
+
n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
if (n)
neigh_release(n);
 
+   delay = READ_ONCE(NEIGH_VAR(in_dev->arp_parms, 
PROXY_DELAY));
if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
-   skb->pkt_type == PACKET_HOST ||
-   NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) 
== 0) {
+   skb->pkt_type == PACKET_HOST || delay == 0) 
{
arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
 sip, dev, tip, sha,
 dev->dev_addr, sha,
 reply_dst);
} else {
pneigh_enqueue(&arp_tbl,
-  in_dev->arp_parms, skb);
+  in_dev->arp_parms, skb, 
delay);
goto out_free_dst;
}
goto out_consume_skb;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 76717478f173..efdaaab47535 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -892,10 +892,10 @@ static void ndisc_recv_ns(struct sk_buff *skb)
(idev->cnf.forwarding &&
 (net->ipv6.devconf_all->proxy_ndp || idev->cnf.proxy_ndp) 
&&
 (is_router = pndisc_is_router(&msg->target, dev)) >= 0)) {
+   int delay = READ_ONCE(NEIGH_VAR(idev->nd_parms, 
PROXY_DELAY));
if (!(NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED) &&
skb->pkt_type != PACKET_HOST &&
-   inc &&
-   NEIGH_VAR(idev->nd_parms, PROXY_DELAY) != 0) {
+

Re: [PATCH] dt-bindings: net: qcom,ipa: Drop unnecessary type ref on 'memory-region'

2020-12-22 Thread Alex Elder


On 12/21/20 10:01 PM, Rob Herring wrote:

'memory-region' is a common property, so it doesn't need a type ref here.


Acked-by: Alex Elder 


Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Alex Elder 
Cc: netdev@vger.kernel.org
Signed-off-by: Rob Herring 
---
I'll take this via the DT tree.

  Documentation/devicetree/bindings/net/qcom,ipa.yaml | 1 -
  1 file changed, 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/qcom,ipa.yaml 
b/Documentation/devicetree/bindings/net/qcom,ipa.yaml
index d0cbbcf1b0e5..8a2d12644675 100644
--- a/Documentation/devicetree/bindings/net/qcom,ipa.yaml
+++ b/Documentation/devicetree/bindings/net/qcom,ipa.yaml
@@ -121,7 +121,6 @@ properties:
receive and act on notifications of modem up/down events.
  
memory-region:

-$ref: /schemas/types.yaml#/definitions/phandle-array
  maxItems: 1
  description:
If present, a phandle for a reserved memory area that holds

[PATCH net-next] net: tipc: Replace expression with offsetof()

2020-12-22 Thread Zheng Yongjun

Use the existing offsetof() macro instead of duplicating code.

Signed-off-by: Zheng Yongjun 
---
 net/tipc/monitor.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c
index 6dce2abf436e..48fac3b17e40 100644
--- a/net/tipc/monitor.c
+++ b/net/tipc/monitor.c
@@ -108,7 +108,7 @@ const int tipc_max_domain_size = sizeof(struct 
tipc_mon_domain);
  */
 static int dom_rec_len(struct tipc_mon_domain *dom, u16 mcnt)
 {
-   return ((void *)&dom->members - (void *)dom) + (mcnt * sizeof(u32));
+   return (offsetof(struct tipc_mon_domain, members)) + (mcnt * 
sizeof(u32));
 }
 
 /* dom_size() : calculate size of own domain based on number of peers
-- 
2.22.0

[PATCH -next] scsi: megaraid: Remove unnecessary memset

2020-12-22 Thread Zheng Yongjun

memcpy operation is next to memset code, and the size to copy is equals to the 
size to
memset, so the memset operation is unnecessary, remove it.

Signed-off-by: Zheng Yongjun 
---
 drivers/net/wireless/ath/wcn36xx/smd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c 
b/drivers/net/wireless/ath/wcn36xx/smd.c
index 766400f7b61c..273fed22711f 100644
--- a/drivers/net/wireless/ath/wcn36xx/smd.c
+++ b/drivers/net/wireless/ath/wcn36xx/smd.c
@@ -484,7 +484,6 @@ static void init_hal_msg(struct wcn36xx_hal_msg_header *hdr,
 
 #define PREPARE_HAL_PTT_MSG_BUF(send_buf, p_msg_body) \
do {\
-   memset(send_buf, 0, p_msg_body->header.len); \
memcpy(send_buf, p_msg_body, p_msg_body->header.len); \
} while (0)
 
-- 
2.22.0

[PATCH -next] scsi: megaraid: Remove unnecessary memset

2020-12-22 Thread Zheng Yongjun

memcpy operation is next to memset code, and the size to copy is equals to the 
size to
memset, so the memset operation is unnecessary, remove it.

Signed-off-by: Zheng Yongjun 
---
 drivers/net/wireless/ath/wcn36xx/smd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c 
b/drivers/net/wireless/ath/wcn36xx/smd.c
index 766400f7b61c..273fed22711f 100644
--- a/drivers/net/wireless/ath/wcn36xx/smd.c
+++ b/drivers/net/wireless/ath/wcn36xx/smd.c
@@ -484,7 +484,6 @@ static void init_hal_msg(struct wcn36xx_hal_msg_header *hdr,
 
 #define PREPARE_HAL_PTT_MSG_BUF(send_buf, p_msg_body) \
do {\
-   memset(send_buf, 0, p_msg_body->header.len); \
memcpy(send_buf, p_msg_body, p_msg_body->header.len); \
} while (0)
 
-- 
2.22.0

[RFC PATCH v2 net-next 01/15] net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLAN

2020-12-22 Thread Vladimir Oltean

The sja1105 implementation can be blind about this, but the felix driver
doesn't do exactly what it's being told, so it needs to know whether it
is a TX or an RX VLAN, so it can install the appropriate type of TCAM
rule.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
None.

 include/linux/dsa/8021q.h | 14 ++
 net/dsa/tag_8021q.c   | 15 +--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/dsa/8021q.h b/include/linux/dsa/8021q.h
index 88cd72dfa4e0..b12b05f1c8b4 100644
--- a/include/linux/dsa/8021q.h
+++ b/include/linux/dsa/8021q.h
@@ -64,6 +64,10 @@ int dsa_8021q_rx_source_port(u16 vid);
 
 u16 dsa_8021q_rx_subvlan(u16 vid);
 
+bool vid_is_dsa_8021q_rxvlan(u16 vid);
+
+bool vid_is_dsa_8021q_txvlan(u16 vid);
+
 bool vid_is_dsa_8021q(u16 vid);
 
 #else
@@ -123,6 +127,16 @@ u16 dsa_8021q_rx_subvlan(u16 vid)
return 0;
 }
 
+bool vid_is_dsa_8021q_rxvlan(u16 vid)
+{
+   return false;
+}
+
+bool vid_is_dsa_8021q_txvlan(u16 vid)
+{
+   return false;
+}
+
 bool vid_is_dsa_8021q(u16 vid)
 {
return false;
diff --git a/net/dsa/tag_8021q.c b/net/dsa/tag_8021q.c
index 8e3e8a5b8559..008c1ec6e20c 100644
--- a/net/dsa/tag_8021q.c
+++ b/net/dsa/tag_8021q.c
@@ -133,10 +133,21 @@ u16 dsa_8021q_rx_subvlan(u16 vid)
 }
 EXPORT_SYMBOL_GPL(dsa_8021q_rx_subvlan);
 
+bool vid_is_dsa_8021q_rxvlan(u16 vid)
+{
+   return (vid & DSA_8021Q_DIR_MASK) == DSA_8021Q_DIR_RX;
+}
+EXPORT_SYMBOL_GPL(vid_is_dsa_8021q_rxvlan);
+
+bool vid_is_dsa_8021q_txvlan(u16 vid)
+{
+   return (vid & DSA_8021Q_DIR_MASK) == DSA_8021Q_DIR_TX;
+}
+EXPORT_SYMBOL_GPL(vid_is_dsa_8021q_txvlan);
+
 bool vid_is_dsa_8021q(u16 vid)
 {
-   return ((vid & DSA_8021Q_DIR_MASK) == DSA_8021Q_DIR_RX ||
-   (vid & DSA_8021Q_DIR_MASK) == DSA_8021Q_DIR_TX);
+   return vid_is_dsa_8021q_rxvlan(vid) || vid_is_dsa_8021q_txvlan(vid);
 }
 EXPORT_SYMBOL_GPL(vid_is_dsa_8021q);
 
-- 
2.25.1

[RFC PATCH v2 net-next 00/15] tag_8021q for Ocelot switches

2020-12-22 Thread Vladimir Oltean

The Felix switch inside LS1028A has an issue. It has a 2.5G CPU port,
and the external ports, in the majority of use cases, run at 1G. This
means that, when the CPU injects traffic into the switch, it is very
easy to run into congestion. This is not to say that it is impossible to
enter congestion even with all ports running at the same speed, just
that the default configuration is already very prone to that by design.

Normally, the way to deal with that is using Ethernet flow control
(PAUSE frames).

However, this functionality is not working today with the ENETC - Felix
switch pair. The hardware issue is undergoing documentation right now as
an erratum within NXP, but several customers have been requesting a
reasonable workaround for it.

In truth, the LS1028A has 2 internal port pairs. The lack of flow control
is an issue only when NPI mode (Node Processor Interface, aka the mode
where the "CPU port module", which carries DSA-style tagged packets, is
connected to a regular Ethernet port) is used, and NPI mode is supported
by Felix on a single port.

In past BSPs, we have had setups where both internal port pairs were
enabled. We were advertising the following setup:

"data port" "control port"
  (2.5G)(1G)

   eno2 eno3
^^
||
| regular| DSA-tagged
| frames | frames
||
vv
   swp4 swp5

This works but is highly unpractical, due to NXP shifting the task of
designing a functional system (choosing which port to use, depending on
type of traffic required) up to the end user. The swpN interfaces would
have to be bridged with swp4, in order for the eno2 "data port" to have
access to the outside network. And the swpN interfaces would still be
capable of IP networking. So running a DHCP client would give us two IP
interfaces from the same subnet, one assigned to eno2, and the other to
swpN (0, 1, 2, 3).

Also, the dual port design doesn't scale. When attaching another DSA
switch to a Felix port, the end result is that the "data port" cannot
carry any meaningful data to the external world, since it lacks the DSA
tags required to traverse the sja1105 switches below. All that traffic
needs to go through the "control port".

So in newer BSPs there was a desire to simplify that setup, and only
have one internal port pair:

   eno2eno3
^
|
| DSA-taggedx disabled
| frames
|
v
   swp4swp5

However, this setup only exacerbates the issue of not having flow
control on the NPI port, since that is the only port now. Also, there
are use cases that still require the "data port", such as IEEE 802.1CB
(TSN stream identification doesn't work over an NPI port), source
MAC address learning over NPI, etc.

Again, there is a desire to keep the simplicity of the single internal
port setup, while regaining the benefits of having a dedicated data port
as well. And this series attempts to deliver just that.

So the NPI functionality is disabled conditionally. Its purpose was:
- To ensure individually addressable ports on TX. This can be replaced
  by using some designated VLAN tags which are pushed by the DSA tagger
  code, then removed by the switch (so they are invisible to the outside
  world and to the user).
- To ensure source port identification on RX. Again, this can be
  replaced by using some designated VLAN tags to encapsulate all RX
  traffic (each VLAN uniquely identifies a source port). The DSA tagger
  determines which port it was based on the VLAN number, then removes
  that header.
- To deliver PTP timestamps. This cannot be obtained through VLAN
  headers, so we need to take a step back and see how else we can do
  that. The Microchip Ocelot-1 (VSC7514 MIPS) driver performs manual
  injection/extraction from the CPU port module using register-based
  MMIO, and not over Ethernet. We will need to do the same from DSA,
  which makes this tagger a sort of hybrid between DSA and pure
  switchdev.

I determined that a Kconfig option would be a sufficiently good
configuration interface for selecting between the existing NPI-based
tagged and the tag_8021q software-defined tagger. However, this is one
of the things that is up for debate today.

Changes in v2:
Posted the entire rework necessary for PTP support using tag_8021q.c.
Added a larger audience to the series.

Vladimir Oltean (15):
  net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or
TX VLAN
  net: mscc: ocelot: export VCAP structures to include/soc/mscc
  net: mscc: ocelot: store a namespaced VCAP filter ID
  net: dsa: felix: add new VLAN-based tagger
  net: mscc: ocelot: stop returning IRQ_NONE in ocelot_xtr_irq_handler
  net: mscc: ocelot: only drain extraction queue on error
  net: mscc: ocelot: just flush the CPU extraction group on error
  net: mscc: ocelot: better error handling in ocelot_xtr_irq_handler
  net: mscc: ocelot: use DIV_ROUND_UP helper in

[RFC PATCH v2 net-next 02/15] net: mscc: ocelot: export VCAP structures to include/soc/mscc

2020-12-22 Thread Vladimir Oltean

The Felix driver will need to preinstall some VCAP filters for its
tag_8021q implementation (outside of the tc-flower offload logic), so
these need to be exported to the common includes.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot_net.c  |   1 +
 drivers/net/ethernet/mscc/ocelot_vcap.h | 293 +---
 include/soc/mscc/ocelot_vcap.h  | 289 +++
 3 files changed, 292 insertions(+), 291 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_net.c 
b/drivers/net/ethernet/mscc/ocelot_net.c
index 2bd2840d88bd..3a3bbd5e7883 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -5,6 +5,7 @@
  */
 
 #include 
+#include 
 #include "ocelot.h"
 #include "ocelot_vcap.h"
 
diff --git a/drivers/net/ethernet/mscc/ocelot_vcap.h 
b/drivers/net/ethernet/mscc/ocelot_vcap.h
index 82fd10581a14..cfc8b976d1de 100644
--- a/drivers/net/ethernet/mscc/ocelot_vcap.h
+++ b/drivers/net/ethernet/mscc/ocelot_vcap.h
@@ -7,300 +7,11 @@
 #define _MSCC_OCELOT_VCAP_H_
 
 #include "ocelot.h"
-#include "ocelot_police.h"
-#include 
-#include 
+#include 
+#include 
 
 #define OCELOT_POLICER_DISCARD 0x17f
 
-struct ocelot_ipv4 {
-   u8 addr[4];
-};
-
-enum ocelot_vcap_bit {
-   OCELOT_VCAP_BIT_ANY,
-   OCELOT_VCAP_BIT_0,
-   OCELOT_VCAP_BIT_1
-};
-
-struct ocelot_vcap_u8 {
-   u8 value[1];
-   u8 mask[1];
-};
-
-struct ocelot_vcap_u16 {
-   u8 value[2];
-   u8 mask[2];
-};
-
-struct ocelot_vcap_u24 {
-   u8 value[3];
-   u8 mask[3];
-};
-
-struct ocelot_vcap_u32 {
-   u8 value[4];
-   u8 mask[4];
-};
-
-struct ocelot_vcap_u40 {
-   u8 value[5];
-   u8 mask[5];
-};
-
-struct ocelot_vcap_u48 {
-   u8 value[6];
-   u8 mask[6];
-};
-
-struct ocelot_vcap_u64 {
-   u8 value[8];
-   u8 mask[8];
-};
-
-struct ocelot_vcap_u128 {
-   u8 value[16];
-   u8 mask[16];
-};
-
-struct ocelot_vcap_vid {
-   u16 value;
-   u16 mask;
-};
-
-struct ocelot_vcap_ipv4 {
-   struct ocelot_ipv4 value;
-   struct ocelot_ipv4 mask;
-};
-
-struct ocelot_vcap_udp_tcp {
-   u16 value;
-   u16 mask;
-};
-
-struct ocelot_vcap_port {
-   u8 value;
-   u8 mask;
-};
-
-enum ocelot_vcap_key_type {
-   OCELOT_VCAP_KEY_ANY,
-   OCELOT_VCAP_KEY_ETYPE,
-   OCELOT_VCAP_KEY_LLC,
-   OCELOT_VCAP_KEY_SNAP,
-   OCELOT_VCAP_KEY_ARP,
-   OCELOT_VCAP_KEY_IPV4,
-   OCELOT_VCAP_KEY_IPV6
-};
-
-struct ocelot_vcap_key_vlan {
-   struct ocelot_vcap_vid vid;/* VLAN ID (12 bit) */
-   struct ocelot_vcap_u8  pcp;/* PCP (3 bit) */
-   enum ocelot_vcap_bit dei;/* DEI */
-   enum ocelot_vcap_bit tagged; /* Tagged/untagged frame */
-};
-
-struct ocelot_vcap_key_etype {
-   struct ocelot_vcap_u48 dmac;
-   struct ocelot_vcap_u48 smac;
-   struct ocelot_vcap_u16 etype;
-   struct ocelot_vcap_u16 data; /* MAC data */
-};
-
-struct ocelot_vcap_key_llc {
-   struct ocelot_vcap_u48 dmac;
-   struct ocelot_vcap_u48 smac;
-
-   /* LLC header: DSAP at byte 0, SSAP at byte 1, Control at byte 2 */
-   struct ocelot_vcap_u32 llc;
-};
-
-struct ocelot_vcap_key_snap {
-   struct ocelot_vcap_u48 dmac;
-   struct ocelot_vcap_u48 smac;
-
-   /* SNAP header: Organization Code at byte 0, Type at byte 3 */
-   struct ocelot_vcap_u40 snap;
-};
-
-struct ocelot_vcap_key_arp {
-   struct ocelot_vcap_u48 smac;
-   enum ocelot_vcap_bit arp;   /* Opcode ARP/RARP */
-   enum ocelot_vcap_bit req;   /* Opcode request/reply */
-   enum ocelot_vcap_bit unknown;/* Opcode unknown */
-   enum ocelot_vcap_bit smac_match; /* Sender MAC matches SMAC */
-   enum ocelot_vcap_bit dmac_match; /* Target MAC matches DMAC */
-
-   /**< Protocol addr. length 4, hardware length 6 */
-   enum ocelot_vcap_bit length;
-
-   enum ocelot_vcap_bit ip;   /* Protocol address type IP */
-   enum  ocelot_vcap_bit ethernet; /* Hardware address type Ethernet */
-   struct ocelot_vcap_ipv4 sip; /* Sender IP address */
-   struct ocelot_vcap_ipv4 dip; /* Target IP address */
-};
-
-struct ocelot_vcap_key_ipv4 {
-   enum ocelot_vcap_bit ttl;  /* TTL zero */
-   enum ocelot_vcap_bit fragment; /* Fragment */
-   enum ocelot_vcap_bit options;  /* Header options */
-   struct ocelot_vcap_u8 ds;
-   struct ocelot_vcap_u8 proto;  /* Protocol */
-   struct ocelot_vcap_ipv4 sip;  /* Source IP address */
-   struct ocelot_vcap_ipv4 dip;  /* Destination IP address */
-   struct ocelot_vcap_u48 data;  /* Not UDP/TCP: IP data */
-   struct ocelot_vcap_udp_tcp sport; /* UDP/TCP: Source port */
-   struct ocelot_vcap_udp_tcp dport; /* UDP/TCP: Destination port */
-   enum ocelot_vcap_bit tcp_fin;
-   enum ocelot_vcap_bit tcp_syn;
-   enum ocelot_vcap_bit tcp_rst;
-   enum ocel

[RFC PATCH v2 net-next 06/15] net: mscc: ocelot: only drain extraction queue on error

2020-12-22 Thread Vladimir Oltean

It appears that the intention of this snippet of code is to not exit
ocelot_xtr_irq_handler() while in the middle of extracting a frame.
The problem in extracting it word by word is that future extraction
attempts are really easy to get desynchronized, since the IRQ handler
assumes that the first 16 bytes are the IFH, which give further
information about the frame, such as frame length.

But during normal operation, "err" will not be 0, but 4, set from here:

for (i = 0; i < OCELOT_TAG_LEN / 4; i++) {
err = ocelot_rx_frame_word(ocelot, grp, true, &ifh[i]);
if (err != 4)
break;
}

if (err != 4)
break;

In that case, draining the extraction queue is a no-op. So explicitly
make this code execute only on negative err.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot_vsc7514.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c 
b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
index 00c6d9838970..ed632dd79245 100644
--- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c
+++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
@@ -702,7 +702,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
dev->stats.rx_packets++;
}
 
-   if (err)
+   if (err < 0)
while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp))
ocelot_read_rix(ocelot, QS_XTR_RD, grp);
 
-- 
2.25.1

[RFC PATCH v2 net-next 05/15] net: mscc: ocelot: stop returning IRQ_NONE in ocelot_xtr_irq_handler

2020-12-22 Thread Vladimir Oltean

Since the xtr (extraction) IRQ of the ocelot switch is not shared, then
if it fired, it means that some data must be present in the queues of
the CPU port module. So simplify the code.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot_vsc7514.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c 
b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
index 1e7729421a82..00c6d9838970 100644
--- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c
+++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
@@ -605,10 +605,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
int i = 0, grp = 0;
int err = 0;
 
-   if (!(ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)))
-   return IRQ_NONE;
-
-   do {
+   while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) {
struct skb_shared_hwtstamps *shhwtstamps;
struct ocelot_port_private *priv;
struct ocelot_port *ocelot_port;
@@ -703,7 +700,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
netif_rx(skb);
dev->stats.rx_bytes += len;
dev->stats.rx_packets++;
-   } while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp));
+   }
 
if (err)
while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp))
-- 
2.25.1

[RFC PATCH v2 net-next 08/15] net: mscc: ocelot: better error handling in ocelot_xtr_irq_handler

2020-12-22 Thread Vladimir Oltean

The ocelot_rx_frame_word() function can return a negative error code,
however this isn't being checked for consistently.

Also, some constructs can be simplified by using "goto" instead of
repeated "break" statements.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot_vsc7514.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c 
b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
index 52ebc69a52cc..ea66b372d63b 100644
--- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c
+++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
@@ -620,12 +620,9 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
for (i = 0; i < OCELOT_TAG_LEN / 4; i++) {
err = ocelot_rx_frame_word(ocelot, grp, true, &ifh[i]);
if (err != 4)
-   break;
+   goto out;
}
 
-   if (err != 4)
-   break;
-
/* At this point the IFH was read correctly, so it is safe to
 * presume that there is no error. The err needs to be reset
 * otherwise a frame could come in CPU queue between the while
@@ -646,7 +643,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
if (unlikely(!skb)) {
netdev_err(dev, "Unable to allocate sk_buff\n");
err = -ENOMEM;
-   break;
+   goto out;
}
buf_len = info.len - ETH_FCS_LEN;
buf = (u32 *)skb_put(skb, buf_len);
@@ -654,12 +651,21 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
len = 0;
do {
sz = ocelot_rx_frame_word(ocelot, grp, false, &val);
+   if (sz < 0) {
+   err = sz;
+   goto out;
+   }
*buf++ = val;
len += sz;
} while (len < buf_len);
 
/* Read the FCS */
sz = ocelot_rx_frame_word(ocelot, grp, false, &val);
+   if (sz < 0) {
+   err = sz;
+   goto out;
+   }
+
/* Update the statistics if part of the FCS was read before */
len -= ETH_FCS_LEN - sz;
 
@@ -668,11 +674,6 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
*buf = val;
}
 
-   if (sz < 0) {
-   err = sz;
-   break;
-   }
-
if (ocelot->ptp) {
ocelot_ptp_gettime64(&ocelot->ptp_info, &ts);
 
@@ -702,6 +703,7 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
dev->stats.rx_packets++;
}
 
+out:
if (err < 0) {
ocelot_write(ocelot, QS_XTR_FLUSH, BIT(grp));
ocelot_write(ocelot, QS_XTR_FLUSH, 0);
-- 
2.25.1

[RFC PATCH v2 net-next 07/15] net: mscc: ocelot: just flush the CPU extraction group on error

2020-12-22 Thread Vladimir Oltean

This procedure should yield the same effect as manually reading out the
extraction data just to discard it.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot_vsc7514.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c 
b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
index ed632dd79245..52ebc69a52cc 100644
--- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c
+++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
@@ -702,9 +702,10 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
dev->stats.rx_packets++;
}
 
-   if (err < 0)
-   while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp))
-   ocelot_read_rix(ocelot, QS_XTR_RD, grp);
+   if (err < 0) {
+   ocelot_write(ocelot, QS_XTR_FLUSH, BIT(grp));
+   ocelot_write(ocelot, QS_XTR_FLUSH, 0);
+   }
 
return IRQ_HANDLED;
 }
-- 
2.25.1

[RFC PATCH v2 net-next 04/15] net: dsa: felix: add new VLAN-based tagger

2020-12-22 Thread Vladimir Oltean

There are use cases for which the existing tagger, based on the NPI
(Node Processor Interface) functionality, is insufficient.

Namely:
- Frames injected through the NPI port bypass the frame analyzer, so no
  source address learning is performed, no TSN stream classification,
  etc.
- Flow control is not functional over an NPI port (PAUSE frames are
  encapsulated in the same Extraction Frame Header as all other frames)
- There can be at most one NPI port configured for an Ocelot switch. But
  in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI
  port is currently either disabled, or operated as a plain user port
  (albeit an internally-facing one). Having the ability to configure the
  two CPU ports symmetrically could pave the way for e.g. creating a LAG
  between them, to increase bandwidth seamlessly for the system.

So, there is a desire to have an alternative to the NPI mode.

This patch brings an implementation of the software-defined tag_8021q.c
tagger format, which also preserves full functionality under a
vlan_filtering bridge (unlike sja1105, the only other user of
tag_8021q).

It does this by using the TCAM engines for:
- pushing the RX VLAN as a second, outer tag, on egress towards the CPU
  port
- redirecting towards the correct front port based on TX VLAN and
  popping that on egress

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Clean up the hardcoding of random VCAP filter IDs and the inclusion of a
private ocelot header.

 MAINTAINERS  |   1 +
 drivers/net/dsa/ocelot/Kconfig   |   4 +-
 drivers/net/dsa/ocelot/Makefile  |   5 +
 drivers/net/dsa/ocelot/felix.c   | 108 +--
 drivers/net/dsa/ocelot/felix.h   |   1 +
 drivers/net/dsa/ocelot/felix_tag_8021q.c | 166 +++
 drivers/net/dsa/ocelot/felix_tag_8021q.h |  20 +++
 drivers/net/ethernet/mscc/ocelot.c   |  18 ++-
 include/soc/mscc/ocelot.h|   1 +
 net/dsa/Kconfig  |  34 +
 net/dsa/Makefile |   3 +-
 net/dsa/tag_ocelot_8021q.c   |  61 +
 12 files changed, 399 insertions(+), 23 deletions(-)
 create mode 100644 drivers/net/dsa/ocelot/felix_tag_8021q.c
 create mode 100644 drivers/net/dsa/ocelot/felix_tag_8021q.h
 create mode 100644 net/dsa/tag_ocelot_8021q.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a355db292486..a9cb0b659c00 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12736,6 +12736,7 @@ F:  drivers/net/dsa/ocelot/*
 F: drivers/net/ethernet/mscc/
 F: include/soc/mscc/ocelot*
 F: net/dsa/tag_ocelot.c
+F: net/dsa/tag_ocelot_8021q.c
 F: tools/testing/selftests/drivers/net/ocelot/*
 
 OCXL (Open Coherent Accelerator Processor Interface OpenCAPI) DRIVER
diff --git a/drivers/net/dsa/ocelot/Kconfig b/drivers/net/dsa/ocelot/Kconfig
index c110e82a7973..ab8de14c4dae 100644
--- a/drivers/net/dsa/ocelot/Kconfig
+++ b/drivers/net/dsa/ocelot/Kconfig
@@ -2,11 +2,11 @@
 config NET_DSA_MSCC_FELIX
tristate "Ocelot / Felix Ethernet switch support"
depends on NET_DSA && PCI
+   depends on NET_DSA_TAG_OCELOT_NPI || NET_DSA_TAG_OCELOT_8021Q
depends on NET_VENDOR_MICROSEMI
depends on NET_VENDOR_FREESCALE
depends on HAS_IOMEM
select MSCC_OCELOT_SWITCH_LIB
-   select NET_DSA_TAG_OCELOT
select FSL_ENETC_MDIO
select PCS_LYNX
help
@@ -16,10 +16,10 @@ config NET_DSA_MSCC_FELIX
 config NET_DSA_MSCC_SEVILLE
tristate "Ocelot / Seville Ethernet switch support"
depends on NET_DSA
+   depends on NET_DSA_TAG_OCELOT_NPI || NET_DSA_TAG_OCELOT_8021Q
depends on NET_VENDOR_MICROSEMI
depends on HAS_IOMEM
select MSCC_OCELOT_SWITCH_LIB
-   select NET_DSA_TAG_OCELOT
select PCS_LYNX
help
  This driver supports the VSC9953 (Seville) switch, which is embedded
diff --git a/drivers/net/dsa/ocelot/Makefile b/drivers/net/dsa/ocelot/Makefile
index f6dd131e7491..e9ea8c0331d8 100644
--- a/drivers/net/dsa/ocelot/Makefile
+++ b/drivers/net/dsa/ocelot/Makefile
@@ -9,3 +9,8 @@ mscc_felix-objs := \
 mscc_seville-objs := \
felix.o \
seville_vsc9953.o
+
+ifdef CONFIG_NET_DSA_TAG_OCELOT_8021Q
+mscc_felix-objs += felix_tag_8021q.o
+mscc_seville-objs += felix_tag_8021q.o
+endif
diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c
index 7dc230677b78..77f73c6bad0b 100644
--- a/drivers/net/dsa/ocelot/felix.c
+++ b/drivers/net/dsa/ocelot/felix.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include "felix.h"
+#include "felix_tag_8021q.h"
 
 static enum dsa_tag_protocol felix_get_tag_protocol(struct dsa_switch *ds,
int port,
@@ -439,6 +440,7 @@ static int felix_init_structs(struct felix *felix, int 
num_phys_ports)
 {
struct ocelot *ocelot = &felix->ocelot;
phy_interface_t *port_phy_modes;
+   enum ocelot_tag_prefix prefix

[RFC PATCH v2 net-next 09/15] net: mscc: ocelot: use DIV_ROUND_UP helper in ocelot_port_inject_frame

2020-12-22 Thread Vladimir Oltean

This looks a bit nicer than the open-coded "(x + 3) % 4" idiom.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_net.c 
b/drivers/net/ethernet/mscc/ocelot_net.c
index 3a3bbd5e7883..7cc0fcd1df8d 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -386,7 +386,7 @@ static int ocelot_port_xmit(struct sk_buff *skb, struct 
net_device *dev)
ocelot_write_rix(ocelot, (__force u32)cpu_to_be32(ifh[i]),
 QS_INJ_WR, grp);
 
-   count = (skb->len + 3) / 4;
+   count = DIV_ROUND_UP(skb->len, 4);
last = skb->len % 4;
for (i = 0; i < count; i++)
ocelot_write_rix(ocelot, ((u32 *)skb->data)[i], QS_INJ_WR, grp);
-- 
2.25.1

[RFC PATCH v2 net-next 10/15] net: mscc: ocelot: refactor ocelot_port_inject_frame out of ocelot_port_xmit

2020-12-22 Thread Vladimir Oltean

The felix DSA driver will inject some frames through register MMIO, same
as ocelot switchdev currently does. So we need to be able to reuse the
common code.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot.c | 78 +
 drivers/net/ethernet/mscc/ocelot.h |  4 ++
 drivers/net/ethernet/mscc/ocelot_net.c | 81 +++---
 3 files changed, 89 insertions(+), 74 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index be4671bfe95f..52f6c986aef0 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -566,6 +566,84 @@ void ocelot_get_txtstamp(struct ocelot *ocelot)
 }
 EXPORT_SYMBOL(ocelot_get_txtstamp);
 
+/* Generate the IFH for frame injection
+ *
+ * The IFH is a 128bit-value
+ * bit 127: bypass the analyzer processing
+ * bit 56-67: destination mask
+ * bit 28-29: pop_cnt: 3 disables all rewriting of the frame
+ * bit 20-27: cpu extraction queue mask
+ * bit 16: tag type 0: C-tag, 1: S-tag
+ * bit 0-11: VID
+ */
+static int ocelot_gen_ifh(u32 *ifh, struct frame_info *info)
+{
+   ifh[0] = IFH_INJ_BYPASS | ((0x1ff & info->rew_op) << 21);
+   ifh[1] = (0xf00 & info->port) >> 8;
+   ifh[2] = (0xff & info->port) << 24;
+   ifh[3] = (info->tag_type << 16) | info->vid;
+
+   return 0;
+}
+
+bool ocelot_can_inject(struct ocelot *ocelot, int grp)
+{
+   u32 val = ocelot_read(ocelot, QS_INJ_STATUS);
+
+   if (!(val & QS_INJ_STATUS_FIFO_RDY(BIT(grp
+   return false;
+   if (val & QS_INJ_STATUS_WMARK_REACHED(BIT(grp)))
+   return false;
+
+   return true;
+}
+
+void ocelot_port_inject_frame(struct ocelot *ocelot, int port, int grp,
+ u32 rew_op, struct sk_buff *skb)
+{
+   struct frame_info info = {};
+   u32 ifh[OCELOT_TAG_LEN / 4];
+   unsigned int i, count, last;
+
+   ocelot_write_rix(ocelot, QS_INJ_CTRL_GAP_SIZE(1) |
+QS_INJ_CTRL_SOF, QS_INJ_CTRL, grp);
+
+   info.port = BIT(port);
+   info.tag_type = IFH_TAG_TYPE_C;
+   info.vid = skb_vlan_tag_get(skb);
+   info.rew_op = rew_op;
+
+   ocelot_gen_ifh(ifh, &info);
+
+   for (i = 0; i < OCELOT_TAG_LEN / 4; i++)
+   ocelot_write_rix(ocelot, (__force u32)cpu_to_be32(ifh[i]),
+QS_INJ_WR, grp);
+
+   count = DIV_ROUND_UP(skb->len, 4);
+   last = skb->len % 4;
+   for (i = 0; i < count; i++)
+   ocelot_write_rix(ocelot, ((u32 *)skb->data)[i], QS_INJ_WR, grp);
+
+   /* Add padding */
+   while (i < (OCELOT_BUFFER_CELL_SZ / 4)) {
+   ocelot_write_rix(ocelot, 0, QS_INJ_WR, grp);
+   i++;
+   }
+
+   /* Indicate EOF and valid bytes in last word */
+   ocelot_write_rix(ocelot, QS_INJ_CTRL_GAP_SIZE(1) |
+QS_INJ_CTRL_VLD_BYTES(skb->len < OCELOT_BUFFER_CELL_SZ 
? 0 : last) |
+QS_INJ_CTRL_EOF,
+QS_INJ_CTRL, grp);
+
+   /* Add dummy CRC */
+   ocelot_write_rix(ocelot, 0, QS_INJ_WR, grp);
+   skb_tx_timestamp(skb);
+
+   skb->dev->stats.tx_packets++;
+   skb->dev->stats.tx_bytes += skb->len;
+}
+
 int ocelot_fdb_add(struct ocelot *ocelot, int port,
   const unsigned char *addr, u16 vid)
 {
diff --git a/drivers/net/ethernet/mscc/ocelot.h 
b/drivers/net/ethernet/mscc/ocelot.h
index 291d39d49c4e..e7685a58b7e2 100644
--- a/drivers/net/ethernet/mscc/ocelot.h
+++ b/drivers/net/ethernet/mscc/ocelot.h
@@ -126,6 +126,10 @@ void ocelot_set_cpu_port(struct ocelot *ocelot, int cpu,
 enum ocelot_tag_prefix injection,
 enum ocelot_tag_prefix extraction);
 
+bool ocelot_can_inject(struct ocelot *ocelot, int grp);
+void ocelot_port_inject_frame(struct ocelot *ocelot, int port, int grp,
+ u32 rew_op, struct sk_buff *skb);
+
 extern struct notifier_block ocelot_netdevice_nb;
 extern struct notifier_block ocelot_switchdev_nb;
 extern struct notifier_block ocelot_switchdev_blocking_nb;
diff --git a/drivers/net/ethernet/mscc/ocelot_net.c 
b/drivers/net/ethernet/mscc/ocelot_net.c
index 7cc0fcd1df8d..942eb56535b7 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -311,53 +311,20 @@ static int ocelot_port_stop(struct net_device *dev)
return 0;
 }
 
-/* Generate the IFH for frame injection
- *
- * The IFH is a 128bit-value
- * bit 127: bypass the analyzer processing
- * bit 56-67: destination mask
- * bit 28-29: pop_cnt: 3 disables all rewriting of the frame
- * bit 20-27: cpu extraction queue mask
- * bit 16: tag type 0: C-tag, 1: S-tag
- * bit 0-11: VID
- */
-static int ocelot_gen_ifh(u32 *ifh, struct frame_info *info)
-{
-   ifh[0] = IFH_INJ_BYPASS | ((0x1ff & info->rew_op) << 21);
-   ifh[1] = (0xf00 & info->port) >> 8

[RFC PATCH v2 net-next 03/15] net: mscc: ocelot: store a namespaced VCAP filter ID

2020-12-22 Thread Vladimir Oltean

We will be adding some private VCAP filters that should not interfere in
any way with the filters added using tc-flower. So we need to allocate
some IDs which will not be used by tc.

Currently ocelot uses an u32 id derived from the flow cookie, which in
itself is an unsigned long. This is a problem in itself, since on 64 bit
systems, sizeof(unsigned long)=8, so the driver is already truncating
these.

Create a struct ocelot_vcap_id which contains the full unsigned long
cookie from tc, as well as a boolean that is supposed to namespace the
filters added by tc with the ones that aren't.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot_flower.c |  7 ---
 drivers/net/ethernet/mscc/ocelot_vcap.c   | 16 
 drivers/net/ethernet/mscc/ocelot_vcap.h   |  3 ++-
 include/soc/mscc/ocelot_vcap.h|  7 ++-
 4 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_flower.c 
b/drivers/net/ethernet/mscc/ocelot_flower.c
index 729495a1a77e..c3ac026f6aea 100644
--- a/drivers/net/ethernet/mscc/ocelot_flower.c
+++ b/drivers/net/ethernet/mscc/ocelot_flower.c
@@ -622,7 +622,8 @@ static int ocelot_flower_parse(struct ocelot *ocelot, int 
port, bool ingress,
int ret;
 
filter->prio = f->common.prio;
-   filter->id = f->cookie;
+   filter->id.cookie = f->cookie;
+   filter->id.tc_offload = true;
 
ret = ocelot_flower_parse_action(ocelot, port, ingress, f, filter);
if (ret)
@@ -717,7 +718,7 @@ int ocelot_cls_flower_destroy(struct ocelot *ocelot, int 
port,
 
block = &ocelot->block[block_id];
 
-   filter = ocelot_vcap_block_find_filter_by_id(block, f->cookie);
+   filter = ocelot_vcap_block_find_filter_by_id(block, f->cookie, true);
if (!filter)
return 0;
 
@@ -741,7 +742,7 @@ int ocelot_cls_flower_stats(struct ocelot *ocelot, int port,
 
block = &ocelot->block[block_id];
 
-   filter = ocelot_vcap_block_find_filter_by_id(block, f->cookie);
+   filter = ocelot_vcap_block_find_filter_by_id(block, f->cookie, true);
if (!filter || filter->type == OCELOT_VCAP_FILTER_DUMMY)
return 0;
 
diff --git a/drivers/net/ethernet/mscc/ocelot_vcap.c 
b/drivers/net/ethernet/mscc/ocelot_vcap.c
index d8c778ee6f1b..2f588dfdc9a2 100644
--- a/drivers/net/ethernet/mscc/ocelot_vcap.c
+++ b/drivers/net/ethernet/mscc/ocelot_vcap.c
@@ -959,6 +959,12 @@ static void ocelot_vcap_filter_add_to_block(struct ocelot 
*ocelot,
list_add(&filter->list, pos->prev);
 }
 
+static bool ocelot_vcap_filter_equal(const struct ocelot_vcap_filter *a,
+const struct ocelot_vcap_filter *b)
+{
+   return !memcmp(&a->id, &b->id, sizeof(struct ocelot_vcap_id));
+}
+
 static int ocelot_vcap_block_get_filter_index(struct ocelot_vcap_block *block,
  struct ocelot_vcap_filter *filter)
 {
@@ -966,7 +972,7 @@ static int ocelot_vcap_block_get_filter_index(struct 
ocelot_vcap_block *block,
int index = 0;
 
list_for_each_entry(tmp, &block->rules, list) {
-   if (filter->id == tmp->id)
+   if (ocelot_vcap_filter_equal(filter, tmp))
return index;
index++;
}
@@ -991,12 +997,14 @@ ocelot_vcap_block_find_filter_by_index(struct 
ocelot_vcap_block *block,
 }
 
 struct ocelot_vcap_filter *
-ocelot_vcap_block_find_filter_by_id(struct ocelot_vcap_block *block, int id)
+ocelot_vcap_block_find_filter_by_id(struct ocelot_vcap_block *block, int 
cookie,
+   bool tc_offload)
 {
struct ocelot_vcap_filter *filter;
 
list_for_each_entry(filter, &block->rules, list)
-   if (filter->id == id)
+   if (filter->id.tc_offload == tc_offload &&
+   filter->id.cookie == cookie)
return filter;
 
return NULL;
@@ -1160,7 +1168,7 @@ static void ocelot_vcap_block_remove_filter(struct ocelot 
*ocelot,
 
list_for_each_safe(pos, q, &block->rules) {
tmp = list_entry(pos, struct ocelot_vcap_filter, list);
-   if (tmp->id == filter->id) {
+   if (ocelot_vcap_filter_equal(filter, tmp)) {
if (tmp->block_id == VCAP_IS2 &&
tmp->action.police_ena)
ocelot_vcap_policer_del(ocelot, block,
diff --git a/drivers/net/ethernet/mscc/ocelot_vcap.h 
b/drivers/net/ethernet/mscc/ocelot_vcap.h
index cfc8b976d1de..3b0c7916056e 100644
--- a/drivers/net/ethernet/mscc/ocelot_vcap.h
+++ b/drivers/net/ethernet/mscc/ocelot_vcap.h
@@ -15,7 +15,8 @@
 int ocelot_vcap_filter_stats_update(struct ocelot *ocelot,
struct ocelot_vcap_filter *rule);
 struct ocelot_vcap_filter *
-ocelot_vcap_block_find_filter_by_id(struct ocelot_vcap_block *block, int id);
+ocelot_

[RFC PATCH v2 net-next 11/15] net: mscc: ocelot: export struct ocelot_frame_info

2020-12-22 Thread Vladimir Oltean

Because felix DSA must now be able to extract a frame in 2 stages over
MMIO (first the XFH then the frame data), it needs access to this
internal ocelot structure that holds the unpacked information from the
Extraction Frame Header.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot.c | 4 ++--
 drivers/net/ethernet/mscc/ocelot.h | 9 -
 drivers/net/ethernet/mscc/ocelot_vsc7514.c | 4 ++--
 include/soc/mscc/ocelot.h  | 9 +
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index 52f6c986aef0..7d73c3251dfb 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -576,7 +576,7 @@ EXPORT_SYMBOL(ocelot_get_txtstamp);
  * bit 16: tag type 0: C-tag, 1: S-tag
  * bit 0-11: VID
  */
-static int ocelot_gen_ifh(u32 *ifh, struct frame_info *info)
+static int ocelot_gen_ifh(u32 *ifh, struct ocelot_frame_info *info)
 {
ifh[0] = IFH_INJ_BYPASS | ((0x1ff & info->rew_op) << 21);
ifh[1] = (0xf00 & info->port) >> 8;
@@ -601,7 +601,7 @@ bool ocelot_can_inject(struct ocelot *ocelot, int grp)
 void ocelot_port_inject_frame(struct ocelot *ocelot, int port, int grp,
  u32 rew_op, struct sk_buff *skb)
 {
-   struct frame_info info = {};
+   struct ocelot_frame_info info = {};
u32 ifh[OCELOT_TAG_LEN / 4];
unsigned int i, count, last;
 
diff --git a/drivers/net/ethernet/mscc/ocelot.h 
b/drivers/net/ethernet/mscc/ocelot.h
index e7685a58b7e2..7dac0edd7767 100644
--- a/drivers/net/ethernet/mscc/ocelot.h
+++ b/drivers/net/ethernet/mscc/ocelot.h
@@ -32,15 +32,6 @@
 
 #define OCELOT_PTP_QUEUE_SZ128
 
-struct frame_info {
-   u32 len;
-   u16 port;
-   u16 vid;
-   u8 tag_type;
-   u16 rew_op;
-   u32 timestamp;  /* rew_val */
-};
-
 struct ocelot_port_tc {
bool block_shared;
unsigned long offload_cnt;
diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c 
b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
index ea66b372d63b..504881d531e5 100644
--- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c
+++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c
@@ -533,7 +533,7 @@ static int ocelot_chip_init(struct ocelot *ocelot, const 
struct ocelot_ops *ops)
return 0;
 }
 
-static int ocelot_parse_ifh(u32 *_ifh, struct frame_info *info)
+static int ocelot_parse_ifh(u32 *_ifh, struct ocelot_frame_info *info)
 {
u8 llen, wlen;
u64 ifh[2];
@@ -607,10 +607,10 @@ static irqreturn_t ocelot_xtr_irq_handler(int irq, void 
*arg)
 
while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) {
struct skb_shared_hwtstamps *shhwtstamps;
+   struct ocelot_frame_info info = {};
struct ocelot_port_private *priv;
struct ocelot_port *ocelot_port;
u64 tod_in_ns, full_ts_in_ns;
-   struct frame_info info = {};
struct net_device *dev;
u32 ifh[4], val, *buf;
struct timespec64 ts;
diff --git a/include/soc/mscc/ocelot.h b/include/soc/mscc/ocelot.h
index 4cbb7655ef0c..25a93bcc6afe 100644
--- a/include/soc/mscc/ocelot.h
+++ b/include/soc/mscc/ocelot.h
@@ -669,6 +669,15 @@ struct ocelot_policer {
u32 burst; /* bytes */
 };
 
+struct ocelot_frame_info {
+   u32 len;
+   u16 port;
+   u16 vid;
+   u8 tag_type;
+   u16 rew_op;
+   u32 timestamp;  /* rew_val */
+};
+
 #define ocelot_read_ix(ocelot, reg, gi, ri) __ocelot_read_ix(ocelot, reg, 
reg##_GSZ * (gi) + reg##_RSZ * (ri))
 #define ocelot_read_gix(ocelot, reg, gi) __ocelot_read_ix(ocelot, reg, 
reg##_GSZ * (gi))
 #define ocelot_read_rix(ocelot, reg, ri) __ocelot_read_ix(ocelot, reg, 
reg##_RSZ * (ri))
-- 
2.25.1

[RFC PATCH v2 net-next 12/15] net: mscc: ocelot: refactor ocelot_xtr_irq_handler into ocelot_xtr_poll

2020-12-22 Thread Vladimir Oltean

Since the felix DSA driver will need to poll the CPU port module for
extracted frames as well, let's create some common functions that read
an Extraction Frame Header, and then an skb, from a CPU extraction
group.

This is so complicated, because the procedure to retrieve a struct
net_device pointer based on the source port is different for DSA and
switchdev. So this is the reason why the polling function is split in
the middle. The ocelot_xtr_poll_xfh() permits the caller to get a struct
net_device pointer based on the XFH port field, then pass this to the
ocelot_xtr_poll_frame() function.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/ethernet/mscc/ocelot.c | 163 +
 drivers/net/ethernet/mscc/ocelot.h |   6 +
 drivers/net/ethernet/mscc/ocelot_vsc7514.c | 158 ++--
 3 files changed, 179 insertions(+), 148 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index 7d73c3251dfb..b91d4c31d3d7 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -12,6 +12,9 @@
 #define TABLE_UPDATE_SLEEP_US 10
 #define TABLE_UPDATE_TIMEOUT_US 10
 
+#define IFH_EXTRACT_BITFIELD64(x, o, w) \
+   (((x) >> (o)) & GENMASK_ULL((w) - 1, 0))
+
 struct ocelot_mact_entry {
u8 mac[ETH_ALEN];
u16 vid;
@@ -566,6 +569,166 @@ void ocelot_get_txtstamp(struct ocelot *ocelot)
 }
 EXPORT_SYMBOL(ocelot_get_txtstamp);
 
+static int ocelot_parse_xfh(u32 *_ifh, struct ocelot_frame_info *info)
+{
+   u8 llen, wlen;
+   u64 ifh[2];
+
+   ifh[0] = be64_to_cpu(((__force __be64 *)_ifh)[0]);
+   ifh[1] = be64_to_cpu(((__force __be64 *)_ifh)[1]);
+
+   wlen = IFH_EXTRACT_BITFIELD64(ifh[0], 7,  8);
+   llen = IFH_EXTRACT_BITFIELD64(ifh[0], 15,  6);
+
+   info->len = OCELOT_BUFFER_CELL_SZ * wlen + llen - 80;
+
+   info->timestamp = IFH_EXTRACT_BITFIELD64(ifh[0], 21, 32);
+
+   info->port = IFH_EXTRACT_BITFIELD64(ifh[1], 43, 4);
+
+   info->tag_type = IFH_EXTRACT_BITFIELD64(ifh[1], 16,  1);
+   info->vid = IFH_EXTRACT_BITFIELD64(ifh[1], 0,  12);
+
+   return 0;
+}
+
+static int ocelot_rx_frame_word(struct ocelot *ocelot, u8 grp, bool ifh,
+   u32 *rval)
+{
+   u32 val;
+   u32 bytes_valid;
+
+   val = ocelot_read_rix(ocelot, QS_XTR_RD, grp);
+   if (val == XTR_NOT_READY) {
+   if (ifh)
+   return -EIO;
+
+   do {
+   val = ocelot_read_rix(ocelot, QS_XTR_RD, grp);
+   } while (val == XTR_NOT_READY);
+   }
+
+   switch (val) {
+   case XTR_ABORT:
+   return -EIO;
+   case XTR_EOF_0:
+   case XTR_EOF_1:
+   case XTR_EOF_2:
+   case XTR_EOF_3:
+   case XTR_PRUNED:
+   bytes_valid = XTR_VALID_BYTES(val);
+   val = ocelot_read_rix(ocelot, QS_XTR_RD, grp);
+   if (val == XTR_ESCAPE)
+   *rval = ocelot_read_rix(ocelot, QS_XTR_RD, grp);
+   else
+   *rval = val;
+
+   return bytes_valid;
+   case XTR_ESCAPE:
+   *rval = ocelot_read_rix(ocelot, QS_XTR_RD, grp);
+
+   return 4;
+   default:
+   *rval = val;
+
+   return 4;
+   }
+}
+
+int ocelot_xtr_poll_xfh(struct ocelot *ocelot, int grp,
+   struct ocelot_frame_info *info)
+{
+   u32 ifh[OCELOT_TAG_LEN / 4];
+   int i, err = 0;
+
+   for (i = 0; i < OCELOT_TAG_LEN / 4; i++) {
+   err = ocelot_rx_frame_word(ocelot, grp, true, &ifh[i]);
+   if (err != 4)
+   return (err < 0) ? err : -EIO;
+   }
+
+   ocelot_parse_xfh(ifh, info);
+
+   return 0;
+}
+
+int ocelot_xtr_poll_frame(struct ocelot *ocelot, int grp,
+ struct net_device *dev,
+ struct ocelot_frame_info *info,
+ struct sk_buff **nskb)
+{
+   struct skb_shared_hwtstamps *shhwtstamps;
+   u64 tod_in_ns, full_ts_in_ns;
+   struct timespec64 ts;
+   int sz, len, buf_len;
+   struct sk_buff *skb;
+   u32 val, *buf;
+   int err = 0;
+
+   skb = netdev_alloc_skb(dev, info->len);
+   if (unlikely(!skb)) {
+   netdev_err(dev, "Unable to allocate sk_buff\n");
+   err = -ENOMEM;
+   goto out;
+   }
+
+   buf_len = info->len - ETH_FCS_LEN;
+   buf = (u32 *)skb_put(skb, buf_len);
+
+   len = 0;
+   do {
+   sz = ocelot_rx_frame_word(ocelot, grp, false, &val);
+   if (sz < 0) {
+   err = sz;
+   goto out;
+   }
+   *buf++ = val;
+   len += sz;
+   } while (len < buf_len);
+
+   /* Read the FCS */
+   sz = ocelot_rx_frame_word(ocelot, grp, false, &val);
+   if (s

[RFC PATCH v2 net-next 13/15] net: dsa: felix: setup MMIO filtering rules for PTP when using tag_8021q

2020-12-22 Thread Vladimir Oltean

Since the tag_8021q tagger is software-defined, it has no means by
itself for retrieving hardware timestamps of PTP event messages.

Because we do want to support PTP on ocelot even with tag_8021q, we need
to use the CPU port module for that. The RX timestamp is present in the
Extraction Frame Header. And because we can't use NPI mode which redirects
the CPU queues to an "external CPU" (meaning the ARM CPU), then we need
to poll the CPU port module through the MMIO registers to retrieve TX
and RX timestamps.

Sadly, on NXP LS1028A, the Felix switch was integrated into the SoC
without wiring the extraction IRQ line to the ARM GIC. So, if we want to
be notified of any PTP packets received on the CPU port module, we have
a problem.

There is a possible workaround, which is to use the Ethernet CPU port as
a notification channel that packets are available on the CPU port module
as well. When a PTP packet is received by the DSA tagger (without timestamp,
of course), we go to the CPU extraction queues, poll for it there, then
we drop the original Ethernet packet and masquerade the packet retrieved
over MMIO (plus the timestamp) as the original when we inject it up the
stack.

Create a quirk in struct felix is selected by the Felix driver (but not
by Seville, since that doesn't support PTP at all). We want to do this
such that the workaround is minimally invasive for future switches that
don't require this workaround.

The only traffic for which we need timestamps is PTP traffic, so add a
redirection rule to the CPU port module for this. Currently we only have
the need for PTP over L2, so redirection rules for UDP ports 319 and 320
are TBD for now.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/dsa/ocelot/felix.h   | 13 
 drivers/net/dsa/ocelot/felix_tag_8021q.c | 83 +++-
 drivers/net/dsa/ocelot/felix_vsc9959.c   |  1 +
 3 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/ocelot/felix.h b/drivers/net/dsa/ocelot/felix.h
index 71f343326c00..c75a934c3e0b 100644
--- a/drivers/net/dsa/ocelot/felix.h
+++ b/drivers/net/dsa/ocelot/felix.h
@@ -25,6 +25,19 @@ struct felix_info {
int switch_pci_bar;
int imdio_pci_bar;
const struct ptp_clock_info *ptp_caps;
+
+   /* Some Ocelot switches are integrated into the SoC without the
+* extraction IRQ line connected to the ARM GIC. By enabling this
+* workaround, the few packets that are delivered to the CPU port
+* module (currently only PTP) are copied not only to the hardware CPU
+* port module, but also to the 802.1Q Ethernet CPU port, and polling
+* the extraction registers is triggered once the DSA tagger sees a PTP
+* frame. The Ethernet frame is only used as a notification: it is
+* dropped, and the original frame is extracted over MMIO and annotated
+* with the RX timestamp.
+*/
+   boolquirk_no_xtr_irq;
+
int (*mdio_bus_alloc)(struct ocelot *ocelot);
void(*mdio_bus_free)(struct ocelot *ocelot);
void(*phylink_validate)(struct ocelot *ocelot, int port,
diff --git a/drivers/net/dsa/ocelot/felix_tag_8021q.c 
b/drivers/net/dsa/ocelot/felix_tag_8021q.c
index f209273bbf69..5243e55a8054 100644
--- a/drivers/net/dsa/ocelot/felix_tag_8021q.c
+++ b/drivers/net/dsa/ocelot/felix_tag_8021q.c
@@ -142,6 +142,85 @@ static const struct dsa_8021q_ops felix_tag_8021q_ops = {
.vlan_add   = felix_tag_8021q_vlan_add,
 };
 
+/* Set up a VCAP IS2 rule for delivering PTP frames to the CPU port module.
+ * If the NET_DSA_TAG_OCELOT_QUIRK_NO_XTR_IRQ is in place, then also copy those
+ * PTP frames to the tag_8021q CPU port.
+ */
+static int felix_setup_mmio_filtering(struct felix *felix)
+{
+   struct ocelot_vcap_filter *redirect_rule;
+   struct ocelot_vcap_filter *tagging_rule;
+   struct ocelot *ocelot = &felix->ocelot;
+   unsigned long ingress_port_mask;
+   int cpu = ocelot->dsa_8021q_cpu;
+   int ret;
+
+   tagging_rule = kzalloc(sizeof(struct ocelot_vcap_filter), GFP_KERNEL);
+   if (!tagging_rule)
+   return -ENOMEM;
+
+   redirect_rule = kzalloc(sizeof(struct ocelot_vcap_filter), GFP_KERNEL);
+   if (!redirect_rule) {
+   kfree(tagging_rule);
+   return -ENOMEM;
+   }
+
+   ingress_port_mask = GENMASK(ocelot->num_phys_ports - 1, 0) & ~BIT(cpu);
+
+   tagging_rule->key_type = OCELOT_VCAP_KEY_ETYPE;
+   *(u16 *)tagging_rule->key.etype.etype.value = htons(ETH_P_1588);
+   *(u16 *)tagging_rule->key.etype.etype.mask = 0x;
+   tagging_rule->ingress_port_mask = ingress_port_mask;
+   tagging_rule->prio = 1;
+   tagging_rule->id.cookie = ocelot->num_phys_ports;
+   tagging_rule->id.tc_offload = false;
+   tagging_rule->block_id = VCAP_IS1;
+   tagging_

[RFC PATCH net-next 15/15] net: dsa: tag_ocelot_8021q: add support for PTP timestamping

2020-12-22 Thread Vladimir Oltean

On TX, use the result of the ptp_classify_raw() BPF classifier from
dsa_skb_tx_timestamp() to divert some frames over to the MMIO-based
injection registers.

On RX, set up a VCAP IS2 rule that redirects the frames with an
EtherType for 1588 to the CPU port module (for MMIO based extraction)
and, if the "no XTR IRQ" workaround is in place, copies them to the
dsa_8021q CPU port as well (for notification).

There is a conflict between the VCAP IS2 trapping rule and the semantics
of the BPF classifier. Namely, ptp_classify_raw() deems general messages
as non-timestampable, but still, those are trapped to the CPU port
module since they have an EtherType of ETH_P_1588. So, if the "no XTR
IRQ" workaround is in place, we need to run another BPF classifier on
the frames extracted over MMIO, to avoid duplicates being sent to the
stack (once over Ethernet, once over MMIO). It doesn't look like it's
possible to install VCAP IS2 rules based on keys extracted from the 1588
frame headers.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 drivers/net/dsa/ocelot/felix.c   | 12 +
 drivers/net/dsa/ocelot/felix_tag_8021q.c | 61 
 drivers/net/dsa/ocelot/felix_tag_8021q.h |  7 +++
 drivers/net/ethernet/mscc/ocelot.c   |  3 ++
 drivers/net/ethernet/mscc/ocelot.h   |  8 
 include/soc/mscc/ocelot.h|  9 
 net/dsa/tag_ocelot_8021q.c   | 24 ++
 7 files changed, 116 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c
index 77f73c6bad0b..a1ee6884f11c 100644
--- a/drivers/net/dsa/ocelot/felix.c
+++ b/drivers/net/dsa/ocelot/felix.c
@@ -757,6 +757,18 @@ static bool felix_rxtstamp(struct dsa_switch *ds, int port,
struct timespec64 ts;
u64 tstamp, val;
 
+   /* If the "no XTR IRQ" workaround is in use, tell DSA to defer this skb
+* for RX timestamping. Then free it, and poll for its copy through
+* MMIO in the CPU port module, and inject that into the stack from
+* ocelot_xtr_poll().
+* If the "no XTR IRQ" workaround isn't in use, this is a no-op and
+* should be eliminated by the compiler as dead code.
+*/
+   if (felix_check_xtr_pkt(ocelot, type)) {
+   kfree_skb(skb);
+   return true;
+   }
+
ocelot_ptp_gettime64(&ocelot->ptp_info, &ts);
tstamp = ktime_set(ts.tv_sec, ts.tv_nsec);
 
diff --git a/drivers/net/dsa/ocelot/felix_tag_8021q.c 
b/drivers/net/dsa/ocelot/felix_tag_8021q.c
index 5243e55a8054..2a7ab3ddb12b 100644
--- a/drivers/net/dsa/ocelot/felix_tag_8021q.c
+++ b/drivers/net/dsa/ocelot/felix_tag_8021q.c
@@ -11,9 +11,70 @@
 #include 
 #include 
 #include 
+#include 
 #include "felix.h"
 #include "felix_tag_8021q.h"
 
+bool felix_check_xtr_pkt(struct ocelot *ocelot, unsigned int ptp_type)
+{
+   struct felix *felix = ocelot_to_felix(ocelot);
+   int err, grp = 0;
+
+   if (!felix->info->quirk_no_xtr_irq)
+   return false;
+
+   if (ptp_type == PTP_CLASS_NONE)
+   return false;
+
+   while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) {
+   struct ocelot_frame_info info = {};
+   struct dsa_port *dp;
+   struct sk_buff *skb;
+   unsigned int type;
+
+   err = ocelot_xtr_poll_xfh(ocelot, grp, &info);
+   if (err)
+   break;
+
+   if (WARN_ON(info.port >= ocelot->num_phys_ports))
+   goto out;
+
+   dp = dsa_to_port(felix->ds, info.port);
+
+   err = ocelot_xtr_poll_frame(ocelot, grp, dp->slave,
+   &info, &skb);
+   if (err)
+   break;
+
+   /* We trap to the CPU port module all PTP frames, but
+* felix_rxtstamp() only gets called for event frames.
+* So we need to avoid sending duplicate general
+* message frames by running a second BPF classifier
+* here and dropping those.
+*/
+   __skb_push(skb, ETH_HLEN);
+
+   type = ptp_classify_raw(skb);
+
+   __skb_pull(skb, ETH_HLEN);
+
+   if (type == PTP_CLASS_NONE) {
+   kfree_skb(skb);
+   continue;
+   }
+
+   netif_rx(skb);
+   }
+
+out:
+   if (err < 0) {
+   ocelot_write(ocelot, QS_XTR_FLUSH, BIT(grp));
+   ocelot_write(ocelot, QS_XTR_FLUSH, 0);
+   }
+
+   return true;
+}
+
 static int felix_tag_8021q_rxvlan_add(struct ocelot *ocelot, int port, u16 vid,
  bool pvid, bool untagged)
 {
diff --git a/drivers/net/dsa/ocelot/felix_tag_8021q.h 
b/drivers/net/dsa/ocelot/felix_tag_8021q.h
index 47684a18c96e..1bf92d5fc4c4 100644
--- a/drivers/net/dsa/ocelot/felix_tag_8021q.h
+++

[RFC PATCH v2 net-next 14/15] net: dsa: felix: use promisc on master for PTP with tag_8021q

2020-12-22 Thread Vladimir Oltean

This is for the "no extraction IRQ" workaround, where the DSA master on
LS1028A (enetc) serves as a de-facto irqchip.

It needs to be promiscuous so that it will never drop a PTP frame (sent
to the 01-80-c2-00-00-0e multicast MAC address), otherwise the tagger
will get confused about which Ethernet PTP frame corresponds to which
PTP frame over the MMIO registers.

Signed-off-by: Vladimir Oltean 
---
Changes in v2:
Patch is new.

 net/dsa/tag_ocelot_8021q.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/dsa/tag_ocelot_8021q.c b/net/dsa/tag_ocelot_8021q.c
index fa89c6a5bb7d..1b5102e74369 100644
--- a/net/dsa/tag_ocelot_8021q.c
+++ b/net/dsa/tag_ocelot_8021q.c
@@ -53,6 +53,7 @@ static struct dsa_device_ops ocelot_netdev_ops = {
.xmit   = ocelot_xmit,
.rcv= ocelot_rcv,
.overhead   = VLAN_HLEN,
+   .promisc_on_master  = true,
 };
 
 MODULE_LICENSE("GPL v2");
-- 
2.25.1

Re: [RFC PATCH v2 3/8] net: sparx5: add hostmode with phylink support

2020-12-22 Thread Steen Hegelund

Hi Andrew,

On Sat, 2020-12-19 at 20:51 +0100, Andrew Lunn wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you
> know the content is safe
> 
> > + /* Create a phylink for PHY management.  Also handles SFPs */
> > + spx5_port->phylink_config.dev = &spx5_port->ndev->dev;
> > + spx5_port->phylink_co
> > nfig.type = PHYLINK_NETDEV;
> > + spx5_port->phylink_config.pcs_poll = true;
> > +
> > + /* phylink needs a valid interface mode to parse dt node */
> > + if (phy_mode == PHY_INTERFACE_MODE_NA)
> > + phy_mode = PHY_INTERFACE_MODE_10GBASER;
> 
> Maybe just enforce a valid value in DT?

Maybe I need to clarify that you must choose between an Ethernet cuPHY
or an SFP, so it is optional.
> 
> > +/* Configuration */
> > +static inline bool sparx5_use_cu_phy(struct sparx5_port *port)
> > +{
> > + return port->conf.phy_mode != PHY_INTERFACE_MODE_NA;
> > +}
> 
> That is a rather odd definition of copper.

Should I rather use a bool property to select between the two options
(cuPHY or SFP)?

> 
> > diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_netdev.c
> > b/drivers/net/ethernet/microchip/sparx5/sparx5_netdev.c
> > new file mode 100644
> > index ..6f9282e9d3f4
> > --- /dev/null
> > +++ b/drivers/net/ethernet/microchip/sparx5/sparx5_netdev.c
> > @@ -0,0 +1,203 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +/* Microchip Sparx5 Switch driver
> > + *
> > + * Copyright (c) 2020 Microchip Technology Inc. and its
> > subsidiaries.
> > + */
> > +
> > +#include "sparx5_main.h"
> 
> I don't actually know what is preferred here, but very few drivers
> i've reviewed put all the required headers into another header
> file. They normally list them in each .c file.

I will look at reworking this. 

> 
> > +static int sparx5_port_open(struct net_device *ndev)
> > +{
> > + struct sparx5_port *port = netdev_priv(ndev);
> > + int err = 0;
> > +
> > + err = phylink_of_phy_connect(port->phylink, port->of_node,
> > 0);
> > + if (err) {
> > + netdev_err(ndev, "Could not attach to PHY\n");
> > + return err;
> > + }
> > +
> > + phylink_start(port->phylink);
> > +
> > + if (!ndev->phydev) {
> 
> Humm. When is ndev->phydev set? I don't think phylink ever sets it.

Indirectly: phylink_of_phy_connect uses phy_attach_direct and that sets
the phydev.

> 
> > + /* power up serdes */
> > + port->conf.power_down = false;
> > + err = phy_power_on(port->serdes);
> > + if (err)
> > + netdev_err(ndev, "%s failed\n", __func__);
> > + }
> > +
> > + return err;
> > +}
> 
> > +struct net_device *sparx5_create_netdev(struct sparx5 *sparx5, u32
> > portno)
> > +{
> > + struct net_device *ndev;
> > + struct sparx5_port *spx5_port;
> > +
> > + ndev = devm_alloc_etherdev(sparx5->dev, sizeof(struct
> > sparx5_port));
> > + if (!ndev)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + SET_NETDEV_DEV(ndev, sparx5->dev);
> > + spx5_port = netdev_priv(ndev);
> > + spx5_port->ndev = ndev;
> > + spx5_port->sparx5 = sparx5;
> > + spx5_port->portno = portno;
> > + sparx5_set_port_ifh(spx5_port->ifh, portno);
> > + snprintf(ndev->name, IFNAMSIZ, "eth%d", portno);
> > +
> > + ether_setup(ndev);
> 
> devm_alloc_etherdev() should of already called ether_setup().

Ah - yes it is the setup(dev) call in alloc_netdev_mqs. I will remove
that then.
> 
> > + ndev->netdev_ops = &sparx5_port_netdev_ops;
> > + ndev->features |= NETIF_F_LLTX; /* software tx */
> > +
> > + ether_addr_copy(ndev->dev_addr, sparx5->base_mac);
> > + ndev->dev_addr[ETH_ALEN - 1] += portno + 1;
> 
> That will cause some surprises with wrap around. Use eth_addr_inc()

OK - will do.

> 
> > +static void sparx5_xtr_grp(struct sparx5 *sparx5, u8 grp, bool
> > byte_swap)
> > +{
> > + int i, byte_cnt = 0;
> > + bool eof_flag = false, pruned_flag = false, abort_flag =
> > false;
> > + u32 ifh[IFH_LEN];
> > + struct sk_buff *skb;
> > + struct frame_info fi;
> > + struct sparx5_port *port;
> > + struct net_device *netdev;
> > + u32 *rxbuf;
> > +
> > + /* Get IFH */
> > + for (i = 0; i < IFH_LEN; i++)
> > + ifh[i] = spx5_rd(sparx5, QS_XTR_RD(grp));
> > +
> > + /* Decode IFH (whats needed) */
> > + sparx5_ifh_parse(ifh, &fi);
> > +
> > + /* Map to port netdev */
> > + port = fi.src_port < SPX5_PORTS ?
> > + sparx5->ports[fi.src_port] : NULL;
> > + if (!port || !port->ndev) {
> > + dev_err(sparx5->dev, "Data on inactive port %d\n",
> > fi.src_port);
> > + sparx5_xtr_flush(sparx5, grp);
> > + return;
> > + }
> > +
> > + /* Have netdev, get skb */
> > + netdev = port->ndev;
> > + skb = netdev_alloc_skb(netdev, netdev->mtu + ETH_HLEN);
> > + if (!skb) {
> > + sparx5_xtr_flush(sparx5, grp);
> > +

[PATCH wireless v3 -next] brcmfmac: Delete useless kfree code

2020-12-22 Thread Zheng Yongjun

A null pointer will be passed to a kfree() call after a kzalloc() call failed.
This code is useless. Thus delete the extra function call.

A goto statement is also no longer needed. Thus adjust an if branch.

Signed-off-by: Zheng Yongjun 
---
 .../wireless/broadcom/brcm80211/brcmfmac/firmware.c| 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c
index d821a4758f8c..d40104b8df55 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c
@@ -319,8 +319,10 @@ static void brcmf_fw_strip_multi_v2(struct nvram_parser 
*nvp, u16 domain_nr,
u8 *nvram;
 
nvram = kzalloc(nvp->nvram_len + 1 + 3 + sizeof(u32), GFP_KERNEL);
-   if (!nvram)
-   goto fail;
+   if (!nvram) {
+   nvp->nvram_len = 0;
+   return;
+   }
 
/* Copy all valid entries, release old nvram and assign new one.
 * Valid entries are of type pcie/X/Y/ where X = domain_nr and
@@ -350,10 +352,6 @@ static void brcmf_fw_strip_multi_v2(struct nvram_parser 
*nvp, u16 domain_nr,
kfree(nvp->nvram);
nvp->nvram = nvram;
nvp->nvram_len = j;
-   return;
-fail:
-   kfree(nvram);
-   nvp->nvram_len = 0;
 }
 
 static void brcmf_fw_add_defaults(struct nvram_parser *nvp)
-- 
2.22.0

RE: [PATCH v3] Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer

2020-12-22 Thread Michael Kelley

From: Andrea Parri (Microsoft)  Sent: Monday, December 
7, 2020 8:53 PM
> 
> From: Andres Beltran 
> 
> Pointers to ring-buffer packets sent by Hyper-V are used within the
> guest VM. Hyper-V can send packets with erroneous values or modify
> packet fields after they are processed by the guest. To defend
> against these scenarios, return a copy of the incoming VMBus packet
> after validating its length and offset fields in hv_pkt_iter_first().
> In this way, the packet can no longer be modified by the host.
> 
> Signed-off-by: Andres Beltran 
> Co-developed-by: Andrea Parri (Microsoft) 
> Signed-off-by: Andrea Parri (Microsoft) 
> Cc: "David S. Miller" 
> Cc: Jakub Kicinski 
> Cc: "James E.J. Bottomley" 
> Cc: "Martin K. Petersen" 
> Cc: netdev@vger.kernel.org
> Cc: linux-s...@vger.kernel.org
> ---
>  drivers/hv/channel.c  |  9 ++--
>  drivers/hv/hv_fcopy.c |  1 +
>  drivers/hv/hv_kvp.c   |  1 +
>  drivers/hv/hyperv_vmbus.h |  2 +-
>  drivers/hv/ring_buffer.c  | 82 ++-
>  drivers/net/hyperv/hyperv_net.h   |  3 ++
>  drivers/net/hyperv/netvsc.c   |  2 +
>  drivers/net/hyperv/rndis_filter.c |  2 +
>  drivers/scsi/storvsc_drv.c| 10 
>  include/linux/hyperv.h| 48 +++---
>  net/vmw_vsock/hyperv_transport.c  |  4 +-
>  11 files changed, 139 insertions(+), 25 deletions(-)
> 

Reviewed-by: Michael Kelley

Re: Reporting SFP presence status

2020-12-22 Thread Andrew Lunn

On Tue, Dec 22, 2020 at 07:28:10AM +0100, Martin Hundebøll wrote:
> Hi Andrew,
> 
> On 21/12/2020 16.22, Andrew Lunn wrote:
> > On Mon, Dec 21, 2020 at 11:37:55AM +0100, Martin Hundebøll wrote:
> > > Hi Andrew,
> > > 
> > > I've browsed the code in drivers/net/phy, but haven't found a place where
> > > the SFP module status/change is reported to user-space. Is there a
> > > "standard" way to report insert/remove events for SFP modules, or should 
> > > we
> > > just add a custom sysfs attribute to our driver?
> > 
> > Hi Martin
> > 
> > There is currently no standard way of notifying user space. But it is
> > something which could be added. But it should not be systfs. This
> > should be a netlink notification, probably as part of ethtool netlink
> > API. Or maybe the extended link info.
> > 
> > What is your intended use case? Why do you need to know when a module
> > has been inserted? It seems like you cannot do too much on such a
> > notification. It seems lots of modules don't conform to the standard,
> > will not immediately respond on the i2c bus. So ethtool -m is probably
> > not going to be useful. You probably need to poll until it does
> > respond, which defeats the purpose of having a notification.
> 
> You're right; a notification isn't what I need. But a way to query the
> current state of the module would be nice, i.e. using ethtool.

What do you mean by state? ethtool -m gives you some state
information. ENODEV gives you an idea that there is no module
inserted. Lots of data suggests there is a module. You can decode the
data to get a lot of information. 

There was also a patchset from Russell King a few weeks ago exposing
some information in debugfs. But since it is debugfs, you cannot rely
on it.

Back to, what is you real use cases here?

 Andrew

Re: [PATCH net v2 2/2] vhost_net: fix high cpu load when sendmsg fails

2020-12-22 Thread Willem de Bruijn

On Mon, Dec 21, 2020 at 11:41 PM Jason Wang  wrote:
>
>
> On 2020/12/22 上午7:07, Willem de Bruijn wrote:
> > On Wed, Dec 16, 2020 at 3:20 AM wangyunjian  wrote:
> >> From: Yunjian Wang
> >>
> >> Currently we break the loop and wake up the vhost_worker when
> >> sendmsg fails. When the worker wakes up again, we'll meet the
> >> same error.
> > The patch is based on the assumption that such error cases always
> > return EAGAIN. Can it not also be ENOMEM, such as from tun_build_skb?
> >
> >> This will cause high CPU load. To fix this issue,
> >> we can skip this description by ignoring the error. When we
> >> exceeds sndbuf, the return value of sendmsg is -EAGAIN. In
> >> the case we don't skip the description and don't drop packet.
> > the -> that
> >
> > here and above: description -> descriptor
> >
> > Perhaps slightly revise to more explicitly state that
> >
> > 1. in the case of persistent failure (i.e., bad packet), the driver
> > drops the packet
> > 2. in the case of transient failure (e.g,. memory pressure) the driver
> > schedules the worker to try again later
>
>
> If we want to go with this way, we need a better time to wakeup the
> worker. Otherwise it just produces more stress on the cpu that is what
> this patch tries to avoid.

Perhaps I misunderstood the purpose of the patch: is it to drop
everything, regardless of transient or persistent failure, until the
ring runs out of descriptors?

I can understand both a blocking and drop strategy during memory
pressure. But partial drop strategy until exceeding ring capacity
seems like a peculiar hybrid?

Re: [PATCH -next] scsi: megaraid: Remove unnecessary memset

2020-12-22 Thread Kalle Valo

Zheng Yongjun  writes:

> memcpy operation is next to memset code, and the size to copy is equals to 
> the size to
> memset, so the memset operation is unnecessary, remove it.
>
> Signed-off-by: Zheng Yongjun 
> ---
>  drivers/net/wireless/ath/wcn36xx/smd.c | 1 -
>  1 file changed, 1 deletion(-)

The title is wrong, this is not about scsi. Please resend as v2.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [RFC PATCH v2 3/8] net: sparx5: add hostmode with phylink support

2020-12-22 Thread Andrew Lunn

On Tue, Dec 22, 2020 at 10:46:12AM +0100, Steen Hegelund wrote:
> Hi Andrew,
> 
> On Sat, 2020-12-19 at 20:51 +0100, Andrew Lunn wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > know the content is safe
> > 
> > > + /* Create a phylink for PHY management.  Also handles SFPs */
> > > + spx5_port->phylink_config.dev = &spx5_port->ndev->dev;
> > > + spx5_port->phylink_co
> > > nfig.type = PHYLINK_NETDEV;
> > > + spx5_port->phylink_config.pcs_poll = true;
> > > +
> > > + /* phylink needs a valid interface mode to parse dt node */
> > > + if (phy_mode == PHY_INTERFACE_MODE_NA)
> > > + phy_mode = PHY_INTERFACE_MODE_10GBASER;
> > 
> > Maybe just enforce a valid value in DT?
> 
> Maybe I need to clarify that you must choose between an Ethernet cuPHY
> or an SFP, so it is optional.

But you also need to watch out for somebody putting a copper modules
in an SFP port. phylink will then set the mode to SGMII for a 1G
copper module, etc.

> > > +/* Configuration */
> > > +static inline bool sparx5_use_cu_phy(struct sparx5_port *port)
> > > +{
> > > + return port->conf.phy_mode != PHY_INTERFACE_MODE_NA;
> > > +}
> > 
> > That is a rather odd definition of copper.
> 
> Should I rather use a bool property to select between the two options
> (cuPHY or SFP)?

I guess what you are trying to indicate is between a hard wired Copper
PHY and an SFP cage? You have some sort of MII switch which allows the
MAC to be connected to either the QSGMII PHY, or an SFP cage? But
since the SFP cage could be populated with a copper PHY, and PHYLINK
will then instantiate a phylib copper PHY driver for it, looking at
phy_mode is not reliable. You need a property which selects the port,
not the technology.

> > > +static int sparx5_port_open(struct net_device *ndev)
> > > +{
> > > + struct sparx5_port *port = netdev_priv(ndev);
> > > + int err = 0;
> > > +
> > > + err = phylink_of_phy_connect(port->phylink, port->of_node,
> > > 0);
> > > + if (err) {
> > > + netdev_err(ndev, "Could not attach to PHY\n");
> > > + return err;
> > > + }
> > > +
> > > + phylink_start(port->phylink);
> > > +
> > > + if (!ndev->phydev) {
> > 
> > Humm. When is ndev->phydev set? I don't think phylink ever sets it.
> 
> Indirectly: phylink_of_phy_connect uses phy_attach_direct and that sets
> the phydev.

Ah, O.K. But watch out for a copper SFP module!

> > > +static void sparx5_xtr_grp(struct sparx5 *sparx5, u8 grp, bool
> > > byte_swap)
> > > +{
> > > + int i, byte_cnt = 0;
> > > + bool eof_flag = false, pruned_flag = false, abort_flag =
> > > false;
> > > + u32 ifh[IFH_LEN];
> > > + struct sk_buff *skb;
> > > + struct frame_info fi;
> > > + struct sparx5_port *port;
> > > + struct net_device *netdev;
> > > + u32 *rxbuf;
> > > +
> > > + /* Get IFH */
> > > + for (i = 0; i < IFH_LEN; i++)
> > > + ifh[i] = spx5_rd(sparx5, QS_XTR_RD(grp));
> > > +
> > > + /* Decode IFH (whats needed) */
> > > + sparx5_ifh_parse(ifh, &fi);
> > > +
> > > + /* Map to port netdev */
> > > + port = fi.src_port < SPX5_PORTS ?
> > > + sparx5->ports[fi.src_port] : NULL;
> > > + if (!port || !port->ndev) {
> > > + dev_err(sparx5->dev, "Data on inactive port %d\n",
> > > fi.src_port);
> > > + sparx5_xtr_flush(sparx5, grp);
> > > + return;
> > > + }
> > > +
> > > + /* Have netdev, get skb */
> > > + netdev = port->ndev;
> > > + skb = netdev_alloc_skb(netdev, netdev->mtu + ETH_HLEN);
> > > + if (!skb) {
> > > + sparx5_xtr_flush(sparx5, grp);
> > > + dev_err(sparx5->dev, "No skb allocated\n");
> > > + return;
> > > + }
> > > + rxbuf = (u32 *)skb->data;
> > > +
> > > + /* Now, pull frame data */
> > > + while (!eof_flag) {
> > > + u32 val = spx5_rd(sparx5, QS_XTR_RD(grp));
> > > + u32 cmp = val;
> > > +
> > > + if (byte_swap)
> > > + cmp = ntohl((__force __be32)val);
> > > +
> > > + switch (cmp) {
> > > + case XTR_NOT_READY:
> > > + break;
> > > + case XTR_ABORT:
> > > + /* No accompanying data */
> > > + abort_flag = true;
> > > + eof_flag = true;
> > > + break;
> > > + case XTR_EOF_0:
> > > + case XTR_EOF_1:
> > > + case XTR_EOF_2:
> > > + case XTR_EOF_3:
> > > + /* This assumes STATUS_WORD_POS == 1, Status
> > > +  * just after last data
> > > +  */
> > > + byte_cnt -= (4 - XTR_VALID_BYTES(val));
> > > + eof_flag = true;
> > > + break;
> > > + case XTR_PRUNED:
> > > + /* But get the last 4 bytes as well

Re: [PATCH 05/12 v2 RFC] skbuff: replace sock_zerocopy_put() with skb_zcopy_put()

2020-12-22 Thread Willem de Bruijn

On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  wrote:
>
> From: Jonathan Lemon 
>
> Replace sock_zerocopy_put with the generic skb_zcopy_put()
> function.  Pass 'true' as the success argument, as this
> is identical to no change.
>
> Signed-off-by: Jonathan Lemon 

uarg->zerocopy may be false if sock_zerocopy_put_abort is called from
tcp_sendmsg_locked

Re: [PATCH 01/12 v2 RFC] net: group skb_shinfo zerocopy related bits together.

2020-12-22 Thread Willem de Bruijn

On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  wrote:
>
> From: Jonathan Lemon 
>
> In preparation for expanded zerocopy (TX and RX), move
> the ZC related bits out of tx_flags into their own flag
> word.
>
> Signed-off-by: Jonathan Lemon 

I think it's better to expand tx_flags to a u16 and add the two new
flags that you need.

That allows the additional 7 bits to be used for arbitrary flags, not
stranding 8 bits exactly only for zerocopy features.

Moving around a few u8's in the same cacheline won't be a problem.

I also prefer not to rename flags that some of us are familiar with,
if it's not needed.

Re: [PATCH 04/12 v2 RFC] skbuff: Push status and refcounts into sock_zerocopy_callback

2020-12-22 Thread Willem de Bruijn

On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  wrote:
>
> From: Jonathan Lemon 
>
> Before this change, the caller of sock_zerocopy_callback would
> need to save the zerocopy status, decrement and check the refcount,
> and then call the callback function - the callback was only invoked
> when the refcount reached zero.
>
> Now, the caller just passes the status into the callback function,
> which saves the status and handles its own refcounts.
>
> This makes the behavior of the sock_zerocopy_callback identical
> to the tpacket and vhost callbacks.
>
> Signed-off-by: Jonathan Lemon 
> ---
>  include/linux/skbuff.h |  3 ---
>  net/core/skbuff.c  | 14 +++---
>  2 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index fb6dd6af0f82..c9d7de9d624d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1482,9 +1482,6 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, 
> bool zerocopy)
> if (uarg) {
> if (skb_zcopy_is_nouarg(skb)) {
> /* no notification callback */
> -   } else if (uarg->callback == sock_zerocopy_callback) {
> -   uarg->zerocopy = uarg->zerocopy && zerocopy;
> -   sock_zerocopy_put(uarg);
> } else {
> uarg->callback(uarg, zerocopy);
> }
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index ea32b3414ad6..73699dbdc4a1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1194,7 +1194,7 @@ static bool skb_zerocopy_notify_extend(struct sk_buff 
> *skb, u32 lo, u16 len)
> return true;
>  }
>
> -void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
> +static void __sock_zerocopy_callback(struct ubuf_info *uarg)
>  {
> struct sk_buff *tail, *skb = skb_from_uarg(uarg);
> struct sock_exterr_skb *serr;
> @@ -1222,7 +1222,7 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, 
> bool success)
> serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
> serr->ee.ee_data = hi;
> serr->ee.ee_info = lo;
> -   if (!success)
> +   if (!uarg->zerocopy)
> serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;
>
> q = &sk->sk_error_queue;
> @@ -1241,11 +1241,19 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, 
> bool success)
> consume_skb(skb);
> sock_put(sk);
>  }
> +
> +void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
> +{
> +   uarg->zerocopy = uarg->zerocopy & success;
> +
> +   if (refcount_dec_and_test(&uarg->refcnt))
> +   __sock_zerocopy_callback(uarg);
> +}
>  EXPORT_SYMBOL_GPL(sock_zerocopy_callback);

I still think this helper is unnecessary. Just return immediately in
existing sock_zerocopy_callback if refcount is not zero.

>  void sock_zerocopy_put(struct ubuf_info *uarg)
>  {
> -   if (uarg && refcount_dec_and_test(&uarg->refcnt))
> +   if (uarg)
> uarg->callback(uarg, uarg->zerocopy);
>  }
>  EXPORT_SYMBOL_GPL(sock_zerocopy_put);

This does increase the number of indirect function calls. Which are
not cheap post spectre.

In the common case for msg_zerocopy we only have two clones, one sent
out and one on the retransmit queue. So I guess the cost will be
acceptable.

Re: [PATCH 03/12 v2 RFC] skbuff: simplify sock_zerocopy_put

2020-12-22 Thread Willem de Bruijn

On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  wrote:
>
> From: Jonathan Lemon 
>
> All 'struct ubuf_info' users should have a callback defined.
> Remove the dead code path to consume_skb(), which makes
> unwarranted assumptions about how the structure was allocated.
>
> Signed-off-by: Jonathan Lemon 

Please link to the commit I shared that made the consume_skb path
obsolete. Before that this would have been unsafe.

should have a callback defined -> have a callback defined as of commit
0a4a060bb204 ("sock: fix
zerocopy_success regression with msg_zerocopy").

With that explanation why this is correct

Acked-by: Willem de Bruijn

kernel BUG at lib/string.c:LINE! (6)

2020-12-22 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:d64c6f96 Merge tag 'net-5.11-rc1' of git://git.kernel.org/..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10bc561350
kernel config:  https://syzkaller.appspot.com/x/.config?x=aca0dc5c721fe9e5
dashboard link: https://syzkaller.appspot.com/bug?extid=e86f7c428c8c50db65b4
compiler:   gcc (GCC) 10.1.0-syz 20200507
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=169378a750
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=144692cb50

The issue was bisected to:

commit 2f78788b55baa3410b1ec91a576286abe1ad4d6a
Author: Jakub Jelinek 
Date:   Wed Dec 16 04:43:37 2020 +

ilog2: improve ilog2 for constant arguments

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1584f13750
final oops: https://syzkaller.appspot.com/x/report.txt?x=1784f13750
console output: https://syzkaller.appspot.com/x/log.txt?x=1384f13750

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+e86f7c428c8c50db6...@syzkaller.appspotmail.com
Fixes: 2f78788b55ba ("ilog2: improve ilog2 for constant arguments")

detected buffer overflow in strlen
[ cut here ]
kernel BUG at lib/string.c:1149!
invalid opcode:  [#1] PREEMPT SMP KASAN
CPU: 0 PID: 8713 Comm: syz-executor731 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:fortify_panic+0xf/0x11 lib/string.c:1149
Code: b5 78 a3 04 48 c7 c7 c0 8f c2 89 58 5b 5d 41 5c 41 5d 41 5e 41 5f e9 30 
ba ee ff 48 89 fe 48 c7 c7 80 90 c2 89 e8 21 ba ee ff <0f> 0b e8 90 f9 97 f8 0f 
b6 f3 48 c7 c7 20 f4 10 8c e8 41 e8 fc fa
RSP: 0018:c900020af500 EFLAGS: 00010282
RAX: 0022 RBX: 888011c26768 RCX: 
RDX: 88801bad RSI: 815a6925 RDI: f52000415e92
RBP: 88801be7c220 R08: 0022 R09: 
R10: 815a4d7b R11:  R12: 88801180ec00
R13: 888011c26700 R14: 192000415ea2 R15: 0010
FS:  00812880() GS:8880b9c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 006dcf60 CR3: 141ee000 CR4: 001506f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 strlen include/linux/string.h:325 [inline]
 strlcpy include/linux/string.h:348 [inline]
 xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143
 xt_check_target+0x26c/0x9e0 net/netfilter/x_tables.c:1019
 check_target net/ipv6/netfilter/ip6_tables.c:529 [inline]
 find_check_entry.constprop.0+0x7f1/0x9e0 net/ipv6/netfilter/ip6_tables.c:572
 translate_table+0xc8b/0x1750 net/ipv6/netfilter/ip6_tables.c:734
 do_replace net/ipv6/netfilter/ip6_tables.c:1152 [inline]
 do_ip6t_set_ctl+0x553/0xb70 net/ipv6/netfilter/ip6_tables.c:1636
 nf_setsockopt+0x83/0xe0 net/netfilter/nf_sockopt.c:101
 ipv6_setsockopt+0x122/0x180 net/ipv6/ipv6_sockglue.c:1008
 tcp_setsockopt+0x136/0x2440 net/ipv4/tcp.c:3597
 __sys_setsockopt+0x2db/0x610 net/socket.c:2115
 __do_sys_setsockopt net/socket.c:2126 [inline]
 __se_sys_setsockopt net/socket.c:2123 [inline]
 __x64_sys_setsockopt+0xba/0x150 net/socket.c:2123
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4493d9
Code: e8 0c ca 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
9b cb fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7fff679a3898 EFLAGS: 0246 ORIG_RAX: 0036
RAX: ffda RBX: 22c0 RCX: 004493d9
RDX: 0040 RSI: 0029 RDI: 0006
RBP: 7fff679a38b0 R08: 0470 R09: 00c2
R10: 2080 R11: 0246 R12: 000112d5
R13: 006d7dc8 R14:  R15: 
Modules linked in:
---[ end trace e17a915ca7e8b666 ]---
RIP: 0010:fortify_panic+0xf/0x11 lib/string.c:1149
Code: b5 78 a3 04 48 c7 c7 c0 8f c2 89 58 5b 5d 41 5c 41 5d 41 5e 41 5f e9 30 
ba ee ff 48 89 fe 48 c7 c7 80 90 c2 89 e8 21 ba ee ff <0f> 0b e8 90 f9 97 f8 0f 
b6 f3 48 c7 c7 20 f4 10 8c e8 41 e8 fc fa
RSP: 0018:c900020af500 EFLAGS: 00010282
RAX: 0022 RBX: 888011c26768 RCX: 
RDX: 88801bad RSI: 815a6925 RDI: f52000415e92
RBP: 88801be7c220 R08: 0022 R09: 
R10: 815a4d7b R11:  R12: 88801180ec00
R13: 888011c26700 R14: 192000415ea2 R15: 0010
FS:  00812880() GS:8880b9c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 006dcf60 CR3: 141ee000 CR4: 001506f0
DR0:  DR1: 000

[RFC v2 00/13] Introduce VDUSE - vDPA Device in Userspace

2020-12-22 Thread Xie Yongji

This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. The work consist of two parts:
control path forwarding and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the core is mapping dma buffer into VDUSE
daemon's address space, which can be implemented in different ways
depending on the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

-   
--
|Container ||  QEMU(VM) |   |   
VDUSE daemon |
|   -  ||  ---  |   | 
-  |
|   |dev/vdx|  ||  |/dev/vhost-vdpa-x|  |   | | vDPA device 
emulation | | block driver | |
+--- ---+   
-+--+-
|   ||  
|
|   ||  
|
+---++--+-
|| block device |   |  vhost device || vduse driver |   
   | TCP/IP ||
|---+   +---+   
   -+|
|   |   |   |   
||
| --+--   --+--- ---+---
||
| | virtio-blk driver |   |  vhost-vdpa driver | | vdpa device |
||
| --+--   --+--- ---+---
||
|   |  virtio bus   |   |   
||
|   ++---   |   |   
||
||  |   |   
||
|  --+--|   |   
||
|  | virtio-blk device ||   |   
||
|  --+--|   |   
||
||  |   |   
||
| ---+---   |   |   
||
| |  virtio-vdpa driver |   |   |   
||
| ---+---   |   |   
||
||  |   |vdpa 
bus   ||
| 
---+--+---+ 
  ||
|   
 ---+--- |
-|
 NIC |--

 ---+---

|

   -+-

   | Remote Storages |

   ---

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

  $ qemu-storage-daemon \
  --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
  --monitor chardev=charmonitor \
  --blockdev 
driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
 \
  --export 
vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128

The qemu-storage-daemon can be found at 
https://github.com/bytedance/qemu/tree/vduse

Future work:
  - Improve perf

[RFC v2 02/13] eventfd: track eventfd_signal() recursion depth separately in different cases

2020-12-22 Thread Xie Yongji

Now we have a global percpu counter to limit the recursion depth
of eventfd_signal(). This can avoid deadlock or stack overflow.
But in stack overflow case, it should be OK to increase the
recursion depth if needed. So we add a percpu counter in eventfd_ctx
to limit the recursion depth for deadlock case. Then it could be
fine to increase the global percpu counter later.

Signed-off-by: Xie Yongji 
---
 fs/aio.c|  3 ++-
 fs/eventfd.c| 20 +++-
 include/linux/eventfd.h |  5 +
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1f32da13d39e..5d82903161f5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1698,7 +1698,8 @@ static int aio_poll_wake(struct wait_queue_entry *wait, 
unsigned mode, int sync,
list_del(&iocb->ki_list);
iocb->ki_res.res = mangle_poll(mask);
req->done = true;
-   if (iocb->ki_eventfd && eventfd_signal_count()) {
+   if (iocb->ki_eventfd &&
+   eventfd_signal_count(iocb->ki_eventfd)) {
iocb = NULL;
INIT_WORK(&req->work, aio_poll_put_work);
schedule_work(&req->work);
diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..2df24f9bada3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -25,6 +25,8 @@
 #include 
 #include 
 
+#define EVENTFD_WAKE_DEPTH 0
+
 DEFINE_PER_CPU(int, eventfd_wake_count);
 
 static DEFINE_IDA(eventfd_ida);
@@ -42,9 +44,17 @@ struct eventfd_ctx {
 */
__u64 count;
unsigned int flags;
+   int __percpu *wake_count;
int id;
 };
 
+bool eventfd_signal_count(struct eventfd_ctx *ctx)
+{
+   return (this_cpu_read(*ctx->wake_count) ||
+   this_cpu_read(eventfd_wake_count) > EVENTFD_WAKE_DEPTH);
+}
+EXPORT_SYMBOL_GPL(eventfd_signal_count);
+
 /**
  * eventfd_signal - Adds @n to the eventfd counter.
  * @ctx: [in] Pointer to the eventfd context.
@@ -71,17 +81,19 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 * it returns true, the eventfd_signal() call should be deferred to a
 * safe context.
 */
-   if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+   if (WARN_ON_ONCE(eventfd_signal_count(ctx)))
return 0;
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
this_cpu_inc(eventfd_wake_count);
+   this_cpu_inc(*ctx->wake_count);
if (ULLONG_MAX - ctx->count < n)
n = ULLONG_MAX - ctx->count;
ctx->count += n;
if (waitqueue_active(&ctx->wqh))
wake_up_locked_poll(&ctx->wqh, EPOLLIN);
this_cpu_dec(eventfd_wake_count);
+   this_cpu_dec(*ctx->wake_count);
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
return n;
@@ -92,6 +104,7 @@ static void eventfd_free_ctx(struct eventfd_ctx *ctx)
 {
if (ctx->id >= 0)
ida_simple_remove(&eventfd_ida, ctx->id);
+   free_percpu(ctx->wake_count);
kfree(ctx);
 }
 
@@ -423,6 +436,11 @@ static int do_eventfd(unsigned int count, int flags)
 
kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
+   ctx->wake_count = alloc_percpu(int);
+   if (!ctx->wake_count) {
+   kfree(ctx);
+   return -ENOMEM;
+   }
ctx->count = count;
ctx->flags = flags;
ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..1a11ebbd74a9 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -45,10 +45,7 @@ void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 
*cnt);
 
 DECLARE_PER_CPU(int, eventfd_wake_count);
 
-static inline bool eventfd_signal_count(void)
-{
-   return this_cpu_read(eventfd_wake_count);
-}
+bool eventfd_signal_count(struct eventfd_ctx *ctx);
 
 #else /* CONFIG_EVENTFD */
 
-- 
2.11.0

[RFC v2 03/13] eventfd: Increase the recursion depth of eventfd_signal()

2020-12-22 Thread Xie Yongji

Increase the recursion depth of eventfd_signal() to 1. This
will be used in VDUSE case later.

Signed-off-by: Xie Yongji 
---
 fs/eventfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 2df24f9bada3..478cdc175949 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -25,7 +25,7 @@
 #include 
 #include 
 
-#define EVENTFD_WAKE_DEPTH 0
+#define EVENTFD_WAKE_DEPTH 1
 
 DEFINE_PER_CPU(int, eventfd_wake_count);
 
-- 
2.11.0

[RFC v2 04/13] vdpa: Remove the restriction that only supports virtio-net devices

2020-12-22 Thread Xie Yongji

With VDUSE, we should be able to support all kinds of virtio devices.

Signed-off-by: Xie Yongji 
---
 drivers/vhost/vdpa.c | 29 +++--
 1 file changed, 3 insertions(+), 26 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 29ed4173f04e..448be7875b6d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vhost.h"
 
@@ -185,26 +186,6 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
return 0;
 }
 
-static int vhost_vdpa_config_validate(struct vhost_vdpa *v,
- struct vhost_vdpa_config *c)
-{
-   long size = 0;
-
-   switch (v->virtio_id) {
-   case VIRTIO_ID_NET:
-   size = sizeof(struct virtio_net_config);
-   break;
-   }
-
-   if (c->len == 0)
-   return -EINVAL;
-
-   if (c->len > size - c->off)
-   return -E2BIG;
-
-   return 0;
-}
-
 static long vhost_vdpa_get_config(struct vhost_vdpa *v,
  struct vhost_vdpa_config __user *c)
 {
@@ -215,7 +196,7 @@ static long vhost_vdpa_get_config(struct vhost_vdpa *v,
 
if (copy_from_user(&config, c, size))
return -EFAULT;
-   if (vhost_vdpa_config_validate(v, &config))
+   if (config.len == 0)
return -EINVAL;
buf = kvzalloc(config.len, GFP_KERNEL);
if (!buf)
@@ -243,7 +224,7 @@ static long vhost_vdpa_set_config(struct vhost_vdpa *v,
 
if (copy_from_user(&config, c, size))
return -EFAULT;
-   if (vhost_vdpa_config_validate(v, &config))
+   if (config.len == 0)
return -EINVAL;
buf = kvzalloc(config.len, GFP_KERNEL);
if (!buf)
@@ -1025,10 +1006,6 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
int minor;
int r;
 
-   /* Currently, we only accept the network devices. */
-   if (ops->get_device_id(vdpa) != VIRTIO_ID_NET)
-   return -ENOTSUPP;
-
v = kzalloc(sizeof(*v), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
if (!v)
return -ENOMEM;
-- 
2.11.0

[RFC v2 01/13] mm: export zap_page_range() for driver use

2020-12-22 Thread Xie Yongji

Export zap_page_range() for use in VDUSE.

Signed-off-by: Xie Yongji 
---
 mm/memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory.c b/mm/memory.c
index 7d608765932b..edd2d6497bb3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1542,6 +1542,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned 
long start,
mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb, start, range.end);
 }
+EXPORT_SYMBOL(zap_page_range);
 
 /**
  * zap_page_range_single - remove user pages in a given range
-- 
2.11.0

[RFC v2 13/13] vduse: Introduce a workqueue for irq injection

2020-12-22 Thread Xie Yongji

This patch introduces a dedicated workqueue for irq injection
so that we are able to do some performance tuning for it.

Signed-off-by: Xie Yongji 
---
 drivers/vdpa/vdpa_user/eventfd.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
index dbffddb08908..caf7d8d68ac0 100644
--- a/drivers/vdpa/vdpa_user/eventfd.c
+++ b/drivers/vdpa/vdpa_user/eventfd.c
@@ -18,6 +18,7 @@
 #include "eventfd.h"
 
 static struct workqueue_struct *vduse_irqfd_cleanup_wq;
+static struct workqueue_struct *vduse_irq_wq;
 
 static void vduse_virqfd_shutdown(struct work_struct *work)
 {
@@ -57,7 +58,7 @@ static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, 
unsigned int mode,
__poll_t flags = key_to_poll(key);
 
if (flags & EPOLLIN)
-   schedule_work(&virqfd->inject);
+   queue_work(vduse_irq_wq, &virqfd->inject);
 
if (flags & EPOLLHUP) {
spin_lock(&vq->irq_lock);
@@ -165,11 +166,18 @@ int vduse_virqfd_init(void)
if (!vduse_irqfd_cleanup_wq)
return -ENOMEM;
 
+   vduse_irq_wq = alloc_workqueue("vduse-irq", WQ_SYSFS | WQ_UNBOUND, 0);
+   if (!vduse_irq_wq) {
+   destroy_workqueue(vduse_irqfd_cleanup_wq);
+   return -ENOMEM;
+   }
+
return 0;
 }
 
 void vduse_virqfd_exit(void)
 {
+   destroy_workqueue(vduse_irq_wq);
destroy_workqueue(vduse_irqfd_cleanup_wq);
 }
 
-- 
2.11.0

[RFC v2 07/13] vduse: support get/set virtqueue state

2020-12-22 Thread Xie Yongji

This patch makes vhost-vdpa bus driver can get/set virtqueue
state from userspace VDUSE process.

Signed-off-by: Xie Yongji 
---
 Documentation/driver-api/vduse.rst |  4 +++
 drivers/vdpa/vdpa_user/vduse_dev.c | 54 ++
 include/uapi/linux/vduse.h |  9 +++
 3 files changed, 67 insertions(+)

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
index da9b3040f20a..623f7b040ccf 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -30,6 +30,10 @@ The following types of messages are provided by the VDUSE 
framework now:
 
 - VDUSE_GET_VQ_READY: Get ready status of virtqueue
 
+- VDUSE_SET_VQ_STATE: Set the state (last_avail_idx) for virtqueue
+
+- VDUSE_GET_VQ_STATE: Get the state (last_avail_idx) for virtqueue
+
 - VDUSE_SET_FEATURES: Set virtio features supported by the driver
 
 - VDUSE_GET_FEATURES: Get virtio features supported by the device
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index 4a869b9698ef..b974333ed4e9 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -291,6 +291,40 @@ static bool vduse_dev_get_vq_ready(struct vduse_dev *dev,
return ready;
 }
 
+static int vduse_dev_get_vq_state(struct vduse_dev *dev,
+   struct vduse_virtqueue *vq,
+   struct vdpa_vq_state *state)
+{
+   struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_GET_VQ_STATE);
+   int ret;
+
+   msg->req.size = sizeof(struct vduse_vq_state);
+   msg->req.vq_state.index = vq->index;
+
+   ret = vduse_dev_msg_sync(dev, msg);
+   state->avail_index = msg->resp.vq_state.avail_idx;
+   vduse_dev_msg_put(msg);
+
+   return ret;
+}
+
+static int vduse_dev_set_vq_state(struct vduse_dev *dev,
+   struct vduse_virtqueue *vq,
+   const struct vdpa_vq_state *state)
+{
+   struct vduse_dev_msg *msg = vduse_dev_new_msg(dev, VDUSE_SET_VQ_STATE);
+   int ret;
+
+   msg->req.size = sizeof(struct vduse_vq_state);
+   msg->req.vq_state.index = vq->index;
+   msg->req.vq_state.avail_idx = state->avail_index;
+
+   ret = vduse_dev_msg_sync(dev, msg);
+   vduse_dev_msg_put(msg);
+
+   return ret;
+}
+
 static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
struct file *file = iocb->ki_filp;
@@ -431,6 +465,24 @@ static bool vduse_vdpa_get_vq_ready(struct vdpa_device 
*vdpa, u16 idx)
return vq->ready;
 }
 
+static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
+   const struct vdpa_vq_state *state)
+{
+   struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+   struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+   return vduse_dev_set_vq_state(dev, vq, state);
+}
+
+static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
+   struct vdpa_vq_state *state)
+{
+   struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+   struct vduse_virtqueue *vq = &dev->vqs[idx];
+
+   return vduse_dev_get_vq_state(dev, vq, state);
+}
+
 static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
 {
struct vduse_dev *dev = vdpa_to_vduse(vdpa);
@@ -532,6 +584,8 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = 
{
.set_vq_num = vduse_vdpa_set_vq_num,
.set_vq_ready   = vduse_vdpa_set_vq_ready,
.get_vq_ready   = vduse_vdpa_get_vq_ready,
+   .set_vq_state   = vduse_vdpa_set_vq_state,
+   .get_vq_state   = vduse_vdpa_get_vq_state,
.get_vq_align   = vduse_vdpa_get_vq_align,
.get_features   = vduse_vdpa_get_features,
.set_features   = vduse_vdpa_set_features,
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index f8579abdaa3b..873305dfd93f 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -13,6 +13,8 @@ enum vduse_req_type {
VDUSE_SET_VQ_ADDR,
VDUSE_SET_VQ_READY,
VDUSE_GET_VQ_READY,
+   VDUSE_SET_VQ_STATE,
+   VDUSE_GET_VQ_STATE,
VDUSE_SET_FEATURES,
VDUSE_GET_FEATURES,
VDUSE_SET_STATUS,
@@ -38,6 +40,11 @@ struct vduse_vq_ready {
__u8 ready;
 };
 
+struct vduse_vq_state {
+   __u32 index;
+   __u16 avail_idx;
+};
+
 struct vduse_dev_config_data {
__u32 offset;
__u32 len;
@@ -53,6 +60,7 @@ struct vduse_dev_request {
struct vduse_vq_num vq_num; /* virtqueue num */
struct vduse_vq_addr vq_addr; /* virtqueue address */
struct vduse_vq_ready vq_ready; /* virtqueue ready status */
+   struct vduse_vq_state vq_state; /* virtqueue state */
struct vduse_dev_config_data config; /* virtio device config 
space */
__u

[RFC v2 08/13] vdpa: Introduce process_iotlb_msg() in vdpa_config_ops

2020-12-22 Thread Xie Yongji

This patch introduces a new method in the vdpa_config_ops to
support processing the raw vhost memory mapping message in the
vDPA device driver.

Signed-off-by: Xie Yongji 
---
 drivers/vhost/vdpa.c | 5 -
 include/linux/vdpa.h | 7 +++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 448be7875b6d..ccbb391e38be 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -728,6 +728,9 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
if (r)
return r;
 
+   if (ops->process_iotlb_msg)
+   return ops->process_iotlb_msg(vdpa, msg);
+
switch (msg->type) {
case VHOST_IOTLB_UPDATE:
r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -770,7 +773,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
int ret;
 
/* Device want to do DMA by itself */
-   if (ops->set_map || ops->dma_map)
+   if (ops->set_map || ops->dma_map || ops->process_iotlb_msg)
return 0;
 
bus = dma_dev->bus;
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 656fe264234e..7bccedf22f4b 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -172,6 +173,10 @@ struct vdpa_iova_range {
  * @vdev: vdpa device
  * Returns the iova range supported by
  * the device.
+ * @process_iotlb_msg: Process vhost memory mapping message (optional)
+ * Only used for VDUSE device now
+ * @vdev: vdpa device
+ * @msg: vhost memory mapping message
  * @set_map:   Set device memory mapping (optional)
  * Needed for device that using device
  * specific DMA translation (on-chip IOMMU)
@@ -240,6 +245,8 @@ struct vdpa_config_ops {
struct vdpa_iova_range (*get_iova_range)(struct vdpa_device *vdev);
 
/* DMA ops */
+   int (*process_iotlb_msg)(struct vdpa_device *vdev,
+struct vhost_iotlb_msg *msg);
int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
   u64 pa, u32 perm);
-- 
2.11.0

[RFC v2 05/13] vdpa: Pass the netlink attributes to ops.dev_add()

2020-12-22 Thread Xie Yongji

Pass the netlink attributes to ops.dev_add() so that we
could get some device specific attributes when creating
a vdpa device.

Signed-off-by: Xie Yongji 
---
 drivers/vdpa/vdpa.c  | 2 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
 include/linux/vdpa.h | 4 +++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 32bd48baffab..f6ff81927694 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -440,7 +440,7 @@ static int vdpa_nl_cmd_dev_add_set_doit(struct sk_buff 
*skb, struct genl_info *i
goto err;
}
 
-   vdev = pdev->ops->dev_add(pdev, name, device_id);
+   vdev = pdev->ops->dev_add(pdev, name, device_id, info->attrs);
if (IS_ERR(vdev))
goto err;
 
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 85776e4e6749..cfc314f5403a 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -726,7 +726,8 @@ static const struct vdpa_config_ops 
vdpasim_net_batch_config_ops = {
 };
 
 static struct vdpa_device *
-vdpa_dev_add(struct vdpa_parent_dev *pdev, const char *name, u32 device_id)
+vdpa_dev_add(struct vdpa_parent_dev *pdev, const char *name,
+   u32 device_id, struct nlattr **attrs)
 {
struct vdpasim *simdev;
 
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index cb5a3d847af3..656fe264234e 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * vDPA callback definition.
@@ -349,6 +350,7 @@ static inline void vdpa_get_config(struct vdpa_device 
*vdev, unsigned offset,
  * @pdev: parent device to use for device addition
  * @name: name of the new vdpa device
  * @device_id: device id of the new vdpa device
+ * @attrs: device specific attributes
  * Driver need to add a new device using vdpa_register_device() 
after
  * fully initializing the vdpa device. On successful addition 
driver
  * must return a valid pointer of vdpa device or ERR_PTR for the 
error.
@@ -359,7 +361,7 @@ static inline void vdpa_get_config(struct vdpa_device 
*vdev, unsigned offset,
  */
 struct vdpa_dev_ops {
struct vdpa_device* (*dev_add)(struct vdpa_parent_dev *pdev, const char 
*name,
-  u32 device_id);
+  u32 device_id, struct nlattr **attrs);
void (*dev_del)(struct vdpa_parent_dev *pdev, struct vdpa_device *dev);
 };
 
-- 
2.11.0

[RFC v2 09/13] vduse: Add support for processing vhost iotlb message

2020-12-22 Thread Xie Yongji

To support vhost-vdpa bus driver, we need a way to share the
vhost-vdpa backend process's memory with the userspace VDUSE process.

This patch tries to make use of the vhost iotlb message to achieve
that. We will get the shm file from the iotlb message and pass it
to the userspace VDUSE process.

Signed-off-by: Xie Yongji 
---
 Documentation/driver-api/vduse.rst |  15 +++-
 drivers/vdpa/vdpa_user/vduse_dev.c | 147 -
 include/uapi/linux/vduse.h |  11 +++
 3 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
index 623f7b040ccf..48e4b1ba353f 100644
--- a/Documentation/driver-api/vduse.rst
+++ b/Documentation/driver-api/vduse.rst
@@ -46,13 +46,26 @@ The following types of messages are provided by the VDUSE 
framework now:
 
 - VDUSE_GET_CONFIG: Read from device specific configuration space
 
+- VDUSE_UPDATE_IOTLB: Update the memory mapping in device IOTLB
+
+- VDUSE_INVALIDATE_IOTLB: Invalidate the memory mapping in device IOTLB
+
 Please see include/linux/vdpa.h for details.
 
-In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+The data path of userspace vDPA device is implemented in different ways
+depending on the vdpa bus to which it is attached.
+
+In virtio-vdpa case, VDUSE framework implements a MMU-based on-chip IOMMU
 driver which supports mapping the kernel dma buffer to a userspace iova
 region dynamically. The userspace iova region can be created by passing
 the userspace vDPA device fd to mmap(2).
 
+In vhost-vdpa case, the dma buffer is reside in a userspace memory region
+which will be shared to the VDUSE userspace processs via the file
+descriptor in VDUSE_UPDATE_IOTLB message. And the corresponding address
+mapping (IOVA of dma buffer <-> VA of the memory region) is also included
+in this message.
+
 Besides, the eventfd mechanism is used to trigger interrupt callbacks and
 receive virtqueue kicks in userspace. The following ioctls on the userspace
 vDPA device fd are provided to support that:
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b974333ed4e9..d24aaacb6008 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -34,6 +34,7 @@
 
 struct vduse_dev_msg {
struct vduse_dev_request req;
+   struct file *iotlb_file;
struct vduse_dev_response resp;
struct list_head list;
wait_queue_head_t waitq;
@@ -325,12 +326,80 @@ static int vduse_dev_set_vq_state(struct vduse_dev *dev,
return ret;
 }
 
+static int vduse_dev_update_iotlb(struct vduse_dev *dev, struct file *file,
+   u64 offset, u64 iova, u64 size, u8 perm)
+{
+   struct vduse_dev_msg *msg;
+   int ret;
+
+   if (!size)
+   return -EINVAL;
+
+   msg = vduse_dev_new_msg(dev, VDUSE_UPDATE_IOTLB);
+   msg->req.size = sizeof(struct vduse_iotlb);
+   msg->req.iotlb.offset = offset;
+   msg->req.iotlb.iova = iova;
+   msg->req.iotlb.size = size;
+   msg->req.iotlb.perm = perm;
+   msg->req.iotlb.fd = -1;
+   msg->iotlb_file = get_file(file);
+
+   ret = vduse_dev_msg_sync(dev, msg);
+   vduse_dev_msg_put(msg);
+   fput(file);
+
+   return ret;
+}
+
+static int vduse_dev_invalidate_iotlb(struct vduse_dev *dev,
+   u64 iova, u64 size)
+{
+   struct vduse_dev_msg *msg;
+   int ret;
+
+   if (!size)
+   return -EINVAL;
+
+   msg = vduse_dev_new_msg(dev, VDUSE_INVALIDATE_IOTLB);
+   msg->req.size = sizeof(struct vduse_iotlb);
+   msg->req.iotlb.iova = iova;
+   msg->req.iotlb.size = size;
+
+   ret = vduse_dev_msg_sync(dev, msg);
+   vduse_dev_msg_put(msg);
+
+   return ret;
+}
+
+static unsigned int perm_to_file_flags(u8 perm)
+{
+   unsigned int flags = 0;
+
+   switch (perm) {
+   case VHOST_ACCESS_WO:
+   flags |= O_WRONLY;
+   break;
+   case VHOST_ACCESS_RO:
+   flags |= O_RDONLY;
+   break;
+   case VHOST_ACCESS_RW:
+   flags |= O_RDWR;
+   break;
+   default:
+   WARN(1, "invalidate vhost IOTLB permission\n");
+   break;
+   }
+
+   return flags;
+}
+
 static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
struct file *file = iocb->ki_filp;
struct vduse_dev *dev = file->private_data;
struct vduse_dev_msg *msg;
-   int size = sizeof(struct vduse_dev_request);
+   unsigned int flags;
+   int fd, size = sizeof(struct vduse_dev_request);
ssize_t ret = 0;
 
if (iov_iter_count(to) < size)
@@ -349,6 +418,18 @@ static ssize_t vduse_dev_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
if (ret)
return ret;
}
+
+   if (msg->req.type == VDUSE_UPD

[RFC v2 10/13] vduse: grab the module's references until there is no vduse device

2020-12-22 Thread Xie Yongji

The module should not be unloaded if any vduse device exists.
So increase the module's reference count when creating vduse
device. And the reference count is kept until the device is
destroyed.

Signed-off-by: Xie Yongji 
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index d24aaacb6008..c29b24a7e7e9 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1052,6 +1052,7 @@ static int vduse_destroy_dev(u32 id)
kfree(dev->vqs);
vduse_iova_domain_destroy(dev->domain);
vduse_dev_destroy(dev);
+   module_put(THIS_MODULE);
 
return 0;
 }
@@ -1096,6 +1097,7 @@ static int vduse_create_dev(struct vduse_dev_config 
*config)
 
refcount_inc(&dev->refcnt);
list_add(&dev->list, &vduse_devs);
+   __module_get(THIS_MODULE);
 
return fd;
 err_fd:
-- 
2.11.0

Re: [PATCH 12/12 v2 RFC] skbuff: rename sock_zerocopy_* to msg_zerocopy_*

2020-12-22 Thread Willem de Bruijn

On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  wrote:
>
> From: Jonathan Lemon 
>
> At Willem's suggestion, rename the sock_zerocopy_* functions
> so that they match the MSG_ZEROCOPY flag, which makes it clear
> they are specific to this zerocopy implementation.
>
> Signed-off-by: Jonathan Lemon 

Acked-by: Willem de Bruijn

[RFC v2 12/13] vduse: Add memory shrinker to reclaim bounce pages

2020-12-22 Thread Xie Yongji

Add a shrinker to reclaim several pages used by bounce buffer
in order to avoid memory pressures.

Signed-off-by: Xie Yongji 
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 51 ++
 1 file changed, 51 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index c29b24a7e7e9..1bc2e627c476 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1142,6 +1142,43 @@ static long vduse_ioctl(struct file *file, unsigned int 
cmd,
return ret;
 }
 
+static unsigned long vduse_shrink_scan(struct shrinker *shrinker,
+   struct shrink_control *sc)
+{
+   unsigned long freed = 0;
+   struct vduse_dev *dev;
+
+   if (!mutex_trylock(&vduse_lock))
+   return SHRINK_STOP;
+
+   list_for_each_entry(dev, &vduse_devs, list) {
+   if (!dev->domain)
+   continue;
+
+   freed = vduse_domain_reclaim(dev->domain);
+   if (!freed)
+   continue;
+
+   list_move_tail(&dev->list, &vduse_devs);
+   break;
+   }
+   mutex_unlock(&vduse_lock);
+
+   return freed ? freed : SHRINK_STOP;
+}
+
+static unsigned long vduse_shrink_count(struct shrinker *shrink,
+   struct shrink_control *sc)
+{
+   return percpu_counter_read_positive(&vduse_total_bounce_pages);
+}
+
+static struct shrinker vduse_bounce_pages_shrinker = {
+   .count_objects = vduse_shrink_count,
+   .scan_objects = vduse_shrink_scan,
+   .seeks = DEFAULT_SEEKS,
+};
+
 static const struct file_operations vduse_fops = {
.owner  = THIS_MODULE,
.unlocked_ioctl = vduse_ioctl,
@@ -1292,12 +1329,24 @@ static int vduse_init(void)
if (ret)
goto err_irqfd;
 
+   ret = vduse_domain_init();
+   if (ret)
+   goto err_domain;
+
+   ret = register_shrinker(&vduse_bounce_pages_shrinker);
+   if (ret)
+   goto err_shrinker;
+
ret = vduse_parentdev_init();
if (ret)
goto err_parentdev;
 
return 0;
 err_parentdev:
+   unregister_shrinker(&vduse_bounce_pages_shrinker);
+err_shrinker:
+   vduse_domain_exit();
+err_domain:
vduse_virqfd_exit();
 err_irqfd:
destroy_workqueue(vduse_vdpa_wq);
@@ -1309,8 +1358,10 @@ module_init(vduse_init);
 
 static void vduse_exit(void)
 {
+   unregister_shrinker(&vduse_bounce_pages_shrinker);
misc_deregister(&vduse_misc);
destroy_workqueue(vduse_vdpa_wq);
+   vduse_domain_exit();
vduse_virqfd_exit();
vduse_parentdev_exit();
 }
-- 
2.11.0

[RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace

2020-12-22 Thread Xie Yongji

This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the VDUSE driver implements a MMU-based on-chip
IOMMU driver which supports mapping the kernel dma buffer to a
userspace iova region dynamically. Userspace can access those
iova region via mmap(). Besides, the eventfd mechanism is used to
trigger interrupt callbacks and receive virtqueue kicks in userspace

Now we only support virtio-vdpa bus driver with this patch applied.

Signed-off-by: Xie Yongji 
---
 Documentation/driver-api/vduse.rst |   74 ++
 Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
 drivers/vdpa/Kconfig   |8 +
 drivers/vdpa/Makefile  |1 +
 drivers/vdpa/vdpa_user/Makefile|5 +
 drivers/vdpa/vdpa_user/eventfd.c   |  221 
 drivers/vdpa/vdpa_user/eventfd.h   |   48 +
 drivers/vdpa/vdpa_user/iova_domain.c   |  442 
 drivers/vdpa/vdpa_user/iova_domain.h   |   93 ++
 drivers/vdpa/vdpa_user/vduse.h |   59 ++
 drivers/vdpa/vdpa_user/vduse_dev.c | 1121 
 include/uapi/linux/vdpa.h  |1 +
 include/uapi/linux/vduse.h |   99 ++
 13 files changed, 2173 insertions(+)
 create mode 100644 Documentation/driver-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/driver-api/vduse.rst 
b/Documentation/driver-api/vduse.rst
new file mode 100644
index ..da9b3040f20a
--- /dev/null
+++ b/Documentation/driver-api/vduse.rst
@@ -0,0 +1,74 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the VDUSE character device (/dev/vduse). Then a file descriptor pointing
+to the new resources will be returned, which can be used to implement the
+userspace vDPA device's control path and data path.
+
+To implement control path, the read/write operations to the file descriptor
+will be used to receive/reply the control messages from/to VDUSE driver.
+Those control messages are based on the vdpa_config_ops which defines a
+unified interface to control different types of vDPA device.
+
+The following types of messages are provided by the VDUSE framework now:
+
+- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+Please see include/linux/vdpa.h for details.
+
+In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+driver which supports mapping the kernel dma buffer to a userspace iova
+region dynamically. The userspace iova region can be created by passing
+the userspace vDPA device fd to mmap(2).
+
+Besides, the eventfd mechanism is used to trigger interrupt callbacks and
+receive virtqueue kicks in userspace. The following ioctls on the userspace
+vDPA device fd are provided to support that:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE driver to notify userspace to consume the vring.
+
+- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
+  by userspace to notify VDUSE driver to trigger interrupt callbacks.
+
+MMU-based IOMMU Driver
+--
+The basic idea behind the IOMMU driver is treating MMU (VA->

[RFC v2 11/13] vduse/iova_domain: Support reclaiming bounce pages

2020-12-22 Thread Xie Yongji

Introduce vduse_domain_reclaim() to support reclaiming bounce page
when necessary. We will do reclaiming chunk by chunk. And only
reclaim the iova chunk that no one used.

Signed-off-by: Xie Yongji 
---
 drivers/vdpa/vdpa_user/iova_domain.c | 83 ++--
 drivers/vdpa/vdpa_user/iova_domain.h | 10 +
 2 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c 
b/drivers/vdpa/vdpa_user/iova_domain.c
index 27022157abc6..c438cc85d33d 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -29,6 +29,8 @@ struct vduse_mmap_vma {
struct list_head list;
 };
 
+struct percpu_counter vduse_total_bounce_pages;
+
 static inline struct page *
 vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
unsigned long iova)
@@ -48,6 +50,13 @@ vduse_domain_set_bounce_page(struct vduse_iova_domain 
*domain,
unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
unsigned long pgindex = chunkoff >> PAGE_SHIFT;
 
+   if (page) {
+   domain->chunks[index].used_bounce_pages++;
+   percpu_counter_inc(&vduse_total_bounce_pages);
+   } else {
+   domain->chunks[index].used_bounce_pages--;
+   percpu_counter_dec(&vduse_total_bounce_pages);
+   }
domain->chunks[index].bounce_pages[pgindex] = page;
 }
 
@@ -175,6 +184,29 @@ void vduse_domain_remove_mapping(struct vduse_iova_domain 
*domain,
}
 }
 
+static bool vduse_domain_try_unmap(struct vduse_iova_domain *domain,
+   unsigned long iova, size_t size)
+{
+   struct vduse_mmap_vma *mmap_vma;
+   unsigned long uaddr;
+   bool unmap = true;
+
+   mutex_lock(&domain->vma_lock);
+   list_for_each_entry(mmap_vma, &domain->vma_list, list) {
+   if (!mmap_read_trylock(mmap_vma->vma->vm_mm)) {
+   unmap = false;
+   break;
+   }
+
+   uaddr = iova + mmap_vma->vma->vm_start;
+   zap_page_range(mmap_vma->vma, uaddr, size);
+   mmap_read_unlock(mmap_vma->vma->vm_mm);
+   }
+   mutex_unlock(&domain->vma_lock);
+
+   return unmap;
+}
+
 void vduse_domain_unmap(struct vduse_iova_domain *domain,
unsigned long iova, size_t size)
 {
@@ -302,6 +334,32 @@ bool vduse_domain_is_direct_map(struct vduse_iova_domain 
*domain,
return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
 }
 
+int vduse_domain_reclaim(struct vduse_iova_domain *domain)
+{
+   struct vduse_iova_chunk *chunk;
+   int i, freed = 0;
+
+   for (i = domain->chunk_num - 1; i >= 0; i--) {
+   chunk = &domain->chunks[i];
+   if (!chunk->used_bounce_pages)
+   continue;
+
+   if (atomic_cmpxchg(&chunk->state, 0, INT_MIN) != 0)
+   continue;
+
+   if (!vduse_domain_try_unmap(domain,
+   chunk->start, IOVA_CHUNK_SIZE)) {
+   atomic_sub(INT_MIN, &chunk->state);
+   break;
+   }
+   freed += vduse_domain_free_bounce_pages(domain,
+   chunk->start, IOVA_CHUNK_SIZE);
+   atomic_sub(INT_MIN, &chunk->state);
+   }
+
+   return freed;
+}
+
 unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
size_t size, enum iova_map_type type)
 {
@@ -319,10 +377,13 @@ unsigned long vduse_domain_alloc_iova(struct 
vduse_iova_domain *domain,
if (atomic_read(&chunk->map_type) != type)
continue;
 
-   iova = gen_pool_alloc_algo(chunk->pool, size,
+   if (atomic_fetch_inc(&chunk->state) >= 0) {
+   iova = gen_pool_alloc_algo(chunk->pool, size,
gen_pool_first_fit_align, &data);
-   if (iova)
-   break;
+   if (iova)
+   break;
+   }
+   atomic_dec(&chunk->state);
}
 
return iova;
@@ -335,6 +396,7 @@ void vduse_domain_free_iova(struct vduse_iova_domain 
*domain,
struct vduse_iova_chunk *chunk = &domain->chunks[index];
 
gen_pool_free(chunk->pool, iova, size);
+   atomic_dec(&chunk->state);
 }
 
 static void vduse_iova_chunk_cleanup(struct vduse_iova_chunk *chunk)
@@ -351,7 +413,8 @@ void vduse_iova_domain_destroy(struct vduse_iova_domain 
*domain)
 
for (i = 0; i < domain->chunk_num; i++) {
chunk = &domain->chunks[i];
-   vduse_domain_free_bounce_pages(domain,
+   if (chunk->used_bounce_pages)
+   vduse_domain_free_bounce_pages(domain,
chunk->start, IOVA_CHUNK_SIZE);
vd

Re: [PATCH 02/12 v2 RFC] skbuff: remove unused skb_zcopy_abort function

2020-12-22 Thread Willem de Bruijn

On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  wrote:
>
> From: Jonathan Lemon 
>
> skb_zcopy_abort() has no in-tree consumers, remove it.
>
> Signed-off-by: Jonathan Lemon 

Acked-by: Willem de Bruijn 

Indeed unused. This was used in the packet and raw zerocopy patches in
the original patchset. But we never merged those, for lack of copy
avoidance benefit at small packet sizes.

Re: [PATCH 09/12 v2 RFC] skbuff: add zc_flags to ubuf_info for ubuf setup

2020-12-22 Thread Willem de Bruijn

On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  wrote:
>
> From: Jonathan Lemon 
>
> Currently, an ubuf is attached to a new skb, the skb zc_flags
> is initialized to a fixed value.  Instead of doing this, set
> the default zc_flags in the ubuf, and have new skb's inherit
> from this default.
>
> This is needed when setting up different zerocopy types.
>
> Signed-off-by: Jonathan Lemon 
> ---
>  include/linux/skbuff.h | 3 ++-
>  net/core/skbuff.c  | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index da0c10da..b90be4b0b2de 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -478,6 +478,7 @@ struct ubuf_info {
> };
> };
> refcount_t refcnt;
> +   u8 zc_flags;
>
> struct mmpin {
> struct user_struct *user;

When allocating ubuf_info for msg_zerocopy, we actually allocate the
notification skb, to be sure that notifications won't be dropped due
to memory pressure at notification time. We actually allocate the skb
and place ubuf_info in skb->cb[].

The struct is exactly 48 bytes on 64-bit platforms, filling all of cb.
This new field fills a 4B hole, so it should still be fine.

Just being very explicit, as this is a fragile bit of code. Come to
think of it, this probably deserves a BUILD_BUG_ON.

Re: [RFC PATCH v2 2/8] net: sparx5: add the basic sparx5 driver

2020-12-22 Thread Andrew Lunn

> > > +static void sparx5_board_init(struct sparx5 *sparx5)
> > > +{
> > > + int idx;
> > > +
> > > + if (!sparx5->sd_sgpio_remapping)
> > > + return;
> > > +
> > > + /* Enable SGPIO Signal Detect remapping */
> > > + spx5_rmw(GCB_HW_SGPIO_SD_CFG_SD_MAP_SEL,
> > > +  GCB_HW_SGPIO_SD_CFG_SD_MAP_SEL,
> > > +  sparx5,
> > > +  GCB_HW_SGPIO_SD_CFG);
> > > +
> > > + /* Refer to LOS SGPIO */
> > > + for (idx = 0; idx < SPX5_PORTS; idx++) {
> > > + if (sparx5->ports[idx]) {
> > > + if (sparx5->ports[idx]->conf.sd_sgpio != ~0)
> > > {
> > > + spx5_wr(sparx5->ports[idx]-
> > > >conf.sd_sgpio,
> > > + sparx5,
> > > +
> > > GCB_HW_SGPIO_TO_SD_MAP_CFG(idx));
> > > + }
> > > + }
> > > + }
> > > +}
> > 
> > I've not looked at how you do SFP integration yet. Is this the LOS
> > from the SFP socket? Is there a Linux GPIO controller exported by
> > this
> > driver, so the SFP driver can use the GPIOs?
> 
> Yes the SFP driver (used by the Sparx5 SerDes driver) will use the
> SGPIO LOS, Module Detect etc, and the Port Modules are aware of the
> location of the LOS, and use this by default without any driver
> configuration.
> But on the PCB134 the SGPIOs are shifted one bit by a mistake, and they
> are not located in the expected position, so we have this board
> remapping function to handle that aspect.

Is it possible to turn this off in the hardware? It might be less
confusing if LOS it determined by phylink, not phylink and the switch
itself. Especially when we get into race conditions between PHYLINK
polling the GPIO and the hardware taking the short cut?


> > > +static int mchp_sparx5_probe(struct platform_device *pdev)
> > > +{
> > > + struct device_node *np = pdev->dev.of_node;
> > > + struct sparx5 *sparx5;
> > > + struct device_node *ports, *portnp;
> > > + const u8 *mac_addr;
> > > + int err = 0;
> > > +
> > > + if (!np && !pdev->dev.platform_data)
> > > + return -ENODEV;
> > > +
> > > + sparx5 = devm_kzalloc(&pdev->dev, sizeof(*sparx5),
> > > GFP_KERNEL);
> > > + if (!sparx5)
> > > + return -ENOMEM;
> > > +
> > > + platform_set_drvdata(pdev, sparx5);
> > > + sparx5->pdev = pdev;
> > > + sparx5->dev = &pdev->dev;
> > > +
> > > + /* Default values, some from DT */
> > > + sparx5->coreclock = SPX5_CORE_CLOCK_DEFAULT;
> > > +
> > > + mac_addr = of_get_mac_address(np);
> > > + if (IS_ERR_OR_NULL(mac_addr)) {
> > > + dev_info(sparx5->dev, "MAC addr was not set, use
> > > random MAC\n");
> > > + eth_random_addr(sparx5->base_mac);
> > > + sparx5->base_mac[5] = 0;
> > > + } else {
> > > + ether_addr_copy(sparx5->base_mac, mac_addr);
> > > + }
> > 
> > The binding document does not say anything about a MAC address at the
> > top level. What is this used for?
> 
> This the base MAC address used for generating the the switch NI's MAC
> addresses.

Yes, that is obvious from the code. But all DT properties must be in
the binding Documentation. The DT verifier is going to complain when
it finds a mac-address property which is not described in the yaml
file.

> > > + config.media_type = ETH_MEDIA_DAC;
> > > + config.serdes_reset = true;
> > > + config.portmode = config.phy_mode;
> > > + err = sparx5_probe_port(sparx5, portnp, serdes,
> > > portno, &config);
> > > + if (err) {
> > > + dev_err(sparx5->dev, "port probe error\n");
> > > + goto cleanup_ports;
> > > + }
> > > + }
> > > + sparx5_board_init(sparx5);
> > > +
> > > +cleanup_ports:
> > > + return err;
> > 
> > Seems missed named, no cleanup.
> 
> Ah - this comes later (as the driver was split in functional groups for
> reviewing). I hope this is OK, as it is only temporary - I could add a
> comment to that effect.

Yes, this is fine. Here, and in other places, a comment like:

/* More code to be added in later patches */

would of been nice, just as a heads up. That is the problem with
linear patch review.

> > > +static int __init sparx5_switch_reset(void)
> > > +{
> > > + const char *syscon_cpu = "microchip,sparx5-cpu-syscon",
> > > + *syscon_gcb = "microchip,sparx5-gcb-syscon";
> > > + struct regmap *cpu_ctrl, *gcb_ctrl;
> > > + u32 val;
> > > +
> > > + cpu_ctrl = syscon_regmap_lookup_by_compatible(syscon_cpu);
> > > + if (IS_ERR(cpu_ctrl)) {
> > > + pr_err("No '%s' syscon map\n", syscon_cpu);
> > > + return PTR_ERR(cpu_ctrl);
> > > + }
> > > +
> > > + gcb_ctrl = syscon_regmap_lookup_by_compatible(syscon_gcb);
> > > + if (IS_ERR(gcb_ctrl)) {
> > > + pr_err("No '%s' syscon map\n", syscon_gcb);
> >

Re: [PATCH v3 03/24] wfx: add Makefile/Kconfig

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> From: Jérôme Pouiller 
>
> Signed-off-by: Jérôme Pouiller 

[...]

> +wfx-$(CONFIG_SPI) += bus_spi.o
> +wfx-$(subst m,y,$(CONFIG_MMC)) += bus_sdio.o

Why this subst? And why only for MMC?

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [PATCH v3 01/24] mmc: sdio: add SDIO IDs for Silabs WF200 chip

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> From: Jérôme Pouiller 
>
> Add Silabs SDIO ID to sdio_ids.h.
>
> Note that the values used by Silabs are uncommon. A driver cannot fully
> rely on the SDIO PnP. It should also check if the device is declared in
> the DT.
>
> Signed-off-by: Jérôme Pouiller 
> ---
>  include/linux/mmc/sdio_ids.h | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/include/linux/mmc/sdio_ids.h b/include/linux/mmc/sdio_ids.h
> index 12036619346c..20a48162f7fc 100644
> --- a/include/linux/mmc/sdio_ids.h
> +++ b/include/linux/mmc/sdio_ids.h
> @@ -25,6 +25,11 @@
>   * Vendors and devices.  Sort key: vendor first, device next.
>   */
>  
> +// Silabs does not use a reliable vendor ID. To avoid conflicts, the driver
> +// won't probe the device if it is not also declared in the DT.

C++ comments?

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [PATCH v3 03/24] wfx: add Makefile/Kconfig

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> From: Jérôme Pouiller 
>
> Signed-off-by: Jérôme Pouiller 
> ---
>  drivers/net/wireless/silabs/wfx/Kconfig  |  8 
>  drivers/net/wireless/silabs/wfx/Makefile | 25 
>  2 files changed, 33 insertions(+)
>  create mode 100644 drivers/net/wireless/silabs/wfx/Kconfig
>  create mode 100644 drivers/net/wireless/silabs/wfx/Makefile
>
> diff --git a/drivers/net/wireless/silabs/wfx/Kconfig
> b/drivers/net/wireless/silabs/wfx/Kconfig
> new file mode 100644
> index ..83ee4d0ca8c6
> --- /dev/null
> +++ b/drivers/net/wireless/silabs/wfx/Kconfig
> @@ -0,0 +1,8 @@
> +config WFX
> + tristate "Silicon Labs wireless chips WF200 and further"
> + depends on MAC80211
> + depends on MMC || !MMC # do not allow WFX=y if MMC=m
> + depends on (SPI || MMC)
> + help
> +   This is a driver for Silicons Labs WFxxx series (WF200 and further)
> +   chipsets. This chip can be found on SPI or SDIO buses.

Kconfig should mention about the SDIO id snafu and that Device Tree is
required.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [PATCH v3 05/24] wfx: add main.c/main.h

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> From: Jérôme Pouiller 
>
> Signed-off-by: Jérôme Pouiller 

[...]

> +static const struct ieee80211_supported_band wfx_band_2ghz = {
> + .channels = wfx_2ghz_chantable,
> + .n_channels = ARRAY_SIZE(wfx_2ghz_chantable),
> + .bitrates = wfx_rates,
> + .n_bitrates = ARRAY_SIZE(wfx_rates),
> + .ht_cap = {
> + // Receive caps

This driver is full of C++ comments, please convert them to network
style comments. More info in the coding style doc.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [RFC PATCH v2 4/8] net: sparx5: add port module support

2020-12-22 Thread Andrew Lunn

> > > +static int sparx5_port_verify_speed(struct sparx5 *sparx5,
> > > + struct sparx5_port *port,
> > > + struct sparx5_port_config *conf)
> > > +{
> > > + case PHY_INTERFACE_MODE_SGMII:
> > > + if (conf->speed != SPEED_1000 &&
> > > + conf->speed != SPEED_100 &&
> > > + conf->speed != SPEED_10 &&
> > > + conf->speed != SPEED_2500)
> > > + return sparx5_port_error(port, conf,
> > > SPX5_PERR_SPEED);
> > 
> > Is it really SGMII over clocked at 2500? Or 2500BaseX?
> 
> Yes the SGMII mode in the serdes driver is overclocked.
> Nothing in the switch driver needs changing when changing between
> speeds 1G/2G5.

So it continues to use the SGMII inband signalling?

There is a lot of confusion in this area, but SGMII inband signalling
overclocked does not make much sense. So it is more likely to be using
2500BaseX.

Andrew

Re: [PATCH v3 09/24] wfx: add hwio.c/hwio.h

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> +/*
> + * Internal helpers.
> + *
> + * About CONFIG_VMAP_STACK:
> + * When CONFIG_VMAP_STACK is enabled, it is not possible to run DMA on stack
> + * allocated data. Functions below that work with registers (aka functions
> + * ending with "32") automatically reallocate buffers with kmalloc. However,
> + * functions that work with arbitrary length buffers let's caller to handle
> + * memory location. In doubt, enable CONFIG_DEBUG_SG to detect badly located
> + * buffer.
> + */

This sounds very hacky to me, I have understood that you should never
use stack with DMA.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

[PATCH net] net: ipa: fix interconnect enable bug

2020-12-22 Thread Alex Elder

When the core clock rate and interconnect bandwidth specifications
were moved into configuration data, a copy/paste bug was introduced,
causing the memory interconnect bandwidth to be set three times
rather than enabling the three different interconnects.

Fix this bug.

Fixes: 91d02f9551501 ("net: ipa: use config data for clocking")
Signed-off-by: Alex Elder 
---
 drivers/net/ipa/ipa_clock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ipa/ipa_clock.c b/drivers/net/ipa/ipa_clock.c
index 9dcf16f399b7a..135c393437f12 100644
--- a/drivers/net/ipa/ipa_clock.c
+++ b/drivers/net/ipa/ipa_clock.c
@@ -115,13 +115,13 @@ static int ipa_interconnect_enable(struct ipa *ipa)
return ret;
 
data = &clock->interconnect_data[IPA_INTERCONNECT_IMEM];
-   ret = icc_set_bw(clock->memory_path, data->average_rate,
+   ret = icc_set_bw(clock->imem_path, data->average_rate,
 data->peak_rate);
if (ret)
goto err_memory_path_disable;
 
data = &clock->interconnect_data[IPA_INTERCONNECT_CONFIG];
-   ret = icc_set_bw(clock->memory_path, data->average_rate,
+   ret = icc_set_bw(clock->config_path, data->average_rate,
 data->peak_rate);
if (ret)
goto err_imem_path_disable;
-- 
2.20.1

Re: [PATCH v3 12/24] wfx: add hif_api_*.h

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> --- /dev/null
> +++ b/drivers/net/wireless/silabs/wfx/hif_api_general.h
> @@ -0,0 +1,267 @@
> +/* SPDX-License-Identifier: Apache-2.0 */
> +/*
> + * WFx hardware interface definitions
> + *
> + * Copyright (c) 2018-2020, Silicon Laboratories Inc.
> + */
> +
> +#ifndef WFX_HIF_API_GENERAL_H
> +#define WFX_HIF_API_GENERAL_H
> +
> +#ifdef __KERNEL__
> +#include 
> +#include 
> +#else
> +#include 
> +#include 
> +#define __packed __attribute__((__packed__))
> +#endif

Why check for __KERNEL__ and redefined __packed? These don't belong to a
wireless driver.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [PATCH v3 00/24] wfx: get out from the staging area

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> From: Jérôme Pouiller 
>
> I think the wfx driver is now mature enough to be accepted in the
> drivers/net/wireless directory.

What's the status with firmware images? Can anyone take the latest
kernel and linux-firmware and use this driver normally?

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [PATCH v3 09/24] wfx: add hwio.c/hwio.h

2020-12-22 Thread Greg Kroah-Hartman

On Tue, Dec 22, 2020 at 05:10:11PM +0200, Kalle Valo wrote:
> Jerome Pouiller  writes:
> 
> > +/*
> > + * Internal helpers.
> > + *
> > + * About CONFIG_VMAP_STACK:
> > + * When CONFIG_VMAP_STACK is enabled, it is not possible to run DMA on 
> > stack
> > + * allocated data. Functions below that work with registers (aka functions
> > + * ending with "32") automatically reallocate buffers with kmalloc. 
> > However,
> > + * functions that work with arbitrary length buffers let's caller to handle
> > + * memory location. In doubt, enable CONFIG_DEBUG_SG to detect badly 
> > located
> > + * buffer.
> > + */
> 
> This sounds very hacky to me, I have understood that you should never
> use stack with DMA.

You should never do that because some platforms do not support it, so no
driver should ever try to do that as they do not know what platform they
are running on.

thanks,

greg k-h

Re: [PATCH v3 18/24] wfx: add data_tx.c/data_tx.h

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> +static bool ieee80211_is_action_back(struct ieee80211_hdr *hdr)
> +{
> + struct ieee80211_mgmt *mgmt = (struct ieee80211_mgmt *)hdr;
> +
> + if (!ieee80211_is_action(mgmt->frame_control))
> + return false;
> + if (mgmt->u.action.category != WLAN_CATEGORY_BACK)
> + return false;
> + return true;
> +}

Don't use ieee80211_ prefix within a driver, it's reserved for mac80211.
Use wfx_ instead.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [RFC PATCH v2 0/8] Adding the Sparx5 Switch Driver

2020-12-22 Thread Lars Povlsen



Florian Fainelli writes:

> On 12/16/2020 11:51 PM, Steen Hegelund wrote:
>> This series provides the Microchip Sparx5 Switch Driver
>>
>> The Sparx5 Carrier Ethernet and Industrial switch family delivers 64
>> Ethernet ports and up to 200 Gbps of switching bandwidth.
>>
>> It provides a rich set of Ethernet switching features such as hierarchical
>> QoS, hardware-based OAM  and service activation testing, protection
>> switching, IEEE 1588, and Synchronous Ethernet.
>>
>> Using provider bridging (Q-in-Q) and MPLS/MPLS-TP technology, it delivers
>> MEF CE
>> 2.0 Ethernet virtual connections (EVCs) and features advanced TCAM
>>   classification in both ingress and egress.
>>
>> Per-EVC features include advanced L3-aware classification, a rich set of
>> statistics, OAM for end-to-end performance monitoring, and dual-rate
>> policing and shaping.
>>
>> Time sensitive networking (TSN) is supported through a comprehensive set of
>> features including frame preemption, cut-through, frame replication and
>> elimination for reliability, enhanced scheduling: credit-based shaping,
>> time-aware shaping, cyclic queuing, and forwarding, and per-stream policing
>> and filtering.
>>
>> Together with IEEE 1588 and IEEE 802.1AS support, this guarantees
>> low-latency deterministic networking for Fronthaul, Carrier, and Industrial
>> Ethernet.
>>
>> The Sparx5 switch family consists of following SKUs:
>>
>> - VSC7546 Sparx5-64 up to 64 Gbps of bandwidth with the following primary
>>   port configurations:
>>   - 6 *10G
>>   - 16 * 2.5G + 2 * 10G
>>   - 24 * 1G + 4 * 10G
>>
>> - VSC7549 Sparx5-90 up to 90 Gbps of bandwidth with the following primary
>>   port configurations:
>>   - 9 * 10G
>>   - 16 * 2.5G + 4 * 10G
>>   - 48 * 1G + 4 * 10G
>>
>> - VSC7552 Sparx5-128 up to 128 Gbps of bandwidth with the following primary
>>   port configurations:
>>   - 12 * 10G
>>   - 16 * 2.5G + 8 * 10G
>>   - 48 * 1G + 8 * 10G
>>
>> - VSC7556 Sparx5-160 up to 160 Gbps of bandwidth with the following primary
>>   port configurations:
>>   - 16 * 10G
>>   - 10 * 10G + 2 * 25G
>>   - 16 * 2.5G + 10 * 10G
>>   - 48 * 1G + 10 * 10G
>>
>> - VSC7558 Sparx5-200 up to 200 Gbps of bandwidth with the following primary
>>   port configurations:
>>   - 20 * 10G
>>   - 8 * 25G
>>
>> In addition, the device supports one 10/100/1000/2500/5000 Mbps
>> SGMII/SerDes node processor interface (NPI) Ethernet port.
>>
>> The Sparx5 support is developed on the PCB134 and PCB135 evaluation boards.
>>
>> - PCB134 main networking features:
>>   - 12x SFP+ front 10G module slots (connected to Sparx5 through SFI).
>>   - 8x SFP28 front 25G module slots (connected to Sparx5 through SFI high
>> speed).
>>   - Optional, one additional 10/100/1000BASE-T (RJ45) Ethernet port
>> (on-board VSC8211 PHY connected to Sparx5 through SGMII).
>>
>> - PCB135 main networking features:
>>   - 48x1G (10/100/1000M) RJ45 front ports using 12xVSC8514 QuadPHY’s each
>> connected to VSC7558 through QSGMII.
>>   - 4x10G (1G/2.5G/5G/10G) RJ45 front ports using the AQR407 10G QuadPHY
>> each port connects to VSC7558 through SFI.
>>   - 4x SFP28 25G module slots on back connected to VSC7558 through SFI high
>> speed.
>>   - Optional, one additional 1G (10/100/1000M) RJ45 port using an on-board
>> VSC8211 PHY, which can be connected to VSC7558 NPI port through SGMII
>> using a loopback add-on PCB)
>>
>> This series provides support for:
>>   - SFPs and DAC cables via PHYLINK with a number of 5G, 10G and 25G
>> devices and media types.
>>   - Port module configuration for 10M to 25G speeds with SGMII, QSGMII,
>> 1000BASEX, 2500BASEX and 10GBASER as appropriate for these modes.
>>   - SerDes configuration via the Sparx5 SerDes driver (see below).
>>   - Host mode providing register based injection and extraction.
>>   - Switch mode providing MAC/VLAN table learning and Layer2 switching
>> offloaded to the Sparx5 switch.
>>   - STP state, VLAN support, host/bridge port mode, Forwarding DB, and
>> configuration and statistics via ethtool.
>>
>> More support will be added at a later stage.
>>
>> The Sparx5 Switch chip register model can be browsed here:
>> Link: https://microchip-ung.github.io/sparx-5_reginfo/reginfo_sparx-5.html
>
> Out of curiosity, what tool was used to generate the register
> information page? It looks really neat and well organized.

Florian,

It is an in-house tool. The input data is in a proprietary XML-like
format.

We're pleased that you like it, we do too. We are also pleased that
being a Microchip entity, we can actually make this kind of information
public.

I'll pass your praise on.

---Lars

--
Lars Povlsen,
Microchip

Re: [PATCH bpf-next V9 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper

2020-12-22 Thread Jesper Dangaard Brouer

On Fri, 18 Dec 2020 12:13:45 -0800
Andrii Nakryiko  wrote:

> On Thu, Dec 17, 2020 at 9:30 AM Jesper Dangaard Brouer
>  wrote:
> >
> > Adding selftest for BPF-helper bpf_check_mtu(). Making sure
> > it can be used from both XDP and TC.
> >
> > Signed-off-by: Jesper Dangaard Brouer 
> > ---
> >  tools/testing/selftests/bpf/prog_tests/check_mtu.c |  204 
> > 
> >  tools/testing/selftests/bpf/progs/test_check_mtu.c |  196 
> > +++
> >  2 files changed, 400 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c 
> > b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
> > new file mode 100644
> > index ..b5d0c3a9abe8
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
> > @@ -0,0 +1,204 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2020 Jesper Dangaard Brouer */
> > +
> > +#include  /* before test_progs.h, avoid bpf_util.h 
> > redefines */
> > +
> > +#include 
> > +#include "test_check_mtu.skel.h"
> > +#include 
> > +
> > +#include 
> > +#include 
> > +
> > +#define IFINDEX_LO 1
> > +
> > +static __u32 duration; /* Hint: needed for CHECK macro */
> > +
> > +static int read_mtu_device_lo(void)
> > +{
> > +   const char *filename = "/sys/devices/virtual/net/lo/mtu";

I will change this to: /sys/class/net/lo/mtu

> > +   char buf[11] = {};
> > +   int value;
> > +   int fd;
> > +
> > +   fd = open(filename, 0, O_RDONLY);
> > +   if (fd == -1)
> > +   return -1;
> > +
> > +   if (read(fd, buf, sizeof(buf)) == -1)  
> 
> close fd missing here?

ack, fixed.

> > +   return -2;
> > +   close(fd);
> > +
> > +   value = strtoimax(buf, NULL, 10);
> > +   if (errno == ERANGE)
> > +   return -3;
> > +
> > +   return value;
> > +}
> > +
> > +static void test_check_mtu_xdp_attach(struct bpf_program *prog)
> > +{
> > +   int err = 0;
> > +   int fd;
> > +
> > +   fd = bpf_program__fd(prog);
> > +   err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE);
> > +   if (CHECK(err, "XDP-attach", "failed"))
> > +   return;
> > +
> > +   bpf_set_link_xdp_fd(IFINDEX_LO, -1, 0);  
> 
> can you please use bpf_link-based bpf_program__attach_xdp() which will
> provide auto-cleanup in case of crash?

Sure, that will be good for me to learn.

> also check that it succeeded?
> 
> > +}
> > +
> > +static void test_check_mtu_run_xdp(struct test_check_mtu *skel,
> > +  struct bpf_program *prog,
> > +  __u32 mtu_expect)
> > +{
> > +   const char *prog_name = bpf_program__name(prog);
> > +   int retval_expect = XDP_PASS;
> > +   __u32 mtu_result = 0;
> > +   char buf[256];
> > +   int err;
> > +
> > +   struct bpf_prog_test_run_attr tattr = {
> > +   .repeat = 1,
> > +   .data_in = &pkt_v4,
> > +   .data_size_in = sizeof(pkt_v4),
> > +   .data_out = buf,
> > +   .data_size_out = sizeof(buf),
> > +   .prog_fd = bpf_program__fd(prog),
> > +   };  
> 
> nit: it's a variable declaration, so keep it all in one block. There
> is also opts-based variant, which might be good to use here instead.
> 
> > +
> > +   memset(buf, 0, sizeof(buf));  
> 
> char buf[256] = {}; would make this unnecessary

ok.

> 
> > +
> > +   err = bpf_prog_test_run_xattr(&tattr);
> > +   CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run",
> > +  "prog_name:%s (err %d errno %d retval %d)\n",
> > +  prog_name, err, errno, tattr.retval);
> > +
> > +CHECK(tattr.retval != retval_expect, "retval",  
> 
> whitespaces are off?

Yes, I noticed with scripts/checkpatch.pl.  And there are a couple
more, that I've already fixed.


> > + "progname:%s unexpected retval=%d expected=%d\n",
> > + prog_name, tattr.retval, retval_expect);
> > +
> > +   /* Extract MTU that BPF-prog got */
> > +   mtu_result = skel->bss->global_bpf_mtu_xdp;
> > +   CHECK(mtu_result != mtu_expect, "MTU-compare-user",
> > + "failed (MTU user:%d bpf:%d)", mtu_expect, mtu_result);  
> 
> There is nicer ASSERT_EQ() macro for such cases:
> 
> ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user"); it will format
> sensible error message automatically

Nice simplification :-)

> 
> > +}
> > +  
> 
> [...]

[... same ...] 

> [...]
> 
> > +
> > +void test_check_mtu(void)
> > +{
> > +   struct test_check_mtu *skel;
> > +   __u32 mtu_lo;
> > +
> > +   skel = test_check_mtu__open_and_load();
> > +   if (CHECK(!skel, "open and load skel", "failed"))
> > +   return; /* Exit if e.g. helper unknown to kernel */
> > +
> > +   if (test__start_subte

Re: [PATCH v3 18/24] wfx: add data_tx.c/data_tx.h

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> +void wfx_tx(struct ieee80211_hw *hw, struct ieee80211_tx_control *control,
> + struct sk_buff *skb)
> +{
> + struct wfx_dev *wdev = hw->priv;
> + struct wfx_vif *wvif;
> + struct ieee80211_sta *sta = control ? control->sta : NULL;
> + struct ieee80211_tx_info *tx_info = IEEE80211_SKB_CB(skb);
> + struct ieee80211_hdr *hdr = (struct ieee80211_hdr *)skb->data;
> + size_t driver_data_room = sizeof_field(struct ieee80211_tx_info,
> +rate_driver_data);
> +
> + compiletime_assert(sizeof(struct wfx_tx_priv) <= driver_data_room,
> +"struct tx_priv is too large");

Interesting, never seen compiletime_assert() before. But I suspect
BUILD_BUG_ON_MSG() is preferred, at least there are more users.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [PATCH v3 05/24] wfx: add main.c/main.h

2020-12-22 Thread Kalle Valo

Jerome Pouiller  writes:

> +/* NOTE: wfx_send_pds() destroy buf */
> +int wfx_send_pds(struct wfx_dev *wdev, u8 *buf, size_t len)
> +{
> + int ret;
> + int start, brace_level, i;
> +
> + start = 0;
> + brace_level = 0;
> + if (buf[0] != '{') {
> + dev_err(wdev->dev, "valid PDS start with '{'. Did you forget to
> compress it?\n");
> + return -EINVAL;
> + }
> + for (i = 1; i < len - 1; i++) {
> + if (buf[i] == '{')
> + brace_level++;
> + if (buf[i] == '}')
> + brace_level--;
> + if (buf[i] == '}' && !brace_level) {
> + i++;
> + if (i - start + 1 > WFX_PDS_MAX_SIZE)
> + return -EFBIG;
> + buf[start] = '{';
> + buf[i] = 0;
> + dev_dbg(wdev->dev, "send PDS '%s}'\n", buf + start);
> + buf[i] = '}';
> + ret = hif_configuration(wdev, buf + start,
> + i - start + 1);
> + if (ret > 0) {
> + dev_err(wdev->dev, "PDS bytes %d to %d: invalid data (unsupported
> options?)\n",
> + start, i);
> + return -EINVAL;
> + }
> + if (ret == -ETIMEDOUT) {
> + dev_err(wdev->dev, "PDS bytes %d to %d: chip didn't reply (corrupted
> file?)\n",
> + start, i);
> + return ret;
> + }
> + if (ret) {
> + dev_err(wdev->dev, "PDS bytes %d to %d: chip returned an unknown
> error\n",
> + start, i);
> + return -EIO;
> + }
> + buf[i] = ',';
> + start = i;
> + }
> + }
> + return 0;
> +}

What does this function do? Looks very strange.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [RFC v2 01/13] mm: export zap_page_range() for driver use

2020-12-22 Thread Christoph Hellwig

On Tue, Dec 22, 2020 at 10:52:09PM +0800, Xie Yongji wrote:
> Export zap_page_range() for use in VDUSE.

Err, no.  This has absolutely no business being used by drivers.

[PATCH net-next v1] stmmac: intel: Add PCI IDs for TGL-H platform

2020-12-22 Thread Muhammad Husaini Zulkifli

From: Noor Azura Ahmad Tarmizi 

Add TGL-H PCI info and PCI IDs for the new TSN Controller to the list
of supported devices.

Signed-off-by: Noor Azura Ahmad Tarmizi 
Signed-off-by: Voon Weifeng 
Signed-off-by: Muhammad Husaini Zulkifli 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
index 55dbb1a930da..d3608db576f7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
@@ -723,6 +723,8 @@ static SIMPLE_DEV_PM_OPS(intel_eth_pm_ops, 
intel_eth_pci_suspend,
 #define PCI_DEVICE_ID_INTEL_EHL_PSE1_RGMII1G_ID0x4bb0
 #define PCI_DEVICE_ID_INTEL_EHL_PSE1_SGMII1G_ID0x4bb1
 #define PCI_DEVICE_ID_INTEL_EHL_PSE1_SGMII2G5_ID   0x4bb2
+#define PCI_DEVICE_ID_INTEL_TGLH_SGMII1G_0_ID  0x43ac
+#define PCI_DEVICE_ID_INTEL_TGLH_SGMII1G_1_ID  0x43a2
 #define PCI_DEVICE_ID_INTEL_TGL_SGMII1G_ID 0xa0ac
 
 static const struct pci_device_id intel_eth_pci_id_table[] = {
@@ -737,6 +739,8 @@ static const struct pci_device_id intel_eth_pci_id_table[] 
= {
{ PCI_DEVICE_DATA(INTEL, EHL_PSE1_SGMII1G_ID, &ehl_pse1_sgmii1g_info) },
{ PCI_DEVICE_DATA(INTEL, EHL_PSE1_SGMII2G5_ID, &ehl_pse1_sgmii1g_info) 
},
{ PCI_DEVICE_DATA(INTEL, TGL_SGMII1G_ID, &tgl_sgmii1g_info) },
+   { PCI_DEVICE_DATA(INTEL, TGLH_SGMII1G_0_ID, &tgl_sgmii1g_info) },
+   { PCI_DEVICE_DATA(INTEL, TGLH_SGMII1G_1_ID, &tgl_sgmii1g_info) },
{}
 };
 MODULE_DEVICE_TABLE(pci, intel_eth_pci_id_table);
-- 
2.17.1

Re: [PATCH net v2 1/3] net: fix race conditions in xps by locking the maps and dev->tc_num

2020-12-22 Thread Alexander Duyck

On Tue, Dec 22, 2020 at 1:21 AM Antoine Tenart  wrote:
>
> Hello Alexander, Jakub,
>
> Quoting Alexander Duyck (2020-12-22 00:21:57)
> >
> > Looking over this patch it seems kind of obvious that extending the
> > xps_map_mutex is making things far more complex then they need to be.
> >
> > Applying the rtnl_mutex would probably be much simpler. Although as I
> > think you have already discovered we need to apply it to the store,
> > and show for this interface. In addition we probably need to perform
> > similar locking around traffic_class_show in order to prevent it from
> > generating a similar error.
>
> I don't think we have the same kind of issues with traffic_class_show:
> dev->num_tc is used, but not for navigating through the map. Protecting
> only a single read wouldn't change much. We can still think about what
> could go wrong here without the lock, but that is not related to this
> series of fixes.

The problem is we are actually reading the netdev, tx queue, and
tc_to_txq mapping. Basically we have several different items that we
are accessing at the same time. If any one is updated while we are
doing it then it will throw things off.

> If I understood correctly, as things are a bit too complex now, you
> would prefer that we go for the solution proposed in v1?

Yeah, that is what I am thinking. Basically we just need to make sure
the num_tc cannot be updated while we are reading the other values.

> I can still do the code factoring for the 2 sysfs show operations, but
> that would then target net-next and would be in a different series. So I
> believe we'll use the patches of v1, unmodified.

I agree the code factoring would be better targeted to net-next.

The rtnl_lock approach from v1 would work for net and for backports.

> Jakub, should I send a v3 then?
>
> Thanks!
> Antoine

Re: [PATCH net] net: ipa: fix interconnect enable bug

2020-12-22 Thread Georgi Djakov


On 12/22/20 17:16, Alex Elder wrote:

When the core clock rate and interconnect bandwidth specifications
were moved into configuration data, a copy/paste bug was introduced,
causing the memory interconnect bandwidth to be set three times
rather than enabling the three different interconnects.

Fix this bug.

Fixes: 91d02f9551501 ("net: ipa: use config data for clocking")
Signed-off-by: Alex Elder 


Reviewed-by: Georgi Djakov 


---
  drivers/net/ipa/ipa_clock.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ipa/ipa_clock.c b/drivers/net/ipa/ipa_clock.c
index 9dcf16f399b7a..135c393437f12 100644
--- a/drivers/net/ipa/ipa_clock.c
+++ b/drivers/net/ipa/ipa_clock.c
@@ -115,13 +115,13 @@ static int ipa_interconnect_enable(struct ipa *ipa)
return ret;
  
  	data = &clock->interconnect_data[IPA_INTERCONNECT_IMEM];

-   ret = icc_set_bw(clock->memory_path, data->average_rate,
+   ret = icc_set_bw(clock->imem_path, data->average_rate,
 data->peak_rate);
if (ret)
goto err_memory_path_disable;
  
  	data = &clock->interconnect_data[IPA_INTERCONNECT_CONFIG];

-   ret = icc_set_bw(clock->memory_path, data->average_rate,
+   ret = icc_set_bw(clock->config_path, data->average_rate,
 data->peak_rate);
if (ret)
goto err_imem_path_disable;

[PATCH bpf-next V10 0/7] bpf: New approach for BPF MTU handling

2020-12-22 Thread Jesper Dangaard Brouer

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.
V5: Fix uninit variable + name struct output member mtu_result.
V6: Use bpf_check_mtu() in selftest
V7: Fix logic using tot_len and add another selftest
V8: Add better selftests for BPF-helper bpf_check_mtu
V9: Remove patch that use skb_set_redirected
V10: Fix selftests and 'tot_len' MTU check like XDP

---

Jesper Dangaard Brouer (7):
  bpf: Remove MTU check in __bpf_skb_max_len
  bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
  bpf: bpf_fib_lookup return MTU value as output when looked up
  bpf: add BPF-helper for MTU checking
  bpf: drop MTU check when doing TC-BPF redirect to ingress
  selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
  bpf/selftests: tests using bpf_check_mtu BPF-helper


 include/linux/netdevice.h  |   31 +++
 include/uapi/linux/bpf.h   |   78 +++
 net/core/dev.c |   19 --
 net/core/filter.c  |  183 +++--
 tools/include/uapi/linux/bpf.h |   78 +++
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  218 
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  199 ++
 .../selftests/bpf/progs/test_cls_redirect.c|7 +
 8 files changed, 769 insertions(+), 44 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

--

[PATCH bpf-next V10 1/7] bpf: Remove MTU check in __bpf_skb_max_len

2020-12-22 Thread Jesper Dangaard Brouer

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.

[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests

V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.

Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |   12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 255aeee72402..f8f198252ff2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 
off, u32 len_diff,
return 0;
 }
 
-static u32 __bpf_skb_max_len(const struct sk_buff *skb)
-{
-   return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
-}
+#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC
 
 BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
   u32, mode, u64, flags)
@@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, 
s32, len_diff,
 {
u32 len_cur, len_diff_abs = abs(len_diff);
u32 len_min = bpf_skb_net_base_len(skb);
-   u32 len_max = __bpf_skb_max_len(skb);
+   u32 len_max = BPF_SKB_MAX_LEN;
__be16 proto = skb->protocol;
bool shrink = len_diff < 0;
u32 off;
@@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, 
unsigned int new_len)
 static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
u64 flags)
 {
-   u32 max_len = __bpf_skb_max_len(skb);
+   u32 max_len = BPF_SKB_MAX_LEN;
u32 min_len = __bpf_skb_min_len(skb);
int ret;
 
@@ -3764,7 +3760,7 @@ static const struct bpf_func_proto 
sk_skb_change_tail_proto = {
 static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
u64 flags)
 {
-   u32 max_len = __bpf_skb_max_len(skb);
+   u32 max_len = BPF_SKB_MAX_LEN;
u32 new_len = skb->len + head_room;
int ret;

[PATCH bpf-next V10 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx

2020-12-22 Thread Jesper Dangaard Brouer

BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use
bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size,
by adjusting fib_params 'tot_len' with the packet length plus the expected
encap size. (Just like the bpf_check_mtu helper supports). He discovered
that for SKB ctx the param->tot_len was not used, instead skb->len was used
(via MTU check in is_skb_forwardable() that checks against netdev MTU).

Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g.
zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU
check is done like XDP code-path, which checks against FIB-dst MTU.

V10:
- Use same method as XDP for 'tot_len' MTU check

Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status")
Reported-by: Carlo Carraro 
Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |   13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index f8f198252ff2..c1e460193bae 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5548,6 +5548,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 {
struct net *net = dev_net(skb->dev);
int rc = -EAFNOSUPPORT;
+   bool check_mtu = false;
 
if (plen < sizeof(*params))
return -EINVAL;
@@ -,22 +5556,28 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT))
return -EINVAL;
 
+   if (params->tot_len)
+   check_mtu = true;
+
switch (params->family) {
 #if IS_ENABLED(CONFIG_INET)
case AF_INET:
-   rc = bpf_ipv4_fib_lookup(net, params, flags, false);
+   rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu);
break;
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
-   rc = bpf_ipv6_fib_lookup(net, params, flags, false);
+   rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu);
break;
 #endif
}
 
-   if (!rc) {
+   if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) {
struct net_device *dev;
 
+   /* When tot_len isn't provided by user,
+* check skb against net_device MTU
+*/
dev = dev_get_by_index_rcu(net, params->ifindex);
if (!is_skb_forwardable(dev, skb))
rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;

[PATCH bpf-next V10 4/7] bpf: add BPF-helper for MTU checking

2020-12-22 Thread Jesper Dangaard Brouer

This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The SKB object is complex and the skb->len value (accessible from
BPF-prog) also include the length of any extra GRO/GSO segments, but
without taking into account that these GRO/GSO segments get added
transport (L4) and network (L3) headers before being transmitted. Thus,
this BPF-helper is created such that the BPF-programmer don't need to
handle these details in the BPF-prog.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V9:
- Use dev->hard_header_len (instead of ETH_HLEN)
- Annotate with unlikely req from Daniel
- Fix logic error using skb_gso_validate_network_len from Daniel

V6:
- Took John's advice and dropped BPF_MTU_CHK_RELAX
- Returned MTU is kept at L3-level (like fib_lookup)

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj

Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h   |   67 ++
 net/core/filter.c  |  122 
 tools/include/uapi/linux/bpf.h |   67 ++
 3 files changed, 256 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 649586d656b6..fa2e99351758 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3833,6 +3833,61 @@ union bpf_attr {
  * Return
  * A pointer to a struct socket on success or NULL if the file is
  * not a socket.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 
flags)
+ * Description
+ * Check ctx packet size against MTU of net device (based on
+ * *ifindex*).  This helper will likely be used in combination with
+ * helpers that adjust/change the packet size.  The argument
+ * *len_diff* can be used for querying with a planned size
+ * change. This allows to check MTU prior to changing packet ctx.
+ *
+ * Specifying *ifindex* zero means the MTU check is performed
+ * against the current net device.  This is practical if this isn't
+ * used prior to redirect.
+ *
+ * The Linux kernel route table can configure MTUs on a more
+ * specific per route level, which is not provided by this helper.
+ * For route level MTU checks use the **bpf_fib_lookup**\ ()
+ * helper.
+ *
+ * *ctx* is either **struct xdp_md** for XDP programs or
+ * **struct sk_buff** for tc cls_act programs.
+ *
+ * The *flags* argument can be a combination of one or more of the
+ * following values:
+ *
+ * **BPF_MTU_CHK_SEGS**
+ * This flag will only works for *ctx* **struct sk_buff**.
+ * If packet context contains extra packet segment buffers
+ * (often knows as GSO skb), then MTU check is harder to
+ * check at this point, because in transmit path it is
+ * possible for the skb packet to get re-segmented
+ * (depending on net device features).  This could still be
+ * a MTU violation, so this flag enables performing MTU
+ * check against segments, with a different violation
+ * return code to tell it apart. Check cannot use len_diff.
+ *
+ * On return *mtu_len* pointer contains the MTU value of the net
+ * device.  Remember the net device configured MTU is the L3 size,
+ * which is returned here and XDP and TX length operate at L2.
+ * Helper take this into account for you, but remember when using
+ * MTU value in your BPF-code.  On input *mtu_len* must be a valid
+ * pointer and be initialized (to zero), else verifier will reject
+ * BPF program.
+ *
+ * Return
+ * * 0 on success, and populate MTU value in *mtu_len* pointer.
+ *
+ * * < 0 if any input argument is invalid (*mtu_len* not updated)
+ *
+ * MTU violations return positive values, but also populat

[PATCH bpf-next V10 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up

2020-12-22 Thread Jesper Dangaard Brouer

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

V5:
 - Fixed uninit value spotted by Dan Carpenter.
 - Name struct output member mtu_result

Reported-by: kernel test robot 
Reported-by: Dan Carpenter 
Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h   |   11 +--
 net/core/filter.c  |   22 +++---
 tools/include/uapi/linux/bpf.h |   11 +--
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 77d7c1bb2923..649586d656b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2225,6 +2225,9 @@ union bpf_attr {
  * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *   packet is not forwarded or needs assist from full stack
  *
+ * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ * was exceeded and output params->mtu_result contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, 
void *key, u64 flags)
  * Description
  * Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4975,9 +4978,13 @@ struct bpf_fib_lookup {
__be16  sport;
__be16  dport;
 
-   /* total length of packet from network header - used for MTU check */
-   __u16   tot_len;
+   union { /* used for MTU check */
+   /* input to lookup */
+   __u16   tot_len; /* L3 length from network hdr (iph->tot_len) */
 
+   /* output: MTU value */
+   __u16   mtu_result;
+   };
/* input: L3 device index for lookup
 * output: device index from FIB lookup
 */
diff --git a/net/core/filter.c b/net/core/filter.c
index c1e460193bae..db59ab55572c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5268,12 +5268,14 @@ static const struct bpf_func_proto 
bpf_skb_get_xfrm_state_proto = {
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
 static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
  const struct neighbour *neigh,
- const struct net_device *dev)
+ const struct net_device *dev, u32 mtu)
 {
memcpy(params->dmac, neigh->ha, ETH_ALEN);
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
params->h_vlan_TCI = 0;
params->h_vlan_proto = 0;
+   if (mtu)
+   params->mtu_result = mtu; /* union with tot_len */
 
return 0;
 }
@@ -5289,8 +5291,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
struct net_device *dev;
struct fib_result res;
struct flowi4 fl4;
+   u32 mtu = 0;
int err;
-   u32 mtu;
 
dev = dev_get_by_index_rcu(net, params->ifindex);
if (unlikely(!dev))
@@ -5357,8 +5359,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 
if (check_mtu) {
mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
-   if (params->tot_len > mtu)
+   if (params->tot_len > mtu) {
+   params->mtu_result = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+   }
}
 
nhc = res.nhc;
@@ -5392,7 +5396,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
if (!neigh)
return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-   return bpf_fib_set_fwd_params(params, neigh, dev);
+   return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5409,7 +5413,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
struct flowi6 fl6;
int strict = 0;
int oif, err;
-   u32 mtu;
+   u32 mtu = 0;
 
/* link local addresses are never forwarded */
if (rt6_need_strict(dst) || rt6_need_strict(src))
@@ -5484,8 +5488,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 
if (check_mtu) {
mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
-   if (params->tot_len > mtu)
+   if (params->tot_len > mtu) {
+   params->mtu_result = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+   }
}
 
if (res.nh->fib_nh_lws)
@@ -5505,7 +5511,7 @@ static int bpf_ipv6_fib

[PATCH bpf-next V10 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress

2020-12-22 Thread Jesper Dangaard Brouer

The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] 
https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/

V9:
 - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect

V4:
 - Keep net_device "up" (IFF_UP) check.
 - Adjustment to handle bpf_redirect_peer() helper

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/netdevice.h |   31 +--
 net/core/dev.c|   19 ++-
 net/core/filter.c |   14 +++---
 3 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7bf167993c05..4c78f73db1ec 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3947,11 +3947,38 @@ int dev_forward_skb(struct net_device *dev, struct 
sk_buff *skb);
 bool is_skb_forwardable(const struct net_device *dev,
const struct sk_buff *skb);
 
+static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
+const struct sk_buff *skb,
+const bool check_mtu)
+{
+   const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
+   unsigned int len;
+
+   if (!(dev->flags & IFF_UP))
+   return false;
+
+   if (!check_mtu)
+   return true;
+
+   len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
+   if (skb->len <= len)
+   return true;
+
+   /* if TSO is enabled, we don't care about the length as the packet
+* could be forwarded without being segmented before
+*/
+   if (skb_is_gso(skb))
+   return true;
+
+   return false;
+}
+
 static __always_inline int dev_forward_skb(struct net_device *dev,
-  struct sk_buff *skb)
+  struct sk_buff *skb,
+  const bool check_mtu)
 {
if (skb_orphan_frags(skb, GFP_ATOMIC) ||
-   unlikely(!is_skb_forwardable(dev, skb))) {
+   unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
atomic_long_inc(&dev->rx_dropped);
kfree_skb(skb);
return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index a46334906c94..ffead13c158f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2176,28 +2176,13 @@ static inline void net_timestamp_set(struct sk_buff 
*skb)
 
 bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff 
*skb)
 {
-   unsigned int len;
-
-   if (!(dev->flags & IFF_UP))
-   return false;
-
-   len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
-   if (skb->len <= len)
-   return true;
-
-   /* if TSO is enabled, we don't care about the length as the packet
-* could be forwarded without being segmented before
-*/
-   if (skb_is_gso(skb))
-   return true;
-
-   return false;
+   return __is_skb_forwardable(dev, skb, true);
 }
 EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
-   int ret = dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, true);
 
if (likely(!ret)) {
skb->protocol = eth_type_trans(skb, dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index 3f2e593244ca..1908800b671c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto 
= {
 
 static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
 {
-   return dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, false);
+
+   if (likely(!ret)) {
+   skb->protocol = eth_type_trans(skb, dev);
+   skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+   ret = netif_rx(skb);
+   }
+
+   return ret;
 }
 
 static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
  struct sk_buff *skb)
 {
-   int ret = dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, false);
 
if (likely(!ret)) {
skb->dev = dev;
@@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb)
goto out_drop;
dev = ops->ndo_get_peer_dev(dev);
if (unlikely(!dev ||
-!is_skb_forwardable(dev, skb) ||
+!(dev->flags & IFF_UP) ||
 net_eq(net, dev_net(dev
goto out_drop;

[PATCH bpf-next V10 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect

2020-12-22 Thread Jesper Dangaard Brouer

This demonstrate how bpf_check_mtu() helper can easily be used together
with bpf_skb_adjust_room() helper, prior to doing size adjustment, as
delta argument is already setup.

Hint: This specific test can be selected like this:
 ./test_progs -t cls_redirect

Signed-off-by: Jesper Dangaard Brouer 
---
 .../selftests/bpf/progs/test_cls_redirect.c|7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect.c 
b/tools/testing/selftests/bpf/progs/test_cls_redirect.c
index c9f8464996ea..3c1e042962e6 100644
--- a/tools/testing/selftests/bpf/progs/test_cls_redirect.c
+++ b/tools/testing/selftests/bpf/progs/test_cls_redirect.c
@@ -70,6 +70,7 @@ typedef struct {
uint64_t errors_total_encap_adjust_failed;
uint64_t errors_total_encap_buffer_too_small;
uint64_t errors_total_redirect_loop;
+   uint64_t errors_total_encap_mtu_violate;
 } metrics_t;
 
 typedef enum {
@@ -407,6 +408,7 @@ static INLINING ret_t forward_with_gre(struct __sk_buff 
*skb, encap_headers_t *e
payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr);
int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead;
uint16_t proto = ETH_P_IP;
+   uint32_t mtu_len = 0;
 
/* Loop protection: the inner packet's TTL is decremented as a safeguard
 * against any forwarding loop. As the only interesting field is the TTL
@@ -479,6 +481,11 @@ static INLINING ret_t forward_with_gre(struct __sk_buff 
*skb, encap_headers_t *e
}
}
 
+   if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) {
+   metrics->errors_total_encap_mtu_violate++;
+   return TC_ACT_SHOT;
+   }
+
if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET,
BPF_F_ADJ_ROOM_FIXED_GSO |
BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||

[PATCH bpf-next V10 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper

2020-12-22 Thread Jesper Dangaard Brouer

Adding selftest for BPF-helper bpf_check_mtu(). Making sure
it can be used from both XDP and TC.

V10:
 - Remove errno non-zero test in CHECK_ATTR()
 - Addresse comments from Andrii Nakryiko

Signed-off-by: Jesper Dangaard Brouer 
---
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  218 
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  199 ++
 2 files changed, 417 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c 
b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
new file mode 100644
index ..63f01c9e08d8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
@@ -0,0 +1,218 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Jesper Dangaard Brouer */
+
+#include  /* before test_progs.h, avoid bpf_util.h redefines 
*/
+
+#include 
+#include "test_check_mtu.skel.h"
+#include 
+
+#include 
+#include 
+
+#define IFINDEX_LO 1
+
+static __u32 duration; /* Hint: needed for CHECK macro */
+
+static int read_mtu_device_lo(void)
+{
+   const char *filename = "/sys/class/net/lo/mtu";
+   char buf[11] = {};
+   int value;
+   int fd;
+
+   fd = open(filename, 0, O_RDONLY);
+   if (fd == -1)
+   return -1;
+
+   if (read(fd, buf, sizeof(buf)) == -1) {
+   close(fd);
+   return -2;
+   }
+   close(fd);
+
+   value = strtoimax(buf, NULL, 10);
+   if (errno == ERANGE)
+   return -3;
+
+   return value;
+}
+
+static void test_check_mtu_xdp_attach()
+{
+   struct bpf_link_info link_info;
+   __u32 link_info_len = sizeof(link_info);
+   struct test_check_mtu *skel;
+   struct bpf_program *prog;
+   struct bpf_link *link;
+   int err = 0;
+   int fd;
+
+   skel = test_check_mtu__open_and_load();
+   if (CHECK(!skel, "open and load skel", "failed"))
+   return; /* Exit if e.g. helper unknown to kernel */
+
+   prog = skel->progs.xdp_use_helper_basic;
+
+   link = bpf_program__attach_xdp(prog, IFINDEX_LO);
+   if (CHECK(IS_ERR(link), "link_attach", "failed: %ld\n", PTR_ERR(link)))
+   goto out;
+   skel->links.xdp_use_helper_basic = link;
+
+   memset(&link_info, 0, sizeof(link_info));
+   fd = bpf_link__fd(link);
+   err = bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
+   if (CHECK(err, "link_info", "failed: %d\n", err))
+   goto out;
+
+   CHECK(link_info.type != BPF_LINK_TYPE_XDP, "link_type",
+ "got %u != exp %u\n", link_info.type, BPF_LINK_TYPE_XDP);
+   CHECK(link_info.xdp.ifindex != IFINDEX_LO, "link_ifindex",
+ "got %u != exp %u\n", link_info.xdp.ifindex, IFINDEX_LO);
+
+   err = bpf_link__detach(link);
+   CHECK(err, "link_detach", "failed %d\n", err);
+
+out:
+   test_check_mtu__destroy(skel);
+}
+
+static void test_check_mtu_run_xdp(struct test_check_mtu *skel,
+  struct bpf_program *prog,
+  __u32 mtu_expect)
+{
+   const char *prog_name = bpf_program__name(prog);
+   int retval_expect = XDP_PASS;
+   __u32 mtu_result = 0;
+   char buf[256] = {};
+   int err;
+   struct bpf_prog_test_run_attr tattr = {
+   .repeat = 1,
+   .data_in = &pkt_v4,
+   .data_size_in = sizeof(pkt_v4),
+   .data_out = buf,
+   .data_size_out = sizeof(buf),
+   .prog_fd = bpf_program__fd(prog),
+   };
+
+   err = bpf_prog_test_run_xattr(&tattr);
+   CHECK_ATTR(err != 0, "bpf_prog_test_run",
+  "prog_name:%s (err %d errno %d retval %d)\n",
+  prog_name, err, errno, tattr.retval);
+
+   CHECK(tattr.retval != retval_expect, "retval",
+ "progname:%s unexpected retval=%d expected=%d\n",
+ prog_name, tattr.retval, retval_expect);
+
+   /* Extract MTU that BPF-prog got */
+   mtu_result = skel->bss->global_bpf_mtu_xdp;
+   ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user");
+}
+
+
+static void test_check_mtu_xdp(__u32 mtu, __u32 ifindex)
+{
+   struct test_check_mtu *skel;
+   int err;
+
+   skel = test_check_mtu__open();
+   if (CHECK(!skel, "skel_open", "failed"))
+   return;
+
+   /* Update "constants" in BPF-prog *BEFORE* libbpf load */
+   skel->rodata->GLOBAL_USER_MTU = mtu;
+   skel->rodata->GLOBAL_USER_IFINDEX = ifindex;
+
+   err = test_check_mtu__load(skel);
+   if (CHECK(err, "skel_load", "failed: %d\n", err))
+   goto cleanup;
+
+   test_check_mtu_run_xdp(skel, skel->progs.xdp_use_helper, mtu);
+   test_check_mtu_run_xdp(skel, skel->progs.xdp_exceed_mtu, mtu);
+   test_check_mtu_run_xdp(skel, skel->progs.xdp_minus_delta

Re: [PATCH v3] net: neighbor: fix a crash caused by mod zero

2020-12-22 Thread Eric Dumazet




On 12/22/20 1:38 PM, weichenchen wrote:
> pneigh_enqueue() tries to obtain a random delay by mod
> NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
> migth be zero at that point because someone could write zero
> to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
> callers check it.
> 
> This patch makes pneigh_enqueue() get a delay time passed in
> by the callers and the callers guarantee it is not zero.
> 
> Signed-off-by: weichenchen 
> ---
> V3:
> - Callers need to pass the delay time to pneigh_enqueue()
>   now and they should guarantee it is not zero.
> - Use READ_ONCE() to read NEIGH_VAR(p, PROXY_DELAY) in both
>   of the existing callers of pneigh_enqueue() and then pass
>   it to pneigh_enqueue().
> V2:
> - Use READ_ONCE() to prevent the complier from re-reading
>   NEIGH_VAR(p, PROXY_DELAY).
> - Give a hint to the complier that delay <= 0 is unlikely
>   to happen.
> ---
>  include/net/neighbour.h | 2 +-
>  net/core/neighbour.c| 5 ++---
>  net/ipv4/arp.c  | 8 +---
>  net/ipv6/ndisc.c| 6 +++---
>  4 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/include/net/neighbour.h b/include/net/neighbour.h
> index 22ced1381ede..f7564dc5304d 100644
> --- a/include/net/neighbour.h
> +++ b/include/net/neighbour.h
> @@ -352,7 +352,7 @@ struct net *neigh_parms_net(const struct neigh_parms 
> *parms)
>  unsigned long neigh_rand_reach_time(unsigned long base);
>  
>  void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
> - struct sk_buff *skb);
> + struct sk_buff *skb, int delay);
>  struct pneigh_entry *pneigh_lookup(struct neigh_table *tbl, struct net *net,
>  const void *key, struct net_device *dev,
>  int creat);
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 9500d28a43b0..b440f966d109 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -1567,12 +1567,11 @@ static void neigh_proxy_process(struct timer_list *t)
>  }
>  
>  void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
> - struct sk_buff *skb)
> + struct sk_buff *skb, int delay)
>  {
>   unsigned long now = jiffies;
>  
> - unsigned long sched_next = now + (prandom_u32() %
> -   NEIGH_VAR(p, PROXY_DELAY));
> + unsigned long sched_next = now + (prandom_u32() % delay);
>  
>   if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) {
>   kfree_skb(skb);

This seems rather complex, what about not using a divide in the first place ? :

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 
9500d28a43b0e1a390382912b6fb59db935e727b..745bc89acc87c2a4802fb6f301c11edd2f0096da
 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1569,10 +1569,7 @@ static void neigh_proxy_process(struct timer_list *t)
 void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
struct sk_buff *skb)
 {
-   unsigned long now = jiffies;
-
-   unsigned long sched_next = now + (prandom_u32() %
- NEIGH_VAR(p, PROXY_DELAY));
+   unsigned long sched_next = jiffies + prandom_u32_max(NEIGH_VAR(p, 
PROXY_DELAY));
 
if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) {
kfree_skb(skb);

pull-request: wireless-drivers-2020-12-22

2020-12-22 Thread Kalle Valo

Hi,

here's a pull request to net tree, more info below. Please let me know if there
are any problems.

Kalle

The following changes since commit 13458ffe0a953e17587f172a8e5059c243e6850a:

  net: x25: Remove unimplemented X.25-over-LLC code stubs (2020-12-12 17:15:33 
-0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git 
tags/wireless-drivers-2020-12-22

for you to fetch changes up to bfe55584713b4d4d518ffe9cf2dab1129eba6321:

  MAINTAINERS: switch to different email address (2020-12-21 19:07:39 +0200)


wireless-drivers fixes for v5.11

First set of fixes for v5.11, more fixes than usual this time. For
ath11k we have several fixes for QCA6390 PCI support and mt76 has
several. Also one build fix for mt76.

mt76

* fix two NULL pointer dereference

* fix build error when CONFIG_MAC80211_MESH is disabled

rtlwifi

* fix use-after-free in firmware handling code

ath11k

* error handling fixes

* fix crash found during connect and disconnect test

* handle HT disable better

* avoid printing qmi memory failure during firmware bootup

* disable ASPM during firmware bootup


Arend van Spriel (1):
  MAINTAINERS: switch to different email address

Carl Huang (4):
  ath11k: fix crash caused by NULL rx_channel
  ath11k: start vdev if a bss peer is already created
  ath11k: qmi: try to allocate a big block of DMA memory first
  ath11k: pci: disable ASPM L0sLs before downloading firmware

Colin Ian King (1):
  ath11k: add missing null check on allocated skb

Dan Carpenter (2):
  ath11k: Fix error code in ath11k_core_suspend()
  ath11k: Fix ath11k_pci_fix_l1ss()

Kalle Valo (1):
  Merge ath-current from git://git.kernel.org/.../kvalo/ath.git

Lorenzo Bianconi (4):
  mt76: mt76u: fix NULL pointer dereference in mt76u_status_worker
  mt76: usb: remove wake logic in mt76u_status_worker
  mt76: sdio: remove wake logic in mt76s_process_tx_queue
  mt76: mt76s: fix NULL pointer dereference in mt76s_process_tx_queue

Ping-Ke Shih (1):
  rtlwifi: rise completion at the last step of firmware callback

Randy Dunlap (1):
  mt76: mt7915: fix MESH ifdef block

 MAINTAINERS  |  2 +-
 drivers/net/wireless/ath/ath11k/core.c   |  2 +-
 drivers/net/wireless/ath/ath11k/dp_rx.c  | 10 --
 drivers/net/wireless/ath/ath11k/mac.c|  8 +++--
 drivers/net/wireless/ath/ath11k/pci.c| 44 +---
 drivers/net/wireless/ath/ath11k/pci.h|  2 ++
 drivers/net/wireless/ath/ath11k/peer.c   | 17 +
 drivers/net/wireless/ath/ath11k/peer.h   |  2 ++
 drivers/net/wireless/ath/ath11k/qmi.c| 24 +++--
 drivers/net/wireless/ath/ath11k/qmi.h|  1 +
 drivers/net/wireless/ath/ath11k/wmi.c|  3 ++
 drivers/net/wireless/mediatek/mt76/mt7915/init.c |  4 +--
 drivers/net/wireless/mediatek/mt76/sdio.c| 19 --
 drivers/net/wireless/mediatek/mt76/usb.c |  9 ++---
 drivers/net/wireless/realtek/rtlwifi/core.c  |  8 +++--
 15 files changed, 118 insertions(+), 37 deletions(-)

Re: [PATCH 04/12 v2 RFC] skbuff: Push status and refcounts into sock_zerocopy_callback

2020-12-22 Thread David Ahern

On 12/22/20 7:43 AM, Willem de Bruijn wrote:
> 
>>  void sock_zerocopy_put(struct ubuf_info *uarg)
>>  {
>> -   if (uarg && refcount_dec_and_test(&uarg->refcnt))
>> +   if (uarg)
>> uarg->callback(uarg, uarg->zerocopy);
>>  }
>>  EXPORT_SYMBOL_GPL(sock_zerocopy_put);
> 
> This does increase the number of indirect function calls. Which are
> not cheap post spectre.
> 
> In the common case for msg_zerocopy we only have two clones, one sent
> out and one on the retransmit queue. So I guess the cost will be
> acceptable.
> 


sock_zerocopy_callback seems to be the only one at the moment - or the
most dominant if I goofed my search. Could use the INDIRECT_CALL macros
for it.

Re: [PATCH 03/12 v2 RFC] skbuff: simplify sock_zerocopy_put

2020-12-22 Thread David Ahern

On 12/21/20 5:09 PM, Jonathan Lemon wrote:
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 327ee8938f78..ea32b3414ad6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1245,12 +1245,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
>  
>  void sock_zerocopy_put(struct ubuf_info *uarg)
>  {
> - if (uarg && refcount_dec_and_test(&uarg->refcnt)) {
> - if (uarg->callback)
> - uarg->callback(uarg, uarg->zerocopy);
> - else
> - consume_skb(skb_from_uarg(uarg));
> - }
> + if (uarg && refcount_dec_and_test(&uarg->refcnt))
> + uarg->callback(uarg, uarg->zerocopy);
>  }
>  EXPORT_SYMBOL_GPL(sock_zerocopy_put);
>  
> 

since it is down to 2 lines, move to skbuff.h as an inline?

Re: [PATCH 03/12 v2 RFC] skbuff: simplify sock_zerocopy_put

2020-12-22 Thread David Ahern

On 12/22/20 9:52 AM, David Ahern wrote:
> On 12/21/20 5:09 PM, Jonathan Lemon wrote:
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 327ee8938f78..ea32b3414ad6 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1245,12 +1245,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
>>  
>>  void sock_zerocopy_put(struct ubuf_info *uarg)
>>  {
>> -if (uarg && refcount_dec_and_test(&uarg->refcnt)) {
>> -if (uarg->callback)
>> -uarg->callback(uarg, uarg->zerocopy);
>> -else
>> -consume_skb(skb_from_uarg(uarg));
>> -}
>> +if (uarg && refcount_dec_and_test(&uarg->refcnt))
>> +uarg->callback(uarg, uarg->zerocopy);
>>  }
>>  EXPORT_SYMBOL_GPL(sock_zerocopy_put);
>>  
>>
> 
> since it is down to 2 lines, move to skbuff.h as an inline?
> 

nm. that is done in patch 5.

Re: [RFC PATCH v2 2/8] net: sparx5: add the basic sparx5 driver

2020-12-22 Thread Alexandre Belloni

On 22/12/2020 16:01:22+0100, Andrew Lunn wrote:
> > The problem is that the switch core reset also affects (reset) the
> > SGPIO controller.
> > 
> > We tried to put this in the reset driver, but it was rejected. If the
> > reset is done at probe time, the SGPIO driver may already have
> > initialized state.
> > 
> > The switch core reset will then reset all SGPIO registers. 
> 
> Ah, O.K. Dumb question. Why is the SGPIO driver a separate driver? It
> sounds like it should be embedded inside this driver if it is sharing
> hardware.
> 
> Another option would be to look at the reset subsystem, and have this
> driver export a reset controller, which the SGPIO driver can bind to.
> Given that the GPIO driver has been merged, if this will work, it is
> probably a better solution.
> 

That was my suggestion. Then you can ensure from the reset controller
driver that this is done exactly once, either from the sgpio driver or
from the switchdev driver. IIRC, the sgpio from the other SoCs are not
affected by the reset.


-- 
Alexandre Belloni, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Re: [PATCH 01/12 v2 RFC] net: group skb_shinfo zerocopy related bits together.

2020-12-22 Thread Jonathan Lemon

On Tue, Dec 22, 2020 at 09:43:15AM -0500, Willem de Bruijn wrote:
> On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  
> wrote:
> >
> > From: Jonathan Lemon 
> >
> > In preparation for expanded zerocopy (TX and RX), move
> > the ZC related bits out of tx_flags into their own flag
> > word.
> >
> > Signed-off-by: Jonathan Lemon 
> 
> I think it's better to expand tx_flags to a u16 and add the two new
> flags that you need.

Okay, but in that case, tx_flags is now wrong, since some of these flags
will also be checked on the rx path.


> That allows the additional 7 bits to be used for arbitrary flags, not
> stranding 8 bits exactly only for zerocopy features.
> 
> Moving around a few u8's in the same cacheline won't be a problem.

How about rearranging them to form 16 bits, and just calling it 'flags'?

 
> I also prefer not to rename flags that some of us are familiar with,
> if it's not needed.

Ack.
--
Jonathan

Re: [PATCH 05/12 v2 RFC] skbuff: replace sock_zerocopy_put() with skb_zcopy_put()

2020-12-22 Thread Jonathan Lemon

On Tue, Dec 22, 2020 at 09:42:40AM -0500, Willem de Bruijn wrote:
> On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  
> wrote:
> >
> > From: Jonathan Lemon 
> >
> > Replace sock_zerocopy_put with the generic skb_zcopy_put()
> > function.  Pass 'true' as the success argument, as this
> > is identical to no change.
> >
> > Signed-off-by: Jonathan Lemon 
> 
> uarg->zerocopy may be false if sock_zerocopy_put_abort is called from
> tcp_sendmsg_locked

Yes, it may well be false.  The original logic goes:

   sock_zerocopy_put_abort()
   sock_zerocopy_put()
   sock_zerocopy_callback(..., success = uarg->zerocopy)
 if (success)

The new logic is now:

   sock_zerocopy_put_abort()
   sock_zerocopy_callback(..., success = true)
 uarg->zerocopy = uarg->zerocopy & success
 if (uarg->zerocopy)

The success value ls latched into uarg->zerocopy, and any failure
is persistent.  Hence my comment about passing 'true' not changing
the logic.
-- 
Jonathan

Re: [PATCH 04/12 v2 RFC] skbuff: Push status and refcounts into sock_zerocopy_callback

2020-12-22 Thread Jonathan Lemon

On Tue, Dec 22, 2020 at 09:43:39AM -0500, Willem de Bruijn wrote:
> On Mon, Dec 21, 2020 at 7:09 PM Jonathan Lemon  
> wrote:
> >
> > From: Jonathan Lemon 
> >
> > Before this change, the caller of sock_zerocopy_callback would
> > need to save the zerocopy status, decrement and check the refcount,
> > and then call the callback function - the callback was only invoked
> > when the refcount reached zero.
> >
> > Now, the caller just passes the status into the callback function,
> > which saves the status and handles its own refcounts.
> >
> > This makes the behavior of the sock_zerocopy_callback identical
> > to the tpacket and vhost callbacks.
> >
> > Signed-off-by: Jonathan Lemon 
> > ---
> >  include/linux/skbuff.h |  3 ---
> >  net/core/skbuff.c  | 14 +++---
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index fb6dd6af0f82..c9d7de9d624d 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1482,9 +1482,6 @@ static inline void skb_zcopy_clear(struct sk_buff 
> > *skb, bool zerocopy)
> > if (uarg) {
> > if (skb_zcopy_is_nouarg(skb)) {
> > /* no notification callback */
> > -   } else if (uarg->callback == sock_zerocopy_callback) {
> > -   uarg->zerocopy = uarg->zerocopy && zerocopy;
> > -   sock_zerocopy_put(uarg);
> > } else {
> > uarg->callback(uarg, zerocopy);
> > }
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index ea32b3414ad6..73699dbdc4a1 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1194,7 +1194,7 @@ static bool skb_zerocopy_notify_extend(struct sk_buff 
> > *skb, u32 lo, u16 len)
> > return true;
> >  }
> >
> > -void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
> > +static void __sock_zerocopy_callback(struct ubuf_info *uarg)
> >  {
> > struct sk_buff *tail, *skb = skb_from_uarg(uarg);
> > struct sock_exterr_skb *serr;
> > @@ -1222,7 +1222,7 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, 
> > bool success)
> > serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
> > serr->ee.ee_data = hi;
> > serr->ee.ee_info = lo;
> > -   if (!success)
> > +   if (!uarg->zerocopy)
> > serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;
> >
> > q = &sk->sk_error_queue;
> > @@ -1241,11 +1241,19 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, 
> > bool success)
> > consume_skb(skb);
> > sock_put(sk);
> >  }
> > +
> > +void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
> > +{
> > +   uarg->zerocopy = uarg->zerocopy & success;
> > +
> > +   if (refcount_dec_and_test(&uarg->refcnt))
> > +   __sock_zerocopy_callback(uarg);
> > +}
> >  EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
> 
> I still think this helper is unnecessary. Just return immediately in
> existing sock_zerocopy_callback if refcount is not zero.

I think the helper makes the logic clearer, and prevents misuse of
the success variable in the main function (use of last value vs the
latched value).  If you really feel that strongly about it, I'll 
fold it into one function.


> >  void sock_zerocopy_put(struct ubuf_info *uarg)
> >  {
> > -   if (uarg && refcount_dec_and_test(&uarg->refcnt))
> > +   if (uarg)
> > uarg->callback(uarg, uarg->zerocopy);
> >  }
> >  EXPORT_SYMBOL_GPL(sock_zerocopy_put);
> 
> This does increase the number of indirect function calls. Which are
> not cheap post spectre.
> 
> In the common case for msg_zerocopy we only have two clones, one sent
> out and one on the retransmit queue. So I guess the cost will be
> acceptable.

Yes, this was the source of my original comment about this being 
a slight pessimization - moving the refcount into the function.

I briefly considered adding a flag like 'use_refcnt', so the logic
becomes:

if (uarg && (!uarg->use_refcnt || refcount_dec_and_test(&uarg->refcnt)))

But thought this might be too much micro-optimization.  But if 
the call overhead is significant, I can put this back in.
-- 
Jonathan

Re: [RFC PATCH v2 2/8] net: sparx5: add the basic sparx5 driver

2020-12-22 Thread Steen Hegelund

Hi Andrew,

On Sat, 2020-12-19 at 20:11 +0100, Andrew Lunn wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you
> know the content is safe
> 
> On Thu, Dec 17, 2020 at 08:51:28AM +0100, Steen Hegelund wrote:
> 
> > +static struct sparx5_io_resource sparx5_iomap[] =  {
> 
> This could be made const i think,.

Yes

> 
> > + { TARGET_DEV2G5, 0, 0 }, /* 0x610004000:
> > dev2g5_0 */
> > + { TARGET_DEV5G,  0x4000,    0 }, /* 0x610008000:
> > dev5g_0 */
> > + { TARGET_PCS5G_BR,   0x8000,    0 }, /* 0x61000c000:
> > pcs5g_br_0 */
> > + { TARGET_DEV2G5 + 1, 0xc000,    0 }, /* 0x61001:
> > dev2g5_1 */
> 
> > +static int sparx5_create_targets(struct sparx5 *sparx5)
> > +{
> > + int idx, jdx;
> > + struct resource *iores[IO_RANGES];
> > + void __iomem *iomem[IO_RANGES];
> > + void __iomem *begin[IO_RANGES];
> > + int range_id[IO_RANGES];
> 
> Reverse Christmas tree. idx, jdx need to come last.

Yes - I will check the entire file for RCT...
> 
> > +
> > + /* Check if done previously (deferred by serdes load) */
> > + if (sparx5->regs[sparx5_iomap[0].id])
> > + return 0;
> 
> Could you explain this a bit more. Do you mean -EPROBE_DEFER?

Yes that was the intended usage. I will change the startup flow to try
to avoid checking this-
> 
> > +static int sparx5_probe_port(struct sparx5 *sparx5,
> > +  struct device_node *portnp,
> > +  struct phy *serdes,
> > +  u32 portno,
> > +  struct sparx5_port_config *conf)
> > +{
> > + struct sparx5_port *spx5_port;
> > + struct net_device *ndev;
> > + int err;
> > +
> > + err = sparx5_create_targets(sparx5);
> > + if (err)
> > + return err;
> 
> This sees odd here. Don't sparx5_create_targets() create all the
> targets, where as this creates one specific port? Seems like
> sparx5_create_targets() should be in the devices as a whole probe,
> not
> the port probe.

You are right - the name does not really fit (anymore). I will rework
this.
> 
> > + spx5_port = netdev_priv(ndev);
> > + spx5_port->of_node = portnp;
> > + spx5_port->serdes = serdes;
> > + spx5_port->pvid = NULL_VID;
> > + spx5_port->signd_internal = true;
> > + spx5_port->signd_active_high = true;
> > + spx5_port->signd_enable = true;
> > + spx5_port->flow_control = false;
> > + spx5_port->max_vlan_tags = SPX5_PORT_MAX_TAGS_NONE;
> > + spx5_port->vlan_type = SPX5_VLAN_PORT_TYPE_UNAWARE;
> > + spx5_port->custom_etype = 0x8880; /* Vitesse */
> > + conf->portmode = conf->phy_mode;
> > + spx5_port->conf.speed = SPEED_UNKNOWN;
> > + spx5_port->conf.power_down = true;
> > + sparx5->ports[portno] = spx5_port;
> > + return 0;
> 
> I'm also not sure this has the correct name. This does not look like
> a
> typical probe function.

Agree.
> 
> 
> > +}
> > +
> > +static int sparx5_init_switchcore(struct sparx5 *sparx5)
> > +{
> > + u32 value, pending, jdx, idx;
> > + struct {
> > + bool gazwrap;
> > + void __iomem *init_reg;
> > + u32  init_val;
> > + } ram, ram_init_list[] = {
> > + {false, spx5_reg_get(sparx5, ANA_AC_STAT_RESET),
> > +  ANA_AC_STAT_RESET_RESET},
> > + {false, spx5_reg_get(sparx5, ASM_STAT_CFG),
> > +  ASM_STAT_CFG_STAT_CNT_CLR_SHOT},
> > + {true,  spx5_reg_get(sparx5, QSYS_RAM_INIT), 0},
> > + {true,  spx5_reg_get(sparx5, REW_RAM_INIT), 0},
> > + {true,  spx5_reg_get(sparx5, VOP_RAM_INIT), 0},
> > + {true,  spx5_reg_get(sparx5, ANA_AC_RAM_INIT), 0},
> > + {true,  spx5_reg_get(sparx5, ASM_RAM_INIT), 0},
> > + {true,  spx5_reg_get(sparx5, EACL_RAM_INIT), 0},
> > + {true,  spx5_reg_get(sparx5, VCAP_SUPER_RAM_INIT),
> > 0},
> > + {true,  spx5_reg_get(sparx5, DSM_RAM_INIT), 0}
> > + };
> 
> Looks like this could be const as well. And this does not really fit
> reverse christmas tree.

I will update this.
> 
> > +
> > + spx5_rmw(EACL_POL_EACL_CFG_EACL_FORCE_INIT_SET(1),
> > +  EACL_POL_EACL_CFG_EACL_FORCE_INIT,
> > +  sparx5,
> > +  EACL_POL_EACL_CFG);
> > +
> > + spx5_rmw(EACL_POL_EACL_CFG_EACL_FORCE_INIT_SET(0),
> > +  EACL_POL_EACL_CFG_EACL_FORCE_INIT,
> > +  sparx5,
> > +  EACL_POL_EACL_CFG);
> > +
> > + /* Initialize memories, if not done already */
> > + value = spx5_rd(sparx5, HSCH_RESET_CFG);
> > +
> > + if (!(value & HSCH_RESET_CFG_CORE_ENA)) {
> > + for (idx = 0; idx < 10; idx++) {
> > + pending = ARRAY_SIZE(ram_init_list);
> > + for (jdx = 0; jdx <
> > ARRAY_SIZE(ram_init_list); jdx++) {
> > + ram = ram_init_list[jdx];
> > +

[PATCH net 0/3] net: ipa: GSI interrupt handling fixes

2020-12-22 Thread Alex Elder

This series implements fixes for some issues related to handling
interrupts when GSI channel and event ring commands complete.

The first issue is that the completion condition for an event ring
or channel command could occur while the associated interrupt is
disabled.  This would cause the interrupt to fire when it is
subsequently enabled, even if the condition it signals had already
been handled.  The fix is to clear any pending interrupt conditions
before re-enabling the interrupt.

The second and third patches change how the success of an event ring
or channel command is determined.  These commands change the state
of an event ring or channel.  Previously the receipt of a completion
interrupt was required to consider a command successful.  Instead, a
command is successful if it changes the state of the target event
ring or channel in the way expected.  This way the command can
succeed even if the completion interrupt did not arrive while it was
enabled.

-Alex

Alex Elder (3):
  net: ipa: clear pending interrupts before enabling
  net: ipa: use state to determine channel command success
  net: ipa: use state to determine event ring command success

 drivers/net/ipa/gsi.c | 89 +++
 1 file changed, 56 insertions(+), 33 deletions(-)

-- 
2.20.1

[PATCH net 3/3] net: ipa: use state to determine event ring command success

2020-12-22 Thread Alex Elder

This patch implements the same basic fix for event rings as the
previous one does for channels.

The result of issuing an event ring control command should be that
the event ring changes state.  If enabled, a completion interrupt
signals that the event ring state has changed.  This interrupt is
enabled by gsi_evt_ring_command() and disabled again after the
command has completed (or we time out).

There is a window of time during which the command could complete
successfully without interrupting.  This would cause the event ring
to transition to the desired new state.

So whether a event ring command ends via completion interrupt or
timeout, we can consider the command successful if the event ring
has entered the desired state (and a failure if it has not,
regardless of the cause).

Fixes: b4175f8731f78 ("net: ipa: only enable GSI event control IRQs when 
needed")
Signed-off-by: Alex Elder 
---
 drivers/net/ipa/gsi.c | 34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
index 4f0e791764237..579cc3e516775 100644
--- a/drivers/net/ipa/gsi.c
+++ b/drivers/net/ipa/gsi.c
@@ -384,13 +384,15 @@ static int gsi_evt_ring_alloc_command(struct gsi *gsi, 
u32 evt_ring_id)
}
 
ret = evt_ring_command(gsi, evt_ring_id, GSI_EVT_ALLOCATE);
-   if (!ret && evt_ring->state != GSI_EVT_RING_STATE_ALLOCATED) {
-   dev_err(gsi->dev, "event ring %u bad state %u after alloc\n",
-   evt_ring_id, evt_ring->state);
-   ret = -EIO;
-   }
 
-   return ret;
+   /* If successful the event ring state will have changed */
+   if (evt_ring->state == GSI_EVT_RING_STATE_ALLOCATED)
+   return 0;
+
+   dev_err(gsi->dev, "event ring %u bad state %u after alloc\n",
+   evt_ring_id, evt_ring->state);
+
+   return -EIO;
 }
 
 /* Reset a GSI event ring in ALLOCATED or ERROR state. */
@@ -408,9 +410,13 @@ static void gsi_evt_ring_reset_command(struct gsi *gsi, 
u32 evt_ring_id)
}
 
ret = evt_ring_command(gsi, evt_ring_id, GSI_EVT_RESET);
-   if (!ret && evt_ring->state != GSI_EVT_RING_STATE_ALLOCATED)
-   dev_err(gsi->dev, "event ring %u bad state %u after reset\n",
-   evt_ring_id, evt_ring->state);
+
+   /* If successful the event ring state will have changed */
+   if (evt_ring->state == GSI_EVT_RING_STATE_ALLOCATED)
+   return;
+
+   dev_err(gsi->dev, "event ring %u bad state %u after reset\n",
+   evt_ring_id, evt_ring->state);
 }
 
 /* Issue a hardware de-allocation request for an allocated event ring */
@@ -426,9 +432,13 @@ static void gsi_evt_ring_de_alloc_command(struct gsi *gsi, 
u32 evt_ring_id)
}
 
ret = evt_ring_command(gsi, evt_ring_id, GSI_EVT_DE_ALLOC);
-   if (!ret && evt_ring->state != GSI_EVT_RING_STATE_NOT_ALLOCATED)
-   dev_err(gsi->dev, "event ring %u bad state %u after dealloc\n",
-   evt_ring_id, evt_ring->state);
+
+   /* If successful the event ring state will have changed */
+   if (evt_ring->state == GSI_EVT_RING_STATE_NOT_ALLOCATED)
+   return;
+
+   dev_err(gsi->dev, "event ring %u bad state %u after dealloc\n",
+   evt_ring_id, evt_ring->state);
 }
 
 /* Fetch the current state of a channel from hardware */
-- 
2.20.1

1 2 3 >

1 - 100 of 288 matches

Mail list logo