Re: [PATCH] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-03-06 Thread Stefano Garzarella
Hi Adalbert,
thanks for catching this issue, I have a comment below.

On Tue, Mar 05, 2019 at 08:01:45PM +0200, Adalbert Lazăr wrote:
> Previous to commit 22b5c0b63f32 ("vsock/virtio: fix kernel panic after device 
> hot-unplug"),
> vsock_core_init() was called from virtio_vsock_probe(). Now,
> virtio_transport_reset_no_sock() can be called before vsock_core_init()
> has the chance to run.
> 
> [Wed Feb 27 14:17:09 2019] BUG: unable to handle kernel NULL pointer 
> dereference at 0110
> [Wed Feb 27 14:17:09 2019] #PF error: [normal kernel read fault]
> [Wed Feb 27 14:17:09 2019] PGD 0 P4D 0
> [Wed Feb 27 14:17:09 2019] Oops:  [#1] SMP PTI
> [Wed Feb 27 14:17:09 2019] CPU: 3 PID: 59 Comm: kworker/3:1 Not tainted 
> 5.0.0-rc7-390-generic-hvi #390
> [Wed Feb 27 14:17:09 2019] Hardware name: QEMU Standard PC (i440FX + PIIX, 
> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [Wed Feb 27 14:17:09 2019] Workqueue: virtio_vsock virtio_transport_rx_work 
> [vmw_vsock_virtio_transport]
> [Wed Feb 27 14:17:09 2019] RIP: 0010:virtio_transport_reset_no_sock+0x8c/0xc0 
> [vmw_vsock_virtio_transport_common]
> [Wed Feb 27 14:17:09 2019] Code: 35 8b 4f 14 48 8b 57 08 31 f6 44 8b 4f 10 44 
> 8b 07 48 8d 7d c8 e8 84 f8 ff ff 48 85 c0 48 89 c3 74 2a e8 f7 31 03 00 48 89 
> df <48> 8b 80 10 01 00 00 e8 68 fb 69 ed 48 8b 75 f0 65 48 33 34 25 28
> [Wed Feb 27 14:17:09 2019] RSP: 0018:b42701ab7d40 EFLAGS: 00010282
> [Wed Feb 27 14:17:09 2019] RAX:  RBX: 9d79637ee080 RCX: 
> 0003
> [Wed Feb 27 14:17:09 2019] RDX: 0001 RSI: 0002 RDI: 
> 9d79637ee080
> [Wed Feb 27 14:17:09 2019] RBP: b42701ab7d78 R08: 9d796fae70e0 R09: 
> 9d796f403500
> [Wed Feb 27 14:17:09 2019] R10: b42701ab7d90 R11:  R12: 
> 9d7969d09240
> [Wed Feb 27 14:17:09 2019] R13: 9d79624e6840 R14: 9d7969d09318 R15: 
> 9d796d48ff80
> [Wed Feb 27 14:17:09 2019] FS:  () 
> GS:9d796fac() knlGS:
> [Wed Feb 27 14:17:09 2019] CS:  0010 DS:  ES:  CR0: 80050033
> [Wed Feb 27 14:17:09 2019] CR2: 0110 CR3: 000427f22000 CR4: 
> 06e0
> [Wed Feb 27 14:17:09 2019] DR0:  DR1:  DR2: 
> 
> [Wed Feb 27 14:17:09 2019] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [Wed Feb 27 14:17:09 2019] Call Trace:
> [Wed Feb 27 14:17:09 2019]  virtio_transport_recv_pkt+0x63/0x820 
> [vmw_vsock_virtio_transport_common]
> [Wed Feb 27 14:17:09 2019]  ? kfree+0x17e/0x190
> [Wed Feb 27 14:17:09 2019]  ? detach_buf_split+0x145/0x160
> [Wed Feb 27 14:17:09 2019]  ? __switch_to_asm+0x40/0x70
> [Wed Feb 27 14:17:09 2019]  virtio_transport_rx_work+0xa0/0x106 
> [vmw_vsock_virtio_transport]
> [Wed Feb 27 14:17:09 2019] NET: Registered protocol family 40
> [Wed Feb 27 14:17:09 2019]  process_one_work+0x167/0x410
> [Wed Feb 27 14:17:09 2019]  worker_thread+0x4d/0x460
> [Wed Feb 27 14:17:09 2019]  kthread+0x105/0x140
> [Wed Feb 27 14:17:09 2019]  ? rescuer_thread+0x360/0x360
> [Wed Feb 27 14:17:09 2019]  ? kthread_destroy_worker+0x50/0x50
> [Wed Feb 27 14:17:09 2019]  ret_from_fork+0x35/0x40
> [Wed Feb 27 14:17:09 2019] Modules linked in: vmw_vsock_virtio_transport 
> vmw_vsock_virtio_transport_common input_leds vsock serio_raw i2c_piix4 
> mac_hid qemu_fw_cfg autofs4 cirrus ttm drm_kms_helper syscopyarea sysfillrect 
> sysimgblt fb_sys_fops virtio_net psmouse drm net_failover pata_acpi 
> virtio_blk failover floppy
> [Wed Feb 27 14:17:09 2019] CR2: 0110
> [Wed Feb 27 14:17:09 2019] ---[ end trace baa35abd2e040fe5 ]---
> [Wed Feb 27 14:17:09 2019] RIP: 0010:virtio_transport_reset_no_sock+0x8c/0xc0 
> [vmw_vsock_virtio_transport_common]
> [Wed Feb 27 14:17:09 2019] Code: 35 8b 4f 14 48 8b 57 08 31 f6 44 8b 4f 10 44 
> 8b 07 48 8d 7d c8 e8 84 f8 ff ff 48 85 c0 48 89 c3 74 2a e8 f7 31 03 00 48 89 
> df <48> 8b 80 10 01 00 00 e8 68 fb 69 ed 48 8b 75 f0 65 48 33 34 25 28
> [Wed Feb 27 14:17:09 2019] RSP: 0018:b42701ab7d40 EFLAGS: 00010282
> [Wed Feb 27 14:17:09 2019] RAX:  RBX: 9d79637ee080 RCX: 
> 0003
> [Wed Feb 27 14:17:09 2019] RDX: 0001 RSI: 0002 RDI: 
> 9d79637ee080
> [Wed Feb 27 14:17:09 2019] RBP: b42701ab7d78 R08: 9d796fae70e0 R09: 
> 9d796f403500
> [Wed Feb 27 14:17:09 2019] R10: b42701ab7d90 R11:  R12: 
> 9d7969d09240
> [Wed Feb 27 14:17:09 2019] R13: 9d79624e6840 R14: 9d7969d09318 R15: 
> 9d796d48ff80
> [Wed Feb 27 14:17:09 2019] FS:  () 
> GS:9d796fac() knlGS:
> [Wed Feb 27 14:17:09 2019] CS:  0010 DS:  ES:  CR0: 80050033
> [Wed Feb 27 14:17:09 2019] CR2: 0110 CR3: 000427f22000 CR4: 
> 06e0
> [Wed Feb 27 14:17:09 2019] DR0:  DR1:  DR2: 
> 
> [Wed Feb 27 14:17:09 20

[PATCH net] net: hns3: Fix a logical vs bitwise typo

2019-03-06 Thread Dan Carpenter
There were a couple logical ORs accidentally mixed in with the bitwise
ORs.

Fixes: e8149933b1fa ("net: hns3: remove hnae3_get_bit in data path")
Signed-off-by: Dan Carpenter 
---
Very recent bug.

 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 3cb43b1f1c2e..1e4efc47c7a5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2321,8 +2321,8 @@ static void hns3_rx_checksum(struct hns3_enet_ring *ring, 
struct sk_buff *skb,
if (!(bd_base_info & BIT(HNS3_RXD_L3L4P_B)))
return;
 
-   if (unlikely(l234info & (BIT(HNS3_RXD_L3E_B) | BIT(HNS3_RXD_L4E_B) ||
-BIT(HNS3_RXD_OL3E_B) ||
+   if (unlikely(l234info & (BIT(HNS3_RXD_L3E_B) | BIT(HNS3_RXD_L4E_B) |
+BIT(HNS3_RXD_OL3E_B) |
 BIT(HNS3_RXD_OL4E_B {
u64_stats_update_begin(&ring->syncp);
ring->stats.l3l4_csum_err++;
-- 
2.17.1



Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-06 Thread si-wei liu




On 3/5/2019 11:23 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 11:15:06PM -0800, si-wei liu wrote:


On 3/5/2019 10:43 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 04:51:00PM -0800, si-wei liu wrote:

On 3/5/2019 4:36 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 04:20:50PM -0800, si-wei liu wrote:

On 3/5/2019 4:06 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 11:35:50AM -0800, si-wei liu wrote:

On 3/5/2019 11:24 AM, Stephen Hemminger wrote:

On Tue, 5 Mar 2019 11:19:32 -0800
si-wei liu  wrote:


I have a vague idea: would it work to *not* set
IFF_UP on slave devices at all?

Hmm, I ever thought about this option, and it appears this solution is
more invasive than required to convert existing scripts, despite the
controversy of introducing internal netdev state to differentiate user
visible state. Either we disallow slave to be brought up by user, or to
not set IFF_UP flag but instead use the internal one, could end up with
substantial behavioral change that breaks scripts. Consider any admin
script that does `ip link set dev ... up' successfully just assumes the
link is up and subsequent operation can be done as usual.

How would it work when carrier is off?


While it *may*

work for dracut (yet to be verified), I'm a bit concerned that there are
more scripts to be converted than those that don't follow volatile
failover slave names. It's technically doable, but may not worth the
effort (in terms of porting existing scripts/apps).

Thanks
-Siwei

Won't work for most devices.  Many devices turn off PHY and link layer
if not IFF_UP

True, that's what I said about introducing internal state for those driver
and other kernel component. Very invasive change indeed.

-Siwei

Well I did say it's vague.
How about hiding IFF_UP from dev_get_flags (and probably
__dev_change_flags)?


Any different? This has small footprint for the kernel change for sure,
while the discrepancy is still there. Anyone who writes code for IFF_UP will
not notice IFF_FAILOVER_SLAVE.

Not to mention more userspace "fixup" work has to be done due to this
change.

-Siwei



Point is it's ok since most userspace should just ignore slaves
- hopefully it will just ignore it since it already
ignores interfaces that are down.

Admin script thought the interface could be bright up and do further
operations without checking the UP flag.

These scripts then would be broken  on any box with multiple interfaces
since not all of these would have carrier.

Consider a script executing `ifconfig ... up' and once succeeds runs tcpdump
or some other command relying on UP interface. It's quite common that those
scripts don't check the UP flag but instead just rely on the well-known fact
that the command exits with 0 meaning the interface should be UP. This
change might well break scripts of that kind.

I am sorry I don't get it. Could you give an example
of a script that works now but would be broken?


https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/netdevice.sh#L27
https://github.com/WPO-Foundation/wptagent/blob/master/internal/adb.py#L443
https://github.com/openstack/steth/blob/master/steth/agent/api.py#L134

There are more if you keep searching.

-Siwei







It doesn't look to be a reliable
way of prohibit userspace from operating against slaves.

-Siwei



This does not mean we shouldn't make an effort to disable broken
configurations.

I am not arguing against your patch. Not at all. I see better
hiding of slaves as a separate enhancement.

I understand, but my point is we should try to minimize unnecessary side
impact to the current usage for whatever "hiding" effort we can make. It's
hard to find a tradeoff sometimes.

Yes if some userspace made an assumption and it worked, we should keep
it working I think. I don't necessarily agree we should worry too much
about theoretical issues. In half a year since the feature got merged
it's unlikely there are millions of slightly different scripts using it.



Acked-by: Michael S. Tsirkin 



Thank you.

-Siwei




Re: [PATCH] net/sched: avoid unused-label warning

2019-03-06 Thread Arnd Bergmann
On Wed, Mar 6, 2019 at 1:28 AM Cong Wang  wrote:
> On Mon, Mar 4, 2019 at 12:40 PM Arnd Bergmann  wrote:
> > diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
> > index 2a5f215ae876..3beb4717d3b7 100644
> > --- a/net/sched/act_tunnel_key.c
> > +++ b/net/sched/act_tunnel_key.c
> > @@ -392,8 +392,8 @@ static int tunnel_key_init(struct net *net, struct 
> > nlattr *nla,
> >  #ifdef CONFIG_DST_CACHE
> > if (metadata)
> > dst_cache_destroy(&metadata->u.tun_info.dst_cache);
> > -#endif
> >  release_tun_meta:
> > +#endif
>
> These #ifdef's are ugly, either we should select DST_CACHE
> or provide a nop for these dst_cache_*() APIs when it is not
> enabled.

I agree that would be nicer, or alternatively convert the preprocessor
conditionals to C conditionals like

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 3beb4717d3b7..586343a5accc 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -327,11 +327,11 @@ static int tunnel_key_init(struct net *net,
struct nlattr *nla,
goto err_out;
}

-#ifdef CONFIG_DST_CACHE
-   ret = dst_cache_init(&metadata->u.tun_info.dst_cache,
GFP_KERNEL);
-   if (ret)
-   goto release_tun_meta;
-#endif
+   if (IS_ENABLED(CONFIG_DST_CACHE)) {
+   ret =
dst_cache_init(&metadata->u.tun_info.dst_cache, GFP_KERNEL);
+   if (ret)
+   goto release_tun_meta;
+   }

if (opts_len) {
ret = tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS],
@@ -389,11 +389,9 @@ static int tunnel_key_init(struct net *net,
struct nlattr *nla,
return ret;

 release_dst_cache:
-#ifdef CONFIG_DST_CACHE
if (metadata)
dst_cache_destroy(&metadata->u.tun_info.dst_cache);
 release_tun_meta:
-#endif
if (metadata)
dst_release(&metadata->dst);

Usually, you'd want to do that consistently though, and change all the
related checks at the same time, so I would keep that separate from
the trivial bugfix.

 Arnd


Re: [PATCH] net/rds: Accept peer connection reject messages due to incompatible version

2019-03-06 Thread Yanjun Zhu



On 2019/3/6 15:04, Gerd Rausch wrote:

Prior to
commit d021fabf525ff ("rds: rdma: add consumer reject")

function "rds_rdma_cm_event_handler_cmn" would always honor a rejected
connection attempt by issuing a "rds_conn_drop".

The commit mentioned above added a "break", eliminating
the "fallthrough" case and made the "rds_conn_drop" rather conditional:

Now it only happens if a "consumer defined" reject (i.e. "rdma_reject")
carries an integer-value of "1" inside "private_data":


if (!conn)
+   break;
+   err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
+   if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
+   pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping 
connection\n",
+   &conn->c_laddr, &conn->c_faddr);
+   conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
+   rds_conn_drop(conn);
+   }
 rdsdebug("Connection rejected: %s\n",
  rdma_reject_msg(cm_id, event->status));
+   break;
 /* FALLTHROUGH */

A number of issues are worth mentioning here:
   #1) Previous versions of the RDS code simply rejected a connection
   by calling "rdma_reject(cm_id, NULL, 0);"
   So the value of the payload in "private_data" will not be "1",
   but "0".

   #2) Now the code has become dependent on host byte order and sizing.
   If one peer is big-endian, the other is little-endian,
   or there's a difference in sizeof(int) (e.g. ILP64 vs LP64),
   the *err check does not work as intended.

   #3) There is no check for "len" to see if the data behind *err is even valid.
   Luckily, it appears that the "rdma_reject(cm_id, NULL, 0)" will always
   carry 148 bytes of zeroized payload.
   But that should probably not be relied upon here.

   #4) With the added "break;",
   we might as well drop the misleading "/* FALLTHROUGH */" comment.

This commit does _not_ address issue #2, as the sender would have to
agree on a byte order as well.

Here is the sequence of messages in this observed error-scenario:
   Host-A is pre-QoS changes (excluding the commit mentioned above)
   Host-B is post-QoS changes (including the commit mentioned above)

   #1 Host-B
  issues a connection request via function "rds_conn_path_transition"
  connection state transitions to "RDS_CONN_CONNECTING"

   #2 Host-A
  rejects the incompatible connection request (from #1)
  It does so by calling "rdma_reject(cm_id, NULL, 0);"

   #3 Host-B
  receives an "RDMA_CM_EVENT_REJECTED" event (from #2)
  But since the code is changed in the way described above,
  it won't drop the connection here, simply because "*err == 0".

   #4 Host-A
  issues a connection request

   #5 Host-B
  receives an "RDMA_CM_EVENT_CONNECT_REQUEST" event
  and ends up calling "rds_ib_cm_handle_connect".
  But since the state is already in "RDS_CONN_CONNECTING"
  (as of #1) it will end up issuing a "rdma_reject" without
  dropping the connection:
 if (rds_conn_state(conn) == RDS_CONN_CONNECTING) {
 /* Wait and see - our connect may still be succeeding */
 rds_ib_stats_inc(s_ib_connect_raced);
 }
 goto out;

   #6 Host-A
  receives an "RDMA_CM_EVENT_REJECTED" event (from #5),
  drops the connection and tries again (goto #4) until it gives up.

Orabug: 29444532


This is the internal bug. It should be removed.

Zhu Yanjun



Signed-off-by: Gerd Rausch 
---
  net/rds/rdma_transport.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 46bce8389066..f628e7fda66d 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -112,7 +112,7 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id 
*cm_id,
if (!conn)
break;
err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
-   if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
+   if (!err || (err && len >= sizeof(*err) && ((*err) <= 
RDS_RDMA_REJ_INCOMPAT))) {
pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping 
connection\n",
&conn->c_laddr, &conn->c_faddr);
conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
@@ -122,7 +122,6 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id 
*cm_id,
rdsdebug("Connection rejected: %s\n",
 rdma_reject_msg(cm_id, event->status));
break;
-   /* FALLTHROUGH */
case RDMA_CM_EVENT_ADDR_ERROR:
case RDMA_CM_EVENT_ROUTE_ERROR:
case RDMA_CM_EVENT_CONNECT_ERROR:


Re: [PATCH] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-03-06 Thread Stefan Hajnoczi
On Tue, Mar 05, 2019 at 08:01:45PM +0200, Adalbert Lazăr wrote:

Thanks for the patch, Adalbert!  Please add a Signed-off-by tag so your
patch can be merged (see Documentation/process/submitting-patches.rst
Chapter 11 for details on the Developer's Certificate of Origin).

>  static int virtio_transport_reset_no_sock(struct virtio_vsock_pkt *pkt)
>  {
> + const struct virtio_transport *t;
>   struct virtio_vsock_pkt_info info = {
>   .op = VIRTIO_VSOCK_OP_RST,
>   .type = le16_to_cpu(pkt->hdr.type),
> @@ -680,7 +681,11 @@ static int virtio_transport_reset_no_sock(struct 
> virtio_vsock_pkt *pkt)
>   if (!pkt)
>   return -ENOMEM;
>  
> - return virtio_transport_get_ops()->send_pkt(pkt);
> + t = virtio_transport_get_ops();
> + if (!t)
> + return -ENOTCONN;

pkt is leaked here.  This is an easy mistake to make because the code is
unclear.  The pkt argument is the received packet that we must reply to.
The reply packet is allocated just before line 680 and must be free
explicitly for return -ENOTCONN.

You can avoid the leak and make the code easier to read like this:

  struct virtio_vsock_pkt *reply;

  ...

 -- avoid reusing 'pkt'
v
  reply = virtio_transport_alloc_pkt(&info, 0, ...);
  if (!reply)
  return -ENOMEM;

  t = virtio_transport_get_ops();
  if (!t) {
  virtio_transport_free_pkt(reply); <-- prevent memory leak
  return -ENOTCONN;
  }
  return t->send_pkt(reply);

Stefan


signature.asc
Description: PGP signature


Re: [PATCH bpf] bpf: fix sanitation rewrite in case of non-pointers

2019-03-06 Thread Jakub Sitnicki
On Tue, Mar 05, 2019 at 03:30 PM CET, Daniel Borkmann wrote:
> On 03/05/2019 03:12 PM, Jakub Sitnicki wrote:
> [...]
>> Could you please queue it for -stable which has d3bd7413e0ca ("bpf: fix
>> sanitation of alu op with pointer / scalar type from different paths")?
>
> Already done here yesterday morning:
>
> https://lore.kernel.org/stable/40b25ec1c31e234cf7eee75d62083a9a4bcbdbfe.1551702973.git.dan...@iogearbox.net/T/#u

Ah, you're one step ahead of me. Thank you.


Re: [PATCH net] net: hns3: Fix a logical vs bitwise typo

2019-03-06 Thread Yunsheng Lin
On 2019/3/6 16:12, Dan Carpenter wrote:
> There were a couple logical ORs accidentally mixed in with the bitwise
> ORs.

Thanks for the fix.

Reviewed-by: Yunsheng Lin 

> 
> Fixes: e8149933b1fa ("net: hns3: remove hnae3_get_bit in data path")
> Signed-off-by: Dan Carpenter 
> ---
> Very recent bug.
> 
>  drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
> b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
> index 3cb43b1f1c2e..1e4efc47c7a5 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
> @@ -2321,8 +2321,8 @@ static void hns3_rx_checksum(struct hns3_enet_ring 
> *ring, struct sk_buff *skb,
>   if (!(bd_base_info & BIT(HNS3_RXD_L3L4P_B)))
>   return;
>  
> - if (unlikely(l234info & (BIT(HNS3_RXD_L3E_B) | BIT(HNS3_RXD_L4E_B) ||
> -  BIT(HNS3_RXD_OL3E_B) ||
> + if (unlikely(l234info & (BIT(HNS3_RXD_L3E_B) | BIT(HNS3_RXD_L4E_B) |
> +  BIT(HNS3_RXD_OL3E_B) |
>BIT(HNS3_RXD_OL4E_B {
>   u64_stats_update_begin(&ring->syncp);
>   ring->stats.l3l4_csum_err++;
> 



[PATCH net v2] ipv4/route: fail early when inet dev is missing

2019-03-06 Thread Paolo Abeni
If a non local multicast packet reaches ip_route_input_rcu() while
the ingress device IPv4 private data (in_dev) is NULL, we end up
doing a NULL pointer dereference in IN_DEV_MFORWARD().

Since the later call to ip_route_input_mc() is going to fail if
!in_dev, we can fail early in such scenario and avoid the dangerous
code path.

v1 -> v2:
 - clarified the commit message, no code changes

Reported-by: Tianhao Zhao 
Fixes: e58e41596811 ("net: Enable support for VRF with ipv4 multicast")
Signed-off-by: Paolo Abeni 
Reviewed-by: David Ahern 
---
 net/ipv4/route.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7bb9128c8363..e40e56e014a0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2144,12 +2144,13 @@ int ip_route_input_rcu(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
int our = 0;
int err = -EINVAL;
 
-   if (in_dev)
-   our = ip_check_mc_rcu(in_dev, daddr, saddr,
- ip_hdr(skb)->protocol);
+   if (!in_dev)
+   return err;
+   our = ip_check_mc_rcu(in_dev, daddr, saddr,
+ ip_hdr(skb)->protocol);
 
/* check l3 master if no match yet */
-   if ((!in_dev || !our) && netif_is_l3_slave(dev)) {
+   if (!our && netif_is_l3_slave(dev)) {
struct in_device *l3_in_dev;
 
l3_in_dev = __in_dev_get_rcu(skb->dev);
-- 
2.20.1



[PATCH 1/2] net: phy: mscc: add support for VSC8514 PHY

2019-03-06 Thread Kavyasree.Kotagiri
From: Kavya Sree Kotagiri 

The VSC8514 PHY is a 4-ports PHY that is 10/100/1000BASE-T, 100BASE-FX,
1000BASE-X, can communicate with the MAC via QSGMII.
The MAC interface protocol for each port within QSGMII can
be either 1000BASE-X or SGMII, if the QSGMII MAC that the VSC8514 is
connecting to supports this functionality.
VSC8514 also supports SGMII MAC-side autonegotiation on each individual
port, downshifting, can set the blinking pattern of each of its 4 LEDs,
SyncE, 1000BASE-T Ring Resiliency as well as HP Auto-MDIX detection.

This adds support for 10BASE-T, 100BASE-TX, and 1000BASE-T,
QSGMII link with the MAC, downshifting, HP Auto-MDIX detection
and blinking pattern for its 4 LEDs.

The GPIO register bank is a set of registers that are common to all PHYs
in the package. So any modification in any register of this bank affects
all PHYs of the package.

If the PHYs haven't been reset before booting the Linux kernel and were
configured to use interrupts for e.g. link status updates, it is
required to clear the interrupts mask register of all PHYs before being
able to use interrupts with any PHY. The first PHY of the package that
will be init will take care of clearing all PHYs interrupts mask
registers. Thus, we need to keep track of the init sequence in the
package, if it's already been done or if it's to be done.

Most of the init sequence of a PHY of the package is common to all PHYs
in the package, thus we use the SMI broadcast feature which enables us
to propagate a write in one register of one PHY to all PHYs in the same
package.

Signed-off-by: Kavya Sree Kotagiri 
Signed-off-by: Quentin Schulz 
Co-developed-by: Quentin Schulz 
---
 drivers/net/phy/Kconfig |   2 +-
 drivers/net/phy/mscc.c  | 381 
 2 files changed, 382 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 3d187cd..17d022e 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -382,7 +382,7 @@ config MICROCHIP_T1_PHY
 config MICROSEMI_PHY
tristate "Microsemi PHYs"
---help---
- Currently supports VSC8530, VSC8531, VSC8540 and VSC8541 PHYs
+ Currently supports VSC8514, VSC8530, VSC8531, VSC8540 and VSC8541 PHYs
 
 config NATIONAL_PHY
tristate "National Semiconductor PHYs"
diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index db50efb..43141e6 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -85,12 +85,29 @@ enum rgmii_rx_clock_delay {
 #define LED_MODE_SEL_MASK(x) (GENMASK(3, 0) << LED_MODE_SEL_POS(x))
 #define LED_MODE_SEL(x, mode)(((mode) << LED_MODE_SEL_POS(x)) & 
LED_MODE_SEL_MASK(x))
 
+#define MSCC_EXT_PAGE_CSR_CNTL_17   17
+#define MSCC_EXT_PAGE_CSR_CNTL_18   18
+
+#define MSCC_EXT_PAGE_CSR_CNTL_19   19
+#define MSCC_PHY_CSR_CNTL_19_REG_ADDR(x)(x)
+#define MSCC_PHY_CSR_CNTL_19_TARGET(x)  ((x) << 12)
+#define MSCC_PHY_CSR_CNTL_19_READ   BIT(14)
+#define MSCC_PHY_CSR_CNTL_19_CMDBIT(15)
+
+#define MSCC_EXT_PAGE_CSR_CNTL_20   20
+#define MSCC_PHY_CSR_CNTL_20_TARGET(x)  (x)
+
+#define PHY_MCB_TARGET0x07
+#define PHY_MCB_S6G_WRITE 0x8000
+#define PHY_MCB_S6G_READ  0x4000
+
 #define MSCC_EXT_PAGE_ACCESS 31
 #define MSCC_PHY_PAGE_STANDARD   0x /* Standard registers */
 #define MSCC_PHY_PAGE_EXTENDED   0x0001 /* Extended registers */
 #define MSCC_PHY_PAGE_EXTENDED_2 0x0002 /* Extended reg - page 2 */
 #define MSCC_PHY_PAGE_EXTENDED_3 0x0003 /* Extended reg - page 3 */
 #define MSCC_PHY_PAGE_EXTENDED_4 0x0004 /* Extended reg - page 4 */
+#define MSCC_PHY_PAGE_CSR_CNTL   MSCC_PHY_PAGE_EXTENDED_4
 /* Extended reg - GPIO; this is a bank of registers that are shared for all 
PHYs
  * in the same package.
  */
@@ -216,6 +233,7 @@ enum rgmii_rx_clock_delay {
 #define MSCC_PHY_TR_MSB  18
 
 /* Microsemi PHY ID's */
+#define PHY_ID_VSC8514   0x00070670
 #define PHY_ID_VSC8530   0x00070560
 #define PHY_ID_VSC8531   0x00070570
 #define PHY_ID_VSC8540   0x00070760
@@ -1742,6 +1760,319 @@ static int vsc8584_did_interrupt(struct phy_device 
*phydev)
return (rc < 0) ? 0 : rc & MII_VSC85XX_INT_MASK_MASK;
 }
 
+static int vsc8514_config_pre_init(struct phy_device *phydev)
+{
+   unsigned int i;
+   u16 reg;
+   const struct reg_val pre_init1[] = {
+   {0x0f90, 0x00688980},
+   {0x0786, 0x0003},
+   {0x07fa, 0x0050100f},
+   {0x0f82, 0x0012b002},
+   {0x1686, 0x0004},
+   {0x168c, 0x00d2c46f},
+   {0x17a2, 0x0620},
+   {0x16a0, 0x00eeffdd},
+   {0x16a6, 0x00071448},
+   {0x16a4, 0x0013132f},
+   {0x16a8, 0x},
+  

[PATCH 2/2] net: phy: vitesse: Remove support for VSC8514

2019-03-06 Thread Kavyasree.Kotagiri
From: Kavya Sree Kotagiri 

Recently added support for VSC8514 in Microsemi driver (mscc.c)
with more features. Features supported in Vitesse driver are also
supported in the Microsemi driver.

Signed-off-by: Kavya Sree Kotagiri 
---
 drivers/net/phy/vitesse.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/drivers/net/phy/vitesse.c b/drivers/net/phy/vitesse.c
index dc0dd87..e5eb98e 100644
--- a/drivers/net/phy/vitesse.c
+++ b/drivers/net/phy/vitesse.c
@@ -61,7 +61,6 @@
 
 #define PHY_ID_VSC8234 0x000fc620
 #define PHY_ID_VSC8244 0x000fc6c0
-#define PHY_ID_VSC8514 0x00070670
 #define PHY_ID_VSC8572 0x000704d0
 #define PHY_ID_VSC8601 0x00070420
 #define PHY_ID_VSC7385 0x00070450
@@ -293,7 +292,6 @@ static int vsc82xx_config_intr(struct phy_device *phydev)
err = phy_write(phydev, MII_VSC8244_IMASK,
(phydev->drv->phy_id == PHY_ID_VSC8234 ||
 phydev->drv->phy_id == PHY_ID_VSC8244 ||
-phydev->drv->phy_id == PHY_ID_VSC8514 ||
 phydev->drv->phy_id == PHY_ID_VSC8572 ||
 phydev->drv->phy_id == PHY_ID_VSC8601) ?
MII_VSC8244_IMASK_MASK :
@@ -404,15 +402,6 @@ static int vsc82x4_config_aneg(struct phy_device *phydev)
.ack_interrupt  = &vsc824x_ack_interrupt,
.config_intr= &vsc82xx_config_intr,
 }, {
-   .phy_id = PHY_ID_VSC8514,
-   .name   = "Vitesse VSC8514",
-   .phy_id_mask= 0x0000,
-   .features   = PHY_GBIT_FEATURES,
-   .config_init= &vsc824x_config_init,
-   .config_aneg= &vsc82x4_config_aneg,
-   .ack_interrupt  = &vsc824x_ack_interrupt,
-   .config_intr= &vsc82xx_config_intr,
-}, {
.phy_id = PHY_ID_VSC8572,
.name   = "Vitesse VSC8572",
.phy_id_mask= 0x0000,
@@ -499,7 +488,6 @@ static int vsc82x4_config_aneg(struct phy_device *phydev)
 static struct mdio_device_id __maybe_unused vitesse_tbl[] = {
{ PHY_ID_VSC8234, 0x0000 },
{ PHY_ID_VSC8244, 0x000fffc0 },
-   { PHY_ID_VSC8514, 0x0000 },
{ PHY_ID_VSC8572, 0x0000 },
{ PHY_ID_VSC7385, 0x0000 },
{ PHY_ID_VSC7388, 0x0000 },
-- 
1.9.1



[PATCH RESEND 2/5] can: flexcan: add CAN FD mode support

2019-03-06 Thread Joakim Zhang
From: Dong Aisheng 

This patch intends to add CAN FD mode support in driver, it means that
payload size can extend up to 64 bytes.

NOTE: Bit rate switch (BRS) enabled by system reset when it enables CAN
FD mode (explicitly set BRS again in driver). So CAN hardware has support
BRS, but now driver has not support it due to bit timing must set in CBT
register other than CTRL1 register. It will add in next patch.

Signed-off-by: Dong Aisheng 
Signed-off-by: Joakim Zhang 
---
 drivers/net/can/flexcan.c | 109 ++
 1 file changed, 99 insertions(+), 10 deletions(-)

diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
index e35083ff31ee..eee0c23bb805 100644
--- a/drivers/net/can/flexcan.c
+++ b/drivers/net/can/flexcan.c
@@ -52,6 +52,7 @@
 #define FLEXCAN_MCR_IRMQ   BIT(16)
 #define FLEXCAN_MCR_LPRIO_EN   BIT(13)
 #define FLEXCAN_MCR_AENBIT(12)
+#define FLEXCAN_MCR_FDEN   BIT(11)
 /* MCR_MAXMB: maximum used MBs is MAXMB + 1 */
 #define FLEXCAN_MCR_MAXMB(x)   ((x) & 0x7f)
 #define FLEXCAN_MCR_IDAM_A (0x0 << 8)
@@ -137,6 +138,20 @@
 FLEXCAN_ESR_BOFF_INT | FLEXCAN_ESR_ERR_INT | \
 FLEXCAN_ESR_WAK_INT)
 
+/* FLEXCAN FD control register (FDCTRL) bits */
+#define FLEXCAN_FDCTRL_FDRATE  BIT(31)
+#define FLEXCAN_FDCTRL_MBDSR3(x)   (((x) & 0x3) << 25)
+#define FLEXCAN_FDCTRL_MBDSR2(x)   (((x) & 0x3) << 22)
+#define FLEXCAN_FDCTRL_MBDSR1(x)   (((x) & 0x3) << 19)
+#define FLEXCAN_FDCTRL_MBDSR0(x)   (((x) & 0x3) << 16)
+
+/* FLEXCAN FD Bit Timing register (FDCBT) bits */
+#define FLEXCAN_FDCBT_FPRESDIV(x)  (((x) & 0x3ff) << 20)
+#define FLEXCAN_FDCBT_FRJW(x)  (((x) & 0x07) << 16)
+#define FLEXCAN_FDCBT_FPROPSEG(x)  (((x) & 0x1f) << 10)
+#define FLEXCAN_FDCBT_FPSEG1(x)(((x) & 0x07) << 5)
+#define FLEXCAN_FDCBT_FPSEG2(x)((x) & 0x07)
+
 /* FLEXCAN interrupt flag register (IFLAG) bits */
 /* Errata ERR005829 step7: Reserve first valid MB */
 #define FLEXCAN_TX_MB_RESERVED_OFF_FIFO8
@@ -148,6 +163,10 @@
 #define FLEXCAN_IFLAG_RX_FIFO_AVAILABLEBIT(5)
 
 /* FLEXCAN message buffers */
+#define FLEXCAN_MB_CNT_EDL BIT(31)
+#define FLEXCAN_MB_CNT_BRS BIT(30)
+#define FLEXCAN_MB_CNT_ESI BIT(29)
+
 #define FLEXCAN_MB_CODE_MASK   (0xf << 24)
 #define FLEXCAN_MB_CODE_RX_BUSY_BIT(0x1 << 24)
 #define FLEXCAN_MB_CODE_RX_INACTIVE(0x0 << 24)
@@ -192,6 +211,7 @@
 #define FLEXCAN_QUIRK_BROKEN_PERR_STATEBIT(6) /* No interrupt for 
error passive */
 #define FLEXCAN_QUIRK_DEFAULT_BIG_ENDIAN   BIT(7) /* default to BE 
register access */
 #define FLEXCAN_QUIRK_SETUP_STOP_MODE  BIT(8) /* Setup stop mode to 
support wakeup */
+#define FLEXCAN_QUIRK_TIMESTAMP_SUPPORT_FD BIT(9) /* Use timestamp then 
support can fd mode */
 
 /* Structure of the message buffer */
 struct flexcan_mb {
@@ -250,6 +270,9 @@ struct flexcan_regs {
u32 rerrdr; /* 0xaf4 */
u32 rerrsynr;   /* 0xaf8 */
u32 errsr;  /* 0xafc */
+   u32 _reserved7[64]; /* 0xb00 */
+   u32 fdctrl; /* 0xc00 */
+   u32 fdcbt;  /* 0xc04 */
 };
 
 struct flexcan_devtype_data {
@@ -337,6 +360,18 @@ static const struct can_bittiming_const 
flexcan_bittiming_const = {
.brp_inc = 1,
 };
 
+static const struct can_bittiming_const flexcan_fd_data_bittiming_const = {
+   .name = DRV_NAME,
+   .tseg1_min = 1,
+   .tseg1_max = 39,
+   .tseg2_min = 1,
+   .tseg2_max = 8,
+   .sjw_max = 8,
+   .brp_min = 1,
+   .brp_max = 1024,
+   .brp_inc = 1,
+};
+
 /* FlexCAN module is essentially modelled as a little-endian IP in most
  * SoCs, i.e the registers as well as the message buffer areas are
  * implemented in a little-endian fashion.
@@ -609,10 +644,10 @@ static int flexcan_get_berr_counter(const struct 
net_device *dev,
 static netdev_tx_t flexcan_start_xmit(struct sk_buff *skb, struct net_device 
*dev)
 {
const struct flexcan_priv *priv = netdev_priv(dev);
-   struct can_frame *cf = (struct can_frame *)skb->data;
+   struct canfd_frame *cf = (struct canfd_frame *)skb->data;
u32 can_id;
u32 data;
-   u32 ctrl = FLEXCAN_MB_CODE_TX_DATA | (cf->can_dlc << 16);
+   u32 ctrl = FLEXCAN_MB_CODE_TX_DATA | ((can_len2dlc(cf->len)) << 16);
int i;
 
if (can_dropped_invalid_skb(dev, skb))
@@ -630,7 +665,10 @@ static netdev_tx_t flexcan_start_xmit(struct sk_buff *skb, 
struct net_device *de
if (cf->can_id & CAN_RTR_FLAG)
ctrl |= FLEXCAN_MB_CNT_RTR;
 
-   for (i = 0; i < cf->can_dlc; i += sizeof(u32)) {
+   if (can_is_canfd_skb(skb))
+   ctrl |= FLEXCAN_MB_CNT_EDL;
+
+   for (i = 0; i < cf->len; i += sizeof(u32)) {
data = be32_to_cpup((__be32 *)&cf->data[i]);
  

[PATCH RESEND 0/5] can: flexcan: add CAN FD support on i.MX8QM

2019-03-06 Thread Joakim Zhang
Hi Marc,

This patch set integrates two patch sets which have been sent before.
[PATCH 0/3] can: flexcan: add imx8qm support
[PATCH 0/2] can: flexcan: add CANFD BRS and ISO FD support

I rebase it that will easy your patch review.

Thanks a lot.

Joakim Zhang.

Dong Aisheng (4):
  can: rx-offload: add CANFD support based on offload
  can: flexcan: add CAN FD mode support
  can: flexcan: add CANFD BRS support and improve bittiming setting
  can: flexcan: add imx8qm support

Joakim Zhang (1):
  can: flexcan: add ISO CAN FD feature support

 drivers/net/can/flexcan.c  | 226 -
 drivers/net/can/rx-offload.c   |  16 ++-
 include/linux/can/rx-offload.h |   4 +-
 3 files changed, 210 insertions(+), 36 deletions(-)

-- 
2.17.1



[PATCH RESEND 5/5] can: flexcan: add imx8qm support

2019-03-06 Thread Joakim Zhang
From: Dong Aisheng 

The Flexcan on i.MX8QM supports CAN FD protocol.

Signed-off-by: Dong Aisheng 
Signed-off-by: Joakim Zhang 
---
 drivers/net/can/flexcan.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
index fca014bc530a..49615cc42436 100644
--- a/drivers/net/can/flexcan.c
+++ b/drivers/net/can/flexcan.c
@@ -346,6 +346,12 @@ static const struct flexcan_devtype_data 
fsl_imx6q_devtype_data = {
FLEXCAN_QUIRK_SETUP_STOP_MODE,
 };
 
+static struct flexcan_devtype_data fsl_imx8qm_devtype_data = {
+   .quirks = FLEXCAN_QUIRK_DISABLE_RXFG | FLEXCAN_QUIRK_ENABLE_EACEN_RRS |
+   FLEXCAN_QUIRK_USE_OFF_TIMESTAMP | 
FLEXCAN_QUIRK_BROKEN_PERR_STATE |
+   FLEXCAN_QUIRK_TIMESTAMP_SUPPORT_FD,
+};
+
 static const struct flexcan_devtype_data fsl_vf610_devtype_data = {
.quirks = FLEXCAN_QUIRK_DISABLE_RXFG | FLEXCAN_QUIRK_ENABLE_EACEN_RRS |
FLEXCAN_QUIRK_DISABLE_MECR | FLEXCAN_QUIRK_USE_OFF_TIMESTAMP |
@@ -1633,6 +1639,7 @@ static int flexcan_setup_stop_mode(struct platform_device 
*pdev)
 }
 
 static const struct of_device_id flexcan_of_match[] = {
+   { .compatible = "fsl,imx8qm-flexcan", .data = &fsl_imx8qm_devtype_data, 
},
{ .compatible = "fsl,imx6q-flexcan", .data = &fsl_imx6q_devtype_data, },
{ .compatible = "fsl,imx28-flexcan", .data = &fsl_imx28_devtype_data, },
{ .compatible = "fsl,imx53-flexcan", .data = &fsl_imx25_devtype_data, },
-- 
2.17.1



[PATCH RESEND 1/5] can: rx-offload: add CANFD support based on offload

2019-03-06 Thread Joakim Zhang
From: Dong Aisheng 

Using struct canfd_frame instead of can_frame to add support for CAN FD
mode in offload. FlexCAN controller will set the is_canfd variable when
it supports CAN FD mode.

Signed-off-by: Dong Aisheng 
Signed-off-by: Joakim Zhang 
---
 drivers/net/can/rx-offload.c   | 16 ++--
 include/linux/can/rx-offload.h |  4 +++-
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/net/can/rx-offload.c b/drivers/net/can/rx-offload.c
index 2ce4fa8698c7..131fe600deb3 100644
--- a/drivers/net/can/rx-offload.c
+++ b/drivers/net/can/rx-offload.c
@@ -55,11 +55,11 @@ static int can_rx_offload_napi_poll(struct napi_struct 
*napi, int quota)
 
while ((work_done < quota) &&
   (skb = skb_dequeue(&offload->skb_queue))) {
-   struct can_frame *cf = (struct can_frame *)skb->data;
+   struct canfd_frame *cf = (struct canfd_frame *)skb->data;
 
work_done++;
stats->rx_packets++;
-   stats->rx_bytes += cf->can_dlc;
+   stats->rx_bytes += cf->len;
netif_receive_skb(skb);
}
 
@@ -122,16 +122,20 @@ static struct sk_buff *can_rx_offload_offload_one(struct 
can_rx_offload *offload
 {
struct sk_buff *skb = NULL;
struct can_rx_offload_cb *cb;
-   struct can_frame *cf;
+   struct canfd_frame *cf;
int ret;
 
/* If queue is full or skb not available, read to discard mailbox */
if (likely(skb_queue_len(&offload->skb_queue) <=
-  offload->skb_queue_len_max))
-   skb = alloc_can_skb(offload->dev, &cf);
+  offload->skb_queue_len_max)) {
+   if (offload->is_canfd)
+   skb = alloc_canfd_skb(offload->dev, &cf);
+   else
+   skb = alloc_can_skb(offload->dev, (struct can_frame 
**)&cf);
+   }
 
if (!skb) {
-   struct can_frame cf_overflow;
+   struct canfd_frame cf_overflow;
u32 timestamp;
 
ret = offload->mailbox_read(offload, &cf_overflow,
diff --git a/include/linux/can/rx-offload.h b/include/linux/can/rx-offload.h
index 8268811a697e..6448e7dfc170 100644
--- a/include/linux/can/rx-offload.h
+++ b/include/linux/can/rx-offload.h
@@ -23,7 +23,7 @@
 struct can_rx_offload {
struct net_device *dev;
 
-   unsigned int (*mailbox_read)(struct can_rx_offload *offload, struct 
can_frame *cf,
+   unsigned int (*mailbox_read)(struct can_rx_offload *offload, struct 
canfd_frame *cf,
 u32 *timestamp, unsigned int mb);
 
struct sk_buff_head skb_queue;
@@ -35,6 +35,8 @@ struct can_rx_offload {
struct napi_struct napi;
 
bool inc;
+
+   bool is_canfd;
 };
 
 int can_rx_offload_add_timestamp(struct net_device *dev, struct can_rx_offload 
*offload);
-- 
2.17.1



[PATCH RESEND 3/5] can: flexcan: add CANFD BRS support and improve bittiming setting

2019-03-06 Thread Joakim Zhang
From: Dong Aisheng 

This patch intends to add CANFD BitRate Switch(BRS) support. Bit timing
must be set in CBT register other than CTRL1 register when CANFD
supports BRS, it will extend the range of all CAN bit timing variables
(PRESDIV, PROPSEG, PSEG1, PSEG2 and RJW), which will improve the bit
timing accuracy.

Signed-off-by: Joakim Zhang 
Signed-off-by: Dong Aisheng 
---
 drivers/net/can/flexcan.c | 107 ++
 1 file changed, 86 insertions(+), 21 deletions(-)

diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
index eee0c23bb805..688bb09b8123 100644
--- a/drivers/net/can/flexcan.c
+++ b/drivers/net/can/flexcan.c
@@ -138,6 +138,14 @@
 FLEXCAN_ESR_BOFF_INT | FLEXCAN_ESR_ERR_INT | \
 FLEXCAN_ESR_WAK_INT)
 
+/* FLEXCAN Bit Timing register (CBT) bits */
+#define FLEXCAN_CBT_BTFBIT(31)
+#define FLEXCAN_CBT_EPRESDIV(x)(((x) & 0x3ff) << 21)
+#define FLEXCAN_CBT_ERJW(x)(((x) & 0x1f) << 16)
+#define FLEXCAN_CBT_EPROPSEG(x)(((x) & 0x3f) << 10)
+#define FLEXCAN_CBT_EPSEG1(x)  (((x) & 0x1f) << 5)
+#define FLEXCAN_CBT_EPSEG2(x)  ((x) & 0x1f)
+
 /* FLEXCAN FD control register (FDCTRL) bits */
 #define FLEXCAN_FDCTRL_FDRATE  BIT(31)
 #define FLEXCAN_FDCTRL_MBDSR3(x)   (((x) & 0x3) << 25)
@@ -245,7 +253,8 @@ struct flexcan_regs {
u32 crcr;   /* 0x44 */
u32 rxfgmask;   /* 0x48 */
u32 rxfir;  /* 0x4c */
-   u32 _reserved3[12]; /* 0x50 */
+   u32 cbt;/* 0x50 */
+   u32 _reserved3[11]; /* 0x54 */
u8 mb[2][512];  /* 0x80 */
/* FIFO-mode:
 *  MB
@@ -360,6 +369,18 @@ static const struct can_bittiming_const 
flexcan_bittiming_const = {
.brp_inc = 1,
 };
 
+static const struct can_bittiming_const flexcan_fd_bittiming_const = {
+   .name = DRV_NAME,
+   .tseg1_min = 2,
+   .tseg1_max = 64,
+   .tseg2_min = 1,
+   .tseg2_max = 32,
+   .sjw_max = 32,
+   .brp_min = 1,
+   .brp_max = 1024,
+   .brp_inc = 1,
+};
+
 static const struct can_bittiming_const flexcan_fd_data_bittiming_const = {
.name = DRV_NAME,
.tseg1_min = 1,
@@ -665,9 +686,13 @@ static netdev_tx_t flexcan_start_xmit(struct sk_buff *skb, 
struct net_device *de
if (cf->can_id & CAN_RTR_FLAG)
ctrl |= FLEXCAN_MB_CNT_RTR;
 
-   if (can_is_canfd_skb(skb))
+   if (can_is_canfd_skb(skb)) {
ctrl |= FLEXCAN_MB_CNT_EDL;
 
+   if (cf->flags & CANFD_BRS)
+   ctrl |= FLEXCAN_MB_CNT_BRS;
+   }
+
for (i = 0; i < cf->len; i += sizeof(u32)) {
data = be32_to_cpup((__be32 *)&cf->data[i]);
priv->write(data, &priv->tx_mb->data[i / sizeof(u32)]);
@@ -876,6 +901,9 @@ static unsigned int flexcan_mailbox_read(struct 
can_rx_offload *offload,
 
if (reg_ctrl & FLEXCAN_MB_CNT_EDL) {
cf->len = can_dlc2len((reg_ctrl >> 16) & 0x0F);
+
+   if (reg_ctrl & FLEXCAN_MB_CNT_BRS)
+   cf->flags |= CANFD_BRS;
} else {
cf->len = get_can_dlc((reg_ctrl >> 16) & 0x0F);
 
@@ -1038,21 +1066,7 @@ static void flexcan_set_bittiming(struct net_device *dev)
u32 reg;
 
reg = priv->read(®s->ctrl);
-   reg &= ~(FLEXCAN_CTRL_PRESDIV(0xff) |
-FLEXCAN_CTRL_RJW(0x3) |
-FLEXCAN_CTRL_PSEG1(0x7) |
-FLEXCAN_CTRL_PSEG2(0x7) |
-FLEXCAN_CTRL_PROPSEG(0x7) |
-FLEXCAN_CTRL_LPB |
-FLEXCAN_CTRL_SMP |
-FLEXCAN_CTRL_LOM);
-
-   reg |= FLEXCAN_CTRL_PRESDIV(bt->brp - 1) |
-   FLEXCAN_CTRL_PSEG1(bt->phase_seg1 - 1) |
-   FLEXCAN_CTRL_PSEG2(bt->phase_seg2 - 1) |
-   FLEXCAN_CTRL_RJW(bt->sjw - 1) |
-   FLEXCAN_CTRL_PROPSEG(bt->prop_seg - 1);
-
+   reg &= ~(FLEXCAN_CTRL_LPB | FLEXCAN_CTRL_SMP | FLEXCAN_CTRL_LOM);
if (priv->can.ctrlmode & CAN_CTRLMODE_LOOPBACK)
reg |= FLEXCAN_CTRL_LPB;
if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
@@ -1064,17 +1078,60 @@ static void flexcan_set_bittiming(struct net_device 
*dev)
priv->write(reg, ®s->ctrl);
 
if (priv->can.ctrlmode & CAN_CTRLMODE_FD) {
+   reg = FLEXCAN_CBT_EPRESDIV(bt->brp - 1) |
+   FLEXCAN_CBT_EPSEG1(bt->phase_seg1 - 1) |
+   FLEXCAN_CBT_EPSEG2(bt->phase_seg2 - 1) |
+   FLEXCAN_CBT_ERJW(bt->sjw - 1) |
+   FLEXCAN_CBT_EPROPSEG(bt->prop_seg - 1) |
+   FLEXCAN_CBT_BTF;
+   priv->write(reg, ®s->cbt);
+
+   netdev_dbg(dev, "bt: prediv %d seg1 %d seg2 %d rjw %d propseg 
%d\n",
+  bt->brp - 1, bt->phase_seg1 - 1, bt->phase_seg2 - 1,
+  bt->sjw - 1, bt

[PATCH RESEND 4/5] can: flexcan: add ISO CAN FD feature support

2019-03-06 Thread Joakim Zhang
ISO CAN FD is introduced to increase the failture detection capability
than non-ISO CAN FD. The non-ISO CAN FD is still supported by FlexCAN so
that it can be used mainly during an intermediate phase, for evaluation
and development purposes.

Therefore, it is strongly recommended to configure FlexCAN to the ISO
CAN FD protocol by setting the ISOCANFDEN field in the CTRL2 register.

NOTE: if you only set "fd on", driver will use ISO FD MODE by default.

Signed-off-by: Joakim Zhang 
---
 drivers/net/can/flexcan.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
index 688bb09b8123..fca014bc530a 100644
--- a/drivers/net/can/flexcan.c
+++ b/drivers/net/can/flexcan.c
@@ -92,6 +92,7 @@
 #define FLEXCAN_CTRL2_MRP  BIT(18)
 #define FLEXCAN_CTRL2_RRS  BIT(17)
 #define FLEXCAN_CTRL2_EACENBIT(16)
+#define FLEXCAN_CTRL2_ISOCANFDEN   BIT(12)
 
 /* FLEXCAN memory error control register (MECR) bits */
 #define FLEXCAN_MECR_ECRWRDIS  BIT(31)
@@ -1254,6 +1255,12 @@ static int flexcan_chip_start(struct net_device *dev)
priv->write(reg_fdctrl, ®s->fdctrl);
reg_mcr = priv->read(®s->mcr);
priv->write(reg_mcr | FLEXCAN_MCR_FDEN, ®s->mcr);
+
+   reg_ctrl2 = flexcan_read(®s->ctrl2);
+   if (!(priv->can.ctrlmode & CAN_CTRLMODE_FD_NON_ISO))
+   flexcan_write(reg_ctrl2 | FLEXCAN_CTRL2_ISOCANFDEN, 
®s->ctrl2);
+   else
+   flexcan_write(reg_ctrl2 & ~FLEXCAN_CTRL2_ISOCANFDEN, 
®s->ctrl2);
}
 
if ((priv->can.ctrlmode_supported & CAN_CTRLMODE_FD) &&
@@ -1746,7 +1753,7 @@ static int flexcan_probe(struct platform_device *pdev)
if (priv->devtype_data->quirks & FLEXCAN_QUIRK_TIMESTAMP_SUPPORT_FD) {
if (priv->devtype_data->quirks & 
FLEXCAN_QUIRK_USE_OFF_TIMESTAMP) {
priv->offload.is_canfd = true;
-   priv->can.ctrlmode_supported |= CAN_CTRLMODE_FD;
+   priv->can.ctrlmode_supported |= CAN_CTRLMODE_FD | 
CAN_CTRLMODE_FD_NON_ISO;
priv->can.bittiming_const = &flexcan_fd_bittiming_const;
priv->can.data_bittiming_const = 
&flexcan_fd_data_bittiming_const;
} else {
-- 
2.17.1



[PATCH v2] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-03-06 Thread Adalbert Lazăr
Previous to commit 22b5c0b63f32 ("vsock/virtio: fix kernel panic
after device hot-unplug"), vsock_core_init() was called from
virtio_vsock_probe(). Now, virtio_transport_reset_no_sock() can be called
before vsock_core_init() has the chance to run.

[Wed Feb 27 14:17:09 2019] BUG: unable to handle kernel NULL pointer 
dereference at 0110
[Wed Feb 27 14:17:09 2019] #PF error: [normal kernel read fault]
[Wed Feb 27 14:17:09 2019] PGD 0 P4D 0
[Wed Feb 27 14:17:09 2019] Oops:  [#1] SMP PTI
[Wed Feb 27 14:17:09 2019] CPU: 3 PID: 59 Comm: kworker/3:1 Not tainted 
5.0.0-rc7-390-generic-hvi #390
[Wed Feb 27 14:17:09 2019] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[Wed Feb 27 14:17:09 2019] Workqueue: virtio_vsock virtio_transport_rx_work 
[vmw_vsock_virtio_transport]
[Wed Feb 27 14:17:09 2019] RIP: 0010:virtio_transport_reset_no_sock+0x8c/0xc0 
[vmw_vsock_virtio_transport_common]
[Wed Feb 27 14:17:09 2019] Code: 35 8b 4f 14 48 8b 57 08 31 f6 44 8b 4f 10 44 
8b 07 48 8d 7d c8 e8 84 f8 ff ff 48 85 c0 48 89 c3 74 2a e8 f7 31 03 00 48 89 
df <48> 8b 80 10 01 00 00 e8 68 fb 69 ed 48 8b 75 f0 65 48 33 34 25 28
[Wed Feb 27 14:17:09 2019] RSP: 0018:b42701ab7d40 EFLAGS: 00010282
[Wed Feb 27 14:17:09 2019] RAX:  RBX: 9d79637ee080 RCX: 
0003
[Wed Feb 27 14:17:09 2019] RDX: 0001 RSI: 0002 RDI: 
9d79637ee080
[Wed Feb 27 14:17:09 2019] RBP: b42701ab7d78 R08: 9d796fae70e0 R09: 
9d796f403500
[Wed Feb 27 14:17:09 2019] R10: b42701ab7d90 R11:  R12: 
9d7969d09240
[Wed Feb 27 14:17:09 2019] R13: 9d79624e6840 R14: 9d7969d09318 R15: 
9d796d48ff80
[Wed Feb 27 14:17:09 2019] FS:  () 
GS:9d796fac() knlGS:
[Wed Feb 27 14:17:09 2019] CS:  0010 DS:  ES:  CR0: 80050033
[Wed Feb 27 14:17:09 2019] CR2: 0110 CR3: 000427f22000 CR4: 
06e0
[Wed Feb 27 14:17:09 2019] DR0:  DR1:  DR2: 

[Wed Feb 27 14:17:09 2019] DR3:  DR6: fffe0ff0 DR7: 
0400
[Wed Feb 27 14:17:09 2019] Call Trace:
[Wed Feb 27 14:17:09 2019]  virtio_transport_recv_pkt+0x63/0x820 
[vmw_vsock_virtio_transport_common]
[Wed Feb 27 14:17:09 2019]  ? kfree+0x17e/0x190
[Wed Feb 27 14:17:09 2019]  ? detach_buf_split+0x145/0x160
[Wed Feb 27 14:17:09 2019]  ? __switch_to_asm+0x40/0x70
[Wed Feb 27 14:17:09 2019]  virtio_transport_rx_work+0xa0/0x106 
[vmw_vsock_virtio_transport]
[Wed Feb 27 14:17:09 2019] NET: Registered protocol family 40
[Wed Feb 27 14:17:09 2019]  process_one_work+0x167/0x410
[Wed Feb 27 14:17:09 2019]  worker_thread+0x4d/0x460
[Wed Feb 27 14:17:09 2019]  kthread+0x105/0x140
[Wed Feb 27 14:17:09 2019]  ? rescuer_thread+0x360/0x360
[Wed Feb 27 14:17:09 2019]  ? kthread_destroy_worker+0x50/0x50
[Wed Feb 27 14:17:09 2019]  ret_from_fork+0x35/0x40
[Wed Feb 27 14:17:09 2019] Modules linked in: vmw_vsock_virtio_transport 
vmw_vsock_virtio_transport_common input_leds vsock serio_raw i2c_piix4 mac_hid 
qemu_fw_cfg autofs4 cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops virtio_net psmouse drm net_failover pata_acpi virtio_blk failover 
floppy

Fixes: 22b5c0b63f32 ("vsock/virtio: fix kernel panic after device hot-unplug")
Reported-by: Alexandru Herghelegiu 
Signed-off-by: Adalbert Lazăr 
Co-developed-by: Stefan Hajnoczi 
---
 net/vmw_vsock/virtio_transport_common.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 3ae3a33da70b..602715fc9a75 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -662,6 +662,8 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
  */
 static int virtio_transport_reset_no_sock(struct virtio_vsock_pkt *pkt)
 {
+   const struct virtio_transport *t;
+   struct virtio_vsock_pkt *reply;
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_RST,
.type = le16_to_cpu(pkt->hdr.type),
@@ -672,15 +674,21 @@ static int virtio_transport_reset_no_sock(struct 
virtio_vsock_pkt *pkt)
if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
return 0;

-   pkt = virtio_transport_alloc_pkt(&info, 0,
-le64_to_cpu(pkt->hdr.dst_cid),
-le32_to_cpu(pkt->hdr.dst_port),
-le64_to_cpu(pkt->hdr.src_cid),
-le32_to_cpu(pkt->hdr.src_port));
-   if (!pkt)
+   reply = virtio_transport_alloc_pkt(&info, 0,
+  le64_to_cpu(pkt->hdr.dst_cid),
+  le32_to_cpu(pkt->hdr.dst_port),
+ 

[PATCH net] iptunnel: NULL pointer deref for ip_md_tunnel_xmit

2019-03-06 Thread Alan Maguire
Naresh Kamboju noted the following oops during execution of selftest
tools/testing/selftests/bpf/test_tunnel.sh on x86_64:

[  274.120445] BUG: unable to handle kernel NULL pointer dereference
at 
[  274.128285] #PF error: [INSTR]
[  274.131351] PGD 800414a0e067 P4D 800414a0e067 PUD 3b6334067 PMD 0
[  274.138241] Oops: 0010 [#1] SMP PTI
[  274.141734] CPU: 1 PID: 11464 Comm: ping Not tainted
5.0.0-rc4-next-20190129 #1
[  274.149046] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.0b 07/27/2017
[  274.156526] RIP: 0010:  (null)
[  274.160280] Code: Bad RIP value.
[  274.163509] RSP: 0018:bc9681f83540 EFLAGS: 00010286
[  274.168726] RAX:  RBX: dc967fa80a18 RCX: 
[  274.175851] RDX: 9db2ee08b540 RSI: 000e RDI: dc967fa809a0
[  274.182974] RBP: bc9681f83580 R08: 9db2c4d62690 R09: 000c
[  274.190098] R10:  R11: 9db2ee08b540 R12: 9db31ce7c000
[  274.197222] R13: 0001 R14: 000c R15: 9db3179cf400
[  274.204346] FS:  7ff4ae7c5740() GS:9db31fa8()
knlGS:
[  274.212424] CS:  0010 DS:  ES:  CR0: 80050033
[  274.218162] CR2: ffd6 CR3: 0004574da004 CR4: 003606e0
[  274.225292] DR0:  DR1:  DR2: 
[  274.232416] DR3:  DR6: fffe0ff0 DR7: 0400
[  274.239541] Call Trace:
[  274.241988]  ? tnl_update_pmtu+0x296/0x3b0
[  274.246085]  ip_md_tunnel_xmit+0x1bc/0x520
[  274.250176]  gre_fb_xmit+0x330/0x390
[  274.253754]  gre_tap_xmit+0x128/0x180
[  274.257414]  dev_hard_start_xmit+0xb7/0x300
[  274.261598]  sch_direct_xmit+0xf6/0x290
[  274.265430]  __qdisc_run+0x15d/0x5e0
[  274.269007]  __dev_queue_xmit+0x2c5/0xc00
[  274.273011]  ? dev_queue_xmit+0x10/0x20
[  274.276842]  ? eth_header+0x2b/0xc0
[  274.280326]  dev_queue_xmit+0x10/0x20
[  274.283984]  ? dev_queue_xmit+0x10/0x20
[  274.287813]  arp_xmit+0x1a/0xf0
[  274.290952]  arp_send_dst.part.19+0x46/0x60
[  274.295138]  arp_solicit+0x177/0x6b0
[  274.298708]  ? mod_timer+0x18e/0x440
[  274.302281]  neigh_probe+0x57/0x70
[  274.305684]  __neigh_event_send+0x197/0x2d0
[  274.309862]  neigh_resolve_output+0x18c/0x210
[  274.314212]  ip_finish_output2+0x257/0x690
[  274.318304]  ip_finish_output+0x219/0x340
[  274.322314]  ? ip_finish_output+0x219/0x340
[  274.326493]  ip_output+0x76/0x240
[  274.329805]  ? ip_fragment.constprop.53+0x80/0x80
[  274.334510]  ip_local_out+0x3f/0x70
[  274.337992]  ip_send_skb+0x19/0x40
[  274.341391]  ip_push_pending_frames+0x33/0x40
[  274.345740]  raw_sendmsg+0xc15/0x11d0
[  274.349403]  ? __might_fault+0x85/0x90
[  274.353151]  ? _copy_from_user+0x6b/0xa0
[  274.357070]  ? rw_copy_check_uvector+0x54/0x130
[  274.361604]  inet_sendmsg+0x42/0x1c0
[  274.365179]  ? inet_sendmsg+0x42/0x1c0
[  274.368937]  sock_sendmsg+0x3e/0x50
[  274.372460]  ___sys_sendmsg+0x26f/0x2d0
[  274.376293]  ? lock_acquire+0x95/0x190
[  274.380043]  ? __handle_mm_fault+0x7ce/0xb70
[  274.384307]  ? lock_acquire+0x95/0x190
[  274.388053]  ? __audit_syscall_entry+0xdd/0x130
[  274.392586]  ? ktime_get_coarse_real_ts64+0x64/0xc0
[  274.397461]  ? __audit_syscall_entry+0xdd/0x130
[  274.401989]  ? trace_hardirqs_on+0x4c/0x100
[  274.406173]  __sys_sendmsg+0x63/0xa0
[  274.409744]  ? __sys_sendmsg+0x63/0xa0
[  274.413488]  __x64_sys_sendmsg+0x1f/0x30
[  274.417405]  do_syscall_64+0x55/0x190
[  274.421064]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  274.426113] RIP: 0033:0x7ff4ae0e6e87
[  274.429686] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00
00 00 00 8b 05 ca d9 2b 00 48 63 d2 48 63 ff 85 c0 75 10 b8 2e 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 53 48 89 f3 48 83 ec 10 48 89 7c
24 08
[  274.448422] RSP: 002b:7ffcd9b76db8 EFLAGS: 0246 ORIG_RAX:
002e
[  274.455978] RAX: ffda RBX: 0040 RCX: 7ff4ae0e6e87
[  274.463104] RDX:  RSI: 006092e0 RDI: 0003
[  274.470228] RBP:  R08: 7ffcd9bc40a0 R09: 7ffcd9bc4080
[  274.477349] R10: 060a R11: 0246 R12: 0003
[  274.484475] R13: 0016 R14: 7ffcd9b77fa0 R15: 7ffcd9b78da4
[  274.491602] Modules linked in: cls_bpf sch_ingress iptable_filter
ip_tables algif_hash af_alg x86_pkg_temp_thermal fuse [last unloaded:
test_bpf]
[  274.504634] CR2: 
[  274.507976] ---[ end trace 196d18386545eae1 ]---
[  274.512588] RIP: 0010:  (null)
[  274.516334] Code: Bad RIP value.
[  274.519557] RSP: 0018:bc9681f83540 EFLAGS: 00010286
[  274.524775] RAX:  RBX: dc967fa80a18 RCX: 
[  274.531921] RDX: 9db2ee08b540 RSI: 000e RDI: dc967fa809a0
[  274.539082] RBP: bc9681f83580 R08: 9db2c4d62690 R09: 000c
[  274.546205] R10:  R11: 9db2ee08b540 R12: 9

[PATCH net] gre: fix kernel panic when using lw tunnel

2019-03-06 Thread Nicolas Dichtel
There was several problems:
 - skb_dst(skb) can be NULL when the packet comes from a gretap tunnel;
 - skb_dst(skb)->ops may point to md_dst_ops, which doesn't set ->mtu
   handler, thus calling dst_mtu() leads to a panic.

I also wonder if ->cow_metrics may be called if skb_dst(skb)->ops points
to ovs_dst_ops.

Don't try to do anything in one of those cases.

Fixes: 962924fa2b7a ("ip_gre: Refactor collect metatdata mode tunnel xmit to 
ip_md_tunnel_xmit")
CC: wenxu 
Signed-off-by: Nicolas Dichtel 
---
 net/ipv4/ip_tunnel.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 2756fb725bf0..e2e0e4601c0f 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -508,6 +508,12 @@ static int tnl_update_pmtu(struct net_device *dev, struct 
sk_buff *skb,
int pkt_size;
int mtu;
 
+   if (!skb_dst(skb) ||
+   !skb_dst(skb)->ops->mtu ||
+   (dst_metrics_read_only(skb_dst(skb)) &&
+!skb_dst(skb)->ops->cow_metrics))
+   return 0;
+
tunnel_hlen = md ? tunnel_hlen : tunnel->hlen;
pkt_size = skb->len - tunnel_hlen - dev->hard_header_len;
 
-- 
2.21.0



Re: general protection fault in sctp_sched_rr_dequeue

2019-03-06 Thread Xin Long
On Wed, Mar 6, 2019 at 9:42 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:63bdf4284c38 Merge branch 'linus' of git://git.kernel.org/..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=100347cb20
> kernel config:  https://syzkaller.appspot.com/x/.config?x=872be05707464aaa
> dashboard link: https://syzkaller.appspot.com/bug?extid=4c9934f20522c0efd657
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11cd9b0320
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=127de8e720
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+4c9934f20522c0efd...@syzkaller.appspotmail.com
>
> kauditd_printk_skb: 2 callbacks suppressed
> audit: type=1400 audit(1551833288.424:35): avc:  denied  { map } for
> pid=8035 comm="bash" path="/bin/bash" dev="sda1" ino=1457
> scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> tcontext=system_u:object_r:file_t:s0 tclass=file permissive=1
> audit: type=1400 audit(1551833294.934:36): avc:  denied  { map } for
> pid=8047 comm="syz-executor778" path="/root/syz-executor778173561"
> dev="sda1" ino=16484 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 8047 Comm: syz-executor778 Not tainted 5.0.0+ #7
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:sctp_sched_rr_dequeue+0xd3/0x170 net/sctp/stream_sched_rr.c:141
The panic was caused by sched->init() reset stream->rr_next = NULL, even
if outq->out_chunk_list is not empty.

We should remove the sched->init() from sctp_stream_init(), since
all sched info was moved into sout->ext and sctp_stream_alloc_out()
will not afffect it.

> Code: ea 03 80 3c 02 00 0f 85 a2 00 00 00 48 8b 5b 08 e8 62 20 ee fa 48 8d
> 7b 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75
> 53 4c 8b 6b 30 4c 89 e7 49 83 ed 18 4c 89 ee e8 b4
> RSP: 0018:88809eacf040 EFLAGS: 00010206
> RAX: dc00 RBX:  RCX: 8679cd9f
> RDX: 0006 RSI: 8681c41e RDI: 0030
> RBP: 88809eacf058 R08: 8880a12bc300 R09: 0002
> R10: ed1015d25bcf R11: 8880ae92de7b R12: 88807cae6ca0
> R13: 88807cae6580 R14: dc00 R15: 88809eacf198
> FS:  01865880() GS:8880ae90() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 55af2d491150 CR3: 8dd7b000 CR4: 001406e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>   sctp_outq_dequeue_data net/sctp/outqueue.c:90 [inline]
>   sctp_outq_flush_data net/sctp/outqueue.c:1079 [inline]
>   sctp_outq_flush+0xba2/0x2790 net/sctp/outqueue.c:1205
>   sctp_outq_uncork+0x6c/0x80 net/sctp/outqueue.c:772
>   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
>   sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>   sctp_do_sm+0x513/0x5390 net/sctp/sm_sideeffect.c:1191
>   sctp_assoc_bh_rcv+0x343/0x660 net/sctp/associola.c:1074
>   sctp_inq_push+0x1ea/0x290 net/sctp/inqueue.c:95
>   sctp_backlog_rcv+0x189/0xbc0 net/sctp/input.c:354
>   sk_backlog_rcv include/net/sock.h:937 [inline]
>   __release_sock+0x12e/0x3a0 net/core/sock.c:2413
>   release_sock+0x59/0x1c0 net/core/sock.c:2929
>   sctp_wait_for_connect+0x316/0x540 net/sctp/socket.c:8999
>   sctp_sendmsg_to_asoc+0x13e2/0x17d0 net/sctp/socket.c:1968
>   sctp_sendmsg+0x10a9/0x17e0 net/sctp/socket.c:2114
>   inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
>   sock_sendmsg_nosec net/socket.c:622 [inline]
>   sock_sendmsg+0xdd/0x130 net/socket.c:632
>   ___sys_sendmsg+0x806/0x930 net/socket.c:2137
>   __sys_sendmsg+0x105/0x1d0 net/socket.c:2175
>   __do_sys_sendmsg net/socket.c:2184 [inline]
>   __se_sys_sendmsg net/socket.c:2182 [inline]
>   __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2182
>   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x440159
> Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
> ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7fff8801fd38 EFLAGS: 0246 ORIG_RAX: 002e
> RAX: ffda RBX: 004002c8 RCX: 00440159
> RDX:  RSI: 2001afc8 RDI: 0003
> RBP: 006ca018 R08: 0002 R09: 004002c8
> R10: 0008 R11: 0246 R12: 004019e0
> R13: 00401a70 R14: 

Re: [RFC PATCH V2 2/5] vhost: fine grain userspace memory accessors

2019-03-06 Thread Christophe de Dinechin



> On 6 Mar 2019, at 08:18, Jason Wang  wrote:
> 
> This is used to hide the metadata address from virtqueue helpers. This
> will allow to implement a vmap based fast accessing to metadata.
> 
> Signed-off-by: Jason Wang 
> ---
> drivers/vhost/vhost.c | 94 +--
> 1 file changed, 77 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 400aa78..29709e7 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -869,6 +869,34 @@ static inline void __user *__vhost_get_user(struct 
> vhost_virtqueue *vq,
>   ret; \
> })
> 
> +static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> +{
> + return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> +   vhost_avail_event(vq));
> +}
> +
> +static inline int vhost_put_used(struct vhost_virtqueue *vq,
> +  struct vring_used_elem *head, int idx,
> +  int count)
> +{
> + return vhost_copy_to_user(vq, vq->used->ring + idx, head,
> +   count * sizeof(*head));
> +}
> +
> +static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> +
> +{
> + return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> +   &vq->used->flags);
> +}
> +
> +static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> +
> +{
> + return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> +   &vq->used->idx);
> +}
> +
> #define vhost_get_user(vq, x, ptr, type)  \
> ({ \
>   int ret; \
> @@ -907,6 +935,43 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   mutex_unlock(&d->vqs[i]->mutex);
> }
> 
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> +   __virtio16 *idx)
> +{
> + return vhost_get_avail(vq, *idx, &vq->avail->idx);
> +}
> +
> +static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> +__virtio16 *head, int idx)
> +{
> + return vhost_get_avail(vq, *head,
> +&vq->avail->ring[idx & (vq->num - 1)]);
> +}
> +
> +static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> + __virtio16 *flags)
> +{
> + return vhost_get_avail(vq, *flags, &vq->avail->flags);
> +}
> +
> +static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> +__virtio16 *event)
> +{
> + return vhost_get_avail(vq, *event, vhost_used_event(vq));
> +}
> +
> +static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> +  __virtio16 *idx)
> +{
> + return vhost_get_used(vq, *idx, &vq->used->idx);
> +}
> +
> +static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> +  struct vring_desc *desc, int idx)
> +{
> + return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
> +}
> +
> static int vhost_new_umem_range(struct vhost_umem *umem,
>   u64 start, u64 size, u64 end,
>   u64 userspace_addr, int perm)
> @@ -1840,8 +1905,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct 
> vhost_log *log,
> static int vhost_update_used_flags(struct vhost_virtqueue *vq)
> {
>   void __user *used;
> - if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> -&vq->used->flags) < 0)
> + if (vhost_put_used_flags(vq))
>   return -EFAULT;
>   if (unlikely(vq->log_used)) {
>   /* Make sure the flag is seen before log. */
> @@ -1858,8 +1922,7 @@ static int vhost_update_used_flags(struct 
> vhost_virtqueue *vq)
> 
> static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 
> avail_event)
> {
> - if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> -vhost_avail_event(vq)))
> + if (vhost_put_avail_event(vq))
>   return -EFAULT;
>   if (unlikely(vq->log_used)) {
>   void __user *used;
> @@ -1895,7 +1958,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
>   r = -EFAULT;
>   goto err;
>   }
> - r = vhost_get_used(vq, last_used_idx, &vq->used->idx);
> + r = vhost_get_used_idx(vq, &last_used_idx);
>   if (r) {
>   vq_err(vq, "Can't access used idx at %p\n",
>  &vq->used->idx);

From the error case, it looks like you are not entirely encapsulating
knowledge of what the accessor uses, i.e. it’s not:

vq_err(vq, "Can't access used idx at %p\n",
   &last_user_idx);

Maybe move error message within accessor?

> @@ -2094,7 +2157,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   last_avail_idx = vq->last_avail_idx;
> 
>   if (vq->avail_idx == vq->last_ava

[PATCH 1/2] appletalk: Fix compile regression

2019-03-06 Thread Arnd Bergmann
A bugfix just broke compilation of appletalk when CONFIG_SYSCTL
is disabled:

In file included from net/appletalk/ddp.c:65:
net/appletalk/ddp.c: In function 'atalk_init':
include/linux/atalk.h:164:34: error: expected expression before 'do'
 #define atalk_register_sysctl()  do { } while(0)
  ^~
net/appletalk/ddp.c:1934:7: note: in expansion of macro 'atalk_register_sysctl'
  rc = atalk_register_sysctl();

This is easier to avoid by using conventional inline functions
as stubs rather than macros. The header already has inline
functions for other purposes, so I'm changing over all the
macros for consistency.

Fixes: 6377f787aeb9 ("appletalk: Fix use-after-free in atalk_proc_exit")
Signed-off-by: Arnd Bergmann 
---
resent with mailing list on Cc
---
 include/linux/atalk.h | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/atalk.h b/include/linux/atalk.h
index 5a90f28d5ff2..d5cfc0b15b76 100644
--- a/include/linux/atalk.h
+++ b/include/linux/atalk.h
@@ -161,16 +161,26 @@ extern int sysctl_aarp_resolve_time;
 extern int atalk_register_sysctl(void);
 extern void atalk_unregister_sysctl(void);
 #else
-#define atalk_register_sysctl()do { } while(0)
-#define atalk_unregister_sysctl()  do { } while(0)
+static inline int atalk_register_sysctl(void)
+{
+   return 0;
+}
+static inline void atalk_unregister_sysctl(void)
+{
+}
 #endif
 
 #ifdef CONFIG_PROC_FS
 extern int atalk_proc_init(void);
 extern void atalk_proc_exit(void);
 #else
-#define atalk_proc_init()  ({ 0; })
-#define atalk_proc_exit()  do { } while(0)
+static inline int atalk_proc_init(void)
+{
+   return 0;
+}
+static inline void atalk_proc_exit(void)
+{
+}
 #endif /* CONFIG_PROC_FS */
 
 #endif /* __LINUX_ATALK_H__ */
-- 
2.20.0



[PATCH 2/2] appletalk: Add atalk.h header files to MAINTAINERS file

2019-03-06 Thread Arnd Bergmann
Add the path names here so that git-send-email can pick up the
netdev@vger.kernel.org Cc line automatically for a patch that
only touches the headers.

Signed-off-by: Arnd Bergmann 
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1e64279f338a..1f344f0c2254 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1059,6 +1059,8 @@ L:netdev@vger.kernel.org
 S: Odd fixes
 F: drivers/net/appletalk/
 F: net/appletalk/
+F: include/linux/atalk.h
+F: include/uapi/linux/atalk.h
 
 APPLIED MICRO (APM) X-GENE DEVICE TREE SUPPORT
 M: Khuong Dinh 
-- 
2.20.0



Re: [RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area

2019-03-06 Thread Christophe de Dinechin



> On 6 Mar 2019, at 08:18, Jason Wang  wrote:
> 
> Signed-off-by: Jason Wang 
> ---
> drivers/vhost/vhost.c | 46 --
> 1 file changed, 28 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2025543..1015464 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -413,6 +413,27 @@ static void vhost_dev_free_iovecs(struct vhost_dev *dev)
>   vhost_vq_free_iovecs(dev->vqs[i]);
> }
> 
> +static size_t vhost_get_avail_size(struct vhost_virtqueue *vq, int num)

Nit: Any reason not to make `num` unsigned or size_t?

> +{
> + size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> +
> + return sizeof(*vq->avail) +
> +sizeof(*vq->avail->ring) * num + event;
> +}
> +
> +static size_t vhost_get_used_size(struct vhost_virtqueue *vq, int num)
> +{
> + size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> +
> + return sizeof(*vq->used) +
> +sizeof(*vq->used->ring) * num + event;
> +}
> +
> +static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
> +{
> + return sizeof(*vq->desc) * num;
> +}
> +
> void vhost_dev_init(struct vhost_dev *dev,
>   struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
> {
> @@ -1253,13 +1274,9 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, 
> unsigned int num,
>struct vring_used __user *used)
> 
> {
> - size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> -
> - return access_ok(desc, num * sizeof *desc) &&
> -access_ok(avail,
> -  sizeof *avail + num * sizeof *avail->ring + s) &&
> -access_ok(used,
> - sizeof *used + num * sizeof *used->ring + s);
> + return access_ok(desc, vhost_get_desc_size(vq, num)) &&
> +access_ok(avail, vhost_get_avail_size(vq, num)) &&
> +access_ok(used, vhost_get_used_size(vq, num));
> }
> 
> static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
> @@ -1311,22 +1328,18 @@ static bool iotlb_access_ok(struct vhost_virtqueue 
> *vq,
> 
> int vq_meta_prefetch(struct vhost_virtqueue *vq)
> {
> - size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
>   unsigned int num = vq->num;
> 
>   if (!vq->iotlb)
>   return 1;
> 
>   return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
> -num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
> +vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
>  iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
> -sizeof *vq->avail +
> -num * sizeof(*vq->avail->ring) + s,
> +vhost_get_avail_size(vq, num),
>  VHOST_ADDR_AVAIL) &&
>  iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->used,
> -sizeof *vq->used +
> -num * sizeof(*vq->used->ring) + s,
> -VHOST_ADDR_USED);
> +vhost_get_used_size(vq, num), VHOST_ADDR_USED);
> }
> EXPORT_SYMBOL_GPL(vq_meta_prefetch);
> 
> @@ -1343,13 +1356,10 @@ bool vhost_log_access_ok(struct vhost_dev *dev)
> static bool vq_log_access_ok(struct vhost_virtqueue *vq,
>void __user *log_base)
> {
> - size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> -
>   return vq_memory_access_ok(log_base, vq->umem,
>  vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
>   (!vq->log_used || log_access_ok(log_base, vq->log_addr,
> - sizeof *vq->used +
> - vq->num * sizeof *vq->used->ring + s));
> +   vhost_get_used_size(vq, vq->num)));
> }
> 
> /* Can we start vq? */
> -- 
> 1.8.3.1
> 



[PATCH iproute2] iprule: fix printing hint about unresolved iifname and oifname

2019-03-06 Thread Thomas Haller
# ip rule add priority 10 iif eth1 goto 1 protocol 10

was displayed as

10: from all iif eth1 [detached] goto 1unresolved proto mrt

now:

10: from all iif eth1 [detached] goto 1 [unresolved] proto mrt

Fixes: 0dd4ccc56c0e ("iprule: add json support")

Signed-off-by: Thomas Haller 
---
 ip/iprule.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/iprule.c b/ip/iprule.c
index 4e9437de..0bd4c636 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -455,7 +455,7 @@ int print_rule(struct nlmsghdr *n, void *arg)
print_string(PRINT_ANY, "goto", "goto %s", "none");
 
if (frh->flags & FIB_RULE_UNRESOLVED)
-   print_null(PRINT_ANY, "unresolved", "unresolved", NULL);
+   print_null(PRINT_ANY, "unresolved", " [unresolved]", 
NULL);
} else if (frh->action == FR_ACT_NOP) {
print_null(PRINT_ANY, "nop", "nop", NULL);
} else if (frh->action != FR_ACT_TO_TBL) {
-- 
2.20.1



[PATCH] vhost: silence an unused-variable warning

2019-03-06 Thread Arnd Bergmann
On some architectures, the MMU can be disabled, leading to access_ok()
becoming an empty macro that does not evaluate its size argument,
which in turn produces an unused-variable warning:

drivers/vhost/vhost.c:1191:9: error: unused variable 's' 
[-Werror,-Wunused-variable]
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;

Mark the variable as __maybe_unused to shut up that warning.

Signed-off-by: Arnd Bergmann 
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a2e5dc7716e2..5ace833de746 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1188,7 +1188,7 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, 
unsigned int num,
 struct vring_used __user *used)
 
 {
-   size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+   size_t s __maybe_unused = vhost_has_feature(vq, 
VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 
return access_ok(desc, num * sizeof *desc) &&
   access_ok(avail,
-- 
2.20.0



[PATCH] tcp: detecting the misuse of .sendpage for Slab objects

2019-03-06 Thread Vasily Averin
sendpage was not designed for processing of the Slab pages,
in some situations it can trigger BUG_ON on receiving side.

Signed-off-by: Vasily Averin 
---
 net/ipv4/tcp.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ad07dd71063d..dbb08140cdc9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -943,6 +943,10 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page 
*page, int offset,
ssize_t copied;
long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 
+   if (IS_ENABLED(CONFIG_DEBUG_VM) &&
+   WARN_ONCE(PageSlab(page), "page must not be a Slab one"))
+   return -EINVAL;
+
/* Wait for a connection to finish. One exception is TCP Fast Open
 * (passive side) where data is allowed to be sent before a connection
 * is fully established.
-- 
2.17.1



Re: [PATCH v2] net: xfrm: Add '_rcu' tag for rcu protected pointer in netns_xfrm

2019-03-06 Thread Steffen Klassert
On Mon, Mar 04, 2019 at 08:19:14PM -0500, Su Yanjun wrote:
> For rcu protected pointers, we'd better add '__rcu' for them.
> 
> Once added '__rcu' tag for rcu protected pointer, the sparse tool reports
> warnings.
> 
> net/xfrm/xfrm_user.c:1198:39: sparse:expected struct sock *sk
> net/xfrm/xfrm_user.c:1198:39: sparse:got struct sock [noderef]  
> *nlsk
> [...]
> 
> So introduce a new wrapper function of nlmsg_unicast  to handle type
> conversions.
> 
> No functional change.

While that was true for v1 of that patch, it is not
true for this version. This fixes a direct access
of a rcu protected socket. So please add a proper
'Fixes' tag.


Re: [PATCH] xfrm: Reset secpath in xfrm failure

2019-03-06 Thread Steffen Klassert
On Wed, Mar 06, 2019 at 04:33:08PM +0900, Myungho Jung wrote:
> In esp4_gro_receive() and esp6_gro_receive(), secpath can be allocated
> without adding xfrm state to xvec. Then, sp->xvec[sp->len - 1] would
> fail and result in dereferencing invalid pointer in esp4_gso_segment()
> and esp6_gso_segment(). Reset secpath if xfrm function returns error.
> 
> Reported-by: syzbot+b69368fd933c6c592...@syzkaller.appspotmail.com
> Signed-off-by: Myungho Jung 

The patch itself looks ok, but please add a 'Fixes' tag to
the commit message.

Thanks!


Re: stmmac / meson8b-dwmac

2019-03-06 Thread Simon Huelck
Hi,

i sorted out some more things:

- i did not activate tcp window scaling , with this , iperf3 is reaching
930MBits now, this was related to my firewall and therefore my uplink.

so the remaining topic is ( and im currently testing with next-20190306 )

TX is starving RX and the total bandwith seem to be limited to 930MBits
instead of something like 930MBits * 2 for duplex.

@Jose, do you have some hints on that ? i saw that you introduced
patches for that , but somehow RX/TX are not equally sharing the NAPI
budget. But i wonder why the TX queue and RX queue collide at all,
shouldnt they be indipendent ?

regards,
Simon
> Hi guys,
>
>
> 1. i discovered something strange. when i never configure my external
> VLAN interface UP ( firewall doesnt start, traffic shaper CAKE doesnt
> start), my non duplex iperf bandwidth increases from 600MBits to 930.
>
> 2.  duplex isnt working, TX is totally starving RX, the total bandwidth
> is 900MBits, whats going on there ?
>
> 3. i had a MTU issue ( was set to 1500, but due to VLANs etc 1450 would
> be better ) but this didnt change performance
> 4. even when i up eth0.4, then i down eth0.4, then flush iptables and
> shaper were never added, i drop to 600MBits 
>
> questions:
> - why is duplex still not working even so the kernel says so ?
> - why is TX totally starving RX, even so duplex is "on"
> - when i flush all my iptable rules, and the traffic shaper, still im
> bond to 600MBits ... very strange, someone got an idea ? upping eth0.4
> is cutting the performance, even when other VLAN IFs like eth0.3,
> eth0.2, eth0.5 are up and bridged ( eth0.4 isnt bridged somewhere )
>
>
>
> my setup:
>
> br-dmz  8000.7ef0fd9b157f   no  eth0.2
> br-guest    8000.001f1fbbbd60   no  wlan0
> br-iot  8000.7ef0fd9b157f   no  eth0.5
> br-lan  8000.001f1fbbbd61   no  eth0.3
>     wlan0_1
>     wlan2
>
> eth0.4 is my uplink
>
> all the bridges are internally , eth0.4 is externally
>
>
> C:\Users\Simon\Downloads\iperf3.6_64bit\iperf3.6_64bit>iperf3.exe -c
> 10.10.11.1 -i1
> warning: Ignoring nonsense TCP MSS 0
> Connecting to host 10.10.11.1, port 5201
> [  5] local 10.10.11.100 port 52173 connected to 10.10.11.1 port 5201
> [ ID] Interval   Transfer Bitrate
> [  5]   0.00-1.00   sec   384 KBytes  3.14 Mbits/sec
> [  5]   1.00-2.00   sec   384 KBytes  3.15 Mbits/sec
> [  5]   2.00-3.00   sec  1.12 MBytes  9.44 Mbits/sec
> [  5]   3.00-4.00   sec  2.00 MBytes  16.8 Mbits/sec
> [  5]   4.00-5.00   sec  2.38 MBytes  19.9 Mbits/sec
> [  5]   5.00-6.00   sec  3.12 MBytes  26.2 Mbits/sec
> [  5]   6.00-7.00   sec  4.75 MBytes  39.8 Mbits/sec
> [  5]   7.00-8.00   sec  68.4 MBytes   574 Mbits/sec
> [  5]   8.00-9.00   sec   104 MBytes   875 Mbits/sec
> [  5]   9.00-10.00  sec   105 MBytes   881 Mbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval   Transfer Bitrate
> [  5]   0.00-10.00  sec   292 MBytes   245 Mbits/sec  sender
> [  5]   0.00-10.04  sec   292 MBytes   244 Mbits/sec 
> receiver
>
> iperf Done.
>
>
> root@odroidc2:~# iperf3 -c 10.10.11.100 -i1
> Connecting to host 10.10.11.100, port 5201
> [  5] local 10.10.11.1 port 60022 connected to 10.10.11.100 port 5201
> [ ID] Interval   Transfer Bitrate Retr  Cwnd
> [  5]   0.00-1.00   sec   112 MBytes   941 Mbits/sec    0    417 KBytes
> [  5]   1.00-2.00   sec   113 MBytes   945 Mbits/sec    0    487 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   940 Mbits/sec    0    487 KBytes
> [  5]   3.00-4.00   sec   112 MBytes   941 Mbits/sec    0    512 KBytes
> [  5]   4.00-5.01   sec   109 MBytes   907 Mbits/sec    0    543 KBytes
> [  5]   5.01-6.01   sec   109 MBytes   911 Mbits/sec    0    543 KBytes
> [  5]   6.01-7.01   sec   108 MBytes   902 Mbits/sec    0    543 KBytes
> [  5]   7.01-8.01   sec   108 MBytes   905 Mbits/sec    0    543 KBytes
> [  5]   8.01-9.00   sec   106 MBytes   895 Mbits/sec    0    543 KBytes
> [  5]   9.00-10.00  sec   106 MBytes   891 Mbits/sec    0    543 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval   Transfer Bitrate Retr
> [  5]   0.00-10.00  sec  1.07 GBytes   918 Mbits/sec    0 sender
> [  5]   0.00-10.04  sec  1.07 GBytes   915 Mbits/sec 
> receiver
>
> --
>
>
> Mar  5 09:46:03 localhost kernel: [  105.534204] meson8b-dwmac
> c941.ethernet eth0: Link is Up - 1Gbps/Full - flow control off
>
> iperf Done.
> root@odroidc2:~#
>
>
&g

Re: stmmac / meson8b-dwmac

2019-03-06 Thread Simon Huelck
 irq_tx_path_in_lpi_mode_n: 15749
 irq_tx_path_exit_lpi_mode_n: 15748
 irq_rx_path_in_lpi_mode_n: 0
 irq_rx_path_exit_lpi_mode_n: 0
 phy_eee_wakeup_error_n: 0
 ip_hdr_err: 0
 ip_payload_err: 0
 ip_csum_bypassed: 0
 ipv4_pkt_rcvd: 0
 ipv6_pkt_rcvd: 0
 no_ptp_rx_msg_type_ext: 0
 ptp_rx_msg_type_sync: 0
 ptp_rx_msg_type_follow_up: 0
 ptp_rx_msg_type_delay_req: 0
 ptp_rx_msg_type_delay_resp: 0
 ptp_rx_msg_type_pdelay_req: 0
 ptp_rx_msg_type_pdelay_resp: 0
 ptp_rx_msg_type_pdelay_follow_up: 0
 ptp_rx_msg_type_announce: 0
 ptp_rx_msg_type_management: 0
 ptp_rx_msg_pkt_reserved_type: 0
 ptp_frame_type: 0
 ptp_ver: 0
 timestamp_dropped: 0
 av_pkt_rcvd: 0
 av_tagged_pkt_rcvd: 0
 vlan_tag_priority_val: 0
 l3_filter_match: 0
 l4_filter_match: 0
 l3_l4_filter_no_match: 0
 irq_pcs_ane_n: 0
 irq_pcs_link_n: 0
 irq_rgmii_n: 0
 mtl_tx_status_fifo_full: 0
 mtl_tx_fifo_not_empty: 0
 mmtl_fifo_ctrl: 0
 mtl_tx_fifo_read_ctrl_write: 0
 mtl_tx_fifo_read_ctrl_wait: 0
 mtl_tx_fifo_read_ctrl_read: 0
 mtl_tx_fifo_read_ctrl_idle: 0
 mac_tx_in_pause: 0
 mac_tx_frame_ctrl_xfer: 0
 mac_tx_frame_ctrl_idle: 0
 mac_tx_frame_ctrl_wait: 0
 mac_tx_frame_ctrl_pause: 0
 mac_gmii_tx_proto_engine: 0
 mtl_rx_fifo_fill_level_full: 0
 mtl_rx_fifo_fill_above_thresh: 0
 mtl_rx_fifo_fill_below_thresh: 0
 mtl_rx_fifo_fill_level_empty: 0
 mtl_rx_fifo_read_ctrl_flush: 0
 mtl_rx_fifo_read_ctrl_read_data: 0
 mtl_rx_fifo_read_ctrl_status: 0
 mtl_rx_fifo_read_ctrl_idle: 0
 mtl_rx_fifo_ctrl_active: 0
 mac_rx_frame_ctrl_fifo: 0
 mac_gmii_rx_proto_engine: 0
 tx_tso_frames: 0
 tx_tso_nfrags: 0


regards,
Simon


Am 06.03.2019 um 12:35 schrieb Simon Huelck:
> Hi,
>
> i sorted out some more things:
>
> - i did not activate tcp window scaling , with this , iperf3 is reaching
> 930MBits now, this was related to my firewall and therefore my uplink.
>
> so the remaining topic is ( and im currently testing with next-20190306 )
>
> TX is starving RX and the total bandwith seem to be limited to 930MBits
> instead of something like 930MBits * 2 for duplex.
>
> @Jose, do you have some hints on that ? i saw that you introduced
> patches for that , but somehow RX/TX are not equally sharing the NAPI
> budget. But i wonder why the TX queue and RX queue collide at all,
> shouldnt they be indipendent ?
>
> regards,
> Simon
>> Hi guys,
>>
>>
>> 1. i discovered something strange. when i never configure my external
>> VLAN interface UP ( firewall doesnt start, traffic shaper CAKE doesnt
>> start), my non duplex iperf bandwidth increases from 600MBits to 930.
>>
>> 2.  duplex isnt working, TX is totally starving RX, the total bandwidth
>> is 900MBits, whats going on there ?
>>
>> 3. i had a MTU issue ( was set to 1500, but due to VLANs etc 1450 would
>> be better ) but this didnt change performance
>> 4. even when i up eth0.4, then i down eth0.4, then flush iptables and
>> shaper were never added, i drop to 600MBits 
>>
>> questions:
>> - why is duplex still not working even so the kernel says so ?
>> - why is TX totally starving RX, even so duplex is "on"
>> - when i flush all my iptable rules, and the traffic shaper, still im
>> bond to 600MBits ... very strange, someone got an idea ? upping eth0.4
>> is cutting the performance, even when other VLAN IFs like eth0.3,
>> eth0.2, eth0.5 are up and bridged ( eth0.4 isnt bridged somewhere )
>>
>>
>>
>> my setup:
>>
>> br-dmz  8000.7ef0fd9b157f   no  eth0.2
>> br-guest    8000.001f1fbbbd60   no  wlan0
>> br-iot  8000.7ef0fd9b157f   no  eth0.5
>> br-lan  8000.001f1fbbbd61   no  eth0.3
>>     wlan0_1
>>     wlan2
>>
>> eth0.4 is my uplink
>>
>> all the bridges are internally , eth0.4 is externally
>>
>>
>> C:\Users\Simon\Downloads\iperf3.6_64bit\iperf3.6_64bit>iperf3.exe -c
>> 10.10.11.1 -i1
>> warning: Ignoring nonsense TCP MSS 0
>> Connecting to host 10.10.11.1, port 5201
>> [  5] local 10.10.11.100 port 52173 connected to 10.10.11.1 port 5201
>> [ ID] Interval   Transfer Bitrate
>> [  5]   0.00-1.00   sec   384 KBytes  3.14 Mbits/sec
>> [  5]   1.00-2.00   sec   384 KBytes  3.15 Mbits/sec
>> [  5]   2.00-3.00   sec  1.12 MBytes  9.44 Mbits/sec
>> [  5]   3.00-4.00   sec  2.00 MBytes  16.8 Mbits/sec
>> [  5]   4.00-5.00   sec  2.38 MBy

[PATCH] ray_cs: Check return value of pcmcia_register_driver

2019-03-06 Thread Yue Haibing
From: YueHaibing 

init_ray_cs does not check value of pcmcia_register_driver,
if it fails, there maybe cause a NULL pointer dereference in
exit_ray_cs.

Signed-off-by: YueHaibing 
---
 drivers/net/wireless/ray_cs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/wireless/ray_cs.c b/drivers/net/wireless/ray_cs.c
index 44a943d..d561659 100644
--- a/drivers/net/wireless/ray_cs.c
+++ b/drivers/net/wireless/ray_cs.c
@@ -2795,6 +2795,8 @@ static int __init init_ray_cs(void)
rc = pcmcia_register_driver(&ray_driver);
pr_debug("raylink init_module register_pcmcia_driver returns 0x%x\n",
  rc);
+   if (rc)
+   return rc;
 
 #ifdef CONFIG_PROC_FS
proc_mkdir("driver/ray_cs", NULL);
-- 
2.7.0




[PATCH] ray_cs: use remove_proc_subtree to simplify procfs code

2019-03-06 Thread Yue Haibing
From: YueHaibing 

Use remove_proc_subtree to remove the whole subtree

Signed-off-by: YueHaibing 
---
 drivers/net/wireless/ray_cs.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/net/wireless/ray_cs.c b/drivers/net/wireless/ray_cs.c
index d561659..ee4d810 100644
--- a/drivers/net/wireless/ray_cs.c
+++ b/drivers/net/wireless/ray_cs.c
@@ -2820,11 +2820,7 @@ static void __exit exit_ray_cs(void)
pr_debug("ray_cs: cleanup_module\n");
 
 #ifdef CONFIG_PROC_FS
-   remove_proc_entry("driver/ray_cs/ray_cs", NULL);
-   remove_proc_entry("driver/ray_cs/essid", NULL);
-   remove_proc_entry("driver/ray_cs/net_type", NULL);
-   remove_proc_entry("driver/ray_cs/translate", NULL);
-   remove_proc_entry("driver/ray_cs", NULL);
+   remove_proc_subtree("driver/ray_cs", NULL);
 #endif
 
pcmcia_unregister_driver(&ray_driver);
-- 
2.7.0




Re: [PATCH net-next v2 00/12] Refactor flower classifier to remove dependency on rtnl lock

2019-03-06 Thread Vlad Buslov


On Tue 05 Mar 2019 at 01:57, Stefano Brivio  wrote:
> On Wed, 27 Feb 2019 12:12:14 +0200
> Vlad Buslov  wrote:
>
>> Currently, all netlink protocol handlers for updating rules, actions and
>> qdiscs are protected with single global rtnl lock which removes any
>> possibility for parallelism. This patch set is a third step to remove
>> rtnl lock dependency from TC rules update path.
>> 
>> Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
>> TC rule update handlers (RTM_NEWTFILTER, RTM_DELTFILTER, etc.) are
>> already registered with this flag and only take rtnl lock when qdisc or
>> classifier requires it. Classifiers can indicate that their ops
>> callbacks don't require caller to hold rtnl lock by setting the
>> TCF_PROTO_OPS_DOIT_UNLOCKED flag. The goal of this change is to refactor
>> flower classifier to support unlocked execution and register it with
>> unlocked flag.
>> 
>> This patch set implements following changes to make flower classifier
>> concurrency-safe:
>> 
>> - Implement reference counting for individual filters. Change fl_get to
>>   take reference to filter. Implement tp->ops->put callback that was
>>   introduced in cls API patch set to release reference to flower filter.
>> 
>> - Use tp->lock spinlock to protect internal classifier data structures
>>   from concurrent modification.
>> 
>> - Handle concurrent tcf proto deletion by returning EAGAIN, which will
>>   cause cls API to retry and create new proto instance or return error
>>   to the user (depending on message type).
>> 
>> - Handle concurrent insertion of filter with same priority and handle by
>>   returning EAGAIN, which will cause cls API to lookup filter again and
>>   process it accordingly to netlink message flags.
>> 
>> - Extend flower mask with reference counting and protect masks list with
>>   masks_lock spinlock.
>> 
>> - Prevent concurrent mask insertion by inserting temporary value to
>>   masks hash table. This is necessary because mask initialization is a
>>   sleeping operation and cannot be done while holding tp->lock.
>> 
>> Both chain level and classifier level conflicts are resolved by
>> returning -EAGAIN to cls API that results restart of whole operation.
>> This retry mechanism is a result of fine-grained locking approach used
>> in this and previous changes in series and is necessary to allow
>> concurrent updates on same chain instance. Alternative approach would be
>> to lock the whole chain while updating filters on any of child tp's,
>> adding and removing classifier instances from the chain. However, since
>> most CPU-intensive parts of filter update code are specifically in
>> classifier code and its dependencies (extensions and hw offloads), such
>> approach would negate most of the gains introduced by this change and
>> previous changes in the series when updating same chain instance.
>> 
>> Tcf hw offloads API is not changed by this patch set and still requires
>> caller to hold rtnl lock. Refactored flower classifier tracks rtnl lock
>> state by means of 'rtnl_held' flag provided by cls API and obtains the
>> lock before calling hw offloads. Following patch set will lift this
>> restriction and refactor cls hw offloads API to support unlocked
>> execution.
>> 
>> With these changes flower classifier is safely registered with
>> TCF_PROTO_OPS_DOIT_UNLOCKED flag in last patch.
>> 
>> Changes from V1 to V2:
>> - Extend cover letter with explanation about retry mechanism.
>> - Rebase on current net-next.
>> - Patch 1:
>>   - Use rcu_dereference_raw() for tp->root dereference.
>>   - Update comment in fl_head_dereference().
>> - Patch 2:
>>   - Remove redundant check in fl_change error handling code.
>>   - Add empty line between error check and new handle assignment.
>> - Patch 3:
>>   - Refactor loop in fl_get_next_filter() to improve readability.
>> - Patch 4:
>>   - Refactor __fl_delete() to improve readability.
>> - Patch 6:
>>   - Fix comment in fl_check_assign_mask().
>> - Patch 9:
>>   - Extend commit message.
>>   - Fix error code in comment.
>> - Patch 11:
>>   - Fix fl_hw_replace_filter() to always release rtnl lock in error
>> handlers.
>> - Patch 12:
>>   - Don't take rtnl lock before calling __fl_destroy_filter() in
>> workqueue context.
>>   - Extend commit message with explanation why flower still takes rtnl
>> lock before calling hardware offloads API.
>
> FWIW,
>
> Reviewed-by: Stefano Brivio 

Thanks for reviewing this!

Regards,
Vlad


[PATCH] ssb: Fix possible NULL pointer dereference in ssb_host_pcmcia_exit

2019-03-06 Thread Yue Haibing
From: YueHaibing 

Syzkaller report this:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN PTI
CPU: 0 PID: 4492 Comm: syz-executor.0 Not tainted 5.0.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 
04/01/2014
RIP: 0010:sysfs_remove_file_ns+0x27/0x70 fs/sysfs/file.c:468
Code: 00 00 00 41 54 55 48 89 fd 53 49 89 d4 48 89 f3 e8 ee 76 9c ff 48 8d 7d 
30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 2d 48 89 
da 48 b8 00 00 00 00 00 fc ff df 48 8b 6d
RSP: 0018:8881e9d9fc00 EFLAGS: 00010206
RAX: dc00 RBX: 900367e0 RCX: 81a95952
RDX: 0006 RSI: c90001405000 RDI: 0030
RBP:  R08: fbfff1fa22ed R09: fbfff1fa22ed
R10: 0001 R11: fbfff1fa22ec R12: 
R13: c1abdac0 R14: 11103d3b3f8b R15: 
FS:  7fe409dc1700() GS:8881f120() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 001b2d721000 CR3: 0001e98b6005 CR4: 007606f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554
Call Trace:
 sysfs_remove_file include/linux/sysfs.h:519 [inline]
 driver_remove_file+0x40/0x50 drivers/base/driver.c:122
 pcmcia_remove_newid_file drivers/pcmcia/ds.c:163 [inline]
 pcmcia_unregister_driver+0x7d/0x2b0 drivers/pcmcia/ds.c:209
 ssb_modexit+0xa/0x1b [ssb]
 __do_sys_delete_module kernel/module.c:1018 [inline]
 __se_sys_delete_module kernel/module.c:961 [inline]
 __x64_sys_delete_module+0x3dc/0x5e0 kernel/module.c:961
 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462e99
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 
c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:7fe409dc0c58 EFLAGS: 0246 ORIG_RAX: 00b0
RAX: ffda RBX: 0073bf00 RCX: 00462e99
RDX:  RSI:  RDI: 20c0
RBP: 0002 R08:  R09: 
R10:  R11: 0246 R12: 7fe409dc16bc
R13: 004bccaa R14: 006f6bc8 R15: 
Modules linked in: ssb(-) 3c59x nvme_core macvlan tap pata_hpt3x3 rt2x00pci 
null_blk tsc40 pm_notifier_error_inject notifier_error_inject mdio cdc_wdm 
nf_reject_ipv4 ath9k_common ath9k_hw ath pppox ppp_generic slhc ehci_platform 
wl12xx wlcore tps6507x_ts ioc4 nf_synproxy_core ide_gd_mod ax25 can_dev iwlwifi 
can_raw atm tm2_touchkey can_gw can sundance adp5588_keys rt2800mmio rt2800lib 
rt2x00mmio rt2x00lib eeprom_93cx6 pn533 lru_cache elants_i2c ip_set nfnetlink 
gameport tipc hampshire nhc_ipv6 nhc_hop nhc_udp nhc_fragment nhc_routing 
nhc_mobility nhc_dest 6lowpan silead brcmutil nfc mt76_usb mt76 mac80211 
iptable_security iptable_raw iptable_mangle iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti 
ip_gre sit hsr veth vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon vcan 
bridge stp llc ip6_gre ip6_tunnel tunnel6 tun joydev mousedev serio_raw 
ide_pci_generic piix floppy ide_core sch_fq_codel ip_tables x_tables ipv6
 [last unloaded: 3c59x]
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace 3913cbf8011e1c05 ]---

In ssb_modinit, it does not fail SSB init when ssb_host_pcmcia_init failed,
however in ssb_modexit, ssb_host_pcmcia_exit calls pcmcia_unregister_driver
unconditionally, which may tigger a NULL pointer dereference issue as above.

Reported-by: Hulk Robot 
Fixes: 399500da18f7 ("ssb: pick PCMCIA host code support from b43 driver")
Signed-off-by: YueHaibing 
---
 drivers/ssb/bridge_pcmcia_80211.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/ssb/bridge_pcmcia_80211.c 
b/drivers/ssb/bridge_pcmcia_80211.c
index f51f150..ffa379e 100644
--- a/drivers/ssb/bridge_pcmcia_80211.c
+++ b/drivers/ssb/bridge_pcmcia_80211.c
@@ -113,16 +113,21 @@ static struct pcmcia_driver ssb_host_pcmcia_driver = {
.resume = ssb_host_pcmcia_resume,
 };
 
+static int pcmcia_init_failed;
+
 /*
  * These are not module init/exit functions!
  * The module_pcmcia_driver() helper cannot be used here.
  */
 int ssb_host_pcmcia_init(void)
 {
-   return pcmcia_register_driver(&ssb_host_pcmcia_driver);
+   pcmcia_init_failed = pcmcia_register_driver(&ssb_host_pcmcia_driver);
+
+   return pcmcia_init_failed;
 }
 
 void ssb_host_pcmcia_exit(void)
 {
-   pcmcia_unregister_driver(&ssb_host_pcmcia_driver);
+   if (!pcmcia_init_failed)
+   pcmcia_unregister_driver(&ssb_host_pcmcia_driver);
 }
-- 
2.7.0




Re: general protection fault in sctp_sched_rr_dequeue

2019-03-06 Thread Neil Horman
On Wed, Mar 06, 2019 at 06:43:48PM +0800, Xin Long wrote:
> On Wed, Mar 6, 2019 at 9:42 AM syzbot
>  wrote:
> >
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:63bdf4284c38 Merge branch 'linus' of git://git.kernel.org/..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=100347cb20
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=872be05707464aaa
> > dashboard link: https://syzkaller.appspot.com/bug?extid=4c9934f20522c0efd657
> > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11cd9b0320
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=127de8e720
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+4c9934f20522c0efd...@syzkaller.appspotmail.com
> >
> > kauditd_printk_skb: 2 callbacks suppressed
> > audit: type=1400 audit(1551833288.424:35): avc:  denied  { map } for
> > pid=8035 comm="bash" path="/bin/bash" dev="sda1" ino=1457
> > scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> > tcontext=system_u:object_r:file_t:s0 tclass=file permissive=1
> > audit: type=1400 audit(1551833294.934:36): avc:  denied  { map } for
> > pid=8047 comm="syz-executor778" path="/root/syz-executor778173561"
> > dev="sda1" ino=16484 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> > tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault:  [#1] PREEMPT SMP KASAN
> > CPU: 1 PID: 8047 Comm: syz-executor778 Not tainted 5.0.0+ #7
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:sctp_sched_rr_dequeue+0xd3/0x170 net/sctp/stream_sched_rr.c:141
> The panic was caused by sched->init() reset stream->rr_next = NULL, even
> if outq->out_chunk_list is not empty.
> 
> We should remove the sched->init() from sctp_stream_init(), since
> all sched info was moved into sout->ext and sctp_stream_alloc_out()
> will not afffect it.
> 
I think what you're saying is we can just let sctp_outq_init handle the stream
scheduler initalization, correct?  If so, ACK to that approach
Neil

> > Code: ea 03 80 3c 02 00 0f 85 a2 00 00 00 48 8b 5b 08 e8 62 20 ee fa 48 8d
> > 7b 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75
> > 53 4c 8b 6b 30 4c 89 e7 49 83 ed 18 4c 89 ee e8 b4
> > RSP: 0018:88809eacf040 EFLAGS: 00010206
> > RAX: dc00 RBX:  RCX: 8679cd9f
> > RDX: 0006 RSI: 8681c41e RDI: 0030
> > RBP: 88809eacf058 R08: 8880a12bc300 R09: 0002
> > R10: ed1015d25bcf R11: 8880ae92de7b R12: 88807cae6ca0
> > R13: 88807cae6580 R14: dc00 R15: 88809eacf198
> > FS:  01865880() GS:8880ae90() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 55af2d491150 CR3: 8dd7b000 CR4: 001406e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >   sctp_outq_dequeue_data net/sctp/outqueue.c:90 [inline]
> >   sctp_outq_flush_data net/sctp/outqueue.c:1079 [inline]
> >   sctp_outq_flush+0xba2/0x2790 net/sctp/outqueue.c:1205
> >   sctp_outq_uncork+0x6c/0x80 net/sctp/outqueue.c:772
> >   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
> >   sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
> >   sctp_do_sm+0x513/0x5390 net/sctp/sm_sideeffect.c:1191
> >   sctp_assoc_bh_rcv+0x343/0x660 net/sctp/associola.c:1074
> >   sctp_inq_push+0x1ea/0x290 net/sctp/inqueue.c:95
> >   sctp_backlog_rcv+0x189/0xbc0 net/sctp/input.c:354
> >   sk_backlog_rcv include/net/sock.h:937 [inline]
> >   __release_sock+0x12e/0x3a0 net/core/sock.c:2413
> >   release_sock+0x59/0x1c0 net/core/sock.c:2929
> >   sctp_wait_for_connect+0x316/0x540 net/sctp/socket.c:8999
> >   sctp_sendmsg_to_asoc+0x13e2/0x17d0 net/sctp/socket.c:1968
> >   sctp_sendmsg+0x10a9/0x17e0 net/sctp/socket.c:2114
> >   inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
> >   sock_sendmsg_nosec net/socket.c:622 [inline]
> >   sock_sendmsg+0xdd/0x130 net/socket.c:632
> >   ___sys_sendmsg+0x806/0x930 net/socket.c:2137
> >   __sys_sendmsg+0x105/0x1d0 net/socket.c:2175
> >   __do_sys_sendmsg net/socket.c:2184 [inline]
> >   __se_sys_sendmsg net/socket.c:2182 [inline]
> >   __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2182
> >   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x440159
> > Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
> > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
> > ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> 

Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-06 Thread Jiri Pirko
Tue, Mar 05, 2019 at 01:50:59AM CET, si-wei@oracle.com wrote:
>When a netdev appears through hot plug then gets enslaved by a failover
>master that is already up and running, the slave will be opened
>right away after getting enslaved. Today there's a race that userspace
>(udev) may fail to rename the slave if the kernel (net_failover)
>opens the slave earlier than when the userspace rename happens.
>Unlike bond or team, the primary slave of failover can't be renamed by
>userspace ahead of time, since the kernel initiated auto-enslavement is
>unable to, or rather, is never meant to be synchronized with the rename
>request from userspace.
>
>As the failover slave interfaces are not designed to be operated
>directly by userspace apps: IP configuration, filter rules with
>regard to network traffic passing and etc., should all be done on master
>interface. In general, userspace apps only care about the
>name of master interface, while slave names are less important as long
>as admin users can see reliable names that may carry
>other information describing the netdev. For e.g., they can infer that
>"ens3nsby" is a standby slave of "ens3", while for a
>name like "eth0" they can't tell which master it belongs to.
>
>Historically the name of IFF_UP interface can't be changed because
>there might be admin script or management software that is already
>relying on such behavior and assumes that the slave name can't be
>changed once UP. But failover is special: with the in-kernel
>auto-enslavement mechanism, the userspace expectation for device
>enumeration and bring-up order is already broken. Previously initramfs
>and various userspace config tools were modified to bypass failover
>slaves because of auto-enslavement and duplicate MAC address. Similarly,
>in case that users care about seeing reliable slave name, the new type
>of failover slaves needs to be taken care of specifically in userspace
>anyway.
>
>For that to work, now introduce a module-level tunable,
>"slave_rename_ok" that allows users to lift up the rename restriction on
>failover slave which is already UP. Although it's possible this change
>potentially break userspace component (most likely configuration scripts
>or management software) that assumes slave name can't be changed while
>UP, it's relatively a limited and controllable set among all userspace
>components, which can be fixed specifically to work with the new naming
>behavior of the failover slave. Userspace component interacting with
>slaves should be changed to operate on failover master instead, as the
>failover slave is dynamic in nature which may come and go at any point.
>The goal is to make the role of failover slaves less relevant, and
>all userspace should only deal with master in the long run. The default
>for the "slave_rename_ok" is set to true(1). If userspace doesn't have
>the right support in place meanwhile users don't care about reliable
>userspace naming, the value can be set to false(0).
>
>Signed-off-by: si-wei@oracle.com
>Reviewed-by: Liran Alon 
>---
> include/linux/netdevice.h |  3 +++
> net/core/dev.c|  3 ++-
> net/core/failover.c   | 11 +--
> 3 files changed, 14 insertions(+), 3 deletions(-)
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index 857f8ab..6d9e4e0 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1487,6 +1487,7 @@ struct net_device_ops {
>  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
>  * @IFF_FAILOVER: device is a failover master device
>  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>+ * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running
>  */
> enum netdev_priv_flags {
>   IFF_802_1Q_VLAN = 1<<0,
>@@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
>   IFF_NO_RX_HANDLER   = 1<<26,
>   IFF_FAILOVER= 1<<27,
>   IFF_FAILOVER_SLAVE  = 1<<28,
>+  IFF_SLAVE_RENAME_OK = 1<<29,
> };
> 
> #define IFF_802_1Q_VLAN   IFF_802_1Q_VLAN
>@@ -1548,6 +1550,7 @@ enum netdev_priv_flags {
> #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
> #define IFF_FAILOVER  IFF_FAILOVER
> #define IFF_FAILOVER_SLAVEIFF_FAILOVER_SLAVE
>+#define IFF_SLAVE_RENAME_OK   IFF_SLAVE_RENAME_OK
> 
> /**
>  *struct net_device - The DEVICE structure.
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 722d50d..ae070de 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char 
>*newname)
>   BUG_ON(!dev_net(dev));
> 
>   net = dev_net(dev);
>-  if (dev->flags & IFF_UP)
>+  if (dev->flags & IFF_UP &&
>+  !(dev->priv_flags & IFF_SLAVE_RENAME_OK))
>   return -EBUSY;
> 
>   write_seqcount_begin(&devnet_rename_seq);
>diff --git a/net/core/failover.c b/net/core/failover.c
>index 4a92a98..1fd8bbb 

Re: [PATCH net-next v2 4/7] devlink: allow subports on devlink PCI ports

2019-03-06 Thread Jiri Pirko
Tue, Mar 05, 2019 at 06:15:34PM CET, jakub.kicin...@netronome.com wrote:
>On Tue, 5 Mar 2019 12:06:01 +0100, Jiri Pirko wrote:
>> >> >as ports.  Can we invent a new command (say "partition"?) that'd take
>> >> >the bus info where the partition is to be spawned?
>> >> 
>> >> Got it. But the question is how different this object would be from the
>> >> existing "port" we have today.  
>> >
>> >They'd be where "the other side of a PCI link" is represented,
>> >restricting ports to only ASIC's forwarding plane ports.  
>> 
>> Basically a "host port", right? It can still be the same port object,
>> only with different flavour and attributes. So we would have:
>> 
>> 1) pci/:05:00.0/0: type eth netdev enp5s0np0
>>flavour physical switch_id 00154d130d2f
>> 2) pci/:05:00.0/1: type eth netdev enp5s0npf0s0
>>flavour pci_pf pf 0 subport 0
>>switch_id 00154d130d2f
>>peer pci/:05:00.0/1
>> 3) pci/:05:00.0/10001: type eth netdev enp5s0npf0vf0
>>flavour pci_vf pf 0 vf 0
>>switch_id 00154d130d2f
>>peer pci/:05:10.1/0
>> 4) pci/:05:00.0/10001: type eth netdev enp5s0npf0s1
>>flavour pci_pf pf 0 subport 1
>>switch_id 00154d130d2f
>>peer pci/:05:00.0/2
>> 5) pci/:05:00.0/1: type eth netdev enp5s0f0??
>>flavour host  <
>>peer pci/:05:00.0/1
>> 6) pci/:05:10.1/0: type eth netdev enp5s10f0 
>>flavour host  <
>>peer pci/:05:00.0/10001
>> 7) pci/:05:00.0/2: type eth netdev enp5s0f0??
>>flavour host  <
>>peer pci/:05:00.0/10001
>> 
>> I think it looks quite clear, it gives complete topology view.
>
>Okay, I have some of questions :)
>
>What do we use for port_index?

That is just a number totally in control of the driver. Driver can
assign it in any way.


>
>What are the operations one can perform on "host ports"?

That is a good question. I would start with *none* and extend it upon
needs.


>
>If we have PCI parameters, do they get set on the ASIC side of the port
>or the host side of the port?

Could you give me an example? But I believe that on switch-port side.


>
>How do those behave when device is passed to VM?

In case of VF? VF will have separate devlink instance (separate handle,
probably "aliased" to the PF handle). So it would disappear from
baremetal and appear in VM:
$ devlink dev
pci/:00:10.0
$ devlink dev port
pci/:00:10.1/0: type eth netdev enp5s10f0
flavour host
That's it for the VM.

There's no linkage (peer, alias) between this and the instances on
baremetal. 


>
>You have a VF devlink instance there - what ports does it show?

See above.


>
>How do those look when the PF is connected to another host?  Do they
>get spawned at all?

What do you mean by "PF is connected to another host"?


>
>Will this not be confusing to DSA folks who have a CPU port?

Why do you think so?


Re: [PATCH net] gre: fix kernel panic when using lw tunnel

2019-03-06 Thread Nicolas Dichtel
Le 06/03/2019 à 11:32, Nicolas Dichtel a écrit :
> There was several problems:
>  - skb_dst(skb) can be NULL when the packet comes from a gretap tunnel;
>  - skb_dst(skb)->ops may point to md_dst_ops, which doesn't set ->mtu
>handler, thus calling dst_mtu() leads to a panic.
> 
> I also wonder if ->cow_metrics may be called if skb_dst(skb)->ops points
> to ovs_dst_ops.
> 
> Don't try to do anything in one of those cases.
> 
> Fixes: 962924fa2b7a ("ip_gre: Refactor collect metatdata mode tunnel xmit to 
> ip_md_tunnel_xmit")
> CC: wenxu 
> Signed-off-by: Nicolas Dichtel 
Please, drop it, Alan's version is better.


Regards,
Nicolas


Re: [PATCH net] iptunnel: NULL pointer deref for ip_md_tunnel_xmit

2019-03-06 Thread Nicolas Dichtel
Le 06/03/2019 à 11:25, Alan Maguire a écrit :
> Naresh Kamboju noted the following oops during execution of selftest
> tools/testing/selftests/bpf/test_tunnel.sh on x86_64:
> 
> [  274.120445] BUG: unable to handle kernel NULL pointer dereference
> at 
> [  274.128285] #PF error: [INSTR]
> [  274.131351] PGD 800414a0e067 P4D 800414a0e067 PUD 3b6334067 PMD 0
> [  274.138241] Oops: 0010 [#1] SMP PTI
> [  274.141734] CPU: 1 PID: 11464 Comm: ping Not tainted
> 5.0.0-rc4-next-20190129 #1
> [  274.149046] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.0b 07/27/2017
> [  274.156526] RIP: 0010:  (null)
> [  274.160280] Code: Bad RIP value.
> [  274.163509] RSP: 0018:bc9681f83540 EFLAGS: 00010286
> [  274.168726] RAX:  RBX: dc967fa80a18 RCX: 
> 
> [  274.175851] RDX: 9db2ee08b540 RSI: 000e RDI: 
> dc967fa809a0
> [  274.182974] RBP: bc9681f83580 R08: 9db2c4d62690 R09: 
> 000c
> [  274.190098] R10:  R11: 9db2ee08b540 R12: 
> 9db31ce7c000
> [  274.197222] R13: 0001 R14: 000c R15: 
> 9db3179cf400
> [  274.204346] FS:  7ff4ae7c5740() GS:9db31fa8()
> knlGS:
> [  274.212424] CS:  0010 DS:  ES:  CR0: 80050033
> [  274.218162] CR2: ffd6 CR3: 0004574da004 CR4: 
> 003606e0
> [  274.225292] DR0:  DR1:  DR2: 
> 
> [  274.232416] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  274.239541] Call Trace:
> [  274.241988]  ? tnl_update_pmtu+0x296/0x3b0
> [  274.246085]  ip_md_tunnel_xmit+0x1bc/0x520
> [  274.250176]  gre_fb_xmit+0x330/0x390
> [  274.253754]  gre_tap_xmit+0x128/0x180
> [  274.257414]  dev_hard_start_xmit+0xb7/0x300
> [  274.261598]  sch_direct_xmit+0xf6/0x290
> [  274.265430]  __qdisc_run+0x15d/0x5e0
> [  274.269007]  __dev_queue_xmit+0x2c5/0xc00
> [  274.273011]  ? dev_queue_xmit+0x10/0x20
> [  274.276842]  ? eth_header+0x2b/0xc0
> [  274.280326]  dev_queue_xmit+0x10/0x20
> [  274.283984]  ? dev_queue_xmit+0x10/0x20
> [  274.287813]  arp_xmit+0x1a/0xf0
> [  274.290952]  arp_send_dst.part.19+0x46/0x60
> [  274.295138]  arp_solicit+0x177/0x6b0
> [  274.298708]  ? mod_timer+0x18e/0x440
> [  274.302281]  neigh_probe+0x57/0x70
> [  274.305684]  __neigh_event_send+0x197/0x2d0
> [  274.309862]  neigh_resolve_output+0x18c/0x210
> [  274.314212]  ip_finish_output2+0x257/0x690
> [  274.318304]  ip_finish_output+0x219/0x340
> [  274.322314]  ? ip_finish_output+0x219/0x340
> [  274.326493]  ip_output+0x76/0x240
> [  274.329805]  ? ip_fragment.constprop.53+0x80/0x80
> [  274.334510]  ip_local_out+0x3f/0x70
> [  274.337992]  ip_send_skb+0x19/0x40
> [  274.341391]  ip_push_pending_frames+0x33/0x40
> [  274.345740]  raw_sendmsg+0xc15/0x11d0
> [  274.349403]  ? __might_fault+0x85/0x90
> [  274.353151]  ? _copy_from_user+0x6b/0xa0
> [  274.357070]  ? rw_copy_check_uvector+0x54/0x130
> [  274.361604]  inet_sendmsg+0x42/0x1c0
> [  274.365179]  ? inet_sendmsg+0x42/0x1c0
> [  274.368937]  sock_sendmsg+0x3e/0x50
> [  274.372460]  ___sys_sendmsg+0x26f/0x2d0
> [  274.376293]  ? lock_acquire+0x95/0x190
> [  274.380043]  ? __handle_mm_fault+0x7ce/0xb70
> [  274.384307]  ? lock_acquire+0x95/0x190
> [  274.388053]  ? __audit_syscall_entry+0xdd/0x130
> [  274.392586]  ? ktime_get_coarse_real_ts64+0x64/0xc0
> [  274.397461]  ? __audit_syscall_entry+0xdd/0x130
> [  274.401989]  ? trace_hardirqs_on+0x4c/0x100
> [  274.406173]  __sys_sendmsg+0x63/0xa0
> [  274.409744]  ? __sys_sendmsg+0x63/0xa0
> [  274.413488]  __x64_sys_sendmsg+0x1f/0x30
> [  274.417405]  do_syscall_64+0x55/0x190
> [  274.421064]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  274.426113] RIP: 0033:0x7ff4ae0e6e87
> [  274.429686] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00
> 00 00 00 8b 05 ca d9 2b 00 48 63 d2 48 63 ff 85 c0 75 10 b8 2e 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 53 48 89 f3 48 83 ec 10 48 89 7c
> 24 08
> [  274.448422] RSP: 002b:7ffcd9b76db8 EFLAGS: 0246 ORIG_RAX:
> 002e
> [  274.455978] RAX: ffda RBX: 0040 RCX: 
> 7ff4ae0e6e87
> [  274.463104] RDX:  RSI: 006092e0 RDI: 
> 0003
> [  274.470228] RBP:  R08: 7ffcd9bc40a0 R09: 
> 7ffcd9bc4080
> [  274.477349] R10: 060a R11: 0246 R12: 
> 0003
> [  274.484475] R13: 0016 R14: 7ffcd9b77fa0 R15: 
> 7ffcd9b78da4
> [  274.491602] Modules linked in: cls_bpf sch_ingress iptable_filter
> ip_tables algif_hash af_alg x86_pkg_temp_thermal fuse [last unloaded:
> test_bpf]
> [  274.504634] CR2: 
> [  274.507976] ---[ end trace 196d18386545eae1 ]---
> [  274.512588] RIP: 0010:  (null)
> [  274.516334] Code: Bad RIP value.
> [  274.519557] RSP: 0018:bc9681f83540 EFLAGS: 00010286
> [  274.524775] RAX:  

[PATCH net] net: sched: flower: insert new filter to idr after setting its mask

2019-03-06 Thread Vlad Buslov
When adding new filter to flower classifier, fl_change() inserts it to
handle_idr before initializing filter extensions and assigning it a mask.
Normally this ordering doesn't matter because all flower classifier ops
callbacks assume rtnl lock protection. However, when filter has an action
that doesn't have its kernel module loaded, rtnl lock is released before
call to request_module(). During this time the filter can be accessed bu
concurrent task before its initialization is completed, which can lead to a
crash.

Example case of NULL pointer dereference in concurrent dump:

Task 1   Task 2

tc_new_tfilter()
 fl_change()
  idr_alloc_u32(fnew)
  fl_set_parms()
   tcf_exts_validate()
tcf_action_init()
 tcf_action_init_1()
  rtnl_unlock()
  request_module()
  ...rtnl_lock()
 tc_dump_tfilter()
  tcf_chain_dump()
   fl_walk()
idr_get_next_ul()
tcf_node_dump()
 tcf_fill_node()
  fl_dump()
   mask = &f->mask->key; <- NULL ptr
  rtnl_lock()

Extension initialization and mask assignment don't depend on fnew->handle
that is allocated by idr_alloc_u32(). Move idr allocation code after action
creation and mask assignment in fl_change() to prevent concurrent access
to not fully initialized filter when rtnl lock is released to load action
module.

Fixes: 01683a146999 ("net: sched: refactor flower walk to iterate over idr")
Signed-off-by: Vlad Buslov 
Reviewed-by: Roi Dayan 
---
 net/sched/cls_flower.c | 43 ++-
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 27300a3e76c7..c04247b403ed 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1348,46 +1348,46 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
if (err < 0)
goto errout;
 
-   if (!handle) {
-   handle = 1;
-   err = idr_alloc_u32(&head->handle_idr, fnew, &handle,
-   INT_MAX, GFP_KERNEL);
-   } else if (!fold) {
-   /* user specifies a handle and it doesn't exist */
-   err = idr_alloc_u32(&head->handle_idr, fnew, &handle,
-   handle, GFP_KERNEL);
-   }
-   if (err)
-   goto errout;
-   fnew->handle = handle;
-
if (tb[TCA_FLOWER_FLAGS]) {
fnew->flags = nla_get_u32(tb[TCA_FLOWER_FLAGS]);
 
if (!tc_flags_valid(fnew->flags)) {
err = -EINVAL;
-   goto errout_idr;
+   goto errout;
}
}
 
err = fl_set_parms(net, tp, fnew, mask, base, tb, tca[TCA_RATE], ovr,
   tp->chain->tmplt_priv, extack);
if (err)
-   goto errout_idr;
+   goto errout;
 
err = fl_check_assign_mask(head, fnew, fold, mask);
if (err)
-   goto errout_idr;
+   goto errout;
+
+   if (!handle) {
+   handle = 1;
+   err = idr_alloc_u32(&head->handle_idr, fnew, &handle,
+   INT_MAX, GFP_KERNEL);
+   } else if (!fold) {
+   /* user specifies a handle and it doesn't exist */
+   err = idr_alloc_u32(&head->handle_idr, fnew, &handle,
+   handle, GFP_KERNEL);
+   }
+   if (err)
+   goto errout_mask;
+   fnew->handle = handle;
 
if (!fold && __fl_lookup(fnew->mask, &fnew->mkey)) {
err = -EEXIST;
-   goto errout_mask;
+   goto errout_idr;
}
 
err = rhashtable_insert_fast(&fnew->mask->ht, &fnew->ht_node,
 fnew->mask->filter_ht_params);
if (err)
-   goto errout_mask;
+   goto errout_idr;
 
if (!tc_skip_hw(fnew->flags)) {
err = fl_hw_replace_filter(tp, fnew, extack);
@@ -1426,12 +1426,13 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
rhashtable_remove_fast(&fnew->mask->ht, &fnew->ht_node,
   fnew->mask->filter_ht_params);
 
-errout_mask:
-   fl_mask_put(head, fnew->mask, false);
-
 errout_idr:
if (!fold)
idr_remove(&head->handle_idr, fnew->handle);
+
+errout_mask:
+   fl_mask_put(head, fnew->mask, false);
+
 errout:
tcf_exts_destroy(&fnew->exts);
kfree(fnew);
-- 
2.13.6



[PATCH net] net: hsr: fix memory leak in hsr_dev_finalize()

2019-03-06 Thread Mao Wenan
If hsr_add_port(hsr, hsr_dev, HSR_PT_MASTER) failed to
add port, it directly returns res and forgets to free the node
that allocated in hsr_create_self_node(), and forgets to delete
the node->mac_list linked in hsr->self_node_db.

BUG: memory leak
unreferenced object 0x8881cfa0c780 (size 64):
  comm "syz-executor.0", pid 2077, jiffies 4294717969 (age 2415.377s)
  hex dump (first 32 bytes):
e0 c7 a0 cf 81 88 ff ff 00 02 00 00 00 00 ad de  
00 e6 49 cd 81 88 ff ff c0 9b 87 d0 81 88 ff ff  ..I.
  backtrace:
[] hsr_dev_finalize+0x736/0x960 [hsr]
[<3ed2e597>] hsr_newlink+0x2b2/0x3e0 [hsr]
[<3fa8c6b6>] __rtnl_newlink+0xf1f/0x1600 net/core/rtnetlink.c:3182
[<1247a7ad>] rtnl_newlink+0x66/0x90 net/core/rtnetlink.c:3240
[] rtnetlink_rcv_msg+0x54e/0xb90 net/core/rtnetlink.c:5130
[<5556bd3a>] netlink_rcv_skb+0x129/0x340 
net/netlink/af_netlink.c:2477
[<741d5ee6>] netlink_unicast_kernel net/netlink/af_netlink.c:1310 
[inline]
[<741d5ee6>] netlink_unicast+0x49a/0x650 
net/netlink/af_netlink.c:1336
[<9d56f9b7>] netlink_sendmsg+0x88b/0xdf0 
net/netlink/af_netlink.c:1917
[<46b35c59>] sock_sendmsg_nosec net/socket.c:621 [inline]
[<46b35c59>] sock_sendmsg+0xc3/0x100 net/socket.c:631
[] __sys_sendto+0x33e/0x560 net/socket.c:1786
[] __do_sys_sendto net/socket.c:1798 [inline]
[] __se_sys_sendto net/socket.c:1794 [inline]
[] __x64_sys_sendto+0xdd/0x1b0 net/socket.c:1794
[] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
[] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[] 0x

Fixes: c5a759117210("net/hsr: Use list_head (and rcu) instead of array for 
slave devices.")
Reported-by: Hulk Robot 
Signed-off-by: Mao Wenan 
---
 net/hsr/hsr_device.c   |  4 +++-
 net/hsr/hsr_framereg.c | 12 
 net/hsr/hsr_framereg.h |  1 +
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/net/hsr/hsr_device.c b/net/hsr/hsr_device.c
index b8cd43c9ed5b..c4676bacb8db 100644
--- a/net/hsr/hsr_device.c
+++ b/net/hsr/hsr_device.c
@@ -486,7 +486,7 @@ int hsr_dev_finalize(struct net_device *hsr_dev, struct 
net_device *slave[2],
 
res = hsr_add_port(hsr, hsr_dev, HSR_PT_MASTER);
if (res)
-   return res;
+   goto err_add_port;
 
res = register_netdevice(hsr_dev);
if (res)
@@ -506,6 +506,8 @@ int hsr_dev_finalize(struct net_device *hsr_dev, struct 
net_device *slave[2],
 fail:
hsr_for_each_port(hsr, port)
hsr_del_port(port);
+err_add_port:
+   hsr_del_node(&hsr->self_node_db);
 
return res;
 }
diff --git a/net/hsr/hsr_framereg.c b/net/hsr/hsr_framereg.c
index 286ceb41ac0c..9af16cb68f76 100644
--- a/net/hsr/hsr_framereg.c
+++ b/net/hsr/hsr_framereg.c
@@ -124,6 +124,18 @@ int hsr_create_self_node(struct list_head *self_node_db,
return 0;
 }
 
+void hsr_del_node(struct list_head *self_node_db)
+{
+   struct hsr_node *node;
+
+   rcu_read_lock();
+   node = list_first_or_null_rcu(self_node_db, struct hsr_node, mac_list);
+   rcu_read_unlock();
+   if (node) {
+   list_del_rcu(&node->mac_list);
+   kfree(node);
+   }
+}
 
 /* Allocate an hsr_node and add it to node_db. 'addr' is the node's AddressA;
  * seq_out is used to initialize filtering of outgoing duplicate frames
diff --git a/net/hsr/hsr_framereg.h b/net/hsr/hsr_framereg.h
index 370b45998121..531fd3dfcac1 100644
--- a/net/hsr/hsr_framereg.h
+++ b/net/hsr/hsr_framereg.h
@@ -16,6 +16,7 @@
 
 struct hsr_node;
 
+void hsr_del_node(struct list_head *self_node_db);
 struct hsr_node *hsr_add_node(struct list_head *node_db, unsigned char addr[],
  u16 seq_out);
 struct hsr_node *hsr_get_node(struct hsr_port *port, struct sk_buff *skb,
-- 
2.20.1



Re: [ovs-dev] openvswitch crash on i386

2019-03-06 Thread Juerg Haefliger
On Tue, 5 Mar 2019 11:58:42 -0800
Joe Stringer  wrote:

> On Tue, Mar 5, 2019 at 2:12 AM Christian Ehrhardt
>  wrote:
> >
> > On Tue, Mar 5, 2019 at 10:58 AM Juerg Haefliger
> >  wrote:  
> > >
> > > Hi,
> > >
> > > Running the following commands in a loop will crash an i386 5.0 kernel
> > > typically within a few iterations:
> > >
> > > ovs-vsctl add-br test
> > > ovs-vsctl del-br test
> > >
> > > [  106.215748] BUG: unable to handle kernel paging request at e8a35f3b
> > > [  106.216733] #PF error: [normal kernel read fault]
> > > [  106.217464] *pdpt = 19a76001 *pde = 
> > > [  106.218346] Oops:  [#1] SMP PTI
> > > [  106.218911] CPU: 0 PID: 2050 Comm: systemd-udevd Tainted: G
> > > E 5.0.0 #25
> > > [  106.220103] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS 1.11.1-1ubuntu1 04/01/2014
> > > [  106.221447] EIP: kmem_cache_alloc_trace+0x7a/0x1b0
> > > [  106.222178] Code: 01 00 00 8b 07 64 8b 50 04 64 03 05 28 61 e8 d2 8b 
> > > 08 89 4d ec 85 c9 0f 84 03 01 00 00 8b 45 ec 8b 5f 14 8d 4a 01 8b 37 01 
> > > c3 <33> 1b 33 9f b4 00 00 00 64 0f c7 0e 75 cb 8b 75 ec 8b 47 14 0f 18
> > > [  106.224752] EAX: e8a35f3b EBX: e8a35f3b ECX: 869f EDX: 869e
> > > [  106.225683] ESI: d2e96ef0 EDI: da401a00 EBP: d9b85dd0 ESP: d9b85db0
> > > [  106.226662] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 
> > > 00010282
> > > [  106.227710] CR0: 80050033 CR2: e8a35f3b CR3: 185b8000 CR4: 06f0
> > > [  106.228703] DR0:  DR1:  DR2:  DR3: 
> > > [  106.229604] DR6: fffe0ff0 DR7: 0400
> > > [  106.230114] Call Trace:
> > > [  106.230525]  ? kernfs_fop_open+0xb4/0x390
> > > [  106.231176]  kernfs_fop_open+0xb4/0x390
> > > [  106.231856]  ? security_file_open+0x7c/0xc0
> > > [  106.232562]  do_dentry_open+0x131/0x370
> > > [  106.233229]  ? kernfs_fop_write+0x180/0x180
> > > [  106.233905]  vfs_open+0x25/0x30
> > > [  106.234432]  path_openat+0x2fd/0x1450
> > > [  106.235084]  ? cp_new_stat64+0x115/0x140
> > > [  106.235754]  ? cp_new_stat64+0x115/0x140
> > > [  106.236427]  do_filp_open+0x6a/0xd0
> > > [  106.237026]  ? cp_new_stat64+0x115/0x140
> > > [  106.237748]  ? strncpy_from_user+0x3d/0x180
> > > [  106.238539]  ? __alloc_fd+0x36/0x120
> > > [  106.239256]  do_sys_open+0x175/0x210
> > > [  106.239955]  sys_openat+0x1b/0x20
> > > [  106.240596]  do_fast_syscall_32+0x7f/0x1e0
> > > [  106.241313]  entry_SYSENTER_32+0x6b/0xbe
> > > [  106.242017] EIP: 0xb7fae871
> > > [  106.242559] Code: 8b 98 58 cd ff ff 89 c8 85 d2 74 02 89 0a 5b 5d c3 
> > > 8b 04 24 c3 8b 14 24 c3 8b 34 24 c3 8b 3c 24 c3 51 52 55 89 e5 0f 34 cd 
> > > 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> > > [  106.245551] EAX: ffda EBX: ff9c ECX: bffdcb60 EDX: 00088000
> > > [  106.246651] ESI:  EDI: b7f9e000 EBP: 00088000 ESP: bffdc970
> > > [  106.247706] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS: 
> > > 0246
> > > [  106.248851] Modules linked in: openvswitch(E)
> > > [  106.249621] CR2: e8a35f3b
> > > [  106.250218] ---[ end trace 6a8d05679a59cda7 ]---
> > >
> > > I've bisected this down to the following commit that seems to have 
> > > introduced
> > > the issue:
> > >
> > > commit 120645513f55a4ac5543120d9e79925d30a0156f (refs/bisect/bad)
> > > Author: Jarno Rajahalme 
> > > Date:   Fri Apr 21 16:48:06 2017 -0700
> > >
> > > openvswitch: Add eventmask support to CT action.
> > >
> > > Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
> > > which can be used in conjunction with the commit flag
> > > (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
> > > conntrack events (IPCT_*) should be delivered via the Netfilter
> > > netlink multicast groups.  Default behavior depends on the system
> > > configuration, but typically a lot of events are delivered.  This can 
> > > be
> > > very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
> > > types of events are of interest.
> > >
> > > Netfilter core init_conntrack() adds the event cache extension, so we
> > > only need to set the ctmask value.  However, if the system is
> > > configured without support for events, the setting will be skipped due
> > > to extension not being found.
> > >
> > > Signed-off-by: Jarno Rajahalme 
> > > Reviewed-by: Greg Rose 
> > > Acked-by: Joe Stringer 
> > > Signed-off-by: David S. Miller   
> >
> > Hi Juerg,
> > the symptom, the identified breaking commit and actually all of it
> > seems to be [1] which James, Joseph and I worked on already.
> > I wanted to make you aware of the past context that already exists.
> >
> > Back then we already reverted the change, found it to be working then.
> > Afterwards Joseph brought it up with Jarno [2] and got some patch it
> > seems, but that (whatever change it was - I have never seen it) wasn't
> > enough and still crashing.
> > Then we lost tr

[PATCH net-next] net: sched: fix potential use-after-free in __tcf_chain_put()

2019-03-06 Thread Vlad Buslov
When used with unlocked classifier that have filters attached to actions
with goto chain, __tcf_chain_put() for last non action reference can race
with calls to same function from action cleanup code that releases last
action reference. In this case action cleanup handler could free the chain
if it executes after all references to chain were released, but before all
concurrent users finished using it. Modify __tcf_chain_put() to only access
tcf_chain fields when holding block->lock. Remove local variables that were
used to cache some tcf_chain fields and are no longer needed because their
values can now be obtained directly from chain under block->lock
protection.

Fixes: 726d061286ce ("net: sched: prevent insertion of new classifiers during 
chain flush")
Signed-off-by: Vlad Buslov 
---
 net/sched/cls_api.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 478095d50f95..2c2aac4ac721 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -470,10 +470,9 @@ static void __tcf_chain_put(struct tcf_chain *chain, bool 
by_act,
 {
struct tcf_block *block = chain->block;
const struct tcf_proto_ops *tmplt_ops;
-   bool is_last, free_block = false;
+   bool free_block = false;
unsigned int refcnt;
void *tmplt_priv;
-   u32 chain_index;
 
mutex_lock(&block->lock);
if (explicitly_created) {
@@ -492,23 +491,21 @@ static void __tcf_chain_put(struct tcf_chain *chain, bool 
by_act,
 * save these to temporary variables.
 */
refcnt = --chain->refcnt;
-   is_last = refcnt - chain->action_refcnt == 0;
tmplt_ops = chain->tmplt_ops;
tmplt_priv = chain->tmplt_priv;
-   chain_index = chain->index;
-
-   if (refcnt == 0)
-   free_block = tcf_chain_detach(chain);
-   mutex_unlock(&block->lock);
 
/* The last dropped non-action reference will trigger notification. */
-   if (is_last && !by_act) {
-   tc_chain_notify_delete(tmplt_ops, tmplt_priv, chain_index,
+   if (refcnt - chain->action_refcnt == 0 && !by_act) {
+   tc_chain_notify_delete(tmplt_ops, tmplt_priv, chain->index,
   block, NULL, 0, 0, false);
/* Last reference to chain, no need to lock. */
chain->flushing = false;
}
 
+   if (refcnt == 0)
+   free_block = tcf_chain_detach(chain);
+   mutex_unlock(&block->lock);
+
if (refcnt == 0) {
tc_chain_tmplt_del(tmplt_ops, tmplt_priv);
tcf_chain_destroy(chain, free_block);
-- 
2.13.6



Re: [PATCH v3 bpf-next 1/2] bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release

2019-03-06 Thread Lorenz Bauer
On Mon, 4 Mar 2019 at 17:43, Martin Lau  wrote:
>
> On Mon, Mar 04, 2019 at 10:33:46AM +0100, Daniel Borkmann wrote:
> > On 03/02/2019 09:21 PM, Martin Lau wrote:
> > > On Sat, Mar 02, 2019 at 10:03:03AM -0800, Alexei Starovoitov wrote:
> > >> On Sat, Mar 02, 2019 at 08:10:10AM -0800, Martin KaFai Lau wrote:
> > >>> Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
> > >>> can still be accessed after bpf_sk_release(sk).
> > >>> Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
> > >>> This patch addresses them together.
> > >>>
> > >>> A simple reproducer looks like this:
> > >>>
> > >>> sk = bpf_sk_lookup_tcp();
> > >>> /* if (!sk) ... */
> > >>> tp = bpf_tcp_sock(sk);
> > >>> /* if (!tp) ... */
> > >>> bpf_sk_release(sk);
> > >>> snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
> > >>>
> > >>> The problem is the verifier did not scrub the register's states of
> > >>> the tcp_sock ptr (tp) after bpf_sk_release(sk).
> > >>>
> > >>> [ Note that when calling bpf_tcp_sock(sk), the sk is not always
> > >>>   refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
> > >>>   fine for this case. ]
> > >>>
> > >>> Currently, the verifier does not track if a helper's return ptr (in 
> > >>> REG_0)
> > >>> is "carry"-ing one of its argument's refcount status. To carry this 
> > >>> info,
> > >>> the reg1->id needs to be stored in reg0.  The reg0->id has already
> > >>> been used for NULL checking purpose.  Hence, a new "refcount_id"
> > >>> is needed in "struct bpf_reg_state".
> > >>>
> > >>> With refcount_id, when bpf_sk_release(sk) is called, the verifier can 
> > >>> scrub
> > >>> all reg states which has a refcount_id match.  It is done with the 
> > >>> changes
> > >>> in release_reg_references().
> > >>>
> > >>> When acquiring and releasing a refcount, the reg->id is still used.
> > >>> Hence, we cannot do "bpf_sk_release(tp)" in the above reproducer
> > >>> example.
> > >>
> > >> I think the choice of returning listener full sock from req sock
> > >> in sk_to_full_sk() was a wrong one.
> > >> It seems better to make semantics of bpf_tcp_sock() and 
> > >> bpf_sk_fullsock() as
> > >> always type cast or null.
> > >> And have a separate helper for req socket that returns 
> > >> inet_reqsk(sk)->rsk_listener.
> > >>
> > >> Then it will be ok to call bpf_sk_release(tp) when tp came from 
> > >> bpf_sk_lookup_tcp.
> > >> The verifier will know that it's the case because its ID will be in 
> > >> acquired_refs.
> > >>
> > >> The additional refcount_id won't be necessary.
> > >> bpf_sk_fullsock() and bpf_tcp_sock() will not call sk_to_full_sk
> > >> and the verifier will be copying reg1->id into reg0->id.
> > >>
> > >> In release_reference() the verifier will do
> > >>   if (regs[i].id == id)
> > >> mark_reg_unknown(env, regs, i);
> > >> for all socket types.
> > >>
> > >> release_reference_state() will stay as-is.
> > >>
> > >> imo such logic will be easier to follow.
> > >>
> > >> This implicit sk_to_full_sk() makes the whole thing much harder for the 
> > >> verifier
> > >> and for the bpf program writers.
> > >>
> > >> The new bpf_get_listener_sock(sk) doesn't have to copy ID from reg1 to 
> > >> reg0
> > >> since req socket will not be returned from bpf_sk_lookup_tcp and its ID
> > >> will not be stored in acuired_refs.
> > >>
> > >> Does it make sense ?
> > > I like this idea.  Many thanks for thinking it through!
> > >
> > > Allowing bpf_sk_release(tp), no need to call bpf_sk_release() on ptr
> > > returned from bpf_get_listener_sock(sk) and keep one reg->id.
> > >
> > > I think it should work.  I will rework the patches.
> >
> > Agree, makes sense, that seems much better fix.
> While I was working on this change, based on the code, one issue I saw is:
>
> if the bpf prog does this:
>
> sk = bpf_sk_lookup_tcp();
> /* if (!sk) ... */
> fullsock = bpf_sk_fullsock(sk);
> if (!fullsock) {
> bpf_sk_release(sk); /* Fail. sk_reg->id not found in ref state */
> return 0;
> }
>
> The bpf_sk_release(sk) failed because the reference state has already
> been released by "release_reference_state(state, fullsock_reg->id)" during
> "if (!fullsock) /* handled by mark_ptr_or_null_regs(is_null == true) */"
> Logically, I think bpf_sk_release(sk) should not fail regardless of
> bpf_sk_fullsock() doing sk_to_full_sk() or not.
>
> bpf_sk_fullsock() could disallow PTR_TO_SOCKET or PTR_TO_TCP_SOCK but that
> would be weird.
>
> I think we still need two id.  May be rename the refcount_id proposed in
> this patch to ref_obj_id which is the original refcounted object id.
>
> If the sk_to_full_sk() is removed from bpf_sk_fullsock() and bpf_tcp_sock(),
> these two helpers become a simple cast (i.e. either return the same pointer
> or NULL).  Then bpf_sk_release(fullsock) and bpf_sk_release(tp) could work:
>
> - When is_null == true, release_reference_state(state, reg->id) is called.

If I understand correctly, this works because we never
acquire_refere

Re: [PATCH] tcp: detecting the misuse of .sendpage for Slab objects

2019-03-06 Thread Eric Dumazet



On 03/06/2019 03:10 AM, Vasily Averin wrote:
> sendpage was not designed for processing of the Slab pages,
> in some situations it can trigger BUG_ON on receiving side.
> 
> Signed-off-by: Vasily Averin 
> ---
>  net/ipv4/tcp.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ad07dd71063d..dbb08140cdc9 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -943,6 +943,10 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page 
> *page, int offset,
>   ssize_t copied;
>   long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
>  
> + if (IS_ENABLED(CONFIG_DEBUG_VM) &&
> + WARN_ONCE(PageSlab(page), "page must not be a Slab one"))
> + return -EINVAL;
> +
>   /* Wait for a connection to finish. One exception is TCP Fast Open
>* (passive side) where data is allowed to be sent before a connection
>* is fully established.
> 

SGTM

David, this probably can be merged into net tree.

Signed-off-by: Eric Dumazet 



Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-06 Thread Michael S. Tsirkin
On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
> It was noticed that the copy_user() friends that was used to access
> virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software checks,
> speculation barrier, hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets since
> the time spent on metadata accessing become more significant.
> 
> This patch tries to eliminate those overheads by accessing them
> through kernel virtual address by vmap(). To make the pages can be
> migrated, instead of pinning them through GUP, we use MMU notifiers to
> invalidate vmaps and re-establish vmaps during each round of metadata
> prefetching if necessary. It looks to me .invalidate_range() is
> sufficient for catching this since we don't need extra TLB flush. For
> devices that doesn't use metadata prefetching, the memory accessors
> fallback to normal copy_user() implementation gracefully. The
> invalidation was synchronized with datapath through vq mutex, and in
> order to avoid hold vq mutex during range checking, MMU notifier was
> teared down when trying to modify vq metadata.
> 
> Dirty page checking is done by calling set_page_dirty_locked()
> explicitly for the page that used ring stay after each round of
> processing.
> 
> Note that this was only done when device IOTLB is not enabled. We
> could use similar method to optimize it in the future.
> 
> Tests shows at most about 22% improvement on TX PPS when using
> virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
> 
> SMAP on | SMAP off
> Before: 5.0Mpps | 6.6Mpps
> After:  6.1Mpps | 7.4Mpps
> 
> Cc: 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/net.c   |   2 +
>  drivers/vhost/vhost.c | 281 
> +-
>  drivers/vhost/vhost.h |  16 +++
>  3 files changed, 297 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index bf55f99..c276371 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -982,6 +982,7 @@ static void handle_tx(struct vhost_net *net)
>   else
>   handle_tx_copy(net, sock);
>  
> + vq_meta_prefetch_done(vq);
>  out:
>   mutex_unlock(&vq->mutex);
>  }
> @@ -1250,6 +1251,7 @@ static void handle_rx(struct vhost_net *net)
>   vhost_net_enable_vq(net, vq);
>  out:
>   vhost_net_signal_used(nvq);
> + vq_meta_prefetch_done(vq);
>   mutex_unlock(&vq->mutex);
>  }
>  
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 1015464..36ccf7c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -434,6 +434,74 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue 
> *vq, int num)
>   return sizeof(*vq->desc) * num;
>  }
>  
> +static void vhost_uninit_vmap(struct vhost_vmap *map)
> +{
> + if (map->addr) {
> + vunmap(map->unmap_addr);
> + kfree(map->pages);
> + map->pages = NULL;
> + map->npages = 0;
> + }
> +
> + map->addr = NULL;
> + map->unmap_addr = NULL;
> +}
> +
> +static void vhost_invalidate_vmap(struct vhost_virtqueue *vq,
> +   struct vhost_vmap *map,
> +   unsigned long ustart,
> +   size_t size,
> +   unsigned long start,
> +   unsigned long end)
> +{
> + if (end < ustart || start > ustart - 1 + size)
> + return;
> +
> + dump_stack();
> + mutex_lock(&vq->mutex);
> + vhost_uninit_vmap(map);
> + mutex_unlock(&vq->mutex);
> +}
> +
> +
> +static void vhost_invalidate(struct vhost_dev *dev,
> +  unsigned long start, unsigned long end)
> +{
> + int i;
> +
> + for (i = 0; i < dev->nvqs; i++) {
> + struct vhost_virtqueue *vq = dev->vqs[i];
> +
> + vhost_invalidate_vmap(vq, &vq->avail_ring,
> +   (unsigned long)vq->avail,
> +   vhost_get_avail_size(vq, vq->num),
> +   start, end);
> + vhost_invalidate_vmap(vq, &vq->desc_ring,
> +   (unsigned long)vq->desc,
> +   vhost_get_desc_size(vq, vq->num),
> +   start, end);
> + vhost_invalidate_vmap(vq, &vq->used_ring,
> +   (unsigned long)vq->used,
> +   vhost_get_used_size(vq, vq->num),
> +   start, end);
> + }
> +}
> +
> +
> +static void vhost_invalidate_range(struct mmu_notifier *mn,
> +struct mm_struct *mm,
> +unsigned long start, unsigned long end)
> +{
> + struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> +  

Re: [PATCH] vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock

2019-03-06 Thread Stefan Hajnoczi
On Wed, Mar 06, 2019 at 11:10:41AM +0200, Adalbert Lazăr wrote:
> On Wed, 6 Mar 2019 08:41:04 +, Stefan Hajnoczi  wrote:
> > On Tue, Mar 05, 2019 at 08:01:45PM +0200, Adalbert Lazăr wrote:
> > The pkt argument is the received packet that we must reply to.
> > The reply packet is allocated just before line 680 and must be free
> > explicitly for return -ENOTCONN.
> > 
> > You can avoid the leak and make the code easier to read like this:
> > 
> >   struct virtio_vsock_pkt *reply;
> > 
> >   ...
> > 
> >  -- avoid reusing 'pkt'
> > v
> >   reply = virtio_transport_alloc_pkt(&info, 0, ...);
> >   if (!reply)
> >   return -ENOMEM;
> > 
> >   t = virtio_transport_get_ops();
> >   if (!t) {
> >   virtio_transport_free_pkt(reply); <-- prevent memory leak
> >   return -ENOTCONN;
> >   }
> >   return t->send_pkt(reply);
> 
> What do you think about Stefano's suggestion, to move the check above
> the line were the reply is allocated?

That's fine too.

However a follow up patch to eliminate the confusing way that 'pkt' is
reused is still warranted.  If you are busy I'd be happy to send that
cleanup.

Stefan


signature.asc
Description: PGP signature


[PATCH mlx5-next] net/mlx5: Fix DCT creation bad flow

2019-03-06 Thread Leon Romanovsky
From: Yishai Hadas 

In case the DCT creation command has succeeded a DRAIN must be issued
before calling DESTROY.

In addition, the original code used the wrong parameter for the DESTROY
command, 'in' instead of 'din', which caused another creation try
instead of destroying.

Cc:  # 4.15
Fixes: 57cda166bbe0 ("net/mlx5: Add DCT command interface")
Signed-off-by: Yishai Hadas 
Reviewed-by: Artemy Kovalyov 
Signed-off-by: Leon Romanovsky 
---
Jason, Doug

If it is possible, I would like to take this patch too:
https://patchwork.kernel.org/patch/10828299/

Thanks
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 64 +++-
 1 file changed, 34 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index 370ca94b6775..54cdfb354c0e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -40,6 +40,9 @@
 #include "mlx5_core.h"
 #include "lib/eq.h"

+static int mlx5_core_drain_dct(struct mlx5_core_dev *dev,
+  struct mlx5_core_dct *dct);
+
 static struct mlx5_core_rsc_common *
 mlx5_get_rsc(struct mlx5_qp_table *table, u32 rsn)
 {
@@ -227,13 +230,40 @@ static void destroy_resource_common(struct mlx5_core_dev 
*dev,
wait_for_completion(&qp->common.free);
 }

+static int _mlx5_core_destroy_dct(struct mlx5_core_dev *dev,
+ struct mlx5_core_dct *dct, bool need_cleanup)
+{
+   u32 out[MLX5_ST_SZ_DW(destroy_dct_out)] = {0};
+   u32 in[MLX5_ST_SZ_DW(destroy_dct_in)]   = {0};
+   struct mlx5_core_qp *qp = &dct->mqp;
+   int err;
+
+   err = mlx5_core_drain_dct(dev, dct);
+   if (err) {
+   if (dev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR) {
+   goto destroy;
+   } else {
+   mlx5_core_warn(dev, "failed drain DCT 0x%x with error 
0x%x\n", qp->qpn, err);
+   return err;
+   }
+   }
+   wait_for_completion(&dct->drained);
+destroy:
+   if (need_cleanup)
+   destroy_resource_common(dev, &dct->mqp);
+   MLX5_SET(destroy_dct_in, in, opcode, MLX5_CMD_OP_DESTROY_DCT);
+   MLX5_SET(destroy_dct_in, in, dctn, qp->qpn);
+   MLX5_SET(destroy_dct_in, in, uid, qp->uid);
+   err = mlx5_cmd_exec(dev, (void *)&in, sizeof(in),
+   (void *)&out, sizeof(out));
+   return err;
+}
+
 int mlx5_core_create_dct(struct mlx5_core_dev *dev,
 struct mlx5_core_dct *dct,
 u32 *in, int inlen)
 {
u32 out[MLX5_ST_SZ_DW(create_dct_out)]   = {0};
-   u32 din[MLX5_ST_SZ_DW(destroy_dct_in)]   = {0};
-   u32 dout[MLX5_ST_SZ_DW(destroy_dct_out)] = {0};
struct mlx5_core_qp *qp = &dct->mqp;
int err;

@@ -254,11 +284,7 @@ int mlx5_core_create_dct(struct mlx5_core_dev *dev,

return 0;
 err_cmd:
-   MLX5_SET(destroy_dct_in, din, opcode, MLX5_CMD_OP_DESTROY_DCT);
-   MLX5_SET(destroy_dct_in, din, dctn, qp->qpn);
-   MLX5_SET(destroy_dct_in, din, uid, qp->uid);
-   mlx5_cmd_exec(dev, (void *)&in, sizeof(din),
- (void *)&out, sizeof(dout));
+   _mlx5_core_destroy_dct(dev, dct, false);
return err;
 }
 EXPORT_SYMBOL_GPL(mlx5_core_create_dct);
@@ -323,29 +349,7 @@ static int mlx5_core_drain_dct(struct mlx5_core_dev *dev,
 int mlx5_core_destroy_dct(struct mlx5_core_dev *dev,
  struct mlx5_core_dct *dct)
 {
-   u32 out[MLX5_ST_SZ_DW(destroy_dct_out)] = {0};
-   u32 in[MLX5_ST_SZ_DW(destroy_dct_in)]   = {0};
-   struct mlx5_core_qp *qp = &dct->mqp;
-   int err;
-
-   err = mlx5_core_drain_dct(dev, dct);
-   if (err) {
-   if (dev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR) {
-   goto destroy;
-   } else {
-   mlx5_core_warn(dev, "failed drain DCT 0x%x with error 
0x%x\n", qp->qpn, err);
-   return err;
-   }
-   }
-   wait_for_completion(&dct->drained);
-destroy:
-   destroy_resource_common(dev, &dct->mqp);
-   MLX5_SET(destroy_dct_in, in, opcode, MLX5_CMD_OP_DESTROY_DCT);
-   MLX5_SET(destroy_dct_in, in, dctn, qp->qpn);
-   MLX5_SET(destroy_dct_in, in, uid, qp->uid);
-   err = mlx5_cmd_exec(dev, (void *)&in, sizeof(in),
-   (void *)&out, sizeof(out));
-   return err;
+   return _mlx5_core_destroy_dct(dev, dct, true);
 }
 EXPORT_SYMBOL_GPL(mlx5_core_destroy_dct);

--
2.19.1



[PATCH mlx5-next] IB/mlx5: Use mlx5 core to create/destroy a DEVX DCT

2019-03-06 Thread Leon Romanovsky
From: Yishai Hadas 

To prevent a hardware memory leak when a DEVX DCT object is destroyed
without calling DRAIN DCT before, (e.g. under cleanup flow), need to
manage its creation and destruction via mlx5 core.

In that case the DRAIN DCT command will be called and only once that
it will be completed the DESTROY DCT command will be called.
Otherwise, the DESTROY DCT may fail and a hardware leak may occur.

As of that change the DRAIN DCT command should not be exposed any more
from DEVX, it's managed internally by the driver to work as expected by
the device specification.

Fixes: 7efce3691d33 ("IB/mlx5: Add obj create and destroy functionality")
Signed-off-by: Yishai Hadas 
Reviewed-by: Artemy Kovalyov 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/devx.c| 34 +++-
 drivers/infiniband/hw/mlx5/qp.c  |  4 ++-
 drivers/net/ethernet/mellanox/mlx5/core/qp.c |  6 ++--
 include/linux/mlx5/qp.h  |  3 +-
 4 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/devx.c 
b/drivers/infiniband/hw/mlx5/devx.c
index eaa055007f28..9e08df7914aa 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -20,6 +20,7 @@
 
 enum devx_obj_flags {
DEVX_OBJ_FLAGS_INDIRECT_MKEY = 1 << 0,
+   DEVX_OBJ_FLAGS_DCT = 1 << 1,
 };
 
 struct devx_async_data {
@@ -39,7 +40,10 @@ struct devx_obj {
u32 dinlen; /* destroy inbox length */
u32 dinbox[MLX5_MAX_DESTROY_INBOX_SIZE_DW];
u32 flags;
-   struct mlx5_ib_devx_mr  devx_mr;
+   union {
+   struct mlx5_ib_devx_mr  devx_mr;
+   struct mlx5_core_dctcore_dct;
+   };
 };
 
 struct devx_umem {
@@ -347,7 +351,6 @@ static u64 devx_get_obj_id(const void *in)
obj_id = get_enc_obj_id(MLX5_CMD_OP_CREATE_RQ,
MLX5_GET(arm_rq_in, in, srq_number));
break;
-   case MLX5_CMD_OP_DRAIN_DCT:
case MLX5_CMD_OP_ARM_DCT_FOR_KEY_VIOLATION:
obj_id = get_enc_obj_id(MLX5_CMD_OP_CREATE_DCT,
MLX5_GET(drain_dct_in, in, dctn));
@@ -618,7 +621,6 @@ static bool devx_is_obj_modify_cmd(const void *in)
case MLX5_CMD_OP_2RST_QP:
case MLX5_CMD_OP_ARM_XRC_SRQ:
case MLX5_CMD_OP_ARM_RQ:
-   case MLX5_CMD_OP_DRAIN_DCT:
case MLX5_CMD_OP_ARM_DCT_FOR_KEY_VIOLATION:
case MLX5_CMD_OP_ARM_XRQ:
case MLX5_CMD_OP_SET_XRQ_DC_PARAMS_ENTRY:
@@ -1124,7 +1126,11 @@ static int devx_obj_cleanup(struct ib_uobject *uobject,
if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY)
devx_cleanup_mkey(obj);
 
-   ret = mlx5_cmd_exec(obj->mdev, obj->dinbox, obj->dinlen, out, 
sizeof(out));
+   if (obj->flags & DEVX_OBJ_FLAGS_DCT)
+   ret = mlx5_core_destroy_dct(obj->mdev, &obj->core_dct);
+   else
+   ret = mlx5_cmd_exec(obj->mdev, obj->dinbox, obj->dinlen, out,
+   sizeof(out));
if (ib_is_destroy_retryable(ret, why, uobject))
return ret;
 
@@ -1185,9 +1191,17 @@ static int 
UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_CREATE)(
devx_set_umem_valid(cmd_in);
}
 
-   err = mlx5_cmd_exec(dev->mdev, cmd_in,
-   cmd_in_len,
-   cmd_out, cmd_out_len);
+   if (opcode == MLX5_CMD_OP_CREATE_DCT) {
+   obj->flags |= DEVX_OBJ_FLAGS_DCT;
+   err = mlx5_core_create_dct(dev->mdev, &obj->core_dct,
+  cmd_in, cmd_in_len,
+  cmd_out, cmd_out_len);
+   } else {
+   err = mlx5_cmd_exec(dev->mdev, cmd_in,
+   cmd_in_len,
+   cmd_out, cmd_out_len);
+   }
+
if (err)
goto obj_free;
 
@@ -1214,7 +1228,11 @@ static int 
UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_CREATE)(
if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY)
devx_cleanup_mkey(obj);
 obj_destroy:
-   mlx5_cmd_exec(obj->mdev, obj->dinbox, obj->dinlen, out, sizeof(out));
+   if (obj->flags & DEVX_OBJ_FLAGS_DCT)
+   mlx5_core_destroy_dct(obj->mdev, &obj->core_dct);
+   else
+   mlx5_cmd_exec(obj->mdev, obj->dinbox, obj->dinlen, out,
+ sizeof(out));
 obj_free:
kfree(obj);
return err;
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 6b1f0e76900b..7cd006da1dae 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3729,6 +3729,7 @@ static int mlx5_ib_modify_dct(struct ib_qp *ibqp, struct 
ib_qp_attr *attr,
 
} else if (cur_state == IB_QPS_INIT && new_state == IB_QPS_RTR) {
struct mlx5_ib_modif

Re: [PATCH] net/rds: Accept peer connection reject messages due to incompatible version

2019-03-06 Thread Santosh Shilimkar

On 3/5/2019 11:04 PM, Gerd Rausch wrote:

Prior to
commit d021fabf525ff ("rds: rdma: add consumer reject")

function "rds_rdma_cm_event_handler_cmn" would always honor a rejected
connection attempt by issuing a "rds_conn_drop".

The commit mentioned above added a "break", eliminating
the "fallthrough" case and made the "rds_conn_drop" rather conditional:

Now it only happens if a "consumer defined" reject (i.e. "rdma_reject")
carries an integer-value of "1" inside "private_data":


if (!conn)
+   break;
+   err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
+   if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
+   pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping 
connection\n",
+   &conn->c_laddr, &conn->c_faddr);
+   conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
+   rds_conn_drop(conn);
+   }
 rdsdebug("Connection rejected: %s\n",
  rdma_reject_msg(cm_id, event->status));
+   break;
 /* FALLTHROUGH */


A number of issues are worth mentioning here:
   #1) Previous versions of the RDS code simply rejected a connection
   by calling "rdma_reject(cm_id, NULL, 0);"
   So the value of the payload in "private_data" will not be "1",
   but "0".

   #2) Now the code has become dependent on host byte order and sizing.
   If one peer is big-endian, the other is little-endian,
   or there's a difference in sizeof(int) (e.g. ILP64 vs LP64),
   the *err check does not work as intended.

   #3) There is no check for "len" to see if the data behind *err is even valid.
   Luckily, it appears that the "rdma_reject(cm_id, NULL, 0)" will always
   carry 148 bytes of zeroized payload.
   But that should probably not be relied upon here.

   #4) With the added "break;",
   we might as well drop the misleading "/* FALLTHROUGH */" comment.

This commit does _not_ address issue #2, as the sender would have to
agree on a byte order as well.

Here is the sequence of messages in this observed error-scenario:
   Host-A is pre-QoS changes (excluding the commit mentioned above)
   Host-B is post-QoS changes (including the commit mentioned above)

   #1 Host-B
  issues a connection request via function "rds_conn_path_transition"
  connection state transitions to "RDS_CONN_CONNECTING"

   #2 Host-A
  rejects the incompatible connection request (from #1)
  It does so by calling "rdma_reject(cm_id, NULL, 0);"

   #3 Host-B
  receives an "RDMA_CM_EVENT_REJECTED" event (from #2)
  But since the code is changed in the way described above,
  it won't drop the connection here, simply because "*err == 0".

   #4 Host-A
  issues a connection request

   #5 Host-B
  receives an "RDMA_CM_EVENT_CONNECT_REQUEST" event
  and ends up calling "rds_ib_cm_handle_connect".
  But since the state is already in "RDS_CONN_CONNECTING"
  (as of #1) it will end up issuing a "rdma_reject" without
  dropping the connection:
 if (rds_conn_state(conn) == RDS_CONN_CONNECTING) {
 /* Wait and see - our connect may still be succeeding */
 rds_ib_stats_inc(s_ib_connect_raced);
 }
 goto out;

   #6 Host-A
  receives an "RDMA_CM_EVENT_REJECTED" event (from #5),
  drops the connection and tries again (goto #4) until it gives up.

Orabug: 29444532

Signed-off-by: Gerd Rausch 
---
  net/rds/rdma_transport.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 46bce8389066..f628e7fda66d 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -112,7 +112,7 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id 
*cm_id,
if (!conn)
break;
err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
-   if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
+   if (!err || (err && len >= sizeof(*err) && ((*err) <= 
RDS_RDMA_REJ_INCOMPAT))) {
pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping 
connection\n",
&conn->c_laddr, &conn->c_faddr);
conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
@@ -122,7 +122,6 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id 
*cm_id,
rdsdebug("Connection rejected: %s\n",
 rdma_reject_msg(cm_id, event->status));
break;
-   /* FALLTHROUGH */
case RDMA_CM_EVENT_ADDR_ERROR:
case RDMA_CM_EVENT_ROUTE_ERROR:
case RDMA_CM_EVENT_CONNECT_ERROR:

Very similar test diff [1] I sent Yanjun to test yesterday... Thanks for 
checking. Will submit a cleaned up fix Gerd. Thansk for checking


F

Re: [PATCH net-next v2 4/7] devlink: allow subports on devlink PCI ports

2019-03-06 Thread Jakub Kicinski
On Wed, 6 Mar 2019 13:20:37 +0100, Jiri Pirko wrote:
> Tue, Mar 05, 2019 at 06:15:34PM CET, jakub.kicin...@netronome.com wrote:
> >On Tue, 5 Mar 2019 12:06:01 +0100, Jiri Pirko wrote:  
> >> >> >as ports.  Can we invent a new command (say "partition"?) that'd take
> >> >> >the bus info where the partition is to be spawned?  
> >> >> 
> >> >> Got it. But the question is how different this object would be from the
> >> >> existing "port" we have today.
> >> >
> >> >They'd be where "the other side of a PCI link" is represented,
> >> >restricting ports to only ASIC's forwarding plane ports.
> >> 
> >> Basically a "host port", right? It can still be the same port object,
> >> only with different flavour and attributes. So we would have:
> >> 
> >> 1) pci/:05:00.0/0: type eth netdev enp5s0np0
> >>flavour physical switch_id 00154d130d2f
> >> 2) pci/:05:00.0/1: type eth netdev enp5s0npf0s0
> >>flavour pci_pf pf 0 subport 0
> >>switch_id 00154d130d2f
> >>peer pci/:05:00.0/1
> >> 3) pci/:05:00.0/10001: type eth netdev enp5s0npf0vf0
> >>flavour pci_vf pf 0 vf 0
> >>switch_id 00154d130d2f
> >>peer pci/:05:10.1/0
> >> 4) pci/:05:00.0/10001: type eth netdev enp5s0npf0s1
> >>flavour pci_pf pf 0 subport 1
> >>switch_id 00154d130d2f
> >>peer pci/:05:00.0/2
> >> 5) pci/:05:00.0/1: type eth netdev enp5s0f0??
> >>flavour host  <
> >>peer pci/:05:00.0/1
> >> 6) pci/:05:10.1/0: type eth netdev enp5s10f0 
> >>flavour host  <
> >>peer pci/:05:00.0/10001
> >> 7) pci/:05:00.0/2: type eth netdev enp5s0f0??
> >>flavour host  <
> >>peer pci/:05:00.0/10001
> >> 
> >> I think it looks quite clear, it gives complete topology view.  
> >
> >Okay, I have some of questions :)
> >
> >What do we use for port_index?  
> 
> That is just a number totally in control of the driver. Driver can
> assign it in any way.
> 
> >
> >What are the operations one can perform on "host ports"?  
> 
> That is a good question. I would start with *none* and extend it upon
> needs.
> 
> 
> >
> >If we have PCI parameters, do they get set on the ASIC side of the port
> >or the host side of the port?  
> 
> Could you give me an example?

Let's take msix_vec_per_pf_min as an example.  

> But I believe that on switch-port side.

Ok.

> >How do those behave when device is passed to VM?  
> 
> In case of VF? VF will have separate devlink instance (separate handle,
> probably "aliased" to the PF handle). So it would disappear from
> baremetal and appear in VM:
> $ devlink dev
> pci/:00:10.0
> $ devlink dev port
> pci/:00:10.1/0: type eth netdev enp5s10f0
> flavour host
> That's it for the VM.
> 
> There's no linkage (peer, alias) between this and the instances on
> baremetal. 

Ok, I guess this is the main advantage from your perspective?
The fact that "host ports" are visible inside a VM?
Or do you believe that having both ends of a pipe as ports makes the
topology easier to understand?

For creating subdevices, I don't think the handle should ever be port.
We create new ports on a devlink instance, and configure its forwarding
with offloads of well established Linux SW constructs.  New devices are
not logically associated with other ports (see how in my patches there
are 2 "subports" but no main port on that PF - a split not a hierarchy).

How we want to model forwarding inside a VM (who configures the
underlying switching) remains unclear.

> >You have a VF devlink instance there - what ports does it show?  
> 
> See above.
> 
> 
> >
> >How do those look when the PF is connected to another host?  Do they
> >get spawned at all?  
> 
> What do you mean by "PF is connected to another host"?

Either "SmartNIC":

http://www.mellanox.com/products/smartnic/?ls=gppc&lsd=SmartNIC-gen-smartnic&gclid=EAIaIQobChMIxIrGmYju4AIVy5yzCh2SFwQJEAAYASAAEgIui_D_BwE

or

Multi-host NIC: http://www.mellanox.com/page/multihost

> >Will this not be confusing to DSA folks who have a CPU port?  
> 
> Why do you think so?

Host and CPU sound quite similar, it is unclear how they differ, and
why we have a need for both (from user's perspective).


[PATCH net] tcp: do not report TCP_CM_INQ of 0 for closed connections

2019-03-06 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

Returning 0 as inq to userspace indicates there is no more data to
read, and the application needs to wait for EPOLLIN. For a connection
that has received FIN from the remote peer, however, the application
must continue reading until getting EOF (return value of 0
from tcp_recvmsg) or an error, if edge-triggered epoll (EPOLLET) is
being used. Otherwise, the application will never receive a new
EPOLLIN, since there is no epoll edge after the FIN.

Return 1 when there is no data left on the queue but the
connection has received FIN, so that the applications continue
reading.

Fixes: b75eba76d3d72 (tcp: send in-queue bytes in cmsg upon read)
Signed-off-by: Soheil Hassas Yeganeh 
Acked-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
Acked-by: Yuchung Cheng 
---
 net/ipv4/tcp.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ad07dd71063da..8b25017e0dc93 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1933,6 +1933,11 @@ static int tcp_inq_hint(struct sock *sk)
inq = tp->rcv_nxt - tp->copied_seq;
release_sock(sk);
}
+   /* After receiving a FIN, tell the user-space to continue reading
+* by returning a non-zero inq.
+*/
+   if (inq == 0 && sock_flag(sk, SOCK_DONE))
+   inq = 1;
return inq;
 }
 
-- 
2.21.0.352.gf09ad66450-goog



Re: [PATCH] tools: testing: selftests: Remove duplicate headers

2019-03-06 Thread Souptick Joarder
On Mon, Mar 4, 2019 at 4:19 PM Souptick Joarder  wrote:
>
> On Tue, Feb 26, 2019 at 10:59 AM Souptick Joarder  
> wrote:
> >
> > On Tue, Feb 26, 2019 at 7:18 AM Michael Ellerman  
> > wrote:
> > >
> > > Souptick Joarder  writes:
> > > > Remove duplicate headers which are included twice.
> > > >
> > > > Signed-off-by: Sabyasachi Gupta 
> > > > Signed-off-by: Souptick Joarder 
> > > > ---
> > > ...
> > > >  tools/testing/selftests/powerpc/pmu/ebb/fork_cleanup_test.c | 1 -
> > >
> > > I took this hunk via the powerpc tree.
> >
> > How about taking this entirely through a single tree ?
> > or Shall I send these changes in different patches ?
>
> If no comment, can we get this patch in queue for 5.1 ?

I will drop this patch as we have submitted these changes in
different patches. ( Except the one picked by Michael).

>
> >
> > >
> > > > diff --git 
> > > > a/tools/testing/selftests/powerpc/pmu/ebb/fork_cleanup_test.c 
> > > > b/tools/testing/selftests/powerpc/pmu/ebb/fork_cleanup_test.c
> > > > index 167135b..af1b802 100644
> > > > --- a/tools/testing/selftests/powerpc/pmu/ebb/fork_cleanup_test.c
> > > > +++ b/tools/testing/selftests/powerpc/pmu/ebb/fork_cleanup_test.c
> > > > @@ -11,7 +11,6 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > -#include 
> > > >
> > > >  #include "ebb.h"
> > >
> > >
> > > cheers


BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!

2019-03-06 Thread syzbot

Hello,

syzbot found the following crash on:

HEAD commit:cf08baa29613 Add linux-next specific files for 20190306
git tree:   linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15bc76c720
kernel config:  https://syzkaller.appspot.com/x/.config?x=c8b6073d992e8217
dashboard link: https://syzkaller.appspot.com/bug?extid=91fd909b6e62ebe06131
compiler:   gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+91fd909b6e62ebe06...@syzkaller.appspotmail.com

BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!
turning off the locking correctness validator.
CPU: 0 PID: 11902 Comm: kworker/u5:27 Not tainted 5.0.0-next-20190306 #4
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Workqueue: hci94 hci_power_on
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 add_chain_cache kernel/locking/lockdep.c:2582 [inline]
 lookup_chain_cache_add kernel/locking/lockdep.c:2656 [inline]
 validate_chain kernel/locking/lockdep.c:2676 [inline]
 __lock_acquire.cold+0x250/0x50d kernel/locking/lockdep.c:3692
 lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4202
 __flush_work+0x677/0x8a0 kernel/workqueue.c:3034
 flush_work+0x18/0x20 kernel/workqueue.c:3060
 hci_dev_do_open+0xa92/0x1780 net/bluetooth/hci_core.c:1543
 hci_power_on+0x10d/0x580 net/bluetooth/hci_core.c:2173
 process_one_work+0x98e/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x357/0x430 kernel/kthread.c:253
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.


Re: [PATCH] appletalk: Correctly handle return value of register_snap_client

2019-03-06 Thread David Miller
From: Yue Haibing 
Date: Wed, 6 Mar 2019 15:27:40 +0800

> @@ -879,15 +879,17 @@ static struct notifier_block aarp_notifier = {
>  
>  static unsigned char aarp_snap_id[] = { 0x00, 0x00, 0x00, 0x80, 0xF3 };
>  
> -void __init aarp_proto_init(void)
> +int __init aarp_proto_init(void)
>  {
>   aarp_dl = register_snap_client(aarp_snap_id, aarp_rcv);
> - if (!aarp_dl)
> - printk(KERN_CRIT "Unable to register AARP with SNAP.\n");
> + if (!aarp_dl) {
> + pr_crit("Unable to register AARP with SNAP.\n");
> + return -ENOMEM;
> + }
>   timer_setup(&aarp_timer, aarp_expire_timeout, 0);
>   aarp_timer.expires  = jiffies + sysctl_aarp_expiry_time;
>   add_timer(&aarp_timer);
> - register_netdevice_notifier(&aarp_notifier);
> + return register_netdevice_notifier(&aarp_notifier);
>  }
>  

Your error paths in the caller of aarp_proto_init() do not handle the case
where aarp_dl is created by register_netdevice_notifier() fails.  You have
to unregister aarp_dl if it is non-NULL.

So your out_dev label path needs to handle aarp_dl if non-NULL.

Probably best is to jump to out_aarp: instead and make aarp_cleanup_module
able to handle partial cleanups.


Re: [PATCH net 1/3] sctp: sctp_sock_migrate() returns error if sctp_bind_addr_dup() fails

2019-03-06 Thread Neil Horman
On Sun, Mar 03, 2019 at 05:54:53PM +0800, Xin Long wrote:
> It should fail to create the new sk if sctp_bind_addr_dup() fails
> when accepting or peeloff an association.
> 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/socket.c | 34 --
>  1 file changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index a2771b3..22adb8d 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -102,9 +102,9 @@ static int sctp_send_asconf(struct sctp_association *asoc,
>   struct sctp_chunk *chunk);
>  static int sctp_do_bind(struct sock *, union sctp_addr *, int);
>  static int sctp_autobind(struct sock *sk);
> -static void sctp_sock_migrate(struct sock *oldsk, struct sock *newsk,
> -   struct sctp_association *assoc,
> -   enum sctp_socket_type type);
> +static int sctp_sock_migrate(struct sock *oldsk, struct sock *newsk,
> +  struct sctp_association *assoc,
> +  enum sctp_socket_type type);
>  
>  static unsigned long sctp_memory_pressure;
>  static atomic_long_t sctp_memory_allocated;
> @@ -4655,7 +4655,11 @@ static struct sock *sctp_accept(struct sock *sk, int 
> flags, int *err, bool kern)
>   /* Populate the fields of the newsk from the oldsk and migrate the
>* asoc to the newsk.
>*/
> - sctp_sock_migrate(sk, newsk, asoc, SCTP_SOCKET_TCP);
> + error = sctp_sock_migrate(sk, newsk, asoc, SCTP_SOCKET_TCP);
> + if (error) {
> + sk_common_release(newsk);
sctp_sock_migrate may fail after the pending packets have been moved from the
old socket to the new socket.  Normally those packets will get purged by
successful transmission, or when the socket is closed (via sctp_close), but
neither of those cases applies here.  Whats going to dequeue and free any
pending skbs on the sk_receive_queue here?

> + newsk = NULL;
> + }
>  
>  out:
>   release_sock(sk);
> @@ -5401,7 +5405,12 @@ int sctp_do_peeloff(struct sock *sk, sctp_assoc_t id, 
> struct socket **sockp)
>   /* Populate the fields of the newsk from the oldsk and migrate the
>* asoc to the newsk.
>*/
> - sctp_sock_migrate(sk, sock->sk, asoc, SCTP_SOCKET_UDP_HIGH_BANDWIDTH);
> + err = sctp_sock_migrate(sk, sock->sk, asoc,
> + SCTP_SOCKET_UDP_HIGH_BANDWIDTH);
> + if (err) {
> + sock_release(sock);
Same question here, what frees any pending skbs on the new socket, if the
migration fails after the skbs have been queued to it?

> + sock = NULL;
> + }
>  
>   *sockp = sock;
>  
> @@ -8924,9 +8933,9 @@ static inline void sctp_copy_descendant(struct sock 
> *sk_to,
>  /* Populate the fields of the newsk from the oldsk and migrate the assoc
>   * and its messages to the newsk.
>   */
> -static void sctp_sock_migrate(struct sock *oldsk, struct sock *newsk,
> -   struct sctp_association *assoc,
> -   enum sctp_socket_type type)
> +static int sctp_sock_migrate(struct sock *oldsk, struct sock *newsk,
> +  struct sctp_association *assoc,
> +  enum sctp_socket_type type)
>  {
>   struct sctp_sock *oldsp = sctp_sk(oldsk);
>   struct sctp_sock *newsp = sctp_sk(newsk);
> @@ -8935,6 +8944,7 @@ static void sctp_sock_migrate(struct sock *oldsk, 
> struct sock *newsk,
>   struct sk_buff *skb, *tmp;
>   struct sctp_ulpevent *event;
>   struct sctp_bind_hashbucket *head;
> + int err;
>  
>   /* Migrate socket buffer sizes and all the socket level options to the
>* new socket.
> @@ -8963,8 +8973,10 @@ static void sctp_sock_migrate(struct sock *oldsk, 
> struct sock *newsk,
>   /* Copy the bind_addr list from the original endpoint to the new
>* endpoint so that we can handle restarts properly
>*/
> - sctp_bind_addr_dup(&newsp->ep->base.bind_addr,
> - &oldsp->ep->base.bind_addr, GFP_KERNEL);
> + err = sctp_bind_addr_dup(&newsp->ep->base.bind_addr,
> +  &oldsp->ep->base.bind_addr, GFP_KERNEL);
> + if (err)
> + return err;
>  
>   /* Move any messages in the old socket's receive queue that are for the
>* peeled off association to the new socket's receive queue.
> @@ -9049,6 +9061,8 @@ static void sctp_sock_migrate(struct sock *oldsk, 
> struct sock *newsk,
>   }
>  
>   release_sock(newsk);
> +
> + return 0;
>  }
>  
>  
> -- 
> 2.1.0
> 
> 


Re: [PATCH net 2/3] sctp: move up sctp_auth_init_hmacs() in sctp_endpoint_init()

2019-03-06 Thread Neil Horman
On Sun, Mar 03, 2019 at 05:54:54PM +0800, Xin Long wrote:
> sctp_auth_init_hmacs() is called only when ep->auth_enable is set.
> It better to move up sctp_auth_init_hmacs() and remove auth_enable
> check in it and check auth_enable only once in sctp_endpoint_init().
> 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/auth.c|  6 --
>  net/sctp/endpointola.c | 18 ++
>  2 files changed, 10 insertions(+), 14 deletions(-)
> 
> diff --git a/net/sctp/auth.c b/net/sctp/auth.c
> index 5b53761..39d72e5 100644
> --- a/net/sctp/auth.c
> +++ b/net/sctp/auth.c
> @@ -471,12 +471,6 @@ int sctp_auth_init_hmacs(struct sctp_endpoint *ep, gfp_t 
> gfp)
>   struct crypto_shash *tfm = NULL;
>   __u16   id;
>  
> - /* If AUTH extension is disabled, we are done */
> - if (!ep->auth_enable) {
> - ep->auth_hmacs = NULL;
> - return 0;
> - }
> -
>   /* If the transforms are already allocated, we are done */
>   if (ep->auth_hmacs)
>   return 0;
> diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
> index 40c7eb9..0448b68 100644
> --- a/net/sctp/endpointola.c
> +++ b/net/sctp/endpointola.c
> @@ -107,6 +107,13 @@ static struct sctp_endpoint *sctp_endpoint_init(struct 
> sctp_endpoint *ep,
>   auth_chunks->param_hdr.length =
>   htons(sizeof(struct sctp_paramhdr) + 2);
>   }
> +
> + /* Allocate and initialize transorms arrays for supported
> +  * HMACs.
> +  */
> + err = sctp_auth_init_hmacs(ep, gfp);
> + if (err)
> + goto nomem;
>   }
>  
>   /* Initialize the base structure. */
> @@ -150,15 +157,10 @@ static struct sctp_endpoint *sctp_endpoint_init(struct 
> sctp_endpoint *ep,
>   INIT_LIST_HEAD(&ep->endpoint_shared_keys);
>   null_key = sctp_auth_shkey_create(0, gfp);
>   if (!null_key)
> - goto nomem;
> + goto nomem_shkey;
>  
>   list_add(&null_key->key_list, &ep->endpoint_shared_keys);
>  
> - /* Allocate and initialize transorms arrays for supported HMACs. */
> - err = sctp_auth_init_hmacs(ep, gfp);
> - if (err)
> - goto nomem_hmacs;
> -
>   /* Add the null key to the endpoint shared keys list and
>* set the hmcas and chunks pointers.
>*/
> @@ -169,8 +171,8 @@ static struct sctp_endpoint *sctp_endpoint_init(struct 
> sctp_endpoint *ep,
>  
>   return ep;
>  
> -nomem_hmacs:
> - sctp_auth_destroy_keys(&ep->endpoint_shared_keys);
> +nomem_shkey:
> + sctp_auth_destroy_hmacs(ep->auth_hmacs);
>  nomem:
>   /* Free all allocations */
>   kfree(auth_hmacs);
> -- 
> 2.1.0
> 
> 
Acked-by: Neil Horman 



Re: [PATCH net 3/3] sctp: call sctp_auth_init_hmacs() in sctp_sock_migrate()

2019-03-06 Thread Neil Horman
On Sun, Mar 03, 2019 at 05:54:55PM +0800, Xin Long wrote:
> New ep's auth_hmacs should be set if old ep's is set, in case that
> net->sctp.auth_enable has been changed to 0 by users and new ep's
> auth_hmacs couldn't be set in sctp_endpoint_init().
> 
> It can even crash kernel by doing:
> 
>   1. on server: sysctl -w net.sctp.auth_enable=1,
> sysctl -w net.sctp.addip_enable=1,
> sysctl -w net.sctp.addip_noauth_enable=0,
> listen() on server,
> sysctl -w net.sctp.auth_enable=0.
>   2. on client: connect() to server.
>   3. on server: accept() the asoc,
> sysctl -w net.sctp.auth_enable=1.
>   4. on client: send() asconf packet to server.
> 
> The call trace:
> 
>   [  245.280251] BUG: unable to handle kernel NULL pointer dereference at 
> 0008
>   [  245.286872] RIP: 0010:sctp_auth_calculate_hmac+0xa3/0x140 [sctp]
>   [  245.304572] Call Trace:
>   [  245.305091]  
>   [  245.311287]  sctp_sf_authenticate+0x110/0x160 [sctp]
>   [  245.312311]  sctp_sf_eat_auth+0xf2/0x230 [sctp]
>   [  245.313249]  sctp_do_sm+0x9a/0x2d0 [sctp]
>   [  245.321483]  sctp_assoc_bh_rcv+0xed/0x1a0 [sctp]
>   [  245.322495]  sctp_rcv+0xa66/0xc70 [sctp]
> 
> It's because the old ep->auth_hmacs wasn't copied to the new ep while
> ep->auth_hmacs is used in sctp_auth_calculate_hmac() when processing
> the incoming auth chunks, and it should have been done when migrating
> sock.
> 
> Reported-by: Ying Xu 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/socket.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 22adb8d..def3335 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -8978,6 +8978,16 @@ static int sctp_sock_migrate(struct sock *oldsk, 
> struct sock *newsk,
>   if (err)
>   return err;
>  
> + /* New ep's auth_hmacs should be set if old ep's is set, in case
> +  * that net->sctp.auth_enable has been changed to 0 by users and
> +  * new ep's auth_hmacs couldn't be set in sctp_endpoint_init().
> +  */
> + if (oldsp->ep->auth_hmacs) {
> + err = sctp_auth_init_hmacs(newsp->ep, GFP_KERNEL);
> + if (err)
> + return err;
> + }
> +
>   /* Move any messages in the old socket's receive queue that are for the
>* peeled off association to the new socket's receive queue.
>*/
> -- 
> 2.1.0
> 
> 
Acked-by: Neil Horman 



Re: [PATCH net v2] ipv4/route: fail early when inet dev is missing

2019-03-06 Thread David Miller
From: Paolo Abeni 
Date: Wed,  6 Mar 2019 10:42:53 +0100

> If a non local multicast packet reaches ip_route_input_rcu() while
> the ingress device IPv4 private data (in_dev) is NULL, we end up
> doing a NULL pointer dereference in IN_DEV_MFORWARD().
> 
> Since the later call to ip_route_input_mc() is going to fail if
> !in_dev, we can fail early in such scenario and avoid the dangerous
> code path.
> 
> v1 -> v2:
>  - clarified the commit message, no code changes
> 
> Reported-by: Tianhao Zhao 
> Fixes: e58e41596811 ("net: Enable support for VRF with ipv4 multicast")
> Signed-off-by: Paolo Abeni 
> Reviewed-by: David Ahern 

Applied and queued up for -stable.


Re: [PATCH net] net: hns3: Fix a logical vs bitwise typo

2019-03-06 Thread David Miller
From: Dan Carpenter 
Date: Wed, 6 Mar 2019 11:12:34 +0300

> There were a couple logical ORs accidentally mixed in with the bitwise
> ORs.
> 
> Fixes: e8149933b1fa ("net: hns3: remove hnae3_get_bit in data path")
> Signed-off-by: Dan Carpenter 
> ---
> Very recent bug.

Applied.


Re: [RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area

2019-03-06 Thread Souptick Joarder
On Wed, Mar 6, 2019 at 12:48 PM Jason Wang  wrote:
>
> Signed-off-by: Jason Wang 

Is the change log left with any particular reason ?
> ---
>  drivers/vhost/vhost.c | 46 --
>  1 file changed, 28 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2025543..1015464 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -413,6 +413,27 @@ static void vhost_dev_free_iovecs(struct vhost_dev *dev)
> vhost_vq_free_iovecs(dev->vqs[i]);
>  }
>
> +static size_t vhost_get_avail_size(struct vhost_virtqueue *vq, int num)
> +{
> +   size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> +
> +   return sizeof(*vq->avail) +
> +  sizeof(*vq->avail->ring) * num + event;
> +}
> +
> +static size_t vhost_get_used_size(struct vhost_virtqueue *vq, int num)
> +{
> +   size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> +
> +   return sizeof(*vq->used) +
> +  sizeof(*vq->used->ring) * num + event;
> +}
> +
> +static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
> +{
> +   return sizeof(*vq->desc) * num;
> +}
> +
>  void vhost_dev_init(struct vhost_dev *dev,
> struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
>  {
> @@ -1253,13 +1274,9 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, 
> unsigned int num,
>  struct vring_used __user *used)
>
>  {
> -   size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> -
> -   return access_ok(desc, num * sizeof *desc) &&
> -  access_ok(avail,
> -sizeof *avail + num * sizeof *avail->ring + s) &&
> -  access_ok(used,
> -   sizeof *used + num * sizeof *used->ring + s);
> +   return access_ok(desc, vhost_get_desc_size(vq, num)) &&
> +  access_ok(avail, vhost_get_avail_size(vq, num)) &&
> +  access_ok(used, vhost_get_used_size(vq, num));
>  }
>
>  static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
> @@ -1311,22 +1328,18 @@ static bool iotlb_access_ok(struct vhost_virtqueue 
> *vq,
>
>  int vq_meta_prefetch(struct vhost_virtqueue *vq)
>  {
> -   size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> unsigned int num = vq->num;
>
> if (!vq->iotlb)
> return 1;
>
> return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
> -  num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
> +  vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) 
> &&
>iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
> -  sizeof *vq->avail +
> -  num * sizeof(*vq->avail->ring) + s,
> +  vhost_get_avail_size(vq, num),
>VHOST_ADDR_AVAIL) &&
>iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->used,
> -  sizeof *vq->used +
> -  num * sizeof(*vq->used->ring) + s,
> -  VHOST_ADDR_USED);
> +  vhost_get_used_size(vq, num), VHOST_ADDR_USED);
>  }
>  EXPORT_SYMBOL_GPL(vq_meta_prefetch);
>
> @@ -1343,13 +1356,10 @@ bool vhost_log_access_ok(struct vhost_dev *dev)
>  static bool vq_log_access_ok(struct vhost_virtqueue *vq,
>  void __user *log_base)
>  {
> -   size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> -
> return vq_memory_access_ok(log_base, vq->umem,
>vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
> (!vq->log_used || log_access_ok(log_base, vq->log_addr,
> -   sizeof *vq->used +
> -   vq->num * sizeof *vq->used->ring + 
> s));
> + vhost_get_used_size(vq, vq->num)));
>  }
>
>  /* Can we start vq? */
> --
> 1.8.3.1
>


[PATCH net] fou, fou6: avoid uninit-value in gue_err() and gue6_err()

2019-03-06 Thread Eric Dumazet
My prior commit missed the fact that these functions
were using udp_hdr() (aka skb_transport_header())
to get access to GUE header.

Since use pskb_transport_may_pull() does not exist yet,
we have to add transport_offset to our pskb_may_pull() calls.

BUG: KMSAN: uninit-value in gue_err+0x514/0xfa0 net/ipv4/fou.c:1032
CPU: 1 PID: 10648 Comm: syz-executor.1 Not tainted 5.0.0+ #11
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x173/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:600
 __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313
 gue_err+0x514/0xfa0 net/ipv4/fou.c:1032
 __udp4_lib_err_encap_no_sk net/ipv4/udp.c:571 [inline]
 __udp4_lib_err_encap net/ipv4/udp.c:626 [inline]
 __udp4_lib_err+0x12e6/0x1d40 net/ipv4/udp.c:665
 udp_err+0x74/0x90 net/ipv4/udp.c:737
 icmp_socket_deliver net/ipv4/icmp.c:767 [inline]
 icmp_unreach+0xb65/0x1070 net/ipv4/icmp.c:884
 icmp_rcv+0x11a1/0x1950 net/ipv4/icmp.c:1066
 ip_protocol_deliver_rcu+0x584/0xbb0 net/ipv4/ip_input.c:208
 ip_local_deliver_finish net/ipv4/ip_input.c:234 [inline]
 NF_HOOK include/linux/netfilter.h:289 [inline]
 ip_local_deliver+0x624/0x7b0 net/ipv4/ip_input.c:255
 dst_input include/net/dst.h:450 [inline]
 ip_rcv_finish net/ipv4/ip_input.c:414 [inline]
 NF_HOOK include/linux/netfilter.h:289 [inline]
 ip_rcv+0x6bd/0x740 net/ipv4/ip_input.c:524
 __netif_receive_skb_one_core net/core/dev.c:4973 [inline]
 __netif_receive_skb net/core/dev.c:5083 [inline]
 process_backlog+0x756/0x10e0 net/core/dev.c:5923
 napi_poll net/core/dev.c:6346 [inline]
 net_rx_action+0x78b/0x1a60 net/core/dev.c:6412
 __do_softirq+0x53f/0x93a kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:375 [inline]
 irq_exit+0x214/0x250 kernel/softirq.c:416
 exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:536
 smp_apic_timer_interrupt+0x48/0x70 arch/x86/kernel/apic/apic.c:1064
 apic_timer_interrupt+0x2e/0x40 arch/x86/entry/entry_64.S:814
 
RIP: 0010:finish_lock_switch+0x2b/0x40 kernel/sched/core.c:2597
Code: 48 89 e5 53 48 89 fb e8 63 e7 95 00 8b b8 88 0c 00 00 48 8b 00 48 85 c0 
75 12 48 89 df e8 dd db 95 00 c6 00 00 c6 03 00 fb 5b <5d> c3 e8 4e e6 95 00 eb 
e7 66 90 66 2e 0f 1f 84 00 00 00 00 00 55
RSP: 0018:888081a0fc80 EFLAGS: 0296 ORIG_RAX: ff13
RAX: 88821fd6bd80 RBX: 888027898000 RCX: d000
RDX: 88821fca8d80 RSI: 8880 RDI: 04a0
RBP: 888081a0fc80 R08: 0002 R09: 888081a0fb08
R10:  R11:  R12: 0001
R13: 88811130e388 R14: 88811130da00 R15: 88812fdb7d80
 finish_task_switch+0xfc/0x2d0 kernel/sched/core.c:2698
 context_switch kernel/sched/core.c:2851 [inline]
 __schedule+0x6cc/0x800 kernel/sched/core.c:3491
 schedule+0x15b/0x240 kernel/sched/core.c:3535
 freezable_schedule include/linux/freezer.h:172 [inline]
 do_nanosleep+0x2ba/0x980 kernel/time/hrtimer.c:1679
 hrtimer_nanosleep kernel/time/hrtimer.c:1733 [inline]
 __do_sys_nanosleep kernel/time/hrtimer.c:1767 [inline]
 __se_sys_nanosleep+0x746/0x960 kernel/time/hrtimer.c:1754
 __x64_sys_nanosleep+0x3e/0x60 kernel/time/hrtimer.c:1754
 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
 entry_SYSCALL_64_after_hwframe+0x63/0xe7
RIP: 0033:0x4855a0
Code: 00 00 48 c7 c0 d4 ff ff ff 64 c7 00 16 00 00 00 31 c0 eb be 66 0f 1f 44 
00 00 83 3d b1 11 5d 00 00 75 14 b8 23 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 
04 e2 f8 ff c3 48 83 ec 08 e8 3a 55 fd ff
RSP: 002b:00a4fd58 EFLAGS: 0246 ORIG_RAX: 0023
RAX: ffda RBX: 00085780 RCX: 004855a0
RDX:  RSI:  RDI: 00a4fd60
RBP: 07ec R08: 0001 R09: 00ceb940
R10:  R11: 0246 R12: 0008
R13: 00a4fdb0 R14: 00085711 R15: 00a4fdc0

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205 [inline]
 kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159
 kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
 kmsan_slab_alloc+0xe/0x10 mm/kmsan/kmsan_hooks.c:185
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2773 [inline]
 __kmalloc_node_track_caller+0xe9e/0xff0 mm/slub.c:4398
 __kmalloc_reserve net/core/skbuff.c:140 [inline]
 __alloc_skb+0x309/0xa20 net/core/skbuff.c:208
 alloc_skb include/linux/skbuff.h:1012 [inline]
 alloc_skb_with_frags+0x186/0xa60 net/core/skbuff.c:5287
 sock_alloc_send_pskb+0xafd/0x10a0 net/core/sock.c:2091
 sock_alloc_send_skb+0xca/0xe0 net/core/sock.c:2108
 __ip_append_data+0x34cd/0x5000 net/ipv4/ip_output.c:998
 ip_append_data+0x324/0x480 net/ipv4/ip_output.c:1220
 icmp_push_reply+0x23d/0x7e0 net/ipv4/icmp.c:375
 __icmp_send+0x2ea3/0x30f0 net/ipv4/icmp.c:737
 icmp_send include/net/icmp.h:47 [inline]
 ipv4_link_failure+0x6d/0x230 net/ipv4/route.c:1190
 dst_link_failure include/net/dst.h:427 [inlin

Re: [PATCH net] iptunnel: NULL pointer deref for ip_md_tunnel_xmit

2019-03-06 Thread David Miller
From: Alan Maguire 
Date: Wed, 6 Mar 2019 10:25:42 + (GMT)

> Naresh Kamboju noted the following oops during execution of selftest
> tools/testing/selftests/bpf/test_tunnel.sh on x86_64:
...
> I'm also seeing the same failure on x86_64, and it reproduces
> consistently.
> 
> From poking around it looks like the skb's dst entry is being used
> to calculate the mtu in:
> 
> mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
> 
> ...but because that dst_entry  has an "ops" value set to md_dst_ops,
> the various ops (including mtu) are not set:
> 
> crash> struct sk_buff._skb_refdst 928f87447700 -x
>   _skb_refdst = 0xcd6fbf5ea590
> crash> struct dst_entry.ops 0xcd6fbf5ea590
>   ops = 0xa0193800
> crash> struct dst_ops.mtu 0xa0193800
>   mtu = 0x0
> crash>
> 
> I confirmed that the dst entry also has dst->input set to
> dst_md_discard, so it looks like it's an entry that's been
> initialized via __metadata_dst_init alright.
> 
> I think the fix here is to use skb_valid_dst(skb) - it checks
> for  DST_METADATA also, and with that fix in place, the
> problem - which was previously 100% reproducible - disappears.
> 
> The below patch resolves the panic and all bpf tunnel tests pass
> without incident.
> 
> Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
> Reported-by: Naresh Kamboju 
> Signed-off-by: Alan Maguire 
> Acked-by: Alexei Starovoitov 
> Tested-by: Anders Roxell 

Applied, thanks.


Re: [PATCH 1/2] appletalk: Fix compile regression

2019-03-06 Thread David Miller
From: Arnd Bergmann 
Date: Wed,  6 Mar 2019 11:52:36 +0100

> A bugfix just broke compilation of appletalk when CONFIG_SYSCTL
> is disabled:
> 
> In file included from net/appletalk/ddp.c:65:
> net/appletalk/ddp.c: In function 'atalk_init':
> include/linux/atalk.h:164:34: error: expected expression before 'do'
>  #define atalk_register_sysctl()  do { } while(0)
>   ^~
> net/appletalk/ddp.c:1934:7: note: in expansion of macro 
> 'atalk_register_sysctl'
>   rc = atalk_register_sysctl();
> 
> This is easier to avoid by using conventional inline functions
> as stubs rather than macros. The header already has inline
> functions for other purposes, so I'm changing over all the
> macros for consistency.
> 
> Fixes: 6377f787aeb9 ("appletalk: Fix use-after-free in atalk_proc_exit")
> Signed-off-by: Arnd Bergmann 

Applied.


Re: [PATCH 2/2] appletalk: Add atalk.h header files to MAINTAINERS file

2019-03-06 Thread David Miller
From: Arnd Bergmann 
Date: Wed,  6 Mar 2019 11:52:37 +0100

> Add the path names here so that git-send-email can pick up the
> netdev@vger.kernel.org Cc line automatically for a patch that
> only touches the headers.
> 
> Signed-off-by: Arnd Bergmann 

Applied.


Re: [PATCH] tcp: detecting the misuse of .sendpage for Slab objects

2019-03-06 Thread David Miller
From: Vasily Averin 
Date: Wed, 6 Mar 2019 14:10:22 +0300

> sendpage was not designed for processing of the Slab pages,
> in some situations it can trigger BUG_ON on receiving side.
> 
> Signed-off-by: Vasily Averin 

Applied.


Re: [PATCH mlx5-next] net/mlx5: Fix DCT creation bad flow

2019-03-06 Thread Jason Gunthorpe
On Wed, Mar 06, 2019 at 07:20:50PM +0200, Leon Romanovsky wrote:
> From: Yishai Hadas 
> 
> In case the DCT creation command has succeeded a DRAIN must be issued
> before calling DESTROY.
> 
> In addition, the original code used the wrong parameter for the DESTROY
> command, 'in' instead of 'din', which caused another creation try
> instead of destroying.
> 
> Cc:  # 4.15
> Fixes: 57cda166bbe0 ("net/mlx5: Add DCT command interface")
> Signed-off-by: Yishai Hadas 
> Reviewed-by: Artemy Kovalyov 
> Signed-off-by: Leon Romanovsky 
> Jason, Doug
> 
> If it is possible, I would like to take this patch too:
> https://patchwork.kernel.org/patch/10828299/

This should have been applied to the shared tree though??

It is RDMA focused, do you want it to go to the RDMA tree?

Jason


Re: [PATCH net] net: sched: flower: insert new filter to idr after setting its mask

2019-03-06 Thread David Miller
From: Vlad Buslov 
Date: Wed,  6 Mar 2019 16:22:12 +0200

> When adding new filter to flower classifier, fl_change() inserts it to
> handle_idr before initializing filter extensions and assigning it a mask.
> Normally this ordering doesn't matter because all flower classifier ops
> callbacks assume rtnl lock protection. However, when filter has an action
> that doesn't have its kernel module loaded, rtnl lock is released before
> call to request_module(). During this time the filter can be accessed bu
> concurrent task before its initialization is completed, which can lead to a
> crash.
> 
> Example case of NULL pointer dereference in concurrent dump:
 ...
> Extension initialization and mask assignment don't depend on fnew->handle
> that is allocated by idr_alloc_u32(). Move idr allocation code after action
> creation and mask assignment in fl_change() to prevent concurrent access
> to not fully initialized filter when rtnl lock is released to load action
> module.
> 
> Fixes: 01683a146999 ("net: sched: refactor flower walk to iterate over idr")
> Signed-off-by: Vlad Buslov 
> Reviewed-by: Roi Dayan 

Applied and queued up for -stable.


Re: [PATCH mlx5-next] net/mlx5: Fix DCT creation bad flow

2019-03-06 Thread Leon Romanovsky
On Wed, Mar 06, 2019 at 08:52:16PM +0200, Jason Gunthorpe wrote:
> On Wed, Mar 06, 2019 at 07:20:50PM +0200, Leon Romanovsky wrote:
> > From: Yishai Hadas 
> >
> > In case the DCT creation command has succeeded a DRAIN must be issued
> > before calling DESTROY.
> >
> > In addition, the original code used the wrong parameter for the DESTROY
> > command, 'in' instead of 'din', which caused another creation try
> > instead of destroying.
> >
> > Cc:  # 4.15
> > Fixes: 57cda166bbe0 ("net/mlx5: Add DCT command interface")
> > Signed-off-by: Yishai Hadas 
> > Reviewed-by: Artemy Kovalyov 
> > Signed-off-by: Leon Romanovsky 
> > Jason, Doug
> >
> > If it is possible, I would like to take this patch too:
> > https://patchwork.kernel.org/patch/10828299/
>
> This should have been applied to the shared tree though??
>
> It is RDMA focused, do you want it to go to the RDMA tree?

Yes, it will be awesome, because net-next is closed, there is no need in
shared tree.

Thanks

>
> Jason


Re: [PATCH net] net: hsr: fix memory leak in hsr_dev_finalize()

2019-03-06 Thread David Miller
From: Mao Wenan 
Date: Wed, 6 Mar 2019 22:45:01 +0800

> If hsr_add_port(hsr, hsr_dev, HSR_PT_MASTER) failed to
> add port, it directly returns res and forgets to free the node
> that allocated in hsr_create_self_node(), and forgets to delete
> the node->mac_list linked in hsr->self_node_db.
 ...
> Fixes: c5a759117210("net/hsr: Use list_head (and rcu) instead of array for 
> slave devices.")
> Reported-by: Hulk Robot 
> Signed-off-by: Mao Wenan 

Applied and queued up for -stable, thanks.


Re: [PATCH net] tcp: do not report TCP_CM_INQ of 0 for closed connections

2019-03-06 Thread David Miller
From: Soheil Hassas Yeganeh 
Date: Wed,  6 Mar 2019 13:01:36 -0500

> From: Soheil Hassas Yeganeh 
> 
> Returning 0 as inq to userspace indicates there is no more data to
> read, and the application needs to wait for EPOLLIN. For a connection
> that has received FIN from the remote peer, however, the application
> must continue reading until getting EOF (return value of 0
> from tcp_recvmsg) or an error, if edge-triggered epoll (EPOLLET) is
> being used. Otherwise, the application will never receive a new
> EPOLLIN, since there is no epoll edge after the FIN.
> 
> Return 1 when there is no data left on the queue but the
> connection has received FIN, so that the applications continue
> reading.
> 
> Fixes: b75eba76d3d72 (tcp: send in-queue bytes in cmsg upon read)
> Signed-off-by: Soheil Hassas Yeganeh 
> Acked-by: Neal Cardwell 
> Signed-off-by: Eric Dumazet 
> Acked-by: Yuchung Cheng 

Applied and queued up for -stable, thank you.


Re: [PATCH net] fou, fou6: avoid uninit-value in gue_err() and gue6_err()

2019-03-06 Thread Stefano Brivio
On Wed,  6 Mar 2019 10:41:00 -0800
Eric Dumazet  wrote:

> My prior commit missed the fact that these functions
> were using udp_hdr() (aka skb_transport_header())
> to get access to GUE header.

Ouch, I totally missed this too. :/

> Since use pskb_transport_may_pull() does not exist yet,

Nit: s/use //

> we have to add transport_offset to our pskb_may_pull() calls.

Thanks for fixing this.

Acked-by: Stefano Brivio 

-- 
Stefano


[PATCH bpf] bpf: only test gso type on gso packets

2019-03-06 Thread Willem de Bruijn
From: Willem de Bruijn 

BPF can adjust gso only for tcp bytestreams. Fail on other gso types.

But only on gso packets. It does not touch this field if !gso_size.

Fixes: b90efd225874 ("bpf: only adjust gso_size on bytestream protocols")
Signed-off-by: Willem de Bruijn 

---

Stupid bug on my part. Found only when adding tests for the feature.
Will try to upstream those once bpf-next opens.

On a related note, also working on a flag BPF_F_ADJ_ROOM_FIXED_GSO
that will allow reenabling this field for UDP (and possibly avoiding
the expensive skb_cow for the TCP common case).
---
 include/linux/skbuff.h | 4 ++--
 net/core/filter.c  | 8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 27beb549ffbe1..f32f32407dc43 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4232,10 +4232,10 @@ static inline bool skb_is_gso_sctp(const struct sk_buff 
*skb)
return skb_shinfo(skb)->gso_type & SKB_GSO_SCTP;
 }
 
+/* Note: Should be called only if skb_is_gso(skb) is true */
 static inline bool skb_is_gso_tcp(const struct sk_buff *skb)
 {
-   return skb_is_gso(skb) &&
-  skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6);
+   return skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6);
 }
 
 static inline void skb_gso_reset(struct sk_buff *skb)
diff --git a/net/core/filter.c b/net/core/filter.c
index 5ceba98069d46..f274620945ff0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2804,7 +2804,7 @@ static int bpf_skb_proto_4_to_6(struct sk_buff *skb)
u32 off = skb_mac_header_len(skb);
int ret;
 
-   if (!skb_is_gso_tcp(skb))
+   if (skb_is_gso(skb) && !skb_is_gso_tcp(skb))
return -ENOTSUPP;
 
ret = skb_cow(skb, len_diff);
@@ -2845,7 +2845,7 @@ static int bpf_skb_proto_6_to_4(struct sk_buff *skb)
u32 off = skb_mac_header_len(skb);
int ret;
 
-   if (!skb_is_gso_tcp(skb))
+   if (skb_is_gso(skb) && !skb_is_gso_tcp(skb))
return -ENOTSUPP;
 
ret = skb_unclone(skb, GFP_ATOMIC);
@@ -2970,7 +2970,7 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 
len_diff)
u32 off = skb_mac_header_len(skb) + bpf_skb_net_base_len(skb);
int ret;
 
-   if (!skb_is_gso_tcp(skb))
+   if (skb_is_gso(skb) && !skb_is_gso_tcp(skb))
return -ENOTSUPP;
 
ret = skb_cow(skb, len_diff);
@@ -2999,7 +2999,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 
len_diff)
u32 off = skb_mac_header_len(skb) + bpf_skb_net_base_len(skb);
int ret;
 
-   if (!skb_is_gso_tcp(skb))
+   if (skb_is_gso(skb) && !skb_is_gso_tcp(skb))
return -ENOTSUPP;
 
ret = skb_unclone(skb, GFP_ATOMIC);
-- 
2.21.0.352.gf09ad66450-goog



[PATCH net-next] 8139too : Add support for U.S. Robotics USR997901A 10/100 Cardbus NIC

2019-03-06 Thread Matthew Whitehead
Add PCI vendor and device identifier for U.S. Robotics USR997901A
10/100 Cardbus NIC. Tested on real hardware.

Signed-off-by: Matthew Whitehead 
---
 drivers/net/ethernet/realtek/8139too.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/realtek/8139too.c 
b/drivers/net/ethernet/realtek/8139too.c
index 69d752f..55d0126 100644
--- a/drivers/net/ethernet/realtek/8139too.c
+++ b/drivers/net/ethernet/realtek/8139too.c
@@ -258,6 +258,7 @@ enum {
{0x126c, 0x1211, PCI_ANY_ID, PCI_ANY_ID, 0, 0, RTL8139 },
{0x1743, 0x8139, PCI_ANY_ID, PCI_ANY_ID, 0, 0, RTL8139 },
{0x021b, 0x8139, PCI_ANY_ID, PCI_ANY_ID, 0, 0, RTL8139 },
+   {0x16ec, 0xab06, PCI_ANY_ID, PCI_ANY_ID, 0, 0, RTL8139 },
 
 #ifdef CONFIG_SH_SECUREEDGE5410
/* Bogus 8139 silicon reports 8129 without external PROM :-( */
-- 
1.8.3.1



Re: [PATCH mlx5-next] net/mlx5: ODP support for XRC transport is not enabled by default in FW

2019-03-06 Thread Jason Gunthorpe
On Mon, Feb 25, 2019 at 08:54:39AM +0200, Leon Romanovsky wrote:
> From: Moni Shoua 
> 
> ODP support for XRC transport is not enabled by default in FW,
> so we need separate ODP checks to enable/disable it.
> 
> While that, rewrite the set of ODP SRQ support capabilities in way
> that tests each field separately for clearness, which is not needed
> for current FW, but better to have it separated.
> 
> Signed-off-by: Moni Shoua 
> Signed-off-by: Leon Romanovsky 
> ---
>  .../net/ethernet/mellanox/mlx5/core/main.c| 38 +++
>  1 file changed, 22 insertions(+), 16 deletions(-)
> 
> --
> 2.19.1

Applied to for-next, thanks

Jason


[PATCH bpf 2/2] libbpf: force fixdep compilation at the start of the build

2019-03-06 Thread Stanislav Fomichev
libbpf targets don't explicitly depend on fixdep target, so when
we do 'make -j$(nproc)', there is a high probability, that some
objects will be built before fixdep binary is available.

Fix this by running sub-make; this makes sure that fixdep dependency
is properly accounted for.

For the same issue in perf, see commit abb26210a395 ("perf tools: Force
fixdep compilation at the start of the build").

Before:

$ rm -rf /tmp/bld; mkdir /tmp/bld; make -j$(nproc) O=/tmp/bld -C tools/lib/bpf/

Auto-detecting system features:
...libelf: [ on  ]
...   bpf: [ on  ]

  HOSTCC   /tmp/bld/fixdep.o
  CC   /tmp/bld/libbpf.o
  CC   /tmp/bld/bpf.o
  CC   /tmp/bld/btf.o
  CC   /tmp/bld/nlattr.o
  CC   /tmp/bld/libbpf_errno.o
  CC   /tmp/bld/str_error.o
  CC   /tmp/bld/netlink.o
  CC   /tmp/bld/bpf_prog_linfo.o
  CC   /tmp/bld/libbpf_probes.o
  CC   /tmp/bld/xsk.o
  HOSTLD   /tmp/bld/fixdep-in.o
  LINK /tmp/bld/fixdep
  LD   /tmp/bld/libbpf-in.o
  LINK /tmp/bld/libbpf.a
  LINK /tmp/bld/libbpf.so
  LINK /tmp/bld/test_libbpf

$ head /tmp/bld/.libbpf.o.cmd
 # cannot find fixdep (/usr/local/google/home/sdf/src/linux/xxx//fixdep)
 # using basic dep data

/tmp/bld/libbpf.o: libbpf.c /usr/include/stdc-predef.h \
 /usr/include/stdlib.h /usr/include/features.h \
 /usr/include/x86_64-linux-gnu/sys/cdefs.h \
 /usr/include/x86_64-linux-gnu/bits/wordsize.h \
 /usr/include/x86_64-linux-gnu/gnu/stubs.h \
 /usr/include/x86_64-linux-gnu/gnu/stubs-64.h \
 /usr/lib/gcc/x86_64-linux-gnu/7/include/stddef.h \

After:

$ rm -rf /tmp/bld; mkdir /tmp/bld; make -j$(nproc) O=/tmp/bld -C tools/lib/bpf/

Auto-detecting system features:
...libelf: [ on  ]
...   bpf: [ on  ]

  HOSTCC   /tmp/bld/fixdep.o
  HOSTLD   /tmp/bld/fixdep-in.o
  LINK /tmp/bld/fixdep
  CC   /tmp/bld/libbpf.o
  CC   /tmp/bld/bpf.o
  CC   /tmp/bld/nlattr.o
  CC   /tmp/bld/btf.o
  CC   /tmp/bld/libbpf_errno.o
  CC   /tmp/bld/str_error.o
  CC   /tmp/bld/netlink.o
  CC   /tmp/bld/bpf_prog_linfo.o
  CC   /tmp/bld/libbpf_probes.o
  CC   /tmp/bld/xsk.o
  LD   /tmp/bld/libbpf-in.o
  LINK /tmp/bld/libbpf.a
  LINK /tmp/bld/libbpf.so
  LINK /tmp/bld/test_libbpf

$ head /tmp/bld/.libbpf.o.cmd
cmd_/tmp/bld/libbpf.o := gcc -Wp,-MD,/tmp/bld/.libbpf.o.d 
-Wp,-MT,/tmp/bld/libbpf.o -g -Wall -DHAVE_LIBELF_MMAP_SUPPORT 
-DCOMPAT_NEED_REALLOCARRAY -Wbad-function-cast -Wdeclaration-after-statement 
-Wformat-security -Wformat-y2k -Winit-self -Wmissing-declarations 
-Wmissing-prototypes -Wnested-externs -Wno-system-headers 
-Wold-style-definition -Wpacked -Wredundant-decls -Wshadow -Wstrict-prototypes 
-Wswitch-default -Wswitch-enum -Wundef -Wwrite-strings -Wformat 
-Wstrict-aliasing=3 -Werror -Wall -fPIC -I. 
-I/usr/local/google/home/sdf/src/linux/tools/include 
-I/usr/local/google/home/sdf/src/linux/tools/arch/x86/include/uapi 
-I/usr/local/google/home/sdf/src/linux/tools/include/uapi -fvisibility=hidden 
-D"BUILD_STR(s)=$(pound)s" -c -o /tmp/bld/libbpf.o libbpf.c

source_/tmp/bld/libbpf.o := libbpf.c

deps_/tmp/bld/libbpf.o := \
  /usr/include/stdc-predef.h \
  /usr/include/stdlib.h \
  /usr/include/features.h \
  /usr/include/x86_64-linux-gnu/sys/cdefs.h \
  /usr/include/x86_64-linux-gnu/bits/wordsize.h \

Reported-by: Eric Dumazet 
Fixes: 7c422f557266 ("tools build: Build fixdep helper from perf and basic 
libs")

Signed-off-by: Stanislav Fomichev 
---
 tools/lib/bpf/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index a05c43468bd0..61aaacf0cfa1 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -147,7 +147,8 @@ endif
 
 TARGETS = $(CMD_TARGETS)
 
-all: fixdep all_cmd
+all: fixdep
+   $(Q)$(MAKE) all_cmd
 
 all_cmd: $(CMD_TARGETS) check
 
-- 
2.21.0.352.gf09ad66450-goog



[PATCH bpf 1/2] selftests: bpf: fix compilation with out-of-tree $(OUTPUT)

2019-03-06 Thread Stanislav Fomichev
A bunch of related changes lumped together:
* Create prog_tests and verifier output directories; these don't exist with
  out-of-tree $(OUTPUT)
* Add missing -I (via separate TEST_{PROGS,VERIFIER}_CFLAGS) for the main tree
  ($(PWD) != $(OUTPUT) for out-of-tree)
* Add libbpf.a dependency for test_progs_32 (parallel make fails otherwise)
* Add missing "; \" after "cd" when generating test.h headers

Tested by:
$ alias m="make -s -j$(nproc)"
$ m -C tools/testing/selftests/bpf/ clean
$ m -C tools/lib/bpf/ clean
$ rm -rf xxx; mkdir xxx; m -C tools/testing/selftests/bpf/ OUTPUT=$PWD/xxx
$ m -C tools/testing/selftests/bpf/

Fixes: 3f30658830f3 ("selftests: bpf: break up test_progs - preparations")
Fixes: 2dfb40121ee8 ("selftests: bpf: prepare for break up of verifier tests")
Fixes: 3ef84346c561 ("selftests: bpf: makefile support sub-register code-gen 
test mode")

Signed-off-by: Stanislav Fomichev 
---
 tools/testing/selftests/bpf/Makefile | 33 +++-
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 518cd587cd63..2aed37ea61a4 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -153,6 +153,9 @@ endif
 endif
 endif
 
+TEST_PROGS_CFLAGS := -I. -I$(OUTPUT)
+TEST_VERIFIER_CFLAGS := -I. -I$(OUTPUT) -Iverifier
+
 ifneq ($(SUBREG_CODEGEN),)
 ALU32_BUILD_DIR = $(OUTPUT)/alu32
 TEST_CUSTOM_PROGS += $(ALU32_BUILD_DIR)/test_progs_32
@@ -162,13 +165,15 @@ TEST_CUSTOM_PROGS += $(ALU32_BUILD_DIR)/test_progs_32
 $(ALU32_BUILD_DIR)/urandom_read: $(OUTPUT)/urandom_read
cp $< $@
 
-$(ALU32_BUILD_DIR)/test_progs_32: test_progs.c $(ALU32_BUILD_DIR) \
+$(ALU32_BUILD_DIR)/test_progs_32: test_progs.c $(OUTPUT)/libbpf.a\
+   $(ALU32_BUILD_DIR) \
$(ALU32_BUILD_DIR)/urandom_read
-   $(CC) $(CFLAGS) -o $(ALU32_BUILD_DIR)/test_progs_32 $< \
-   trace_helpers.c prog_tests/*.c $(OUTPUT)/libbpf.a $(LDLIBS)
+   $(CC) $(TEST_PROGS_CFLAGS) $(CFLAGS) \
+   -o $(ALU32_BUILD_DIR)/test_progs_32 \
+   test_progs.c trace_helpers.c prog_tests/*.c \
+   $(OUTPUT)/libbpf.a $(LDLIBS)
 
 $(ALU32_BUILD_DIR)/test_progs_32: $(PROG_TESTS_H)
-$(ALU32_BUILD_DIR)/test_progs_32: CFLAGS += -I$(OUTPUT)
 $(ALU32_BUILD_DIR)/test_progs_32: prog_tests/*.c
 
 $(ALU32_BUILD_DIR)/%.o: progs/%.c $(ALU32_BUILD_DIR) \
@@ -202,12 +207,16 @@ endif
 
 PROG_TESTS_H := $(OUTPUT)/prog_tests/tests.h
 $(OUTPUT)/test_progs: $(PROG_TESTS_H)
-$(OUTPUT)/test_progs: CFLAGS += -I$(OUTPUT)
+$(OUTPUT)/test_progs: CFLAGS += $(TEST_PROGS_CFLAGS)
 $(OUTPUT)/test_progs: prog_tests/*.c
 
+PROG_TESTS_DIR = $(OUTPUT)/prog_tests
+$(PROG_TESTS_DIR):
+   mkdir -p $@
+
 PROG_TESTS_FILES := $(wildcard prog_tests/*.c)
-$(PROG_TESTS_H): $(PROG_TESTS_FILES)
-   $(shell ( cd prog_tests/
+$(PROG_TESTS_H): $(PROG_TESTS_DIR) $(PROG_TESTS_FILES)
+   $(shell ( cd prog_tests/; \
  echo '/* Generated header, do not edit */'; \
  echo '#ifdef DECLARE'; \
  ls *.c 2> /dev/null | \
@@ -221,11 +230,15 @@ $(PROG_TESTS_H): $(PROG_TESTS_FILES)
 
 VERIFIER_TESTS_H := $(OUTPUT)/verifier/tests.h
 $(OUTPUT)/test_verifier: $(VERIFIER_TESTS_H)
-$(OUTPUT)/test_verifier: CFLAGS += -I$(OUTPUT)
+$(OUTPUT)/test_verifier: CFLAGS += $(TEST_VERIFIER_CFLAGS)
+
+VERIFIER_TESTS_DIR = $(OUTPUT)/verifier
+$(VERIFIER_TESTS_DIR):
+   mkdir -p $@
 
 VERIFIER_TEST_FILES := $(wildcard verifier/*.c)
-$(OUTPUT)/verifier/tests.h: $(VERIFIER_TEST_FILES)
-   $(shell ( cd verifier/
+$(OUTPUT)/verifier/tests.h: $(VERIFIER_TESTS_DIR) $(VERIFIER_TEST_FILES)
+   $(shell ( cd verifier/; \
  echo '/* Generated header, do not edit */'; \
  echo '#ifdef FILL_ARRAY'; \
  ls *.c 2> /dev/null | \
-- 
2.21.0.352.gf09ad66450-goog



[PATCH RFC v2] mac80211: debugfs option to force TX status frames

2019-03-06 Thread Julius Niedworok
At Technical University of Munich we use MAC 802.11 TX status frames to
perform several measurements in MAC 802.11 setups.

With ath based drivers this was possible until commit d94a461d7a7df6
("ath9k: use ieee80211_tx_status_noskb where possible") as the driver
ignored the IEEE80211_TX_CTL_REQ_TX_STATUS flag and always delivered
tx_status frames. Since that commit, this behavior was changed and the
driver now adheres to IEEE80211_TX_CTL_REQ_TX_STATUS.

Due to performance reasons, IEEE80211_TX_CTL_REQ_TX_STATUS is not set for
data frames from interfaces in managed mode. Hence, frames that are sent
from a managed mode interface do never deliver tx_status frames. This
remains true even if a monitor mode interface (the measurement interface)
is added to the same ieee80211 physical device. Thus, there is no
possibility for receiving tx_status frames for frames sent on an interface
in managed mode, if the driver adheres to IEEE80211_TX_CTL_REQ_TX_STATUS.

In order to force delivery of tx_status frames for research and debugging
purposes, implement a debugfs option force_tx_status for ieee80211 physical
devices. When this option is set for a physical device,
IEEE80211_TX_CTL_REQ_TX_STATUS is enabled in all packets sent from that
device. This option can be set via
/sys/kernel/debug/ieee80211//force_tx_status. The default is disabled.

Co-developed-by: Charlie Groh 
Signed-off-by: Charlie Groh 
Signed-off-by: Julius Niedworok 
---
 net/mac80211/debugfs.c | 53 ++
 net/mac80211/ieee80211_i.h |  1 +
 net/mac80211/tx.c  | 10 +
 3 files changed, 64 insertions(+)

diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugfs.c
index 3fe541e..074b5d1 100644
--- a/net/mac80211/debugfs.c
+++ b/net/mac80211/debugfs.c
@@ -150,6 +150,58 @@ static const struct file_operations aqm_ops = {
.llseek = default_llseek,
 };
 
+static ssize_t force_tx_status_read(struct file *file,
+   char __user *user_buf,
+   size_t count,
+   loff_t *ppos)
+{
+   struct ieee80211_local *local = file->private_data;
+   char buf[3];
+   int len = 0;
+
+   len = scnprintf(buf, sizeof(buf), "%d\n", (int)local->force_tx_status);
+
+   return simple_read_from_buffer(user_buf, count, ppos,
+  buf, len);
+}
+
+static ssize_t force_tx_status_write(struct file *file,
+const char __user *user_buf,
+size_t count,
+loff_t *ppos)
+{
+   struct ieee80211_local *local = file->private_data;
+   char buf[3];
+   size_t len;
+
+   if (count > sizeof(buf))
+   return -EINVAL;
+
+   if (copy_from_user(buf, user_buf, count))
+   return -EFAULT;
+
+   buf[sizeof(buf) - 1] = '\0';
+   len = strlen(buf);
+   if (len > 0 && buf[len - 1] == '\n')
+   buf[len - 1] = 0;
+
+   if (buf[0] == '0' && buf[1] == '\0')
+   local->force_tx_status = 0;
+   else if (buf[0] == '1' && buf[1] == '\0')
+   local->force_tx_status = 1;
+   else
+   return -EINVAL;
+
+   return count;
+}
+
+static const struct file_operations force_tx_status_ops = {
+   .write = force_tx_status_write,
+   .read = force_tx_status_read,
+   .open = simple_open,
+   .llseek = default_llseek,
+};
+
 #ifdef CONFIG_PM
 static ssize_t reset_write(struct file *file, const char __user *user_buf,
   size_t count, loff_t *ppos)
@@ -379,6 +431,7 @@ void debugfs_hw_add(struct ieee80211_local *local)
DEBUGFS_ADD(hwflags);
DEBUGFS_ADD(user_power);
DEBUGFS_ADD(power);
+   DEBUGFS_ADD_MODE(force_tx_status, 0600);
 
if (local->ops->wake_tx_queue)
DEBUGFS_ADD_MODE(aqm, 0600);
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index 7dfb4e2..3339b5d 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1367,6 +1367,7 @@ struct ieee80211_local {
struct dentry *rcdir;
struct dentry *keys;
} debugfs;
+   bool force_tx_status;
 #endif
 
/*
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 928f13a..717fa71 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -2463,6 +2463,11 @@ static struct sk_buff *ieee80211_build_hdr(struct 
ieee80211_sub_if_data *sdata,
if (IS_ERR(sta))
sta = NULL;
 
+#ifdef CONFIG_MAC80211_DEBUGFS
+   if (local->force_tx_status)
+   info_flags |= IEEE80211_TX_CTL_REQ_TX_STATUS;
+#endif
+
/* convert Ethernet header to proper 802.11 header (based on
 * operation mode) */
ethertype = (skb->data[12] << 8) | skb->data[13];
@@ -3468,6 +3473,11 @@ static bool ieee80211_xmit_fast(struct 
ieee80211_sub_if_data *sdata,
  (tid_tx ? IEEE80

[PATCH 2/2] ip6mr: Make cache queue length configurable

2019-03-06 Thread Brodie Greenfield
We want to be able to keep more spaces available in our queue for
processing incoming IPv6 multicast traffic (adding (S,G) entries) - this
lets us learn more groups faster, rather than dropping them at this stage.

Signed-off-by: Brodie Greenfield 
---
 Documentation/networking/ip-sysctl.txt | 8 
 include/net/netns/ipv6.h   | 1 +
 net/ipv6/af_inet6.c| 1 +
 net/ipv6/ip6mr.c   | 4 +++-
 net/ipv6/sysctl_net_ipv6.c | 7 +++
 5 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 02f77e932adf..68eada3ca915 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1481,6 +1481,14 @@ skip_notify_on_dev_down - BOOLEAN
on userspace caches to track link events and evict routes.
Default: false (generate message)
 
+ip_mr_cache_queue_length - INTEGER
+   Limit the number of multicast packets we can have in the queue to be
+   resolved.
+   Bear in mind that when an unresolved multicast packet is received,
+   there is an O(n) traversal of the queue. This should be considered
+   if increasing.
+   Default: 10
+
 IPv6 Fragmentation:
 
 ip6frag_high_thresh - INTEGER
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index ef1ed529f33c..84b58424c799 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -46,6 +46,7 @@ struct netns_sysctl_ipv6 {
int max_hbh_opts_len;
int seg6_flowlabel;
bool skip_notify_on_dev_down;
+   unsigned int ip6_mr_cache_queue_length;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d99753b5e39b..6551bb63e5a2 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -856,6 +856,7 @@ static int __net_init inet6_net_init(struct net *net)
net->ipv6.sysctl.max_hbh_opts_cnt = IP6_DEFAULT_MAX_HBH_OPTS_CNT;
net->ipv6.sysctl.max_dst_opts_len = IP6_DEFAULT_MAX_DST_OPTS_LEN;
net->ipv6.sysctl.max_hbh_opts_len = IP6_DEFAULT_MAX_HBH_OPTS_LEN;
+   net->ipv6.sysctl.ip6_mr_cache_queue_length = 10;
atomic_set(&net->ipv6.fib6_sernum, 1);
 
err = ipv6_init_mibs(net);
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index cc01aa3f2b5e..bb445871437e 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1135,6 +1135,7 @@ static int ip6mr_cache_report(struct mr_table *mrt, 
struct sk_buff *pkt,
 static int ip6mr_cache_unresolved(struct mr_table *mrt, mifi_t mifi,
  struct sk_buff *skb, struct net_device *dev)
 {
+   struct net *net = dev_net(dev);
struct mfc6_cache *c;
bool found = false;
int err;
@@ -1153,7 +1154,8 @@ static int ip6mr_cache_unresolved(struct mr_table *mrt, 
mifi_t mifi,
 *  Create a new entry if allowable
 */
 
-   if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
+   if (atomic_read(&mrt->cache_resolve_queue_len) >=
+   net->ipv6.sysctl.ip6_mr_cache_queue_length ||
(c = ip6mr_cache_alloc_unres()) == NULL) {
spin_unlock_bh(&mfc_unres_lock);
 
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index e15cd37024fd..a27299d4cc34 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -159,6 +159,13 @@ static struct ctl_table ipv6_table_template[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec
},
+   {
+   .procname   = "ip6_mr_cache_queue_length",
+   .data   = 
&init_net.ipv6.sysctl.ip6_mr_cache_queue_length,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{ }
 };
 
-- 
2.21.0



[PATCH 0/2] Make ipmr queue length configurable

2019-03-06 Thread Brodie Greenfield
We want to have some more space in our queue for processing incoming
multicast packets, so we can process more of them without dropping
them prematurely. It is useful to be able to increase this limit on
higher-spec platforms that can handle more items.

For the particular use case here at Allied Telesis, we have linux
running on our switches and routers, with support for the number of
multicast groups being increased. Basically, this queue length affects
the time taken to fully learn all of the multicast streams. 

Changes in v2:
 - Tidy up a few unnecessary bits of code.
 - Put the sysctl inside ip multicast ifdef.
 - Included the IPv6 version.

Brodie Greenfield (2):
  ipmr: Make cache queue length configurable
  ip6mr: Make cache queue length configurable

 Documentation/networking/ip-sysctl.txt | 16 
 include/net/netns/ipv4.h   |  1 +
 include/net/netns/ipv6.h   |  1 +
 net/ipv4/af_inet.c |  1 +
 net/ipv4/ipmr.c|  4 +++-
 net/ipv4/sysctl_net_ipv4.c |  7 +++
 net/ipv6/af_inet6.c|  1 +
 net/ipv6/ip6mr.c   |  4 +++-
 net/ipv6/sysctl_net_ipv6.c |  7 +++
 9 files changed, 40 insertions(+), 2 deletions(-)

-- 
2.21.0



[PATCH 1/2] ipmr: Make cache queue length configurable

2019-03-06 Thread Brodie Greenfield
We want to be able to keep more spaces available in our queue for
processing incoming multicast traffic (adding (S,G) entries) - this lets
us learn more groups faster, rather than dropping them at this stage.

Signed-off-by: Brodie Greenfield 
---
 Documentation/networking/ip-sysctl.txt | 8 
 include/net/netns/ipv4.h   | 1 +
 net/ipv4/af_inet.c | 1 +
 net/ipv4/ipmr.c| 4 +++-
 net/ipv4/sysctl_net_ipv4.c | 7 +++
 5 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index acdfb5d2bcaa..02f77e932adf 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -887,6 +887,14 @@ ip_local_reserved_ports - list of comma separated ranges
 
Default: Empty
 
+ip_mr_cache_queue_length - INTEGER
+   Limit the number of multicast packets we can have in the queue to be
+   resolved.
+   Bear in mind that when an unresolved multicast packet is received,
+   there is an O(n) traversal of the queue. This should be considered
+   if increasing.
+   Default: 10
+
 ip_unprivileged_port_start - INTEGER
This is a per-namespace sysctl.  It defines the first
unprivileged port in the network namespace.  Privileged ports
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 104a6669e344..3411d3f18d51 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -187,6 +187,7 @@ struct netns_ipv4 {
int sysctl_igmp_max_msf;
int sysctl_igmp_llm_reports;
int sysctl_igmp_qrv;
+   unsigned int sysctl_ip_mr_cache_queue_length;
 
struct ping_group_range ping_group_range;
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 0dfb72c46671..8e25538bdb1e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1827,6 +1827,7 @@ static __net_init int inet_init_net(struct net *net)
net->ipv4.sysctl_igmp_llm_reports = 1;
net->ipv4.sysctl_igmp_qrv = 2;
 
+   net->ipv4.sysctl_ip_mr_cache_queue_length = 10;
return 0;
 }
 
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index ddbf8c9a1abb..c6a6c3e453a9 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1127,6 +1127,7 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, 
vifi_t vifi,
 struct sk_buff *skb, struct net_device *dev)
 {
const struct iphdr *iph = ip_hdr(skb);
+   struct net *net = dev_net(dev);
struct mfc_cache *c;
bool found = false;
int err;
@@ -1142,7 +1143,8 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, 
vifi_t vifi,
 
if (!found) {
/* Create a new entry if allowable */
-   if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
+   if (atomic_read(&mrt->cache_resolve_queue_len) >=
+   net->ipv4.sysctl_ip_mr_cache_queue_length ||
(c = ipmr_cache_alloc_unres()) == NULL) {
spin_unlock_bh(&mfc_unres_lock);
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ba0fc4b18465..78ae86e8c6cb 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -784,6 +784,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec
},
 #ifdef CONFIG_IP_MULTICAST
+   {
+   .procname   = "ip_mr_cache_queue_length",
+   .data   = 
&init_net.ipv4.sysctl_ip_mr_cache_queue_length,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{
.procname   = "igmp_qrv",
.data   = &init_net.ipv4.sysctl_igmp_qrv,
-- 
2.21.0



Re: [PATCH 2/2] ip6mr: Make cache queue length configurable

2019-03-06 Thread Chris Packham
On 7/03/19 9:20 AM, Brodie Greenfield wrote:
> We want to be able to keep more spaces available in our queue for
> processing incoming IPv6 multicast traffic (adding (S,G) entries) - this
> lets us learn more groups faster, rather than dropping them at this stage.
> 
> Signed-off-by: Brodie Greenfield 
> ---
>   Documentation/networking/ip-sysctl.txt | 8 
>   include/net/netns/ipv6.h   | 1 +
>   net/ipv6/af_inet6.c| 1 +
>   net/ipv6/ip6mr.c   | 4 +++-
>   net/ipv6/sysctl_net_ipv6.c | 7 +++
>   5 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> index 02f77e932adf..68eada3ca915 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -1481,6 +1481,14 @@ skip_notify_on_dev_down - BOOLEAN
>   on userspace caches to track link events and evict routes.
>   Default: false (generate message)
>   
> +ip_mr_cache_queue_length - INTEGER

Should be "ip6_mr_cache_queue_length" for this patch.

> + Limit the number of multicast packets we can have in the queue to be
> + resolved.
> + Bear in mind that when an unresolved multicast packet is received,
> + there is an O(n) traversal of the queue. This should be considered
> + if increasing.
> + Default: 10
> +
>   IPv6 Fragmentation:
>   
>   ip6frag_high_thresh - INTEGER
> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
> index ef1ed529f33c..84b58424c799 100644
> --- a/include/net/netns/ipv6.h
> +++ b/include/net/netns/ipv6.h
> @@ -46,6 +46,7 @@ struct netns_sysctl_ipv6 {
>   int max_hbh_opts_len;
>   int seg6_flowlabel;
>   bool skip_notify_on_dev_down;
> + unsigned int ip6_mr_cache_queue_length;
>   };
>   
>   struct netns_ipv6 {
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index d99753b5e39b..6551bb63e5a2 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -856,6 +856,7 @@ static int __net_init inet6_net_init(struct net *net)
>   net->ipv6.sysctl.max_hbh_opts_cnt = IP6_DEFAULT_MAX_HBH_OPTS_CNT;
>   net->ipv6.sysctl.max_dst_opts_len = IP6_DEFAULT_MAX_DST_OPTS_LEN;
>   net->ipv6.sysctl.max_hbh_opts_len = IP6_DEFAULT_MAX_HBH_OPTS_LEN;
> + net->ipv6.sysctl.ip6_mr_cache_queue_length = 10;
>   atomic_set(&net->ipv6.fib6_sernum, 1);
>   
>   err = ipv6_init_mibs(net);
> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> index cc01aa3f2b5e..bb445871437e 100644
> --- a/net/ipv6/ip6mr.c
> +++ b/net/ipv6/ip6mr.c
> @@ -1135,6 +1135,7 @@ static int ip6mr_cache_report(struct mr_table *mrt, 
> struct sk_buff *pkt,
>   static int ip6mr_cache_unresolved(struct mr_table *mrt, mifi_t mifi,
> struct sk_buff *skb, struct net_device *dev)
>   {
> + struct net *net = dev_net(dev);
>   struct mfc6_cache *c;
>   bool found = false;
>   int err;
> @@ -1153,7 +1154,8 @@ static int ip6mr_cache_unresolved(struct mr_table *mrt, 
> mifi_t mifi,
>*  Create a new entry if allowable
>*/
>   
> - if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
> + if (atomic_read(&mrt->cache_resolve_queue_len) >=
> + net->ipv6.sysctl.ip6_mr_cache_queue_length ||
>   (c = ip6mr_cache_alloc_unres()) == NULL) {
>   spin_unlock_bh(&mfc_unres_lock);
>   
> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
> index e15cd37024fd..a27299d4cc34 100644
> --- a/net/ipv6/sysctl_net_ipv6.c
> +++ b/net/ipv6/sysctl_net_ipv6.c
> @@ -159,6 +159,13 @@ static struct ctl_table ipv6_table_template[] = {
>   .mode   = 0644,
>   .proc_handler   = proc_dointvec
>   },
> + {
> + .procname   = "ip6_mr_cache_queue_length",
> + .data   = 
> &init_net.ipv6.sysctl.ip6_mr_cache_queue_length,
> + .maxlen = sizeof(int),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec
> + },
>   { }
>   };
>   
> 



Re: [PATCH] xfrm: Reset secpath in xfrm failure

2019-03-06 Thread Myungho Jung
On Wed, Mar 06, 2019 at 12:35:43PM +0100, Steffen Klassert wrote:
> On Wed, Mar 06, 2019 at 04:33:08PM +0900, Myungho Jung wrote:
> > In esp4_gro_receive() and esp6_gro_receive(), secpath can be allocated
> > without adding xfrm state to xvec. Then, sp->xvec[sp->len - 1] would
> > fail and result in dereferencing invalid pointer in esp4_gso_segment()
> > and esp6_gso_segment(). Reset secpath if xfrm function returns error.
> > 
> > Reported-by: syzbot+b69368fd933c6c592...@syzkaller.appspotmail.com
> > Signed-off-by: Myungho Jung 
> 
> The patch itself looks ok, but please add a 'Fixes' tag to
> the commit message.
> 
> Thanks!

Hi Steffen,

Thanks for reviewing the change. I'll add fixes tag and resubmit it.

Thanks,
Myungho


Re: [PATCH v1 iproute2-next 1/4] rdma: add helper rd_sendrecv_msg()

2019-03-06 Thread Steve Wise


On 3/4/2019 8:13 AM, Steve Wise wrote:
> Hey Leon, adding this to rd_recv_msg():
>
> @@ -693,10 +693,28 @@ int rd_recv_msg(struct rd *rd, mnl_cb_t callback, void
> *data, unsigned int seq)
> ret = mnl_cb_run(buf, ret, seq, portid, callback, data);
> } while (ret > 0);
>
> +   if (ret < 0)
> +   perror(NULL);
> +
> mnl_socket_close(rd->nl);
> return ret;
>  }
>
> Results in unexpected errors being logged when doing a query such as:
>
> [root@stevo1 iproute2]# ./rdma/rdma res show qp lqpn 176
> error: Invalid argument
> link mlx5_0/1 lqpn 176 type UD state RTS sq-psn 0 comm [ib_core]
> error: Invalid argument
> error: No such file or directory
> error: Invalid argument
> error: No such file or directory
>
> It appears the "invalid argument" errors are due to rdmatool sending a
> RDMA_NLDEV_CMD_RES_QP_GET command using the doit kernel method to allow
> querying for just a QP with lqpn = 176.  However, rdmatool isn't passing a
> port index in the messages that generate the "invalid argument" error from
> the kernel.  IE you must provide a device index and port index when issuing
> a doit command vs a dumpit command.  I think. 
>
> This error was not found because rd_recv_msg() never displayed any errors
> previously.  Further, the RES_FUNC() massive macro has code that will retry
> a failed doit call with a dumpit call.  I think _##name() should distinguish
> between failures reported by the kernel doit function vs failures because no
> doit function exists.  Not sure how to support that.
>
>
> static inline int _##name(struct rd *rd)
> \
> {
> \
> uint32_t idx;
> \
> int ret;
> \
> if (id) {
> \
> ret = rd_doit_index(rd, &idx);
> \
> if (ret) {
> \
> ret = _res_send_idx_msg(rd, command,
> \
> name##_idx_parse_cb,
> \
> idx, id);
> \
> if (!ret)
> \
> return ret;
> \
> /* Fallback for old systems without .doit
> callbacks */ \
> }
> \
> }
> \
> return _res_send_msg(rd, command, name##_parse_cb);
> \
> }
> \
>
>
>
> The "no such file or dir" errors are being returned because, in my setup,
> there are 2 other links that do not have lqpn 176.   So there are 2 issues
> uncovered by adding generic printing of errors in rd_recv_msg()
>
> 1) the doit code in rdmatool is generating requests for a doit method in the
> kernel w/o providing a port index.
> 2) some paths in rdmatool should not print "benign" errors like filtering on
> a GET command causing a "does not exist" error returned by the kernel doit
> func.
>
> #1 is a bug, IMO.  Can you propose a fix?
> #2 could be solved by adding an error callback func passed to rd_recv_msg().
> Then the RES_FUNC() functions could parse errors like "no such file or dir"
> when doing a filtered query and silently drop them.  And functions like
> dev_set_name() would display all errors returned because there are no
> expected errors other than "success".
>
> Steve.
>

Hey Leon, you've been quiet. :)   Thoughts?

Thanks,

Steve.




Re: [PATCH] connector: fix unsafe usage of ->real_parent

2019-03-06 Thread Evgeniy Polyakov
Hi

06.03.2019, 09:46, "Li RongQing" :
> proc_exit_connector() uses ->real_parent lockless. This is not
> safe that its parent can go away at any moment, so use RCU to
> protect it, and ensure that this task is not released.
>

> Fixes: b086ff87251b4a4 ("connector: add parent pid and tgid to coredump and 
> exit events")
> Signed-off-by: Zhang Yu 
> Signed-off-by: Li RongQing 

Looks good to me, thank you!
Acked-by: Evgeniy Polyakov 

> ---
>  drivers/connector/cn_proc.c | 22 ++
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c
> index ed5e42461094..ad48fd52cb53 100644
> --- a/drivers/connector/cn_proc.c
> +++ b/drivers/connector/cn_proc.c
> @@ -250,6 +250,7 @@ void proc_coredump_connector(struct task_struct *task)
>  {
>  struct cn_msg *msg;
>  struct proc_event *ev;
> + struct task_struct *parent;
>  __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8);
>
>  if (atomic_read(&proc_event_num_listeners) < 1)
> @@ -262,8 +263,14 @@ void proc_coredump_connector(struct task_struct *task)
>  ev->what = PROC_EVENT_COREDUMP;
>  ev->event_data.coredump.process_pid = task->pid;
>  ev->event_data.coredump.process_tgid = task->tgid;
> - ev->event_data.coredump.parent_pid = task->real_parent->pid;
> - ev->event_data.coredump.parent_tgid = task->real_parent->tgid;
> +
> + rcu_read_lock();
> + if (pid_alive(task)) {
> + parent = rcu_dereference(task->real_parent);
> + ev->event_data.coredump.parent_pid = parent->pid;
> + ev->event_data.coredump.parent_tgid = parent->tgid;
> + }
> + rcu_read_unlock();
>
>  memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id));
>  msg->ack = 0; /* not used */
> @@ -276,6 +283,7 @@ void proc_exit_connector(struct task_struct *task)
>  {
>  struct cn_msg *msg;
>  struct proc_event *ev;
> + struct task_struct *parent;
>  __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8);
>
>  if (atomic_read(&proc_event_num_listeners) < 1)
> @@ -290,8 +298,14 @@ void proc_exit_connector(struct task_struct *task)
>  ev->event_data.exit.process_tgid = task->tgid;
>  ev->event_data.exit.exit_code = task->exit_code;
>  ev->event_data.exit.exit_signal = task->exit_signal;
> - ev->event_data.exit.parent_pid = task->real_parent->pid;
> - ev->event_data.exit.parent_tgid = task->real_parent->tgid;
> +
> + rcu_read_lock();
> + if (pid_alive(task)) {
> + parent = rcu_dereference(task->real_parent);
> + ev->event_data.exit.parent_pid = parent->pid;
> + ev->event_data.exit.parent_tgid = parent->tgid;
> + }
> + rcu_read_unlock();
>
>  memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id));
>  msg->ack = 0; /* not used */
> --
> 2.16.2


[PATCH v2] xfrm: Reset secpath in xfrm failure

2019-03-06 Thread Myungho Jung
In esp4_gro_receive() and esp6_gro_receive(), secpath can be allocated
without adding xfrm state to xvec. Then, sp->xvec[sp->len - 1] would
fail and result in dereferencing invalid pointer in esp4_gso_segment()
and esp6_gso_segment(). Reset secpath if xfrm function returns error.

Fixes: 7785bba299a8 ("esp: Add a software GRO codepath")
Reported-by: syzbot+b69368fd933c6c592...@syzkaller.appspotmail.com
Signed-off-by: Myungho Jung 
---
Changes in v2:
  - Add fixes tag.

 net/ipv4/esp4_offload.c | 9 +++--
 net/ipv6/esp6_offload.c | 9 +++--
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index 8756e0e790d2..7329e40c73f6 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -51,14 +51,18 @@ static struct sk_buff *esp4_gro_receive(struct list_head 
*head,
if (!sp)
goto out;
 
-   if (sp->len == XFRM_MAX_DEPTH)
+   if (sp->len == XFRM_MAX_DEPTH) {
+   secpath_reset(skb);
goto out;
+   }
 
x = xfrm_state_lookup(dev_net(skb->dev), skb->mark,
  (xfrm_address_t *)&ip_hdr(skb)->daddr,
  spi, IPPROTO_ESP, AF_INET);
-   if (!x)
+   if (!x) {
+   secpath_reset(skb);
goto out;
+   }
 
sp->xvec[sp->len++] = x;
sp->olen++;
@@ -66,6 +70,7 @@ static struct sk_buff *esp4_gro_receive(struct list_head 
*head,
xo = xfrm_offload(skb);
if (!xo) {
xfrm_state_put(x);
+   secpath_reset(skb);
goto out;
}
}
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index d46b4eb645c2..399c688d192e 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -73,14 +73,18 @@ static struct sk_buff *esp6_gro_receive(struct list_head 
*head,
if (!sp)
goto out;
 
-   if (sp->len == XFRM_MAX_DEPTH)
+   if (sp->len == XFRM_MAX_DEPTH) {
+   secpath_reset(skb);
goto out;
+   }
 
x = xfrm_state_lookup(dev_net(skb->dev), skb->mark,
  (xfrm_address_t *)&ipv6_hdr(skb)->daddr,
  spi, IPPROTO_ESP, AF_INET6);
-   if (!x)
+   if (!x) {
+   secpath_reset(skb);
goto out;
+   }
 
sp->xvec[sp->len++] = x;
sp->olen++;
@@ -88,6 +92,7 @@ static struct sk_buff *esp6_gro_receive(struct list_head 
*head,
xo = xfrm_offload(skb);
if (!xo) {
xfrm_state_put(x);
+   secpath_reset(skb);
goto out;
}
}
-- 
2.17.1



TAHI testing fails for IPv6 Fragments in Kernel 4.9

2019-03-06 Thread Captain Wiggum
We are using the TAHI Self-test tools from IPv6 Ready Logo Program:
https://www.ipv6ready.org/?page=documents&tag=ipv6-core-protocols

The test passed up to 4.9.133, then fails ever since.

The are about 20 failing test cases regarding IPv6 fragments,
where the kernel is issuing an ICMPv6 parameter problem pointing
to the Fragmentation Header.

I see lots of commits regarding improving fragment processing.
Has anyone else run TAHI tests on kernel 4.9 or later?
Is there any interest in looking into this to improve
the IPv6 functionality?

Any tips appreciated.

--John Masinter


Re: TAHI testing fails for IPv6 Fragments in Kernel 4.9

2019-03-06 Thread David Miller
From: Captain Wiggum 
Date: Wed, 6 Mar 2019 15:26:43 -0700

> We are using the TAHI Self-test tools from IPv6 Ready Logo Program:
> https://www.ipv6ready.org/?page=documents&tag=ipv6-core-protocols
> 
> The test passed up to 4.9.133, then fails ever since.
> 
> The are about 20 failing test cases regarding IPv6 fragments,
> where the kernel is issuing an ICMPv6 parameter problem pointing
> to the Fragmentation Header.
> 
> I see lots of commits regarding improving fragment processing.
> Has anyone else run TAHI tests on kernel 4.9 or later?
> Is there any interest in looking into this to improve
> the IPv6 functionality?

It is intentionally failing those tests to fix a denial of service
issue with ipv6 fragmentation and this will therefore not be changed.


Re: TAHI testing fails for IPv6 Fragments in Kernel 4.9

2019-03-06 Thread Florian Fainelli
On 3/6/19 2:26 PM, Captain Wiggum wrote:
> We are using the TAHI Self-test tools from IPv6 Ready Logo Program:
> https://www.ipv6ready.org/?page=documents&tag=ipv6-core-protocols
> 
> The test passed up to 4.9.133, then fails ever since.
> 
> The are about 20 failing test cases regarding IPv6 fragments,
> where the kernel is issuing an ICMPv6 parameter problem pointing
> to the Fragmentation Header.
> 
> I see lots of commits regarding improving fragment processing.
> Has anyone else run TAHI tests on kernel 4.9 or later?
> Is there any interest in looking into this to improve
> the IPv6 functionality?

Can you run a git bisection to determine when things started to fail
and/if a newer 4.9 release does improve the situation?
-- 
Florian


Re: [PATCH v2] xfrm: Reset secpath in xfrm failure

2019-03-06 Thread Eric Dumazet



On 03/06/2019 01:55 PM, Myungho Jung wrote:
> In esp4_gro_receive() and esp6_gro_receive(), secpath can be allocated
> without adding xfrm state to xvec. Then, sp->xvec[sp->len - 1] would
> fail and result in dereferencing invalid pointer in esp4_gso_segment()
> and esp6_gso_segment(). Reset secpath if xfrm function returns error.
> 
> Fixes: 7785bba299a8 ("esp: Add a software GRO codepath")
> Reported-by: syzbot+b69368fd933c6c592...@syzkaller.appspotmail.com
> Signed-off-by: Myungho Jung 
> ---
> Changes in v2:
>   - Add fixes tag.
> 
>  net/ipv4/esp4_offload.c | 9 +++--
>  net/ipv6/esp6_offload.c | 9 +++--
>  2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
> index 8756e0e790d2..7329e40c73f6 100644
> --- a/net/ipv4/esp4_offload.c
> +++ b/net/ipv4/esp4_offload.c
> @@ -51,14 +51,18 @@ static struct sk_buff *esp4_gro_receive(struct list_head 
> *head,
>   if (!sp)
>   goto out;
>  
> - if (sp->len == XFRM_MAX_DEPTH)
> + if (sp->len == XFRM_MAX_DEPTH) {
> + secpath_reset(skb);
>   goto out;
> + }
>  
>   x = xfrm_state_lookup(dev_net(skb->dev), skb->mark,
> (xfrm_address_t *)&ip_hdr(skb)->daddr,
> spi, IPPROTO_ESP, AF_INET);
> - if (!x)
> + if (!x) {
> + secpath_reset(skb);
>   goto out;
> + }
>  

I suggest another exit label, so that you replace "goto out" by "goto 
out_reset";

diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index 
8756e0e790d2a94a5b4a587c3bc3de0673baf2c4..76f754f6692696ba2aa8c9eb03b68b92d1e39ee1
 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -82,6 +82,8 @@ static struct sk_buff *esp4_gro_receive(struct list_head 
*head,
xfrm_input(skb, IPPROTO_ESP, spi, -2);
 
return ERR_PTR(-EINPROGRESS);
+out_reset:
+   secpath_reset(skb);
 out:
skb_push(skb, offset);
NAPI_GRO_CB(skb)->same_flow = 0;




>   sp->xvec[sp->len++] = x;
>   sp->olen++;
> @@ -66,6 +70,7 @@ static struct sk_buff *esp4_gro_receive(struct list_head 
> *head,
>   xo = xfrm_offload(skb);
>   if (!xo) {
>   xfrm_state_put(x);
> + secpath_reset(skb);
>   goto out;
>   }
>   }




Re: TAHI testing fails for IPv6 Fragments in Kernel 4.9

2019-03-06 Thread Eric Dumazet



On 03/06/2019 02:28 PM, David Miller wrote:
> From: Captain Wiggum 
> Date: Wed, 6 Mar 2019 15:26:43 -0700
> 
>> We are using the TAHI Self-test tools from IPv6 Ready Logo Program:
>> https://www.ipv6ready.org/?page=documents&tag=ipv6-core-protocols
>>
>> The test passed up to 4.9.133, then fails ever since.
>>
>> The are about 20 failing test cases regarding IPv6 fragments,
>> where the kernel is issuing an ICMPv6 parameter problem pointing
>> to the Fragmentation Header.
>>
>> I see lots of commits regarding improving fragment processing.
>> Has anyone else run TAHI tests on kernel 4.9 or later?
>> Is there any interest in looking into this to improve
>> the IPv6 functionality?
> 
> It is intentionally failing those tests to fix a denial of service
> issue with ipv6 fragmentation and this will therefore not be changed.


That was Florian patch that was later reverted, right ?

0ed4229b08c13c84a3c301a08defdc9e7f4467e6 ipv6: defrag: drop non-last frags 
smaller than min mtu

-> reverted in 

d4289fcc9b16b89619ee1c54f829e05e56de8b9a net: IP6 defrag: use rbtrees for IPv6 
defrag





Re: TAHI testing fails for IPv6 Fragments in Kernel 4.9

2019-03-06 Thread David Miller
From: Eric Dumazet 
Date: Wed, 6 Mar 2019 14:34:25 -0800

> That was Florian patch that was later reverted, right ?
> 
> 0ed4229b08c13c84a3c301a08defdc9e7f4467e6 ipv6: defrag: drop non-last frags 
> smaller than min mtu
> 
> -> reverted in 
> 
> d4289fcc9b16b89619ee1c54f829e05e56de8b9a net: IP6 defrag: use rbtrees for 
> IPv6 defrag

Oh yes, that's right.


Re: TAHI testing fails for IPv6 Fragments in Kernel 4.9

2019-03-06 Thread Captain Wiggum
Thank you all for the reply.
The problem started with 4.9.133, and persists to latest 4.9 today.
I will work on the bisect approach to find the bad commit(s).
It will take a few days. I will reply back when I have more info.

On Wed, Mar 6, 2019 at 3:34 PM Eric Dumazet  wrote:
>
>
>
> On 03/06/2019 02:28 PM, David Miller wrote:
> > From: Captain Wiggum 
> > Date: Wed, 6 Mar 2019 15:26:43 -0700
> >
> >> We are using the TAHI Self-test tools from IPv6 Ready Logo Program:
> >> https://www.ipv6ready.org/?page=documents&tag=ipv6-core-protocols
> >>
> >> The test passed up to 4.9.133, then fails ever since.
> >>
> >> The are about 20 failing test cases regarding IPv6 fragments,
> >> where the kernel is issuing an ICMPv6 parameter problem pointing
> >> to the Fragmentation Header.
> >>
> >> I see lots of commits regarding improving fragment processing.
> >> Has anyone else run TAHI tests on kernel 4.9 or later?
> >> Is there any interest in looking into this to improve
> >> the IPv6 functionality?
> >
> > It is intentionally failing those tests to fix a denial of service
> > issue with ipv6 fragmentation and this will therefore not be changed.
>
>
> That was Florian patch that was later reverted, right ?
>
> 0ed4229b08c13c84a3c301a08defdc9e7f4467e6 ipv6: defrag: drop non-last frags 
> smaller than min mtu
>
> -> reverted in
>
> d4289fcc9b16b89619ee1c54f829e05e56de8b9a net: IP6 defrag: use rbtrees for 
> IPv6 defrag
>
>
>


Re: [Linuxptp-devel] strangeness

2019-03-06 Thread Paul Thomas
On Fri, Mar 1, 2019 at 1:24 AM Harini Katakam  wrote:
>
> +netdev
>
> Hi Paul,
> On Fri, Mar 1, 2019 at 12:29 AM Richard Cochran
>  wrote:
> >
> > On Thu, Feb 28, 2019 at 12:33:26PM -0500, Paul Thomas wrote:
> > > Yes changing it to TSTAMP_ALL_PTP_FRAMES instead of TSTAMP_ALL_FRAMES
> > > does seem to fix the ssh issue. My worry is that there is still a bug
> > > somewhere in the network stack that this is just masking.
>
> Ok thanks.
> One place to check in the driver will be:
> if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
> /* skb now belongs to timestamp buffer
> * and will be removed later
> */
> tx_skb->skb = NULL;
> }
> When all TX packets are timestamped, the skb always belongs to the
> timestamp buffer.
>
> >
> > Or the HW isn't sending the frames in the first place.
> >
> > Check that first!
>
> To check this, the statistics registers in MAC will be one way.
> But if there is no TX completion interrupt, then I wouldn't expect
> these statistics to increase either. The used bit status in BD dump
> might be of more use.
>
> I will also try to reproduce (with TX timestamp ALL) and see if any of
> the above gives some clue.
>
> Regards,
> Harini

Hi Harini, any luck looking at this?

I didn't get very far, even in the "broken" state I see plenty of tx_frames:
root@xu5:/opt/linuxptp# ethtool -S eth0
NIC statistics:
 ...
 tx_frames: 39763
 ...

When you said "registers in the MAC" is ethtool -S displaying that?

-Paul


[PATCH bpf] selftests: bpf: test_progs: initialize duration in singal_pending test

2019-03-06 Thread Stanislav Fomichev
CHECK macro implicitly uses duration. We call CHECK() a couple of times
before duration is initialized from bpf_prog_test_run().
Explicitly set duration to 0 to avoid compiler warnings.

Fixes: 740f8a657221 ("selftests/bpf: make sure signal interrupts 
BPF_PROG_TEST_RUN")

Signed-off-by: Stanislav Fomichev 
---
 tools/testing/selftests/bpf/prog_tests/signal_pending.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/signal_pending.c 
b/tools/testing/selftests/bpf/prog_tests/signal_pending.c
index f2a37bbf91ab..996e808f43a2 100644
--- a/tools/testing/selftests/bpf/prog_tests/signal_pending.c
+++ b/tools/testing/selftests/bpf/prog_tests/signal_pending.c
@@ -12,7 +12,7 @@ static void test_signal_pending_by_type(enum bpf_prog_type 
prog_type)
struct itimerval timeo = {
.it_value.tv_usec = 10, /* 100ms */
};
-   __u32 duration, retval;
+   __u32 duration = 0, retval;
int prog_fd;
int err;
int i;
-- 
2.21.0.352.gf09ad66450-goog



Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-06 Thread Liran Alon



> On 6 Mar 2019, at 23:42, si-wei liu  wrote:
> 
> 
> 
> On 3/6/2019 1:36 PM, Samudrala, Sridhar wrote:
>> 
>> On 3/6/2019 1:26 PM, si-wei liu wrote:
>>> 
>>> 
>>> On 3/6/2019 4:04 AM, Jiri Pirko wrote:
> --- a/net/core/failover.c
> +++ b/net/core/failover.c
> @@ -16,6 +16,11 @@
> 
> static LIST_HEAD(failover_list);
> static DEFINE_SPINLOCK(failover_lock);
> +static bool slave_rename_ok = true;
> +
> +module_param(slave_rename_ok, bool, (S_IRUGO | S_IWUSR));
> +MODULE_PARM_DESC(slave_rename_ok,
> +  "If set allow renaming the slave when failover master is up");
> 
 No module parameters please. If you need to set something do it using
 rtnl_link_ops. Thanks.
 
 
>>> I understand what you ask for, but without module parameters userspace 
>>> don't work. During boot (dracut) the virtio netdev gets enslaved earlier 
>>> than when userspace comes up, so failover has to determine the setting 
>>> during initialization/creation. This config is not dynamic, at least for 
>>> the life cycle of a particular failover link it shouldn't be changed. 
>>> Without module parameter, how does the userspace specify this value during 
>>> kernel initialization? 
>>> 
>> Can we enable this by default and not make it configurable via module 
>> parameter?
>> Is there any  usecase where someone expects rename to fail with failover 
>> slaves?
> Probably just cater for those application that assumes fixed name on UP 
> interface?
> 
> It's already the default for the configurable. I myself don't think that's a 
> big problem for failover users. So far there's not even QEMU support I think 
> everything can be changed. I don't feel strong to just fix it without 
> introducing configurable. But maybe Michael or others think it differently...
> 
> If no one objects, I don't feel strong to make it fixed behavior.
> 
> -Siwei
> 

I agree we should just remove the module parameter.

-Liran




[PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself

2019-03-06 Thread Al Viro
From: Al Viro 

simplifies the caller

Signed-off-by: Al Viro 
---
 fs/aio.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5837a29e63fe..af51b1360305 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1029,6 +1029,11 @@ static inline struct aio_kiocb *aio_get_req(struct 
kioctx *ctx)
if (unlikely(!req))
return NULL;
 
+   if (unlikely(!get_reqs_available(ctx))) {
+   kfree(req);
+   return NULL;
+   }
+
percpu_ref_get(&ctx->reqs);
req->ki_ctx = ctx;
INIT_LIST_HEAD(&req->ki_list);
@@ -1805,13 +1810,9 @@ static int __io_submit_one(struct kioctx *ctx, const 
struct iocb *iocb,
return -EINVAL;
}
 
-   if (!get_reqs_available(ctx))
-   return -EAGAIN;
-
-   ret = -EAGAIN;
req = aio_get_req(ctx);
if (unlikely(!req))
-   goto out_put_reqs_available;
+   return -EAGAIN;
 
req->ki_filp = fget(iocb->aio_fildes);
ret = -EBADF;
@@ -1883,7 +1884,6 @@ static int __io_submit_one(struct kioctx *ctx, const 
struct iocb *iocb,
return 0;
 out_put_req:
iocb_put(req);
-out_put_reqs_available:
put_reqs_available(ctx, 1);
return ret;
 }
-- 
2.11.0



[PATCH 6/8] move dropping ->ki_eventfd into iocb_put()

2019-03-06 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 fs/aio.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5dd5f35d054c..5837a29e63fe 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1066,6 +1066,8 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 
 static inline void iocb_put(struct aio_kiocb *iocb)
 {
+   if (iocb->ki_eventfd)
+   eventfd_ctx_put(iocb->ki_eventfd);
if (iocb->ki_filp)
fput(iocb->ki_filp);
percpu_ref_put(&iocb->ki_ctx->reqs);
@@ -1142,10 +1144,8 @@ static void aio_complete(struct aio_kiocb *iocb, long 
res, long res2)
 * eventfd. The eventfd_signal() function is safe to be called
 * from IRQ context.
 */
-   if (iocb->ki_eventfd) {
+   if (iocb->ki_eventfd)
eventfd_signal(iocb->ki_eventfd, 1);
-   eventfd_ctx_put(iocb->ki_eventfd);
-   }
 
/*
 * We have to order our ring_info tail store above and test
@@ -1819,18 +1819,19 @@ static int __io_submit_one(struct kioctx *ctx, const 
struct iocb *iocb,
goto out_put_req;
 
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
+   struct eventfd_ctx *eventfd;
/*
 * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
 * instance of the file* now. The file descriptor must be
 * an eventfd() fd, and will be signaled for each completed
 * event using the eventfd_signal() function.
 */
-   req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
-   if (IS_ERR(req->ki_eventfd)) {
-   ret = PTR_ERR(req->ki_eventfd);
-   req->ki_eventfd = NULL;
+   eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
+   if (IS_ERR(eventfd)) {
+   ret = PTR_ERR(eventfd);
goto out_put_req;
}
+   req->ki_eventfd = eventfd;
}
 
ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
@@ -1881,8 +1882,6 @@ static int __io_submit_one(struct kioctx *ctx, const 
struct iocb *iocb,
goto out_put_req;
return 0;
 out_put_req:
-   if (req->ki_eventfd)
-   eventfd_ctx_put(req->ki_eventfd);
iocb_put(req);
 out_put_reqs_available:
put_reqs_available(ctx, 1);
-- 
2.11.0



  1   2   >