date:20180914

Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-14 Thread kbuild test robot

Hi Joe,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Joe-Stringer/Add-socket-lookup-support/20180914-134632
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-randconfig-s0-09141346 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   net/core/filter.c: In function 'sk_lookup':
>> net/core/filter.c:4870:1: error: invalid storage class for function 
>> 'bpf_sk_lookup'
bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
^
>> net/core/filter.c:4869:1: warning: ISO C90 forbids mixed declarations and 
>> code [-Wdeclaration-after-statement]
static unsigned long
^~
   In file included from include/net/sock.h:64:0,
from include/linux/sock_diag.h:8,
from net/core/filter.c:29:
>> include/linux/filter.h:432:6: error: invalid storage class for function 
>> 'bpf_sk_lookup_tcp'
 u64 ##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__));   \
 ^
>> include/linux/filter.h:446:31: note: in expansion of macro 'BPF_CALL_x'
#define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__)
  ^~
>> net/core/filter.c:4896:1: note: in expansion of macro 'BPF_CALL_5'
BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb,
^~
>> net/core/filter.c:4896:12: error: static declaration of 'bpf_sk_lookup_tcp' 
>> follows non-static declaration
BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb,
   ^
   include/linux/filter.h:434:6: note: in definition of macro 'BPF_CALL_x'
 u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__))\
 ^~~~
>> net/core/filter.c:4896:1: note: in expansion of macro 'BPF_CALL_5'
BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb,
^~
   net/core/filter.c:4896:12: note: previous declaration of 'bpf_sk_lookup_tcp' 
was here
BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb,
   ^
   include/linux/filter.h:433:6: note: in definition of macro 'BPF_CALL_x'
 u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__));\
 ^~~~
>> net/core/filter.c:4896:1: note: in expansion of macro 'BPF_CALL_5'
BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb,
^~
   net/core/filter.c: In function 'bpf_sk_lookup_tcp':
>> include/linux/filter.h:436:10: error: implicit declaration of function 
>> 'bpf_sk_lookup_tcp' [-Werror=implicit-function-declaration]
  return ##name(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
 ^
>> include/linux/filter.h:446:31: note: in expansion of macro 'BPF_CALL_x'
#define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__)
  ^~
>> net/core/filter.c:4896:1: note: in expansion of macro 'BPF_CALL_5'
BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb,
^~
   net/core/filter.c: In function 'sk_lookup':
   include/linux/filter.h:439:6: error: invalid storage class for function 
'bpf_sk_lookup_tcp'
 u64 ##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__))
 ^
>> include/linux/filter.h:446:31: note: in expansion of macro 'BPF_CALL_x'
#define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__)
  ^~
>> net/core/filter.c:4896:1: note: in expansion of macro 'BPF_CALL_5'
BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb,
^~
>> net/core/filter.c:4903:11: error: initializer element is not constant
 .func  = bpf_sk_lookup_tcp,
  ^
   net/core/filter.c:4903:11: note: (near initialization for 
'bpf_sk_lookup_tcp_proto.func')
   In file included from include/net/sock.h:64:0,
from include/linux/sock_diag.h:8,
from net/core/filter.c:29:
>> include/linux/filter.h:432:6: error: invalid storage class for function 
>> 'bpf_sk_lookup_udp'
 u64 ##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__));   \
 ^
>> include/linux/filter.h:446:31: note: in expansion of macro 'BPF_CALL_x'
#define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__)
  ^~
   net/core/filter.c:4913:1: note: in expansion of macro 'BPF_CALL_5'
BPF_CALL_5(bpf_sk_lookup_ud

[PATCH][net-next] net: move definition of pcpu_lstats to header file

2018-09-14 Thread Li RongQing

pcpu_lstats is defined in several files, so unify them as one
and move to header file

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 drivers/net/loopback.c|  6 --
 drivers/net/nlmon.c   |  6 --
 drivers/net/vsockmon.c| 14 --
 include/linux/netdevice.h |  6 ++
 4 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 30612497643c..a7207fa7e451 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -59,12 +59,6 @@
 #include 
 #include 
 
-struct pcpu_lstats {
-   u64 packets;
-   u64 bytes;
-   struct u64_stats_sync   syncp;
-};
-
 /* The higher levels take care of making this non-reentrant (it's
  * called with bh's disabled).
  */
diff --git a/drivers/net/nlmon.c b/drivers/net/nlmon.c
index 4b22955de191..dd0db7534cb3 100644
--- a/drivers/net/nlmon.c
+++ b/drivers/net/nlmon.c
@@ -6,12 +6,6 @@
 #include 
 #include 
 
-struct pcpu_lstats {
-   u64 packets;
-   u64 bytes;
-   struct u64_stats_sync syncp;
-};
-
 static netdev_tx_t nlmon_xmit(struct sk_buff *skb, struct net_device *dev)
 {
int len = skb->len;
diff --git a/drivers/net/vsockmon.c b/drivers/net/vsockmon.c
index c28bdce14fd5..7bad5c95551f 100644
--- a/drivers/net/vsockmon.c
+++ b/drivers/net/vsockmon.c
@@ -11,12 +11,6 @@
 #define DEFAULT_MTU (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + \
 sizeof(struct af_vsockmon_hdr))
 
-struct pcpu_lstats {
-   u64 rx_packets;
-   u64 rx_bytes;
-   struct u64_stats_sync syncp;
-};
-
 static int vsockmon_dev_init(struct net_device *dev)
 {
dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
@@ -56,8 +50,8 @@ static netdev_tx_t vsockmon_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct pcpu_lstats *stats = this_cpu_ptr(dev->lstats);
 
u64_stats_update_begin(&stats->syncp);
-   stats->rx_bytes += len;
-   stats->rx_packets++;
+   stats->bytes += len;
+   stats->packets++;
u64_stats_update_end(&stats->syncp);
 
dev_kfree_skb(skb);
@@ -80,8 +74,8 @@ vsockmon_get_stats64(struct net_device *dev, struct 
rtnl_link_stats64 *stats)
 
do {
start = u64_stats_fetch_begin_irq(&vstats->syncp);
-   tbytes = vstats->rx_bytes;
-   tpackets = vstats->rx_packets;
+   tbytes = vstats->bytes;
+   tpackets = vstats->packets;
} while (u64_stats_fetch_retry_irq(&vstats->syncp, start));
 
packets += tpackets;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e2b3bd750c98..baed5d5088c5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2382,6 +2382,12 @@ struct pcpu_sw_netstats {
struct u64_stats_sync   syncp;
 };
 
+struct pcpu_lstats {
+   u64 packets;
+   u64 bytes;
+   struct u64_stats_sync syncp;
+};
+
 #define __netdev_alloc_pcpu_stats(type, gfp)   \
 ({ \
typeof(type) __percpu *pcpu_stats = alloc_percpu_gfp(type, gfp);\
-- 
2.16.2

Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-14 Thread Jason Wang





On 2018年09月14日 12:46, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 11:53 PM Jason Wang  wrote:



On 2018年09月14日 11:40, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 11:27 PM Jason Wang  wrote:


On 2018年09月13日 22:58, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 5:02 AM Jason Wang  wrote:

On 2018年09月13日 07:27, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
 wrote:

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli  wrote:

On 9/12/2018 11:07 AM, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli  wrote:

On 9/9/2018 3:44 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more
appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.

Okay, I made that change. That is indeed much cleaner, thanks.
Let me send the patch, initially as RFC.

I've observed one issue where if we toggle the flag before bringing
up the device, it hits a kernel BUG at include/linux/netdevice.h:515

BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));

This reminds me that we need to check netif_running() before trying to
enable and disable tx napi in ethtool_set_coalesce().

The first iteration of my patch checked IFF_UP and effectively
only allowed the change when not running. What do you mean
by need to check?

I mean if device is not up, there's no need to toggle napi state and tx
lock.


And to respond to the other follow-up notes at once:


Consider we may have interrupt moderation in the future, I tend to use
set_coalesce. Otherwise we may need two steps to enable moderation:

- tx-napi on
- set_coalesce

FWIW, I don't care strongly whether we do this through coalesce or priv_flags.

Ok.

Since you prefer coalesce, let's go with that (and a revision of your
latest patch).

Good to know this.


+ if (!napi_weight)
+ virtqueue_enable_cb(vi->sq[i].vq);

I don't get why we need to disable enable cb here.

To avoid entering no-napi mode with too few descriptors to
make progress and no way to get out of that state. This is a
pretty crude attempt at handling that, admittedly.

But in this case, we will call enable_cb_delayed() and we will finally
get a interrupt?

Right. It's a bit of a roundabout way to ensure that
netif_tx_wake_queue and thus eventually free_old_xmit_skbs are called.
It might make more sense to just wake the device without going through
an interrupt.

I'm not sure I get this. If we don't enable tx napi, we tend to delay TX
interrupt if we found the ring is about to full to avoid interrupt
storm, so we're probably ok in this case.

I'm only concerned about the transition state when converting from
napi to no-napi when the queue is stopped and tx interrupt disabled.

With napi mode the interrupt is only disabled if napi is scheduled,
in which case it will eventually reenable the interrupt. But when
switching to no-napi mode in this state no progress will be made.

But it seems this cannot happen. When converting to no-napi
mode, set_coalesce waits for napi to complete in napi_disable.
So the interrupt should always start enabled when transitioning
into no-napi mode.


Yes, I see.

Thanks

Re: mlx5 driver loading failing on v4.19 / net-next / bpf-next

2018-09-14 Thread Saeed Mahameed

On Thu, Sep 13, 2018 at 11:36 PM, Jesper Dangaard Brouer
 wrote:
> On Thu, 13 Sep 2018 15:55:29 -0700
> Alexei Starovoitov  wrote:
>
>> On Thu, Aug 30, 2018 at 1:35 AM, Tariq Toukan  wrote:
>> >
>> >
>> > On 29/08/2018 6:05 PM, Jesper Dangaard Brouer wrote:
>> >>
>> >> Hi Saeed,
>> >>
>> >> I'm having issues loading mlx5 driver on v4.19 kernels (tested both
>> >> net-next and bpf-next), while kernel v4.18 seems to work.  It happens
>> >> with a Mellanox ConnectX-5 NIC (and also a CX4-Lx but I removed that
>> >> from the system now).
>> >>
>> >
>> > Hi Jesper,
>> >
>> > Thanks for your report!
>> >
>> > We are working to analyze and debug the issue.
>>
>> looks like serious issue to me... while no news in 2 weeks.
>> any update?
>
> Mellanox took it offlist, and Sep 6th found that this is a regression
> introduced by commit 269d26f47f6f ("net/mlx5: Reduce command polling
> interval"), but only if CONFIG_PREEMPT is on.
>
> I can confirm that reverting this commit fixed the issue (and not the
> firmware upgrade I also did).
>
> I think Moshe (Cc) is responsible for this case, and I expect to soon
> see a revert or alternative solution to this!?
>
> Thanks for the kick Alexei :-)

Thanks you Alexei and Jesper for following up,
the fix is already being tested [1] and will be submitted tomorrow,
as Jesper pointed out the issue happens only with 269d26f47f6f
("net/mlx5: Reduce command polling
interval"), and only if CONFIG_PREEMPT is on.
the only affected kernel is 4.19 which is not GA yet.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=net-mlx5

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

[PATCH 1/1] net: rds: use memset to optimize the recv

2018-09-14 Thread Zhu Yanjun

The function rds_inc_init is in recv process. To use memset can optimize
the function rds_inc_init.
The test result:

Before:
1) + 24.950 us   |rds_inc_init [rds]();
After:
1) + 10.990 us   |rds_inc_init [rds]();

Signed-off-by: Zhu Yanjun 
---
 net/rds/recv.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/rds/recv.c b/net/rds/recv.c
index 504cd6bcc54c..a9399ddbb7bf 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -43,8 +43,6 @@
 void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
 struct in6_addr *saddr)
 {
-   int i;
-
refcount_set(&inc->i_refcount, 1);
INIT_LIST_HEAD(&inc->i_item);
inc->i_conn = conn;
@@ -53,8 +51,7 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_rx_tstamp.tv_sec = 0;
inc->i_rx_tstamp.tv_usec = 0;
 
-   for (i = 0; i < RDS_RX_MAX_TRACES; i++)
-   inc->i_rx_lat_trace[i] = 0;
+   memset(inc->i_rx_lat_trace, 0, sizeof(inc->i_rx_lat_trace));
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
-- 
2.17.1

Re: mlx5 driver loading failing on v4.19 / net-next / bpf-next

2018-09-14 Thread Jesper Dangaard Brouer

On Fri, 14 Sep 2018 01:22:15 -0700
Saeed Mahameed  wrote:

> On Thu, Sep 13, 2018 at 11:36 PM, Jesper Dangaard Brouer
>  wrote:
> > On Thu, 13 Sep 2018 15:55:29 -0700
> > Alexei Starovoitov  wrote:
> >  
> >> On Thu, Aug 30, 2018 at 1:35 AM, Tariq Toukan  wrote: 
> >>  
> >> >
> >> >
> >> > On 29/08/2018 6:05 PM, Jesper Dangaard Brouer wrote:  
> >> >>
> >> >> Hi Saeed,
> >> >>
> >> >> I'm having issues loading mlx5 driver on v4.19 kernels (tested both
> >> >> net-next and bpf-next), while kernel v4.18 seems to work.  It happens
> >> >> with a Mellanox ConnectX-5 NIC (and also a CX4-Lx but I removed that
> >> >> from the system now).
> >> >>  
> >> >
> >> > Hi Jesper,
> >> >
> >> > Thanks for your report!
> >> >
> >> > We are working to analyze and debug the issue.  
> >>
> >> looks like serious issue to me... while no news in 2 weeks.
> >> any update?  
> >
> > Mellanox took it offlist, and Sep 6th found that this is a regression
> > introduced by commit 269d26f47f6f ("net/mlx5: Reduce command polling
> > interval"), but only if CONFIG_PREEMPT is on.
> >
> > I can confirm that reverting this commit fixed the issue (and not the
> > firmware upgrade I also did).
> >
> > I think Moshe (Cc) is responsible for this case, and I expect to soon
> > see a revert or alternative solution to this!?
> >
> > Thanks for the kick Alexei :-)  
> 
> Thanks you Alexei and Jesper for following up,
> the fix is already being tested [1] and will be submitted tomorrow,
> as Jesper pointed out the issue happens only with 269d26f47f6f
> ("net/mlx5: Reduce command polling
> interval"), and only if CONFIG_PREEMPT is on.
> the only affected kernel is 4.19 which is not GA yet.
> 
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=net-mlx5

Sound good.

I will appreciate if you add a:

Reported-by: Jesper Dangaard Brouer 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next 5/8] bnxt_en: Use hw_tc_offload and ignore_ari devlink parameters

2018-09-14 Thread kbuild test robot

Hi Vasundhara,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Vasundhara-Volam/bnxt_en-devlink-param-updates/20180914-141937
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   drivers/net//ethernet/broadcom/bnxt/bnxt_devlink.c: In function 
'bnxt_hwrm_nvm_req.constprop':
   drivers/net//ethernet/broadcom/bnxt/bnxt_devlink.c:38:27: warning: 
'nvm_param.num_bits' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
 struct bnxt_dl_nvm_param nvm_param;
  ^
   drivers/net//ethernet/broadcom/bnxt/bnxt_devlink.c:53:5: warning: 
'nvm_param.dir_type' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
 if (nvm_param.dir_type == BNXT_NVM_PORT_CFG)
^
   In file included from include/linux/byteorder/big_endian.h:5:0,
from arch/powerpc/include/uapi/asm/byteorder.h:14,
from include/asm-generic/bitops/le.h:6,
from arch/powerpc/include/asm/bitops.h:247,
from include/linux/bitops.h:19,
from include/linux/kernel.h:11,
from include/linux/list.h:9,
from include/linux/pci.h:26,
from drivers/net//ethernet/broadcom/bnxt/bnxt_devlink.c:10:
>> include/uapi/linux/byteorder/big_endian.h:35:27: warning: 'nvm_param.offset' 
>> may be used uninitialized in this function [-Wmaybe-uninitialized]
#define __cpu_to_le16(x) ((__force __le16)__swab16((x)))
  ^
   drivers/net//ethernet/broadcom/bnxt/bnxt_devlink.c:38:27: note: 
'nvm_param.offset' was declared here
 struct bnxt_dl_nvm_param nvm_param;
  ^
--
   drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c: In function 
'bnxt_hwrm_nvm_req.constprop':
   drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c:38:27: warning: 
'nvm_param.num_bits' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
 struct bnxt_dl_nvm_param nvm_param;
  ^
   drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c:53:5: warning: 
'nvm_param.dir_type' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
 if (nvm_param.dir_type == BNXT_NVM_PORT_CFG)
^
   In file included from include/linux/byteorder/big_endian.h:5:0,
from arch/powerpc/include/uapi/asm/byteorder.h:14,
from include/asm-generic/bitops/le.h:6,
from arch/powerpc/include/asm/bitops.h:247,
from include/linux/bitops.h:19,
from include/linux/kernel.h:11,
from include/linux/list.h:9,
from include/linux/pci.h:26,
from drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c:10:
>> include/uapi/linux/byteorder/big_endian.h:35:27: warning: 'nvm_param.offset' 
>> may be used uninitialized in this function [-Wmaybe-uninitialized]
#define __cpu_to_le16(x) ((__force __le16)__swab16((x)))
  ^
   drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c:38:27: note: 
'nvm_param.offset' was declared here
 struct bnxt_dl_nvm_param nvm_param;
  ^

vim +35 include/uapi/linux/byteorder/big_endian.h

5921e6f8 David Howells 2012-10-13  14  
5921e6f8 David Howells 2012-10-13  15  #define __constant_htonl(x) ((__force 
__be32)(__u32)(x))
5921e6f8 David Howells 2012-10-13  16  #define __constant_ntohl(x) ((__force 
__u32)(__be32)(x))
5921e6f8 David Howells 2012-10-13  17  #define __constant_htons(x) ((__force 
__be16)(__u16)(x))
5921e6f8 David Howells 2012-10-13  18  #define __constant_ntohs(x) ((__force 
__u16)(__be16)(x))
5921e6f8 David Howells 2012-10-13  19  #define __constant_cpu_to_le64(x) 
((__force __le64)___constant_swab64((x)))
5921e6f8 David Howells 2012-10-13  20  #define __constant_le64_to_cpu(x) 
___constant_swab64((__force __u64)(__le64)(x))
5921e6f8 David Howells 2012-10-13  21  #define __constant_cpu_to_le32(x) 
((__force __le32)___constant_swab32((x)))
5921e6f8 David Howells 2012-10-13  22  #define __constant_le32_to_cpu(x) 
___constant_swab32((__force __u32)(__le32)(x))
5921e6f8 David Howells 2012-10-13  23  #define __constant

[PATCH net-next] cxgb4: Fix endianness issue in t4_fwcache()

2018-09-14 Thread Ganesh Goudar

Do not put host-endian 0 or 1 into big endian feild.

Reported-by: Al Viro 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index c28a1d8..f85eab5 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -3889,7 +3889,7 @@ int t4_fwcache(struct adapter *adap, enum 
fw_params_param_dev_fwcache op)
c.param[0].mnem =
cpu_to_be32(FW_PARAMS_MNEM_V(FW_PARAMS_MNEM_DEV) |
FW_PARAMS_PARAM_X_V(FW_PARAMS_PARAM_DEV_FWCACHE));
-   c.param[0].val = (__force __be32)op;
+   c.param[0].val = cpu_to_be32(op);
 
return t4_wr_mbox(adap, adap->mbox, &c, sizeof(c), NULL);
 }
-- 
2.1.0

Re: [RFC PATCH net-next v1 00/14] rename and shrink i40evf

2018-09-14 Thread Or Gerlitz

On Fri, Sep 14, 2018 at 1:31 AM, Jesse Brandeburg
 wrote:

Hi Jesse,

> This series contains changes to i40evf so that it becomes a more
> generic virtual function driver for current and future silicon.
>
> While doing the rename of i40evf to a more generic name of iavf,
> we also put the driver on a severe diet due to how much of the
> code was unneeded or was unused.  The outcome is a lean and mean
> virtual function driver that continues to work on existing 40GbE
> (i40e) virtual devices and prepped for future supported devices,
> like the 100GbE (ice) virtual devices.

on what HW ring format do you standardize? do i40e/Fortville and
ice/what's-the-intel-code-name?  HWs can/use the same posting/completion
descriptor?

> This solves 2 issues we saw coming or were already present, the
> first was constant code duplication happening with i40e/i40evf,
> when much of the duplicate code in the i40evf was not used or was
> not needed.

could you spare few words on the origin/nature of these duplicates? were them
just developer C&P mistakes for functionality which is irrelevant for
a VF? like what?
if not, what was there?

> The second was to remove the future confusion of why
> future VF devices that were not considered "40GbE" only devices
> were supported by i40evf.

can elaborate further?

> The thought is that iavf will be the virtual function driver for
> all future devices, so it should have a "generic" name to propery
> represent that it is the VF driver for multiple generations of
> devices.

for that end,  as I think was explained @ the netdev Tokyo AVF session,
you would need a mechanism for feature negotiation, is it here or coming up?

>  41 files changed, 3436 insertions(+), 7581 deletions(-)

code diet is cool!

[PATCH net-next] cxgb4: add per rx-queue counter for packet errors

2018-09-14 Thread Ganesh Goudar

print per rx-queue packet errors in sge_qinfo

Signed-off-by: Casey Leedom 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 1 +
 drivers/net/ethernet/chelsio/cxgb4/sge.c   | 4 
 3 files changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 298701ed..b5010bd 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -692,6 +692,7 @@ struct sge_eth_stats {  /* Ethernet queue 
statistics */
unsigned long rx_cso;   /* # of Rx checksum offloads */
unsigned long vlan_ex;  /* # of Rx VLAN extractions */
unsigned long rx_drops; /* # of packets dropped due to no mem */
+   unsigned long bad_rx_pkts;  /* # of packets with err_vec!=0 */
 };
 
 struct sge_eth_rxq {/* SW Ethernet Rx queue */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 0f72f9c..cab492e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2784,6 +2784,7 @@ do { \
RL("LROmerged:", stats.lro_merged);
RL("LROpackets:", stats.lro_pkts);
RL("RxDrops:", stats.rx_drops);
+   RL("RxBadPkts:", stats.bad_rx_pkts);
TL("TSO:", tso);
TL("TxCSO:", tx_cso);
TL("VLANins:", vlan_ins);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 6807bc3..b901884 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2830,6 +2830,10 @@ int t4_ethrx_handler(struct sge_rspq *q, const __be64 
*rsp,
 
csum_ok = pkt->csum_calc && !err_vec &&
  (q->netdev->features & NETIF_F_RXCSUM);
+
+   if (err_vec)
+   rxq->stats.bad_rx_pkts++;
+
if (((pkt->l2info & htonl(RXF_TCP_F)) ||
 tnl_hdr_len) &&
(q->netdev->features & NETIF_F_GRO) && csum_ok && !pkt->ip_frag) {
-- 
2.1.0

Re: [net-next,RFC PATCH] Introduce TC Range classifier

2018-09-14 Thread Jiri Pirko

Thu, Sep 13, 2018 at 10:52:01PM CEST, amritha.namb...@intel.com wrote:
>This patch introduces a TC range classifier to support filtering based
>on ranges. Only port-range filters are supported currently. This can
>be combined with flower classifier to support filters that are a
>combination of port-ranges and other parameters based on existing
>fields supported by cls_flower. The 'goto chain' action can be used to
>combine the flower and range filter.
>The filter precedence is decided based on the 'prio' value.

For example Spectrum ASIC supports mask-based and range-based matching
in a single TCAM rule. No chains needed. Also, I don't really understand
why is this a separate cls. I believe that this functionality should be
put as an extension of existing cls_flower.

Re: [net-next, RFC PATCH] net: sched: cls_range: Introduce Range classifier

2018-09-14 Thread Jiri Pirko

Thu, Sep 13, 2018 at 10:52:06PM CEST, amritha.namb...@intel.com wrote:

[...]

>+static struct cls_range_filter *range_lookup(struct cls_range_head *head,
>+   struct range_flow_key *key,
>+   struct range_flow_key *mkey,
>+   bool is_skb)
>+{
>+  struct cls_range_filter *filter, *next_filter;
>+  struct range_params range;
>+  int ret;
>+  size_t cmp_size;
>+
>+  list_for_each_entry_safe(filter, next_filter, &head->filters, flist) {

This really should be list_for_each_entry_rcu()

also, as I wrote in the previous email, this should be done in
cls_flower. Look at fl_lookup() it looks-up hashtable. You just need to
add linked list traversal and range comparison to that function for the
hit in the hashtable.


>+  if (!is_skb) {
>+  /* Existing filter comparison */
>+  cmp_size = sizeof(filter->mkey);
>+  } else {
>+  /* skb classification */
>+  ret = range_compare_params(&range, filter, key,
>+ RANGE_PORT_DST);
>+  if (ret < 0)
>+  continue;
>+
>+  ret = range_compare_params(&range, filter, key,
>+ RANGE_PORT_SRC);
>+  if (ret < 0)
>+  continue;
>+
>+  /* skb does not have min and max values */
>+  cmp_size = RANGE_KEY_MEMBER_OFFSET(tp_min);
>+  }
>+  if (!memcmp(mkey, &filter->mkey, cmp_size))
>+  return filter;
>+  }
>+  return NULL;

[...]

[PATCH net] net/sched: act_sample: fix NULL dereference in the data path

2018-09-14 Thread Davide Caratti

Matteo reported the following splat, testing the datapath of TC 'sample':

 BUG: KASAN: null-ptr-deref in tcf_sample_act+0xc4/0x310
 Read of size 8 at addr  by task nc/433

 CPU: 0 PID: 433 Comm: nc Not tainted 4.19.0-rc3-kvm #17
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
 Call Trace:
  kasan_report.cold.6+0x6c/0x2fa
  tcf_sample_act+0xc4/0x310
  ? dev_hard_start_xmit+0x117/0x180
  tcf_action_exec+0xa3/0x160
  tcf_classify+0xdd/0x1d0
  htb_enqueue+0x18e/0x6b0
  ? deref_stack_reg+0x7a/0xb0
  ? htb_delete+0x4b0/0x4b0
  ? unwind_next_frame+0x819/0x8f0
  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
  __dev_queue_xmit+0x722/0xca0
  ? unwind_get_return_address_ptr+0x50/0x50
  ? netdev_pick_tx+0xe0/0xe0
  ? save_stack+0x8c/0xb0
  ? kasan_kmalloc+0xbe/0xd0
  ? __kmalloc_track_caller+0xe4/0x1c0
  ? __kmalloc_reserve.isra.45+0x24/0x70
  ? __alloc_skb+0xdd/0x2e0
  ? sk_stream_alloc_skb+0x91/0x3b0
  ? tcp_sendmsg_locked+0x71b/0x15a0
  ? tcp_sendmsg+0x22/0x40
  ? __sys_sendto+0x1b0/0x250
  ? __x64_sys_sendto+0x6f/0x80
  ? do_syscall_64+0x5d/0x150
  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ? __sys_sendto+0x1b0/0x250
  ? __x64_sys_sendto+0x6f/0x80
  ? do_syscall_64+0x5d/0x150
  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ip_finish_output2+0x495/0x590
  ? ip_copy_metadata+0x2e0/0x2e0
  ? skb_gso_validate_network_len+0x6f/0x110
  ? ip_finish_output+0x174/0x280
  __tcp_transmit_skb+0xb17/0x12b0
  ? __tcp_select_window+0x380/0x380
  tcp_write_xmit+0x913/0x1de0
  ? __sk_mem_schedule+0x50/0x80
  tcp_sendmsg_locked+0x49d/0x15a0
  ? tcp_rcv_established+0x8da/0xa30
  ? tcp_set_state+0x220/0x220
  ? clear_user+0x1f/0x50
  ? iov_iter_zero+0x1ae/0x590
  ? __fget_light+0xa0/0xe0
  tcp_sendmsg+0x22/0x40
  __sys_sendto+0x1b0/0x250
  ? __ia32_sys_getpeername+0x40/0x40
  ? _copy_to_user+0x58/0x70
  ? poll_select_copy_remaining+0x176/0x200
  ? __pollwait+0x1c0/0x1c0
  ? ktime_get_ts64+0x11f/0x140
  ? kern_select+0x108/0x150
  ? core_sys_select+0x360/0x360
  ? vfs_read+0x127/0x150
  ? kernel_write+0x90/0x90
  __x64_sys_sendto+0x6f/0x80
  do_syscall_64+0x5d/0x150
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fefef2b129d
 Code: ff ff ff ff eb b6 0f 1f 80 00 00 00 00 48 8d 05 51 37 0c 00 41 89 ca 8b 
00 85 c0 75 20 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b 
f3 c3 66 0f 1f 84 00 00 00 00 00 41 56 41
 RSP: 002b:7fff2f5350c8 EFLAGS: 0246 ORIG_RAX: 002c
 RAX: ffda RBX: 56118d60c120 RCX: 7fefef2b129d
 RDX: 2000 RSI: 56118d629320 RDI: 0003
 RBP: 56118d530370 R08:  R09: 
 R10:  R11: 0246 R12: 2000
 R13: 56118d5c2a10 R14: 56118d5c2a10 R15: 56118d5303b8

tcf_sample_act() tried to update its per-cpu stats, but tcf_sample_init()
forgot to allocate them, because tcf_idr_create() was called with a wrong
value of 'cpustats'. Setting it to true proved to fix the reported crash.

Reported-by: Matteo Croce 
Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use IDR")
Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
Tested-by: Matteo Croce 
Signed-off-by: Davide Caratti 
---
 net/sched/act_sample.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 44e9c00657bc..6b67aa13d2dd 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -69,7 +69,7 @@ static int tcf_sample_init(struct net *net, struct nlattr 
*nla,
 
if (!exists) {
ret = tcf_idr_create(tn, parm->index, est, a,
-&act_sample_ops, bind, false);
+&act_sample_ops, bind, true);
if (ret) {
tcf_idr_cleanup(tn, parm->index);
return ret;
-- 
2.17.1

Re: [PATCH net-next 0/8] bnxt_en: devlink param updates

2018-09-14 Thread Jiri Pirko

Fri, Sep 14, 2018 at 06:17:07AM CEST, vasundhara-v.vo...@broadcom.com wrote:
>On Wed, Sep 12, 2018 at 3:20 PM Jakub Kicinski
> wrote:
>>
>> On Wed, 12 Sep 2018 12:09:37 +0530, Vasundhara Volam wrote:
>> > On Tue, Sep 11, 2018 at 5:04 PM Jakub Kicinski wrote:
>> > > On Tue, 11 Sep 2018 14:14:57 +0530, Vasundhara Volam wrote:
>> > > > This patchset adds support for 4 generic and 1 driver-specific devlink
>> > > > parameters.
>> > > >
>> > > > Also, this patchset adds support to return proper error code if
>> > > > HWRM_NVM_GET/SET_VARIABLE commands return error code
>> > > > HWRM_ERR_CODE_RESOURCE_ACCESS_DENIED.
>> > > >
>> > > > Vasundhara Volam (8):
>> > > >   devlink: Add generic parameter hw_tc_offload
>> > >
>> > > Much like Jiri, I can't help but wonder why do you need this?
>> >
>> > There is a request from our customer for a way to toggle tc_offload
>> > feature in our adapter.
>>
>> Vasundhara, again, we don't need to know who asked you to do this, but
>> _why_.  What problem are you solving?  What is the customer trying to
>> achieve?
>For Brand new big features like TC_offload, few customers are not willing
>to enable it by default in the adapter(Firmware). This was a subjective 
>decision
>to disable TC_offload by default in the adapter.

Again, why? Why it cannot be enabled in FW and just enabled/disabled by
ethtool flag? Don't say that "customers want it" please...


>>
>> > > >   devlink: Add generic parameter ignore_ari
>> > > >   devlink: Add generic parameter msix_vec_per_pf_max
>> > > >   devlink: Add generic parameter msix_vec_per_pf_min
>> > >
>> > > IMHO more structured API would be preferable if possible.  The string
>> > > keys won't scale if you want to set the parameters per PF, and
>> > > creating more structured API for PCIe which is a relatively slow
>> > > moving HW spec seems tractable.
>> >
>> > Sorry, could you please suggest an example? We will try to adapt.
>>
>> My thinking was that the same way devlink device has ports, it should
>> have PCIe functions as objects which then have attributes.  Instead of
>> making everything a string-identified device attribute.  But I'm not
>> dead set on this if others don't think its a good idea.
>Actually this parameters are for the port but the value given to this param
>is applicable for individual PF. That's the reason I have added "per_pf" 
>string.
>If you think this is not a good idea, I can move this params to 
>driver-specific.

Re: [PATCH net] net/sched: act_sample: fix NULL dereference in the data path

2018-09-14 Thread Jiri Pirko

Fri, Sep 14, 2018 at 12:03:18PM CEST, dcara...@redhat.com wrote:
>Matteo reported the following splat, testing the datapath of TC 'sample':
>
> BUG: KASAN: null-ptr-deref in tcf_sample_act+0xc4/0x310
> Read of size 8 at addr  by task nc/433
>
> CPU: 0 PID: 433 Comm: nc Not tainted 4.19.0-rc3-kvm #17
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
> Call Trace:
>  kasan_report.cold.6+0x6c/0x2fa
>  tcf_sample_act+0xc4/0x310
>  ? dev_hard_start_xmit+0x117/0x180
>  tcf_action_exec+0xa3/0x160
>  tcf_classify+0xdd/0x1d0
>  htb_enqueue+0x18e/0x6b0
>  ? deref_stack_reg+0x7a/0xb0
>  ? htb_delete+0x4b0/0x4b0
>  ? unwind_next_frame+0x819/0x8f0
>  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>  __dev_queue_xmit+0x722/0xca0
>  ? unwind_get_return_address_ptr+0x50/0x50
>  ? netdev_pick_tx+0xe0/0xe0
>  ? save_stack+0x8c/0xb0
>  ? kasan_kmalloc+0xbe/0xd0
>  ? __kmalloc_track_caller+0xe4/0x1c0
>  ? __kmalloc_reserve.isra.45+0x24/0x70
>  ? __alloc_skb+0xdd/0x2e0
>  ? sk_stream_alloc_skb+0x91/0x3b0
>  ? tcp_sendmsg_locked+0x71b/0x15a0
>  ? tcp_sendmsg+0x22/0x40
>  ? __sys_sendto+0x1b0/0x250
>  ? __x64_sys_sendto+0x6f/0x80
>  ? do_syscall_64+0x5d/0x150
>  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>  ? __sys_sendto+0x1b0/0x250
>  ? __x64_sys_sendto+0x6f/0x80
>  ? do_syscall_64+0x5d/0x150
>  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>  ip_finish_output2+0x495/0x590
>  ? ip_copy_metadata+0x2e0/0x2e0
>  ? skb_gso_validate_network_len+0x6f/0x110
>  ? ip_finish_output+0x174/0x280
>  __tcp_transmit_skb+0xb17/0x12b0
>  ? __tcp_select_window+0x380/0x380
>  tcp_write_xmit+0x913/0x1de0
>  ? __sk_mem_schedule+0x50/0x80
>  tcp_sendmsg_locked+0x49d/0x15a0
>  ? tcp_rcv_established+0x8da/0xa30
>  ? tcp_set_state+0x220/0x220
>  ? clear_user+0x1f/0x50
>  ? iov_iter_zero+0x1ae/0x590
>  ? __fget_light+0xa0/0xe0
>  tcp_sendmsg+0x22/0x40
>  __sys_sendto+0x1b0/0x250
>  ? __ia32_sys_getpeername+0x40/0x40
>  ? _copy_to_user+0x58/0x70
>  ? poll_select_copy_remaining+0x176/0x200
>  ? __pollwait+0x1c0/0x1c0
>  ? ktime_get_ts64+0x11f/0x140
>  ? kern_select+0x108/0x150
>  ? core_sys_select+0x360/0x360
>  ? vfs_read+0x127/0x150
>  ? kernel_write+0x90/0x90
>  __x64_sys_sendto+0x6f/0x80
>  do_syscall_64+0x5d/0x150
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fefef2b129d
> Code: ff ff ff ff eb b6 0f 1f 80 00 00 00 00 48 8d 05 51 37 0c 00 41 89 ca 8b 
> 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 
> 6b f3 c3 66 0f 1f 84 00 00 00 00 00 41 56 41
> RSP: 002b:7fff2f5350c8 EFLAGS: 0246 ORIG_RAX: 002c
> RAX: ffda RBX: 56118d60c120 RCX: 7fefef2b129d
> RDX: 2000 RSI: 56118d629320 RDI: 0003
> RBP: 56118d530370 R08:  R09: 
> R10:  R11: 0246 R12: 2000
> R13: 56118d5c2a10 R14: 56118d5c2a10 R15: 56118d5303b8
>
>tcf_sample_act() tried to update its per-cpu stats, but tcf_sample_init()
>forgot to allocate them, because tcf_idr_create() was called with a wrong
>value of 'cpustats'. Setting it to true proved to fix the reported crash.
>
>Reported-by: Matteo Croce 
>Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use 
>IDR")
>Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
>Tested-by: Matteo Croce 
>Signed-off-by: Davide Caratti 

Acked-by: Jiri Pirko

Re: [PATCH net-next 08/13] net: sched: rename tcf_block_get{_ext}() and tcf_block_put{_ext}()

2018-09-14 Thread Vlad Buslov



On Thu 13 Sep 2018 at 17:21, Cong Wang  wrote:
> On Wed, Sep 12, 2018 at 1:24 AM Vlad Buslov  wrote:
>>
>>
>> On Fri 07 Sep 2018 at 20:09, Cong Wang  wrote:
>> > On Thu, Sep 6, 2018 at 12:59 AM Vlad Buslov  wrote:
>> >>
>> >> Functions tcf_block_get{_ext}() and tcf_block_put{_ext}() actually
>> >> attach/detach block to specific Qdisc besides just taking/putting
>> >> reference. Rename them according to their purpose.
>> >
>> > Where exactly does it attach to?
>> >
>> > Each qdisc provides a pointer to a pointer of a block, like
>> > &cl->block. It is where the result is saved to. It takes a parameter
>> > of Qdisc* merely for read-only purpose.
>>
>> tcf_block_attach_ext() passes qdisc parameter to tcf_block_owner_add()
>> which saves qdisc to new tcf_block_owner_item and adds the item to
>> block's owner list. I proposed several naming options for these
>> functions to Jiri on internal review and he suggested "attach" as better
>> option.
>
> But that is merely item->q = q, this is why I said it is read-only,
> hard to claim this is attaching.
>
>
>>
>> >
>> > So, renaming it to *attach() is even confusing, at least not
>> > any better. Please find other names or leave them as they are.
>>
>> What would you recommend?
>
> I don't know, perhaps "acquire"?
>
> Or, leaving tcf_block_get() as it is but rename your refcnt
> increment function to be something like tcf_block_refcnt_get()?

Cong, I'm okay with both options.

Jiri, which naming would you prefer?

Re: [PATCH net-next v2] net: sched: change tcf_del_walker() to take idrinfo->lock

2018-09-14 Thread Vlad Buslov



On Thu 13 Sep 2018 at 17:13, Cong Wang  wrote:
> On Wed, Sep 12, 2018 at 1:51 AM Vlad Buslov  wrote:
>>
>>
>> On Fri 07 Sep 2018 at 19:12, Cong Wang  wrote:
>> > On Fri, Sep 7, 2018 at 6:52 AM Vlad Buslov  wrote:
>> >>
>> >> Action API was changed to work with actions and action_idr in concurrency
>> >> safe manner, however tcf_del_walker() still uses actions without taking a
>> >> reference or idrinfo->lock first, and deletes them directly, disregarding
>> >> possible concurrent delete.
>> >>
>> >> Add tc_action_wq workqueue to action API. Implement
>> >> tcf_idr_release_unsafe() that assumes external synchronization by caller
>> >> and delays blocking action cleanup part to tc_action_wq workqueue. Extend
>> >> tcf_action_cleanup() with 'async' argument to indicate that function 
>> >> should
>> >> free action asynchronously.
>> >
>> > Where exactly is blocking in tcf_action_cleanup()?
>> >
>> > From your code, it looks like free_tcf(), but from my observation,
>> > the only blocking function inside is tcf_action_goto_chain_fini()
>> > which calls __tcf_chain_put(). But, __tcf_chain_put() is blocking
>> > _ONLY_ when tc_chain_notify() is called, for tc action it is never
>> > called.
>> >
>> > So, what else is blocking?
>>
>> __tcf_chain_put() calls tc_chain_tmplt_del(), which calls
>> ops->tmplt_destroy(). This last function uses hw offload API, which is
>> blocking.
>
> Good to know.
>
> Can we just make ops->tmplt_destroy() to use workqueue?
> Making tc action to workqueue seems overkill, for me.

How about changing tcf_chain_put_by_act() to use tc_filter_wq, instead
of directly calling __tcf_chain_put()? IMO it is a better solution
because it benefits all classifiers, instead of requiring every
classifier with templates support to implement non-blocking
ops->tmplt_destroy().

[PATCH net-next] cxgb4: update supported DCB version

2018-09-14 Thread Ganesh Goudar

- In CXGB4_DCB_STATE_FW_INCOMPLETE state check if the dcb
  version is changed and update the dcb supported version.

- Also, fill the priority code point value for priority
  based flow control.

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c | 27 ++
 drivers/net/ethernet/chelsio/cxgb4/l2t.c   |  6 --
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c
index b34f0f0..6ba3104 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c
@@ -114,6 +114,24 @@ void cxgb4_dcb_reset(struct net_device *dev)
cxgb4_dcb_state_init(dev);
 }
 
+/* update the dcb port support, if version is IEEE then set it to
+ * FW_PORT_DCB_VER_IEEE and if DCB_CAP_DCBX_VER_CEE is already set then
+ * clear that. and if it is set to CEE then set dcb supported to
+ * DCB_CAP_DCBX_VER_CEE & if DCB_CAP_DCBX_VER_IEEE is set, clear it
+ */
+static inline void cxgb4_dcb_update_support(struct port_dcb_info *dcb)
+{
+   if (dcb->dcb_version == FW_PORT_DCB_VER_IEEE) {
+   if (dcb->supported & DCB_CAP_DCBX_VER_CEE)
+   dcb->supported &= ~DCB_CAP_DCBX_VER_CEE;
+   dcb->supported |= DCB_CAP_DCBX_VER_IEEE;
+   } else if (dcb->dcb_version == FW_PORT_DCB_VER_CEE1D01) {
+   if (dcb->supported & DCB_CAP_DCBX_VER_IEEE)
+   dcb->supported &= ~DCB_CAP_DCBX_VER_IEEE;
+   dcb->supported |= DCB_CAP_DCBX_VER_CEE;
+   }
+}
+
 /* Finite State machine for Data Center Bridging.
  */
 void cxgb4_dcb_state_fsm(struct net_device *dev,
@@ -165,6 +183,15 @@ void cxgb4_dcb_state_fsm(struct net_device *dev,
}
 
case CXGB4_DCB_STATE_FW_INCOMPLETE: {
+   if (transition_to != CXGB4_DCB_INPUT_FW_DISABLED) {
+   /* during this CXGB4_DCB_STATE_FW_INCOMPLETE state,
+* check if the dcb version is changed (there can be
+* mismatch in default config & the negotiated switch
+* configuration at FW, so update the dcb support
+* accordingly.
+*/
+   cxgb4_dcb_update_support(dcb);
+   }
switch (transition_to) {
case CXGB4_DCB_INPUT_FW_ENABLED: {
/* we're alreaady in firmware DCB mode */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/l2t.c 
b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
index 301c4df..99022c0 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/l2t.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
@@ -433,10 +433,12 @@ struct l2t_entry *cxgb4_l2t_get(struct l2t_data *d, 
struct neighbour *neigh,
else
lport = netdev2pinfo(physdev)->lport;
 
-   if (is_vlan_dev(neigh->dev))
+   if (is_vlan_dev(neigh->dev)) {
vlan = vlan_dev_vlan_id(neigh->dev);
-   else
+   vlan |= vlan_dev_get_egress_qos_mask(neigh->dev, priority);
+   } else {
vlan = VLAN_NONE;
+   }
 
write_lock_bh(&d->lock);
for (e = d->l2tab[hash].first; e; e = e->next)
-- 
2.1.0

Re: [PATCH net-next 4/4] bnxt_en: Always forward VF MAC address to the PF.

2018-09-14 Thread Siwei Liu

This commit is toxic, if possible I hope it can be reverted and
reworked with a new patch.

First, the patch introduced backward incompatible changes to bnxt_en
VF driver that is causing issue when interoperating with the old PF
driver without this commit. In that event, VF probing fails from
within the VM:

[5.660331] Broadcom NetXtreme-C/E driver bnxt_en v1.9.1
[5.663653] bnxt_en :00:03.0 (unnamed net_device)
(uninitialized): hwrm req_type 0xf seq id 0x6 error 0x4
[5.665804] bnxt_en :00:03.0 (unnamed net_device)
(uninitialized): VF MAC address 00:01:02:03:04:05 not approved by the
PF
[5.668268] bnxt_en :00:03.0: Unable to initialize mac address.
[5.670974] bnxt_en: probe of :00:03.0 failed with error -99

Second, this commit contains driver changes to both PF and VF side,
and incorrectly assumes that both PF and VF can/should be updated at
the same time to resolve the original issue (zero VF MAC address in
'ip link show') it tried to address. In fact that is not warranted. A
potential warranted fix is for VF driver to ignore what
bnxt_approve_mac() may return when it got a valid MAC address from the
firmware. The only purpose for the bnxt_approve_mac call for this case
is a best-effort attempt to inform PF of the MAC address, instead of
failing the VF driver probe when talking to an old PF driver.

Canonical reported a similar issue a few days back due to the same cause.

https://www.spinics.net/lists/netdev/msg521428.html

Regards,
-Siwei

On Tue, May 8, 2018 at 12:18 AM, Michael Chan  wrote:
> The current code already forwards the VF MAC address to the PF, except
> in one case.  If the VF driver gets a valid MAC address from the firmware
> during probe time, it will not forward the MAC address to the PF,
> incorrectly assuming that the PF already knows the MAC address.  This
> causes "ip link show" to show zero VF MAC addresses for this case.
>
> This assumption is not correct.  Newer firmware remembers the VF MAC
> address last used by the VF and provides it to the VF driver during
> probe.  So we need to always forward the VF MAC address to the PF.
>
> The forwarded MAC address may now be the PF assigned MAC address and so we
> need to make sure we approve it for this case.
>
> Signed-off-by: Michael Chan 
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c   | 2 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index cd3ab78..dfa0839 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -8678,8 +8678,8 @@ static int bnxt_init_mac_addr(struct bnxt *bp)
> memcpy(bp->dev->dev_addr, vf->mac_addr, ETH_ALEN);
> } else {
> eth_hw_addr_random(bp->dev);
> -   rc = bnxt_approve_mac(bp, bp->dev->dev_addr);
> }
> +   rc = bnxt_approve_mac(bp, bp->dev->dev_addr);
>  #endif
> }
> return rc;
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> index cc21d87..a649108 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> @@ -923,7 +923,8 @@ static int bnxt_vf_configure_mac(struct bnxt *bp, struct 
> bnxt_vf_info *vf)
> if (req->enables & 
> cpu_to_le32(FUNC_VF_CFG_REQ_ENABLES_DFLT_MAC_ADDR)) {
> if (is_valid_ether_addr(req->dflt_mac_addr) &&
> ((vf->flags & BNXT_VF_TRUST) ||
> -(!is_valid_ether_addr(vf->mac_addr {
> +!is_valid_ether_addr(vf->mac_addr) ||
> +ether_addr_equal(req->dflt_mac_addr, vf->mac_addr))) {
> ether_addr_copy(vf->vf_mac_addr, req->dflt_mac_addr);
> return bnxt_hwrm_exec_fwd_resp(bp, vf, msg_size);
> }
> --
> 1.8.3.1
>

Re: [PATCH iproute2] q_cake: Add printing of no-split-gso option

2018-09-14 Thread Toke Høiland-Jørgensen

Stephen Hemminger  writes:

> On Wed, 12 Sep 2018 00:32:16 +0200
> Toke Høiland-Jørgensen  wrote:
>
>> When the GSO splitting was turned into dual split-gso/no-split-gso options,
>> the printing of the latter was left out. Add that, so output is consistent
>> with the options passed.
>> 
>> Signed-off-by: Toke Høiland-Jørgensen 
>
> Applied. I noticed that nat/nonat and wash/nowash have similar missing
> output.

Thanks! And yeah, you're right; I'll send another patch :)

-Toke

Re: [PATCH net] veth: Orphan skb before GRO

2018-09-14 Thread Paolo Abeni

On Fri, 2018-09-14 at 13:33 +0900, Toshiaki Makita wrote:
> GRO expects skbs not to be owned by sockets, but when XDP is enabled veth
> passed skbs owned by sockets. It caused corrupted sk_wmem_alloc.
> 
> Paolo Abeni reported the following splat:
> 
> [  362.098904] refcount_t overflow at skb_set_owner_w+0x5e/0xa0 in 
> iperf3[1644], uid/euid: 0/0
> [  362.108239] WARNING: CPU: 0 PID: 1644 at kernel/panic.c:648 
> refcount_error_report+0xa0/0xa4
> [  362.117547] Modules linked in: tcp_diag inet_diag veth intel_rapl sb_edac 
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore 
> intel_rapl_perf ipmi_ssif iTCO_wdt sg ipmi_si iTCO_vendor_support 
> ipmi_devintf mxm_wmi ipmi_msghandler pcspkr dcdbas mei_me wmi mei lpc_ich 
> acpi_power_meter pcc_cpufreq xfs libcrc32c sd_mod mgag200 drm_kms_helper 
> syscopyarea sysfillrect sysimgblt fb_sys_fops ixgbe igb ttm ahci mdio libahci 
> ptp crc32c_intel drm pps_core libata i2c_algo_bit dca dm_mirror 
> dm_region_hash dm_log dm_mod
> [  362.176622] CPU: 0 PID: 1644 Comm: iperf3 Not tainted 4.19.0-rc2.vanilla+ 
> #2025
> [  362.184777] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 
> 06/16/2016
> [  362.193124] RIP: 0010:refcount_error_report+0xa0/0xa4
> [  362.198758] Code: 08 00 00 48 8b 95 80 00 00 00 49 8d 8c 24 80 0a 00 00 41 
> 89 c1 44 89 2c 24 48 89 de 48 c7 c7 18 4d e7 9d 31 c0 e8 30 fa ff ff <0f> 0b 
> eb 88 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc
> [  362.219711] RSP: 0018:9ee6ff603c20 EFLAGS: 00010282
> [  362.225538] RAX:  RBX: 9de83e10 RCX: 
> 
> [  362.233497] RDX: 0001 RSI: 9ee6ff6167d8 RDI: 
> 9ee6ff6167d8
> [  362.241457] RBP: 9ee6ff603d78 R08: 0490 R09: 
> 0004
> [  362.249416] R10:  R11: 9ee6ff603990 R12: 
> 9ee664b94500
> [  362.257377] R13:  R14: 0004 R15: 
> 9de615f9
> [  362.265337] FS:  7f1d22d28740() GS:9ee6ff60() 
> knlGS:
> [  362.274363] CS:  0010 DS:  ES:  CR0: 80050033
> [  362.280773] CR2: 7f1d222f35d0 CR3: 001fddfec003 CR4: 
> 001606f0
> [  362.288733] Call Trace:
> [  362.291459]  
> [  362.293702]  ex_handler_refcount+0x4e/0x80
> [  362.298269]  fixup_exception+0x35/0x40
> [  362.302451]  do_trap+0x109/0x150
> [  362.306048]  do_error_trap+0xd5/0x130
> [  362.315766]  invalid_op+0x14/0x20
> [  362.319460] RIP: 0010:skb_set_owner_w+0x5e/0xa0
> [  362.324512] Code: ef ff ff 74 49 48 c7 43 60 20 7b 4a 9d 8b 85 f4 01 00 00 
> 85 c0 75 16 8b 83 e0 00 00 00 f0 01 85 44 01 00 00 0f 88 d8 23 16 00 <5b> 5d 
> c3 80 8b 91 00 00 00 01 8b 85 f4 01 00 00 89 83 a4 00 00 00
> [  362.345465] RSP: 0018:9ee6ff603e20 EFLAGS: 00010a86
> [  362.351291] RAX: 1100 RBX: 9ee65deec700 RCX: 
> 9ee65e829244
> [  362.359250] RDX: 0100 RSI: 9ee65e829100 RDI: 
> 9ee65deec700
> [  362.367210] RBP: 9ee65e829100 R08: 0002a380 R09: 
> 
> [  362.375169] R10: 0002 R11: f1a4bf77bb00 R12: 
> c0754661d000
> [  362.383130] R13: 9ee65deec200 R14: 9ee65f597000 R15: 
> 00aa
> [  362.391092]  veth_xdp_rcv+0x4e4/0x890 [veth]
> [  362.399357]  veth_poll+0x4d/0x17a [veth]
> [  362.403731]  net_rx_action+0x2af/0x3f0
> [  362.407912]  __do_softirq+0xdd/0x29e
> [  362.411897]  do_softirq_own_stack+0x2a/0x40
> [  362.416561]  
> [  362.418899]  do_softirq+0x4b/0x70
> [  362.422594]  __local_bh_enable_ip+0x50/0x60
> [  362.427258]  ip_finish_output2+0x16a/0x390
> [  362.431824]  ip_output+0x71/0xe0
> [  362.440670]  __tcp_transmit_skb+0x583/0xab0
> [  362.445333]  tcp_write_xmit+0x247/0xfb0
> [  362.449609]  __tcp_push_pending_frames+0x2d/0xd0
> [  362.454760]  tcp_sendmsg_locked+0x857/0xd30
> [  362.459424]  tcp_sendmsg+0x27/0x40
> [  362.463216]  sock_sendmsg+0x36/0x50
> [  362.467104]  sock_write_iter+0x87/0x100
> [  362.471382]  __vfs_write+0x112/0x1a0
> [  362.475369]  vfs_write+0xad/0x1a0
> [  362.479062]  ksys_write+0x52/0xc0
> [  362.482759]  do_syscall_64+0x5b/0x180
> [  362.486841]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  362.492473] RIP: 0033:0x7f1d22293238
> [  362.496458] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 
> 0f 1e fa 48 8d 05 c5 54 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 
> 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
> [  362.517409] RSP: 002b:7ffebaef8008 EFLAGS: 0246 ORIG_RAX: 
> 0001
> [  362.525855] RAX: ffda RBX: 2800 RCX: 
> 7f1d22293238
> [  362.533816] RDX: 2800 RSI: 7f1d22d36000 RDI: 
> 0005
> [  362.541775] RBP: 7f1d22d36000 R08: 0002db777a30 R09: 
> 562b70712b20
> [  362.549734] R10:  R11: 0246 R12: 
> 0005
> [  362.557693]

[PATCH net] pppoe: fix reception of frames with no mac header

2018-09-14 Thread Guillaume Nault

pppoe_rcv() needs to look back at the Ethernet header in order to
lookup the PPPoE session. Therefore we need to ensure that the mac
header is big enough to contain an Ethernet header. Otherwise
eth_hdr(skb)->h_source might access invalid data.

==
BUG: KMSAN: uninit-value in __get_item drivers/net/ppp/pppoe.c:172 [inline]
BUG: KMSAN: uninit-value in get_item drivers/net/ppp/pppoe.c:236 [inline]
BUG: KMSAN: uninit-value in pppoe_rcv+0xcef/0x10e0 drivers/net/ppp/pppoe.c:450
CPU: 0 PID: 4543 Comm: syz-executor355 Not tainted 4.16.0+ #87
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
 __get_item drivers/net/ppp/pppoe.c:172 [inline]
 get_item drivers/net/ppp/pppoe.c:236 [inline]
 pppoe_rcv+0xcef/0x10e0 drivers/net/ppp/pppoe.c:450
 __netif_receive_skb_core+0x47df/0x4a90 net/core/dev.c:4562
 __netif_receive_skb net/core/dev.c:4627 [inline]
 netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
 netif_receive_skb+0x230/0x240 net/core/dev.c:4725
 tun_rx_batched drivers/net/tun.c:1555 [inline]
 tun_get_user+0x740f/0x7c60 drivers/net/tun.c:1962
 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
 call_write_iter include/linux/fs.h:1782 [inline]
 new_sync_write fs/read_write.c:469 [inline]
 __vfs_write+0x7fb/0x9f0 fs/read_write.c:482
 vfs_write+0x463/0x8d0 fs/read_write.c:544
 SYSC_write+0x172/0x360 fs/read_write.c:589
 SyS_write+0x55/0x80 fs/read_write.c:581
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x4447c9
RSP: 002b:7fff64c8fc28 EFLAGS: 0297 ORIG_RAX: 0001
RAX: ffda RBX:  RCX: 004447c9
RDX: fd87 RSI: 2600 RDI: 0004
RBP: 006cf018 R08: 7fff64c8fda8 R09: 7fff6bda
R10: 5fe7 R11: 0297 R12: 004020d0
R13: 00402160 R14:  R15: 

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
 sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
 tun_alloc_skb drivers/net/tun.c:1532 [inline]
 tun_get_user+0x2242/0x7c60 drivers/net/tun.c:1829
 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
 call_write_iter include/linux/fs.h:1782 [inline]
 new_sync_write fs/read_write.c:469 [inline]
 __vfs_write+0x7fb/0x9f0 fs/read_write.c:482
 vfs_write+0x463/0x8d0 fs/read_write.c:544
 SYSC_write+0x172/0x360 fs/read_write.c:589
 SyS_write+0x55/0x80 fs/read_write.c:581
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
==

Fixes: 224cf5ad14c0 ("ppp: Move the PPP drivers")
Reported-by: syzbot+f5f6080811c849739...@syzkaller.appspotmail.com
Signed-off-by: Guillaume Nault 
---
 drivers/net/ppp/pppoe.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index ce61231e96ea..62dc564b251d 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -429,6 +429,9 @@ static int pppoe_rcv(struct sk_buff *skb, struct net_device 
*dev,
if (!skb)
goto out;
 
+   if (skb_mac_header_len(skb) < ETH_HLEN)
+   goto drop;
+
if (!pskb_may_pull(skb, sizeof(struct pppoe_hdr)))
goto drop;
 
-- 
2.19.0

Re: [PATCH net] pppoe: fix reception of frames with no mac header

2018-09-14 Thread Guillaume Nault

On Fri, Sep 14, 2018 at 04:28:05PM +0200, Guillaume Nault wrote:
> pppoe_rcv() needs to look back at the Ethernet header in order to
> lookup the PPPoE session. Therefore we need to ensure that the mac
> header is big enough to contain an Ethernet header. Otherwise
> eth_hdr(skb)->h_source might access invalid data.
> 
Forgot to Cc Alexander :/
Sorry...
BTW, thanks for your first analysis.

[bpf-next, v4 1/5] flow_dissector: implements flow dissector BPF hook

2018-09-14 Thread Petar Penkov

From: Petar Penkov 

Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 include/linux/bpf.h |   1 +
 include/linux/bpf_types.h   |   1 +
 include/linux/skbuff.h  |   7 ++
 include/net/net_namespace.h |   3 +
 include/net/sch_generic.h   |  12 +++-
 include/uapi/linux/bpf.h|  26 +++
 kernel/bpf/syscall.c|   8 +++
 kernel/bpf/verifier.c   |  32 +
 net/core/filter.c   |  70 +++
 net/core/flow_dissector.c   | 134 
 10 files changed, 291 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..988a00797bcd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -212,6 +212,7 @@ enum bpf_reg_type {
PTR_TO_PACKET_META,  /* skb->data - meta_len */
PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
+   PTR_TO_FLOW_KEYS,/* reg points to bpf_flow_keys */
 };
 
 /* The information passed from prog-specific *_is_valid_access
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..22083712dd18 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
 #endif
+BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 17a13e4785fc..ce0e863f02a2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -243,6 +243,8 @@ struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
 struct napi_struct;
+struct bpf_prog;
+union bpf_attr;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector 
*flow_dissector,
 const struct flow_dissector_key *key,
 unsigned int key_count);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+  struct bpf_prog *prog);
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
+
 bool __skb_flow_dissect(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
void *target_container,
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 9b5fdc50519a..99d4148e0f90 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -43,6 +43,7 @@ struct ctl_table_header;
 struct net_generic;
 struct uevent_sock;
 struct netns_ipvs;
+struct bpf_prog;
 
 
 #define NETDEV_HASHBITS8
@@ -145,6 +146,8 @@ struct net {
 #endif
struct net_generic __rcu*gen;
 
+   struct bpf_prog __rcu   *flow_dissector_prog;
+
/* Note : following structs are cache line aligned */
 #ifdef CONFIG_XFRM
struct netns_xfrm   xfrm;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a6d00093f35e..1b81ba85fd2d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -19,6 +19,7 @@ struct Qdisc_ops;
 struct qdisc_walker;
 struct tcf_walker;
 struct module;
+struct bpf_flow_keys;
 
 typedef int tc_setup_cb_t(enum tc_setup_type type,
  void *type_data, void *cb_priv);
@@ -307,9 +308,14 @@ struct tcf_proto {
 };
 
 struct qdisc_skb_cb {
-   unsigned intpkt_len;
-   u16 slave_dev_queue_mapping;
-   u16 tc_classid;
+   union {
+   struct {
+   unsigned intpkt_len;
+   u16 slave_dev_queue_mapping;
+   u16 tc_classid;
+   };
+   struct bpf_flow_keys *flow_keys;
+   };
 #define QDISC_CB_PRIV_LEN 20
unsigned char   data[QDISC_CB_PRIV_LEN];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..aa5ccd2385ed 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LWT_SEG6LOCAL,
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
+   BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
BPF_CGROUP_UDP4_SENDMSG,
BPF_CGROUP_UDP6_SENDMSG,
BPF_LIRC_MODE2,
+   BPF_FLOW_DISSECTOR,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2333,6 +2335,7 @@ struct __sk_buff {
/*

[bpf-next, v4 0/5] Introduce eBPF flow dissector

2018-09-14 Thread Petar Penkov

From: Petar Penkov 

This patch series hardens the RX stack by allowing flow dissection in BPF,
as previously discussed [1]. Because of the rigorous checks of the BPF
verifier, this provides significant security guarantees. In particular, the
BPF flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
read outside of packet bounds, because all memory accesses are checked.
Also, with BPF the administrator can decide which protocols to support,
reducing potential attack surface. Rarely encountered protocols can be
excluded from dissection and the program can be updated without kernel
recompile or reboot if a bug is discovered.

Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
This includes a new BPF program and attach type.

Patch 2 adds the new BPF flow dissector definitions to tools/uapi.

Patch 3 adds support for the new BPF program type to libbpf and bpftool.

Patch 4 adds a flow dissector program in BPF. This parses most protocols in
__skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
and address types).

Patch 5 adds a selftest that attaches the BPF program to the flow dissector
and sends traffic with different levels of encapsulation.

Performance Evaluation:
The in-kernel implementation was compared against the demo program from
patch 4 using the test in patch 5 with IPv4/UDP traffic over 10 seconds.
$perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \
-t 10

In-kernel Dissector:
__skb_flow_dissect overhead: 2.12%
Total Packets: 3,272,597 (from output of ./test_flow_dissector)

BPF Dissector:
__skb_flow_dissect overhead: 1.63% 
Total Packets: 3,232,356 (from output of ./test_flow_dissector)

No-op BPF Dissector:
__skb_flow_dissect overhead: 1.52% 
Total Packets: 3,330,635 (from output of ./test_flow_dissector)

Changes since v3:
1/ struct bpf_flow_keys reorganized to remove holes in patch 1 and patch 2.

Changes since v2:
1/ Changes to tools/include/uapi pulled into a separate patch 2
2/ Changes to tools/lib and tools/bpftool pulled into a separate patch 3
3/ Changed flow_keys in __sk_buff from __u32 to struct bpf_flow_keys *
4/ Added nhoff field in struct bpf_flow_keys to pass initial offset
5/ Saving all of the modified control block, rather than just the qdisc
6/ Sample BPF program in patch 4 modified to use the changes above

Changes since v1:
1/ LD_ABS instructions now disallowed for the new BPF prog type 
2/ now checks if skb is NULL in __skb_flow_dissect()
3/ fixed incorrect accesses in flow_dissector_is_valid_access()
- writes to the flow_keys field now disallowed
- reads/writes to tc_classid and data_meta now disallowed 
4/ headers pulled with bpf_skb_load_data if direct access fails 

Changes since RFC:
1/ Flow dissector hook changed from global to per-netns
2/ Defined struct bpf_flow_keys to be used in BPF flow dissector
programs instead of exposing the internal flow keys layout. Added a
function to translate from bpf_flow_keys to the internal layout after BPF
dissection is complete. The pointer to this struct is stored in
qdisc_skb_cb rather than inside of the 20 byte control block which
simplifies verification and allows access to all 20 bytes of the cb.
3/ Removed GUE parsing as it relied on a hardcoded port
4/ MPLS parsing now stops at the first label which is consistent
with the in-kernel flow dissector
5/ Refactored to use direct packet access and to write out to
struct bpf_flow_keys

[1] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

Petar Penkov (5):
  flow_dissector: implements flow dissector BPF hook
  bpf: sync bpf.h uapi with tools/
  bpf: support flow dissector in libbpf and bpftool
  flow_dissector: implements eBPF parser
  selftests/bpf: test bpf flow dissection

 include/linux/bpf.h   |   1 +
 include/linux/bpf_types.h |   1 +
 include/linux/skbuff.h|   7 +
 include/net/net_namespace.h   |   3 +
 include/net/sch_generic.h |  12 +-
 include/uapi/linux/bpf.h  |  26 +
 kernel/bpf/syscall.c  |   8 +
 kernel/bpf/verifier.c |  32 +
 net/core/filter.c |  70 ++
 net/core/flow_dissector.c | 134 +++
 tools/bpf/bpftool/prog.c  |   1 +
 tools/include/uapi/linux/bpf.h|  26 +
 tools/lib/bpf/libbpf.c|   2 +
 tools/testing/selftests/bpf/.gitignore|   2 +
 tools/testing/selftests/bpf/Makefile  |   8 +-
 tools/testing/selftests/bpf/bpf_flow.c| 373 +
 tools/testing/selftests/bpf/config|   1 +
 .../selftests/bpf/flow_dissector_load.c   | 140 
 .../selftests/bpf/test_flow_dissector.c   | 782 ++

[bpf-next, v4 2/5] bpf: sync bpf.h uapi with tools/

2018-09-14 Thread Petar Penkov

From: Petar Penkov 

This patch syncs tools/include/uapi/linux/bpf.h with the flow dissector
definitions from include/uapi/linux/bpf.h

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/include/uapi/linux/bpf.h | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..aa5ccd2385ed 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LWT_SEG6LOCAL,
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
+   BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
BPF_CGROUP_UDP4_SENDMSG,
BPF_CGROUP_UDP6_SENDMSG,
BPF_LIRC_MODE2,
+   BPF_FLOW_DISSECTOR,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2333,6 +2335,7 @@ struct __sk_buff {
/* ... here. */
 
__u32 data_meta;
+   struct bpf_flow_keys *flow_keys;
 };
 
 struct bpf_tunnel_key {
@@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
BPF_FD_TYPE_URETPROBE,  /* filename + offset */
 };
 
+struct bpf_flow_keys {
+   __u16   nhoff;
+   __u16   thoff;
+   __u16   addr_proto; /* ETH_P_* of valid addrs */
+   __u8is_frag;
+   __u8is_first_frag;
+   __u8is_encap;
+   __u8ip_proto;
+   __be16  n_proto;
+   __be16  sport;
+   __be16  dport;
+   union {
+   struct {
+   __be32  ipv4_src;
+   __be32  ipv4_dst;
+   };
+   struct {
+   __u32   ipv6_src[4];/* in6_addr; network order */
+   __u32   ipv6_dst[4];/* in6_addr; network order */
+   };
+   };
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.19.0.397.gdd90340f6a-goog

[bpf-next, v4 3/5] bpf: support flow dissector in libbpf and bpftool

2018-09-14 Thread Petar Penkov

From: Petar Penkov 

This patch extends libbpf and bpftool to work with programs of type
BPF_PROG_TYPE_FLOW_DISSECTOR.

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/bpf/bpftool/prog.c | 1 +
 tools/lib/bpf/libbpf.c   | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d22106..b1cd3bc8db70 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
[BPF_PROG_TYPE_RAW_TRACEPOINT]  = "raw_tracepoint",
[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
[BPF_PROG_TYPE_LIRC_MODE2]  = "lirc_mode2",
+   [BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8476da7f2720..9ca8e0e624d8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type 
type)
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
case BPF_PROG_TYPE_LIRC_MODE2:
case BPF_PROG_TYPE_SK_REUSEPORT:
+   case BPF_PROG_TYPE_FLOW_DISSECTOR:
return false;
case BPF_PROG_TYPE_UNSPEC:
case BPF_PROG_TYPE_KPROBE:
@@ -2121,6 +2122,7 @@ static const struct {
BPF_PROG_SEC("sk_skb",  BPF_PROG_TYPE_SK_SKB),
BPF_PROG_SEC("sk_msg",  BPF_PROG_TYPE_SK_MSG),
BPF_PROG_SEC("lirc_mode2",  BPF_PROG_TYPE_LIRC_MODE2),
+   BPF_PROG_SEC("flow_dissector",  BPF_PROG_TYPE_FLOW_DISSECTOR),
BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
-- 
2.19.0.397.gdd90340f6a-goog

[bpf-next, v4 4/5] flow_dissector: implements eBPF parser

2018-09-14 Thread Petar Penkov

From: Petar Penkov 

This eBPF program extracts basic/control/ip address/ports keys from
incoming packets. It supports recursive parsing for IP encapsulation,
and VLAN, along with IPv4/IPv6 and extension headers.  This program is
meant to show how flow dissection and key extraction can be done in
eBPF.

Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/bpf/Makefile   |   2 +-
 tools/testing/selftests/bpf/bpf_flow.c | 373 +
 2 files changed, 374 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..e65f50f9185e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o
+   test_skb_cgroup_id_kern.o bpf_flow.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_flow.c 
b/tools/testing/selftests/bpf/bpf_flow.c
new file mode 100644
index ..5fb809d95867
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_flow.c
@@ -0,0 +1,373 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+#define PROG(F) SEC(#F) int bpf_func_##F
+
+/* These are the identifiers of the BPF programs that will be used in tail
+ * calls. Name is limited to 16 characters, with the terminating character and
+ * bpf_func_ above, we have only 6 to work with, anything after will be 
cropped.
+ */
+enum {
+   IP,
+   IPV6,
+   IPV6OP, /* Destination/Hop-by-Hop Options IPv6 Extension header */
+   IPV6FR, /* Fragmentation IPv6 Extension Header */
+   MPLS,
+   VLAN,
+};
+
+#define IP_MF  0x2000
+#define IP_OFFSET  0x1FFF
+#define IP6_MF 0x0001
+#define IP6_OFFSET 0xFFF8
+
+struct vlan_hdr {
+   __be16 h_vlan_TCI;
+   __be16 h_vlan_encapsulated_proto;
+};
+
+struct gre_hdr {
+   __be16 flags;
+   __be16 proto;
+};
+
+struct frag_hdr {
+   __u8 nexthdr;
+   __u8 reserved;
+   __be16 frag_off;
+   __be32 identification;
+};
+
+struct bpf_map_def SEC("maps") jmp_table = {
+   .type = BPF_MAP_TYPE_PROG_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(__u32),
+   .max_entries = 8
+};
+
+static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb,
+__u16 hdr_size,
+void *buffer)
+{
+   void *data_end = (void *)(long)skb->data_end;
+   void *data = (void *)(long)skb->data;
+   __u16 nhoff = skb->flow_keys->nhoff;
+   __u8 *hdr;
+
+   /* Verifies this variable offset does not overflow */
+   if (nhoff > (USHRT_MAX - hdr_size))
+   return NULL;
+
+   hdr = data + nhoff;
+   if (hdr + hdr_size <= data_end)
+   return hdr;
+
+   if (bpf_skb_load_bytes(skb, nhoff, buffer, hdr_size))
+   return NULL;
+
+   return buffer;
+}
+
+/* Dispatches on ETHERTYPE */
+static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
+{
+   struct bpf_flow_keys *keys = skb->flow_keys;
+
+   keys->n_proto = proto;
+   switch (proto) {
+   case bpf_htons(ETH_P_IP):
+   bpf_tail_call(skb, &jmp_table, IP);
+   break;
+   case bpf_htons(ETH_P_IPV6):
+   bpf_tail_call(skb, &jmp_table, IPV6);
+   break;
+   case bpf_htons(ETH_P_MPLS_MC):
+   case bpf_htons(ETH_P_MPLS_UC):
+   bpf_tail_call(skb, &jmp_table, MPLS);
+   break;
+   case bpf_htons(ETH_P_8021Q):
+   case bpf_htons(ETH_P_8021AD):
+   bpf_tail_call(skb, &jmp_table, VLAN);
+   break;
+   default:
+   /* Protocol not supported */
+   return BPF_DROP;
+   }
+
+   return BPF_DROP;
+}
+
+SEC("dissect")
+int dissect(struct __sk_buff *skb)
+{
+   if (!skb->vlan_present)
+   return parse_eth_proto(skb, skb->protocol);
+   else
+   return parse_eth_proto(skb, skb->vlan_proto);
+}
+
+/* Parses on IPPROTO_* */
+static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
+{

[bpf-next, v4 5/5] selftests/bpf: test bpf flow dissection

2018-09-14 Thread Petar Penkov

From: Petar Penkov 

Adds a test that sends different types of packets over multiple
tunnels and verifies that valid packets are dissected correctly.  To do
so, a tc-flower rule is added to drop packets on UDP src port 9, and
packets are sent from ports 8, 9, and 10. Only the packets on port 9
should be dropped. Because tc-flower relies on the flow dissector to
match flows, correct classification demonstrates correct dissection.

Also add support logic to load the BPF program and to inject the test
packets.

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/bpf/.gitignore|   2 +
 tools/testing/selftests/bpf/Makefile  |   6 +-
 tools/testing/selftests/bpf/config|   1 +
 .../selftests/bpf/flow_dissector_load.c   | 140 
 .../selftests/bpf/test_flow_dissector.c   | 782 ++
 .../selftests/bpf/test_flow_dissector.sh  | 115 +++
 tools/testing/selftests/bpf/with_addr.sh  |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 8 files changed, 1134 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 4d789c1e5167..8a60c9b9892d 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -23,3 +23,5 @@ test_skb_cgroup_id_user
 test_socket_cookie
 test_cgroup_storage
 test_select_reuseport
+test_flow_dissector
+flow_dissector_load
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index e65f50f9185e..fd3851d5c079 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -47,10 +47,12 @@ TEST_PROGS := test_kmod.sh \
test_tunnel.sh \
test_lwt_seg6local.sh \
test_lirc_mode2.sh \
-   test_skb_cgroup_id.sh
+   test_skb_cgroup_id.sh \
+   test_flow_dissector.sh
 
 # Compile but not part of 'make run_tests'
-TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr 
test_skb_cgroup_id_user
+TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr 
test_skb_cgroup_id_user \
+   flow_dissector_load test_flow_dissector
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/config 
b/tools/testing/selftests/bpf/config
index b4994a94968b..3655508f95fd 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -18,3 +18,4 @@ CONFIG_CRYPTO_HMAC=m
 CONFIG_CRYPTO_SHA256=m
 CONFIG_VXLAN=y
 CONFIG_GENEVE=y
+CONFIG_NET_CLS_FLOWER=m
diff --git a/tools/testing/selftests/bpf/flow_dissector_load.c 
b/tools/testing/selftests/bpf/flow_dissector_load.c
new file mode 100644
index ..d3273b5b3173
--- /dev/null
+++ b/tools/testing/selftests/bpf/flow_dissector_load.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+const char *cfg_pin_path = "/sys/fs/bpf/flow_dissector";
+const char *cfg_map_name = "jmp_table";
+bool cfg_attach = true;
+char *cfg_section_name;
+char *cfg_path_name;
+
+static void load_and_attach_program(void)
+{
+   struct bpf_program *prog, *main_prog;
+   struct bpf_map *prog_array;
+   int i, fd, prog_fd, ret;
+   struct bpf_object *obj;
+   int prog_array_fd;
+
+   ret = bpf_prog_load(cfg_path_name, BPF_PROG_TYPE_FLOW_DISSECTOR, &obj,
+   &prog_fd);
+   if (ret)
+   error(1, 0, "bpf_prog_load %s", cfg_path_name);
+
+   main_prog = bpf_object__find_program_by_title(obj, cfg_section_name);
+   if (!main_prog)
+   error(1, 0, "bpf_object__find_program_by_title %s",
+ cfg_section_name);
+
+   prog_fd = bpf_program__fd(main_prog);
+   if (prog_fd < 0)
+   error(1, 0, "bpf_program__fd");
+
+   prog_array = bpf_object__find_map_by_name(obj, cfg_map_name);
+   if (!prog_array)
+   error(1, 0, "bpf_object__find_map_by_name %s", cfg_map_name);
+
+   prog_array_fd = bpf_map__fd(prog_array);
+   if (prog_array_fd < 0)
+   error(1, 0, "bpf_map__fd %s", cfg_map_name);
+
+   i = 0;
+   bpf_object__for_each_program(prog, obj) {
+   fd = bpf_program__fd(prog);
+   if (fd < 0)
+   error(1, 0, "bpf_program__fd");
+
+   if (fd != prog_fd) {
+   printf("%d: %s\n", i, bpf_program__title(prog, false));
+   bpf_map_update_elem(prog_array_fd, &i, &fd, BPF_ANY);
+   ++i;
+   }
+   }
+
+   ret = bpf_

Re: [PATCH net-next v3 0/2] net: stmmac: Coalesce and tail addr fixes

2018-09-14 Thread Jerome Brunet

On Thu, 2018-09-13 at 09:02 +0100, Jose Abreu wrote:
> The fix for coalesce timer and a fix in tail address setting that impacts
> XGMAC2 operation.
> 
> Cc: Florian Fainelli 
> Cc: Neil Armstrong 
> Cc: Jerome Brunet 
> Cc: Martin Blumenstingl 
> Cc: David S. Miller 
> Cc: Joao Pinto 
> Cc: Giuseppe Cavallaro 
> Cc: Alexandre Torgue 
> 
> Jose Abreu (2):
>   net: stmmac: Rework coalesce timer and fix multi-queue races
>   net: stmmac: Fixup the tail addr setting in xmit path

Looks better this time. Stable so far, with even a small throughput improvement
on the Tx path.

so for the a113 s400 board (single queue)
Tested-by: Jerome Brunet 

> 
>  drivers/net/ethernet/stmicro/stmmac/common.h  |   4 +-
>  drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  14 +-
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 238 
> --
>  include/linux/stmmac.h|   1 +
>  4 files changed, 149 insertions(+), 108 deletions(-)
>

Re: [PATCH net-next 08/13] net: sched: rename tcf_block_get{_ext}() and tcf_block_put{_ext}()

2018-09-14 Thread Jiri Pirko

Fri, Sep 14, 2018 at 12:38:08PM CEST, vla...@mellanox.com wrote:
>
>On Thu 13 Sep 2018 at 17:21, Cong Wang  wrote:
>> On Wed, Sep 12, 2018 at 1:24 AM Vlad Buslov  wrote:
>>>
>>>
>>> On Fri 07 Sep 2018 at 20:09, Cong Wang  wrote:
>>> > On Thu, Sep 6, 2018 at 12:59 AM Vlad Buslov  wrote:
>>> >>
>>> >> Functions tcf_block_get{_ext}() and tcf_block_put{_ext}() actually
>>> >> attach/detach block to specific Qdisc besides just taking/putting
>>> >> reference. Rename them according to their purpose.
>>> >
>>> > Where exactly does it attach to?
>>> >
>>> > Each qdisc provides a pointer to a pointer of a block, like
>>> > &cl->block. It is where the result is saved to. It takes a parameter
>>> > of Qdisc* merely for read-only purpose.
>>>
>>> tcf_block_attach_ext() passes qdisc parameter to tcf_block_owner_add()
>>> which saves qdisc to new tcf_block_owner_item and adds the item to
>>> block's owner list. I proposed several naming options for these
>>> functions to Jiri on internal review and he suggested "attach" as better
>>> option.
>>
>> But that is merely item->q = q, this is why I said it is read-only,
>> hard to claim this is attaching.
>>
>>
>>>
>>> >
>>> > So, renaming it to *attach() is even confusing, at least not
>>> > any better. Please find other names or leave them as they are.
>>>
>>> What would you recommend?
>>
>> I don't know, perhaps "acquire"?
>>
>> Or, leaving tcf_block_get() as it is but rename your refcnt
>> increment function to be something like tcf_block_refcnt_get()?
>
>Cong, I'm okay with both options.
>
>Jiri, which naming would you prefer?

Maybe tcf_block_refcnt_get() is better.

Re: [PATCH net] net: diag: Fix swapped src/dst in udp_dump_one.

2018-09-14 Thread David Miller

From: Lorenzo Colitti 
Date: Fri, 14 Sep 2018 15:25:53 +0900

> Since its inception, udp_dump_one had has a bug where userspace
> needs to swap src and dst addresses and ports in order to find
> the socket it wants.
> 
> This is because udp_dump_one misuses __udp[46]_lib_lookup by
> passing the source address as the source address argument.
> Unfortunately, those functions are intended to find local sockets
> matching received packets, so the order of the arguments is
> inverted: the argument that ends up being compared with, e.g.,
> sk_daddr is actually saddr, not daddr.
> 
> While it's true that this creates a backwards compatibility
> problem, this is clearly a bug since inet_diag_sockid is very
> clear about which struct elements are the source address and port
> and which are the destination address and port. Also, this bug
> does not affect TCP sockets, SOCK_DESTROY of UDP sockets, or
> finding UDP sockets with NLMSG_DUMP.
> 
> Fixes: a925aa00a55 ("udp_diag: Implement the get_exact dumping functionality")
> Tested: https://android-review.googlesource.com/c/kernel/tests/+/755889/
> Signed-off-by: Lorenzo Colitti 

Unfortunately I think we are stuck with how things are now.

Indisputably, your patch breaks userland components that have
workarounds in order to work with existing kernels.  People who
wrote such code:

1) Won't get any warnings that things are about to break on them

2) Will have limited options to have their code work on all kernels,
   ones that have this change and ones that do not.

Maybe if this got introduced 1 or 2 releases ago we could consider
doing this, but all the way back to v3.3?  No way.

I cannot apply this, sorry.

Re: [PATCH][net-next] net: move definition of pcpu_lstats to header file

2018-09-14 Thread David Miller

From: Li RongQing 
Date: Fri, 14 Sep 2018 16:00:51 +0800

> pcpu_lstats is defined in several files, so unify them as one
> and move to header file
> 
> Signed-off-by: Zhang Yu 
> Signed-off-by: Li RongQing 

This looks fine, applied, thanks.

Re: [PATCH net-next] cxgb4: add per rx-queue counter for packet errors

2018-09-14 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 14 Sep 2018 14:46:04 +0530

> print per rx-queue packet errors in sge_qinfo
> 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudar 

Applied.

Re: [PATCH net-next] cxgb4: Fix endianness issue in t4_fwcache()

2018-09-14 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 14 Sep 2018 14:36:27 +0530

> Do not put host-endian 0 or 1 into big endian feild.
> 
> Reported-by: Al Viro 
> Signed-off-by: Ganesh Goudar 

Applied.

[RFC PATCH 0/4] UDP: implement GRO support for UDP_SEGMENT socket

2018-09-14 Thread Paolo Abeni

This series implements GRO support for UDP sockets, as the RX counterpart
of ommit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT"). 
The first two patches allow UDP GRO registration on demand, avoiding additional
overhead when no UDP_SEGMENT sockets are created, actually decreasing the GRO
engine costs for the default configuration for UDP packets. They could possibly
live on their own.
The third patch contains the actual UDP GRO implementation, while the 4th patch
allows using the udpgso_bench_rx program under selftest to trigger UDP GRO. A
full self-test is not there yet.

Paolo Abeni (4):
  net: add new helper to update an already registered offload
  net: enable UDP gro on demand.
  udp: implement GRO plain UDP sockets.
  selftests: add GRO support, fix port option processing

 include/linux/udp.h   |  18 +-
 include/net/addrconf.h|   1 +
 include/net/protocol.h|   4 +
 include/net/udp.h |  12 ++
 net/ipv4/protocol.c   |  13 +-
 net/ipv4/udp.c|   3 +
 net/ipv4/udp_offload.c| 170 +++---
 net/ipv4/udp_tunnel.c |   1 +
 net/ipv6/af_inet6.c   |   1 +
 net/ipv6/protocol.c   |  13 +-
 net/ipv6/udp_offload.c|  31 +++-
 tools/testing/selftests/net/udpgso_bench_rx.c |  18 +-
 12 files changed, 244 insertions(+), 41 deletions(-)

-- 
2.17.1

[RFC PATCH 4/4] selftests: add GRO support, fix port option processing

2018-09-14 Thread Paolo Abeni

Not a full test-case yet, but allows triggering the UDP GSO code
path.

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/udpgso_bench_rx.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c 
b/tools/testing/selftests/net/udpgso_bench_rx.c
index 727cf67a3f75..f8bb7ea6bd25 100644
--- a/tools/testing/selftests/net/udpgso_bench_rx.c
+++ b/tools/testing/selftests/net/udpgso_bench_rx.c
@@ -31,9 +31,14 @@
 #include 
 #include 
 
+#ifndef UDP_SEGMENT
+#define UDP_SEGMENT103
+#endif
+
 static int  cfg_port   = 8000;
 static bool cfg_tcp;
 static bool cfg_verify;
+static bool cfg_gro_segment;
 
 static bool interrupted;
 static unsigned long packets, bytes;
@@ -199,10 +204,13 @@ static void parse_opts(int argc, char **argv)
 {
int c;
 
-   while ((c = getopt(argc, argv, "ptv")) != -1) {
+   while ((c = getopt(argc, argv, "p:Stv")) != -1) {
switch (c) {
case 'p':
-   cfg_port = htons(strtoul(optarg, NULL, 0));
+   cfg_port = strtoul(optarg, NULL, 0);
+   break;
+   case 'S':
+   cfg_gro_segment = true;
break;
case 't':
cfg_tcp = true;
@@ -227,6 +235,12 @@ static void do_recv(void)
 
fd = do_socket(cfg_tcp);
 
+   if (cfg_gro_segment) {
+   int val = 1;
+   if (setsockopt(fd, IPPROTO_UDP, UDP_SEGMENT, &val, sizeof(val)))
+   error(1, errno, "setsockopt UDP_SEGMENT");
+   }
+
treport = gettimeofday_ms() + 1000;
do {
do_poll(fd);
-- 
2.17.1

[RFC PATCH 1/4] net: add new helper to update an already registered offload

2018-09-14 Thread Paolo Abeni

This will allow us to enable/disable UDP GRO at runtime in
a later patch.

Signed-off-by: Paolo Abeni 
---
 include/net/protocol.h |  4 
 net/ipv4/protocol.c| 13 +
 net/ipv6/protocol.c| 13 +
 3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/include/net/protocol.h b/include/net/protocol.h
index 4fc75f7ae23b..aa77e7feffab 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -104,6 +104,8 @@ extern struct inet6_protocol __rcu 
*inet6_protos[MAX_INET_PROTOS];
 int inet_add_protocol(const struct net_protocol *prot, unsigned char num);
 int inet_del_protocol(const struct net_protocol *prot, unsigned char num);
 int inet_add_offload(const struct net_offload *prot, unsigned char num);
+int inet_update_offload(const struct net_offload *old_prot,
+   const struct net_offload *new_prot, unsigned char num);
 int inet_del_offload(const struct net_offload *prot, unsigned char num);
 void inet_register_protosw(struct inet_protosw *p);
 void inet_unregister_protosw(struct inet_protosw *p);
@@ -115,6 +117,8 @@ int inet6_register_protosw(struct inet_protosw *p);
 void inet6_unregister_protosw(struct inet_protosw *p);
 #endif
 int inet6_add_offload(const struct net_offload *prot, unsigned char num);
+int inet6_update_offload(const struct net_offload *old_prot,
+const struct net_offload *new_prot, unsigned char num);
 int inet6_del_offload(const struct net_offload *prot, unsigned char num);
 
 #endif /* _PROTOCOL_H */
diff --git a/net/ipv4/protocol.c b/net/ipv4/protocol.c
index 32a691b7ce2c..b60f1686b918 100644
--- a/net/ipv4/protocol.c
+++ b/net/ipv4/protocol.c
@@ -65,12 +65,17 @@ int inet_del_protocol(const struct net_protocol *prot, 
unsigned char protocol)
 }
 EXPORT_SYMBOL(inet_del_protocol);
 
-int inet_del_offload(const struct net_offload *prot, unsigned char protocol)
+int inet_update_offload(const struct net_offload *old_prot,
+   const struct net_offload *new_prot,
+   unsigned char protocol)
 {
-   int ret;
+   return (cmpxchg((const struct net_offload **)&inet_offloads[protocol],
+   old_prot, new_prot) == old_prot) ? 0 : -1;
+}
 
-   ret = (cmpxchg((const struct net_offload **)&inet_offloads[protocol],
-  prot, NULL) == prot) ? 0 : -1;
+int inet_del_offload(const struct net_offload *prot, unsigned char protocol)
+{
+   int ret = inet_update_offload(prot, NULL, protocol);
 
synchronize_net();
 
diff --git a/net/ipv6/protocol.c b/net/ipv6/protocol.c
index b5d54d4f995c..9ee6aff1f3fa 100644
--- a/net/ipv6/protocol.c
+++ b/net/ipv6/protocol.c
@@ -60,12 +60,17 @@ int inet6_add_offload(const struct net_offload *prot, 
unsigned char protocol)
 }
 EXPORT_SYMBOL(inet6_add_offload);
 
-int inet6_del_offload(const struct net_offload *prot, unsigned char protocol)
+int inet6_update_offload(const struct net_offload *old_prot,
+const struct net_offload *new_prot,
+unsigned char protocol)
 {
-   int ret;
+   return (cmpxchg((const struct net_offload **)&inet6_offloads[protocol],
+   old_prot, new_prot) == old_prot) ? 0 : -1;
+}
 
-   ret = (cmpxchg((const struct net_offload **)&inet6_offloads[protocol],
-  prot, NULL) == prot) ? 0 : -1;
+int inet6_del_offload(const struct net_offload *prot, unsigned char protocol)
+{
+   int ret = inet6_update_offload(prot, NULL, protocol);
 
synchronize_net();
 
-- 
2.17.1

[RFC PATCH 2/4] net: enable UDP gro on demand.

2018-09-14 Thread Paolo Abeni

Currently, the UDP GRO callback is always invoked, regardless of
the existence of any actual user (e.g. a UDP tunnel). With retpoline
enabled, this causes measurable overhead.

This changeset introduces explicit accounting of the sockets requiring
UDP GRO and updates the UDP offloads at runtime accordingly, so that
the GRO callback is present (and invoked) only when there is at least
one socket requiring it.

Tested with pktgen vs udpgso_bench_rx
Before:
udp rx: 27 MB/s  1613271 calls/s

After:
udp rx: 30 MB/s  1771537 calls/s

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h| 18 +++-
 include/net/addrconf.h |  1 +
 include/net/udp.h  | 12 
 net/ipv4/udp.c |  2 ++
 net/ipv4/udp_offload.c | 63 --
 net/ipv4/udp_tunnel.c  |  1 +
 net/ipv6/af_inet6.c|  1 +
 net/ipv6/udp_offload.c | 25 +++--
 8 files changed, 117 insertions(+), 6 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 320d49d85484..56a321a55ba1 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,8 @@ struct udp_sock {
unsigned int corkflag;  /* Cork is required */
__u8 encap_type;/* Is this an Encapsulation socket? */
unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+gro_in_use:1;  /* UDP GRO is requested */
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
@@ -105,6 +106,11 @@ static inline void udp_set_no_check6_rx(struct sock *sk, 
bool val)
udp_sk(sk)->no_check6_rx = val;
 }
 
+static inline void udp_set_gro_in_use(struct sock *sk, bool val)
+{
+   udp_sk(sk)->gro_in_use = val;
+}
+
 static inline bool udp_get_no_check6_tx(struct sock *sk)
 {
return udp_sk(sk)->no_check6_tx;
@@ -115,6 +121,16 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
return udp_sk(sk)->no_check6_rx;
 }
 
+static inline bool udp_get_gro_in_use(struct sock *sk)
+{
+   return udp_sk(sk)->gro_in_use;
+}
+
+static inline bool udp_want_gro(struct sock *sk)
+{
+   return udp_sk(sk)->gro_receive;
+}
+
 #define udp_portaddr_for_each_entry(__sk, list) \
hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 6def0351bcc3..fb2ac3ca3417 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -254,6 +254,7 @@ struct ipv6_stub {
 struct in6_addr *saddr);
 
void (*udpv6_encap_enable)(void);
+   void (*udpv6_update_offload)(bool enable_gro);
void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr 
*daddr,
  const struct in6_addr *solicited_addr,
  bool router, bool solicited, bool override, bool 
inc_opt);
diff --git a/include/net/udp.h b/include/net/udp.h
index 8482a990b0bb..eff2dfa0571b 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -444,8 +444,20 @@ int udpv4_offload_init(void);
 void udp_init(void);
 
 void udp_encap_enable(void);
+void udp_gro_in_use_changed(struct sock *sk);
+
 #if IS_ENABLED(CONFIG_IPV6)
 void udpv6_encap_enable(void);
+void udpv6_update_offload(bool);
 #endif
 
+static inline void udp_update_gro_in_use(struct sock *sk, bool want_gro)
+{
+   if (want_gro == udp_get_gro_in_use(sk))
+   return;
+
+   udp_set_gro_in_use(sk, want_gro);
+   udp_gro_in_use_changed(sk);
+}
+
 #endif /* _UDP_H */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f4e35b2ff8b8..5ac794230013 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1438,6 +1438,8 @@ void udp_destruct_sock(struct sock *sk)
}
udp_rmem_release(sk, total, 0, true);
 
+   udp_update_gro_in_use(sk, 0);
+
inet_sock_destruct(sk);
 }
 EXPORT_SYMBOL_GPL(udp_destruct_sock);
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0c0522b79b43..08b225adf763 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -14,6 +14,10 @@
 #include 
 #include 
 
+#if IS_ENABLED(CONFIG_IPV6)
+#include 
+#endif
+
 static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
netdev_features_t features,
struct sk_buff *(*gso_inner_segment)(struct sk_buff *skb,
@@ -472,7 +476,13 @@ static int udp4_gro_complete(struct sk_buff *skb, int 
nhoff)
return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
 
-static const struct net_offload udpv4_offload = {
+static const struct net_offload udpv4_no_gro_offload = {
+   .callbacks = {
+   .gso_segment = udp4_ufo_fragment,
+   },
+};
+
+static const struct net_offload udpv4_gro_offload = {
.callbacks = {
.gso_segment

[RFC PATCH 3/4] udp: implement GRO plain UDP sockets.

2018-09-14 Thread Paolo Abeni

This is the RX counter part of commit bec1f6f69736 ("udp: generate gso
with UDP_SEGMENT"). When UDP_SEGMENT is enabled, such socket is also
eligible for GRO in the rx path: UDP segments directed to such socket
are assembled into a larger GSO_UDP_L4 packet.

The core UDP GRO support is enabled/updated on setsockopt(UDP_SEGMENT) and
disabled, if needed at socket destruction time.

Initial benchmark numbers:

Before:
udp rx:   1079 MB/s   769065 calls/s

After:
udp rx:   1466 MB/s24877 calls/s

This change introduces a side effect in respect to UDP tunnels:
after an UDP tunnel creation, now the kernel performs a lookup per ingress UDP
packet, before such lookup happended only if the ingress packet carried a valid
internal header csum.

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h|   2 +-
 net/ipv4/udp.c |   1 +
 net/ipv4/udp_offload.c | 107 +
 net/ipv6/udp_offload.c |   6 +--
 4 files changed, 90 insertions(+), 26 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 56a321a55ba1..27dea956ef6e 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -128,7 +128,7 @@ static inline bool udp_get_gro_in_use(struct sock *sk)
 
 static inline bool udp_want_gro(struct sock *sk)
 {
-   return udp_sk(sk)->gro_receive;
+   return udp_sk(sk)->gro_receive || udp_sk(sk)->gso_size;
 }
 
 #define udp_portaddr_for_each_entry(__sk, list) \
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5ac794230013..871ee55afd96 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2450,6 +2450,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
if (val < 0 || val > USHRT_MAX)
return -EINVAL;
up->gso_size = val;
+   udp_update_gro_in_use(sk, udp_want_gro(sk));
break;
 
/*
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 08b225adf763..4ff150bb84de 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -347,6 +347,54 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
return segs;
 }
 
+#define UDO_GRO_CNT_MAX 64
+static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
+  struct sk_buff *skb)
+{
+   struct udphdr *uh = udp_hdr(skb);
+   struct sk_buff *pp = NULL;
+   struct udphdr *uh2;
+   struct sk_buff *p;
+
+   /* requires non zero csum, for simmetry with GSO */
+   if (!uh->check) {
+   NAPI_GRO_CB(skb)->flush = 1;
+   return NULL;
+   }
+
+   /* pull encapsulating udp header */
+   skb_gro_pull(skb, sizeof(struct udphdr));
+   skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
+
+   list_for_each_entry(p, head, list) {
+   if (!NAPI_GRO_CB(p)->same_flow)
+   continue;
+
+   uh2 = udp_hdr(p);
+
+   /* Match ports only, as csum is always non zero */
+   if ((*(u32 *)&uh->source != *(u32 *)&uh2->source)) {
+   NAPI_GRO_CB(p)->same_flow = 0;
+   continue;
+   }
+
+   /* Terminate the flow on len mismatch or if it grow "too much".
+* Under small packet flood GRO count could elsewhere grow a lot
+* leading to execessive truesize values
+*/
+   if (!skb_gro_receive(p, skb) &&
+   NAPI_GRO_CB(p)->count > UDO_GRO_CNT_MAX)
+   pp = p;
+   else if (uh->len != uh2->len)
+   pp = p;
+
+   return pp;
+   }
+
+   /* mismatch, but we never need to flush */
+   return NULL;
+}
+
 struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
struct udphdr *uh, udp_lookup_t lookup)
 {
@@ -357,23 +405,29 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
int flush = 1;
struct sock *sk;
 
+   rcu_read_lock();
+   sk = (*lookup)(skb, uh->source, uh->dest);
+   if (!sk)
+   goto out_unlock;
+
+   if (udp_sk(sk)->gso_size) {
+   pp = call_gro_receive(udp_gro_receive_segment, head, skb);
+   rcu_read_unlock();
+   return pp;
+   }
+
if (NAPI_GRO_CB(skb)->encap_mark ||
(skb->ip_summed != CHECKSUM_PARTIAL &&
 NAPI_GRO_CB(skb)->csum_cnt == 0 &&
 !NAPI_GRO_CB(skb)->csum_valid))
-   goto out;
+   goto out_unlock;
 
/* mark that this skb passed once through the tunnel gro layer */
NAPI_GRO_CB(skb)->encap_mark = 1;
 
-   rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
-
-   if (sk && udp_sk(sk)->gro_receive)
-   goto unflush;
-   goto out_unlock;
+   if (!udp_sk(sk)->gro_receive)
+   goto out_unlock;
 
-

Re: [PATCH net] net/sched: act_sample: fix NULL dereference in the data path

2018-09-14 Thread David Miller

From: Davide Caratti 
Date: Fri, 14 Sep 2018 12:03:18 +0200

> Matteo reported the following splat, testing the datapath of TC 'sample':
 ...
> tcf_sample_act() tried to update its per-cpu stats, but tcf_sample_init()
> forgot to allocate them, because tcf_idr_create() was called with a wrong
> value of 'cpustats'. Setting it to true proved to fix the reported crash.
> 
> Reported-by: Matteo Croce 
> Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use 
> IDR")
> Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
> Tested-by: Matteo Croce 
> Signed-off-by: Davide Caratti 

Applied and queued up for -stable, thanks.

Re: [PATCH net-next] cxgb4: update supported DCB version

2018-09-14 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 14 Sep 2018 17:35:55 +0530

> - In CXGB4_DCB_STATE_FW_INCOMPLETE state check if the dcb
>   version is changed and update the dcb supported version.
> 
> - Also, fill the priority code point value for priority
>   based flow control.
> 
> Signed-off-by: Ganesh Goudar 

Applied, thank you.

Re: Project Financing

2018-09-14 Thread Gabriel Walker

Thank you for your time,

We are looking for clients in your country with good business or project that 
requires financing to execute.

Do get back to me if you are interested in this or you know anybody who has 
good business ideas but lack the necessary capital to fund his projects so we 
can establish working relationship.

Sincerely,
 
John Hanan, MBA, CFA
General Investment Consultant

Re: [PATCH net] pppoe: fix reception of frames with no mac header

2018-09-14 Thread Alexander Potapenko

On Fri, Sep 14, 2018 at 4:35 PM Guillaume Nault  wrote:
>
> On Fri, Sep 14, 2018 at 04:28:05PM +0200, Guillaume Nault wrote:
> > pppoe_rcv() needs to look back at the Ethernet header in order to
> > lookup the PPPoE session. Therefore we need to ensure that the mac
> > header is big enough to contain an Ethernet header. Otherwise
> > eth_hdr(skb)->h_source might access invalid data.
> >
> Forgot to Cc Alexander :/
> Sorry...
> BTW, thanks for your first analysis.
Thank you!



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Re: [RFC PATCH 3/4] udp: implement GRO plain UDP sockets.

2018-09-14 Thread Eric Dumazet




On 09/14/2018 08:43 AM, Paolo Abeni wrote:
> This is the RX counter part of commit bec1f6f69736 ("udp: generate gso
> with UDP_SEGMENT"). When UDP_SEGMENT is enabled, such socket is also
> eligible for GRO in the rx path: UDP segments directed to such socket
> are assembled into a larger GSO_UDP_L4 packet.
> 
> The core UDP GRO support is enabled/updated on setsockopt(UDP_SEGMENT) and
> disabled, if needed at socket destruction time.
> 
> Initial benchmark numbers:
> 
> Before:
> udp rx:   1079 MB/s   769065 calls/s
> 
> After:
> udp rx:   1466 MB/s24877 calls/s


Are you sure the data is actually fully copied to user space ?

tools/testing/selftests/net/udpgso_bench_rx.c

uses :

static char rbuf[ETH_DATA_LEN];
   /* MSG_TRUNC will make return value full datagram length */
   ret = recv(fd, rbuf, len, MSG_TRUNC | MSG_DONTWAIT);

So you need to change this program.

Also, GRO reception would mean that userspace can retrieve,
not only full bytes of X datagrams, but also the gso_size (or length of 
individual datagrams)

You can not know the size of the packets in advance, the sender will decide.

Re: [PATCH 1/1] net: rds: use memset to optimize the recv

2018-09-14 Thread Santosh Shilimkar


On 9/14/2018 1:45 AM, Zhu Yanjun wrote:

The function rds_inc_init is in recv process. To use memset can optimize
the function rds_inc_init.
The test result:

 Before:
 1) + 24.950 us   |rds_inc_init [rds]();
 After:
 1) + 10.990 us   |rds_inc_init [rds]();

Signed-off-by: Zhu Yanjun 
---

Looks good. Thanks !!

Acked-by: Santosh Shilimkar

Re: [RFC PATCH 2/4] net: enable UDP gro on demand.

2018-09-14 Thread Willem de Bruijn

On Fri, Sep 14, 2018 at 11:47 AM Paolo Abeni  wrote:
>
> Currently, the UDP GRO callback is always invoked, regardless of
> the existence of any actual user (e.g. a UDP tunnel). With retpoline
> enabled, this causes measurable overhead.
>
> This changeset introduces explicit accounting of the sockets requiring
> UDP GRO and updates the UDP offloads at runtime accordingly, so that
> the GRO callback is present (and invoked) only when there is at least
> one socket requiring it.

I have a difference solution both to the UDP socket lookup avoidance
and configurable GRO in general.

The first can be achieved by exporting the udp_encap_needed_key static key:

"
diff --git a/include/net/udp.h b/include/net/udp.h
index 8482a990b0bb..9e82cb391dea 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -443,8 +443,10 @@ int udpv4_offload_init(void);

 void udp_init(void);

+DECLARE_STATIC_KEY_FALSE(udp_encap_needed_key);
 void udp_encap_enable(void);

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f4e35b2ff8b8..bd873a5b8a86 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1889,7 +1889,7 @@ static int __udp_queue_rcv_skb(struct sock *sk,
struct sk_buff *skb)
return 0;
 }

-static DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
+DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
 void udp_encap_enable(void)
 {
static_branch_enable(&udp_encap_needed_key);
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 4f6aa95a9b12..f44fe328aa0f 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -405,7 +405,7 @@ static struct sk_buff *udp4_gro_receive(struct
list_head *head,
 {
struct udphdr *uh = udp_gro_udphdr(skb);

-   if (unlikely(!uh))
+   if (unlikely(!uh) || !static_branch_unlikely(&udp_encap_needed_key))
goto flush;
 "

.. and same for ipv6.

The second is a larger patchset that converts dev_offload to
net_offload, so that all offloads share the same infrastructure, and a
sysctl interface to be able to disable all gro_receive types, not just
udp.

I've been sitting on it for too long. Let me slightly clean it up and
send it out for discussion sake..

>
> Tested with pktgen vs udpgso_bench_rx
> Before:
> udp rx: 27 MB/s  1613271 calls/s
>
> After:
> udp rx: 30 MB/s  1771537 calls/s
>
> Signed-off-by: Paolo Abeni

[PATCH] net: caif: remove redundant null check on frontpkt

2018-09-14 Thread Colin King

From: Colin Ian King 

It is impossible for frontpkt to be null at the point of the null
check because it has been assigned from rearpkt and there is no
way realpkt can be null at the point of the assignment because
of the sanity checking and exit paths taken previously. Remove
the redundant null check.

Detected by CoverityScan, CID#114434 ("Logically dead code")

Signed-off-by: Colin Ian King 
---
 net/caif/cfrfml.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/caif/cfrfml.c b/net/caif/cfrfml.c
index b82440e1fcb4..a931a71ef6df 100644
--- a/net/caif/cfrfml.c
+++ b/net/caif/cfrfml.c
@@ -264,9 +264,6 @@ static int cfrfml_transmit(struct cflayer *layr, struct 
cfpkt *pkt)
frontpkt = rearpkt;
rearpkt = NULL;
 
-   err = -ENOMEM;
-   if (frontpkt == NULL)
-   goto out;
err = -EPROTO;
if (cfpkt_add_head(frontpkt, head, 6) < 0)
goto out;
-- 
2.17.1

Re: [PATCH iproute2] libnetlink: fix leak and using unused memory on error

2018-09-14 Thread महेश बंडेवार

On Thu, Sep 13, 2018 at 12:33 PM, Stephen Hemminger
 wrote:
> If an error happens in multi-segment message (tc only)
> then report the error and stop processing further responses.
> This also fixes refering to the buffer after free.
>
> The sequence check is not necessary here because the
> response message has already been validated to be in
> the window of the sequence number of the iov.
>
> Reported-by: Mahesh Bandewar 
> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> Signed-off-by: Stephen Hemminger 
Acked-by: Mahesh Bandewar 
> ---
>  lib/libnetlink.c | 23 +--
>  1 file changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
> index 928de1dd16d8..586809292594 100644
> --- a/lib/libnetlink.c
> +++ b/lib/libnetlink.c
> @@ -617,7 +617,6 @@ static int __rtnl_talk_iov(struct rtnl_handle *rtnl, 
> struct iovec *iov,
> msg.msg_iovlen = 1;
> i = 0;
> while (1) {
> -next:
> status = rtnl_recvmsg(rtnl->fd, &msg, &buf);
> ++i;
>
> @@ -660,27 +659,23 @@ next:
>
> if (l < sizeof(struct nlmsgerr)) {
> fprintf(stderr, "ERROR truncated\n");
> -   } else if (!err->error) {
> +   free(buf);
> +   return -1;
> +   }
> +
> +   if (!err->error)
> /* check messages from kernel */
> nl_dump_ext_ack(h, errfn);
>
> -   if (answer)
> -   *answer = (struct nlmsghdr 
> *)buf;
> -   else
> -   free(buf);
> -   if (h->nlmsg_seq == seq)
> -   return 0;
> -   else if (i < iovlen)
> -   goto next;
> -   return 0;
> -   }
> -
> if (rtnl->proto != NETLINK_SOCK_DIAG &&
> show_rtnl_err)
> rtnl_talk_error(h, err, errfn);
>
> errno = -err->error;
> -   free(buf);
> +   if (answer)
> +   *answer = (struct nlmsghdr *)buf;
> +   else
> +   free(buf);
> return -i;
> }
>
> --
> 2.18.0
>

Re: [PATCH] net: caif: remove redundant null check on frontpkt

2018-09-14 Thread Sergei Shtylyov

Hello!

On 09/14/2018 08:19 PM, Colin King wrote:

> From: Colin Ian King 
> 
> It is impossible for frontpkt to be null at the point of the null
> check because it has been assigned from rearpkt and there is no
> way realpkt can be null at the point of the assignment because

   rearpkt?

> of the sanity checking and exit paths taken previously. Remove
> the redundant null check.
> 
> Detected by CoverityScan, CID#114434 ("Logically dead code")
> 
> Signed-off-by: Colin Ian King 
[...]

MBR, Sergei

[PATCH net-next RFC 6/8] net: make gro configurable

2018-09-14 Thread Willem de Bruijn

From: Willem de Bruijn 

Add net_offload flag NET_OFF_FLAG_GRO_OFF. If set, a net_offload will
not be used for gro receive processing.

Also add sysctl helper proc_do_net_offload that toggles this flag and
register sysctls net.{core,ipv4,ipv6}.gro

Signed-off-by: Willem de Bruijn 
---
 drivers/net/vxlan.c|  8 +
 include/linux/netdevice.h  |  7 -
 net/core/dev.c |  1 +
 net/core/sysctl_net_core.c | 60 ++
 net/ipv4/sysctl_net_ipv4.c |  7 +
 net/ipv6/ip6_offload.c | 10 +--
 net/ipv6/sysctl_net_ipv6.c |  8 +
 7 files changed, 97 insertions(+), 4 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index e5d236595206..8cb8e02c8ab6 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -572,6 +572,7 @@ static struct sk_buff *vxlan_gro_receive(struct sock *sk,
 struct list_head *head,
 struct sk_buff *skb)
 {
+   const struct net_offload *ops;
struct sk_buff *pp = NULL;
struct sk_buff *p;
struct vxlanhdr *vh, *vh2;
@@ -606,6 +607,12 @@ static struct sk_buff *vxlan_gro_receive(struct sock *sk,
goto out;
}
 
+   rcu_read_lock();
+   ops = net_gro_receive(dev_offloads, ETH_P_TEB);
+   rcu_read_unlock();
+   if (!ops)
+   goto out;
+
skb_gro_pull(skb, sizeof(struct vxlanhdr)); /* pull vxlan header */
 
list_for_each_entry(p, head, list) {
@@ -621,6 +628,7 @@ static struct sk_buff *vxlan_gro_receive(struct sock *sk,
}
 
pp = call_gro_receive(eth_gro_receive, head, skb);
+
flush = 0;
 
 out:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b9e671887fc2..93e8c9ade593 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2377,6 +2377,10 @@ struct net_offload {
 
 /* This should be set for any extension header which is compatible with GSO. */
 #define INET6_PROTO_GSO_EXTHDR 0x1
+#define NET_OFF_FLAG_GRO_OFF   0x2
+
+int proc_do_net_offload(struct ctl_table *ctl, int write, void __user *buffer,
+   size_t *lenp, loff_t *ppos);
 
 /* often modified stats are per-CPU, other are shared (netdev->stats) */
 struct pcpu_sw_netstats {
@@ -3583,7 +3587,8 @@ net_gro_receive(struct net_offload __rcu **offs, u16 type)
 
off = rcu_dereference(offs[net_offload_from_type(type)]);
if (off && off->callbacks.gro_receive &&
-   (!off->type || off->type == type))
+   (!off->type || off->type == type) &&
+   !(off->flags & NET_OFF_FLAG_GRO_OFF))
return off;
else
return NULL;
diff --git a/net/core/dev.c b/net/core/dev.c
index 20d9552afd38..0fd5273bc931 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -154,6 +154,7 @@
 #define GRO_MAX_HEAD (MAX_HEADER + 128)
 
 static DEFINE_SPINLOCK(ptype_lock);
+DEFINE_SPINLOCK(offload_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;  /* Taps */
 static struct list_head offload_base __read_mostly;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b1a2c5e38530..d2d72afdd9eb 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -34,6 +35,58 @@ static int net_msg_warn; /* Unused, but still a sysctl */
 int sysctl_fb_tunnels_only_for_init_net __read_mostly = 0;
 EXPORT_SYMBOL(sysctl_fb_tunnels_only_for_init_net);
 
+extern spinlock_t offload_lock;
+
+#define NET_OFF_TBL_LEN256
+
+int proc_do_net_offload(struct ctl_table *ctl, int write, void __user *buffer,
+   size_t *lenp, loff_t *ppos)
+{
+   unsigned long bitmap[NET_OFF_TBL_LEN / (sizeof(unsigned long) << 3)];
+   struct ctl_table tbl = { .maxlen = NET_OFF_TBL_LEN, .data = bitmap };
+   unsigned long flag = (unsigned long) ctl->extra2;
+   struct net_offload __rcu **offs = ctl->extra1;
+   struct net_offload *off;
+   int i, ret;
+
+   memset(bitmap, 0, sizeof(bitmap));
+
+   spin_lock(&offload_lock);
+
+   for (i = 0; i < tbl.maxlen; i++) {
+   off = rcu_dereference_protected(offs[i], 
lockdep_is_held(&offload_lock));
+   if (off && off->flags & flag) {
+   /* flag specific constraints */
+   if (flag == NET_OFF_FLAG_GRO_OFF) {
+   /* gro disable bit: only if can gro */
+   if (!off->callbacks.gro_receive &&
+   !(off->flags & INET6_PROTO_GSO_EXTHDR))
+   continue;
+   }
+   set_bit(i, bitmap);
+   }
+   }
+
+   ret = proc_do_large_bitmap(&tbl, write, buffer, lenp, ppos);
+
+   if (write && !ret) {

[PATCH net-next RFC 7/8] udp: gro behind static key

2018-09-14 Thread Willem de Bruijn

From: Willem de Bruijn 

Avoid the socket lookup cost in udp_gro_receive if no socket has a
gro callback configured.

Signed-off-by: Willem de Bruijn 
---
 include/net/udp.h  | 2 ++
 net/ipv4/udp.c | 2 +-
 net/ipv4/udp_offload.c | 2 +-
 net/ipv6/udp.c | 2 +-
 net/ipv6/udp_offload.c | 2 +-
 5 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 8482a990b0bb..9e82cb391dea 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -443,8 +443,10 @@ int udpv4_offload_init(void);
 
 void udp_init(void);
 
+DECLARE_STATIC_KEY_FALSE(udp_encap_needed_key);
 void udp_encap_enable(void);
 #if IS_ENABLED(CONFIG_IPV6)
+DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
 void udpv6_encap_enable(void);
 #endif
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f4e35b2ff8b8..bd873a5b8a86 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1889,7 +1889,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
return 0;
 }
 
-static DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
+DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
 void udp_encap_enable(void)
 {
static_branch_enable(&udp_encap_needed_key);
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 4f6aa95a9b12..f44fe328aa0f 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -405,7 +405,7 @@ static struct sk_buff *udp4_gro_receive(struct list_head 
*head,
 {
struct udphdr *uh = udp_gro_udphdr(skb);
 
-   if (unlikely(!uh))
+   if (unlikely(!uh) || !static_branch_unlikely(&udp_encap_needed_key))
goto flush;
 
/* Don't bother verifying checksum if we're going to flush anyway. */
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 83f4c77c79d8..d84672959f10 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -548,7 +548,7 @@ static __inline__ void udpv6_err(struct sk_buff *skb,
__udp6_lib_err(skb, opt, type, code, offset, info, &udp_table);
 }
 
-static DEFINE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
+DEFINE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
 void udpv6_encap_enable(void)
 {
static_branch_enable(&udpv6_encap_needed_key);
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 2a41da0dd33f..e00f19c4a939 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -119,7 +119,7 @@ static struct sk_buff *udp6_gro_receive(struct list_head 
*head,
 {
struct udphdr *uh = udp_gro_udphdr(skb);
 
-   if (unlikely(!uh))
+   if (unlikely(!uh) || !static_branch_unlikely(&udpv6_encap_needed_key))
goto flush;
 
/* Don't bother verifying checksum if we're going to flush anyway. */
-- 
2.19.0.397.gdd90340f6a-goog

[PATCH net-next RFC 4/8] ipv6: remove offload exception for hopopts

2018-09-14 Thread Willem de Bruijn

From: Willem de Bruijn 

Extension headers in ipv6 are pulled without calling a callback
function. An inet6_offload signals this feature with flag
INET6_PROTO_GSO_EXTHDR.

Add net_has_flag helper to hide implementation details and in
prepartion for configurable gro.

Convert NEXTHDR_HOP from a special case branch to a standard
extension header offload.

Signed-off-by: Willem de Bruijn 
---
 include/linux/netdevice.h  |  9 +
 net/ipv6/exthdrs_offload.c | 17 ++---
 net/ipv6/ip6_offload.c | 36 +---
 3 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0be594f8d1ce..1c97a048506f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3567,6 +3567,15 @@ static inline u8 net_offload_from_type(u16 type)
return type & 0xFF;
 }
 
+static inline bool net_offload_has_flag(const struct net_offload __rcu **offs,
+   u16 type, u16 flag)
+{
+   const struct net_offload *off;
+
+   off = offs ? rcu_dereference(offs[net_offload_from_type(type)]) : NULL;
+   return off && off->flags & flag;
+}
+
 static inline const struct net_offload *
 net_gro_receive(const struct net_offload __rcu **offs, u16 type)
 {
diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c
index f5e2ba1c18bf..2230331c6012 100644
--- a/net/ipv6/exthdrs_offload.c
+++ b/net/ipv6/exthdrs_offload.c
@@ -12,11 +12,15 @@
 #include 
 #include "ip6_offload.h"
 
-static const struct net_offload rthdr_offload = {
+static struct net_offload hophdr_offload = {
.flags  =   INET6_PROTO_GSO_EXTHDR,
 };
 
-static const struct net_offload dstopt_offload = {
+static struct net_offload rthdr_offload = {
+   .flags  =   INET6_PROTO_GSO_EXTHDR,
+};
+
+static struct net_offload dstopt_offload = {
.flags  =   INET6_PROTO_GSO_EXTHDR,
 };
 
@@ -24,10 +28,14 @@ int __init ipv6_exthdrs_offload_init(void)
 {
int ret;
 
-   ret = inet6_add_offload(&rthdr_offload, IPPROTO_ROUTING);
+   ret = inet6_add_offload(&hophdr_offload, IPPROTO_HOPOPTS);
if (ret)
goto out;
 
+   ret = inet6_add_offload(&rthdr_offload, IPPROTO_ROUTING);
+   if (ret)
+   goto out_hop;
+
ret = inet6_add_offload(&dstopt_offload, IPPROTO_DSTOPTS);
if (ret)
goto out_rt;
@@ -37,5 +45,8 @@ int __init ipv6_exthdrs_offload_init(void)
 
 out_rt:
inet6_del_offload(&rthdr_offload, IPPROTO_ROUTING);
+
+out_hop:
+   inet6_del_offload(&rthdr_offload, IPPROTO_HOPOPTS);
goto out;
 }
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 9d301bef0e23..4854509a2c5d 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -22,21 +22,13 @@
 
 static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto)
 {
-   const struct net_offload *ops = NULL;
-
for (;;) {
struct ipv6_opt_hdr *opth;
int len;
 
-   if (proto != NEXTHDR_HOP) {
-   ops = rcu_dereference(inet6_offloads[proto]);
-
-   if (unlikely(!ops))
-   break;
-
-   if (!(ops->flags & INET6_PROTO_GSO_EXTHDR))
-   break;
-   }
+   if (!net_offload_has_flag(inet6_offloads, proto,
+ INET6_PROTO_GSO_EXTHDR))
+   break;
 
if (unlikely(!pskb_may_pull(skb, 8)))
break;
@@ -141,26 +133,24 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff 
*skb,
 /* Return the total length of all the extension hdrs, following the same
  * logic in ipv6_gso_pull_exthdrs() when parsing ext-hdrs.
  */
-static int ipv6_exthdrs_len(struct ipv6hdr *iph,
-   const struct net_offload **opps)
+static int ipv6_exthdrs_len(struct ipv6hdr *iph, u8 *pproto)
 {
struct ipv6_opt_hdr *opth = (void *)iph;
int len = 0, proto, optlen = sizeof(*iph);
 
proto = iph->nexthdr;
for (;;) {
-   if (proto != NEXTHDR_HOP) {
-   *opps = rcu_dereference(inet6_offloads[proto]);
-   if (unlikely(!(*opps)))
-   break;
-   if (!((*opps)->flags & INET6_PROTO_GSO_EXTHDR))
-   break;
-   }
+   if (!net_offload_has_flag(inet6_offloads, proto,
+ INET6_PROTO_GSO_EXTHDR))
+   break;
+
opth = (void *)opth + optlen;
optlen = ipv6_optlen(opth);
len += optlen;
proto = opth->nexthdr;
}
+
+   *pproto = proto;
return len;
 }
 
@@ -296,8 +286,8 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head 
*hea

[PATCH net-next RFC 0/8] udp and configurable gro

2018-09-14 Thread Willem de Bruijn

From: Willem de Bruijn 

This is a *very rough* draft. Mainly for discussion while we also
look at another partially overlapping approach [1].

Reduce UDP receive cost for bulk traffic by enabling datagram
coalescing with GRO.

Before adding more GRO callbacks, make GRO configurable by the
administrator to optionally reduce the attack surface of this
early receive path. See also [2].

Introduce sysctls net.(core|ipv4|ipv6).gro that expose the table of
protocols for which GRO is support. Allow the administrator to disable
individual entries in the table.

To have a single infrastructure, convert dev_offloads to the
table-based approach to existing inet(6)_offloads. Additional small
benefit is that ipv6 will no longer take two list lookups to find.

Patch 1 converts dev_offloads to the infra of inet(6)_offloads
Patch 2 deduplicates gro_complete logic now that all share infra
Patch 3 does the same for gro_receive, in anticipation of adding
a branch to check whether gro_receive is enabled
Patch 4 harmonizes ipv6 header opts, so that those, too can be
optionally disabled.
Patch 5 makes inet(6)_offloads non-const to allow disabling a flag
Patch 6 introduces the administrative sysctl
Patch 7 avoids udp gro cost if no udp gro callback is register
Patch 8 introduces udp gro

[1] http://patchwork.ozlabs.org/project/netdev/list/?series=65741
[2] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

Willem de Bruijn (8):
  gro: convert device offloads to net_offload
  gro: deduplicate gro_complete
  gro: add net_gro_receive
  ipv6: remove offload exception for hopopts
  net: deconstify net_offload
  net: make gro configurable
  udp: gro behind static key
  udp: add gro

 drivers/net/geneve.c   |  11 +---
 drivers/net/vxlan.c|   8 +++
 include/linux/netdevice.h  |  64 +++--
 include/net/protocol.h |  19 ++-
 include/net/udp.h  |   2 +
 include/uapi/linux/udp.h   |   1 +
 net/8021q/vlan.c   |  12 +---
 net/core/dev.c | 112 -
 net/core/sysctl_net_core.c |  60 
 net/ethernet/eth.c |  13 +
 net/ipv4/af_inet.c |  21 ++-
 net/ipv4/esp4_offload.c|   2 +-
 net/ipv4/fou.c |  41 --
 net/ipv4/gre_offload.c |  26 -
 net/ipv4/protocol.c|  10 ++--
 net/ipv4/sysctl_net_ipv4.c |   7 +++
 net/ipv4/tcp_offload.c |   2 +-
 net/ipv4/udp.c |  73 +++-
 net/ipv4/udp_offload.c |  19 +++
 net/ipv6/esp6_offload.c|   2 +-
 net/ipv6/exthdrs_offload.c |  17 +-
 net/ipv6/ip6_offload.c |  69 +--
 net/ipv6/protocol.c|  10 ++--
 net/ipv6/sysctl_net_ipv6.c |   8 +++
 net/ipv6/tcpv6_offload.c   |   2 +-
 net/ipv6/udp.c |   2 +-
 net/ipv6/udp_offload.c |   4 +-
 net/sctp/offload.c |   2 +-
 28 files changed, 344 insertions(+), 275 deletions(-)

-- 
2.19.0.397.gdd90340f6a-goog

[PATCH net-next RFC 2/8] gro: deduplicate gro_complete

2018-09-14 Thread Willem de Bruijn

From: Willem de Bruijn 

The gro completion datapath is open coded for all protocols.
Deduplicate with new helper function net_gro_complete.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/geneve.c  |  9 +
 include/linux/netdevice.h | 19 ++-
 net/8021q/vlan.c  | 10 +-
 net/core/dev.c| 24 +---
 net/ethernet/eth.c| 11 +--
 net/ipv4/af_inet.c| 15 ++-
 net/ipv4/fou.c| 25 +++--
 net/ipv4/gre_offload.c| 12 +++-
 net/ipv6/ip6_offload.c| 13 +
 9 files changed, 31 insertions(+), 107 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 6625fabe2c88..a3a4621d9bee 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -488,7 +488,6 @@ static int geneve_gro_complete(struct sock *sk, struct 
sk_buff *skb,
   int nhoff)
 {
struct genevehdr *gh;
-   struct packet_offload *ptype;
__be16 type;
int gh_len;
int err = -ENOSYS;
@@ -497,13 +496,7 @@ static int geneve_gro_complete(struct sock *sk, struct 
sk_buff *skb,
gh_len = geneve_hlen(gh);
type = gh->proto_type;
 
-   rcu_read_lock();
-   ptype = gro_find_complete_by_type(type);
-   if (ptype)
-   err = ptype->callbacks.gro_complete(skb, nhoff + gh_len);
-
-   rcu_read_unlock();
-
+   err = net_gro_complete(dev_offloads, type, skb, nhoff + gh_len);
skb_set_inner_mac_header(skb, nhoff + gh_len);
 
return err;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7425068fa249..0d292ea6716e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3557,7 +3557,8 @@ void napi_gro_flush(struct napi_struct *napi, bool 
flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
 struct packet_offload *gro_find_receive_by_type(__be16 type);
-struct packet_offload *gro_find_complete_by_type(__be16 type);
+
+extern const struct net_offload __rcu *dev_offloads[256];
 
 static inline u8 net_offload_from_type(u16 type)
 {
@@ -3567,6 +3568,22 @@ static inline u8 net_offload_from_type(u16 type)
return type & 0xFF;
 }
 
+static inline int net_gro_complete(const struct net_offload __rcu **offs,
+  u16 type, struct sk_buff *skb, int nhoff)
+{
+   const struct net_offload *off;
+   int ret = -ENOENT;
+
+   rcu_read_lock();
+   off = rcu_dereference(offs[net_offload_from_type(type)]);
+   if (off && off->callbacks.gro_complete &&
+   (!off->type || off->type == type))
+   ret = off->callbacks.gro_complete(skb, nhoff);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
kfree_skb(napi->skb);
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 5e9950453955..6ac27aa9f158 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -703,16 +703,8 @@ static int vlan_gro_complete(struct sk_buff *skb, int 
nhoff)
 {
struct vlan_hdr *vhdr = (struct vlan_hdr *)(skb->data + nhoff);
__be16 type = vhdr->h_vlan_encapsulated_proto;
-   struct packet_offload *ptype;
-   int err = -ENOENT;
 
-   rcu_read_lock();
-   ptype = gro_find_complete_by_type(type);
-   if (ptype)
-   err = ptype->callbacks.gro_complete(skb, nhoff + sizeof(*vhdr));
-
-   rcu_read_unlock();
-   return err;
+   return net_gro_complete(dev_offloads, type, skb, nhoff + sizeof(*vhdr));
 }
 
 static struct packet_offload vlan_packet_offloads[] __read_mostly = {
diff --git a/net/core/dev.c b/net/core/dev.c
index 55f86b6d3182..2c21e507291f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5235,10 +5235,6 @@ static void flush_all_backlogs(void)
 
 static int napi_gro_complete(struct sk_buff *skb)
 {
-   const struct packet_offload *ptype;
-   __be16 type = skb->protocol;
-   int err = -ENOENT;
-
BUILD_BUG_ON(sizeof(struct napi_gro_cb) > sizeof(skb->cb));
 
if (NAPI_GRO_CB(skb)->count == 1) {
@@ -5246,13 +5242,7 @@ static int napi_gro_complete(struct sk_buff *skb)
goto out;
}
 
-   rcu_read_lock();
-   ptype = dev_offloads[net_offload_from_type(type)];
-   if (ptype && ptype->callbacks.gro_complete)
-   err = ptype->callbacks.gro_complete(skb, 0);
-   rcu_read_unlock();
-
-   if (err) {
+   if (net_gro_complete(dev_offloads, skb->protocol, skb, 0)) {
kfree_skb(skb);
return NET_RX_SUCCESS;
}
@@ -5505,18 +5495,6 @@ struct packet_offload *gro_find_receive_by_type(__be16 
type)
 }
 EXPORT_SYMBOL(gro_find_receive_by_type);
 
-struct packet_offload *gro_find_complete_by_type(__be16 type)
-{
-   struct net_offload *off;
-
-   off = (struct net_offload *) rcu_dereference(dev_of

[PATCH net-next RFC 1/8] gro: convert device offloads to net_offload

2018-09-14 Thread Willem de Bruijn

From: Willem de Bruijn 

In preparation of making GRO receive configurable, have all offloads
share the same infrastructure.

Signed-off-by: Willem de Bruijn 
---
 include/linux/netdevice.h |  17 +-
 include/net/protocol.h|   7 ---
 net/core/dev.c| 105 +-
 3 files changed, 51 insertions(+), 78 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e2b3bd750c98..7425068fa249 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2366,13 +2366,18 @@ struct offload_callbacks {
int (*gro_complete)(struct sk_buff *skb, int nhoff);
 };
 
-struct packet_offload {
+struct net_offload {
__be16   type;  /* This is really htons(ether_type). */
u16  priority;
struct offload_callbacks callbacks;
-   struct list_head list;
+   unsigned int flags; /* Flags used by IPv6 for now */
 };
 
+#define packet_offload net_offload
+
+/* This should be set for any extension header which is compatible with GSO. */
+#define INET6_PROTO_GSO_EXTHDR 0x1
+
 /* often modified stats are per-CPU, other are shared (netdev->stats) */
 struct pcpu_sw_netstats {
u64 rx_packets;
@@ -3554,6 +3559,14 @@ gro_result_t napi_gro_frags(struct napi_struct *napi);
 struct packet_offload *gro_find_receive_by_type(__be16 type);
 struct packet_offload *gro_find_complete_by_type(__be16 type);
 
+static inline u8 net_offload_from_type(u16 type)
+{
+   /* Do not bother handling collisions. There are none.
+* If they do occur with new offloads, add a mapping function here.
+*/
+   return type & 0xFF;
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
kfree_skb(napi->skb);
diff --git a/include/net/protocol.h b/include/net/protocol.h
index 4fc75f7ae23b..53a0322ee545 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -69,13 +69,6 @@ struct inet6_protocol {
 #define INET6_PROTO_FINAL  0x2
 #endif
 
-struct net_offload {
-   struct offload_callbacks callbacks;
-   unsigned int flags; /* Flags used by IPv6 for now */
-};
-/* This should be set for any extension header which is compatible with GSO. */
-#define INET6_PROTO_GSO_EXTHDR 0x1
-
 /* This is used to register socket interfaces for IP protocols.  */
 struct inet_protosw {
struct list_head list;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0b2d777e5b9e..55f86b6d3182 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -154,7 +154,6 @@
 #define GRO_MAX_HEAD (MAX_HEADER + 128)
 
 static DEFINE_SPINLOCK(ptype_lock);
-static DEFINE_SPINLOCK(offload_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;  /* Taps */
 static struct list_head offload_base __read_mostly;
@@ -467,6 +466,9 @@ void dev_remove_pack(struct packet_type *pt)
 EXPORT_SYMBOL(dev_remove_pack);
 
 
+const struct net_offload __rcu *dev_offloads[256] __read_mostly;
+EXPORT_SYMBOL(dev_offloads);
+
 /**
  * dev_add_offload - register offload handlers
  * @po: protocol offload declaration
@@ -481,15 +483,9 @@ EXPORT_SYMBOL(dev_remove_pack);
  */
 void dev_add_offload(struct packet_offload *po)
 {
-   struct packet_offload *elem;
-
-   spin_lock(&offload_lock);
-   list_for_each_entry(elem, &offload_base, list) {
-   if (po->priority < elem->priority)
-   break;
-   }
-   list_add_rcu(&po->list, elem->list.prev);
-   spin_unlock(&offload_lock);
+   cmpxchg((const struct net_offload **)
+   &dev_offloads[net_offload_from_type(po->type)],
+   NULL, po);
 }
 EXPORT_SYMBOL(dev_add_offload);
 
@@ -506,23 +502,11 @@ EXPORT_SYMBOL(dev_add_offload);
  * and must not be freed until after all the CPU's have gone
  * through a quiescent state.
  */
-static void __dev_remove_offload(struct packet_offload *po)
+static int __dev_remove_offload(struct packet_offload *po)
 {
-   struct list_head *head = &offload_base;
-   struct packet_offload *po1;
-
-   spin_lock(&offload_lock);
-
-   list_for_each_entry(po1, head, list) {
-   if (po == po1) {
-   list_del_rcu(&po->list);
-   goto out;
-   }
-   }
-
-   pr_warn("dev_remove_offload: %p not found\n", po);
-out:
-   spin_unlock(&offload_lock);
+   return (cmpxchg((const struct net_offload **)
+   &dev_offloads[net_offload_from_type(po->type)],
+  po, NULL) == po) ? 0 : -1;
 }
 
 /**
@@ -2962,7 +2946,7 @@ struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
netdev_features_t features)
 {
struct sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT);
-   struct packet_offload *ptype;
+   const struct net_offload *off;
int vlan_depth = skb->

[PATCH net-next RFC 3/8] gro: add net_gro_receive

2018-09-14 Thread Willem de Bruijn

From: Willem de Bruijn 

For configurable gro_receive all callsites need to be updated. Similar
to gro_complete, introduce a single shared helper, net_gro_receive.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/geneve.c  |  2 +-
 include/linux/netdevice.h | 14 +-
 net/8021q/vlan.c  |  2 +-
 net/core/dev.c| 20 
 net/ethernet/eth.c|  2 +-
 net/ipv4/af_inet.c|  4 ++--
 net/ipv4/fou.c|  8 
 net/ipv4/gre_offload.c| 12 ++--
 net/ipv6/ip6_offload.c|  8 
 9 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index a3a4621d9bee..a812a774e5fd 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -467,7 +467,7 @@ static struct sk_buff *geneve_gro_receive(struct sock *sk,
type = gh->proto_type;
 
rcu_read_lock();
-   ptype = gro_find_receive_by_type(type);
+   ptype = net_gro_receive(dev_offloads, type);
if (!ptype)
goto out_unlock;
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0d292ea6716e..0be594f8d1ce 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3556,7 +3556,6 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, 
struct sk_buff *skb);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
-struct packet_offload *gro_find_receive_by_type(__be16 type);
 
 extern const struct net_offload __rcu *dev_offloads[256];
 
@@ -3568,6 +3567,19 @@ static inline u8 net_offload_from_type(u16 type)
return type & 0xFF;
 }
 
+static inline const struct net_offload *
+net_gro_receive(const struct net_offload __rcu **offs, u16 type)
+{
+   const struct net_offload *off;
+
+   off = rcu_dereference(offs[net_offload_from_type(type)]);
+   if (off && off->callbacks.gro_receive &&
+   (!off->type || off->type == type))
+   return off;
+   else
+   return NULL;
+}
+
 static inline int net_gro_complete(const struct net_offload __rcu **offs,
   u16 type, struct sk_buff *skb, int nhoff)
 {
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 6ac27aa9f158..a106c5373b1d 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -670,7 +670,7 @@ static struct sk_buff *vlan_gro_receive(struct list_head 
*head,
type = vhdr->h_vlan_encapsulated_proto;
 
rcu_read_lock();
-   ptype = gro_find_receive_by_type(type);
+   ptype = net_gro_receive(dev_offloads, type);
if (!ptype)
goto out_unlock;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 2c21e507291f..ae5fbd4114d2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5382,7 +5382,7 @@ static void gro_flush_oldest(struct list_head *head)
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
-   const struct packet_offload *ptype;
+   const struct net_offload *ops;
__be16 type = skb->protocol;
struct list_head *gro_head;
struct sk_buff *pp = NULL;
@@ -5396,8 +5396,8 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
gro_head = gro_list_prepare(napi, skb);
 
rcu_read_lock();
-   ptype = dev_offloads[net_offload_from_type(type)];
-   if (ptype && ptype->callbacks.gro_receive) {
+   ops = net_gro_receive(dev_offloads, type);
+   if (ops) {
skb_set_network_header(skb, skb_gro_offset(skb));
skb_reset_mac_len(skb);
NAPI_GRO_CB(skb)->same_flow = 0;
@@ -5425,7 +5425,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
NAPI_GRO_CB(skb)->csum_valid = 0;
}
 
-   pp = ptype->callbacks.gro_receive(gro_head, skb);
+   pp = ops->callbacks.gro_receive(gro_head, skb);
rcu_read_unlock();
} else {
rcu_read_unlock();
@@ -5483,18 +5483,6 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
goto pull;
 }
 
-struct packet_offload *gro_find_receive_by_type(__be16 type)
-{
-   struct net_offload *off;
-
-   off = (struct net_offload *) rcu_dereference(dev_offloads[type & 0xFF]);
-   if (off && off->type == type && off->callbacks.gro_receive)
-   return off;
-   else
-   return NULL;
-}
-EXPORT_SYMBOL(gro_find_receive_by_type);
-
 static void napi_skb_free_stolen_head(struct sk_buff *skb)
 {
skb_dst_drop(skb);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index fb17a13722e8..542dbc2ec956 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -462,7 +462,7 @@ struct sk_buff *eth_gro_receive(struct list

[PATCH net-next RFC 5/8] net: deconstify net_offload

2018-09-14 Thread Willem de Bruijn

From: Willem de Bruijn 

With configurable gro, the flags field in net_offloads may be changed.

Remove the const keyword. This is a noop otherwise.

Signed-off-by: Willem de Bruijn 
---
 include/linux/netdevice.h | 14 +++---
 include/net/protocol.h| 12 ++--
 net/core/dev.c|  8 +++-
 net/ipv4/af_inet.c|  2 +-
 net/ipv4/esp4_offload.c   |  2 +-
 net/ipv4/fou.c|  8 
 net/ipv4/gre_offload.c|  2 +-
 net/ipv4/protocol.c   | 10 +-
 net/ipv4/tcp_offload.c|  2 +-
 net/ipv4/udp_offload.c|  6 +++---
 net/ipv6/esp6_offload.c   |  2 +-
 net/ipv6/ip6_offload.c|  6 +++---
 net/ipv6/protocol.c   | 10 +-
 net/ipv6/tcpv6_offload.c  |  2 +-
 net/ipv6/udp_offload.c|  2 +-
 net/sctp/offload.c|  2 +-
 16 files changed, 44 insertions(+), 46 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1c97a048506f..b9e671887fc2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3557,7 +3557,7 @@ void napi_gro_flush(struct napi_struct *napi, bool 
flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
 
-extern const struct net_offload __rcu *dev_offloads[256];
+extern struct net_offload __rcu *dev_offloads[256];
 
 static inline u8 net_offload_from_type(u16 type)
 {
@@ -3567,19 +3567,19 @@ static inline u8 net_offload_from_type(u16 type)
return type & 0xFF;
 }
 
-static inline bool net_offload_has_flag(const struct net_offload __rcu **offs,
+static inline bool net_offload_has_flag(struct net_offload __rcu **offs,
u16 type, u16 flag)
 {
-   const struct net_offload *off;
+   struct net_offload *off;
 
off = offs ? rcu_dereference(offs[net_offload_from_type(type)]) : NULL;
return off && off->flags & flag;
 }
 
 static inline const struct net_offload *
-net_gro_receive(const struct net_offload __rcu **offs, u16 type)
+net_gro_receive(struct net_offload __rcu **offs, u16 type)
 {
-   const struct net_offload *off;
+   struct net_offload *off;
 
off = rcu_dereference(offs[net_offload_from_type(type)]);
if (off && off->callbacks.gro_receive &&
@@ -3589,10 +3589,10 @@ net_gro_receive(const struct net_offload __rcu **offs, 
u16 type)
return NULL;
 }
 
-static inline int net_gro_complete(const struct net_offload __rcu **offs,
+static inline int net_gro_complete(struct net_offload __rcu **offs,
   u16 type, struct sk_buff *skb, int nhoff)
 {
-   const struct net_offload *off;
+   struct net_offload *off;
int ret = -ENOENT;
 
rcu_read_lock();
diff --git a/include/net/protocol.h b/include/net/protocol.h
index 53a0322ee545..5e2c20b662d1 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -87,8 +87,8 @@ struct inet_protosw {
 #define INET_PROTOSW_ICSK  0x04  /* Is this an inet_connection_sock? */
 
 extern struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS];
-extern const struct net_offload __rcu *inet_offloads[MAX_INET_PROTOS];
-extern const struct net_offload __rcu *inet6_offloads[MAX_INET_PROTOS];
+extern struct net_offload __rcu *inet_offloads[MAX_INET_PROTOS];
+extern struct net_offload __rcu *inet6_offloads[MAX_INET_PROTOS];
 
 #if IS_ENABLED(CONFIG_IPV6)
 extern struct inet6_protocol __rcu *inet6_protos[MAX_INET_PROTOS];
@@ -96,8 +96,8 @@ extern struct inet6_protocol __rcu 
*inet6_protos[MAX_INET_PROTOS];
 
 int inet_add_protocol(const struct net_protocol *prot, unsigned char num);
 int inet_del_protocol(const struct net_protocol *prot, unsigned char num);
-int inet_add_offload(const struct net_offload *prot, unsigned char num);
-int inet_del_offload(const struct net_offload *prot, unsigned char num);
+int inet_add_offload(struct net_offload *prot, unsigned char num);
+int inet_del_offload(struct net_offload *prot, unsigned char num);
 void inet_register_protosw(struct inet_protosw *p);
 void inet_unregister_protosw(struct inet_protosw *p);
 
@@ -107,7 +107,7 @@ int inet6_del_protocol(const struct inet6_protocol *prot, 
unsigned char num);
 int inet6_register_protosw(struct inet_protosw *p);
 void inet6_unregister_protosw(struct inet_protosw *p);
 #endif
-int inet6_add_offload(const struct net_offload *prot, unsigned char num);
-int inet6_del_offload(const struct net_offload *prot, unsigned char num);
+int inet6_add_offload(struct net_offload *prot, unsigned char num);
+int inet6_del_offload(struct net_offload *prot, unsigned char num);
 
 #endif /* _PROTOCOL_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index ae5fbd4114d2..20d9552afd38 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -466,7 +466,7 @@ void dev_remove_pack(struct packet_type *pt)
 EXPORT_SYMBOL(dev_remove_pack);
 
 
-const struct net_offload __rcu *dev_offloads[256] __read_mostly;
+struct net_offload __rcu *dev_offloads[256] __read_mostly;

[PATCH net-next RFC 8/8] udp: add gro

2018-09-14 Thread Willem de Bruijn

From: Willem de Bruijn 

Very rough initial version of udp gro, for discussion purpose only at
this point.

Among others it
- lacks the cmsg UDP_SEGMENT to return gso_size
- probably breaks udp tunnels
- hard breaks at 40 segments
- does not allow a last segment of unequal size

Signed-off-by: Willem de Bruijn 
---
 include/uapi/linux/udp.h |  1 +
 net/ipv4/udp.c   | 71 
 net/ipv4/udp_offload.c   | 11 +++
 3 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 09d00f8c442b..7fda3e8c7fcf 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -33,6 +33,7 @@ struct udphdr {
 #define UDP_NO_CHECK6_TX 101   /* Disable sending checksum for UDP6X */
 #define UDP_NO_CHECK6_RX 102   /* Disable accpeting checksum for UDP6 */
 #define UDP_SEGMENT103 /* Set GSO segmentation size */
+#define UDP_GRO104 /* Enable GRO */
 
 /* UDP encapsulation types */
 #define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index bd873a5b8a86..ae49c08e6225 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2387,6 +2387,51 @@ void udp_destroy_sock(struct sock *sk)
}
 }
 
+static struct sk_buff *udp_gro_receive_cb(struct sock *sk,
+ struct list_head *head,
+ struct sk_buff *skb)
+{
+   struct sk_buff *p;
+   unsigned int off;
+
+   off = skb_gro_offset(skb) - sizeof(struct udphdr);
+
+   list_for_each_entry(p, head, list) {
+   if (!NAPI_GRO_CB(p)->same_flow)
+   continue;
+
+   /* TODO: for UDP_GRO: match size unless last segment */
+   if (NAPI_GRO_CB(p)->flush)
+   break;
+
+   /* TODO: look into ip id check */
+   if (skb_gro_receive(p, skb)) {
+   NAPI_GRO_CB(skb)->flush = 1;
+   break;
+   }
+
+   if (NAPI_GRO_CB(skb)->count >= 40) {
+   return p;
+   }
+
+   return NULL;
+   }
+
+   return NULL;
+}
+
+static int udp_gro_complete_cb(struct sock *sk, struct sk_buff *skb,
+  int nhoff)
+{
+   skb->csum_start = (unsigned char *)udp_hdr(skb) - skb->head;
+   skb->csum_offset = offsetof(struct udphdr, check);
+   skb->ip_summed = CHECKSUM_PARTIAL;
+
+   skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
+
+   return 0;
+}
+
 /*
  * Socket option code for UDP
  */
@@ -2450,6 +2495,32 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
up->gso_size = val;
break;
 
+   case UDP_GRO:
+   {
+   if (val < 0 || val > 1)
+   return -EINVAL;
+
+   lock_sock(sk);
+   if (val) {
+
+   if (!udp_sk(sk)->gro_receive) {
+   udp_sk(sk)->gro_complete = udp_gro_complete_cb;
+   udp_sk(sk)->gro_receive = udp_gro_receive_cb;
+   } else {
+   err = -EALREADY;
+   }
+   } else {
+   if (udp_sk(sk)->gro_receive) {
+   udp_sk(sk)->gro_receive = NULL;
+   udp_sk(sk)->gro_complete = NULL;
+   } else {
+   err = -ENOENT;
+   }
+   }
+   release_sock(sk);
+   break;
+   }
+
/*
 *  UDP-Lite's partial checksum coverage (RFC 3828).
 */
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index f44fe328aa0f..6dd3f0a28b5e 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -386,6 +386,8 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
NAPI_GRO_CB(p)->same_flow = 0;
continue;
}
+
+   /* TODO: for UDP_GRO: match size */
}
 
skb_gro_pull(skb, sizeof(struct udphdr)); /* pull encapsulating udp 
header */
@@ -437,11 +439,6 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
 
uh->len = newlen;
 
-   /* Set encapsulation before calling into inner gro_complete() functions
-* to make them set up the inner offsets.
-*/
-   skb->encapsulation = 1;
-
rcu_read_lock();
sk = (*lookup)(skb, uh->source, uh->dest);
if (sk && udp_sk(sk)->gro_complete)
@@ -462,11 +459,11 @@ static int udp4_gro_complete(struct sk_buff *skb, int 
nhoff)
struct udphdr *uh = (struct udphdr *)(skb->data + nhoff);
 
if (uh->check) {
-   skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL_CSUM;
+   skb_sh

Re: [PATCH] net: caif: remove redundant null check on frontpkt

2018-09-14 Thread Colin Ian King

On 14/09/18 18:54, Sergei Shtylyov wrote:
> Hello!
> 
> On 09/14/2018 08:19 PM, Colin King wrote:
> 
>> From: Colin Ian King 
>>
>> It is impossible for frontpkt to be null at the point of the null
>> check because it has been assigned from rearpkt and there is no
>> way realpkt can be null at the point of the assignment because
> 
>rearpkt?

Good spot. Can this be fixed up when the patch is applied?

> 
>> of the sanity checking and exit paths taken previously. Remove
>> the redundant null check.
>>
>> Detected by CoverityScan, CID#114434 ("Logically dead code")
>>
>> Signed-off-by: Colin Ian King 
> [...]
> 
> MBR, Sergei
>

Re: mlx5 driver loading failing on v4.19 / net-next / bpf-next

2018-09-14 Thread Saeed Mahameed

On Fri, Sep 14, 2018 at 1:52 AM, Jesper Dangaard Brouer
 wrote:
> On Fri, 14 Sep 2018 01:22:15 -0700
> Saeed Mahameed  wrote:
>
>> On Thu, Sep 13, 2018 at 11:36 PM, Jesper Dangaard Brouer
>>  wrote:
>> > On Thu, 13 Sep 2018 15:55:29 -0700
>> > Alexei Starovoitov  wrote:
>> >
>> >> On Thu, Aug 30, 2018 at 1:35 AM, Tariq Toukan  wrote:
>> >> >
>> >> >
>> >> > On 29/08/2018 6:05 PM, Jesper Dangaard Brouer wrote:
>> >> >>
>> >> >> Hi Saeed,
>> >> >>
>> >> >> I'm having issues loading mlx5 driver on v4.19 kernels (tested both
>> >> >> net-next and bpf-next), while kernel v4.18 seems to work.  It happens
>> >> >> with a Mellanox ConnectX-5 NIC (and also a CX4-Lx but I removed that
>> >> >> from the system now).
>> >> >>
>> >> >
>> >> > Hi Jesper,
>> >> >
>> >> > Thanks for your report!
>> >> >
>> >> > We are working to analyze and debug the issue.
>> >>
>> >> looks like serious issue to me... while no news in 2 weeks.
>> >> any update?
>> >
>> > Mellanox took it offlist, and Sep 6th found that this is a regression
>> > introduced by commit 269d26f47f6f ("net/mlx5: Reduce command polling
>> > interval"), but only if CONFIG_PREEMPT is on.
>> >
>> > I can confirm that reverting this commit fixed the issue (and not the
>> > firmware upgrade I also did).
>> >
>> > I think Moshe (Cc) is responsible for this case, and I expect to soon
>> > see a revert or alternative solution to this!?
>> >
>> > Thanks for the kick Alexei :-)
>>
>> Thanks you Alexei and Jesper for following up,
>> the fix is already being tested [1] and will be submitted tomorrow,
>> as Jesper pointed out the issue happens only with 269d26f47f6f
>> ("net/mlx5: Reduce command polling
>> interval"), and only if CONFIG_PREEMPT is on.
>> the only affected kernel is 4.19 which is not GA yet.
>>
>> [1] 
>> https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=net-mlx5
>
> Sound good.
>
> I will appreciate if you add a:
>
> Reported-by: Jesper Dangaard Brouer 
>

Of course i will add it, simply the patch was in my review queue
before your report :).

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next RFC 6/8] net: make gro configurable

2018-09-14 Thread Stephen Hemminger

On Fri, 14 Sep 2018 13:59:39 -0400
Willem de Bruijn  wrote:

> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index e5d236595206..8cb8e02c8ab6 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -572,6 +572,7 @@ static struct sk_buff *vxlan_gro_receive(struct sock *sk,
>struct list_head *head,
>struct sk_buff *skb)
>  {
> + const struct net_offload *ops;
>   struct sk_buff *pp = NULL;
>   struct sk_buff *p;
>   struct vxlanhdr *vh, *vh2;
> @@ -606,6 +607,12 @@ static struct sk_buff *vxlan_gro_receive(struct sock *sk,
>   goto out;
>   }
>  
> + rcu_read_lock();
> + ops = net_gro_receive(dev_offloads, ETH_P_TEB);
> + rcu_read_unlock();
> + if (!ops)
> + goto out;

Isn't rcu_read_lock already held here?
RCU read lock is always held in the receive handler path

> +
>   skb_gro_pull(skb, sizeof(struct vxlanhdr)); /* pull vxlan header */
>  
>   list_for_each_entry(p, head, list) {
> @@ -621,6 +628,7 @@ static struct sk_buff *vxlan_gro_receive(struct sock *sk,
>   }
>  
>   pp = call_gro_receive(eth_gro_receive, head, skb);
> +
>   flush = 0;

whitespace change crept into this patch.

Re: [RFC PATCH net-next v1 00/14] rename and shrink i40evf

2018-09-14 Thread Jesse Brandeburg

On Fri, 14 Sep 2018 13:39:17 +0900 Benjamin wrote:
> > Jesse Brandeburg (14):
> >   intel-ethernet: rename i40evf to iavf  
> 
> Seems like patch 1 didn't make it to netdev
> https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20180910/014025.html

Hi Ben, Thanks for the note, I don't know why it didn't show up for
you, it's here if you want to take a look:
https://patchwork.ozlabs.org/patch/969557/

[PATCH net] ipv6: fix possible use-after-free in ip6_xmit()

2018-09-14 Thread Eric Dumazet

In the unlikely case ip6_xmit() has to call skb_realloc_headroom(),
we need to call skb_set_owner_w() before consuming original skb,
otherwise we risk a use-after-free.

Bring IPv6 in line with what we do in IPv4 to fix this.

Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
 net/ipv6/ip6_output.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 
16f200f06500758c4cae84ea16229d5dbce912cb..f9f8f554d141676a7d342f85088d12d9a6815e9d
 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -219,12 +219,10 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, 
struct flowi6 *fl6,
kfree_skb(skb);
return -ENOBUFS;
}
+   if (skb->sk)
+   skb_set_owner_w(skb2, skb->sk);
consume_skb(skb);
skb = skb2;
-   /* skb_set_owner_w() changes sk->sk_wmem_alloc 
atomically,
-* it is safe to call in our context (socket lock not 
held)
-*/
-   skb_set_owner_w(skb, (struct sock *)sk);
}
if (opt->opt_flen)
ipv6_push_frag_opts(skb, opt, &proto);
-- 
2.19.0.397.gdd90340f6a-goog

Re: [RFC PATCH net-next v1 00/14] rename and shrink i40evf

2018-09-14 Thread Jesse Brandeburg

On Fri, 14 Sep 2018 12:10:45 +0300 Or wrote:
> On Fri, Sep 14, 2018 at 1:31 AM, Jesse Brandeburg
>  wrote:
> on what HW ring format do you standardize? do i40e/Fortville and
> ice/what's-the-intel-code-name?  HWs can/use the same posting/completion
> descriptor?

The initial ring format is the same as used for XL710/X722 devices, and
planned be supported for the Intel Ethernet E800 series (ice driver) and
future VF devices using SR-IOV.

> > This solves 2 issues we saw coming or were already present, the
> > first was constant code duplication happening with i40e/i40evf,
> > when much of the duplicate code in the i40evf was not used or was
> > not needed.  
> 
> could you spare few words on the origin/nature of these duplicates? were them
> just developer C&P mistakes for functionality which is irrelevant for
> a VF? like what?
> if not, what was there?

In particular, some of the code was not used at all, but was not caught
by any automation because it was in a header file and included into
multiple file scopes.  Other big chunk of the duplicate code was for
the PF's usage of the communication channel to firmware, which for some
reason was left in the VF driver code (probably just to avoid changing
the file) - but the VF driver doesn't communicate to firmware, just to
the PF.

> > The second was to remove the future confusion of why
> > future VF devices that were not considered "40GbE" only devices
> > were supported by i40evf.  
> 
> can elaborate further?

The name i40evf was generating customer questions, and was confusing
when you add in multiple generations of PF hardware that are no longer
using the i40e driver.

> > The thought is that iavf will be the virtual function driver for
> > all future devices, so it should have a "generic" name to propery
> > represent that it is the VF driver for multiple generations of
> > devices.  
> 
> for that end,  as I think was explained @ the netdev Tokyo AVF session,
> you would need a mechanism for feature negotiation, is it here or coming up?

The driver already has it (a feature negotitiation), please see the
function called iavf_send_vf_config_msg, and follow from where it is
called.  Basically the VF driver negotiates with the PF for what it can
do, and the PF guarantees that the base set of features will always
work, with optional advanced features which the code may/may-not have
in the future.

> >  41 files changed, 3436 insertions(+), 7581 deletions(-)  
> 
> code diet is cool!

Thanks! ~4000 lines less made me very happy too.

Re: [bpf-next, v4 0/5] Introduce eBPF flow dissector

2018-09-14 Thread Alexei Starovoitov

On Fri, Sep 14, 2018 at 07:46:17AM -0700, Petar Penkov wrote:
> From: Petar Penkov 
> 
> This patch series hardens the RX stack by allowing flow dissection in BPF,
> as previously discussed [1]. Because of the rigorous checks of the BPF
> verifier, this provides significant security guarantees. In particular, the
> BPF flow dissector cannot get inside of an infinite loop, as with
> CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
> read outside of packet bounds, because all memory accesses are checked.
> Also, with BPF the administrator can decide which protocols to support,
> reducing potential attack surface. Rarely encountered protocols can be
> excluded from dissection and the program can be updated without kernel
> recompile or reboot if a bug is discovered.
> 
> Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
> This includes a new BPF program and attach type.
> 
> Patch 2 adds the new BPF flow dissector definitions to tools/uapi.
> 
> Patch 3 adds support for the new BPF program type to libbpf and bpftool.
> 
> Patch 4 adds a flow dissector program in BPF. This parses most protocols in
> __skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
> and address types).
> 
> Patch 5 adds a selftest that attaches the BPF program to the flow dissector
> and sends traffic with different levels of encapsulation.
> 
> Performance Evaluation:
> The in-kernel implementation was compared against the demo program from
> patch 4 using the test in patch 5 with IPv4/UDP traffic over 10 seconds.
>   $perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \
>   -t 10

Looks great. Applied to bpf-next with one extra patch:
 SEC("dissect")
-int dissect(struct __sk_buff *skb)
+int _dissect(struct __sk_buff *skb)

otherwise the test doesn't build.
I'm not sure how it builds for you. Which llvm did you use?

Also above command works and ipv4 test in ./test_flow_dissector.sh
is passing as well, but it still fails at the end for me:
./test_flow_dissector.sh
bpffs not mounted. Mounting...
0: IP
1: IPV6
2: IPV6OP
3: IPV6FR
4: MPLS
5: VLAN
Testing IPv4...
inner.dest4: 127.0.0.1
inner.source4: 127.0.0.3
pkts: tx=10 rx=10
inner.dest4: 127.0.0.1
inner.source4: 127.0.0.3
pkts: tx=10 rx=0
inner.dest4: 127.0.0.1
inner.source4: 127.0.0.3
pkts: tx=10 rx=10
Testing IPIP...
tunnels before test:
tunl0: any/ip remote any local any ttl inherit nopmtudisc
sit_test_LV5N: any/ip remote 127.0.0.2 local 127.0.0.1 dev lo ttl inherit
ipip_test_LV5N: any/ip remote 127.0.0.2 local 127.0.0.1 dev lo ttl inherit
sit0: ipv6/ip remote any local any ttl 64 nopmtudisc
gre_test_LV5N: gre/ip remote 127.0.0.2 local 127.0.0.1 dev lo ttl inherit
gre0: gre/ip remote any local any ttl inherit nopmtudisc
inner.dest4: 192.168.0.1
inner.source4: 1.1.1.1
encap proto:   4
outer.dest4: 127.0.0.1
outer.source4: 127.0.0.2
pkts: tx=10 rx=0
tunnels after test:
tunl0: any/ip remote any local any ttl inherit nopmtudisc
sit0: ipv6/ip remote any local any ttl 64 nopmtudisc
gre0: gre/ip remote any local any ttl inherit nopmtudisc
selftests: test_flow_dissector [FAILED]

is it something in my setup or test is broken?

Re: [PATCH net-next v2] net/tls: Add support for async decryption of tls records

2018-09-14 Thread John Fastabend

On 08/29/2018 02:56 AM, Vakul Garg wrote:
> When tls records are decrypted using asynchronous acclerators such as
> NXP CAAM engine, the crypto apis return -EINPROGRESS. Presently, on
> getting -EINPROGRESS, the tls record processing stops till the time the
> crypto accelerator finishes off and returns the result. This incurs a
> context switch and is not an efficient way of accessing the crypto
> accelerators. Crypto accelerators work efficient when they are queued
> with multiple crypto jobs without having to wait for the previous ones
> to complete.
> 
> The patch submits multiple crypto requests without having to wait for
> for previous ones to complete. This has been implemented for records
> which are decrypted in zero-copy mode. At the end of recvmsg(), we wait
> for all the asynchronous decryption requests to complete.
> 
> The references to records which have been sent for async decryption are
> dropped. For cases where record decryption is not possible in zero-copy
> mode, asynchronous decryption is not used and we wait for decryption
> crypto api to complete.
> 
> For crypto requests executing in async fashion, the memory for
> aead_request, sglists and skb etc is freed from the decryption
> completion handler. The decryption completion handler wakesup the
> sleeping user context when recvmsg() flags that it has done sending
> all the decryption requests and there are no more decryption requests
> pending to be completed.
> 
> Signed-off-by: Vakul Garg 
> Reviewed-by: Dave Watson 
> ---

[...]


> @@ -1271,6 +1377,8 @@ int tls_set_sw_offload(struct sock *sk, struct 
> tls_context *ctx, int tx)
>   goto free_aead;
>  
>   if (sw_ctx_rx) {
> + (*aead)->reqsize = sizeof(struct decrypt_req_ctx);
> +

This is not valid and may cause GPF or best case only a KASAN
warning. 'reqsize' should probably not be mangled outside the
internal crypto APIs but the real reason is the reqsize is used
to determine how much space is needed at the end of the aead_request
for crypto private ctx use in encrypt/decrypt. After this patch
when we submit an aead_request the crypto layer will think it
has room for its private structs at the end but now only 8B will
be there and crypto layer will happily memset some arbitrary
memory for you amongst other things.

Anyways testing a fix now will post shortly.

Thanks,
John

[PATCH net] bnxt_en: Fix VF mac address regression.

2018-09-14 Thread Michael Chan

The recent commit to always forward the VF MAC address to the PF for
approval may not work if the PF driver or the firmware is older.  This
will cause the VF driver to fail during probe:

  bnxt_en :00:03.0 (unnamed net_device) (uninitialized): hwrm req_type 0xf 
seq id 0x5 error 0x
  bnxt_en :00:03.0 (unnamed net_device) (uninitialized): VF MAC address 
00:00:17:02:05:d0 not approved by the PF
  bnxt_en :00:03.0: Unable to initialize mac address.
  bnxt_en: probe of :00:03.0 failed with error -99

We fix it by treating the error as fatal only if the VF MAC address is
locally generated by the VF.

Fixes: 707e7e966026 ("bnxt_en: Always forward VF MAC address to the PF.")
Reported-by: Seth Forshee 
Reported-by: Siwei Liu 
Signed-off-by: Michael Chan 
---
Please queue this for stable as well.  Thanks.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c   | 9 +++--
 drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 9 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h | 2 +-
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index cecbb1d..177587f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -8027,7 +8027,7 @@ static int bnxt_change_mac_addr(struct net_device *dev, 
void *p)
if (ether_addr_equal(addr->sa_data, dev->dev_addr))
return 0;
 
-   rc = bnxt_approve_mac(bp, addr->sa_data);
+   rc = bnxt_approve_mac(bp, addr->sa_data, true);
if (rc)
return rc;
 
@@ -8827,14 +8827,19 @@ static int bnxt_init_mac_addr(struct bnxt *bp)
} else {
 #ifdef CONFIG_BNXT_SRIOV
struct bnxt_vf_info *vf = &bp->vf;
+   bool strict_approval = true;
 
if (is_valid_ether_addr(vf->mac_addr)) {
/* overwrite netdev dev_addr with admin VF MAC */
memcpy(bp->dev->dev_addr, vf->mac_addr, ETH_ALEN);
+   /* Older PF driver or firmware may not approve this
+* correctly.
+*/
+   strict_approval = false;
} else {
eth_hw_addr_random(bp->dev);
}
-   rc = bnxt_approve_mac(bp, bp->dev->dev_addr);
+   rc = bnxt_approve_mac(bp, bp->dev->dev_addr, strict_approval);
 #endif
}
return rc;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
index fcd085a..3962f6f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
@@ -1104,7 +1104,7 @@ void bnxt_update_vf_mac(struct bnxt *bp)
mutex_unlock(&bp->hwrm_cmd_lock);
 }
 
-int bnxt_approve_mac(struct bnxt *bp, u8 *mac)
+int bnxt_approve_mac(struct bnxt *bp, u8 *mac, bool strict)
 {
struct hwrm_func_vf_cfg_input req = {0};
int rc = 0;
@@ -1122,12 +1122,13 @@ int bnxt_approve_mac(struct bnxt *bp, u8 *mac)
memcpy(req.dflt_mac_addr, mac, ETH_ALEN);
rc = hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
 mac_done:
-   if (rc) {
+   if (rc && strict) {
rc = -EADDRNOTAVAIL;
netdev_warn(bp->dev, "VF MAC address %pM not approved by the 
PF\n",
mac);
+   return rc;
}
-   return rc;
+   return 0;
 }
 #else
 
@@ -1144,7 +1145,7 @@ void bnxt_update_vf_mac(struct bnxt *bp)
 {
 }
 
-int bnxt_approve_mac(struct bnxt *bp, u8 *mac)
+int bnxt_approve_mac(struct bnxt *bp, u8 *mac, bool strict)
 {
return 0;
 }
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
index e9b20cd..2eed9ed 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
@@ -39,5 +39,5 @@ int bnxt_sriov_configure(struct pci_dev *pdev, int num_vfs);
 void bnxt_sriov_disable(struct bnxt *);
 void bnxt_hwrm_exec_fwd_req(struct bnxt *);
 void bnxt_update_vf_mac(struct bnxt *);
-int bnxt_approve_mac(struct bnxt *, u8 *);
+int bnxt_approve_mac(struct bnxt *, u8 *, bool);
 #endif
-- 
2.5.1

[PATCH v2 0/2] hv_netvsc: associate VF and PV device by serial number

2018-09-14 Thread Stephen Hemminger

The Hyper-V implementation of PCI controller has concept of 32 bit serial number
(not to be confused with PCI-E serial number).  This value is sent in the 
protocol
from the host to indicate SR-IOV VF device is attached to a synthetic NIC.

Using the serial number (instead of MAC address) to associate the two devices
avoids lots of potential problems when there are duplicate MAC addresses from
tunnels or layered devices.

The patch set is broken into two parts, one is for the PCI controller
and the other is for the netvsc device. Normally, these go through different
trees but sending them together here for better review. The PCI changes
were submitted previously, but the main review comment was "why do you
need this?". This is why.

v2 - slot name can be shorter.
 remove locking when creating pci_slots; see comment for explaination

Stephen Hemminger (2):
  PCI: hv: support reporting serial number as slot information
  hv_netvsc: pair VF based on serial number

 drivers/net/hyperv/netvsc.c |  3 ++
 drivers/net/hyperv/netvsc_drv.c | 58 -
 drivers/pci/controller/pci-hyperv.c | 37 ++
 3 files changed, 73 insertions(+), 25 deletions(-)

-- 
2.18.0

[PATCH v2 1/2] PCI: hv: support reporting serial number as slot information

2018-09-14 Thread Stephen Hemminger

The Hyper-V host API for PCI provides a unique "serial number" which
can be used as basis for sysfs PCI slot table. This can be useful
for cases where userspace wants to find the PCI device based on
serial number.

When an SR-IOV NIC is added, the host sends an attach message
with serial number. The kernel doesn't use the serial number, but
it is useful when doing the same thing in a userspace driver such
as the DPDK. By having /sys/bus/pci/slots/N it provides a direct
way to find the matching PCI device.

There maybe some cases where serial number is not unique such
as when using GPU's. But the PCI slot infrastructure will handle
that.

This has a side effect which may also be useful. The common udev
network device naming policy uses the slot information (rather
than PCI address).

Signed-off-by: Stephen Hemminger 
---
 drivers/pci/controller/pci-hyperv.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index c00f82cc54aa..ee80e79db21a 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -89,6 +89,9 @@ static enum pci_protocol_version_t pci_protocol_version;
 
 #define STATUS_REVISION_MISMATCH 0xC059
 
+/* space for 32bit serial number as string */
+#define SLOT_NAME_SIZE 11
+
 /*
  * Message Types
  */
@@ -494,6 +497,7 @@ struct hv_pci_dev {
struct list_head list_entry;
refcount_t refs;
enum hv_pcichild_state state;
+   struct pci_slot *pci_slot;
struct pci_function_description desc;
bool reported_missing;
struct hv_pcibus_device *hbus;
@@ -1457,6 +1461,34 @@ static void prepopulate_bars(struct hv_pcibus_device 
*hbus)
spin_unlock_irqrestore(&hbus->device_list_lock, flags);
 }
 
+/*
+ * Assign entries in sysfs pci slot directory.
+ *
+ * Note that this function does not need to lock the children list
+ * because it is called from pci_devices_present_work which
+ * is serialized with hv_eject_device_work because they are on the
+ * same ordered workqueue. Therefore hbus->children list will not change
+ * even when pci_create_slot sleeps.
+ */
+static void hv_pci_assign_slots(struct hv_pcibus_device *hbus)
+{
+   struct hv_pci_dev *hpdev;
+   char name[SLOT_NAME_SIZE];
+   int slot_nr;
+
+   list_for_each_entry(hpdev, &hbus->children, list_entry) {
+   if (hpdev->pci_slot)
+   continue;
+
+   slot_nr = PCI_SLOT(wslot_to_devfn(hpdev->desc.win_slot.slot));
+   snprintf(name, SLOT_NAME_SIZE, "%u", hpdev->desc.ser);
+   hpdev->pci_slot = pci_create_slot(hbus->pci_bus, slot_nr,
+ name, NULL);
+   if (!hpdev->pci_slot)
+   pr_warn("pci_create slot %s failed\n", name);
+   }
+}
+
 /**
  * create_root_hv_pci_bus() - Expose a new root PCI bus
  * @hbus:  Root PCI bus, as understood by this driver
@@ -1480,6 +1512,7 @@ static int create_root_hv_pci_bus(struct hv_pcibus_device 
*hbus)
pci_lock_rescan_remove();
pci_scan_child_bus(hbus->pci_bus);
pci_bus_assign_resources(hbus->pci_bus);
+   hv_pci_assign_slots(hbus);
pci_bus_add_devices(hbus->pci_bus);
pci_unlock_rescan_remove();
hbus->state = hv_pcibus_installed;
@@ -1742,6 +1775,7 @@ static void pci_devices_present_work(struct work_struct 
*work)
 */
pci_lock_rescan_remove();
pci_scan_child_bus(hbus->pci_bus);
+   hv_pci_assign_slots(hbus);
pci_unlock_rescan_remove();
break;
 
@@ -1858,6 +1892,9 @@ static void hv_eject_device_work(struct work_struct *work)
list_del(&hpdev->list_entry);
spin_unlock_irqrestore(&hpdev->hbus->device_list_lock, flags);
 
+   if (hpdev->pci_slot)
+   pci_destroy_slot(hpdev->pci_slot);
+
memset(&ctxt, 0, sizeof(ctxt));
ejct_pkt = (struct pci_eject_response *)&ctxt.pkt.message;
ejct_pkt->message_type.type = PCI_EJECTION_COMPLETE;
-- 
2.18.0

[PATCH v2 2/2] hv_netvsc: pair VF based on serial number

2018-09-14 Thread Stephen Hemminger

Matching network device based on MAC address is problematic
since a non VF network device can be creted with a duplicate MAC
address causing confusion and problems.  The VMBus API does provide
a serial number that is a better matching method.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc.c |  3 ++
 drivers/net/hyperv/netvsc_drv.c | 58 +++--
 2 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 31c3d77b4733..fe01e141c8f8 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1203,6 +1203,9 @@ static void netvsc_send_vf(struct net_device *ndev,
 
net_device_ctx->vf_alloc = nvmsg->msg.v4_msg.vf_assoc.allocated;
net_device_ctx->vf_serial = nvmsg->msg.v4_msg.vf_assoc.serial;
+   netdev_info(ndev, "VF slot %u %s\n",
+   net_device_ctx->vf_serial,
+   net_device_ctx->vf_alloc ? "added" : "removed");
 }
 
 static  void netvsc_receive_inband(struct net_device *ndev,
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 1121a1ec407c..9dedc1463e88 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1894,20 +1894,6 @@ static void netvsc_link_change(struct work_struct *w)
rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-   struct net_device_context *ndev_ctx;
-
-   list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
-   struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx);
-
-   if (ether_addr_equal(mac, dev->perm_addr))
-   return dev;
-   }
-
-   return NULL;
-}
-
 static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
 {
struct net_device_context *net_device_ctx;
@@ -2036,26 +2022,48 @@ static void netvsc_vf_setup(struct work_struct *w)
rtnl_unlock();
 }
 
+/* Find netvsc by VMBus serial number.
+ * The PCI hyperv controller records the serial number as the slot.
+ */
+static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
+{
+   struct device *parent = vf_netdev->dev.parent;
+   struct net_device_context *ndev_ctx;
+   struct pci_dev *pdev;
+
+   if (!parent || !dev_is_pci(parent))
+   return NULL; /* not a PCI device */
+
+   pdev = to_pci_dev(parent);
+   if (!pdev->slot) {
+   netdev_notice(vf_netdev, "no PCI slot information\n");
+   return NULL;
+   }
+
+   list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
+   if (!ndev_ctx->vf_alloc)
+   continue;
+
+   if (ndev_ctx->vf_serial == pdev->slot->number)
+   return hv_get_drvdata(ndev_ctx->device_ctx);
+   }
+
+   netdev_notice(vf_netdev,
+ "no netdev found for slot %u\n", pdev->slot->number);
+   return NULL;
+}
+
 static int netvsc_register_vf(struct net_device *vf_netdev)
 {
-   struct net_device *ndev;
struct net_device_context *net_device_ctx;
-   struct device *pdev = vf_netdev->dev.parent;
struct netvsc_device *netvsc_dev;
+   struct net_device *ndev;
int ret;
 
if (vf_netdev->addr_len != ETH_ALEN)
return NOTIFY_DONE;
 
-   if (!pdev || !dev_is_pci(pdev) || dev_is_pf(pdev))
-   return NOTIFY_DONE;
-
-   /*
-* We will use the MAC address to locate the synthetic interface to
-* associate with the VF interface. If we don't find a matching
-* synthetic interface, move on.
-*/
-   ndev = get_netvsc_bymac(vf_netdev->perm_addr);
+   ndev = get_netvsc_byslot(vf_netdev);
if (!ndev)
return NOTIFY_DONE;
 
-- 
2.18.0

[net-next PATCH] tls: async support causes out-of-bounds access in crypto APIs

2018-09-14 Thread John Fastabend

When async support was added it needed to access the sk from the async
callback to report errors up the stack. The patch tried to use space
after the aead request struct by directly setting the reqsize field in
aead_request. This is an internal field that should not be used
outside the crypto APIs. It is used by the crypto code to define extra
space for private structures used in the crypto context. Users of the
API then use crypto_aead_reqsize() and add the returned amount of
bytes to the end of the request memory allocation before posting the
request to encrypt/decrypt APIs.

So this breaks (with general protection fault and KASAN error, if
enabled) because the request sent to decrypt is shorter than required
causing the crypto API out-of-bounds errors. Also it seems unlikely the
sk is even valid by the time it gets to the callback because of memset
in crypto layer.

Anyways, fix this by holding the sk in the skb->sk field when the
callback is set up and because the skb is already passed through to
the callback handler via void* we can access it in the handler. Then
in the handler we need to be careful to NULL the pointer again before
kfree_skb. I added comments on both the setup (in tls_do_decryption)
and when we clear it from the crypto callback handler
tls_decrypt_done(). After this selftests pass again and fixes KASAN
errors/warnings.

Fixes: 94524d8fc965 ("net/tls: Add support for async decryption of tls records")
Signed-off-by: John Fastabend 
---
 include/net/tls.h |4 
 net/tls/tls_sw.c  |   39 +++
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index cd0a65b..8630d28 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -128,10 +128,6 @@ struct tls_sw_context_rx {
bool async_notify;
 };
 
-struct decrypt_req_ctx {
-   struct sock *sk;
-};
-
 struct tls_record_info {
struct list_head list;
u32 end_seq;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index be4f2e9..cef69b6 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -122,25 +122,32 @@ static int skb_nsg(struct sk_buff *skb, int offset, int 
len)
 static void tls_decrypt_done(struct crypto_async_request *req, int err)
 {
struct aead_request *aead_req = (struct aead_request *)req;
-   struct decrypt_req_ctx *req_ctx =
-   (struct decrypt_req_ctx *)(aead_req + 1);
-
struct scatterlist *sgout = aead_req->dst;
-
-   struct tls_context *tls_ctx = tls_get_ctx(req_ctx->sk);
-   struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
-   int pending = atomic_dec_return(&ctx->decrypt_pending);
+   struct tls_sw_context_rx *ctx;
+   struct tls_context *tls_ctx;
struct scatterlist *sg;
+   struct sk_buff *skb;
unsigned int pages;
+   int pending;
+
+   skb = (struct sk_buff *)req->data;
+   tls_ctx = tls_get_ctx(skb->sk);
+   ctx = tls_sw_ctx_rx(tls_ctx);
+   pending = atomic_dec_return(&ctx->decrypt_pending);
 
/* Propagate if there was an err */
if (err) {
ctx->async_wait.err = err;
-   tls_err_abort(req_ctx->sk, err);
+   tls_err_abort(skb->sk, err);
}
 
+   /* After using skb->sk to propagate sk through crypto async callback
+* we need to NULL it again.
+*/
+   skb->sk = NULL;
+
/* Release the skb, pages and memory allocated for crypto req */
-   kfree_skb(req->data);
+   kfree_skb(skb);
 
/* Skip the first S/G entry as it points to AAD */
for_each_sg(sg_next(sgout), sg, UINT_MAX, pages) {
@@ -175,11 +182,13 @@ static int tls_do_decryption(struct sock *sk,
   (u8 *)iv_recv);
 
if (async) {
-   struct decrypt_req_ctx *req_ctx;
-
-   req_ctx = (struct decrypt_req_ctx *)(aead_req + 1);
-   req_ctx->sk = sk;
-
+   /* Using skb->sk to push sk through to crypto async callback
+* handler. This allows propagating errors up to the socket
+* if needed. It _must_ be cleared in the async handler
+* before kfree_skb is called. We _know_ skb->sk is NULL
+* because it is a clone from strparser.
+*/
+   skb->sk = sk;
aead_request_set_callback(aead_req,
  CRYPTO_TFM_REQ_MAY_BACKLOG,
  tls_decrypt_done, skb);
@@ -1455,8 +1464,6 @@ int tls_set_sw_offload(struct sock *sk, struct 
tls_context *ctx, int tx)
goto free_aead;
 
if (sw_ctx_rx) {
-   (*aead)->reqsize = sizeof(struct decrypt_req_ctx);
-
/* Set up strparser */
memset(&cb, 0, sizeof(cb));
cb.rcv_msg = tls_queue;

Re: [PATCH net] bnxt_en: Fix VF mac address regression.

2018-09-14 Thread Siwei Liu

Ack. Looks fine to me.

-Siwei

On Fri, Sep 14, 2018 at 12:41 PM, Michael Chan
 wrote:
> The recent commit to always forward the VF MAC address to the PF for
> approval may not work if the PF driver or the firmware is older.  This
> will cause the VF driver to fail during probe:
>
>   bnxt_en :00:03.0 (unnamed net_device) (uninitialized): hwrm req_type 
> 0xf seq id 0x5 error 0x
>   bnxt_en :00:03.0 (unnamed net_device) (uninitialized): VF MAC address 
> 00:00:17:02:05:d0 not approved by the PF
>   bnxt_en :00:03.0: Unable to initialize mac address.
>   bnxt_en: probe of :00:03.0 failed with error -99
>
> We fix it by treating the error as fatal only if the VF MAC address is
> locally generated by the VF.
>
> Fixes: 707e7e966026 ("bnxt_en: Always forward VF MAC address to the PF.")
> Reported-by: Seth Forshee 
> Reported-by: Siwei Liu 
> Signed-off-by: Michael Chan 
> ---
> Please queue this for stable as well.  Thanks.
>
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c   | 9 +++--
>  drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 9 +
>  drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h | 2 +-
>  3 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index cecbb1d..177587f 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -8027,7 +8027,7 @@ static int bnxt_change_mac_addr(struct net_device *dev, 
> void *p)
> if (ether_addr_equal(addr->sa_data, dev->dev_addr))
> return 0;
>
> -   rc = bnxt_approve_mac(bp, addr->sa_data);
> +   rc = bnxt_approve_mac(bp, addr->sa_data, true);
> if (rc)
> return rc;
>
> @@ -8827,14 +8827,19 @@ static int bnxt_init_mac_addr(struct bnxt *bp)
> } else {
>  #ifdef CONFIG_BNXT_SRIOV
> struct bnxt_vf_info *vf = &bp->vf;
> +   bool strict_approval = true;
>
> if (is_valid_ether_addr(vf->mac_addr)) {
> /* overwrite netdev dev_addr with admin VF MAC */
> memcpy(bp->dev->dev_addr, vf->mac_addr, ETH_ALEN);
> +   /* Older PF driver or firmware may not approve this
> +* correctly.
> +*/
> +   strict_approval = false;
> } else {
> eth_hw_addr_random(bp->dev);
> }
> -   rc = bnxt_approve_mac(bp, bp->dev->dev_addr);
> +   rc = bnxt_approve_mac(bp, bp->dev->dev_addr, strict_approval);
>  #endif
> }
> return rc;
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> index fcd085a..3962f6f 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> @@ -1104,7 +1104,7 @@ void bnxt_update_vf_mac(struct bnxt *bp)
> mutex_unlock(&bp->hwrm_cmd_lock);
>  }
>
> -int bnxt_approve_mac(struct bnxt *bp, u8 *mac)
> +int bnxt_approve_mac(struct bnxt *bp, u8 *mac, bool strict)
>  {
> struct hwrm_func_vf_cfg_input req = {0};
> int rc = 0;
> @@ -1122,12 +1122,13 @@ int bnxt_approve_mac(struct bnxt *bp, u8 *mac)
> memcpy(req.dflt_mac_addr, mac, ETH_ALEN);
> rc = hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
>  mac_done:
> -   if (rc) {
> +   if (rc && strict) {
> rc = -EADDRNOTAVAIL;
> netdev_warn(bp->dev, "VF MAC address %pM not approved by the 
> PF\n",
> mac);
> +   return rc;
> }
> -   return rc;
> +   return 0;
>  }
>  #else
>
> @@ -1144,7 +1145,7 @@ void bnxt_update_vf_mac(struct bnxt *bp)
>  {
>  }
>
> -int bnxt_approve_mac(struct bnxt *bp, u8 *mac)
> +int bnxt_approve_mac(struct bnxt *bp, u8 *mac, bool strict)
>  {
> return 0;
>  }
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
> index e9b20cd..2eed9ed 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
> @@ -39,5 +39,5 @@ int bnxt_sriov_configure(struct pci_dev *pdev, int num_vfs);
>  void bnxt_sriov_disable(struct bnxt *);
>  void bnxt_hwrm_exec_fwd_req(struct bnxt *);
>  void bnxt_update_vf_mac(struct bnxt *);
> -int bnxt_approve_mac(struct bnxt *, u8 *);
> +int bnxt_approve_mac(struct bnxt *, u8 *, bool);
>  #endif
> --
> 2.5.1
>

[Patch net-next] ipv4: initialize ra_mutex in inet_init_net()

2018-09-14 Thread Cong Wang

ra_mutex is a IPv4 specific mutex, it is inside struct netns_ipv4,
but its initialization is in the generic netns code, setup_net().

Move it to IPv4 specific net init code, inet_init_net().

Fixes: d9ff3049739e ("net: Replace ip_ra_lock with per-net mutex")
Cc: Kirill Tkhai 
Signed-off-by: Cong Wang 
---
 net/core/net_namespace.c | 1 -
 net/ipv4/af_inet.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 670c84b1bfc2..b272ccfcbf63 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -308,7 +308,6 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
net->user_ns = user_ns;
idr_init(&net->netns_ids);
spin_lock_init(&net->nsid_lock);
-   mutex_init(&net->ipv4.ra_mutex);
 
list_for_each_entry(ops, &pernet_list, list) {
error = ops_init(ops, net);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 20fda8fb8ffd..57b7bffb93e5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1817,6 +1817,8 @@ static __net_init int inet_init_net(struct net *net)
net->ipv4.sysctl_igmp_llm_reports = 1;
net->ipv4.sysctl_igmp_qrv = 2;
 
+   mutex_init(&net->ipv4.ra_mutex);
+
return 0;
 }
 
-- 
2.14.4

Re: [PATCH net-next v2] net: sched: change tcf_del_walker() to take idrinfo->lock

2018-09-14 Thread Cong Wang

On Fri, Sep 14, 2018 at 3:46 AM Vlad Buslov  wrote:
>
>
> On Thu 13 Sep 2018 at 17:13, Cong Wang  wrote:
> > On Wed, Sep 12, 2018 at 1:51 AM Vlad Buslov  wrote:
> >>
> >>
> >> On Fri 07 Sep 2018 at 19:12, Cong Wang  wrote:
> >> > On Fri, Sep 7, 2018 at 6:52 AM Vlad Buslov  wrote:
> >> >>
> >> >> Action API was changed to work with actions and action_idr in 
> >> >> concurrency
> >> >> safe manner, however tcf_del_walker() still uses actions without taking 
> >> >> a
> >> >> reference or idrinfo->lock first, and deletes them directly, 
> >> >> disregarding
> >> >> possible concurrent delete.
> >> >>
> >> >> Add tc_action_wq workqueue to action API. Implement
> >> >> tcf_idr_release_unsafe() that assumes external synchronization by caller
> >> >> and delays blocking action cleanup part to tc_action_wq workqueue. 
> >> >> Extend
> >> >> tcf_action_cleanup() with 'async' argument to indicate that function 
> >> >> should
> >> >> free action asynchronously.
> >> >
> >> > Where exactly is blocking in tcf_action_cleanup()?
> >> >
> >> > From your code, it looks like free_tcf(), but from my observation,
> >> > the only blocking function inside is tcf_action_goto_chain_fini()
> >> > which calls __tcf_chain_put(). But, __tcf_chain_put() is blocking
> >> > _ONLY_ when tc_chain_notify() is called, for tc action it is never
> >> > called.
> >> >
> >> > So, what else is blocking?
> >>
> >> __tcf_chain_put() calls tc_chain_tmplt_del(), which calls
> >> ops->tmplt_destroy(). This last function uses hw offload API, which is
> >> blocking.
> >
> > Good to know.
> >
> > Can we just make ops->tmplt_destroy() to use workqueue?
> > Making tc action to workqueue seems overkill, for me.
>
> How about changing tcf_chain_put_by_act() to use tc_filter_wq, instead
> of directly calling __tcf_chain_put()? IMO it is a better solution
> because it benefits all classifiers, instead of requiring every
> classifier with templates support to implement non-blocking
> ops->tmplt_destroy().

My point is, there is only one filter implements ops->tmplt_destroy
so far, so there is no reason to just make all filters to adjusted
for this single one. Not to mention actions, actions are innocent
here.

[PATCH net] tls: fix currently broken MSG_PEEK behavior

2018-09-14 Thread Daniel Borkmann

In kTLS MSG_PEEK behavior is currently failing, strace example:

  [pid  2430] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
  [pid  2430] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4
  [pid  2430] bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
  [pid  2430] listen(4, 10)   = 0
  [pid  2430] getsockname(4, {sa_family=AF_INET, sin_port=htons(38855), 
sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
  [pid  2430] connect(3, {sa_family=AF_INET, sin_port=htons(38855), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
  [pid  2430] setsockopt(3, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0
  [pid  2430] setsockopt(3, 0x11a /* SOL_?? */, 1, 
"\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0
  [pid  2430] accept(4, {sa_family=AF_INET, sin_port=htons(49636), 
sin_addr=inet_addr("127.0.0.1")}, [16]) = 5
  [pid  2430] setsockopt(5, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0
  [pid  2430] setsockopt(5, 0x11a /* SOL_?? */, 2, 
"\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0
  [pid  2430] close(4)= 0
  [pid  2430] sendto(3, "test_read_peek", 14, 0, NULL, 0) = 14
  [pid  2430] sendto(3, "_mult_recs\0", 11, 0, NULL, 0) = 11
  [pid  2430] recvfrom(5, "test_read_peektest_read_peektest"..., 64, MSG_PEEK, 
NULL, NULL) = 64

As can be seen from strace, there are two TLS records sent,
i) 'test_read_peek' and ii) '_mult_recs\0' where we end up
peeking 'test_read_peektest_read_peektest'. This is clearly
wrong, and what happens is that given peek cannot call into
tls_sw_advance_skb() to unpause strparser and proceed with
the next skb, we end up looping over the current one, copying
the 'test_read_peek' over and over into the user provided
buffer.

Here, we can only peek into the currently held skb (current,
full TLS record) as otherwise we would end up having to hold
all the original skb(s) (depending on the peek depth) in a
separate queue when unpausing strparser to process next
records, minimally intrusive is to return only up to the
current record's size (which likely was what c46234ebb4d1
("tls: RX path for ktls") originally intended as well). Thus,
after patch we properly peek the first record:

  [pid  2046] wait4(2075,  
  [pid  2075] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
  [pid  2075] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4
  [pid  2075] bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
  [pid  2075] listen(4, 10)   = 0
  [pid  2075] getsockname(4, {sa_family=AF_INET, sin_port=htons(55115), 
sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
  [pid  2075] connect(3, {sa_family=AF_INET, sin_port=htons(55115), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
  [pid  2075] setsockopt(3, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0
  [pid  2075] setsockopt(3, 0x11a /* SOL_?? */, 1, 
"\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0
  [pid  2075] accept(4, {sa_family=AF_INET, sin_port=htons(45732), 
sin_addr=inet_addr("127.0.0.1")}, [16]) = 5
  [pid  2075] setsockopt(5, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0
  [pid  2075] setsockopt(5, 0x11a /* SOL_?? */, 2, 
"\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0
  [pid  2075] close(4)= 0
  [pid  2075] sendto(3, "test_read_peek", 14, 0, NULL, 0) = 14
  [pid  2075] sendto(3, "_mult_recs\0", 11, 0, NULL, 0) = 11
  [pid  2075] recvfrom(5, "test_read_peek", 64, MSG_PEEK, NULL, NULL) = 14

Fixes: c46234ebb4d1 ("tls: RX path for ktls")
Signed-off-by: Daniel Borkmann 
---
 net/tls/tls_sw.c  |  8 +++
 tools/testing/selftests/net/tls.c | 49 +++
 2 files changed, 57 insertions(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index e28a6ff..b0cea79 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -931,7 +931,15 @@ int tls_sw_recvmsg(struct sock *sk,
if (control != TLS_RECORD_TYPE_DATA)
goto recv_end;
}
+   } else {
+   /* MSG_PEEK right now cannot look beyond current skb
+* from strparser, meaning we cannot advance skb here
+* and thus unpause strparser since we'd loose original
+* one.
+*/
+   break;
}
+
/* If we have a new message from strparser, continue now. */
if (copied >= target && !ctx->recv_pkt)
break;
diff --git a/tools/testing/selftests/net/tls.c 
b/tools/testing/selftests/net/tls.c
index b3ebf26..8fdfeaf 100644
--- a/tools/testing/selftests/net/tls.c
+++ b/tools/testing/selftests/net/tls.c
@@ -502,6 +502,55 @@ TEST_F(tls, recv_peek_multiple)
EXPECT_EQ(memcmp(test_str, buf, send_len), 0);
 }
 
+TEST_F(tls, recv_peek_multiple_records)
+{
+

Re: [net-next, RFC PATCH] net: sched: cls_range: Introduce Range classifier

2018-09-14 Thread Cong Wang

On Thu, Sep 13, 2018 at 6:53 PM Amritha Nambiar
 wrote:
>
> This patch introduces a range classifier to support filtering based
> on ranges. Only port-range filters are supported currently. This can
> be combined with flower classifier to support filters that are a
> combination of port-ranges and other parameters based on existing
> fields supported by cls_flower.

Why should we have a special-purpose filter just for ports here?

We have achieved almost the same goal with u32 filter:

https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/network/port_mapping.cpp

There is a large overlap with other general purpose filters.

I don't see you provide any justification for the purpose of it. If
it is just for convenience, can't we just make it on top of other
general purpose header-matching filters?

Re: [net-next,RFC PATCH] Introduce TC Range classifier

2018-09-14 Thread Cong Wang

On Fri, Sep 14, 2018 at 2:53 AM Jiri Pirko  wrote:
>
> Thu, Sep 13, 2018 at 10:52:01PM CEST, amritha.namb...@intel.com wrote:
> >This patch introduces a TC range classifier to support filtering based
> >on ranges. Only port-range filters are supported currently. This can
> >be combined with flower classifier to support filters that are a
> >combination of port-ranges and other parameters based on existing
> >fields supported by cls_flower. The 'goto chain' action can be used to
> >combine the flower and range filter.
> >The filter precedence is decided based on the 'prio' value.
>
> For example Spectrum ASIC supports mask-based and range-based matching
> in a single TCAM rule. No chains needed. Also, I don't really understand
> why is this a separate cls. I believe that this functionality should be
> put as an extension of existing cls_flower.

Exactly. u32 filters support range matching too with proper masks.

mlx5_core: null pointer dereference in mlx5_accel_tls_device_caps() (net-next kernel)

2018-09-14 Thread Michal Kubecek

I just encountered a null pointer dereference on mlx5_core module
initialization while booting net-next kernel (based on commit
ee4fccbee7d3) on an aarch64 machine:

[   12.021971] iommu: Adding device :01:00.0 to group 3
[   12.022925] mlx5_core :01:00.0: firmware version: 12.17.2020
[   12.022954] mlx5_core :01:00.0: 63.008 Gb/s available PCIe bandwidth (8 
GT/s x8 link)
[   12.068709] Adding 98830144k swap on /dev/sda4.  Priority:-2 extents:1 
across:98830144k FS
[   12.347571] (:01:00.0): E-Switch: Total vports 9, per vport: max 
uc(1024) max mc(16384)
[   12.351962] mlx5_core :01:00.0: Port module event: module 0, Cable 
plugged
[   12.366306] mlx5_core :01:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(128) 
RxCqeCmprss(0)
[   12.366741] Unable to handle kernel NULL pointer dereference at virtual 
address 0050
[   12.374603] Mem abort info:
[   12.377368]   ESR = 0x9604
[   12.380406]   Exception class = DABT (current EL), IL = 32 bits
[   12.386357]   SET = 0, FnV = 0
[   12.389347]   EA = 0, S1PTW = 0
[   12.392471] Data abort info:
[   12.395343]   ISV = 0, ISS = 0x0004
[   12.399156]   CM = 0, WnR = 0
[   12.402108] user pgtable: 4k pages, 48-bit VAs, pgdp = (ptrval)
[   12.408711] [0050] pgd=
[   12.413567] Internal error: Oops: 9604 [#1] SMP
[   12.418427] Modules linked in: fat mlx5_core(+) ipmi_ssif(+) aes_ce_blk 
crypto_simd cryptd aes_ce_cipher crc32_ce crct10dif_ce ghash_ce aes_arm64 
sha2_ce sha256_arm64 sha1_ce ipmi_devintf ipmi_msghandler sbsa_gwdt tls mlxfw 
devlink at803x qcom_emac btrfs libcrc32c xor zlib_deflate raid6_pq 
ahci_platform libahci_platform hdma hdma_mgmt i2c_qup sg dm_multipath dm_mod 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[   12.454800] CPU: 40 PID: 742 Comm: systemd-udevd Not tainted 
4.19.0-rc3-ethnl.15-default #1
[   12.463131] Hardware name: To be filled by O.E.M. To be filled by O.E.M./To 
be filled by O.E.M., BIOS 5.13 12/12/2012
[   12.473722] pstate: 6045 (nZCv daif +PAN -UAO)
[   12.478559] pc : mlx5_accel_tls_device_caps+0x28/0x38 [mlx5_core]
[   12.484598] lr : mlx5e_tls_build_netdev+0x24/0x98 [mlx5_core]
[   12.490301] sp : 21873a30
[   12.493599] x29: 21873a30 x28: 2a72560a7940 
[   12.498895] x27: 2a7256df6000 x26: 2a71a0fed650 
[   12.504190] x25:  x24: 92c7f2b988c0 
[   12.509485] x23: 92c7fe01c0c0 x22: 2a71a0fcfa70 
[   12.514780] x21: 92c7f2b808c0 x20: 92c7f741c110 
[   12.520075] x19: 92c7f2b988c0 x18: 218739b0 
[   12.525370] x17:  x16: 2a725625ade0 
[   12.530665] x15: 29818ed4 x14: d47aab07 
[   12.535961] x13: 8a24 x12:  
[   12.541256] x11:  x10:  
[   12.546551] x9 :  x8 :  
[   12.551846] x7 :  x6 : 92c8159dc910 
[   12.557141] x5 : 0400 x4 : 7e4b205a20c7 
[   12.562436] x3 :  x2 : 2a725625ae1c 
[   12.567731] x1 : ab078a24 x0 :  
[   12.573027] Process systemd-udevd (pid: 742, stack limit = 
0x(ptrval))
[   12.580232] Call trace:
[   12.582688]  mlx5_accel_tls_device_caps+0x28/0x38 [mlx5_core]
[   12.588419]  mlx5e_build_nic_netdev+0x27c/0x348 [mlx5_core]
[   12.593974]  mlx5e_nic_init+0x1a0/0x258 [mlx5_core]
[   12.598835]  mlx5e_create_netdev+0x74/0x118 [mlx5_core]
[   12.604043]  mlx5e_add+0xf0/0x2c0 [mlx5_core]
[   12.608384]  mlx5_add_device+0x88/0x1a8 [mlx5_core]
[   12.613246]  mlx5_register_interface+0x78/0xb0 [mlx5_core]
[   12.618713]  mlx5e_init+0x24/0x30 [mlx5_core]
[   12.623052]  init+0x88/0xa0 [mlx5_core]
[   12.626850]  do_one_initcall+0x54/0x200
[   12.630667]  do_init_module+0x64/0x1d8
[   12.634401]  load_module+0x1480/0x1510
[   12.638132]  __se_sys_finit_module+0xc8/0xd8
[   12.642385]  __arm64_sys_finit_module+0x24/0x30
[   12.646901]  el0_svc_common+0x7c/0x118
[   12.650631]  el0_svc_handler+0x38/0x78
[   12.654364]  el0_svc+0x8/0xc
[   12.657229] Code: d503201f f97c7e60 f9400bf3 a8c27bfd (f9402800) 
[   12.663306] ---[ end trace 57e772dd3cf718f1 ]---

The function looks like this:


drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c:
68  {
   0x00058230 <+0>: stp x29, x30, [sp, #-32]!
   0x00058234 <+4>: mov x29, sp
   0x00058238 <+8>: str x19, [sp, #16]
   0x0005823c <+12>:mov x19, x0
   0x00058240 <+16>:mov x0, x30

drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h:
68  return mdev->fpga->tls->caps;
   0x00058244 <+20>:add x19, x19, #0x38, lsl #12

drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c:
68  {
   0x00058248 <+24>:bl  0x58248


drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h:
68  return mdev->fpga->tls->caps;

Re: [PATCH net-next v3 0/2] net: stmmac: Coalesce and tail addr fixes

2018-09-14 Thread David Miller

From: Jose Abreu 
Date: Thu, 13 Sep 2018 09:02:21 +0100

> The fix for coalesce timer and a fix in tail address setting that impacts
> XGMAC2 operation.

This series is fixing bugs going all the way back to 4.7

There is no logical way that targetting net-next is valid.

net-next is always for new features and cleanups.

Bug fixes always go to 'net'.

Thank you.

[PATH RFC net-next 5/8] net: phy: Add limkmode equivalents to some of the MII ethtool helpers

2018-09-14 Thread Andrew Lunn

Add helpers which take a linkmode rather than a u32 ethtool for
advertising settings.

Signed-off-by: Andrew Lunn 
---
 include/linux/mii.h | 50 +
 1 file changed, 50 insertions(+)

diff --git a/include/linux/mii.h b/include/linux/mii.h
index 9ed49c8261d0..2da85b02e1c0 100644
--- a/include/linux/mii.h
+++ b/include/linux/mii.h
@@ -132,6 +132,34 @@ static inline u32 ethtool_adv_to_mii_adv_t(u32 ethadv)
return result;
 }
 
+/**
+ * linkmode_adv_to_mii_adv_t
+ * @advertising: the linkmode advertisement settings
+ *
+ * A small helper function that translates linkmode advertisement
+ * settings to phy autonegotiation advertisements for the
+ * MII_ADVERTISE register.
+ */
+static inline u32 linkmode_adv_to_mii_adv_t(unsigned long *advertising)
+{
+   u32 result = 0;
+
+   if (linkmode_test_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT, advertising))
+   result |= ADVERTISE_10HALF;
+   if (linkmode_test_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT, advertising))
+   result |= ADVERTISE_10FULL;
+   if (linkmode_test_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT, advertising))
+   result |= ADVERTISE_100HALF;
+   if (linkmode_test_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT, advertising))
+   result |= ADVERTISE_100FULL;
+   if (linkmode_test_bit(ETHTOOL_LINK_MODE_Pause_BIT, advertising))
+   result |= ADVERTISE_PAUSE_CAP;
+   if (linkmode_test_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT, advertising))
+   result |= ADVERTISE_PAUSE_ASYM;
+
+   return result;
+}
+
 /**
  * mii_adv_to_ethtool_adv_t
  * @adv: value of the MII_ADVERTISE register
@@ -179,6 +207,28 @@ static inline u32 ethtool_adv_to_mii_ctrl1000_t(u32 ethadv)
return result;
 }
 
+/**
+ * linkmode_adv_to_mii_ctrl1000_t
+ * advertising: the linkmode advertisement settings
+ *
+ * A small helper function that translates linkmode advertisement
+ * settings to phy autonegotiation advertisements for the
+ * MII_CTRL1000 register when in 1000T mode.
+ */
+static inline u32 linkmode_adv_to_mii_ctrl1000_t(unsigned long *advertising)
+{
+   u32 result = 0;
+
+   if (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
+ advertising))
+   result |= ADVERTISE_1000HALF;
+   if (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
+ advertising))
+   result |= ADVERTISE_1000FULL;
+
+   return result;
+}
+
 /**
  * mii_ctrl1000_to_ethtool_adv_t
  * @adv: value of the MII_CTRL1000 register
-- 
2.19.0.rc1

[PATH RFC net-next 2/8] net: phy: Add phydev_warn()

2018-09-14 Thread Andrew Lunn

Not all new style LINK_MODE bits can be converted into old style
SUPPORTED bits. We need to warn when such a conversion is attempted.
Add a helper for this.

Signed-off-by: Andrew Lunn 
---
 include/linux/phy.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/phy.h b/include/linux/phy.h
index d24cc46748e2..0ab9f89773fd 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -968,6 +968,9 @@ static inline void phy_device_reset(struct phy_device 
*phydev, int value)
 #define phydev_err(_phydev, format, args...)   \
dev_err(&_phydev->mdio.dev, format, ##args)
 
+#define phydev_warn(_phydev, format, args...)  \
+   dev_warn(&_phydev->mdio.dev, format, ##args)
+
 #define phydev_dbg(_phydev, format, args...)   \
dev_dbg(&_phydev->mdio.dev, format, ##args)
 
-- 
2.19.0.rc1

[PATH RFC net-next 4/8] net: phy: Add helper for advertise to lcl value

2018-09-14 Thread Andrew Lunn

Add a helper to convert the local advertising to an LCL capabilities,
which is then used to resolve pause flow control settings.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mt7530.c  |  6 +-
 drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c   |  5 +
 drivers/net/ethernet/freescale/fman/mac.c |  6 +-
 drivers/net/ethernet/freescale/gianfar.c  |  7 +--
 .../hisilicon/hns3/hns3pf/hclge_main.c|  6 +-
 drivers/net/ethernet/mediatek/mtk_eth_soc.c   |  6 +-
 drivers/net/ethernet/socionext/sni_ave.c  |  5 +
 include/linux/mii.h   | 19 +++
 8 files changed, 26 insertions(+), 34 deletions(-)

diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index 62e486652e62..a5de9bffe5be 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -658,11 +658,7 @@ static void mt7530_adjust_link(struct dsa_switch *ds, int 
port,
if (phydev->asym_pause)
rmt_adv |= LPA_PAUSE_ASYM;
 
-   if (phydev->advertising & ADVERTISED_Pause)
-   lcl_adv |= ADVERTISE_PAUSE_CAP;
-   if (phydev->advertising & ADVERTISED_Asym_Pause)
-   lcl_adv |= ADVERTISE_PAUSE_ASYM;
-
+   lcl_adv = ethtool_adv_to_lcl_adv_t(phydev->advertising);
flowctrl = mii_resolve_flowctrl_fdx(lcl_adv, rmt_adv);
 
if (flowctrl & FLOW_CTRL_TX)
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
index 289129011b9f..a7e03e3ecc93 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
@@ -1495,10 +1495,7 @@ static void xgbe_phy_phydev_flowctrl(struct 
xgbe_prv_data *pdata)
if (!phy_data->phydev)
return;
 
-   if (phy_data->phydev->advertising & ADVERTISED_Pause)
-   lcl_adv |= ADVERTISE_PAUSE_CAP;
-   if (phy_data->phydev->advertising & ADVERTISED_Asym_Pause)
-   lcl_adv |= ADVERTISE_PAUSE_ASYM;
+   lcl_adv = ethtool_adv_to_lcl_adv_t(phy_data->phydev->advertising);
 
if (phy_data->phydev->pause) {
XGBE_SET_LP_ADV(lks, Pause);
diff --git a/drivers/net/ethernet/freescale/fman/mac.c 
b/drivers/net/ethernet/freescale/fman/mac.c
index a847b9c3b31a..d79e4e009d63 100644
--- a/drivers/net/ethernet/freescale/fman/mac.c
+++ b/drivers/net/ethernet/freescale/fman/mac.c
@@ -393,11 +393,7 @@ void fman_get_pause_cfg(struct mac_device *mac_dev, bool 
*rx_pause,
 */
 
/* get local capabilities */
-   lcl_adv = 0;
-   if (phy_dev->advertising & ADVERTISED_Pause)
-   lcl_adv |= ADVERTISE_PAUSE_CAP;
-   if (phy_dev->advertising & ADVERTISED_Asym_Pause)
-   lcl_adv |= ADVERTISE_PAUSE_ASYM;
+   lcl_adv = ethtool_adv_to_lcl_adv_t(phy_dev->advertising);
 
/* get link partner capabilities */
rmt_adv = 0;
diff --git a/drivers/net/ethernet/freescale/gianfar.c 
b/drivers/net/ethernet/freescale/gianfar.c
index 40a1a87cd338..a24b242bf752 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -3658,12 +3658,7 @@ static u32 gfar_get_flowctrl_cfg(struct gfar_private 
*priv)
if (phydev->asym_pause)
rmt_adv |= LPA_PAUSE_ASYM;
 
-   lcl_adv = 0;
-   if (phydev->advertising & ADVERTISED_Pause)
-   lcl_adv |= ADVERTISE_PAUSE_CAP;
-   if (phydev->advertising & ADVERTISED_Asym_Pause)
-   lcl_adv |= ADVERTISE_PAUSE_ASYM;
-
+   lcl_adv = ethtool_adv_to_lcl_adv_t(phydev->advertising);
flowctrl = mii_resolve_flowctrl_fdx(lcl_adv, rmt_adv);
if (flowctrl & FLOW_CTRL_TX)
val |= MACCFG1_TX_FLOW;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index cf18608669f5..a8088ba2ac9c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5270,11 +5270,7 @@ int hclge_cfg_flowctrl(struct hclge_dev *hdev)
if (!phydev->link || !phydev->autoneg)
return 0;
 
-   if (phydev->advertising & ADVERTISED_Pause)
-   local_advertising = ADVERTISE_PAUSE_CAP;
-
-   if (phydev->advertising & ADVERTISED_Asym_Pause)
-   local_advertising |= ADVERTISE_PAUSE_ASYM;
+   local_advertising = ethtool_adv_to_lcl_adv_t(phydev->advertising);
 
if (phydev->pause)
remote_advertising = LPA_PAUSE_CAP;
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index cc1e9a96a43b..7dbfdac4067a 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b

[PATH RFC net-next 1/8] net: phy: Move linkmode helpers to somewhere public

2018-09-14 Thread Andrew Lunn

phylink has some useful helpers to working with linkmode bitmaps.
Move them to there own header so other code can use them.

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/phylink.c | 27 
 include/linux/linkmode.h  | 67 +++
 include/linux/mii.h   |  1 +
 include/linux/phy.h   |  1 +
 4 files changed, 69 insertions(+), 27 deletions(-)
 create mode 100644 include/linux/linkmode.h

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 3ba5cf2a8a5f..95ab492089f2 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -68,33 +68,6 @@ struct phylink {
struct sfp_bus *sfp_bus;
 };
 
-static inline void linkmode_zero(unsigned long *dst)
-{
-   bitmap_zero(dst, __ETHTOOL_LINK_MODE_MASK_NBITS);
-}
-
-static inline void linkmode_copy(unsigned long *dst, const unsigned long *src)
-{
-   bitmap_copy(dst, src, __ETHTOOL_LINK_MODE_MASK_NBITS);
-}
-
-static inline void linkmode_and(unsigned long *dst, const unsigned long *a,
-   const unsigned long *b)
-{
-   bitmap_and(dst, a, b, __ETHTOOL_LINK_MODE_MASK_NBITS);
-}
-
-static inline void linkmode_or(unsigned long *dst, const unsigned long *a,
-   const unsigned long *b)
-{
-   bitmap_or(dst, a, b, __ETHTOOL_LINK_MODE_MASK_NBITS);
-}
-
-static inline bool linkmode_empty(const unsigned long *src)
-{
-   return bitmap_empty(src, __ETHTOOL_LINK_MODE_MASK_NBITS);
-}
-
 /**
  * phylink_set_port_modes() - set the port type modes in the ethtool mask
  * @mask: ethtool link mode mask
diff --git a/include/linux/linkmode.h b/include/linux/linkmode.h
new file mode 100644
index ..014fb86c7114
--- /dev/null
+++ b/include/linux/linkmode.h
@@ -0,0 +1,67 @@
+#ifndef __LINKMODE_H
+#define __LINKMODE_H
+
+#include 
+#include 
+#include 
+
+static inline void linkmode_zero(unsigned long *dst)
+{
+   bitmap_zero(dst, __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static inline void linkmode_copy(unsigned long *dst, const unsigned long *src)
+{
+   bitmap_copy(dst, src, __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static inline void linkmode_and(unsigned long *dst, const unsigned long *a,
+   const unsigned long *b)
+{
+   bitmap_and(dst, a, b, __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static inline void linkmode_or(unsigned long *dst, const unsigned long *a,
+   const unsigned long *b)
+{
+   bitmap_or(dst, a, b, __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static inline bool linkmode_empty(const unsigned long *src)
+{
+   return bitmap_empty(src, __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static inline int linkmode_andnot(unsigned long *dst, const unsigned long 
*src1,
+ const unsigned long *src2)
+{
+   return bitmap_andnot(dst, src1, src2,  __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static inline void linkmode_set_bit(int nr, volatile unsigned long *addr)
+{
+   __set_bit(nr, addr);
+}
+
+static inline void linkmode_clear_bit(int nr, volatile unsigned long *addr)
+{
+   __clear_bit(nr, addr);
+}
+
+static inline void linkmode_change_bit(int nr, volatile unsigned long *addr)
+{
+   __change_bit(nr, addr);
+}
+
+static inline int linkmode_test_bit(int nr, volatile unsigned long *addr)
+{
+   return test_bit(nr, addr);
+}
+
+static inline int linkmode_equal(const unsigned long *src1,
+const unsigned long *src2)
+{
+   return bitmap_equal(src1, src2, __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+#endif /* __LINKMODE_H */
diff --git a/include/linux/mii.h b/include/linux/mii.h
index 55000ee5c6ad..567047ef0309 100644
--- a/include/linux/mii.h
+++ b/include/linux/mii.h
@@ -10,6 +10,7 @@
 
 
 #include 
+#include 
 #include 
 
 struct ethtool_cmd;
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 192a1fa0c73b..d24cc46748e2 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.19.0.rc1

[PATH RFC net-next 6/8] net: ethernet xgbe expand PHY_GBIT_FEAUTRES

2018-09-14 Thread Andrew Lunn

The macro PHY_GBIT_FEAUTRES needs to change into a bitmap in order to
support link_modes. Remove its use from xgde by replacing it with its
definition.

Probably, the current behavior is wrong. It probably should be
ANDing not assigning.

Signed-off-by: Andrew Lunn 
---
 drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
index a7e03e3ecc93..d49e76982453 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
@@ -878,8 +878,9 @@ static bool xgbe_phy_finisar_phy_quirks(struct 
xgbe_prv_data *pdata)
phy_write(phy_data->phydev, 0x04, 0x0d01);
phy_write(phy_data->phydev, 0x00, 0x9140);
 
-   phy_data->phydev->supported = PHY_GBIT_FEATURES;
-   phy_data->phydev->advertising = phy_data->phydev->supported;
+   phy_data->phydev->supported = (PHY_10BT_FEATURES |
+  PHY_100BT_FEATURES |
+  PHY_1000BT_FEATURES);
phy_support_asym_pause(phy_data->phydev);
 
netif_dbg(pdata, drv, pdata->netdev,
@@ -950,8 +951,9 @@ static bool xgbe_phy_belfuse_phy_quirks(struct 
xgbe_prv_data *pdata)
reg = phy_read(phy_data->phydev, 0x00);
phy_write(phy_data->phydev, 0x00, reg & ~0x00800);
 
-   phy_data->phydev->supported = PHY_GBIT_FEATURES;
-   phy_data->phydev->advertising = phy_data->phydev->supported;
+   phy_data->phydev->supported = (PHY_10BT_FEATURES |
+  PHY_100BT_FEATURES |
+  PHY_1000BT_FEATURES);
phy_support_asym_pause(phy_data->phydev);
 
netif_dbg(pdata, drv, pdata->netdev,
-- 
2.19.0.rc1

[PATH RFC net-next 3/8] net: phy: Add helper to convert MII ADV register to a linkmode

2018-09-14 Thread Andrew Lunn

The phy_mii_ioctl can be used to write a value into the MII_ADVERTISE
register in the PHY. Since this changes the state of the PHY, we need
to make the same change to phydev->advertising. Add a helper which can
convert the register value to a linkmode.

Signed-off-by: Andrew Lunn 
---
 include/linux/mii.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/include/linux/mii.h b/include/linux/mii.h
index 567047ef0309..8c7da9473ad9 100644
--- a/include/linux/mii.h
+++ b/include/linux/mii.h
@@ -303,6 +303,37 @@ static inline u32 mii_lpa_to_ethtool_lpa_x(u32 lpa)
return result | mii_adv_to_ethtool_adv_x(lpa);
 }
 
+/**
+ * mii_adv_to_linkmode_adv_t
+ * @advertising:pointer to destination link mode.
+ * @adv: value of the MII_ADVERTISE register
+ *
+ * A small helper function that translates MII_ADVERTISE bits
+ * to linkmode advertisement settings.
+ */
+static inline void mii_adv_to_linkmode_adv_t(unsigned long *advertising,
+u32 adv)
+{
+   linkmode_zero(advertising);
+
+   if (adv & ADVERTISE_10HALF)
+   linkmode_set_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
+advertising);
+   if (adv & ADVERTISE_10FULL)
+   linkmode_set_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT,
+advertising);
+   if (adv & ADVERTISE_100HALF)
+   linkmode_set_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
+advertising);
+   if (adv & ADVERTISE_100FULL)
+   linkmode_set_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
+advertising);
+   if (adv & ADVERTISE_PAUSE_CAP)
+   linkmode_set_bit(ETHTOOL_LINK_MODE_Pause_BIT, advertising);
+   if (adv & ADVERTISE_PAUSE_ASYM)
+   linkmode_set_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT, advertising);
+}
+
 /**
  * mii_advertise_flowctrl - get flow control advertisement flags
  * @cap: Flow control capabilities (FLOW_CTRL_RX, FLOW_CTRL_TX or both)
-- 
2.19.0.rc1

[PATH RFC net-next 7/8] net: phy: Replace phy driver features u32 with link_mode bitmap

2018-09-14 Thread Andrew Lunn

This is one step in allowing phylib to make use of link_mode bitmaps,
instead of u32 for supported and advertised features. Convert the phy
drivers to use bitmaps to indicates the features they support. This
requires some macro magic in order to construct constant bitmaps used
to initialise the driver structures.

Some new PHY_*_FEATURES are added, to indicate FIBRE is supported, and
that all media ports are supported. This is done since bitmaps cannot
be ORed together at compile time.

Within phylib, the features bitmap is currently turned back into a
u32.  The MAC API to phylib needs to be cleaned up before the core of
phylib can be converted to using bitmaps instead of u32.

Signed-off-by: Andrew Lunn 
---
 drivers/net/ethernet/marvell/pxa168_eth.c |   4 +-
 drivers/net/phy/aquantia.c|  12 +-
 drivers/net/phy/bcm63xx.c |   9 +-
 drivers/net/phy/marvell.c |   2 +-
 drivers/net/phy/marvell10g.c  |  11 +-
 drivers/net/phy/microchip_t1.c|   2 +-
 drivers/net/phy/phy_device.c  | 204 +-
 include/linux/phy.h   |  24 ++-
 8 files changed, 229 insertions(+), 39 deletions(-)

diff --git a/drivers/net/ethernet/marvell/pxa168_eth.c 
b/drivers/net/ethernet/marvell/pxa168_eth.c
index 3a9730612a70..b406395bbb37 100644
--- a/drivers/net/ethernet/marvell/pxa168_eth.c
+++ b/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -988,8 +988,8 @@ static int pxa168_init_phy(struct net_device *dev)
cmd.base.phy_address = pep->phy_addr;
cmd.base.speed = pep->phy_speed;
cmd.base.duplex = pep->phy_duplex;
-   ethtool_convert_legacy_u32_to_link_mode(cmd.link_modes.advertising,
-   PHY_BASIC_FEATURES);
+   bitmap_copy(cmd.link_modes.advertising, PHY_BASIC_FEATURES,
+   __ETHTOOL_LINK_MODE_MASK_NBITS);
cmd.base.autoneg = AUTONEG_ENABLE;
 
if (cmd.base.speed != 0)
diff --git a/drivers/net/phy/aquantia.c b/drivers/net/phy/aquantia.c
index 319edc9c8ec7..632472cab3bb 100644
--- a/drivers/net/phy/aquantia.c
+++ b/drivers/net/phy/aquantia.c
@@ -115,7 +115,7 @@ static struct phy_driver aquantia_driver[] = {
.phy_id = PHY_ID_AQ1202,
.phy_id_mask= 0xfff0,
.name   = "Aquantia AQ1202",
-   .features   = PHY_AQUANTIA_FEATURES,
+   .features   = PHY_10GBIT_FULL_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
.aneg_done  = genphy_c45_aneg_done,
.config_aneg= aquantia_config_aneg,
@@ -127,7 +127,7 @@ static struct phy_driver aquantia_driver[] = {
.phy_id = PHY_ID_AQ2104,
.phy_id_mask= 0xfff0,
.name   = "Aquantia AQ2104",
-   .features   = PHY_AQUANTIA_FEATURES,
+   .features   = PHY_10GBIT_FULL_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
.aneg_done  = genphy_c45_aneg_done,
.config_aneg= aquantia_config_aneg,
@@ -139,7 +139,7 @@ static struct phy_driver aquantia_driver[] = {
.phy_id = PHY_ID_AQR105,
.phy_id_mask= 0xfff0,
.name   = "Aquantia AQR105",
-   .features   = PHY_AQUANTIA_FEATURES,
+   .features   = PHY_10GBIT_FULL_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
.aneg_done  = genphy_c45_aneg_done,
.config_aneg= aquantia_config_aneg,
@@ -151,7 +151,7 @@ static struct phy_driver aquantia_driver[] = {
.phy_id = PHY_ID_AQR106,
.phy_id_mask= 0xfff0,
.name   = "Aquantia AQR106",
-   .features   = PHY_AQUANTIA_FEATURES,
+   .features   = PHY_10GBIT_FULL_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
.aneg_done  = genphy_c45_aneg_done,
.config_aneg= aquantia_config_aneg,
@@ -163,7 +163,7 @@ static struct phy_driver aquantia_driver[] = {
.phy_id = PHY_ID_AQR107,
.phy_id_mask= 0xfff0,
.name   = "Aquantia AQR107",
-   .features   = PHY_AQUANTIA_FEATURES,
+   .features   = PHY_10GBIT_FULL_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
.aneg_done  = genphy_c45_aneg_done,
.config_aneg= aquantia_config_aneg,
@@ -175,7 +175,7 @@ static struct phy_driver aquantia_driver[] = {
.phy_id = PHY_ID_AQR405,
.phy_id_mask= 0xfff0,
.name   = "Aquantia AQR405",
-   .features   = PHY_AQUANTIA_FEATURES,
+   .features   = PHY_10GBIT_FULL_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
.aneg_done  = genphy_c45_aneg_done,
.config_aneg= aquantia_config_aneg,
diff --git a/drivers/net/phy/bcm63xx.c b/drivers/net/phy/bcm63xx.c
index cf14613745c9..ff5acf01b877 100644
--- a/drivers/net/phy/bcm63xx.c
+++ b/drivers/net/phy/bcm63xx.c
@@ -42,6 +42,9 @@ static int bcm63xx_config_init(struct phy_devi

[PATH RFC net-next 8/8] net: phy: Add build warning if assumptions get broken

2018-09-14 Thread Andrew Lunn

The macro magic to build constant bitmaps of supported PHY features
breaks when we have more than 63 ETHTOOL_LINK_MODE bits. Make the
breakage loud, not a subtle bug, when we get to that condition.

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/phy_device.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index eed61ee1d394..7bee59c7834b 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -2297,6 +2297,13 @@ static int __init phy_init(void)
 {
int rc;
 
+   /* The phy_basic_features, phy_gbit_features etc, above, only
+* work for values up to 63. Ensure we get a loud error if
+* this threshold is exceeded, and the necessary changes are
+* made.
+*/
+   BUILD_BUG_ON(__ETHTOOL_LINK_MODE_LAST > 63);
+
rc = mdio_bus_init();
if (rc)
return rc;
-- 
2.19.0.rc1

[PATH RFC net-next 0/8] Continue towards using linkmode in phylib

2018-09-14 Thread Andrew Lunn

These patches contain some further cleanup and helpers, and the first
real patch towards using linkmode bitmaps in phylink.

It is RFC because i don't like patch #7 and maybe somebody has a
better idea how to do this. Ideally, we want to initialise a linux
generic bitmap at compile time.

Thanks
Andrew

Andrew Lunn (8):
  net: phy: Move linkmode helpers to somewhere public
  net: phy: Add phydev_warn()
  net: phy: Add helper to convert MII ADV register to a linkmode
  net: phy: Add helper for advertise to lcl value
  net: phy: Add limkmode equivalents to some of the MII ethtool helpers
  net: ethernet xgbe expand PHY_GBIT_FEAUTRES
  net: phy: Replace phy driver features u32 with link_mode bitmap
  net: phy: Add build warning if assumptions get broken

 drivers/net/dsa/mt7530.c  |   6 +-
 drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c   |  15 +-
 drivers/net/ethernet/freescale/fman/mac.c |   6 +-
 drivers/net/ethernet/freescale/gianfar.c  |   7 +-
 .../hisilicon/hns3/hns3pf/hclge_main.c|   6 +-
 drivers/net/ethernet/marvell/pxa168_eth.c |   4 +-
 drivers/net/ethernet/mediatek/mtk_eth_soc.c   |   6 +-
 drivers/net/ethernet/socionext/sni_ave.c  |   5 +-
 drivers/net/phy/aquantia.c|  12 +-
 drivers/net/phy/bcm63xx.c |   9 +-
 drivers/net/phy/marvell.c |   2 +-
 drivers/net/phy/marvell10g.c  |  11 +-
 drivers/net/phy/microchip_t1.c|   2 +-
 drivers/net/phy/phy_device.c  | 211 +-
 drivers/net/phy/phylink.c |  27 ---
 include/linux/linkmode.h  |  67 ++
 include/linux/mii.h   | 101 +
 include/linux/phy.h   |  28 ++-
 18 files changed, 421 insertions(+), 104 deletions(-)
 create mode 100644 include/linux/linkmode.h

-- 
2.19.0.rc1

[PATCH net] net: dsa: mv88e6xxx: Fix ATU Miss Violation

2018-09-14 Thread Andrew Lunn

Fix a cut/paste error and a typo which results in ATU miss violations
not being reported.

Fixes: 0977644c5005 ("net: dsa: mv88e6xxx: Decode ATU problem interrupt")
Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/global1.h | 2 +-
 drivers/net/dsa/mv88e6xxx/global1_atu.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/global1.h 
b/drivers/net/dsa/mv88e6xxx/global1.h
index 7c791c1da4b9..bef01331266f 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.h
+++ b/drivers/net/dsa/mv88e6xxx/global1.h
@@ -128,7 +128,7 @@
 #define MV88E6XXX_G1_ATU_OP_GET_CLR_VIOLATION  0x7000
 #define MV88E6XXX_G1_ATU_OP_AGE_OUT_VIOLATION  BIT(7)
 #define MV88E6XXX_G1_ATU_OP_MEMBER_VIOLATION   BIT(6)
-#define MV88E6XXX_G1_ATU_OP_MISS_VIOLTATIONBIT(5)
+#define MV88E6XXX_G1_ATU_OP_MISS_VIOLATION BIT(5)
 #define MV88E6XXX_G1_ATU_OP_FULL_VIOLATION BIT(4)
 
 /* Offset 0x0C: ATU Data Register */
diff --git a/drivers/net/dsa/mv88e6xxx/global1_atu.c 
b/drivers/net/dsa/mv88e6xxx/global1_atu.c
index 307410898fc9..5200e4bdce93 100644
--- a/drivers/net/dsa/mv88e6xxx/global1_atu.c
+++ b/drivers/net/dsa/mv88e6xxx/global1_atu.c
@@ -349,7 +349,7 @@ static irqreturn_t mv88e6xxx_g1_atu_prob_irq_thread_fn(int 
irq, void *dev_id)
chip->ports[entry.portvec].atu_member_violation++;
}
 
-   if (val & MV88E6XXX_G1_ATU_OP_MEMBER_VIOLATION) {
+   if (val & MV88E6XXX_G1_ATU_OP_MISS_VIOLATION) {
dev_err_ratelimited(chip->dev,
"ATU miss violation for %pM portvec %x\n",
entry.mac, entry.portvec);
-- 
2.19.0.rc1

Re: [bpf-next, v4 0/5] Introduce eBPF flow dissector

2018-09-14 Thread Y Song

On Fri, Sep 14, 2018 at 12:24 PM Alexei Starovoitov
 wrote:
>
> On Fri, Sep 14, 2018 at 07:46:17AM -0700, Petar Penkov wrote:
> > From: Petar Penkov 
> >
> > This patch series hardens the RX stack by allowing flow dissection in BPF,
> > as previously discussed [1]. Because of the rigorous checks of the BPF
> > verifier, this provides significant security guarantees. In particular, the
> > BPF flow dissector cannot get inside of an infinite loop, as with
> > CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
> > read outside of packet bounds, because all memory accesses are checked.
> > Also, with BPF the administrator can decide which protocols to support,
> > reducing potential attack surface. Rarely encountered protocols can be
> > excluded from dissection and the program can be updated without kernel
> > recompile or reboot if a bug is discovered.
> >
> > Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
> > This includes a new BPF program and attach type.
> >
> > Patch 2 adds the new BPF flow dissector definitions to tools/uapi.
> >
> > Patch 3 adds support for the new BPF program type to libbpf and bpftool.
> >
> > Patch 4 adds a flow dissector program in BPF. This parses most protocols in
> > __skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
> > and address types).
> >
> > Patch 5 adds a selftest that attaches the BPF program to the flow dissector
> > and sends traffic with different levels of encapsulation.
> >
> > Performance Evaluation:
> > The in-kernel implementation was compared against the demo program from
> > patch 4 using the test in patch 5 with IPv4/UDP traffic over 10 seconds.
> >   $perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \
> >   -t 10
>
> Looks great. Applied to bpf-next with one extra patch:
>  SEC("dissect")
> -int dissect(struct __sk_buff *skb)
> +int _dissect(struct __sk_buff *skb)
>
> otherwise the test doesn't build.
> I'm not sure how it builds for you. Which llvm did you use?

This is a known issue. IIRC, llvm <= 4 should be okay and llvm >= 5 would fail.

>
> Also above command works and ipv4 test in ./test_flow_dissector.sh
> is passing as well, but it still fails at the end for me:
> ./test_flow_dissector.sh
> bpffs not mounted. Mounting...
> 0: IP
> 1: IPV6
> 2: IPV6OP
> 3: IPV6FR
> 4: MPLS
> 5: VLAN
> Testing IPv4...
> inner.dest4: 127.0.0.1
> inner.source4: 127.0.0.3
> pkts: tx=10 rx=10
> inner.dest4: 127.0.0.1
> inner.source4: 127.0.0.3
> pkts: tx=10 rx=0
> inner.dest4: 127.0.0.1
> inner.source4: 127.0.0.3
> pkts: tx=10 rx=10
> Testing IPIP...
> tunnels before test:
> tunl0: any/ip remote any local any ttl inherit nopmtudisc
> sit_test_LV5N: any/ip remote 127.0.0.2 local 127.0.0.1 dev lo ttl inherit
> ipip_test_LV5N: any/ip remote 127.0.0.2 local 127.0.0.1 dev lo ttl inherit
> sit0: ipv6/ip remote any local any ttl 64 nopmtudisc
> gre_test_LV5N: gre/ip remote 127.0.0.2 local 127.0.0.1 dev lo ttl inherit
> gre0: gre/ip remote any local any ttl inherit nopmtudisc
> inner.dest4: 192.168.0.1
> inner.source4: 1.1.1.1
> encap proto:   4
> outer.dest4: 127.0.0.1
> outer.source4: 127.0.0.2
> pkts: tx=10 rx=0
> tunnels after test:
> tunl0: any/ip remote any local any ttl inherit nopmtudisc
> sit0: ipv6/ip remote any local any ttl 64 nopmtudisc
> gre0: gre/ip remote any local any ttl inherit nopmtudisc
> selftests: test_flow_dissector [FAILED]
>
> is it something in my setup or test is broken?
>

Re: [bpf-next, v4 0/5] Introduce eBPF flow dissector

2018-09-14 Thread Petar Penkov

On Fri, Sep 14, 2018 at 2:47 PM, Y Song  wrote:
> On Fri, Sep 14, 2018 at 12:24 PM Alexei Starovoitov
>  wrote:
>>
>> On Fri, Sep 14, 2018 at 07:46:17AM -0700, Petar Penkov wrote:
>> > From: Petar Penkov 
>> >
>> > This patch series hardens the RX stack by allowing flow dissection in BPF,
>> > as previously discussed [1]. Because of the rigorous checks of the BPF
>> > verifier, this provides significant security guarantees. In particular, the
>> > BPF flow dissector cannot get inside of an infinite loop, as with
>> > CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
>> > read outside of packet bounds, because all memory accesses are checked.
>> > Also, with BPF the administrator can decide which protocols to support,
>> > reducing potential attack surface. Rarely encountered protocols can be
>> > excluded from dissection and the program can be updated without kernel
>> > recompile or reboot if a bug is discovered.
>> >
>> > Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
>> > This includes a new BPF program and attach type.
>> >
>> > Patch 2 adds the new BPF flow dissector definitions to tools/uapi.
>> >
>> > Patch 3 adds support for the new BPF program type to libbpf and bpftool.
>> >
>> > Patch 4 adds a flow dissector program in BPF. This parses most protocols in
>> > __skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
>> > and address types).
>> >
>> > Patch 5 adds a selftest that attaches the BPF program to the flow dissector
>> > and sends traffic with different levels of encapsulation.
>> >
>> > Performance Evaluation:
>> > The in-kernel implementation was compared against the demo program from
>> > patch 4 using the test in patch 5 with IPv4/UDP traffic over 10 seconds.
>> >   $perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \
>> >   -t 10
>>
>> Looks great. Applied to bpf-next with one extra patch:
>>  SEC("dissect")
>> -int dissect(struct __sk_buff *skb)
>> +int _dissect(struct __sk_buff *skb)
>>
>> otherwise the test doesn't build.
>> I'm not sure how it builds for you. Which llvm did you use?
>
> This is a known issue. IIRC, llvm <= 4 should be okay and llvm >= 5 would 
> fail.
>
I was running a much older version of llvm so I imagine this was the
issue. Thanks for the fix!
>>
>> Also above command works and ipv4 test in ./test_flow_dissector.sh
>> is passing as well, but it still fails at the end for me:
>> ./test_flow_dissector.sh
>> bpffs not mounted. Mounting...
>> 0: IP
>> 1: IPV6
>> 2: IPV6OP
>> 3: IPV6FR
>> 4: MPLS
>> 5: VLAN
>> Testing IPv4...
>> inner.dest4: 127.0.0.1
>> inner.source4: 127.0.0.3
>> pkts: tx=10 rx=10
>> inner.dest4: 127.0.0.1
>> inner.source4: 127.0.0.3
>> pkts: tx=10 rx=0
>> inner.dest4: 127.0.0.1
>> inner.source4: 127.0.0.3
>> pkts: tx=10 rx=10
>> Testing IPIP...
>> tunnels before test:
>> tunl0: any/ip remote any local any ttl inherit nopmtudisc
>> sit_test_LV5N: any/ip remote 127.0.0.2 local 127.0.0.1 dev lo ttl inherit
>> ipip_test_LV5N: any/ip remote 127.0.0.2 local 127.0.0.1 dev lo ttl inherit
>> sit0: ipv6/ip remote any local any ttl 64 nopmtudisc
>> gre_test_LV5N: gre/ip remote 127.0.0.2 local 127.0.0.1 dev lo ttl inherit
>> gre0: gre/ip remote any local any ttl inherit nopmtudisc
>> inner.dest4: 192.168.0.1
>> inner.source4: 1.1.1.1
>> encap proto:   4
>> outer.dest4: 127.0.0.1
>> outer.source4: 127.0.0.2
>> pkts: tx=10 rx=0
>> tunnels after test:
>> tunl0: any/ip remote any local any ttl inherit nopmtudisc
>> sit0: ipv6/ip remote any local any ttl 64 nopmtudisc
>> gre0: gre/ip remote any local any ttl inherit nopmtudisc
>> selftests: test_flow_dissector [FAILED]
>>
>> is it something in my setup or test is broken?
>>
I just reran the test and it is passing. We will investigate what
could be causing the issue.

[PATCH bpf-next] tools/bpf: bpftool: improve output format for bpftool net

2018-09-14 Thread Yonghong Song

This is a followup patch for Commit f6f3bac08ff9
("tools/bpf: bpftool: add net support").
Some improvements are made for the bpftool net output.
Specially, plain output is more concise such that
per attachment should nicely fit in one line.
Compared to previous output, the prog tag is removed
since it can be easily obtained with program id.
Similar to xdp attachments, the device name is added
to tc_filters attachments.

The bpf program attached through shared block
mechanism is supported as well.
  $ ip link add dev v1 type veth peer name v2
  $ tc qdisc add dev v1 ingress_block 10 egress_block 20 clsact
  $ tc qdisc add dev v2 ingress_block 10 egress_block 20 clsact
  $ tc filter add block 10 protocol ip prio 25 bpf obj bpf_shared.o sec ingress 
flowid 1:1
  $ tc filter add block 20 protocol ip prio 30 bpf obj bpf_cyclic.o sec 
classifier flowid 1:1
  $ bpftool net
  xdp [
  ]
  tc_filters [
   v2(7) qdisc_clsact_ingress bpf_shared.o:[ingress] id 23
   v2(7) qdisc_clsact_egress bpf_cyclic.o:[classifier] id 24
   v1(8) qdisc_clsact_ingress bpf_shared.o:[ingress] id 23
   v1(8) qdisc_clsact_egress bpf_cyclic.o:[classifier] id 24
  ]

The documentation and "bpftool net help" are updated
to make it clear that current implementation only
supports xdp and tc attachments. For programs
attached to cgroups, "bpftool cgroup" can be used
to dump attachments. For other programs e.g.
sk_{filter,skb,msg,reuseport} and lwt/seg6,
iproute2 tools should be used.

The new output:
  $ bpftool net
  xdp [
   eth0(2) id/drv 198
  ]
  tc_filters [
   eth0(2) qdisc_clsact_ingress fbflow_icmp id 335 act [{icmp_action id 336}]
   eth0(2) qdisc_clsact_egress fbflow_egress id 334
  ]
  $ bpftool -jp net
  [{
"xdp": [{
"devname": "eth0",
"ifindex": 2,
"id/drv": 198
}
],
"tc_filters": [{
"devname": "eth0",
"ifindex": 2,
"kind": "qdisc_clsact_ingress",
"name": "fbflow_icmp",
"id": 335,
"act": [{
"name": "icmp_action",
"id": 336
}
]
},{
"devname": "eth0",
"ifindex": 2,
"kind": "qdisc_clsact_egress",
"name": "fbflow_egress",
"id": 334
}
]
}
  ]

Signed-off-by: Yonghong Song 
---
 .../bpf/bpftool/Documentation/bpftool-net.rst |  58 +-
 tools/bpf/bpftool/main.h  |   3 +-
 tools/bpf/bpftool/net.c   | 100 --
 tools/bpf/bpftool/netlink_dumper.c|  78 ++
 tools/bpf/bpftool/netlink_dumper.h|  20 ++--
 5 files changed, 143 insertions(+), 116 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-net.rst 
b/tools/bpf/bpftool/Documentation/bpftool-net.rst
index 48a61837a264..433581592c72 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-net.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-net.rst
@@ -26,9 +26,20 @@ NET COMMANDS
 DESCRIPTION
 ===
**bpftool net { show | list } [ dev name ]**
- List all networking device driver and tc attachment in the 
system.
-
-  Output will start with all xdp program attachment, followed 
by
+  List bpf program attachments in the kernel networking 
subsystem.
+
+  Currently, only device driver xdp attachments and tc filter
+  classification/action attachments are implemented, i.e., for
+  program types **BPF_PROG_TYPE_SCHED_CLS**,
+  **BPF_PROG_TYPE_SCHED_ACT** and **BPF_PROG_TYPE_XDP**.
+  For programs attached to a particular cgroup, e.g.,
+  **BPF_PROG_TYPE_CGROUP_SKB**, **BPF_PROG_TYPE_CGROUP_SOCK**,
+  **BPF_PROG_TYPE_SOCK_OPS** and 
**BPF_PROG_TYPE_CGROUP_SOCK_ADDR**,
+  users can use **bpftool cgroup** to dump cgroup attachments.
+  For sk_{filter, skb, msg, reuseport} and lwt/seg6
+  bpf programs, users should consult other tools, e.g., 
iproute2.
+
+  The current output will start with all xdp program 
attachments, followed by
   all tc class/qdisc bpf program attachments. Both xdp 
programs and
   tc programs are ordered based on ifindex number. If multiple 
bpf
   programs attached to the same networking device through **tc 
filter**,
@@ -62,19 +73,14 @@ EXAMPLES
 ::
 
   xdp [
-  ifindex 2 devname eth0 prog_id 198
+   eth0(2) id/drv 198
   ]
   tc_filters [
-  ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
-prog_id 111727 tag d08fe3b4319bc2fd act []
-  ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
-prog_id 130246 tag 3f265c7f26db62c9 act []
-

Re: [PATH RFC net-next 1/8] net: phy: Move linkmode helpers to somewhere public

2018-09-14 Thread Florian Fainelli

On 09/14/2018 02:38 PM, Andrew Lunn wrote:
> phylink has some useful helpers to working with linkmode bitmaps.
> Move them to there own header so other code can use them.

Good idea, I wonder if we should create a more specific directory within
include/linux/ that can host a variety of PHYLIB, PHYLINK and what not
header files, but this could be solved later on.

> 
> Signed-off-by: Andrew Lunn 

Acked-by: Florian Fainelli 
-- 
Florian

Re: [PATH RFC net-next 2/8] net: phy: Add phydev_warn()

2018-09-14 Thread Florian Fainelli

On 09/14/2018 02:38 PM, Andrew Lunn wrote:
> Not all new style LINK_MODE bits can be converted into old style
> SUPPORTED bits. We need to warn when such a conversion is attempted.
> Add a helper for this.
> 
> Signed-off-by: Andrew Lunn 

Acked-by: Florian Fainelli 

Do you mind converting drivers/net/phy/marvell10g.c to use it? I would
also suggest adding phydev_info() while we are at it and do the two
conversions to it that exist in drivers/net/phy/phy_device.c?

Thanks!
-- 
Florian

Re: [PATH RFC net-next 3/8] net: phy: Add helper to convert MII ADV register to a linkmode

2018-09-14 Thread Florian Fainelli

On 09/14/2018 02:38 PM, Andrew Lunn wrote:
> The phy_mii_ioctl can be used to write a value into the MII_ADVERTISE
> register in the PHY. Since this changes the state of the PHY, we need
> to make the same change to phydev->advertising. Add a helper which can
> convert the register value to a linkmode.

It would have been nice if we could eliminate the duplication between
mii_adv_to_ethtool_adv_t() and mii_adv_to_linkmode_adv_t() but I don't
really see how without changing the former function's signature.

Reviewed-by: Florian Fainelli 

> 
> Signed-off-by: Andrew Lunn 
> ---
>  include/linux/mii.h | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/include/linux/mii.h b/include/linux/mii.h
> index 567047ef0309..8c7da9473ad9 100644
> --- a/include/linux/mii.h
> +++ b/include/linux/mii.h
> @@ -303,6 +303,37 @@ static inline u32 mii_lpa_to_ethtool_lpa_x(u32 lpa)
>   return result | mii_adv_to_ethtool_adv_x(lpa);
>  }
>  
> +/**
> + * mii_adv_to_linkmode_adv_t
> + * @advertising:pointer to destination link mode.
> + * @adv: value of the MII_ADVERTISE register
> + *
> + * A small helper function that translates MII_ADVERTISE bits
> + * to linkmode advertisement settings.
> + */
> +static inline void mii_adv_to_linkmode_adv_t(unsigned long *advertising,
> +  u32 adv)
> +{
> + linkmode_zero(advertising);
> +
> + if (adv & ADVERTISE_10HALF)
> + linkmode_set_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
> +  advertising);
> + if (adv & ADVERTISE_10FULL)
> + linkmode_set_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT,
> +  advertising);
> + if (adv & ADVERTISE_100HALF)
> + linkmode_set_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
> +  advertising);
> + if (adv & ADVERTISE_100FULL)
> + linkmode_set_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
> +  advertising);
> + if (adv & ADVERTISE_PAUSE_CAP)
> + linkmode_set_bit(ETHTOOL_LINK_MODE_Pause_BIT, advertising);
> + if (adv & ADVERTISE_PAUSE_ASYM)
> + linkmode_set_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT, advertising);
> +}
> +
>  /**
>   * mii_advertise_flowctrl - get flow control advertisement flags
>   * @cap: Flow control capabilities (FLOW_CTRL_RX, FLOW_CTRL_TX or both)
> 


-- 
Florian

Re: [PATCH net-next RFC 6/8] net: make gro configurable

2018-09-14 Thread Willem de Bruijn

On Fri, Sep 14, 2018 at 2:39 PM Stephen Hemminger
 wrote:
>
> On Fri, 14 Sep 2018 13:59:39 -0400
> Willem de Bruijn  wrote:
>
> > diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> > index e5d236595206..8cb8e02c8ab6 100644
> > --- a/drivers/net/vxlan.c
> > +++ b/drivers/net/vxlan.c
> > @@ -572,6 +572,7 @@ static struct sk_buff *vxlan_gro_receive(struct sock 
> > *sk,
> >struct list_head *head,
> >struct sk_buff *skb)
> >  {
> > + const struct net_offload *ops;
> >   struct sk_buff *pp = NULL;
> >   struct sk_buff *p;
> >   struct vxlanhdr *vh, *vh2;
> > @@ -606,6 +607,12 @@ static struct sk_buff *vxlan_gro_receive(struct sock 
> > *sk,
> >   goto out;
> >   }
> >
> > + rcu_read_lock();
> > + ops = net_gro_receive(dev_offloads, ETH_P_TEB);
> > + rcu_read_unlock();
> > + if (!ops)
> > + goto out;
>
> Isn't rcu_read_lock already held here?
> RCU read lock is always held in the receive handler path

There is a critical section on receive, taken in
netif_receive_skb_core, but gro code runs before that. All the
existing gro handlers call rcu_read_lock.

> > +
> >   skb_gro_pull(skb, sizeof(struct vxlanhdr)); /* pull vxlan header */
> >
> >   list_for_each_entry(p, head, list) {
> > @@ -621,6 +628,7 @@ static struct sk_buff *vxlan_gro_receive(struct sock 
> > *sk,
> >   }
> >
> >   pp = call_gro_receive(eth_gro_receive, head, skb);
> > +
> >   flush = 0;
>
> whitespace change crept into this patch.

Oops, thanks.

Re: [PATCH net-next RFC 6/8] net: make gro configurable

2018-09-14 Thread Willem de Bruijn

On Fri, Sep 14, 2018 at 6:50 PM Willem de Bruijn
 wrote:
>
> On Fri, Sep 14, 2018 at 2:39 PM Stephen Hemminger
>  wrote:
> >
> > On Fri, 14 Sep 2018 13:59:39 -0400
> > Willem de Bruijn  wrote:
> >
> > > diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> > > index e5d236595206..8cb8e02c8ab6 100644
> > > --- a/drivers/net/vxlan.c
> > > +++ b/drivers/net/vxlan.c
> > > @@ -572,6 +572,7 @@ static struct sk_buff *vxlan_gro_receive(struct sock 
> > > *sk,
> > >struct list_head *head,
> > >struct sk_buff *skb)
> > >  {
> > > + const struct net_offload *ops;
> > >   struct sk_buff *pp = NULL;
> > >   struct sk_buff *p;
> > >   struct vxlanhdr *vh, *vh2;
> > > @@ -606,6 +607,12 @@ static struct sk_buff *vxlan_gro_receive(struct sock 
> > > *sk,
> > >   goto out;
> > >   }
> > >
> > > + rcu_read_lock();
> > > + ops = net_gro_receive(dev_offloads, ETH_P_TEB);
> > > + rcu_read_unlock();
> > > + if (!ops)
> > > + goto out;
> >
> > Isn't rcu_read_lock already held here?
> > RCU read lock is always held in the receive handler path
>
> There is a critical section on receive, taken in
> netif_receive_skb_core, but gro code runs before that. All the
> existing gro handlers call rcu_read_lock.

Though if dev_gro_receive is the entry point for all of gro, then
all other handlers are ensured to be executed within its rcu
readside section.

Re: [PATCH net-next RFC 6/8] net: make gro configurable

2018-09-14 Thread Willem de Bruijn

On Fri, Sep 14, 2018 at 1:59 PM Willem de Bruijn
 wrote:
>
> From: Willem de Bruijn 
>
> Add net_offload flag NET_OFF_FLAG_GRO_OFF. If set, a net_offload will
> not be used for gro receive processing.
>
> Also add sysctl helper proc_do_net_offload that toggles this flag and
> register sysctls net.{core,ipv4,ipv6}.gro
>
> Signed-off-by: Willem de Bruijn 
> ---
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 20d9552afd38..0fd5273bc931 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -154,6 +154,7 @@
>  #define GRO_MAX_HEAD (MAX_HEADER + 128)
>
>  static DEFINE_SPINLOCK(ptype_lock);
> +DEFINE_SPINLOCK(offload_lock);
>  struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
>  struct list_head ptype_all __read_mostly;  /* Taps */
>  static struct list_head offload_base __read_mostly;
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index b1a2c5e38530..d2d72afdd9eb 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -34,6 +35,58 @@ static int net_msg_warn; /* Unused, but still a sysctl 
> */
>  int sysctl_fb_tunnels_only_for_init_net __read_mostly = 0;
>  EXPORT_SYMBOL(sysctl_fb_tunnels_only_for_init_net);
>
> +extern spinlock_t offload_lock;
> +
> +#define NET_OFF_TBL_LEN256
> +
> +int proc_do_net_offload(struct ctl_table *ctl, int write, void __user 
> *buffer,
> +   size_t *lenp, loff_t *ppos)
> +{
> +   unsigned long bitmap[NET_OFF_TBL_LEN / (sizeof(unsigned long) << 3)];
> +   struct ctl_table tbl = { .maxlen = NET_OFF_TBL_LEN, .data = bitmap };
> +   unsigned long flag = (unsigned long) ctl->extra2;
> +   struct net_offload __rcu **offs = ctl->extra1;
> +   struct net_offload *off;
> +   int i, ret;
> +
> +   memset(bitmap, 0, sizeof(bitmap));
> +
> +   spin_lock(&offload_lock);
> +
> +   for (i = 0; i < tbl.maxlen; i++) {
> +   off = rcu_dereference_protected(offs[i], 
> lockdep_is_held(&offload_lock));
> +   if (off && off->flags & flag) {

This does not actually work as is. No protocol will have this flag set
out of the box.

I was in the middle of rewriting some of this when it became topical,
so I sent it out for discussion. It's bound not to be the only bug of
the patchset as is. I'll work through them to get it back in shape.

Re: [PATH RFC net-next 3/8] net: phy: Add helper to convert MII ADV register to a linkmode

2018-09-14 Thread Andrew Lunn

On Fri, Sep 14, 2018 at 03:23:14PM -0700, Florian Fainelli wrote:
> On 09/14/2018 02:38 PM, Andrew Lunn wrote:
> > The phy_mii_ioctl can be used to write a value into the MII_ADVERTISE
> > register in the PHY. Since this changes the state of the PHY, we need
> > to make the same change to phydev->advertising. Add a helper which can
> > convert the register value to a linkmode.
> 
> It would have been nice if we could eliminate the duplication between
> mii_adv_to_ethtool_adv_t() and mii_adv_to_linkmode_adv_t() but I don't
> really see how without changing the former function's signature.

Some of these functions are also used by non-phylib MAC drivers. So
the ethtool version cannot be eliminated.

And the UAPI for EEE still uses a u32 for which modes EEE is
advertised :-(

   Andrew

1 2 >

1 - 100 of 125 matches

Mail list logo