date:20190701

Re: [PATCH net-next 0/2] net: ipv4: fix circular-list infinite loop

2019-07-01 Thread Tariq Toukan



On 7/1/2019 9:46 AM, Ran Rozenstein wrote:
> 
> 
>> -Original Message-
>> From: Tariq Toukan
>> Sent: Sunday, June 30, 2019 10:57
>> To: David Miller ; f...@strlen.de
>> Cc: netdev@vger.kernel.org; Ran Rozenstein ; Tariq
>> Toukan 
>> Subject: Re: [PATCH net-next 0/2] net: ipv4: fix circular-list infinite loop
>>
>>
>>
>> On 6/27/2019 7:54 PM, David Miller wrote:
>>> From: Florian Westphal 
>>> Date: Thu, 27 Jun 2019 14:03:31 +0200
>>>
 Tariq and Ran reported a regression caused by net-next commit
 2638eb8b50cf ("net: ipv4: provide __rcu annotation for ifa_list").

 This happens when net.ipv4.conf.$dev.promote_secondaries sysctl is
 enabled -- we can arrange for ifa->next to point at ifa, so next
 process that tries to walk the list loops forever.

 Fix this and extend rtnetlink.sh with a small test case for this.
>>>
>>> Series applied, thanks Florian.
>>>
>>
>> Thanks Florian!
>>
>> Ran, please test and update.
>>
>> Tariq
> 
> Thanks Florian.
> Didn't reproduce tonight with the fixes.
> 
> Ran.
> 

Sounds good!

Thanks,
Tariq

Re: [PATCH bpf] selftests: bpf: fix inlines in test_lwt_seg6local

2019-07-01 Thread Jiri Benc

On Sat, 29 Jun 2019 11:04:54 -0700, Song Liu wrote:
> > Maybe use "__always_inline" as most other tests do?  
> 
> I meant "static __always_inline".

Sure, I can do that. It doesn't seem to be as consistent as you
suggest, though.

There are three different forms used in selftests/bpf/progs:

static __always_inline
static inline __attribute__((__always_inline__))
static inline __attribute__((always_inline))

As this is a bug causing selftests to fail (at least for some clang/llvm
versions), how about applying this to bpf.git as a minimal fix and
unifying the progs in bpf-next?

Thanks,

 Jiri

[PATCH] xfrm: use list_for_each_entry_safe in xfrm_policy_flush

2019-07-01 Thread Li RongQing

The iterated pol maybe be freed since it is not protected
by RCU or spinlock when put it, lead to UAF, so use _safe
function to iterate over it against removal

Signed-off-by: Li RongQing 
---
 net/xfrm/xfrm_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 3235562f6588..87d770dab1f5 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1772,7 +1772,7 @@ xfrm_policy_flush_secctx_check(struct net *net, u8 type, 
bool task_valid)
 int xfrm_policy_flush(struct net *net, u8 type, bool task_valid)
 {
int dir, err = 0, cnt = 0;
-   struct xfrm_policy *pol;
+   struct xfrm_policy *pol, *tmp;
 
spin_lock_bh(&net->xfrm.xfrm_policy_lock);
 
@@ -1781,7 +1781,7 @@ int xfrm_policy_flush(struct net *net, u8 type, bool 
task_valid)
goto out;
 
 again:
-   list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+   list_for_each_entry_safe(pol, tmp, &net->xfrm.policy_all, walk.all) {
dir = xfrm_policy_id2dir(pol->index);
if (pol->walk.dead ||
dir >= XFRM_POLICY_MAX ||
-- 
2.16.2

Re: [PATCH rdma-next v4 06/17] RDMA/counter: Add "auto" configuration mode support

2019-07-01 Thread Leon Romanovsky

On Sun, Jun 30, 2019 at 12:40:54AM +, Jason Gunthorpe wrote:
> On Tue, Jun 18, 2019 at 08:26:14PM +0300, Leon Romanovsky wrote:
>
> > +static void __rdma_counter_dealloc(struct rdma_counter *counter)
> > +{
> > +   mutex_lock(&counter->lock);
> > +   counter->device->ops.counter_dealloc(counter);
> > +   mutex_unlock(&counter->lock);
> > +}
>
> Does this lock do anything? The kref is 0 at this point, so no other
> thread can have a pointer to this lock.

Yes, it is leftover from atomic_read implementation.

>
> > +
> > +static void rdma_counter_dealloc(struct rdma_counter *counter)
> > +{
> > +   if (!counter)
> > +   return;
>
> Counter is never NULL.

Ohh, right, I'll clean some code near 
rdma_counter_dealloc/__rdma_counter_dealloc.

Thanks

>
> Jason

[PATCH net] Documentation/networking: fix default_ttl typo in mpls-sysctl

2019-07-01 Thread Hangbin Liu

default_ttl should be integer instead of bool

Reported-by: Ying Xu 
Fixes: a59166e47086 ("mpls: allow TTL propagation from IP packets to be 
configured")
Signed-off-by: Hangbin Liu 
---
 Documentation/networking/mpls-sysctl.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/networking/mpls-sysctl.txt 
b/Documentation/networking/mpls-sysctl.txt
index 2f24a1912a48..025cc9b96992 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -30,7 +30,7 @@ ip_ttl_propagate - BOOL
0 - disabled / RFC 3443 [Short] Pipe Model
1 - enabled / RFC 3443 Uniform Model (default)
 
-default_ttl - BOOL
+default_ttl - INTEGER
Default TTL value to use for MPLS packets where it cannot be
propagated from an IP header, either because one isn't present
or ip_ttl_propagate has been disabled.
-- 
2.19.2

Re: [PATCH] xfrm: use list_for_each_entry_safe in xfrm_policy_flush

2019-07-01 Thread Florian Westphal

Li RongQing  wrote:
> The iterated pol maybe be freed since it is not protected
> by RCU or spinlock when put it, lead to UAF, so use _safe
> function to iterate over it against removal
> 
> Signed-off-by: Li RongQing 
> ---
>  net/xfrm/xfrm_policy.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index 3235562f6588..87d770dab1f5 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -1772,7 +1772,7 @@ xfrm_policy_flush_secctx_check(struct net *net, u8 
> type, bool task_valid)
>  int xfrm_policy_flush(struct net *net, u8 type, bool task_valid)
>  {
>   int dir, err = 0, cnt = 0;
> - struct xfrm_policy *pol;
> + struct xfrm_policy *pol, *tmp;
>  
>   spin_lock_bh(&net->xfrm.xfrm_policy_lock);
>  
> @@ -1781,7 +1781,7 @@ int xfrm_policy_flush(struct net *net, u8 type, bool 
> task_valid)
>   goto out;
>  
>  again:
> - list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
> + list_for_each_entry_safe(pol, tmp, &net->xfrm.policy_all, walk.all) {
>   dir = xfrm_policy_id2dir(pol->index);
>   if (pol->walk.dead ||
>   dir >= XFRM_POLICY_MAX ||

This function drops the lock, but after re-acquire jumps to the 'again'
label, so I do not see the UAF as the entire loop gets restarted.

[PATCH] sis900: add ethtool tests (link, eeprom)

2019-07-01 Thread Sergej Benilov

Add tests for ethtool: link test, EEPROM read test.
Correct a few typos, too.

Signed-off-by: Sergej Benilov 
---
 drivers/net/ethernet/sis/sis900.c | 78 +--
 1 file changed, 74 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sis/sis900.c 
b/drivers/net/ethernet/sis/sis900.c
index 9b036c857b1d..a781bce23ec8 100644
--- a/drivers/net/ethernet/sis/sis900.c
+++ b/drivers/net/ethernet/sis/sis900.c
@@ -262,7 +262,7 @@ static int sis900_get_mac_addr(struct pci_dev *pci_dev,
/* check to see if we have sane EEPROM */
signature = (u16) read_eeprom(ioaddr, EEPROMSignature);
if (signature == 0x || signature == 0x) {
-   printk (KERN_WARNING "%s: Error EERPOM read %x\n",
+   printk (KERN_WARNING "%s: Error EEPROM read %x\n",
pci_name(pci_dev), signature);
return 0;
}
@@ -359,9 +359,9 @@ static int sis635_get_mac_addr(struct pci_dev *pci_dev,
  *
  * SiS962 or SiS963 model, use EEPROM to store MAC address. And EEPROM
  * is shared by
- * LAN and 1394. When access EEPROM, send EEREQ signal to hardware first
- * and wait for EEGNT. If EEGNT is ON, EEPROM is permitted to be access
- * by LAN, otherwise is not. After MAC address is read from EEPROM, send
+ * LAN and 1394. When accessing EEPROM, send EEREQ signal to hardware first
+ * and wait for EEGNT. If EEGNT is ON, EEPROM is permitted to be accessed
+ * by LAN, otherwise it is not. After MAC address is read from EEPROM, send
  * EEDONE signal to refuse EEPROM access by LAN.
  * The EEPROM map of SiS962 or SiS963 is different to SiS900.
  * The signature field in SiS962 or SiS963 spec is meaningless.
@@ -2122,6 +2122,73 @@ static void sis900_get_wol(struct net_device *net_dev, 
struct ethtool_wolinfo *w
wol->supported = (WAKE_PHY | WAKE_MAGIC);
 }
 
+static const char sis900_gstrings_test[][ETH_GSTRING_LEN] = {
+   "Link test (on/offline)",
+   "EEPROM read test   (on/offline)",
+};
+#define SIS900_TEST_LENARRAY_SIZE(sis900_gstrings_test)
+
+static int sis900_eeprom_readtest(struct net_device *net_dev)
+{
+   struct sis900_private *sis_priv = netdev_priv(net_dev);
+   void __iomem *ioaddr = sis_priv->ioaddr;
+   int wait, ret = -EAGAIN;
+   u16 signature;
+
+   if (sis_priv->chipset_rev == SIS96x_900_REV) {
+   sw32(mear, EEREQ);
+   for (wait = 0; wait < 2000; wait++) {
+   if (sr32(mear) & EEGNT) {
+signature = (u16) read_eeprom(ioaddr, EEPROMSignature);
+ret = 0;
+break;
+   }
+   udelay(1);
+ } 
+   sw32(mear, EEDONE);
+   }
+   else {  
+   signature = (u16) read_eeprom(ioaddr, EEPROMSignature);
+   if (signature != 0x && signature != 0x)
+   ret = 0;
+   }
+   return ret;
+}
+
+static void sis900_diag_test(struct net_device *netdev,
+   struct ethtool_test *test, u64 *data)
+{
+   struct sis900_private *nic = netdev_priv(netdev);
+   int i;
+
+   memset(data, 0, SIS900_TEST_LEN * sizeof(u64));
+   data[0] = !mii_link_ok(&nic->mii_info);
+data[1] = sis900_eeprom_readtest(netdev);
+   for (i = 0; i < SIS900_TEST_LEN; i++)
+   test->flags |= data[i] ? ETH_TEST_FL_FAILED : 0;
+
+   msleep_interruptible(4 * 1000);
+}
+
+static int sis900_get_sset_count(struct net_device *netdev, int sset)
+{
+   switch (sset) {
+   case ETH_SS_TEST:
+   return SIS900_TEST_LEN;
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
+static void sis900_get_strings(struct net_device *netdev, u32 stringset, u8 
*data)
+{
+   switch (stringset) {
+   case ETH_SS_TEST:
+   memcpy(data, *sis900_gstrings_test, 
sizeof(sis900_gstrings_test));
+   break;
+   }
+}
+
 static const struct ethtool_ops sis900_ethtool_ops = {
.get_drvinfo= sis900_get_drvinfo,
.get_msglevel   = sis900_get_msglevel,
@@ -2132,6 +2199,9 @@ static const struct ethtool_ops sis900_ethtool_ops = {
.set_wol= sis900_set_wol,
.get_link_ksettings = sis900_get_link_ksettings,
.set_link_ksettings = sis900_set_link_ksettings,
+   .self_test  = sis900_diag_test,
+   .get_strings= sis900_get_strings,
+   .get_sset_count = sis900_get_sset_count,
 };
 
 /**
-- 
2.17.1

Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

2019-07-01 Thread Song Liu

Hi Andy,

Thanks for these detailed analysis. 

> On Jun 30, 2019, at 8:12 AM, Andy Lutomirski  wrote:
> 
> On Fri, Jun 28, 2019 at 12:05 PM Song Liu  wrote:
>> 
>> Hi Andy,
>> 
>>> On Jun 27, 2019, at 4:40 PM, Andy Lutomirski  wrote:
>>> 
>>> On 6/27/19 1:19 PM, Song Liu wrote:
 This patch introduce unprivileged BPF access. The access control is
 achieved via device /dev/bpf. Users with write access to /dev/bpf are able
 to call sys_bpf().
 Two ioctl command are added to /dev/bpf:
 The two commands enable/disable permission to call sys_bpf() for current
 task. This permission is noted by bpf_permitted in task_struct. This
 permission is inherited during clone(CLONE_THREAD).
 Helper function bpf_capable() is added to check whether the task has got
 permission via /dev/bpf.
>>> 
 diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
 index 0e079b2298f8..79dc4d641cf3 100644
 --- a/kernel/bpf/verifier.c
 +++ b/kernel/bpf/verifier.c
 @@ -9134,7 +9134,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
 *attr,
 env->insn_aux_data[i].orig_idx = i;
 env->prog = *prog;
 env->ops = bpf_verifier_ops[env->prog->type];
 -is_priv = capable(CAP_SYS_ADMIN);
 +is_priv = bpf_capable(CAP_SYS_ADMIN);
>>> 
>>> Huh?  This isn't a hardening measure -- the "is_priv" verifier mode allows 
>>> straight-up leaks of private kernel state to user mode.
>>> 
>>> (For that matter, the pending lockdown stuff should possibly consider this 
>>> a "confidentiality" issue.)
>>> 
>>> 
>>> I have a bigger issue with this patch, though: it's a really awkward way to 
>>> pretend to have capabilities. For bpf, it seems like you could make this be 
>>> a *real* capability without too much pain since there's only one syscall 
>>> there.  Just find a way to pass an fd to /dev/bpf into the syscall.  If 
>>> this means you need a new bpf_with_cap() syscall that takes an extra 
>>> argument, so be it.  The old bpf() syscall can just translate to 
>>> bpf_with_cap(..., -1).
>>> 
>>> For a while, I've considered a scheme I call "implicit rights".  There 
>>> would be a directory in /dev called /dev/implicit_rights.  This would 
>>> either be part of devtmpfs or a whole new filesystem -- it would *not* be 
>>> any other filesystem.  The contents would be files that can't be read or 
>>> written and exist only in memory. You create them with a privileged 
>>> syscall.  Certain actions that are sensitive but not at the level of 
>>> CAP_SYS_ADMIN (use of large-attack-surface bpf stuff, creation of user 
>>> namespaces, profiling the kernel, etc) could require an "implicit right".  
>>> When you do them, if you don't have CAP_SYS_ADMIN, the kernel would do a 
>>> path walk for, say, /dev/implicit_rights/bpf and, if the object exists, can 
>>> be opened, and actually refers to the "bpf" rights object, then the action 
>>> is allowed.  Otherwise it's denied.
>>> 
>>> This is extensible, and it doesn't require the rather ugly per-task state 
>>> of whether it's enabled.
>>> 
>>> For things like creation of user namespaces, there's an existing API, and 
>>> the default is that it works without privilege.  Switching it to an 
>>> implicit right has the benefit of not requiring code changes to programs 
>>> that already work as non-root.
>>> 
>>> But, for BPF in particular, this type of compatibility issue doesn't exist 
>>> now.  You already can't use most eBPF functionality without privilege.  New 
>>> bpf-using programs meant to run without privilege are *new*, so they can 
>>> use a new improved API.  So, rather than adding this obnoxious ioctl, just 
>>> make the API explicit, please.
>>> 
>>> Also, please cc: linux-abi next time.
>> 
>> Thanks for your inputs.
>> 
>> I think we need to clarify the use case here. In this case, we are NOT
>> thinking about creating new tools for unprivileged users. Instead, we
>> would like to use existing tools without root.
> 
> I read patch 4, and I interpret it very differently.  Patches 2-4 are
> creating a new version of libbpf and a new version of bpftool.  Given
> this, I see no real justification for adding a new in-kernel per-task
> state instead of just pushing the complexity into libbpf.

I am not sure whether we are on the same page. Let me try an example, 
say we have application A, which calls sys_bpf(). 

Before the series: we have to run A with root; 
After the series:  we add a special user with access to /dev/bpf, and 
   run A with this special user. 

If we look at the whole system, I would say we are more secure after 
the series. 

I am not trying to make an extreme example here, because this use case
is the motivation here. 

To stay safe, we have to properly manage the permission of /dev/bpf. 
This is just like we need to properly manage access to /etc/sudoers and 
/dev/mem. 

Does this make sense? 

Thanks,
Song

Re: [PATCH net-next v1] bonding: add an option to specify a delay between peer notifications

2019-07-01 Thread Jiri Pirko

Sun, Jun 30, 2019 at 08:59:31PM CEST, vinc...@bernat.ch wrote:

[...]


>+module_param(peer_notif_delay, int, 0);
>+MODULE_PARM_DESC(peer_notif_delay, "Delay between each peer notification on "
>+ "failover event, in milliseconds");

No module options please. Use netlink. See bond_changelink() function.

[...]

Re: [PATCH bpf-next] virtio_net: add XDP meta data support

2019-07-01 Thread Jason Wang




On 2019/6/27 下午4:06, Yuya Kusakabe wrote:

This adds XDP meta data support to both receive_small() and
receive_mergeable().

Fixes: de8f3a83b0a0 ("bpf: add meta pointer for direct access")
Signed-off-by: Yuya Kusakabe 
---
  drivers/net/virtio_net.c | 40 +---
  1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4f3de0ac8b0b..e787657fc568 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -371,7 +371,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
   struct receive_queue *rq,
   struct page *page, unsigned int offset,
   unsigned int len, unsigned int truesize,
-  bool hdr_valid)
+  bool hdr_valid, unsigned int metasize)
  {
struct sk_buff *skb;
struct virtio_net_hdr_mrg_rxbuf *hdr;
@@ -393,17 +393,25 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
*vi,
else
hdr_padded_len = sizeof(struct padded_vnet_hdr);
  
-	if (hdr_valid)

+   if (hdr_valid && !metasize)
memcpy(hdr, p, hdr_len);
  
  	len -= hdr_len;

offset += hdr_padded_len;
p += hdr_padded_len;
  
-	copy = len;

+   copy = len + metasize;
if (copy > skb_tailroom(skb))
copy = skb_tailroom(skb);
-   skb_put_data(skb, p, copy);
+
+   if (metasize) {
+   skb_put_data(skb, p - metasize, copy);



I would rather keep copy untouched above, and use copy + metasize here, 
then you can save the following decrement  as well. Or tweak the caller 
the count the meta in to offset, then we need only deal with skb_pull() 
and skb_metadata_set() here.




+   __skb_pull(skb, metasize);
+   skb_metadata_set(skb, metasize);
+   copy -= metasize;
+   } else {
+   skb_put_data(skb, p, copy);
+   }
  
  	len -= copy;

offset += copy;
@@ -644,6 +652,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
unsigned int delta = 0;
struct page *xdp_page;
int err;
+   unsigned int metasize = 0;
  
  	len -= vi->hdr_len;

stats->bytes += len;
@@ -683,8 +692,8 @@ static struct sk_buff *receive_small(struct net_device *dev,
  
  		xdp.data_hard_start = buf + VIRTNET_RX_PAD + vi->hdr_len;

xdp.data = xdp.data_hard_start + xdp_headroom;
-   xdp_set_data_meta_invalid(&xdp);
xdp.data_end = xdp.data + len;
+   xdp.data_meta = xdp.data;
xdp.rxq = &rq->xdp_rxq;
orig_data = xdp.data;
act = bpf_prog_run_xdp(xdp_prog, &xdp);
@@ -695,9 +704,11 @@ static struct sk_buff *receive_small(struct net_device 
*dev,
/* Recalculate length in case bpf program changed it */
delta = orig_data - xdp.data;
len = xdp.data_end - xdp.data;
+   metasize = xdp.data - xdp.data_meta;
break;
case XDP_TX:
stats->xdp_tx++;
+   xdp.data_meta = xdp.data;



Why need this?



xdpf = convert_to_xdp_frame(&xdp);
if (unlikely(!xdpf))
goto err_xdp;
@@ -735,11 +746,14 @@ static struct sk_buff *receive_small(struct net_device 
*dev,
}
skb_reserve(skb, headroom - delta);
skb_put(skb, len);
-   if (!delta) {
+   if (!delta && !metasize) {
buf += header_offset;
memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len);
} /* keep zeroed vnet hdr since packet was changed by bpf */



Is there any method to preserve the vnet header here? We probably don't 
want to lose it for XDP_PASS when packet is not modified.



  
+	if (metasize)

+   skb_metadata_set(skb, metasize);
+
  err:
return skb;
  
@@ -761,7 +775,7 @@ static struct sk_buff *receive_big(struct net_device *dev,

  {
struct page *page = buf;
struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len,
- PAGE_SIZE, true);
+ PAGE_SIZE, true, 0);
  
  	stats->bytes += len - vi->hdr_len;

if (unlikely(!skb))
@@ -793,6 +807,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
unsigned int truesize;
unsigned int headroom = mergeable_ctx_to_headroom(ctx);
int err;
+   unsigned int metasize = 0;
  
  	head_skb = NULL;

stats->bytes += len - vi->hdr_len;
@@ -839,8 +854,8 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
data = page_address(xdp_page) + offset;
xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM

Re: [PATCH net-next v1] bonding: add an option to specify a delay between peer notifications

2019-07-01 Thread Vincent Bernat

 ❦  1 juillet 2019 11:27 +02, Jiri Pirko :

>>+module_param(peer_notif_delay, int, 0);
>>+MODULE_PARM_DESC(peer_notif_delay, "Delay between each peer notification on "
>>+"failover event, in milliseconds");
>
> No module options please. Use netlink. See bond_changelink() function.

It's also present in the patch. I'll do a v2 removing the ability to set
the default value through a module parameter.
-- 
Don't patch bad code - rewrite it.
- The Elements of Programming Style (Kernighan & Plauger)

Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

2019-07-01 Thread Lorenz Bauer

On Fri, 28 Jun 2019 at 20:10, Song Liu  wrote:
> There should be a master thread, no? Can we do that from the master thread at
> the beginning of the execution?

Unfortunately, no. The Go runtime has no such concept. This is all
that is defined about program start up:

  https://golang.org/ref/spec#Program_initialization_and_execution

Salient section:

  Package initialization—variable initialization and the invocation of init
  functions—happens in a single goroutine, sequentially, one package at
  a time. An init function may launch other goroutines, which can run
  concurrently with the initialization code. However, initialization always
  sequences the init functions: it will not invoke the next one until the
  previous one has returned.

This means that at the earliest possible moment for Go code to run,
the scheduler is already active with at least GOMAXPROCS threads.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

Re: [PATCH 2/6 bpf-next] Clean up xsk reuseq API

2019-07-01 Thread Magnus Karlsson

On Fri, Jun 28, 2019 at 11:09 PM Jonathan Lemon
 wrote:
>
>
>
> On 28 Jun 2019, at 13:41, Jakub Kicinski wrote:
>
> > On Thu, 27 Jun 2019 19:31:26 -0700, Jonathan Lemon wrote:
> >> On 27 Jun 2019, at 15:38, Jakub Kicinski wrote:
> >>
> >>> On Thu, 27 Jun 2019 15:08:32 -0700, Jonathan Lemon wrote:
>  The reuseq is actually a recycle stack, only accessed from the kernel
>  side.
>  Also, the implementation details of the stack should belong to the
>  umem
>  object, and not exposed to the caller.
> 
>  Clean up and rename for consistency in preparation for the next
>  patch.
> 
>  Signed-off-by: Jonathan Lemon 
> >>>
> >>> Prepare/swap is to cater to how drivers should be written - being able
> >>> to allocate resources independently of those currently used.  Allowing
> >>> for changing ring sizes and counts on the fly.  This patch makes it
> >>> harder to write drivers in the way we are encouraging people to.
> >>>
> >>> IOW no, please don't do this.
> >>
> >> The main reason I rewrote this was to provide the same type
> >> of functionality as realloc() - no need to allocate/initialize a new
> >> array if the old one would still end up being used.  This would seem
> >> to be a win for the typical case of having the interface go up/down.
> >>
> >> Perhaps I should have named the function differently?
> >
> > Perhaps add a helper which calls both parts to help poorly architected
> > drivers?
>
> Still ends up taking more memory.
>
> There are only 3 drivers in the tree which do AF_XDP: i40e, ixgbe, and mlx5.
>
> All of these do the same thing:
> reuseq = xsk_reuseq_prepare(n)
> if (!reuseq)
>error
> xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
>
> I figured simplifying was a good thing.
>
> But I do take your point that some future driver might want to allocate
> everything up front before performing a commit of the resources.

Jonathan, can you come up with a solution that satisfies both these
goals: providing a lower level API that Jakub can and would like to
use for his driver and a higher level helper that can be used by
today's driver to make the AF_XDP part smaller and easier to
implement? I like the fact that you are simplifying the AF_XDP enabled
drivers that are out there today, but at the same time I do not want
to hinder Jakub from, hopefully, in the future upstreaming his
support.

Thanks: Magnus

> --
> Jonathan

[PATCH net-next 5/8] net: mscc: describe the PTP register range

2019-07-01 Thread Antoine Tenart

This patch adds support for using the PTP register range, and adds a
description of its registers. This bank is used when configuring PTP.

Signed-off-by: Antoine Tenart 
---
 drivers/net/ethernet/mscc/ocelot.h   |  9 ++
 drivers/net/ethernet/mscc/ocelot_board.c | 10 +-
 drivers/net/ethernet/mscc/ocelot_ptp.h   | 41 
 drivers/net/ethernet/mscc/ocelot_regs.c  | 11 +++
 4 files changed, 70 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mscc/ocelot_ptp.h

diff --git a/drivers/net/ethernet/mscc/ocelot.h 
b/drivers/net/ethernet/mscc/ocelot.h
index f7eeb4806897..e0da8b4eddf2 100644
--- a/drivers/net/ethernet/mscc/ocelot.h
+++ b/drivers/net/ethernet/mscc/ocelot.h
@@ -23,6 +23,7 @@
 #include "ocelot_sys.h"
 #include "ocelot_qs.h"
 #include "ocelot_tc.h"
+#include "ocelot_ptp.h"
 
 #define PGID_AGGR64
 #define PGID_SRC 80
@@ -71,6 +72,7 @@ enum ocelot_target {
SYS,
S2,
HSIO,
+   PTP,
TARGET_MAX,
 };
 
@@ -343,6 +345,13 @@ enum ocelot_reg {
S2_CACHE_ACTION_DAT,
S2_CACHE_CNT_DAT,
S2_CACHE_TG_DAT,
+   PTP_PIN_CFG = PTP << TARGET_OFFSET,
+   PTP_PIN_TOD_SEC_MSB,
+   PTP_PIN_TOD_SEC_LSB,
+   PTP_PIN_TOD_NSEC,
+   PTP_CFG_MISC,
+   PTP_CLK_CFG_ADJ_CFG,
+   PTP_CLK_CFG_ADJ_FREQ,
 };
 
 enum ocelot_regfield {
diff --git a/drivers/net/ethernet/mscc/ocelot_board.c 
b/drivers/net/ethernet/mscc/ocelot_board.c
index 58bde1a9eacb..c508e51c1e28 100644
--- a/drivers/net/ethernet/mscc/ocelot_board.c
+++ b/drivers/net/ethernet/mscc/ocelot_board.c
@@ -182,6 +182,7 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
struct {
enum ocelot_target id;
char *name;
+   u8 optional:1;
} res[] = {
{ SYS, "sys" },
{ REW, "rew" },
@@ -189,6 +190,7 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
{ ANA, "ana" },
{ QS, "qs" },
{ S2, "s2" },
+   { PTP, "ptp", 1 },
};
 
if (!np && !pdev->dev.platform_data)
@@ -205,8 +207,14 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
struct regmap *target;
 
target = ocelot_io_platform_init(ocelot, pdev, res[i].name);
-   if (IS_ERR(target))
+   if (IS_ERR(target)) {
+   if (res[i].optional) {
+   ocelot->targets[res[i].id] = NULL;
+   continue;
+   }
+
return PTR_ERR(target);
+   }
 
ocelot->targets[res[i].id] = target;
}
diff --git a/drivers/net/ethernet/mscc/ocelot_ptp.h 
b/drivers/net/ethernet/mscc/ocelot_ptp.h
new file mode 100644
index ..9ede14a12573
--- /dev/null
+++ b/drivers/net/ethernet/mscc/ocelot_ptp.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT) */
+/*
+ * Microsemi Ocelot Switch driver
+ *
+ * License: Dual MIT/GPL
+ * Copyright (c) 2017 Microsemi Corporation
+ */
+
+#ifndef _MSCC_OCELOT_PTP_H_
+#define _MSCC_OCELOT_PTP_H_
+
+#define PTP_PIN_CFG_RSZ0x20
+#define PTP_PIN_TOD_SEC_MSB_RSZPTP_PIN_CFG_RSZ
+#define PTP_PIN_TOD_SEC_LSB_RSZPTP_PIN_CFG_RSZ
+#define PTP_PIN_TOD_NSEC_RSZ   PTP_PIN_CFG_RSZ
+
+#define PTP_PIN_CFG_DOMBIT(0)
+#define PTP_PIN_CFG_SYNC   BIT(2)
+#define PTP_PIN_CFG_ACTION(x)  ((x) << 3)
+#define PTP_PIN_CFG_ACTION_MASKPTP_PIN_CFG_ACTION(0x7)
+
+enum {
+   PTP_PIN_ACTION_IDLE = 0,
+   PTP_PIN_ACTION_LOAD,
+   PTP_PIN_ACTION_SAVE,
+   PTP_PIN_ACTION_CLOCK,
+   PTP_PIN_ACTION_DELTA,
+   PTP_PIN_ACTION_NOSYNC,
+   PTP_PIN_ACTION_SYNC,
+};
+
+#define PTP_CFG_MISC_PTP_ENBIT(2)
+
+#define PSEC_PER_SEC   1LL
+
+#define PTP_CFG_CLK_ADJ_CFG_ENABIT(0)
+#define PTP_CFG_CLK_ADJ_CFG_DIRBIT(1)
+
+#define PTP_CFG_CLK_ADJ_FREQ_NSBIT(30)
+
+#endif
diff --git a/drivers/net/ethernet/mscc/ocelot_regs.c 
b/drivers/net/ethernet/mscc/ocelot_regs.c
index 6c387f994ec5..e59977d20400 100644
--- a/drivers/net/ethernet/mscc/ocelot_regs.c
+++ b/drivers/net/ethernet/mscc/ocelot_regs.c
@@ -234,6 +234,16 @@ static const u32 ocelot_s2_regmap[] = {
REG(S2_CACHE_TG_DAT,   0x000388),
 };
 
+static const u32 ocelot_ptp_regmap[] = {
+   REG(PTP_PIN_CFG,   0x00),
+   REG(PTP_PIN_TOD_SEC_MSB,   0x04),
+   REG(PTP_PIN_TOD_SEC_LSB,   0x08),
+   REG(PTP_PIN_TOD_NSEC,  0x0c),
+   REG(PTP_CFG_MISC,  0xa0),
+   REG(PTP_CLK_CFG_ADJ_CFG,   0xa4),
+   REG(PTP_CLK_CFG_ADJ_FREQ,  0xa8),
+};
+
 static const u32 *ocelot_regmap[] = {
[ANA] = ocelot_

[PATCH net-next 7/8] net: mscc: remove the frame_info cpuq member

2019-07-01 Thread Antoine Tenart

In struct frame_info, the cpuq member is never used. This cosmetic patch
removes it from the structure, and from the parsing of the frame header
as it's only set but never used.

Signed-off-by: Antoine Tenart 
---
 drivers/net/ethernet/mscc/ocelot.h   | 1 -
 drivers/net/ethernet/mscc/ocelot_board.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.h 
b/drivers/net/ethernet/mscc/ocelot.h
index e0da8b4eddf2..515dee6fa8a6 100644
--- a/drivers/net/ethernet/mscc/ocelot.h
+++ b/drivers/net/ethernet/mscc/ocelot.h
@@ -45,7 +45,6 @@ struct frame_info {
u32 len;
u16 port;
u16 vid;
-   u8 cpuq;
u8 tag_type;
 };
 
diff --git a/drivers/net/ethernet/mscc/ocelot_board.c 
b/drivers/net/ethernet/mscc/ocelot_board.c
index 09ad6a123347..008a762512b9 100644
--- a/drivers/net/ethernet/mscc/ocelot_board.c
+++ b/drivers/net/ethernet/mscc/ocelot_board.c
@@ -33,7 +33,6 @@ static int ocelot_parse_ifh(u32 *_ifh, struct frame_info 
*info)
 
info->port = IFH_EXTRACT_BITFIELD64(ifh[1], 43, 4);
 
-   info->cpuq = IFH_EXTRACT_BITFIELD64(ifh[1], 20, 8);
info->tag_type = IFH_EXTRACT_BITFIELD64(ifh[1], 16,  1);
info->vid = IFH_EXTRACT_BITFIELD64(ifh[1], 0,  12);
 
-- 
2.21.0

[PATCH net-next 4/8] MIPS: dts: mscc: describe the PTP ready interrupt

2019-07-01 Thread Antoine Tenart

This patch adds a description of the PTP ready interrupt, which can be
triggered when a PTP timestamp is available on an hardware FIFO.

Signed-off-by: Antoine Tenart 
---
 arch/mips/boot/dts/mscc/ocelot.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/mips/boot/dts/mscc/ocelot.dtsi 
b/arch/mips/boot/dts/mscc/ocelot.dtsi
index 1e55a778def5..797d336db54d 100644
--- a/arch/mips/boot/dts/mscc/ocelot.dtsi
+++ b/arch/mips/boot/dts/mscc/ocelot.dtsi
@@ -139,8 +139,8 @@
"port2", "port3", "port4", "port5", "port6",
"port7", "port8", "port9", "port10", "qsys",
"ana", "s2";
-   interrupts = <21 22>;
-   interrupt-names = "xtr", "inj";
+   interrupts = <18 21 22>;
+   interrupt-names = "ptp_rdy", "xtr", "inj";
 
ethernet-ports {
#address-cells = <1>;
-- 
2.21.0

[PATCH net-next 0/8] net: mscc: PTP Hardware Clock (PHC) support

2019-07-01 Thread Antoine Tenart

Hello,

This series introduces the PTP Hardware Clock (PHC) support to the Mscc
Ocelot switch driver. In order to make use of this, a new register bank
is added and described in the device tree, as well as a new interrupt.
The use this bank and interrupt was made optional in the driver for dt
compatibility reasons.

Patches 2 and 4 should probably go through the MIPS tree.

Thanks!
Antoine

Antoine Tenart (8):
  Documentation/bindings: net: ocelot: document the PTP bank
  MIPS: dts: mscc: describe the PTP register range
  Documentation/bindings: net: ocelot: document the PTP ready IRQ
  MIPS: dts: mscc: describe the PTP ready interrupt
  net: mscc: describe the PTP register range
  net: mscc: improve the frame header parsing readability
  net: mscc: remove the frame_info cpuq member
  net: mscc: PTP Hardware Clock (PHC) support

 .../devicetree/bindings/net/mscc-ocelot.txt   |  20 +-
 arch/mips/boot/dts/mscc/ocelot.dtsi   |   7 +-
 drivers/net/ethernet/mscc/ocelot.c| 382 +-
 drivers/net/ethernet/mscc/ocelot.h|  47 ++-
 drivers/net/ethernet/mscc/ocelot_board.c  | 139 ++-
 drivers/net/ethernet/mscc/ocelot_ptp.h|  41 ++
 drivers/net/ethernet/mscc/ocelot_regs.c   |  11 +
 7 files changed, 615 insertions(+), 32 deletions(-)
 create mode 100644 drivers/net/ethernet/mscc/ocelot_ptp.h

-- 
2.21.0

[PATCH net-next 8/8] net: mscc: PTP Hardware Clock (PHC) support

2019-07-01 Thread Antoine Tenart

This patch adds support for PTP Hardware Clock (PHC) to the Ocelot
switch for both PTP 1-step and 2-step modes.

Signed-off-by: Antoine Tenart 
---
 drivers/net/ethernet/mscc/ocelot.c   | 382 ++-
 drivers/net/ethernet/mscc/ocelot.h   |  37 +++
 drivers/net/ethernet/mscc/ocelot_board.c | 106 ++-
 3 files changed, 517 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index b71e4ecbe469..1a8a7c305f54 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -538,7 +539,7 @@ static int ocelot_port_stop(struct net_device *dev)
  */
 static int ocelot_gen_ifh(u32 *ifh, struct frame_info *info)
 {
-   ifh[0] = IFH_INJ_BYPASS;
+   ifh[0] = IFH_INJ_BYPASS | ((0x1ff & info->rew_op) << 21);
ifh[1] = (0xf00 & info->port) >> 8;
ifh[2] = (0xff & info->port) << 24;
ifh[3] = (info->tag_type << 16) | info->vid;
@@ -550,6 +551,7 @@ static int ocelot_port_xmit(struct sk_buff *skb, struct 
net_device *dev)
 {
struct ocelot_port *port = netdev_priv(dev);
struct ocelot *ocelot = port->ocelot;
+   struct skb_shared_info *shinfo = skb_shinfo(skb);
u32 val, ifh[IFH_LEN];
struct frame_info info = {};
u8 grp = 0; /* Send everything on CPU group 0 */
@@ -566,6 +568,14 @@ static int ocelot_port_xmit(struct sk_buff *skb, struct 
net_device *dev)
info.port = BIT(port->chip_port);
info.tag_type = IFH_TAG_TYPE_C;
info.vid = skb_vlan_tag_get(skb);
+
+   /* Check if timestamping is needed */
+   if (ocelot->ptp && shinfo->tx_flags & SKBTX_HW_TSTAMP) {
+   info.rew_op = port->ptp_cmd;
+   if (port->ptp_cmd == IFH_REW_OP_TWO_STEP_PTP)
+   info.rew_op |= (port->ts_id  % 4) << 3;
+   }
+
ocelot_gen_ifh(ifh, &info);
 
for (i = 0; i < IFH_LEN; i++)
@@ -596,11 +606,43 @@ static int ocelot_port_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
dev->stats.tx_packets++;
dev->stats.tx_bytes += skb->len;
-   dev_kfree_skb_any(skb);
+
+   if (ocelot->ptp && shinfo->tx_flags & SKBTX_HW_TSTAMP &&
+   port->ptp_cmd == IFH_REW_OP_TWO_STEP_PTP) {
+   struct ocelot_skb *oskb =
+   kzalloc(sizeof(struct ocelot_skb), GFP_KERNEL);
+
+   oskb->skb = skb;
+   oskb->id = port->ts_id % 4;
+   port->ts_id++;
+
+   list_add_tail(&oskb->head, &port->skbs);
+   } else {
+   dev_kfree_skb_any(skb);
+   }
 
return NETDEV_TX_OK;
 }
 
+void ocelot_get_hwtimestamp(struct ocelot *ocelot, struct timespec64 *ts)
+{
+   /* Read current PTP time to get seconds */
+   u32 val = ocelot_read_rix(ocelot, PTP_PIN_CFG, TOD_ACC_PIN);
+
+   val &= ~(PTP_PIN_CFG_SYNC | PTP_PIN_CFG_ACTION_MASK | PTP_PIN_CFG_DOM);
+   val |= PTP_PIN_CFG_ACTION(PTP_PIN_ACTION_SAVE);
+   ocelot_write_rix(ocelot, val, PTP_PIN_CFG, TOD_ACC_PIN);
+   ts->tv_sec = ocelot_read_rix(ocelot, PTP_PIN_TOD_SEC_LSB, TOD_ACC_PIN);
+
+   /* Read packet HW timestamp from FIFO */
+   val = ocelot_read(ocelot, SYS_PTP_TXSTAMP);
+   ts->tv_nsec = SYS_PTP_TXSTAMP_PTP_TXSTAMP(val);
+
+   /* Sec has incremented since the ts was registered */
+   if ((ts->tv_sec & 0x1) != !!(val & SYS_PTP_TXSTAMP_PTP_TXSTAMP_SEC))
+   ts->tv_sec--;
+}
+
 static int ocelot_mc_unsync(struct net_device *dev, const unsigned char *addr)
 {
struct ocelot_port *port = netdev_priv(dev);
@@ -917,6 +959,97 @@ static int ocelot_get_port_parent_id(struct net_device 
*dev,
return 0;
 }
 
+static int ocelot_hwstamp_get(struct ocelot_port *port, struct ifreq *ifr)
+{
+   struct ocelot *ocelot = port->ocelot;
+
+   return copy_to_user(ifr->ifr_data, &ocelot->hwtstamp_config,
+   sizeof(ocelot->hwtstamp_config)) ? -EFAULT : 0;
+}
+
+static int ocelot_hwstamp_set(struct ocelot_port *port, struct ifreq *ifr)
+{
+   struct ocelot *ocelot = port->ocelot;
+   struct hwtstamp_config cfg;
+
+   if (copy_from_user(&cfg, ifr->ifr_data, sizeof(cfg)))
+   return -EFAULT;
+
+   /* reserved for future extensions */
+   if (cfg.flags)
+   return -EINVAL;
+
+   /* Tx type sanity check */
+   switch (cfg.tx_type) {
+   case HWTSTAMP_TX_ON:
+   port->ptp_cmd = IFH_REW_OP_TWO_STEP_PTP;
+   break;
+   case HWTSTAMP_TX_ONESTEP_SYNC:
+   /* IFH_REW_OP_ONE_STEP_PTP updates the correctional field, we
+* need to update the origin time.
+*/
+   port->ptp_cmd = IFH_REW_OP_ORIGIN_PTP;
+   break;
+   case HWTSTAMP_TX_OFF:
+   port->ptp_cmd = 0;
+   break;
+   default:
+

[PATCH net-next 3/8] Documentation/bindings: net: ocelot: document the PTP ready IRQ

2019-07-01 Thread Antoine Tenart

One additional interrupt needs to be described within the Ocelot device
tree node: the PTP ready one. This patch documents the binding needed to
do so.

Signed-off-by: Antoine Tenart 
---
 Documentation/devicetree/bindings/net/mscc-ocelot.txt | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/mscc-ocelot.txt 
b/Documentation/devicetree/bindings/net/mscc-ocelot.txt
index 0afadfaa33ee..33bbeb998166 100644
--- a/Documentation/devicetree/bindings/net/mscc-ocelot.txt
+++ b/Documentation/devicetree/bindings/net/mscc-ocelot.txt
@@ -17,9 +17,10 @@ Required properties:
   - "ana"
   - "portX" with X from 0 to the number of last port index available on that
 switch
-- interrupts: Should contain the switch interrupts for frame extraction and
-  frame injection
-- interrupt-names: should contain the interrupt names: "xtr", "inj"
+- interrupts: Should contain the switch interrupts for frame extraction,
+  frame injection and PTP ready.
+- interrupt-names: should contain the interrupt names: "xtr", "inj" and
+  "ptp_rdy".
 - ethernet-ports: A container for child nodes representing switch ports.
 
 The ethernet-ports container has the following properties
@@ -63,8 +64,8 @@ Example:
"port2", "port3", "port4", "port5", "port6",
"port7", "port8", "port9", "port10", "qsys",
"ana";
-   interrupts = <21 22>;
-   interrupt-names = "xtr", "inj";
+   interrupts = <18 21 22>;
+   interrupt-names = "ptp_rdy", "xtr", "inj";
 
ethernet-ports {
#address-cells = <1>;
-- 
2.21.0

[PATCH net-next 6/8] net: mscc: improve the frame header parsing readability

2019-07-01 Thread Antoine Tenart

This cosmetic patch improves the frame header parsing readability by
introducing a new macro to access and mask its fields.

Signed-off-by: Antoine Tenart 
---
 drivers/net/ethernet/mscc/ocelot_board.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_board.c 
b/drivers/net/ethernet/mscc/ocelot_board.c
index c508e51c1e28..09ad6a123347 100644
--- a/drivers/net/ethernet/mscc/ocelot_board.c
+++ b/drivers/net/ethernet/mscc/ocelot_board.c
@@ -16,24 +16,26 @@
 
 #include "ocelot.h"
 
-static int ocelot_parse_ifh(u32 *ifh, struct frame_info *info)
+#define IFH_EXTRACT_BITFIELD64(x, o, w) (((x) >> (o)) & GENMASK_ULL((w) - 1, 
0))
+
+static int ocelot_parse_ifh(u32 *_ifh, struct frame_info *info)
 {
-   int i;
u8 llen, wlen;
+   u64 ifh[2];
+
+   ifh[0] = be64_to_cpu(((__force __be64 *)_ifh)[0]);
+   ifh[1] = be64_to_cpu(((__force __be64 *)_ifh)[1]);
 
-   /* The IFH is in network order, switch to CPU order */
-   for (i = 0; i < IFH_LEN; i++)
-   ifh[i] = ntohl((__force __be32)ifh[i]);
+   wlen = IFH_EXTRACT_BITFIELD64(ifh[0], 7,  8);
+   llen = IFH_EXTRACT_BITFIELD64(ifh[0], 15,  6);
 
-   wlen = (ifh[1] >> 7) & 0xff;
-   llen = (ifh[1] >> 15) & 0x3f;
info->len = OCELOT_BUFFER_CELL_SZ * wlen + llen - 80;
 
-   info->port = (ifh[2] & GENMASK(14, 11)) >> 11;
+   info->port = IFH_EXTRACT_BITFIELD64(ifh[1], 43, 4);
 
-   info->cpuq = (ifh[3] & GENMASK(27, 20)) >> 20;
-   info->tag_type = (ifh[3] & BIT(16)) >> 16;
-   info->vid = ifh[3] & GENMASK(11, 0);
+   info->cpuq = IFH_EXTRACT_BITFIELD64(ifh[1], 20, 8);
+   info->tag_type = IFH_EXTRACT_BITFIELD64(ifh[1], 16,  1);
+   info->vid = IFH_EXTRACT_BITFIELD64(ifh[1], 0,  12);
 
return 0;
 }
-- 
2.21.0

[PATCH net-next 2/8] MIPS: dts: mscc: describe the PTP register range

2019-07-01 Thread Antoine Tenart

This patch adds one register range within the mscc,vsc7514-switch node,
to describe the PTP registers.

Signed-off-by: Antoine Tenart 
---
 arch/mips/boot/dts/mscc/ocelot.dtsi | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/mips/boot/dts/mscc/ocelot.dtsi 
b/arch/mips/boot/dts/mscc/ocelot.dtsi
index 33ae74aaa1bb..1e55a778def5 100644
--- a/arch/mips/boot/dts/mscc/ocelot.dtsi
+++ b/arch/mips/boot/dts/mscc/ocelot.dtsi
@@ -120,6 +120,7 @@
reg = <0x101 0x1>,
  <0x103 0x1>,
  <0x108 0x100>,
+ <0x10e 0x1>,
  <0x11e 0x100>,
  <0x11f 0x100>,
  <0x120 0x100>,
@@ -134,7 +135,7 @@
  <0x180 0x8>,
  <0x188 0x1>,
  <0x106 0x1>;
-   reg-names = "sys", "rew", "qs", "port0", "port1",
+   reg-names = "sys", "rew", "qs", "ptp", "port0", "port1",
"port2", "port3", "port4", "port5", "port6",
"port7", "port8", "port9", "port10", "qsys",
"ana", "s2";
-- 
2.21.0

Re: [PATCH 0/6 bpf-next] xsk: reuseq cleanup

2019-07-01 Thread Magnus Karlsson

On Fri, Jun 28, 2019 at 12:10 AM Jonathan Lemon
 wrote:
>
> Clean up and normalize usage of the recycle queue in order to
> support upcoming TX from RX queue functionality.
>
> Jonathan Lemon (6):
>   Have xsk_umem_peek_addr_rq() return chunk-aligned handles.
>   Clean up xsk reuseq API
>   Always check the recycle stack when using the umem fq.
>   Simplify AF_XDP umem allocation path for Intel drivers.
>   Remove use of umem _rq variants from Mellanox driver.
>   Remove the umem _rq variants now that the last consumer is gone.

Maybe it is just me, but I cannot find patch 6/6. Not in my gmail
account and not in my Intel account. Am I just going insane :-)?

/Magnus

>  drivers/net/ethernet/intel/i40e/i40e_xsk.c| 86 +++
>  .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |  2 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 59 ++---
>  .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |  8 +-
>  .../ethernet/mellanox/mlx5/core/en/xsk/umem.c |  7 +-
>  include/net/xdp_sock.h| 69 ++-
>  net/xdp/xdp_umem.c|  2 +-
>  net/xdp/xsk.c | 22 -
>  net/xdp/xsk_queue.c   | 56 +---
>  net/xdp/xsk_queue.h   |  2 +-
>  10 files changed, 68 insertions(+), 245 deletions(-)
>
> --
> 2.17.1
>

[PATCH net-next 1/8] Documentation/bindings: net: ocelot: document the PTP bank

2019-07-01 Thread Antoine Tenart

One additional register range needs to be described within the Ocelot
device tree node: the PTP. This patch documents the binding needed to do
so.

Signed-off-by: Antoine Tenart 
---
 Documentation/devicetree/bindings/net/mscc-ocelot.txt | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/mscc-ocelot.txt 
b/Documentation/devicetree/bindings/net/mscc-ocelot.txt
index 9e5c17d426ce..0afadfaa33ee 100644
--- a/Documentation/devicetree/bindings/net/mscc-ocelot.txt
+++ b/Documentation/devicetree/bindings/net/mscc-ocelot.txt
@@ -12,6 +12,7 @@ Required properties:
   - "sys"
   - "rew"
   - "qs"
+  - "ptp"
   - "qsys"
   - "ana"
   - "portX" with X from 0 to the number of last port index available on that
@@ -44,6 +45,7 @@ Example:
reg = <0x101 0x1>,
  <0x103 0x1>,
  <0x108 0x100>,
+ <0x10e 0x1>,
  <0x11e 0x100>,
  <0x11f 0x100>,
  <0x120 0x100>,
@@ -57,9 +59,10 @@ Example:
  <0x128 0x100>,
  <0x180 0x8>,
  <0x188 0x1>;
-   reg-names = "sys", "rew", "qs", "port0", "port1", "port2",
-   "port3", "port4", "port5", "port6", "port7",
-   "port8", "port9", "port10", "qsys", "ana";
+   reg-names = "sys", "rew", "qs", "ptp", "port0", "port1",
+   "port2", "port3", "port4", "port5", "port6",
+   "port7", "port8", "port9", "port10", "qsys",
+   "ana";
interrupts = <21 22>;
interrupt-names = "xtr", "inj";
 
-- 
2.21.0

答复: [PATCH] xfrm: use list_for_each_entry_safe in xfrm_policy_flush

2019-07-01 Thread Li,Rongqing



> -邮件原件-
> 发件人: Florian Westphal [mailto:f...@strlen.de]
> 发送时间: 2019年7月1日 17:04
> 收件人: Li,Rongqing 
> 抄送: netdev@vger.kernel.org
> 主题: Re: [PATCH] xfrm: use list_for_each_entry_safe in xfrm_policy_flush
> 
> Li RongQing  wrote:
> > The iterated pol maybe be freed since it is not protected by RCU or
> > spinlock when put it, lead to UAF, so use _safe function to iterate
> > over it against removal
> >
> > Signed-off-by: Li RongQing 
> > ---
> >  net/xfrm/xfrm_policy.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index
> > 3235562f6588..87d770dab1f5 100644
> > --- a/net/xfrm/xfrm_policy.c
> > +++ b/net/xfrm/xfrm_policy.c
> > @@ -1772,7 +1772,7 @@ xfrm_policy_flush_secctx_check(struct net *net,
> > u8 type, bool task_valid)  int xfrm_policy_flush(struct net *net, u8
> > type, bool task_valid)  {
> > int dir, err = 0, cnt = 0;
> > -   struct xfrm_policy *pol;
> > +   struct xfrm_policy *pol, *tmp;
> >
> > spin_lock_bh(&net->xfrm.xfrm_policy_lock);
> >
> > @@ -1781,7 +1781,7 @@ int xfrm_policy_flush(struct net *net, u8 type, bool
> task_valid)
> > goto out;
> >
> >  again:
> > -   list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
> > +   list_for_each_entry_safe(pol, tmp, &net->xfrm.policy_all, walk.all)
> > +{
> > dir = xfrm_policy_id2dir(pol->index);
> > if (pol->walk.dead ||
> > dir >= XFRM_POLICY_MAX ||
> 
> This function drops the lock, but after re-acquire jumps to the 'again'
> label, so I do not see the UAF as the entire loop gets restarted.

You are right, sorry for the noise

-Li

Re: [PATCH V3] can: flexcan: fix stop mode acknowledgment

2019-07-01 Thread Marc Kleine-Budde

On 6/19/19 9:42 AM, Joakim Zhang wrote:
> To enter stop mode, the CPU should manually assert a global Stop Mode
> request and check the acknowledgment asserted by FlexCAN. The CPU must
> only consider the FlexCAN in stop mode when both request and
> acknowledgment conditions are satisfied.
> 
> Fixes: de3578c198c6 ("can: flexcan: add self wakeup support")
> Reported-by: Marc Kleine-Budde 
> Signed-off-by: Joakim Zhang 
> 
> ChangeLog:
> V1->V2:
>   * regmap_read()-->regmap_read_poll_timeout()
> V2->V3:
>   * change the way of error return, it will make easy for function
>   extension.

Please rebase to linux-next/master, as this is a fix.

Marc

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 0/3 bpf-next] intel: AF_XDP support for TX of RX packets

2019-07-01 Thread Magnus Karlsson

On Sat, Jun 29, 2019 at 12:18 AM Jonathan Lemon
 wrote:
>
> NOTE: This patch depends on my previous "xsk: reuse cleanup" patch,
> sent to netdev earlier.
>
> The motivation is to have packets which were received on a zero-copy
> AF_XDP socket, and which returned a TX verdict from the bpf program,
> queued directly on the TX ring (if they're in the same napi context).
>
> When these TX packets are completed, they are placed back onto the
> reuse queue, as there isn't really any other place to handle them.
>
> Space in the reuse queue is preallocated at init time for both the
> RX and TX rings.  Another option would have a smaller TX queue size
> and count in-flight TX packets, dropping any which exceed the reuseq
> size - this approach is omitted for simplicity.

This should speed up XDP_TX under ZC substantially, which of course is
a good thing. Would be great if you could add some performance
numbers.

As other people have pointed out, it would have been great if we had a
page pool we could return the buffers to. But we do not so there are
only two options: keep it in the kernel on the reuse queue in this
case, or return the buffer to user space with a length of zero
indicating that there is no packet data. Just a transfer of ownership.
But let us go with the former one as you have done in this patch set,
as we have so far have always tried to reuse the buffers inside the
kernel. But the latter option might be good to have in store as a
solution for other problems.

/Magnus

>
> Jonathan Lemon (3):
>   net: add convert_to_xdp_frame_keep_zc function
>   i40e: Support zero-copy XDP_TX on the RX path for AF_XDP sockets.
>   ixgbe: Support zero-copy XDP_TX on the RX path for AF_XDP sockets.
>
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h  |  1 +
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c   | 54 --
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h |  1 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 74 +---
>  include/net/xdp.h| 20 --
>  5 files changed, 134 insertions(+), 16 deletions(-)
>
> --
> 2.17.1
>

Re: [PATCH 1/3 bpf-next] net: add convert_to_xdp_frame_keep_zc function

2019-07-01 Thread Magnus Karlsson

On Sat, Jun 29, 2019 at 12:19 AM Jonathan Lemon
 wrote:
>
> Add a function which converts a ZC xdp_buff to a an xdp_frame, while

nit: "a an" -> "an"

> keeping the zc backing storage.  This will be used to support TX of
> received AF_XDP frames.
>
> Signed-off-by: Jonathan Lemon 
> ---
>  include/net/xdp.h | 20 
>  1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 40c6d3398458..abe5f47ff0a5 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -82,6 +82,7 @@ struct xdp_frame {
>  */
> struct xdp_mem_info mem;
> struct net_device *dev_rx; /* used by cpumap */
> +   unsigned long handle;
>  };
>
>  /* Clear kernel pointers in xdp_frame */
> @@ -95,15 +96,12 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct 
> xdp_buff *xdp);
>
>  /* Convert xdp_buff to xdp_frame */
>  static inline
> -struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> +struct xdp_frame *__convert_to_xdp_frame(struct xdp_buff *xdp)
>  {
> struct xdp_frame *xdp_frame;
> int metasize;
> int headroom;
>
> -   if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> -   return xdp_convert_zc_to_xdp_frame(xdp);
> -
> /* Assure headroom is available for storing info */
> headroom = xdp->data - xdp->data_hard_start;
> metasize = xdp->data - xdp->data_meta;
> @@ -125,6 +123,20 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff 
> *xdp)
> return xdp_frame;
>  }
>
> +static inline
> +struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> +{
> +   if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> +   return xdp_convert_zc_to_xdp_frame(xdp);
> +   return __convert_to_xdp_frame(xdp);
> +}
> +
> +static inline
> +struct xdp_frame *convert_to_xdp_frame_keep_zc(struct xdp_buff *xdp)
> +{
> +   return __convert_to_xdp_frame(xdp);
> +}
> +
>  void xdp_return_frame(struct xdp_frame *xdpf);
>  void xdp_return_frame_rx_napi(struct xdp_frame *xdpf);
>  void xdp_return_buff(struct xdp_buff *xdp);
> --
> 2.17.1
>

Re: [PATCH 2/3 bpf-next] i40e: Support zero-copy XDP_TX on the RX path for AF_XDP sockets.

2019-07-01 Thread Magnus Karlsson

On Sat, Jun 29, 2019 at 12:18 AM Jonathan Lemon
 wrote:
>
> When the XDP program attached to a zero-copy AF_XDP socket returns XDP_TX,
> queue the umem frame on the XDP TX ring.  Space on the recycle stack is
> pre-allocated when the xsk is created.  (taken from tx_ring, since the
> xdp ring is not initialized yet)
>
> Signed-off-by: Jonathan Lemon 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 54 +++--
>  2 files changed, 51 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index 100e92d2982f..3e7954277737 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -274,6 +274,7 @@ static inline unsigned int i40e_txd_use_count(unsigned 
> int size)
>  #define I40E_TX_FLAGS_TSYN BIT(8)
>  #define I40E_TX_FLAGS_FD_SBBIT(9)
>  #define I40E_TX_FLAGS_UDP_TUNNEL   BIT(10)
> +#define I40E_TX_FLAGS_ZC_FRAME BIT(11)
>  #define I40E_TX_FLAGS_VLAN_MASK0x
>  #define I40E_TX_FLAGS_VLAN_PRIO_MASK   0xe000
>  #define I40E_TX_FLAGS_VLAN_PRIO_SHIFT  29
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
> b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index ce8650d06962..020f9859215d 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -91,7 +91,8 @@ static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, 
> struct xdp_umem *umem,
> qid >= netdev->real_num_tx_queues)
> return -EINVAL;
>
> -   if (!xsk_umem_recycle_alloc(umem, vsi->rx_rings[0]->count))
> +   if (!xsk_umem_recycle_alloc(umem, vsi->rx_rings[0]->count +
> + vsi->tx_rings[0]->count))
> return -ENOMEM;
>
> err = i40e_xsk_umem_dma_map(vsi, umem);
> @@ -175,6 +176,48 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct 
> xdp_umem *umem,
> i40e_xsk_umem_disable(vsi, qid);
>  }
>
> +static int i40e_xmit_rcvd_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)

This function looks very much like i40e_xmit_xdp_ring(). How can we
refactor them to make them share more code and not lose performance at
the same time? This comment is also valid for the ixgbe driver patch
that follows.

Thanks: Magnus

> +{
> +   struct i40e_ring *xdp_ring;
> +   struct i40e_tx_desc *tx_desc;
> +   struct i40e_tx_buffer *tx_bi;
> +   struct xdp_frame *xdpf;
> +   dma_addr_t dma;
> +
> +   xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> +
> +   if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
> +   xdp_ring->tx_stats.tx_busy++;
> +   return I40E_XDP_CONSUMED;
> +   }
> +   xdpf = convert_to_xdp_frame_keep_zc(xdp);
> +   if (unlikely(!xdpf))
> +   return I40E_XDP_CONSUMED;
> +   xdpf->handle = xdp->handle;
> +
> +   dma = xdp_umem_get_dma(rx_ring->xsk_umem, xdp->handle);
> +   tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
> +   tx_bi->bytecount = xdpf->len;
> +   tx_bi->gso_segs = 1;
> +   tx_bi->xdpf = xdpf;
> +   tx_bi->tx_flags = I40E_TX_FLAGS_ZC_FRAME;
> +
> +   tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> +   tx_desc->buffer_addr = cpu_to_le64(dma);
> +   tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
> + I40E_TX_DESC_CMD_EOP,
> + 0, xdpf->len, 0);
> +   smp_wmb();
> +
> +   xdp_ring->next_to_use++;
> +   if (xdp_ring->next_to_use == xdp_ring->count)
> +   xdp_ring->next_to_use = 0;
> +
> +   tx_bi->next_to_watch = tx_desc;
> +
> +   return I40E_XDP_TX;
> +}
> +
>  /**
>   * i40e_run_xdp_zc - Executes an XDP program on an xdp_buff
>   * @rx_ring: Rx ring
> @@ -187,7 +230,6 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct 
> xdp_umem *umem,
>  static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
>  {
> int err, result = I40E_XDP_PASS;
> -   struct i40e_ring *xdp_ring;
> struct bpf_prog *xdp_prog;
> u32 act;
>
> @@ -202,8 +244,7 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, 
> struct xdp_buff *xdp)
> case XDP_PASS:
> break;
> case XDP_TX:
> -   xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
> -   result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
> +   result = i40e_xmit_rcvd_zc(rx_ring, xdp);
> break;
> case XDP_REDIRECT:
> err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> @@ -628,6 +669,11 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, 
> unsigned int budget)
>  static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
>

[PATCH v3] ss: introduce switch to print exact value of data rates

2019-07-01 Thread Tomasz Torcz

  Introduce -X/--exact switch to disable human-friendly printing
 of data rates. Without the switch (default), data is presented as MBps/Kbps.

  Signed-off-by: Tomasz Torcz 
---
 man/man8/ss.8 |  3 +++
 misc/ss.c | 12 ++--
 2 files changed, 13 insertions(+), 2 deletions(-)

 Changes in v3:
  - updated ss man page with new option

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 9054fab9..2ba5fda2 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -290,6 +290,9 @@ that parsing /proc/net/tcp is painful.
 .B \-E, \-\-events
 Continually display sockets as they are destroyed
 .TP
+.B \-X, \-\-exact
+Show exact bandwidth values, instead of human-readable
+.TP
 .B \-Z, \-\-context
 As the
 .B \-p
diff --git a/misc/ss.c b/misc/ss.c
index 99c06d31..ba1bfff6 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -110,6 +110,7 @@ static int resolve_services = 1;
 int preferred_family = AF_UNSPEC;
 static int show_options;
 int show_details;
+static int show_human_readable = 1;
 static int show_users;
 static int show_mem;
 static int show_tcpinfo;
@@ -2361,7 +2362,9 @@ static int proc_inet_split_line(char *line, char **loc, 
char **rem, char **data)
 
 static char *sprint_bw(char *buf, double bw)
 {
-   if (bw > 100.)
+   if (!show_human_readable)
+   sprintf(buf, "%.0f", bw);
+   else if (bw > 100.)
sprintf(buf, "%.1fM", bw / 100.);
else if (bw > 1000.)
sprintf(buf, "%.1fK", bw / 1000.);
@@ -4883,6 +4886,7 @@ static void _usage(FILE *dest)
 "   --tos   show tos and priority information\n"
 "   -b, --bpf   show bpf filter socket information\n"
 "   -E, --eventscontinually display sockets as they are destroyed\n"
+"   -X, --exact show exact bandwidth values, instead of 
human-readable\n"
 "   -Z, --context   display process SELinux security contexts\n"
 "   -z, --contexts  display process and socket SELinux security contexts\n"
 "   -N, --net   switch to the specified network namespace name\n"
@@ -5031,6 +5035,7 @@ static const struct option long_opts[] = {
{ "no-header", 0, 0, 'H' },
{ "xdp", 0, 0, OPT_XDPSOCK},
{ "oneline", 0, 0, 'O' },
+   { "exact", 0, 0, 'X' },
{ 0 }
 
 };
@@ -5046,7 +5051,7 @@ int main(int argc, char *argv[])
int state_filter = 0;
 
while ((ch = getopt_long(argc, argv,
-"dhaletuwxnro460spbEf:miA:D:F:vVzZN:KHSO",
+"dhaletuwxXnro460spbEf:miA:D:F:vVzZN:KHSO",
 long_opts, NULL)) != EOF) {
switch (ch) {
case 'n':
@@ -5097,6 +5102,9 @@ int main(int argc, char *argv[])
case 'x':
filter_af_set(¤t_filter, AF_UNIX);
break;
+   case 'X':
+   show_human_readable = 0;
+   break;
case OPT_VSOCK:
filter_af_set(¤t_filter, AF_VSOCK);
break;
-- 
2.21.0

[PATCH net-next 1/3] devlink: Introduce PCI PF port flavour and port attribute

2019-07-01 Thread Parav Pandit

In an eswitch, PCI PF may have port which is normally represented
using a representor netdevice.
To have better visibility of eswitch port, its association with
PF, a representor netdevice and port number, introduce a PCI PF port
flavour and port attriute.

When devlink port flavour is PCI PF, fill up PCI PF attributes of the
port.

Extend port name creation using PCI PF number on best effort basis.
So that vendor drivers can skip defining their own scheme.

$ devlink port show
pci/:05:00.0/0: type eth netdev eth0 flavour pcipf pfnum 0

Acked-by: Jiri Pirko 
Signed-off-by: Parav Pandit 
---
 include/net/devlink.h| 11 ++
 include/uapi/linux/devlink.h |  5 +++
 net/core/devlink.c   | 71 +---
 3 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 6625ea068d5e..8db9c0e83fb5 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -38,6 +38,10 @@ struct devlink {
char priv[0] __aligned(NETDEV_ALIGN);
 };
 
+struct devlink_port_pci_pf_attrs {
+   u16 pf; /* Associated PCI PF for this port. */
+};
+
 struct devlink_port_attrs {
u8 set:1,
   split:1,
@@ -46,6 +50,9 @@ struct devlink_port_attrs {
u32 port_number; /* same value as "split group" */
u32 split_subport_number;
struct netdev_phys_item_id switch_id;
+   union {
+   struct devlink_port_pci_pf_attrs pci_pf;
+   };
 };
 
 struct devlink_port {
@@ -590,6 +597,10 @@ void devlink_port_attrs_set(struct devlink_port 
*devlink_port,
u32 split_subport_number,
const unsigned char *switch_id,
unsigned char switch_id_len);
+void devlink_port_attrs_pci_pf_set(struct devlink_port *devlink_port,
+  u32 port_number,
+  const unsigned char *switch_id,
+  unsigned char switch_id_len, u16 pf);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
u32 size, u16 ingress_pools_count,
u16 egress_pools_count, u16 ingress_tc_count,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 5287b42c181f..f7323884c3fe 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -169,6 +169,10 @@ enum devlink_port_flavour {
DEVLINK_PORT_FLAVOUR_DSA, /* Distributed switch architecture
   * interconnect port.
   */
+   DEVLINK_PORT_FLAVOUR_PCI_PF, /* Represents eswitch port for
+ * the PCI PF. It is an internal
+ * port that faces the PCI PF.
+ */
 };
 
 enum devlink_param_cmode {
@@ -337,6 +341,7 @@ enum devlink_attr {
DEVLINK_ATTR_FLASH_UPDATE_STATUS_DONE,  /* u64 */
DEVLINK_ATTR_FLASH_UPDATE_STATUS_TOTAL, /* u64 */
 
+   DEVLINK_ATTR_PORT_PCI_PF_NUMBER,/* u16 */
/* add new attributes above here, update the policy in devlink.c */
 
__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 89c533778135..001f9e2c96f0 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -517,6 +517,11 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
return -EMSGSIZE;
if (nla_put_u32(msg, DEVLINK_ATTR_PORT_NUMBER, attrs->port_number))
return -EMSGSIZE;
+   if (devlink_port->attrs.flavour == DEVLINK_PORT_FLAVOUR_PCI_PF) {
+   if (nla_put_u16(msg, DEVLINK_ATTR_PORT_PCI_PF_NUMBER,
+   attrs->pci_pf.pf))
+   return -EMSGSIZE;
+   }
if (!attrs->split)
return 0;
if (nla_put_u32(msg, DEVLINK_ATTR_PORT_SPLIT_GROUP, attrs->port_number))
@@ -5738,6 +5743,30 @@ void devlink_port_type_clear(struct devlink_port 
*devlink_port)
 }
 EXPORT_SYMBOL_GPL(devlink_port_type_clear);
 
+static void __devlink_port_attrs_set(struct devlink_port *devlink_port,
+enum devlink_port_flavour flavour,
+u32 port_number,
+const unsigned char *switch_id,
+unsigned char switch_id_len)
+{
+   struct devlink_port_attrs *attrs = &devlink_port->attrs;
+
+   if (WARN_ON(devlink_port->registered))
+   return;
+   attrs->set = true;
+   attrs->flavour = flavour;
+   attrs->port_number = port_number;
+   if (switch_id) {
+   attrs->switch_port = true;
+   if (WARN_ON(switch_id_len > MAX_PHYS_ITEM_ID_LEN))
+   switch_id_len = MAX_PHYS_ITEM_ID_LEN;
+   memcpy(attrs->switch_id.id, switch_id, switch_id_len);
+   attrs->switch_id.id

[PATCH net-next 0/3] devlink: Introduce PCI PF, VF ports and attributes

2019-07-01 Thread Parav Pandit

This patchset carry forwards the work initiated in [1] and discussion
futher concluded at [2].

To improve visibility of representor netdevice, its association with
PF or VF, physical port, two new devlink port flavours are added as
PCI PF and PCI VF ports.

A sample eswitch view can be seen below, which will be futher extended to
mdev subdevices of a PCI function in future.

Patch-1,2 extends devlink port attributes and port flavour.
Patch-3 extends mlx5 driver to register devlink ports for PF, VF and
physical link.

+---+  +---+
  vf|   |  |   | pf
+-+-+  +-+-+
physical link <-+ |  |
| |  |
| |  |
  +-+-+ +-+-+  +-+-+
  | 1 | | 2 |  | 3 |
   +--+---+-+---+--+---+--+
   |  physical   vf pf|
   |  port   port   port  |
   |  |
   | eswitch  |
   |  |
   +--+

[1] https://www.spinics.net/lists/netdev/msg555797.html
[2] https://marc.info/?l=linux-netdev&m=155354609408485&w=2

Parav Pandit (3):
  devlink: Introduce PCI PF port flavour and port attribute
  devlink: Introduce PCI VF port flavour and port attribute
  net/mlx5e: Register devlink ports for physical link, PCI PF, VFs

 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 108 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.h  |   1 +
 include/net/devlink.h |  22 
 include/uapi/linux/devlink.h  |  11 ++
 net/core/devlink.c| 107 ++---
 5 files changed, 204 insertions(+), 45 deletions(-)

-- 
2.19.2

[PATCH net-next 3/3] net/mlx5e: Register devlink ports for physical link, PCI PF, VFs

2019-07-01 Thread Parav Pandit

Register devlink port of physical port, PCI PF and PCI VF flavour
for each PF, VF when a given devlink instance is in switchdev mode.

Implement ndo_get_devlink_port callback API to make use of registered
devlink ports.
This eliminates ndo_get_phys_port_name() and ndo_get_port_parent_id()
callbacks. Hence, remove them.

An example output with 2 VFs, without a PF and single uplink port is
below.

$ devlink port show
pci/:06:00.0/65535: type eth netdev ens2f0 flavour physical
pci/:05:00.0/1: type eth netdev eth1 flavour pcivf pfnum 0 vfnum 0
pci/:05:00.0/2: type eth netdev eth2 flavour pcivf pfnum 0 vfnum 1

Reviewed-by: Roi Dayan 
Acked-by: Jiri Pirko 
Signed-off-by: Parav Pandit 
---
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 108 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.h  |   1 +
 2 files changed, 78 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 330034fcdfc5..aa47be3c139f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "eswitch.h"
 #include "en.h"
@@ -1119,32 +1120,6 @@ static int mlx5e_rep_close(struct net_device *dev)
return ret;
 }
 
-static int mlx5e_rep_get_phys_port_name(struct net_device *dev,
-   char *buf, size_t len)
-{
-   struct mlx5e_priv *priv = netdev_priv(dev);
-   struct mlx5e_rep_priv *rpriv = priv->ppriv;
-   struct mlx5_eswitch_rep *rep = rpriv->rep;
-   unsigned int fn;
-   int ret;
-
-   fn = PCI_FUNC(priv->mdev->pdev->devfn);
-   if (fn >= MLX5_MAX_PORTS)
-   return -EOPNOTSUPP;
-
-   if (rep->vport == MLX5_VPORT_UPLINK)
-   ret = snprintf(buf, len, "p%d", fn);
-   else if (rep->vport == MLX5_VPORT_PF)
-   ret = snprintf(buf, len, "pf%d", fn);
-   else
-   ret = snprintf(buf, len, "pf%dvf%d", fn, rep->vport - 1);
-
-   if (ret >= len)
-   return -EOPNOTSUPP;
-
-   return 0;
-}
-
 static int
 mlx5e_rep_setup_tc_cls_flower(struct mlx5e_priv *priv,
  struct tc_cls_flower_offload *cls_flower, int 
flags)
@@ -1298,17 +1273,24 @@ static int mlx5e_uplink_rep_set_vf_vlan(struct 
net_device *dev, int vf, u16 vlan
return 0;
 }
 
+static struct devlink_port *mlx5e_get_devlink_port(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   struct mlx5e_rep_priv *rpriv = priv->ppriv;
+
+   return &rpriv->dl_port;
+}
+
 static const struct net_device_ops mlx5e_netdev_ops_rep = {
.ndo_open= mlx5e_rep_open,
.ndo_stop= mlx5e_rep_close,
.ndo_start_xmit  = mlx5e_xmit,
-   .ndo_get_phys_port_name  = mlx5e_rep_get_phys_port_name,
.ndo_setup_tc= mlx5e_rep_setup_tc,
+   .ndo_get_devlink_port = mlx5e_get_devlink_port,
.ndo_get_stats64 = mlx5e_rep_get_stats,
.ndo_has_offload_stats   = mlx5e_rep_has_offload_stats,
.ndo_get_offload_stats   = mlx5e_rep_get_offload_stats,
.ndo_change_mtu  = mlx5e_rep_change_mtu,
-   .ndo_get_port_parent_id  = mlx5e_rep_get_port_parent_id,
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_uplink_rep = {
@@ -1316,8 +1298,8 @@ static const struct net_device_ops 
mlx5e_netdev_ops_uplink_rep = {
.ndo_stop= mlx5e_close,
.ndo_start_xmit  = mlx5e_xmit,
.ndo_set_mac_address = mlx5e_uplink_rep_set_mac,
-   .ndo_get_phys_port_name  = mlx5e_rep_get_phys_port_name,
.ndo_setup_tc= mlx5e_rep_setup_tc,
+   .ndo_get_devlink_port = mlx5e_get_devlink_port,
.ndo_get_stats64 = mlx5e_get_stats,
.ndo_has_offload_stats   = mlx5e_rep_has_offload_stats,
.ndo_get_offload_stats   = mlx5e_rep_get_offload_stats,
@@ -1330,7 +1312,6 @@ static const struct net_device_ops 
mlx5e_netdev_ops_uplink_rep = {
.ndo_get_vf_config   = mlx5e_get_vf_config,
.ndo_get_vf_stats= mlx5e_get_vf_stats,
.ndo_set_vf_vlan = mlx5e_uplink_rep_set_vf_vlan,
-   .ndo_get_port_parent_id  = mlx5e_rep_get_port_parent_id,
.ndo_set_features= mlx5e_set_features,
 };
 
@@ -1731,6 +1712,55 @@ static const struct mlx5e_profile 
mlx5e_uplink_rep_profile = {
.max_tc = MLX5E_MAX_NUM_TC,
 };
 
+static bool
+is_devlink_port_supported(const struct mlx5_core_dev *dev,
+ const struct mlx5e_rep_priv *rpriv)
+{
+   return rpriv->rep->vport == MLX5_VPORT_UPLINK ||
+  rpriv->rep->vport == MLX5_VPORT_PF ||
+  mlx5_eswitch_is_vf_vport(dev->priv.eswitch, rpriv->rep->vport);
+}
+
+static int register_devlink_port(struct mlx5_core_dev *dev,
+

[PATCH net-next 2/3] devlink: Introduce PCI VF port flavour and port attribute

2019-07-01 Thread Parav Pandit

In an eswitch, PCI VF may have port which is normally represented using
a representor netdevice.
To have better visibility of eswitch port, its association with VF,
its representor netdevice and port number, introduce a PCI VF
port flavour.

When devlink port flavour is PCI VF, fill up PCI VF attributes of
the port.

Extend port name creation using PCI PF and VF number scheme on best
effort basis, so that vendor drivers can skip defining their own scheme.

$ devlink port show
pci/:05:00.0/0: type eth netdev eth0 flavour pcipf pfnum 0
pci/:05:00.0/1: type eth netdev eth1 flavour pcivf pfnum 0 vfnum 0
pci/:05:00.0/2: type eth netdev eth2 flavour pcivf pfnum 0 vfnum 1

Acked-by: Jiri Pirko 
Signed-off-by: Parav Pandit 
---
 include/net/devlink.h| 11 +++
 include/uapi/linux/devlink.h |  6 ++
 net/core/devlink.c   | 36 
 3 files changed, 53 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 8db9c0e83fb5..dff7c7797f3e 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -42,6 +42,11 @@ struct devlink_port_pci_pf_attrs {
u16 pf; /* Associated PCI PF for this port. */
 };
 
+struct devlink_port_pci_vf_attrs {
+   u16 pf; /* Associated PCI PF for this port. */
+   u16 vf; /* Associated PCI VF for of the PCI PF for this port. */
+};
+
 struct devlink_port_attrs {
u8 set:1,
   split:1,
@@ -52,6 +57,7 @@ struct devlink_port_attrs {
struct netdev_phys_item_id switch_id;
union {
struct devlink_port_pci_pf_attrs pci_pf;
+   struct devlink_port_pci_vf_attrs pci_vf;
};
 };
 
@@ -601,6 +607,11 @@ void devlink_port_attrs_pci_pf_set(struct devlink_port 
*devlink_port,
   u32 port_number,
   const unsigned char *switch_id,
   unsigned char switch_id_len, u16 pf);
+void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port,
+  u32 port_number,
+  const unsigned char *switch_id,
+  unsigned char switch_id_len,
+  u16 pf, u16 vf);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
u32 size, u16 ingress_pools_count,
u16 egress_pools_count, u16 ingress_tc_count,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index f7323884c3fe..ffc993256527 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -173,6 +173,10 @@ enum devlink_port_flavour {
  * the PCI PF. It is an internal
  * port that faces the PCI PF.
  */
+   DEVLINK_PORT_FLAVOUR_PCI_VF, /* Represents eswitch port
+ * for the PCI VF. It is an internal
+ * port that faces the PCI VF.
+ */
 };
 
 enum devlink_param_cmode {
@@ -342,6 +346,8 @@ enum devlink_attr {
DEVLINK_ATTR_FLASH_UPDATE_STATUS_TOTAL, /* u64 */
 
DEVLINK_ATTR_PORT_PCI_PF_NUMBER,/* u16 */
+   DEVLINK_ATTR_PORT_PCI_VF_NUMBER,/* u16 */
+
/* add new attributes above here, update the policy in devlink.c */
 
__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 001f9e2c96f0..d62c4591351b 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -521,6 +521,12 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
if (nla_put_u16(msg, DEVLINK_ATTR_PORT_PCI_PF_NUMBER,
attrs->pci_pf.pf))
return -EMSGSIZE;
+   } else if (devlink_port->attrs.flavour == DEVLINK_PORT_FLAVOUR_PCI_VF) {
+   if (nla_put_u16(msg, DEVLINK_ATTR_PORT_PCI_PF_NUMBER,
+   attrs->pci_vf.pf) ||
+   nla_put_u16(msg, DEVLINK_ATTR_PORT_PCI_VF_NUMBER,
+   attrs->pci_vf.vf))
+   return -EMSGSIZE;
}
if (!attrs->split)
return 0;
@@ -5820,6 +5826,32 @@ void devlink_port_attrs_pci_pf_set(struct devlink_port 
*devlink_port,
 }
 EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_pf_set);
 
+/**
+ * devlink_port_attrs_pci_vf_set - Set PCI VF port attributes
+ *
+ * @devlink_port: devlink port
+ * @port_number: number of the port that is facing a VF
+ * @pf: associated PF for the devlink port instance
+ * @vf: associated VF of a PF for the devlink port instance
+ * @switch_id: if the port is part of switch, this is buffer with ID,
+ * otwerwise this is NULL
+ * @switch_id_len: length of the switch_id buffer
+ */
+void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_por

[PATCH net-next v4 0/5] Add MPLS actions to TC

2019-07-01 Thread John Hurley

This patchset introduces a new TC action module that allows the
manipulation of the MPLS headers of packets. The code impliments
functionality including push, pop, and modify.

Also included are tests for the new funtionality. Note that these will
require iproute2 changes to be submitted soon.

NOTE: these patches are applied to net-next along with the patch:
[PATCH net 1/1] net: openvswitch: fix csum updates for MPLS actions
This patch has been accepted into net but, at time of posting, is not yet
in net-next.

v3-v4:
- refactor and reuse OvS code (Cong Wang)
- use csum API rather than skb_post*rscum to update skb->csum (Cong Wang)
- remove unnecessary warning (Cong Wang)
- add comments to uapi attributes (David Ahern)
- set strict type policy check for TCA_MPLS_UNSPEC (David Ahern)
- expand/improve extack messages (David Ahern)
- add option to manually set BOS
v2-v3:
- remove a few unnecessary line breaks (Jiri Pirko)
- retract hw offload patch from set (resubmit with driver changes) (Jiri)
v1->v2:
- ensure TCA_ID_MPLS does not conflict with TCA_ID_CTINFO (Davide Caratti)

John Hurley (5):
  net: core: move push MPLS functionality from OvS to core helper
  net: core: move pop MPLS functionality from OvS to core helper
  net: core: add MPLS update core helper and use in OvS
  net: sched: add mpls manipulation actions to TC
  selftests: tc-tests: actions: add MPLS tests

 include/linux/skbuff.h |   3 +
 include/net/tc_act/tc_mpls.h   |  29 +
 include/uapi/linux/pkt_cls.h   |   3 +-
 include/uapi/linux/tc_act/tc_mpls.h|  33 +
 net/core/skbuff.c  | 140 
 net/openvswitch/actions.c  |  81 +-
 net/sched/Kconfig  |  11 +
 net/sched/Makefile |   1 +
 net/sched/act_mpls.c   | 413 +++
 .../tc-testing/tc-tests/actions/mpls.json  | 812 +
 10 files changed, 1453 insertions(+), 73 deletions(-)
 create mode 100644 include/net/tc_act/tc_mpls.h
 create mode 100644 include/uapi/linux/tc_act/tc_mpls.h
 create mode 100644 net/sched/act_mpls.c
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/mpls.json

-- 
2.7.4

[PATCH net-next v4 1/5] net: core: move push MPLS functionality from OvS to core helper

2019-07-01 Thread John Hurley

Open vSwitch provides code to push an MPLS header to a packet. In
preparation for supporting this in TC, move the push code to an skb helper
that can be reused.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 include/linux/skbuff.h|  1 +
 net/core/skbuff.c | 64 +++
 net/openvswitch/actions.c | 31 +++
 3 files changed, 69 insertions(+), 27 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b5d427b..0112256 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3446,6 +3446,7 @@ int skb_ensure_writable(struct sk_buff *skb, int 
write_len);
 int __skb_vlan_pop(struct sk_buff *skb, u16 *vlan_tci);
 int skb_vlan_pop(struct sk_buff *skb);
 int skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci);
+int skb_mpls_push(struct sk_buff *skb, __be32 mpls_lse, __be16 mpls_proto);
 struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
 gfp_t gfp);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5323441..f1d1e47 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -5326,6 +5327,69 @@ int skb_vlan_push(struct sk_buff *skb, __be16 
vlan_proto, u16 vlan_tci)
 }
 EXPORT_SYMBOL(skb_vlan_push);
 
+/* Update the ethertype of hdr and the skb csum value if required. */
+static void skb_mod_eth_type(struct sk_buff *skb, struct ethhdr *hdr,
+__be16 ethertype)
+{
+   if (skb->ip_summed == CHECKSUM_COMPLETE) {
+   __be16 diff[] = { ~hdr->h_proto, ethertype };
+
+   skb->csum = csum_partial((char *)diff, sizeof(diff), skb->csum);
+   }
+
+   hdr->h_proto = ethertype;
+}
+
+/**
+ * skb_mpls_push() - push a new MPLS header after the mac header
+ *
+ * @skb: buffer
+ * @mpls_lse: MPLS label stack entry to push
+ * @mpls_proto: ethertype of the new MPLS header (expects 0x8847 or 0x8848)
+ *
+ * Expects skb->data at mac header.
+ *
+ * Returns 0 on success, -errno otherwise.
+ */
+int skb_mpls_push(struct sk_buff *skb, __be32 mpls_lse, __be16 mpls_proto)
+{
+   struct mpls_shim_hdr *lse;
+   int err;
+
+   if (unlikely(!eth_p_mpls(mpls_proto)))
+   return -EINVAL;
+
+   /* Networking stack does not allow simultaneous Tunnel and MPLS GSO. */
+   if (skb->encapsulation)
+   return -EINVAL;
+
+   err = skb_cow_head(skb, MPLS_HLEN);
+   if (unlikely(err))
+   return err;
+
+   if (!skb->inner_protocol) {
+   skb_set_inner_network_header(skb, skb->mac_len);
+   skb_set_inner_protocol(skb, skb->protocol);
+   }
+
+   skb_push(skb, MPLS_HLEN);
+   memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_header(skb),
+   skb->mac_len);
+   skb_reset_mac_header(skb);
+   skb_set_network_header(skb, skb->mac_len);
+
+   lse = mpls_hdr(skb);
+   lse->label_stack_entry = mpls_lse;
+   skb_postpush_rcsum(skb, lse, MPLS_HLEN);
+
+   if (skb->dev && skb->dev->type == ARPHRD_ETHER)
+   skb_mod_eth_type(skb, eth_hdr(skb), mpls_proto);
+   skb->protocol = mpls_proto;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(skb_mpls_push);
+
 /**
  * alloc_skb_with_frags - allocate skb with page frags
  *
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index bd13146..a9a6c9c 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -175,34 +175,11 @@ static void update_ethertype(struct sk_buff *skb, struct 
ethhdr *hdr,
 static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key,
 const struct ovs_action_push_mpls *mpls)
 {
-   struct mpls_shim_hdr *new_mpls_lse;
-
-   /* Networking stack do not allow simultaneous Tunnel and MPLS GSO. */
-   if (skb->encapsulation)
-   return -ENOTSUPP;
-
-   if (skb_cow_head(skb, MPLS_HLEN) < 0)
-   return -ENOMEM;
-
-   if (!skb->inner_protocol) {
-   skb_set_inner_network_header(skb, skb->mac_len);
-   skb_set_inner_protocol(skb, skb->protocol);
-   }
-
-   skb_push(skb, MPLS_HLEN);
-   memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_header(skb),
-   skb->mac_len);
-   skb_reset_mac_header(skb);
-   skb_set_network_header(skb, skb->mac_len);
-
-   new_mpls_lse = mpls_hdr(skb);
-   new_mpls_lse->label_stack_entry = mpls->mpls_lse;
-
-   skb_postpush_rcsum(skb, new_mpls_lse, MPLS_HLEN);
+   int err;
 
-   if (ovs_key_mac_proto(key) == MAC_PROTO_ETHERNET)
-   update_ethertype(skb, eth_hdr(skb), mpls->mpls_ethertype);
-   skb->protocol = mpls->mpls_ethertype;
+   err = skb_mpls_push(skb, mpls->mpls_lse, mpls->mpls_ethertype);
+   if (err)
+   return err;
 
invalidate_flow_key(key);
re

[PATCH net-next v4 2/5] net: core: move pop MPLS functionality from OvS to core helper

2019-07-01 Thread John Hurley

Open vSwitch provides code to pop an MPLS header to a packet. In
preparation for supporting this in TC, move the pop code to an skb helper
that can be reused.

Remove the, now unused, update_ethertype static function from OvS.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 include/linux/skbuff.h|  1 +
 net/core/skbuff.c | 42 ++
 net/openvswitch/actions.c | 37 ++---
 3 files changed, 45 insertions(+), 35 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0112256..89d5c43 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3447,6 +3447,7 @@ int __skb_vlan_pop(struct sk_buff *skb, u16 *vlan_tci);
 int skb_vlan_pop(struct sk_buff *skb);
 int skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci);
 int skb_mpls_push(struct sk_buff *skb, __be32 mpls_lse, __be16 mpls_proto);
+int skb_mpls_pop(struct sk_buff *skb, __be16 next_proto);
 struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
 gfp_t gfp);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f1d1e47..ce30989 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5391,6 +5391,48 @@ int skb_mpls_push(struct sk_buff *skb, __be32 mpls_lse, 
__be16 mpls_proto)
 EXPORT_SYMBOL_GPL(skb_mpls_push);
 
 /**
+ * skb_mpls_pop() - pop the outermost MPLS header
+ *
+ * @skb: buffer
+ * @next_proto: ethertype of header after popped MPLS header
+ *
+ * Expects skb->data at mac header.
+ *
+ * Returns 0 on success, -errno otherwise.
+ */
+int skb_mpls_pop(struct sk_buff *skb, __be16 next_proto)
+{
+   int err;
+
+   if (unlikely(!eth_p_mpls(skb->protocol)))
+   return -EINVAL;
+
+   err = skb_ensure_writable(skb, skb->mac_len + MPLS_HLEN);
+   if (unlikely(err))
+   return err;
+
+   skb_postpull_rcsum(skb, mpls_hdr(skb), MPLS_HLEN);
+   memmove(skb_mac_header(skb) + MPLS_HLEN, skb_mac_header(skb),
+   skb->mac_len);
+
+   __skb_pull(skb, MPLS_HLEN);
+   skb_reset_mac_header(skb);
+   skb_set_network_header(skb, skb->mac_len);
+
+   if (skb->dev && skb->dev->type == ARPHRD_ETHER) {
+   struct ethhdr *hdr;
+
+   /* use mpls_hdr() to get ethertype to account for VLANs. */
+   hdr = (struct ethhdr *)((void *)mpls_hdr(skb) - ETH_HLEN);
+   skb_mod_eth_type(skb, hdr, next_proto);
+   }
+   skb->protocol = next_proto;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(skb_mpls_pop);
+
+/**
  * alloc_skb_with_frags - allocate skb with page frags
  *
  * @header_len: size of linear part
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index a9a6c9c..62715bb 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -160,18 +160,6 @@ static int do_execute_actions(struct datapath *dp, struct 
sk_buff *skb,
  struct sw_flow_key *key,
  const struct nlattr *attr, int len);
 
-static void update_ethertype(struct sk_buff *skb, struct ethhdr *hdr,
-__be16 ethertype)
-{
-   if (skb->ip_summed == CHECKSUM_COMPLETE) {
-   __be16 diff[] = { ~(hdr->h_proto), ethertype };
-
-   skb->csum = csum_partial((char *)diff, sizeof(diff), skb->csum);
-   }
-
-   hdr->h_proto = ethertype;
-}
-
 static int push_mpls(struct sk_buff *skb, struct sw_flow_key *key,
 const struct ovs_action_push_mpls *mpls)
 {
@@ -190,31 +178,10 @@ static int pop_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
 {
int err;
 
-   err = skb_ensure_writable(skb, skb->mac_len + MPLS_HLEN);
-   if (unlikely(err))
+   err = skb_mpls_pop(skb, ethertype);
+   if (err)
return err;
 
-   skb_postpull_rcsum(skb, mpls_hdr(skb), MPLS_HLEN);
-
-   memmove(skb_mac_header(skb) + MPLS_HLEN, skb_mac_header(skb),
-   skb->mac_len);
-
-   __skb_pull(skb, MPLS_HLEN);
-   skb_reset_mac_header(skb);
-   skb_set_network_header(skb, skb->mac_len);
-
-   if (ovs_key_mac_proto(key) == MAC_PROTO_ETHERNET) {
-   struct ethhdr *hdr;
-
-   /* mpls_hdr() is used to locate the ethertype field correctly 
in the
-* presence of VLAN tags.
-*/
-   hdr = (struct ethhdr *)((void *)mpls_hdr(skb) - ETH_HLEN);
-   update_ethertype(skb, hdr, ethertype);
-   }
-   if (eth_p_mpls(skb->protocol))
-   skb->protocol = ethertype;
-
invalidate_flow_key(key);
return 0;
 }
-- 
2.7.4

[PATCH net-next v4 3/5] net: core: add MPLS update core helper and use in OvS

2019-07-01 Thread John Hurley

Open vSwitch allows the updating of an existing MPLS header on a packet.
In preparation for supporting similar functionality in TC, move this to a
common skb helper function.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 include/linux/skbuff.h|  1 +
 net/core/skbuff.c | 34 ++
 net/openvswitch/actions.c | 13 +++--
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 89d5c43..1545c4c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3448,6 +3448,7 @@ int skb_vlan_pop(struct sk_buff *skb);
 int skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci);
 int skb_mpls_push(struct sk_buff *skb, __be32 mpls_lse, __be16 mpls_proto);
 int skb_mpls_pop(struct sk_buff *skb, __be16 next_proto);
+int skb_mpls_update_lse(struct sk_buff *skb, __be32 mpls_lse);
 struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
 gfp_t gfp);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ce30989..398ebcb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5433,6 +5433,40 @@ int skb_mpls_pop(struct sk_buff *skb, __be16 next_proto)
 EXPORT_SYMBOL_GPL(skb_mpls_pop);
 
 /**
+ * skb_mpls_update_lse() - modify outermost MPLS header and update csum
+ *
+ * @skb: buffer
+ * @mpls_lse: new MPLS label stack entry to update to
+ *
+ * Expects skb->data at mac header.
+ *
+ * Returns 0 on success, -errno otherwise.
+ */
+int skb_mpls_update_lse(struct sk_buff *skb, __be32 mpls_lse)
+{
+   struct mpls_shim_hdr *old_lse = mpls_hdr(skb);
+   int err;
+
+   if (unlikely(!eth_p_mpls(skb->protocol)))
+   return -EINVAL;
+
+   err = skb_ensure_writable(skb, skb->mac_len + MPLS_HLEN);
+   if (unlikely(err))
+   return err;
+
+   if (skb->ip_summed == CHECKSUM_COMPLETE) {
+   __be32 diff[] = { ~old_lse->label_stack_entry, mpls_lse };
+
+   skb->csum = csum_partial((char *)diff, sizeof(diff), skb->csum);
+   }
+
+   old_lse->label_stack_entry = mpls_lse;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(skb_mpls_update_lse);
+
+/**
  * alloc_skb_with_frags - allocate skb with page frags
  *
  * @header_len: size of linear part
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 62715bb..3572e11 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -193,19 +193,12 @@ static int set_mpls(struct sk_buff *skb, struct 
sw_flow_key *flow_key,
__be32 lse;
int err;
 
-   err = skb_ensure_writable(skb, skb->mac_len + MPLS_HLEN);
-   if (unlikely(err))
-   return err;
-
stack = mpls_hdr(skb);
lse = OVS_MASKED(stack->label_stack_entry, *mpls_lse, *mask);
-   if (skb->ip_summed == CHECKSUM_COMPLETE) {
-   __be32 diff[] = { ~(stack->label_stack_entry), lse };
-
-   skb->csum = csum_partial((char *)diff, sizeof(diff), skb->csum);
-   }
+   err = skb_mpls_update_lse(skb, lse);
+   if (err)
+   return err;
 
-   stack->label_stack_entry = lse;
flow_key->mpls.top_lse = lse;
return 0;
 }
-- 
2.7.4

[PATCH net-next v4 4/5] net: sched: add mpls manipulation actions to TC

2019-07-01 Thread John Hurley

Currently, TC offers the ability to match on the MPLS fields of a packet
through the use of the flow_dissector_key_mpls struct. However, as yet, TC
actions do not allow the modification or manipulation of such fields.

Add a new module that registers TC action ops to allow manipulation of
MPLS. This includes the ability to push and pop headers as well as modify
the contents of new or existing headers. A further action to decrement the
TTL field of an MPLS header is also provided.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 include/net/tc_act/tc_mpls.h|  29 +++
 include/uapi/linux/pkt_cls.h|   3 +-
 include/uapi/linux/tc_act/tc_mpls.h |  33 +++
 net/sched/Kconfig   |  11 +
 net/sched/Makefile  |   1 +
 net/sched/act_mpls.c| 413 
 6 files changed, 489 insertions(+), 1 deletion(-)
 create mode 100644 include/net/tc_act/tc_mpls.h
 create mode 100644 include/uapi/linux/tc_act/tc_mpls.h
 create mode 100644 net/sched/act_mpls.c

diff --git a/include/net/tc_act/tc_mpls.h b/include/net/tc_act/tc_mpls.h
new file mode 100644
index 000..7df7a1d
--- /dev/null
+++ b/include/net/tc_act/tc_mpls.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
+/* Copyright (C) 2019 Netronome Systems, Inc. */
+
+#ifndef __NET_TC_MPLS_H
+#define __NET_TC_MPLS_H
+
+#include 
+#include 
+
+struct tcf_mpls_params {
+   int tcfm_action;
+   u32 tcfm_label;
+   u8 tcfm_tc;
+   u8 tcfm_ttl;
+   u8 tcfm_bos;
+   __be16 tcfm_proto;
+   struct rcu_head rcu;
+};
+
+#define ACT_MPLS_TC_NOT_SET0xff
+#define ACT_MPLS_BOS_NOT_SET   0xff
+
+struct tcf_mpls {
+   struct tc_action common;
+   struct tcf_mpls_params __rcu *mpls_p;
+};
+#define to_mpls(a) ((struct tcf_mpls *)a)
+
+#endif /* __NET_TC_MPLS_H */
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 8cc6b67..e22ef4a 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -104,8 +104,9 @@ enum tca_id {
TCA_ID_SIMP = TCA_ACT_SIMP,
TCA_ID_IFE = TCA_ACT_IFE,
TCA_ID_SAMPLE = TCA_ACT_SAMPLE,
-   /* other actions go here */
TCA_ID_CTINFO,
+   TCA_ID_MPLS,
+   /* other actions go here */
__TCA_ID_MAX = 255
 };
 
diff --git a/include/uapi/linux/tc_act/tc_mpls.h 
b/include/uapi/linux/tc_act/tc_mpls.h
new file mode 100644
index 000..9360e95
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_mpls.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (C) 2019 Netronome Systems, Inc. */
+
+#ifndef __LINUX_TC_MPLS_H
+#define __LINUX_TC_MPLS_H
+
+#include 
+
+#define TCA_MPLS_ACT_POP   1
+#define TCA_MPLS_ACT_PUSH  2
+#define TCA_MPLS_ACT_MODIFY3
+#define TCA_MPLS_ACT_DEC_TTL   4
+
+struct tc_mpls {
+   tc_gen; /* generic TC action fields. */
+   int m_action;   /* action of type TCA_MPLS_ACT_*. */
+};
+
+enum {
+   TCA_MPLS_UNSPEC,
+   TCA_MPLS_TM,/* struct tcf_t; time values associated with action. */
+   TCA_MPLS_PARMS, /* struct tc_mpls; action type and general TC fields. */
+   TCA_MPLS_PAD,
+   TCA_MPLS_PROTO, /* be16; eth_type of pushed or next (for pop) header. */
+   TCA_MPLS_LABEL, /* u32; MPLS label. Lower 20 bits are used. */
+   TCA_MPLS_TC,/* u8; MPLS TC field. Lower 3 bits are used. */
+   TCA_MPLS_TTL,   /* u8; MPLS TTL field. Must not be 0. */
+   TCA_MPLS_BOS,   /* u8; MPLS BOS field. Either 1 or 0. */
+   __TCA_MPLS_MAX,
+};
+#define TCA_MPLS_MAX (__TCA_MPLS_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 360fdd3..731f5fb 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -842,6 +842,17 @@ config NET_ACT_CSUM
  To compile this code as a module, choose M here: the
  module will be called act_csum.
 
+config NET_ACT_MPLS
+   tristate "MPLS manipulation"
+   depends on NET_CLS_ACT
+   help
+ Say Y here to push or pop MPLS headers.
+
+ If unsure, say N.
+
+ To compile this code as a module, choose M here: the
+ module will be called act_mpls.
+
 config NET_ACT_VLAN
 tristate "Vlan manipulation"
 depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index d54bfcb..c266036 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_NET_ACT_PEDIT)   += act_pedit.o
 obj-$(CONFIG_NET_ACT_SIMP) += act_simple.o
 obj-$(CONFIG_NET_ACT_SKBEDIT)  += act_skbedit.o
 obj-$(CONFIG_NET_ACT_CSUM) += act_csum.o
+obj-$(CONFIG_NET_ACT_MPLS) += act_mpls.o
 obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)  += act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.o
diff --git a/net/sched/act_mpls.c b/net/sched/act_mpls.c
new file mode 100644
index 000..dbd3fc8
--- /dev/null
+++ b

[PATCH net-next v4 5/5] selftests: tc-tests: actions: add MPLS tests

2019-07-01 Thread John Hurley

Add a new series of selftests to verify the functionality of act_mpls in
TC.

Signed-off-by: John Hurley 
Reviewed-by: Simon Horman 
Acked-by: Jakub Kicinski 
---
 .../tc-testing/tc-tests/actions/mpls.json  | 812 +
 1 file changed, 812 insertions(+)
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/mpls.json

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mpls.json 
b/tools/testing/selftests/tc-testing/tc-tests/actions/mpls.json
new file mode 100644
index 000..9708de9
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mpls.json
@@ -0,0 +1,812 @@
+[
+{
+"id": "a933",
+"name": "Add MPLS dec_ttl action with pipe opcode",
+"category": [
+"actions",
+"mpls"
+],
+"setup": [
+[
+"$TC actions flush action mpls",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mpls dec_ttl pipe index 8",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action mpls",
+"matchPattern": "action order [0-9]+: mpls.*dec_ttl.*pipe.*index 8 
ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mpls"
+]
+},
+{
+"id": "08d1",
+"name": "Add mpls dec_ttl action with pass opcode",
+"category": [
+"actions",
+"mpls"
+],
+"setup": [
+[
+"$TC actions flush action mpls",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mpls dec_ttl pass index 8",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mpls index 8",
+"matchPattern": "action order [0-9]+: mpls.*dec_ttl.*pass.*index 8 
ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mpls"
+]
+},
+{
+"id": "d786",
+"name": "Add mpls dec_ttl action with drop opcode",
+"category": [
+"actions",
+"mpls"
+],
+"setup": [
+[
+"$TC actions flush action mpls",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mpls dec_ttl drop index 8",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mpls index 8",
+"matchPattern": "action order [0-9]+: mpls.*dec_ttl.*drop.*index 8 
ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mpls"
+]
+},
+{
+"id": "f334",
+"name": "Add mpls dec_ttl action with reclassify opcode",
+"category": [
+"actions",
+"mpls"
+],
+"setup": [
+[
+"$TC actions flush action mpls",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mpls dec_ttl reclassify index 
8",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mpls index 8",
+"matchPattern": "action order [0-9]+: mpls.*dec_ttl.*reclassify.*index 
8 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mpls"
+]
+},
+{
+"id": "29bd",
+"name": "Add mpls dec_ttl action with continue opcode",
+"category": [
+"actions",
+"mpls"
+],
+"setup": [
+[
+"$TC actions flush action mpls",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mpls dec_ttl continue index 8",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mpls index 8",
+"matchPattern": "action order [0-9]+: mpls.*dec_ttl.*continue.*index 8 
ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mpls"
+]
+},
+{
+"id": "48df",
+"name": "Add mpls dec_ttl action with jump opcode",
+"category": [
+"actions",
+"mpls"
+],
+"setup": [
+[
+"$TC actions flush action mpls",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mpls dec_ttl jump 10 index 8",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action mpls",
+"matchPattern": "action order [0-9]+: mpls.*jump 10.*index 8 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mpls"
+]
+},
+{
+"id": "62eb",
+"name": "Add mpls dec_ttl action with trap opcode"

Re: [RFC iproute2 1/1] ip: netns: add mounted state file for each netns

2019-07-01 Thread Nicolas Dichtel

Le 28/06/2019 à 18:26, David Howells a écrit :
> Nicolas Dichtel  wrote:
> 
>> David Howells was working on a mount notification mechanism:
>> https://lwn.net/Articles/760714/
>> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
>>
>> I don't know what is the status of this series.
> 
> It's still alive.  I just posted a new version on it.  I'm hoping, possibly
> futiley, to get it in in this merge window.
Nice to hear. It will help to properly solve this issue.


Thank you,
Nicolas

Re: [RFC iproute2] netns: add mounting state file for each netns

2019-07-01 Thread Nicolas Dichtel

Le 30/06/2019 à 21:29, Matteo Croce a écrit :
> When ip creates a netns, there is a small time interval between the
> placeholder file creation in NETNS_RUN_DIR and the bind mount from /proc.
> 
> Add a temporary file named .mounting-$netns which gets deleted after the
> bind mount, so watching for delete event matching the .mounting-* name
> will notify watchers only after the bind mount has been done.
Probably a naive question, but why creating those '.mounting-$netns' files in
the directory where netns are stored? Why not another directory, something like
/var/run/netns-monitor/?


Regards,
Nicolas

Re: [PATCH] net: ethernet: mediatek: Fix overlapping capability bits.

2019-07-01 Thread René van Dorst


Quoting Willem de Bruijn :


On Sat, Jun 29, 2019 at 8:24 AM René van Dorst  wrote:


Both MTK_TRGMII_MT7621_CLK and MTK_PATH_BIT are defined as bit 10.

This causes issues on non-MT7621 devices which has the
MTK_PATH_BIT(MTK_ETH_PATH_GMAC1_RGMII) capability set.
The wrong TRGMII setup code is executed.

Moving the MTK_PATH_BIT to bit 11 fixes the issue.

Fixes: 8efaa653a8a5 ("net: ethernet: mediatek: Add MT7621 TRGMII mode
support")
Signed-off-by: René van Dorst 


This targets net? Please mark networking patches [PATCH net] or [PATCH
net-next].


Hi Willem,

Thanks for you input.

This patch was for net-next.




---
 drivers/net/ethernet/mediatek/mtk_eth_soc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h  
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h

index 876ce6798709..2cb8a915731c 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -626,7 +626,7 @@ enum mtk_eth_path {
 #define MTK_TRGMII_MT7621_CLK  BIT(10)

 /* Supported path present on SoCs */
-#define MTK_PATH_BIT(x) BIT((x) + 10)

+#define MTK_PATH_BIT(x) BIT((x) + 11)



To avoid this happening again, perhaps make the reserved range more explicit?

For instance

#define MTK_FIXED_BIT_LAST 10
#define MTK_TRGMII_MT7621_CLK  BIT(MTK_FIXED_BIT_LAST)

#define MTK_PATH_BIT_FIRST  (MTK_FIXED_BIT_LAST + 1)
#define MTK_PATH_BIT_LAST (MTK_FIXED_BIT_LAST + 7)
#define MTK_MUX_BIT_FIRST (MTK_PATH_BIT_LAST + 1)

Though I imagine there are cleaner approaches. Perhaps define all
fields as enum instead of just mtk_eth_mux and mtk_eth_path. Then
there can be no accidental collision.


You mean in a similar way as done in the ethtool.h [0]?

Use a enum to define the unique bits.

enum mtk_bits {
MTK_RGMII_BIT = 0,
MTK_SGMII_BIT,
MTK_TRGMII_BIT,
AND SO ON 
};

Also move the mtk_eth_mux and mtk_eth_path in to this enum.

Then use defines to convert bits to values.

#define MTK_RGMII  BIT(MTK_RGMII_BIT)
#define MTK_TRGMII BIT(MTK_TRGMII_BIT)

Replace the MTK_PATH_BIT and MTK_PATH_BIT macro with BIT()

Is this what you had in mind?

Greats,

René

[0]:  
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/ethtool.h#L1402

Re: [PATCH] net: ethernet: mediatek: Allow non TRGMII mode with MT7621 DDR2 devices

2019-07-01 Thread René van Dorst


Quoting René van Dorst :

I see that I also forgot to tag this patch for net-next.

Greats,

René


No reason to error out on a MT7621 device with DDR2 memory when non
TRGMII mode is selected.
Only MT7621 DDR2 clock setup is not supported for TRGMII mode.
But non TRGMII mode doesn't need any special clock setup.

Signed-off-by: René van Dorst 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c  
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c

index 066712f2e985..b20b3a5a1ebb 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -139,9 +139,12 @@ static int mt7621_gmac0_rgmii_adjust(struct  
mtk_eth *eth,

 {
u32 val;

-   /* Check DDR memory type. Currently DDR2 is not supported. */
+   /* Check DDR memory type.
+* Currently TRGMII mode with DDR2 memory is not supported.
+*/
regmap_read(eth->ethsys, ETHSYS_SYSCFG, &val);
-   if (val & SYSCFG_DRAM_TYPE_DDR2) {
+   if (interface == PHY_INTERFACE_MODE_TRGMII &&
+   val & SYSCFG_DRAM_TYPE_DDR2) {
dev_err(eth->dev,
"TRGMII mode with DDR2 memory is not supported!\n");
return -EOPNOTSUPP;
--
2.20.1

Re: [PATCH] net: ethernet: mediatek: Fix overlapping capability bits.

2019-07-01 Thread Willem de Bruijn

On Mon, Jul 1, 2019 at 8:44 AM René van Dorst  wrote:
>
> Quoting Willem de Bruijn :
>
> > On Sat, Jun 29, 2019 at 8:24 AM René van Dorst  
> > wrote:
> >>
> >> Both MTK_TRGMII_MT7621_CLK and MTK_PATH_BIT are defined as bit 10.
> >>
> >> This causes issues on non-MT7621 devices which has the
> >> MTK_PATH_BIT(MTK_ETH_PATH_GMAC1_RGMII) capability set.
> >> The wrong TRGMII setup code is executed.
> >>
> >> Moving the MTK_PATH_BIT to bit 11 fixes the issue.
> >>
> >> Fixes: 8efaa653a8a5 ("net: ethernet: mediatek: Add MT7621 TRGMII mode
> >> support")
> >> Signed-off-by: René van Dorst 
> >
> > This targets net? Please mark networking patches [PATCH net] or [PATCH
> > net-next].
>
> Hi Willem,
>
> Thanks for you input.
>
> This patch was for net-next.
>
> >
> >> ---
> >>  drivers/net/ethernet/mediatek/mtk_eth_soc.h | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> >> b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> >> index 876ce6798709..2cb8a915731c 100644
> >> --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> >> +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
> >> @@ -626,7 +626,7 @@ enum mtk_eth_path {
> >>  #define MTK_TRGMII_MT7621_CLK  BIT(10)
> >>
> >>  /* Supported path present on SoCs */
> >> -#define MTK_PATH_BIT(x) BIT((x) + 10)
> >>
> >> +#define MTK_PATH_BIT(x) BIT((x) + 11)
> >>
> >
> > To avoid this happening again, perhaps make the reserved range more 
> > explicit?
> >
> > For instance
> >
> > #define MTK_FIXED_BIT_LAST 10
> > #define MTK_TRGMII_MT7621_CLK  BIT(MTK_FIXED_BIT_LAST)
> >
> > #define MTK_PATH_BIT_FIRST  (MTK_FIXED_BIT_LAST + 1)
> > #define MTK_PATH_BIT_LAST (MTK_FIXED_BIT_LAST + 7)
> > #define MTK_MUX_BIT_FIRST (MTK_PATH_BIT_LAST + 1)
> >
> > Though I imagine there are cleaner approaches. Perhaps define all
> > fields as enum instead of just mtk_eth_mux and mtk_eth_path. Then
> > there can be no accidental collision.
>
> You mean in a similar way as done in the ethtool.h [0]?
>
> Use a enum to define the unique bits.
>
> enum mtk_bits {
> MTK_RGMII_BIT = 0,
> MTK_SGMII_BIT,
> MTK_TRGMII_BIT,
> AND SO ON 
> };
>
> Also move the mtk_eth_mux and mtk_eth_path in to this enum.

That's the key part: they are all part of the same namespace and these
enums are not used anywhere else, so a single enum will avoid
accidentally namespace collisions.

> Then use defines to convert bits to values.
>
> #define MTK_RGMII  BIT(MTK_RGMII_BIT)
> #define MTK_TRGMII BIT(MTK_TRGMII_BIT)
>
> Replace the MTK_PATH_BIT and MTK_PATH_BIT macro with BIT()
>
> Is this what you had in mind?

Great find. Exactly, but I did not find such a clear example.

>
> Greats,
>
> René
>
> [0]:
> https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/ethtool.h#L1402
>
>
>

Re: [PATCH net-next 3/3] macsec: add brackets and indentation after calling macsec_decrypt

2019-07-01 Thread Sabrina Dubroca

2019-06-30, 22:05:41 -0400, Willem de Bruijn wrote:
> On Sun, Jun 30, 2019 at 4:48 PM Andreas Steinmetz  wrote:
> >
> > At this point, skb could only be a valid pointer, so this patch does
> > not introduce any functional change.
> 
> Previously, macsec_post_decrypt could be called on the original skb if
> the initial condition was false and macsec_decrypt is skipped. That
> was probably unintended. Either way, then this is a functional change,
> and perhaps a bugfix?

Ouch, I missed that when Andreas sent me that patch before. No, it is
actually intended. If we skip macsec_decrypt(), we should still
account for that packet in the InPktsUnchecked/InPktsDelayed
counters. That's in Figure 10-5 in the standard.

Thanks for catching this, Willem. That patch should only move the
IS_ERR(skb) case under the block where macsec_decrypt() is called, but
not move the call to macsec_post_decrypt().


> > Signed-off-by: Andreas Steinmetz 
> >
> > --- a/drivers/net/macsec.c  2019-06-30 22:05:17.785683634 +0200
> > +++ b/drivers/net/macsec.c  2019-06-30 22:05:20.526171178 +0200
> > @@ -1205,21 +1205,22 @@
> >
> > /* Disabled && !changed text => skip validation */
> > if (hdr->tci_an & MACSEC_TCI_C ||
> > -   secy->validate_frames != MACSEC_VALIDATE_DISABLED)
> > +   secy->validate_frames != MACSEC_VALIDATE_DISABLED) {
> > skb = macsec_decrypt(skb, dev, rx_sa, sci, secy);
> >
> > -   if (IS_ERR(skb)) {
> > -   /* the decrypt callback needs the reference */
> > -   if (PTR_ERR(skb) != -EINPROGRESS) {
> > -   macsec_rxsa_put(rx_sa);
> > -   macsec_rxsc_put(rx_sc);
> > +   if (IS_ERR(skb)) {
> > +   /* the decrypt callback needs the reference */
> > +   if (PTR_ERR(skb) != -EINPROGRESS) {
> > +   macsec_rxsa_put(rx_sa);
> > +   macsec_rxsc_put(rx_sc);
> > +   }
> > +   rcu_read_unlock();
> > +   return RX_HANDLER_CONSUMED;
> > }
> > -   rcu_read_unlock();
> > -   return RX_HANDLER_CONSUMED;
> > -   }
> >
> > -   if (!macsec_post_decrypt(skb, secy, pn))
> > -   goto drop;
> > +   if (!macsec_post_decrypt(skb, secy, pn))
> > +   goto drop;
> > +   }
> >
> >  deliver:
> > macsec_finalize_skb(skb, secy->icv_len,
> >

-- 
Sabrina

Re: r8169 not working on 5.2.0rc6 with GPD MicroPC

2019-07-01 Thread Andrew Lunn

> When the vendor driver assigns a random MAC address, it writes it to the
> chip. The related registers may be persistent (can't say exactly due to
> missing documentation).

If the device supports WOL, it could be it is powered using the
standby supply, not the main supply. Try pulling the plug from the
wall to really remove all power.

 Andrew

[PATCH net-next] ipv6: icmp: allow flowlabel reflection in echo replies

2019-07-01 Thread Eric Dumazet

Extend flowlabel_reflect bitmask to allow conditional
reflection of incoming flowlabels in echo replies.

Note this has precedence against auto flowlabels.

Add flowlabel_reflect enum to replace hard coded
values.

Signed-off-by: Eric Dumazet 
---
 Documentation/networking/ip-sysctl.txt | 4 +++-
 include/net/ipv6.h | 7 +++
 net/ipv6/af_inet6.c| 2 +-
 net/ipv6/icmp.c| 3 +++
 net/ipv6/sysctl_net_ipv6.c | 4 ++--
 net/ipv6/tcp_ipv6.c| 2 +-
 6 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 
e0d8a96e2c671e3d09d234c8ed49799b08240259..f0e6d1f53485d6cbfcd73c9cd079b970d976b6d9
 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1452,7 +1452,7 @@ flowlabel_reflect - INTEGER
environments. See RFC 7690 and:
https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
 
-   This is a mask of two bits.
+   This is a bitmask.
1: enabled for established flows
 
Note that this prevents automatic flowlabel changes, as done
@@ -1463,6 +1463,8 @@ flowlabel_reflect - INTEGER
If set, a RST packet sent in response to a SYN packet on a closed
port will reflect the incoming flow label.
 
+   4: enabled for ICMPv6 echo reply messages.
+
Default: 0
 
 fib_multipath_hash_policy - INTEGER
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 
b41f6a0fa903e9916e293f86f8bfb0f264161e80..8eca5fb30376f3a0a40ff0dc438cbad9ff56142a
 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -301,6 +301,13 @@ struct ipv6_txoptions {
/* Option buffer, as read by IPV6_PKTOPTIONS, starts here. */
 };
 
+/* flowlabel_reflect sysctl values */
+enum flowlabel_reflect {
+   FLOWLABEL_REFLECT_ESTABLISHED   = 1,
+   FLOWLABEL_REFLECT_TCP_RESET = 2,
+   FLOWLABEL_REFLECT_ICMPV6_ECHO_REPLIES   = 4,
+};
+
 struct ip6_flowlabel {
struct ip6_flowlabel __rcu *next;
__be32  label;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 
7382a927d1eb74a6bbf4d5f83de336ccab5a2ae2..8369af32cef619b5d8fd2fcfaeb12924941d4ae8
 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -208,7 +208,7 @@ static int inet6_create(struct net *net, struct socket 
*sock, int protocol,
np->mc_loop = 1;
np->mc_all  = 1;
np->pmtudisc= IPV6_PMTUDISC_WANT;
-   np->repflow = net->ipv6.sysctl.flowlabel_reflect & 1;
+   np->repflow = net->ipv6.sysctl.flowlabel_reflect & 
FLOWLABEL_REFLECT_ESTABLISHED;
sk->sk_ipv6only = net->ipv6.sysctl.bindv6only;
 
/* Init the ipv4 part of the socket since we can have sockets
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 
12906301ec7baedcccfba224b93d30cb6060c3b9..62c997201970a664cbcfd526d426af07ae019b0e
 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -703,6 +703,9 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
tmp_hdr.icmp6_type = ICMPV6_ECHO_REPLY;
 
memset(&fl6, 0, sizeof(fl6));
+   if (net->ipv6.sysctl.flowlabel_reflect & 
FLOWLABEL_REFLECT_ICMPV6_ECHO_REPLIES)
+   fl6.flowlabel = ip6_flowlabel(ipv6_hdr(skb));
+
fl6.flowi6_proto = IPPROTO_ICMPV6;
fl6.daddr = ipv6_hdr(skb)->saddr;
if (saddr)
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 
6d86fac472e7298cbd8df7aa0b190cf0087675e2..8b3fe81783ed945e2f9172fd9008f48fed474475
 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -23,7 +23,7 @@
 
 static int zero;
 static int one = 1;
-static int three = 3;
+static int flowlabel_reflect_max = 0x7;
 static int auto_flowlabels_min;
 static int auto_flowlabels_max = IP6_AUTO_FLOW_LABEL_MAX;
 
@@ -116,7 +116,7 @@ static struct ctl_table ipv6_table_template[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
.extra1 = &zero,
-   .extra2 = &three,
+   .extra2 = &flowlabel_reflect_max,
},
{
.procname   = "max_dst_opts_number",
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 
408d9ec2697154e840a26675765e8a9c1636ada4..4f3f99b3982099b3c64669f0445bc68d27390c89
 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -989,7 +989,7 @@ static void tcp_v6_send_reset(const struct sock *sk, struct 
sk_buff *skb)
if (sk->sk_state == TCP_TIME_WAIT)
label = cpu_to_be32(inet_twsk(sk)->tw_flowlabel);
} else {
-   if (net->ipv6.sysctl.flowlabel_reflect & 2)
+   if (net->ipv6.sysctl.flowlabel_reflect & 
FLOWLABEL_REFLECT_TCP_RESET)
label = ip6_flowlabel(ipv6h);
}
 
-- 
2.22.0.410.gd8fdbe21b5-goog

Re: [PATCH] sis900: add ethtool tests (link, eeprom)

2019-07-01 Thread Andrew Lunn

On Mon, Jul 01, 2019 at 11:03:33AM +0200, Sergej Benilov wrote:
> Add tests for ethtool: link test, EEPROM read test.
> Correct a few typos, too.

Hi Sergej

Please split this up into two patches. The first one should fixing the
typos.

Rather than implementing a test for the EEPROM, add support for
ethtool --eeprom-dump. That is much more useful.

The link test does not show you anything which you cannot get via ip
link show. If there is no carrier, the link is down. So drop that.

The patch also has white space issues. Spaces where there should be
tabs. Please run ./scripts/checkpatch.pl.

 Andrew

Re: [RFC iproute2] netns: add mounting state file for each netns

2019-07-01 Thread Matteo Croce

On Mon, Jul 1, 2019 at 2:38 PM Nicolas Dichtel
 wrote:
>
> Le 30/06/2019 à 21:29, Matteo Croce a écrit :
> > When ip creates a netns, there is a small time interval between the
> > placeholder file creation in NETNS_RUN_DIR and the bind mount from /proc.
> >
> > Add a temporary file named .mounting-$netns which gets deleted after the
> > bind mount, so watching for delete event matching the .mounting-* name
> > will notify watchers only after the bind mount has been done.
> Probably a naive question, but why creating those '.mounting-$netns' files in
> the directory where netns are stored? Why not another directory, something 
> like
> /var/run/netns-monitor/?
>
>
> Regards,
> Nicolas

Yes, would work too. But ideally I'd wait for the mount inotify notifications.

-- 
Matteo Croce
per aspera ad upstream

Re: [PATCH net-next 1/8] Documentation/bindings: net: ocelot: document the PTP bank

2019-07-01 Thread Andrew Lunn

On Mon, Jul 01, 2019 at 12:03:20PM +0200, Antoine Tenart wrote:
> One additional register range needs to be described within the Ocelot
> device tree node: the PTP. This patch documents the binding needed to do
> so.

Hi Antoine

Are there any more register banks? Maybe just add them all?

Also, you should probably add a comment that despite it being in the
Required part of the binding, it is actually optional.

 Andrew

Re: [PATCH net-next 3/8] Documentation/bindings: net: ocelot: document the PTP ready IRQ

2019-07-01 Thread Andrew Lunn

On Mon, Jul 01, 2019 at 12:03:22PM +0200, Antoine Tenart wrote:
> One additional interrupt needs to be described within the Ocelot device
> tree node: the PTP ready one. This patch documents the binding needed to
> do so.

Hi Antoine

Same questions/points as for the register bank :-)

Andrew

[PATCH iproute2] man: tc-netem.8: fix URL for netem page

2019-07-01 Thread Andrea Claudi

URL for netem page on sources section points to a no more existent
resource. Fix this using the correct URL.

Fixes: cd72dcf13c8a4 ("netem: add man-page")
Signed-off-by: Andrea Claudi 
---
 man/man8/tc-netem.8 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/tc-netem.8 b/man/man8/tc-netem.8
index 09cf042f0..5a08a406a4a7b 100644
--- a/man/man8/tc-netem.8
+++ b/man/man8/tc-netem.8
@@ -219,7 +219,7 @@ April 2005
 (http://devresources.linux-foundation.org/shemminger/netem/LCA2005_paper.pdf)
 
 .IP " 2. " 4
-Netem page from Linux foundation, (http://www.linuxfoundation.org/en/Net:Netem)
+Netem page from Linux foundation, 
(https://wiki.linuxfoundation.org/networking/netem)
 
 .IP " 3. " 4
 Salsano S., Ludovici F., Ordine A., "Definition of a general and intuitive loss
-- 
2.20.1

Re: [PATCH iproute2] man: tc-netem.8: fix URL for netem page

2019-07-01 Thread Andrea Claudi

On Mon, Jul 1, 2019 at 4:05 PM Andrea Claudi  wrote:
>
> URL for netem page on sources section points to a no more existent
> resource. Fix this using the correct URL.
>
> Fixes: cd72dcf13c8a4 ("netem: add man-page")
> Signed-off-by: Andrea Claudi 
> ---
>  man/man8/tc-netem.8 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/man/man8/tc-netem.8 b/man/man8/tc-netem.8
> index 09cf042f0..5a08a406a4a7b 100644
> --- a/man/man8/tc-netem.8
> +++ b/man/man8/tc-netem.8
> @@ -219,7 +219,7 @@ April 2005
>  (http://devresources.linux-foundation.org/shemminger/netem/LCA2005_paper.pdf)
>
>  .IP " 2. " 4
> -Netem page from Linux foundation, 
> (http://www.linuxfoundation.org/en/Net:Netem)
> +Netem page from Linux foundation, 
> (https://wiki.linuxfoundation.org/networking/netem)
>
>  .IP " 3. " 4
>  Salsano S., Ludovici F., Ordine A., "Definition of a general and intuitive 
> loss
> --
> 2.20.1
>

Hi Stephen,
I noticed that the link to your LCA 2005 paper is wrong, too (it
actually redirects to the home page of the Linux Foundation Wiki). If
you provide me the correct URL, I will happily send a v2 of this patch
fixing that, too.

Regards,
Andrea

Re: [RFC iproute2] netns: add mounting state file for each netns

2019-07-01 Thread Nicolas Dichtel

Le 01/07/2019 à 15:50, Matteo Croce a écrit :
> On Mon, Jul 1, 2019 at 2:38 PM Nicolas Dichtel
>  wrote:
>>
>> Le 30/06/2019 à 21:29, Matteo Croce a écrit :
>>> When ip creates a netns, there is a small time interval between the
>>> placeholder file creation in NETNS_RUN_DIR and the bind mount from /proc.
>>>
>>> Add a temporary file named .mounting-$netns which gets deleted after the
>>> bind mount, so watching for delete event matching the .mounting-* name
>>> will notify watchers only after the bind mount has been done.
>> Probably a naive question, but why creating those '.mounting-$netns' files in
>> the directory where netns are stored? Why not another directory, something 
>> like
>> /var/run/netns-monitor/?
>>
>>
>> Regards,
>> Nicolas
> 
> Yes, would work too. But ideally I'd wait for the mount inotify notifications.
> 
Yes, I agree.

Re: [PATCH v3] ss: introduce switch to print exact value of data rates

2019-07-01 Thread David Ahern

On 7/1/19 5:52 AM, Tomasz Torcz wrote:
>   Introduce -X/--exact switch to disable human-friendly printing
>  of data rates. Without the switch (default), data is presented as MBps/Kbps.
> 
>   Signed-off-by: Tomasz Torcz 
> ---
>  man/man8/ss.8 |  3 +++
>  misc/ss.c | 12 ++--
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
>  Changes in v3:
>   - updated ss man page with new option
> 

ss now has Numeric option which can be used for this as well if we
broaden the meaning to be 'raw numbers over human readable'.

Re: [PATCH 1/2] samples: pktgen: add some helper functions for port parsing

2019-07-01 Thread Jesper Dangaard Brouer

On Sat, 29 Jun 2019 22:33:57 +0900
"Daniel T. Lee"  wrote:

> This commit adds port parsing and port validate helper function to parse
> single or range of port(s) from a given string. (e.g. 1234, 443-444)
> 
> Helpers will be used in prior to set target port(s) in samples/pktgen.
> 
> Signed-off-by: Daniel T. Lee 
> ---
>  samples/pktgen/functions.sh | 34 ++
>  1 file changed, 34 insertions(+)


Nice bash shellcode with use of array variables.

Acked-by: Jesper Dangaard Brouer 

> diff --git a/samples/pktgen/functions.sh b/samples/pktgen/functions.sh
> index f8bb3cd0f4ce..4af4046d71be 100644
> --- a/samples/pktgen/functions.sh
> +++ b/samples/pktgen/functions.sh
> @@ -162,3 +162,37 @@ function get_node_cpus()
>  
>   echo $node_cpu_list
>  }
> +
> +# Given a single or range of port(s), return minimum and maximum port number.
> +function parse_ports()
> +{
> +local port_str=$1
> +local port_list
> +local min_port
> +local max_port
> +
> +IFS="-" read -ra port_list <<< $port_str
> +
> +min_port=${port_list[0]}
> +max_port=${port_list[1]:-$min_port}
> +
> +echo $min_port $max_port
> +}
> +
> +# Given a minimum and maximum port, verify port number.
> +function validate_ports()
> +{
> +local min_port=$1
> +local max_port=$2
> +
> +# 0 < port < 65536
> +if [[ $min_port -gt 0 && $min_port -lt 65536 ]]; then
> + if [[ $max_port -gt 0 && $max_port -lt 65536 ]]; then
> + if [[ $min_port -le $max_port ]]; then
> + return 0
> + fi
> + fi
> +fi
> +
> +err 5 "Invalid port(s): $min_port-$max_port"
> +}



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH 00/11] XDP unaligned chunk placement support

2019-07-01 Thread Laatz, Kevin


On 28/06/2019 21:29, Jonathan Lemon wrote:

On 28 Jun 2019, at 9:19, Laatz, Kevin wrote:

On 27/06/2019 22:25, Jakub Kicinski wrote:

I think that's very limiting.  What is the challenge in providing
aligned addresses, exactly?

The challenges are two-fold:
1) it prevents using arbitrary buffer sizes, which will be an issue 
supporting e.g. jumbo frames in future.
2) higher level user-space frameworks which may want to use AF_XDP, 
such as DPDK, do not currently support having buffers with 'fixed' 
alignment.

    The reason that DPDK uses arbitrary placement is that:
        - it would stop things working on certain NICs which need the 
actual writable space specified in units of 1k - therefore we need 2k 
+ metadata space.
        - we place padding between buffers to avoid constantly 
hitting the same memory channels when accessing memory.
        - it allows the application to choose the actual buffer size 
it wants to use.
    We make use of the above to allow us to speed up processing 
significantly and also reduce the packet buffer memory size.


    Not having arbitrary buffer alignment also means an AF_XDP driver 
for DPDK cannot be a drop-in replacement for existing drivers in 
those frameworks. Even with a new capability to allow an arbitrary 
buffer alignment, existing apps will need to be modified to use that 
new capability.


Since all buffers in the umem are the same chunk size, the original 
buffer
address can be recalculated with some multiply/shift math. However, 
this is

more expensive than just a mask operation.



Yes, we can do this.

Another option we have is to add a socket option for querying the 
metadata length from the driver (assuming it doesn't vary per packet). 
We can use that information to get back the original address using 
subtraction.


Alternatively, we can change the Rx descriptor format to include the 
metadata length. We could do this in a couple of ways, for example, 
rather than returning the address at the start of the packet, instead 
return the buffer address that was passed in, and adding another 16-bit 
field to specify the start of the packet offset with that buffer. Id 
using 16-bits of descriptor space is not desirable, an alternative could 
be to limit umem sizes to e.g. 2^48 bits (256 terabytes should be 
enough, right :-) ) and use the remaining 16 bits of the address as a 
packet offset. Other variations on these approaches are obviously 
possible too.

Re: [PATCH 2/2] samples: pktgen: allow to specify destination port

2019-07-01 Thread Jesper Dangaard Brouer

On Sat, 29 Jun 2019 22:33:58 +0900
"Daniel T. Lee"  wrote:

> Currently, kernel pktgen has the feature to specify udp destination port
> for sending packet. (e.g. pgset "udp_dst_min 9")
> 
> But on samples, each of the scripts doesn't have any option to achieve this.
> 
> This commit adds the DST_PORT option to specify the target port(s) in the 
> script.
> 
> -p : ($DST_PORT)  destination PORT range (e.g. 433-444) is also allowed
> 
> Signed-off-by: Daniel T. Lee 

Nice feature, this look very usable for testing.  I think my QA asked
me for something similar.

One nitpick is that script named pktgen_sample03_burst_single_flow.sh
implies this is a single flow, but by specifying a port-range this will
be more flows.  I'm okay with adding this, as the end-user specifying a
port-range should realize this.  Thus, you get my ACK.

Acked-by: Jesper Dangaard Brouer 

Another thing you should realize (but you/we cannot do anything about)
is that when the scripts use burst or clone, then the port (UDPDST_RND)
will be the same for all packets in the same burst.  I don't know if it
matters for your use-case.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next 8/8] net: mscc: PTP Hardware Clock (PHC) support

2019-07-01 Thread Willem de Bruijn

On Mon, Jul 1, 2019 at 6:05 AM Antoine Tenart
 wrote:
>
> This patch adds support for PTP Hardware Clock (PHC) to the Ocelot
> switch for both PTP 1-step and 2-step modes.
>
> Signed-off-by: Antoine Tenart 

>  void ocelot_deinit(struct ocelot *ocelot)
>  {
> +   struct ocelot_port *port;
> +   struct ocelot_skb *entry;
> +   struct list_head *pos;
> +   int i;
> +
> destroy_workqueue(ocelot->stats_queue);
> mutex_destroy(&ocelot->stats_lock);
> ocelot_ace_deinit();
> +
> +   for (i = 0; i < ocelot->num_phys_ports; i++) {
> +   port = ocelot->ports[i];
> +
> +   list_for_each(pos, &port->skbs) {
> +   entry = list_entry(pos, struct ocelot_skb, head);
> +
> +   list_del(pos);

list_for_each_safe

> +   kfree(entry);
> +   }
> +   }
>  }
>  EXPORT_SYMBOL(ocelot_deinit);

[PATCH net] net: don't warn in inet diag when IPV6 is disabled

2019-07-01 Thread Stephen Hemminger

If IPV6 was disabled, then ss command would cause a kernel warning
because the command was attempting to dump IPV6 socke information.
This should not be a warning, instead just return a normal error
code.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202249
Fixes: 432490f9d455 ("net: ip, diag -- Add diag interface for raw sockets")
Signed-off-by: Stephen Hemminger 
---
 net/ipv4/raw_diag.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c
index 899e34ceb560..045485d39f23 100644
--- a/net/ipv4/raw_diag.c
+++ b/net/ipv4/raw_diag.c
@@ -19,9 +19,11 @@ raw_get_hashinfo(const struct inet_diag_req_v2 *r)
 {
if (r->sdiag_family == AF_INET) {
return &raw_v4_hashinfo;
-#if IS_ENABLED(CONFIG_IPV6)
} else if (r->sdiag_family == AF_INET6) {
+#if IS_ENABLED(CONFIG_IPV6)
return &raw_v6_hashinfo;
+#else
+   return ERR_PTR(-EOPNOTSUPP);
 #endif
} else {
pr_warn_once("Unexpected inet family %d\n",
-- 
2.20.1

Re: [PATCH net-next v4 3/5] net: core: add MPLS update core helper and use in OvS

2019-07-01 Thread Willem de Bruijn

On Mon, Jul 1, 2019 at 8:31 AM John Hurley  wrote:
>
> Open vSwitch allows the updating of an existing MPLS header on a packet.
> In preparation for supporting similar functionality in TC, move this to a
> common skb helper function.
>
> Signed-off-by: John Hurley 
> Reviewed-by: Jakub Kicinski 
> Reviewed-by: Simon Horman 
> ---
>  /**
> + * skb_mpls_update_lse() - modify outermost MPLS header and update csum
> + *
> + * @skb: buffer
> + * @mpls_lse: new MPLS label stack entry to update to
> + *
> + * Expects skb->data at mac header.
> + *
> + * Returns 0 on success, -errno otherwise.
> + */
> +int skb_mpls_update_lse(struct sk_buff *skb, __be32 mpls_lse)
> +{
> +   struct mpls_shim_hdr *old_lse = mpls_hdr(skb);
> +   int err;
> +
> +   if (unlikely(!eth_p_mpls(skb->protocol)))
> +   return -EINVAL;
> +
> +   err = skb_ensure_writable(skb, skb->mac_len + MPLS_HLEN);
> +   if (unlikely(err))
> +   return err;
> +
> +   if (skb->ip_summed == CHECKSUM_COMPLETE) {
> +   __be32 diff[] = { ~old_lse->label_stack_entry, mpls_lse };
> +
> +   skb->csum = csum_partial((char *)diff, sizeof(diff), 
> skb->csum);
> +   }
> +
> +   old_lse->label_stack_entry = mpls_lse;

skb_ensure_writable may have reallocated the skb linear. old_lse needs
to be loaded after. Or, safer:

  mpls_hdr(skb)->label_stack_entry = mpls_lse;

[PATCH 1/4] net: dsa: Change DT bindings for Vitesse VSC73xx switches

2019-07-01 Thread Pawel Dembicki

This commit document changes after split vsc73xx driver into core and
spi part. The change of DT bindings is required for support the same
vsc73xx chip, which need PI bus to communicate with CPU. It also
introduce how to use vsc73xx platform driver.

Signed-off-by: Pawel Dembicki 
---
 .../bindings/net/dsa/vitesse,vsc73xx.txt  | 74 ---
 1 file changed, 64 insertions(+), 10 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/dsa/vitesse,vsc73xx.txt 
b/Documentation/devicetree/bindings/net/dsa/vitesse,vsc73xx.txt
index ed4710c40641..c6a4cd85891c 100644
--- a/Documentation/devicetree/bindings/net/dsa/vitesse,vsc73xx.txt
+++ b/Documentation/devicetree/bindings/net/dsa/vitesse,vsc73xx.txt
@@ -2,8 +2,8 @@ Vitesse VSC73xx Switches
 
 
 This defines device tree bindings for the Vitesse VSC73xx switch chips.
-The Vitesse company has been acquired by Microsemi and Microsemi in turn
-acquired by Microchip but retains this vendor branding.
+The Vitesse company has been acquired by Microsemi and Microsemi has
+been acquired Microchip but retains this vendor branding.
 
 The currently supported switch chips are:
 Vitesse VSC7385 SparX-G5 5+1-port Integrated Gigabit Ethernet Switch
@@ -11,16 +11,26 @@ Vitesse VSC7388 SparX-G8 8-port Integrated Gigabit Ethernet 
Switch
 Vitesse VSC7395 SparX-G5e 5+1-port Integrated Gigabit Ethernet Switch
 Vitesse VSC7398 SparX-G8e 8-port Integrated Gigabit Ethernet Switch
 
-The device tree node is an SPI device so it must reside inside a SPI bus
-device tree node, see spi/spi-bus.txt
+This switch could have two different management interface.
+
+If SPI interface is used, the device tree node is an SPI device so it must
+reside inside a SPI bus device tree node, see spi/spi-bus.txt
+
+If Platform driver is used, the device tree node is an platform device so it
+must reside inside a platform bus device tree node.
 
 Required properties:
 
-- compatible: must be exactly one of:
-   "vitesse,vsc7385"
-   "vitesse,vsc7388"
-   "vitesse,vsc7395"
-   "vitesse,vsc7398"
+- compatible (SPI): must be exactly one of:
+   "vitesse,vsc7385-spi"
+   "vitesse,vsc7388-spi"
+   "vitesse,vsc7395-spi"
+   "vitesse,vsc7398-spi"
+- compatible (Platform): must be exactly one of:
+   "vitesse,vsc7385-platform"
+   "vitesse,vsc7388-platform"
+   "vitesse,vsc7395-platform"
+   "vitesse,vsc7398-platform"
 - gpio-controller: indicates that this switch is also a GPIO controller,
   see gpio/gpio.txt
 - #gpio-cells: this must be set to <2> and indicates that we are a twocell
@@ -38,8 +48,9 @@ and subnodes of DSA switches.
 
 Examples:
 
+SPI:
 switch@0 {
-   compatible = "vitesse,vsc7395";
+   compatible = "vitesse,vsc7395-spi";
reg = <0>;
/* Specified for 2.5 MHz or below */
spi-max-frequency = <250>;
@@ -79,3 +90,46 @@ switch@0 {
};
};
 };
+
+Platform:
+switch@2,0 {
+   #address-cells = <1>;
+   #size-cells = <1>;
+   compatible = "vitesse,vsc7385-platform";
+   reg = <0x2 0x0 0x2>;
+   reset-gpios = <&gpio0 12 GPIO_ACTIVE_LOW>;
+
+   ports {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   port@0 {
+   reg = <0>;
+   label = "lan1";
+   };
+   port@1 {
+   reg = <1>;
+   label = "lan2";
+   };
+   port@2 {
+   reg = <2>;
+   label = "lan3";
+   };
+   port@3 {
+   reg = <3>;
+   label = "lan4";
+   };
+   vsc: port@6 {
+   reg = <6>;
+   label = "cpu";
+   ethernet = <&enet0>;
+   phy-mode = "rgmii";
+   fixed-link {
+   speed = <1000>;
+   full-duplex;
+   pause;
+   };
+   };
+   };
+
+};
-- 
2.20.1

Re: [PATCH bpf-next 1/2] bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addr

2019-07-01 Thread Andrii Nakryiko

On Sat, Jun 29, 2019 at 10:53 PM Yonghong Song  wrote:
>
>
>
> On 6/28/19 4:10 PM, Stanislav Fomichev wrote:
> > Since commit cd17d7770578 ("bpf/tools: sync bpf.h") clang decided
> > that it can do a single u64 store into user_ip6[2] instead of two
> > separate u32 ones:
> >
> >   #  17: (18) r2 = 0x100
> >   #  ; ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2);
> >   #  19: (7b) *(u64 *)(r1 +16) = r2
> >   #  invalid bpf_context access off=16 size=8
> >
> >  From the compiler point of view it does look like a correct thing
> > to do, so let's support it on the kernel side.
> >
> > Credit to Andrii Nakryiko for a proper implementation of
> > bpf_ctx_wide_store_ok.
> >
> > Cc: Andrii Nakryiko 
> > Cc: Yonghong Song 
> > Fixes: cd17d7770578 ("bpf/tools: sync bpf.h")
> > Reported-by: kernel test robot 
> > Signed-off-by: Stanislav Fomichev 
>
> The change looks good to me with the following nits:
>1. could you add a cover letter for the patch set?
>   typically if the number of patches is more than one,
>   it would be a good practice with a cover letter.
>   See bpf_devel_QA.rst .
>2. with this change, the comments in uapi bpf.h
>   are not accurate any more.
>  __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
>   * Stored in network byte order.
>
>   */
>  __u32 msg_src_ip6[4];   /* Allows 1,2,4-byte read an 4-byte write.
>   * Stored in network byte order.
>   */
>   now for stores, aligned 8-byte write is permitted.
>   could you update this as well?
>
>  From the typical usage pattern, I did not see a need
> for 8-tye read of user_ip6 and msg_src_ip6 yet. So let
> us just deal with write for now.

But I guess it's still possible for clang to optimize two consecutive
4-byte reads into single 8-byte read in some circumstances? If that's
the case, maybe it's a good idea to have corresponding read checks as
well?

But overall this looks good to me:

Acked-by: Andrii Nakryiko 

>
> With the above two nits,
> Acked-by: Yonghong Song 
>
> > ---
> >   include/linux/filter.h |  6 ++
> >   net/core/filter.c  | 22 ++
> >   2 files changed, 20 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 340f7d648974..3901007e36f1 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -746,6 +746,12 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 
> > size_default)
> >   return size <= size_default && (size & (size - 1)) == 0;
> >   }
> >
> > +#define bpf_ctx_wide_store_ok(off, size, type, field)  
> >   \
> > + (size == sizeof(__u64) &&   \
> > + off >= offsetof(type, field) && \
> > + off + sizeof(__u64) <= offsetofend(type, field) &&  \
> > + off % sizeof(__u64) == 0)
> > +
> >   #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
> >
> >   static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index dc8534be12fc..5d33f2146dab 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -6849,6 +6849,16 @@ static bool sock_addr_is_valid_access(int off, int 
> > size,
> >   if (!bpf_ctx_narrow_access_ok(off, size, 
> > size_default))
> >   return false;
> >   } else {
> > + if (bpf_ctx_wide_store_ok(off, size,
> > +   struct bpf_sock_addr,
> > +   user_ip6))
> > + return true;
> > +
> > + if (bpf_ctx_wide_store_ok(off, size,
> > +   struct bpf_sock_addr,
> > +   msg_src_ip6))
> > + return true;
> > +
> >   if (size != size_default)
> >   return false;
> >   }
> > @@ -7689,9 +7699,6 @@ static u32 xdp_convert_ctx_access(enum 
> > bpf_access_type type,
> >   /* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to
> >* SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation.
> >*
> > - * It doesn't support SIZE argument though since narrow stores are not
> > - * supported for now.
> > - *
> >* In addition it uses Temporary Field TF (member of struct S) as the 3rd
> >* "register" since two registers available in convert_ctx_access are not
> >* enough: we can't override neither SRC, since it contains value to 
> > store, nor
> > @@ -7699,7 +7706,7 @@ static u32 xdp_convert_ctx_access(enum 
> > bpf_access_type type,
> >* instructions. But we need a temporary place to save pointer to neste

Re: [PATCH bpf-next 2/2] selftests/bpf: add verifier tests for wide stores

2019-07-01 Thread Andrii Nakryiko

On Sat, Jun 29, 2019 at 11:02 PM Yonghong Song  wrote:
>
>
>
> On 6/28/19 4:10 PM, Stanislav Fomichev wrote:
> > Make sure that wide stores are allowed at proper (aligned) addresses.
> > Note that user_ip6 is naturally aligned on 8-byte boundary, so
> > correct addresses are user_ip6[0] and user_ip6[2]. msg_src_ip6 is,
> > however, aligned on a 4-byte bondary, so only msg_src_ip6[1]
> > can be wide-stored.
> >
> > Cc: Andrii Nakryiko 
> > Cc: Yonghong Song 
> > Signed-off-by: Stanislav Fomichev 
> > ---
> >   tools/testing/selftests/bpf/test_verifier.c   | 17 ++--
> >   .../selftests/bpf/verifier/wide_store.c   | 40 +++
> >   2 files changed, 54 insertions(+), 3 deletions(-)
> >   create mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
> >
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c 
> > b/tools/testing/selftests/bpf/test_verifier.c
> > index c5514daf8865..b0773291012a 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -105,6 +105,7 @@ struct bpf_test {
> >   __u64 data64[TEST_DATA_LEN / 8];
> >   };
> >   } retvals[MAX_TEST_RUNS];
> > + enum bpf_attach_type expected_attach_type;
> >   };
> >
> >   /* Note we want this to be 64 bit aligned so that the end of our array is
> > @@ -850,6 +851,7 @@ static void do_test_single(struct bpf_test *test, bool 
> > unpriv,
> >   int fd_prog, expected_ret, alignment_prevented_execution;
> >   int prog_len, prog_type = test->prog_type;
> >   struct bpf_insn *prog = test->insns;
> > + struct bpf_load_program_attr attr;
> >   int run_errs, run_successes;
> >   int map_fds[MAX_NR_MAPS];
> >   const char *expected_err;
> > @@ -881,8 +883,17 @@ static void do_test_single(struct bpf_test *test, bool 
> > unpriv,
> >   pflags |= BPF_F_STRICT_ALIGNMENT;
> >   if (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)
> >   pflags |= BPF_F_ANY_ALIGNMENT;
> > - fd_prog = bpf_verify_program(prog_type, prog, prog_len, pflags,
> > -  "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 4);
> > +
> > + memset(&attr, 0, sizeof(attr));
> > + attr.prog_type = prog_type;
> > + attr.expected_attach_type = test->expected_attach_type;
> > + attr.insns = prog;
> > + attr.insns_cnt = prog_len;
> > + attr.license = "GPL";
> > + attr.log_level = 4;
> > + attr.prog_flags = pflags;
> > +
> > + fd_prog = bpf_load_program_xattr(&attr, bpf_vlog, sizeof(bpf_vlog));
> >   if (fd_prog < 0 && !bpf_probe_prog_type(prog_type, 0)) {
> >   printf("SKIP (unsupported program type %d)\n", prog_type);
> >   skips++;
> > @@ -912,7 +923,7 @@ static void do_test_single(struct bpf_test *test, bool 
> > unpriv,
> >   printf("FAIL\nUnexpected success to load!\n");
> >   goto fail_log;
> >   }
> > - if (!strstr(bpf_vlog, expected_err)) {
> > + if (!expected_err || !strstr(bpf_vlog, expected_err)) {
> >   printf("FAIL\nUnexpected error message!\n\tEXP: 
> > %s\n\tRES: %s\n",
> > expected_err, bpf_vlog);
> >   goto fail_log;
> > diff --git a/tools/testing/selftests/bpf/verifier/wide_store.c 
> > b/tools/testing/selftests/bpf/verifier/wide_store.c
> > new file mode 100644
> > index ..c6385f45b114
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/verifier/wide_store.c
> > @@ -0,0 +1,40 @@
> > +#define BPF_SOCK_ADDR(field, off, res, err) \
> > +{ \
> > + "wide store to bpf_sock_addr." #field "[" #off "]", \
> > + .insns = { \
> > + BPF_MOV64_IMM(BPF_REG_0, 1), \
> > + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, \
> > + offsetof(struct bpf_sock_addr, field[off])), \
> > + BPF_EXIT_INSN(), \
> > + }, \
> > + .result = res, \
> > + .prog_type = BPF_PROG_TYPE_CGROUP_SOCK_ADDR, \
> > + .expected_attach_type = BPF_CGROUP_UDP6_SENDMSG, \
> > + .errstr = err, \
> > +}
> > +
> > +/* user_ip6[0] is u64 aligned */
> > +BPF_SOCK_ADDR(user_ip6, 0, ACCEPT,
> > +   NULL),
> > +BPF_SOCK_ADDR(user_ip6, 1, REJECT,
> > +   "invalid bpf_context access off=12 size=8"),
> > +BPF_SOCK_ADDR(user_ip6, 2, ACCEPT,
> > +   NULL),
> > +BPF_SOCK_ADDR(user_ip6, 3, REJECT,
> > +   "invalid bpf_context access off=20 size=8"),
> > +BPF_SOCK_ADDR(user_ip6, 4, REJECT,
> > +   "invalid bpf_context access off=24 size=8"),
>
> With offset 4, we have
> #968/p wide store to bpf_sock_addr.user_ip6[4] OK
>
> This test case can be removed. user code typically
> won't write bpf_sock_addr.user_ip6[4], and compiler
> typically will give a warning since it is out of
> array bound. Any particular reason you want to
> include this one?

I agree, user_ip6[4] is essentially 8-byte write to user_port field.

>
>
> > +
> > +/* ms

Re: [PATCH net-next v4 4/5] net: sched: add mpls manipulation actions to TC

2019-07-01 Thread David Ahern

On 7/1/19 6:30 AM, John Hurley wrote:
> Currently, TC offers the ability to match on the MPLS fields of a packet
> through the use of the flow_dissector_key_mpls struct. However, as yet, TC
> actions do not allow the modification or manipulation of such fields.
> 
> Add a new module that registers TC action ops to allow manipulation of
> MPLS. This includes the ability to push and pop headers as well as modify
> the contents of new or existing headers. A further action to decrement the
> TTL field of an MPLS header is also provided.

Would be good to document an example here and how to handle a label
stack. The same example can be used with the iproute2 patch (I presume
this one ;-)).


> +static int valid_label(const struct nlattr *attr,
> +struct netlink_ext_ack *extack)
> +{
> + const u32 *label = nla_data(attr);
> +
> + if (!*label || *label & ~MPLS_LABEL_MASK) {
> + NL_SET_ERR_MSG_MOD(extack, "MPLS label out of range");
> + return -EINVAL;
> + }

core MPLS code (nla_get_labels) checks for MPLS_LABEL_IMPLNULL as well.


> +
> + return 0;
> +}
> +
> +static const struct nla_policy mpls_policy[TCA_MPLS_MAX + 1] = {
> + [TCA_MPLS_UNSPEC]   = { .strict_start_type = TCA_MPLS_UNSPEC + 1 },
> + [TCA_MPLS_PARMS]= NLA_POLICY_EXACT_LEN(sizeof(struct tc_mpls)),
> + [TCA_MPLS_PROTO]= { .type = NLA_U16 },
> + [TCA_MPLS_LABEL]= NLA_POLICY_VALIDATE_FN(NLA_U32, valid_label),
> + [TCA_MPLS_TC]   = NLA_POLICY_RANGE(NLA_U8, 0, 7),
> + [TCA_MPLS_TTL]  = NLA_POLICY_MIN(NLA_U8, 1),
> + [TCA_MPLS_BOS]  = NLA_POLICY_RANGE(NLA_U8, 0, 1),
> +};
> +
> +static int tcf_mpls_init(struct net *net, struct nlattr *nla,
> +  struct nlattr *est, struct tc_action **a,
> +  int ovr, int bind, bool rtnl_held,
> +  struct tcf_proto *tp, struct netlink_ext_ack *extack)
> +{
> + struct tc_action_net *tn = net_generic(net, mpls_net_id);
> + struct nlattr *tb[TCA_MPLS_MAX + 1];
> + struct tcf_chain *goto_ch = NULL;
> + struct tcf_mpls_params *p;
> + struct tc_mpls *parm;
> + bool exists = false;
> + struct tcf_mpls *m;
> + int ret = 0, err;
> + u8 mpls_ttl = 0;
> +
> + if (!nla) {
> + NL_SET_ERR_MSG_MOD(extack, "Missing netlink attributes");
> + return -EINVAL;
> + }
> +
> + err = nla_parse_nested(tb, TCA_MPLS_MAX, nla, mpls_policy, extack);
> + if (err < 0)
> + return err;
> +
> + if (!tb[TCA_MPLS_PARMS]) {
> + NL_SET_ERR_MSG_MOD(extack, "No MPLS params");
> + return -EINVAL;
> + }
> + parm = nla_data(tb[TCA_MPLS_PARMS]);
> +
> + /* Verify parameters against action type. */
> + switch (parm->m_action) {
> + case TCA_MPLS_ACT_POP:
> + if (!tb[TCA_MPLS_PROTO] ||
> + !eth_proto_is_802_3(nla_get_be16(tb[TCA_MPLS_PROTO]))) {
> + NL_SET_ERR_MSG_MOD(extack, "Invalid protocol type for 
> MPLS pop");

would be better to call out '!tb[TCA_MPLS_PROTO]' with its own 'Protocol
must be set given for pop' message.

Re: [PATCH net-next v4 4/5] net: sched: add mpls manipulation actions to TC

2019-07-01 Thread Willem de Bruijn

On Mon, Jul 1, 2019 at 8:31 AM John Hurley  wrote:
>
> Currently, TC offers the ability to match on the MPLS fields of a packet
> through the use of the flow_dissector_key_mpls struct. However, as yet, TC
> actions do not allow the modification or manipulation of such fields.
>
> Add a new module that registers TC action ops to allow manipulation of
> MPLS. This includes the ability to push and pop headers as well as modify
> the contents of new or existing headers. A further action to decrement the
> TTL field of an MPLS header is also provided.
>
> Signed-off-by: John Hurley 
> Reviewed-by: Jakub Kicinski 
> Reviewed-by: Simon Horman 

> +static __be32 tcf_mpls_get_lse(struct mpls_shim_hdr *lse,
> +  struct tcf_mpls_params *p, bool set_bos)
> +{
> +   u32 new_lse = 0;
> +
> +   if (lse)
> +   new_lse = be32_to_cpu(lse->label_stack_entry);
> +
> +   if (p->tcfm_label) {
> +   new_lse &= ~MPLS_LS_LABEL_MASK;
> +   new_lse |= p->tcfm_label << MPLS_LS_LABEL_SHIFT;
> +   }
> +   if (p->tcfm_ttl) {
> +   new_lse &= ~MPLS_LS_TTL_MASK;
> +   new_lse |= p->tcfm_ttl << MPLS_LS_TTL_SHIFT;
> +   }
> +   if (p->tcfm_tc != ACT_MPLS_TC_NOT_SET) {
> +   new_lse &= ~MPLS_LS_TC_MASK;
> +   new_lse |= p->tcfm_tc << MPLS_LS_TC_SHIFT;
> +   }
> +   if (p->tcfm_bos != ACT_MPLS_BOS_NOT_SET) {
> +   new_lse &= ~MPLS_LS_S_MASK;
> +   new_lse |= p->tcfm_bos << MPLS_LS_S_SHIFT;
> +   } else if (set_bos) {
> +   new_lse |= 1 << MPLS_LS_S_SHIFT;
> +   }

not necessarily for this patchset, but perhaps it would make code more
readable to add a struct mpls_label_type with integer bit fields to avoid
all this explicit masking and shifting.

> +static int tcf_mpls_act(struct sk_buff *skb, const struct tc_action *a,
> +   struct tcf_result *res)
> +{
> +   struct tcf_mpls *m = to_mpls(a);
> +   struct mpls_shim_hdr *pkt_lse;
> +   struct tcf_mpls_params *p;
> +   __be32 new_lse;
> +   u32 cpu_lse;
> +   int ret;
> +   u8 ttl;
> +
> +   tcf_lastuse_update(&m->tcf_tm);
> +   bstats_cpu_update(this_cpu_ptr(m->common.cpu_bstats), skb);
> +
> +   /* Ensure 'data' points at mac_header prior calling mpls manipulating
> +* functions.
> +*/
> +   if (skb_at_tc_ingress(skb))
> +   skb_push_rcsum(skb, skb->mac_len);
> +
> +   ret = READ_ONCE(m->tcf_action);
> +
> +   p = rcu_dereference_bh(m->mpls_p);
> +
> +   switch (p->tcfm_action) {
> +   case TCA_MPLS_ACT_POP:
> +   if (skb_mpls_pop(skb, p->tcfm_proto))
> +   goto drop;
> +   break;
> +   case TCA_MPLS_ACT_PUSH:
> +   new_lse = tcf_mpls_get_lse(NULL, p, 
> !eth_p_mpls(skb->protocol));
> +   if (skb_mpls_push(skb, new_lse, p->tcfm_proto))
> +   goto drop;
> +   break;
> +   case TCA_MPLS_ACT_MODIFY:
> +   new_lse = tcf_mpls_get_lse(mpls_hdr(skb), p, false);
> +   if (skb_mpls_update_lse(skb, new_lse))
> +   goto drop;
> +   break;
> +   case TCA_MPLS_ACT_DEC_TTL:
> +   pkt_lse = mpls_hdr(skb);
> +   cpu_lse = be32_to_cpu(pkt_lse->label_stack_entry);
> +   ttl = (cpu_lse & MPLS_LS_TTL_MASK) >> MPLS_LS_TTL_SHIFT;
> +   if (!--ttl)
> +   goto drop;
> +
> +   cpu_lse &= ~MPLS_LS_TTL_MASK;
> +   cpu_lse |= ttl << MPLS_LS_TTL_SHIFT;

this could perhaps use a helper of its own?


> +   if (skb_mpls_update_lse(skb, cpu_to_be32(cpu_lse)))
> +   goto drop;
> +   break;
> +   }

[RFC PATCH v2 1/2] Documentation: net: dsa: Describe DSA switch configuration

2019-07-01 Thread Benedikt Spranger

Document DSA tagged and VLAN based switch configuration by showcases.

Signed-off-by: Benedikt Spranger 
---
 .../networking/dsa/configuration.rst  | 292 ++
 Documentation/networking/dsa/index.rst|   1 +
 2 files changed, 293 insertions(+)
 create mode 100644 Documentation/networking/dsa/configuration.rst

diff --git a/Documentation/networking/dsa/configuration.rst 
b/Documentation/networking/dsa/configuration.rst
new file mode 100644
index ..55d6dce6500d
--- /dev/null
+++ b/Documentation/networking/dsa/configuration.rst
@@ -0,0 +1,292 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===
+DSA switch configuration from userspace
+===
+
+The DSA switch configuration is not integrated into the main userspace
+network configuration suites by now and has to be performed manualy.
+
+.. _dsa-config-showcases:
+
+Configuration showcases
+---
+
+To configure a DSA switch a couple of commands need to be executed. In this
+documentation some common configuration scenarios are handled as showcases:
+
+*single port*
+  Every switch port acts as a different configurable Ethernet port
+
+*bridge*
+  Every switch port is part of one configurable Ethernet bridge
+
+*gateway*
+  Every switch port except one upstream port is part of a configurable
+  Ethernet bridge.
+  The upstream port acts as different configurable Ethernet port.
+
+All configurations are performed with tools from iproute2, wich is available at
+https://www.kernel.org/pub/linux/utils/net/iproute2/
+
+Through DSA every port of a switch is handled like a normal linux Ethernet
+interface. The CPU port is the switch port connected to an Ethernet MAC chip.
+The corresponding linux Ethernet interface is called the master interface.
+All other corresponding linux interfaces are called slave interfaces.
+
+The slave interfaces depend on the master interface. They can only brought up,
+when the master interface is up.
+
+In this documentation the following Ethernet interfaces are used:
+
+*eth0*
+  the master interface
+
+*lan1*
+  a slave interface
+
+*lan2*
+  another slave interface
+
+*lan3*
+  a third slave interface
+
+*wan*
+  A slave interface dedicated for upstream traffic
+
+Further Ethernet interfaces can be configured similar.
+The configured IPs and networks are:
+
+*single port*
+  * lan1: 192.0.2.1/30 (192.0.2.0 - 192.0.2.3)
+  * lan2: 192.0.2.5/30 (192.0.2.4 - 192.0.2.7)
+  * lan3: 192.0.2.9/30 (192.0.2.8 - 192.0.2.11)
+
+*bridge*
+  * br0: 192.0.2.129/25 (192.0.2.128 - 192.0.2.255)
+
+*gateway*
+  * br0: 192.0.2.129/25 (192.0.2.128 - 192.0.2.255)
+  * wan: 192.0.2.1/30 (192.0.2.0 - 192.0.2.3)
+
+.. _dsa-tagged-configuration:
+
+Configuration with tagging support
+--
+
+The tagging based configuration is desired and supported by the majority of
+DSA switches. These switches are capable to tag incoming and outgoing traffic
+without using a VLAN based configuration.
+
+single port
+~~~
+
+.. code-block:: sh
+
+  # configure each interface
+  ip addr add 192.0.2.1/30 dev lan1
+  ip addr add 192.0.2.5/30 dev lan2
+  ip addr add 192.0.2.9/30 dev lan3
+
+  # The master interface needs to be brought up before the slave ports.
+  ip link set eth0 up
+
+  # bring up the slave interfaces
+  ip link set lan1 up
+  ip link set lan2 up
+  ip link set lan3 up
+
+bridge
+~~
+
+.. code-block:: sh
+
+  # The master interface needs to be brought up before the slave ports.
+  ip link set eth0 up
+
+  # bring up the slave interfaces
+  ip link set lan1 up
+  ip link set lan2 up
+  ip link set lan3 up
+
+  # create bridge
+  ip link add name br0 type bridge
+
+  # add ports to bridge
+  ip link set dev lan1 master br0
+  ip link set dev lan2 master br0
+  ip link set dev lan3 master br0
+
+  # configure the bridge
+  ip addr add 192.0.2.129/25 dev br0
+
+  # bring up the bridge
+  ip link set dev br0 up
+
+gateway
+~~~
+
+.. code-block:: sh
+
+  # The master interface needs to be brought up before the slave ports.
+  ip link set eth0 up
+
+  # bring up the slave interfaces
+  ip link set wan up
+  ip link set lan1 up
+  ip link set lan2 up
+
+  # configure the upstream port
+  ip addr add 192.0.2.1/30 dev wan
+
+  # create bridge
+  ip link add name br0 type bridge
+
+  # add ports to bridge
+  ip link set dev lan1 master br0
+  ip link set dev lan2 master br0
+
+  # configure the bridge
+  ip addr add 192.0.2.129/25 dev br0
+
+  # bring up the bridge
+  ip link set dev br0 up
+
+.. _dsa-vlan-configuration:
+
+Configuration without tagging support
+-
+
+A minority of switches are not capable to use a taging protocol
+(DSA_TAG_PROTO_NONE). These switches can be configured by a VLAN based
+configuration.
+
+single port
+~~~
+The configuration can only be set up via VLAN tagging and bridge setup.
+
+.. code-block:: sh
+
+  # tag traffic on CPU port
+

[RFC PATCH v2 2/2] Documentation: net: dsa: b53: Describe b53 configuration

2019-07-01 Thread Benedikt Spranger

Document the different needs of documentation for the b53 driver.

Signed-off-by: Benedikt Spranger 
---
 Documentation/networking/dsa/b53.rst   | 174 +
 Documentation/networking/dsa/index.rst |   1 +
 2 files changed, 175 insertions(+)
 create mode 100644 Documentation/networking/dsa/b53.rst

diff --git a/Documentation/networking/dsa/b53.rst 
b/Documentation/networking/dsa/b53.rst
new file mode 100644
index ..23f1d79a6258
--- /dev/null
+++ b/Documentation/networking/dsa/b53.rst
@@ -0,0 +1,174 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Broadcom RoboSwitch Ethernet switch driver
+==
+
+The Broadcom RoboSwitch Ethernet switch family is used in quite a range of
+xDSL router, cable modems and other multimedia devices.
+
+The actual implementation supports the devices BCM5325E, BCM5365, BCM539x,
+BCM53115 and BCM53125 as well as BCM63XX.
+
+Implementation details
+==
+
+The driver is located in ``drivers/net/dsa/b53/`` and is implemented as a
+DSA driver; see ``Documentation/networking/dsa/dsa.rst`` for details on the
+subsystem and what it provides.
+
+The switch is, if possible, configured to enable a Broadcom specific 4-bytes
+switch tag which gets inserted by the switch for every packet forwarded to the
+CPU interface, conversely, the CPU network interface should insert a similar
+tag for packets entering the CPU port. The tag format is described in
+``net/dsa/tag_brcm.c``.
+
+The configuration of the device depends on whether or not tagging is
+supported.
+
+The interface names and example network configuration are used according the
+configuration described in the :ref:`dsa-config-showcases`.
+
+Configuration with tagging support
+--
+
+The tagging based configuration is desired. It is not specific to the b53
+DSA driver and will work like all DSA drivers which supports tagging.
+
+See :ref:`dsa-tagged-configuration`.
+
+Configuration without tagging support
+-
+
+Older models (5325, 5365) support a different tag format that is not supported
+yet. 539x and 531x5 require managed mode and some special handling, which is
+also not yet supported. The tagging support is disabled in these cases and the
+switch need a different configuration.
+
+The configuration slightly differ from the :ref:`dsa-vlan-configuration`.
+
+single port
+~~~
+The configuration can only be set up via VLAN tagging and bridge setup.
+By default packages are tagged with vid 1:
+
+.. code-block:: sh
+
+  # tag traffic on CPU port
+  ip link add link eth0 name eth0.1 type vlan id 1
+  ip link add link eth0 name eth0.2 type vlan id 2
+  ip link add link eth0 name eth0.3 type vlan id 3
+
+  # The master interface needs to be brought up before the slave ports.
+  ip link set eth0 up
+  ip link set eth0.1 up
+  ip link set eth0.2 up
+  ip link set eth0.3 up
+
+  # bring up the slave interfaces
+  ip link set wan up
+  ip link set lan1 up
+  ip link set lan2 up
+
+  # create bridge
+  ip link add name br0 type bridge
+
+  # activate VLAN filtering
+  ip link set dev br0 type bridge vlan_filtering 1
+
+  # add ports to bridges
+  ip link set dev wan master br0
+  ip link set dev lan1 master br0
+  ip link set dev lan2 master br0
+
+  # tag traffic on ports
+  bridge vlan add dev lan1 vid 2 pvid untagged
+  bridge vlan del dev lan1 vid 1
+  bridge vlan add dev lan2 vid 3 pvid untagged
+  bridge vlan del dev lan2 vid 1
+
+  # configure the VLANs
+  ip addr add 192.0.2.1/30 dev eth0.1
+  ip addr add 192.0.2.5/30 dev eth0.2
+  ip addr add 192.0.2.9/30 dev eth0.3
+
+  # bring up the bridge devices
+  ip link set br0 up
+
+
+bridge
+~~
+
+.. code-block:: sh
+
+  # tag traffic on CPU port
+  ip link add link eth0 name eth0.1 type vlan id 1
+
+  # The master interface needs to be brought up before the slave ports.
+  ip link set eth0 up
+  ip link set eth0.1 up
+
+  # bring up the slave interfaces
+  ip link set wan up
+  ip link set lan1 up
+  ip link set lan2 up
+
+  # create bridge
+  ip link add name br0 type bridge
+
+  # activate VLAN filtering
+  ip link set dev br0 type bridge vlan_filtering 1
+
+  # add ports to bridge
+  ip link set dev wan master br0
+  ip link set dev lan1 master br0
+  ip link set dev lan2 master br0
+  ip link set eth0.1 master br0
+
+  # configure the bridge
+  ip addr add 192.0.2.129/25 dev br0
+
+  # bring up the bridge
+  ip link set dev br0 up
+
+gateway
+~~~
+
+.. code-block:: sh
+
+  # tag traffic on CPU port
+  ip link add link eth0 name eth0.1 type vlan id 1
+  ip link add link eth0 name eth0.2 type vlan id 2
+
+  # The master interface needs to be brought up before the slave ports.
+  ip link set eth0 up
+  ip link set eth0.1 up
+  ip link set eth0.2 up
+
+  # bring up the slave interfaces
+  ip link set wan up
+  ip link set lan1 up
+  ip link set lan2 up
+
+  # create bridge
+  ip link add

[RFC PATCH v2 0/2] Document the configuration of b53

2019-07-01 Thread Benedikt Spranger

Hi,

this is the second RFC to document the configuration of a b53 supported
switch.

Thanks for the comments.

Regards
Bene Spranger

v1..v2:
- split out generic parts of the configuration.
- target comments by Andrew Lunn and Florian Fainelli.
- make changes visible to build system

Benedikt Spranger (2):
  Documentation: net: dsa: Describe DSA switch configuration
  Documentation: net: dsa: b53: Describe b53 configuration

 Documentation/networking/dsa/b53.rst  | 174 +++
 .../networking/dsa/configuration.rst  | 292 ++
 Documentation/networking/dsa/index.rst|   2 +
 3 files changed, 468 insertions(+)
 create mode 100644 Documentation/networking/dsa/b53.rst
 create mode 100644 Documentation/networking/dsa/configuration.rst

-- 
2.20.1

Re: [PATCH net-next 8/8] net: mscc: PTP Hardware Clock (PHC) support

2019-07-01 Thread Eric Dumazet




On 7/1/19 8:12 AM, Willem de Bruijn wrote:
> On Mon, Jul 1, 2019 at 6:05 AM Antoine Tenart
>  wrote:
>>
>> This patch adds support for PTP Hardware Clock (PHC) to the Ocelot
>> switch for both PTP 1-step and 2-step modes.
>>
>> Signed-off-by: Antoine Tenart 
> 
>>  void ocelot_deinit(struct ocelot *ocelot)
>>  {
>> +   struct ocelot_port *port;
>> +   struct ocelot_skb *entry;
>> +   struct list_head *pos;
>> +   int i;
>> +
>> destroy_workqueue(ocelot->stats_queue);
>> mutex_destroy(&ocelot->stats_lock);
>> ocelot_ace_deinit();
>> +
>> +   for (i = 0; i < ocelot->num_phys_ports; i++) {
>> +   port = ocelot->ports[i];
>> +
>> +   list_for_each(pos, &port->skbs) {
>> +   entry = list_entry(pos, struct ocelot_skb, head);
>> +
>> +   list_del(pos);
> 
> list_for_each_safe

Also entry->skb seems to be leaked ?

dev_kfree_skb_any(entry->skb) seems to be needed


> 
>> +   kfree(entry);

Re: [PATCH bpf-next 2/2] selftests/bpf: add verifier tests for wide stores

2019-07-01 Thread Stanislav Fomichev

On 06/30, Yonghong Song wrote:
> 
> 
> On 6/28/19 4:10 PM, Stanislav Fomichev wrote:
> > Make sure that wide stores are allowed at proper (aligned) addresses.
> > Note that user_ip6 is naturally aligned on 8-byte boundary, so
> > correct addresses are user_ip6[0] and user_ip6[2]. msg_src_ip6 is,
> > however, aligned on a 4-byte bondary, so only msg_src_ip6[1]
> > can be wide-stored.
> > 
> > Cc: Andrii Nakryiko 
> > Cc: Yonghong Song 
> > Signed-off-by: Stanislav Fomichev 
> > ---
> >   tools/testing/selftests/bpf/test_verifier.c   | 17 ++--
> >   .../selftests/bpf/verifier/wide_store.c   | 40 +++
> >   2 files changed, 54 insertions(+), 3 deletions(-)
> >   create mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
> > 
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c 
> > b/tools/testing/selftests/bpf/test_verifier.c
> > index c5514daf8865..b0773291012a 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -105,6 +105,7 @@ struct bpf_test {
> > __u64 data64[TEST_DATA_LEN / 8];
> > };
> > } retvals[MAX_TEST_RUNS];
> > +   enum bpf_attach_type expected_attach_type;
> >   };
> >   
> >   /* Note we want this to be 64 bit aligned so that the end of our array is
> > @@ -850,6 +851,7 @@ static void do_test_single(struct bpf_test *test, bool 
> > unpriv,
> > int fd_prog, expected_ret, alignment_prevented_execution;
> > int prog_len, prog_type = test->prog_type;
> > struct bpf_insn *prog = test->insns;
> > +   struct bpf_load_program_attr attr;
> > int run_errs, run_successes;
> > int map_fds[MAX_NR_MAPS];
> > const char *expected_err;
> > @@ -881,8 +883,17 @@ static void do_test_single(struct bpf_test *test, bool 
> > unpriv,
> > pflags |= BPF_F_STRICT_ALIGNMENT;
> > if (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)
> > pflags |= BPF_F_ANY_ALIGNMENT;
> > -   fd_prog = bpf_verify_program(prog_type, prog, prog_len, pflags,
> > -"GPL", 0, bpf_vlog, sizeof(bpf_vlog), 4);
> > +
> > +   memset(&attr, 0, sizeof(attr));
> > +   attr.prog_type = prog_type;
> > +   attr.expected_attach_type = test->expected_attach_type;
> > +   attr.insns = prog;
> > +   attr.insns_cnt = prog_len;
> > +   attr.license = "GPL";
> > +   attr.log_level = 4;
> > +   attr.prog_flags = pflags;
> > +
> > +   fd_prog = bpf_load_program_xattr(&attr, bpf_vlog, sizeof(bpf_vlog));
> > if (fd_prog < 0 && !bpf_probe_prog_type(prog_type, 0)) {
> > printf("SKIP (unsupported program type %d)\n", prog_type);
> > skips++;
> > @@ -912,7 +923,7 @@ static void do_test_single(struct bpf_test *test, bool 
> > unpriv,
> > printf("FAIL\nUnexpected success to load!\n");
> > goto fail_log;
> > }
> > -   if (!strstr(bpf_vlog, expected_err)) {
> > +   if (!expected_err || !strstr(bpf_vlog, expected_err)) {
> > printf("FAIL\nUnexpected error message!\n\tEXP: 
> > %s\n\tRES: %s\n",
> >   expected_err, bpf_vlog);
> > goto fail_log;
> > diff --git a/tools/testing/selftests/bpf/verifier/wide_store.c 
> > b/tools/testing/selftests/bpf/verifier/wide_store.c
> > new file mode 100644
> > index ..c6385f45b114
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/verifier/wide_store.c
> > @@ -0,0 +1,40 @@
> > +#define BPF_SOCK_ADDR(field, off, res, err) \
> > +{ \
> > +   "wide store to bpf_sock_addr." #field "[" #off "]", \
> > +   .insns = { \
> > +   BPF_MOV64_IMM(BPF_REG_0, 1), \
> > +   BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, \
> > +   offsetof(struct bpf_sock_addr, field[off])), \
> > +   BPF_EXIT_INSN(), \
> > +   }, \
> > +   .result = res, \
> > +   .prog_type = BPF_PROG_TYPE_CGROUP_SOCK_ADDR, \
> > +   .expected_attach_type = BPF_CGROUP_UDP6_SENDMSG, \
> > +   .errstr = err, \
> > +}
> > +
> > +/* user_ip6[0] is u64 aligned */
> > +BPF_SOCK_ADDR(user_ip6, 0, ACCEPT,
> > + NULL),
> > +BPF_SOCK_ADDR(user_ip6, 1, REJECT,
> > + "invalid bpf_context access off=12 size=8"),
> > +BPF_SOCK_ADDR(user_ip6, 2, ACCEPT,
> > + NULL),
> > +BPF_SOCK_ADDR(user_ip6, 3, REJECT,
> > + "invalid bpf_context access off=20 size=8"),
> > +BPF_SOCK_ADDR(user_ip6, 4, REJECT,
> > + "invalid bpf_context access off=24 size=8"),
> 
> With offset 4, we have
> #968/p wide store to bpf_sock_addr.user_ip6[4] OK
> 
> This test case can be removed. user code typically
> won't write bpf_sock_addr.user_ip6[4], and compiler
> typically will give a warning since it is out of
> array bound. Any particular reason you want to
> include this one?
Agreed on both, I'm being overly cautious here. They should
be caught by the outer switch and be rejected because of
other reasons.

> > +
> > +/* msg_src_ip6[0] is _not_ u64 aligned */
> > +BPF_SOCK_ADDR(msg_src

Re: [PATCH bpf-next 1/2] bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addr

2019-07-01 Thread Stanislav Fomichev

On 06/30, Yonghong Song wrote:
> 
> 
> On 6/28/19 4:10 PM, Stanislav Fomichev wrote:
> > Since commit cd17d7770578 ("bpf/tools: sync bpf.h") clang decided
> > that it can do a single u64 store into user_ip6[2] instead of two
> > separate u32 ones:
> > 
> >   #  17: (18) r2 = 0x100
> >   #  ; ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2);
> >   #  19: (7b) *(u64 *)(r1 +16) = r2
> >   #  invalid bpf_context access off=16 size=8
> > 
> >  From the compiler point of view it does look like a correct thing
> > to do, so let's support it on the kernel side.
> > 
> > Credit to Andrii Nakryiko for a proper implementation of
> > bpf_ctx_wide_store_ok.
> > 
> > Cc: Andrii Nakryiko 
> > Cc: Yonghong Song 
> > Fixes: cd17d7770578 ("bpf/tools: sync bpf.h")
> > Reported-by: kernel test robot 
> > Signed-off-by: Stanislav Fomichev 
> 
> The change looks good to me with the following nits:
>1. could you add a cover letter for the patch set?
>   typically if the number of patches is more than one,
>   it would be a good practice with a cover letter.
>   See bpf_devel_QA.rst .
>2. with this change, the comments in uapi bpf.h
>   are not accurate any more.
>  __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
>   * Stored in network byte order. 
> 
>   */
>  __u32 msg_src_ip6[4];   /* Allows 1,2,4-byte read an 4-byte write.
>   * Stored in network byte order.
>   */
>   now for stores, aligned 8-byte write is permitted.
>   could you update this as well?
> 
>  From the typical usage pattern, I did not see a need
> for 8-tye read of user_ip6 and msg_src_ip6 yet. So let
> us just deal with write for now.
> 
> With the above two nits,
> Acked-by: Yonghong Song 
Thank you for a review, will follow up with a v2 shortly with both
things addressed!

> > ---
> >   include/linux/filter.h |  6 ++
> >   net/core/filter.c  | 22 ++
> >   2 files changed, 20 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 340f7d648974..3901007e36f1 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -746,6 +746,12 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 
> > size_default)
> > return size <= size_default && (size & (size - 1)) == 0;
> >   }
> >   
> > +#define bpf_ctx_wide_store_ok(off, size, type, field)  
> > \
> > +   (size == sizeof(__u64) &&   \
> > +   off >= offsetof(type, field) && \
> > +   off + sizeof(__u64) <= offsetofend(type, field) &&  \
> > +   off % sizeof(__u64) == 0)
> > +
> >   #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
> >   
> >   static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index dc8534be12fc..5d33f2146dab 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -6849,6 +6849,16 @@ static bool sock_addr_is_valid_access(int off, int 
> > size,
> > if (!bpf_ctx_narrow_access_ok(off, size, size_default))
> > return false;
> > } else {
> > +   if (bpf_ctx_wide_store_ok(off, size,
> > + struct bpf_sock_addr,
> > + user_ip6))
> > +   return true;
> > +
> > +   if (bpf_ctx_wide_store_ok(off, size,
> > + struct bpf_sock_addr,
> > + msg_src_ip6))
> > +   return true;
> > +
> > if (size != size_default)
> > return false;
> > }
> > @@ -7689,9 +7699,6 @@ static u32 xdp_convert_ctx_access(enum 
> > bpf_access_type type,
> >   /* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to
> >* SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation.
> >*
> > - * It doesn't support SIZE argument though since narrow stores are not
> > - * supported for now.
> > - *
> >* In addition it uses Temporary Field TF (member of struct S) as the 3rd
> >* "register" since two registers available in convert_ctx_access are not
> >* enough: we can't override neither SRC, since it contains value to 
> > store, nor
> > @@ -7699,7 +7706,7 @@ static u32 xdp_convert_ctx_access(enum 
> > bpf_access_type type,
> >* instructions. But we need a temporary place to save pointer to nested
> >* structure whose field we want to store to.
> >*/
> > -#define SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, OFF, TF)
> >\
> > +#define SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, SIZE, OFF, TF)  
> >\

Re: [PATCH bpf-next 1/2] bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addr

2019-07-01 Thread Stanislav Fomichev

On 07/01, Andrii Nakryiko wrote:
> On Sat, Jun 29, 2019 at 10:53 PM Yonghong Song  wrote:
> >
> >
> >
> > On 6/28/19 4:10 PM, Stanislav Fomichev wrote:
> > > Since commit cd17d7770578 ("bpf/tools: sync bpf.h") clang decided
> > > that it can do a single u64 store into user_ip6[2] instead of two
> > > separate u32 ones:
> > >
> > >   #  17: (18) r2 = 0x100
> > >   #  ; ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2);
> > >   #  19: (7b) *(u64 *)(r1 +16) = r2
> > >   #  invalid bpf_context access off=16 size=8
> > >
> > >  From the compiler point of view it does look like a correct thing
> > > to do, so let's support it on the kernel side.
> > >
> > > Credit to Andrii Nakryiko for a proper implementation of
> > > bpf_ctx_wide_store_ok.
> > >
> > > Cc: Andrii Nakryiko 
> > > Cc: Yonghong Song 
> > > Fixes: cd17d7770578 ("bpf/tools: sync bpf.h")
> > > Reported-by: kernel test robot 
> > > Signed-off-by: Stanislav Fomichev 
> >
> > The change looks good to me with the following nits:
> >1. could you add a cover letter for the patch set?
> >   typically if the number of patches is more than one,
> >   it would be a good practice with a cover letter.
> >   See bpf_devel_QA.rst .
> >2. with this change, the comments in uapi bpf.h
> >   are not accurate any more.
> >  __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
> >   * Stored in network byte order.
> >
> >   */
> >  __u32 msg_src_ip6[4];   /* Allows 1,2,4-byte read an 4-byte write.
> >   * Stored in network byte order.
> >   */
> >   now for stores, aligned 8-byte write is permitted.
> >   could you update this as well?
> >
> >  From the typical usage pattern, I did not see a need
> > for 8-tye read of user_ip6 and msg_src_ip6 yet. So let
> > us just deal with write for now.
> 
> But I guess it's still possible for clang to optimize two consecutive
> 4-byte reads into single 8-byte read in some circumstances? If that's
> the case, maybe it's a good idea to have corresponding read checks as
> well?
I guess clang can do those kinds of optimizations. I can put it on my
todo and address later (or when we actually see it out in the wild).

> But overall this looks good to me:
> 
> Acked-by: Andrii Nakryiko 
Thanks for a review!

> >
> > With the above two nits,
> > Acked-by: Yonghong Song 
> >
> > > ---
> > >   include/linux/filter.h |  6 ++
> > >   net/core/filter.c  | 22 ++
> > >   2 files changed, 20 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > index 340f7d648974..3901007e36f1 100644
> > > --- a/include/linux/filter.h
> > > +++ b/include/linux/filter.h
> > > @@ -746,6 +746,12 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 
> > > size_default)
> > >   return size <= size_default && (size & (size - 1)) == 0;
> > >   }
> > >
> > > +#define bpf_ctx_wide_store_ok(off, size, type, field)
> > > \
> > > + (size == sizeof(__u64) &&   \
> > > + off >= offsetof(type, field) && \
> > > + off + sizeof(__u64) <= offsetofend(type, field) &&  \
> > > + off % sizeof(__u64) == 0)
> > > +
> > >   #define bpf_classic_proglen(fprog) (fprog->len * 
> > > sizeof(fprog->filter[0]))
> > >
> > >   static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index dc8534be12fc..5d33f2146dab 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -6849,6 +6849,16 @@ static bool sock_addr_is_valid_access(int off, int 
> > > size,
> > >   if (!bpf_ctx_narrow_access_ok(off, size, 
> > > size_default))
> > >   return false;
> > >   } else {
> > > + if (bpf_ctx_wide_store_ok(off, size,
> > > +   struct bpf_sock_addr,
> > > +   user_ip6))
> > > + return true;
> > > +
> > > + if (bpf_ctx_wide_store_ok(off, size,
> > > +   struct bpf_sock_addr,
> > > +   msg_src_ip6))
> > > + return true;
> > > +
> > >   if (size != size_default)
> > >   return false;
> > >   }
> > > @@ -7689,9 +7699,6 @@ static u32 xdp_convert_ctx_access(enum 
> > > bpf_access_type type,
> > >   /* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to
> > >* SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation.
> > >*
> > > - * It doesn't support SIZE argument though since narrow stores are not
> > > - * supported for now.
> >

Re: [PATCH v4 bpf-next 0/9] libbpf: add bpf_link and tracing attach APIs

2019-07-01 Thread Stanislav Fomichev

On 06/28, Andrii Nakryiko wrote:
> This patchset adds the following APIs to allow attaching BPF programs to
> tracing entities:
> - bpf_program__attach_perf_event for attaching to any opened perf event FD,
>   allowing users full control;
> - bpf_program__attach_kprobe for attaching to kernel probes (both entry and
>   return probes);
> - bpf_program__attach_uprobe for attaching to user probes (both entry/return);
> - bpf_program__attach_tracepoint for attaching to kernel tracepoints;
> - bpf_program__attach_raw_tracepoint for attaching to raw kernel tracepoint
>   (wrapper around bpf_raw_tracepoint_open);
> 
> This set of APIs makes libbpf more useful for tracing applications.
> 
> All attach APIs return abstract struct bpf_link that encapsulates logic of
> detaching BPF program. See patch #2 for details. bpf_assoc was considered as
> an alternative name for this opaque "handle", but bpf_link seems to be
> appropriate semantically and is nice and short.
> 
> Pre-patch #1 makes internal libbpf_strerror_r helper function work w/ negative
> error codes, lifting the burder off callers to keep track of error sign.
> Patch #2 adds bpf_link abstraction.
> Patch #3 adds attach_perf_event, which is the base for all other APIs.
> Patch #4 adds kprobe/uprobe APIs.
> Patch #5 adds tracepoint API.
> Patch #6 adds raw_tracepoint API.
> Patch #7 converts one existing test to use attach_perf_event.
> Patch #8 adds new kprobe/uprobe tests.
> Patch #9 converts some selftests currently using tracepoint to new APIs.
> 
> v3->v4:
> - proper errno handling (Stanislav);
> - bpf_fd -> prog_fd (Stanislav);
> - switch to fprintf (Song);
Reviewed-by: Stanislav Fomichev 

Thanks!

> v2->v3:
> - added bpf_link concept (Daniel);
> - didn't add generic bpf_link__attach_program for reasons described in [0];
> - dropped Stanislav's Reviewed-by from patches #2-#6, in case he doesn't like
>   the change;
> v1->v2:
> - preserve errno before close() call (Stanislav);
> - use libbpf_perf_event_disable_and_close in selftest (Stanislav);
> - remove unnecessary memset (Stanislav);
> 
> [0] 
> https://lore.kernel.org/bpf/caef4bzz7em5ep2eazn7t2yb5qgvriwas+epelr1g01ttx-6...@mail.gmail.com/
> 
> Andrii Nakryiko (9):
>   libbpf: make libbpf_strerror_r agnostic to sign of error
>   libbpf: introduce concept of bpf_link
>   libbpf: add ability to attach/detach BPF program to perf event
>   libbpf: add kprobe/uprobe attach API
>   libbpf: add tracepoint attach API
>   libbpf: add raw tracepoint attach API
>   selftests/bpf: switch test to new attach_perf_event API
>   selftests/bpf: add kprobe/uprobe selftests
>   selftests/bpf: convert existing tracepoint tests to new APIs
> 
>  tools/lib/bpf/libbpf.c| 359 ++
>  tools/lib/bpf/libbpf.h|  21 +
>  tools/lib/bpf/libbpf.map  |   8 +-
>  tools/lib/bpf/str_error.c |   2 +-
>  .../selftests/bpf/prog_tests/attach_probe.c   | 155 
>  .../bpf/prog_tests/stacktrace_build_id.c  |  50 +--
>  .../bpf/prog_tests/stacktrace_build_id_nmi.c  |  31 +-
>  .../selftests/bpf/prog_tests/stacktrace_map.c |  43 +--
>  .../bpf/prog_tests/stacktrace_map_raw_tp.c|  15 +-
>  .../selftests/bpf/progs/test_attach_probe.c   |  55 +++
>  10 files changed, 644 insertions(+), 95 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/attach_probe.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_attach_probe.c
> 
> -- 
> 2.17.1
>

Re: [PATCH net] Documentation/networking: fix default_ttl typo in mpls-sysctl

2019-07-01 Thread David Ahern

On 7/1/19 2:45 AM, Hangbin Liu wrote:
> default_ttl should be integer instead of bool
> 
> Reported-by: Ying Xu 
> Fixes: a59166e47086 ("mpls: allow TTL propagation from IP packets to be 
> configured")
> Signed-off-by: Hangbin Liu 
> ---
>  Documentation/networking/mpls-sysctl.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
Reviewed-by: David Ahern

[PATCH bpf-next v2 3/3] selftests/bpf: add verifier tests for wide stores

2019-07-01 Thread Stanislav Fomichev

Make sure that wide stores are allowed at proper (aligned) addresses.
Note that user_ip6 is naturally aligned on 8-byte boundary, so
correct addresses are user_ip6[0] and user_ip6[2]. msg_src_ip6 is,
however, aligned on a 4-byte bondary, so only msg_src_ip6[1]
can be wide-stored.

Cc: Andrii Nakryiko 
Cc: Yonghong Song 
Signed-off-by: Stanislav Fomichev 
---
 tools/testing/selftests/bpf/test_verifier.c   | 17 +++--
 .../selftests/bpf/verifier/wide_store.c   | 36 +++
 2 files changed, 50 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index c5514daf8865..b0773291012a 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -105,6 +105,7 @@ struct bpf_test {
__u64 data64[TEST_DATA_LEN / 8];
};
} retvals[MAX_TEST_RUNS];
+   enum bpf_attach_type expected_attach_type;
 };
 
 /* Note we want this to be 64 bit aligned so that the end of our array is
@@ -850,6 +851,7 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
int fd_prog, expected_ret, alignment_prevented_execution;
int prog_len, prog_type = test->prog_type;
struct bpf_insn *prog = test->insns;
+   struct bpf_load_program_attr attr;
int run_errs, run_successes;
int map_fds[MAX_NR_MAPS];
const char *expected_err;
@@ -881,8 +883,17 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
pflags |= BPF_F_STRICT_ALIGNMENT;
if (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)
pflags |= BPF_F_ANY_ALIGNMENT;
-   fd_prog = bpf_verify_program(prog_type, prog, prog_len, pflags,
-"GPL", 0, bpf_vlog, sizeof(bpf_vlog), 4);
+
+   memset(&attr, 0, sizeof(attr));
+   attr.prog_type = prog_type;
+   attr.expected_attach_type = test->expected_attach_type;
+   attr.insns = prog;
+   attr.insns_cnt = prog_len;
+   attr.license = "GPL";
+   attr.log_level = 4;
+   attr.prog_flags = pflags;
+
+   fd_prog = bpf_load_program_xattr(&attr, bpf_vlog, sizeof(bpf_vlog));
if (fd_prog < 0 && !bpf_probe_prog_type(prog_type, 0)) {
printf("SKIP (unsupported program type %d)\n", prog_type);
skips++;
@@ -912,7 +923,7 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
printf("FAIL\nUnexpected success to load!\n");
goto fail_log;
}
-   if (!strstr(bpf_vlog, expected_err)) {
+   if (!expected_err || !strstr(bpf_vlog, expected_err)) {
printf("FAIL\nUnexpected error message!\n\tEXP: 
%s\n\tRES: %s\n",
  expected_err, bpf_vlog);
goto fail_log;
diff --git a/tools/testing/selftests/bpf/verifier/wide_store.c 
b/tools/testing/selftests/bpf/verifier/wide_store.c
new file mode 100644
index ..8fe99602ded4
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/wide_store.c
@@ -0,0 +1,36 @@
+#define BPF_SOCK_ADDR(field, off, res, err) \
+{ \
+   "wide store to bpf_sock_addr." #field "[" #off "]", \
+   .insns = { \
+   BPF_MOV64_IMM(BPF_REG_0, 1), \
+   BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, \
+   offsetof(struct bpf_sock_addr, field[off])), \
+   BPF_EXIT_INSN(), \
+   }, \
+   .result = res, \
+   .prog_type = BPF_PROG_TYPE_CGROUP_SOCK_ADDR, \
+   .expected_attach_type = BPF_CGROUP_UDP6_SENDMSG, \
+   .errstr = err, \
+}
+
+/* user_ip6[0] is u64 aligned */
+BPF_SOCK_ADDR(user_ip6, 0, ACCEPT,
+ NULL),
+BPF_SOCK_ADDR(user_ip6, 1, REJECT,
+ "invalid bpf_context access off=12 size=8"),
+BPF_SOCK_ADDR(user_ip6, 2, ACCEPT,
+ NULL),
+BPF_SOCK_ADDR(user_ip6, 3, REJECT,
+ "invalid bpf_context access off=20 size=8"),
+
+/* msg_src_ip6[0] is _not_ u64 aligned */
+BPF_SOCK_ADDR(msg_src_ip6, 0, REJECT,
+ "invalid bpf_context access off=44 size=8"),
+BPF_SOCK_ADDR(msg_src_ip6, 1, ACCEPT,
+ NULL),
+BPF_SOCK_ADDR(msg_src_ip6, 2, REJECT,
+ "invalid bpf_context access off=52 size=8"),
+BPF_SOCK_ADDR(msg_src_ip6, 3, REJECT,
+ "invalid bpf_context access off=56 size=8"),
+
+#undef BPF_SOCK_ADDR
-- 
2.22.0.410.gd8fdbe21b5-goog

[PATCH bpf-next v2 2/3] bpf: sync bpf.h to tools/

2019-07-01 Thread Stanislav Fomichev

Sync user_ip6 & msg_src_ip6 comments.

Signed-off-by: Stanislav Fomichev 
---
 tools/include/uapi/linux/bpf.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a396b516a2b2..586867fe6102 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3237,7 +3237,7 @@ struct bpf_sock_addr {
__u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
 * Stored in network byte order.
 */
-   __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
+   __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4,8-byte write.
 * Stored in network byte order.
 */
__u32 user_port;/* Allows 4-byte read and write.
@@ -3249,7 +3249,7 @@ struct bpf_sock_addr {
__u32 msg_src_ip4;  /* Allows 1,2,4-byte read an 4-byte write.
 * Stored in network byte order.
 */
-   __u32 msg_src_ip6[4];   /* Allows 1,2,4-byte read an 4-byte write.
+   __u32 msg_src_ip6[4];   /* Allows 1,2,4-byte read an 4,8-byte write.
 * Stored in network byte order.
 */
__bpf_md_ptr(struct bpf_sock *, sk);
-- 
2.22.0.410.gd8fdbe21b5-goog

[PATCH bpf-next v2 0/3] bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addr

2019-07-01 Thread Stanislav Fomichev

Clang can generate 8-byte stores for user_ip6 & msg_src_ip6,
let's support that on the verifier side.

v2:
* Add simple cover letter (Yonghong Song)
* Update comments (Yonghong Song)
* Remove [4] selftests (Yonghong Song)

Stanislav Fomichev (3):
  bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addr
  bpf: sync bpf.h to tools/
  selftests/bpf: add verifier tests for wide stores

 include/linux/filter.h|  6 
 include/uapi/linux/bpf.h  |  4 +--
 net/core/filter.c | 22 +++-
 tools/include/uapi/linux/bpf.h|  4 +--
 tools/testing/selftests/bpf/test_verifier.c   | 17 +++--
 .../selftests/bpf/verifier/wide_store.c   | 36 +++
 6 files changed, 74 insertions(+), 15 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c

-- 
2.22.0.410.gd8fdbe21b5-goog

[PATCH bpf-next v2 1/3] bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addr

2019-07-01 Thread Stanislav Fomichev

Since commit cd17d7770578 ("bpf/tools: sync bpf.h") clang decided
that it can do a single u64 store into user_ip6[2] instead of two
separate u32 ones:

 #  17: (18) r2 = 0x100
 #  ; ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2);
 #  19: (7b) *(u64 *)(r1 +16) = r2
 #  invalid bpf_context access off=16 size=8

>From the compiler point of view it does look like a correct thing
to do, so let's support it on the kernel side.

Credit to Andrii Nakryiko for a proper implementation of
bpf_ctx_wide_store_ok.

Cc: Andrii Nakryiko 
Cc: Yonghong Song 
Fixes: cd17d7770578 ("bpf/tools: sync bpf.h")
Reported-by: kernel test robot 
Acked-by: Yonghong Song 
Acked-by: Andrii Nakryiko 
Signed-off-by: Stanislav Fomichev 
---
 include/linux/filter.h   |  6 ++
 include/uapi/linux/bpf.h |  4 ++--
 net/core/filter.c| 22 ++
 3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 340f7d648974..3901007e36f1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -746,6 +746,12 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 
size_default)
return size <= size_default && (size & (size - 1)) == 0;
 }
 
+#define bpf_ctx_wide_store_ok(off, size, type, field)  \
+   (size == sizeof(__u64) &&   \
+   off >= offsetof(type, field) && \
+   off + sizeof(__u64) <= offsetofend(type, field) &&  \
+   off % sizeof(__u64) == 0)
+
 #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
 
 static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a396b516a2b2..586867fe6102 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3237,7 +3237,7 @@ struct bpf_sock_addr {
__u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
 * Stored in network byte order.
 */
-   __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
+   __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4,8-byte write.
 * Stored in network byte order.
 */
__u32 user_port;/* Allows 4-byte read and write.
@@ -3249,7 +3249,7 @@ struct bpf_sock_addr {
__u32 msg_src_ip4;  /* Allows 1,2,4-byte read an 4-byte write.
 * Stored in network byte order.
 */
-   __u32 msg_src_ip6[4];   /* Allows 1,2,4-byte read an 4-byte write.
+   __u32 msg_src_ip6[4];   /* Allows 1,2,4-byte read an 4,8-byte write.
 * Stored in network byte order.
 */
__bpf_md_ptr(struct bpf_sock *, sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index dc8534be12fc..5d33f2146dab 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6849,6 +6849,16 @@ static bool sock_addr_is_valid_access(int off, int size,
if (!bpf_ctx_narrow_access_ok(off, size, size_default))
return false;
} else {
+   if (bpf_ctx_wide_store_ok(off, size,
+ struct bpf_sock_addr,
+ user_ip6))
+   return true;
+
+   if (bpf_ctx_wide_store_ok(off, size,
+ struct bpf_sock_addr,
+ msg_src_ip6))
+   return true;
+
if (size != size_default)
return false;
}
@@ -7689,9 +7699,6 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type 
type,
 /* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to
  * SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation.
  *
- * It doesn't support SIZE argument though since narrow stores are not
- * supported for now.
- *
  * In addition it uses Temporary Field TF (member of struct S) as the 3rd
  * "register" since two registers available in convert_ctx_access are not
  * enough: we can't override neither SRC, since it contains value to store, nor
@@ -7699,7 +7706,7 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type 
type,
  * instructions. But we need a temporary place to save pointer to nested
  * structure whose field we want to store to.
  */
-#define SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, OFF, TF)
   \
+#define SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, SIZE, OFF, TF) \
do {   \
int tmp_reg = BPF_REG_9;   \
if (s

[PATCH] User mode linux bump maximum MTU tuntap interface

2019-07-01 Thread Алексей

Hello, the parameter  ETH_MAX_PACKET limited to 1500 bytes is the not
support jumbo frames.

This patch change ETH_MAX_PACKET the 65535 bytes to jumbo frame support
with user mode linux tuntap driver.


PATCH:

---

diff -ruNp ./src/1/linux-5.1/arch/um/include/shared/net_user.h
./src/linux-5.1/$
--- ./arch/um/include/shared/net_user.h 2019-05-06 00:42:58.0 +
+++ ./arch/um/include/shared/net_user.h 2019-07-01 16:09:20.31597 +
@@ -9,7 +9,7 @@
 #define ETH_ADDR_LEN (6)
 #define ETH_HEADER_ETHERTAP (16)
 #define ETH_HEADER_OTHER (26) /* 14 for ethernet + VLAN + MPLS for
crazy peopl$
-#define ETH_MAX_PACKET (1500)
+#define ETH_MAX_PACKET (65535)
 
 #define UML_NET_VERSION (4)

[PATCH net-next 0/7] net/rds: RDMA fixes

2019-07-01 Thread Gerd Rausch

A number of net/rds fixes necessary to make "rds_rdma.ko"
pass some basic Oracle internal tests.

Gerd Rausch (7):
  net/rds: Give fr_state a chance to transition to FRMR_IS_FREE
  net/rds: Get rid of "wait_clean_list_grace" and add locking
  net/rds: Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition after
posting IB_WR_LOCAL_INV
  net/rds: Fix NULL/ERR_PTR inconsistency
  net/rds: Set fr_state only to FRMR_IS_FREE if IB_WR_LOCAL_INV had been
successful
  net/rds: Keep track of and wait for FRWR segments in use upon shutdown
  net/rds: Initialize ic->i_fastreg_wrs upon allocation

 net/rds/ib.h  |   1 +
 net/rds/ib_cm.c   |   9 +++-
 net/rds/ib_frmr.c | 103 ++
 net/rds/ib_mr.h   |   4 ++
 net/rds/ib_rdma.c |  60 +--
 5 files changed, 128 insertions(+), 49 deletions(-)

-- 
2.18.0

[PATCH net-next 1/7] net/rds: Give fr_state a chance to transition to FRMR_IS_FREE

2019-07-01 Thread Gerd Rausch

In the context of FRMR (ib_frmr.c):

Memory regions make it onto the "clean_list" via "rds_ib_flush_mr_pool",
after the memory region has been posted for invalidation via
"rds_ib_post_inv".

At that point in time, "fr_state" may still be in state "FRMR_IS_INUSE",
since the only place where "fr_state" transitions to "FRMR_IS_FREE"
is in "rds_ib_mr_cqe_handler", which is triggered by a tasklet.

So in case we notice that "fr_state != FRMR_IS_FREE" (see below),
we wait for "fr_inv_done" to trigger with a maximum of 10msec.
Then we check again, and only put the memory region onto the drop_list
(via "rds_ib_free_frmr") in case the situation remains unchanged.

This avoids the problem of memory-regions bouncing between "clean_list"
and "drop_list" before they even have a chance to be properly invalidated.

Signed-off-by: Gerd Rausch 
---
 net/rds/ib_frmr.c | 32 +++-
 net/rds/ib_mr.h   |  1 +
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 32ae26ed58a0..9f8aa310c27a 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -75,6 +75,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct 
rds_ib_device *rds_ibdev,
pool->max_items_soft = pool->max_items;
 
frmr->fr_state = FRMR_IS_FREE;
+   init_waitqueue_head(&frmr->fr_inv_done);
return ibmr;
 
 out_no_cigar:
@@ -285,6 +286,7 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
if (frmr->fr_inv) {
frmr->fr_state = FRMR_IS_FREE;
frmr->fr_inv = false;
+   wake_up(&frmr->fr_inv_done);
}
 
atomic_inc(&ic->i_fastreg_wrs);
@@ -345,8 +347,36 @@ struct rds_ib_mr *rds_ib_reg_frmr(struct rds_ib_device 
*rds_ibdev,
}
 
do {
-   if (ibmr)
+   if (ibmr) {
+   /* Memory regions make it onto the "clean_list" via
+* "rds_ib_flush_mr_pool", after the memory region has
+* been posted for invalidation via "rds_ib_post_inv".
+*
+* At that point in time, "fr_state" may still be
+* in state "FRMR_IS_INUSE", since the only place where
+* "fr_state" transitions to "FRMR_IS_FREE" is in
+* is in "rds_ib_mr_cqe_handler", which is
+* triggered by a tasklet.
+*
+* So in case we notice that
+* "fr_state != FRMR_IS_FREE" (see below), * we wait for
+* "fr_inv_done" to trigger with a maximum of 10msec.
+* Then we check again, and only put the memory region
+* onto the drop_list (via "rds_ib_free_frmr")
+* in case the situation remains unchanged.
+*
+* This avoids the problem of memory-regions bouncing
+* between "clean_list" and "drop_list" before they
+* even have a chance to be properly invalidated.
+*/
+   frmr = &ibmr->u.frmr;
+   wait_event_timeout(frmr->fr_inv_done,
+  frmr->fr_state == FRMR_IS_FREE,
+  msecs_to_jiffies(10));
+   if (frmr->fr_state == FRMR_IS_FREE)
+   break;
rds_ib_free_frmr(ibmr, true);
+   }
ibmr = rds_ib_alloc_frmr(rds_ibdev, nents);
if (IS_ERR(ibmr))
return ibmr;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index 5da12c248431..42daccb7b5eb 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -57,6 +57,7 @@ struct rds_ib_frmr {
struct ib_mr*mr;
enum rds_ib_fr_statefr_state;
boolfr_inv;
+   wait_queue_head_t   fr_inv_done;
struct ib_send_wr   fr_wr;
unsigned intdma_npages;
unsigned intsg_byte_len;
-- 
2.18.0

[PATCH net-next 3/7] net/rds: Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition after posting IB_WR_LOCAL_INV

2019-07-01 Thread Gerd Rausch

In order to:
1) avoid a silly bouncing between "clean_list" and "drop_list"
   triggered by function "rds_ib_reg_frmr" as it is releases frmr
   regions whose state is not "FRMR_IS_FREE" right away.

2) prevent an invalid access error in a race from a pending
   "IB_WR_LOCAL_INV" operation with a teardown ("dma_unmap_sg", "put_page")
   and de-registration ("ib_dereg_mr") of the corresponding
   memory region.

Signed-off-by: Gerd Rausch 
---
 net/rds/ib_frmr.c | 89 ++-
 net/rds/ib_mr.h   |  2 ++
 2 files changed, 59 insertions(+), 32 deletions(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 9f8aa310c27a..3c953034dca3 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -76,6 +76,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct 
rds_ib_device *rds_ibdev,
 
frmr->fr_state = FRMR_IS_FREE;
init_waitqueue_head(&frmr->fr_inv_done);
+   init_waitqueue_head(&frmr->fr_reg_done);
return ibmr;
 
 out_no_cigar:
@@ -124,6 +125,7 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
 */
ib_update_fast_reg_key(frmr->mr, ibmr->remap_count++);
frmr->fr_state = FRMR_IS_INUSE;
+   frmr->fr_reg = true;
 
memset(®_wr, 0, sizeof(reg_wr));
reg_wr.wr.wr_id = (unsigned long)(void *)ibmr;
@@ -144,7 +146,29 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
if (printk_ratelimit())
pr_warn("RDS/IB: %s returned error(%d)\n",
__func__, ret);
+   goto out;
+   }
+
+   if (!frmr->fr_reg)
+   goto out;
+
+   /* Wait for the registration to complete in order to prevent an invalid
+* access error resulting from a race between the memory region already
+* being accessed while registration is still pending.
+*/
+   wait_event_timeout(frmr->fr_reg_done, !frmr->fr_reg,
+  msecs_to_jiffies(100));
+
+   /* Registration did not complete within one second, something's wrong */
+   if (frmr->fr_reg) {
+   pr_warn("RDS/IB: %s registration still incomplete after 
100msec\n",
+   __func__);
+   frmr->fr_state = FRMR_IS_STALE;
+   ret = -EBUSY;
}
+
+out:
+
return ret;
 }
 
@@ -262,6 +286,26 @@ static int rds_ib_post_inv(struct rds_ib_mr *ibmr)
pr_err("RDS/IB: %s returned error(%d)\n", __func__, ret);
goto out;
}
+
+   if (frmr->fr_state != FRMR_IS_INUSE)
+   goto out;
+
+   /* Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition in order to
+* 1) avoid a silly bouncing between "clean_list" and "drop_list"
+*triggered by function "rds_ib_reg_frmr" as it is releases frmr
+*regions whose state is not "FRMR_IS_FREE" right away.
+* 2) prevents an invalid access error in a race
+*from a pending "IB_WR_LOCAL_INV" operation
+*with a teardown ("dma_unmap_sg", "put_page")
+*and de-registration ("ib_dereg_mr") of the corresponding
+*memory region.
+*/
+   wait_event_timeout(frmr->fr_inv_done, frmr->fr_state != FRMR_IS_INUSE,
+  msecs_to_jiffies(50));
+
+   if (frmr->fr_state == FRMR_IS_INUSE)
+   ret = -EBUSY;
+
 out:
return ret;
 }
@@ -289,6 +333,11 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
wake_up(&frmr->fr_inv_done);
}
 
+   if (frmr->fr_reg) {
+   frmr->fr_reg = false;
+   wake_up(&frmr->fr_reg_done);
+   }
+
atomic_inc(&ic->i_fastreg_wrs);
 }
 
@@ -297,14 +346,18 @@ void rds_ib_unreg_frmr(struct list_head *list, unsigned 
int *nfreed,
 {
struct rds_ib_mr *ibmr, *next;
struct rds_ib_frmr *frmr;
-   int ret = 0;
+   int ret = 0, ret2;
unsigned int freed = *nfreed;
 
/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
list_for_each_entry(ibmr, list, unmap_list) {
-   if (ibmr->sg_dma_len)
-   ret |= rds_ib_post_inv(ibmr);
+   if (ibmr->sg_dma_len) {
+   ret2 = rds_ib_post_inv(ibmr);
+   if (ret2 && !ret)
+   ret = ret2;
+   }
}
+
if (ret)
pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, ret);
 
@@ -347,36 +400,8 @@ struct rds_ib_mr *rds_ib_reg_frmr(struct rds_ib_device 
*rds_ibdev,
}
 
do {
-   if (ibmr) {
-   /* Memory regions make it onto the "clean_list" via
-* "rds_ib_flush_mr_pool", after the memory region has
-* been posted for invalidation via "rds_ib_post_inv".
-*
-* At that point in time, "fr_state" ma

[PATCH net-next 4/7] net/rds: Fix NULL/ERR_PTR inconsistency

2019-07-01 Thread Gerd Rausch

Make function "rds_ib_try_reuse_ibmr" return NULL in case
memory region could not be allocated, since callers
simply check if the return value is not NULL.

Signed-off-by: Gerd Rausch 
---
 net/rds/ib_rdma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 6b047e63a769..c8c1e3ae8d84 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -450,7 +450,7 @@ struct rds_ib_mr *rds_ib_try_reuse_ibmr(struct 
rds_ib_mr_pool *pool)
rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
else
rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-   return ERR_PTR(-EAGAIN);
+   break;
}
 
/* We do have some empty MRs. Flush them out. */
@@ -464,7 +464,7 @@ struct rds_ib_mr *rds_ib_try_reuse_ibmr(struct 
rds_ib_mr_pool *pool)
return ibmr;
}
 
-   return ibmr;
+   return NULL;
 }
 
 static void rds_ib_mr_pool_flush_worker(struct work_struct *work)
-- 
2.18.0

[PATCH net-next 2/7] net/rds: Get rid of "wait_clean_list_grace" and add locking

2019-07-01 Thread Gerd Rausch

Waiting for activity on the "clean_list" to quiesce is no substitute
for proper locking.

We can have multiple threads competing for "llist_del_first"
via "rds_ib_reuse_mr", and a single thread competing
for "llist_del_all" and "llist_del_first" via "rds_ib_flush_mr_pool".

Since "llist_del_first" depends on "list->first->next" not to change
in the midst of the operation, simply waiting for all current calls
to "rds_ib_reuse_mr" to quiesce across all CPUs is woefully inadequate:

By the time "wait_clean_list_grace" is done iterating over all CPUs to see
that there is no concurrent caller to "rds_ib_reuse_mr", a new caller may
have just shown up on the first CPU.

Furthermore,  explicitly calls out the need for locking:
 * Cases where locking is needed:
 * If we have multiple consumers with llist_del_first used in one consumer,
 * and llist_del_first or llist_del_all used in other consumers,
 * then a lock is needed.

Also, while at it, drop the unused "pool" parameter
from "list_to_llist_nodes".

Signed-off-by: Gerd Rausch 
---
 net/rds/ib_mr.h   |  1 +
 net/rds/ib_rdma.c | 56 +++
 2 files changed, 19 insertions(+), 38 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index 42daccb7b5eb..ab26c20ed66f 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -98,6 +98,7 @@ struct rds_ib_mr_pool {
struct llist_head   free_list;  /* unused MRs */
struct llist_head   clean_list; /* unused & unmapped MRs */
wait_queue_head_t   flush_wait;
+   spinlock_t  clean_lock; /* "clean_list" concurrency */
 
atomic_tfree_pinned;/* memory pinned by free MRs */
unsigned long   max_items;
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 0b347f46b2f4..6b047e63a769 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -40,9 +40,6 @@
 
 struct workqueue_struct *rds_ib_mr_wq;
 
-static DEFINE_PER_CPU(unsigned long, clean_list_grace);
-#define CLEAN_LIST_BUSY_BIT 0
-
 static struct rds_ib_device *rds_ib_get_device(__be32 ipaddr)
 {
struct rds_ib_device *rds_ibdev;
@@ -195,12 +192,11 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool 
*pool)
 {
struct rds_ib_mr *ibmr = NULL;
struct llist_node *ret;
-   unsigned long *flag;
+   unsigned long flags;
 
-   preempt_disable();
-   flag = this_cpu_ptr(&clean_list_grace);
-   set_bit(CLEAN_LIST_BUSY_BIT, flag);
+   spin_lock_irqsave(&pool->clean_lock, flags);
ret = llist_del_first(&pool->clean_list);
+   spin_unlock_irqrestore(&pool->clean_lock, flags);
if (ret) {
ibmr = llist_entry(ret, struct rds_ib_mr, llnode);
if (pool->pool_type == RDS_IB_MR_8K_POOL)
@@ -209,23 +205,9 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool 
*pool)
rds_ib_stats_inc(s_ib_rdma_mr_1m_reused);
}
 
-   clear_bit(CLEAN_LIST_BUSY_BIT, flag);
-   preempt_enable();
return ibmr;
 }
 
-static inline void wait_clean_list_grace(void)
-{
-   int cpu;
-   unsigned long *flag;
-
-   for_each_online_cpu(cpu) {
-   flag = &per_cpu(clean_list_grace, cpu);
-   while (test_bit(CLEAN_LIST_BUSY_BIT, flag))
-   cpu_relax();
-   }
-}
-
 void rds_ib_sync_mr(void *trans_private, int direction)
 {
struct rds_ib_mr *ibmr = trans_private;
@@ -324,8 +306,7 @@ static unsigned int llist_append_to_list(struct llist_head 
*llist,
  * of clusters.  Each cluster has linked llist nodes of
  * MR_CLUSTER_SIZE mrs that are ready for reuse.
  */
-static void list_to_llist_nodes(struct rds_ib_mr_pool *pool,
-   struct list_head *list,
+static void list_to_llist_nodes(struct list_head *list,
struct llist_node **nodes_head,
struct llist_node **nodes_tail)
 {
@@ -402,8 +383,13 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 */
dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
dirty_to_clean += llist_append_to_list(&pool->free_list, &unmap_list);
-   if (free_all)
+   if (free_all) {
+   unsigned long flags;
+
+   spin_lock_irqsave(&pool->clean_lock, flags);
llist_append_to_list(&pool->clean_list, &unmap_list);
+   spin_unlock_irqrestore(&pool->clean_lock, flags);
+   }
 
free_goal = rds_ib_flush_goal(pool, free_all);
 
@@ -416,27 +402,20 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
rds_ib_unreg_fmr(&unmap_list, &nfreed, &unpinned, free_goal);
 
if (!list_empty(&unmap_list)) {
-   /* we have to make sure that none of the things we're about
-* to put on the clean list would race with other cpus trying
-* to pull items off.  The llist would explode if we managed t

[PATCH net-next 7/7] net/rds: Initialize ic->i_fastreg_wrs upon allocation

2019-07-01 Thread Gerd Rausch

Otherwise, if an IB connection is torn down before "rds_ib_setup_qp"
is called, the value of "ic->i_fastreg_wrs" is still at zero
(as it wasn't initialized by "rds_ib_setup_qp").
Consequently "rds_ib_conn_path_shutdown" will spin forever,
waiting for it to go back to "RDS_IB_DEFAULT_FR_WR",
which of course will never happen as there are no
outstanding work requests.

Signed-off-by: Gerd Rausch 
---
 net/rds/ib_cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 1b6fd6c8b12b..4de0214da63c 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -527,7 +527,6 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.qp_type = IB_QPT_RC;
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
-   atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -1139,6 +1138,7 @@ int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t 
gfp)
spin_lock_init(&ic->i_ack_lock);
 #endif
atomic_set(&ic->i_signaled_sends, 0);
+   atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * rds_ib_conn_shutdown() waits for these to be emptied so they
-- 
2.18.0

[PATCH net-next 5/7] net/rds: Set fr_state only to FRMR_IS_FREE if IB_WR_LOCAL_INV had been successful

2019-07-01 Thread Gerd Rausch

Fix a bug where fr_state first goes to FRMR_IS_STALE, because of a failure
of operation IB_WR_LOCAL_INV, but then gets set back to "FRMR_IS_FREE"
uncoditionally, even though the operation failed.

Signed-off-by: Gerd Rausch 
---
 net/rds/ib_frmr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 3c953034dca3..a5d8f4128515 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -328,7 +328,8 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
}
 
if (frmr->fr_inv) {
-   frmr->fr_state = FRMR_IS_FREE;
+   if (frmr->fr_state == FRMR_IS_INUSE)
+   frmr->fr_state = FRMR_IS_FREE;
frmr->fr_inv = false;
wake_up(&frmr->fr_inv_done);
}
-- 
2.18.0

[PATCH net-next 6/7] net/rds: Keep track of and wait for FRWR segments in use upon shutdown

2019-07-01 Thread Gerd Rausch

Since "rds_ib_free_frmr" and "rds_ib_free_frmr_list" simply put
the FRMR memory segments on the "drop_list" or "free_list",
and it is the job of "rds_ib_flush_mr_pool" to reap those entries
by ultimately issuing a "IB_WR_LOCAL_INV" work-request,
we need to trigger and then wait for all those memory segments
attached to a particular connection to be fully released before
we can move on to release the QP, CQ, etc.

So we make "rds_ib_conn_path_shutdown" wait for one more
atomic_t called "i_fastreg_inuse_count" that keeps track of how
many FRWR memory segments are out there marked "FRMR_IS_INUSE"
(and also wake_up rds_ib_ring_empty_wait, as they go away).

Signed-off-by: Gerd Rausch 
---
 net/rds/ib.h  |  1 +
 net/rds/ib_cm.c   |  7 +++
 net/rds/ib_frmr.c | 45 ++---
 3 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 66c03c7665b2..303c6ee8bdb7 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -156,6 +156,7 @@ struct rds_ib_connection {
 
/* To control the number of wrs from fastreg */
atomic_ti_fastreg_wrs;
+   atomic_ti_fastreg_inuse_count;
 
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 8891822eba4f..1b6fd6c8b12b 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -40,6 +40,7 @@
 #include "rds_single_path.h"
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
 /*
  * Set the selected protocol version
@@ -993,6 +994,11 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
ic->i_cm_id, err);
}
 
+   /* kick off "flush_worker" for all pools in order to reap
+* all FRMR registrations that are still marked "FRMR_IS_INUSE"
+*/
+   rds_ib_flush_mrs();
+
/*
 * We want to wait for tx and rx completion to finish
 * before we tear down the connection, but we have to be
@@ -1005,6 +1011,7 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(&ic->i_recv_ring) &&
   (atomic_read(&ic->i_signaled_sends) == 0) &&
+  (atomic_read(&ic->i_fastreg_inuse_count) == 0) &&
   (atomic_read(&ic->i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
tasklet_kill(&ic->i_send_tasklet);
tasklet_kill(&ic->i_recv_tasklet);
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index a5d8f4128515..19c4cafb6952 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -32,6 +32,24 @@
 
 #include "ib_mr.h"
 
+static inline void
+rds_transition_frwr_state(struct rds_ib_mr *ibmr,
+ enum rds_ib_fr_state old_state,
+ enum rds_ib_fr_state new_state)
+{
+   if (cmpxchg(&ibmr->u.frmr.fr_state,
+   old_state, new_state) == old_state &&
+   old_state == FRMR_IS_INUSE) {
+   /* enforce order of ibmr->u.frmr.fr_state update
+* before decrementing i_fastreg_inuse_count
+*/
+   smp_mb__before_atomic();
+   atomic_dec(&ibmr->ic->i_fastreg_inuse_count);
+   if (waitqueue_active(&rds_ib_ring_empty_wait))
+   wake_up(&rds_ib_ring_empty_wait);
+   }
+}
+
 static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
   int npages)
 {
@@ -118,13 +136,18 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
if (unlikely(ret != ibmr->sg_len))
return ret < 0 ? ret : -EINVAL;
 
+   if (cmpxchg(&frmr->fr_state,
+   FRMR_IS_FREE, FRMR_IS_INUSE) != FRMR_IS_FREE)
+   return -EBUSY;
+
+   atomic_inc(&ibmr->ic->i_fastreg_inuse_count);
+
/* Perform a WR for the fast_reg_mr. Each individual page
 * in the sg list is added to the fast reg page list and placed
 * inside the fast_reg_mr WR.  The key used is a rolling 8bit
 * counter, which should guarantee uniqueness.
 */
ib_update_fast_reg_key(frmr->mr, ibmr->remap_count++);
-   frmr->fr_state = FRMR_IS_INUSE;
frmr->fr_reg = true;
 
memset(®_wr, 0, sizeof(reg_wr));
@@ -141,7 +164,8 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
ret = ib_post_send(ibmr->ic->i_cm_id->qp, ®_wr.wr, NULL);
if (unlikely(ret)) {
/* Failure here can be because of -ENOMEM as well */
-   frmr->fr_state = FRMR_IS_STALE;
+   rds_transition_frwr_state(ibmr, FRMR_IS_INUSE, FRMR_IS_STALE);
+
atomic_inc(&ibmr->ic->i_fastreg_wrs);
if (printk_ratelimit())
pr_warn("RDS/IB: %s returned error(%d)\n",
@

Re: [PATCH bpf-next v2 1/3] bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addr

2019-07-01 Thread Andrii Nakryiko

On Mon, Jul 1, 2019 at 9:51 AM Stanislav Fomichev  wrote:
>
> Since commit cd17d7770578 ("bpf/tools: sync bpf.h") clang decided
> that it can do a single u64 store into user_ip6[2] instead of two
> separate u32 ones:
>
>  #  17: (18) r2 = 0x100
>  #  ; ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2);
>  #  19: (7b) *(u64 *)(r1 +16) = r2
>  #  invalid bpf_context access off=16 size=8
>
> From the compiler point of view it does look like a correct thing
> to do, so let's support it on the kernel side.
>
> Credit to Andrii Nakryiko for a proper implementation of
> bpf_ctx_wide_store_ok.
>
> Cc: Andrii Nakryiko 
> Cc: Yonghong Song 
> Fixes: cd17d7770578 ("bpf/tools: sync bpf.h")
> Reported-by: kernel test robot 
> Acked-by: Yonghong Song 
> Acked-by: Andrii Nakryiko 
> Signed-off-by: Stanislav Fomichev 
> ---
>  include/linux/filter.h   |  6 ++
>  include/uapi/linux/bpf.h |  4 ++--
>  net/core/filter.c| 22 ++
>  3 files changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 340f7d648974..3901007e36f1 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -746,6 +746,12 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 
> size_default)
> return size <= size_default && (size & (size - 1)) == 0;
>  }
>
> +#define bpf_ctx_wide_store_ok(off, size, type, field)  \
> +   (size == sizeof(__u64) &&   \
> +   off >= offsetof(type, field) && \
> +   off + sizeof(__u64) <= offsetofend(type, field) &&  \
> +   off % sizeof(__u64) == 0)
> +
>  #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
>
>  static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a396b516a2b2..586867fe6102 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3237,7 +3237,7 @@ struct bpf_sock_addr {
> __u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
>  * Stored in network byte order.
>  */
> -   __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
> +   __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4,8-byte write.

typo: an -> and

>  * Stored in network byte order.
>  */
> __u32 user_port;/* Allows 4-byte read and write.
> @@ -3249,7 +3249,7 @@ struct bpf_sock_addr {
> __u32 msg_src_ip4;  /* Allows 1,2,4-byte read an 4-byte write.

same

>  * Stored in network byte order.
>  */
> -   __u32 msg_src_ip6[4];   /* Allows 1,2,4-byte read an 4-byte write.
> +   __u32 msg_src_ip6[4];   /* Allows 1,2,4-byte read an 4,8-byte write.

the power of copy/paste! :)

>  * Stored in network byte order.
>  */
> __bpf_md_ptr(struct bpf_sock *, sk);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index dc8534be12fc..5d33f2146dab 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6849,6 +6849,16 @@ static bool sock_addr_is_valid_access(int off, int 
> size,
> if (!bpf_ctx_narrow_access_ok(off, size, 
> size_default))
> return false;
> } else {
> +   if (bpf_ctx_wide_store_ok(off, size,
> + struct bpf_sock_addr,
> + user_ip6))
> +   return true;
> +
> +   if (bpf_ctx_wide_store_ok(off, size,
> + struct bpf_sock_addr,
> + msg_src_ip6))
> +   return true;
> +
> if (size != size_default)
> return false;
> }
> @@ -7689,9 +7699,6 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type 
> type,
>  /* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to
>   * SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation.
>   *
> - * It doesn't support SIZE argument though since narrow stores are not
> - * supported for now.
> - *
>   * In addition it uses Temporary Field TF (member of struct S) as the 3rd
>   * "register" since two registers available in convert_ctx_access are not
>   * enough: we can't override neither SRC, since it contains value to store, 
> nor
> @@ -7699,7 +7706,7 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type 
> type,
>   * instructions. But we need a temporary place to save pointer to nested
>   * structure whose field we want to store to.
>   */
> -#define SOCK_ADDR_STORE_NES

[PATCH net-next] tipc: use rcu dereference functions properly

2019-07-01 Thread Xin Long

For these places are protected by rcu_read_lock, we change from
rcu_dereference_rtnl to rcu_dereference, as there is no need to
check if rtnl lock is held.

For these places are protected by rtnl_lock, we change from
rcu_dereference_rtnl to rtnl_dereference/rcu_dereference_protected,
as no extra memory barriers are needed under rtnl_lock() which also
protects tn->bearer_list[] and dev->tipc_ptr/b->media_ptr updating.

rcu_dereference_rtnl will be only used in the places where it could
be under rcu_read_lock or rtnl_lock.

Signed-off-by: Xin Long 
---
 net/tipc/bearer.c| 14 +++---
 net/tipc/udp_media.c |  8 
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index 2bed658..a809c0e 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -62,7 +62,7 @@ static struct tipc_bearer *bearer_get(struct net *net, int 
bearer_id)
 {
struct tipc_net *tn = tipc_net(net);
 
-   return rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
+   return rcu_dereference(tn->bearer_list[bearer_id]);
 }
 
 static void bearer_disable(struct net *net, struct tipc_bearer *b);
@@ -210,7 +210,7 @@ void tipc_bearer_add_dest(struct net *net, u32 bearer_id, 
u32 dest)
struct tipc_bearer *b;
 
rcu_read_lock();
-   b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
+   b = rcu_dereference(tn->bearer_list[bearer_id]);
if (b)
tipc_disc_add_dest(b->disc);
rcu_read_unlock();
@@ -222,7 +222,7 @@ void tipc_bearer_remove_dest(struct net *net, u32 
bearer_id, u32 dest)
struct tipc_bearer *b;
 
rcu_read_lock();
-   b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
+   b = rcu_dereference(tn->bearer_list[bearer_id]);
if (b)
tipc_disc_remove_dest(b->disc);
rcu_read_unlock();
@@ -444,7 +444,7 @@ int tipc_l2_send_msg(struct net *net, struct sk_buff *skb,
struct net_device *dev;
int delta;
 
-   dev = (struct net_device *)rcu_dereference_rtnl(b->media_ptr);
+   dev = (struct net_device *)rcu_dereference(b->media_ptr);
if (!dev)
return 0;
 
@@ -481,7 +481,7 @@ int tipc_bearer_mtu(struct net *net, u32 bearer_id)
struct tipc_bearer *b;
 
rcu_read_lock();
-   b = rcu_dereference_rtnl(tipc_net(net)->bearer_list[bearer_id]);
+   b = rcu_dereference(tipc_net(net)->bearer_list[bearer_id]);
if (b)
mtu = b->mtu;
rcu_read_unlock();
@@ -574,8 +574,8 @@ static int tipc_l2_rcv_msg(struct sk_buff *skb, struct 
net_device *dev,
struct tipc_bearer *b;
 
rcu_read_lock();
-   b = rcu_dereference_rtnl(dev->tipc_ptr) ?:
-   rcu_dereference_rtnl(orig_dev->tipc_ptr);
+   b = rcu_dereference(dev->tipc_ptr) ?:
+   rcu_dereference(orig_dev->tipc_ptr);
if (likely(b && test_bit(0, &b->up) &&
   (skb->pkt_type <= PACKET_MULTICAST))) {
skb_mark_not_on_list(skb);
diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index b8962df..62b85db 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -231,7 +231,7 @@ static int tipc_udp_send_msg(struct net *net, struct 
sk_buff *skb,
}
 
skb_set_inner_protocol(skb, htons(ETH_P_TIPC));
-   ub = rcu_dereference_rtnl(b->media_ptr);
+   ub = rcu_dereference(b->media_ptr);
if (!ub) {
err = -ENODEV;
goto out;
@@ -490,7 +490,7 @@ int tipc_udp_nl_dump_remoteip(struct sk_buff *skb, struct 
netlink_callback *cb)
}
}
 
-   ub = rcu_dereference_rtnl(b->media_ptr);
+   ub = rtnl_dereference(b->media_ptr);
if (!ub) {
rtnl_unlock();
return -EINVAL;
@@ -532,7 +532,7 @@ int tipc_udp_nl_add_bearer_data(struct tipc_nl_msg *msg, 
struct tipc_bearer *b)
struct udp_bearer *ub;
struct nlattr *nest;
 
-   ub = rcu_dereference_rtnl(b->media_ptr);
+   ub = rtnl_dereference(b->media_ptr);
if (!ub)
return -ENODEV;
 
@@ -806,7 +806,7 @@ static void tipc_udp_disable(struct tipc_bearer *b)
 {
struct udp_bearer *ub;
 
-   ub = rcu_dereference_rtnl(b->media_ptr);
+   ub = rtnl_dereference(b->media_ptr);
if (!ub) {
pr_err("UDP bearer instance not found\n");
return;
-- 
2.1.0

Re: [PATCH bpf-next v2 3/3] selftests/bpf: add verifier tests for wide stores

2019-07-01 Thread Andrii Nakryiko

On Mon, Jul 1, 2019 at 9:54 AM Stanislav Fomichev  wrote:
>
> Make sure that wide stores are allowed at proper (aligned) addresses.
> Note that user_ip6 is naturally aligned on 8-byte boundary, so
> correct addresses are user_ip6[0] and user_ip6[2]. msg_src_ip6 is,
> however, aligned on a 4-byte bondary, so only msg_src_ip6[1]
> can be wide-stored.
>
> Cc: Andrii Nakryiko 
> Cc: Yonghong Song 
> Signed-off-by: Stanislav Fomichev 
> ---

Acked-by: Andrii Nakryiko 

>  tools/testing/selftests/bpf/test_verifier.c   | 17 +++--
>  .../selftests/bpf/verifier/wide_store.c   | 36 +++
>  2 files changed, 50 insertions(+), 3 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
>

[PATCH net-next] tipc: remove ub->ubsock checks

2019-07-01 Thread Xin Long

Both tipc_udp_enable and tipc_udp_disable are called under rtnl_lock,
ub->ubsock could never be NULL in tipc_udp_disable and cleanup_bearer,
so remove the check.

Also remove the one in tipc_udp_enable by adding "free" label.

Signed-off-by: Xin Long 
---
 net/tipc/udp_media.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index 62b85db..287df687 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -759,7 +759,7 @@ static int tipc_udp_enable(struct net *net, struct 
tipc_bearer *b,
 
err = dst_cache_init(&ub->rcast.dst_cache, GFP_ATOMIC);
if (err)
-   goto err;
+   goto free;
 
/**
 * The bcast media address port is used for all peers and the ip
@@ -771,13 +771,14 @@ static int tipc_udp_enable(struct net *net, struct 
tipc_bearer *b,
else
err = tipc_udp_rcast_add(b, &remote);
if (err)
-   goto err;
+   goto free;
 
return 0;
-err:
+
+free:
dst_cache_destroy(&ub->rcast.dst_cache);
-   if (ub->ubsock)
-   udp_tunnel_sock_release(ub->ubsock);
+   udp_tunnel_sock_release(ub->ubsock);
+err:
kfree(ub);
return err;
 }
@@ -795,8 +796,7 @@ static void cleanup_bearer(struct work_struct *work)
}
 
dst_cache_destroy(&ub->rcast.dst_cache);
-   if (ub->ubsock)
-   udp_tunnel_sock_release(ub->ubsock);
+   udp_tunnel_sock_release(ub->ubsock);
synchronize_net();
kfree(ub);
 }
@@ -811,8 +811,7 @@ static void tipc_udp_disable(struct tipc_bearer *b)
pr_err("UDP bearer instance not found\n");
return;
}
-   if (ub->ubsock)
-   sock_set_flag(ub->ubsock->sk, SOCK_DEAD);
+   sock_set_flag(ub->ubsock->sk, SOCK_DEAD);
RCU_INIT_POINTER(ub->bearer, NULL);
 
/* sock_release need to be done outside of rtnl lock */
-- 
2.1.0

Re: [PATCH v4 bpf-next 2/9] libbpf: introduce concept of bpf_link

2019-07-01 Thread Yonghong Song



On 6/28/19 8:48 PM, Andrii Nakryiko wrote:
> bpf_link is and abstraction of an association of a BPF program and one

"is and" => "is an".

> of many possible BPF attachment points (hooks). This allows to have
> uniform interface for detaching BPF programs regardless of the nature of
> link and how it was created. Details of creation and setting up of
> a specific bpf_link is handled by corresponding attachment methods
> (bpf_program__attach_xxx) added in subsequent commits. Once successfully
> created, bpf_link has to be eventually destroyed with
> bpf_link__destroy(), at which point BPF program is disassociated from
> a hook and all the relevant resources are freed.
> 
> Signed-off-by: Andrii Nakryiko 
> Acked-by: Song Liu 
> ---
>   tools/lib/bpf/libbpf.c   | 17 +
>   tools/lib/bpf/libbpf.h   |  4 
>   tools/lib/bpf/libbpf.map |  3 ++-
>   3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 6e6ebef11ba3..455795e6f8af 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -3941,6 +3941,23 @@ int bpf_prog_load_xattr(const struct 
> bpf_prog_load_attr *attr,
>   return 0;
>   }
>   
> +struct bpf_link {
> + int (*destroy)(struct bpf_link *link);
> +};
> +
> +int bpf_link__destroy(struct bpf_link *link)
> +{
> + int err;
> +
> + if (!link)
> + return 0;
> +
> + err = link->destroy(link);
> + free(link);
> +
> + return err;
> +}
> +
>   enum bpf_perf_event_ret
>   bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t 
> page_size,
>  void **copy_mem, size_t *copy_size,
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index d639f47e3110..5082a5ebb0c2 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -165,6 +165,10 @@ LIBBPF_API int bpf_program__pin(struct bpf_program 
> *prog, const char *path);
>   LIBBPF_API int bpf_program__unpin(struct bpf_program *prog, const char 
> *path);
>   LIBBPF_API void bpf_program__unload(struct bpf_program *prog);
>   
> +struct bpf_link;
> +
> +LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
> +
>   struct bpf_insn;
>   
>   /*
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 2c6d835620d2..3cde850fc8da 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -167,10 +167,11 @@ LIBBPF_0.0.3 {
>   
>   LIBBPF_0.0.4 {
>   global:
> + bpf_link__destroy;
> + bpf_object__load_xattr;
>   btf_dump__dump_type;
>   btf_dump__free;
>   btf_dump__new;
>   btf__parse_elf;
> - bpf_object__load_xattr;
>   libbpf_num_possible_cpus;
>   } LIBBPF_0.0.3;
>

[PATCH net v2] ipv4: don't set IPv6 only flags to IPv4 addresses

2019-07-01 Thread Matteo Croce

Avoid the situation where an IPV6 only flag is applied to an IPv4 address:

# ip addr add 192.0.2.1/24 dev dummy0 nodad home mngtmpaddr noprefixroute
# ip -4 addr show dev dummy0
2: dummy0:  mtu 1500 qdisc noqueue state 
UNKNOWN group default qlen 1000
inet 192.0.2.1/24 scope global noprefixroute dummy0
   valid_lft forever preferred_lft forever

Or worse, by sending a malicious netlink command:

# ip -4 addr show dev dummy0
2: dummy0:  mtu 1500 qdisc noqueue state 
UNKNOWN group default qlen 1000
inet 192.0.2.1/24 scope global nodad optimistic dadfailed home 
tentative mngtmpaddr noprefixroute stable-privacy dummy0
   valid_lft forever preferred_lft forever

Signed-off-by: Matteo Croce 
---
 net/ipv4/devinet.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index c6bd0f7a020a..c5ebfa199794 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -62,6 +62,11 @@
 #include 
 #include 
 
+#define IPV6ONLY_FLAGS \
+   (IFA_F_NODAD | IFA_F_OPTIMISTIC | IFA_F_DADFAILED | \
+IFA_F_HOMEADDRESS | IFA_F_TENTATIVE | \
+IFA_F_MANAGETEMPADDR | IFA_F_STABLE_PRIVACY)
+
 static struct ipv4_devconf ipv4_devconf = {
.data = {
[IPV4_DEVCONF_ACCEPT_REDIRECTS - 1] = 1,
@@ -468,6 +473,9 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct 
nlmsghdr *nlh,
ifa->ifa_flags &= ~IFA_F_SECONDARY;
last_primary = &in_dev->ifa_list;
 
+   /* Don't set IPv6 only flags to IPv4 addresses */
+   ifa->ifa_flags &= ~IPV6ONLY_FLAGS;
+
for (ifap = &in_dev->ifa_list; (ifa1 = *ifap) != NULL;
 ifap = &ifa1->ifa_next) {
if (!(ifa1->ifa_flags & IFA_F_SECONDARY) &&
-- 
2.21.0

Re: [PATCH v4 bpf-next 3/9] libbpf: add ability to attach/detach BPF program to perf event

2019-07-01 Thread Yonghong Song



On 6/28/19 8:49 PM, Andrii Nakryiko wrote:
> bpf_program__attach_perf_event allows to attach BPF program to existing
> perf event hook, providing most generic and most low-level way to attach BPF
> programs. It returns struct bpf_link, which should be passed to
> bpf_link__destroy to detach and free resources, associated with a link.
> 
> Signed-off-by: Andrii Nakryiko 
> ---
>   tools/lib/bpf/libbpf.c   | 61 
>   tools/lib/bpf/libbpf.h   |  3 ++
>   tools/lib/bpf/libbpf.map |  1 +
>   3 files changed, 65 insertions(+)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 455795e6f8af..98c155ec3bfa 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -32,6 +32,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
>   #include 
>   #include 
>   #include 
> @@ -3958,6 +3959,66 @@ int bpf_link__destroy(struct bpf_link *link)
>   return err;
>   }
>   
> +struct bpf_link_fd {
> + struct bpf_link link; /* has to be at the top of struct */
> + int fd; /* hook FD */
> +};
> +
> +static int bpf_link__destroy_perf_event(struct bpf_link *link)
> +{
> + struct bpf_link_fd *l = (void *)link;
> + int err;
> +
> + if (l->fd < 0)
> + return 0;
> +
> + err = ioctl(l->fd, PERF_EVENT_IOC_DISABLE, 0);
> + if (err)
> + err = -errno;
> +
> + close(l->fd);
> + return err;
> +}
> +
> +struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
> + int pfd)
> +{
> + char errmsg[STRERR_BUFSIZE];
> + struct bpf_link_fd *link;
> + int prog_fd, err;
> +
> + prog_fd = bpf_program__fd(prog);
> + if (prog_fd < 0) {
> + pr_warning("program '%s': can't attach before loaded\n",
> +bpf_program__title(prog, false));
> + return ERR_PTR(-EINVAL);
> + }

should we check validity of pfd here?
If pfd < 0, we just return ERR_PTR(-EINVAL)?
This way, in bpf_link__destroy_perf_event(), we do not need to check
l->fd < 0 since it will be always nonnegative.

> +
> + link = malloc(sizeof(*link));
> + if (!link)
> + return ERR_PTR(-ENOMEM);
> + link->link.destroy = &bpf_link__destroy_perf_event;
> + link->fd = pfd;
> +
> + if (ioctl(pfd, PERF_EVENT_IOC_SET_BPF, prog_fd) < 0) {
> + err = -errno;
> + free(link);
> + pr_warning("program '%s': failed to attach to pfd %d: %s\n",
> +bpf_program__title(prog, false), pfd,
> +libbpf_strerror_r(err, errmsg, sizeof(errmsg)));
> + return ERR_PTR(err);
> + }
> + if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
> + err = -errno;
> + free(link);
> + pr_warning("program '%s': failed to enable pfd %d: %s\n",
> +bpf_program__title(prog, false), pfd,
> +libbpf_strerror_r(err, errmsg, sizeof(errmsg)));
> + return ERR_PTR(err);
> + }
> + return (struct bpf_link *)link;
> +}
> +
>   enum bpf_perf_event_ret
>   bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t 
> page_size,
>  void **copy_mem, size_t *copy_size,
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 5082a5ebb0c2..1bf66c4a9330 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -169,6 +169,9 @@ struct bpf_link;
>   
>   LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
>   
> +LIBBPF_API struct bpf_link *
> +bpf_program__attach_perf_event(struct bpf_program *prog, int pfd);
> +
>   struct bpf_insn;
>   
>   /*
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 3cde850fc8da..756f5aa802e9 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -169,6 +169,7 @@ LIBBPF_0.0.4 {
>   global:
>   bpf_link__destroy;
>   bpf_object__load_xattr;
> + bpf_program__attach_perf_event;
>   btf_dump__dump_type;
>   btf_dump__free;
>   btf_dump__new;
>

Re: [PATCH net-next 0/7] net/rds: RDMA fixes

2019-07-01 Thread santosh . shilimkar


On 7/1/19 9:39 AM, Gerd Rausch wrote:

A number of net/rds fixes necessary to make "rds_rdma.ko"
pass some basic Oracle internal tests.

Gerd Rausch (7):
   net/rds: Give fr_state a chance to transition to FRMR_IS_FREE
   net/rds: Get rid of "wait_clean_list_grace" and add locking
   net/rds: Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition after
 posting IB_WR_LOCAL_INV
   net/rds: Fix NULL/ERR_PTR inconsistency
   net/rds: Set fr_state only to FRMR_IS_FREE if IB_WR_LOCAL_INV had been
 successful
   net/rds: Keep track of and wait for FRWR segments in use upon shutdown
   net/rds: Initialize ic->i_fastreg_wrs upon allocation


Will apply these on top of earlier few fixes after going through them.
Thanks for posting them out.

Regards,
Santosh

[PATCH net-next 0/5] net: use ICW for sk_proto->{send,recv}msg

2019-07-01 Thread Paolo Abeni

This series extends ICW usage to one of the few remaining spots in fast-path
still hitting per packet retpoline overhead, namely the sk_proto->{send,recv}msg
calls.

The first 3 patches in this series refactor the existing code so that applying
the ICW macros is straight-forward: we demux inet_{recv,send}msg in ipv4 and
ipv6 variants so that each of them can easily select the appropriate TCP or UDP
direct call. While at it, a new helper is created to avoid excessive code
duplication, and the current ICWs for inet_{recv,send}msg are adjusted
accordingly.

The last 2 patches really introduce the new ICW use-case, respectively for the
ipv6 and the ipv4 code path.

This gives up to 5% performance improvement under UDP flood, and smaller but
measurable gains for TCP RR workloads.

Paolo Abeni (5):
  inet: factor out inet_send_prepare()
  ipv6: provide and use ipv6 specific version for {recv,send}msg
  net: adjust socket level ICW to cope with ipv6 variant of
{recv,send}msg
  ipv6: use indirect call wrappers for {tcp,udpv6}_{recv,send}msg()
  ipv4: use indirect call wrappers for {tcp,udp}_{recv,send}msg()

 include/net/inet_common.h |  1 +
 include/net/ipv6.h|  3 +++
 net/ipv4/af_inet.c| 33 ---
 net/ipv6/af_inet6.c   | 41 +++
 net/socket.c  | 17 ++--
 5 files changed, 69 insertions(+), 26 deletions(-)

-- 
2.20.1

[PATCH net-next 5/5] ipv4: use indirect call wrappers for {tcp,udp}_{recv,send}msg()

2019-07-01 Thread Paolo Abeni

This avoids an indirect call per syscall for common ipv4 transports

Signed-off-by: Paolo Abeni 
---
 net/ipv4/af_inet.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8421e2f5bbb3..9a2f17d0c5f5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -797,6 +797,8 @@ int inet_send_prepare(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(inet_send_prepare);
 
+INDIRECT_CALLABLE_DECLARE(int udp_sendmsg(struct sock *, struct msghdr *,
+ size_t));
 int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
 {
struct sock *sk = sock->sk;
@@ -804,7 +806,8 @@ int inet_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t size)
if (unlikely(inet_send_prepare(sk)))
return -EAGAIN;
 
-   return sk->sk_prot->sendmsg(sk, msg, size);
+   return INDIRECT_CALL_2(sk->sk_prot->sendmsg, tcp_sendmsg, udp_sendmsg,
+  sk, msg, size);
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
@@ -822,6 +825,8 @@ ssize_t inet_sendpage(struct socket *sock, struct page 
*page, int offset,
 }
 EXPORT_SYMBOL(inet_sendpage);
 
+INDIRECT_CALLABLE_DECLARE(int udp_recvmsg(struct sock *, struct msghdr *,
+ size_t, int, int, int *));
 int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 int flags)
 {
@@ -832,8 +837,9 @@ int inet_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
if (likely(!(flags & MSG_ERRQUEUE)))
sock_rps_record_flow(sk);
 
-   err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
-  flags & ~MSG_DONTWAIT, &addr_len);
+   err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
+ sk, msg, size, flags & MSG_DONTWAIT,
+ flags & ~MSG_DONTWAIT, &addr_len);
if (err >= 0)
msg->msg_namelen = addr_len;
return err;
-- 
2.20.1

[PATCH net-next 4/5] ipv6: use indirect call wrappers for {tcp,udpv6}_{recv,send}msg()

2019-07-01 Thread Paolo Abeni

This avoids an indirect call per syscall for common ipv6 transports

Signed-off-by: Paolo Abeni 
---
 net/ipv6/af_inet6.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 4628681eca88..d5e98ee9fc79 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -564,6 +564,8 @@ int inet6_ioctl(struct socket *sock, unsigned int cmd, 
unsigned long arg)
 }
 EXPORT_SYMBOL(inet6_ioctl);
 
+INDIRECT_CALLABLE_DECLARE(int udpv6_sendmsg(struct sock *, struct msghdr *,
+   size_t));
 int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
 {
struct sock *sk = sock->sk;
@@ -571,9 +573,12 @@ int inet6_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t size)
if (unlikely(inet_send_prepare(sk)))
return -EAGAIN;
 
-   return sk->sk_prot->sendmsg(sk, msg, size);
+   return INDIRECT_CALL_2(sk->sk_prot->sendmsg, tcp_sendmsg, udpv6_sendmsg,
+  sk, msg, size);
 }
 
+INDIRECT_CALLABLE_DECLARE(int udpv6_recvmsg(struct sock *, struct msghdr *,
+   size_t, int, int, int *));
 int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
  int flags)
 {
@@ -584,8 +589,9 @@ int inet6_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
if (likely(!(flags & MSG_ERRQUEUE)))
sock_rps_record_flow(sk);
 
-   err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
-  flags & ~MSG_DONTWAIT, &addr_len);
+   err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udpv6_recvmsg,
+ sk, msg, size, flags & MSG_DONTWAIT,
+ flags & ~MSG_DONTWAIT, &addr_len);
if (err >= 0)
msg->msg_namelen = addr_len;
return err;
-- 
2.20.1

[PATCH net-next 2/5] ipv6: provide and use ipv6 specific version for {recv,send}msg

2019-07-01 Thread Paolo Abeni

This will simplify indirect call wrapper invocation in the following
patch.

No functional change intended, any - out-of-tree - IPv6 user of
inet_{recv,send}msg can keep using the existing functions.

SCTP code still uses the existing version even for ipv6: as this series
will not add ICW for SCTP, moving to the new helper would not give
any benefit.

The only other in-kernel user of inet_{recv,send}msg is
pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket,
so no need to update that code path, too.

Signed-off-by: Paolo Abeni 
---
 include/net/ipv6.h  |  3 +++
 net/ipv6/af_inet6.c | 35 +++
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index b41f6a0fa903..aecc28dff8f8 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1089,6 +1089,9 @@ int inet6_ioctl(struct socket *sock, unsigned int cmd, 
unsigned long arg);
 
 int inet6_hash_connect(struct inet_timewait_death_row *death_row,
  struct sock *sk);
+int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
+int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
+ int flags);
 
 /*
  * reassembly.c
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 7382a927d1eb..4628681eca88 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -564,6 +564,33 @@ int inet6_ioctl(struct socket *sock, unsigned int cmd, 
unsigned long arg)
 }
 EXPORT_SYMBOL(inet6_ioctl);
 
+int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
+{
+   struct sock *sk = sock->sk;
+
+   if (unlikely(inet_send_prepare(sk)))
+   return -EAGAIN;
+
+   return sk->sk_prot->sendmsg(sk, msg, size);
+}
+
+int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
+ int flags)
+{
+   struct sock *sk = sock->sk;
+   int addr_len = 0;
+   int err;
+
+   if (likely(!(flags & MSG_ERRQUEUE)))
+   sock_rps_record_flow(sk);
+
+   err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
+  flags & ~MSG_DONTWAIT, &addr_len);
+   if (err >= 0)
+   msg->msg_namelen = addr_len;
+   return err;
+}
+
 const struct proto_ops inet6_stream_ops = {
.family= PF_INET6,
.owner = THIS_MODULE,
@@ -580,8 +607,8 @@ const struct proto_ops inet6_stream_ops = {
.shutdown  = inet_shutdown, /* ok   */
.setsockopt= sock_common_setsockopt,/* ok   */
.getsockopt= sock_common_getsockopt,/* ok   */
-   .sendmsg   = inet_sendmsg,  /* ok   */
-   .recvmsg   = inet_recvmsg,  /* ok   */
+   .sendmsg   = inet6_sendmsg, /* retpoline's sake */
+   .recvmsg   = inet6_recvmsg, /* retpoline's sake */
 #ifdef CONFIG_MMU
.mmap  = tcp_mmap,
 #endif
@@ -614,8 +641,8 @@ const struct proto_ops inet6_dgram_ops = {
.shutdown  = inet_shutdown, /* ok   */
.setsockopt= sock_common_setsockopt,/* ok   */
.getsockopt= sock_common_getsockopt,/* ok   */
-   .sendmsg   = inet_sendmsg,  /* ok   */
-   .recvmsg   = inet_recvmsg,  /* ok   */
+   .sendmsg   = inet6_sendmsg, /* retpoline's sake */
+   .recvmsg   = inet6_recvmsg, /* retpoline's sake */
.mmap  = sock_no_mmap,
.sendpage  = sock_no_sendpage,
.set_peek_off  = sk_set_peek_off,
-- 
2.20.1

1 2 3 >

1 - 100 of 259 matches

Mail list logo