date:20190628

Daniel Borkmann  writes:

> On 06/23/2019 04:17 AM, Toke Høiland-Jørgensen wrote:
>> From: Toke Høiland-Jørgensen 
>> 
>> The socket map uses a linked list instead of a bitmap to keep track of
>> which entries to flush. Do the same for devmap and cpumap, as this means we
>> don't have to care about the map index when enqueueing things into the
>> map (and so we can cache the map lookup).
>> 
>> Signed-off-by: Toke Høiland-Jørgensen 
> [...]
>> +static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx)
>>  {
>> +struct bpf_cpu_map_entry *rcpu = bq->obj;
>>  unsigned int processed = 0, drops = 0;
>>  const int to_cpu = rcpu->cpu;
>>  struct ptr_ring *q;
>> @@ -621,6 +630,9 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry 
>> *rcpu,
>>  bq->count = 0;
>>  spin_unlock(&q->producer_lock);
>>  
>> +__list_del(bq->flush_node.prev, bq->flush_node.next);
>> +bq->flush_node.prev = NULL;
>
> Given this and below is a bit non-standard way of using list API, maybe add
> these as inline helpers to include/linux/list.h to make sure anyone changing
> list API semantics doesn't overlook these in future?

Sure, can do.

-Toke

Re: XDP multi-buffer incl. jumbo-frames (Was: [RFC V1 net-next 1/1] net: ena: implement XDP drop support)

2019-06-28 Thread Eelco Chaudron

On 26 Jun 2019, at 10:38, Jesper Dangaard Brouer wrote:

On Tue, 25 Jun 2019 03:19:22 +
"Machulsky, Zorik" wrote:

On 6/23/19, 7:21 AM, "Jesper Dangaard Brouer"
wrote:

On Sun, 23 Jun 2019 10:06:49 +0300 wrote:

> This commit implements the basic functionality of drop/pass
logic in the

> ena driver.

Usually we require a driver to implement all the XDP return
codes,
before we accept it. But as Daniel and I discussed with Zorik
during
NetConf[1], we are going to make an exception and accept the
driver

if you also implement XDP_TX.

As we trust that Zorik/Amazon will follow and implement
XDP_REDIRECT
later, given he/you wants AF_XDP support which requires
XDP_REDIRECT.

Jesper, thanks for your comments and very helpful discussion during
NetConf! That's the plan, as we agreed. From our side I would like to
reiterate again the importance of multi-buffer support by xdp frame.
We would really prefer not to see our MTU shrinking because of xdp
support.

Okay we really need to make a serious attempt to find a way to support
multi-buffer packets with XDP. With the important criteria of not
hurting performance of the single-buffer per packet design.

I've created a design document[2], that I will update based on our
discussions: [2]
https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

The use-case that really convinced me was Eric's packet header-split.

Lets refresh: Why XDP don't have multi-buffer support:

XDP is designed for maximum performance, which is why certain
driver-level
use-cases were not supported, like multi-buffer packets (like
jumbo-frames).

As it e.g. complicated the driver RX-loop and memory model handling.

The single buffer per packet design, is also tied into eBPF
Direct-Access
(DA) to packet data, which can only be allowed if the packet memory is
in

contiguous memory. This DA feature is essential for XDP performance.

One way forward is to define that XDP only get access to the first
packet buffer, and it cannot see subsequent buffers. For XDP_TX and
XDP_REDIRECT to work then XDP still need to carry pointers (plus
len+offset) to the other buffers, which is 16 bytes per extra buffer.

I’ve seen various network processor HW designs, and they normally get
the first x bytes (128 - 512) which they can manipulate
(append/prepend/insert/modify/delete).

There are designs where they can “page in” the additional fragments
but it’s expensive as it requires additional memory transfers. But the
majority do not care (cannot change) the remaining fragments. Can also
not think of a reason why you might want to remove something at the end
of the frame (thinking about routing/forwarding needs here).

If we do want XDP to access other fragments we could do this through a
helper which swaps the packet context?

//Eelco

Re: [PATCH bpf-next v5 2/3] bpf_xdp_redirect_map: Perform map lookup in eBPF helper

Daniel Borkmann  writes:

> On 06/23/2019 04:17 AM, Toke Høiland-Jørgensen wrote:
>> From: Toke Høiland-Jørgensen 
>> 
>> The bpf_redirect_map() helper used by XDP programs doesn't return any
>> indication of whether it can successfully redirect to the map index it was
>> given. Instead, BPF programs have to track this themselves, leading to
>> programs using duplicate maps to track which entries are populated in the
>> devmap.
>> 
>> This patch fixes this by moving the map lookup into the bpf_redirect_map()
>> helper, which makes it possible to return failure to the eBPF program. The
>> lower bits of the flags argument is used as the return code, which means
>> that existing users who pass a '0' flag argument will get XDP_ABORTED.
>> 
>> With this, a BPF program can check the return code from the helper call and
>> react by, for instance, substituting a different redirect. This works for
>> any type of map used for redirect.
>> 
>> Signed-off-by: Toke Høiland-Jørgensen 
>
> Overall series looks good to me. Just very small things inline here & in the
> other two patches:
>
> [...]
>> @@ -3750,9 +3742,16 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, 
>> map, u32, ifindex,
>>  {
>>  struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>>  
>> -if (unlikely(flags))
>> +/* Lower bits of the flags are used as return code on lookup failure */
>> +if (unlikely(flags > XDP_TX))
>>  return XDP_ABORTED;
>>  
>> +ri->item = __xdp_map_lookup_elem(map, ifindex);
>> +if (unlikely(!ri->item)) {
>> +WRITE_ONCE(ri->map, NULL);
>
> This WRITE_ONCE() is not needed. We never set it before at this point.

You mean the WRITE_ONCE() wrapper is not needed, or the set-to-NULL is
not needed? The reason I added it is in case an eBPF program calls the
helper twice before returning, where the first lookup succeeds but the
second fails; in that case we want to clear the ->map pointer, no?

-Toke

Re: [PATCH RFC net-next 1/5] net: dsa: mt7530: Convert to PHYLINK API

2019-06-28 Thread Daniel Santos

Hello Andrew,

On 6/27/19 2:28 PM, Andrew Lunn wrote:
>>> Looking at the data sheet page, you want FORCE_MODE_Pn set. You never
>>> want the MAC directly talking to the PHY. Bad things will happen.
>> So what exactly do you mean by the MAC directly talking to the PHY?  Do
>> you mean setting speed, duplex, etc. via the MAC registers instead of
>> via MDIO to the MII registers of the PHY?
> The data sheet implies the MAC hardware performs reads on the PHY to
> get the status, and then uses that to configure the MAC. This is often
> called PHY polling. The MAC hardware however has no idea what the PHY
> driver is doing. The MAC does not take the PHY mutex. So the PHY
> driver might be doing something at the same time the MAC hardware
> polls the PHY, and it gets odd results. Some PHYs have multiple pages,
> and for example reading the temperature sensor means swapping
> pages. If the MAC hardware was to poll while the sensor is being read,
> it would not get the link status, it would read some random
> temperature register.
>
> So you want the PHY driver to read the results of auto-neg and it then
> tell the MAC the results, so the MAC can be configured correctly.

Thank you, this is very helpful!  I finally understand why these
settings are in two different places. :)  Unfortunately this driver (in
OpenWRT) does a lot of "magic" during init to registers that I don't
have documentation for, but I see where auto-polling can be disabled now.

>>> Then use FORCE_RX_FC_Pn and FORCE_TX_Pn to reflect phydev->pause and
>>> phydev->asym_pause.
>>>
>>> The same idea applies when using phylink.
>>>
>>> Andrew
>> You're help is greatly appreciated here.  Admittedly, I'm also trying to
>> get this working in the now deprecated swconfig for a 3.18 kernel that's
>> in production.
> I'm not very familiar with swconfig. Is there software driving the
> PHY? If not, it is then safe for the MAC hardware to poll the PHY.
>
>  Andrew

swconfig is an netlink-based interface the OpenWRT team developed for
configuring switches before DSA was converted into a vendor-neutral
interface.  Now that DSA does what swconfig was designed for it has been
deprecated, although (to my knowledge) we don't yet have DSA for all
devices that OpenWRT supports.

Daniel

Re: [PATCH bpf-next v5 2/3] bpf_xdp_redirect_map: Perform map lookup in eBPF helper

Daniel Borkmann  writes:

> On 06/23/2019 04:17 AM, Toke Høiland-Jørgensen wrote:
>> From: Toke Høiland-Jørgensen 
>> 
>> The bpf_redirect_map() helper used by XDP programs doesn't return any
>> indication of whether it can successfully redirect to the map index it was
>> given. Instead, BPF programs have to track this themselves, leading to
>> programs using duplicate maps to track which entries are populated in the
>> devmap.
>> 
>> This patch fixes this by moving the map lookup into the bpf_redirect_map()
>> helper, which makes it possible to return failure to the eBPF program. The
>> lower bits of the flags argument is used as the return code, which means
>> that existing users who pass a '0' flag argument will get XDP_ABORTED.
>> 
>> With this, a BPF program can check the return code from the helper call and
>> react by, for instance, substituting a different redirect. This works for
>> any type of map used for redirect.
>> 
>> Signed-off-by: Toke Høiland-Jørgensen 
> [...]
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 183bf4d8e301..a6779e1cc1b8 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -3605,17 +3605,13 @@ static int xdp_do_redirect_map(struct net_device 
>> *dev, struct xdp_buff *xdp,
>> struct bpf_redirect_info *ri)
>>  {
>>  u32 index = ri->ifindex;
>> -void *fwd = NULL;
>> +void *fwd = ri->item;
>>  int err;
>>  
>>  ri->ifindex = 0;
>> +ri->item = NULL;
>>  WRITE_ONCE(ri->map, NULL);
>>  
>> -fwd = __xdp_map_lookup_elem(map, index);
>> -if (unlikely(!fwd)) {
>> -err = -EINVAL;
>> -goto err;
>> -}
>
> If you look at the _trace_xdp_redirect{,_err}(), we should also get rid of the
> extra NULL test in devmap_ifindex() which is not under tracepoint static key.

ACK, will add.

-Toke

Re: [RFC] longer netdev names proposal

Thu, Jun 27, 2019 at 07:14:31PM CEST, dsah...@gmail.com wrote:
>On 6/27/19 3:43 AM, Jiri Pirko wrote:
>> Hi all.
>> 
>> In the past, there was repeatedly discussed the IFNAMSIZ (16) limit for
>> netdevice name length. Now when we have PF and VF representors
>> with port names like "pfXvfY", it became quite common to hit this limit:
>> 0123456789012345
>> enp131s0f1npf0vf6
>> enp131s0f1npf0vf22
>
>QinQ (stacked vlans) is another example.

There are more usecases for this, yes.


>
>> 
>> Since IFLA_NAME is just a string, I though it might be possible to use
>> it to carry longer names as it is. However, the userspace tools, like
>> iproute2, are doing checks before print out. So for example in output of
>> "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is
>> completely avoided.
>> 
>> So here is a proposal that might work:
>> 1) Add a new attribute IFLA_NAME_EXT that could carry names longer than
>>IFNAMSIZE, say 64 bytes. The max size should be only defined in kernel,
>>user should be prepared for any string size.
>> 2) Add a file in sysfs that would indicate that NAME_EXT is supported by
>>the kernel.
>
>no sysfs files.
>
>Johannes added infrastructure to retrieve the policy. That is a more
>flexible and robust option for determining what the kernel supports.

Sure, udev can query rtnetlink. I just proposed it as an option, anyway,
it's implementation detail.


>
>
>> 3) Udev is going to look for the sysfs indication file. In case when
>>kernel supports long names, it will do rename to longer name, setting
>>IFLA_NAME_EXT. If not, it does what it does now - fail.
>> 4) There are two cases that can happen during rename:
>>A) The name is shorter than IFNAMSIZ
>>   -> both IFLA_NAME and IFLA_NAME_EXT would contain the same string:
>>  original IFLA_NAME = eth0
>>  original IFLA_NAME_EXT = eth0
>>  renamed  IFLA_NAME = enp5s0f1npf0vf1
>>  renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
>>B) The name is longer tha IFNAMSIZ
>>   -> IFLA_NAME would contain the original one, IFLA_NAME_EXT would 
>>  contain the new one:
>>  original IFLA_NAME = eth0
>>  original IFLA_NAME_EXT = eth0
>>  renamed  IFLA_NAME = eth0
>>  renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22
>
>so kernel side there will be 2 names for the same net_device?

Yes. However, updated tools (which would be eventually all) are going to
show only the ext one.



>
>> 
>> This would allow the old tools to work with "eth0" and the new
>> tools would work with "enp131s0f1npf0vf22". In sysfs, there would
>> be symlink from one name to another.
>
>I would prefer a solution that does not rely on sysfs hooks.

Please note that this /sys/class/net/ifacename dirs are already created.
What I propose is to have symlink from ext to the short name or vice
versa. The solution really does not "rely" on this...


>
>>   
>> Also, there might be a warning added to kernel if someone works
>> with IFLA_NAME that the userspace tool should be upgraded.
>
>that seems like spam and confusion for the first few years of a new api.

Spam? warn_once?


>
>> 
>> Eventually, only IFLA_NAME_EXT is going to be used by everyone.
>> 
>> I'm aware there are other places where similar new attribute
>> would have to be introduced too (ip rule for example).
>> I'm not saying this is a simple work.
>> 
>> Question is what to do with the ioctl api (get ifindex etc). I would
>> probably leave it as is and push tools to use rtnetlink instead.
>
>The ioctl API is going to be a limiter here. ifconfig is still quite
>prevalent and net-snmp still uses ioctl (as just 2 common examples).
>snmp showing one set of names and rtnetlink s/w showing another is going
>to be really confusing.

I don't see other way though, do you? The ioctl names are unextendable :/

[PATCH net-next v2 10/10] net: stmmac: Update Kconfig entry

2019-06-28 Thread Jose Abreu

We support more speeds now. Update the Kconfig entry.

Signed-off-by: Jose Abreu 
Cc: Joao Pinto 
Cc: David S. Miller 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 
---
 drivers/net/ethernet/stmicro/stmmac/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig 
b/drivers/net/ethernet/stmicro/stmmac/Kconfig
index c43e2da4e7e3..2acb999b7f63 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
+++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config STMMAC_ETH
-   tristate "STMicroelectronics 10/100/1000/EQOS Ethernet driver"
+   tristate "STMicroelectronics 10/100/1000/EQOS/2500/5000/1 Ethernet 
driver"
depends on HAS_IOMEM && HAS_DMA
select MII
select PHYLINK
-- 
2.7.4

Re: [RFC] longer netdev names proposal

Thu, Jun 27, 2019 at 09:35:27PM CEST, d...@redhat.com wrote:
>On Thu, 2019-06-27 at 12:20 -0700, Stephen Hemminger wrote:
>> On Thu, 27 Jun 2019 20:39:48 +0200
>> Michal Kubecek  wrote:
>> 
>> > > $ ip li set dev enp3s0 alias "Onboard Ethernet"
>> > > # ip link show "Onboard Ethernet"
>> > > Device "Onboard Ethernet" does not exist.
>> > > 
>> > > So it does not really appear to be an alias, it is a label. To be
>> > > truly useful, it needs to be more than a label, it needs to be a
>> > > real
>> > > alias which you can use.  
>> > 
>> > That's exactly what I meant: to be really useful, one should be
>> > able to
>> > use the alias(es) for setting device options, for adding routes, in
>> > netfilter rules etc.
>> > 
>> > Michal
>> 
>> The kernel doesn't enforce uniqueness of alias.
>
>Can we even enforce unique aliases/labels? Given that the kernel hasn't
>enforced that in the past there's a good possibility of breaking stuff
>if it started. (unfortunately)

Correct. I think that Michal's idea to introduce "real aliases" is very
intereting. However, the existing "alias" as we have it does not seem
right to be used. Also because of the UAPI. We have IFLA_IFALIAS which
is a single value. For "real aliases" we need nested array.

[...]

Re: XDP multi-buffer incl. jumbo-frames (Was: [RFC V1 net-next 1/1] net: ena: implement XDP drop support)

"Eelco Chaudron"  writes:

> On 26 Jun 2019, at 10:38, Jesper Dangaard Brouer wrote:
>
>> On Tue, 25 Jun 2019 03:19:22 +
>> "Machulsky, Zorik"  wrote:
>>
>>> On 6/23/19, 7:21 AM, "Jesper Dangaard Brouer"  
>>> wrote:
>>>
>>> On Sun, 23 Jun 2019 10:06:49 +0300  wrote:
>>>
>>> > This commit implements the basic functionality of drop/pass 
>>> logic in the
>>> > ena driver.
>>>
>>> Usually we require a driver to implement all the XDP return 
>>> codes,
>>> before we accept it.  But as Daniel and I discussed with Zorik 
>>> during
>>> NetConf[1], we are going to make an exception and accept the 
>>> driver
>>> if you also implement XDP_TX.
>>>
>>> As we trust that Zorik/Amazon will follow and implement 
>>> XDP_REDIRECT
>>> later, given he/you wants AF_XDP support which requires 
>>> XDP_REDIRECT.
>>>
>>> Jesper, thanks for your comments and very helpful discussion during
>>> NetConf! That's the plan, as we agreed. From our side I would like to
>>> reiterate again the importance of multi-buffer support by xdp frame.
>>> We would really prefer not to see our MTU shrinking because of xdp
>>> support.
>>
>> Okay we really need to make a serious attempt to find a way to support
>> multi-buffer packets with XDP. With the important criteria of not
>> hurting performance of the single-buffer per packet design.
>>
>> I've created a design document[2], that I will update based on our
>> discussions: [2] 
>> https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
>>
>> The use-case that really convinced me was Eric's packet header-split.
>>
>>
>> Lets refresh: Why XDP don't have multi-buffer support:
>>
>> XDP is designed for maximum performance, which is why certain 
>> driver-level
>> use-cases were not supported, like multi-buffer packets (like 
>> jumbo-frames).
>> As it e.g. complicated the driver RX-loop and memory model handling.
>>
>> The single buffer per packet design, is also tied into eBPF 
>> Direct-Access
>> (DA) to packet data, which can only be allowed if the packet memory is 
>> in
>> contiguous memory.  This DA feature is essential for XDP performance.
>>
>>
>> One way forward is to define that XDP only get access to the first
>> packet buffer, and it cannot see subsequent buffers.  For XDP_TX and
>> XDP_REDIRECT to work then XDP still need to carry pointers (plus
>> len+offset) to the other buffers, which is 16 bytes per extra buffer.
>
>
> I’ve seen various network processor HW designs, and they normally get 
> the first x bytes (128 - 512) which they can manipulate 
> (append/prepend/insert/modify/delete).
>
> There are designs where they can “page in” the additional fragments 
> but it’s expensive as it requires additional memory transfers. But the 
> majority do not care (cannot change) the remaining fragments. Can also 
> not think of a reason why you might want to remove something at the end 
> of the frame (thinking about routing/forwarding needs here).
>
> If we do want XDP to access other fragments we could do this through a 
> helper which swaps the packet context?

Yeah, I was also going to suggest a helper for that. It doesn't
necessarily need to swap the packet context, it could just return a new
pointer?

-Toke

Re: [PATCH bpf-next v5 2/3] bpf_xdp_redirect_map: Perform map lookup in eBPF helper

2019-06-28 Thread Daniel Borkmann

On 06/28/2019 09:17 AM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann  writes:
> 
>> On 06/23/2019 04:17 AM, Toke Høiland-Jørgensen wrote:
>>> From: Toke Høiland-Jørgensen 
>>>
>>> The bpf_redirect_map() helper used by XDP programs doesn't return any
>>> indication of whether it can successfully redirect to the map index it was
>>> given. Instead, BPF programs have to track this themselves, leading to
>>> programs using duplicate maps to track which entries are populated in the
>>> devmap.
>>>
>>> This patch fixes this by moving the map lookup into the bpf_redirect_map()
>>> helper, which makes it possible to return failure to the eBPF program. The
>>> lower bits of the flags argument is used as the return code, which means
>>> that existing users who pass a '0' flag argument will get XDP_ABORTED.
>>>
>>> With this, a BPF program can check the return code from the helper call and
>>> react by, for instance, substituting a different redirect. This works for
>>> any type of map used for redirect.
>>>
>>> Signed-off-by: Toke Høiland-Jørgensen 
>>
>> Overall series looks good to me. Just very small things inline here & in the
>> other two patches:
>>
>> [...]
>>> @@ -3750,9 +3742,16 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, 
>>> map, u32, ifindex,
>>>  {
>>> struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>>>  
>>> -   if (unlikely(flags))
>>> +   /* Lower bits of the flags are used as return code on lookup failure */
>>> +   if (unlikely(flags > XDP_TX))
>>> return XDP_ABORTED;
>>>  
>>> +   ri->item = __xdp_map_lookup_elem(map, ifindex);
>>> +   if (unlikely(!ri->item)) {
>>> +   WRITE_ONCE(ri->map, NULL);
>>
>> This WRITE_ONCE() is not needed. We never set it before at this point.
> 
> You mean the WRITE_ONCE() wrapper is not needed, or the set-to-NULL is
> not needed? The reason I added it is in case an eBPF program calls the
> helper twice before returning, where the first lookup succeeds but the
> second fails; in that case we want to clear the ->map pointer, no?

Yeah I meant the set-to-NULL. So if first call succeeds, and the second one
fails, then the expected semantics wrt the first call are as if the program
would have called bpf_xdp_redirect() only?

Looking at the code again, if we set ri->item to NULL, then we /must/ also
set ri->map to NULL. I guess there are two options: i) leave as is, ii) keep
the __xdp_map_lookup_elem() result in a temp var, if it's NULL return flags,
otherwise only /then/ update ri->item, so that semantics are similar to the
invalid flags check earlier. I guess fine either way, in case of i) there
should probably be a comment since it's less obvious.

Thanks,
Daniel

Re: [PATCH bpf-next v5 2/3] bpf_xdp_redirect_map: Perform map lookup in eBPF helper

Daniel Borkmann  writes:

> On 06/28/2019 09:17 AM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann  writes:
>> 
>>> On 06/23/2019 04:17 AM, Toke Høiland-Jørgensen wrote:
 From: Toke Høiland-Jørgensen 

 The bpf_redirect_map() helper used by XDP programs doesn't return any
 indication of whether it can successfully redirect to the map index it was
 given. Instead, BPF programs have to track this themselves, leading to
 programs using duplicate maps to track which entries are populated in the
 devmap.

 This patch fixes this by moving the map lookup into the bpf_redirect_map()
 helper, which makes it possible to return failure to the eBPF program. The
 lower bits of the flags argument is used as the return code, which means
 that existing users who pass a '0' flag argument will get XDP_ABORTED.

 With this, a BPF program can check the return code from the helper call and
 react by, for instance, substituting a different redirect. This works for
 any type of map used for redirect.

 Signed-off-by: Toke Høiland-Jørgensen 
>>>
>>> Overall series looks good to me. Just very small things inline here & in the
>>> other two patches:
>>>
>>> [...]
 @@ -3750,9 +3742,16 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, 
 map, u32, ifindex,
  {
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
  
 -  if (unlikely(flags))
 +  /* Lower bits of the flags are used as return code on lookup failure */
 +  if (unlikely(flags > XDP_TX))
return XDP_ABORTED;
  
 +  ri->item = __xdp_map_lookup_elem(map, ifindex);
 +  if (unlikely(!ri->item)) {
 +  WRITE_ONCE(ri->map, NULL);
>>>
>>> This WRITE_ONCE() is not needed. We never set it before at this point.
>> 
>> You mean the WRITE_ONCE() wrapper is not needed, or the set-to-NULL is
>> not needed? The reason I added it is in case an eBPF program calls the
>> helper twice before returning, where the first lookup succeeds but the
>> second fails; in that case we want to clear the ->map pointer, no?
>
> Yeah I meant the set-to-NULL. So if first call succeeds, and the second one
> fails, then the expected semantics wrt the first call are as if the program
> would have called bpf_xdp_redirect() only?
>
> Looking at the code again, if we set ri->item to NULL, then we /must/
> also set ri->map to NULL. I guess there are two options: i) leave as
> is, ii) keep the __xdp_map_lookup_elem() result in a temp var, if it's
> NULL return flags, otherwise only /then/ update ri->item, so that
> semantics are similar to the invalid flags check earlier. I guess fine
> either way, in case of i) there should probably be a comment since
> it's less obvious.

Yeah, I think a temp var is probably clearer, will do that :)

-Toke

Re: XDP multi-buffer incl. jumbo-frames (Was: [RFC V1 net-next 1/1] net: ena: implement XDP drop support)

On Fri, 28 Jun 2019 09:14:39 +0200
"Eelco Chaudron"  wrote:

> On 26 Jun 2019, at 10:38, Jesper Dangaard Brouer wrote:
> 
> > On Tue, 25 Jun 2019 03:19:22 +
> > "Machulsky, Zorik"  wrote:
> >  
> >> On 6/23/19, 7:21 AM, "Jesper Dangaard Brouer"  
> >> wrote:
> >>
> >> On Sun, 23 Jun 2019 10:06:49 +0300  wrote:
> >>  
> >> > This commit implements the basic functionality of drop/pass logic in 
> >> the  
> >> > ena driver.  
> >>
> >> Usually we require a driver to implement all the XDP return codes,
> >> before we accept it.  But as Daniel and I discussed with Zorik during
> >> NetConf[1], we are going to make an exception and accept the driver
> >> if you also implement XDP_TX.
> >>
> >> As we trust that Zorik/Amazon will follow and implement XDP_REDIRECT
> >> later, given he/you wants AF_XDP support which requires XDP_REDIRECT.
> >>
> >> Jesper, thanks for your comments and very helpful discussion during
> >> NetConf! That's the plan, as we agreed. From our side I would like to
> >> reiterate again the importance of multi-buffer support by xdp frame.
> >> We would really prefer not to see our MTU shrinking because of xdp
> >> support.  
> >
> > Okay we really need to make a serious attempt to find a way to support
> > multi-buffer packets with XDP. With the important criteria of not
> > hurting performance of the single-buffer per packet design.
> >
> > I've created a design document[2], that I will update based on our
> > discussions: [2] 
> > https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
> >
> > The use-case that really convinced me was Eric's packet header-split.
> >
> >
> > Lets refresh: Why XDP don't have multi-buffer support:
> >
> > XDP is designed for maximum performance, which is why certain 
> > driver-level
> > use-cases were not supported, like multi-buffer packets (like 
> > jumbo-frames).
> > As it e.g. complicated the driver RX-loop and memory model handling.
> >
> > The single buffer per packet design, is also tied into eBPF 
> > Direct-Access
> > (DA) to packet data, which can only be allowed if the packet memory is 
> > in
> > contiguous memory.  This DA feature is essential for XDP performance.
> >
> >
> > One way forward is to define that XDP only get access to the first
> > packet buffer, and it cannot see subsequent buffers.  For XDP_TX and
> > XDP_REDIRECT to work then XDP still need to carry pointers (plus
> > len+offset) to the other buffers, which is 16 bytes per extra buffer.  
> 
> 
> I’ve seen various network processor HW designs, and they normally get 
> the first x bytes (128 - 512) which they can manipulate 
> (append/prepend/insert/modify/delete).

Good data point, thank you!  It confirms that XDP only getting access to
the first packet-buffer makes sense, for most use-cases.

We also have to remember that XDP it not meant to handle every
use-case.  XDP is a software fast-path, that can accelerate certain
use-case.  We have the existing network stack as a fall-back for
handling the corner-cases, that would otherwise slowdown our XDP
fast-path.


> There are designs where they can “page in” the additional fragments 
> but it’s expensive as it requires additional memory transfers. But
> the majority do not care (cannot change) the remaining fragments. Can
> also not think of a reason why you might want to remove something at
> the end of the frame (thinking about routing/forwarding needs here).

Use-cases that need to adjust tail of packet:

- ICMP replies directly from XDP[1] need to shorten packet tail, but
  this use-case doesn't use fragments.

- IPsec need to add/extend packet tail for IPset-trailer[2], again
  unlikely that this needs fragments(?). (This use-case convinced me
  that we need to add extend-tail support to bpf_xdp_adjust_tail)

- DNS or memcached replies directly from XDP, need to extend packet
  tail, to have room for reply. (It would be interesting to allow larger
  replies, but I'm not sure we should ever support that).


> If we do want XDP to access other fragments we could do this through
> a helper which swaps the packet context?

That might be a way forward.  If the XDP developer have to call a
helper, then they should realize and "buy into" an additional
overhead/cost.


[1] 
https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_adjust_tail_kern.c
[2] http://vger.kernel.org/netconf2019_files/xfrm_xdp.pdf
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: XDP multi-buffer incl. jumbo-frames (Was: [RFC V1 net-next 1/1] net: ena: implement XDP drop support)



On Wed, 26 Jun 2019 11:20:45 -0400 Willem de Bruijn 
 wrote:
> On Wed, Jun 26, 2019 at 11:01 AM Toke Høiland-Jørgensen  
> wrote:
> > Jesper Dangaard Brouer  writes:
> > > On Wed, 26 Jun 2019 13:52:16 +0200
> > > Toke Høiland-Jørgensen  wrote:
> > >  
> > >> Jesper Dangaard Brouer  writes:
> > >>  
[...]
> > >
> > > You touch upon some interesting complications already:
> > >
> > > 1. It is valuable for XDP bpf_prog to know "full" length?
> > >(if so, then we need to extend xdp ctx with info)  
> >
> > Valuable, quite likely. A hard requirement, probably not (for all use
> > cases).  
> 
> Agreed.
> 
> One common validation use would be to drop any packets whose header
> length disagrees with the actual packet length.

That is a good point.

Added a section "XDP access to full packet length?" to capture this:
- https://github.com/xdp-project/xdp-project/commit/da5b84264b85b0d
- 
https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org#xdp-access-to-full-packet-length


> > >  But if we need to know the full length, when the first-buffer is
> > >  processed. Then realize that this affect the drivers RX-loop, because
> > >  then we need to "collect" all the buffers before we can know the
> > >  length (although some HW provide this in first descriptor).
> > >
> > >  We likely have to change drivers RX-loop anyhow, as XDP_TX and
> > >  XDP_REDIRECT will also need to "collect" all buffers before the packet
> > >  can be forwarded. (Although this could potentially happen later in
> > >  driver loop when it meet/find the End-Of-Packet descriptor bit).  
> 
> Yes, this might be quite a bit of refactoring of device driver code.
> 
> Should we move forward with some initial constraints, e.g., no
> XDP_REDIRECT, no "full" length and no bpf_xdp_adjust_tail?

I generally like this...

If not adding "full" length. Maybe we could add an indication to
XDP-developer, that his is a multi-buffer/multi-segment packet, such
that header length validation code against packet length (data_end-data)
is not possible.  This is user visible, so we would have to keep it
forever... I'm leaning towards adding "full" length from beginning.

> That already allows many useful programs.
>
> As long as we don't arrive at a design that cannot be extended with
> those features later.

That is the important part...

 
> > > 2. Can we even allow helper bpf_xdp_adjust_tail() ?
[...]
> >  
> > >  Perhaps it is better to let bpf_xdp_adjust_tail() fail runtime?  
> >
> > If we do disallow it, I think I'd lean towards failing the call at
> > runtime...  
> 
> Disagree. I'd rather have a program fail at load if it depends on
> multi-frag support while the (driver) implementation does not yet
> support it.

I usually agree that we should fail the program, early at load time.
For XDP we are unfortunately missing some knobs to do this, see[1].

Specifically for bpf_xdp_adjust_tail(), it might be better to fail
runtime.  Because, the driver might have enabled TSO for TCP packets,
while your XDP use-case is for adjusting UDP-packets (and do XDP level
replies), which will never see multi-buffer packets.  If we fail use of
bpf_xdp_adjust_tail(), then you would have to disable TSO to allow
loading your XDP-prog, hurting the other TSO-TCP use-case.


[1] http://vger.kernel.org/netconf2019_files/xdp-feature-detection.pdf
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next 10/16] qlge: Factor out duplicated expression

2019-06-28 Thread Benjamin Poirier

On 2019/06/25 18:32, Manish Chopra wrote:
[...]
> > 
> > What I inferred from the presence of that expression though is that in the
> > places where it is used, the device interprets a value of 0 as 65536. 
> > Manish,
> > can you confirm that? As David points out, the expression is useless. A
> > comment might not be however.
> 
> Yes,  I think it could be simplified to simple cast just.
> 

I checked and it seems like the device treats cqicq->lbq_buf_size = 0
more like 65536 than like 0. That is, I saw it write up to 9596B
(running at 9000 mtu...) in a single lbq buffer.

Re: [PATCH iproute2-next v3 1/2] ipaddress: correctly print a VF hw address in the IPoIB case

2019-06-28 Thread Denis Kirjanov

On 6/24/19, David Ahern  wrote:
> On 6/22/19 12:00 PM, Denis Kirjanov wrote:
>> @@ -365,13 +367,45 @@ static void print_vfinfo(FILE *fp, struct rtattr
>> *vfinfo)
>>  parse_rtattr_nested(vf, IFLA_VF_MAX, vfinfo);
>>
>>  vf_mac = RTA_DATA(vf[IFLA_VF_MAC]);
>> +vf_broadcast = RTA_DATA(vf[IFLA_VF_BROADCAST]);
>>  vf_tx_rate = RTA_DATA(vf[IFLA_VF_TX_RATE]);
>>
>>  print_string(PRINT_FP, NULL, "%s", _SL_);
>>  print_int(PRINT_ANY, "vf", "vf %d ", vf_mac->vf);
>> -print_string(PRINT_ANY, "mac", "MAC %s",
>> - ll_addr_n2a((unsigned char *) &vf_mac->mac,
>> - ETH_ALEN, 0, b1, sizeof(b1)));
>> +
>> +print_string(PRINT_ANY,
>> +"link_type",
>> +"link/%s ",
>> +ll_type_n2a(ifi->ifi_type, b1, sizeof(b1)));
>> +
>> +print_color_string(PRINT_ANY,
>> +COLOR_MAC,
>> +"address",
>> +"%s",
>> +ll_addr_n2a((unsigned char *) &vf_mac->mac,
>> +ifi->ifi_type == ARPHRD_ETHER ?
>> +ETH_ALEN : INFINIBAND_ALEN,
>> +ifi->ifi_type,
>> +b1, sizeof(b1)));
>
> you still have a lot of lines that are not lined up column wise. See how
> the COLOR_MAC is offset to the right from PRINT_ANY?
>
>> +
>> +if (vf[IFLA_VF_BROADCAST]) {
>> +if (ifi->ifi_flags&IFF_POINTOPOINT) {
>> +print_string(PRINT_FP, NULL, " peer ", NULL);
>> +print_bool(PRINT_JSON,
>> +"link_pointtopoint", NULL, true);
>> +} else
>> +print_string(PRINT_FP, NULL, " brd ", NULL);
>> +
>> +print_color_string(PRINT_ANY,
>> +COLOR_MAC,
>> +"broadcast",
>> +"%s",
>> +ll_addr_n2a((unsigned char *) 
>> &vf_broadcast->broadcast,
>> +ifi->ifi_type == ARPHRD_ETHER ?
>> +ETH_ALEN : INFINIBAND_ALEN,
>> +ifi->ifi_type,
>> +b1, sizeof(b1)));
>
> And then these lines are offset to the left.

Hi David,

I've just sent a new version and I've formatted strings in the similar
way as it used through the source.

Thank you.

>

Re: [PATCH bpf-next v5 2/3] bpf_xdp_redirect_map: Perform map lookup in eBPF helper

Toke Høiland-Jørgensen  writes:

> Daniel Borkmann  writes:
>
>> On 06/28/2019 09:17 AM, Toke Høiland-Jørgensen wrote:
>>> Daniel Borkmann  writes:
>>> 
 On 06/23/2019 04:17 AM, Toke Høiland-Jørgensen wrote:
> From: Toke Høiland-Jørgensen 
>
> The bpf_redirect_map() helper used by XDP programs doesn't return any
> indication of whether it can successfully redirect to the map index it was
> given. Instead, BPF programs have to track this themselves, leading to
> programs using duplicate maps to track which entries are populated in the
> devmap.
>
> This patch fixes this by moving the map lookup into the bpf_redirect_map()
> helper, which makes it possible to return failure to the eBPF program. The
> lower bits of the flags argument is used as the return code, which means
> that existing users who pass a '0' flag argument will get XDP_ABORTED.
>
> With this, a BPF program can check the return code from the helper call 
> and
> react by, for instance, substituting a different redirect. This works for
> any type of map used for redirect.
>
> Signed-off-by: Toke Høiland-Jørgensen 

 Overall series looks good to me. Just very small things inline here & in 
 the
 other two patches:

 [...]
> @@ -3750,9 +3742,16 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, 
> map, u32, ifindex,
>  {
>   struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>  
> - if (unlikely(flags))
> + /* Lower bits of the flags are used as return code on lookup failure */
> + if (unlikely(flags > XDP_TX))
>   return XDP_ABORTED;
>  
> + ri->item = __xdp_map_lookup_elem(map, ifindex);
> + if (unlikely(!ri->item)) {
> + WRITE_ONCE(ri->map, NULL);

 This WRITE_ONCE() is not needed. We never set it before at this point.
>>> 
>>> You mean the WRITE_ONCE() wrapper is not needed, or the set-to-NULL is
>>> not needed? The reason I added it is in case an eBPF program calls the
>>> helper twice before returning, where the first lookup succeeds but the
>>> second fails; in that case we want to clear the ->map pointer, no?
>>
>> Yeah I meant the set-to-NULL. So if first call succeeds, and the second one
>> fails, then the expected semantics wrt the first call are as if the program
>> would have called bpf_xdp_redirect() only?
>>
>> Looking at the code again, if we set ri->item to NULL, then we /must/
>> also set ri->map to NULL. I guess there are two options: i) leave as
>> is, ii) keep the __xdp_map_lookup_elem() result in a temp var, if it's
>> NULL return flags, otherwise only /then/ update ri->item, so that
>> semantics are similar to the invalid flags check earlier. I guess fine
>> either way, in case of i) there should probably be a comment since
>> it's less obvious.
>
> Yeah, I think a temp var is probably clearer, will do that :)

Actually, thinking about this some more, I think it's better to
completely clear out the state the second time around. I'll add a
comment to explain.

-Toke

Re: [EXT] [PATCH net-next 03/16] qlge: Deduplicate lbq_buf_size

2019-06-28 Thread Benjamin Poirier

On 2019/06/26 11:42, Willem de Bruijn wrote:
> On Wed, Jun 26, 2019 at 7:37 AM Benjamin Poirier  wrote:
> >
> > On 2019/06/26 09:24, Manish Chopra wrote:
> > > > -Original Message-
> > > > From: Benjamin Poirier 
> > > > Sent: Monday, June 17, 2019 1:19 PM
> > > > To: Manish Chopra ; GR-Linux-NIC-Dev  > > > nic-...@marvell.com>; netdev@vger.kernel.org
> > > > Subject: [EXT] [PATCH net-next 03/16] qlge: Deduplicate lbq_buf_size
> > > >
> > > > External Email
> > > >
> > > > --
> > > > lbq_buf_size is duplicated to every rx_ring structure whereas 
> > > > lbq_buf_order is
> > > > present once in the ql_adapter structure. All rings use the same buf 
> > > > size, keep
> > > > only one copy of it. Also factor out the calculation of lbq_buf_size 
> > > > instead of
> > > > having two copies.
> > > >
> > > > Signed-off-by: Benjamin Poirier 
> > > > ---
> > [...]
> > >
> > > Not sure if this change is really required, I think fields relevant to 
> > > rx_ring should be present in the rx_ring structure.
> > > There are various other fields like "lbq_len" and "lbq_size" which would 
> > > be same for all rx rings but still under the relevant rx_ring structure.
> 
> The one argument against deduplicating might be if the original fields
> are in a hot cacheline and the new location adds a cacheline access to
> a hot path. Not sure if that is relevant here. But maybe something to
> double check.
> 

Thanks for the hint. I didn't check before because my hunch was that
this driver is not near that level of optimization but I checked now and
got the following results.

The other side of the link is doing pktgen of 257B frames (smallest to
exercise the lbq path), over 10 runs of 100s, number of frames received
by the stack before/after applying this patch:
before: 1487000±2000pps
after: 1538000±2000pps
(3.4% improvement)

Looking with perf stat, boiling down the numbers to be per packet, most
interesting differences are:
cycles -4.32%
instructions -0.97%
L1-dcache-loads +6.30%
L1-dcache-load-misses -1.26%
LLC-loads +2.93%
LLC-load-misses -1.16%

So, fewer cycles/packet, fewer cache misses. Note however that
L1-dcache-loads had high variability between runs (~12%).

Looking with pahole, struct rx_ring is reduced by 8 bytes but its a 784
bytes structure full of holes to begin with.

Numbers are from an old Xeon E3-1275

Full output follows:
root@dtest:~# cat /tmp/mini.sh
#!/bin/bash

da=$(ethtool -S ql1 | awk '/rx_nic_fifo_drop/ {print $2}')
ra=$(ip -s link show dev ql1 | awk 'NR == 4 {print $2}')
sleep 100
db=$(ethtool -S ql1 | awk '/rx_nic_fifo_drop/ {print $2}')
rb=$(ip -s link show dev ql1 | awk 'NR == 4 {print $2}')
echo "rx $((rb - ra))"
echo "dropped $((db - da))"

root@dtest:~# perf stat -a -r10 -d /tmp/mini.sh
rx 148763519
dropped 296128797
rx 148784865
dropped 296107878
rx 148675349
dropped 296217234
rx 148634271
dropped 296260358
rx 148813581
dropped 296089785
rx 148755713
dropped 296136071
rx 148434116
dropped 296459743
rx 148390638
dropped 296501405
rx 148500812
dropped 296391286
rx 148973912
dropped 295920014

 Performance counter stats for 'system wide' (10 runs):

800,245.24 msec cpu-clock #8.000 CPUs utilized  
  ( +-  0.00% )
11,559  context-switches  #   14.445 M/sec  
  ( +-  0.13% )
36  cpu-migrations#0.044 M/sec  
  ( +-  7.11% )
62,100  page-faults   #   77.601 M/sec  
  ( +-  0.51% )
   362,221,639,977  cycles# 452638.486 GHz  
  ( +-  0.77% )  (49.92%)
   515,318,132,327  instructions  #1.42  insn per cycle 
  ( +-  0.64% )  (62.42%)
91,744,969,260  branches  # 114646115.533 M/sec 
  ( +-  0.04% )  (62.50%)
   549,681,601  branch-misses #0.60% of all branches
  ( +-  0.25% )  (62.50%)
   115,207,692,963  L1-dcache-loads   # 143965544.751 M/sec 
  ( +- 12.09% )  (40.59%)
 2,629,809,584  L1-dcache-load-misses #2.28% of all L1-dcache 
hits( +-  3.24% )  (29.90%)
   837,806,984  LLC-loads # 1046938.235 M/sec   
  ( +-  0.73% )  (25.31%)
   569,343,079  LLC-load-misses   #   67.96% of all LL-cache 
hits ( +-  0.07% )  (37.50%)

 100.03177 +- 0.00104 seconds time elapsed  ( +-  0.00% )

#
# changed the driver now to include this patch
#

root@dtest:~# perf stat -a -r10 -d /tmp/mini.sh
rx 154094001
dropped 290799944
rx 153798301
dropped 291094799
rx 153942166
dropped 290950792
rx 154150506
dropped 290743194
rx 154179791
dropped 290712400
rx 153907213
dropped 290985227
rx 153813515
dropped 291078475
rx 153603554
dropped 291289250
rx 153521414
dropped 291372145
rx 153592955
dropped 291299734

 Performance counter stats for 's

Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

2019-06-28 Thread Lorenz Bauer

On Thu, 27 Jun 2019 at 21:19, Song Liu  wrote:
>
> This patch introduce unprivileged BPF access. The access control is
> achieved via device /dev/bpf. Users with write access to /dev/bpf are able
> to call sys_bpf().
>
> Two ioctl command are added to /dev/bpf:
>
> The two commands enable/disable permission to call sys_bpf() for current
> task. This permission is noted by bpf_permitted in task_struct. This
> permission is inherited during clone(CLONE_THREAD).

If I understand it correctly, a process would have to open /dev/bpf before
spawning other threads for this to work?

That still wouldn't work for Go I'm afraid. The runtime creates and destroys
threads on an ad-hoc basis, and there is no way to "initialize" in the
first thread.
With the API as is, any Go wrapper wishing to use this would have to do the
following _for every BPF syscall_:

1. Use runtime.LockOSThread() to prevent the scheduler from moving the
goroutine.
2. Open /dev/bpf to set the bit in current_task
3. Execute the syscall
4. Call runtime.UnlockOSThread()

Note that calling into C code via CGo doesn't change this. Is it not possible to
set the bit on all processes in the current thread group?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

2019-06-28 Thread Lorenz Bauer

On Fri, 28 Jun 2019 at 00:40, Andy Lutomirski  wrote:
>
> I have a bigger issue with this patch, though: it's a really awkward way
> to pretend to have capabilities.  For bpf, it seems like you could make
> this be a *real* capability without too much pain since there's only one
> syscall there.  Just find a way to pass an fd to /dev/bpf into the
> syscall.  If this means you need a new bpf_with_cap() syscall that takes
> an extra argument, so be it.  The old bpf() syscall can just translate
> to bpf_with_cap(..., -1).

I agree, this seems nicer from my POV, since it evades the issues with
the Go runtime I pointed out in the other message.

It also seems like this wouldn't have to create API churn in libbpf? We can
"feature detect" the presence of the new syscall and use that instead. If
you want you can even keep the semantics of having a "global" credential.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

[PATCH bpf-next v6 4/5] bpf_xdp_redirect_map: Perform map lookup in eBPF helper

From: Toke Høiland-Jørgensen 

The bpf_redirect_map() helper used by XDP programs doesn't return any
indication of whether it can successfully redirect to the map index it was
given. Instead, BPF programs have to track this themselves, leading to
programs using duplicate maps to track which entries are populated in the
devmap.

This patch fixes this by moving the map lookup into the bpf_redirect_map()
helper, which makes it possible to return failure to the eBPF program. The
lower bits of the flags argument is used as the return code, which means
that existing users who pass a '0' flag argument will get XDP_ABORTED.

With this, a BPF program can check the return code from the helper call and
react by, for instance, substituting a different redirect. This works for
any type of map used for redirect.

Signed-off-by: Toke Høiland-Jørgensen 
Acked-by: Jonathan Lemon 
Acked-by: Andrii Nakryiko 
---
 include/linux/filter.h |1 +
 include/trace/events/xdp.h |5 ++---
 include/uapi/linux/bpf.h   |7 +--
 net/core/filter.c  |   32 ++--
 4 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index af8601340dae..f2a03f232386 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -580,6 +580,7 @@ struct bpf_skb_data_end {
 struct bpf_redirect_info {
u32 flags;
u32 tgt_index;
+   void *tgt_value;
struct bpf_map *map;
struct bpf_map *map_to_flush;
u32 kern_flags;
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index bb5e380e2ef3..393f3d7ca819 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -146,9 +146,8 @@ struct _bpf_dtab_netdev {
 #endif /* __DEVMAP_OBJ_TYPE */
 
 #define devmap_ifindex(fwd, map)   \
-   (!fwd ? 0 : \
-((map->map_type == BPF_MAP_TYPE_DEVMAP) ?  \
- ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0))
+   ((map->map_type == BPF_MAP_TYPE_DEVMAP) ?   \
+ ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0)
 
 #define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx)   \
 trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map), \
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b077507efa3f..59f3fe61b2b0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1568,8 +1568,11 @@ union bpf_attr {
  * but this is only implemented for native XDP (with driver
  * support) as of this writing).
  *
- * All values for *flags* are reserved for future usage, and must
- * be left at zero.
+ * The lower two bits of *flags* are used as the return code if
+ * the map lookup fails. This is so that the return value can be
+ * one of the XDP program return codes up to XDP_TX, as chosen by
+ * the caller. Any higher bits in the *flags* argument must be
+ * unset.
  *
  * When used to redirect packets to net devices, this helper
  * provides a high performance increase over **bpf_redirect**\ ().
diff --git a/net/core/filter.c b/net/core/filter.c
index 1d7f5f7f32cf..f9fcc8168e1c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3605,17 +3605,13 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
   struct bpf_redirect_info *ri)
 {
u32 index = ri->tgt_index;
-   void *fwd = NULL;
+   void *fwd = ri->tgt_value;
int err;
 
ri->tgt_index = 0;
+   ri->tgt_value = NULL;
WRITE_ONCE(ri->map, NULL);
 
-   fwd = __xdp_map_lookup_elem(map, index);
-   if (unlikely(!fwd)) {
-   err = -EINVAL;
-   goto err;
-   }
if (ri->map_to_flush && unlikely(ri->map_to_flush != map))
xdp_do_flush_map();
 
@@ -3652,18 +3648,13 @@ static int xdp_do_generic_redirect_map(struct 
net_device *dev,
 {
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
u32 index = ri->tgt_index;
-   void *fwd = NULL;
+   void *fwd = ri->tgt_value;
int err = 0;
 
ri->tgt_index = 0;
+   ri->tgt_value = NULL;
WRITE_ONCE(ri->map, NULL);
 
-   fwd = __xdp_map_lookup_elem(map, index);
-   if (unlikely(!fwd)) {
-   err = -EINVAL;
-   goto err;
-   }
-
if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
struct bpf_dtab_netdev *dst = fwd;
 
@@ -3732,6 +3723,7 @@ BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 
ri->flags = flags;
ri->tgt_index = ifindex;
+   ri->tgt_value = NULL;
WRITE_ONCE(ri->map, NULL);
 
return XDP_REDIRECT;
@@ -3750,9 +3742,21 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, 
u32, ifindex,
 {

[PATCH bpf-next v6 2/5] devmap/cpumap: Use flush list instead of bitmap

From: Toke Høiland-Jørgensen 

The socket map uses a linked list instead of a bitmap to keep track of
which entries to flush. Do the same for devmap and cpumap, as this means we
don't have to care about the map index when enqueueing things into the
map (and so we can cache the map lookup).

Signed-off-by: Toke Høiland-Jørgensen 
Acked-by: Jonathan Lemon 
Acked-by: Andrii Nakryiko 
---
 kernel/bpf/cpumap.c |  105 +++---
 kernel/bpf/devmap.c |  107 ++-
 net/core/filter.c   |2 -
 3 files changed, 95 insertions(+), 119 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8dff08768087..ef49e17ae47c 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -32,14 +32,19 @@
 
 /* General idea: XDP packets getting XDP redirected to another CPU,
  * will maximum be stored/queued for one driver ->poll() call.  It is
- * guaranteed that setting flush bit and flush operation happen on
+ * guaranteed that queueing the frame and the flush operation happen on
  * same CPU.  Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
  * which queue in bpf_cpu_map_entry contains packets.
  */
 
 #define CPU_MAP_BULK_SIZE 8  /* 8 == one cacheline on 64-bit archs */
+struct bpf_cpu_map_entry;
+struct bpf_cpu_map;
+
 struct xdp_bulk_queue {
void *q[CPU_MAP_BULK_SIZE];
+   struct list_head flush_node;
+   struct bpf_cpu_map_entry *obj;
unsigned int count;
 };
 
@@ -52,6 +57,8 @@ struct bpf_cpu_map_entry {
/* XDP can run multiple RX-ring queues, need __percpu enqueue store */
struct xdp_bulk_queue __percpu *bulkq;
 
+   struct bpf_cpu_map *cmap;
+
/* Queue with potential multi-producers, and single-consumer kthread */
struct ptr_ring *queue;
struct task_struct *kthread;
@@ -65,23 +72,17 @@ struct bpf_cpu_map {
struct bpf_map map;
/* Below members specific for map type */
struct bpf_cpu_map_entry **cpu_map;
-   unsigned long __percpu *flush_needed;
+   struct list_head __percpu *flush_list;
 };
 
-static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
-struct xdp_bulk_queue *bq, bool in_napi_ctx);
-
-static u64 cpu_map_bitmap_size(const union bpf_attr *attr)
-{
-   return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long);
-}
+static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx);
 
 static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 {
struct bpf_cpu_map *cmap;
int err = -ENOMEM;
+   int ret, cpu;
u64 cost;
-   int ret;
 
if (!capable(CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
@@ -105,7 +106,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 
/* make sure page count doesn't overflow */
cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
-   cost += cpu_map_bitmap_size(attr) * num_possible_cpus();
+   cost += sizeof(struct list_head) * num_possible_cpus();
 
/* Notice returns -EPERM on if map size is larger than memlock limit */
ret = bpf_map_charge_init(&cmap->map.memory, cost);
@@ -114,12 +115,13 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
goto free_cmap;
}
 
-   /* A per cpu bitfield with a bit per possible CPU in map  */
-   cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr),
-   __alignof__(unsigned long));
-   if (!cmap->flush_needed)
+   cmap->flush_list = alloc_percpu(struct list_head);
+   if (!cmap->flush_list)
goto free_charge;
 
+   for_each_possible_cpu(cpu)
+   INIT_LIST_HEAD(per_cpu_ptr(cmap->flush_list, cpu));
+
/* Alloc array for possible remote "destination" CPUs */
cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
   sizeof(struct bpf_cpu_map_entry *),
@@ -129,7 +131,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 
return &cmap->map;
 free_percpu:
-   free_percpu(cmap->flush_needed);
+   free_percpu(cmap->flush_list);
 free_charge:
bpf_map_charge_finish(&cmap->map.memory);
 free_cmap:
@@ -334,7 +336,8 @@ static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 
qsize, u32 cpu,
 {
gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
struct bpf_cpu_map_entry *rcpu;
-   int numa, err;
+   struct xdp_bulk_queue *bq;
+   int numa, err, i;
 
/* Have map->numa_node, but choose node of redirect target CPU */
numa = cpu_to_node(cpu);
@@ -349,6 +352,11 @@ static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 
qsize, u32 cpu,
if (!rcpu->bulkq)
goto free_rcu;
 
+   for_each_possible_cpu(i) {
+   bq = per_cpu_ptr(rcpu->bulkq, i);
+   bq->obj = rcpu;
+   }
+
/* Al

[PATCH bpf-next v6 0/5] xdp: Allow lookup into devmaps before redirect

When using the bpf_redirect_map() helper to redirect packets from XDP, the eBPF
program cannot currently know whether the redirect will succeed, which makes it
impossible to gracefully handle errors. To properly fix this will probably
require deeper changes to the way TX resources are allocated, but one thing that
is fairly straight forward to fix is to allow lookups into devmaps, so programs
can at least know when a redirect is *guaranteed* to fail because there is no
entry in the map. Currently, programs work around this by keeping a shadow map
of another type which indicates whether a map index is valid.

This series contains two changes that are complementary ways to fix this issue:

- Moving the map lookup into the bpf_redirect_map() helper (and caching the
  result), so the helper can return an error if no value is found in the map.
  This includes a refactoring of the devmap and cpumap code to not care about
  the index on enqueue.

- Allowing regular lookups into devmaps from eBPF programs, using the read-only
  flag to make sure they don't change the values.

The performance impact of the series is negligible, in the sense that I cannot
measure it because the variance between test runs is higher than the difference
pre/post series.

Changelog:

v6:
  - Factor out list handling in maps to a helper in list.h (new patch 1)
  - Rename variables in struct bpf_redirect_info (new patch 3 + patch 4)
  - Explain why we are clearing out the map in the info struct on lookup failure
  - Remove unneeded check for forwarding target in tracepoint macro

v5:
  - Rebase on latest bpf-next.
  - Update documentation for bpf_redirect_map() with the new meaning of flags.

v4:
  - Fix a few nits from Andrii
  - Lose the #defines in bpf.h and just compare the flags argument directly to
XDP_TX in bpf_xdp_redirect_map().

v3:
  - Adopt Jonathan's idea of using the lower two bits of the flag value as the
return code.
  - Always do the lookup, and cache the result for use in xdp_do_redirect(); to
achieve this, refactor the devmap and cpumap code to get rid the bitmap for
selecting which devices to flush.

v2:
  - For patch 1, make it clear that the change works for any map type.
  - For patch 2, just use the new BPF_F_RDONLY_PROG flag to make the return
value read-only.

---

Toke Høiland-Jørgensen (5):
  xskmap: Move non-standard list manipulation to helper
  devmap/cpumap: Use flush list instead of bitmap
  devmap: Rename ifindex member in bpf_redirect_info
  bpf_xdp_redirect_map: Perform map lookup in eBPF helper
  devmap: Allow map lookups from eBPF


 include/linux/filter.h |3 +
 include/linux/list.h   |   14 ++
 include/trace/events/xdp.h |5 +-
 include/uapi/linux/bpf.h   |7 ++-
 kernel/bpf/cpumap.c|  105 +++--
 kernel/bpf/devmap.c|  112 
 kernel/bpf/verifier.c  |7 +--
 kernel/bpf/xskmap.c|3 -
 net/core/filter.c  |   60 
 9 files changed, 157 insertions(+), 159 deletions(-)

[PATCH bpf-next v6 1/5] xskmap: Move non-standard list manipulation to helper

From: Toke Høiland-Jørgensen 

Add a helper in list.h for the non-standard way of clearing a list that is
used in xskmap. This makes it easier to reuse it in the other map types,
and also makes sure this usage is not forgotten in any list refactorings in
the future.

Signed-off-by: Toke Høiland-Jørgensen 
---
 include/linux/list.h |   14 ++
 kernel/bpf/xskmap.c  |3 +--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index e951228db4b2..85c92555e31f 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -106,6 +106,20 @@ static inline void __list_del(struct list_head * prev, 
struct list_head * next)
WRITE_ONCE(prev->next, next);
 }
 
+/*
+ * Delete a list entry and clear the 'prev' pointer.
+ *
+ * This is a special-purpose list clearing method used in the networking code
+ * for lists allocated as per-cpu, where we don't want to incur the extra
+ * WRITE_ONCE() overhead of a regular list_del_init(). The code that uses this
+ * needs to check the node 'prev' pointer instead of calling list_empty().
+ */
+static inline void __list_del_clearprev(struct list_head *entry)
+{
+   __list_del(entry->prev, entry->next);
+   entry->prev = NULL;
+}
+
 /**
  * list_del - deletes entry from list.
  * @entry: the element to delete from the list.
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index ef7338cebd18..9bb96ace9fa1 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -145,8 +145,7 @@ void __xsk_map_flush(struct bpf_map *map)
 
list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
xsk_flush(xs);
-   __list_del(xs->flush_node.prev, xs->flush_node.next);
-   xs->flush_node.prev = NULL;
+   __list_del_clearprev(&xs->flush_node);
}
 }

[PATCH bpf-next v6 3/5] devmap: Rename ifindex member in bpf_redirect_info

From: Toke Høiland-Jørgensen 

The bpf_redirect_info struct has an 'ifindex' member which was named back
when the redirects could only target egress interfaces. Now that we can
also redirect to sockets and CPUs, this is a bit misleading, so rename the
member to tgt_index.

Reorder the struct members so we can have 'tgt_index' and 'tgt_value' next
to each other in a subsequent patch.

Signed-off-by: Toke Høiland-Jørgensen 
---
 include/linux/filter.h |2 +-
 net/core/filter.c  |   26 +-
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 43b45d6db36d..af8601340dae 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -578,8 +578,8 @@ struct bpf_skb_data_end {
 };
 
 struct bpf_redirect_info {
-   u32 ifindex;
u32 flags;
+   u32 tgt_index;
struct bpf_map *map;
struct bpf_map *map_to_flush;
u32 kern_flags;
diff --git a/net/core/filter.c b/net/core/filter.c
index 183bf4d8e301..1d7f5f7f32cf 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2158,8 +2158,8 @@ BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
if (unlikely(flags & ~(BPF_F_INGRESS)))
return TC_ACT_SHOT;
 
-   ri->ifindex = ifindex;
ri->flags = flags;
+   ri->tgt_index = ifindex;
 
return TC_ACT_REDIRECT;
 }
@@ -2169,8 +2169,8 @@ int skb_do_redirect(struct sk_buff *skb)
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
struct net_device *dev;
 
-   dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
-   ri->ifindex = 0;
+   dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->tgt_index);
+   ri->tgt_index = 0;
if (unlikely(!dev)) {
kfree_skb(skb);
return -EINVAL;
@@ -3488,11 +3488,11 @@ xdp_do_redirect_slow(struct net_device *dev, struct 
xdp_buff *xdp,
 struct bpf_prog *xdp_prog, struct bpf_redirect_info *ri)
 {
struct net_device *fwd;
-   u32 index = ri->ifindex;
+   u32 index = ri->tgt_index;
int err;
 
fwd = dev_get_by_index_rcu(dev_net(dev), index);
-   ri->ifindex = 0;
+   ri->tgt_index = 0;
if (unlikely(!fwd)) {
err = -EINVAL;
goto err;
@@ -3604,11 +3604,11 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog, struct bpf_map *map,
   struct bpf_redirect_info *ri)
 {
-   u32 index = ri->ifindex;
+   u32 index = ri->tgt_index;
void *fwd = NULL;
int err;
 
-   ri->ifindex = 0;
+   ri->tgt_index = 0;
WRITE_ONCE(ri->map, NULL);
 
fwd = __xdp_map_lookup_elem(map, index);
@@ -3651,11 +3651,11 @@ static int xdp_do_generic_redirect_map(struct 
net_device *dev,
   struct bpf_map *map)
 {
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-   u32 index = ri->ifindex;
+   u32 index = ri->tgt_index;
void *fwd = NULL;
int err = 0;
 
-   ri->ifindex = 0;
+   ri->tgt_index = 0;
WRITE_ONCE(ri->map, NULL);
 
fwd = __xdp_map_lookup_elem(map, index);
@@ -3695,14 +3695,14 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 {
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
struct bpf_map *map = READ_ONCE(ri->map);
-   u32 index = ri->ifindex;
+   u32 index = ri->tgt_index;
struct net_device *fwd;
int err = 0;
 
if (map)
return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog,
   map);
-   ri->ifindex = 0;
+   ri->tgt_index = 0;
fwd = dev_get_by_index_rcu(dev_net(dev), index);
if (unlikely(!fwd)) {
err = -EINVAL;
@@ -3730,8 +3730,8 @@ BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
if (unlikely(flags))
return XDP_ABORTED;
 
-   ri->ifindex = ifindex;
ri->flags = flags;
+   ri->tgt_index = ifindex;
WRITE_ONCE(ri->map, NULL);
 
return XDP_REDIRECT;
@@ -3753,8 +3753,8 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, 
u32, ifindex,
if (unlikely(flags))
return XDP_ABORTED;
 
-   ri->ifindex = ifindex;
ri->flags = flags;
+   ri->tgt_index = ifindex;
WRITE_ONCE(ri->map, map);
 
return XDP_REDIRECT;

[PATCH bpf-next v6 5/5] devmap: Allow map lookups from eBPF

From: Toke Høiland-Jørgensen 

We don't currently allow lookups into a devmap from eBPF, because the map
lookup returns a pointer directly to the dev->ifindex, which shouldn't be
modifiable from eBPF.

However, being able to do lookups in devmaps is useful to know (e.g.)
whether forwarding to a specific interface is enabled. Currently, programs
work around this by keeping a shadow map of another type which indicates
whether a map index is valid.

Since we now have a flag to make maps read-only from the eBPF side, we can
simply lift the lookup restriction if we make sure this flag is always set.

Signed-off-by: Toke Høiland-Jørgensen 
Acked-by: Jonathan Lemon 
Acked-by: Andrii Nakryiko 
---
 kernel/bpf/devmap.c   |5 +
 kernel/bpf/verifier.c |7 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index a4dddc867cbf..d83cf8ccc872 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -89,6 +89,11 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK)
return ERR_PTR(-EINVAL);
 
+   /* Lookup returns a pointer straight to dev->ifindex, so make sure the
+* verifier prevents writes from the BPF side
+*/
+   attr->map_flags |= BPF_F_RDONLY_PROG;
+
dtab = kzalloc(sizeof(*dtab), GFP_USER);
if (!dtab)
return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0e079b2298f8..51c327ee7d3f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3407,12 +3407,9 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
if (func_id != BPF_FUNC_get_local_storage)
goto error;
break;
-   /* devmap returns a pointer to a live net_device ifindex that we cannot
-* allow to be modified from bpf side. So do not allow lookup elements
-* for now.
-*/
case BPF_MAP_TYPE_DEVMAP:
-   if (func_id != BPF_FUNC_redirect_map)
+   if (func_id != BPF_FUNC_redirect_map &&
+   func_id != BPF_FUNC_map_lookup_elem)
goto error;
break;
/* Restrict bpf side of cpumap and xskmap, open when use-cases

[PATCH 1/4 nf-next v2] netfilter:nf_flow_table: Refactor flow_offload_tuple to destination

From: wenxu 

Add struct flow_offload_dst to support more offload method to replace
dst_cache which only work for route offload.

Signed-off-by: wenxu 
---
 include/net/netfilter/nf_flow_table.h | 12 ++--
 net/netfilter/nf_flow_table_core.c| 22 +++---
 net/netfilter/nf_flow_table_ip.c  |  4 ++--
 net/netfilter/nft_flow_offload.c  | 12 ++--
 4 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index d8c1879..d40d409 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -33,6 +33,10 @@ enum flow_offload_tuple_dir {
FLOW_OFFLOAD_DIR_MAX = IP_CT_DIR_MAX
 };
 
+struct flow_offload_dst {
+   struct dst_entry*dst_cache;
+};
+
 struct flow_offload_tuple {
union {
struct in_addr  src_v4;
@@ -55,7 +59,7 @@ struct flow_offload_tuple {
 
u16 mtu;
 
-   struct dst_entry*dst_cache;
+   struct flow_offload_dst dst;
 };
 
 struct flow_offload_tuple_rhash {
@@ -85,8 +89,12 @@ struct nf_flow_route {
} tuple[FLOW_OFFLOAD_DIR_MAX];
 };
 
+struct nf_flow_dst {
+   struct nf_flow_route route;
+};
+
 struct flow_offload *flow_offload_alloc(struct nf_conn *ct,
-   struct nf_flow_route *route);
+   struct nf_flow_dst *flow_dst);
 void flow_offload_free(struct flow_offload *flow);
 
 int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload 
*flow);
diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index e3d7972..7e0b5bd 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -24,13 +24,13 @@ struct flow_offload_entry {
 
 static void
 flow_offload_fill_dir(struct flow_offload *flow, struct nf_conn *ct,
- struct nf_flow_route *route,
+ struct nf_flow_dst *flow_dst,
  enum flow_offload_tuple_dir dir)
 {
struct flow_offload_tuple *ft = &flow->tuplehash[dir].tuple;
struct nf_conntrack_tuple *ctt = &ct->tuplehash[dir].tuple;
-   struct dst_entry *other_dst = route->tuple[!dir].dst;
-   struct dst_entry *dst = route->tuple[dir].dst;
+   struct dst_entry *other_dst = flow_dst->route.tuple[!dir].dst;
+   struct dst_entry *dst = flow_dst->route.tuple[dir].dst;
 
ft->dir = dir;
 
@@ -57,7 +57,7 @@ struct flow_offload_entry {
 }
 
 struct flow_offload *
-flow_offload_alloc(struct nf_conn *ct, struct nf_flow_route *route)
+flow_offload_alloc(struct nf_conn *ct, struct nf_flow_dst *flow_dst)
 {
struct flow_offload_entry *entry;
struct flow_offload *flow;
@@ -72,16 +72,16 @@ struct flow_offload *
 
flow = &entry->flow;
 
-   if (!dst_hold_safe(route->tuple[FLOW_OFFLOAD_DIR_ORIGINAL].dst))
+   if 
(!dst_hold_safe(flow_dst->route.tuple[FLOW_OFFLOAD_DIR_ORIGINAL].dst))
goto err_dst_cache_original;
 
-   if (!dst_hold_safe(route->tuple[FLOW_OFFLOAD_DIR_REPLY].dst))
+   if (!dst_hold_safe(flow_dst->route.tuple[FLOW_OFFLOAD_DIR_REPLY].dst))
goto err_dst_cache_reply;
 
entry->ct = ct;
 
-   flow_offload_fill_dir(flow, ct, route, FLOW_OFFLOAD_DIR_ORIGINAL);
-   flow_offload_fill_dir(flow, ct, route, FLOW_OFFLOAD_DIR_REPLY);
+   flow_offload_fill_dir(flow, ct, flow_dst, FLOW_OFFLOAD_DIR_ORIGINAL);
+   flow_offload_fill_dir(flow, ct, flow_dst, FLOW_OFFLOAD_DIR_REPLY);
 
if (ct->status & IPS_SRC_NAT)
flow->flags |= FLOW_OFFLOAD_SNAT;
@@ -91,7 +91,7 @@ struct flow_offload *
return flow;
 
 err_dst_cache_reply:
-   dst_release(route->tuple[FLOW_OFFLOAD_DIR_ORIGINAL].dst);
+   dst_release(flow_dst->route.tuple[FLOW_OFFLOAD_DIR_ORIGINAL].dst);
 err_dst_cache_original:
kfree(entry);
 err_ct_refcnt:
@@ -139,8 +139,8 @@ void flow_offload_free(struct flow_offload *flow)
 {
struct flow_offload_entry *e;
 
-   dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_cache);
-   dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_cache);
+   
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst.dst_cache);
+   
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst.dst_cache);
e = container_of(flow, struct flow_offload_entry, flow);
if (flow->flags & FLOW_OFFLOAD_DYING)
nf_ct_delete(e->ct, 0, 0);
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 2413174..0016bb8 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -241,7 +241,7 @@ static bool nf_flow_exceeds_mtu(const struct sk_buff *skb, 
unsigned int mtu)
 
dir = tuplehash->tuple.dir;
flow = container_of(tuplehash, struct flow_

[PATCH 2/4 nf-next v2] netfilter:nf_flow_table: Separate inet operation to single function

From: wenxu 

This patch separate the inet family operation to single function.
Prepare for supporting  the bridge family.

Signed-off-by: wenxu 
---
 net/netfilter/nf_flow_table_core.c | 52 --
 net/netfilter/nf_flow_table_ip.c   | 34 +++--
 net/netfilter/nft_flow_offload.c   |  2 +-
 3 files changed, 60 insertions(+), 28 deletions(-)

diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index 7e0b5bd..2bec409 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -22,6 +22,20 @@ struct flow_offload_entry {
 static DEFINE_MUTEX(flowtable_lock);
 static LIST_HEAD(flowtables);
 
+static struct dst_entry *
+flow_offload_fill_inet_dst(struct flow_offload_tuple *ft,
+  struct nf_flow_route *route,
+  enum flow_offload_tuple_dir dir)
+{
+   struct dst_entry *other_dst = route->tuple[!dir].dst;
+   struct dst_entry *dst = route->tuple[dir].dst;
+
+   ft->iifidx = other_dst->dev->ifindex;
+   ft->dst.dst_cache = dst;
+
+   return dst;
+}
+
 static void
 flow_offload_fill_dir(struct flow_offload *flow, struct nf_conn *ct,
  struct nf_flow_dst *flow_dst,
@@ -29,9 +43,9 @@ struct flow_offload_entry {
 {
struct flow_offload_tuple *ft = &flow->tuplehash[dir].tuple;
struct nf_conntrack_tuple *ctt = &ct->tuplehash[dir].tuple;
-   struct dst_entry *other_dst = flow_dst->route.tuple[!dir].dst;
-   struct dst_entry *dst = flow_dst->route.tuple[dir].dst;
+   struct dst_entry *dst;
 
+   dst = flow_offload_fill_inet_dst(ft, &flow_dst->route, dir);
ft->dir = dir;
 
switch (ctt->src.l3num) {
@@ -51,9 +65,19 @@ struct flow_offload_entry {
ft->l4proto = ctt->dst.protonum;
ft->src_port = ctt->src.u.tcp.port;
ft->dst_port = ctt->dst.u.tcp.port;
+}
 
-   ft->iifidx = other_dst->dev->ifindex;
-   ft->dst_cache = dst;
+static int flow_offload_dst_hold(struct nf_flow_dst *flow_dst)
+{
+   if 
(!dst_hold_safe(flow_dst->route.tuple[FLOW_OFFLOAD_DIR_ORIGINAL].dst))
+   return -1;
+
+   if (!dst_hold_safe(flow_dst->route.tuple[FLOW_OFFLOAD_DIR_REPLY].dst)) {
+   
dst_release(flow_dst->route.tuple[FLOW_OFFLOAD_DIR_ORIGINAL].dst);
+   return -1;
+   }
+
+   return 0;
 }
 
 struct flow_offload *
@@ -72,11 +96,8 @@ struct flow_offload *
 
flow = &entry->flow;
 
-   if 
(!dst_hold_safe(flow_dst->route.tuple[FLOW_OFFLOAD_DIR_ORIGINAL].dst))
-   goto err_dst_cache_original;
-
-   if (!dst_hold_safe(flow_dst->route.tuple[FLOW_OFFLOAD_DIR_REPLY].dst))
-   goto err_dst_cache_reply;
+   if (flow_offload_dst_hold(flow_dst))
+   goto err_dst_cache;
 
entry->ct = ct;
 
@@ -90,9 +111,7 @@ struct flow_offload *
 
return flow;
 
-err_dst_cache_reply:
-   dst_release(flow_dst->route.tuple[FLOW_OFFLOAD_DIR_ORIGINAL].dst);
-err_dst_cache_original:
+err_dst_cache:
kfree(entry);
 err_ct_refcnt:
nf_ct_put(ct);
@@ -135,12 +154,17 @@ static void flow_offload_fixup_ct_state(struct nf_conn 
*ct)
ct->timeout = nfct_time_stamp + timeout;
 }
 
+static void flow_offload_dst_release(struct flow_offload *flow)
+{
+   
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst.dst_cache);
+   
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst.dst_cache);
+}
+
 void flow_offload_free(struct flow_offload *flow)
 {
struct flow_offload_entry *e;
 
-   
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst.dst_cache);
-   
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst.dst_cache);
+   flow_offload_dst_release(flow);
e = container_of(flow, struct flow_offload_entry, flow);
if (flow->flags & FLOW_OFFLOAD_DYING)
nf_ct_delete(e->ct, 0, 0);
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 0016bb8..24263e2 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -214,6 +214,25 @@ static bool nf_flow_exceeds_mtu(const struct sk_buff *skb, 
unsigned int mtu)
return true;
 }
 
+static void nf_flow_inet_xmit(struct flow_offload *flow, struct sk_buff *skb,
+ enum flow_offload_tuple_dir dir)
+{
+   struct net_device *outdev;
+   struct rtable *rt;
+   struct iphdr *iph;
+   __be32 nexthop;
+
+   rt = (struct rtable *)flow->tuplehash[dir].tuple.dst.dst_cache;
+   outdev = rt->dst.dev;
+   iph = ip_hdr(skb);
+   ip_decrease_ttl(iph);
+
+   skb->dev = outdev;
+   nexthop = rt_nexthop(rt, flow->tuplehash[!dir].tuple.src_v4.s_addr);
+   skb_dst_set_noref(skb, &rt->dst);
+   neigh_xmit(NEIGH_ARP_TABLE, outdev, &nexthop, skb);
+}
+
 unsigned int
 nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,

[PATCH 3/4 nf-next v2] netfilter:nf_flow_table: Support bridge family flow offload

From: wenxu 

With nf_conntrack_bridge function. The bridge family can do
conntrack it self. The flow offload function based on the
conntrack. So the flow in the bridge wih conntrack can be
offloaded.

Signed-off-by: wenxu 
---
 include/net/netfilter/nf_flow_table.h |  31 +++-
 net/bridge/br_vlan.c  |   1 +
 net/netfilter/nf_flow_table_core.c|  58 ---
 net/netfilter/nf_flow_table_ip.c  |  43 +++-
 net/netfilter/nft_flow_offload.c  | 129 --
 5 files changed, 243 insertions(+), 19 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index d40d409..dcf197a 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -33,8 +33,23 @@ enum flow_offload_tuple_dir {
FLOW_OFFLOAD_DIR_MAX = IP_CT_DIR_MAX
 };
 
+enum flow_offload_tuple_type {
+   FLOW_OFFLOAD_TYPE_INET,
+   FLOW_OFFLOAD_TYPE_BRIDGE,
+};
+
+struct dst_br_port {
+   struct net_device *dev;
+   u16 dst_vlan_tag;
+   u16 vlan_proto;
+};
+
 struct flow_offload_dst {
-   struct dst_entry*dst_cache;
+   enum flow_offload_tuple_type type;
+   union {
+   struct dst_entry*dst_cache;
+   struct dst_br_port  dst_port;
+   };
 };
 
 struct flow_offload_tuple {
@@ -52,6 +67,7 @@ struct flow_offload_tuple {
};
 
int iifidx;
+   u16 vlan_tag;
 
u8  l3proto;
u8  l4proto;
@@ -89,8 +105,19 @@ struct nf_flow_route {
} tuple[FLOW_OFFLOAD_DIR_MAX];
 };
 
+struct nf_flow_forward {
+   struct {
+   struct dst_br_port  dst_port;
+   u16 vlan_tag;
+   } tuple[FLOW_OFFLOAD_DIR_MAX];
+};
+
 struct nf_flow_dst {
-   struct nf_flow_route route;
+   enum flow_offload_tuple_type type;
+   union {
+   struct nf_flow_route route;
+   struct nf_flow_forward forward;
+   };
 };
 
 struct flow_offload *flow_offload_alloc(struct nf_conn *ct,
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index f47f526..bcb7ce4 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -729,6 +729,7 @@ struct net_bridge_vlan *br_vlan_find(struct 
net_bridge_vlan_group *vg, u16 vid)
 
return br_vlan_lookup(&vg->vlan_hash, vid);
 }
+EXPORT_SYMBOL_GPL(br_vlan_find);
 
 /* Must be protected by RTNL. */
 static void recalculate_group_addr(struct net_bridge *br)
diff --git a/net/netfilter/nf_flow_table_core.c 
b/net/netfilter/nf_flow_table_core.c
index 2bec409..08c1ca4 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -36,6 +36,21 @@ struct flow_offload_entry {
return dst;
 }
 
+static struct dst_br_port *
+flow_offload_fill_bridge_dst(struct flow_offload_tuple *ft,
+struct nf_flow_forward *forward,
+enum flow_offload_tuple_dir dir)
+{
+   struct dst_br_port other_dst_port = forward->tuple[!dir].dst_port;
+   struct dst_br_port dst_port = forward->tuple[dir].dst_port;
+
+   ft->iifidx = other_dst_port.dev->ifindex;
+   ft->dst.dst_port = dst_port;
+   ft->vlan_tag = forward->tuple[dir].vlan_tag;
+
+   return &ft->dst.dst_port;
+}
+
 static void
 flow_offload_fill_dir(struct flow_offload *flow, struct nf_conn *ct,
  struct nf_flow_dst *flow_dst,
@@ -43,16 +58,29 @@ struct flow_offload_entry {
 {
struct flow_offload_tuple *ft = &flow->tuplehash[dir].tuple;
struct nf_conntrack_tuple *ctt = &ct->tuplehash[dir].tuple;
+   struct dst_br_port *dst_port;
struct dst_entry *dst;
 
-   dst = flow_offload_fill_inet_dst(ft, &flow_dst->route, dir);
+   switch (flow_dst->type) {
+   case FLOW_OFFLOAD_TYPE_INET:
+   dst = flow_offload_fill_inet_dst(ft, &flow_dst->route, dir);
+   break;
+   case FLOW_OFFLOAD_TYPE_BRIDGE:
+   dst_port = flow_offload_fill_bridge_dst(ft, &flow_dst->forward, 
dir);
+   break;
+   }
+
+   ft->dst.type = flow_dst->type;
ft->dir = dir;
 
switch (ctt->src.l3num) {
case NFPROTO_IPV4:
ft->src_v4 = ctt->src.u3.in;
ft->dst_v4 = ctt->dst.u3.in;
-   ft->mtu = ip_dst_mtu_maybe_forward(dst, true);
+   if (flow_dst->type == FLOW_OFFLOAD_TYPE_INET)
+   ft->mtu = ip_dst_mtu_maybe_forward(dst, true);
+   else
+   ft->mtu = dst_port->dev->mtu;
break;
case NFPROTO_IPV6:
ft->src_v6 = ctt->src.u3.in6;
@@ -67,13 +95,13 @@ struct flow_offload_entry {
ft->dst_port = ctt->dst.u.tcp.port;
 }
 
-static int flow_offload_dst_hold(struct nf_flow_dst *flow_dst)
+sta

[PATCH 4/4 nf-next v2] netfilter: Flow table support for the bridge family

From: wenxu 

This patch adds the bridge flow table type, that implements the datapath
flow table to forward IPv4 traffic through bridge.

Signed-off-by: wenxu 
---
 net/bridge/netfilter/Kconfig|  8 +
 net/bridge/netfilter/Makefile   |  1 +
 net/bridge/netfilter/nf_flow_table_bridge.c | 46 +
 3 files changed, 55 insertions(+)
 create mode 100644 net/bridge/netfilter/nf_flow_table_bridge.c

diff --git a/net/bridge/netfilter/Kconfig b/net/bridge/netfilter/Kconfig
index f4fb0b9..cba5f71 100644
--- a/net/bridge/netfilter/Kconfig
+++ b/net/bridge/netfilter/Kconfig
@@ -33,6 +33,14 @@ config NF_CONNTRACK_BRIDGE
 
  To compile it as a module, choose M here.  If unsure, say N.
 
+config NF_FLOW_TABLE_BRIDGE
+   tristate "Netfilter flow table bridge module"
+   depends on NF_FLOW_TABLE && NF_CONNTRACK_BRIDGE
+   help
+  This option adds the flow table bridge support.
+
+ To compile it as a module, choose M here.
+
 endif # NF_TABLES_BRIDGE
 
 menuconfig BRIDGE_NF_EBTABLES
diff --git a/net/bridge/netfilter/Makefile b/net/bridge/netfilter/Makefile
index 9d77673..deb81e6 100644
--- a/net/bridge/netfilter/Makefile
+++ b/net/bridge/netfilter/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_NFT_BRIDGE_REJECT)  += nft_reject_bridge.o
 
 # connection tracking
 obj-$(CONFIG_NF_CONNTRACK_BRIDGE) += nf_conntrack_bridge.o
+obj-$(CONFIG_NF_FLOW_TABLE_BRIDGE) += nf_flow_table_bridge.o
 
 # packet logging
 obj-$(CONFIG_NF_LOG_BRIDGE) += nf_log_bridge.o
diff --git a/net/bridge/netfilter/nf_flow_table_bridge.c 
b/net/bridge/netfilter/nf_flow_table_bridge.c
new file mode 100644
index 000..3a65b44
--- /dev/null
+++ b/net/bridge/netfilter/nf_flow_table_bridge.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static unsigned int
+nf_flow_offload_bridge_hook(void *priv, struct sk_buff *skb,
+   const struct nf_hook_state *state)
+{
+   switch (skb->protocol) {
+   case htons(ETH_P_IP):
+   return nf_flow_offload_ip_hook(priv, skb, state);
+   }
+
+   return NF_ACCEPT;
+}
+
+static struct nf_flowtable_type flowtable_bridge = {
+   .family = NFPROTO_BRIDGE,
+   .init   = nf_flow_table_init,
+   .free   = nf_flow_table_free,
+   .hook   = nf_flow_offload_bridge_hook,
+   .owner  = THIS_MODULE,
+};
+
+static int __init nf_flow_bridge_module_init(void)
+{
+   nft_register_flowtable_type(&flowtable_bridge);
+
+   return 0;
+}
+
+static void __exit nf_flow_bridge_module_exit(void)
+{
+   nft_unregister_flowtable_type(&flowtable_bridge);
+}
+
+module_init(nf_flow_bridge_module_init);
+module_exit(nf_flow_bridge_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("wenxu ");
+MODULE_ALIAS_NF_FLOWTABLE(AF_BRIDGE);
-- 
1.8.3.1

Re: [PATCH 2/3 nf-next] netfilter:nf_flow_table: Support bridge type flow offload



On 6/28/2019 2:06 PM, Florian Westphal wrote:
> wenxu  wrote:
>> ns21 iperf to 10.0.0.8 with dport 22 in ns22
>> first time with OFFLOAD enable
>>
>> nft add flowtable bridge firewall fb2 { hook ingress priority 0 \; devices = 
>> { veth21, veth22 } \; }
>> nft add chain bridge firewall ftb-all {type filter hook forward priority 0 
>> \; policy accept \; }
>> nft add rule bridge firewall ftb-all counter ct zone 2 ip protocol tcp flow 
>> offload @fb2
>>
>> # iperf -c 10.0.0.8 -p 22 -t 60 -i2
> [..]
>> [  3]  0.0-60.0 sec   353 GBytes  50.5 Gbits/sec
>>
>> The second time on any offload:
>> # iperf -c 10.0.0.8 -p 22 -t 60 -i2
>> [  3]  0.0-60.0 sec   271 GBytes  38.8 Gbits/sec
> Wow, this is pretty impressive.  Do you have numbers without
> offload and no connection tracking?

There is no other connection  on the bridge in zone 2

>
> Is this with CONFIG_RETPOLINE=y (just curious)?
Yes, it is enable.

[PATCH iproute2-next v4 2/2] uapi: update if_link.h

2019-06-28 Thread Denis Kirjanov

update if_link.h to commit 75345f888f700c4ab2448287e35d48c760b202e6
("ipoib: show VF broadcast address")

Signed-off-by: Denis Kirjanov 
---
 include/uapi/linux/if_link.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index bfe7f9e6..5f271d84 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -692,6 +692,7 @@ enum {
IFLA_VF_IB_NODE_GUID,   /* VF Infiniband node GUID */
IFLA_VF_IB_PORT_GUID,   /* VF Infiniband port GUID */
IFLA_VF_VLAN_LIST,  /* nested list of vlans, option for QinQ */
+   IFLA_VF_BROADCAST,  /* VF broadcast */
__IFLA_VF_MAX,
 };
 
@@ -702,6 +703,10 @@ struct ifla_vf_mac {
__u8 mac[32]; /* MAX_ADDR_LEN */
 };
 
+struct ifla_vf_broadcast {
+   __u8 broadcast[32];
+};
+
 struct ifla_vf_vlan {
__u32 vf;
__u32 vlan; /* 0 - 4095, 0 disables VLAN filter */
-- 
2.12.3

[PATCH iproute2-next v4 1/2] ipaddress: correctly print a VF hw address in the IPoIB case

2019-06-28 Thread Denis Kirjanov

Current code assumes that we print ethernet mac and
that doesn't work in the IPoIB case with SRIOV-enabled hardware

Before:
11: ib1:  mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 MAC 14:80:00:00:66:fe, spoof checking off, link-state
disable,
trust off, query_rss off
...

After:
11: ib1:  mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof
checking off, link-state disable, trust off, query_rss off

v1->v2: updated kernel headers to uapi commit
v2->v3: fixed alignment
v3->v4: aligned print statements as used through the source

Signed-off-by: Denis Kirjanov 
---
 ip/ipaddress.c | 40 +++-
 1 file changed, 35 insertions(+), 5 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index b504200b..741b8667 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -349,9 +350,10 @@ static void print_af_spec(FILE *fp, struct rtattr 
*af_spec_attr)
 
 static void print_vf_stats64(FILE *fp, struct rtattr *vfstats);
 
-static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
+static void print_vfinfo(struct ifinfomsg *ifi, FILE *fp, struct rtattr 
*vfinfo)
 {
struct ifla_vf_mac *vf_mac;
+   struct ifla_vf_broadcast *vf_broadcast;
struct ifla_vf_tx_rate *vf_tx_rate;
struct rtattr *vf[IFLA_VF_MAX + 1] = {};
 
@@ -365,13 +367,41 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
parse_rtattr_nested(vf, IFLA_VF_MAX, vfinfo);
 
vf_mac = RTA_DATA(vf[IFLA_VF_MAC]);
+   vf_broadcast = RTA_DATA(vf[IFLA_VF_BROADCAST]);
vf_tx_rate = RTA_DATA(vf[IFLA_VF_TX_RATE]);
 
print_string(PRINT_FP, NULL, "%s", _SL_);
print_int(PRINT_ANY, "vf", "vf %d ", vf_mac->vf);
-   print_string(PRINT_ANY, "mac", "MAC %s",
-ll_addr_n2a((unsigned char *) &vf_mac->mac,
-ETH_ALEN, 0, b1, sizeof(b1)));
+
+   print_string(PRINT_ANY,
+   "link_type",
+   "link/%s ",
+   ll_type_n2a(ifi->ifi_type, b1, sizeof(b1)));
+
+   print_color_string(PRINT_ANY, COLOR_MAC,
+   "address", "%s",
+   ll_addr_n2a((unsigned char *) &vf_mac->mac,
+   ifi->ifi_type == ARPHRD_ETHER ?
+   ETH_ALEN : INFINIBAND_ALEN,
+   ifi->ifi_type,
+   b1, sizeof(b1)));
+
+   if (vf[IFLA_VF_BROADCAST]) {
+   if (ifi->ifi_flags&IFF_POINTOPOINT) {
+   print_string(PRINT_FP, NULL, " peer ", NULL);
+   print_bool(PRINT_JSON,
+   "link_pointtopoint", NULL, true);
+   } else
+   print_string(PRINT_FP, NULL, " brd ", NULL);
+
+   print_color_string(PRINT_ANY, COLOR_MAC,
+   "broadcast", "%s",
+   ll_addr_n2a((unsigned char *) 
&vf_broadcast->broadcast,
+   ifi->ifi_type == ARPHRD_ETHER ?
+   ETH_ALEN : INFINIBAND_ALEN,
+   ifi->ifi_type,
+   b1, sizeof(b1)));
+   }
 
if (vf[IFLA_VF_VLAN_LIST]) {
struct rtattr *i, *vfvlanlist = vf[IFLA_VF_VLAN_LIST];
@@ -1102,7 +1132,7 @@ int print_linkinfo(struct nlmsghdr *n, void *arg)
open_json_array(PRINT_JSON, "vfinfo_list");
for (i = RTA_DATA(vflist); RTA_OK(i, rem); i = RTA_NEXT(i, 
rem)) {
open_json_object(NULL);
-   print_vfinfo(fp, i);
+   print_vfinfo(ifi, fp, i);
close_json_object();
}
close_json_array(PRINT_JSON, NULL);
-- 
2.12.3

Re: [PATCH v2 net-next 0/3] Better PHYLINK compliance for SJA1105 DSA

2019-06-28 Thread Russell King - ARM Linux admin

On Fri, Jun 28, 2019 at 12:46:34AM +0300, Vladimir Oltean wrote:
> After discussing with Russell King, it appears this driver is making a
> few confusions and not performing some checks for consistent operation.
> 
> Changes in v2:
> - Removed redundant print in the phylink_validate callback (in 2/3).
> 
> Vladimir Oltean (3):
>   net: dsa: sja1105: Don't check state->link in phylink_mac_config
>   net: dsa: sja1105: Check for PHY mode mismatches with what PHYLINK
> reports
>   net: dsa: sja1105: Mark in-band AN modes not supported for PHYLINK
> 
>  drivers/net/dsa/sja1105/sja1105_main.c | 56 +-
>  1 file changed, 54 insertions(+), 2 deletions(-)

Thanks.  For the whole series:

Acked-by: Russell King 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

Re: [PATCH bpf-next v2] libbpf: add xsk_ring_prod__nb_free() function

2019-06-28 Thread Magnus Karlsson

On Wed, Jun 26, 2019 at 10:33 AM Eelco Chaudron  wrote:
>
> When an AF_XDP application received X packets, it does not mean X
> frames can be stuffed into the producer ring. To make it easier for
> AF_XDP applications this API allows them to check how many frames can
> be added into the ring.
>
> Signed-off-by: Eelco Chaudron 
> ---
>
> v1 -> v2
>  - Renamed xsk_ring_prod__free() to xsk_ring_prod__nb_free()
>  - Add caching so it will only touch global state when needed
>
>  tools/lib/bpf/xsk.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> index 82ea71a0f3ec..6acb81102346 100644
> --- a/tools/lib/bpf/xsk.h
> +++ b/tools/lib/bpf/xsk.h
> @@ -76,11 +76,11 @@ xsk_ring_cons__rx_desc(const struct xsk_ring_cons *rx, 
> __u32 idx)
> return &descs[idx & rx->mask];
>  }
>
> -static inline __u32 xsk_prod_nb_free(struct xsk_ring_prod *r, __u32 nb)
> +static inline __u32 xsk_prod__nb_free(struct xsk_ring_prod *r, __u32 nb)
>  {
> __u32 free_entries = r->cached_cons - r->cached_prod;
>
> -   if (free_entries >= nb)
> +   if (free_entries >= nb && nb != 0)
> return free_entries;

Thanks Eelco for the patch. Is the test nb != 0 introduced here so
that the function will continue with the refresh from the global state
when nb is set to 0? If so, could a user not instead just set the nb
parameter to the size of the ring? This would always trigger a
refresh, except when the number of free entries is equal to the size
of the ring, but then we do not need the refresh anyway. This would
eliminate the nb != 0 test that you introduced from the fast path.

/Magnus

> /* Refresh the local tail pointer.
> @@ -110,7 +110,7 @@ static inline __u32 xsk_cons_nb_avail(struct 
> xsk_ring_cons *r, __u32 nb)
>  static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod,
> size_t nb, __u32 *idx)
>  {
> -   if (xsk_prod_nb_free(prod, nb) < nb)
> +   if (xsk_prod__nb_free(prod, nb) < nb)
> return 0;
>
> *idx = prod->cached_prod;
> --
> 2.20.1
>

Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

2019-06-28 Thread Christian Brauner

On Thu, Jun 27, 2019 at 04:42:18PM -0700, Andy Lutomirski wrote:
> [sigh, I finally set up lore nntp, and I goofed some addresses.  Hi
> Kees and linux-api.]

Love it or hate it but that should probably also Cc linux-security...

> 
> On Thu, Jun 27, 2019 at 4:40 PM Andy Lutomirski  wrote:
> >
> > On 6/27/19 1:19 PM, Song Liu wrote:
> > > This patch introduce unprivileged BPF access. The access control is
> > > achieved via device /dev/bpf. Users with write access to /dev/bpf are able
> > > to call sys_bpf().
> > >
> > > Two ioctl command are added to /dev/bpf:
> > >
> > > The two commands enable/disable permission to call sys_bpf() for current
> > > task. This permission is noted by bpf_permitted in task_struct. This
> > > permission is inherited during clone(CLONE_THREAD).
> > >
> > > Helper function bpf_capable() is added to check whether the task has got
> > > permission via /dev/bpf.
> > >
> >
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 0e079b2298f8..79dc4d641cf3 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -9134,7 +9134,7 @@ int bpf_check(struct bpf_prog **prog, union 
> > > bpf_attr *attr,
> > >   env->insn_aux_data[i].orig_idx = i;
> > >   env->prog = *prog;
> > >   env->ops = bpf_verifier_ops[env->prog->type];
> > > - is_priv = capable(CAP_SYS_ADMIN);
> > > + is_priv = bpf_capable(CAP_SYS_ADMIN);
> >
> > Huh?  This isn't a hardening measure -- the "is_priv" verifier mode
> > allows straight-up leaks of private kernel state to user mode.
> >
> > (For that matter, the pending lockdown stuff should possibly consider
> > this a "confidentiality" issue.)
> >
> >
> > I have a bigger issue with this patch, though: it's a really awkward way
> > to pretend to have capabilities.  For bpf, it seems like you could make
> > this be a *real* capability without too much pain since there's only one
> > syscall there.  Just find a way to pass an fd to /dev/bpf into the
> > syscall.  If this means you need a new bpf_with_cap() syscall that takes
> > an extra argument, so be it.  The old bpf() syscall can just translate
> > to bpf_with_cap(..., -1).
> >
> > For a while, I've considered a scheme I call "implicit rights".  There
> > would be a directory in /dev called /dev/implicit_rights.  This would
> > either be part of devtmpfs or a whole new filesystem -- it would *not*
> > be any other filesystem.  The contents would be files that can't be read
> > or written and exist only in memory.  You create them with a privileged
> > syscall.  Certain actions that are sensitive but not at the level of
> > CAP_SYS_ADMIN (use of large-attack-surface bpf stuff, creation of user
> > namespaces, profiling the kernel, etc) could require an "implicit
> > right".  When you do them, if you don't have CAP_SYS_ADMIN, the kernel
> > would do a path walk for, say, /dev/implicit_rights/bpf and, if the
> > object exists, can be opened, and actually refers to the "bpf" rights
> > object, then the action is allowed.  Otherwise it's denied.
> >
> > This is extensible, and it doesn't require the rather ugly per-task
> > state of whether it's enabled.
> >
> > For things like creation of user namespaces, there's an existing API,
> > and the default is that it works without privilege.  Switching it to an
> > implicit right has the benefit of not requiring code changes to programs
> > that already work as non-root.
> >
> > But, for BPF in particular, this type of compatibility issue doesn't
> > exist now.  You already can't use most eBPF functionality without
> > privilege.  New bpf-using programs meant to run without privilege are
> > *new*, so they can use a new improved API.  So, rather than adding this
> > obnoxious ioctl, just make the API explicit, please.
> >
> > Also, please cc: linux-abi next time.

[PATCH 2/3, net-next] net: page_pool: add helper function for retrieving dma direction

Since the dma direction is stored in page pool params, offer an API
helper for driver that choose not to keep track of it locally

Signed-off-by: Ilias Apalodimas 
---
 include/net/page_pool.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index f07c518ef8a5..ee9c871d2043 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -112,6 +112,15 @@ static inline struct page 
*page_pool_dev_alloc_pages(struct page_pool *pool)
return page_pool_alloc_pages(pool, gfp);
 }
 
+/* get the stored dma direction. A driver might decide to treat this locally 
and
+ * avoid the extra cache line from page_pool to determine the direction
+ */
+static
+inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
+{
+   return pool->p.dma_dir;
+}
+
 struct page_pool *page_pool_create(const struct page_pool_params *params);
 
 void __page_pool_free(struct page_pool *pool);
-- 
2.20.1

[PATCH 1/3, net-next] net: netsec: Use page_pool API

Use page_pool and it's DMA mapping capabilities for Rx buffers instead
of netdev/napi_alloc_frag()

Although this will result in a slight performance penalty on small sized
packets (~10%) the use of the API will allow to easily add XDP support.
The penalty won't be visible in network testing i.e ipef/netperf etc, it
only happens during raw packet drops.
Furthermore we intend to add recycling capabilities on the API
in the future. Once the recycling is added the performance penalty will
go away.
The only 'real' penalty is the slightly increased memory usage, since we
now allocate a page per packet instead of the amount of bytes we need +
skb metadata (difference is roughly 2kb per packet).
With a minimum of 4BG of RAM on the only SoC that has this NIC the
extra memory usage is negligible (a bit more on 64K pages)

Signed-off-by: Ilias Apalodimas 
---
 drivers/net/ethernet/socionext/Kconfig  |   1 +
 drivers/net/ethernet/socionext/netsec.c | 121 +++-
 2 files changed, 75 insertions(+), 47 deletions(-)

diff --git a/drivers/net/ethernet/socionext/Kconfig 
b/drivers/net/ethernet/socionext/Kconfig
index 25f18be27423..95e99baf3f45 100644
--- a/drivers/net/ethernet/socionext/Kconfig
+++ b/drivers/net/ethernet/socionext/Kconfig
@@ -26,6 +26,7 @@ config SNI_NETSEC
tristate "Socionext NETSEC ethernet support"
depends on (ARCH_SYNQUACER || COMPILE_TEST) && OF
select PHYLIB
+   select PAGE_POOL
select MII
---help---
  Enable to add support for the SocioNext NetSec Gigabit Ethernet
diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index 48fd7448b513..e653b24d0534 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -11,6 +11,7 @@
 #include 
 
 #include 
+#include 
 #include 
 
 #define NETSEC_REG_SOFT_RST0x104
@@ -235,7 +236,8 @@
 #define DESC_NUM   256
 
 #define NETSEC_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
-#define NETSEC_RX_BUF_SZ 1536
+#define NETSEC_RX_BUF_NON_DATA (NETSEC_SKB_PAD + \
+   SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
 
 #define DESC_SZsizeof(struct netsec_de)
 
@@ -258,6 +260,8 @@ struct netsec_desc_ring {
struct netsec_desc *desc;
void *vaddr;
u16 head, tail;
+   struct page_pool *page_pool;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 struct netsec_priv {
@@ -673,33 +677,27 @@ static void netsec_process_tx(struct netsec_priv *priv)
 }
 
 static void *netsec_alloc_rx_data(struct netsec_priv *priv,
- dma_addr_t *dma_handle, u16 *desc_len,
- bool napi)
+ dma_addr_t *dma_handle, u16 *desc_len)
+
 {
-   size_t total_len = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-   size_t payload_len = NETSEC_RX_BUF_SZ;
-   dma_addr_t mapping;
-   void *buf;
 
-   total_len += SKB_DATA_ALIGN(payload_len + NETSEC_SKB_PAD);
+   struct netsec_desc_ring *dring = &priv->desc_ring[NETSEC_RING_RX];
+   struct page *page;
 
-   buf = napi ? napi_alloc_frag(total_len) : netdev_alloc_frag(total_len);
-   if (!buf)
+   page = page_pool_dev_alloc_pages(dring->page_pool);
+   if (!page)
return NULL;
 
-   mapping = dma_map_single(priv->dev, buf + NETSEC_SKB_PAD, payload_len,
-DMA_FROM_DEVICE);
-   if (unlikely(dma_mapping_error(priv->dev, mapping)))
-   goto err_out;
-
-   *dma_handle = mapping;
-   *desc_len = payload_len;
-
-   return buf;
+   /* page_pool API will map the whole page, skip
+* NET_SKB_PAD + NET_IP_ALIGN for the payload
+*/
+   *dma_handle = page_pool_get_dma_addr(page) + NETSEC_SKB_PAD;
+   /* make sure the incoming payload fits in the page with the needed
+* NET_SKB_PAD + NET_IP_ALIGN + skb_shared_info
+*/
+   *desc_len = PAGE_SIZE - NETSEC_RX_BUF_NON_DATA;
 
-err_out:
-   skb_free_frag(buf);
-   return NULL;
+   return page_address(page);
 }
 
 static void netsec_rx_fill(struct netsec_priv *priv, u16 from, u16 num)
@@ -728,10 +726,10 @@ static int netsec_process_rx(struct netsec_priv *priv, 
int budget)
u16 idx = dring->tail;
struct netsec_de *de = dring->vaddr + (DESC_SZ * idx);
struct netsec_desc *desc = &dring->desc[idx];
+   struct page *page = virt_to_page(desc->addr);
u16 pkt_len, desc_len;
dma_addr_t dma_handle;
void *buf_addr;
-   u32 truesize;
 
if (de->attr & (1U << NETSEC_RX_PKT_OWN_FIELD)) {
/* reading the register clears the irq */
@@ -766,8 +764,8 @@ static int netsec_process_rx(struct netsec_priv *priv, int 
budget)
/* allocate a fresh buffer and map it to the hardware.
 * T

[PATCH 0/3, net-next] net: netsec: Add XDP Support

This is a respin of https://www.spinics.net/lists/netdev/msg526066.html
Since page_pool API fixes are merged into net-next we can now safely use 
it's DMA mapping capabilities. 

First patch changes the buffer allocation from napi/netdev_alloc_frag()
to page_pool API. Although this will lead to slightly reduced performance 
(on raw packet drops only) we can use the API for XDP buffer recycling. 
Another side effect is a slight increase in memory usage, due to using a 
single page per packet.

The second patch adds XDP support on the driver. 
There's a bunch of interesting options that come up due to the single 
Tx queue.
Locking is needed(to avoid messing up the Tx queues since ndo_xdp_xmit 
and the normal stack can co-exist). We also need to track down the 
'buffer type' for TX and properly free or recycle the packet depending 
on it's nature.


Changes since RFC:
- Bug fixes from Jesper and Maciej
- Added page pool API to retrieve the DMA direction

Ilias Apalodimas (3):
  net: netsec: Use page_pool API
  net: page_pool: add helper function for retrieving dma direction
  net: netsec: add XDP support

 drivers/net/ethernet/socionext/Kconfig  |   1 +
 drivers/net/ethernet/socionext/netsec.c | 469 
 include/net/page_pool.h |   9 +
 3 files changed, 412 insertions(+), 67 deletions(-)

-- 
2.20.1

[PATCH 3/3, net-next] net: netsec: add XDP support

The interface only supports 1 Tx queue so locking is introduced on
the Tx queue if XDP is enabled to make sure .ndo_start_xmit and
.ndo_xdp_xmit won't corrupt Tx ring

- Performance (SMMU off)

Benchmark   XDP_SKB XDP_DRV
xdp1291kpps 344kpps
rxdrop  282kpps 342kpps

- Performance (SMMU on)
Benchmark   XDP_SKB XDP_DRV
xdp1167kpps 324kpps
rxdrop  164kpps 323kpps

Signed-off-by: Ilias Apalodimas 
---
 drivers/net/ethernet/socionext/netsec.c | 361 ++--
 1 file changed, 334 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index e653b24d0534..d200d47df1b4 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -9,6 +9,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -236,23 +239,41 @@
 #define DESC_NUM   256
 
 #define NETSEC_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
-#define NETSEC_RX_BUF_NON_DATA (NETSEC_SKB_PAD + \
+#define NETSEC_RXBUF_HEADROOM (max(XDP_PACKET_HEADROOM, NET_SKB_PAD) + \
+  NET_IP_ALIGN)
+#define NETSEC_RX_BUF_NON_DATA (NETSEC_RXBUF_HEADROOM + \
SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
 
 #define DESC_SZsizeof(struct netsec_de)
 
 #define NETSEC_F_NETSEC_VER_MAJOR_NUM(x)   ((x) & 0x)
 
+#define NETSEC_XDP_PASS  0
+#define NETSEC_XDP_CONSUMED  BIT(0)
+#define NETSEC_XDP_TXBIT(1)
+#define NETSEC_XDP_REDIR BIT(2)
+#define NETSEC_XDP_RX_OK (NETSEC_XDP_PASS | NETSEC_XDP_TX | NETSEC_XDP_REDIR)
+
 enum ring_id {
NETSEC_RING_TX = 0,
NETSEC_RING_RX
 };
 
+enum buf_type {
+   TYPE_NETSEC_SKB = 0,
+   TYPE_NETSEC_XDP_TX,
+   TYPE_NETSEC_XDP_NDO,
+};
+
 struct netsec_desc {
-   struct sk_buff *skb;
+   union {
+   struct sk_buff *skb;
+   struct xdp_frame *xdpf;
+   };
dma_addr_t dma_addr;
void *addr;
u16 len;
+   u8 buf_type;
 };
 
 struct netsec_desc_ring {
@@ -260,13 +281,17 @@ struct netsec_desc_ring {
struct netsec_desc *desc;
void *vaddr;
u16 head, tail;
+   u16 xdp_xmit; /* netsec_xdp_xmit packets */
+   bool is_xdp;
struct page_pool *page_pool;
struct xdp_rxq_info xdp_rxq;
+   spinlock_t lock; /* XDP tx queue locking */
 };
 
 struct netsec_priv {
struct netsec_desc_ring desc_ring[NETSEC_RING_MAX];
struct ethtool_coalesce et_coalesce;
+   struct bpf_prog *xdp_prog;
spinlock_t reglock; /* protect reg access */
struct napi_struct napi;
phy_interface_t phy_interface;
@@ -303,6 +328,11 @@ struct netsec_rx_pkt_info {
bool err_flag;
 };
 
+static void netsec_set_tx_de(struct netsec_priv *priv,
+struct netsec_desc_ring *dring,
+const struct netsec_tx_pkt_ctrl *tx_ctrl,
+const struct netsec_desc *desc, void *buf);
+
 static void netsec_write(struct netsec_priv *priv, u32 reg_addr, u32 val)
 {
writel(val, priv->ioaddr + reg_addr);
@@ -609,6 +639,9 @@ static bool netsec_clean_tx_dring(struct netsec_priv *priv)
int tail = dring->tail;
int cnt = 0;
 
+   if (dring->is_xdp)
+   spin_lock(&dring->lock);
+
pkts = 0;
bytes = 0;
entry = dring->vaddr + DESC_SZ * tail;
@@ -622,13 +655,23 @@ static bool netsec_clean_tx_dring(struct netsec_priv 
*priv)
eop = (entry->attr >> NETSEC_TX_LAST) & 1;
dma_rmb();
 
-   dma_unmap_single(priv->dev, desc->dma_addr, desc->len,
-DMA_TO_DEVICE);
-   if (eop) {
-   pkts++;
+   if (desc->buf_type == TYPE_NETSEC_SKB)
+   dma_unmap_single(priv->dev, desc->dma_addr, desc->len,
+DMA_TO_DEVICE);
+   else if (desc->buf_type == TYPE_NETSEC_XDP_NDO)
+   dma_unmap_single(priv->dev, desc->dma_addr,
+desc->len, DMA_TO_DEVICE);
+
+   if (!eop)
+   goto next;
+
+   if (desc->buf_type == TYPE_NETSEC_SKB) {
bytes += desc->skb->len;
dev_kfree_skb(desc->skb);
+   } else {
+   xdp_return_frame(desc->xdpf);
}
+next:
/* clean up so netsec_uninit_pkt_dring() won't free the skb
 * again
 */
@@ -645,6 +688,8 @@ static bool netsec_clean_tx_dring(struct netsec_priv *priv)
entry = dring->vaddr + DESC_SZ * tail;
cnt++;
}
+   if (dring->is_xdp)
+   spin_unlock(&dring->lock);
 
if (!cnt)
return false;
@@ -688,1

Re: [PATCH net] sctp: fix error handling on stream scheduler initialization

2019-06-28 Thread Neil Horman

On Thu, Jun 27, 2019 at 07:48:10PM -0300, Marcelo Ricardo Leitner wrote:
> It allocates the extended area for outbound streams only on sendmsg
> calls, if they are not yet allocated.  When using the priority
> stream scheduler, this initialization may imply into a subsequent
> allocation, which may fail.  In this case, it was aborting the stream
> scheduler initialization but leaving the ->ext pointer (allocated) in
> there, thus in a partially initialized state.  On a subsequent call to
> sendmsg, it would notice the ->ext pointer in there, and trip on
> uninitialized stuff when trying to schedule the data chunk.
> 
> The fix is undo the ->ext initialization if the stream scheduler
> initialization fails and avoid the partially initialized state.
> 
> Although syzkaller bisected this to commit 4ff40b86262b ("sctp: set
> chunk transport correctly when it's a new asoc"), this bug was actually
> introduced on the commit I marked below.
> 
> Reported-by: syzbot+c1a380d42b190ad1e...@syzkaller.appspotmail.com
> Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
> Tested-by: Xin Long 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/stream.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> index 
> 93ed07877337eace4ef7f4775dda5868359ada37..25946604af85c09917e63e5c4a8d7d6fa2caebc4
>  100644
> --- a/net/sctp/stream.c
> +++ b/net/sctp/stream.c
> @@ -153,13 +153,20 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 
> outcnt, __u16 incnt,
>  int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid)
>  {
>   struct sctp_stream_out_ext *soute;
> + int ret;
>  
>   soute = kzalloc(sizeof(*soute), GFP_KERNEL);
>   if (!soute)
>   return -ENOMEM;
>   SCTP_SO(stream, sid)->ext = soute;
>  
> - return sctp_sched_init_sid(stream, sid, GFP_KERNEL);
> + ret = sctp_sched_init_sid(stream, sid, GFP_KERNEL);
> + if (ret) {
> + kfree(SCTP_SO(stream, sid)->ext);
> + SCTP_SO(stream, sid)->ext = NULL;
> + }
> +
> + return ret;
>  }
>  
>  void sctp_stream_free(struct sctp_stream *stream)
> -- 
> 2.21.0
> 
> 
Acked-by: Neil Horman

Re: [RFC] longer netdev names proposal

Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:
>On Thu, 27 Jun 2019 20:39:48 +0200
>Michal Kubecek  wrote:
>
>> > 
>> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
>> > # ip link show "Onboard Ethernet"
>> > Device "Onboard Ethernet" does not exist.
>> > 
>> > So it does not really appear to be an alias, it is a label. To be
>> > truly useful, it needs to be more than a label, it needs to be a real
>> > alias which you can use.  
>> 
>> That's exactly what I meant: to be really useful, one should be able to
>> use the alias(es) for setting device options, for adding routes, in
>> netfilter rules etc.
>> 
>> Michal
>
>The kernel doesn't enforce uniqueness of alias.
>Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed).
>
>If it did, then handling it in iproute would be something like:

I think that it is desired for kernel to work with "real alias" as a
handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
Userspace mapping like you did here might be perhaps okay for iproute2,
but I think that we need something and easy to use for all.

Let's call it "altname". Get would return:

IFLA_NAME  eth0
IFLA_ALT_NAME_LIST
   IFLA_ALT_NAME  eth0
   IFLA_ALT_NAME  somethingelse
   IFLA_ALT_NAME  somenamethatisreallylong

then userspace would pass with a request (get/set/del):
IFLA_ALT_NAME eth0/somethingelse/somenamethatisreallylong
or
IFLA_NAME eth0 if it is talking with older kernel

Then following would do exactly the same:
ip link set eth0 addr 11:22:33:44:55:66
ip link set somethingelse addr 11:22:33:44:55:66
ip link set somenamethatisreallylong addr 11:22:33:44:55:66

We would have to figure out the iproute2 iface to add/del altnames:
ip link add eth0 altname somethingelse
ip link del eth0 altname somethingelse
  this might be also:
  ip link del somethingelse altname somethingelse

How does this sound?

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Michal Kubecek

On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:
> 
> I think that it is desired for kernel to work with "real alias" as a
> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
> Userspace mapping like you did here might be perhaps okay for iproute2,
> but I think that we need something and easy to use for all.
> 
> Let's call it "altname". Get would return:
> 
> IFLA_NAME  eth0
> IFLA_ALT_NAME_LIST
>IFLA_ALT_NAME  eth0
>IFLA_ALT_NAME  somethingelse
>IFLA_ALT_NAME  somenamethatisreallylong
> 
> then userspace would pass with a request (get/set/del):
> IFLA_ALT_NAME eth0/somethingelse/somenamethatisreallylong
> or
> IFLA_NAME eth0 if it is talking with older kernel
> 
> Then following would do exactly the same:
> ip link set eth0 addr 11:22:33:44:55:66
> ip link set somethingelse addr 11:22:33:44:55:66
> ip link set somenamethatisreallylong addr 11:22:33:44:55:66

Yes, this sounds nice.

> We would have to figure out the iproute2 iface to add/del altnames:
> ip link add eth0 altname somethingelse
> ip link del eth0 altname somethingelse
>   this might be also:
>   ip link del somethingelse altname somethingelse

This would be a bit confusing, IMHO, as so far

  ip link add $name ...

always means we want to add or delete new device $name which would not
be the case here. How about the other way around:

  ip link add somethingelse altname_for eth0

(preferrably with a better keyword than "altname_for" :-) ). Or maybe

  ip altname add somethingelse dev eth0
  ip altname del somethingelse dev eth0

Michal

Re: [RFC PATCH 1/1] Documentation: net: dsa: b53: Describe b53 configuration

2019-06-28 Thread Kurt Kanzenbach

Hi,

On Thu, Jun 27, 2019 at 09:38:16AM -0700, Florian Fainelli wrote:
> On 6/27/19 3:15 AM, Benedikt Spranger wrote:
> > Document the different needs of documentation for the b53 driver.
> >
> > Signed-off-by: Benedikt Spranger 
> > ---
> >  Documentation/networking/dsa/b53.rst | 300 +++
> >  1 file changed, 300 insertions(+)
> >  create mode 100644 Documentation/networking/dsa/b53.rst
> >
> > diff --git a/Documentation/networking/dsa/b53.rst 
> > b/Documentation/networking/dsa/b53.rst
> > new file mode 100644
> > index ..5838cf6230da
> > --- /dev/null
> > +++ b/Documentation/networking/dsa/b53.rst
> > @@ -0,0 +1,300 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +==
> > +Broadcom RoboSwitch Ethernet switch driver
> > +==
> > +
> > +The Broadcom RoboSwitch Ethernet switch family is used in quite a range of
> > +xDSL router, cable modems and other multimedia devices.
> > +
> > +The actual implementation supports the devices BCM5325E, BCM5365, BCM539x,
> > +BCM53115 and BCM53125 as well as BCM63XX.
> > +
> > +Implementation details
> > +==
> > +
> > +The driver is located in ``drivers/net/dsa/bcm_sf2.c`` and is implemented 
> > as a
> > +DSA driver; see ``Documentation/networking/dsa/dsa.rst`` for details on the
> > +subsystemand what it provides.
>
> The driver is under drivers/net/dsa/b53/
>
> s/ethernet/Ethernet/ for your global submission.
>
> What you are describing is not entirely specific to the B53 driver
> (maybe with the exception of having a VLAN tag on the DSA master network
> device, since B53 puts the CPU port tagged in all VLANs by default), and
> therefore the entire document should be written with the general DSA
> devices in mind, and eventually pointing out where B53 differs in a
> separate document.

That's true. It makes sense to split the documentation into a generic
and the b53 specific part.

Thanks a lot, Bene, for doing this. It's really helpful.

Thanks,
Kurt


signature.asc
Description: PGP signature

Re: XDP multi-buffer incl. jumbo-frames (Was: [RFC V1 net-next 1/1] net: ena: implement XDP drop support)

2019-06-28 Thread Eelco Chaudron

On 28 Jun 2019, at 9:46, Toke Høiland-Jørgensen wrote:

"Eelco Chaudron" writes:

On 26 Jun 2019, at 10:38, Jesper Dangaard Brouer wrote:

On Tue, 25 Jun 2019 03:19:22 +
"Machulsky, Zorik" wrote:

On 6/23/19, 7:21 AM, "Jesper Dangaard Brouer"

wrote:

On Sun, 23 Jun 2019 10:06:49 +0300 wrote:

> This commit implements the basic functionality of drop/pass
logic in the
> ena driver.

As we trust that Zorik/Amazon will follow and implement
XDP_REDIRECT
later, given he/you wants AF_XDP support which requires
XDP_REDIRECT.

Jesper, thanks for your comments and very helpful discussion during
NetConf! That's the plan, as we agreed. From our side I would like
to
reiterate again the importance of multi-buffer support by xdp
frame.

We would really prefer not to see our MTU shrinking because of xdp
support.

Okay we really need to make a serious attempt to find a way to
support

multi-buffer packets with XDP. With the important criteria of not
hurting performance of the single-buffer per packet design.

I've created a design document[2], that I will update based on our
discussions: [2]
https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

The use-case that really convinced me was Eric's packet
header-split.

Lets refresh: Why XDP don't have multi-buffer support:

XDP is designed for maximum performance, which is why certain
driver-level
use-cases were not supported, like multi-buffer packets (like
jumbo-frames).
As it e.g. complicated the driver RX-loop and memory model handling.

The single buffer per packet design, is also tied into eBPF
Direct-Access
(DA) to packet data, which can only be allowed if the packet memory
is

in
contiguous memory. This DA feature is essential for XDP
performance.

I’ve seen various network processor HW designs, and they normally
get

the first x bytes (128 - 512) which they can manipulate
(append/prepend/insert/modify/delete).

There are designs where they can “page in” the additional
fragments
but it’s expensive as it requires additional memory transfers. But
the
majority do not care (cannot change) the remaining fragments. Can
also
not think of a reason why you might want to remove something at the
end

of the frame (thinking about routing/forwarding needs here).

If we do want XDP to access other fragments we could do this through
a

helper which swaps the packet context?

Yeah, I was also going to suggest a helper for that. It doesn't
necessarily need to swap the packet context, it could just return a
new

pointer?

Yes that will work, my head was still thinking ASICs where there is
limited SRAM space…

Re: [RFC] longer netdev names proposal

Fri, Jun 28, 2019 at 01:42:12PM CEST, mkube...@suse.cz wrote:
>On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:
>> 
>> I think that it is desired for kernel to work with "real alias" as a
>> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
>> Userspace mapping like you did here might be perhaps okay for iproute2,
>> but I think that we need something and easy to use for all.
>> 
>> Let's call it "altname". Get would return:
>> 
>> IFLA_NAME  eth0
>> IFLA_ALT_NAME_LIST
>>IFLA_ALT_NAME  eth0
>>IFLA_ALT_NAME  somethingelse
>>IFLA_ALT_NAME  somenamethatisreallylong
>> 
>> then userspace would pass with a request (get/set/del):
>> IFLA_ALT_NAME eth0/somethingelse/somenamethatisreallylong
>> or
>> IFLA_NAME eth0 if it is talking with older kernel
>> 
>> Then following would do exactly the same:
>> ip link set eth0 addr 11:22:33:44:55:66
>> ip link set somethingelse addr 11:22:33:44:55:66
>> ip link set somenamethatisreallylong addr 11:22:33:44:55:66
>
>Yes, this sounds nice.
>
>> We would have to figure out the iproute2 iface to add/del altnames:
>> ip link add eth0 altname somethingelse
>> ip link del eth0 altname somethingelse
>>   this might be also:
>>   ip link del somethingelse altname somethingelse
>
>This would be a bit confusing, IMHO, as so far
>
>  ip link add $name ...
>
>always means we want to add or delete new device $name which would not
>be the case here. How about the other way around:
>
>  ip link add somethingelse altname_for eth0
>
>(preferrably with a better keyword than "altname_for" :-) ). Or maybe
>
>  ip altname add somethingelse dev eth0
>  ip altname del somethingelse dev eth0

Yeah, I like this.

Let's see how it will work during the implementation.

Re: What to do when a bridge port gets its pvid deleted?

2019-06-28 Thread Ido Schimmel

On Tue, Jun 25, 2019 at 11:49:29PM +0300, Vladimir Oltean wrote:
> A number of DSA drivers (BCM53XX, Microchip KSZ94XX, Mediatek MT7530
> at the very least), as well as Mellanox Spectrum (I didn't look at all
> the pure switchdev drivers) try to restore the pvid to a default value
> on .port_vlan_del.

I don't know about DSA drivers, but that's not what mlxsw is doing. If
the VLAN that is configured as PVID is deleted from the bridge port, the
driver instructs the port to discard untagged and prio-tagged packets.
This is consistent with the bridge driver's behavior.

We do have a flow the "restores" the PVID, but that's when a port is
unlinked from its bridge master. The PVID we set is 4095 which cannot be
configured by the 8021q / bridge driver. This is due to the way the
underlying hardware works. Even if a port is not bridged and used purely
for routing, packets still do L2 lookup first which sends them directly
to the router block. If PVID is not configured, untagged packets could
not be routed. Obviously, at egress we strip this VLAN.

> Sure, the port stops receiving traffic when its pvid is a VLAN ID that
> is not installed in its hw filter, but as far as the bridge core is
> concerned, this is to be expected:
> 
> # bridge vlan add dev swp2 vid 100 pvid untagged
> # bridge vlan
> portvlan ids
> swp5 1 PVID Egress Untagged
> 
> swp2 1 Egress Untagged
>  100 PVID Egress Untagged
> 
> swp3 1 PVID Egress Untagged
> 
> swp4 1 PVID Egress Untagged
> 
> br0  1 PVID Egress Untagged
> # ping 10.0.0.1
> PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
> 64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.682 ms
> 64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.299 ms
> 64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=0.251 ms
> 64 bytes from 10.0.0.1: icmp_seq=4 ttl=64 time=0.324 ms
> 64 bytes from 10.0.0.1: icmp_seq=5 ttl=64 time=0.257 ms
> ^C
> --- 10.0.0.1 ping statistics ---
> 5 packets transmitted, 5 received, 0% packet loss, time 4188ms
> rtt min/avg/max/mdev = 0.251/0.362/0.682/0.163 ms
> # bridge vlan del dev swp2 vid 100
> # bridge vlan
> portvlan ids
> swp5 1 PVID Egress Untagged
> 
> swp2 1 Egress Untagged
> 
> swp3 1 PVID Egress Untagged
> 
> swp4 1 PVID Egress Untagged
> 
> br0  1 PVID Egress Untagged
> 
> # ping 10.0.0.1
> PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
> ^C
> --- 10.0.0.1 ping statistics ---
> 8 packets transmitted, 0 received, 100% packet loss, time 7267ms
> 
> What is the consensus here? Is there a reason why the bridge driver
> doesn't take care of this? 

Take care of what? :) Always maintaining a PVID on the bridge port? It's
completely OK not to have a PVID.

> Do switchdev drivers have to restore the pvid to always be
> operational, even if their state becomes inconsistent with the upper
> dev? Is it just 'nice to have'? What if VID 1 isn't in the hw filter
> either (perfectly legal)?

Are you saying that DSA drivers always maintain a PVID on the bridge
port and allow untagged traffic to ingress regardless of the bridge
driver's configuration? If so, I think this needs to be fixed.

Re: What to do when a bridge port gets its pvid deleted?

2019-06-28 Thread Vladimir Oltean

On Fri, 28 Jun 2019 at 15:30, Ido Schimmel  wrote:
>
> On Tue, Jun 25, 2019 at 11:49:29PM +0300, Vladimir Oltean wrote:
> > A number of DSA drivers (BCM53XX, Microchip KSZ94XX, Mediatek MT7530
> > at the very least), as well as Mellanox Spectrum (I didn't look at all
> > the pure switchdev drivers) try to restore the pvid to a default value
> > on .port_vlan_del.
>
> I don't know about DSA drivers, but that's not what mlxsw is doing. If
> the VLAN that is configured as PVID is deleted from the bridge port, the
> driver instructs the port to discard untagged and prio-tagged packets.
> This is consistent with the bridge driver's behavior.
>
> We do have a flow the "restores" the PVID, but that's when a port is
> unlinked from its bridge master. The PVID we set is 4095 which cannot be
> configured by the 8021q / bridge driver. This is due to the way the
> underlying hardware works. Even if a port is not bridged and used purely
> for routing, packets still do L2 lookup first which sends them directly
> to the router block. If PVID is not configured, untagged packets could
> not be routed. Obviously, at egress we strip this VLAN.
>
> > Sure, the port stops receiving traffic when its pvid is a VLAN ID that
> > is not installed in its hw filter, but as far as the bridge core is
> > concerned, this is to be expected:
> >
> > # bridge vlan add dev swp2 vid 100 pvid untagged
> > # bridge vlan
> > portvlan ids
> > swp5 1 PVID Egress Untagged
> >
> > swp2 1 Egress Untagged
> >  100 PVID Egress Untagged
> >
> > swp3 1 PVID Egress Untagged
> >
> > swp4 1 PVID Egress Untagged
> >
> > br0  1 PVID Egress Untagged
> > # ping 10.0.0.1
> > PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
> > 64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.682 ms
> > 64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.299 ms
> > 64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=0.251 ms
> > 64 bytes from 10.0.0.1: icmp_seq=4 ttl=64 time=0.324 ms
> > 64 bytes from 10.0.0.1: icmp_seq=5 ttl=64 time=0.257 ms
> > ^C
> > --- 10.0.0.1 ping statistics ---
> > 5 packets transmitted, 5 received, 0% packet loss, time 4188ms
> > rtt min/avg/max/mdev = 0.251/0.362/0.682/0.163 ms
> > # bridge vlan del dev swp2 vid 100
> > # bridge vlan
> > portvlan ids
> > swp5 1 PVID Egress Untagged
> >
> > swp2 1 Egress Untagged
> >
> > swp3 1 PVID Egress Untagged
> >
> > swp4 1 PVID Egress Untagged
> >
> > br0  1 PVID Egress Untagged
> >
> > # ping 10.0.0.1
> > PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
> > ^C
> > --- 10.0.0.1 ping statistics ---
> > 8 packets transmitted, 0 received, 100% packet loss, time 7267ms
> >
> > What is the consensus here? Is there a reason why the bridge driver
> > doesn't take care of this?
>
> Take care of what? :) Always maintaining a PVID on the bridge port? It's
> completely OK not to have a PVID.
>

Yes, I didn't think it through during the first email. I came to the
same conclusion in the second one.

> > Do switchdev drivers have to restore the pvid to always be
> > operational, even if their state becomes inconsistent with the upper
> > dev? Is it just 'nice to have'? What if VID 1 isn't in the hw filter
> > either (perfectly legal)?
>
> Are you saying that DSA drivers always maintain a PVID on the bridge
> port and allow untagged traffic to ingress regardless of the bridge
> driver's configuration? If so, I think this needs to be fixed.

Well, not at the DSA core level.
But for Microchip:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/dsa/microchip/ksz9477.c#n576
For Broadcom:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/dsa/b53/b53_common.c#n1376
For Mediatek:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/dsa/mt7530.c#n1196

There might be others as well.

Thanks,
-Vladimir

Re: [EXT] [PATCH V4] bnx2x: Prevent ptp_task to be rescheduled indefinitely

2019-06-28 Thread Guilherme G. Piccoli

On 28/06/2019 02:22, Sudarsana Reddy Kalluru wrote:
> [...]
> Thanks for the changes.
> 
> Acked-by: Sudarsana Reddy Kalluru 
> 

Thanks a lot Sudarsana!


David, do you think it's still possible to get this merged as a fix in
5.2 cycle, or it's gonna go in 5.3?

Thanks,

Guilherme

[PATCH 4/4] ipvs: reduce kernel stack usage

2019-06-28 Thread Arnd Bergmann

With the new CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL option, the stack
usage in the ipvs debug output grows because each instance of
IP_VS_DBG_BUF() now has its own buffer of 160 bytes that add up
rather than reusing the stack slots:

net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_sched_persist':
net/netfilter/ipvs/ip_vs_core.c:427:1: error: the frame size of 1052 bytes is 
larger than 1024 bytes [-Werror=frame-larger-than=]
net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_new_conn_out':
net/netfilter/ipvs/ip_vs_core.c:1231:1: error: the frame size of 1048 bytes is 
larger than 1024 bytes [-Werror=frame-larger-than=]
net/netfilter/ipvs/ip_vs_ftp.c: In function 'ip_vs_ftp_out':
net/netfilter/ipvs/ip_vs_ftp.c:397:1: error: the frame size of 1104 bytes is 
larger than 1024 bytes [-Werror=frame-larger-than=]
net/netfilter/ipvs/ip_vs_ftp.c: In function 'ip_vs_ftp_in':
net/netfilter/ipvs/ip_vs_ftp.c:555:1: error: the frame size of 1200 bytes is 
larger than 1024 bytes [-Werror=frame-larger-than=]

Since printk() already has a way to print IPv4/IPv6 addresses using
the %pIS format string, use that instead, combined with a macro that
creates a local sockaddr structure on the stack. These will still
add up, but the stack frames are now under 200 bytes.

Signed-off-by: Arnd Bergmann 
---
I'm not sure this actually does what I think it does. Someone
needs to verify that we correctly print the addresses here.
I've also only added three files that caused the warning messages
to be reported. There are still a lot of other instances of
IP_VS_DBG_BUF() that could be converted the same way after the
basic idea is confirmed.
---
 include/net/ip_vs.h | 71 +++--
 net/netfilter/ipvs/ip_vs_core.c | 44 ++--
 net/netfilter/ipvs/ip_vs_ftp.c  | 20 +-
 3 files changed, 72 insertions(+), 63 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 3759167f91f5..3dfbeef67be6 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -227,6 +227,16 @@ static inline const char *ip_vs_dbg_addr(int af, char 
*buf, size_t buf_len,
   sizeof(ip_vs_dbg_buf), addr, \
   &ip_vs_dbg_idx)
 
+#define IP_VS_DBG_SOCKADDR4(fam, addr, port)   \
+   (struct sockaddr*)&(struct sockaddr_in) \
+   { .sin_family = (fam), .sin_addr = (addr)->in, .sin_port = (port) }
+#define IP_VS_DBG_SOCKADDR6(fam, addr, port)   \
+   (struct sockaddr*)&(struct sockaddr_in6) \
+   { .sin6_family = (fam), .sin6_addr = (addr)->in6, .sin6_port = (port) }
+#define IP_VS_DBG_SOCKADDR(fam, addr, port) (fam == AF_INET ?  \
+   IP_VS_DBG_SOCKADDR4(fam, addr, port) :  \
+   IP_VS_DBG_SOCKADDR6(fam, addr, port))
+
 #define IP_VS_DBG(level, msg, ...) \
do {\
if (level <= ip_vs_get_debug_level())   \
@@ -251,6 +261,7 @@ static inline const char *ip_vs_dbg_addr(int af, char *buf, 
size_t buf_len,
 #else  /* NO DEBUGGING at ALL */
 #define IP_VS_DBG_BUF(level, msg...)  do {} while (0)
 #define IP_VS_ERR_BUF(msg...)  do {} while (0)
+#define IP_VS_DBG_SOCKADDR(fam, addr, port) NULL
 #define IP_VS_DBG(level, msg...)  do {} while (0)
 #define IP_VS_DBG_RL(msg...)  do {} while (0)
 #define IP_VS_DBG_PKT(level, af, pp, skb, ofs, msg)do {} while (0)
@@ -1244,31 +1255,31 @@ static inline void ip_vs_control_del(struct ip_vs_conn 
*cp)
 {
struct ip_vs_conn *ctl_cp = cp->control;
if (!ctl_cp) {
-   IP_VS_ERR_BUF("request control DEL for uncontrolled: "
- "%s:%d to %s:%d\n",
- IP_VS_DBG_ADDR(cp->af, &cp->caddr),
- ntohs(cp->cport),
- IP_VS_DBG_ADDR(cp->af, &cp->vaddr),
- ntohs(cp->vport));
+   pr_err("request control DEL for uncontrolled: "
+  "%pISp to %pISp\n",
+  IP_VS_DBG_SOCKADDR(cp->af, &cp->caddr,
+  ntohs(cp->cport)),
+  IP_VS_DBG_SOCKADDR(cp->af, &cp->vaddr,
+  ntohs(cp->vport)));
 
return;
}
 
-   IP_VS_DBG_BUF(7, "DELeting control for: "
- "cp.dst=%s:%d ctl_cp.dst=%s:%d\n",
- IP_VS_DBG_ADDR(cp->af, &cp->caddr),
- ntohs(cp->cport),
- IP_VS_DBG_ADDR(cp->af, &ctl_cp->caddr),
- ntohs(ctl_cp->cport));
+   IP_VS_DBG(7, "DELeting control for: "
+ "cp.dst=%pISp ctl_cp.dst=%pISp\n",
+ IP_VS_DBG_SOCKADDR(cp->af, &cp->caddr,
+ ntohs(cp->cport)),
+ IP_VS_DBG_SOCKADDR(cp->af, &ctl_cp->caddr,
+

Re: [PATCH 1/3, net-next] net: netsec: Use page_pool API

On Fri, 28 Jun 2019 13:39:13 +0300
Ilias Apalodimas  wrote:

> @@ -1079,11 +1095,22 @@ static int netsec_setup_rx_dring(struct netsec_priv 
> *priv)
>   }
>  
>   netsec_rx_fill(priv, 0, DESC_NUM);
> + err = xdp_rxq_info_reg(&dring->xdp_rxq, priv->ndev, 0);
> + if (err)
> + goto err_out;
> +
> + err = xdp_rxq_info_reg_mem_model(&dring->xdp_rxq, MEM_TYPE_PAGE_POOL,
> +  dring->page_pool);
> + if (err) {
> + page_pool_free(dring->page_pool);
> + goto err_out;
> + }
>  
>   return 0;
>  
>  err_out:
> - return -ENOMEM;
> + netsec_uninit_pkt_dring(priv, NETSEC_RING_RX);
> + return err;
>  }

I think you need to move page_pool_free(dring->page_pool) until after
netsec_uninit_pkt_dring() as it use dring->page_pool.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH 1/3, net-next] net: netsec: Use page_pool API

On Fri, 28 Jun 2019 13:39:13 +0300
Ilias Apalodimas  wrote:

> Use page_pool and it's DMA mapping capabilities for Rx buffers instead
> of netdev/napi_alloc_frag()
> 
> Although this will result in a slight performance penalty on small sized
> packets (~10%) the use of the API will allow to easily add XDP support.
> The penalty won't be visible in network testing i.e ipef/netperf etc, it
> only happens during raw packet drops.
> Furthermore we intend to add recycling capabilities on the API
> in the future. Once the recycling is added the performance penalty will
> go away.
> The only 'real' penalty is the slightly increased memory usage, since we
> now allocate a page per packet instead of the amount of bytes we need +
> skb metadata (difference is roughly 2kb per packet).
> With a minimum of 4BG of RAM on the only SoC that has this NIC the
> extra memory usage is negligible (a bit more on 64K pages)
> 
> Signed-off-by: Ilias Apalodimas 
> ---
>  drivers/net/ethernet/socionext/Kconfig  |   1 +
>  drivers/net/ethernet/socionext/netsec.c | 121 +++-
>  2 files changed, 75 insertions(+), 47 deletions(-)

Acked-by: Jesper Dangaard Brouer 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH 2/3, net-next] net: page_pool: add helper function for retrieving dma direction

On Fri, 28 Jun 2019 13:39:14 +0300
Ilias Apalodimas  wrote:

> Since the dma direction is stored in page pool params, offer an API
> helper for driver that choose not to keep track of it locally
> 
> Signed-off-by: Ilias Apalodimas 
> ---
>  include/net/page_pool.h | 9 +
>  1 file changed, 9 insertions(+)

Acked-by: Jesper Dangaard Brouer 

This is simple enough and you also explained the downside.
Thanks for adding a helper for this.

 
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index f07c518ef8a5..ee9c871d2043 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -112,6 +112,15 @@ static inline struct page 
> *page_pool_dev_alloc_pages(struct page_pool *pool)
>   return page_pool_alloc_pages(pool, gfp);
>  }
>  
> +/* get the stored dma direction. A driver might decide to treat this locally 
> and
> + * avoid the extra cache line from page_pool to determine the direction
> + */
> +static
> +inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
> +{
> + return pool->p.dma_dir;
> +}
> +
>  struct page_pool *page_pool_create(const struct page_pool_params *params);
>  
>  void __page_pool_free(struct page_pool *pool);


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

macvtap vlan and tcp header overhead (and mtu size)

2019-06-28 Thread Marc Roos




I hope this is the right place to ask. I have a host setup for libvirt 
kvm/qemu vms. And I wonder a bit about the overhead of the macvtap and 
how to configure the mtu's properly. To be able to communicate with the 
host, I have moved the ip address of the host from the adapter to a 
macvtap to allow host communication. 

I have the below setup on hosts. 

  
+-+   
| macvtap0|   
 +--|   ip|   
 |  | mtu1500 |   
 |  +-+   
 +-+ |  +-+   
 |eth0.v100| |  | macvtap1|   
  +--|  no ip  +-+--|   ip|   
 +-+  |  | mtu1500 || mtu1500 |   
 |   eth0  |  |  +-++-+   
 | +--+   
 | mtu9000 |  |  +-+  
 +-+  |  |eth0.v101|  
  +--|   ip|  
 | mtu9000 |  
 +-+  

https://pastebin.com/9jJrMCTD



I can do a ping -M do -s 9000 between hosts via the vlan interface 
eth0.v101. That is as expected.

The ping -M do -s 1500 macvtap0 and another host or macvtap1 fails. The 
maximum size that does not fragment is 1472.
That is 28 bytes??? Where have they gone? I am only using macvtap, can 
this be the combination of the parent interface being a vlan and that 
macvtap is not properly handling this? Anyone experienced something 
similar? Or can explain where these 28 bytes go?

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Andrew Lunn

On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:
> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:
> >On Thu, 27 Jun 2019 20:39:48 +0200
> >Michal Kubecek  wrote:
> >
> >> > 
> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
> >> > # ip link show "Onboard Ethernet"
> >> > Device "Onboard Ethernet" does not exist.
> >> > 
> >> > So it does not really appear to be an alias, it is a label. To be
> >> > truly useful, it needs to be more than a label, it needs to be a real
> >> > alias which you can use.  
> >> 
> >> That's exactly what I meant: to be really useful, one should be able to
> >> use the alias(es) for setting device options, for adding routes, in
> >> netfilter rules etc.
> >> 
> >> Michal
> >
> >The kernel doesn't enforce uniqueness of alias.
> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed).
> >
> >If it did, then handling it in iproute would be something like:
> 
> I think that it is desired for kernel to work with "real alias" as a
> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
> Userspace mapping like you did here might be perhaps okay for iproute2,
> but I think that we need something and easy to use for all.
> 
> Let's call it "altname". Get would return:
> 
> IFLA_NAME  eth0
> IFLA_ALT_NAME_LIST
>IFLA_ALT_NAME  eth0
>IFLA_ALT_NAME  somethingelse
>IFLA_ALT_NAME  somenamethatisreallylong

Hi Jiri

What is your user case for having multiple IFLA_ALT_NAME for the same
IFLA_NAME?

Thanks
Andrew

Re: [PATCH 3/3, net-next] net: netsec: add XDP support

On Fri, 28 Jun 2019 13:39:15 +0300
Ilias Apalodimas  wrote:

> +static int netsec_xdp_setup(struct netsec_priv *priv, struct bpf_prog *prog,
> + struct netlink_ext_ack *extack)
> +{
> + struct net_device *dev = priv->ndev;
> + struct bpf_prog *old_prog;
> +
> + /* For now just support only the usual MTU sized frames */
> + if (prog && dev->mtu > 1500) {
> + NL_SET_ERR_MSG_MOD(extack, "Jumbo frames not supported on XDP");
> + return -EOPNOTSUPP;
> + }
> +
> + if (netif_running(dev))
> + netsec_netdev_stop(dev);
> +
> + /* Detach old prog, if any */
> + old_prog = xchg(&priv->xdp_prog, prog);
> + if (old_prog)
> + bpf_prog_put(old_prog);
> +
> + if (netif_running(dev))
> + netsec_netdev_open(dev);

Shouldn't the if-statement be if (!netif_running(dev))

> +
> + return 0;
> +}



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v4 03/13] dt-bindings: net: Add a YAML schemas for the generic MDIO options

2019-06-28 Thread Maxime Ripard

On Thu, Jun 27, 2019 at 10:06:57AM -0600, Rob Herring wrote:
> On Thu, Jun 27, 2019 at 9:57 AM Maxime Ripard  
> wrote:
> > > > +
> > > > +reset-gpios = <&gpio2 5 1>;
> > > > +reset-delay-us = <2>;
> > > > +
> > > > +ethphy0: ethernet-phy@1 {
> > > > +reg = <1>;
> > >
> > > Need a child node schema to validate the unit-address and reg property.
> >
> > This should be already covered by the ethernet-phy.yaml schemas
> > earlier in this series.
>
> Partially, yes.
>
> > Were you expecting something else?
>
> That would not prevent having a child node such as 'foo {};'  or
> 'foo@bad {};'. It would also not check valid nodes named something
> other than 'ethernet-phy'.

Right, but listing the nodes won't either, since we can't enable
additionalProperties in that schema. So any node that wouldn't match
ethernet-phy@.* wouldn't be validated, but wouldn't generate a warning
either.

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


signature.asc
Description: PGP signature

Re: [RFC] longer netdev names proposal

Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote:
>On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:
>> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:
>> >On Thu, 27 Jun 2019 20:39:48 +0200
>> >Michal Kubecek  wrote:
>> >
>> >> > 
>> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
>> >> > # ip link show "Onboard Ethernet"
>> >> > Device "Onboard Ethernet" does not exist.
>> >> > 
>> >> > So it does not really appear to be an alias, it is a label. To be
>> >> > truly useful, it needs to be more than a label, it needs to be a real
>> >> > alias which you can use.  
>> >> 
>> >> That's exactly what I meant: to be really useful, one should be able to
>> >> use the alias(es) for setting device options, for adding routes, in
>> >> netfilter rules etc.
>> >> 
>> >> Michal
>> >
>> >The kernel doesn't enforce uniqueness of alias.
>> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed).
>> >
>> >If it did, then handling it in iproute would be something like:
>> 
>> I think that it is desired for kernel to work with "real alias" as a
>> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
>> Userspace mapping like you did here might be perhaps okay for iproute2,
>> but I think that we need something and easy to use for all.
>> 
>> Let's call it "altname". Get would return:
>> 
>> IFLA_NAME  eth0
>> IFLA_ALT_NAME_LIST
>>IFLA_ALT_NAME  eth0
>>IFLA_ALT_NAME  somethingelse
>>IFLA_ALT_NAME  somenamethatisreallylong
>
>Hi Jiri
>
>What is your user case for having multiple IFLA_ALT_NAME for the same
>IFLA_NAME?

I don't know about specific usecase for having more. Perhaps Michal
does.

>From the implementation perspective it is handy to have the ifname as
the first alt name in kernel, so the userspace would just pass
IFLA_ALT_NAME always. Also for avoiding name collisions etc.



>
>   Thanks
>   Andrew
>

[PATCH 02/10] batman-adv: Fix includes for *_MAX constants

2019-06-28 Thread Simon Wunderlich

From: Sven Eckelmann 

The commit 54d50897d544 ("linux/kernel.h: split *_MAX and *_MIN macros into
") moved the U32_MAX/INT_MAX/ULONG_MAX from linux/kernel.h
to linux/limits.h. Adjust the includes accordingly.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Simon Wunderlich 
---
 net/batman-adv/gateway_common.c | 1 +
 net/batman-adv/hard-interface.c | 1 +
 net/batman-adv/netlink.c| 1 +
 net/batman-adv/sysfs.c  | 1 +
 net/batman-adv/tp_meter.c   | 1 +
 5 files changed, 5 insertions(+)

diff --git a/net/batman-adv/gateway_common.c b/net/batman-adv/gateway_common.c
index dac097f9be03..fc55750542e4 100644
--- a/net/batman-adv/gateway_common.c
+++ b/net/batman-adv/gateway_common.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c
index 79d1731b8306..899487641bca 100644
--- a/net/batman-adv/hard-interface.c
+++ b/net/batman-adv/hard-interface.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/net/batman-adv/netlink.c b/net/batman-adv/netlink.c
index a67720fad46c..7253699c3151 100644
--- a/net/batman-adv/netlink.c
+++ b/net/batman-adv/netlink.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/net/batman-adv/sysfs.c b/net/batman-adv/sysfs.c
index 80fc3253c336..1efcb97039cd 100644
--- a/net/batman-adv/sysfs.c
+++ b/net/batman-adv/sysfs.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/net/batman-adv/tp_meter.c b/net/batman-adv/tp_meter.c
index 820392146249..dd6a9a40dbb9 100644
--- a/net/batman-adv/tp_meter.c
+++ b/net/batman-adv/tp_meter.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.11.0

[PATCH 01/10] batman-adv: Start new development cycle

2019-06-28 Thread Simon Wunderlich

Signed-off-by: Simon Wunderlich 
---
 net/batman-adv/main.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/batman-adv/main.h b/net/batman-adv/main.h
index c59afcba31e0..11d051dbbda4 100644
--- a/net/batman-adv/main.h
+++ b/net/batman-adv/main.h
@@ -13,7 +13,7 @@
 #define BATADV_DRIVER_DEVICE "batman-adv"
 
 #ifndef BATADV_SOURCE_VERSION
-#define BATADV_SOURCE_VERSION "2019.2"
+#define BATADV_SOURCE_VERSION "2019.3"
 #endif
 
 /* B.A.T.M.A.N. parameters */
-- 
2.11.0

Re: [PATCH 1/2] tls: remove close callback sock unlock/lock and flush_sync

2019-06-28 Thread John Fastabend

Jakub Kicinski wrote:
> On Thu, 27 Jun 2019 10:36:42 -0700, John Fastabend wrote:
> > The tls close() callback currently drops the sock lock, makes a
> > cancel_delayed_work_sync() call, and then relocks the sock. This
> > seems suspect at best. The lock_sock() is applied to stop concurrent
> > operations on the socket while tearing the sock down. Further we
> > will need to add support for unhash() shortly and this complicates
> > matters because the lock may or may not be held then.
> > 
> > So to fix the above situation and simplify the next patch to add
> > unhash this patch creates a function tls_sk_proto_cleanup() that
> > tears down the socket without calling lock_sock/release_sock. In
> > order to flush the workqueue then we do the following,
> > 
> >   - Add a new bit to ctx, BIT_TX_CLOSING that is set when the
> > tls resources are being removed.
> >   - Check this bit before scheduling any new work. This way we
> > avoid queueing new work after tear down has started.
> >   - With the BIT_TX_CLOSING ensuring no new work is being added
> > convert the cancel_delayed_work_sync to flush_delayed_work()
> >   - Finally call tlx_tx_records() to complete any available records
> > before,
> >   - releasing and removing tls ctx.
> > 
> > The above is implemented for the software case namely any of
> > the following configurations from build_protos,
> > 
> >prot[TLS_SW][TLS_BASE]
> >prot[TLS_BASE][TLS_SW]
> >prot[TLS_SW][TLS_SW]
> > 
> > The implication is a follow up patch is needed to resolve the
> > hardware offload case.
> > 
> > Tested with net selftests and bpf selftests.
> > 
> > Signed-off-by: John Fastabend 
> > ---
> >  include/net/tls.h  |4 ++--
> >  net/tls/tls_main.c |   54 
> > ++--
> >  net/tls/tls_sw.c   |   50 
> >  3 files changed, 62 insertions(+), 46 deletions(-)
> > 
> > diff --git a/include/net/tls.h b/include/net/tls.h
> > index 4a55ce6a303f..6fe1f5c96f4a 100644
> > --- a/include/net/tls.h
> > +++ b/include/net/tls.h
> > @@ -105,9 +105,7 @@ struct tls_device {
> >  enum {
> > TLS_BASE,
> > TLS_SW,
> > -#ifdef CONFIG_TLS_DEVICE
> > TLS_HW,
> > -#endif
> > TLS_HW_RECORD,
> > TLS_NUM_CONFIG,
> >  };
> > @@ -160,6 +158,7 @@ struct tls_sw_context_tx {
> > int async_capable;
> >  
> >  #define BIT_TX_SCHEDULED   0
> 
> BTW do you understand why we track this bit separately?  Just to avoid
> the irq operations in the workqueue code?
> 

Sorry not sure I understand. You mean vs simply scheduling the work
without checking the bit? Presumably its better to avoid scheduling
unnecessary work.

> > +#define BIT_TX_CLOSING 1
> 
> But since we do have the above, and I think it's tested everywhere,
> wouldn't setting SCHEDULED without accentually scheduling have
> effectively the same result?

It would block a send from calling tls_tx_records() but I guess that is
OK because this is a tear down operation and we are about to call
tls_tx_records anyways.

Sure we can do it this way might be slightly nicer to avoid checking
two bits.

> 
> > unsigned long tx_bitmask;
> >  };
> >  
> > @@ -327,6 +326,7 @@ void tls_sw_close(struct sock *sk, long timeout);
> >  void tls_sw_free_resources_tx(struct sock *sk);
> >  void tls_sw_free_resources_rx(struct sock *sk);
> >  void tls_sw_release_resources_rx(struct sock *sk);
> > +void tls_sw_release_strp_rx(struct tls_context *tls_ctx);
> >  int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> >int nonblock, int flags, int *addr_len);
> >  bool tls_sw_stream_read(const struct sock *sk);
> > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> > index fc81ae18cc44..51cb19e24dd9 100644
> > --- a/net/tls/tls_main.c
> > +++ b/net/tls/tls_main.c
> > @@ -261,24 +261,9 @@ static void tls_ctx_free(struct tls_context *ctx)
> > kfree(ctx);
> >  }
> >  
> > -static void tls_sk_proto_close(struct sock *sk, long timeout)
> > +static void tls_sk_proto_cleanup(struct sock *sk,
> > +struct tls_context *ctx, long timeo)
> >  {
> > -   struct tls_context *ctx = tls_get_ctx(sk);
> > -   long timeo = sock_sndtimeo(sk, 0);
> > -   void (*sk_proto_close)(struct sock *sk, long timeout);
> > -   bool free_ctx = false;
> > -
> > -   lock_sock(sk);
> > -   sk_proto_close = ctx->sk_proto_close;
> > -
> > -   if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
> > -   goto skip_tx_cleanup;
> > -
> > -   if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE) {
> > -   free_ctx = true;
> > -   goto skip_tx_cleanup;
> > -   }
> > -
> > if (!tls_complete_pending_work(sk, ctx, 0, &timeo))
> > tls_handle_open_record(sk, 0);
> >  
> > @@ -299,22 +284,37 @@ static void tls_sk_proto_close(struct sock *sk, long 
> > timeout)
> >  #ifdef CONFIG_TLS_DEVICE
> > if (ctx->rx_conf == TLS_HW)
> > tls_devi

[PATCH net-next] af_packet: convert pending frame counter to atomic_t

2019-06-28 Thread Neil Horman

The AF_PACKET protocol, when running as a memory mapped socket uses a
pending frame counter to track the number of skbs in flight during
transmission.  It is incremented during the sendmsg call (via
tpacket_snd), and decremented (possibly asynchronously) in the skb
desructor during skb_free.

The counter is currently implemented as a percpu variable for each open
socket, but for reads (via packet_read_pending), we iterate over every
cpu variable, accumulating the total pending count.

Given that the socket transmit path is an exclusive path (locked via the
pg_vec_lock mutex), we do not have the ability to increment this counter
on multiple cpus in parallel.  This implementation also seems to have
the potential to be broken, in that, should an skb be freed on a cpu
other than the one that it was initially transmitted on, we may
decrement a counter that was not initially incremented, leading to
underflow.

As such, adjust the packet socket struct to convert the per-cpu counter
to an atomic_t variable (to enforce consistency between the send path
and the skb free path).  This saves us some space in the packet_sock
structure, prevents the possibility of underflow, and should reduce the
run time of packet_read_pending, as we only need to read a single
variable, instead of having to loop over every available cpu variable
instance.

Tested by myself by running a small program which sends frames via
AF_PACKET on multiple cpus in parallel, with good results.

Signed-off-by: Neil Horman 
CC: "David S. Miller" 
CC: Willem de Bruijn 
---
 net/packet/af_packet.c | 40 +---
 net/packet/internal.h  |  2 +-
 2 files changed, 6 insertions(+), 36 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 8d54f3047768..25ffb486fac9 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1154,43 +1154,17 @@ static void packet_increment_head(struct 
packet_ring_buffer *buff)
 
 static void packet_inc_pending(struct packet_ring_buffer *rb)
 {
-   this_cpu_inc(*rb->pending_refcnt);
+   atomic_inc(&rb->pending_refcnt);
 }
 
 static void packet_dec_pending(struct packet_ring_buffer *rb)
 {
-   this_cpu_dec(*rb->pending_refcnt);
+   atomic_dec(&rb->pending_refcnt);
 }
 
 static unsigned int packet_read_pending(const struct packet_ring_buffer *rb)
 {
-   unsigned int refcnt = 0;
-   int cpu;
-
-   /* We don't use pending refcount in rx_ring. */
-   if (rb->pending_refcnt == NULL)
-   return 0;
-
-   for_each_possible_cpu(cpu)
-   refcnt += *per_cpu_ptr(rb->pending_refcnt, cpu);
-
-   return refcnt;
-}
-
-static int packet_alloc_pending(struct packet_sock *po)
-{
-   po->rx_ring.pending_refcnt = NULL;
-
-   po->tx_ring.pending_refcnt = alloc_percpu(unsigned int);
-   if (unlikely(po->tx_ring.pending_refcnt == NULL))
-   return -ENOBUFS;
-
-   return 0;
-}
-
-static void packet_free_pending(struct packet_sock *po)
-{
-   free_percpu(po->tx_ring.pending_refcnt);
+   atomic_read(&rb->pending_refcnt);
 }
 
 #define ROOM_POW_OFF   2
@@ -3046,7 +3020,6 @@ static int packet_release(struct socket *sock)
/* Purge queues */
 
skb_queue_purge(&sk->sk_receive_queue);
-   packet_free_pending(po);
sk_refcnt_debug_release(sk);
 
sock_put(sk);
@@ -3236,9 +3209,8 @@ static int packet_create(struct net *net, struct socket 
*sock, int protocol,
po->num = proto;
po->xmit = dev_queue_xmit;
 
-   err = packet_alloc_pending(po);
-   if (err)
-   goto out2;
+   atomic_set(&po->tx_ring.pending_refcnt,0);
+   atomic_set(&po->rx_ring.pending_refcnt,0);
 
packet_cached_dev_reset(po);
 
@@ -3273,8 +3245,6 @@ static int packet_create(struct net *net, struct socket 
*sock, int protocol,
preempt_enable();
 
return 0;
-out2:
-   sk_free(sk);
 out:
return err;
 }
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 82fb2b10f790..b0fdb54bb91b 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -68,7 +68,7 @@ struct packet_ring_buffer {
unsigned intpg_vec_pages;
unsigned intpg_vec_len;
 
-   unsigned int __percpu   *pending_refcnt;
+   atomic_tpending_refcnt;
 
struct tpacket_kbdq_coreprb_bdqc;
 };
-- 
2.21.0

Re: [EXT] [PATCH net-next 03/16] qlge: Deduplicate lbq_buf_size

2019-06-28 Thread Willem de Bruijn

On Fri, Jun 28, 2019 at 4:57 AM Benjamin Poirier  wrote:
>
> On 2019/06/26 11:42, Willem de Bruijn wrote:
> > On Wed, Jun 26, 2019 at 7:37 AM Benjamin Poirier  wrote:
> > >
> > > On 2019/06/26 09:24, Manish Chopra wrote:
> > > > > -Original Message-
> > > > > From: Benjamin Poirier 
> > > > > Sent: Monday, June 17, 2019 1:19 PM
> > > > > To: Manish Chopra ; GR-Linux-NIC-Dev  > > > > nic-...@marvell.com>; netdev@vger.kernel.org
> > > > > Subject: [EXT] [PATCH net-next 03/16] qlge: Deduplicate lbq_buf_size
> > > > >
> > > > > External Email
> > > > >
> > > > > --
> > > > > lbq_buf_size is duplicated to every rx_ring structure whereas 
> > > > > lbq_buf_order is
> > > > > present once in the ql_adapter structure. All rings use the same buf 
> > > > > size, keep
> > > > > only one copy of it. Also factor out the calculation of lbq_buf_size 
> > > > > instead of
> > > > > having two copies.
> > > > >
> > > > > Signed-off-by: Benjamin Poirier 
> > > > > ---
> > > [...]
> > > >
> > > > Not sure if this change is really required, I think fields relevant to 
> > > > rx_ring should be present in the rx_ring structure.
> > > > There are various other fields like "lbq_len" and "lbq_size" which 
> > > > would be same for all rx rings but still under the relevant rx_ring 
> > > > structure.
> >
> > The one argument against deduplicating might be if the original fields
> > are in a hot cacheline and the new location adds a cacheline access to
> > a hot path. Not sure if that is relevant here. But maybe something to
> > double check.
> >
>
> Thanks for the hint. I didn't check before because my hunch was that
> this driver is not near that level of optimization but I checked now and
> got the following results.

Thanks for the data. I didn't mean to ask you to do a lot of extra work.
Sorry if it resulted in that.

Fully agreed on your point about optimization (see also.. that 784B
struct with holes). I support the patch and meant to argue against the
previous response: this cleanup makes sense to me, just take a second
look at struct layout. To be more crystal clear:

Acked-by: Willem de Bruijn

[net-next 1/1] tipc: embed jiffies in macro TIPC_BC_RETR_LIM

2019-06-28 Thread Jon Maloy

The macro TIPC_BC_RETR_LIM is always used in combination with 'jiffies',
so we can just as well perform the addition in the macro itself. This
way, we get a few shorter code lines and one less line break.

Signed-off-by: Jon Maloy 
---
 net/tipc/link.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index f8bf63b..66d3a07 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -207,7 +207,7 @@ enum {
BC_NACK_SND_SUPPRESS,
 };
 
-#define TIPC_BC_RETR_LIM msecs_to_jiffies(10)   /* [ms] */
+#define TIPC_BC_RETR_LIM  (jiffies + msecs_to_jiffies(10))
 #define TIPC_UC_RETR_TIME (jiffies + msecs_to_jiffies(1))
 
 /*
@@ -976,8 +976,7 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head 
*list,
__skb_queue_tail(transmq, skb);
/* next retransmit attempt */
if (link_is_bc_sndlink(l))
-   TIPC_SKB_CB(skb)->nxt_retr =
-   jiffies + TIPC_BC_RETR_LIM;
+   TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
__skb_queue_tail(xmitq, _skb);
TIPC_SKB_CB(skb)->ackers = l->ackers;
l->rcv_unacked = 0;
@@ -1027,7 +1026,7 @@ static void tipc_link_advance_backlog(struct tipc_link *l,
__skb_queue_tail(&l->transmq, skb);
/* next retransmit attempt */
if (link_is_bc_sndlink(l))
-   TIPC_SKB_CB(skb)->nxt_retr = jiffies + TIPC_BC_RETR_LIM;
+   TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
 
__skb_queue_tail(xmitq, _skb);
TIPC_SKB_CB(skb)->ackers = l->ackers;
@@ -1123,7 +1122,7 @@ static int tipc_link_bc_retrans(struct tipc_link *l, 
struct tipc_link *r,
if (link_is_bc_sndlink(l)) {
if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr))
continue;
-   TIPC_SKB_CB(skb)->nxt_retr = jiffies + TIPC_BC_RETR_LIM;
+   TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
}
_skb = __pskb_copy(skb, LL_MAX_HEADER + MIN_H_SIZE, GFP_ATOMIC);
if (!_skb)
-- 
2.1.4

Re: [PATCH net-next] af_packet: convert pending frame counter to atomic_t

2019-06-28 Thread Willem de Bruijn

On Fri, Jun 28, 2019 at 10:53 AM Neil Horman  wrote:
>
> The AF_PACKET protocol, when running as a memory mapped socket uses a
> pending frame counter to track the number of skbs in flight during
> transmission.  It is incremented during the sendmsg call (via
> tpacket_snd), and decremented (possibly asynchronously) in the skb
> desructor during skb_free.
>
> The counter is currently implemented as a percpu variable for each open
> socket, but for reads (via packet_read_pending), we iterate over every
> cpu variable, accumulating the total pending count.
>
> Given that the socket transmit path is an exclusive path (locked via the
> pg_vec_lock mutex), we do not have the ability to increment this counter
> on multiple cpus in parallel.  This implementation also seems to have
> the potential to be broken, in that, should an skb be freed on a cpu
> other than the one that it was initially transmitted on, we may
> decrement a counter that was not initially incremented, leading to
> underflow.
>
> As such, adjust the packet socket struct to convert the per-cpu counter
> to an atomic_t variable (to enforce consistency between the send path
> and the skb free path).  This saves us some space in the packet_sock
> structure, prevents the possibility of underflow, and should reduce the
> run time of packet_read_pending, as we only need to read a single
> variable, instead of having to loop over every available cpu variable
> instance.
>
> Tested by myself by running a small program which sends frames via
> AF_PACKET on multiple cpus in parallel, with good results.
>
> Signed-off-by: Neil Horman 
> CC: "David S. Miller" 
> CC: Willem de Bruijn 
> ---

This essentially is a revert of commit b013840810c2 ("packet: use
percpu mmap tx frame pending refcount"). That has some benchmark
numbers and also discusses the overflow issue.

I think more interesting would be to eschew the counter when
MSG_DONTWAIT, as it is only used to know when to exit the loop if
need_wait.

But, IMHO packet sockets are deprecated in favor of AF_XDP and
should be limited to bug fixes at this point.

>  net/packet/af_packet.c | 40 +---
>  net/packet/internal.h  |  2 +-
>  2 files changed, 6 insertions(+), 36 deletions(-)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 8d54f3047768..25ffb486fac9 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1154,43 +1154,17 @@ static void packet_increment_head(struct 
> packet_ring_buffer *buff)
>
>  static void packet_inc_pending(struct packet_ring_buffer *rb)
>  {
> -   this_cpu_inc(*rb->pending_refcnt);
> +   atomic_inc(&rb->pending_refcnt);
>  }

If making this change, can also remove these helper functions. The
layer of indirection just hinders readability.

[PATCH v2 bpf-next 0/4] capture integers in BTF type info for map defs

This patch set implements an update to how BTF-defined maps are specified. The
change is in how integer attributes, e.g., type, max_entries, map_flags, are
specified: now they are captured as part of map definition struct's BTF type
information (using array dimension), eliminating the need for compile-time
data initialization and keeping all the metadata in one place.

All existing selftests that were using BTF-defined maps are updated, along
with some other selftests, that were switched to new syntax.

v1->v2:
- split bpf_helpers.h change from libbpf change (Song).

Andrii Nakryiko (4):
  libbpf: capture value in BTF type info for BTF-defined map defs
  selftests/bpf: add __int and __type macro for BTF-defined maps
  selftests/bpf: convert selftests using BTF-defined maps to new syntax
  selftests/bpf: convert legacy BPF maps to BTF-defined ones

 tools/lib/bpf/libbpf.c|  58 +
 tools/testing/selftests/bpf/bpf_helpers.h |   3 +
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  28 ++---
 .../selftests/bpf/progs/get_cgroup_id_kern.c  |  26 ++---
 .../testing/selftests/bpf/progs/netcnt_prog.c |  20 ++--
 tools/testing/selftests/bpf/progs/pyperf.h|  90 +++---
 .../selftests/bpf/progs/sample_map_ret0.c |  24 ++--
 .../selftests/bpf/progs/socket_cookie_prog.c  |  13 +--
 .../bpf/progs/sockmap_verdict_prog.c  |  48 
 .../testing/selftests/bpf/progs/strobemeta.h  |  68 +--
 .../selftests/bpf/progs/test_btf_newkv.c  |  13 +--
 .../bpf/progs/test_get_stack_rawtp.c  |  39 +++
 .../selftests/bpf/progs/test_global_data.c|  37 +++---
 tools/testing/selftests/bpf/progs/test_l4lb.c |  65 ---
 .../selftests/bpf/progs/test_l4lb_noinline.c  |  65 ---
 .../selftests/bpf/progs/test_map_in_map.c |  30 ++---
 .../selftests/bpf/progs/test_map_lock.c   |  26 ++---
 .../testing/selftests/bpf/progs/test_obj_id.c |  12 +-
 .../bpf/progs/test_select_reuseport_kern.c|  67 ---
 .../bpf/progs/test_send_signal_kern.c |  26 ++---
 .../bpf/progs/test_sock_fields_kern.c |  78 +
 .../selftests/bpf/progs/test_spin_lock.c  |  36 +++---
 .../bpf/progs/test_stacktrace_build_id.c  |  55 -
 .../selftests/bpf/progs/test_stacktrace_map.c |  52 +++--
 .../selftests/bpf/progs/test_tcp_estats.c |  13 +--
 .../selftests/bpf/progs/test_tcpbpf_kern.c|  26 ++---
 .../selftests/bpf/progs/test_tcpnotify_kern.c |  28 ++---
 tools/testing/selftests/bpf/progs/test_xdp.c  |  26 ++---
 .../selftests/bpf/progs/test_xdp_loop.c   |  26 ++---
 .../selftests/bpf/progs/test_xdp_noinline.c   |  81 +
 .../selftests/bpf/progs/xdp_redirect_map.c|  12 +-
 .../testing/selftests/bpf/progs/xdping_kern.c |  12 +-
 .../selftests/bpf/test_queue_stack_map.h  |  30 ++---
 .../testing/selftests/bpf/test_sockmap_kern.h | 110 +-
 34 files changed, 571 insertions(+), 772 deletions(-)

-- 
2.17.1

[PATCH v2 bpf-next 1/4] libbpf: capture value in BTF type info for BTF-defined map defs

Change BTF-defined map definitions to capture compile-time integer
values as part of BTF type definition, to avoid split of key/value type
information and actual type/size/flags initialization for maps.

Signed-off-by: Andrii Nakryiko 
---
 tools/lib/bpf/libbpf.c | 58 --
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 6e6ebef11ba3..9e099ecb2c2b 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1028,40 +1028,40 @@ static const struct btf_type 
*skip_mods_and_typedefs(const struct btf *btf,
}
 }
 
-static bool get_map_field_int(const char *map_name,
- const struct btf *btf,
+/*
+ * Fetch integer attribute of BTF map definition. Such attributes are
+ * represented using a pointer to an array, in which dimensionality of array
+ * encodes specified integer value. E.g., int (*type)[BPF_MAP_TYPE_ARRAY];
+ * encodes `type => BPF_MAP_TYPE_ARRAY` key/value pair completely using BTF
+ * type definition, while using only sizeof(void *) space in ELF data section.
+ */
+static bool get_map_field_int(const char *map_name, const struct btf *btf,
  const struct btf_type *def,
- const struct btf_member *m,
- const void *data, __u32 *res) {
+ const struct btf_member *m, __u32 *res) {
const struct btf_type *t = skip_mods_and_typedefs(btf, m->type);
const char *name = btf__name_by_offset(btf, m->name_off);
-   __u32 int_info = *(const __u32 *)(const void *)(t + 1);
+   const struct btf_array *arr_info;
+   const struct btf_type *arr_t;
 
-   if (BTF_INFO_KIND(t->info) != BTF_KIND_INT) {
-   pr_warning("map '%s': attr '%s': expected INT, got %u.\n",
+   if (BTF_INFO_KIND(t->info) != BTF_KIND_PTR) {
+   pr_warning("map '%s': attr '%s': expected PTR, got %u.\n",
   map_name, name, BTF_INFO_KIND(t->info));
return false;
}
-   if (t->size != 4 || BTF_INT_BITS(int_info) != 32 ||
-   BTF_INT_OFFSET(int_info)) {
-   pr_warning("map '%s': attr '%s': expected 32-bit non-bitfield 
integer, "
-  "got %u-byte (%d-bit) one with bit offset %d.\n",
-  map_name, name, t->size, BTF_INT_BITS(int_info),
-  BTF_INT_OFFSET(int_info));
-   return false;
-   }
-   if (BTF_INFO_KFLAG(def->info) && BTF_MEMBER_BITFIELD_SIZE(m->offset)) {
-   pr_warning("map '%s': attr '%s': bitfield is not supported.\n",
-  map_name, name);
+
+   arr_t = btf__type_by_id(btf, t->type);
+   if (!arr_t) {
+   pr_warning("map '%s': attr '%s': type [%u] not found.\n",
+  map_name, name, t->type);
return false;
}
-   if (m->offset % 32) {
-   pr_warning("map '%s': attr '%s': unaligned fields are not 
supported.\n",
-  map_name, name);
+   if (BTF_INFO_KIND(arr_t->info) != BTF_KIND_ARRAY) {
+   pr_warning("map '%s': attr '%s': expected ARRAY, got %u.\n",
+  map_name, name, BTF_INFO_KIND(arr_t->info));
return false;
}
-
-   *res = *(const __u32 *)(data + m->offset / 8);
+   arr_info = (const void *)(arr_t + 1);
+   *res = arr_info->nelems;
return true;
 }
 
@@ -1074,7 +1074,6 @@ static int bpf_object__init_user_btf_map(struct 
bpf_object *obj,
const struct btf_var_secinfo *vi;
const struct btf_var *var_extra;
const struct btf_member *m;
-   const void *def_data;
const char *map_name;
struct bpf_map *map;
int vlen, i;
@@ -1131,7 +1130,6 @@ static int bpf_object__init_user_btf_map(struct 
bpf_object *obj,
pr_debug("map '%s': at sec_idx %d, offset %zu.\n",
 map_name, map->sec_idx, map->sec_offset);
 
-   def_data = data->d_buf + vi->offset;
vlen = BTF_INFO_VLEN(def->info);
m = (const void *)(def + 1);
for (i = 0; i < vlen; i++, m++) {
@@ -1144,19 +1142,19 @@ static int bpf_object__init_user_btf_map(struct 
bpf_object *obj,
}
if (strcmp(name, "type") == 0) {
if (!get_map_field_int(map_name, obj->btf, def, m,
-  def_data, &map->def.type))
+  &map->def.type))
return -EINVAL;
pr_debug("map '%s': found type = %u.\n",
 map_name, map->def.type);
} else if (strcmp(name, "max_entries") == 0) {
if (!get_map_field_int(map_name, obj->btf, def, m,
-  def_data, &map-

[PATCH v2 bpf-next 4/4] selftests/bpf: convert legacy BPF maps to BTF-defined ones

Convert selftests that were originally left out and new ones added
recently to consistently use BTF-defined maps.

Signed-off-by: Andrii Nakryiko 
---
 .../selftests/bpf/progs/get_cgroup_id_kern.c  |  26 ++---
 tools/testing/selftests/bpf/progs/pyperf.h|  90 +++---
 .../selftests/bpf/progs/sample_map_ret0.c |  24 ++--
 .../bpf/progs/sockmap_verdict_prog.c  |  48 
 .../testing/selftests/bpf/progs/strobemeta.h  |  68 +--
 .../selftests/bpf/progs/test_map_in_map.c |  30 ++---
 .../testing/selftests/bpf/progs/test_obj_id.c |  12 +-
 .../selftests/bpf/progs/test_xdp_loop.c   |  26 ++---
 .../selftests/bpf/progs/xdp_redirect_map.c|  12 +-
 .../testing/selftests/bpf/progs/xdping_kern.c |  12 +-
 .../selftests/bpf/test_queue_stack_map.h  |  30 ++---
 .../testing/selftests/bpf/test_sockmap_kern.h | 110 +-
 12 files changed, 240 insertions(+), 248 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/get_cgroup_id_kern.c 
b/tools/testing/selftests/bpf/progs/get_cgroup_id_kern.c
index 014dba10b8a5..d7f4003ac3a2 100644
--- a/tools/testing/selftests/bpf/progs/get_cgroup_id_kern.c
+++ b/tools/testing/selftests/bpf/progs/get_cgroup_id_kern.c
@@ -4,19 +4,19 @@
 #include 
 #include "bpf_helpers.h"
 
-struct bpf_map_def SEC("maps") cg_ids = {
-   .type = BPF_MAP_TYPE_ARRAY,
-   .key_size = sizeof(__u32),
-   .value_size = sizeof(__u64),
-   .max_entries = 1,
-};
-
-struct bpf_map_def SEC("maps") pidmap = {
-   .type = BPF_MAP_TYPE_ARRAY,
-   .key_size = sizeof(__u32),
-   .value_size = sizeof(__u32),
-   .max_entries = 1,
-};
+struct {
+   __int(type, BPF_MAP_TYPE_ARRAY);
+   __int(max_entries, 1);
+   __type(key, __u32);
+   __type(value, __u64);
+} cg_ids SEC(".maps");
+
+struct {
+   __int(type, BPF_MAP_TYPE_ARRAY);
+   __int(max_entries, 1);
+   __type(key, __u32);
+   __type(value, __u32);
+} pidmap SEC(".maps");
 
 SEC("tracepoint/syscalls/sys_enter_nanosleep")
 int trace(void *ctx)
diff --git a/tools/testing/selftests/bpf/progs/pyperf.h 
b/tools/testing/selftests/bpf/progs/pyperf.h
index 6b0781391be5..f0c7250ab045 100644
--- a/tools/testing/selftests/bpf/progs/pyperf.h
+++ b/tools/testing/selftests/bpf/progs/pyperf.h
@@ -58,14 +58,6 @@ typedef struct {
 } Event;
 
 
-struct bpf_elf_map {
-   __u32 type;
-   __u32 size_key;
-   __u32 size_value;
-   __u32 max_elem;
-   __u32 flags;
-};
-
 typedef int pid_t;
 
 typedef struct {
@@ -119,47 +111,47 @@ get_frame_data(void* frame_ptr, PidData* pidData, 
FrameData* frame, Symbol* symb
return true;
 }
 
-struct bpf_elf_map SEC("maps") pidmap = {
-   .type = BPF_MAP_TYPE_HASH,
-   .size_key = sizeof(int),
-   .size_value = sizeof(PidData),
-   .max_elem = 1,
-};
-
-struct bpf_elf_map SEC("maps") eventmap = {
-   .type = BPF_MAP_TYPE_HASH,
-   .size_key = sizeof(int),
-   .size_value = sizeof(Event),
-   .max_elem = 1,
-};
-
-struct bpf_elf_map SEC("maps") symbolmap = {
-   .type = BPF_MAP_TYPE_HASH,
-   .size_key = sizeof(Symbol),
-   .size_value = sizeof(int),
-   .max_elem = 1,
-};
-
-struct bpf_elf_map SEC("maps") statsmap = {
-   .type = BPF_MAP_TYPE_ARRAY,
-   .size_key = sizeof(Stats),
-   .size_value = sizeof(int),
-   .max_elem = 1,
-};
-
-struct bpf_elf_map SEC("maps") perfmap = {
-   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
-   .size_key = sizeof(int),
-   .size_value = sizeof(int),
-   .max_elem = 32,
-};
-
-struct bpf_elf_map SEC("maps") stackmap = {
-   .type = BPF_MAP_TYPE_STACK_TRACE,
-   .size_key = sizeof(int),
-   .size_value = sizeof(long long) * 127,
-   .max_elem = 1000,
-};
+struct {
+   __int(type, BPF_MAP_TYPE_HASH);
+   __int(max_entries, 1);
+   __type(key, int);
+   __type(value, PidData);
+} pidmap SEC(".maps");
+
+struct {
+   __int(type, BPF_MAP_TYPE_HASH);
+   __int(max_entries, 1);
+   __type(key, int);
+   __type(value, Event);
+} eventmap SEC(".maps");
+
+struct {
+   __int(type, BPF_MAP_TYPE_HASH);
+   __int(max_entries, 1);
+   __type(key, Symbol);
+   __type(value, int);
+} symbolmap SEC(".maps");
+
+struct {
+   __int(type, BPF_MAP_TYPE_ARRAY);
+   __int(max_entries, 1);
+   __type(key, int);
+   __type(value, Stats);
+} statsmap SEC(".maps");
+
+struct {
+   __int(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+   __int(max_entries, 32);
+   __int(key_size, sizeof(int));
+   __int(value_size, sizeof(int));
+} perfmap SEC(".maps");
+
+struct {
+   __int(type, BPF_MAP_TYPE_STACK_TRACE);
+   __int(max_entries, 1000);
+   __int(key_size, sizeof(int));
+   __int(value_size, sizeof(long long) * 127);
+} stackmap SEC(".maps");
 
 static inline __attribute__((__always_inline__)) int __on_event(struct pt_regs 
*ctx)
 {
diff --git a/tools/testing/selftests/bpf/progs/sample_map_ret0.c 
b/tools/testing/s

[PATCH v2 bpf-next 3/4] selftests/bpf: convert selftests using BTF-defined maps to new syntax

Convert all the existing selftests that are already using BTF-defined
maps to use new syntax (with no static data initialization).

Signed-off-by: Andrii Nakryiko 
---
 tools/testing/selftests/bpf/progs/bpf_flow.c  | 28 +++
 .../testing/selftests/bpf/progs/netcnt_prog.c | 20 ++---
 .../selftests/bpf/progs/socket_cookie_prog.c  | 13 ++-
 .../selftests/bpf/progs/test_btf_newkv.c  | 13 ++-
 .../bpf/progs/test_get_stack_rawtp.c  | 39 -
 .../selftests/bpf/progs/test_global_data.c| 37 -
 tools/testing/selftests/bpf/progs/test_l4lb.c | 65 ++-
 .../selftests/bpf/progs/test_l4lb_noinline.c  | 65 ++-
 .../selftests/bpf/progs/test_map_lock.c   | 26 +++---
 .../bpf/progs/test_select_reuseport_kern.c| 67 ++-
 .../bpf/progs/test_send_signal_kern.c | 26 +++---
 .../bpf/progs/test_sock_fields_kern.c | 78 +++---
 .../selftests/bpf/progs/test_spin_lock.c  | 36 -
 .../bpf/progs/test_stacktrace_build_id.c  | 55 +
 .../selftests/bpf/progs/test_stacktrace_map.c | 52 +---
 .../selftests/bpf/progs/test_tcp_estats.c | 13 ++-
 .../selftests/bpf/progs/test_tcpbpf_kern.c| 26 +++---
 .../selftests/bpf/progs/test_tcpnotify_kern.c | 28 +++
 tools/testing/selftests/bpf/progs/test_xdp.c  | 26 +++---
 .../selftests/bpf/progs/test_xdp_noinline.c   | 81 +++
 20 files changed, 300 insertions(+), 494 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c 
b/tools/testing/selftests/bpf/progs/bpf_flow.c
index 849f42e548b5..a5564a90525d 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -58,26 +58,18 @@ struct frag_hdr {
 };
 
 struct {
-   __u32 type;
-   __u32 max_entries;
-   __u32 key_size;
-   __u32 value_size;
-} jmp_table SEC(".maps") = {
-   .type = BPF_MAP_TYPE_PROG_ARRAY,
-   .max_entries = 8,
-   .key_size = sizeof(__u32),
-   .value_size = sizeof(__u32),
-};
+   __int(type, BPF_MAP_TYPE_PROG_ARRAY);
+   __int(max_entries, 8);
+   __int(key_size, sizeof(__u32));
+   __int(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
 
 struct {
-   __u32 type;
-   __u32 max_entries;
-   __u32 *key;
-   struct bpf_flow_keys *value;
-} last_dissection SEC(".maps") = {
-   .type = BPF_MAP_TYPE_ARRAY,
-   .max_entries = 1,
-};
+   __int(type, BPF_MAP_TYPE_ARRAY);
+   __int(max_entries, 1);
+   __type(key, __u32);
+   __type(value, struct bpf_flow_keys);
+} last_dissection SEC(".maps");
 
 static __always_inline int export_flow_keys(struct bpf_flow_keys *keys,
int ret)
diff --git a/tools/testing/selftests/bpf/progs/netcnt_prog.c 
b/tools/testing/selftests/bpf/progs/netcnt_prog.c
index a25c82a5b7c8..31afe01ca52e 100644
--- a/tools/testing/selftests/bpf/progs/netcnt_prog.c
+++ b/tools/testing/selftests/bpf/progs/netcnt_prog.c
@@ -11,20 +11,16 @@
 #define NS_PER_SEC 10
 
 struct {
-   __u32 type;
-   struct bpf_cgroup_storage_key *key;
-   struct percpu_net_cnt *value;
-} percpu_netcnt SEC(".maps") = {
-   .type = BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
-};
+   __int(type, BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE);
+   __type(key, struct bpf_cgroup_storage_key);
+   __type(value, struct percpu_net_cnt);
+} percpu_netcnt SEC(".maps");
 
 struct {
-   __u32 type;
-   struct bpf_cgroup_storage_key *key;
-   struct net_cnt *value;
-} netcnt SEC(".maps") = {
-   .type = BPF_MAP_TYPE_CGROUP_STORAGE,
-};
+   __int(type, BPF_MAP_TYPE_CGROUP_STORAGE);
+   __type(key, struct bpf_cgroup_storage_key);
+   __type(value, struct net_cnt);
+} netcnt SEC(".maps");
 
 SEC("cgroup/skb")
 int bpf_nextcnt(struct __sk_buff *skb)
diff --git a/tools/testing/selftests/bpf/progs/socket_cookie_prog.c 
b/tools/testing/selftests/bpf/progs/socket_cookie_prog.c
index 6aabb681fb9a..6cd0a9457175 100644
--- a/tools/testing/selftests/bpf/progs/socket_cookie_prog.c
+++ b/tools/testing/selftests/bpf/progs/socket_cookie_prog.c
@@ -13,14 +13,11 @@ struct socket_cookie {
 };
 
 struct {
-   __u32 type;
-   __u32 map_flags;
-   int *key;
-   struct socket_cookie *value;
-} socket_cookies SEC(".maps") = {
-   .type = BPF_MAP_TYPE_SK_STORAGE,
-   .map_flags = BPF_F_NO_PREALLOC,
-};
+   __int(type, BPF_MAP_TYPE_SK_STORAGE);
+   __int(map_flags, BPF_F_NO_PREALLOC);
+   __type(key, int);
+   __type(value, struct socket_cookie);
+} socket_cookies SEC(".maps");
 
 SEC("cgroup/connect6")
 int set_cookie(struct bpf_sock_addr *ctx)
diff --git a/tools/testing/selftests/bpf/progs/test_btf_newkv.c 
b/tools/testing/selftests/bpf/progs/test_btf_newkv.c
index 28c16bb583b6..572ee1050e61 100644
--- a/tools/testing/selftests/bpf/progs/test_btf_newkv.c
+++ b/tools/testing/selftests/bpf/progs/test_btf_newkv.c
@@ -21,14 +21,11 @@ struct bpf_

[PATCH v2 bpf-next 2/4] selftests/bpf: add int and type macro for BTF-defined maps

Add simple __int and __type macro that hide details of how type and
integer values are captured in BTF-defined maps.

Signed-off-by: Andrii Nakryiko 
---
 tools/testing/selftests/bpf/bpf_helpers.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 1a5b1accf091..aa5ddf58c088 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -8,6 +8,9 @@
  */
 #define SEC(NAME) __attribute__((section(NAME), used))
 
+#define __int(name, val) int (*name)[val]
+#define __type(name, val) val *name
+
 /* helper macro to print out debug messages */
 #define bpf_printk(fmt, ...)   \
 ({ \
-- 
2.17.1

Re: [PATCH 0/5] net: dsa: microchip: Further regmap cleanups

2019-06-28 Thread Vivien Didelot

Hi Marek,

On Thu, 27 Jun 2019 23:55:51 +0200, Marek Vasut  wrote:
> This patchset cleans up KSZ9477 switch driver by replacing various
> ad-hoc polling implementations and register RMW with regmap functions.
> 
> Each polling function is replaced separately to make it easier to review
> and possibly bisect, but maybe the patches can be squashed.
> 
> Signed-off-by: Marek Vasut 
> Cc: Andrew Lunn 
> Cc: Florian Fainelli 
> Cc: Tristram Ha 
> Cc: Woojung Huh 

Please copy me next time, as per the MAINTAINERS file.


Thanks,
Vivien

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Stephen Hemminger

On Fri, 28 Jun 2019 15:55:53 +0200
Jiri Pirko  wrote:

> Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote:
> >On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:  
> >> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:  
> >> >On Thu, 27 Jun 2019 20:39:48 +0200
> >> >Michal Kubecek  wrote:
> >> >  
> >> >> > 
> >> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
> >> >> > # ip link show "Onboard Ethernet"
> >> >> > Device "Onboard Ethernet" does not exist.
> >> >> > 
> >> >> > So it does not really appear to be an alias, it is a label. To be
> >> >> > truly useful, it needs to be more than a label, it needs to be a real
> >> >> > alias which you can use.
> >> >> 
> >> >> That's exactly what I meant: to be really useful, one should be able to
> >> >> use the alias(es) for setting device options, for adding routes, in
> >> >> netfilter rules etc.
> >> >> 
> >> >> Michal  
> >> >
> >> >The kernel doesn't enforce uniqueness of alias.
> >> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily fixed).
> >> >
> >> >If it did, then handling it in iproute would be something like:  
> >> 
> >> I think that it is desired for kernel to work with "real alias" as a
> >> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
> >> Userspace mapping like you did here might be perhaps okay for iproute2,
> >> but I think that we need something and easy to use for all.
> >> 
> >> Let's call it "altname". Get would return:
> >> 
> >> IFLA_NAME  eth0
> >> IFLA_ALT_NAME_LIST
> >>IFLA_ALT_NAME  eth0
> >>IFLA_ALT_NAME  somethingelse
> >>IFLA_ALT_NAME  somenamethatisreallylong  
> >
> >Hi Jiri
> >
> >What is your user case for having multiple IFLA_ALT_NAME for the same
> >IFLA_NAME?  
> 
> I don't know about specific usecase for having more. Perhaps Michal
> does.
> 
> From the implementation perspective it is handy to have the ifname as
> the first alt name in kernel, so the userspace would just pass
> IFLA_ALT_NAME always. Also for avoiding name collisions etc.

I like the alternate name proposal. The kernel would have to impose  uniqueness.
Does alt_name have to be unique across both regular and alt_name?
Having multiple names list seems less interesting but it could be useful.

Re: [RFC] longer netdev names proposal

Fri, Jun 28, 2019 at 05:44:47PM CEST, step...@networkplumber.org wrote:
>On Fri, 28 Jun 2019 15:55:53 +0200
>Jiri Pirko  wrote:
>
>> Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote:
>> >On Fri, Jun 28, 2019 at 01:12:16PM +0200, Jiri Pirko wrote:  
>> >> Thu, Jun 27, 2019 at 09:20:41PM CEST, step...@networkplumber.org wrote:  
>> >> >On Thu, 27 Jun 2019 20:39:48 +0200
>> >> >Michal Kubecek  wrote:
>> >> >  
>> >> >> > 
>> >> >> > $ ip li set dev enp3s0 alias "Onboard Ethernet"
>> >> >> > # ip link show "Onboard Ethernet"
>> >> >> > Device "Onboard Ethernet" does not exist.
>> >> >> > 
>> >> >> > So it does not really appear to be an alias, it is a label. To be
>> >> >> > truly useful, it needs to be more than a label, it needs to be a real
>> >> >> > alias which you can use.
>> >> >> 
>> >> >> That's exactly what I meant: to be really useful, one should be able to
>> >> >> use the alias(es) for setting device options, for adding routes, in
>> >> >> netfilter rules etc.
>> >> >> 
>> >> >> Michal  
>> >> >
>> >> >The kernel doesn't enforce uniqueness of alias.
>> >> >Also current kernel RTM_GETLINK doesn't do filter by alias (easily 
>> >> >fixed).
>> >> >
>> >> >If it did, then handling it in iproute would be something like:  
>> >> 
>> >> I think that it is desired for kernel to work with "real alias" as a
>> >> handle. Userspace could either pass ifindex, IFLA_NAME or "real alias".
>> >> Userspace mapping like you did here might be perhaps okay for iproute2,
>> >> but I think that we need something and easy to use for all.
>> >> 
>> >> Let's call it "altname". Get would return:
>> >> 
>> >> IFLA_NAME  eth0
>> >> IFLA_ALT_NAME_LIST
>> >>IFLA_ALT_NAME  eth0
>> >>IFLA_ALT_NAME  somethingelse
>> >>IFLA_ALT_NAME  somenamethatisreallylong  
>> >
>> >Hi Jiri
>> >
>> >What is your user case for having multiple IFLA_ALT_NAME for the same
>> >IFLA_NAME?  
>> 
>> I don't know about specific usecase for having more. Perhaps Michal
>> does.
>> 
>> From the implementation perspective it is handy to have the ifname as
>> the first alt name in kernel, so the userspace would just pass
>> IFLA_ALT_NAME always. Also for avoiding name collisions etc.
>
>I like the alternate name proposal. The kernel would have to impose  
>uniqueness.
>Does alt_name have to be unique across both regular and alt_name?

Yes. That is my idea. To have one big hashtable to contain them all.


>Having multiple names list seems less interesting but it could be useful.

Yeah. Okay, I'm going to jump on this.

Re: [PATCH v3 bpf-next 2/9] libbpf: introduce concept of bpf_link

2019-06-28 Thread Stanislav Fomichev

On 06/27, Andrii Nakryiko wrote:
> bpf_link is and abstraction of an association of a BPF program and one
> of many possible BPF attachment points (hooks). This allows to have
> uniform interface for detaching BPF programs regardless of the nature of
> link and how it was created. Details of creation and setting up of
> a specific bpf_link is handled by corresponding attachment methods
> (bpf_program__attach_xxx) added in subsequent commits. Once successfully
> created, bpf_link has to be eventually destroyed with
> bpf_link__destroy(), at which point BPF program is disassociated from
> a hook and all the relevant resources are freed.
> 
> Signed-off-by: Andrii Nakryiko 
> ---
>  tools/lib/bpf/libbpf.c   | 17 +
>  tools/lib/bpf/libbpf.h   |  4 
>  tools/lib/bpf/libbpf.map |  3 ++-
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 6e6ebef11ba3..455795e6f8af 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -3941,6 +3941,23 @@ int bpf_prog_load_xattr(const struct 
> bpf_prog_load_attr *attr,
>   return 0;
>  }
>  
> +struct bpf_link {
Maybe call it bpf_attachment? You call the bpf_program__attach_to_blah
and you get an attachment?

> + int (*destroy)(struct bpf_link *link);
> +};
> +
> +int bpf_link__destroy(struct bpf_link *link)
> +{
> + int err;
> +
> + if (!link)
> + return 0;
> +
> + err = link->destroy(link);
> + free(link);
> +
> + return err;
> +}
> +
>  enum bpf_perf_event_ret
>  bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t 
> page_size,
>  void **copy_mem, size_t *copy_size,
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index d639f47e3110..5082a5ebb0c2 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -165,6 +165,10 @@ LIBBPF_API int bpf_program__pin(struct bpf_program 
> *prog, const char *path);
>  LIBBPF_API int bpf_program__unpin(struct bpf_program *prog, const char 
> *path);
>  LIBBPF_API void bpf_program__unload(struct bpf_program *prog);
>  
> +struct bpf_link;
> +
> +LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
> +
>  struct bpf_insn;
>  
>  /*
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 2c6d835620d2..3cde850fc8da 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -167,10 +167,11 @@ LIBBPF_0.0.3 {
>  
>  LIBBPF_0.0.4 {
>   global:
> + bpf_link__destroy;
> + bpf_object__load_xattr;
>   btf_dump__dump_type;
>   btf_dump__free;
>   btf_dump__new;
>   btf__parse_elf;
> - bpf_object__load_xattr;
>   libbpf_num_possible_cpus;
>  } LIBBPF_0.0.3;
> -- 
> 2.17.1
>

[PATCH iproute2-next] utils: move parse_percent() to tc_util

2019-06-28 Thread Andrea Claudi

As parse_percent() is used only in tc.

This reduces ip, bridge and genl binaries size:

$ bloat-o-meter -t bridge/bridge bridge/bridge.new
add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-109 (-109)
Total: Before=50973, After=50864, chg -0.21%

$ bloat-o-meter -t genl/genl genl/genl.new
add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-109 (-109)
Total: Before=30298, After=30189, chg -0.36%

$ bloat-o-meter ip/ip ip/ip.new
add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-109 (-109)
Total: Before=674164, After=674055, chg -0.02%

Signed-off-by: Andrea Claudi 
---
 include/utils.h |  1 -
 lib/utils.c | 16 
 tc/tc_util.c| 16 
 tc/tc_util.h|  1 +
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index ec0231f9772d0..1d9c11276bbe6 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -146,7 +146,6 @@ int get_addr_rta(inet_prefix *dst, const struct rtattr 
*rta, int family);
 int get_addr_ila(__u64 *val, const char *arg);
 
 int read_prop(const char *dev, char *prop, long *value);
-int parse_percent(double *val, const char *str);
 int get_hex(char c);
 int get_integer(int *val, const char *arg, int base);
 int get_unsigned(unsigned *val, const char *arg, int base);
diff --git a/lib/utils.c b/lib/utils.c
index be0f11b00280d..5da9a47848966 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -101,22 +101,6 @@ out:
return -1;
 }
 
-/* Parse a percent e.g: '30%'
- * return: 0 = ok, -1 = error, 1 = out of range
- */
-int parse_percent(double *val, const char *str)
-{
-   char *p;
-
-   *val = strtod(str, &p) / 100.;
-   if (*val == HUGE_VALF || *val == HUGE_VALL)
-   return 1;
-   if (*p && strcmp(p, "%"))
-   return -1;
-
-   return 0;
-}
-
 int get_hex(char c)
 {
if (c >= 'A' && c <= 'F')
diff --git a/tc/tc_util.c b/tc/tc_util.c
index e5d15281581df..53d15e08e9734 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -190,6 +190,22 @@ static const struct rate_suffix {
{ NULL }
 };
 
+/* Parse a percent e.g: '30%'
+ * return: 0 = ok, -1 = error, 1 = out of range
+ */
+int parse_percent(double *val, const char *str)
+{
+   char *p;
+
+   *val = strtod(str, &p) / 100.;
+   if (*val == HUGE_VALF || *val == HUGE_VALL)
+   return 1;
+   if (*p && strcmp(p, "%"))
+   return -1;
+
+   return 0;
+}
+
 static int parse_percent_rate(char *rate, size_t len,
  const char *str, const char *dev)
 {
diff --git a/tc/tc_util.h b/tc/tc_util.h
index 825fea36a0809..eb4b60db3fdd7 100644
--- a/tc/tc_util.h
+++ b/tc/tc_util.h
@@ -101,6 +101,7 @@ int print_tc_classid(char *buf, int len, __u32 h);
 char *sprint_tc_classid(__u32 h, char *buf);
 
 int tc_print_police(FILE *f, struct rtattr *tb);
+int parse_percent(double *val, const char *str);
 int parse_police(int *argc_p, char ***argv_p, int tca_id, struct nlmsghdr *n);
 
 int parse_action_control(int *argc_p, char ***argv_p,
-- 
2.20.1

Re: [PATCH v3 bpf-next 3/9] libbpf: add ability to attach/detach BPF program to perf event

2019-06-28 Thread Stanislav Fomichev

On 06/27, Andrii Nakryiko wrote:
> bpf_program__attach_perf_event allows to attach BPF program to existing
> perf event hook, providing most generic and most low-level way to attach BPF
> programs. It returns struct bpf_link, which should be passed to
> bpf_link__destroy to detach and free resources, associated with a link.
> 
> Signed-off-by: Andrii Nakryiko 
> ---
>  tools/lib/bpf/libbpf.c   | 58 
>  tools/lib/bpf/libbpf.h   |  3 +++
>  tools/lib/bpf/libbpf.map |  1 +
>  3 files changed, 62 insertions(+)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 455795e6f8af..606705f878ba 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -3958,6 +3959,63 @@ int bpf_link__destroy(struct bpf_link *link)
>   return err;
>  }
>  
> +struct bpf_link_fd {
> + struct bpf_link link; /* has to be at the top of struct */
[..]
> + int fd; /* hook FD */
> +};
Any cons to storing everything in bpf_link, instead of creating a
"subclass"? Less things to worry about.

> +static int bpf_link__destroy_perf_event(struct bpf_link *link)
> +{
> + struct bpf_link_fd *l = (void *)link;
> + int err;
> +
> + if (l->fd < 0)
> + return 0;
> +
> + err = ioctl(l->fd, PERF_EVENT_IOC_DISABLE, 0);
> + close(l->fd);
> + return err;
> +}
> +
> +struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
> + int pfd)
> +{
> + char errmsg[STRERR_BUFSIZE];
> + struct bpf_link_fd *link;
> + int bpf_fd, err;
> +
> + bpf_fd = bpf_program__fd(prog);
> + if (bpf_fd < 0) {
> + pr_warning("program '%s': can't attach before loaded\n",
> +bpf_program__title(prog, false));
> + return ERR_PTR(-EINVAL);
> + }
> +
> + link = malloc(sizeof(*link));
> + if (!link)
> + return ERR_PTR(-ENOMEM);
> + link->link.destroy = &bpf_link__destroy_perf_event;
> + link->fd = pfd;
> +
> + if (ioctl(pfd, PERF_EVENT_IOC_SET_BPF, bpf_fd) < 0) {
> + err = -errno;
> + free(link);
> + pr_warning("program '%s': failed to attach to pfd %d: %s\n",
> +bpf_program__title(prog, false), pfd,
> +libbpf_strerror_r(err, errmsg, sizeof(errmsg)));
> + return ERR_PTR(err);
> + }
> + if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
> + err = -errno;
> + free(link);
> + pr_warning("program '%s': failed to enable pfd %d: %s\n",
> +bpf_program__title(prog, false), pfd,
> +libbpf_strerror_r(err, errmsg, sizeof(errmsg)));
> + return ERR_PTR(err);
> + }
> + return (struct bpf_link *)link;
> +}
> +
>  enum bpf_perf_event_ret
>  bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t 
> page_size,
>  void **copy_mem, size_t *copy_size,
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 5082a5ebb0c2..1bf66c4a9330 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -169,6 +169,9 @@ struct bpf_link;
>  
>  LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
>  
> +LIBBPF_API struct bpf_link *
> +bpf_program__attach_perf_event(struct bpf_program *prog, int pfd);
> +
>  struct bpf_insn;
>  
>  /*
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 3cde850fc8da..756f5aa802e9 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -169,6 +169,7 @@ LIBBPF_0.0.4 {
>   global:
>   bpf_link__destroy;
>   bpf_object__load_xattr;
> + bpf_program__attach_perf_event;
>   btf_dump__dump_type;
>   btf_dump__free;
>   btf_dump__new;
> -- 
> 2.17.1
>

Re: [PATCH 00/11] XDP unaligned chunk placement support

2019-06-28 Thread Laatz, Kevin


On 27/06/2019 22:25, Jakub Kicinski wrote:

On Thu, 27 Jun 2019 12:14:50 +0100, Laatz, Kevin wrote:

On the application side (xdpsock), we don't have to worry about the user
defined headroom, since it is 0, so we only need to account for the
XDP_PACKET_HEADROOM when computing the original address (in the default
scenario).

That assumes specific layout for the data inside the buffer.  Some NICs
will prepend information like timestamp to the packet, meaning the
packet would start at offset XDP_PACKET_HEADROOM + metadata len..


Yes, if NICs prepend extra data to the packet that would be a problem for
using this feature in isolation. However, if we also add in support for 
in-order

RX and TX rings, that would no longer be an issue. However, even for NICs
which do prepend data, this patchset should not break anything that is 
currently

working.



I think that's very limiting.  What is the challenge in providing
aligned addresses, exactly?

The challenges are two-fold:
1) it prevents using arbitrary buffer sizes, which will be an issue 
supporting e.g. jumbo frames in future.
2) higher level user-space frameworks which may want to use AF_XDP, such 
as DPDK, do not currently support having buffers with 'fixed' alignment.

    The reason that DPDK uses arbitrary placement is that:
        - it would stop things working on certain NICs which need the 
actual writable space specified in units of 1k - therefore we need 2k + 
metadata space.
        - we place padding between buffers to avoid constantly hitting 
the same memory channels when accessing memory.
        - it allows the application to choose the actual buffer size it 
wants to use.
    We make use of the above to allow us to speed up processing 
significantly and also reduce the packet buffer memory size.


    Not having arbitrary buffer alignment also means an AF_XDP driver 
for DPDK cannot be a drop-in replacement for existing drivers in those 
frameworks. Even with a new capability to allow an arbitrary buffer 
alignment, existing apps will need to be modified to use that new 
capability.

[PATCH bpf-next] bpf: fix precision tracking

2019-06-28 Thread Alexei Starovoitov

When equivalent state is found the current state needs to propagate precision 
marks.
Otherwise the verifier will prune the search incorrectly.

There is a price for correctness:
  before  beforebrokenfixed
  cnst spill  precise   precise
bpf_lb-DLB_L3.o   19238128  1863  1898
bpf_lb-DLB_L4.o   30776707  2468  2666
bpf_lb-DUNKNOWN.o 10621062  544   544
bpf_lxc-DDROP_ALL.o   166729  38071222629 36823
bpf_lxc-DUNKNOWN.o174607  44065228805 45325
bpf_netdev.o  840731904 6801  7002
bpf_overlay.o 542023569 4754  4858
bpf_lxc_jit.o 39389   35944550925 69631
Overall precision tracking is still very effective.

Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking")
Reported-by: Lawrence Brakmo 
Signed-off-by: Alexei Starovoitov 
---
Sending the fix early w/o tests, since I'm traveling.
Will add proper tests when I'm back.
---
 kernel/bpf/verifier.c | 121 +-
 1 file changed, 107 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6b5623d320f9..62afc4058ced 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1659,16 +1659,18 @@ static void mark_all_scalars_precise(struct 
bpf_verifier_env *env,
}
 }
 
-static int mark_chain_precision(struct bpf_verifier_env *env, int regno)
+static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
+ int spi)
 {
struct bpf_verifier_state *st = env->cur_state;
int first_idx = st->first_insn_idx;
int last_idx = env->insn_idx;
struct bpf_func_state *func;
struct bpf_reg_state *reg;
-   u32 reg_mask = 1u << regno;
-   u64 stack_mask = 0;
+   u32 reg_mask = regno >= 0 ? 1u << regno : 0;
+   u64 stack_mask = spi >= 0 ? 1ull << spi : 0;
bool skip_first = true;
+   bool new_marks = false;
int i, err;
 
if (!env->allow_ptr_leaks)
@@ -1676,18 +1678,43 @@ static int mark_chain_precision(struct bpf_verifier_env 
*env, int regno)
return 0;
 
func = st->frame[st->curframe];
-   reg = &func->regs[regno];
-   if (reg->type != SCALAR_VALUE) {
-   WARN_ONCE(1, "backtracing misuse");
-   return -EFAULT;
+   if (regno >= 0) {
+   reg = &func->regs[regno];
+   if (reg->type != SCALAR_VALUE) {
+   WARN_ONCE(1, "backtracing misuse");
+   return -EFAULT;
+   }
+   if (!reg->precise)
+   new_marks = true;
+   else
+   reg_mask = 0;
+   reg->precise = true;
}
-   if (reg->precise)
-   return 0;
-   func->regs[regno].precise = true;
 
+   while (spi >= 0) {
+   if (func->stack[spi].slot_type[0] != STACK_SPILL) {
+   stack_mask = 0;
+   break;
+   }
+   reg = &func->stack[spi].spilled_ptr;
+   if (reg->type != SCALAR_VALUE) {
+   stack_mask = 0;
+   break;
+   }
+   if (!reg->precise)
+   new_marks = true;
+   else
+   stack_mask = 0;
+   reg->precise = true;
+   break;
+   }
+
+   if (!new_marks)
+   return 0;
+   if (!reg_mask && !stack_mask)
+   return 0;
for (;;) {
DECLARE_BITMAP(mask, 64);
-   bool new_marks = false;
u32 history = st->jmp_history_cnt;
 
if (env->log.level & BPF_LOG_LEVEL)
@@ -1730,12 +1757,15 @@ static int mark_chain_precision(struct bpf_verifier_env 
*env, int regno)
if (!st)
break;
 
+   new_marks = false;
func = st->frame[st->curframe];
bitmap_from_u64(mask, reg_mask);
for_each_set_bit(i, mask, 32) {
reg = &func->regs[i];
-   if (reg->type != SCALAR_VALUE)
+   if (reg->type != SCALAR_VALUE) {
+   reg_mask &= ~(1u << i);
continue;
+   }
if (!reg->precise)
new_marks = true;
reg->precise = true;
@@ -1756,11 +1786,15 @@ static int mark_chain_precision(struct bpf_verifier_env 
*env, int regno)
return -EFAULT;
}
 
-   if (func->stack[i].slot_type[0] != STACK_SPILL)
+   if (func->stack[i].slot_type[0] != STACK_SPILL) {
+   stack_mask &=

Re: [RFC iproute2 1/1] ip: netns: add mounted state file for each netns

2019-06-28 Thread David Howells

Nicolas Dichtel  wrote:

> David Howells was working on a mount notification mechanism:
> https://lwn.net/Articles/760714/
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
> 
> I don't know what is the status of this series.

It's still alive.  I just posted a new version on it.  I'm hoping, possibly
futiley, to get it in in this merge window.

David

Re: [RFC] longer netdev names proposal

2019-06-28 Thread Michal Kubecek

On Fri, Jun 28, 2019 at 03:55:53PM +0200, Jiri Pirko wrote:
> Fri, Jun 28, 2019 at 03:14:01PM CEST, and...@lunn.ch wrote:
> >
> >What is your user case for having multiple IFLA_ALT_NAME for the same
> >IFLA_NAME?
> 
> I don't know about specific usecase for having more. Perhaps Michal
> does.

One use case that comes to my mind are the "predictable names"
implemented by udev/systemd which can be based on different naming
schemes (bus address, BIOS numbering, MAC address etc.) and it's not
always obvious which scheme is going to be used. I have even seen
multiple times that one schemed was used during system installation and
another in the installed system so that network configuration created by
installer did not work.

For block devices, current practice is not to rename the device and only
create multiple symlinks based on different naming schemes (by id, by
uuid, by label, etc.). With support for multiple altnames, we could also
identify the network device in different ways (all applicable ones).

Michal

Re: [PATCH v2 net-next 0/3] Better PHYLINK compliance for SJA1105 DSA

2019-06-28 Thread David Miller

From: Russell King - ARM Linux admin 
Date: Fri, 28 Jun 2019 11:05:42 +0100

> On Fri, Jun 28, 2019 at 12:46:34AM +0300, Vladimir Oltean wrote:
>> After discussing with Russell King, it appears this driver is making a
>> few confusions and not performing some checks for consistent operation.
>> 
>> Changes in v2:
>> - Removed redundant print in the phylink_validate callback (in 2/3).
>> 
>> Vladimir Oltean (3):
>>   net: dsa: sja1105: Don't check state->link in phylink_mac_config
>>   net: dsa: sja1105: Check for PHY mode mismatches with what PHYLINK
>> reports
>>   net: dsa: sja1105: Mark in-band AN modes not supported for PHYLINK
>> 
>>  drivers/net/dsa/sja1105/sja1105_main.c | 56 +-
>>  1 file changed, 54 insertions(+), 2 deletions(-)
> 
> Thanks.  For the whole series:
> 
> Acked-by: Russell King 

Series applied, thanks all.

Re: [PATCH 0/5] net: dsa: microchip: Further regmap cleanups

2019-06-28 Thread Marek Vasut

On 6/28/19 5:31 PM, Vivien Didelot wrote:
> Hi Marek,
> 
> On Thu, 27 Jun 2019 23:55:51 +0200, Marek Vasut  wrote:
>> This patchset cleans up KSZ9477 switch driver by replacing various
>> ad-hoc polling implementations and register RMW with regmap functions.
>>
>> Each polling function is replaced separately to make it easier to review
>> and possibly bisect, but maybe the patches can be squashed.
>>
>> Signed-off-by: Marek Vasut 
>> Cc: Andrew Lunn 
>> Cc: Florian Fainelli 
>> Cc: Tristram Ha 
>> Cc: Woojung Huh 
> 
> Please copy me next time, as per the MAINTAINERS file.

Will do. I plan to submit KSZ8795 support next.

-- 
Best regards,
Marek Vasut

Re: [PATCH v3 bpf-next 2/9] libbpf: introduce concept of bpf_link

On Fri, Jun 28, 2019 at 9:02 AM Stanislav Fomichev  wrote:
>
> On 06/27, Andrii Nakryiko wrote:
> > bpf_link is and abstraction of an association of a BPF program and one
> > of many possible BPF attachment points (hooks). This allows to have
> > uniform interface for detaching BPF programs regardless of the nature of
> > link and how it was created. Details of creation and setting up of
> > a specific bpf_link is handled by corresponding attachment methods
> > (bpf_program__attach_xxx) added in subsequent commits. Once successfully
> > created, bpf_link has to be eventually destroyed with
> > bpf_link__destroy(), at which point BPF program is disassociated from
> > a hook and all the relevant resources are freed.
> >
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  tools/lib/bpf/libbpf.c   | 17 +
> >  tools/lib/bpf/libbpf.h   |  4 
> >  tools/lib/bpf/libbpf.map |  3 ++-
> >  3 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 6e6ebef11ba3..455795e6f8af 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -3941,6 +3941,23 @@ int bpf_prog_load_xattr(const struct 
> > bpf_prog_load_attr *attr,
> >   return 0;
> >  }
> >
> > +struct bpf_link {
> Maybe call it bpf_attachment? You call the bpf_program__attach_to_blah
> and you get an attachment?

I wanted to keep it as short as possible, bpf_attachment is way too
long (it's also why as an alternative I've proposed bpf_assoc, not
bpf_association, but bpf_attach isn't great shortening).

>
> > + int (*destroy)(struct bpf_link *link);
> > +};
> > +
> > +int bpf_link__destroy(struct bpf_link *link)
> > +{
> > + int err;
> > +
> > + if (!link)
> > + return 0;
> > +
> > + err = link->destroy(link);
> > + free(link);
> > +
> > + return err;
> > +}
> > +
> >  enum bpf_perf_event_ret
> >  bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t 
> > page_size,
> >  void **copy_mem, size_t *copy_size,
> > diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> > index d639f47e3110..5082a5ebb0c2 100644
> > --- a/tools/lib/bpf/libbpf.h
> > +++ b/tools/lib/bpf/libbpf.h
> > @@ -165,6 +165,10 @@ LIBBPF_API int bpf_program__pin(struct bpf_program 
> > *prog, const char *path);
> >  LIBBPF_API int bpf_program__unpin(struct bpf_program *prog, const char 
> > *path);
> >  LIBBPF_API void bpf_program__unload(struct bpf_program *prog);
> >
> > +struct bpf_link;
> > +
> > +LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
> > +
> >  struct bpf_insn;
> >
> >  /*
> > diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> > index 2c6d835620d2..3cde850fc8da 100644
> > --- a/tools/lib/bpf/libbpf.map
> > +++ b/tools/lib/bpf/libbpf.map
> > @@ -167,10 +167,11 @@ LIBBPF_0.0.3 {
> >
> >  LIBBPF_0.0.4 {
> >   global:
> > + bpf_link__destroy;
> > + bpf_object__load_xattr;
> >   btf_dump__dump_type;
> >   btf_dump__free;
> >   btf_dump__new;
> >   btf__parse_elf;
> > - bpf_object__load_xattr;
> >   libbpf_num_possible_cpus;
> >  } LIBBPF_0.0.3;
> > --
> > 2.17.1
> >

Re: [PATCH 1/3, net-next] net: netsec: Use page_pool API

2019-06-28 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Fri, 28 Jun 2019 15:04:34 +0200

> On Fri, 28 Jun 2019 13:39:13 +0300
> Ilias Apalodimas  wrote:
> 
>> Use page_pool and it's DMA mapping capabilities for Rx buffers instead
>> of netdev/napi_alloc_frag()
>> 
>> Although this will result in a slight performance penalty on small sized
>> packets (~10%) the use of the API will allow to easily add XDP support.
>> The penalty won't be visible in network testing i.e ipef/netperf etc, it
>> only happens during raw packet drops.
>> Furthermore we intend to add recycling capabilities on the API
>> in the future. Once the recycling is added the performance penalty will
>> go away.
>> The only 'real' penalty is the slightly increased memory usage, since we
>> now allocate a page per packet instead of the amount of bytes we need +
>> skb metadata (difference is roughly 2kb per packet).
>> With a minimum of 4BG of RAM on the only SoC that has this NIC the
>> extra memory usage is negligible (a bit more on 64K pages)
>> 
>> Signed-off-by: Ilias Apalodimas 
>> ---
>>  drivers/net/ethernet/socionext/Kconfig  |   1 +
>>  drivers/net/ethernet/socionext/netsec.c | 121 +++-
>>  2 files changed, 75 insertions(+), 47 deletions(-)
> 
> Acked-by: Jesper Dangaard Brouer 

Jesper this is confusing, you just asked if the code needs to be moved
around to be correct and then right now immediately afterwards you ACK
the patch.

Re: [PATCH 3/3, net-next] net: netsec: add XDP support

2019-06-28 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Fri, 28 Jun 2019 15:35:52 +0200

> On Fri, 28 Jun 2019 13:39:15 +0300
> Ilias Apalodimas  wrote:
> 
>> +static int netsec_xdp_setup(struct netsec_priv *priv, struct bpf_prog *prog,
>> +struct netlink_ext_ack *extack)
>> +{
>> +struct net_device *dev = priv->ndev;
>> +struct bpf_prog *old_prog;
>> +
>> +/* For now just support only the usual MTU sized frames */
>> +if (prog && dev->mtu > 1500) {
>> +NL_SET_ERR_MSG_MOD(extack, "Jumbo frames not supported on XDP");
>> +return -EOPNOTSUPP;
>> +}
>> +
>> +if (netif_running(dev))
>> +netsec_netdev_stop(dev);
>> +
>> +/* Detach old prog, if any */
>> +old_prog = xchg(&priv->xdp_prog, prog);
>> +if (old_prog)
>> +bpf_prog_put(old_prog);
>> +
>> +if (netif_running(dev))
>> +netsec_netdev_open(dev);
> 
> Shouldn't the if-statement be if (!netif_running(dev))

Hmmm, does netsec_netdev_stop() clear the running flag?  That just
runs the driver internal routine and doesn't update IFF_UP et
al. which the core networking would do.

Re: What to do when a bridge port gets its pvid deleted?

2019-06-28 Thread Florian Fainelli

On 6/28/19 5:37 AM, Vladimir Oltean wrote:
> On Fri, 28 Jun 2019 at 15:30, Ido Schimmel  wrote:
>>
>> On Tue, Jun 25, 2019 at 11:49:29PM +0300, Vladimir Oltean wrote:
>>> A number of DSA drivers (BCM53XX, Microchip KSZ94XX, Mediatek MT7530
>>> at the very least), as well as Mellanox Spectrum (I didn't look at all
>>> the pure switchdev drivers) try to restore the pvid to a default value
>>> on .port_vlan_del.
>>
>> I don't know about DSA drivers, but that's not what mlxsw is doing. If
>> the VLAN that is configured as PVID is deleted from the bridge port, the
>> driver instructs the port to discard untagged and prio-tagged packets.
>> This is consistent with the bridge driver's behavior.
>>
>> We do have a flow the "restores" the PVID, but that's when a port is
>> unlinked from its bridge master. The PVID we set is 4095 which cannot be
>> configured by the 8021q / bridge driver. This is due to the way the
>> underlying hardware works. Even if a port is not bridged and used purely
>> for routing, packets still do L2 lookup first which sends them directly
>> to the router block. If PVID is not configured, untagged packets could
>> not be routed. Obviously, at egress we strip this VLAN.
>>
>>> Sure, the port stops receiving traffic when its pvid is a VLAN ID that
>>> is not installed in its hw filter, but as far as the bridge core is
>>> concerned, this is to be expected:
>>>
>>> # bridge vlan add dev swp2 vid 100 pvid untagged
>>> # bridge vlan
>>> portvlan ids
>>> swp5 1 PVID Egress Untagged
>>>
>>> swp2 1 Egress Untagged
>>>  100 PVID Egress Untagged
>>>
>>> swp3 1 PVID Egress Untagged
>>>
>>> swp4 1 PVID Egress Untagged
>>>
>>> br0  1 PVID Egress Untagged
>>> # ping 10.0.0.1
>>> PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
>>> 64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.682 ms
>>> 64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.299 ms
>>> 64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=0.251 ms
>>> 64 bytes from 10.0.0.1: icmp_seq=4 ttl=64 time=0.324 ms
>>> 64 bytes from 10.0.0.1: icmp_seq=5 ttl=64 time=0.257 ms
>>> ^C
>>> --- 10.0.0.1 ping statistics ---
>>> 5 packets transmitted, 5 received, 0% packet loss, time 4188ms
>>> rtt min/avg/max/mdev = 0.251/0.362/0.682/0.163 ms
>>> # bridge vlan del dev swp2 vid 100
>>> # bridge vlan
>>> portvlan ids
>>> swp5 1 PVID Egress Untagged
>>>
>>> swp2 1 Egress Untagged
>>>
>>> swp3 1 PVID Egress Untagged
>>>
>>> swp4 1 PVID Egress Untagged
>>>
>>> br0  1 PVID Egress Untagged
>>>
>>> # ping 10.0.0.1
>>> PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
>>> ^C
>>> --- 10.0.0.1 ping statistics ---
>>> 8 packets transmitted, 0 received, 100% packet loss, time 7267ms
>>>
>>> What is the consensus here? Is there a reason why the bridge driver
>>> doesn't take care of this?
>>
>> Take care of what? :) Always maintaining a PVID on the bridge port? It's
>> completely OK not to have a PVID.
>>
> 
> Yes, I didn't think it through during the first email. I came to the
> same conclusion in the second one.
> 
>>> Do switchdev drivers have to restore the pvid to always be
>>> operational, even if their state becomes inconsistent with the upper
>>> dev? Is it just 'nice to have'? What if VID 1 isn't in the hw filter
>>> either (perfectly legal)?
>>
>> Are you saying that DSA drivers always maintain a PVID on the bridge
>> port and allow untagged traffic to ingress regardless of the bridge
>> driver's configuration? If so, I think this needs to be fixed.
> 
> Well, not at the DSA core level.
> But for Microchip:
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/dsa/microchip/ksz9477.c#n576
> For Broadcom:
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/dsa/b53/b53_common.c#n1376
> For Mediatek:
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/dsa/mt7530.c#n1196
> 
> There might be others as well.

That sounds bogus indeed, and I bet that the two other drivers just
followed the b53 driver there. I will have to test this again and come
up with a patch eventually.

When the port leaves the bridge we do bring it back into a default PVID
(which is different than the bridge's default PVID) so that part should
be okay.
-- 
Florian

Re: [PATCH 3/3, net-next] net: netsec: add XDP support

On Fri, Jun 28, 2019 at 03:35:52PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 28 Jun 2019 13:39:15 +0300
> Ilias Apalodimas  wrote:
> 
> > +static int netsec_xdp_setup(struct netsec_priv *priv, struct bpf_prog 
> > *prog,
> > +   struct netlink_ext_ack *extack)
> > +{
> > +   struct net_device *dev = priv->ndev;
> > +   struct bpf_prog *old_prog;
> > +
> > +   /* For now just support only the usual MTU sized frames */
> > +   if (prog && dev->mtu > 1500) {
> > +   NL_SET_ERR_MSG_MOD(extack, "Jumbo frames not supported on XDP");
> > +   return -EOPNOTSUPP;
> > +   }
> > +
> > +   if (netif_running(dev))
> > +   netsec_netdev_stop(dev);
> > +
> > +   /* Detach old prog, if any */
> > +   old_prog = xchg(&priv->xdp_prog, prog);
> > +   if (old_prog)
> > +   bpf_prog_put(old_prog);
> > +
> > +   if (netif_running(dev))
> > +   netsec_netdev_open(dev);
> 
> Shouldn't the if-statement be if (!netif_running(dev))
> 
> > +
This is there to restart the device if it's up already (to rebuild the rings).
This should be fine as-is

> > +   return 0;
> > +}

Thanks
/Ilias

Re: [PATCH v3 bpf-next 3/9] libbpf: add ability to attach/detach BPF program to perf event

On Fri, Jun 28, 2019 at 9:04 AM Stanislav Fomichev  wrote:
>
> On 06/27, Andrii Nakryiko wrote:
> > bpf_program__attach_perf_event allows to attach BPF program to existing
> > perf event hook, providing most generic and most low-level way to attach BPF
> > programs. It returns struct bpf_link, which should be passed to
> > bpf_link__destroy to detach and free resources, associated with a link.
> >
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  tools/lib/bpf/libbpf.c   | 58 
> >  tools/lib/bpf/libbpf.h   |  3 +++
> >  tools/lib/bpf/libbpf.map |  1 +
> >  3 files changed, 62 insertions(+)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 455795e6f8af..606705f878ba 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -32,6 +32,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -3958,6 +3959,63 @@ int bpf_link__destroy(struct bpf_link *link)
> >   return err;
> >  }
> >
> > +struct bpf_link_fd {
> > + struct bpf_link link; /* has to be at the top of struct */
> [..]
> > + int fd; /* hook FD */
> > +};
> Any cons to storing everything in bpf_link, instead of creating a
> "subclass"? Less things to worry about.

Yes, it's not always enough to just have single FD to detach BPF
program. Check bpf_prog_detach and bpf_prog_detach2 in
tools/lib/bpf/bpf.c. For some types of attachment you have to provide
target_fd+attach_type, for some target_fd+attach_type+attach_bpf_fd.
So those two will use their own bpf_link extensions.

I haven't implemented those attachment APIs yet, but we should.

What should go into bpf_link itself is any information that's common
to any kind of attachment (e.g, "kind of attachment" itself). It's
conceivable that we might allow "casting" bpf_link into specific
variation and having extra "methods" on those. I haven't done that, as
I didn't have a need yet.

>
> > +static int bpf_link__destroy_perf_event(struct bpf_link *link)
> > +{
> > + struct bpf_link_fd *l = (void *)link;
> > + int err;
> > +
> > + if (l->fd < 0)
> > + return 0;
> > +
> > + err = ioctl(l->fd, PERF_EVENT_IOC_DISABLE, 0);
> > + close(l->fd);
> > + return err;
> > +}
> > +
> > +struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
> > + int pfd)
> > +{
> > + char errmsg[STRERR_BUFSIZE];
> > + struct bpf_link_fd *link;
> > + int bpf_fd, err;
> > +
> > + bpf_fd = bpf_program__fd(prog);
> > + if (bpf_fd < 0) {
> > + pr_warning("program '%s': can't attach before loaded\n",
> > +bpf_program__title(prog, false));
> > + return ERR_PTR(-EINVAL);
> > + }
> > +
> > + link = malloc(sizeof(*link));
> > + if (!link)
> > + return ERR_PTR(-ENOMEM);
> > + link->link.destroy = &bpf_link__destroy_perf_event;
> > + link->fd = pfd;
> > +
> > + if (ioctl(pfd, PERF_EVENT_IOC_SET_BPF, bpf_fd) < 0) {
> > + err = -errno;
> > + free(link);
> > + pr_warning("program '%s': failed to attach to pfd %d: %s\n",
> > +bpf_program__title(prog, false), pfd,
> > +libbpf_strerror_r(err, errmsg, sizeof(errmsg)));
> > + return ERR_PTR(err);
> > + }
> > + if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
> > + err = -errno;
> > + free(link);
> > + pr_warning("program '%s': failed to enable pfd %d: %s\n",
> > +bpf_program__title(prog, false), pfd,
> > +libbpf_strerror_r(err, errmsg, sizeof(errmsg)));
> > + return ERR_PTR(err);
> > + }
> > + return (struct bpf_link *)link;
> > +}
> > +
> >  enum bpf_perf_event_ret
> >  bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t 
> > page_size,
> >  void **copy_mem, size_t *copy_size,
> > diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> > index 5082a5ebb0c2..1bf66c4a9330 100644
> > --- a/tools/lib/bpf/libbpf.h
> > +++ b/tools/lib/bpf/libbpf.h
> > @@ -169,6 +169,9 @@ struct bpf_link;
> >
> >  LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
> >
> > +LIBBPF_API struct bpf_link *
> > +bpf_program__attach_perf_event(struct bpf_program *prog, int pfd);
> > +
> >  struct bpf_insn;
> >
> >  /*
> > diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> > index 3cde850fc8da..756f5aa802e9 100644
> > --- a/tools/lib/bpf/libbpf.map
> > +++ b/tools/lib/bpf/libbpf.map
> > @@ -169,6 +169,7 @@ LIBBPF_0.0.4 {
> >   global:
> >   bpf_link__destroy;
> >   bpf_object__load_xattr;
> > + bpf_program__attach_perf_event;
> >   btf_dump__dump_type;
> >   btf_dump__free;
> >   btf_dump__new;
> > --
> > 2.17.1
> >

Re: [PATCH 00/11] XDP unaligned chunk placement support

2019-06-28 Thread Björn Töpel


On 2019-06-28 18:19, Laatz, Kevin wrote:

On 27/06/2019 22:25, Jakub Kicinski wrote:

On Thu, 27 Jun 2019 12:14:50 +0100, Laatz, Kevin wrote:

On the application side (xdpsock), we don't have to worry about the user
defined headroom, since it is 0, so we only need to account for the
XDP_PACKET_HEADROOM when computing the original address (in the default
scenario).

That assumes specific layout for the data inside the buffer.  Some NICs
will prepend information like timestamp to the packet, meaning the
packet would start at offset XDP_PACKET_HEADROOM + metadata len..


Yes, if NICs prepend extra data to the packet that would be a problem for
using this feature in isolation. However, if we also add in support for 
in-order

RX and TX rings, that would no longer be an issue. However, even for NICs
which do prepend data, this patchset should not break anything that is 
currently

working.


(Late on the ball. I'm in vacation mode.)

In your example Jakub, how would this look in XDP? Wouldn't the
timestamp be part of the metadata (xdp_md.data_meta)? Isn't
data-data_meta (if valid) <= XDP_PACKET_HEADROOM? That was my assumption.

There were some discussion on having meta data length in the struct
xdp_desc, before AF_XDP was merged, but the conclusion was that this was
*not* needed, because AF_XDP and the XDP program had an implicit
contract. If you're running AF_XDP, you also have an XDP program running
and you can determine the meta data length (and also getting back the
original buffer).

So, today in AF_XDP if XDP metadata is added, the userland application
can look it up before the xdp_desc.addr (just like regular XDP), and how
the XDP/AF_XDP application determines length/layout of the metadata i
out-of-band/not specified.

This is a bit messy/handwavy TBH, so maybe adding the length to the
descriptor *is* a good idea (extending the options part of the
xdp_desc)? Less clean though. OTOH the layout of the meta data still
need to be determined.


Björn

Re: [RFC PATCH 1/1] Documentation: net: dsa: b53: Describe b53 configuration

2019-06-28 Thread Benedikt Spranger

Am Thu, 27 Jun 2019 09:38:16 -0700
schrieb Florian Fainelli :

> On 6/27/19 3:15 AM, Benedikt Spranger wrote:
> > Document the different needs of documentation for the b53 driver.
> > 
> > Signed-off-by: Benedikt Spranger 
> > ---
> >  Documentation/networking/dsa/b53.rst | 300
> > +++ 1 file changed, 300 insertions(+)
> >  create mode 100644 Documentation/networking/dsa/b53.rst
> > 
> > diff --git a/Documentation/networking/dsa/b53.rst
> > b/Documentation/networking/dsa/b53.rst new file mode 100644
> > index ..5838cf6230da
> > --- /dev/null
> > +++ b/Documentation/networking/dsa/b53.rst
> > @@ -0,0 +1,300 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +==
> > +Broadcom RoboSwitch Ethernet switch driver
> > +==
> > +
> > +The Broadcom RoboSwitch Ethernet switch family is used in quite a
> > range of +xDSL router, cable modems and other multimedia devices.
> > +
> > +The actual implementation supports the devices BCM5325E, BCM5365,
> > BCM539x, +BCM53115 and BCM53125 as well as BCM63XX.
> > +
> > +Implementation details
> > +==
> > +
> > +The driver is located in ``drivers/net/dsa/bcm_sf2.c`` and is
> > implemented as a +DSA driver; see
> > ``Documentation/networking/dsa/dsa.rst`` for details on the
> > +subsystemand what it provides.  
> 
> The driver is under drivers/net/dsa/b53/
Fixed.

> s/ethernet/Ethernet/ for your global submission.
OK.
 
> What you are describing is not entirely specific to the B53 driver
> (maybe with the exception of having a VLAN tag on the DSA master
> network device, since B53 puts the CPU port tagged in all VLANs by
> default), and therefore the entire document should be written with
> the general DSA devices in mind, and eventually pointing out where
> B53 differs in a separate document.

I have split up the Documentation into 
Documentation/networking/dsa/configuration.rst
and
Documentation/networking/dsa/b53.rst

> There are largely two kinds of behavior:
> 
> - switches that are configured with DSA_TAG_PROTO_NONE, which behave
> in a certain way and require a bridge with VLAN filtering even when
> ports are intended to be used as standalone devices.
> 
> - switches that are configured with a tagging protocol other than
> DSA_TAG_PROTO_NONE, which behave more like traditional network devices
> people are used to use.
OK.

> > +bridge
> > +~~  
> 
> I would add something like:
> 
> All ports being part of a single bridge/broadcast domain or something
> along those lines. Seeing the "wan" interface being added to a bridge
> is a bit confusing.

 
> > +
> > +.. code-block:: sh
> > +
> > +  # create bridge
> > +  ip link add name br0 type bridge
> > +
> > +  # add ports to bridge
> > +  ip link set dev wan master br0
> > +  ip link set dev lan1 master br0
> > +  ip link set dev lan2 master br0
> > +
> > +  # configure the bridge
> > +  ip addr add 192.0.2.129/25 dev br0
> > +
> > +  # The master interface needs to be brought up before the slave
> > ports.
> > +  ip link set eth0 up
> > +
> > +  # bring up the slave interfaces
> > +  ip link set wan up
> > +  ip link set lan1 up
> > +  ip link set lan2 up
> > +
> > +  # bring up the bridge
> > +  ip link set dev br0 up
> > +
> > +gateway
> > +~~~
> > +
> > +.. code-block:: sh
> > +
> > +  # create bridge
> > +  ip link add name br0 type bridge
> > +
> > +  # add ports to bridge
> > +  ip link set dev lan1 master br0
> > +  ip link set dev lan2 master br0
> > +
> > +  # configure the bridge
> > +  ip addr add 192.0.2.129/25 dev br0
> > +
> > +  # configure the upstream port
> > +  ip addr add 192.0.2.1/30 dev wan
> > +
> > +  # The master interface needs to be brought up before the slave
> > ports.
> > +  ip link set eth0 up
> > +
> > +  # bring up the slave interfaces
> > +  ip link set wan up
> > +  ip link set lan1 up
> > +  ip link set lan2 up
> > +
> > +  # bring up the bridge
> > +  ip link set dev br0 up
> > +
> > +Configuration without tagging support
> > +-
> > +
> > +Older models (5325, 5365) support a different tag format that is
> > not supported +yet. 539x and 531x5 require managed mode and some
> > special handling, which is +also not yet supported. The tagging
> > support is disabled in these cases and the +switch need a different
> > configuration.  
> 
> On that topic, did the patch I sent you ended up working the way you
> wanted it with ifupdown-scripts or are you still chasing some other
> issues with it?
Your patch is needed and working. Stumbled over my own feed.

I have a hackary solution by now. Hackary in terms of not tracking the
master interface state (just *up* it unconditionally).

> > +
> > +single port
> > +~~~
> > +The configuration can only be set up via VLAN tagging and bridge
> > setup. +By default packages are tagged with vid 1:
> > +
> > +.. code-block:: sh
> > +
> > +  # tag traffic on CPU port
> > +

Re: [RFC iproute2 1/1] ip: netns: add mounted state file for each netns

2019-06-28 Thread Matteo Croce

On Fri, Jun 28, 2019 at 6:27 PM David Howells  wrote:
>
> Nicolas Dichtel  wrote:
>
> > David Howells was working on a mount notification mechanism:
> > https://lwn.net/Articles/760714/
> > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
> >
> > I don't know what is the status of this series.
>
> It's still alive.  I just posted a new version on it.  I'm hoping, possibly
> futiley, to get it in in this merge window.
>
> David

Hi all,

this could cause a clash if I create a netns with name ending with .mounted.

$ sudo ip/ip netns add ns1.mounted
$ sudo ip/ip netns add ns1
Cannot create namespace file "/var/run/netns/ns1.mounted": File exists
Cannot remove namespace file "/var/run/netns/ns1.mounted": Device or
resource busy

If you want to go along this road, please either:
- disallow netns creation with names ending with .mounted
- or properly document it in the manpage

Regards,
-- 
Matteo Croce
per aspera ad upstream

[PATCH 0/1] Disable all ports on b53 setup

2019-06-28 Thread Benedikt Spranger

Hi,

while working on a Banana Pi R1 based system I faced inconsistent
switch configurations. The switch is attached to an EEPROM which feeds
additional configuration which is applied after the reset of the chip.

As a result all ports remained active while the DSA subsystem
assumed that those ports were inactive after the reset.
Disable the ports on switch setup to get a consistent view of things
between real life and DSA. 

Benedikt Spranger (1):
  net: dsa: b53: Disable all ports on setup

 drivers/net/dsa/b53/b53_common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

-- 
2.20.1

[PATCH 1/1] net: dsa: b53: Disable all ports on setup

2019-06-28 Thread Benedikt Spranger

A b53 device may configured through an external EEPROM like the switch
device on the Lamobo R1 router board. The configuration of a port may
therefore differ from the reset configuration of the switch.

The switch configuration reported by the DSA subsystem is different until
the port is configured by DSA i.e. a port can be active, while the DSA
subsystem reports the port is inactive. Disable all ports and not only
the unused ones to put all ports into a well defined state.

Signed-off-by: Benedikt Spranger 
---
 drivers/net/dsa/b53/b53_common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index a47f5bc667bd..5127c2fefba9 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -962,13 +962,13 @@ static int b53_setup(struct dsa_switch *ds)
if (ret)
dev_err(ds->dev, "failed to apply configuration\n");
 
-   /* Configure IMP/CPU port, disable unused ports. Enabled
+   /* Configure IMP/CPU port, disable all other ports. Enabled
 * ports will be configured with .port_enable
 */
for (port = 0; port < dev->num_ports; port++) {
if (dsa_is_cpu_port(ds, port))
b53_enable_cpu_port(dev, port);
-   else if (dsa_is_unused_port(ds, port))
+   else
b53_disable_port(ds, port);
}
 
-- 
2.20.1

Re: [PATCH 1/3, net-next] net: netsec: Use page_pool API

Hi David, 

> >> Use page_pool and it's DMA mapping capabilities for Rx buffers instead
> >> of netdev/napi_alloc_frag()
> >> 
> >> Although this will result in a slight performance penalty on small sized
> >> packets (~10%) the use of the API will allow to easily add XDP support.
> >> The penalty won't be visible in network testing i.e ipef/netperf etc, it
> >> only happens during raw packet drops.
> >> Furthermore we intend to add recycling capabilities on the API
> >> in the future. Once the recycling is added the performance penalty will
> >> go away.
> >> The only 'real' penalty is the slightly increased memory usage, since we
> >> now allocate a page per packet instead of the amount of bytes we need +
> >> skb metadata (difference is roughly 2kb per packet).
> >> With a minimum of 4BG of RAM on the only SoC that has this NIC the
> >> extra memory usage is negligible (a bit more on 64K pages)
> >> 
> >> Signed-off-by: Ilias Apalodimas 
> >> ---
> >>  drivers/net/ethernet/socionext/Kconfig  |   1 +
> >>  drivers/net/ethernet/socionext/netsec.c | 121 +++-
> >>  2 files changed, 75 insertions(+), 47 deletions(-)
> > 
> > Acked-by: Jesper Dangaard Brouer 
> 
> Jesper this is confusing, you just asked if the code needs to be moved
> around to be correct and then right now immediately afterwards you ACK
> the patch.
I can answer on the driver, page_pool_free() needs re-arranging indeed.
I'll fix it and post a V2. I guess Jesper meant 'acked-if-fixed' so i can it on
V2


Thanks
/Ilias

Re: [PATCH 1/1] net: dsa: b53: Disable all ports on setup

2019-06-28 Thread Florian Fainelli

On 6/28/19 9:58 AM, Benedikt Spranger wrote:
> A b53 device may configured through an external EEPROM like the switch
> device on the Lamobo R1 router board. The configuration of a port may
> therefore differ from the reset configuration of the switch.
> 
> The switch configuration reported by the DSA subsystem is different until
> the port is configured by DSA i.e. a port can be active, while the DSA
> subsystem reports the port is inactive. Disable all ports and not only
> the unused ones to put all ports into a well defined state.
> 
> Signed-off-by: Benedikt Spranger 

Reviewed-by: Florian Fainelli 

Makes sense, in fact, that should probably be moved to the DSA core at
some point (wink wink Vivien).
-- 
Florian

Re: [PATCH 3/3, net-next] net: netsec: add XDP support

2019-06-28 Thread Maciej Fijalkowski

On Fri, 28 Jun 2019 19:47:41 +0300
Ilias Apalodimas  wrote:

> On Fri, Jun 28, 2019 at 03:35:52PM +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 28 Jun 2019 13:39:15 +0300
> > Ilias Apalodimas  wrote:
> >   
> > > +static int netsec_xdp_setup(struct netsec_priv *priv, struct bpf_prog 
> > > *prog,
> > > + struct netlink_ext_ack *extack)
> > > +{
> > > + struct net_device *dev = priv->ndev;
> > > + struct bpf_prog *old_prog;
> > > +
> > > + /* For now just support only the usual MTU sized frames */
> > > + if (prog && dev->mtu > 1500) {
> > > + NL_SET_ERR_MSG_MOD(extack, "Jumbo frames not supported on XDP");
> > > + return -EOPNOTSUPP;
> > > + }
> > > +
> > > + if (netif_running(dev))
> > > + netsec_netdev_stop(dev);
> > > +
> > > + /* Detach old prog, if any */
> > > + old_prog = xchg(&priv->xdp_prog, prog);
> > > + if (old_prog)
> > > + bpf_prog_put(old_prog);
> > > +
> > > + if (netif_running(dev))
> > > + netsec_netdev_open(dev);  
> > 
> > Shouldn't the if-statement be if (!netif_running(dev))
> >   
> > > +  
> This is there to restart the device if it's up already (to rebuild the rings).
> This should be fine as-is

I think that Jesper's concern was about that you could have already stopped the
netdev earlier via netsec_netdev_stop (before the xchg)? So at this point
__LINK_STATE_START might be not set.

Maybe initially store what netif_running(dev) returns in stack variable and
act on it, so your stop/open are symmetric?

> 
> > > + return 0;
> > > +}  
> 
> Thanks
> /Ilias

Re: [PATCH bpf-next v3 1/2] xsk: remove AF_XDP socket from map when the socket is released

2019-06-28 Thread Björn Töpel

On Fri, 28 Jun 2019 at 02:33, Daniel Borkmann  wrote:
>
> On 06/20/2019 12:06 PM, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > When an AF_XDP socket is released/closed the XSKMAP still holds a
> > reference to the socket in a "released" state. The socket will still
> > use the netdev queue resource, and block newly created sockets from
> > attaching to that queue, but no user application can access the
> > fill/complete/rx/tx queues. This results in that all applications need
> > to explicitly clear the map entry from the old "zombie state"
> > socket. This should be done automatically.
> >
> > After this patch, when a socket is released, it will remove itself
> > from all the XSKMAPs it resides in, allowing the socket application to
> > remove the code that cleans the XSKMAP entry.
> >
> > This behavior is also closer to that of SOCKMAP, making the two socket
> > maps more consistent.
> >
> > Suggested-by: Bruce Richardson 
> > Signed-off-by: Björn Töpel 
>
> Sorry for the bit of delay in reviewing, few comments inline:
>

No worries!

> > ---
> >  include/net/xdp_sock.h |   3 ++
> >  kernel/bpf/xskmap.c| 101 +++--
> >  net/xdp/xsk.c  |  25 ++
> >  3 files changed, 116 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index ae0f368a62bb..011a1b08d7c9 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -68,6 +68,8 @@ struct xdp_sock {
> >*/
> >   spinlock_t tx_completion_lock;
> >   u64 rx_dropped;
> > + struct list_head map_list;
> > + spinlock_t map_list_lock;
> >  };
> >
> >  struct xdp_buff;
> > @@ -87,6 +89,7 @@ struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem 
> > *umem,
> > struct xdp_umem_fq_reuse *newq);
> >  void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
> >  struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 
> > queue_id);
> > +void xsk_map_delete_from_node(struct xdp_sock *xs, struct list_head *node);
> >
> >  static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
> >  {
> > diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
> > index ef7338cebd18..af802c89ebab 100644
> > --- a/kernel/bpf/xskmap.c
> > +++ b/kernel/bpf/xskmap.c
> > @@ -13,8 +13,58 @@ struct xsk_map {
> >   struct bpf_map map;
> >   struct xdp_sock **xsk_map;
> >   struct list_head __percpu *flush_list;
> > + spinlock_t lock;
> >  };
> >
> > +/* Nodes are linked in the struct xdp_sock map_list field, and used to
> > + * track which maps a certain socket reside in.
> > + */
> > +struct xsk_map_node {
> > + struct list_head node;
> > + struct xsk_map *map;
> > + struct xdp_sock **map_entry;
> > +};
> > +
> > +static struct xsk_map_node *xsk_map_node_alloc(void)
> > +{
> > + return kzalloc(sizeof(struct xsk_map_node), GFP_ATOMIC | 
> > __GFP_NOWARN);
> > +}
> > +
> > +static void xsk_map_node_free(struct xsk_map_node *node)
> > +{
> > + kfree(node);
> > +}
> > +
> > +static void xsk_map_node_init(struct xsk_map_node *node,
> > +   struct xsk_map *map,
> > +   struct xdp_sock **map_entry)
> > +{
> > + node->map = map;
> > + node->map_entry = map_entry;
> > +}
> > +
> > +static void xsk_map_add_node(struct xdp_sock *xs, struct xsk_map_node 
> > *node)
> > +{
> > + spin_lock_bh(&xs->map_list_lock);
> > + list_add_tail(&node->node, &xs->map_list);
> > + spin_unlock_bh(&xs->map_list_lock);
> > +}
> > +
> > +static void xsk_map_del_node(struct xdp_sock *xs, struct xdp_sock 
> > **map_entry)
> > +{
> > + struct xsk_map_node *n, *tmp;
> > +
> > + spin_lock_bh(&xs->map_list_lock);
> > + list_for_each_entry_safe(n, tmp, &xs->map_list, node) {
> > + if (map_entry == n->map_entry) {
> > + list_del(&n->node);
> > + xsk_map_node_free(n);
> > + }
> > + }
> > + spin_unlock_bh(&xs->map_list_lock);
> > +
> > +}
> > +
> >  static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
> >  {
> >   struct xsk_map *m;
> > @@ -34,6 +84,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
> >   return ERR_PTR(-ENOMEM);
> >
> >   bpf_map_init_from_attr(&m->map, attr);
> > + spin_lock_init(&m->lock);
> >
> >   cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);
> >   cost += sizeof(struct list_head) * num_possible_cpus();
> > @@ -76,15 +127,16 @@ static void xsk_map_free(struct bpf_map *map)
> >   bpf_clear_redirect_map(map);
> >   synchronize_net();
> >
> > + spin_lock_bh(&m->lock);
> >   for (i = 0; i < map->max_entries; i++) {
> > - struct xdp_sock *xs;
> > -
> > - xs = m->xsk_map[i];
> > - if (!xs)
> > - continue;
> > + struct xdp_sock **map_entry = &m->xsk_map[i];
> > + stru

Re: [PATCH 1/3, net-next] net: netsec: Use page_pool API