date:20190527

Re: [PATCH net-next] vrf: local route leaking

2019-05-27 Thread George Wilkie

On Sat, May 25, 2019 at 09:13:13AM -0600, David Ahern wrote:
> > Using a loopback doesn't work, e.g. if 10.1.1.0/24 was on a global 
> > interface:
> >ip ro add vrf vrf-a 10.1.1.0/24 dev lo
> 
> That works for MPLS when you exit the LSP and deliver locally, so it
> should work here as well. I'll take a look early next week.

OK, thanks.

> I would prefer to avoid it if possible. VRF route leaking for forwarding
> does not have the second lookup and that is the primary use case. VRL
> with local delivery is a 1-off use case and you could just easily argue
> that the connection should not rely on the leaked route. ie., the
> control plane is aware of both VRFs, and the userspace process could use
> the VRF-B path.
> 

Although it isn't always possible to change the userspace process -
may be running in a specific vrf by 'ip vrf exec'

Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives

2019-05-27 Thread Jakub Sitnicki

On Fri, May 24, 2019 at 05:51 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>>
>> Now that those pesky crashes are gone, we plan to look into drops when
>> doing echo with sockmap. Marek tried running echo-sockmap [1] with
>> latest bpf-next (plus mentioned crash fixes) and reports that not all
>> data bounces back:
>>
>> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
>> 971832
>> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
>> 867352
>> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
>> 952648
>>
>> I'm tring to turn echo-sockmap into a selftest but as you can probably
>> guess over loopback all works fine.
>>
>
> Right, sockmap when used from recvmsg with redirect is lossy. This
> was a design choice I made that apparently caught a few people
> by surprise. The original rationale for this was when doing a
> multiplex operation, e.g. single ingress socket to many egress
> sockets blocking would cause head of line blocking on all
> sockets. To resolve this I simply dropped the packet and then allow
> the flow to continue. This pushes the logic up to the application
> to do retries, etc. when this happens. FWIW userspace proxies I
> tested also had similar points where they fell over and dropped
> packets. In hind sight though it probably would have made more
> sense to make this behavior opt-in vs the default. But, the
> use case I was solving at the time I wrote this could handle
> drops and was actually a NxM sockets with N ingress sockets and M
> egress sockets so head of line blocking was a real problem.
>
> Adding a flag to turn this into a blocking op has been on my
> todo list for awhile. Especially when sockmap is being used as
> a single ingress to single egress socket then blocking vs dropping
> makes much more sense.
>
> The place to look is in sk_psock_verdict_apply() in __SK_REDIRECT
> case there is a few checks and notice we can fallthrough to a
> kfree_skb(skb). This is most likely the drops you are hitting.
> Maybe annotate it with a dbg statement to check.
>
> To fix this we could have a flag to _not_ drop but enqueue the
> packet regardless of the test or hold it until space is
> available. I even think sk_psock_strp_read could push back
> on the stream parser which would eventually push back via TCP
> and get the behavior you want.
>
> Also, I have a couple items on my TODO list that I'll eventually
> get to. First we run without stream parsers in some Cilium
> use cases. I'll push some patches to allow this in the next
> months or so. This avoids the annoying stream parser prog that
> simply returns skb->len. This is mostly an optimizations. A
> larger change I want to make at some point is to remove the
> backlog workqueue altogether. Originally it was added for
> simplicity but actually causes some latency spikes when
> looking at 99+ percentiles. It really doesn't need to be
> there it was a hold over from some original architecture that
> got pushed upstream. If you have time and want to let me know
> if you would like to tackle removing it.

This is great stuff. Thanks for explaining the sockmap's design and
decisions behind it. The opt-in blocking mode idea is spot on.

I imagine we'll get back to sockmap once we have a replacement for
TPROXY figured out (unrelated to sockmap). Let's sync then.

-Jakub

reset value of MV88E6XXX_G1_IEEE_PRI

2019-05-27 Thread Rasmus Villemoes

Hi,

Looking through the data sheets comparing the mv88e6240 and 6250, I
noticed that they have the exact same description of the G1_IEEE_PRI
register (global1, offset 0x18). However, the current code used by 6240 does

int mv88e6085_g1_ieee_pri_map(struct mv88e6xxx_chip *chip)
{
/* Reset the IEEE Tag priorities to defaults */
return mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IEEE_PRI, 0xfa41);
}

while if my reading of the data sheet is correct, the reset value is
really 0xfa50 (fields 7:6 and 5:4 are RWS to 0x1, field 3:2 and 1:0 are
RWR) - and this is also the value I read from the 6250 on our old BSP
with an out-of-tree driver that doesn't touch that register. This seems
to go way back (at least 3b1588593097). Should this be left alone for
not risking breaking existing setups (just updating the comment), or can
we make the code match the comment? Or am I just reading this all wrong?

Rasmus

Re: [PATCH net 1/4] net: aquantia: tx clean budget logic error

2019-05-27 Thread Igor Russkikh



On 27.05.2019 8:15, David Miller wrote:
> From: Igor Russkikh 
> Date: Sat, 25 May 2019 09:57:59 +
> 
>> In case no other traffic happening on the ring, full tx cleanup
>> may not be completed. That may cause socket buffer to overflow
>> and tx traffic to stuck until next activity on the ring happens.
>>
>> This is due to logic error in budget variable decrementor.
>> Variable is compared with zero, and then post decremented,
>> causing it to become MAX_INT. Solution is remove decrementor
>> from the `for` statement and rewrite it in a clear way.
>>
>> Fixes: b647d3980948e ("net: aquantia: Add tx clean budget and valid budget 
>> handling logic")
>> Signed-off-by: Igor Russkikh 
> 
> I think the TX clean budget is a very bad idea.
> 
> You should always do as much TX clean work as there is TODO.

Hi David,

This is not about introducing tx clean budget, but about fixing a bug.

tx clean budget logic is present in majority of the drivers as I see,
including igb,ixgbe,mlx5.

I see it as a logical action to limit the time driver spends in napi_poll
under napi budget.

Regards,
  Igor

Re: [RFC] vsock: proposal to support multiple transports at runtime

2019-05-27 Thread Stefano Garzarella

On Thu, May 23, 2019 at 04:37:03PM +0100, Stefan Hajnoczi wrote:
> On Tue, May 14, 2019 at 10:15:43AM +0200, Stefano Garzarella wrote:
> > Hi guys,
> > I'm currently interested on implement a multi-transport support for VSOCK in
> > order to handle nested VMs.
> > 
> > As Stefan suggested me, I started to look at this discussion:
> > https://lkml.org/lkml/2017/8/17/551
> > Below I tried to summarize a proposal for a discussion, following the ideas
> > from Dexuan, Jorgen, and Stefan.
> > 
> > 
> > We can define two types of transport that we have to handle at the same time
> > (e.g. in a nested VM we would have both types of transport running 
> > together):
> > 
> > - 'host side transport', it runs in the host and it is used to communicate 
> > with
> >   the guests of a specific hypervisor (KVM, VMWare or HyperV)
> > 
> >   Should we support multiple 'host side transport' running at the same time?
> > 
> > - 'guest side transport'. it runs in the guest and it is used to communicate
> >   with the host transport
> 
> I find this terminology confusing.  Perhaps "host->guest" (your 'host
> side transport') and "guest->host" (your 'guest side transport') is
> clearer?

I agree, "host->guest" and "guest->host" are better, I'll use them.

> 
> Or maybe the nested virtualization terminology of L2 transport (your
> 'host side transport') and L0 transport (your 'guest side transport')?
> Here we are the L1 guest and L0 is the host and L2 is our nested guest.
>

I'm confused, if L2 is the nested guest, it should be the
'guest side transport'. Did I miss anything?

Maybe it is another point to your first proposal :)

> > 
> > 
> > The main goal is to find a way to decide what transport use in these cases:
> > 1. connect() / sendto()
> > 
> > a. use the 'host side transport', if the destination is the guest
> >(dest_cid > VMADDR_CID_HOST).
> >If we want to support multiple 'host side transport' running at the
> >same time, we should assign CIDs uniquely across all transports.
> >In this way, a packet generated by the host side will get directed
> >to the appropriate transport based on the CID
> 
> The multiple host side transport case is unlikely to be necessary on x86
> where only one hypervisor uses VMX at any given time.  But eventually it
> may happen so it's wise to at least allow it in the design.
> 

Okay, I was in doubt, but I'll keep it in the design.

> > 
> > b. use the 'guest side transport', if the destination is the host
> >(dest_cid == VMADDR_CID_HOST)
> 
> Makes sense to me.
> 
> > 
> > 
> > 2. listen() / recvfrom()
> > 
> > a. use the 'host side transport', if the socket is bound to
> >VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
> >guest transport.
> >We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
> >address this case.
> >If we want to support multiple 'host side transport' running at the
> >same time, we should find a way to allow an application to bound a
> >specific host transport (e.g. adding new VMADDR_CID_LISTEN_FROM_KVM,
> >VMADDR_CID_LISTEN_FROM_VMWARE, VMADDR_CID_LISTEN_FROM_HYPERV)
> 
> Hmm...VMADDR_CID_LISTEN_FROM_KVM, VMADDR_CID_LISTEN_FROM_VMWARE,
> VMADDR_CID_LISTEN_FROM_HYPERV isn't very flexible.  What if my service
> should only be available to a subset of VMware VMs?

You're right, it is not very flexible.

> 
> Instead it might be more appropriate to use network namespaces to create
> independent AF_VSOCK addressing domains.  Then you could have two
> separate groups of VMware VMs and selectively listen to just one group.
> 

Does AF_VSOCK support network namespace or it could be another
improvement to take care? (IIUC is not currently supported)

A possible issue that I'm seeing with netns is if they are used for
other purpose (e.g. to isolate the network of a VM), we should have
multiple instances of the application, one per netns.

> > 
> > b. use the 'guest side transport', if the socket is bound to local CID
> >different from the VMADDR_CID_HOST (guest CID get with
> >IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY
> >(to be backward compatible).
> >Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST.
> 
> Two additional topics:
> 
> 1. How will loading af_vsock.ko change?

I'd allow the loading of af_vsock.ko without any transport.
Maybe we should move the MODULE_ALIAS_NETPROTO(PF_VSOCK) from the
vmci_transport.ko to the af_vsock.ko, but this can impact the VMware
driver.

>In particular, can an
>application create a socket in af_vsock.ko without any loaded
>transport?  Can it enter listen state without any loaded transport
>(this seems useful with VMADDR_CID_ANY)?

I'll check if we can allow listen sockets without any loaded transport,
but I think could be a nice behaviour to have.

> 
> 2. Does your proposed behavior match

[PATCH] net: stmmac: Do not output error on deferred probe

2019-05-27 Thread Thierry Reding

From: Thierry Reding 

If the subdriver defers probe, do not show an error message. It's
perfectly fine for this error to occur since the driver will get another
chance to probe after some time and will usually succeed after all of
the resources that it requires have been registered.

Signed-off-by: Thierry Reding 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
index 3256e5cbad27..5bc224834c77 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
@@ -455,7 +455,11 @@ static int dwc_eth_dwmac_probe(struct platform_device 
*pdev)
priv = data->probe(pdev, plat_dat, &stmmac_res);
if (IS_ERR(priv)) {
ret = PTR_ERR(priv);
-   dev_err(&pdev->dev, "failed to probe subdriver: %d\n", ret);
+
+   if (ret != -EPROBE_DEFER)
+   dev_err(&pdev->dev, "failed to probe subdriver: %d\n",
+   ret);
+
goto remove_config;
}
 
-- 
2.21.0

[PATCH net-next v5] net: sched: Introduce act_ctinfo action

2019-05-27 Thread Kevin 'ldir' Darbyshire-Bryant

ctinfo is a new tc filter action module.  It is designed to restore
information contained in firewall conntrack marks to other packet fields
and is typically used on packet ingress paths.  At present it has two
independent sub-functions or operating modes, DSCP restoration mode &
skb mark restoration mode.

The DSCP restore mode:

This mode copies DSCP values that have been placed in the firewall
conntrack mark back into the IPv4/v6 diffserv fields of relevant
packets.

The DSCP restoration is intended for use and has been found useful for
restoring ingress classifications based on egress classifications across
links that bleach or otherwise change DSCP, typically home ISP Internet
links.  Restoring DSCP on ingress on the WAN link allows qdiscs such as
but by no means limited to CAKE to shape inbound packets according to
policies that are easier to set & mark on egress.

Ingress classification is traditionally a challenging task since
iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
lookups, hence are unable to see internal IPv4 addresses as used on the
typical home masquerading gateway.  Thus marking the connection in some
manner on egress for later restoration of classification on ingress is
easier to implement.

Parameters related to DSCP restore mode:

dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
conntrack mark field contain the DSCP value to be restored.

statemask - a 32 bit mask of (usually) 1 bit length, outside the area
specified by dscpmask.  This represents a conditional operation flag
whereby the DSCP is only restored if the flag is set.  This is useful to
implement a 'one shot' iptables based classification where the
'complicated' iptables rules are only run once to classify the
connection on initial (egress) packet and subsequent packets are all
marked/restored with the same DSCP.  A mask of zero disables the
conditional behaviour ie. the conntrack mark DSCP bits are always
restored to the ip diffserv field (assuming the conntrack entry is found
& the skb is an ipv4/ipv6 type)

e.g. dscpmask 0xfc00 statemask 0x0100

|0xFCconntrack mark00---|
| Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
| DSCP   | unused | flag  |unused   |
|---0x01---00---|
  |   |
  |   |
  ---| Conditional flag
 v only restore if set
|-ip diffserv-|
| 6 bits  |
|-|

The skb mark restore mode (cpmark):

This mode copies the firewall conntrack mark to the skb's mark field.
It is completely the functional equivalent of the existing act_connmark
action with the additional feature of being able to apply a mask to the
restored value.

Parameters related to skb mark restore mode:

mask - a 32 bit mask applied to the firewall conntrack mark to mask out
bits unwanted for restoration.  This can be useful where the conntrack
mark is being used for different purposes by different applications.  If
not specified and by default the whole mark field is copied (i.e.
default mask of 0x)

e.g. mask 0x00ff to mask out the top 8 bits being used by the
aforementioned DSCP restore mode.

|0x00conntrack markff---|
| Bits 31-24 |  |
| DSCP & flag|  some value here |
|---|
|
|
v
|skb mark---|
||  |
|  zeroed|  |
|---|

Overall parameters:

zone - conntrack zone

control - action related control (reclassify | pipe | drop | continue |
ok | goto chain )

Mode specific values are passed as parameter structures across netlink.
Similarly statistics indicating DSCP & skb mark restoration counts are
also returned via netlink.

Signed-off-by: Kevin Darbyshire-Bryant 

---
v2 - add equivalent connmark functionality with an enhancement
to accept a mask
pass statistics for each sub-function as individual netlink
attributes and stop (ab)using overlimits, drops
update the testing config correctly
v3 - fix a licensing silly & tidy up GPL boilerplate
v4 - drop stray copy paste inline
reverse christmas tree local vars
v5 - rebase on net-next/master not net/master by mistake - doh! now
 applies!
 rename connmark to cpmark.
 always use structures across netlink to pass parameters.
 rework commit message to clarify modes & applicable parameters
 without getting bogged down in userspace syntax.
 restrict dscpmask parameter to 6 contiguous bits only instead
 of >=6 contiguous bits.
 re-order netlink TLV values into functional groupings.

 include/net/tc_act/tc_ctinfo.h|  28 ++
 include/uapi/linux/pkt_cls.h  |   1 +
 include/uapi/linux/tc_act/tc_ctinfo.h |  52 +++
 net/sched/Kconfig |  17 +
 n

[PATCH net-next] net: stmmac: Switch to devm_alloc_etherdev_mqs

2019-05-27 Thread Jisheng Zhang

Make use of devm_alloc_etherdev_mqs() to simplify the code.

Signed-off-by: Jisheng Zhang 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index a87ec70b19f1..08022fbcb67a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -4243,9 +4243,9 @@ int stmmac_dvr_probe(struct device *device,
u32 queue, maxq;
int ret = 0;
 
-   ndev = alloc_etherdev_mqs(sizeof(struct stmmac_priv),
- MTL_MAX_TX_QUEUES,
- MTL_MAX_RX_QUEUES);
+   ndev = devm_alloc_etherdev_mqs(sizeof(struct stmmac_priv),
+  MTL_MAX_TX_QUEUES,
+  MTL_MAX_RX_QUEUES);
if (!ndev)
return -ENOMEM;
 
@@ -4277,8 +4277,7 @@ int stmmac_dvr_probe(struct device *device,
priv->wq = create_singlethread_workqueue("stmmac_wq");
if (!priv->wq) {
dev_err(priv->device, "failed to create workqueue\n");
-   ret = -ENOMEM;
-   goto error_wq;
+   return -ENOMEM;
}
 
INIT_WORK(&priv->service_task, stmmac_service_task);
@@ -4434,8 +4433,6 @@ int stmmac_dvr_probe(struct device *device,
}
 error_hw_init:
destroy_workqueue(priv->wq);
-error_wq:
-   free_netdev(ndev);
 
return ret;
 }
@@ -4472,7 +4469,6 @@ int stmmac_dvr_remove(struct device *dev)
stmmac_mdio_unregister(ndev);
destroy_workqueue(priv->wq);
mutex_destroy(&priv->lock);
-   free_netdev(ndev);
 
return 0;
 }
-- 
2.20.1

Re: reset value of MV88E6XXX_G1_IEEE_PRI

2019-05-27 Thread Vivien Didelot

Hi Rasmus,

On Mon, 27 May 2019 09:36:13 +, Rasmus Villemoes 
 wrote:
> Looking through the data sheets comparing the mv88e6240 and 6250, I
> noticed that they have the exact same description of the G1_IEEE_PRI
> register (global1, offset 0x18). However, the current code used by 6240 does
> 
> int mv88e6085_g1_ieee_pri_map(struct mv88e6xxx_chip *chip)
> {
>   /* Reset the IEEE Tag priorities to defaults */
>   return mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IEEE_PRI, 0xfa41);
> }
> 
> while if my reading of the data sheet is correct, the reset value is
> really 0xfa50 (fields 7:6 and 5:4 are RWS to 0x1, field 3:2 and 1:0 are
> RWR) - and this is also the value I read from the 6250 on our old BSP
> with an out-of-tree driver that doesn't touch that register. This seems
> to go way back (at least 3b1588593097). Should this be left alone for
> not risking breaking existing setups (just updating the comment), or can
> we make the code match the comment? Or am I just reading this all wrong?

If the reset value isn't the same, the bits are certainly differently
organized inside this register, so the proper way would be to add a
mv88e6240_g1_ieee_pri_map function, used by both 88E6240 and 88E6250.

I'm not a big fan of rewriting the default values, but that is the
way we chose until we make actually use of these tag priority bits.


Thanks,
Vivien

Re: reset value of MV88E6XXX_G1_IEEE_PRI

2019-05-27 Thread Rasmus Villemoes

On 27/05/2019 14.32, Vivien Didelot wrote:
> Hi Rasmus,
> 
> On Mon, 27 May 2019 09:36:13 +, Rasmus Villemoes 
>  wrote:
>> Looking through the data sheets comparing the mv88e6240 and 6250, I
>> noticed that they have the exact same description of the G1_IEEE_PRI
>> register (global1, offset 0x18). However, the current code used by 6240 does
>>
>> int mv88e6085_g1_ieee_pri_map(struct mv88e6xxx_chip *chip)
>> {
>>  /* Reset the IEEE Tag priorities to defaults */
>>  return mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IEEE_PRI, 0xfa41);
>> }
>>
>> while if my reading of the data sheet is correct, the reset value is
>> really 0xfa50 (fields 7:6 and 5:4 are RWS to 0x1, field 3:2 and 1:0 are
>> RWR) - and this is also the value I read from the 6250 on our old BSP
>> with an out-of-tree driver that doesn't touch that register. This seems
>> to go way back (at least 3b1588593097). Should this be left alone for
>> not risking breaking existing setups (just updating the comment), or can
>> we make the code match the comment? Or am I just reading this all wrong?
> 
> If the reset value isn't the same, the bits are certainly differently
> organized inside this register, so the proper way would be to add a
> mv88e6240_g1_ieee_pri_map function, used by both 88E6240 and 88E6250.
> 

Based on the very systematic description [ieee tags 7 and 6 are mapped
to 3, 5 and 4 to 2, 3 and 2 to 1, and 1 and 0 to 0], I strongly believe
that 0xfa50 is also the reset value for the 6085, so this is most likely
wrong for all the current chips - though I don't have a 6085 data sheet.

I can certainly add a 6250 variant that does the right thing for the
6250, and I probably will - this is more a question about the current code.

> I'm not a big fan of rewriting the default values, but that is the
> way we chose until we make actually use of these tag priority bits.

Yes, I was wondering why there's a lot of code which simply serves to
set default values - but I guess it makes sense to force the switch into
a known state in case the bootloader did something odd.

Rasmus

Re: [patch net-next 3/7] mlxfw: Propagate error messages through extack

2019-05-27 Thread Jiri Pirko

Sat, May 25, 2019 at 02:08:52AM CEST, jakub.kicin...@netronome.com wrote:
>On Sat, 25 May 2019 00:26:35 +0200, Jiri Pirko wrote:
>> Fri, May 24, 2019 at 05:54:46PM CEST, jakub.kicin...@netronome.com wrote:
>> >On Fri, 24 May 2019 10:11:10 +0200, Jiri Pirko wrote:  
>> >> Thu, May 23, 2019 at 05:19:46PM CEST, dsah...@gmail.com wrote:  
>> >> >On 5/23/19 3:45 AM, Jiri Pirko wrote:
>> >> >> @@ -57,11 +58,13 @@ static int mlxfw_fsm_state_wait(struct mlxfw_dev 
>> >> >> *mlxfw_dev, u32 fwhandle,
>> >> >>if (fsm_state_err != MLXFW_FSM_STATE_ERR_OK) {
>> >> >>pr_err("Firmware flash failed: %s\n",
>> >> >>   mlxfw_fsm_state_err_str[fsm_state_err]);
>> >> >> +  NL_SET_ERR_MSG_MOD(extack, "Firmware flash failed");
>> >> >>return -EINVAL;
>> >> >>}
>> >> >>if (curr_fsm_state != fsm_state) {
>> >> >>if (--times == 0) {
>> >> >>pr_err("Timeout reached on FSM state change");
>> >> >> +  NL_SET_ERR_MSG_MOD(extack, "Timeout reached on 
>> >> >> FSM state change");
>> >> >
>> >> >FSM? Is the meaning obvious to users?
>> >> 
>> >> It is specific to mlx drivers.  
>> >
>> >What does it stand for?  Isn't it just Finite State Machine?  
>> 
>> I believe so.
>
>In which case it doesn't really add much, no?  I second David's request
>to make the messages as easy to understand as possible.  

Well, FSM is something that is used in the code and known. I would
change it to "finite state machine" (which I'm still not sure it
really is) but I don't believe that would bring more info to the user.
Well, nothing. On contrary, a MLX engineer might get confused if customer
sends him the message, because he is used to "FSM" :)

Same with "MFA2" in the other message. I only know it is the format
of the binary, have no clue what it actually stands for (other than
it is version 2).


>
>PSID for better or worse I have previously capitulated on, so I guess
>the ship has indeed sailed there :)
>
>$ grep -A4 psid -- Documentation/networking/devlink-info-versions.rst 
>fw.psid
>===
>
>Unique identifier of the firmware parameter set.

Re: reset value of MV88E6XXX_G1_IEEE_PRI

2019-05-27 Thread Vivien Didelot

Hi Rasmus,

On Mon, 27 May 2019 13:02:04 +, Rasmus Villemoes 
 wrote:
> > On Mon, 27 May 2019 09:36:13 +, Rasmus Villemoes 
> >  wrote:
> >> Looking through the data sheets comparing the mv88e6240 and 6250, I
> >> noticed that they have the exact same description of the G1_IEEE_PRI
> >> register (global1, offset 0x18). However, the current code used by 6240 
> >> does
> >>
> >> int mv88e6085_g1_ieee_pri_map(struct mv88e6xxx_chip *chip)
> >> {
> >>/* Reset the IEEE Tag priorities to defaults */
> >>return mv88e6xxx_g1_write(chip, MV88E6XXX_G1_IEEE_PRI, 0xfa41);
> >> }
> >>
> >> while if my reading of the data sheet is correct, the reset value is
> >> really 0xfa50 (fields 7:6 and 5:4 are RWS to 0x1, field 3:2 and 1:0 are
> >> RWR) - and this is also the value I read from the 6250 on our old BSP
> >> with an out-of-tree driver that doesn't touch that register. This seems
> >> to go way back (at least 3b1588593097). Should this be left alone for
> >> not risking breaking existing setups (just updating the comment), or can
> >> we make the code match the comment? Or am I just reading this all wrong?
> > 
> > If the reset value isn't the same, the bits are certainly differently
> > organized inside this register, so the proper way would be to add a
> > mv88e6240_g1_ieee_pri_map function, used by both 88E6240 and 88E6250.
> > 
> 
> Based on the very systematic description [ieee tags 7 and 6 are mapped
> to 3, 5 and 4 to 2, 3 and 2 to 1, and 1 and 0 to 0], I strongly believe
> that 0xfa50 is also the reset value for the 6085, so this is most likely
> wrong for all the current chips - though I don't have a 6085 data sheet.
> 
> I can certainly add a 6250 variant that does the right thing for the
> 6250, and I probably will - this is more a question about the current code.

Good catch, I double checked 88E6085 and 88E6352 and both describe
a reset value of 0xFA50. Fixing mv88e6085_g1_ieee_pri_map should
be enough.
> 
> > I'm not a big fan of rewriting the default values, but that is the
> > way we chose until we make actually use of these tag priority bits.
> 
> Yes, I was wondering why there's a lot of code which simply serves to
> set default values - but I guess it makes sense to force the switch into
> a known state in case the bootloader did something odd.

That was the idea, yes.


Thank you,
Vivien

Re: netdev_alloc_skb is failing for 16k length

2019-05-27 Thread Stephen Hemminger

On Mon, 27 May 2019 12:21:51 +0530
Arun Kumar Neelakantam  wrote:

> Hi team,
> 
> we are using "skb = netdev_alloc_skb(NULL, len);" which is getting 
> failed sometimes for len = 16k.
> 
> I suspect mostly system memory got fragmented and hence atomic memory 
> allocation for 16k is failing, can you please suggest best way to handle 
> this failure case.
> 
> Thanks
> 
> Arun N
> 

If you are handling big frames, then put the data in page size chunks
and use build_skb.

[PATCH net-next] enetc: Enable TC offloading with mqprio

2019-05-27 Thread Claudiu Manoil

From: Camelia Groza 

Add support to configure multiple prioritized TX traffic
classes with mqprio.

Configure one BD ring per TC for the moment, one netdev
queue per TC.

Signed-off-by: Camelia Groza 
Signed-off-by: Claudiu Manoil 
---
 drivers/net/ethernet/freescale/enetc/enetc.c  | 56 +++
 drivers/net/ethernet/freescale/enetc/enetc.h  |  3 +
 .../net/ethernet/freescale/enetc/enetc_hw.h   | 12 +++-
 .../net/ethernet/freescale/enetc/enetc_pf.c   |  1 +
 .../net/ethernet/freescale/enetc/enetc_vf.c   |  1 +
 5 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c 
b/drivers/net/ethernet/freescale/enetc/enetc.c
index d2ace299bed0..315957dff473 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -1425,6 +1425,62 @@ int enetc_close(struct net_device *ndev)
return 0;
 }
 
+int enetc_setup_tc(struct net_device *ndev, enum tc_setup_type type,
+  void *type_data)
+{
+   struct enetc_ndev_priv *priv = netdev_priv(ndev);
+   struct tc_mqprio_qopt *mqprio = type_data;
+   struct enetc_bdr *tx_ring;
+   u8 num_tc;
+   int i;
+
+   if (type != TC_SETUP_QDISC_MQPRIO)
+   return -EOPNOTSUPP;
+
+   mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
+   num_tc = mqprio->num_tc;
+
+   if (!num_tc) {
+   netdev_reset_tc(ndev);
+   netif_set_real_num_tx_queues(ndev, priv->num_tx_rings);
+
+   /* Reset all ring priorities to 0 */
+   for (i = 0; i < priv->num_tx_rings; i++) {
+   tx_ring = priv->tx_ring[i];
+   enetc_set_bdr_prio(&priv->si->hw, tx_ring->index, 0);
+   }
+
+   return 0;
+   }
+
+   /* Check if we have enough BD rings available to accommodate all TCs */
+   if (num_tc > priv->num_tx_rings) {
+   netdev_err(ndev, "Max %d traffic classes supported\n",
+  priv->num_tx_rings);
+   return -EINVAL;
+   }
+
+   /* For the moment, we use only one BD ring per TC.
+*
+* Configure num_tc BD rings with increasing priorities.
+*/
+   for (i = 0; i < num_tc; i++) {
+   tx_ring = priv->tx_ring[i];
+   enetc_set_bdr_prio(&priv->si->hw, tx_ring->index, i);
+   }
+
+   /* Reset the number of netdev queues based on the TC count */
+   netif_set_real_num_tx_queues(ndev, num_tc);
+
+   netdev_set_num_tc(ndev, num_tc);
+
+   /* Each TC is associated with one netdev queue */
+   for (i = 0; i < num_tc; i++)
+   netdev_set_tc_queue(ndev, i, 1, i);
+
+   return 0;
+}
+
 struct net_device_stats *enetc_get_stats(struct net_device *ndev)
 {
struct enetc_ndev_priv *priv = netdev_priv(ndev);
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.h 
b/drivers/net/ethernet/freescale/enetc/enetc.h
index ea443268bf70..541b4e2073fe 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.h
+++ b/drivers/net/ethernet/freescale/enetc/enetc.h
@@ -229,6 +229,9 @@ struct net_device_stats *enetc_get_stats(struct net_device 
*ndev);
 int enetc_set_features(struct net_device *ndev,
   netdev_features_t features);
 int enetc_ioctl(struct net_device *ndev, struct ifreq *rq, int cmd);
+int enetc_setup_tc(struct net_device *ndev, enum tc_setup_type type,
+  void *type_data);
+
 /* ethtool */
 void enetc_set_ethtool_ops(struct net_device *ndev);
 
diff --git a/drivers/net/ethernet/freescale/enetc/enetc_hw.h 
b/drivers/net/ethernet/freescale/enetc/enetc_hw.h
index 6559cef4b07d..88276299f447 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc_hw.h
+++ b/drivers/net/ethernet/freescale/enetc/enetc_hw.h
@@ -127,7 +127,7 @@ enum enetc_bdr_type {TX, RX};
 #define ENETC_TBSR_BUSYBIT(0)
 #define ENETC_TBMR_VIH BIT(9)
 #define ENETC_TBMR_PRIO_MASK   GENMASK(2, 0)
-#define ENETC_TBMR_PRIO_SET(val)   val
+#define ENETC_TBMR_SET_PRIO(val)   ((val) & ENETC_TBMR_PRIO_MASK)
 #define ENETC_TBMR_EN  BIT(31)
 #define ENETC_TBSR 0x4
 #define ENETC_TBBAR0   0x10
@@ -544,3 +544,13 @@ static inline void enetc_enable_txvlan(struct enetc_hw 
*hw, int si_idx,
val = (val & ~ENETC_TBMR_VIH) | (en ? ENETC_TBMR_VIH : 0);
enetc_txbdr_wr(hw, si_idx, ENETC_TBMR, val);
 }
+
+static inline void enetc_set_bdr_prio(struct enetc_hw *hw, int bdr_idx,
+ int prio)
+{
+   u32 val = enetc_txbdr_rd(hw, bdr_idx, ENETC_TBMR);
+
+   val &= ~ENETC_TBMR_PRIO_MASK;
+   val |= ENETC_TBMR_SET_PRIO(prio);
+   enetc_txbdr_wr(hw, bdr_idx, ENETC_TBMR, val);
+}
diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c 
b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
index d78ec8d43c39..258b3cb38a6f 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
@@ -70

Re: [PATCH net-next v5] net: sched: Introduce act_ctinfo action

2019-05-27 Thread Toke Høiland-Jørgensen

Kevin 'ldir' Darbyshire-Bryant  writes:

> ctinfo is a new tc filter action module.  It is designed to restore
> information contained in firewall conntrack marks to other packet fields
> and is typically used on packet ingress paths.  At present it has two
> independent sub-functions or operating modes, DSCP restoration mode &
> skb mark restoration mode.
>
> The DSCP restore mode:
>
> This mode copies DSCP values that have been placed in the firewall
> conntrack mark back into the IPv4/v6 diffserv fields of relevant
> packets.
>
> The DSCP restoration is intended for use and has been found useful for
> restoring ingress classifications based on egress classifications across
> links that bleach or otherwise change DSCP, typically home ISP Internet
> links.  Restoring DSCP on ingress on the WAN link allows qdiscs such as
> but by no means limited to CAKE to shape inbound packets according to
> policies that are easier to set & mark on egress.
>
> Ingress classification is traditionally a challenging task since
> iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
> lookups, hence are unable to see internal IPv4 addresses as used on the
> typical home masquerading gateway.  Thus marking the connection in some
> manner on egress for later restoration of classification on ingress is
> easier to implement.
>
> Parameters related to DSCP restore mode:
>
> dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
> conntrack mark field contain the DSCP value to be restored.
>
> statemask - a 32 bit mask of (usually) 1 bit length, outside the area
> specified by dscpmask.  This represents a conditional operation flag
> whereby the DSCP is only restored if the flag is set.  This is useful to
> implement a 'one shot' iptables based classification where the
> 'complicated' iptables rules are only run once to classify the
> connection on initial (egress) packet and subsequent packets are all
> marked/restored with the same DSCP.  A mask of zero disables the
> conditional behaviour ie. the conntrack mark DSCP bits are always
> restored to the ip diffserv field (assuming the conntrack entry is found
> & the skb is an ipv4/ipv6 type)
>
> e.g. dscpmask 0xfc00 statemask 0x0100
>
> |0xFCconntrack mark00---|
> | Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
> | DSCP   | unused | flag  |unused   |
> |---0x01---00---|
>   |   |
>   |   |
>   ---| Conditional flag
>  v only restore if set
> |-ip diffserv-|
> | 6 bits  |
> |-|
>
> The skb mark restore mode (cpmark):
>
> This mode copies the firewall conntrack mark to the skb's mark field.
> It is completely the functional equivalent of the existing act_connmark
> action with the additional feature of being able to apply a mask to the
> restored value.
>
> Parameters related to skb mark restore mode:
>
> mask - a 32 bit mask applied to the firewall conntrack mark to mask out
> bits unwanted for restoration.  This can be useful where the conntrack
> mark is being used for different purposes by different applications.  If
> not specified and by default the whole mark field is copied (i.e.
> default mask of 0x)
>
> e.g. mask 0x00ff to mask out the top 8 bits being used by the
> aforementioned DSCP restore mode.
>
> |0x00conntrack markff---|
> | Bits 31-24 |  |
> | DSCP & flag|  some value here |
> |---|
>   |
>   |
>   v
> |skb mark---|
> ||  |
> |  zeroed|  |
> |---|
>
> Overall parameters:
>
> zone - conntrack zone
>
> control - action related control (reclassify | pipe | drop | continue |
> ok | goto chain )
>
> Mode specific values are passed as parameter structures across netlink.
> Similarly statistics indicating DSCP & skb mark restoration counts are
> also returned via netlink.
>
> Signed-off-by: Kevin Darbyshire-Bryant 

I like the new commit message! :)

> ---
> v2 - add equivalent connmark functionality with an enhancement
> to accept a mask
> pass statistics for each sub-function as individual netlink
> attributes and stop (ab)using overlimits, drops
> update the testing config correctly
> v3 - fix a licensing silly & tidy up GPL boilerplate
> v4 - drop stray copy paste inline
> reverse christmas tree local vars
> v5 - rebase on net-next/master not net/master by mistake - doh! now
>  applies!

Still getting errors from 'git am' for your email. It helps to add
--ignore-whitespace, but then I get ^M line endings in the resulting
file...

>  rename connmark to cpmark.
>  always use structures across netlink to pass parameters.
>  rework commit message to clarify modes & applicable parameters

Re: [PATCH net-next] team: add ethtool get_link_ksettings

2019-05-27 Thread David Miller

From: Hangbin Liu 
Date: Mon, 27 May 2019 11:31:10 +0800

> Like bond, add ethtool get_link_ksettings to show the total speed.
> 
> Signed-off-by: Hangbin Liu 

Yeah, this mirrors what bonding does.

Jiri, please review.

Re: [PATCH] enetc: fix le32/le16 degrading to integer warnings

2019-05-27 Thread David Miller

From: "Y.b. Lu" 
Date: Mon, 27 May 2019 03:55:20 +

> Fix blow sparse warning introduced by a previous patch.
> - restricted __le32 degrades to integer
> - restricted __le16 degrades to integer
> 
> Fixes: d39823121911 ("enetc: add hardware timestamping support")
> Signed-off-by: Yangbo Lu 

Applied, thank you.

Re: [PATCH net 1/4] net: aquantia: tx clean budget logic error

2019-05-27 Thread David Miller

From: Igor Russkikh 
Date: Mon, 27 May 2019 09:57:18 +

> This is not about introducing tx clean budget, but about fixing a bug.
> 
> tx clean budget logic is present in majority of the drivers as I see,
> including igb,ixgbe,mlx5.
> 
> I see it as a logical action to limit the time driver spends in napi_poll
> under napi budget.

Ok, series applied, thank you.

[PATCH net] selftests: pmtu: Fix encapsulating device in pmtu_vti6_link_change_mtu

2019-05-27 Thread Stefano Brivio

In the pmtu_vti6_link_change_mtu test, both local and remote addresses
for the vti6 tunnel are assigned to the same address given to the dummy
interface that we use as encapsulating device with a known MTU.

This works as long as the dummy interface is actually selected, via
rt6_lookup(), as encapsulating device. But if the remote address of the
tunnel is a local address too, the loopback interface could also be
selected, and there's nothing wrong with it.

This is what some older -stable kernels do (3.18.z, at least), and
nothing prevents us from subtly changing FIB implementation to revert
back to that behaviour in the future.

Define an IPv6 prefix instead, and use two separate addresses as local
and remote for vti6, so that the encapsulating device can't be a
loopback interface.

Reported-by: Xiumei Mu 
Fixes: 1fad59ea1c34 ("selftests: pmtu: Add pmtu_vti6_link_change_mtu test")
Signed-off-by: Stefano Brivio 
---
 tools/testing/selftests/net/pmtu.sh | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index b9171a7b3aaa..317dafcd605d 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -208,8 +208,8 @@ tunnel6_a_addr="fd00:2::a"
 tunnel6_b_addr="fd00:2::b"
 tunnel6_mask="64"
 
-dummy6_0_addr="fc00:1000::0"
-dummy6_1_addr="fc00:1001::0"
+dummy6_0_prefix="fc00:1000::"
+dummy6_1_prefix="fc00:1001::"
 dummy6_mask="64"
 
 cleanup_done=1
@@ -1005,13 +1005,13 @@ test_pmtu_vti6_link_change_mtu() {
run_cmd ${ns_a} ip link set dummy0 up
run_cmd ${ns_a} ip link set dummy1 up
 
-   run_cmd ${ns_a} ip addr add ${dummy6_0_addr}/${dummy6_mask} dev dummy0
-   run_cmd ${ns_a} ip addr add ${dummy6_1_addr}/${dummy6_mask} dev dummy1
+   run_cmd ${ns_a} ip addr add ${dummy6_0_prefix}1/${dummy6_mask} dev 
dummy0
+   run_cmd ${ns_a} ip addr add ${dummy6_1_prefix}1/${dummy6_mask} dev 
dummy1
 
fail=0
 
# Create vti6 interface bound to device, passing MTU, check it
-   run_cmd ${ns_a} ip link add vti6_a mtu 1300 type vti6 remote 
${dummy6_0_addr} local ${dummy6_0_addr}
+   run_cmd ${ns_a} ip link add vti6_a mtu 1300 type vti6 remote 
${dummy6_0_prefix}2 local ${dummy6_0_prefix}1
mtu="$(link_get_mtu "${ns_a}" vti6_a)"
if [ ${mtu} -ne 1300 ]; then
err "  vti6 MTU ${mtu} doesn't match configured value 1300"
@@ -1020,7 +1020,7 @@ test_pmtu_vti6_link_change_mtu() {
 
# Move to another device with different MTU, without passing MTU, check
# MTU is adjusted
-   run_cmd ${ns_a} ip link set vti6_a type vti6 remote ${dummy6_1_addr} 
local ${dummy6_1_addr}
+   run_cmd ${ns_a} ip link set vti6_a type vti6 remote ${dummy6_1_prefix}2 
local ${dummy6_1_prefix}1
mtu="$(link_get_mtu "${ns_a}" vti6_a)"
if [ ${mtu} -ne $((3000 - 40)) ]; then
err "  vti MTU ${mtu} is not dummy MTU 3000 minus IPv6 header 
length"
@@ -1028,7 +1028,7 @@ test_pmtu_vti6_link_change_mtu() {
fi
 
# Move it back, passing MTU, check MTU is not overridden
-   run_cmd ${ns_a} ip link set vti6_a mtu 1280 type vti6 remote 
${dummy6_0_addr} local ${dummy6_0_addr}
+   run_cmd ${ns_a} ip link set vti6_a mtu 1280 type vti6 remote 
${dummy6_0_prefix}2 local ${dummy6_0_prefix}1
mtu="$(link_get_mtu "${ns_a}" vti6_a)"
if [ ${mtu} -ne 1280 ]; then
err "  vti6 MTU ${mtu} doesn't match configured value 1280"
-- 
2.20.1

Re: [RESEND][PATCH] Fix MACsec kernel panics, oopses and bugs

2019-05-27 Thread Andreas Steinmetz

Patch will be worked over and split. I'll need to investigate one more
problem. Split patch will be resent when ready.

On Thu, 2019-05-23 at 09:11 -0700, David Miller wrote:
> From: Andreas Steinmetz 
> Date: Thu, 23 May 2019 09:46:15 +0200
> 
> > MACsec causes oopses followed by a kernel panic when attached directly or 
> > indirectly to
> a bridge. It causes erroneous
> > checksum messages when attached to vxlan. When I did investigate I did find 
> > skb leaks,
> apparent skb mis-handling and
> > superfluous code. The attached patch fixes all MACsec misbehaviour I could 
> > find. As I
> am no kernel developer somebody
> > with sufficient kernel network knowledge should verify and correct the 
> > patch where
> necessary.
> > 
> > Signed-off-by: Andreas Steinmetz 
> 
> Subject lines should be of the form:
> 
> [PATCH $DST_TREE] $subsystem_prefix: Description.
> 
> Where $DST_TREE here would be "net" and $subsystem_prefix would be "macsec".
> 
> > + /* FIXME: any better way to prevent calls to netdev_rx_csum_fault? */
> > + skb->csum_complete_sw = 1;
> 
> Create a helper for this in linux/skbuff.h with very clear and clean comments
> explaining what is going on.

Re: [patch net-next 0/7] expose flash update status to user

2019-05-27 Thread Florian Fainelli




On 5/23/2019 2:45 AM, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> When user is flashing device using devlink, he currenly does not see any
> information about what is going on, percentages, etc.
> Drivers, for example mlxsw and mlx5, have notion about the progress
> and what is happening. This patchset exposes this progress
> information to userspace.
> 
> See this console recording which shows flashing FW on a Mellanox
> Spectrum device:
> https://asciinema.org/a/247926

It would be great to explain why you went that route instead of
implementing a MTD device (like what sfc) which would have presumably
allowed you to more or less the same thing using a standard device
driver model that is establish with flash devices.

> 
> Jiri Pirko (7):
>   mlxsw: Move firmware flash implementation to devlink
>   mlx5: Move firmware flash implementation to devlink
>   mlxfw: Propagate error messages through extack
>   devlink: allow driver to update progress of flash update
>   mlxfw: Introduce status_notify op and call it to notify about the
> status
>   mlxsw: Implement flash update status notifications
>   netdevsim: implement fake flash updating with notifications
> 
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |   2 -
>  .../ethernet/mellanox/mlx5/core/en_ethtool.c  |  35 --
>  drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   6 +-
>  .../mellanox/mlx5/core/ipoib/ethtool.c|   9 --
>  .../net/ethernet/mellanox/mlx5/core/main.c|  20 
>  .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   3 +-
>  drivers/net/ethernet/mellanox/mlxfw/mlxfw.h   |  11 +-
>  .../net/ethernet/mellanox/mlxfw/mlxfw_fsm.c   |  57 --
>  drivers/net/ethernet/mellanox/mlxsw/core.c|  15 +++
>  drivers/net/ethernet/mellanox/mlxsw/core.h|   3 +
>  .../net/ethernet/mellanox/mlxsw/spectrum.c|  75 +++--
>  drivers/net/netdevsim/dev.c   |  35 ++
>  include/net/devlink.h |   8 ++
>  include/uapi/linux/devlink.h  |   5 +
>  net/core/devlink.c| 102 ++
>  15 files changed, 295 insertions(+), 91 deletions(-)
> 

-- 
Florian

[PATCH net-next 0/3] net: phy: improve handling of more complex C45 PHY's

2019-05-27 Thread Heiner Kallweit

This series tries to address few problematic aspects raised by
Russell. Concrete example is the Marvell 88x3310, the changes
should be helpful for other complex C45 PHY's too.

Heiner Kallweit (3):
  net: phy: export phy_queue_state_machine
  net: phy: add callback for custom interrupt handler to struct
phy_driver
  net: phy: move handling latched link-down to phylib state machine

 drivers/net/phy/phy-c45.c| 12 
 drivers/net/phy/phy.c| 28 +++-
 drivers/net/phy/phy_device.c | 14 +-
 include/linux/phy.h  | 13 -
 4 files changed, 36 insertions(+), 31 deletions(-)

-- 
2.21.0

[PATCH net-next 3/3] net: phy: move handling latched link-down to phylib state machine

2019-05-27 Thread Heiner Kallweit

Especially with fibre links there may be very short link drops. And if
interrupt handling is slow we may miss such a link drop. To deal with
this we remove the double link status read from the generic link status
read functions, and call the state machine twice instead.
The flag for double-reading link status can be set by phy_mac_interrupt
from hard irq context, therefore we have to use an atomic operation.

Signed-off-by: Heiner Kallweit 
Suggested-by: Russell King - ARM Linux admin 
---
 drivers/net/phy/phy-c45.c| 12 
 drivers/net/phy/phy.c| 11 +++
 drivers/net/phy/phy_device.c | 14 +-
 include/linux/phy.h  |  8 
 4 files changed, 20 insertions(+), 25 deletions(-)

diff --git a/drivers/net/phy/phy-c45.c b/drivers/net/phy/phy-c45.c
index b9d414578..63d9e8483 100644
--- a/drivers/net/phy/phy-c45.c
+++ b/drivers/net/phy/phy-c45.c
@@ -223,18 +223,6 @@ int genphy_c45_read_link(struct phy_device *phydev)
devad = __ffs(mmd_mask);
mmd_mask &= ~BIT(devad);
 
-   /* The link state is latched low so that momentary link
-* drops can be detected. Do not double-read the status
-* in polling mode to detect such short link drops.
-*/
-   if (!phy_polling_mode(phydev)) {
-   val = phy_read_mmd(phydev, devad, MDIO_STAT1);
-   if (val < 0)
-   return val;
-   else if (val & MDIO_STAT1_LSTATUS)
-   continue;
-   }
-
val = phy_read_mmd(phydev, devad, MDIO_STAT1);
if (val < 0)
return val;
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 8030d0a97..575412ff5 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -778,6 +778,7 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
if (phydev->drv->handle_interrupt(phydev))
goto phy_err;
} else {
+   set_bit(PHY_LINK_DOUBLE_READ, &phydev->atomic_flags);
/* reschedule state queue work to run as soon as possible */
phy_trigger_machine(phydev);
}
@@ -969,6 +970,15 @@ void phy_state_machine(struct work_struct *work)
phydev->drv->link_change_notify(phydev);
}
 
+   /* link-down is latched, in order not to lose a link-up event we have
+* to read the link status twice
+*/
+   if (test_and_clear_bit(PHY_LINK_DOUBLE_READ, &phydev->atomic_flags)) {
+   if (!phydev->link)
+   phy_trigger_machine(phydev);
+   return;
+   }
+
/* Only re-schedule a PHY state machine change if we are polling the
 * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
 * between states from phy_mac_interrupt().
@@ -992,6 +1002,7 @@ void phy_state_machine(struct work_struct *work)
  */
 void phy_mac_interrupt(struct phy_device *phydev)
 {
+   set_bit(PHY_LINK_DOUBLE_READ, &phydev->atomic_flags);
/* Trigger a state machine change */
phy_trigger_machine(phydev);
 }
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index dcc93a873..f2a78baa8 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1704,23 +1704,11 @@ int genphy_update_link(struct phy_device *phydev)
 {
int status;
 
-   /* The link state is latched low so that momentary link
-* drops can be detected. Do not double-read the status
-* in polling mode to detect such short link drops.
-*/
-   if (!phy_polling_mode(phydev)) {
-   status = phy_read(phydev, MII_BMSR);
-   if (status < 0)
-   return status;
-   else if (status & BMSR_LSTATUS)
-   goto done;
-   }
-
/* Read link and autonegotiation status */
status = phy_read(phydev, MII_BMSR);
if (status < 0)
return status;
-done:
+
phydev->link = status & BMSR_LSTATUS ? 1 : 0;
phydev->autoneg_complete = status & BMSR_ANEGCOMPLETE ? 1 : 0;
 
diff --git a/include/linux/phy.h b/include/linux/phy.h
index f90158c67..1d1dcfa90 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -332,6 +332,10 @@ struct phy_c45_device_ids {
u32 device_ids[8];
 };
 
+enum phy_atomic_flags {
+   PHY_LINK_DOUBLE_READ,
+};
+
 /* phy_device: An instance of a PHY
  *
  * drv: Pointer to the driver for this PHY instance
@@ -346,6 +350,7 @@ struct phy_c45_device_ids {
  * sysfs_links: Internal boolean tracking sysfs symbolic links setup/removal.
  * loopback_enabled: Set true if this phy has been loopbacked successfully.
  * state: state of the PHY for management purposes
+ * atomic_flags: flags accessed in atomic context
  * dev_flags: Device-specific flags used by the PH

[PATCH net-next 1/3] net: phy: export phy_queue_state_machine

2019-05-27 Thread Heiner Kallweit

We face the issue that link change interrupt and link status may be
reported by different layers. As a result the link change interrupt
may occur before the link status changes.
Export phy_queue_state_machine to allow PHY drivers to specify a
delay between link status change interrupt and link status check.

Signed-off-by: Heiner Kallweit 
Suggested-by: Russell King - ARM Linux admin 
---
 drivers/net/phy/phy.c | 8 +---
 include/linux/phy.h   | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index e88854292..20955836c 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -29,6 +29,8 @@
 #include 
 #include 
 
+#define PHY_STATE_TIME HZ
+
 #define PHY_STATE_STR(_state)  \
case PHY_##_state:  \
return __stringify(_state); \
@@ -478,12 +480,12 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq 
*ifr, int cmd)
 }
 EXPORT_SYMBOL(phy_mii_ioctl);
 
-static void phy_queue_state_machine(struct phy_device *phydev,
-   unsigned int secs)
+void phy_queue_state_machine(struct phy_device *phydev, unsigned int jiffies)
 {
mod_delayed_work(system_power_efficient_wq, &phydev->state_queue,
-secs * HZ);
+jiffies);
 }
+EXPORT_SYMBOL(phy_queue_state_machine);
 
 static void phy_trigger_machine(struct phy_device *phydev)
 {
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 7180b1d1e..b133d59f3 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -188,7 +188,6 @@ static inline const char *phy_modes(phy_interface_t 
interface)
 
 
 #define PHY_INIT_TIMEOUT   10
-#define PHY_STATE_TIME 1
 #define PHY_FORCE_TIMEOUT  10
 
 #define PHY_MAX_ADDR   32
@@ -1137,6 +1136,7 @@ int phy_driver_register(struct phy_driver *new_driver, 
struct module *owner);
 int phy_drivers_register(struct phy_driver *new_driver, int n,
 struct module *owner);
 void phy_state_machine(struct work_struct *work);
+void phy_queue_state_machine(struct phy_device *phydev, unsigned int jiffies);
 void phy_mac_interrupt(struct phy_device *phydev);
 void phy_start_machine(struct phy_device *phydev);
 void phy_stop_machine(struct phy_device *phydev);
-- 
2.21.0

[PATCH net-next 2/3] net: phy: add callback for custom interrupt handler to struct phy_driver

2019-05-27 Thread Heiner Kallweit

The phylib interrupt handler handles link change events only currently.
However PHY drivers may want to use other interrupt sources too,
e.g. to report temperature monitoring events. Therefore add a callback
to struct phy_driver allowing PHY drivers to implement a custom
interrupt handler.

Signed-off-by: Heiner Kallweit 
Suggested-by: Russell King - ARM Linux admin 
---
 drivers/net/phy/phy.c | 9 +++--
 include/linux/phy.h   | 3 +++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 20955836c..8030d0a97 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -774,8 +774,13 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
if (phydev->drv->did_interrupt && !phydev->drv->did_interrupt(phydev))
return IRQ_NONE;
 
-   /* reschedule state queue work to run as soon as possible */
-   phy_trigger_machine(phydev);
+   if (phydev->drv->handle_interrupt) {
+   if (phydev->drv->handle_interrupt(phydev))
+   goto phy_err;
+   } else {
+   /* reschedule state queue work to run as soon as possible */
+   phy_trigger_machine(phydev);
+   }
 
if (phy_clear_interrupt(phydev))
goto phy_err;
diff --git a/include/linux/phy.h b/include/linux/phy.h
index b133d59f3..f90158c67 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -536,6 +536,9 @@ struct phy_driver {
 */
int (*did_interrupt)(struct phy_device *phydev);
 
+   /* Override default interrupt handling */
+   int (*handle_interrupt)(struct phy_device *phydev);
+
/* Clears up any memory if needed */
void (*remove)(struct phy_device *phydev);
 
-- 
2.21.0

Re: [PATCH v2 1/4] net: phy: dp83867: fix speed 10 in sgmii mode

2019-05-27 Thread Florian Fainelli




On 5/26/2019 11:16 PM, Max Uvarov wrote:
> For support 10Mps sped in SGMII mode DP83867_10M_SGMII_RATE_ADAPT bit
> of DP83867_10M_SGMII_CFG register has to be cleared by software.
> That does not affect speeds 100 and 1000 so can be done on init.

s/support/supporting/
s/sped/speed/

> 
> Signed-off-by: Max Uvarov 
> Cc: Heiner Kallweit 
> ---
>  drivers/net/phy/dp83867.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
> index fd35131a0c39..75861b8f3b4d 100644
> --- a/drivers/net/phy/dp83867.c
> +++ b/drivers/net/phy/dp83867.c
> @@ -30,6 +30,7 @@
>  #define DP83867_STRAP_STS1   0x006E
>  #define DP83867_RGMIIDCTL0x0086
>  #define DP83867_IO_MUX_CFG   0x0170
> +#define DP83867_10M_SGMII_CFG  0x016F
>  
>  #define DP83867_SW_RESET BIT(15)
>  #define DP83867_SW_RESTART   BIT(14)
> @@ -74,6 +75,9 @@
>  /* CFG4 bits */
>  #define DP83867_CFG4_PORT_MIRROR_EN  BIT(0)
>  
> +/* 10M_SGMII_CFG bits */
> +#define DP83867_10M_SGMII_RATE_ADAPT  BIT(7)
> +
>  enum {
>   DP83867_PORT_MIRROING_KEEP,
>   DP83867_PORT_MIRROING_EN,
> @@ -277,6 +281,22 @@ static int dp83867_config_init(struct phy_device *phydev)
>  DP83867_IO_MUX_CFG_IO_IMPEDANCE_CTRL);
>   }
>  
> + if (phydev->interface == PHY_INTERFACE_MODE_SGMII) {
> + /* For support SPEED_10 in SGMII mode
> +  * DP83867_10M_SGMII_RATE_ADAPT bit
> +  * has to be cleared by software. That
> +  * does not affect SPEED_100 and
> +  * SPEED_1000.

Likewise, s/support/supporting/ with that and Heiner's suggestion to use
phy_modify_mmd():

Reviewed-by: Florian Fainelli 

> +  */
> + val = phy_read_mmd(phydev, DP83867_DEVADDR,
> +DP83867_10M_SGMII_CFG);
> + val &= ~DP83867_10M_SGMII_RATE_ADAPT;
> + ret = phy_write_mmd(phydev, DP83867_DEVADDR,
> + DP83867_10M_SGMII_CFG, val);
> + if (ret)
> + return ret;
> + }
> +
>   /* Enable Interrupt output INT_OE in CFG3 register */
>   if (phy_interrupt_is_valid(phydev)) {
>   val = phy_read(phydev, DP83867_CFG3);
> 

-- 
Florian

Re: [PATCH v2 4/4] net: phy: dp83867: Set up RGMII TX delay

2019-05-27 Thread Florian Fainelli




On 5/26/2019 11:16 PM, Max Uvarov wrote:
> PHY_INTERFACE_MODE_RGMII_RXID is less then TXID
> so code to set tx delay is never called.
> Fixes: 2a10154abcb75 ("net: phy: dp83867: Add TI dp83867 phy")
> 
> Signed-off-by: Max Uvarov 
> Cc: Florian Fainelli 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH v2 3/4] net: phy: dp83867: do not call config_init twice

2019-05-27 Thread Florian Fainelli




On 5/26/2019 11:16 PM, Max Uvarov wrote:
> Phy state machine calls _config_init just after
> reset.
> 
> Signed-off-by: Max Uvarov 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH v2 2/4] net: phy: dp83867: increase SGMII autoneg timer duration

2019-05-27 Thread Florian Fainelli




On 5/26/2019 11:16 PM, Max Uvarov wrote:
> After reset SGMII Autoneg timer is set to 2us (bits 6 and 5 are 01).
> That us not enough to finalize autonegatiation on some devices.

s/us/is/

> Increase this timer duration to maximum supported 16ms.



> 
> Signed-off-by: Max Uvarov 
> Cc: Heiner Kallweit 
> ---
>  drivers/net/phy/dp83867.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
> index 75861b8f3b4d..5fafcc091525 100644
> --- a/drivers/net/phy/dp83867.c
> +++ b/drivers/net/phy/dp83867.c
> @@ -295,6 +295,16 @@ static int dp83867_config_init(struct phy_device *phydev)
>   DP83867_10M_SGMII_CFG, val);
>   if (ret)
>   return ret;
> +
> + /* After reset SGMII Autoneg timer is set to 2us (bits 6 and 5
> +  * are 01). That us not enough to finalize autoneg on some

Likewise, same typo was carried over here. With that fixed and Heiner's
suggestions addressed:

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next 1/3] net: phy: export phy_queue_state_machine

2019-05-27 Thread Florian Fainelli




On 5/27/2019 11:28 AM, Heiner Kallweit wrote:
> We face the issue that link change interrupt and link status may be
> reported by different layers. As a result the link change interrupt
> may occur before the link status changes.
> Export phy_queue_state_machine to allow PHY drivers to specify a
> delay between link status change interrupt and link status check.
> 
> Signed-off-by: Heiner Kallweit 
> Suggested-by: Russell King - ARM Linux admin 
> ---
>  drivers/net/phy/phy.c | 8 +---
>  include/linux/phy.h   | 2 +-
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index e88854292..20955836c 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -29,6 +29,8 @@
>  #include 
>  #include 
>  
> +#define PHY_STATE_TIME   HZ
> +
>  #define PHY_STATE_STR(_state)\
>   case PHY_##_state:  \
>   return __stringify(_state); \
> @@ -478,12 +480,12 @@ int phy_mii_ioctl(struct phy_device *phydev, struct 
> ifreq *ifr, int cmd)
>  }
>  EXPORT_SYMBOL(phy_mii_ioctl);
>  
> -static void phy_queue_state_machine(struct phy_device *phydev,
> - unsigned int secs)
> +void phy_queue_state_machine(struct phy_device *phydev, unsigned int jiffies)

mod_delayed_work() takes an unsigned long delay argument, so I would
replicate that here as well. Other than that, making PHY_STATE_TIME
local to phy.c is definitively a good idea.

>  {
>   mod_delayed_work(system_power_efficient_wq, &phydev->state_queue,
> -  secs * HZ);
> +  jiffies);
>  }
> +EXPORT_SYMBOL(phy_queue_state_machine);
>  
>  static void phy_trigger_machine(struct phy_device *phydev)
>  {
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 7180b1d1e..b133d59f3 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -188,7 +188,6 @@ static inline const char *phy_modes(phy_interface_t 
> interface)
>  
>  
>  #define PHY_INIT_TIMEOUT 10
> -#define PHY_STATE_TIME   1
>  #define PHY_FORCE_TIMEOUT10
>  
>  #define PHY_MAX_ADDR 32
> @@ -1137,6 +1136,7 @@ int phy_driver_register(struct phy_driver *new_driver, 
> struct module *owner);
>  int phy_drivers_register(struct phy_driver *new_driver, int n,
>struct module *owner);
>  void phy_state_machine(struct work_struct *work);
> +void phy_queue_state_machine(struct phy_device *phydev, unsigned int 
> jiffies);
>  void phy_mac_interrupt(struct phy_device *phydev);
>  void phy_start_machine(struct phy_device *phydev);
>  void phy_stop_machine(struct phy_device *phydev);
> 

-- 
Florian

Re: [PATCH net-next 2/3] net: phy: add callback for custom interrupt handler to struct phy_driver

2019-05-27 Thread Florian Fainelli




On 5/27/2019 11:28 AM, Heiner Kallweit wrote:
> The phylib interrupt handler handles link change events only currently.
> However PHY drivers may want to use other interrupt sources too,
> e.g. to report temperature monitoring events. Therefore add a callback
> to struct phy_driver allowing PHY drivers to implement a custom
> interrupt handler.
> 
> Signed-off-by: Heiner Kallweit 
> Suggested-by: Russell King - ARM Linux admin 
> ---
>  drivers/net/phy/phy.c | 9 +++--
>  include/linux/phy.h   | 3 +++
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 20955836c..8030d0a97 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -774,8 +774,13 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
>   if (phydev->drv->did_interrupt && !phydev->drv->did_interrupt(phydev))
>   return IRQ_NONE;
>  
> - /* reschedule state queue work to run as soon as possible */
> - phy_trigger_machine(phydev);
> + if (phydev->drv->handle_interrupt) {
> + if (phydev->drv->handle_interrupt(phydev))

If Russell is okay with such a model where the PHY state machine still
manages the interrupts at large, and only calls a specific callback for
specific even handling, that's fine. We might have to allow PHY drivers
to let them specify what they want to get passed to
request_threaded_irq(), or leave it to them to do it.
-- 
Florian

Re: [PATCH net-next 2/3] net: phy: add callback for custom interrupt handler to struct phy_driver

2019-05-27 Thread Heiner Kallweit

On 27.05.2019 21:25, Florian Fainelli wrote:
> 
> 
> On 5/27/2019 11:28 AM, Heiner Kallweit wrote:
>> The phylib interrupt handler handles link change events only currently.
>> However PHY drivers may want to use other interrupt sources too,
>> e.g. to report temperature monitoring events. Therefore add a callback
>> to struct phy_driver allowing PHY drivers to implement a custom
>> interrupt handler.
>>
>> Signed-off-by: Heiner Kallweit 
>> Suggested-by: Russell King - ARM Linux admin 
>> ---
>>  drivers/net/phy/phy.c | 9 +++--
>>  include/linux/phy.h   | 3 +++
>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index 20955836c..8030d0a97 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -774,8 +774,13 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
>>  if (phydev->drv->did_interrupt && !phydev->drv->did_interrupt(phydev))
>>  return IRQ_NONE;
>>  
>> -/* reschedule state queue work to run as soon as possible */
>> -phy_trigger_machine(phydev);
>> +if (phydev->drv->handle_interrupt) {
>> +if (phydev->drv->handle_interrupt(phydev))
> 
> If Russell is okay with such a model where the PHY state machine still
> manages the interrupts at large, and only calls a specific callback for
> specific even handling, that's fine. We might have to allow PHY drivers
> to let them specify what they want to get passed to
> request_threaded_irq(), or leave it to them to do it.
> 
This proposed easy model should be able to cover quite some use cases.
One constraint may be that interrupts are disabled if phylib state
machine isn't in a started state. Means most likely it's not able to
cover e.g. the requirement to allow temperature warning interrupts if
PHY is in state HALTED.

Re: [PATCH net-next v5] net: sched: Introduce act_ctinfo action

2019-05-27 Thread Kevin 'ldir' Darbyshire-Bryant

I have to call it a day.  I have no idea why the patches are becoming corrupt 
and hence how to fix it, it’s probably something Apple has done to git, or 
maybe MS to my email server.

Sadly I also think that the only way this patch/functionality will ever be 
acceptable is if someone else writes it, where they or their company can take 
the credit/blame.

I tried very hard to approach the process of upstream submission in a positive 
way, seeking advice & guidance in the form of RFC patches, many rounds later I 
feel they’re further away from acceptance than ever.  Clearly it is not desired 
functionality/code otherwise it would have been written by now and I cannot 
face another 3 rounds of the same thing for act_ctinfo user space, the 
x_tables/nf_tables kernel helper to store the DSCP in the first place and the 
user space code to handle that.

As a rank outsider, amateur coder I shall leave it that I’ve found the process 
completely discouraging.  The professionals are of course paid to deal with 
this.


> On 27 May 2019, at 16:47, Toke Høiland-Jørgensen  wrote:
> 
> Kevin 'ldir' Darbyshire-Bryant  writes:
> 
>> ctinfo is a new tc filter action module.  It is designed to restore
>> information contained in firewall conntrack marks to other packet fields
>> and is typically used on packet ingress paths.  At present it has two
>> independent sub-functions or operating modes, DSCP restoration mode &
>> skb mark restoration mode.
>> 
>> The DSCP restore mode:
>> 
>> This mode copies DSCP values that have been placed in the firewall
>> conntrack mark back into the IPv4/v6 diffserv fields of relevant
>> packets.
>> 
>> The DSCP restoration is intended for use and has been found useful for
>> restoring ingress classifications based on egress classifications across
>> links that bleach or otherwise change DSCP, typically home ISP Internet
>> links.  Restoring DSCP on ingress on the WAN link allows qdiscs such as
>> but by no means limited to CAKE to shape inbound packets according to
>> policies that are easier to set & mark on egress.
>> 
>> Ingress classification is traditionally a challenging task since
>> iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
>> lookups, hence are unable to see internal IPv4 addresses as used on the
>> typical home masquerading gateway.  Thus marking the connection in some
>> manner on egress for later restoration of classification on ingress is
>> easier to implement.
>> 
>> Parameters related to DSCP restore mode:
>> 
>> dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
>> conntrack mark field contain the DSCP value to be restored.
>> 
>> statemask - a 32 bit mask of (usually) 1 bit length, outside the area
>> specified by dscpmask.  This represents a conditional operation flag
>> whereby the DSCP is only restored if the flag is set.  This is useful to
>> implement a 'one shot' iptables based classification where the
>> 'complicated' iptables rules are only run once to classify the
>> connection on initial (egress) packet and subsequent packets are all
>> marked/restored with the same DSCP.  A mask of zero disables the
>> conditional behaviour ie. the conntrack mark DSCP bits are always
>> restored to the ip diffserv field (assuming the conntrack entry is found
>> & the skb is an ipv4/ipv6 type)
>> 
>> e.g. dscpmask 0xfc00 statemask 0x0100
>> 
>> |0xFCconntrack mark00---|
>> | Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
>> | DSCP   | unused | flag  |unused   |
>> |---0x01---00---|
>>  |   |
>>  |   |
>>  ---| Conditional flag
>> v only restore if set
>> |-ip diffserv-|
>> | 6 bits  |
>> |-|
>> 
>> The skb mark restore mode (cpmark):
>> 
>> This mode copies the firewall conntrack mark to the skb's mark field.
>> It is completely the functional equivalent of the existing act_connmark
>> action with the additional feature of being able to apply a mask to the
>> restored value.
>> 
>> Parameters related to skb mark restore mode:
>> 
>> mask - a 32 bit mask applied to the firewall conntrack mark to mask out
>> bits unwanted for restoration.  This can be useful where the conntrack
>> mark is being used for different purposes by different applications.  If
>> not specified and by default the whole mark field is copied (i.e.
>> default mask of 0x)
>> 
>> e.g. mask 0x00ff to mask out the top 8 bits being used by the
>> aforementioned DSCP restore mode.
>> 
>> |0x00conntrack markff---|
>> | Bits 31-24 |  |
>> | DSCP & flag|  some value here |
>> |---|
>>  |
>>  |
>>  v
>> |skb mark---|
>> ||  |
>> |  zeroed|  |
>> |---|
>> 
>>

[PATCH net-next] selftests/net: ipv6 flowlabel

2019-05-27 Thread Willem de Bruijn

From: Willem de Bruijn 

Test the IPv6 flowlabel control and datapath interfaces:

Acquire and release the right to use flowlabels with socket option
IPV6_FLOWLABEL_MGR.

Then configure flowlabels on send and read them on recv with cmsg
IPV6_FLOWINFO. Also verify auto-flowlabel if not explicitly set.

This helped identify the issue fixed in commit 95c169251bf73 ("ipv6:
invert flowlabel sharing check in process and user mode")

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/.gitignore|   2 +
 tools/testing/selftests/net/Makefile  |   4 +-
 tools/testing/selftests/net/ipv6_flowlabel.c  | 230 ++
 tools/testing/selftests/net/ipv6_flowlabel.sh |  22 ++
 .../selftests/net/ipv6_flowlabel_mgr.c| 200 +++
 5 files changed, 456 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/net/ipv6_flowlabel.c
 create mode 100755 tools/testing/selftests/net/ipv6_flowlabel.sh
 create mode 100644 tools/testing/selftests/net/ipv6_flowlabel_mgr.c

diff --git a/tools/testing/selftests/net/.gitignore 
b/tools/testing/selftests/net/.gitignore
index 27ef4d07ac915..99a4e41d52499 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -18,3 +18,5 @@ tls
 txring_overwrite
 ip_defrag
 so_txtime
+flowlabel
+flowlabel_mgr
diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 8af7869e0f1c8..8343fb9d8a463 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,13 +9,13 @@ TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh 
netdevice.sh \
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
 TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh 
reuseport_addr_any.sh
-TEST_PROGS += test_vxlan_fdb_changelink.sh so_txtime.sh
+TEST_PROGS += test_vxlan_fdb_changelink.sh so_txtime.sh ipv6_flowlabel.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
 TEST_GEN_FILES += tcp_mmap tcp_inq psock_snd txring_overwrite
 TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx ip_defrag
-TEST_GEN_FILES += so_txtime
+TEST_GEN_FILES += so_txtime ipv6_flowlabel ipv6_flowlabel_mgr
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
 
diff --git a/tools/testing/selftests/net/ipv6_flowlabel.c 
b/tools/testing/selftests/net/ipv6_flowlabel.c
new file mode 100644
index 0..7b1cf51084a62
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_flowlabel.c
@@ -0,0 +1,230 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Test IPV6_FLOWINFO cmsg on send and recv */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* uapi/glibc weirdness may leave this undefined */
+#ifndef IPV6_FLOWINFO
+#define IPV6_FLOWINFO 11
+#endif
+
+#ifndef IPV6_FLOWLABEL_MGR
+#define IPV6_FLOWLABEL_MGR 32
+#endif
+
+#define FLOWLABEL_WILDCARD ((uint32_t) -1)
+
+static const char cfg_data[]   = "a";
+static uint32_t cfg_label  = 1;
+
+static void do_send(int fd, bool with_flowlabel, uint32_t flowlabel)
+{
+   char control[CMSG_SPACE(sizeof(flowlabel))] = {0};
+   struct msghdr msg = {0};
+   struct iovec iov = {0};
+   int ret;
+
+   iov.iov_base = (char *)cfg_data;
+   iov.iov_len = sizeof(cfg_data);
+
+   msg.msg_iov = &iov;
+   msg.msg_iovlen = 1;
+
+   if (with_flowlabel) {
+   struct cmsghdr *cm;
+
+   cm = (void *)control;
+   cm->cmsg_len = CMSG_LEN(sizeof(flowlabel));
+   cm->cmsg_level = SOL_IPV6;
+   cm->cmsg_type = IPV6_FLOWINFO;
+   *(uint32_t *)CMSG_DATA(cm) = htonl(flowlabel);
+
+   msg.msg_control = control;
+   msg.msg_controllen = sizeof(control);
+   }
+
+   ret = sendmsg(fd, &msg, 0);
+   if (ret == -1)
+   error(1, errno, "send");
+
+   if (with_flowlabel)
+   fprintf(stderr, "sent with label %u\n", flowlabel);
+   else
+   fprintf(stderr, "sent without label\n");
+}
+
+static void do_recv(int fd, bool with_flowlabel, uint32_t expect)
+{
+   char control[CMSG_SPACE(sizeof(expect))];
+   char data[sizeof(cfg_data)];
+   struct msghdr msg = {0};
+   struct iovec iov = {0};
+   struct cmsghdr *cm;
+   uint32_t flowlabel;
+   int ret;
+
+   iov.iov_base = data;
+   iov.iov_len = sizeof(data);
+
+   msg.msg_iov = &iov;
+   msg.msg_iovlen = 1;
+
+   memset(control, 0, sizeof(control));
+   msg.msg_control = control;
+   msg.msg_controllen = sizeof(control);
+
+

Re: [PATCH net-next v5] net: sched: Introduce act_ctinfo action

2019-05-27 Thread Toke Høiland-Jørgensen

Kevin 'ldir' Darbyshire-Bryant  writes:

> I have to call it a day. I have no idea why the patches are becoming
> corrupt and hence how to fix it, it’s probably something Apple has
> done to git, or maybe MS to my email server.

Or maybe it's just that your editor saves things with the wrong type of
line ending (if you're on a Mac)?

> Sadly I also think that the only way this patch/functionality will
> ever be acceptable is if someone else writes it, where they or their
> company can take the credit/blame.

Not sure why you would think so.

> I tried very hard to approach the process of upstream submission in a
> positive way, seeking advice & guidance in the form of RFC patches,
> many rounds later I feel they’re further away from acceptance than
> ever.

Not sure why you'd think that either; I thought you were rather close,
actually...

> Clearly it is not desired functionality/code otherwise it would have
> been written by now and I cannot face another 3 rounds of the same
> thing for act_ctinfo user space, the x_tables/nf_tables kernel helper
> to store the DSCP in the first place and the user space code to handle
> that.
>
> As a rank outsider, amateur coder I shall leave it that I’ve found the
> process completely discouraging. The professionals are of course paid
> to deal with this.

It's up to you if you want to continue, of course; but honestly, I'm not
actually sure what it is you are finding hard to "deal with"? No one has
told you "go away, this is junk"; you've gotten a few suggestions for
improvements, most of which you have already fixed. So what, exactly, is
the problem? :)

-Toke

[PATCH] tc: flower: fix port value truncation

2019-05-27 Thread Lukasz Czapnik

sscanf truncates read port values silently without any error. As sscanf
man says:
(...) sscanf() conform to C89 and C99 and POSIX.1-2001. These standards
do not specify the ERANGE error.

Replace sscanf with safer get_be16 that returns error when value is out
of range.

Example:
tc filter add dev eth0 protocol ip parent : prio 1 flower ip_proto
tcp dst_port 7 hw_tc 1

Would result in filter for port 4464 without any warning.

Fixes: 8930840e678b ("tc: flower: Classify packets based port ranges")
Signed-off-by: Lukasz Czapnik 
---
 tc/f_flower.c | 48 
 1 file changed, 36 insertions(+), 12 deletions(-)

diff --git a/tc/f_flower.c b/tc/f_flower.c
index 9659e894..e2420d92 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -493,23 +493,40 @@ static int flower_port_range_attr_type(__u8 ip_proto, 
enum flower_endpoint type,
return 0;
 }
 
+/* parse range args in format 10-20 */
+static int parse_range(char *str, __be16 *min, __be16 *max)
+{
+   char *sep;
+
+   sep = strchr(str, '-');
+   if (sep) {
+   *sep = '\0';
+
+   if (get_be16(min, str, 10))
+   return -1;
+
+   if (get_be16(max, sep + 1, 10))
+   return -1;
+   } else {
+   if (get_be16(min, str, 10))
+   return -1;
+   }
+   return 0;
+}
+
 static int flower_parse_port(char *str, __u8 ip_proto,
 enum flower_endpoint endpoint,
 struct nlmsghdr *n)
 {
-   __u16 min, max;
+   __be16 min = 0;
+   __be16 max = 0;
int ret;
 
-   ret = sscanf(str, "%hu-%hu", &min, &max);
-
-   if (ret == 1) {
-   int type;
+   ret = parse_range(str, &min, &max);
+   if (ret)
+   return -1;
 
-   type = flower_port_attr_type(ip_proto, endpoint);
-   if (type < 0)
-   return -1;
-   addattr16(n, MAX_MSG, type, htons(min));
-   } else if (ret == 2) {
+   if (min && max) {
__be16 min_port_type, max_port_type;
 
if (max <= min) {
@@ -520,8 +537,15 @@ static int flower_parse_port(char *str, __u8 ip_proto,
&min_port_type, &max_port_type))
return -1;
 
-   addattr16(n, MAX_MSG, min_port_type, htons(min));
-   addattr16(n, MAX_MSG, max_port_type, htons(max));
+   addattr16(n, MAX_MSG, min_port_type, min);
+   addattr16(n, MAX_MSG, max_port_type, max);
+   } else if (min && !max) {
+   int type;
+
+   type = flower_port_attr_type(ip_proto, endpoint);
+   if (type < 0)
+   return -1;
+   addattr16(n, MAX_MSG, type, min);
} else {
return -1;
}
-- 
2.17.2

[PATCH 03/11] net: phy: Check against net_device being NULL

2019-05-27 Thread Ioana Ciornei

In general, we don't want MAC drivers calling phy_attach_direct with the
net_device being NULL. Add checks against this in all the functions
calling it: phy_attach() and phy_connect_direct().

Signed-off-by: Ioana Ciornei 
Suggested-by: Andrew Lunn 
---
 drivers/net/phy/phy_device.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index da3bf3f70d63..0b7730fd41ba 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -948,6 +948,9 @@ int phy_connect_direct(struct net_device *dev, struct 
phy_device *phydev,
 {
int rc;
 
+   if (!dev)
+   return ERR_PTR(-EINVAL);
+
rc = phy_attach_direct(dev, phydev, phydev->dev_flags, interface);
if (rc)
return rc;
@@ -1307,6 +1310,9 @@ struct phy_device *phy_attach(struct net_device *dev, 
const char *bus_id,
struct device *d;
int rc;
 
+   if (!dev)
+   return ERR_PTR(-EINVAL);
+
/* Search the list of PHY devices on the mdio bus for the
 * PHY with the requested name
 */
-- 
2.21.0

[PATCH 01/11] net: phy: Add phy_sysfs_create_links helper function

2019-05-27 Thread Ioana Ciornei

From: Vladimir Oltean 

This is a cosmetic patch that wraps the operation of creating sysfs
links between the netdev->phydev and the phydev->attached_dev.

This is needed to keep the indentation level in check in a follow-up
patch where this function will be guarded against the existence of a
phydev->attached_dev.

Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
---
 drivers/net/phy/phy_device.c | 43 ++--
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 5d288da9a3b0..8fd1bf37718b 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1133,6 +1133,31 @@ void phy_attached_print(struct phy_device *phydev, const 
char *fmt, ...)
 }
 EXPORT_SYMBOL(phy_attached_print);
 
+static void phy_sysfs_create_links(struct phy_device *phydev)
+{
+   struct net_device *dev = phydev->attached_dev;
+   int err;
+
+   err = sysfs_create_link(&phydev->mdio.dev.kobj, &dev->dev.kobj,
+   "attached_dev");
+   if (err)
+   return;
+
+   err = sysfs_create_link_nowarn(&dev->dev.kobj,
+  &phydev->mdio.dev.kobj,
+  "phydev");
+   if (err) {
+   dev_err(&dev->dev, "could not add device link to %s err %d\n",
+   kobject_name(&phydev->mdio.dev.kobj),
+   err);
+   /* non-fatal - some net drivers can use one netdevice
+* with more then one phy
+*/
+   }
+
+   phydev->sysfs_links = true;
+}
+
 /**
  * phy_attach_direct - attach a network device to a given PHY device pointer
  * @dev: network device to attach
@@ -1216,23 +1241,7 @@ int phy_attach_direct(struct net_device *dev, struct 
phy_device *phydev,
 */
phydev->sysfs_links = false;
 
-   err = sysfs_create_link(&phydev->mdio.dev.kobj, &dev->dev.kobj,
-   "attached_dev");
-   if (!err) {
-   err = sysfs_create_link_nowarn(&dev->dev.kobj,
-  &phydev->mdio.dev.kobj,
-  "phydev");
-   if (err) {
-   dev_err(&dev->dev, "could not add device link to %s err 
%d\n",
-   kobject_name(&phydev->mdio.dev.kobj),
-   err);
-   /* non-fatal - some net drivers can use one netdevice
-* with more then one phy
-*/
-   }
-
-   phydev->sysfs_links = true;
-   }
+   phy_sysfs_create_links(phydev);
 
phydev->dev_flags = flags;
 
-- 
2.21.0

[PATCH 00/11] Decoupling PHYLINK from struct net_device

2019-05-27 Thread Ioana Ciornei

Following two separate discussion threads in:
  https://www.spinics.net/lists/netdev/msg569087.html
and:
  https://www.spinics.net/lists/netdev/msg570450.html

Previous RFC patch set: https://www.spinics.net/lists/netdev/msg571995.html

PHYLINK was reworked in order to accept multiple operation types,
PHYLINK_NETDEV and PHYLINK_DEV, passed through a phylink_config
structure alongside the corresponding struct device.

One of the main concerns expressed in the RFC was that using notifiers
to signal the corresponding phylink_mac_ops would break PHYLINK's API
unity and that it would become harder to grep for its users.
Using the current approach, we maintain a common API for all users.
Also, printing useful information in PHYLINK, when decoupled from a
net_device, is achieved using dev_err&co on the struct device received
(in DSA's case is the device corresponding to the dsa_switch).

PHYLIB (which PHYLINK uses) was reworked to the extent that it does not
crash when connecting to a PHY and the net_device pointer is NULL.

Lastly, DSA has been reworked in its way that it handles PHYs for ports
that lack a net_device (CPU and DSA ports).  For these, it was
previously using PHYLIB and is now using the PHYLINK_DEV operation type.
Previously, a driver that wanted to support PHY operations on CPU/DSA
ports has to implement .adjust_link(). This patch set not only gives
drivers the options to use PHYLINK uniformly but also urges them to
convert to it. For compatibility, the old code is kept but it will be
removed once all drivers switch over.

The patchset was tested on the NXP LS1021A-TSN board having the
following Ethernet layout:
  https://lkml.org/lkml/2019/5/5/279
The CPU port was moved from the internal RGMII fixed-link (enet2 ->
switch port 4) to an external loopback Cat5 cable between the enet1 port
and the front-facing swp2 SJA1105 port. In this mode, both the master
and the CPU port have an attached PHY which detects link change events:

[   49.105426] fsl-gianfar soc:ethernet@2d5 eth1: Link is Down
[   50.305486] sja1105 spi0.1: Link is Down
[   53.265596] fsl-gianfar soc:ethernet@2d5 eth1: Link is Up - 1Gbps/Full - 
flow control off
[   54.466304] sja1105 spi0.1: Link is Up - 1Gbps/Full - flow control off

Ioana Ciornei (9):
  net: phy: Guard against the presence of a netdev
  net: phy: Check against net_device being NULL
  net: phy: Add phy_standalone sysfs entry
  net: phylink: Add phylink_mac_link_{up,down} wrapper functions
  net: phylink: Add struct phylink_config to PHYLINK API
  net: phylink: Add PHYLINK_DEV operation type
  net: phylink: Add phylink_{printk,err,warn,info,dbg} macros
  net: dsa: Move the phylink driver calls into port.c
  net: dsa: Use PHYLINK for the CPU/DSA ports

Vladimir Oltean (2):
  net: phy: Add phy_sysfs_create_links helper function
  net: dsa: sja1105: Fix broken fixed-link interfaces on user ports

 Documentation/networking/sfp-phylink.rst  |   5 +-
 drivers/net/dsa/sja1105/sja1105_main.c|  11 +-
 drivers/net/ethernet/marvell/mvneta.c |  36 ++-
 drivers/net/ethernet/marvell/mvpp2/mvpp2.h|   1 +
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |  43 ++--
 drivers/net/phy/phy_device.c  | 100 ++--
 drivers/net/phy/phylink.c | 220 +++---
 include/linux/phylink.h   |  57 +++--
 include/net/dsa.h |   2 +
 net/dsa/dsa_priv.h|  17 ++
 net/dsa/port.c| 157 +
 net/dsa/slave.c   |  99 +---
 12 files changed, 491 insertions(+), 257 deletions(-)

-- 
2.21.0

[PATCH 04/11] net: phy: Add phy_standalone sysfs entry

2019-05-27 Thread Ioana Ciornei

Export a phy_standalone device attribute that is meant to give the
indication that this PHY lacks an attached_dev and its corresponding
sysfs link. The attribute will be created only when the
phy_attach_direct() function will be called with a NULL net_device.

Signed-off-by: Ioana Ciornei 
---
 drivers/net/phy/phy_device.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 0b7730fd41ba..711971897e10 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1164,6 +1164,16 @@ static void phy_sysfs_create_links(struct phy_device 
*phydev)
phydev->sysfs_links = true;
 }
 
+static ssize_t
+phy_standalone_show(struct device *dev, struct device_attribute *attr,
+   char *buf)
+{
+   struct phy_device *phydev = to_phy_device(dev);
+
+   return sprintf(buf, "%d\n", !phydev->attached_dev);
+}
+static DEVICE_ATTR_RO(phy_standalone);
+
 /**
  * phy_attach_direct - attach a network device to a given PHY device pointer
  * @dev: network device to attach
@@ -1253,6 +1263,13 @@ int phy_attach_direct(struct net_device *dev, struct 
phy_device *phydev,
 
phy_sysfs_create_links(phydev);
 
+   if (!phydev->attached_dev) {
+   err = sysfs_create_file(&phydev->mdio.dev.kobj,
+   &dev_attr_phy_standalone.attr);
+   if (err)
+   phydev_err(phydev, "error creating 'phy_standalone' 
sysfs entry\n");
+   }
+
phydev->dev_flags = flags;
 
phydev->interface = interface;
@@ -1380,6 +1397,11 @@ void phy_detach(struct phy_device *phydev)
sysfs_remove_link(&dev->dev.kobj, "phydev");
sysfs_remove_link(&phydev->mdio.dev.kobj, "attached_dev");
}
+
+   if (!phydev->attached_dev)
+   sysfs_remove_file(&phydev->mdio.dev.kobj,
+ &dev_attr_phy_standalone.attr);
+
phy_suspend(phydev);
if (dev) {
phydev->attached_dev->phydev = NULL;
-- 
2.21.0

[PATCH 10/11] net: dsa: Use PHYLINK for the CPU/DSA ports

2019-05-27 Thread Ioana Ciornei

For DSA switches that do not have an .adjust_link callback, aka those
who transitioned totally to the PHYLINK-compliant API, use PHYLINK to
drive the CPU/DSA ports.

The PHYLIB usage and .adjust_link are kept but deprecated, and users are
asked to transition from it.  The reason why we can't do anything for
them is because PHYLINK does not wrap the fixed-link state behind a
phydev object, so we cannot wrap .phylink_mac_config into .adjust_link
unless we fabricate a phy_device structure.

For these ports, the newly introduced PHYLINK_DEV operation type is
used and the dsa_switch device structure is passed to PHYLINK for
printing purposes.  The handling of the PHYLINK_NETDEV and PHYLINK_DEV
PHYLINK instances is common from the perspective of the driver.

Signed-off-by: Ioana Ciornei 
Signed-off-by: Vladimir Oltean 
---
 net/dsa/port.c | 69 +-
 1 file changed, 63 insertions(+), 6 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 0051f5006248..30d5b08a5653 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -481,12 +481,15 @@ void dsa_port_phylink_mac_link_down(struct phylink_config 
*config,
phy_interface_t interface)
 {
struct dsa_port *dp = container_of(config, struct dsa_port, pl_config);
-   struct net_device *dev = dp->slave;
+   struct phy_device *phydev = NULL;
struct dsa_switch *ds = dp->ds;
 
+   if (dsa_is_user_port(ds, dp->index))
+   phydev = dp->slave->phydev;
+
if (!ds->ops->phylink_mac_link_down) {
-   if (ds->ops->adjust_link && dev->phydev)
-   ds->ops->adjust_link(ds, dp->index, dev->phydev);
+   if (ds->ops->adjust_link && phydev)
+   ds->ops->adjust_link(ds, dp->index, phydev);
return;
}
 
@@ -500,12 +503,11 @@ void dsa_port_phylink_mac_link_up(struct phylink_config 
*config,
  struct phy_device *phydev)
 {
struct dsa_port *dp = container_of(config, struct dsa_port, pl_config);
-   struct net_device *dev = dp->slave;
struct dsa_switch *ds = dp->ds;
 
if (!ds->ops->phylink_mac_link_up) {
-   if (ds->ops->adjust_link && dev->phydev)
-   ds->ops->adjust_link(ds, dp->index, dev->phydev);
+   if (ds->ops->adjust_link && phydev)
+   ds->ops->adjust_link(ds, dp->index, phydev);
return;
}
 
@@ -599,8 +601,53 @@ static int dsa_port_fixed_link_register_of(struct dsa_port 
*dp)
return 0;
 }
 
+int dsa_port_phylink_register(struct dsa_port *dp)
+{
+   struct dsa_switch *ds = dp->ds;
+   struct device_node *port_dn = dp->dn;
+   int mode, err;
+
+   mode = of_get_phy_mode(port_dn);
+   if (mode < 0)
+   mode = PHY_INTERFACE_MODE_NA;
+
+   dp->pl_config.dev = ds->dev;
+   dp->pl_config.type = PHYLINK_DEV;
+
+   dp->pl = phylink_create(&dp->pl_config, of_fwnode_handle(port_dn),
+   mode, &dsa_port_phylink_mac_ops);
+   if (IS_ERR(dp->pl)) {
+   pr_err("error creating PHYLINK: %ld\n", PTR_ERR(dp->pl));
+   return PTR_ERR(dp->pl);
+   }
+
+   err = phylink_of_phy_connect(dp->pl, port_dn, 0);
+   if (err) {
+   pr_err("could not attach to PHY: %d\n", err);
+   goto err_phy_connect;
+   }
+
+   rtnl_lock();
+   phylink_start(dp->pl);
+   rtnl_unlock();
+
+   return 0;
+
+err_phy_connect:
+   phylink_destroy(dp->pl);
+   return err;
+}
+
 int dsa_port_link_register_of(struct dsa_port *dp)
 {
+   struct dsa_switch *ds = dp->ds;
+
+   if (!ds->ops->adjust_link)
+   return dsa_port_phylink_register(dp);
+
+   dev_warn(ds->dev,
+"Using legacy PHYLIB callbacks. Please migrate to PHYLINK!\n");
+
if (of_phy_is_fixed_link(dp->dn))
return dsa_port_fixed_link_register_of(dp);
else
@@ -609,6 +656,16 @@ int dsa_port_link_register_of(struct dsa_port *dp)
 
 void dsa_port_link_unregister_of(struct dsa_port *dp)
 {
+   struct dsa_switch *ds = dp->ds;
+
+   if (!ds->ops->adjust_link) {
+   rtnl_lock();
+   phylink_disconnect_phy(dp->pl);
+   rtnl_unlock();
+   phylink_destroy(dp->pl);
+   return;
+   }
+
if (of_phy_is_fixed_link(dp->dn))
of_phy_deregister_fixed_link(dp->dn);
else
-- 
2.21.0

[PATCH 02/11] net: phy: Guard against the presence of a netdev

2019-05-27 Thread Ioana Ciornei

A prerequisite for PHYLIB to work in the absence of a struct net_device
is to not access pointers to it.

Changes are needed in the following areas:

 - Printing: In some places netdev_err was replaced with phydev_err.

 - Incrementing reference count to the parent MDIO bus driver: If there
   is no net device, then the reference count should definitely be
   incremented since there is no chance that it was an Ethernet driver
   who registered the MDIO bus.

 - Sysfs links are not created in case there is no attached_dev.

 - No netif_carrier_off is done if there is no attached_dev.

Signed-off-by: Ioana Ciornei 
Signed-off-by: Vladimir Oltean 
Reviewed-by: Florian Fainelli 
---
 drivers/net/phy/phy_device.c | 33 +++--
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 8fd1bf37718b..da3bf3f70d63 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1138,6 +1138,9 @@ static void phy_sysfs_create_links(struct phy_device 
*phydev)
struct net_device *dev = phydev->attached_dev;
int err;
 
+   if (!dev)
+   return;
+
err = sysfs_create_link(&phydev->mdio.dev.kobj, &dev->dev.kobj,
"attached_dev");
if (err)
@@ -1176,9 +1179,9 @@ static void phy_sysfs_create_links(struct phy_device 
*phydev)
 int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
  u32 flags, phy_interface_t interface)
 {
-   struct module *ndev_owner = dev->dev.parent->driver->owner;
struct mii_bus *bus = phydev->mdio.bus;
struct device *d = &phydev->mdio.dev;
+   struct module *ndev_owner = NULL;
bool using_genphy = false;
int err;
 
@@ -1187,8 +1190,10 @@ int phy_attach_direct(struct net_device *dev, struct 
phy_device *phydev,
 * our own module->refcnt here, otherwise we would not be able to
 * unload later on.
 */
+   if (dev)
+   ndev_owner = dev->dev.parent->driver->owner;
if (ndev_owner != bus->owner && !try_module_get(bus->owner)) {
-   dev_err(&dev->dev, "failed to get the bus module\n");
+   phydev_err(phydev, "failed to get the bus module\n");
return -EIO;
}
 
@@ -1207,7 +1212,7 @@ int phy_attach_direct(struct net_device *dev, struct 
phy_device *phydev,
}
 
if (!try_module_get(d->driver->owner)) {
-   dev_err(&dev->dev, "failed to get the device driver module\n");
+   phydev_err(phydev, "failed to get the device driver module\n");
err = -EIO;
goto error_put_device;
}
@@ -1228,8 +1233,10 @@ int phy_attach_direct(struct net_device *dev, struct 
phy_device *phydev,
}
 
phydev->phy_link_change = phy_link_change;
-   phydev->attached_dev = dev;
-   dev->phydev = phydev;
+   if (dev) {
+   phydev->attached_dev = dev;
+   dev->phydev = phydev;
+   }
 
/* Some Ethernet drivers try to connect to a PHY device before
 * calling register_netdevice() -> netdev_register_kobject() and
@@ -1252,7 +1259,8 @@ int phy_attach_direct(struct net_device *dev, struct 
phy_device *phydev,
/* Initial carrier state is off as the phy is about to be
 * (re)initialized.
 */
-   netif_carrier_off(phydev->attached_dev);
+   if (dev)
+   netif_carrier_off(phydev->attached_dev);
 
/* Do initial configuration here, now that
 * we have certain key parameters
@@ -1358,16 +1366,19 @@ EXPORT_SYMBOL_GPL(phy_driver_is_genphy_10g);
 void phy_detach(struct phy_device *phydev)
 {
struct net_device *dev = phydev->attached_dev;
-   struct module *ndev_owner = dev->dev.parent->driver->owner;
+   struct module *ndev_owner = NULL;
struct mii_bus *bus;
 
if (phydev->sysfs_links) {
-   sysfs_remove_link(&dev->dev.kobj, "phydev");
+   if (dev)
+   sysfs_remove_link(&dev->dev.kobj, "phydev");
sysfs_remove_link(&phydev->mdio.dev.kobj, "attached_dev");
}
phy_suspend(phydev);
-   phydev->attached_dev->phydev = NULL;
-   phydev->attached_dev = NULL;
+   if (dev) {
+   phydev->attached_dev->phydev = NULL;
+   phydev->attached_dev = NULL;
+   }
phydev->phylink = NULL;
 
phy_led_triggers_unregister(phydev);
@@ -1390,6 +1401,8 @@ void phy_detach(struct phy_device *phydev)
bus = phydev->mdio.bus;
 
put_device(&phydev->mdio.dev);
+   if (dev)
+   ndev_owner = dev->dev.parent->driver->owner;
if (ndev_owner != bus->owner)
module_put(bus->owner);
 
-- 
2.21.0

[PATCH 09/11] net: dsa: Move the phylink driver calls into port.c

2019-05-27 Thread Ioana Ciornei

In order to have a common handling of PHYLINK for the slave and non-user
ports, the DSA core glue logic (between PHYLINK and the driver) must use
an API that does not rely on a struct net_device.

These will also be called by the CPU-port-handling code in a further
patch.

Signed-off-by: Ioana Ciornei 
Suggested-by: Vladimir Oltean 
---
 net/dsa/dsa_priv.h |  17 
 net/dsa/port.c | 100 +
 net/dsa/slave.c|  96 +--
 3 files changed, 118 insertions(+), 95 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 8f1222324646..418490bda3a4 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -163,6 +163,23 @@ int dsa_port_vid_add(struct dsa_port *dp, u16 vid, u16 
flags);
 int dsa_port_vid_del(struct dsa_port *dp, u16 vid);
 int dsa_port_link_register_of(struct dsa_port *dp);
 void dsa_port_link_unregister_of(struct dsa_port *dp);
+void dsa_port_phylink_validate(struct phylink_config *config,
+  unsigned long *supported,
+  struct phylink_link_state *state);
+int dsa_port_phylink_mac_link_state(struct phylink_config *config,
+   struct phylink_link_state *state);
+void dsa_port_phylink_mac_config(struct phylink_config *config,
+unsigned int mode,
+const struct phylink_link_state *state);
+void dsa_port_phylink_mac_an_restart(struct phylink_config *config);
+void dsa_port_phylink_mac_link_down(struct phylink_config *config,
+   unsigned int mode,
+   phy_interface_t interface);
+void dsa_port_phylink_mac_link_up(struct phylink_config *config,
+ unsigned int mode,
+ phy_interface_t interface,
+ struct phy_device *phydev);
+extern const struct phylink_mac_ops dsa_port_phylink_mac_ops;
 
 /* slave.c */
 extern const struct dsa_device_ops notag_netdev_ops;
diff --git a/net/dsa/port.c b/net/dsa/port.c
index ed8ba9daa3ba..0051f5006248 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -422,6 +422,106 @@ static struct phy_device *dsa_port_get_phy_device(struct 
dsa_port *dp)
return phydev;
 }
 
+void dsa_port_phylink_validate(struct phylink_config *config,
+  unsigned long *supported,
+  struct phylink_link_state *state)
+{
+   struct dsa_port *dp = container_of(config, struct dsa_port, pl_config);
+   struct dsa_switch *ds = dp->ds;
+
+   if (!ds->ops->phylink_validate)
+   return;
+
+   ds->ops->phylink_validate(ds, dp->index, supported, state);
+}
+EXPORT_SYMBOL_GPL(dsa_port_phylink_validate);
+
+int dsa_port_phylink_mac_link_state(struct phylink_config *config,
+   struct phylink_link_state *state)
+{
+   struct dsa_port *dp = container_of(config, struct dsa_port, pl_config);
+   struct dsa_switch *ds = dp->ds;
+
+   /* Only called for SGMII and 802.3z */
+   if (!ds->ops->phylink_mac_link_state)
+   return -EOPNOTSUPP;
+
+   return ds->ops->phylink_mac_link_state(ds, dp->index, state);
+}
+EXPORT_SYMBOL_GPL(dsa_port_phylink_mac_link_state);
+
+void dsa_port_phylink_mac_config(struct phylink_config *config,
+unsigned int mode,
+const struct phylink_link_state *state)
+{
+   struct dsa_port *dp = container_of(config, struct dsa_port, pl_config);
+   struct dsa_switch *ds = dp->ds;
+
+   if (!ds->ops->phylink_mac_config)
+   return;
+
+   ds->ops->phylink_mac_config(ds, dp->index, mode, state);
+}
+EXPORT_SYMBOL_GPL(dsa_port_phylink_mac_config);
+
+void dsa_port_phylink_mac_an_restart(struct phylink_config *config)
+{
+   struct dsa_port *dp = container_of(config, struct dsa_port, pl_config);
+   struct dsa_switch *ds = dp->ds;
+
+   if (!ds->ops->phylink_mac_an_restart)
+   return;
+
+   ds->ops->phylink_mac_an_restart(ds, dp->index);
+}
+EXPORT_SYMBOL_GPL(dsa_port_phylink_mac_an_restart);
+
+void dsa_port_phylink_mac_link_down(struct phylink_config *config,
+   unsigned int mode,
+   phy_interface_t interface)
+{
+   struct dsa_port *dp = container_of(config, struct dsa_port, pl_config);
+   struct net_device *dev = dp->slave;
+   struct dsa_switch *ds = dp->ds;
+
+   if (!ds->ops->phylink_mac_link_down) {
+   if (ds->ops->adjust_link && dev->phydev)
+   ds->ops->adjust_link(ds, dp->index, dev->phydev);
+   return;
+   }
+
+   ds->ops->phylink_mac_link_down(ds, dp->index, mode, interface);
+}
+EXPORT_SYMBOL_GPL(dsa_port_phylink_mac_link_down);
+
+void dsa_port_phylink_mac_link_up(struct p

[PATCH 06/11] net: phylink: Add struct phylink_config to PHYLINK API

2019-05-27 Thread Ioana Ciornei

The phylink_config structure will encapsulate a pointer to a struct
device and the operation type requested for this instance of PHYLINK.
This patch does not make any functional changes, it just transitions the
PHYLINK internals and all its users to the new API.

A pointer to a phylink_config structure will be passed to
phylink_create() instead of the net_device directly. Also, the same
phylink_config pointer will be passed back to all phylink_mac_ops
callbacks instead of the net_device. Using this mechanism, a PHYLINK
user can get the original net_device using a structure such as
'to_net_dev(config->dev)' or directly the structure containing the
phylink_config using a container_of call.

At the moment, only the PHYLINK_NETDEV is defined as a valid operation
type for PHYLINK. In this mode, a valid reference to a struct device
linked to the original net_device should be passed to PHYLINK through
the phylink_config structure.

This API changes is mainly driven by the necessity of adding a new
operation type in PHYLINK that disconnects the phy_device from the
net_device and also works when the net_device is lacking.

Signed-off-by: Ioana Ciornei 
Signed-off-by: Vladimir Oltean 
---
 Documentation/networking/sfp-phylink.rst  |  5 +-
 drivers/net/ethernet/marvell/mvneta.c | 36 
 drivers/net/ethernet/marvell/mvpp2/mvpp2.h|  1 +
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 43 --
 drivers/net/phy/phylink.c | 26 ++---
 include/linux/phylink.h   | 56 ---
 include/net/dsa.h |  2 +
 net/dsa/slave.c   | 31 +-
 8 files changed, 128 insertions(+), 72 deletions(-)

diff --git a/Documentation/networking/sfp-phylink.rst 
b/Documentation/networking/sfp-phylink.rst
index 5bd26cb07244..91446b431b70 100644
--- a/Documentation/networking/sfp-phylink.rst
+++ b/Documentation/networking/sfp-phylink.rst
@@ -98,6 +98,7 @@ this documentation.
 4. Add::
 
struct phylink *phylink;
+   struct phylink_config phylink_config;
 
to the driver's private data structure.  We shall refer to the
driver's private data pointer as ``priv`` below, and the driver's
@@ -223,8 +224,10 @@ this documentation.
.. code-block:: c
 
struct phylink *phylink;
+   priv->phylink_config.dev = &dev.dev;
+   priv->phylink_config.type = PHYLINK_NETDEV;
 
-   phylink = phylink_create(dev, node, phy_mode, &phylink_ops);
+   phylink = phylink_create(&priv->phylink_config, node, phy_mode, 
&phylink_ops);
if (IS_ERR(phylink)) {
err = PTR_ERR(phylink);
fail probe;
diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index e758650b2c26..adbbcdde73e6 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -437,6 +437,7 @@ struct mvneta_port {
struct device_node *dn;
unsigned int tx_csum_limit;
struct phylink *phylink;
+   struct phylink_config phylink_config;
struct phy *comphy;
 
struct mvneta_bm *bm_priv;
@@ -3356,9 +3357,11 @@ static int mvneta_set_mac_addr(struct net_device *dev, 
void *addr)
return 0;
 }
 
-static void mvneta_validate(struct net_device *ndev, unsigned long *supported,
+static void mvneta_validate(struct phylink_config *config,
+   unsigned long *supported,
struct phylink_link_state *state)
 {
+   struct net_device *ndev = to_net_dev(config->dev);
struct mvneta_port *pp = netdev_priv(ndev);
__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
 
@@ -3408,9 +3411,10 @@ static void mvneta_validate(struct net_device *ndev, 
unsigned long *supported,
phylink_helper_basex_speed(state);
 }
 
-static int mvneta_mac_link_state(struct net_device *ndev,
+static int mvneta_mac_link_state(struct phylink_config *config,
 struct phylink_link_state *state)
 {
+   struct net_device *ndev = to_net_dev(config->dev);
struct mvneta_port *pp = netdev_priv(ndev);
u32 gmac_stat;
 
@@ -3438,8 +3442,9 @@ static int mvneta_mac_link_state(struct net_device *ndev,
return 1;
 }
 
-static void mvneta_mac_an_restart(struct net_device *ndev)
+static void mvneta_mac_an_restart(struct phylink_config *config)
 {
+   struct net_device *ndev = to_net_dev(config->dev);
struct mvneta_port *pp = netdev_priv(ndev);
u32 gmac_an = mvreg_read(pp, MVNETA_GMAC_AUTONEG_CONFIG);
 
@@ -3449,9 +3454,10 @@ static void mvneta_mac_an_restart(struct net_device 
*ndev)
gmac_an & ~MVNETA_GMAC_INBAND_RESTART_AN);
 }
 
-static void mvneta_mac_config(struct net_device *ndev, unsigned int mode,
-   const struct phylink_link_state *state)
+static void mvneta_mac_config(struct phylink_config *config, unsigned int mode,
+ const str

[PATCH 11/11] net: dsa: sja1105: Fix broken fixed-link interfaces on user ports

2019-05-27 Thread Ioana Ciornei

From: Vladimir Oltean 

PHYLIB and PHYLINK handle fixed-link interfaces differently. PHYLIB
wraps them in a software PHY ("pseudo fixed link") phydev construct such
that .adjust_link driver callbacks see an unified API. Whereas PHYLINK
simply creates a phylink_link_state structure and passes it to
.mac_config.

At the time the driver was introduced, DSA was using PHYLIB for the
CPU/cascade ports (the ones with no net devices) and PHYLINK for
everything else.

As explained below:

commit aab9c4067d2389d0adfc9c53806437df7b0fe3d5
Author: Florian Fainelli 
Date:   Thu May 10 13:17:36 2018 -0700

  net: dsa: Plug in PHYLINK support

  Drivers that utilize fixed links for user-facing ports (e.g: bcm_sf2)
  will need to implement phylink_mac_ops from now on to preserve
  functionality, since PHYLINK *does not* create a phy_device instance
  for fixed links.

In the above patch, DSA guards the .phylink_mac_config callback against
a NULL phydev pointer.  Therefore, .adjust_link is not called in case of
a fixed-link user port.

This patch fixes the situation by converting the driver from using
.adjust_link to .phylink_mac_config.  This can be done now in a unified
fashion for both slave and CPU/cascade ports because DSA now uses
PHYLINK for all ports.

Signed-off-by: Vladimir Oltean 
Signed-off-by: Ioana Ciornei 
Reviewed-by: Florian Fainelli 
---
 drivers/net/dsa/sja1105/sja1105_main.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/sja1105/sja1105_main.c 
b/drivers/net/dsa/sja1105/sja1105_main.c
index 0663b78a2f6c..cfdefd9f1905 100644
--- a/drivers/net/dsa/sja1105/sja1105_main.c
+++ b/drivers/net/dsa/sja1105/sja1105_main.c
@@ -734,15 +734,16 @@ static int sja1105_adjust_port_config(struct 
sja1105_private *priv, int port,
return sja1105_clocking_setup_port(priv, port);
 }
 
-static void sja1105_adjust_link(struct dsa_switch *ds, int port,
-   struct phy_device *phydev)
+static void sja1105_mac_config(struct dsa_switch *ds, int port,
+  unsigned int link_an_mode,
+  const struct phylink_link_state *state)
 {
struct sja1105_private *priv = ds->priv;
 
-   if (!phydev->link)
+   if (!state->link)
sja1105_adjust_port_config(priv, port, 0, false);
else
-   sja1105_adjust_port_config(priv, port, phydev->speed, true);
+   sja1105_adjust_port_config(priv, port, state->speed, true);
 }
 
 static void sja1105_phylink_validate(struct dsa_switch *ds, int port,
@@ -1515,9 +1516,9 @@ static int sja1105_set_ageing_time(struct dsa_switch *ds,
 static const struct dsa_switch_ops sja1105_switch_ops = {
.get_tag_protocol   = sja1105_get_tag_protocol,
.setup  = sja1105_setup,
-   .adjust_link= sja1105_adjust_link,
.set_ageing_time= sja1105_set_ageing_time,
.phylink_validate   = sja1105_phylink_validate,
+   .phylink_mac_config = sja1105_mac_config,
.get_strings= sja1105_get_strings,
.get_ethtool_stats  = sja1105_get_ethtool_stats,
.get_sset_count = sja1105_get_sset_count,
-- 
2.21.0

[PATCH 05/11] net: phylink: Add phylink_mac_link_{up,down} wrapper functions

2019-05-27 Thread Ioana Ciornei

This is a cosmetic patch that reduces the clutter in phylink_resolve
around calling the .mac_link_up/.mac_link_down driver callbacks.  In a
further patch this logic will be extended to emit notifications in case
a net device does not exist.

Signed-off-by: Ioana Ciornei 
Reviewed-by: Florian Fainelli 
---
 drivers/net/phy/phylink.c | 50 +--
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 74983593834b..83ab83c3edba 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -395,6 +395,34 @@ static const char *phylink_pause_to_str(int pause)
}
 }
 
+static void phylink_mac_link_up(struct phylink *pl,
+   struct phylink_link_state link_state)
+{
+   struct net_device *ndev = pl->netdev;
+
+   pl->ops->mac_link_up(ndev, pl->link_an_mode,
+pl->phy_state.interface,
+pl->phydev);
+
+   netif_carrier_on(ndev);
+
+   netdev_info(ndev,
+   "Link is Up - %s/%s - flow control %s\n",
+   phy_speed_to_str(link_state.speed),
+   phy_duplex_to_str(link_state.duplex),
+   phylink_pause_to_str(link_state.pause));
+}
+
+static void phylink_mac_link_down(struct phylink *pl)
+{
+   struct net_device *ndev = pl->netdev;
+
+   netif_carrier_off(ndev);
+   pl->ops->mac_link_down(ndev, pl->link_an_mode,
+  pl->phy_state.interface);
+   netdev_info(ndev, "Link is Down\n");
+}
+
 static void phylink_resolve(struct work_struct *w)
 {
struct phylink *pl = container_of(w, struct phylink, resolve);
@@ -443,24 +471,10 @@ static void phylink_resolve(struct work_struct *w)
}
 
if (link_state.link != netif_carrier_ok(ndev)) {
-   if (!link_state.link) {
-   netif_carrier_off(ndev);
-   pl->ops->mac_link_down(ndev, pl->link_an_mode,
-  pl->phy_state.interface);
-   netdev_info(ndev, "Link is Down\n");
-   } else {
-   pl->ops->mac_link_up(ndev, pl->link_an_mode,
-pl->phy_state.interface,
-pl->phydev);
-
-   netif_carrier_on(ndev);
-
-   netdev_info(ndev,
-   "Link is Up - %s/%s - flow control %s\n",
-   phy_speed_to_str(link_state.speed),
-   phy_duplex_to_str(link_state.duplex),
-   phylink_pause_to_str(link_state.pause));
-   }
+   if (!link_state.link)
+   phylink_mac_link_down(pl);
+   else
+   phylink_mac_link_up(pl, link_state);
}
if (!link_state.link && pl->mac_link_dropped) {
pl->mac_link_dropped = false;
-- 
2.21.0

[PATCH 07/11] net: phylink: Add PHYLINK_DEV operation type

2019-05-27 Thread Ioana Ciornei

In the PHYLINK_DEV operation type, the PHYLINK infrastructure can work
without an attached net_device. For printing usecases, instead, a struct
device * should be passed to PHYLINK using the phylink_config structure.

Also, netif_carrier_* calls ar guarded by the presence of a valid
net_device. When using the PHYLINK_DEV operation type, we cannot check
link status using the netif_carrier_ok() API so instead, keep an
internal state of the MAC and call mac_link_{down,up} only when the link
changed.

Signed-off-by: Ioana Ciornei 
Signed-off-by: Vladimir Oltean 
---
 drivers/net/phy/phylink.c | 25 -
 include/linux/phylink.h   |  1 +
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 5a283bf9d402..5f6120f3fa3f 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -42,6 +42,8 @@ struct phylink {
struct net_device *netdev;
const struct phylink_mac_ops *ops;
struct phylink_config *config;
+   struct device *dev;
+   unsigned int old_link_state:1;
 
unsigned long phylink_disable_state; /* bitmask of disables */
struct phy_device *phydev;
@@ -404,7 +406,8 @@ static void phylink_mac_link_up(struct phylink *pl,
 pl->phy_state.interface,
 pl->phydev);
 
-   netif_carrier_on(ndev);
+   if (ndev)
+   netif_carrier_on(ndev);
 
netdev_info(ndev,
"Link is Up - %s/%s - flow control %s\n",
@@ -417,7 +420,8 @@ static void phylink_mac_link_down(struct phylink *pl)
 {
struct net_device *ndev = pl->netdev;
 
-   netif_carrier_off(ndev);
+   if (ndev)
+   netif_carrier_off(ndev);
pl->ops->mac_link_down(pl->config, pl->link_an_mode,
   pl->phy_state.interface);
netdev_info(ndev, "Link is Down\n");
@@ -428,6 +432,7 @@ static void phylink_resolve(struct work_struct *w)
struct phylink *pl = container_of(w, struct phylink, resolve);
struct phylink_link_state link_state;
struct net_device *ndev = pl->netdev;
+   int link_changed;
 
mutex_lock(&pl->state_mutex);
if (pl->phylink_disable_state) {
@@ -470,7 +475,13 @@ static void phylink_resolve(struct work_struct *w)
}
}
 
-   if (link_state.link != netif_carrier_ok(ndev)) {
+   if (pl->netdev)
+   link_changed = (link_state.link != netif_carrier_ok(ndev));
+   else
+   link_changed = (link_state.link != pl->old_link_state);
+
+   if (link_changed) {
+   pl->old_link_state = link_state.link;
if (!link_state.link)
phylink_mac_link_down(pl);
else
@@ -571,6 +582,8 @@ struct phylink *phylink_create(struct phylink_config 
*config,
pl->config = config;
if (config->type == PHYLINK_NETDEV) {
pl->netdev = to_net_dev(config->dev);
+   } else if (config->type == PHYLINK_DEV) {
+   pl->dev = config->dev;
} else {
kfree(pl);
return ERR_PTR(-EINVAL);
@@ -910,7 +923,8 @@ void phylink_start(struct phylink *pl)
phy_modes(pl->link_config.interface));
 
/* Always set the carrier off */
-   netif_carrier_off(pl->netdev);
+   if (pl->netdev)
+   netif_carrier_off(pl->netdev);
 
/* Apply the link configuration to the MAC when starting. This allows
 * a fixed-link to start with the correct parameters, and also
@@ -1255,7 +1269,8 @@ int phylink_ethtool_set_pauseparam(struct phylink *pl,
switch (pl->link_an_mode) {
case MLO_AN_PHY:
/* Silently mark the carrier down, and then trigger a 
resolve */
-   netif_carrier_off(pl->netdev);
+   if (pl->netdev)
+   netif_carrier_off(pl->netdev);
phylink_run_resolve(pl);
break;
 
diff --git a/include/linux/phylink.h b/include/linux/phylink.h
index 67f35f07ac4b..0f6f65bb9d44 100644
--- a/include/linux/phylink.h
+++ b/include/linux/phylink.h
@@ -56,6 +56,7 @@ struct phylink_link_state {
 
 enum phylink_op_type {
PHYLINK_NETDEV = 0,
+   PHYLINK_DEV,
 };
 
 /**
-- 
2.21.0

[PATCH 08/11] net: phylink: Add phylink_{printk,err,warn,info,dbg} macros

2019-05-27 Thread Ioana Ciornei

With the latest addition to the PHYLINK infrastructure, we are faced
with a decision on when to print necessary info using the struct
net_device and when with the struct device.

Add a series of macros that encapsulate this decision and replace all
uses of netdev_err&co with phylink_err.

Signed-off-by: Ioana Ciornei 
Signed-off-by: Vladimir Oltean 
---
 drivers/net/phy/phylink.c | 137 +-
 1 file changed, 77 insertions(+), 60 deletions(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 5f6120f3fa3f..68d0a89c52be 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -68,6 +68,23 @@ struct phylink {
struct sfp_bus *sfp_bus;
 };
 
+#define phylink_printk(level, pl, fmt, ...) \
+   do { \
+   if ((pl)->config->type == PHYLINK_NETDEV) \
+   netdev_printk(level, (pl)->netdev, fmt, ##__VA_ARGS__); 
\
+   else if ((pl)->config->type == PHYLINK_DEV) \
+   dev_printk(level, (pl)->dev, fmt, ##__VA_ARGS__); \
+   } while (0)
+
+#define phylink_err(pl, fmt, ...) \
+   phylink_printk(KERN_ERR, pl, fmt, ##__VA_ARGS__)
+#define phylink_warn(pl, fmt, ...) \
+   phylink_printk(KERN_WARNING, pl, fmt, ##__VA_ARGS__)
+#define phylink_info(pl, fmt, ...) \
+   phylink_printk(KERN_INFO, pl, fmt, ##__VA_ARGS__)
+#define phylink_dbg(pl, fmt, ...) \
+   phylink_printk(KERN_DEBUG, pl, fmt, ##__VA_ARGS__)
+
 /**
  * phylink_set_port_modes() - set the port type modes in the ethtool mask
  * @mask: ethtool link mode mask
@@ -164,7 +181,7 @@ static int phylink_parse_fixedlink(struct phylink *pl,
ret = fwnode_property_read_u32_array(fwnode, "fixed-link",
 NULL, 0);
if (ret != ARRAY_SIZE(prop)) {
-   netdev_err(pl->netdev, "broken fixed-link?\n");
+   phylink_err(pl, "broken fixed-link?\n");
return -EINVAL;
}
 
@@ -183,8 +200,8 @@ static int phylink_parse_fixedlink(struct phylink *pl,
 
if (pl->link_config.speed > SPEED_1000 &&
pl->link_config.duplex != DUPLEX_FULL)
-   netdev_warn(pl->netdev, "fixed link specifies half duplex for 
%dMbps link?\n",
-   pl->link_config.speed);
+   phylink_warn(pl, "fixed link specifies half duplex for %dMbps 
link?\n",
+pl->link_config.speed);
 
bitmap_fill(pl->supported, __ETHTOOL_LINK_MODE_MASK_NBITS);
linkmode_copy(pl->link_config.advertising, pl->supported);
@@ -197,9 +214,9 @@ static int phylink_parse_fixedlink(struct phylink *pl,
if (s) {
__set_bit(s->bit, pl->supported);
} else {
-   netdev_warn(pl->netdev, "fixed link %s duplex %dMbps not 
recognised\n",
-   pl->link_config.duplex == DUPLEX_FULL ? "full" : 
"half",
-   pl->link_config.speed);
+   phylink_warn(pl, "fixed link %s duplex %dMbps not recognised\n",
+pl->link_config.duplex == DUPLEX_FULL ? "full" : 
"half",
+pl->link_config.speed);
}
 
linkmode_and(pl->link_config.advertising, pl->link_config.advertising,
@@ -224,8 +241,8 @@ static int phylink_parse_mode(struct phylink *pl, struct 
fwnode_handle *fwnode)
if (fwnode_property_read_string(fwnode, "managed", &managed) == 0 &&
strcmp(managed, "in-band-status") == 0) {
if (pl->link_an_mode == MLO_AN_FIXED) {
-   netdev_err(pl->netdev,
-  "can't use both fixed-link and 
in-band-status\n");
+   phylink_err(pl,
+   "can't use both fixed-link and 
in-band-status\n");
return -EINVAL;
}
 
@@ -272,17 +289,17 @@ static int phylink_parse_mode(struct phylink *pl, struct 
fwnode_handle *fwnode)
break;
 
default:
-   netdev_err(pl->netdev,
-  "incorrect link mode %s for in-band 
status\n",
-  phy_modes(pl->link_config.interface));
+   phylink_err(pl,
+   "incorrect link mode %s for in-band 
status\n",
+   phy_modes(pl->link_config.interface));
return -EINVAL;
}
 
linkmode_copy(pl->link_config.advertising, pl->supported);
 
if (phylink_validate(pl, pl->supported, &pl->link_config)) {
-   netdev_err(pl->netdev,
-  "failed to validate link configuration for 
in-band status\n");
+   phylink_err(pl,
+   "failed to validate link configuratio

Re: [PATCH 03/11] net: phy: Check against net_device being NULL

2019-05-27 Thread kbuild test robot

Hi Ioana,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
[also build test WARNING on v5.2-rc2 next-20190524]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Ioana-Ciornei/Decoupling-PHYLINK-from-struct-net_device/20190528-061507
config: x86_64-randconfig-x004201921-201921 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All warnings (new ones prefixed by >>):

   drivers/net//phy/phy_device.c: In function 'phy_connect_direct':
>> drivers/net//phy/phy_device.c:952:10: warning: return makes integer from 
>> pointer without a cast [-Wint-conversion]
  return ERR_PTR(-EINVAL);
 ^~~~

vim +952 drivers/net//phy/phy_device.c

   937  
   938  /**
   939   * phy_connect_direct - connect an ethernet device to a specific 
phy_device
   940   * @dev: the network device to connect
   941   * @phydev: the pointer to the phy device
   942   * @handler: callback function for state change notifications
   943   * @interface: PHY device's interface
   944   */
   945  int phy_connect_direct(struct net_device *dev, struct phy_device 
*phydev,
   946 void (*handler)(struct net_device *),
   947 phy_interface_t interface)
   948  {
   949  int rc;
   950  
   951  if (!dev)
 > 952  return ERR_PTR(-EINVAL);
   953  
   954  rc = phy_attach_direct(dev, phydev, phydev->dev_flags, 
interface);
   955  if (rc)
   956  return rc;
   957  
   958  phy_prepare_link(phydev, handler);
   959  if (phy_interrupt_is_valid(phydev))
   960  phy_request_interrupt(phydev);
   961  
   962  return 0;
   963  }
   964  EXPORT_SYMBOL(phy_connect_direct);
   965  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH 01/11] net: phy: Add phy_sysfs_create_links helper function

2019-05-27 Thread Fabio Estevam

Hi Ioana,

On Mon, May 27, 2019 at 6:47 PM Ioana Ciornei  wrote:
>
> From: Vladimir Oltean 
>
> This is a cosmetic patch that wraps the operation of creating sysfs
> links between the netdev->phydev and the phydev->attached_dev.
>
> This is needed to keep the indentation level in check in a follow-up
> patch where this function will be guarded against the existence of a
> phydev->attached_dev.
>
> Signed-off-by: Vladimir Oltean 
> Reviewed-by: Florian Fainelli 

As you are transmitting the patch authored by other person, you need
to provide your Signed-off-by tag.

[PATCH net-next 3/3] inet: frags: fix use-after-free read in inet_frag_destroy_rcu

2019-05-27 Thread Eric Dumazet

As caught by syzbot [1], the rcu grace period that is respected
before fqdir_rwork_fn() proceeds and frees fqdir is not enough
to prevent inet_frag_destroy_rcu() being run after the freeing.

We need a proper rcu_barrier() synchronization to replace
the one we had in inet_frags_fini()

We also have to fix a potential problem at module removal :
inet_frags_fini() needs to make sure that all queued work queues
(fqdir_rwork_fn) have completed, otherwise we might
call kmem_cache_destroy() too soon and get another use-after-free.

[1]
BUG: KASAN: use-after-free in inet_frag_destroy_rcu+0xd9/0xe0 
net/ipv4/inet_fragment.c:201
Read of size 8 at addr 88806ed47a18 by task swapper/1/0

CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0-rc1+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
 __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
 rcu_do_batch kernel/rcu/tree.c:2092 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline]
 rcu_core+0xba5/0x1500 kernel/rcu/tree.c:2291
 __do_softirq+0x25c/0x94c kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:374 [inline]
 irq_exit+0x180/0x1d0 kernel/softirq.c:414
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
 
RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
Code: ff ff 48 89 df e8 f2 95 8c fa eb 82 e9 07 00 00 00 0f 00 2d e4 45 4b 00 
f4 c3 66 90 e9 07 00 00 00 0f 00 2d d4 45 4b 00 fb f4  90 55 48 89 e5 41 57 
41 56 41 55 41 54 53 e8 8e 18 42 fa e8 99
RSP: 0018:8880a98e7d78 EFLAGS: 0282 ORIG_RAX: ff13
RAX: 11164e11 RBX: 8880a98d4340 RCX: 
RDX: dc00 RSI: 0006 RDI: 8880a98d4bbc
RBP: 8880a98e7da8 R08: 8880a98d4340 R09: 
R10:  R11:  R12: 0001
R13: 88b27078 R14: 0001 R15: 
 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
 default_idle_call+0x36/0x90 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x377/0x560 kernel/sched/idle.c:263
 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:354
 start_secondary+0x34e/0x4c0 arch/x86/kernel/smpboot.c:267
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243

Allocated by task 8877:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
 kmem_cache_alloc_trace+0x151/0x750 mm/slab.c:3555
 kmalloc include/linux/slab.h:547 [inline]
 kzalloc include/linux/slab.h:742 [inline]
 fqdir_init include/net/inet_frag.h:115 [inline]
 ipv6_frags_init_net+0x48/0x460 net/ipv6/reassembly.c:513
 ops_init+0xb3/0x410 net/core/net_namespace.c:130
 setup_net+0x2d3/0x740 net/core/net_namespace.c:316
 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
 ksys_unshare+0x440/0x980 kernel/fork.c:2692
 __do_sys_unshare kernel/fork.c:2760 [inline]
 __se_sys_unshare kernel/fork.c:2758 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 17:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kfree+0xcf/0x220 mm/slab.c:3755
 fqdir_rwork_fn+0x33/0x40 net/ipv4/inet_fragment.c:154
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

The buggy address belongs to the object at 88806ed47a00
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 24 bytes inside of
 512-byte region [88806ed47a00, 88806ed47c00)
The buggy address belongs to the page:
page:ea0001bb51c0 refcount:1 mapcount:0 mapping:8880aa400940 index:0x0
flags: 0x1fffc000200(slab)
raw: 01fffc000200 ea000282a788 ea0001bb53c8 8880aa400940
raw:  88806ed47000 00010006 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 88806ed4

[PATCH net-next 0/3] inet: frags: followup to 'inet-frags-avoid-possible-races-at-netns-dismantle'

2019-05-27 Thread Eric Dumazet

Latest patch series ('inet-frags-avoid-possible-races-at-netns-dismantle')
brought another syzbot report shown in the third patch changelog.

While fixing the issue, I had to call inet_frags_fini() later
in IPv6 and ilowpan.

Also I believe a completion is needed to ensure proper dismantle
at module removal.

Eric Dumazet (3):
  inet: frags: uninline fqdir_init()
  inet: frags: call inet_frags_fini() after unregister_pernet_subsys()
  inet: frags: fix use-after-free read in inet_frag_destroy_rcu

 include/net/inet_frag.h | 23 +++--
 net/ieee802154/6lowpan/reassembly.c |  2 +-
 net/ipv4/inet_fragment.c| 39 +++--
 net/ipv6/reassembly.c   |  2 +-
 4 files changed, 43 insertions(+), 23 deletions(-)

-- 
2.22.0.rc1.257.g3120a18244-goog

[PATCH net-next 2/3] inet: frags: call inet_frags_fini() after unregister_pernet_subsys()

2019-05-27 Thread Eric Dumazet

Both IPv6 and 6lowpan are calling inet_frags_fini() too soon.

inet_frags_fini() is dismantling a kmem_cache, that might be needed
later when unregister_pernet_subsys() eventually has to remove
frags queues from hash tables and free them.

This fixes potential use-after-free, and is a prereq for the following patch.

Fixes: d4ad4d22e7ac ("inet: frags: use kmem_cache for inet_frag_queue")
Signed-off-by: Eric Dumazet 
---
 net/ieee802154/6lowpan/reassembly.c | 2 +-
 net/ipv6/reassembly.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 
e59c3b7089691ef95ce3b7ce02afe68ffc256dcc..5b56f16ed86b09ac7235ebaf741cc8c434bbef5c
 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -540,7 +540,7 @@ int __init lowpan_net_frag_init(void)
 
 void lowpan_net_frag_exit(void)
 {
-   inet_frags_fini(&lowpan_frags);
lowpan_frags_sysctl_unregister();
unregister_pernet_subsys(&lowpan_frags_ops);
+   inet_frags_fini(&lowpan_frags);
 }
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 
836ea964cf140d8b0134894455f18addc069e13e..ff5b6d8de2c6e65f5b2925649c77fad900b1e768
 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -583,8 +583,8 @@ int __init ipv6_frag_init(void)
 
 void ipv6_frag_exit(void)
 {
-   inet_frags_fini(&ip6_frags);
ip6_frags_sysctl_unregister();
unregister_pernet_subsys(&ip6_frags_ops);
inet6_del_protocol(&frag_protocol, IPPROTO_FRAGMENT);
+   inet_frags_fini(&ip6_frags);
 }
-- 
2.22.0.rc1.257.g3120a18244-goog

[PATCH net-next 1/3] inet: frags: uninline fqdir_init()

2019-05-27 Thread Eric Dumazet

fqdir_init() is not fast path and is getting bigger.

Signed-off-by: Eric Dumazet 
---
 include/net/inet_frag.h  | 20 +---
 net/ipv4/inet_fragment.c | 19 +++
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 
002f23c1a1a7126e146f596300aee0e52b6cafc6..94092b1ef22e9729d99d56323a77faf1ea4c92a6
 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -109,25 +109,7 @@ struct inet_frags {
 int inet_frags_init(struct inet_frags *);
 void inet_frags_fini(struct inet_frags *);
 
-static inline int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f,
-struct net *net)
-{
-   struct fqdir *fqdir = kzalloc(sizeof(*fqdir), GFP_KERNEL);
-   int res;
-
-   if (!fqdir)
-   return -ENOMEM;
-   fqdir->f = f;
-   fqdir->net = net;
-   res = rhashtable_init(&fqdir->rhashtable, &fqdir->f->rhash_params);
-   if (res < 0) {
-   kfree(fqdir);
-   return res;
-   }
-   *fqdirp = fqdir;
-   return 0;
-}
-
+int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net);
 void fqdir_exit(struct fqdir *fqdir);
 
 void inet_frag_kill(struct inet_frag_queue *q);
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 
6ca9523374dab03737cd073a1aa990130c4a87ca..7c07aae969e6c84d4f0345b5c4852b2e37d088f6
 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -154,6 +154,25 @@ static void fqdir_rwork_fn(struct work_struct *work)
kfree(fqdir);
 }
 
+int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
+{
+   struct fqdir *fqdir = kzalloc(sizeof(*fqdir), GFP_KERNEL);
+   int res;
+
+   if (!fqdir)
+   return -ENOMEM;
+   fqdir->f = f;
+   fqdir->net = net;
+   res = rhashtable_init(&fqdir->rhashtable, &fqdir->f->rhash_params);
+   if (res < 0) {
+   kfree(fqdir);
+   return res;
+   }
+   *fqdirp = fqdir;
+   return 0;
+}
+EXPORT_SYMBOL(fqdir_init);
+
 void fqdir_exit(struct fqdir *fqdir)
 {
fqdir->high_thresh = 0; /* prevent creation of new frags */
-- 
2.22.0.rc1.257.g3120a18244-goog

[PATCH net] llc: fix skb leak in llc_build_and_send_ui_pkt()

2019-05-27 Thread Eric Dumazet

If llc_mac_hdr_init() returns an error, we must drop the skb
since no llc_build_and_send_ui_pkt() caller will take care of this.

BUG: memory leak
unreferenced object 0x8881202b6800 (size 2048):
  comm "syz-executor907", pid 7074, jiffies 4294943781 (age 8.590s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
1a 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@
  backtrace:
[] kmemleak_alloc_recursive include/linux/kmemleak.h:55 
[inline]
[] slab_post_alloc_hook mm/slab.h:439 [inline]
[] slab_alloc mm/slab.c:3326 [inline]
[] __do_kmalloc mm/slab.c:3658 [inline]
[] __kmalloc+0x161/0x2c0 mm/slab.c:3669
[] kmalloc include/linux/slab.h:552 [inline]
[] sk_prot_alloc+0xd6/0x170 net/core/sock.c:1608
[] sk_alloc+0x35/0x2f0 net/core/sock.c:1662
[<2ecae075>] llc_sk_alloc+0x35/0x170 net/llc/llc_conn.c:950
[<551f7c47>] llc_ui_create+0x7b/0x140 net/llc/af_llc.c:173
[<29027f0e>] __sock_create+0x164/0x250 net/socket.c:1430
[<8bdec225>] sock_create net/socket.c:1481 [inline]
[<8bdec225>] __sys_socket+0x69/0x110 net/socket.c:1523
[] __do_sys_socket net/socket.c:1532 [inline]
[] __se_sys_socket net/socket.c:1530 [inline]
[] __x64_sys_socket+0x1e/0x30 net/socket.c:1530
[] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
[<0c32554f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0x88811d750d00 (size 224):
  comm "syz-executor907", pid 7074, jiffies 4294943781 (age 8.600s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 f0 0c 24 81 88 ff ff 00 68 2b 20 81 88 ff ff  ...$.h+ 
  backtrace:
[<53026172>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 
[inline]
[<53026172>] slab_post_alloc_hook mm/slab.h:439 [inline]
[<53026172>] slab_alloc_node mm/slab.c:3269 [inline]
[<53026172>] kmem_cache_alloc_node+0x153/0x2a0 mm/slab.c:3579
[] __alloc_skb+0x6e/0x210 net/core/skbuff.c:198
[] alloc_skb include/linux/skbuff.h:1058 [inline]
[] alloc_skb_with_frags+0x5f/0x250 net/core/skbuff.c:5327
[<0a34a2e7>] sock_alloc_send_pskb+0x269/0x2a0 net/core/sock.c:2225
[] sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2242
[] llc_ui_sendmsg+0x10a/0x540 net/llc/af_llc.c:933
[] sock_sendmsg_nosec net/socket.c:652 [inline]
[] sock_sendmsg+0x54/0x70 net/socket.c:671
[<3b687167>] __sys_sendto+0x148/0x1f0 net/socket.c:1964
[<922d78d9>] __do_sys_sendto net/socket.c:1976 [inline]
[<922d78d9>] __se_sys_sendto net/socket.c:1972 [inline]
[<922d78d9>] __x64_sys_sendto+0x2a/0x30 net/socket.c:1972
[] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
[<0c32554f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
 net/llc/llc_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/llc/llc_output.c b/net/llc/llc_output.c
index 
94425e421213a3fd2719428244156a890a98127a..9e4b6bcf6920d216530c9ac89cdb197a32bed473
 100644
--- a/net/llc/llc_output.c
+++ b/net/llc/llc_output.c
@@ -72,6 +72,8 @@ int llc_build_and_send_ui_pkt(struct llc_sap *sap, struct 
sk_buff *skb,
rc = llc_mac_hdr_init(skb, skb->dev->dev_addr, dmac);
if (likely(!rc))
rc = dev_queue_xmit(skb);
+   else
+   kfree_skb(skb);
return rc;
 }
 
-- 
2.22.0.rc1.257.g3120a18244-goog

Re: [PATCH 06/11] net: phylink: Add struct phylink_config to PHYLINK API

2019-05-27 Thread Florian Fainelli




On 5/27/2019 2:22 PM, Ioana Ciornei wrote:
> The phylink_config structure will encapsulate a pointer to a struct
> device and the operation type requested for this instance of PHYLINK.
> This patch does not make any functional changes, it just transitions the
> PHYLINK internals and all its users to the new API.
> 
> A pointer to a phylink_config structure will be passed to
> phylink_create() instead of the net_device directly. Also, the same
> phylink_config pointer will be passed back to all phylink_mac_ops
> callbacks instead of the net_device. Using this mechanism, a PHYLINK
> user can get the original net_device using a structure such as
> 'to_net_dev(config->dev)' or directly the structure containing the
> phylink_config using a container_of call.
> 
> At the moment, only the PHYLINK_NETDEV is defined as a valid operation
> type for PHYLINK. In this mode, a valid reference to a struct device
> linked to the original net_device should be passed to PHYLINK through
> the phylink_config structure.
> 
> This API changes is mainly driven by the necessity of adding a new
> operation type in PHYLINK that disconnects the phy_device from the
> net_device and also works when the net_device is lacking.
> 
> Signed-off-by: Ioana Ciornei 
> Signed-off-by: Vladimir Oltean 

The PHYLINK and DSA portions look good to me, and this is a lot nicer
than the notifier, thanks for coming up with that scheme:

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH 08/11] net: phylink: Add phylink_{printk,err,warn,info,dbg} macros

2019-05-27 Thread Florian Fainelli




On 5/27/2019 2:22 PM, Ioana Ciornei wrote:
> With the latest addition to the PHYLINK infrastructure, we are faced
> with a decision on when to print necessary info using the struct
> net_device and when with the struct device.
> 
> Add a series of macros that encapsulate this decision and replace all
> uses of netdev_err&co with phylink_err.
> 
> Signed-off-by: Ioana Ciornei 
> Signed-off-by: Vladimir Oltean 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH 10/11] net: dsa: Use PHYLINK for the CPU/DSA ports

2019-05-27 Thread Florian Fainelli




On 5/27/2019 2:22 PM, Ioana Ciornei wrote:
> For DSA switches that do not have an .adjust_link callback, aka those
> who transitioned totally to the PHYLINK-compliant API, use PHYLINK to
> drive the CPU/DSA ports.
> 
> The PHYLIB usage and .adjust_link are kept but deprecated, and users are
> asked to transition from it.  The reason why we can't do anything for
> them is because PHYLINK does not wrap the fixed-link state behind a
> phydev object, so we cannot wrap .phylink_mac_config into .adjust_link
> unless we fabricate a phy_device structure.
> 
> For these ports, the newly introduced PHYLINK_DEV operation type is
> used and the dsa_switch device structure is passed to PHYLINK for
> printing purposes.  The handling of the PHYLINK_NETDEV and PHYLINK_DEV
> PHYLINK instances is common from the perspective of the driver.
> 
> Signed-off-by: Ioana Ciornei 
> Signed-off-by: Vladimir Oltean 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH 04/11] net: phy: Add phy_standalone sysfs entry

2019-05-27 Thread Florian Fainelli




On 5/27/2019 2:22 PM, Ioana Ciornei wrote:
> Export a phy_standalone device attribute that is meant to give the
> indication that this PHY lacks an attached_dev and its corresponding
> sysfs link. The attribute will be created only when the
> phy_attach_direct() function will be called with a NULL net_device.
> 
> Signed-off-by: Ioana Ciornei 

If you update Documentation/ABI/testing/sysfs-class-net-phydev, this is:

Reviewed-by: Florian Fainelli 

I will take care of removing sysfs-bus-mdio which duplicates that
information.
-- 
Florian

Re: [PATCH 07/11] net: phylink: Add PHYLINK_DEV operation type

2019-05-27 Thread Florian Fainelli




On 5/27/2019 2:22 PM, Ioana Ciornei wrote:
> In the PHYLINK_DEV operation type, the PHYLINK infrastructure can work
> without an attached net_device. For printing usecases, instead, a struct
> device * should be passed to PHYLINK using the phylink_config structure.
> 
> Also, netif_carrier_* calls ar guarded by the presence of a valid
> net_device. When using the PHYLINK_DEV operation type, we cannot check
> link status using the netif_carrier_ok() API so instead, keep an
> internal state of the MAC and call mac_link_{down,up} only when the link
> changed.
> 
> Signed-off-by: Ioana Ciornei 
> Signed-off-by: Vladimir Oltean 

Should not this patch be re-ordered to be after patch #8? Other than that:

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH 09/11] net: dsa: Move the phylink driver calls into port.c

2019-05-27 Thread Florian Fainelli




On 5/27/2019 2:22 PM, Ioana Ciornei wrote:
> In order to have a common handling of PHYLINK for the slave and non-user
> ports, the DSA core glue logic (between PHYLINK and the driver) must use
> an API that does not rely on a struct net_device.
> 
> These will also be called by the CPU-port-handling code in a further
> patch.
> 
> Signed-off-by: Ioana Ciornei 
> Suggested-by: Vladimir Oltean 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH 03/11] net: phy: Check against net_device being NULL

2019-05-27 Thread kbuild test robot

Hi Ioana,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
[also build test WARNING on v5.2-rc2 next-20190524]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Ioana-Ciornei/Decoupling-PHYLINK-from-struct-net_device/20190528-061507
reproduce:
# apt-get install sparse
# sparse version: v0.6.1-rc1-7-g2b96cd8-dirty
make ARCH=x86_64 allmodconfig
make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 


sparse warnings: (new ones prefixed by >>)

>> drivers/net/phy/phy_device.c:952:31: sparse: sparse: incorrect type in 
>> return expression (different base types) @@expected int @@got voiint 
>> @@
>> drivers/net/phy/phy_device.c:952:31: sparse:expected int
>> drivers/net/phy/phy_device.c:952:31: sparse:got void *

vim +952 drivers/net/phy/phy_device.c

   937  
   938  /**
   939   * phy_connect_direct - connect an ethernet device to a specific 
phy_device
   940   * @dev: the network device to connect
   941   * @phydev: the pointer to the phy device
   942   * @handler: callback function for state change notifications
   943   * @interface: PHY device's interface
   944   */
   945  int phy_connect_direct(struct net_device *dev, struct phy_device 
*phydev,
   946 void (*handler)(struct net_device *),
   947 phy_interface_t interface)
   948  {
   949  int rc;
   950  
   951  if (!dev)
 > 952  return ERR_PTR(-EINVAL);
   953  
   954  rc = phy_attach_direct(dev, phydev, phydev->dev_flags, 
interface);
   955  if (rc)
   956  return rc;
   957  
   958  phy_prepare_link(phydev, handler);
   959  if (phy_interrupt_is_valid(phydev))
   960  phy_request_interrupt(phydev);
   961  
   962  return 0;
   963  }
   964  EXPORT_SYMBOL(phy_connect_direct);
   965  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[PATCH net-next 1/2] qed: Reduce the severity of ptp debug message.

2019-05-27 Thread Sudarsana Reddy Kalluru

PTP Tx implementation continuously polls for the availability of timestamp.
Reducing the severity of a debug message in this path to avoid filling up
the syslog buffer with this message, especially in the error scenarios. 

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Michal Kalderon 
---
 drivers/net/ethernet/qlogic/qed/qed_ptp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_ptp.c 
b/drivers/net/ethernet/qlogic/qed/qed_ptp.c
index 1302b30..f3ebdc5 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_ptp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_ptp.c
@@ -157,7 +157,8 @@ static int qed_ptp_hw_read_tx_ts(struct qed_dev *cdev, u64 
*timestamp)
*timestamp = 0;
val = qed_rd(p_hwfn, p_ptt, NIG_REG_TX_LLH_PTP_BUF_SEQID);
if (!(val & QED_TIMESTAMP_MASK)) {
-   DP_INFO(p_hwfn, "Invalid Tx timestamp, buf_seqid = %d\n", val);
+   DP_VERBOSE(p_hwfn, QED_MSG_DEBUG,
+  "Invalid Tx timestamp, buf_seqid = %08x\n", val);
return -EINVAL;
}
 
-- 
1.8.3.1

[PATCH net-next 2/2] qede: Handle infinite driver spinning for Tx timestamp.

2019-05-27 Thread Sudarsana Reddy Kalluru

In PTP Tx implementation, driver kept scheduling a poll thread until the
timestamp is available. In the error scenarios (e.g. app requesting the
timestamp for non-ptp packet), this thread kept waiting for the timestamp
forever.  This patch add changes to report such scenario as an error and
terminate the thread. Added a timeout of 2 seconds i.e., max time to wait
for Tx timestamp. Added a stat value ptp_skip_txts for reporting the number
of packets for which Tx timestamping is skipped. 

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Michal Kalderon 
---
 drivers/net/ethernet/qlogic/qede/qede.h |  2 ++
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c |  1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c|  3 ++
 drivers/net/ethernet/qlogic/qede/qede_ptp.c | 37 -
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index 92fe226..b972ab0 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -92,6 +92,7 @@ struct qede_stats_common {
u64 non_coalesced_pkts;
u64 coalesced_bytes;
u64 link_change_count;
+   u64 ptp_skip_txts;
 
/* port */
u64 rx_64_byte_packets;
@@ -189,6 +190,7 @@ struct qede_dev {
 
const struct qed_eth_ops*ops;
struct qede_ptp *ptp;
+   u64 ptp_skip_txts;
 
struct qed_dev_eth_info dev_info;
 #define QEDE_MAX_RSS_CNT(edev) ((edev)->dev_info.num_queues)
diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index 8911a97..e85f9fe 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -174,6 +174,7 @@
QEDE_STAT(coalesced_bytes),
 
QEDE_STAT(link_change_count),
+   QEDE_STAT(ptp_skip_txts),
 };
 
 #define QEDE_NUM_STATS ARRAY_SIZE(qede_stats_arr)
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index a9684a8..741377b 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -390,6 +390,7 @@ void qede_fill_by_demand_stats(struct qede_dev *edev)
p_common->brb_discards = stats.common.brb_discards;
p_common->tx_mac_ctrl_frames = stats.common.tx_mac_ctrl_frames;
p_common->link_change_count = stats.common.link_change_count;
+   p_common->ptp_skip_txts = edev->ptp_skip_txts;
 
if (QEDE_IS_BB(edev)) {
struct qede_stats_bb *p_bb = &edev->stats.bb;
@@ -2232,6 +2233,8 @@ static void qede_unload(struct qede_dev *edev, enum 
qede_unload_mode mode,
if (mode != QEDE_UNLOAD_RECOVERY)
DP_NOTICE(edev, "Link is down\n");
 
+   edev->ptp_skip_txts = 0;
+
DP_INFO(edev, "Ending qede unload\n");
 }
 
diff --git a/drivers/net/ethernet/qlogic/qede/qede_ptp.c 
b/drivers/net/ethernet/qlogic/qede/qede_ptp.c
index bddb2b5..f815435 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ptp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ptp.c
@@ -30,6 +30,7 @@
  * SOFTWARE.
  */
 #include "qede_ptp.h"
+#define QEDE_PTP_TX_TIMEOUT (2 * HZ)
 
 struct qede_ptp {
const struct qed_eth_ptp_ops*ops;
@@ -38,6 +39,7 @@ struct qede_ptp {
struct timecounter  tc;
struct ptp_clock*clock;
struct work_struct  work;
+   unsigned long   ptp_tx_start;
struct qede_dev *edev;
struct sk_buff  *tx_skb;
 
@@ -160,18 +162,30 @@ static void qede_ptp_task(struct work_struct *work)
struct qede_dev *edev;
struct qede_ptp *ptp;
u64 timestamp, ns;
+   bool timedout;
int rc;
 
ptp = container_of(work, struct qede_ptp, work);
edev = ptp->edev;
+   timedout = time_is_before_jiffies(ptp->ptp_tx_start +
+ QEDE_PTP_TX_TIMEOUT);
 
/* Read Tx timestamp registers */
spin_lock_bh(&ptp->lock);
rc = ptp->ops->read_tx_ts(edev->cdev, ×tamp);
spin_unlock_bh(&ptp->lock);
if (rc) {
-   /* Reschedule to keep checking for a valid timestamp value */
-   schedule_work(&ptp->work);
+   if (unlikely(timedout)) {
+   DP_INFO(edev, "Tx timestamp is not recorded\n");
+   dev_kfree_skb_any(ptp->tx_skb);
+   ptp->tx_skb = NULL;
+   clear_bit_unlock(QEDE_FLAGS_PTP_TX_IN_PRORGESS,
+&edev->flags);
+   edev->ptp_skip_txts++;
+   } else {
+   /* Reschedule to keep checking for a valid TS value */
+   schedule_work(&ptp->work);
+   }

[PATCH net-next 0/2] qed*: Fix inifinite spinning of PTP poll thread.

2019-05-27 Thread Sudarsana Reddy Kalluru

The patch series addresses an error scenario in the PTP Tx implementation.

Please consider applying it to net-next.

Sudarsana Reddy Kalluru (2):
  qed: Reduce the severity of the debug message.
  qede: Handle driver spin for Tx timestamp.

 drivers/net/ethernet/qlogic/qed/qed_ptp.c   |  3 +-
 drivers/net/ethernet/qlogic/qede/qede.h |  2 ++
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c |  1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c|  3 ++
 drivers/net/ethernet/qlogic/qede/qede_ptp.c | 37 -
 5 files changed, 38 insertions(+), 8 deletions(-)

-- 
1.8.3.1

Re: [PATCH 10/11] net: dsa: Use PHYLINK for the CPU/DSA ports

2019-05-27 Thread kbuild test robot

Hi Ioana,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
[also build test WARNING on v5.2-rc2 next-20190524]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Ioana-Ciornei/Decoupling-PHYLINK-from-struct-net_device/20190528-061507
reproduce:
# apt-get install sparse
# sparse version: v0.6.1-rc1-7-g2b96cd8-dirty
make ARCH=x86_64 allmodconfig
make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 


sparse warnings: (new ones prefixed by >>)

>> net/dsa/port.c:604:5: sparse: sparse: symbol 'dsa_port_phylink_register' was 
>> not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[RFC PATCH] net: dsa: dsa_port_phylink_register() can be static

2019-05-27 Thread kbuild test robot



Fixes: 9dd6d07682b1 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: kbuild test robot 
---
 port.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 30d5b08..d74bc9d 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -601,7 +601,7 @@ static int dsa_port_fixed_link_register_of(struct dsa_port 
*dp)
return 0;
 }
 
-int dsa_port_phylink_register(struct dsa_port *dp)
+static int dsa_port_phylink_register(struct dsa_port *dp)
 {
struct dsa_switch *ds = dp->ds;
struct device_node *port_dn = dp->dn;

[PATCH v3 bpf-next 0/6] bpf: Propagate cn to TCP

2019-05-27 Thread brakmo

This patchset adds support for propagating congestion notifications (cn)
to TCP from cgroup inet skb egress BPF programs.

Current cgroup skb BPF programs cannot trigger TCP congestion window
reductions, even when they drop a packet. This patch-set adds support
for cgroup skb BPF programs to send congestion notifications in the
return value when the packets are TCP packets. Rather than the
current 1 for keeping the packet and 0 for dropping it, they can
now return:
NET_XMIT_SUCCESS(0)- continue with packet output
NET_XMIT_DROP   (1)- drop packet and do cn
NET_XMIT_CN (2)- continue with packet output and do cn
-EPERM - drop packet

Finally, HBM programs are modified to collect and return more
statistics.

There has been some discussion regarding the best place to manage
bandwidths. Some believe this should be done in the qdisc where it can
also be managed with a BPF program. We believe there are advantages
for doing it with a BPF program in the cgroup/skb callback. For example,
it reduces overheads in the cases where there is on primary workload and
one or more secondary workloads, where each workload is running on its
own cgroupv2. In this scenario, we only need to throttle the secondary
workloads and there is no overhead for the primary workload since there
will be no BPF program attached to its cgroup.

Regardless, we agree that this mechanism should not penalize those that
are not using it. We tested this by doing 1 byte req/reply RPCs over
loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs.
Each test was repeated 50 times with a 1 minute delay between each set
of 10. We then calculated the average RPCs/sec over the 50 tests. We
compare upstream with upstream + patchset and no BPF program as well
as upstream + patchset and a BPF program that just returns ALLOW_PKT.
Here are the results:

upstream   80937 RPCs/sec
upstream + patches, no BPF program 80894 RPCs/sec
upstream + patches, BPF program80634 RPCs/sec

These numbers indicate that there is no penalty for these patches

The use of congestion notifications improves the performance of HBM when
using Cubic. Without congestion notifications, Cubic will not decrease its
cwnd and HBM will need to drop a large percentage of the packets.

The following results are obtained for rate limits of 1Gbps,
between two servers using netperf, and only one flow. We also show how
reducing the max delayed ACK timer can improve the performance when
using Cubic.

Command used was:
  ./do_hbm_test.sh -l -D --stats -N -r= [--no_cn] [dctcp] \
   -s=
  where:
is 1000
 --no_cn  specifies no cwr notifications
 dctcpuses dctcp

   CubicDCTCP
Lim, DA  Mbps cwnd cred drops  Mbps cwnd cred drops
    -     -
  1G, 40   35  462 -320 67% 9951 -212  0.05%
  1G, 40,cn   7369  -78  0.07   9951 -212  0.05
  1G,  5,cn   9412 -189  0.13   9951 -212  0.05

Notes:
  --no_cn has no effect with DCTCP
  Lim = rate limit
  DA = maximum delay ack timer
  cred = credit in packets
  drops = % packets dropped

v1->v2: Insures that only BPF_CGROUP_INET_EGRESS can return values 2 and 3
New egress values apply to all protocols, not just TCP
Cleaned up patch 4, Update BPF_CGROUP_RUN_PROG_INET_EGRESS callers
Removed changes to __tcp_transmit_skb (patch 5), no longer needed
Removed sample use of EDT
v2->v3: Removed the probe timer related changes

brakmo (6):
  bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
  bpf: cgroup inet skb programs can return 0 to 3
  bpf: Update __cgroup_bpf_run_filter_skb with cn
  bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls
  bpf: Add cn support to hbm_out_kern.c
  bpf: Add more stats to HBM

 include/linux/bpf.h| 50 +
 include/linux/filter.h |  3 +-
 kernel/bpf/cgroup.c| 25 ---
 kernel/bpf/syscall.c   | 12 +++
 kernel/bpf/verifier.c  | 16 +++--
 net/ipv4/ip_output.c   | 34 +---
 net/ipv6/ip6_output.c  | 26 +--
 samples/bpf/do_hbm_test.sh | 10 --
 samples/bpf/hbm.c  | 51 +++--
 samples/bpf/hbm.h  |  9 +-
 samples/bpf/hbm_kern.h | 66 --
 samples/bpf/hbm_out_kern.c | 48 +++
 12 files changed, 299 insertions(+), 51 deletions(-)

-- 
2.17.1

[PATCH v3 bpf-next 1/6] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY

2019-05-27 Thread brakmo

Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
__cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
request cwr for TCP packets.

Current cgroup skb programs can only return 0 or 1 (0 to drop the
packet. This macro changes the behavior so the low order bit
indicates whether the packet should be dropped (0) or not (1)
and the next bit is used for congestion notification (cn).

Hence, new allowed return values of CGROUP EGRESS BPF programs are:
  0: drop packet
  1: keep packet
  2: drop packet and call cwr
  3: keep packet and call cwr

This macro then converts it to one of NET_XMIT values or -EPERM
that has the effect of dropping the packet with no cn.
  0: NET_XMIT_SUCCESS  skb should be transmitted (no cn)
  1: NET_XMIT_DROP skb should be dropped and cwr called
  2: NET_XMIT_CN   skb should be transmitted and cwr called
  3: -EPERMskb should be dropped (no cn)

Note that when more than one BPF program is called, the packet is
dropped if at least one of programs requests it be dropped, and
there is cn if at least one program returns cn.

Signed-off-by: Lawrence Brakmo 
---
 include/linux/bpf.h | 50 +
 1 file changed, 50 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d98141edb74b..49be4f88454c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -552,6 +552,56 @@ _out:  
\
_ret;   \
 })
 
+/* To be used by __cgroup_bpf_run_filter_skb for EGRESS BPF progs
+ * so BPF programs can request cwr for TCP packets.
+ *
+ * Current cgroup skb programs can only return 0 or 1 (0 to drop the
+ * packet. This macro changes the behavior so the low order bit
+ * indicates whether the packet should be dropped (0) or not (1)
+ * and the next bit is a congestion notification bit. This could be
+ * used by TCP to call tcp_enter_cwr()
+ *
+ * Hence, new allowed return values of CGROUP EGRESS BPF programs are:
+ *   0: drop packet
+ *   1: keep packet
+ *   2: drop packet and cn
+ *   3: keep packet and cn
+ *
+ * This macro then converts it to one of the NET_XMIT or an error
+ * code that is then interpreted as drop packet (and no cn):
+ *   0: NET_XMIT_SUCCESS  skb should be transmitted
+ *   1: NET_XMIT_DROP skb should be dropped and cn
+ *   2: NET_XMIT_CN   skb should be transmitted and cn
+ *   3: -EPERMskb should be dropped
+ */
+#define BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(array, ctx, func)
\
+   ({  \
+   struct bpf_prog_array_item *_item;  \
+   struct bpf_prog *_prog; \
+   struct bpf_prog_array *_array;  \
+   u32 ret;\
+   u32 _ret = 1;   \
+   u32 _cn = 0;\
+   preempt_disable();  \
+   rcu_read_lock();\
+   _array = rcu_dereference(array);\
+   _item = &_array->items[0];  \
+   while ((_prog = READ_ONCE(_item->prog))) {  \
+   bpf_cgroup_storage_set(_item->cgroup_storage);  \
+   ret = func(_prog, ctx); \
+   _ret &= (ret & 1);  \
+   _cn |= (ret & 2);   \
+   _item++;\
+   }   \
+   rcu_read_unlock();  \
+   preempt_enable_no_resched();\
+   if (_ret)   \
+   _ret = (_cn ? NET_XMIT_CN : NET_XMIT_SUCCESS);  \
+   else\
+   _ret = (_cn ? NET_XMIT_DROP : -EPERM);  \
+   _ret;   \
+   })
+
 #define BPF_PROG_RUN_ARRAY(array, ctx, func)   \
__BPF_PROG_RUN_ARRAY(array, ctx, func, false)
 
-- 
2.17.1

[PATCH v3 bpf-next 2/6] bpf: cgroup inet skb programs can return 0 to 3

2019-05-27 Thread brakmo

Allows cgroup inet skb programs to return values in the range [0, 3].
The second bit is used to deterine if congestion occurred and higher
level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()

The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
at load time if it uses the new return values (i.e. 2 or 3).

The expected_attach_type is currently not enforced for
BPF_PROG_TYPE_CGROUP_SKB.  e.g Meaning the current bpf_prog with
expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
BPF_CGROUP_INET_INGRESS.  Blindly enforcing expected_attach_type will
break backward compatibility.

This patch adds a enforce_expected_attach_type bit to only
enforce the expected_attach_type when it uses the new
return value.

Signed-off-by: Lawrence Brakmo 
Signed-off-by: Martin KaFai Lau 
---
 include/linux/filter.h |  3 ++-
 kernel/bpf/syscall.c   | 12 
 kernel/bpf/verifier.c  | 16 +---
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index ba8b65270e0d..43b45d6db36d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -526,7 +526,8 @@ struct bpf_prog {
blinded:1,  /* Was blinded */
is_func:1,  /* program is a bpf function */
kprobe_override:1, /* Do we override a kprobe? 
*/
-   has_callchain_buf:1; /* callchain buffer 
allocated? */
+   has_callchain_buf:1, /* callchain buffer 
allocated? */
+   enforce_expected_attach_type:1; /* Enforce 
expected_attach_type checking at attach time */
enum bpf_prog_type  type;   /* Type of BPF program */
enum bpf_attach_typeexpected_attach_type; /* For some prog types */
u32 len;/* Number of filter blocks */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3d546b6f4646..1539774d78c7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1585,6 +1585,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type 
prog_type,
default:
return -EINVAL;
}
+   case BPF_PROG_TYPE_CGROUP_SKB:
+   switch (expected_attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   return 0;
+   default:
+   return -EINVAL;
+   }
default:
return 0;
}
@@ -1836,6 +1844,10 @@ static int bpf_prog_attach_check_attach_type(const 
struct bpf_prog *prog,
case BPF_PROG_TYPE_CGROUP_SOCK:
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+   case BPF_PROG_TYPE_CGROUP_SKB:
+   return prog->enforce_expected_attach_type &&
+   prog->expected_attach_type != attach_type ?
+   -EINVAL : 0;
default:
return 0;
}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2778417e6e0c..5c2cb5bd84ce 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5508,11 +5508,16 @@ static int check_ld_abs(struct bpf_verifier_env *env, 
struct bpf_insn *insn)
 
 static int check_return_code(struct bpf_verifier_env *env)
 {
+   struct tnum enforce_attach_type_range = tnum_unknown;
struct bpf_reg_state *reg;
struct tnum range = tnum_range(0, 1);
 
switch (env->prog->type) {
case BPF_PROG_TYPE_CGROUP_SKB:
+   if (env->prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
+   range = tnum_range(0, 3);
+   enforce_attach_type_range = tnum_range(2, 3);
+   }
case BPF_PROG_TYPE_CGROUP_SOCK:
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
case BPF_PROG_TYPE_SOCK_OPS:
@@ -5531,18 +5536,23 @@ static int check_return_code(struct bpf_verifier_env 
*env)
}
 
if (!tnum_in(range, reg->var_off)) {
+   char tn_buf[48];
+
verbose(env, "At program exit the register R0 ");
if (!tnum_is_unknown(reg->var_off)) {
-   char tn_buf[48];
-
tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
verbose(env, "has value %s", tn_buf);
} else {
verbose(env, "has unknown scalar value");
}
-   verbose(env, " should have been 0 or 1\n");
+   tnum_strn(tn_buf, sizeof(tn_buf), range);
+   verbose(env, " should have been %s\n", tn_buf);
return -EINVAL;
}
+
+   if (!tnum_is_unknown(enforce_attach_type_range) &&
+   tnum_in(enforce_attach_type_range, reg->var_off))
+   env->prog->enforce_e

[PATCH v3 bpf-next 3/6] bpf: Update __cgroup_bpf_run_filter_skb with cn

2019-05-27 Thread brakmo

For egress packets, __cgroup_bpf_fun_filter_skb() will now call
BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
in order to propagate congestion notifications (cn) requests to TCP
callers.

For egress packets, this function can return:
   NET_XMIT_SUCCESS(0)- continue with packet output
   NET_XMIT_DROP   (1)- drop packet and notify TCP to call cwr
   NET_XMIT_CN (2)- continue with packet output and notify TCP
to call cwr
   -EPERM - drop packet

For ingress packets, this function will return -EPERM if any attached
program was found and if it returned != 1 during execution. Otherwise 0
is returned.

Signed-off-by: Lawrence Brakmo 
---
 kernel/bpf/cgroup.c | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index fcde0f7b2585..5db8eb8481cb 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -548,8 +548,16 @@ int cgroup_bpf_prog_query(const union bpf_attr *attr,
  * The program type passed in via @type must be suitable for network
  * filtering. No further check is performed to assert that.
  *
- * This function will return %-EPERM if any if an attached program was found
- * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ * For egress packets, this function can return:
+ *   NET_XMIT_SUCCESS(0)   - continue with packet output
+ *   NET_XMIT_DROP   (1)   - drop packet and notify TCP to call cwr
+ *   NET_XMIT_CN (2)   - continue with packet output and notify TCP
+ *   to call cwr
+ *   -EPERM- drop packet
+ *
+ * For ingress packets, this function will return -EPERM if any
+ * attached program was found and if it returned != 1 during execution.
+ * Otherwise 0 is returned.
  */
 int __cgroup_bpf_run_filter_skb(struct sock *sk,
struct sk_buff *skb,
@@ -575,12 +583,19 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
/* compute pointers for the bpf prog */
bpf_compute_and_save_data_end(skb, &saved_data_end);
 
-   ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
-__bpf_prog_run_save_cb);
+   if (type == BPF_CGROUP_INET_EGRESS) {
+   ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(
+   cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb);
+   } else {
+   ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
+ __bpf_prog_run_save_cb);
+   ret = (ret == 1 ? 0 : -EPERM);
+   }
bpf_restore_data_end(skb, saved_data_end);
__skb_pull(skb, offset);
skb->sk = save_sk;
-   return ret == 1 ? 0 : -EPERM;
+
+   return ret;
 }
 EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb);
 
-- 
2.17.1

[PATCH v3 bpf-next 5/6] bpf: Add cn support to hbm_out_kern.c

2019-05-27 Thread brakmo

Update hbm_out_kern.c to support returning cn notifications.
Also updates relevant files to allow disabling cn notifications.

Signed-off-by: Lawrence Brakmo 
---
 samples/bpf/do_hbm_test.sh | 10 +++---
 samples/bpf/hbm.c  | 18 +++---
 samples/bpf/hbm.h  |  3 ++-
 samples/bpf/hbm_out_kern.c | 26 +-
 4 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/samples/bpf/do_hbm_test.sh b/samples/bpf/do_hbm_test.sh
index 56c8b4115c95..e48b047d4646 100755
--- a/samples/bpf/do_hbm_test.sh
+++ b/samples/bpf/do_hbm_test.sh
@@ -13,10 +13,10 @@ Usage() {
   echo "egress or ingress bandwidht. It then uses iperf3 or netperf to create"
   echo "loads. The output is the goodput in Mbps (unless -D was used)."
   echo ""
-  echo "USAGE: $name [out] [-b=|--bpf=] [-c=|--cc=] [-D]"
-  echo " [-d=|--delay=] [--debug] [-E]"
+  echo "USAGE: $name [out] [-b=|--bpf=] [-c=|--cc=]"
+  echo " [-D] [-d=|--delay=] [--debug] [-E]"
   echo " [-f=<#flows>|--flows=<#flows>] [-h] [-i=|--id=]"
-  echo " [-l] [-N] [-p=|--port=] [-P]"
+  echo " [-l] [-N] [--no_cn] [-p=|--port=] [-P]"
   echo " [-q=] [-R] [-s=|--server=|--time=] [-w] [cubic|dctcp]"
   echo "  Where:"
@@ -33,6 +33,7 @@ Usage() {
   echo "-f or --flows number of concurrent flows (default=1)"
   echo "-i or --idcgroup id (an integer, default is 1)"
   echo "-Nuse netperf instead of iperf3"
+  echo "--no_cn   Do not return CN notifications"
   echo "-ldo not limit flows using loopback"
   echo "-hHelp"
   echo "-p or --port  iperf3 port (default is 5201)"
@@ -115,6 +116,9 @@ processArgs () {
 -c=*|--cc=*)
   cc="${i#*=}"
   ;;
+--no_cn)
+  flags="$flags --no_cn"
+  ;;
 --debug)
   flags="$flags -d"
   debug_flag=1
diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
index a79828ab273f..c5629bae67ab 100644
--- a/samples/bpf/hbm.c
+++ b/samples/bpf/hbm.c
@@ -16,6 +16,7 @@
  *-l   Also limit flows doing loopback
  *-n <#>   To create cgroup \"/hbm#\" and attach prog
  * Default is /hbm1
+ *--no_cn   Do not return cn notifications
  *-r Rate limit in Mbps
  *-s   Get HBM stats (marked, dropped, etc.)
  *-t Exit after specified seconds (default is 0)
@@ -42,6 +43,7 @@
 
 #include 
 #include 
+#include 
 
 #include "bpf_load.h"
 #include "bpf_rlimit.h"
@@ -59,6 +61,7 @@ bool stats_flag;
 bool loopback_flag;
 bool debugFlag;
 bool work_conserving_flag;
+bool no_cn_flag;
 
 static void Usage(void);
 static void read_trace_pipe2(void);
@@ -185,6 +188,7 @@ static int run_bpf_prog(char *prog, int cg_id)
qstats.rate = rate;
qstats.stats = stats_flag ? 1 : 0;
qstats.loopback = loopback_flag ? 1 : 0;
+   qstats.no_cn = no_cn_flag ? 1 : 0;
if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY)) {
printf("ERROR: Could not update map element\n");
goto err;
@@ -366,14 +370,15 @@ static void Usage(void)
 {
printf("This program loads a cgroup skb BPF program to enforce\n"
   "cgroup output (egress) bandwidth limits.\n\n"
-  "USAGE: hbm [-o] [-d]  [-l] [-n ] [-r ] [-s]\n"
-  "   [-t ] [-w] [-h] [prog]\n"
+  "USAGE: hbm [-o] [-d]  [-l] [-n ] [--no_cn] [-r ]\n"
+  "   [-s] [-t ] [-w] [-h] [prog]\n"
   "  Where:\n"
   "-o indicates egress direction (default)\n"
   "-d print BPF trace debug buffer\n"
   "-l also limit flows using loopback\n"
   "-n <#> to create cgroup \"/hbm#\" and attach prog\n"
   "   Default is /hbm1\n"
+  "--no_cndisable CN notifcations\n"
   "-r   Rate in Mbps\n"
   "-s Update HBM stats\n"
   "-t   Exit after specified seconds (default is 0)\n"
@@ -393,9 +398,16 @@ int main(int argc, char **argv)
int  k;
int cg_id = 1;
char *optstring = "iodln:r:st:wh";
+   struct option loptions[] = {
+   {"no_cn", 0, NULL, 1},
+   {NULL, 0, NULL, 0}
+   };
 
-   while ((k = getopt(argc, argv, optstring)) != -1) {
+   while ((k = getopt_long(argc, argv, optstring, loptions, NULL)) != -1) {
switch (k) {
+   case 1:
+   no_cn_flag = true;
+   break;
case'o':
break;
case 'd':
diff --git a/samples/bpf/hbm.h b/samples/bpf/hbm.h
index 518e8147d084..c08247cec2a7 100644
--- a/samples/bpf/hbm.h
+++ b/samples/bpf/hbm.h
@@ -19,7 +19,8 @@ struct hbm_vqueue {
 struct hbm_queue_stats {
unsigned long rate; /* in Mbps*/
unsigned

[PATCH v3 bpf-next 6/6] bpf: Add more stats to HBM

2019-05-27 Thread brakmo

Adds more stats to HBM, including average cwnd and rtt of all TCP
flows, percents of packets that are ecn ce marked and distribution
of return values.

Signed-off-by: Lawrence Brakmo 
---
 samples/bpf/hbm.c  | 33 +++
 samples/bpf/hbm.h  |  6 
 samples/bpf/hbm_kern.h | 66 --
 samples/bpf/hbm_out_kern.c | 22 -
 4 files changed, 117 insertions(+), 10 deletions(-)

diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
index c5629bae67ab..480b7ad6a1f2 100644
--- a/samples/bpf/hbm.c
+++ b/samples/bpf/hbm.c
@@ -316,6 +316,14 @@ static int run_bpf_prog(char *prog, int cg_id)
double percent_pkts, percent_bytes;
char fname[100];
FILE *fout;
+   int k;
+   static const char *returnValNames[] = {
+   "DROP_PKT",
+   "ALLOW_PKT",
+   "DROP_PKT_CWR",
+   "ALLOW_PKT_CWR"
+   };
+#define RET_VAL_COUNT 4
 
 // Future support of ingress
 // if (!outFlag)
@@ -350,6 +358,31 @@ static int run_bpf_prog(char *prog, int cg_id)
(qstats.bytes_total + 1);
fprintf(fout, "pkts_dropped_percent:%6.2f\n", percent_pkts);
fprintf(fout, "bytes_dropped_percent:%6.2f\n", percent_bytes);
+
+   // ECN CE markings
+   percent_pkts = (qstats.pkts_ecn_ce * 100.0) /
+   (qstats.pkts_total + 1);
+   fprintf(fout, "pkts_ecn_ce:%6.2f (%d)\n", percent_pkts,
+   (int)qstats.pkts_ecn_ce);
+
+   // Average cwnd
+   fprintf(fout, "avg cwnd:%d\n",
+   (int)(qstats.sum_cwnd / (qstats.sum_cwnd_cnt + 1)));
+   // Average rtt
+   fprintf(fout, "avg rtt:%d\n",
+   (int)(qstats.sum_rtt / (qstats.pkts_total + 1)));
+   // Average credit
+   fprintf(fout, "avg credit:%d\n",
+   (int)(qstats.sum_credit /
+ (1500 * ((int)qstats.pkts_total) + 1)));
+
+   // Return values stats
+   for (k = 0; k < RET_VAL_COUNT; k++) {
+   percent_pkts = (qstats.returnValCount[k] * 100.0) /
+   (qstats.pkts_total + 1);
+   fprintf(fout, "%s:%6.2f (%d)\n", returnValNames[k],
+   percent_pkts, (int)qstats.returnValCount[k]);
+   }
fclose(fout);
}
 
diff --git a/samples/bpf/hbm.h b/samples/bpf/hbm.h
index c08247cec2a7..f0963ed6a562 100644
--- a/samples/bpf/hbm.h
+++ b/samples/bpf/hbm.h
@@ -29,4 +29,10 @@ struct hbm_queue_stats {
unsigned long long bytes_total;
unsigned long long firstPacketTime;
unsigned long long lastPacketTime;
+   unsigned long long pkts_ecn_ce;
+   unsigned long long returnValCount[4];
+   unsigned long long sum_cwnd;
+   unsigned long long sum_rtt;
+   unsigned long long sum_cwnd_cnt;
+   long long sum_credit;
 };
diff --git a/samples/bpf/hbm_kern.h b/samples/bpf/hbm_kern.h
index 41384be233b9..be19cf1d5cd5 100644
--- a/samples/bpf/hbm_kern.h
+++ b/samples/bpf/hbm_kern.h
@@ -65,17 +65,43 @@ struct bpf_map_def SEC("maps") queue_stats = {
 BPF_ANNOTATE_KV_PAIR(queue_stats, int, struct hbm_queue_stats);
 
 struct hbm_pkt_info {
+   int cwnd;
+   int rtt;
boolis_ip;
boolis_tcp;
short   ecn;
 };
 
+static int get_tcp_info(struct __sk_buff *skb, struct hbm_pkt_info *pkti)
+{
+   struct bpf_sock *sk;
+   struct bpf_tcp_sock *tp;
+
+   sk = skb->sk;
+   if (sk) {
+   sk = bpf_sk_fullsock(sk);
+   if (sk) {
+   if (sk->protocol == IPPROTO_TCP) {
+   tp = bpf_tcp_sock(sk);
+   if (tp) {
+   pkti->cwnd = tp->snd_cwnd;
+   pkti->rtt = tp->srtt_us >> 3;
+   return 0;
+   }
+   }
+   }
+   }
+   return 1;
+}
+
 static __always_inline void hbm_get_pkt_info(struct __sk_buff *skb,
 struct hbm_pkt_info *pkti)
 {
struct iphdr iph;
struct ipv6hdr *ip6h;
 
+   pkti->cwnd = 0;
+   pkti->rtt = 0;
bpf_skb_load_bytes(skb, 0, &iph, 12);
if (iph.version == 6) {
ip6h = (struct ipv6hdr *)&iph;
@@ -91,6 +117,8 @@ static __always_inline void hbm_get_pkt_info(struct 
__sk_buff *skb,
pkti->is_tcp = false;
pkti->ecn = 0;
}
+   if (pkti->is_tcp)
+   get_tcp_info(skb, pkti);
 }
 
 static __always_inline void hbm_init_vqueue(struct hbm_vqueue *qdp, int rate)
@@ -105,8 +133,14

[PATCH v3 bpf-next 4/6] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls

2019-05-27 Thread brakmo

Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
congestion notifications from the BPF programs.

Signed-off-by: Lawrence Brakmo 
Signed-off-by: Alexei Starovoitov 
---
 net/ipv4/ip_output.c  | 34 +++---
 net/ipv6/ip6_output.c | 26 +-
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index bfd0ca554977..1217a53381c2 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -287,16 +287,9 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
return ret;
 }
 
-static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+static int __ip_finish_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
 {
unsigned int mtu;
-   int ret;
-
-   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
-   if (ret) {
-   kfree_skb(skb);
-   return ret;
-   }
 
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
/* Policy lookup after SNAT yielded a new policy */
@@ -315,18 +308,37 @@ static int ip_finish_output(struct net *net, struct sock 
*sk, struct sk_buff *sk
return ip_finish_output2(net, sk, skb);
 }
 
+static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+{
+   int ret;
+
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   switch (ret) {
+   case NET_XMIT_SUCCESS:
+   return __ip_finish_output(net, sk, skb);
+   case NET_XMIT_CN:
+   return __ip_finish_output(net, sk, skb) ? : ret;
+   default:
+   kfree_skb(skb);
+   return ret;
+   }
+}
+
 static int ip_mc_finish_output(struct net *net, struct sock *sk,
   struct sk_buff *skb)
 {
int ret;
 
ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
-   if (ret) {
+   switch (ret) {
+   case NET_XMIT_SUCCESS:
+   return dev_loopback_xmit(net, sk, skb);
+   case NET_XMIT_CN:
+   return dev_loopback_xmit(net, sk, skb) ? : ret;
+   default:
kfree_skb(skb);
return ret;
}
-
-   return dev_loopback_xmit(net, sk, skb);
 }
 
 int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index adef2236abe2..a75bc21d8c88 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -128,16 +128,8 @@ static int ip6_finish_output2(struct net *net, struct sock 
*sk, struct sk_buff *
return -EINVAL;
 }
 
-static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+static int __ip6_finish_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
-   int ret;
-
-   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
-   if (ret) {
-   kfree_skb(skb);
-   return ret;
-   }
-
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
/* Policy lookup after SNAT yielded a new policy */
if (skb_dst(skb)->xfrm) {
@@ -154,6 +146,22 @@ static int ip6_finish_output(struct net *net, struct sock 
*sk, struct sk_buff *s
return ip6_finish_output2(net, sk, skb);
 }
 
+static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+{
+   int ret;
+
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   switch (ret) {
+   case NET_XMIT_SUCCESS:
+   return __ip6_finish_output(net, sk, skb);
+   case NET_XMIT_CN:
+   return __ip6_finish_output(net, sk, skb) ? : ret;
+   default:
+   kfree_skb(skb);
+   return ret;
+   }
+}
+
 int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
-- 
2.17.1

Re: [PATCH V3 net-next 2/6] net: Introduce a new MII time stamping interface.

2019-05-27 Thread Richard Cochran

On Wed, May 22, 2019 at 02:58:23AM +0200, Andrew Lunn wrote:
> > -static int dp83640_hwtstamp(struct phy_device *phydev, struct ifreq *ifr)
> > +static int dp83640_hwtstamp(struct mii_timestamper *mii_ts, struct ifreq 
> > *ifr)
> >  {
> > -   struct dp83640_private *dp83640 = phydev->priv;
> > +   struct dp83640_private *dp83640 =
> > +   container_of(mii_ts, struct dp83640_private, mii_ts);
> > struct hwtstamp_config cfg;
> > u16 txcfg0, rxcfg0;
> 
> Hi Richard
> 
> David might complain about reverse christmas tree. Maybe define a
> macro, to_dp83640() which takes mii_ts?

That is nice idea for another series, I think.  For now this matches
the existing 'container_of' usage within the driver.
 
> > +/**
> > + * struct mii_timestamper - Callback interface to MII time stamping 
> > devices.
> > + *
> > + * @rxtstamp:  Requests a Rx timestamp for 'skb'.  If the skb is 
> > accepted,
> > + * the MII time stamping device promises to deliver it using
> > + * netif_rx() as soon as a timestamp becomes available. One of
> > + * the PTP_CLASS_ values is passed in 'type'.  The function
> > + * must return true if the skb is accepted for delivery.
> > + *
> > + * @txtstamp:  Requests a Tx timestamp for 'skb'.  The MII time 
> > stamping
> > + * device promises to deliver it using skb_complete_tx_timestamp()
> > + * as soon as a timestamp becomes available. One of the PTP_CLASS_
> > + * values is passed in 'type'.
> > + *
> > + * @hwtstamp:  Handles SIOCSHWTSTAMP ioctl for hardware time stamping.
> > + * @link_state:Allows the device to respond to changes in the link 
> > state.
> > + * @ts_info:   Handles ethtool queries for hardware time stamping.
> > + *
> > + * Drivers for PHY time stamping devices should embed their
> > + * mii_timestamper within a private structure, obtaining a reference
> > + * to it using container_of().
> > + */
> 
> I wonder if it is worth mentioning that link_state() is called with
> the phy lock held, but none of the others are?

Will do.

Thanks,
Richard

Re: [PATCH] [PATCH bpf] style fix in while(!feof()) loop

2019-05-27 Thread Y Song

On Sun, May 26, 2019 at 3:35 AM Chang-Hsien Tsai  wrote:
>
> use fgets() as the while loop condition.
>
> Signed-off-by: Chang-Hsien Tsai 
Looks like right now we did not really differentiate error in fgets from EOF,
so the change is in this patch is equivalent to its previous behavior.

Acked-by: Yonghong Song 

> ---
>  tools/bpf/bpftool/xlated_dumper.c   | 4 +---
>  tools/testing/selftests/bpf/trace_helpers.c | 4 +---
>  2 files changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/tools/bpf/bpftool/xlated_dumper.c 
> b/tools/bpf/bpftool/xlated_dumper.c
> index 0bb17bf88b18..494d7ae3614d 100644
> --- a/tools/bpf/bpftool/xlated_dumper.c
> +++ b/tools/bpf/bpftool/xlated_dumper.c
> @@ -31,9 +31,7 @@ void kernel_syms_load(struct dump_data *dd)
> if (!fp)
> return;
>
> -   while (!feof(fp)) {
> -   if (!fgets(buff, sizeof(buff), fp))
> -   break;
> +   while (fgets(buff, sizeof(buff), fp)) {
> tmp = reallocarray(dd->sym_mapping, dd->sym_count + 1,
>sizeof(*dd->sym_mapping));
> if (!tmp) {
> diff --git a/tools/testing/selftests/bpf/trace_helpers.c 
> b/tools/testing/selftests/bpf/trace_helpers.c
> index 9a9fc6c9b70b..b47f205f0310 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.c
> +++ b/tools/testing/selftests/bpf/trace_helpers.c
> @@ -30,9 +30,7 @@ int load_kallsyms(void)
> if (!f)
> return -ENOENT;
>
> -   while (!feof(f)) {
> -   if (!fgets(buf, sizeof(buf), f))
> -   break;
> +   while (fgets(buf, sizeof(buf), f)) {
> if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
> break;
> if (!addr)
> --
> 2.17.1
>

Re: [PATCH V3 net-next 3/6] net: Add a layer for non-PHY MII time stamping drivers.

2019-05-27 Thread Richard Cochran

On Wed, May 22, 2019 at 03:08:52AM +0200, Andrew Lunn wrote:
> probe_channel returns an PTR_ERR. So if (mii_ts) should probably be
> if (IS_ERR(mii_ts))

Nice catch.  Thanks for the careful review!

Richard

Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue

2019-05-27 Thread Fred Klassen




> On May 27, 2019, at 6:15 PM, Willem de Bruijn 
>  wrote:
>> I wanted to discuss whether or not to attach a buffer to the
>> recvmsg(fd, &msg, MSG_ERRQUEUE). Without it, I have
>> MSG_TRUNC errors in my msg_flags. Either I have to add
>> a buffer, or ignore that error flag.
> 
> Either sounds reasonable. It is an expected and well understood
> message if underprovisioning the receive data buffer.
> 

I’ll stick with setting up buffers. It will fail if there are any bugs 
introduced in buffer copy routines.

> 
> The netdev list is archived and available through various websites,
> like lore.kernel.org/netdev . As well as the patches with comments at
> patchwork.ozlabs.org/project/netdev/list
> 

Much better. Sure beats hunting down lost emails.


>> I have been wondering about xmit_more
>> myself. I don’t think it changes anything for software timestamps,
>> but it may with hardware timestamps.
> 
> It arguably makes the software timestamp too early if taken on the
> first segment, as the NIC is only informed of all the new descriptors
> when the last segment is written and the doorbell is rung.
> 

Totally makes sense. Possibly this can be improved software TX
timestamps by delaying until just before ring buffer is advanced.
It would have to be updated in each driver. I may have a look at
this once I am complete this patch. Hopefully that one will be a bit
smoother. 

>>> Can you elaborate on this suspected memory leak?
>> 
>> A user program cannot free a zerocopy buffer until it is reported as free.
>> If zerocopy events are not reported, that could be a memory leak.
>> 
>> I may have a fix. I have added a -P option when I am running an audit.
>> It doesn’t appear to affect performance, and since implementing it I have
>> received all error messages expected for both timestamp and zerocopy.
>> 
>> I am still testing.
> 
> I see, a userspace leak from lack of completion notification.
> 
> If the issue is a few missing notifications at the end of the run,
> then perhaps cfg_waittime_ms is too short.
> 

I’ll get back to you when I have tested this more thoroughly. Early results
suggest that adding the -P poll() option has fixed it without any appreciable
performance hit. I’ll share raw results with you, and we can make a final
decision together.

>> Should the test have failed at this point? I did return an error(), but
>> the script kept running.
> 
> This should normally be cause for test failure, I think yes. Though
> it's fine to send the code for review and possibly even merge, so that
> I can take a look.
> 

Sounds like udpgso_bench.sh needs a ’set -e’ to ensure it stops on
first error.

Re: [PATCH V3 net-next 5/6] net: mdio: of: Register discovered MII time stampers.

2019-05-27 Thread Richard Cochran

On Wed, May 22, 2019 at 03:22:27AM +0200, Andrew Lunn wrote:
> Shouldn't unregister_mii_timestamper() be called on error?

Yes.

Thanks,
Richard

Re: [PATCH V3 net-next 6/6] ptp: Add a driver for InES time stamping IP core.

2019-05-27 Thread Richard Cochran

On Wed, May 22, 2019 at 03:42:20AM +0200, Andrew Lunn wrote:
> I don't know about the PTP subsystem, but in general, forward
> declarations are frowned upon, and it is generally requested to
> reorder the functions to remove them.

I am not aware of any general coding style rule or even defacto
practice in this area, but there is a method to this driver.  The
functions are in alphabetical order.  This makes it easier to find
your way around the driver.

Thanks,
Richard

Cake not doing rate limiting in a way it is expected to do

2019-05-27 Thread Akshat Kakkar

Cake is expected to handle traffic in 2 steps :
First is on the basis of host
Second is within every host, on the basis of flow

So, if I limit traffic to 20Mbps shared across 2 host A & B,
Following are various scenarios, expectation and observations
1. If either A or B is downloading, they will be getting speed of 20Mbps
Observation: Meeting with expectation

2. If both A & B downloads (single download each), each will be
getting speed of 20Mbps
Observation: Meeting with expecation but its very jittery (around
20%), i.e. speed varies from 8Mbps to 12 Mbps. If I use fq_codel speed
is same BUT jitter is very less (around 1%).

3. Now if A starts 3 downloads, and B is still having single download,
A each download should be around 3.3 Mbps and B should be around
10Mbps
Observation: Around 5 Mbps for each download with lot of jitter, i.e.
no advantage of having CAKE!!!

Linux Kernel 4.20
For case 3, output of command : tc -s class show dev eno2

class htb 1:1 root leaf 8003: prio 1 rate 2Kbit ceil 2Kbit
burst 200Kb cburst 1600b
 Sent 688474645 bytes 455058 pkt (dropped 0, overlimits 381196 requeues 0)
 rate 19874Kbit 1641pps backlog 21196b 14p requeues 0
 lended: 382532 borrowed: 0 giants: 0
 tokens: 1260573 ctokens: -9427

class cake 8003:44f parent 8003:
 (dropped 3404, overlimits 0 requeues 0)
 backlog 9084b 6p requeues 0
class cake 8003:516 parent 8003:
 (dropped 3565, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class cake 8003:590 parent 8003:
 (dropped 3023, overlimits 0 requeues 0)
 backlog 4542b 3p requeues 0
class cake 8003:605 parent 8003:
 (dropped 1772, overlimits 0 requeues 0)
 backlog 7570b 5p requeues 0

Re: [PATCH net-next 11/11] inet: frags: rework rhashtable dismantle

2019-05-27 Thread Herbert Xu

Hi Eric:

Eric Dumazet  wrote:
>
> +void fqdir_exit(struct fqdir *fqdir)
> +{
> +   fqdir->high_thresh = 0; /* prevent creation of new frags */
> +
> +   /* paired with READ_ONCE() in inet_frag_kill() :
> +* We want to prevent rhashtable_remove_fast() calls
> +*/
> +   smp_store_release(&fqdir->dead, true);
> +
> +   INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn);
> +   queue_rcu_work(system_wq, &fqdir->destroy_rwork);
> +
> +}

What is the smp_store_release supposed to protect here? If it's
meant to separate the setting of dead and the subsequent destruction
work then it doesn't work because the barrier only protects the code
preceding it, not after.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [PATCH 01/11] net: phy: Add phy_sysfs_create_links helper function

2019-05-27 Thread Ioana Ciornei


> Subject: Re: [PATCH 01/11] net: phy: Add phy_sysfs_create_links helper 
> function
> 
> Hi Ioana,
> 
> On Mon, May 27, 2019 at 6:47 PM Ioana Ciornei 
> wrote:
> >
> > From: Vladimir Oltean 
> >
> > This is a cosmetic patch that wraps the operation of creating sysfs
> > links between the netdev->phydev and the phydev->attached_dev.
> >
> > This is needed to keep the indentation level in check in a follow-up
> > patch where this function will be guarded against the existence of a
> > phydev->attached_dev.
> >
> > Signed-off-by: Vladimir Oltean 
> > Reviewed-by: Florian Fainelli 
> 
> As you are transmitting the patch authored by other person, you need to 
> provide
> your Signed-off-by tag.

Sorry for the mishap. I forgot to add it to this patch. Thanks for pointing it 
out.

--
Ioana

83 matches

Mail list logo