date:20201207


> -Original Message-
> From: Jakub Kicinski 
> Sent: 2020年12月6日 5:40
> To: Joakim Zhang 
> Cc: peppe.cavall...@st.com; alexandre.tor...@st.com;
> joab...@synopsys.com; da...@davemloft.net; netdev@vger.kernel.org;
> dl-linux-imx 
> Subject: Re: [PATCH V2 0/5] patches for stmmac
> 
> On Fri,  4 Dec 2020 10:46:33 +0800 Joakim Zhang wrote:
> > A patch set for stmmac, fix some driver issues.
> 
> These don't apply cleanly to the net tree where fixes go:
> 
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchw
> ork.kernel.org%2Fproject%2Fnetdevbpf%2Flist%2F%3Fdelegate%3Dnetdev%26
> param%3D1%26order%3Ddate&data=04%7C01%7Cqiangqing.zhang%40
> nxp.com%7C78a0b4496e7a49d8fcfc08d899664aff%7C686ea1d3bc2b4c6fa92cd
> 99c5c301635%7C0%7C1%7C637428011934975450%7CUnknown%7CTWFpbGZ
> sb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn
> 0%3D%7C2000&sdata=TO7GoQGWml8hlMyYV84bks1hXsAb%2FLQYue1U
> Y%2FpmIrM%3D&reserved=0
> 
> Please rebase / retest / repost.

Hi Jakub,

I will rebase to the latest net tree, thanks.


Hi all guys,

I also want to report a stmmac driver issue here, someone may also suffer from 
it.

After I do hundreds of suspend/resume stress test, I can encounter below netdev 
watchdog timeout issue. Tx queue timed out then reset adapter.
===
suspend 1000 times
===

  Test < suspend_quick_auto.sh > ended  
root@imx8mpevk:/unit_tests/Power_Management# [ 1347.976688] imx-dwmac 
30bf.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[ 1358.022784] [ cut here ]
[ 1358.027430] NETDEV WATCHDOG: eth0 (imx-dwmac): transmit queue 0 timed out
[ 1358.035469] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:450 
dev_watchdog+0x2fc/0x30c
[ 1358.043736] Modules linked in:
[ 1358.046798] CPU: 2 PID: 0 Comm: swapper/2 Tainted: GW 
5.8.0-rc5-next-20200717-7-g30d24ae22e81-dirty #333
[ 1358.058011] Hardware name: NXP i.MX8MPlus EVK board (DT)
[ 1358.063324] pstate: 2005 (nzCv daif -PAN -UAO BTYPE=--)
[ 1358.068898] pc : dev_watchdog+0x2fc/0x30c
[ 1358.072908] lr : dev_watchdog+0x2fc/0x30c
[ 1358.076915] sp : 800011c5bd90
[ 1358.080228] x29: 800011c5bd90 x28: 0001767f1940
[ 1358.085542] x27: 0004 x26: 000176e88440
[ 1358.090857] x25: 0140 x24: 
[ 1358.096171] x23: 000176e8839c x22: 0002
[ 1358.101484] x21: 8000119f6000 x20: 000176e88000
[ 1358.106799] x19:  x18: 0030
[ 1358.112112] x17: 0001 x16: 0018bf1a354e
[ 1358.117426] x15: 0001760eae70 x14: 
[ 1358.122740] x13: 800091c5ba77 x12: 800011c5ba80
[ 1358.128054] x11:  x10: 00017f38b7c0
[ 1358.133368] x9 : 000c x8 : 6928203068746520
[ 1358.138682] x7 : 3a474f4448435441 x6 : 0003
[ 1358.143996] x5 :  x4 : 
[ 1358.149310] x3 : 0004 x2 : 0100
[ 1358.154624] x1 : b54950db346c9600 x0 : 
[ 1358.159939] Call trace:
[ 1358.162389]  dev_watchdog+0x2fc/0x30c
[ 1358.166055]  call_timer_fn.constprop.0+0x24/0x80
[ 1358.170673]  expire_timers+0x98/0xc4
[ 1358.174249]  run_timer_softirq+0xd0/0x200
[ 1358.178261]  efi_header_end+0x124/0x284
[ 1358.182098]  irq_exit+0xdc/0xfc
[ 1358.185241]  __handle_domain_irq+0x80/0xe0
[ 1358.189338]  gic_handle_irq+0xc8/0x170
[ 1358.193087]  el1_irq+0xbc/0x180
[ 1358.196230]  arch_cpu_idle+0x14/0x20
[ 1358.199807]  cpu_startup_entry+0x24/0x80
[ 1358.203732]  secondary_start_kernel+0x138/0x184
[ 1358.208262] ---[ end trace b422761fd811b2a7 ]---
[ 1358.213588] imx-dwmac 30bf.ethernet eth0: Reset adapter.
[ 1358.228037] imx-dwmac 30bf.ethernet eth0: PHY [stmmac-1:01] driver 
[RTL8211F Gigabit Ethernet] (irq=POLL)
[ 1358.246815] imx-dwmac 30bf.ethernet eth0: No Safety Features support 
found
[ 1358.254062] imx-dwmac 30bf.ethernet eth0: IEEE 1588-2008 Advanced 
Timestamp supported
[ 1358.264130] imx-dwmac 30bf.ethernet eth0: registered PTP clock
[ 1358.270374] imx-dwmac 30bf.ethernet eth0: configuring for phy/rgmii-id 
link mode
[ 1358.279481] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1360.328695] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1360.335007] imx-dwmac 30bf.ethernet eth0: Link is Up - 100Mbps/Full - 
flow control rx/tx

I found this issue first in latest 5.10, and I confirm it is fine in 5.4. After 
a period of time digging into driver commit history, I got nothing. It should 
be related to stmmac core driver, un-related to platform driver.
So I think it could be reproduced on other platforms.

Could you please point me how to debug this issue? Now I don't know how to look 
into this issue further, as I take over ethernet driver in a short time.

Any feedback could be appreciated!

Joakim Zhang

RE: [PATCH] net: stmmac: implement .set_intf_mode() callback for imx8dxl


> -Original Message-
> From: Jakub Kicinski 
> Sent: 2020年12月6日 3:58
> To: Joakim Zhang 
> Cc: peppe.cavall...@st.com; alexandre.tor...@st.com;
> joab...@synopsys.com; da...@davemloft.net; dl-linux-imx
> ; netdev@vger.kernel.org
> Subject: Re: [PATCH] net: stmmac: implement .set_intf_mode() callback for
> imx8dxl
> 
> On Thu,  3 Dec 2020 12:10:38 +0800 Joakim Zhang wrote:
> > From: Fugang Duan 
> >
> > Implement .set_intf_mode() callback for imx8dxl.
> >
> > Signed-off-by: Fugang Duan 
> > Signed-off-by: Joakim Zhang 
> 
> Couple minor issues.
> 
> > @@ -86,7 +88,37 @@ imx8dxl_set_intf_mode(struct
> plat_stmmacenet_data
> > *plat_dat)  {
> > int ret = 0;
> >
> > -   /* TBD: depends on imx8dxl scu interfaces to be upstreamed */
> > +   struct imx_sc_ipc *ipc_handle;
> > +   int val;
> 
> Looks like you're gonna have a empty line in the middle of variable
> declarations?
> 
> Please remove it and order the variable lines longest to shortest.
> 
> > +
> > +   ret = imx_scu_get_handle(&ipc_handle);
> > +   if (ret)
> > +   return ret;
> > +
> > +   switch (plat_dat->interface) {
> > +   case PHY_INTERFACE_MODE_MII:
> > +   val = GPR_ENET_QOS_INTF_SEL_MII;
> > +   break;
> > +   case PHY_INTERFACE_MODE_RMII:
> > +   val = GPR_ENET_QOS_INTF_SEL_RMII;
> > +   break;
> > +   case PHY_INTERFACE_MODE_RGMII:
> > +   case PHY_INTERFACE_MODE_RGMII_ID:
> > +   case PHY_INTERFACE_MODE_RGMII_RXID:
> > +   case PHY_INTERFACE_MODE_RGMII_TXID:
> > +   val = GPR_ENET_QOS_INTF_SEL_RGMII;
> > +   break;
> > +   default:
> > +   pr_debug("imx dwmac doesn't support %d interface\n",
> > +plat_dat->interface);
> > +   return -EINVAL;
> > +   }
> > +
> > +   ret = imx_sc_misc_set_control(ipc_handle, IMX_SC_R_ENET_1,
> > + IMX_SC_C_INTF_SEL, val >> 16);
> > +   ret |= imx_sc_misc_set_control(ipc_handle, IMX_SC_R_ENET_1,
> > +  IMX_SC_C_CLK_GEN_EN, 0x1);
> > return ret;
> 
> These calls may return different errors AFAICT.
> 
> You can't just errno values to gether the result will be meaningless.
> 
> please use the normal flow, and return the result of the second call
> directly:
> 
>   ret = func1();
>   if (ret)
>   return ret;
> 
>   return func2();
> 
> Please also CC the maintainers of the Ethernet PHY subsystem on v2, to make
> sure there is nothing wrong with the patch from their PoV.


Thanks Jakub for your kindly review, I will improve patch following your 
comments.

Best Regards,
Joakim Zhang
> Thanks!

[PATCH bpf-next] xsk: Validate socket state in xsk_recvmsg, prior touching socket members

2020-12-07 Thread Björn Töpel

From: Björn Töpel 

In AF_XDP the socket state needs to be checked, prior touching the
members of the socket. This was not the case for the recvmsg
implementation. Fix that by moving the xsk_is_bound() call.

Reported-by: kernel test robot 
Fixes: 45a86681844e ("xsk: Add support for recvmsg()")
Signed-off-by: Björn Töpel 
---
 net/xdp/xsk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 56c46e5f57bc..e28c6825e089 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -554,12 +554,12 @@ static int xsk_recvmsg(struct socket *sock, struct msghdr 
*m, size_t len, int fl
struct sock *sk = sock->sk;
struct xdp_sock *xs = xdp_sk(sk);
 
+   if (unlikely(!xsk_is_bound(xs)))
+   return -ENXIO;
if (unlikely(!(xs->dev->flags & IFF_UP)))
return -ENETDOWN;
if (unlikely(!xs->rx))
return -ENOBUFS;
-   if (unlikely(!xsk_is_bound(xs)))
-   return -ENXIO;
if (unlikely(need_wait))
return -EOPNOTSUPP;
 

base-commit: 34da87213d3ddd26643aa83deff7ffc6463da0fc
-- 
2.27.0

Re: Why the auxiliary cipher in gss_krb5_crypto.c?

2020-12-07 Thread David Howells

Herbert Xu  wrote:

> > Herbert recently made some changes for MSG_MORE support in the AF_ALG
> > code, which permits a skcipher encryption to be split into several
> > invocations of the skcipher layer without the need for this complexity
> > on the side of the caller. Maybe there is a way to reuse that here.
> > Herbert?
> 
> Yes this was one of the reasons I was persuing the continuation
> work.  It should allow us to kill the special case for CTS in the
> krb5 code.
> 
> Hopefully I can get some time to restart work on this soon.

In the krb5 case, we know in advance how much data we're going to be dealing
with, if that helps.

David

[PATCH] net: tipc: prevent possible null deref of link

2020-12-07 Thread Cengiz Can

`tipc_node_apply_property` does a null check on a `tipc_link_entry`
pointer but also accesses the same pointer out of the null check block.

This triggers a warning on Coverity Static Analyzer because we're
implying that `e->link` can BE null.

Move "Update MTU for node link entry" line into if block to make sure
that we're not in a state that `e->link` is null.

Signed-off-by: Cengiz Can 
---
 net/tipc/node.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index c95d037fde51..83978d5dae59 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -2181,9 +2181,11 @@ void tipc_node_apply_property(struct net *net, struct 
tipc_bearer *b,
&xmitq);
else if (prop == TIPC_NLA_PROP_MTU)
tipc_link_set_mtu(e->link, b->mtu);
+
+   /* Update MTU for node link entry */
+   e->mtu = tipc_link_mss(e->link);
}
-   /* Update MTU for node link entry */
-   e->mtu = tipc_link_mss(e->link);
+
tipc_node_write_unlock(n);
tipc_bearer_xmit(net, bearer_id, &xmitq, &e->maddr, NULL);
}
-- 
2.29.2

Re: [PATCH net] udp: fix the proto value passed to ip_protocol_deliver_rcu for the segments

2020-12-07 Thread David Miller

From: Xin Long 
Date: Mon,  7 Dec 2020 15:55:40 +0800

> Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the
> proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu().
> Otherwize, with a negtive value passed, it will underflow inet_protos.
> 
> This can be reproduced with IPIP FOU:
> 
>   # ip fou add port  ipproto 4
>   # ethtool -K eth1 rx-gro-list on
> 
> Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
> Reported-by: Guillaume Nault 
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks!

Re: [net-next V2 08/15] net/mlx5e: Add TX PTP port object support

2020-12-07 Thread Saeed Mahameed

On Sun, 2020-12-06 at 09:08 -0800, Richard Cochran wrote:
> On Sun, Dec 06, 2020 at 03:37:47PM +0200, Eran Ben Elisha wrote:
> > Adding new enum to the ioctl means we have add
> > (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY for example) all the way -
> > drivers,
> > kernel ptp, user space ptp, ethtool.
> > 

Not exactly,
1) the flag name should be HWTSTAMP_TX_PTP_EVENTS, similar to what we
already have in RX, which will mean: 
HW stamp all PTP events, don't care about the rest.

2) no need to add it to drivers from the get go, only drivers who are
interested may implement it, and i am sure there are tons who would
like to have this flag if their hw timestamping implementation is slow
! other drivers will just keep doing what they are doing, timestamp all
traffic even if user requested this flag, again exactly like many other
drivers do for RX flags (hwtstamp_rx_filters).

> > My concerns are:
> > 1. Timestamp applications (like ptp4l or similar) will have to add
> > support
> > for configuring the driver to use HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY
> > if
> > supported via ioctl prior to packets transmit. From application
> > point of
> > view, the dual-modes (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY ,
> > HWTSTAMP_TX_ON)
> > support is redundant, as it offers nothing new.
> 
> Well said.
> 

disagree, it is not a dual mode, just allow the user to have better
granularity for what hw stamps, exactly like what we have in rx.

we are not adding any new mechanism.

> > 2. Other vendors will have to support it as well, when not sure
> > what is the
> > expectation from them if they cannot improve accuracy between them.
> 
> If there were multiple different devices out there with this kind of
> implementation (different levels of accuracy with increasing run time
> performance cost), then we could consider such a flag.  However, to
> my
> knowledge, this feature is unique to your device.
> 

I agree, but i never meant to have a flag that indicate two different
levels of accuracy, that would be a very wild mistake for sure! 

The new flag will be about selecting granularity of what gets a hw
stamp and what doesn't, aligning with the RX filter API.

> > This feature is just an internal enhancement, and as such it should
> > be added
> > only as a vendor private configuration flag. We are not offering
> > here about
> > any standard for others to follow.
> 
> +1
> 

Our driver feature is and internal enhancement yes, but the suggested
flag is very far from indicating any internal enhancement, is actually
an enhancement to the current API, and is a very simple extension with
wide range of improvements to all layers.

Our driver can optimize accuracy when this flag is set, other drivers
might be happy to implement it since they already have a slow hw and
this flag would allow them to run better TCP/UDP performance while
still performing ptp hw stamping, some admins/apps will use it to avoid
stamping all traffic on tx, win win win.

Re: [PATCH 1/1] ice: fix array overflow on receiving too many fragments for a packet

2020-12-07 Thread kernel test robot

Hi Xiaohui,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tnguy-next-queue/dev-queue]
[also build test WARNING on v5.10-rc7 next-20201204]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Xiaohui-Zhang/ice-fix-array-overflow-on-receiving-too-many-fragments-for-a-packet/20201207-141033
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git 
dev-queue
config: riscv-allyesconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/0day-ci/linux/commit/b3906f69dcad641195cbf1ce9af3e9105a6f72e1
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Xiaohui-Zhang/ice-fix-array-overflow-on-receiving-too-many-fragments-for-a-packet/20201207-141033
git checkout b3906f69dcad641195cbf1ce9af3e9105a6f72e1
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

   In file included from include/vdso/processor.h:10,
from arch/riscv/include/asm/processor.h:11,
from include/linux/prefetch.h:15,
from drivers/net/ethernet/intel/ice/ice_txrx.c:6:
   arch/riscv/include/asm/vdso/processor.h: In function 'cpu_relax':
   arch/riscv/include/asm/vdso/processor.h:14:2: error: implicit declaration of 
function 'barrier' [-Werror=implicit-function-declaration]
  14 |  barrier();
 |  ^~~
   drivers/net/ethernet/intel/ice/ice_txrx.c: In function 'ice_add_rx_frag':
>> drivers/net/ethernet/intel/ice/ice_txrx.c:828:2: warning: ISO C90 forbids 
>> mixed declarations and code [-Wdeclaration-after-statement]
 828 |  struct skb_shared_info *shinfo = skb_shinfo(skb);
 |  ^~
>> drivers/net/ethernet/intel/ice/ice_txrx.c:831:24: warning: passing argument 
>> 2 of 'skb_add_rx_frag' makes integer from pointer without a cast 
>> [-Wint-conversion]
 831 |   skb_add_rx_frag(skb, shinfo, rx_buf->page,
 |^~
 ||
 |struct skb_shared_info *
   In file included from include/net/net_namespace.h:39,
from include/linux/netdevice.h:37,
from include/trace/events/xdp.h:8,
from include/linux/bpf_trace.h:5,
from drivers/net/ethernet/intel/ice/ice_txrx.c:8:
   include/linux/skbuff.h:2187:47: note: expected 'int' but argument is of type 
'struct skb_shared_info *'
2187 | void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, 
int off,
 |   ^
   cc1: some warnings being treated as errors

vim +828 drivers/net/ethernet/intel/ice/ice_txrx.c

   825  
   826  if (!size)
   827  return;
 > 828  struct skb_shared_info *shinfo = skb_shinfo(skb);
   829  
   830  if (shinfo->nr_frags < ARRAY_SIZE(shinfo->frags)) {
 > 831  skb_add_rx_frag(skb, shinfo, rx_buf->page,
   832  rx_buf->page_offset, size, truesize);
   833  }
   834  
   835  /* page is being used so we must update the page offset */
   836  ice_rx_buf_adjust_pg_offset(rx_buf, truesize);
   837  }
   838  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH bpf-next] xsk: Validate socket state in xsk_recvmsg, prior touching socket members

2020-12-07 Thread Magnus Karlsson

On Mon, Dec 7, 2020 at 9:22 AM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> In AF_XDP the socket state needs to be checked, prior touching the
> members of the socket. This was not the case for the recvmsg
> implementation. Fix that by moving the xsk_is_bound() call.
>
> Reported-by: kernel test robot 
> Fixes: 45a86681844e ("xsk: Add support for recvmsg()")
> Signed-off-by: Björn Töpel 
> ---
>  net/xdp/xsk.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Acked-by: Magnus Karlsson 

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 56c46e5f57bc..e28c6825e089 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -554,12 +554,12 @@ static int xsk_recvmsg(struct socket *sock, struct 
> msghdr *m, size_t len, int fl
> struct sock *sk = sock->sk;
> struct xdp_sock *xs = xdp_sk(sk);
>
> +   if (unlikely(!xsk_is_bound(xs)))
> +   return -ENXIO;
> if (unlikely(!(xs->dev->flags & IFF_UP)))
> return -ENETDOWN;
> if (unlikely(!xs->rx))
> return -ENOBUFS;
> -   if (unlikely(!xsk_is_bound(xs)))
> -   return -ENXIO;
> if (unlikely(need_wait))
> return -EOPNOTSUPP;
>
>
> base-commit: 34da87213d3ddd26643aa83deff7ffc6463da0fc
> --
> 2.27.0
>

[PATCH 1/1] xdp: avoid calling kfree twice

2020-12-07 Thread Zhu Yanjun

From: Zhu Yanjun 

In the function xdp_umem_pin_pages, if npgs != umem->npgs and
npgs >= 0, the function xdp_umem_unpin_pages is called. In this
function, kfree is called to handle umem->pgs, and then in the
function xdp_umem_pin_pages, kfree is called again to handle
umem->pgs. Eventually, umem->pgs is freed twice.

Signed-off-by: Zhu Yanjun 
---
 net/xdp/xdp_umem.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 56a28a686988..ff5173f72920 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -97,7 +97,6 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned 
long address)
 {
unsigned int gup_flags = FOLL_WRITE;
long npgs;
-   int err;
 
umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs),
GFP_KERNEL | __GFP_NOWARN);
@@ -112,20 +111,14 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, 
unsigned long address)
if (npgs != umem->npgs) {
if (npgs >= 0) {
umem->npgs = npgs;
-   err = -ENOMEM;
-   goto out_pin;
+   xdp_umem_unpin_pages(umem);
+   return -ENOMEM;
}
-   err = npgs;
-   goto out_pgs;
+   kfree(umem->pgs);
+   umem->pgs = NULL;
+   return npgs;
}
return 0;
-
-out_pin:
-   xdp_umem_unpin_pages(umem);
-out_pgs:
-   kfree(umem->pgs);
-   umem->pgs = NULL;
-   return err;
 }
 
 static int xdp_umem_account_pages(struct xdp_umem *umem)
-- 
2.18.4

Re: [PATCH] net: stmmac: dwmac-meson8b: fix mask definition of the m250_sel mux

2020-12-07 Thread Jerome Brunet



On Sat 05 Dec 2020 at 22:32, Martin Blumenstingl 
 wrote:

> The m250_sel mux clock uses bit 4 in the PRG_ETH0 register. Fix this by
> shifting the PRG_ETH0_CLK_M250_SEL_MASK accordingly as the "mask" in
> struct clk_mux expects the mask relative to the "shift" field in the
> same struct.
>
> While here, get rid of the PRG_ETH0_CLK_M250_SEL_SHIFT macro and use
> __ffs() to determine it from the existing PRG_ETH0_CLK_M250_SEL_MASK
> macro.
>
> Fixes: 566e8251625304 ("net: stmmac: add a glue driver for the Amlogic Meson 
> 8b / GXBB DWMAC")
> Signed-off-by: Martin Blumenstingl 

Reviewed-by: Jerome Brunet 

> ---
>  drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c 
> b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
> index dc0b8b6d180d..459ae715b33d 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
> @@ -30,7 +30,6 @@
>  #define PRG_ETH0_EXT_RMII_MODE   4
>  
>  /* mux to choose between fclk_div2 (bit unset) and mpll2 (bit set) */
> -#define PRG_ETH0_CLK_M250_SEL_SHIFT  4
>  #define PRG_ETH0_CLK_M250_SEL_MASK   GENMASK(4, 4)
>  
>  /* TX clock delay in ns = "8ns / 4 * tx_dly_val" (where 8ns are exactly one
> @@ -155,8 +154,9 @@ static int meson8b_init_rgmii_tx_clk(struct meson8b_dwmac 
> *dwmac)
>   return -ENOMEM;
>  
>   clk_configs->m250_mux.reg = dwmac->regs + PRG_ETH0;
> - clk_configs->m250_mux.shift = PRG_ETH0_CLK_M250_SEL_SHIFT;
> - clk_configs->m250_mux.mask = PRG_ETH0_CLK_M250_SEL_MASK;
> + clk_configs->m250_mux.shift = __ffs(PRG_ETH0_CLK_M250_SEL_MASK);
> + clk_configs->m250_mux.mask = PRG_ETH0_CLK_M250_SEL_MASK >>
> +  clk_configs->m250_mux.shift;
>   clk = meson8b_dwmac_register_clk(dwmac, "m250_sel", mux_parents,
>ARRAY_SIZE(mux_parents), &clk_mux_ops,
>&clk_configs->m250_mux.hw);

Re: [PATCH net] net: openvswitch: fix TTL decrement exception action execution

2020-12-07 Thread Eelco Chaudron





On 5 Dec 2020, at 1:30, Jakub Kicinski wrote:


On Fri,  4 Dec 2020 07:16:23 -0500 Eelco Chaudron wrote:
Currently, the exception actions are not processed correctly as the 
wrong

dataset is passed. This change fixes this, including the misleading
comment.

In addition, a check was added to make sure we work on an IPv4 
packet,

and not just assume if it's not IPv6 it's IPv4.

Small cleanup which removes an unsessesaty parameter from the
dec_ttl_exception_handler() function.


No cleanups in fixes, please. Especially when we're at -rc6..

You can clean this up in net-next within a week after trees merge.


Ack, will undo the parameter removal, and sent out a v2.

Fixes: 69929d4c49e1 ("net: openvswitch: fix TTL decrement action 
netlink message format")


:(
and please add some info on how these changes are tested.


Will add the details to v2.

pull request (net): ipsec 2020-12-07

1) Sysbot reported fixes for the new 64/32 bit compat layer.
   From Dmitry Safonov.

2) Fix a memory leak in xfrm_user_policy that was introduced
   by adding the 64/32 bit compat layer. From Yu Kuai.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 4e0396c59559264442963b349ab71f66e471f84d:

  net: marvell: prestera: fix compilation with CONFIG_BRIDGE=m (2020-11-07 
12:43:26 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to 48f486e13ffdb49fbb9b38c21d0e108ed60ab1a2:

  net: xfrm: fix memory leak in xfrm_user_policy() (2020-11-10 09:14:25 +0100)


Dmitry Safonov (3):
  xfrm/compat: Translate by copying XFRMA_UNSPEC attribute
  xfrm/compat: memset(0) 64-bit padding at right place
  xfrm/compat: Don't allocate memory with __GFP_ZERO

Steffen Klassert (1):
  Merge branch 'xfrm/compat: syzbot-found fixes'

Yu Kuai (1):
  net: xfrm: fix memory leak in xfrm_user_policy()

 net/xfrm/xfrm_compat.c | 5 +++--
 net/xfrm/xfrm_state.c  | 4 +++-
 2 files changed, 6 insertions(+), 3 deletions(-)

[PATCH 3/4] xfrm/compat: Don't allocate memory with __GFP_ZERO

From: Dmitry Safonov 

32-bit to 64-bit messages translator zerofies needed paddings in the
translation, the rest is the actual payload.
Don't allocate zero pages as they are not needed.

Fixes: 5106f4a8acff ("xfrm/compat: Add 32=>64-bit messages translator")
Signed-off-by: Dmitry Safonov 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_compat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_compat.c b/net/xfrm/xfrm_compat.c
index 556e9f33b815..d8e8a11ca845 100644
--- a/net/xfrm/xfrm_compat.c
+++ b/net/xfrm/xfrm_compat.c
@@ -564,7 +564,7 @@ static struct nlmsghdr *xfrm_user_rcv_msg_compat(const 
struct nlmsghdr *h32,
return NULL;
 
len += NLMSG_HDRLEN;
-   h64 = kvmalloc(len, GFP_KERNEL | __GFP_ZERO);
+   h64 = kvmalloc(len, GFP_KERNEL);
if (!h64)
return ERR_PTR(-ENOMEM);
 
-- 
2.25.1

Re: [PATCH ipsec-next] xfrm: interface: support collect metadata mode

On Fri, Nov 27, 2020 at 02:32:44PM +0200, Eyal Birger wrote:
> Hi Steffen,
> 
> On Fri, Nov 27, 2020 at 11:44 AM Steffen Klassert
>  wrote:
> >
> > On Sat, Nov 21, 2020 at 04:28:23PM +0200, Eyal Birger wrote:
> > > This commit adds support for 'collect_md' mode on xfrm interfaces.
> > >
> > > Each net can have one collect_md device, created by providing the
> > > IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
> > > altered and has no if_id or link device attributes.
> > >
> > > On transmit to this device, the if_id is fetched from the attached dst
> > > metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
> > > since the only needed property is the if_id stored in the tun_id member
> > > of the ip_tunnel_info->key.
> >
> > Can we please have a separate metadata type for xfrm interfaces?
> >
> > Sharing such structures turned already out to be a bad idea
> > on vti interfaces, let's try to avoid that misstake with
> > xfrm interfaces.
> 
> My initial thought was to do that, but it looks like most of the constructs
> surrounding this facility - tc, nft, ovs, ebpf, ip routing - are built around
> struct ip_tunnel_info and don't regard other possible metadata types.

That is likely because most objects that have a collect_md mode are
tunnels. We have already a second metadata type, and I don't see
why we can't have a third one. Maybe we can create something more
generic so that it can have other users too.

> For xfrm interfaces, the only metadata used is the if_id, which is stored
> in the metadata tun_id, so I think other than naming consideration, the use
> of struct ip_tunnel_info does not imply tunneling and does not limit the
> use of xfrmi to a specific mode of operation.

I agree that this can work, but it is a first step into a wrong direction.
Using a __be64 field of a completely unrelated structure as an u32 if_id
is bad style IMO.

> On the other hand, adding a new metadata type would require changing all
> other places to regard the new metadata type, with a large number of
> userspace visible changes.

I admit that this might have some disadvantages too, but I'm not convinced
that this justifies the 'ip_tunnel_info' hack.

[PATCH 1/4] xfrm/compat: Translate by copying XFRMA_UNSPEC attribute

From: Dmitry Safonov 

xfrm_xlate32() translates 64-bit message provided by kernel to be sent
for 32-bit listener (acknowledge or monitor). Translator code doesn't
expect XFRMA_UNSPEC attribute as it doesn't know its payload.
Kernel never attaches such attribute, but a user can.

I've searched if any opensource does it and the answer is no.
Nothing on github and google finds only tfcproject that has such code
commented-out.

What will happen if a user sends a netlink message with XFRMA_UNSPEC
attribute? Ipsec code ignores this attribute. But if there is a
monitor-process or 32-bit user requested ack - kernel will try to
translate such message and will hit WARN_ONCE() in xfrm_xlate64_attr().

Deal with XFRMA_UNSPEC by copying the attribute payload with
xfrm_nla_cpy(). In result, the default switch-case in xfrm_xlate64_attr()
becomes an unused code. Leave those 3 lines in case a new xfrm attribute
will be added.

Fixes: 5461fc0c8d9f ("xfrm/compat: Add 64=>32-bit messages translator")
Reported-by: syzbot+a7e701c8385bd8543...@syzkaller.appspotmail.com
Signed-off-by: Dmitry Safonov 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_compat.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xfrm/xfrm_compat.c b/net/xfrm/xfrm_compat.c
index e28f0c9ecd6a..17edbf935e35 100644
--- a/net/xfrm/xfrm_compat.c
+++ b/net/xfrm/xfrm_compat.c
@@ -234,6 +234,7 @@ static int xfrm_xlate64_attr(struct sk_buff *dst, const 
struct nlattr *src)
case XFRMA_PAD:
/* Ignore */
return 0;
+   case XFRMA_UNSPEC:
case XFRMA_ALG_AUTH:
case XFRMA_ALG_CRYPT:
case XFRMA_ALG_COMP:
-- 
2.25.1

[PATCH 2/4] xfrm/compat: memset(0) 64-bit padding at right place

From: Dmitry Safonov 

32-bit messages translated by xfrm_compat can have attributes attached.
For all, but XFRMA_SA, XFRMA_POLICY the size of payload is the same
in 32-bit UABI and 64-bit UABI. For XFRMA_SA (struct xfrm_usersa_info)
and XFRMA_POLICY (struct xfrm_userpolicy_info) it's only tail-padding
that is present in 64-bit payload, but not in 32-bit.
The proper size for destination nlattr is already calculated by
xfrm_user_rcv_calculate_len64() and allocated with kvmalloc().

xfrm_attr_cpy32() copies 32-bit copy_len into 64-bit attribute
translated payload, zero-filling possible padding for SA/POLICY.
Due to a typo, *pos already has 64-bit payload size, in a result next
memset(0) is called on the memory after the translated attribute, not on
the tail-padding of it.

Fixes: 5106f4a8acff ("xfrm/compat: Add 32=>64-bit messages translator")
Reported-by: syzbot+c43831072e7df506a...@syzkaller.appspotmail.com
Signed-off-by: Dmitry Safonov 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_compat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_compat.c b/net/xfrm/xfrm_compat.c
index 17edbf935e35..556e9f33b815 100644
--- a/net/xfrm/xfrm_compat.c
+++ b/net/xfrm/xfrm_compat.c
@@ -388,7 +388,7 @@ static int xfrm_attr_cpy32(void *dst, size_t *pos, const 
struct nlattr *src,
 
memcpy(nla, src, nla_attr_size(copy_len));
nla->nla_len = nla_attr_size(payload);
-   *pos += nla_attr_size(payload);
+   *pos += nla_attr_size(copy_len);
nlmsg->nlmsg_len += nla->nla_len;
 
memset(dst + *pos, 0, payload - copy_len);
-- 
2.25.1

[PATCH 4/4] net: xfrm: fix memory leak in xfrm_user_policy()

From: Yu Kuai 

if xfrm_get_translator() failed, xfrm_user_policy() return without
freeing 'data', which is allocated in memdup_sockptr().

Fixes: 96392ee5a13b ("xfrm/compat: Translate 32-bit user_policy from sockptr")
Reported-by: Hulk Robot 
Signed-off-by: Yu Kuai 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_state.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index a77da7aae6fe..2f1517827995 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -2382,8 +2382,10 @@ int xfrm_user_policy(struct sock *sk, int optname, 
sockptr_t optval, int optlen)
if (in_compat_syscall()) {
struct xfrm_translator *xtr = xfrm_get_translator();
 
-   if (!xtr)
+   if (!xtr) {
+   kfree(data);
return -EOPNOTSUPP;
+   }
 
err = xtr->xlate_user_policy_sockptr(&data, optlen);
xfrm_put_translator(xtr);
-- 
2.25.1

Re: [PATCH v1 1/5] Bluetooth: advmon offload MSFT add rssi support

2020-12-07 Thread Marcel Holtmann

Hi Archie,

> MSFT needs rssi parameter for monitoring advertisement packet,
> therefore we should supply them from mgmt.
> 
> Signed-off-by: Archie Pusaka 
> Reviewed-by: Miao-chen Chou 
> Reviewed-by: Yun-Hao Chung 
 
 I don’t need any Reviewed-by if they are not catching an obvious user API 
 breakage.
 
> ---
> 
> include/net/bluetooth/hci_core.h | 9 +
> include/net/bluetooth/mgmt.h | 9 +
> net/bluetooth/mgmt.c | 8 
> 3 files changed, 26 insertions(+)
> 
> diff --git a/include/net/bluetooth/hci_core.h 
> b/include/net/bluetooth/hci_core.h
> index 9873e1c8cd16..42d446417817 100644
> --- a/include/net/bluetooth/hci_core.h
> +++ b/include/net/bluetooth/hci_core.h
> @@ -246,8 +246,17 @@ struct adv_pattern {
> __u8 value[HCI_MAX_AD_LENGTH];
> };
> 
> +struct adv_rssi_thresholds {
> + __s8 low_threshold;
> + __s8 high_threshold;
> + __u16 low_threshold_timeout;
> + __u16 high_threshold_timeout;
> + __u8 sampling_period;
> +};
> +
> struct adv_monitor {
> struct list_head patterns;
> + struct adv_rssi_thresholds rssi;
> boolactive;
> __u16   handle;
> };
> diff --git a/include/net/bluetooth/mgmt.h b/include/net/bluetooth/mgmt.h
> index d8367850e8cd..dc534837be0e 100644
> --- a/include/net/bluetooth/mgmt.h
> +++ b/include/net/bluetooth/mgmt.h
> @@ -763,9 +763,18 @@ struct mgmt_adv_pattern {
> __u8 value[31];
> } __packed;
> 
> +struct mgmt_adv_rssi_thresholds {
> + __s8 high_threshold;
> + __le16 high_threshold_timeout;
> + __s8 low_threshold;
> + __le16 low_threshold_timeout;
> + __u8 sampling_period;
> +} __packed;
> +
> #define MGMT_OP_ADD_ADV_PATTERNS_MONITOR  0x0052
> struct mgmt_cp_add_adv_patterns_monitor {
> __u8 pattern_count;
> + struct mgmt_adv_rssi_thresholds rssi;
> struct mgmt_adv_pattern patterns[];
> } __packed;
 
 This is something we can not do. It breaks an userspace facing API. Is the 
 mgmt opcode 0x0052 in an already released kernel?
>>> 
>>> Yes, the opcode does exist in an already released kernel.
>>> 
>>> The DBus method which accesses this API is put behind the experimental
>>> flag, therefore we expect they are flexible enough to support changes.
>>> Previously, we already had a discussion in an email thread with the
>>> title "Offload RSSI tracking to controller", and the outcome supports
>>> this change.
>>> 
>>> Here is an excerpt of the discussion.
>> 
>> it doesn’t matter. This is fixed API now and so we can not just change it. 
>> The argument above is void. What matters if it is in already released kernel.
> 
> If that is the case, do you have a suggestion to allow RSSI to be
> considered when monitoring advertisement? Would a new MGMT opcode with
> these parameters suffice?

its the only way.

Regards

Marcel

Re: [PATCH net-next v2 1/4] vm_sockets: Include flags field in the vsock address data structure


On Fri, Dec 04, 2020 at 07:02:32PM +0200, Andra Paraschiv wrote:

vsock enables communication between virtual machines and the host they
are running on. With the multi transport support (guest->host and
host->guest), nested VMs can also use vsock channels for communication.

In addition to this, by default, all the vsock packets are forwarded to
the host, if no host->guest transport is loaded. This behavior can be
implicitly used for enabling vsock communication between sibling VMs.

Add a flags field in the vsock address data structure that can be used
to explicitly mark the vsock connection as being targeted for a certain
type of communication. This way, can distinguish between different use
cases such as nested VMs and sibling VMs.

Use the already available "svm_reserved1" field and mark it as a flags
field instead. This field can be set when initializing the vsock address
variable used for the connect() call.

Changelog

v1 -> v2

* Update the field name to "svm_flags".
* Split the current patch in 2 patches.


Usually the changelog goes after the 3 dashes, but I'm not sure there is 
a strict rule :-)


Anyway the patch LGTM:

Reviewed-by: Stefano Garzarella 



Signed-off-by: Andra Paraschiv 
---
include/uapi/linux/vm_sockets.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index fd0ed7221645d..46735376a57a8 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -145,7 +145,7 @@

struct sockaddr_vm {
__kernel_sa_family_t svm_family;
-   unsigned short svm_reserved1;
+   unsigned short svm_flags;
unsigned int svm_port;
unsigned int svm_cid;
unsigned char svm_zero[sizeof(struct sockaddr) -
--
2.20.1 (Apple Git-117)




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar 
Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in 
Romania. Registration number J22/2621/2005.

Re: [PATCH net-next v2 2/4] vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag


On Fri, Dec 04, 2020 at 07:02:33PM +0200, Andra Paraschiv wrote:

Add VMADDR_FLAG_TO_HOST vsock flag that is used to setup a vsock
connection where all the packets are forwarded to the host.

Then, using this type of vsock channel, vsock communication between
sibling VMs can be built on top of it.

Changelog

v1 -> v2

* New patch in v2, it was split from the first patch in the series.
* Remove the default value for the vsock flags field.
* Update the naming for the vsock flag to "VMADDR_FLAG_TO_HOST".

Signed-off-by: Andra Paraschiv 
---
include/uapi/linux/vm_sockets.h | 15 +++
1 file changed, 15 insertions(+)

diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index 46735376a57a8..72e1a3d05682d 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -114,6 +114,21 @@

#define VMADDR_CID_HOST 2

+/* The current default use case for the vsock channel is the following:
+ * local vsock communication between guest and host and nested VMs setup.
+ * In addition to this, implicitly, the vsock packets are forwarded to the host
+ * if no host->guest vsock transport is set.
+ *
+ * Set this flag value in the sockaddr_vm corresponding field if the vsock
+ * packets need to be always forwarded to the host. Using this behavior,
+ * vsock communication between sibling VMs can be setup.


Maybe we can add a sentence saying that this flag is set on the remote 
peer address for an incoming connection when it is routed from the host 
(local CID and remote CID > VMADDR_CID_HOST).



+ *
+ * This way can explicitly distinguish between vsock channels created for
+ * different use cases, such as nested VMs (or local communication between
+ * guest and host) and sibling VMs.
+ */
+#define VMADDR_FLAG_TO_HOST 0x0001
+
/* Invalid vSockets version. */

#define VM_SOCKETS_INVALID_VERSION -1U
--
2.20.1 (Apple Git-117)




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar 
Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in 
Romania. Registration number J22/2621/2005.

Re: [PATCH net-next v2 4/4] af_vsock: Assign the vsock transport considering the vsock address flags


On Fri, Dec 04, 2020 at 07:02:35PM +0200, Andra Paraschiv wrote:

The vsock flags field can be set in the connect and (listen) receive
paths.

When the vsock transport is assigned, the remote CID is used to
distinguish between types of connection.

Use the vsock flags value (in addition to the CID) from the remote
address to decide which vsock transport to assign. For the sibling VMs
use case, all the vsock packets need to be forwarded to the host, so
always assign the guest->host transport if the VMADDR_FLAG_TO_HOST flag
is set. For the other use cases, the vsock transport assignment logic is
not changed.

Changelog

v1 -> v2

* Use bitwise operator to check the vsock flag.
* Use the updated "VMADDR_FLAG_TO_HOST" flag naming.
* Merge the checks for the g2h transport assignment in one "if" block.

Signed-off-by: Andra Paraschiv 
---
net/vmw_vsock/af_vsock.c | 9 +++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 83d035eab0b05..66e643c3b5f85 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -421,7 +421,8 @@ static void vsock_deassign_transport(struct vsock_sock *vsk)
 * The vsk->remote_addr is used to decide which transport to use:
 *  - remote CID == VMADDR_CID_LOCAL or g2h->local_cid or VMADDR_CID_HOST if
 *g2h is not loaded, will use local transport;
- *  - remote CID <= VMADDR_CID_HOST will use guest->host transport;
+ *  - remote CID <= VMADDR_CID_HOST or h2g is not loaded or remote flags field
+ *includes VMADDR_FLAG_TO_HOST flag value, will use guest->host transport;
 *  - remote CID > VMADDR_CID_HOST will use host->guest transport;
 */
int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
@@ -429,6 +430,7 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct 
vsock_sock *psk)
const struct vsock_transport *new_transport;
struct sock *sk = sk_vsock(vsk);
unsigned int remote_cid = vsk->remote_addr.svm_cid;
+   unsigned short remote_flags;
int ret;

/* If the packet is coming with the source and destination CIDs higher
@@ -443,6 +445,8 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct 
vsock_sock *psk)
vsk->remote_addr.svm_cid > VMADDR_CID_HOST)
vsk->remote_addr.svm_flags |= VMADDR_FLAG_TO_HOST;

+   remote_flags = vsk->remote_addr.svm_flags;
+
switch (sk->sk_type) {
case SOCK_DGRAM:
new_transport = transport_dgram;
@@ -450,7 +454,8 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct 
vsock_sock *psk)
case SOCK_STREAM:
if (vsock_use_local_transport(remote_cid))
new_transport = transport_local;
-   else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g)
+   else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g ||
+(remote_flags & VMADDR_FLAG_TO_HOST) == 
VMADDR_FLAG_TO_HOST)


Maybe "remote_flags & VMADDR_FLAG_TO_HOST" should be enough, but the 
patch is okay:


Reviewed-by: Stefano Garzarella 


new_transport = transport_g2h;
else
new_transport = transport_h2g;
--
2.20.1 (Apple Git-117)




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar 
Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in 
Romania. Registration number J22/2621/2005.

Re: [PATCH net-next v2 3/4] af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path


On Fri, Dec 04, 2020 at 07:02:34PM +0200, Andra Paraschiv wrote:

The vsock flags can be set during the connect() setup logic, when
initializing the vsock address data structure variable. Then the vsock
transport is assigned, also considering this flags field.

The vsock transport is also assigned on the (listen) receive path. The
flags field needs to be set considering the use case.

Set the value of the vsock flags of the remote address to the one
targeted for packets forwarding to the host, if the following conditions
are met:

* The source CID of the packet is higher than VMADDR_CID_HOST.
* The destination CID of the packet is higher than VMADDR_CID_HOST.

Changelog

v1 -> v2

* Set the vsock flag on the receive path in the vsock transport
 assignment logic.
* Use bitwise operator for the vsock flag setup.
* Use the updated "VMADDR_FLAG_TO_HOST" flag naming.

Signed-off-by: Andra Paraschiv 
---
net/vmw_vsock/af_vsock.c | 12 
1 file changed, 12 insertions(+)


Reviewed-by: Stefano Garzarella 



diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index d10916ab45267..83d035eab0b05 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -431,6 +431,18 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct 
vsock_sock *psk)
unsigned int remote_cid = vsk->remote_addr.svm_cid;
int ret;

+   /* If the packet is coming with the source and destination CIDs higher
+* than VMADDR_CID_HOST, then a vsock channel where all the packets are
+* forwarded to the host should be established. Then the host will
+* need to forward the packets to the guest.
+*
+* The flag is set on the (listen) receive path (psk is not NULL). On
+* the connect path the flag can be set by the user space application.
+*/
+   if (psk && vsk->local_addr.svm_cid > VMADDR_CID_HOST &&
+   vsk->remote_addr.svm_cid > VMADDR_CID_HOST)
+   vsk->remote_addr.svm_flags |= VMADDR_FLAG_TO_HOST;
+
switch (sk->sk_type) {
case SOCK_DGRAM:
new_transport = transport_dgram;
--
2.20.1 (Apple Git-117)




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar 
Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in 
Romania. Registration number J22/2621/2005.

Re: [PATCH net-next v2 0/4] vsock: Add flags field in the vsock address


Hi Andra,

On Fri, Dec 04, 2020 at 07:02:31PM +0200, Andra Paraschiv wrote:

vsock enables communication between virtual machines and the host they are
running on. Nested VMs can be setup to use vsock channels, as the multi
transport support has been available in the mainline since the v5.5 Linux kernel
has been released.

Implicitly, if no host->guest vsock transport is loaded, all the vsock packets
are forwarded to the host. This behavior can be used to setup communication
channels between sibling VMs that are running on the same host. One example can
be the vsock channels that can be established within AWS Nitro Enclaves
(see Documentation/virt/ne_overview.rst).

To be able to explicitly mark a connection as being used for a certain use case,
add a flags field in the vsock address data structure. The "svm_reserved1" field
has been repurposed to be the flags field. The value of the flags will then be
taken into consideration when the vsock transport is assigned. This way can
distinguish between different use cases, such as nested VMs / local 
communication
and sibling VMs.


the series seems in a good shape, I left some minor comments.
I run my test suite (vsock_test, iperf3, nc) with nested VMs (QEMU/KVM), 
and everything looks good.


Note: I'll be offline today and tomorrow, so I may miss followups.

Thanks,
Stefano

[PATCH net v2] net: openvswitch: fix TTL decrement exception action execution

2020-12-07 Thread Eelco Chaudron

Currently, the exception actions are not processed correctly as the wrong
dataset is passed. This change fixes this, including the misleading
comment.

In addition, a check was added to make sure we work on an IPv4 packet,
and not just assume if it's not IPv6 it's IPv4.

This was all tested using OVS with patch,
https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639,
applied and sending packets with a TTL of 1 (and 0), both with IPv4
and IPv6.

Fixes: 69929d4c49e1 ("net: openvswitch: fix TTL decrement action netlink 
message format")
Signed-off-by: Eelco Chaudron 
---
v2: - Undid unnessesary paramerter removal from dec_ttl_exception_handler()
- Updated commit message to include testing information.

 net/openvswitch/actions.c |   15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 5829a020b81c..ace69777cb29 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -956,16 +956,13 @@ static int dec_ttl_exception_handler(struct datapath *dp, 
struct sk_buff *skb,
 struct sw_flow_key *key,
 const struct nlattr *attr, bool last)
 {
-   /* The first action is always 'OVS_DEC_TTL_ATTR_ARG'. */
-   struct nlattr *dec_ttl_arg = nla_data(attr);
+   /* The first attribute is always 'OVS_DEC_TTL_ATTR_ACTION'. */
+   struct nlattr *actions = nla_data(attr);
 
-   if (nla_len(dec_ttl_arg)) {
-   struct nlattr *actions = nla_data(dec_ttl_arg);
+   if (nla_len(actions))
+   return clone_execute(dp, skb, key, 0, nla_data(actions),
+nla_len(actions), last, false);
 
-   if (actions)
-   return clone_execute(dp, skb, key, 0, nla_data(actions),
-nla_len(actions), last, false);
-   }
consume_skb(skb);
return 0;
 }
@@ -1209,7 +1206,7 @@ static int execute_dec_ttl(struct sk_buff *skb, struct 
sw_flow_key *key)
return -EHOSTUNREACH;
 
key->ip.ttl = --nh->hop_limit;
-   } else {
+   } else if (skb->protocol == htons(ETH_P_IP)) {
struct iphdr *nh;
u8 old_ttl;

Re: [net-next V2 09/15] net/mlx5e: CT: Use the same counter for both directions

2020-12-07 Thread Oz Shlomo


Hi Marcelo,

On 12/1/2020 11:41 PM, Saeed Mahameed wrote:

On Fri, 2020-11-27 at 11:01 -0300, Marcelo Ricardo Leitner wrote:

On Wed, Sep 23, 2020 at 03:48:18PM -0700, sa...@kernel.org wrote:

From: Oz Shlomo 


Sorry for reviving this one, but seemed better for the context.


A connection is represented by two 5-tuple entries, one for each
direction.
Currently, each direction allocates its own hw counter, which is
inefficient as ct aging is managed per connection.

Share the counter that was allocated for the original direction
with the
reverse direction.


Yes, aging is done per connection, but the stats are not. With this
patch, with netperf TCP_RR test, I get this: (mangled for
readability)

# grep 172.0.0.4 /proc/net/nf_conntrack
ipv4 2 tcp  6
   src=172.0.0.3 dst=172.0.0.4 sport=34018 dport=33396 packets=3941992
bytes=264113427
   src=172.0.0.4 dst=172.0.0.3 sport=33396 dport=34018 packets=4
bytes=218 [HW_OFFLOAD]
   mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3

while without it (594e31bceb + act_ct patch to enable it posted
yesterday + revert), I get:

# grep 172.0.0.4 /proc/net/nf_conntrack
ipv4 2 tcp  6
   src=172.0.0.3 dst=172.0.0.4 sport=41856 dport=32776 packets=1876763
bytes=125743084
   src=172.0.0.4 dst=172.0.0.3 sport=32776 dport=41856 packets=1876761
bytes=125742951 [HW_OFFLOAD]
   mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3

The same is visible on 'ovs-appctl dpctl/dump-conntrack -s' then.
Summing both directions in one like this is at least very misleading.
Seems this change was motivated only by hw resources constrains. That
said, I'm wondering, can this change be reverted somehow?

   Marcelo


Hi Marcelo, thanks for the report,
Sorry i am not familiar with this /procfs
Oz, Ariel, Roi, what is your take on this, it seems that we changed the
behavior of stats incorrectly.


Indeed we overlooked the CT accounting extension.
We will submit a driver fix.



Thanks,
Saeed.

Re: pull-request: wireless-drivers-next-2020-12-03

2020-12-07 Thread Kalle Valo

Jakub Kicinski  writes:

> On Thu,  3 Dec 2020 18:57:32 + (UTC) Kalle Valo wrote:
>> wireless-drivers-next patches for v5.11
>> 
>> First set of patches for v5.11. rtw88 getting improvements to work
>> better with Bluetooth and other driver also getting some new features.
>> mhi-ath11k-immutable branch was pulled from mhi tree to avoid
>> conflicts with mhi tree.
>
> Pulled, but there are a lot of fixes in here which look like they
> should have been part of the other PR, if you ask me.

Yeah, I'm actually on purpose keeping the bar high for patches going to
wireless-drivers (ie. the fixes going to -rc releases). This is just to
keep things simple for me and avoiding the number of conflicts between
the trees.

> There's also a patch which looks like it renames a module parameter.
> Module parameters are considered uAPI.

Ah, I have been actually wondering that if they are part of user space
API or not, good to know that they are. I'll keep an eye of this in the
future so that we are not breaking the uAPI with module parameter
changes.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

[PATCH V3 0/5] patches for stmmac

A patch set for stmmac, fix some driver issues.

ChangeLogs:
V1->V2:
* add Fixes tag.
* add patch 5/5 into this patch set.

V2->V3:
* rebase to latest net tree where fixes go.

Fugang Duan (5):
  net: stmmac: increase the timeout for dma reset
  net: stmmac: start phylink instance before stmmac_hw_setup()
  net: stmmac: free tx skb buffer in stmmac_resume()
  net: stmmac: delete the eee_ctrl_timer after napi disabled
  net: stmmac: overwrite the dma_cap.addr64 according to HW design

 .../net/ethernet/stmicro/stmmac/dwmac-imx.c   |  9 +---
 .../net/ethernet/stmicro/stmmac/dwmac4_lib.c  |  2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 51 +++
 include/linux/stmmac.h|  1 +
 4 files changed, 43 insertions(+), 20 deletions(-)

-- 
2.17.1

[PATCH V3 1/5] net: stmmac: increase the timeout for dma reset

From: Fugang Duan 

Current timeout value is not enough for gmac5 dma reset
on imx8mp platform, increase the timeout range.

Signed-off-by: Fugang Duan 
Signed-off-by: Joakim Zhang 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
index 6e30d7eb4983..0b4ee2dbb691 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
@@ -22,7 +22,7 @@ int dwmac4_dma_reset(void __iomem *ioaddr)
 
return readl_poll_timeout(ioaddr + DMA_BUS_MODE, value,
 !(value & DMA_BUS_MODE_SFT_RESET),
-1, 10);
+1, 100);
 }
 
 void dwmac4_set_rx_tail_ptr(void __iomem *ioaddr, u32 tail_ptr, u32 chan)
-- 
2.17.1

[PATCH V3 3/5] net: stmmac: free tx skb buffer in stmmac_resume()

From: Fugang Duan 

When do suspend/resume test, there have WARN_ON() log dump from
stmmac_xmit() funciton, the code logic:
entry = tx_q->cur_tx;
first_entry = entry;
WARN_ON(tx_q->tx_skbuff[first_entry]);

In normal case, tx_q->tx_skbuff[txq->cur_tx] should be NULL because
the skb should be handled and freed in stmmac_tx_clean().

But stmmac_resume() reset queue parameters like below, skb buffers
may not be freed.
tx_q->cur_tx = 0;
tx_q->dirty_tx = 0;

So free tx skb buffer in stmmac_resume() to avoid warning and
memory leak.

log:
[   46.139824] [ cut here ]
[   46.144453] WARNING: CPU: 0 PID: 0 at 
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:3235 stmmac_xmit+0x7a0/0x9d0
[   46.154969] Modules linked in: crct10dif_ce vvcam(O) flexcan can_dev
[   46.161328] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G   O  
5.4.24-2.1.0+g2ad925d15481 #1
[   46.170369] Hardware name: NXP i.MX8MPlus EVK board (DT)
[   46.175677] pstate: 8005 (Nzcv daif -PAN -UAO)
[   46.180465] pc : stmmac_xmit+0x7a0/0x9d0
[   46.184387] lr : dev_hard_start_xmit+0x94/0x158
[   46.188913] sp : 800010003cc0
[   46.192224] x29: 800010003cc0 x28: 000177e2a100
[   46.197533] x27: 000176ef0840 x26: 000176ef0090
[   46.202842] x25:  x24: 
[   46.208151] x23: 0003 x22: 8000119ddd30
[   46.213460] x21: 00017636f000 x20: 000176ef0cc0
[   46.218769] x19: 0003 x18: 
[   46.224078] x17:  x16: 
[   46.229386] x15: 0079 x14: 
[   46.234695] x13: 0003 x12: 0003
[   46.240003] x11: 0010 x10: 0010
[   46.245312] x9 : 00017002b140 x8 : 
[   46.250621] x7 : 00017636f000 x6 : 0010
[   46.255930] x5 : 0001 x4 : 000176ef
[   46.261238] x3 : 0003 x2 : 
[   46.266547] x1 : 000177e2a000 x0 : 
[   46.271856] Call trace:
[   46.274302]  stmmac_xmit+0x7a0/0x9d0
[   46.277874]  dev_hard_start_xmit+0x94/0x158
[   46.282056]  sch_direct_xmit+0x11c/0x338
[   46.285976]  __qdisc_run+0x118/0x5f0
[   46.289549]  net_tx_action+0x110/0x198
[   46.293297]  __do_softirq+0x120/0x23c
[   46.296958]  irq_exit+0xb8/0xd8
[   46.300098]  __handle_domain_irq+0x64/0xb8
[   46.304191]  gic_handle_irq+0x5c/0x148
[   46.307936]  el1_irq+0xb8/0x180
[   46.311076]  cpuidle_enter_state+0x84/0x360
[   46.315256]  cpuidle_enter+0x34/0x48
[   46.318829]  call_cpuidle+0x18/0x38
[   46.322314]  do_idle+0x1e0/0x280
[   46.325539]  cpu_startup_entry+0x24/0x40
[   46.329460]  rest_init+0xd4/0xe0
[   46.332687]  arch_call_rest_init+0xc/0x14
[   46.336695]  start_kernel+0x420/0x44c
[   46.340353] ---[ end trace bc1ee695123cbacd ]---

Fixes: 47dd7a540b8a0 ("net: add support for STMicroelectronics Ethernet 
controllers.")
Signed-off-by: Fugang Duan 
Signed-off-by: Joakim Zhang 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 0cef414f1289..7452f3c1cab9 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1533,6 +1533,19 @@ static void dma_free_tx_skbufs(struct stmmac_priv *priv, 
u32 queue)
stmmac_free_tx_buffer(priv, queue, i);
 }
 
+/**
+ * stmmac_free_tx_skbufs - free TX skb buffers
+ * @priv: private structure
+ */
+static void stmmac_free_tx_skbufs(struct stmmac_priv *priv)
+{
+   u32 tx_queue_cnt = priv->plat->tx_queues_to_use;
+   u32 queue;
+
+   for (queue = 0; queue < tx_queue_cnt; queue++)
+   dma_free_tx_skbufs(priv, queue);
+}
+
 /**
  * free_dma_rx_desc_resources - free RX dma desc resources
  * @priv: private structure
@@ -5260,6 +5273,7 @@ int stmmac_resume(struct device *dev)
 
stmmac_reset_queues_param(priv);
 
+   stmmac_free_tx_skbufs(priv);
stmmac_clear_descriptors(priv);
 
stmmac_hw_setup(ndev, false);
-- 
2.17.1

[PATCH V3 2/5] net: stmmac: start phylink instance before stmmac_hw_setup()

From: Fugang Duan 

Start phylink instance and resume back the PHY to supply
RX clock to MAC before MAC layer initialization by calling
.stmmac_hw_setup(), since DMA reset depends on the RX clock,
otherwise DMA reset cost maximum timeout value then finally
timeout.

Fixes: 74371272f97f ("net: stmmac: Convert to phylink and remove phylib logic")
Signed-off-by: Fugang Duan 
Signed-off-by: Joakim Zhang 
---
 .../net/ethernet/stmicro/stmmac/stmmac_main.c| 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index ba45fe237512..0cef414f1289 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5247,6 +5247,14 @@ int stmmac_resume(struct device *dev)
return ret;
}
 
+   if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
+   rtnl_lock();
+   phylink_start(priv->phylink);
+   /* We may have called phylink_speed_down before */
+   phylink_speed_up(priv->phylink);
+   rtnl_unlock();
+   }
+
rtnl_lock();
mutex_lock(&priv->lock);
 
@@ -5265,14 +5273,6 @@ int stmmac_resume(struct device *dev)
mutex_unlock(&priv->lock);
rtnl_unlock();
 
-   if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
-   rtnl_lock();
-   phylink_start(priv->phylink);
-   /* We may have called phylink_speed_down before */
-   phylink_speed_up(priv->phylink);
-   rtnl_unlock();
-   }
-
phylink_mac_change(priv->phylink, true);
 
netif_device_attach(ndev);
-- 
2.17.1

[PATCH V3 5/5] net: stmmac: overwrite the dma_cap.addr64 according to HW design

From: Fugang Duan 

The current IP register MAC_HW_Feature1[ADDR64] only defines
32/40/64 bit width, but some SOCs support others like i.MX8MP
support 34 bits but it maps to 40 bits width in MAC_HW_Feature1[ADDR64].
So overwrite dma_cap.addr64 according to HW real design.

Fixes: 94abdad6974a ("net: ethernet: dwmac: add ethernet glue logic for NXP 
imx8 chip")
Signed-off-by: Fugang Duan 
Signed-off-by: Joakim Zhang 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c   | 9 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 8 
 include/linux/stmmac.h| 1 +
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c
index efef5476a577..223f69da7e95 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c
@@ -246,13 +246,7 @@ static int imx_dwmac_probe(struct platform_device *pdev)
goto err_parse_dt;
}
 
-   ret = dma_set_mask_and_coherent(&pdev->dev,
-   DMA_BIT_MASK(dwmac->ops->addr_width));
-   if (ret) {
-   dev_err(&pdev->dev, "DMA mask set failed\n");
-   goto err_dma_mask;
-   }
-
+   plat_dat->addr64 = dwmac->ops->addr_width;
plat_dat->init = imx_dwmac_init;
plat_dat->exit = imx_dwmac_exit;
plat_dat->fix_mac_speed = imx_dwmac_fix_speed;
@@ -272,7 +266,6 @@ static int imx_dwmac_probe(struct platform_device *pdev)
 err_dwmac_init:
 err_drv_probe:
imx_dwmac_exit(pdev, plat_dat->bsp_priv);
-err_dma_mask:
 err_parse_dt:
 err_match_data:
stmmac_remove_config_dt(pdev, plat_dat);
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index d2521ebb8217..c33db79cdd0a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -4945,6 +4945,14 @@ int stmmac_dvr_probe(struct device *device,
dev_info(priv->device, "SPH feature enabled\n");
}
 
+   /* The current IP register MAC_HW_Feature1[ADDR64] only define
+* 32/40/64 bit width, but some SOC support others like i.MX8MP
+* support 34 bits but it map to 40 bits width in 
MAC_HW_Feature1[ADDR64].
+* So overwrite dma_cap.addr64 according to HW real design.
+*/
+   if (priv->plat->addr64)
+   priv->dma_cap.addr64 = priv->plat->addr64;
+
if (priv->dma_cap.addr64) {
ret = dma_set_mask_and_coherent(device,
DMA_BIT_MASK(priv->dma_cap.addr64));
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index 628e28903b8b..15ca6b4167cc 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -170,6 +170,7 @@ struct plat_stmmacenet_data {
int unicast_filter_entries;
int tx_fifo_size;
int rx_fifo_size;
+   u32 addr64;
u32 rx_queues_to_use;
u32 tx_queues_to_use;
u8 rx_sched_algorithm;
-- 
2.17.1

[PATCH V3 4/5] net: stmmac: delete the eee_ctrl_timer after napi disabled

From: Fugang Duan 

There have chance to re-enable the eee_ctrl_timer and fire the timer
in napi callback after delete the timer in .stmmac_release(), which
introduces to access eee registers in the timer function after clocks
are disabled then causes system hang. Found this issue when do
suspend/resume and reboot stress test.

It is safe to delete the timer after napi disabled and disable lpi mode.

Fixes: d765955d2ae0b ("stmmac: add the Energy Efficient Ethernet support")
Signed-off-by: Fugang Duan 
Signed-off-by: Joakim Zhang 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 7452f3c1cab9..d2521ebb8217 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2908,9 +2908,6 @@ static int stmmac_release(struct net_device *dev)
struct stmmac_priv *priv = netdev_priv(dev);
u32 chan;
 
-   if (priv->eee_enabled)
-   del_timer_sync(&priv->eee_ctrl_timer);
-
if (device_may_wakeup(priv->device))
phylink_speed_down(priv->phylink, false);
/* Stop and disconnect the PHY */
@@ -2929,6 +2926,11 @@ static int stmmac_release(struct net_device *dev)
if (priv->lpi_irq > 0)
free_irq(priv->lpi_irq, dev);
 
+   if (priv->eee_enabled) {
+   priv->tx_path_in_lpi_mode = false;
+   del_timer_sync(&priv->eee_ctrl_timer);
+   }
+
/* Stop TX/RX DMA and clear the descriptors */
stmmac_stop_all_dma(priv);
 
@@ -5155,6 +5157,11 @@ int stmmac_suspend(struct device *dev)
for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++)
del_timer_sync(&priv->tx_queue[chan].txtimer);
 
+   if (priv->eee_enabled) {
+   priv->tx_path_in_lpi_mode = false;
+   del_timer_sync(&priv->eee_ctrl_timer);
+   }
+
/* Stop TX/RX DMA */
stmmac_stop_all_dma(priv);
 
-- 
2.17.1

Re: [net-next V2 08/15] net/mlx5e: Add TX PTP port object support

2020-12-07 Thread Eran Ben Elisha





On 12/7/2020 10:37 AM, Saeed Mahameed wrote:

On Sun, 2020-12-06 at 09:08 -0800, Richard Cochran wrote:

On Sun, Dec 06, 2020 at 03:37:47PM +0200, Eran Ben Elisha wrote:

Adding new enum to the ioctl means we have add
(HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY for example) all the way -
drivers,
kernel ptp, user space ptp, ethtool.



Not exactly,
1) the flag name should be HWTSTAMP_TX_PTP_EVENTS, similar to what we
already have in RX, which will mean:
HW stamp all PTP events, don't care about the rest.

2) no need to add it to drivers from the get go, only drivers who are
interested may implement it, and i am sure there are tons who would
like to have this flag if their hw timestamping implementation is slow
! other drivers will just keep doing what they are doing, timestamp all
traffic even if user requested this flag, again exactly like many other
drivers do for RX flags (hwtstamp_rx_filters).


My concerns are:
1. Timestamp applications (like ptp4l or similar) will have to add
support
for configuring the driver to use HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY
if
supported via ioctl prior to packets transmit. From application
point of
view, the dual-modes (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY ,
HWTSTAMP_TX_ON)
support is redundant, as it offers nothing new.


Well said.



disagree, it is not a dual mode, just allow the user to have better
granularity for what hw stamps, exactly like what we have in rx.

we are not adding any new mechanism.


2. Other vendors will have to support it as well, when not sure
what is the
expectation from them if they cannot improve accuracy between them.


If there were multiple different devices out there with this kind of
implementation (different levels of accuracy with increasing run time
performance cost), then we could consider such a flag.  However, to
my
knowledge, this feature is unique to your device.



I agree, but i never meant to have a flag that indicate two different
levels of accuracy, that would be a very wild mistake for sure!

The new flag will be about selecting granularity of what gets a hw
stamp and what doesn't, aligning with the RX filter API.


This feature is just an internal enhancement, and as such it should
be added
only as a vendor private configuration flag. We are not offering
here about
any standard for others to follow.


+1



Our driver feature is and internal enhancement yes, but the suggested
flag is very far from indicating any internal enhancement, is actually
an enhancement to the current API, and is a very simple extension with
wide range of improvements to all layers.

Our driver can optimize accuracy when this flag is set, other drivers
might be happy to implement it since they already have a slow hw and
this flag would allow them to run better TCP/UDP performance while
still performing ptp hw stamping, some admins/apps will use it to avoid
stamping all traffic on tx, win win win.



Seems interesting. I can form such V2 patches soon.

[PATCH net-next] nfc: s3fwrn5: Change irqflags

2020-12-07 Thread Bongsu Jeon

From: Bongsu Jeon 

change irqflags from IRQF_TRIGGER_HIGH to IRQF_TRIGGER_RISING for stable
Samsung's nfc interrupt handling.

Signed-off-by: Bongsu Jeon 
---
 drivers/nfc/s3fwrn5/i2c.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nfc/s3fwrn5/i2c.c b/drivers/nfc/s3fwrn5/i2c.c
index e1bdde105f24..016f6b6df849 100644
--- a/drivers/nfc/s3fwrn5/i2c.c
+++ b/drivers/nfc/s3fwrn5/i2c.c
@@ -213,7 +213,7 @@ static int s3fwrn5_i2c_probe(struct i2c_client *client,
return ret;
 
ret = devm_request_threaded_irq(&client->dev, phy->i2c_dev->irq, NULL,
-   s3fwrn5_i2c_irq_thread_fn, IRQF_TRIGGER_HIGH | IRQF_ONESHOT,
+   s3fwrn5_i2c_irq_thread_fn, IRQF_TRIGGER_RISING | IRQF_ONESHOT,
S3FWRN5_I2C_DRIVER_NAME, phy);
if (ret)
s3fwrn5_remove(phy->common.ndev);
-- 
2.17.1

[PATCH RFC] ethernet: stmmac: clean up the code for release/suspend/resume function

commit 1c35cc9cf6a0 ("net: stmmac: remove redundant null check before 
clk_disable_unprepare()"),
have not clean up check NULL clock parameter completely, this patch did it.

commit e8377e7a29efb ("net: stmmac: only call pmt() during suspend/resume if HW 
enables PMT"),
after this patch, we use
if (device_may_wakeup(priv->device) && priv->plat->pmt) check MAC wakeup
if (device_may_wakeup(priv->device)) check PHY wakeup
Add oneline comment for readability.

commit 77b2898394e3b ("net: stmmac: Speed down the PHY if WoL to save energy"),
slow down phy speed when release net device under any condition.

Slightly adjust the order of the codes so that suspend/resume look more
symmetrical, generally speaking they should appear symmetrically.

Signed-off-by: Joakim Zhang 
---
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 22 +--
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index c33db79cdd0a..a46e865c4acc 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2908,8 +2908,7 @@ static int stmmac_release(struct net_device *dev)
struct stmmac_priv *priv = netdev_priv(dev);
u32 chan;
 
-   if (device_may_wakeup(priv->device))
-   phylink_speed_down(priv->phylink, false);
+   phylink_speed_down(priv->phylink, false);
/* Stop and disconnect the PHY */
phylink_stop(priv->phylink);
phylink_disconnect_phy(priv->phylink);
@@ -5183,6 +5182,7 @@ int stmmac_suspend(struct device *dev)
} else {
mutex_unlock(&priv->lock);
rtnl_lock();
+   /* For PHY wakeup case */
if (device_may_wakeup(priv->device))
phylink_speed_down(priv->phylink, false);
phylink_stop(priv->phylink);
@@ -5260,11 +5260,17 @@ int stmmac_resume(struct device *dev)
/* enable the clk previously disabled */
clk_prepare_enable(priv->plat->stmmac_clk);
clk_prepare_enable(priv->plat->pclk);
-   if (priv->plat->clk_ptp_ref)
-   clk_prepare_enable(priv->plat->clk_ptp_ref);
+   clk_prepare_enable(priv->plat->clk_ptp_ref);
/* reset the phy so that it's ready */
if (priv->mii)
stmmac_mdio_reset(priv->mii);
+
+   rtnl_lock();
+   phylink_start(priv->phylink);
+   /* We may have called phylink_speed_down before */
+   if (device_may_wakeup(priv->device))
+   phylink_speed_up(priv->phylink);
+   rtnl_unlock();
}
 
if (priv->plat->serdes_powerup) {
@@ -5275,14 +5281,6 @@ int stmmac_resume(struct device *dev)
return ret;
}
 
-   if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
-   rtnl_lock();
-   phylink_start(priv->phylink);
-   /* We may have called phylink_speed_down before */
-   phylink_speed_up(priv->phylink);
-   rtnl_unlock();
-   }
-
rtnl_lock();
mutex_lock(&priv->lock);
 
-- 
2.17.1

[PATCH net] tcp: fix receive buffer autotuning to trigger for any valid advertised MSS

2020-12-07 Thread Hazem Mohamed Abuelfotoh

Previously receiver buffer auto-tuning starts after receiving
one advertised window amount of data.After the initial
receiver buffer was raised by
commit a337531b942b ("tcp: up initial rmem to 128KB
and SYN rwin to around 64KB"),the receiver buffer may
take too long for TCP autotuning to start raising
the receiver buffer size.
commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
tried to decrease the threshold at which TCP auto-tuning starts
but it's doesn't work well in some environments
where the receiver has large MTU (9001) especially with high RTT
connections as in these environments rcvq_space.space will be the same
as rcv_wnd so TCP autotuning will never start because
sender can't send more than rcv_wnd size in one round trip.
To address this issue this patch is decreasing the initial
rcvq_space.space so TCP autotuning kicks in whenever the sender is
able to send more than 5360 bytes in one round trip regardless the
receiver's configured MTU.

Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 
64KB")
Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")

Signed-off-by: Hazem Mohamed Abuelfotoh 
---
 net/ipv4/tcp_input.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 389d1b340248..f0ffac9e937b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct 
sk_buff *skb)
 static void tcp_init_buffer_space(struct sock *sk)
 {
int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
+   struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
int maxwin;
 
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
tcp_sndbuf_expand(sk);
 
-   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
tp->advmss);
+   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
icsk->icsk_ack.rcv_mss);
tcp_mstamp_refresh(tp);
tp->rcvq_space.time = tp->tcp_mstamp;
tp->rcvq_space.seq = tp->copied_seq;
-- 
2.16.6




Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, 
R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington 
Road, Dublin 4, Ireland, branch registration number 908705

BUG: unable to handle kernel paging request in bpf_lru_populate

2020-12-07 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:bcd684aa net/nfc/nci: Support NCI 2.x initial sequence
git tree:   net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=12001bd350
kernel config:  https://syzkaller.appspot.com/x/.config?x=3cb098ab0334059f
dashboard link: https://syzkaller.appspot.com/bug?extid=ec2234240c96fdd26b93
compiler:   gcc (GCC) 10.1.0-syz 20200507
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11f7f2ef50
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=103833f750

The issue was bisected to:

commit b93ef089d35c3386dd197e85afb6399bbd54cfb3
Author: Martin KaFai Lau 
Date:   Mon Nov 16 20:01:13 2020 +

bpf: Fix the irq and nmi check in bpf_sk_storage for tracing usage

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1103b83750
final oops: https://syzkaller.appspot.com/x/report.txt?x=1303b83750
console output: https://syzkaller.appspot.com/x/log.txt?x=1503b83750

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+ec2234240c96fdd26...@syzkaller.appspotmail.com
Fixes: b93ef089d35c ("bpf: Fix the irq and nmi check in bpf_sk_storage for 
tracing usage")

BUG: unable to handle page fault for address: f5200471266c
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 23fff2067 P4D 23fff2067 PUD 101a4067 PMD 32e3a067 PTE 0
Oops:  [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8503 Comm: syz-executor608 Not tainted 5.10.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:bpf_common_lru_populate kernel/bpf/bpf_lru_list.c:569 [inline]
RIP: 0010:bpf_lru_populate+0xd8/0x5e0 kernel/bpf/bpf_lru_list.c:614
Code: 03 4d 01 e7 48 01 d8 48 89 4c 24 10 4d 89 fe 48 89 44 24 08 e8 99 23 eb 
ff 49 8d 7e 12 48 89 f8 48 89 fa 48 c1 e8 03 83 e2 07 <0f> b6 04 18 38 d0 7f 08 
84 c0 0f 85 80 04 00 00 49 8d 7e 13 41 c6
RSP: 0018:c9000126fc20 EFLAGS: 00010202
RAX: 19200471266c RBX: dc00 RCX: 8184e3e2
RDX: 0002 RSI: 8184e2e7 RDI: c90023893362
RBP: 00bc R08: 107c R09: 
R10: 107c R11:  R12: 0001
R13: 107c R14: c90023893350 R15: c900234832f0
FS:  00fe0880() GS:8880b9f0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: f5200471266c CR3: 1ba62000 CR4: 001506e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 prealloc_init kernel/bpf/hashtab.c:319 [inline]
 htab_map_alloc+0xf6e/0x1230 kernel/bpf/hashtab.c:507
 find_and_alloc_map kernel/bpf/syscall.c:123 [inline]
 map_create kernel/bpf/syscall.c:829 [inline]
 __do_sys_bpf+0xa81/0x5170 kernel/bpf/syscall.c:4374
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4402e9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
7b 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7ffe77af23b8 EFLAGS: 0246 ORIG_RAX: 0141
RAX: ffda RBX: 004002c8 RCX: 004402e9
RDX: 0040 RSI: 2000 RDI: 0d00
RBP: 006ca018 R08:  R09: 
R10:  R11: 0246 R12: 00401af0
R13: 00401b80 R14:  R15: 
Modules linked in:
CR2: f5200471266c
---[ end trace 4f3928bacde7b3ed ]---
RIP: 0010:bpf_common_lru_populate kernel/bpf/bpf_lru_list.c:569 [inline]
RIP: 0010:bpf_lru_populate+0xd8/0x5e0 kernel/bpf/bpf_lru_list.c:614
Code: 03 4d 01 e7 48 01 d8 48 89 4c 24 10 4d 89 fe 48 89 44 24 08 e8 99 23 eb 
ff 49 8d 7e 12 48 89 f8 48 89 fa 48 c1 e8 03 83 e2 07 <0f> b6 04 18 38 d0 7f 08 
84 c0 0f 85 80 04 00 00 49 8d 7e 13 41 c6
RSP: 0018:c9000126fc20 EFLAGS: 00010202
RAX: 19200471266c RBX: dc00 RCX: 8184e3e2
RDX: 0002 RSI: 8184e2e7 RDI: c90023893362
RBP: 00bc R08: 107c R09: 
R10: 107c R11:  R12: 0001
R13: 107c R14: c90023893350 R15: c900234832f0
FS:  00fe0880() GS:8880b9f0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: f5200471266c CR3: 1ba62000 CR4: 001506e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep

[PATCH net] tcp: fix receive buffer autotuning to trigger for any valid advertised MSS

2020-12-07 Thread Hazem Mohamed Abuelfotoh

Previously receiver buffer auto-tuning starts after receiving
one advertised window amount of data.After the initial
receiver buffer was raised by
commit a337531b942b ("tcp: up initial rmem to 128KB
and SYN rwin to around 64KB"),the receiver buffer may
take too long for TCP autotuning to start raising
the receiver buffer size.
commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
tried to decrease the threshold at which TCP auto-tuning starts
but it's doesn't work well in some environments
where the receiver has large MTU (9001) especially with high RTT
connections as in these environments rcvq_space.space will be the same
as rcv_wnd so TCP autotuning will never start because
sender can't send more than rcv_wnd size in one round trip.
To address this issue this patch is decreasing the initial
rcvq_space.space so TCP autotuning kicks in whenever the sender is
able to send more than 5360 bytes in one round trip regardless the
receiver's configured MTU.

Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 
64KB")
Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")

Signed-off-by: Hazem Mohamed Abuelfotoh 
---
 net/ipv4/tcp_input.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 389d1b340248..f0ffac9e937b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct 
sk_buff *skb)
 static void tcp_init_buffer_space(struct sock *sk)
 {
int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
+   struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
int maxwin;
 
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
tcp_sndbuf_expand(sk);
 
-   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
tp->advmss);
+   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
icsk->icsk_ack.rcv_mss);
tcp_mstamp_refresh(tp);
tp->rcvq_space.time = tp->tcp_mstamp;
tp->rcvq_space.seq = tp->copied_seq;
-- 
2.16.6




Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, 
R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington 
Road, Dublin 4, Ireland, branch registration number 908705

Re: [PATCH net-next] nfc: s3fwrn5: Change irqflags

2020-12-07 Thread Krzysztof Kozlowski

On Mon, Dec 07, 2020 at 08:38:27PM +0900, Bongsu Jeon wrote:
> From: Bongsu Jeon 
> 
> change irqflags from IRQF_TRIGGER_HIGH to IRQF_TRIGGER_RISING for stable
> Samsung's nfc interrupt handling.

1. Describe in commit title/subject the change. Just a word "change irqflags" is
   not enough.

2. Describe in commit message what you are trying to fix. Before was not
   stable? The "for stable interrupt handling" is a little bit vauge.

3. This is contradictory to the bindings and current DTS. I think the
   driver should not force the specific trigger type because I could
   imagine some configuration that the actual interrupt to the CPU is
   routed differently.

   Instead, how about removing the trigger flags here and fixing the DTS
   and bindings example?

Best regards,
Krzysztof

> 
> Signed-off-by: Bongsu Jeon 
> ---
>  drivers/nfc/s3fwrn5/i2c.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nfc/s3fwrn5/i2c.c b/drivers/nfc/s3fwrn5/i2c.c
> index e1bdde105f24..016f6b6df849 100644
> --- a/drivers/nfc/s3fwrn5/i2c.c
> +++ b/drivers/nfc/s3fwrn5/i2c.c
> @@ -213,7 +213,7 @@ static int s3fwrn5_i2c_probe(struct i2c_client *client,
>   return ret;
>  
>   ret = devm_request_threaded_irq(&client->dev, phy->i2c_dev->irq, NULL,
> - s3fwrn5_i2c_irq_thread_fn, IRQF_TRIGGER_HIGH | IRQF_ONESHOT,
> + s3fwrn5_i2c_irq_thread_fn, IRQF_TRIGGER_RISING | IRQF_ONESHOT,
>   S3FWRN5_I2C_DRIVER_NAME, phy);
>   if (ret)
>   s3fwrn5_remove(phy->common.ndev);
> -- 
> 2.17.1
>

Re: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set

2020-12-07 Thread Jesper Dangaard Brouer

On Fri, 4 Dec 2020 23:19:55 +0100
Daniel Borkmann  wrote:

> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:
> > Daniel Borkmann  writes:  
> [...]
> >> We tried to standardize on a minimum guaranteed amount, but unfortunately 
> >> not
> >> everyone seems to implement it, but I think it would be very useful to 
> >> query
> >> this from application side, for example, consider that an app inserts a BPF
> >> prog at XDP doing custom encap shortly before XDP_TX so it would be useful 
> >> to
> >> know which of the different encaps it implements are realistically 
> >> possible on
> >> the underlying XDP supported dev.  
> > 
> > How many distinct values are there in reality? Enough to express this in
> > a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
> > an additional field to get the exact value? If we implement the latter
> > we also run the risk of people actually implementing all sorts of weird
> > values, whereas if we constrain it to a few distinct values it's easier
> > to push back against adding new values (as it'll be obvious from the
> > addition of new flags).  
> 
> It's not everywhere straight forward to determine unfortunately, see also 
> [0,1]
> as some data points where Jesper looked into in the past, so in some cases it
> might differ depending on the build/runtime config..
> 
>[0] 
> https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/
>[1] 
> https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/

Yes, unfortunately drivers have already gotten creative in this area,
and variations have sneaked in.  I remember that we were forced to
allow SFC driver to use 128 bytes headroom, to avoid a memory
corruption. I tried hard to have the minimum 192 bytes as it is 3
cachelines, but I failed to enforce this.

It might be valuable to expose info on the drivers headroom size, as
this will allow end-users to take advantage of this (instead of having
to use the lowest common headroom) and up-front in userspace rejecting
to load on e.g. SFC that have this annoying limitation.

BUT thinking about what the drivers headroom size MEANS to userspace,
I'm not sure it is wise to give this info to userspace.  The
XDP-headroom is used for several kernel internal things, that limit the
available space for growing packet-headroom.  E.g. (1) xdp_frame is
something that we likely need to grow (even-though I'm pushing back),
E.g. (2) metadata area which Saeed is looking to populate from driver
code (also reduce packet-headroom for encap-headers).  So, userspace
cannot use the XDP-headroom size to much...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality

2020-12-07 Thread Mark Rutland

On Fri, Dec 04, 2020 at 12:37:32AM +, Alex Belits wrote:
> On Wed, 2020-12-02 at 13:59 +, Mark Rutland wrote:
> > On Mon, Nov 23, 2020 at 05:58:06PM +, Alex Belits wrote:

> > As a heads-up, the arm64 entry code is changing, as we found that
> > our lockdep, RCU, and context-tracking management wasn't quite
> > right. I have a series of patches:
> > 
> > https://lore.kernel.org/r/20201130115950.22492-1-mark.rutl...@arm.com
> > 
> > ... which are queued in the arm64 for-next/fixes branch. I intend to
> > have some further rework ready for the next cycle.

> > That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were
> > chosen and context tracking was in use (e.g. with
> > CONTEXT_TRACKING_FORCE),
> 
> I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and
> CONTEXT_TRACKING have to be enabled for it to do anything.
> 
> I will check it with PROVE_LOCKING and your patches.

Thanks. In future, please do test this functionality with PROVE_LOCKING,
because otherwise bugs with RCU and IRQ state maangement will easily be
missed (as has been the case until very recently).

Testing with all those debug optiosn enabled (and stating that you have
done so) will give reviuewers much greater confidence that this works,
and if that does start spewing errors it save everyone the time
identifying that.

> Entry code only adds an inline function that, if task isolation is
> enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level 
> operations and accesses per-CPU variabled by offset, so at very least
> it should not add any problems. Even raw_local_irq_save() /
> raw_local_irq_restore() probably should be removed, however I wanted to
> have something that can be safely called if by whatever reason
> interrupts were enabled before kernel was fully entered.

Sure. In the new flows we have new enter_from_*() and exit_to_*()
functions where these calls should be able to live (and so we should be
able to ensure a more consistent environment).

The near-term plan for arm64 is to migrate more of the exception triage
assembly to C, then to rework the arm64 entry code and generic entry
code to be more similar, then to migrate as much as possible to the
generic entry code. So please bear in mind that anything that adds to
the differences between the two is goingf to be problematic.

> >  so I'm assuming that this series has not been
> > tested in that configuration. What sort of testing has this seen?
> 
> On various available arm64 hardware, with enabled
> 
> CONFIG_TASK_ISOLATION
> CONFIG_NO_HZ_FULL
> CONFIG_HIGH_RES_TIMERS
> 
> and disabled:
> 
> CONFIG_HZ_PERIODIC
> CONFIG_NO_HZ_IDLE
> CONFIG_NO_HZ

Ok. I'd recommend looking at the various debug options under the "kernel
hacking" section in kconfig, and enabling some of those. At the very
least PROVE_LOCKING, ideally also using the lockup dectors and anything
else for debugging RCU, etc.

[...]

> > > Functions called from there:
> > > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
> > > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()
> > > 
> > > Handlers:
> > > do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
> > >   or task_isolation_kernel_enter()
> > > el1_sync_handler() -> task_isolation_kernel_enter()
> > > el0_sync_handler() -> task_isolation_kernel_enter()
> > > el0_sync_compat_handler() -> task_isolation_kernel_enter()
> > > 
> > > handle_arch_irq() is irqchip-specific, most call
> > > handle_domain_irq()
> > > There is a separate patch for irqchips that do not follow this
> > > rule.
> > > 
> > > handle_domain_irq() -> task_isolation_kernel_enter()
> > > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
> > > nmi_enter() -> task_isolation_kernel_enter()
> > 
> > The IRQ cases look very odd to me. With the rework I've just done
> > for arm64, we'll do the regular context tracking accounting before
> > we ever get into handle_domain_irq() or similar, so I suspect that's
> > not necessary at all?
> 
> The goal is to call task_isolation_kernel_enter() before anything that
> depends on a CPU state, including pipeline, that could remain un-
> synchronized when the rest of the kernel was sending synchronization
> IPIs. Similarly task_isolation_kernel_return() should be called when it
> is safe to turn off synchronization. If rework allows it to be done
> earlier, there is no need to touch more specific functions.

Sure; I think that's sorted as a result of the changes I made recently.

> 
> > --- a/arch/arm64/include/asm/barrier.h
> > > +++ b/arch/arm64/include/asm/barrier.h
> > > @@ -49,6 +49,7 @@
> > >  #define dma_rmb()dmb(oshld)
> > >  #define dma_wmb()dmb(oshst)
> > >  
> > > +#define instr_sync() isb()
> > 
> > I think I've asked on prior versions of the patchset, but what is
> > this for? Where is it going to be used, and what is the expected
> > semantics?  I'm wary of exposing this outside of arch code because
> > there aren't stron

Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-12-07 Thread Mark Rutland

On Fri, Dec 04, 2020 at 12:54:29AM +, Alex Belits wrote:
> 
> On Wed, 2020-12-02 at 14:20 +, Mark Rutland wrote:
> > External Email
> > 
> > ---
> > ---
> > On Mon, Nov 23, 2020 at 05:58:22PM +, Alex Belits wrote:
> > > From: Yuri Norov 
> > > 
> > > For nohz_full CPUs the desirable behavior is to receive interrupts
> > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > > obviously not desirable because it breaks isolation.
> > > 
> > > This patch adds check for it.
> > > 
> > > Signed-off-by: Yuri Norov 
> > > [abel...@marvell.com: updated, only exclude CPUs running isolated
> > > tasks]
> > > Signed-off-by: Alex Belits 
> > > ---
> > >  kernel/time/tick-sched.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index a213952541db..6c8679e200f0 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -20,6 +20,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> > >   */
> > >  void tick_nohz_full_kick_cpu(int cpu)
> > >  {
> > > - if (!tick_nohz_full_cpu(cpu))
> > > + smp_rmb();
> > 
> > What does this barrier pair with? The commit message doesn't mention
> > it,
> > and it's not clear in-context.
> 
> With barriers in task_isolation_kernel_enter()
> and task_isolation_exit_to_user_mode().

Please add a comment in the code as to what it pairs with.

Thanks,
Mark.

[PATCH -next] net/mlx5_core: remove unused including

2020-12-07 Thread Zou Wei

Remove including  that don't need it.

Signed-off-by: Zou Wei 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 989c70c..82ecc161 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -30,7 +30,6 @@
  * SOFTWARE.
  */
 
-#include 
 #include 
 #include 
 #include 
-- 
2.6.2

Re: Why the auxiliary cipher in gss_krb5_crypto.c?

2020-12-07 Thread David Howells

Ard Biesheuvel  wrote:

> > Yeah - the problem with that is that for sunrpc, we might be dealing with 
> > 1MB
> > plus bits of non-contiguous pages, requiring >8K of scatterlist elements
> > (admittedly, we can chain them, but we may have to do one or more large
> > allocations).
> >
> > > However, I would recommend against it:
> >
> > Sorry, recommend against what?
> >
> 
> Recommend against the current approach of manipulating the input like
> this and feeding it into the skcipher piecemeal.

Right.  I understand the problem, but as I mentioned above, the scatterlist
itself becomes a performance issue as it may exceed two pages in size.  Double
that as there may need to be separate input and output scatterlists.

> Herbert recently made some changes for MSG_MORE support in the AF_ALG
> code, which permits a skcipher encryption to be split into several
> invocations of the skcipher layer without the need for this complexity
> on the side of the caller. Maybe there is a way to reuse that here.
> Herbert?

I wonder if it would help if the input buffer and output buffer didn't have to
correspond exactly in usage - ie. the output buffer could be used at a slower
rate than the input to allow for buffering inside the crypto algorithm.

> > Can you also do SHA at the same time in the same loop?
> 
> SHA-1 or HMAC-SHA1? The latter could probably be modeled as an AEAD.
> The former doesn't really fit the current API so we'd have to invent
> something for it.

The hashes corresponding to the kerberos enctypes I'm supporting are:

HMAC-SHA1 for aes128-cts-hmac-sha1-96 and aes256-cts-hmac-sha1-96.

HMAC-SHA256 for aes128-cts-hmac-sha256-128

HMAC-SHA384 for aes256-cts-hmac-sha384-192

CMAC-CAMELLIA for camellia128-cts-cmac and camellia256-cts-cmac

I'm not sure you can support all of those with the instructions available.

David

Re: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set

Daniel Borkmann  writes:

> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann  writes:
> [...]
>>> We tried to standardize on a minimum guaranteed amount, but unfortunately 
>>> not
>>> everyone seems to implement it, but I think it would be very useful to query
>>> this from application side, for example, consider that an app inserts a BPF
>>> prog at XDP doing custom encap shortly before XDP_TX so it would be useful 
>>> to
>>> know which of the different encaps it implements are realistically possible 
>>> on
>>> the underlying XDP supported dev.
>> 
>> How many distinct values are there in reality? Enough to express this in
>> a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
>> an additional field to get the exact value? If we implement the latter
>> we also run the risk of people actually implementing all sorts of weird
>> values, whereas if we constrain it to a few distinct values it's easier
>> to push back against adding new values (as it'll be obvious from the
>> addition of new flags).
>
> It's not everywhere straight forward to determine unfortunately, see also 
> [0,1]
> as some data points where Jesper looked into in the past, so in some cases it
> might differ depending on the build/runtime config..
>
>[0] 
> https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/
>[1] 
> https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/

Right, well in that case maybe we should just expose the actual headroom
as a separate netlink attribute? Although I suppose that would require
another round of driver changes since Jesper's patch you linked above
only puts this into xdp_buff at XDP program runtime.

Jesper, WDYT?

-Toke

Re: [PATCH v2 bpf 0/5] New netdev feature flags for XDP

Jakub Kicinski  writes:

> On Fri, 04 Dec 2020 18:26:10 +0100 Toke Høiland-Jørgensen wrote:
>> Jakub Kicinski  writes:
>> 
>> > On Fri,  4 Dec 2020 11:28:56 +0100 alar...@gmail.com wrote:  
>> >>  * Extend ethtool netlink interface in order to get access to the XDP
>> >>bitmap (XDP_PROPERTIES_GET). [Toke]  
>> >
>> > That's a good direction, but I don't see why XDP caps belong in ethtool
>> > at all? We use rtnetlink to manage the progs...  
>> 
>> You normally use ethtool to get all the other features a device support,
>> don't you?
>
> Not really, please take a look at all the IFLA attributes. There's 
> a bunch of capabilities there.

Ah, right, TIL. Well, putting this new property in rtnetlink instead of
ethtool is fine by me as well :)

-Toke

Re: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set

Jesper Dangaard Brouer  writes:

> On Fri, 4 Dec 2020 23:19:55 +0100
> Daniel Borkmann  wrote:
>
>> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:
>> > Daniel Borkmann  writes:  
>> [...]
>> >> We tried to standardize on a minimum guaranteed amount, but unfortunately 
>> >> not
>> >> everyone seems to implement it, but I think it would be very useful to 
>> >> query
>> >> this from application side, for example, consider that an app inserts a 
>> >> BPF
>> >> prog at XDP doing custom encap shortly before XDP_TX so it would be 
>> >> useful to
>> >> know which of the different encaps it implements are realistically 
>> >> possible on
>> >> the underlying XDP supported dev.  
>> > 
>> > How many distinct values are there in reality? Enough to express this in
>> > a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
>> > an additional field to get the exact value? If we implement the latter
>> > we also run the risk of people actually implementing all sorts of weird
>> > values, whereas if we constrain it to a few distinct values it's easier
>> > to push back against adding new values (as it'll be obvious from the
>> > addition of new flags).  
>> 
>> It's not everywhere straight forward to determine unfortunately, see also 
>> [0,1]
>> as some data points where Jesper looked into in the past, so in some cases it
>> might differ depending on the build/runtime config..
>> 
>>[0] 
>> https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/
>>[1] 
>> https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/
>
> Yes, unfortunately drivers have already gotten creative in this area,
> and variations have sneaked in.  I remember that we were forced to
> allow SFC driver to use 128 bytes headroom, to avoid a memory
> corruption. I tried hard to have the minimum 192 bytes as it is 3
> cachelines, but I failed to enforce this.
>
> It might be valuable to expose info on the drivers headroom size, as
> this will allow end-users to take advantage of this (instead of having
> to use the lowest common headroom) and up-front in userspace rejecting
> to load on e.g. SFC that have this annoying limitation.
>
> BUT thinking about what the drivers headroom size MEANS to userspace,
> I'm not sure it is wise to give this info to userspace.  The
> XDP-headroom is used for several kernel internal things, that limit the
> available space for growing packet-headroom.  E.g. (1) xdp_frame is
> something that we likely need to grow (even-though I'm pushing back),
> E.g. (2) metadata area which Saeed is looking to populate from driver
> code (also reduce packet-headroom for encap-headers).  So, userspace
> cannot use the XDP-headroom size to much...

(Ah, you had already replied, sorry seems I missed that).

Can we calculate a number from the headroom that is meaningful for
userspace? I suppose that would be "total number of bytes available for
metadata+packet extension"? Even with growing data structures, any
particular kernel should be able to inform userspace of the current
value, no?

-Toke

Re: [PATCH v8 3/4] phy: Add Sparx5 ethernet serdes PHY driver

2020-12-07 Thread Steen Hegelund

On 04.12.2020 15:16, Alexandre Belloni wrote:

EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
content is safe

On 03/12/2020 22:52:53+0100, Andrew Lunn wrote:

> +   if (macro->serdestype == SPX5_SDT_6G) {
> +   value = sdx5_rd(priv, SD6G_LANE_LANE_DF(macro->stpidx));
> +   analog_sd = SD6G_LANE_LANE_DF_PMA2PCS_RXEI_FILTERED_GET(value);
> +   } else if (macro->serdestype == SPX5_SDT_10G) {
> +   value = sdx5_rd(priv, SD10G_LANE_LANE_DF(macro->stpidx));
> +   analog_sd = SD10G_LANE_LANE_DF_PMA2PCS_RXEI_FILTERED_GET(value);
> +   } else {
> +   value = sdx5_rd(priv, SD25G_LANE_LANE_DE(macro->stpidx));
> +   analog_sd = SD25G_LANE_LANE_DE_LN_PMA_RXEI_GET(value);
> +   }
> +   /* Link up is when analog_sd == 0 */
> +   return analog_sd;
> +}

What i have not yet seen is how this code plugs together with
phylink_pcs_ops?

Can this hardware also be used for SATA, USB? As far as i understand,
the Marvell Comphy is multi-purpose, it is used for networking, USB,
and SATA, etc. Making it a generic PHY then makes sense, because
different subsystems need to use it.

But it looks like this is for networking only? So i'm wondering if it
belongs in driver/net/pcs and it should be accessed using
phylink_pcs_ops?

Ocelot had PCie on the phys, doesn't Sparx5 have it?

Yes Ocelot has that, but on Sparx5 the PCIe is separate...

--
Alexandre Belloni, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

BR
Steen

---
Steen Hegelund
steen.hegel...@microchip.com

[PATCH] bpf: propagate __user annotations properly

2020-12-07 Thread Lukas Bulwahn

__htab_map_lookup_and_delete_batch() stores a user pointer in the local
variable ubatch and uses that in copy_{from,to}_user(), but ubatch misses a
__user annotation.

So, sparse warns in the various assignments and uses of ubatch:

  kernel/bpf/hashtab.c:1415:24: warning: incorrect type in initializer
(different address spaces)
  kernel/bpf/hashtab.c:1415:24:expected void *ubatch
  kernel/bpf/hashtab.c:1415:24:got void [noderef] __user *

  kernel/bpf/hashtab.c:1444:46: warning: incorrect type in argument 2
(different address spaces)
  kernel/bpf/hashtab.c:1444:46:expected void const [noderef] __user *from
  kernel/bpf/hashtab.c:1444:46:got void *ubatch

  kernel/bpf/hashtab.c:1608:16: warning: incorrect type in assignment
(different address spaces)
  kernel/bpf/hashtab.c:1608:16:expected void *ubatch
  kernel/bpf/hashtab.c:1608:16:got void [noderef] __user *

  kernel/bpf/hashtab.c:1609:26: warning: incorrect type in argument 1
(different address spaces)
  kernel/bpf/hashtab.c:1609:26:expected void [noderef] __user *to
  kernel/bpf/hashtab.c:1609:26:got void *ubatch

Add the __user annotation to repair this chain of propagating __user
annotations in __htab_map_lookup_and_delete_batch().

Signed-off-by: Lukas Bulwahn 
---
applies cleanly on current master (v5.10-rc7) and next-20201204

BPF maintainers, please pick this minor non-urgent clean-up patch.

 kernel/bpf/hashtab.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index fe7a0733a63a..76c791def033 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1412,7 +1412,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
void __user *uvalues = u64_to_user_ptr(attr->batch.values);
void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
-   void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
+   void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch);
u32 batch, max_count, size, bucket_size;
struct htab_elem *node_to_free = NULL;
u64 elem_map_flags, map_flags;
-- 
2.17.1

Re: [PATCH 1/1] xdp: avoid calling kfree twice

2020-12-07 Thread Björn Töpel


On 2020-12-08 07:50, Zhu Yanjun wrote:

From: Zhu Yanjun 

In the function xdp_umem_pin_pages, if npgs != umem->npgs and
npgs >= 0, the function xdp_umem_unpin_pages is called. In this
function, kfree is called to handle umem->pgs, and then in the
function xdp_umem_pin_pages, kfree is called again to handle
umem->pgs. Eventually, umem->pgs is freed twice.



Hi Zhu,

Thanks for the cleanup! kfree(NULL) is valid, so this is not a 
double-free, but still a nice cleanup!




Signed-off-by: Zhu Yanjun 
---
  net/xdp/xdp_umem.c | 17 +
  1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 56a28a686988..ff5173f72920 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -97,7 +97,6 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned 
long address)
  {
unsigned int gup_flags = FOLL_WRITE;
long npgs;
-   int err;
  
  	umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs),

GFP_KERNEL | __GFP_NOWARN);
@@ -112,20 +111,14 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, 
unsigned long address)
if (npgs != umem->npgs) {
if (npgs >= 0) {
umem->npgs = npgs;
-   err = -ENOMEM;
-   goto out_pin;
+   xdp_umem_unpin_pages(umem);
+   return -ENOMEM;
}
-   err = npgs;
-   goto out_pgs;
+   kfree(umem->pgs);
+   umem->pgs = NULL;
+   return npgs;


I'd like an explicit cast "(int)" here (-Wconversion). Please spin a v2
with the cast, with my:

Acked-by: Björn Töpel 

added.


Cheers!
Björn



}
return 0;
-
-out_pin:
-   xdp_umem_unpin_pages(umem);
-out_pgs:
-   kfree(umem->pgs);
-   umem->pgs = NULL;
-   return err;
  }
  
  static int xdp_umem_account_pages(struct xdp_umem *umem)

WARNING: ODEBUG bug in slave_kobj_release

2020-12-07 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:34816d20 Merge tag 'gfs2-v5.10-rc5-fixes' of git://git.ker..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=153f779d50
kernel config:  https://syzkaller.appspot.com/x/.config?x=e49433cfed49b7d9
dashboard link: https://syzkaller.appspot.com/bug?extid=7bce4c2f7e1768ec3fe0
compiler:   gcc (GCC) 10.1.0-syz 20200507

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+7bce4c2f7e1768ec3...@syzkaller.appspotmail.com

kobject_add_internal failed for bonding_slave (error: -12 parent: veth213)
[ cut here ]
ODEBUG: assert_init not available (active state 0) object type: timer_list 
hint: 0x0
WARNING: CPU: 1 PID: 22707 at lib/debugobjects.c:505 
debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Modules linked in:
CPU: 1 PID: 22707 Comm: syz-executor.4 Not tainted 5.10.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 af 00 00 00 48 8b 14 dd 20 
a2 9d 89 4c 89 ee 48 c7 c7 20 96 9d 89 e8 1e 0e f2 04 <0f> 0b 83 05 a5 87 32 09 
01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e c3
RSP: 0018:c9000e37e9a0 EFLAGS: 00010082
RAX:  RBX: 0005 RCX: 
RDX: 0004 RSI: 8158c855 RDI: f52001c6fd26
RBP: 0001 R08: 0001 R09: 8880b9f2011b
R10:  R11:  R12: 894d3be0
R13: 899d9ca0 R14: 815f15f0 R15: 192001c6fd3f
FS:  7fc5d258d700() GS:8880b9f0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00749138 CR3: 52e81000 CR4: 00350ee0
Call Trace:
 debug_object_assert_init lib/debugobjects.c:890 [inline]
 debug_object_assert_init+0x1f4/0x2e0 lib/debugobjects.c:861
 debug_timer_assert_init kernel/time/timer.c:737 [inline]
 debug_assert_init kernel/time/timer.c:782 [inline]
 del_timer+0x6d/0x110 kernel/time/timer.c:1202
 try_to_grab_pending+0x6d/0xd0 kernel/workqueue.c:1252
 __cancel_work_timer+0xa6/0x520 kernel/workqueue.c:3095
 slave_kobj_release+0x48/0xe0 drivers/net/bonding/bond_main.c:1468
 kobject_cleanup lib/kobject.c:705 [inline]
 kobject_release lib/kobject.c:736 [inline]
 kref_put include/linux/kref.h:65 [inline]
 kobject_put+0x1c8/0x540 lib/kobject.c:753
 bond_kobj_init drivers/net/bonding/bond_main.c:1489 [inline]
 bond_alloc_slave drivers/net/bonding/bond_main.c:1506 [inline]
 bond_enslave+0x2488/0x4bf0 drivers/net/bonding/bond_main.c:1708
 do_set_master+0x1c8/0x220 net/core/rtnetlink.c:2517
 do_setlink+0x911/0x3a70 net/core/rtnetlink.c:2713
 __rtnl_newlink+0xc1c/0x1740 net/core/rtnetlink.c:3374
 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500
 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5562
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:671
 sys_sendmsg+0x6e8/0x810 net/socket.c:2353
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2407
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2440
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7fc5d258cc78 EFLAGS: 0246 ORIG_RAX: 002e
RAX: ffda RBX: 0002e740 RCX: 0045deb9
RDX:  RSI: 2080 RDI: 0005
RBP: 7fc5d258cca0 R08:  R09: 
R10:  R11: 0246 R12: 0009
R13: 7ffdcf6b003f R14: 7fc5d258d9c0 R15: 0119bf2c


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

[PATCH net-next] net/af_iucv: use DECLARE_SOCKADDR to cast from sockaddr

This gets us compile-time size checking.

Signed-off-by: Julian Wiedmann 
---
 net/iucv/af_iucv.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index db7d888914fa..882f028992c3 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -587,7 +587,7 @@ static void __iucv_auto_name(struct iucv_sock *iucv)
 static int iucv_sock_bind(struct socket *sock, struct sockaddr *addr,
  int addr_len)
 {
-   struct sockaddr_iucv *sa = (struct sockaddr_iucv *) addr;
+   DECLARE_SOCKADDR(struct sockaddr_iucv *, sa, addr);
char uid[sizeof(sa->siucv_user_id)];
struct sock *sk = sock->sk;
struct iucv_sock *iucv;
@@ -691,7 +691,7 @@ static int iucv_sock_autobind(struct sock *sk)
 
 static int afiucv_path_connect(struct socket *sock, struct sockaddr *addr)
 {
-   struct sockaddr_iucv *sa = (struct sockaddr_iucv *) addr;
+   DECLARE_SOCKADDR(struct sockaddr_iucv *, sa, addr);
struct sock *sk = sock->sk;
struct iucv_sock *iucv = iucv_sk(sk);
unsigned char user_data[16];
@@ -738,7 +738,7 @@ static int afiucv_path_connect(struct socket *sock, struct 
sockaddr *addr)
 static int iucv_sock_connect(struct socket *sock, struct sockaddr *addr,
 int alen, int flags)
 {
-   struct sockaddr_iucv *sa = (struct sockaddr_iucv *) addr;
+   DECLARE_SOCKADDR(struct sockaddr_iucv *, sa, addr);
struct sock *sk = sock->sk;
struct iucv_sock *iucv = iucv_sk(sk);
int err;
@@ -874,7 +874,7 @@ static int iucv_sock_accept(struct socket *sock, struct 
socket *newsock,
 static int iucv_sock_getname(struct socket *sock, struct sockaddr *addr,
 int peer)
 {
-   struct sockaddr_iucv *siucv = (struct sockaddr_iucv *) addr;
+   DECLARE_SOCKADDR(struct sockaddr_iucv *, siucv, addr);
struct sock *sk = sock->sk;
struct iucv_sock *iucv = iucv_sk(sk);
 
-- 
2.17.1

Re: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set

2020-12-07 Thread Jesper Dangaard Brouer

On Fri, 4 Dec 2020 16:21:08 +0100
Daniel Borkmann  wrote:

> On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:
> > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:  
> >> alar...@gmail.com writes:  
> >>> From: Marek Majtyka 
> >>>
> >>> Implement support for checking what kind of xdp functionality a netdev
> >>> supports. Previously, there was no way to do this other than to try
> >>> to create an AF_XDP socket on the interface or load an XDP program and see
> >>> if it worked. This commit changes this by adding a new variable which
> >>> describes all xdp supported functions on pretty detailed level:  
> >>
> >> I like the direction this is going! :)

(Me too, don't get discouraged by our nitpicking, keep working on this! :-))

> >>  
> >>>   - aborted
> >>>   - drop
> >>>   - pass
> >>>   - tx  
> 
> I strongly think we should _not_ merge any native XDP driver patchset
> that does not support/implement the above return codes. 

I agree, with above statement.

> Could we instead group them together and call this something like
> XDP_BASE functionality to not give a wrong impression?

I disagree.  I can accept that XDP_BASE include aborted+drop+pass.

I think we need to keep XDP_TX action separate, because I think that
there are use-cases where the we want to disable XDP_TX due to end-user
policy or hardware limitations.

Use-case(1): Cloud-provider want to give customers (running VMs) ability
to load XDP program for DDoS protection (only), but don't want to allow
customer to use XDP_TX (that can implement LB or cheat their VM
isolation policy).

Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue
resources, as the use-case is only DDoS.  Today we have this problem
with the ixgbe hardware, that cannot load XDP programs on systems with
more than 192 CPUs.

> If this is properly documented that these are basic must-have
> _requirements_, then users and driver developers both know what the
> expectations are.

We can still document that XDP_TX is a must-have requirement, when a
driver implements XDP.

> >>>   - redirect  
> >>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH v5 5/6] net: dsa: microchip: Add Microchip KSZ8863 SPI based driver support

Add KSZ88X3 driver support. We add support for the KXZ88X3 three port
switches using the SPI Interface.

Reviewed-by: Florian Fainelli 
Signed-off-by: Michael Grzeschik 

---
v1 -> v2: - this glue was not implemented
v2 -> v3: - this glue was part of previous bigger patch
v3 -> v4: - this glue was moved to this separate patch
v4 -> v5: - added reviewed by from f.fainelli
  - using device_get_match_data instead of own matching code
---
 drivers/net/dsa/microchip/ksz8795_spi.c | 44 ++---
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz8795_spi.c 
b/drivers/net/dsa/microchip/ksz8795_spi.c
index 45420c07c99fc..708f8daaedbc2 100644
--- a/drivers/net/dsa/microchip/ksz8795_spi.c
+++ b/drivers/net/dsa/microchip/ksz8795_spi.c
@@ -14,34 +14,52 @@
 #include 
 #include 
 
+#include "ksz8.h"
 #include "ksz_common.h"
 
-#define SPI_ADDR_SHIFT 12
-#define SPI_ADDR_ALIGN 3
-#define SPI_TURNAROUND_SHIFT   1
+#define KSZ8795_SPI_ADDR_SHIFT 12
+#define KSZ8795_SPI_ADDR_ALIGN 3
+#define KSZ8795_SPI_TURNAROUND_SHIFT   1
 
-KSZ_REGMAP_TABLE(ksz8795, 16, SPI_ADDR_SHIFT,
-SPI_TURNAROUND_SHIFT, SPI_ADDR_ALIGN);
+#define KSZ8863_SPI_ADDR_SHIFT 8
+#define KSZ8863_SPI_ADDR_ALIGN 8
+#define KSZ8863_SPI_TURNAROUND_SHIFT   0
+
+KSZ_REGMAP_TABLE(ksz8795, 16, KSZ8795_SPI_ADDR_SHIFT,
+KSZ8795_SPI_TURNAROUND_SHIFT, KSZ8795_SPI_ADDR_ALIGN);
+
+KSZ_REGMAP_TABLE(ksz8863, 16, KSZ8863_SPI_ADDR_SHIFT,
+KSZ8863_SPI_TURNAROUND_SHIFT, KSZ8863_SPI_ADDR_ALIGN);
 
 static int ksz8795_spi_probe(struct spi_device *spi)
 {
+   const struct regmap_config *regmap_config;
+   struct device *ddev = &spi->dev;
+   struct ksz8 *ksz8;
struct regmap_config rc;
struct ksz_device *dev;
-   int i, ret;
+   int i, ret = 0;
 
-   dev = ksz_switch_alloc(&spi->dev, spi);
+   ksz8 = devm_kzalloc(&spi->dev, sizeof(struct ksz8), GFP_KERNEL);
+   ksz8->priv = spi;
+
+   dev = ksz_switch_alloc(&spi->dev, ksz8);
if (!dev)
return -ENOMEM;
 
+   regmap_config = device_get_match_data(ddev);
+   if (!regmap_config)
+   return -EINVAL;
+
for (i = 0; i < ARRAY_SIZE(ksz8795_regmap_config); i++) {
-   rc = ksz8795_regmap_config[i];
+   rc = regmap_config[i];
rc.lock_arg = &dev->regmap_mutex;
dev->regmap[i] = devm_regmap_init_spi(spi, &rc);
if (IS_ERR(dev->regmap[i])) {
ret = PTR_ERR(dev->regmap[i]);
dev_err(&spi->dev,
"Failed to initialize regmap%i: %d\n",
-   ksz8795_regmap_config[i].val_bits, ret);
+   regmap_config[i].val_bits, ret);
return ret;
}
}
@@ -85,9 +103,11 @@ static void ksz8795_spi_shutdown(struct spi_device *spi)
 }
 
 static const struct of_device_id ksz8795_dt_ids[] = {
-   { .compatible = "microchip,ksz8765" },
-   { .compatible = "microchip,ksz8794" },
-   { .compatible = "microchip,ksz8795" },
+   { .compatible = "microchip,ksz8765", .data = &ksz8795_regmap_config },
+   { .compatible = "microchip,ksz8794", .data = &ksz8795_regmap_config },
+   { .compatible = "microchip,ksz8795", .data = &ksz8795_regmap_config },
+   { .compatible = "microchip,ksz8863", .data = &ksz8863_regmap_config },
+   { .compatible = "microchip,ksz8873", .data = &ksz8863_regmap_config },
{},
 };
 MODULE_DEVICE_TABLE(of, ksz8795_dt_ids);
-- 
2.29.2

[PATCH v5 3/6] net: dsa: microchip: ksz8795: move register offsets and shifts to separate struct

In order to get this driver used with other switches the functions need
to use different offsets and register shifts. This patch changes the
direct use of the register defines to register description structures,
which can be set depending on the chips register layout.

Signed-off-by: Michael Grzeschik 

---
v1 -> v4: - extracted this change from bigger previous patch
v4 -> v5: - added missing variables in ksz8_r_vlan_entries
  - moved shifts, masks and registers to arrays indexed by enums
  - using unsigned types where possible
---
 drivers/net/dsa/microchip/ksz8.h|  69 +++
 drivers/net/dsa/microchip/ksz8795.c | 261 +---
 drivers/net/dsa/microchip/ksz8795_reg.h |  85 
 3 files changed, 253 insertions(+), 162 deletions(-)
 create mode 100644 drivers/net/dsa/microchip/ksz8.h

diff --git a/drivers/net/dsa/microchip/ksz8.h b/drivers/net/dsa/microchip/ksz8.h
new file mode 100644
index 0..d3e89c27e22aa
--- /dev/null
+++ b/drivers/net/dsa/microchip/ksz8.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Microchip KSZ8XXX series register access
+ *
+ * Copyright (C) 2019 Pengutronix, Michael Grzeschik 
+ */
+
+#ifndef __KSZ8XXX_H
+#define __KSZ8XXX_H
+#include 
+
+enum ksz_regs {
+   REG_IND_CTRL_0,
+   REG_IND_DATA_8,
+   REG_IND_DATA_CHECK,
+   REG_IND_DATA_HI,
+   REG_IND_DATA_LO,
+   REG_IND_MIB_CHECK,
+   P_FORCE_CTRL,
+   P_LINK_STATUS,
+   P_LOCAL_CTRL,
+   P_NEG_RESTART_CTRL,
+   P_REMOTE_STATUS,
+   P_SPEED_STATUS,
+   S_TAIL_TAG_CTRL,
+};
+
+enum ksz_masks {
+   PORT_802_1P_REMAPPING,
+   SW_TAIL_TAG_ENABLE,
+   MIB_COUNTER_OVERFLOW,
+   MIB_COUNTER_VALID,
+   VLAN_TABLE_FID,
+   VLAN_TABLE_MEMBERSHIP,
+   VLAN_TABLE_VALID,
+   STATIC_MAC_TABLE_VALID,
+   STATIC_MAC_TABLE_USE_FID,
+   STATIC_MAC_TABLE_FID,
+   STATIC_MAC_TABLE_OVERRIDE,
+   STATIC_MAC_TABLE_FWD_PORTS,
+   DYNAMIC_MAC_TABLE_ENTRIES_H,
+   DYNAMIC_MAC_TABLE_MAC_EMPTY,
+   DYNAMIC_MAC_TABLE_NOT_READY,
+   DYNAMIC_MAC_TABLE_ENTRIES,
+   DYNAMIC_MAC_TABLE_FID,
+   DYNAMIC_MAC_TABLE_SRC_PORT,
+   DYNAMIC_MAC_TABLE_TIMESTAMP,
+};
+
+enum ksz_shifts {
+   VLAN_TABLE_MEMBERSHIP_S,
+   VLAN_TABLE,
+   STATIC_MAC_FWD_PORTS,
+   STATIC_MAC_FID,
+   DYNAMIC_MAC_ENTRIES_H,
+   DYNAMIC_MAC_ENTRIES,
+   DYNAMIC_MAC_FID,
+   DYNAMIC_MAC_TIMESTAMP,
+   DYNAMIC_MAC_SRC_PORT,
+};
+
+struct ksz8 {
+   const u8 *regs;
+   const u32 *masks;
+   const u8 *shifts;
+   void *priv;
+};
+
+#endif
diff --git a/drivers/net/dsa/microchip/ksz8795.c 
b/drivers/net/dsa/microchip/ksz8795.c
index 53cb41087a594..127498a9b8f72 100644
--- a/drivers/net/dsa/microchip/ksz8795.c
+++ b/drivers/net/dsa/microchip/ksz8795.c
@@ -20,6 +20,57 @@
 
 #include "ksz_common.h"
 #include "ksz8795_reg.h"
+#include "ksz8.h"
+
+static const u8 ksz8795_regs[] = {
+   [REG_IND_CTRL_0]= 0x6E,
+   [REG_IND_DATA_8]= 0x70,
+   [REG_IND_DATA_CHECK]= 0x72,
+   [REG_IND_DATA_HI]   = 0x71,
+   [REG_IND_DATA_LO]   = 0x75,
+   [REG_IND_MIB_CHECK] = 0x74,
+   [P_FORCE_CTRL]  = 0x0C,
+   [P_LINK_STATUS] = 0x0E,
+   [P_LOCAL_CTRL]  = 0x07,
+   [P_NEG_RESTART_CTRL]= 0x0D,
+   [P_REMOTE_STATUS]   = 0x08,
+   [P_SPEED_STATUS]= 0x09,
+   [S_TAIL_TAG_CTRL]   = 0x0C,
+};
+
+static const u32 ksz8795_masks[] = {
+   [PORT_802_1P_REMAPPING] = BIT(7),
+   [SW_TAIL_TAG_ENABLE]= BIT(1),
+   [MIB_COUNTER_OVERFLOW]  = BIT(6),
+   [MIB_COUNTER_VALID] = BIT(5),
+   [VLAN_TABLE_FID]= GENMASK(6, 0),
+   [VLAN_TABLE_MEMBERSHIP] = GENMASK(11, 7),
+   [VLAN_TABLE_VALID]  = BIT(12),
+   [STATIC_MAC_TABLE_VALID]= BIT(21),
+   [STATIC_MAC_TABLE_USE_FID]  = BIT(23),
+   [STATIC_MAC_TABLE_FID]  = GENMASK(30, 24),
+   [STATIC_MAC_TABLE_OVERRIDE] = BIT(26),
+   [STATIC_MAC_TABLE_FWD_PORTS]= GENMASK(24, 20),
+   [DYNAMIC_MAC_TABLE_ENTRIES_H]   = GENMASK(6, 0),
+   [DYNAMIC_MAC_TABLE_MAC_EMPTY]   = BIT(8),
+   [DYNAMIC_MAC_TABLE_NOT_READY]   = BIT(7),
+   [DYNAMIC_MAC_TABLE_ENTRIES] = GENMASK(31, 29),
+   [DYNAMIC_MAC_TABLE_FID] = GENMASK(26, 20),
+   [DYNAMIC_MAC_TABLE_SRC_PORT]= GENMASK(26, 24),
+   [DYNAMIC_MAC_TABLE_TIMESTAMP]   = GENMASK(28, 27),
+};
+
+static const u8 ksz8795_shifts[] = {
+   [VLAN_TABLE_MEMBERSHIP] = 7,
+   [VLAN_TABLE]= 16,
+   [STATIC_MAC_FWD_PORTS]  = 16,
+   [STATIC_MAC_FID]= 24,
+   [DYNAMIC_MAC_ENTRIES_H] = 3,
+   [DYNAMIC_MAC_ENTRIES]

[PATCH v5 1/6] net: dsa: microchip: ksz8795: change drivers prefix to be generic

The driver can be used on other chips of this type. To reflect
this we rename the drivers prefix from ksz8795 to ksz8.

Signed-off-by: Michael Grzeschik 

---
v1 -> v4: - extracted this change from bigger previous patch
v4 -> v5: - removed extra unavailable variables in ksz8_r_vlan_entries
---
 drivers/net/dsa/microchip/ksz8795.c | 222 
 drivers/net/dsa/microchip/ksz8795_spi.c |   2 +-
 drivers/net/dsa/microchip/ksz_common.h  |   2 +-
 3 files changed, 110 insertions(+), 116 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz8795.c 
b/drivers/net/dsa/microchip/ksz8795.c
index c973db101b729..1d12597a1c8a4 100644
--- a/drivers/net/dsa/microchip/ksz8795.c
+++ b/drivers/net/dsa/microchip/ksz8795.c
@@ -74,7 +74,7 @@ static void ksz_port_cfg(struct ksz_device *dev, int port, 
int offset, u8 bits,
   bits, set ? bits : 0);
 }
 
-static int ksz8795_reset_switch(struct ksz_device *dev)
+static int ksz8_reset_switch(struct ksz_device *dev)
 {
/* reset switch */
ksz_write8(dev, REG_POWER_MANAGEMENT_1,
@@ -117,8 +117,7 @@ static void ksz8795_set_prio_queue(struct ksz_device *dev, 
int port, int queue)
true);
 }
 
-static void ksz8795_r_mib_cnt(struct ksz_device *dev, int port, u16 addr,
- u64 *cnt)
+static void ksz8_r_mib_cnt(struct ksz_device *dev, int port, u16 addr, u64 
*cnt)
 {
u16 ctrl_addr;
u32 data;
@@ -148,8 +147,8 @@ static void ksz8795_r_mib_cnt(struct ksz_device *dev, int 
port, u16 addr,
mutex_unlock(&dev->alu_mutex);
 }
 
-static void ksz8795_r_mib_pkt(struct ksz_device *dev, int port, u16 addr,
- u64 *dropped, u64 *cnt)
+static void ksz8_r_mib_pkt(struct ksz_device *dev, int port, u16 addr,
+  u64 *dropped, u64 *cnt)
 {
u16 ctrl_addr;
u32 data;
@@ -195,7 +194,7 @@ static void ksz8795_r_mib_pkt(struct ksz_device *dev, int 
port, u16 addr,
mutex_unlock(&dev->alu_mutex);
 }
 
-static void ksz8795_freeze_mib(struct ksz_device *dev, int port, bool freeze)
+static void ksz8_freeze_mib(struct ksz_device *dev, int port, bool freeze)
 {
/* enable the port for flush/freeze function */
if (freeze)
@@ -207,7 +206,7 @@ static void ksz8795_freeze_mib(struct ksz_device *dev, int 
port, bool freeze)
ksz_cfg(dev, REG_SW_CTRL_6, BIT(port), false);
 }
 
-static void ksz8795_port_init_cnt(struct ksz_device *dev, int port)
+static void ksz8_port_init_cnt(struct ksz_device *dev, int port)
 {
struct ksz_port_mib *mib = &dev->ports[port].mib;
 
@@ -235,8 +234,7 @@ static void ksz8795_port_init_cnt(struct ksz_device *dev, 
int port)
memset(mib->counters, 0, dev->mib_cnt * sizeof(u64));
 }
 
-static void ksz8795_r_table(struct ksz_device *dev, int table, u16 addr,
-   u64 *data)
+static void ksz8_r_table(struct ksz_device *dev, int table, u16 addr, u64 
*data)
 {
u16 ctrl_addr;
 
@@ -248,8 +246,7 @@ static void ksz8795_r_table(struct ksz_device *dev, int 
table, u16 addr,
mutex_unlock(&dev->alu_mutex);
 }
 
-static void ksz8795_w_table(struct ksz_device *dev, int table, u16 addr,
-   u64 data)
+static void ksz8_w_table(struct ksz_device *dev, int table, u16 addr, u64 data)
 {
u16 ctrl_addr;
 
@@ -261,7 +258,7 @@ static void ksz8795_w_table(struct ksz_device *dev, int 
table, u16 addr,
mutex_unlock(&dev->alu_mutex);
 }
 
-static int ksz8795_valid_dyn_entry(struct ksz_device *dev, u8 *data)
+static int ksz8_valid_dyn_entry(struct ksz_device *dev, u8 *data)
 {
int timeout = 100;
 
@@ -284,9 +281,9 @@ static int ksz8795_valid_dyn_entry(struct ksz_device *dev, 
u8 *data)
return 0;
 }
 
-static int ksz8795_r_dyn_mac_table(struct ksz_device *dev, u16 addr,
-  u8 *mac_addr, u8 *fid, u8 *src_port,
-  u8 *timestamp, u16 *entries)
+static int ksz8_r_dyn_mac_table(struct ksz_device *dev, u16 addr,
+   u8 *mac_addr, u8 *fid, u8 *src_port,
+   u8 *timestamp, u16 *entries)
 {
u32 data_hi, data_lo;
u16 ctrl_addr;
@@ -298,7 +295,7 @@ static int ksz8795_r_dyn_mac_table(struct ksz_device *dev, 
u16 addr,
mutex_lock(&dev->alu_mutex);
ksz_write16(dev, REG_IND_CTRL_0, ctrl_addr);
 
-   rc = ksz8795_valid_dyn_entry(dev, &data);
+   rc = ksz8_valid_dyn_entry(dev, &data);
if (rc == -EAGAIN) {
if (addr == 0)
*entries = 0;
@@ -341,13 +338,13 @@ static int ksz8795_r_dyn_mac_table(struct ksz_device 
*dev, u16 addr,
return rc;
 }
 
-static int ksz8795_r_sta_mac_table(struct ksz_device *dev, u16 addr,
-  struct alu_struct *alu)
+static int ksz8_r_sta_mac_table(struct ksz_device *dev, u16 addr,
+   struct alu_struct *alu)
 {

[PATCH v5 2/6] net: dsa: microchip: ksz8795: move cpu_select_interface to extra function

This patch moves the cpu interface selection code to a individual
function specific for ksz8795. It will make it simpler to customize the
code path for different switches supported by this driver.

Signed-off-by: Michael Grzeschik 

---
v1 -> v5: - extracted this from previous bigger patch
---
 drivers/net/dsa/microchip/ksz8795.c | 92 -
 1 file changed, 50 insertions(+), 42 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz8795.c 
b/drivers/net/dsa/microchip/ksz8795.c
index 1d12597a1c8a4..53cb41087a594 100644
--- a/drivers/net/dsa/microchip/ksz8795.c
+++ b/drivers/net/dsa/microchip/ksz8795.c
@@ -911,10 +911,58 @@ static void ksz8_port_mirror_del(struct dsa_switch *ds, 
int port,
 PORT_MIRROR_SNIFFER, false);
 }
 
+static void ksz8795_cpu_interface_select(struct ksz_device *dev, int port)
+{
+   struct ksz_port *p = &dev->ports[port];
+   u8 data8;
+
+   if (!p->interface && dev->compat_interface) {
+   dev_warn(dev->dev,
+"Using legacy switch \"phy-mode\" property, because it 
is missing on port %d node. "
+"Please update your device tree.\n",
+port);
+   p->interface = dev->compat_interface;
+   }
+
+   /* Configure MII interface for proper network communication. */
+   ksz_read8(dev, REG_PORT_5_CTRL_6, &data8);
+   data8 &= ~PORT_INTERFACE_TYPE;
+   data8 &= ~PORT_GMII_1GPS_MODE;
+   switch (p->interface) {
+   case PHY_INTERFACE_MODE_MII:
+   p->phydev.speed = SPEED_100;
+   break;
+   case PHY_INTERFACE_MODE_RMII:
+   data8 |= PORT_INTERFACE_RMII;
+   p->phydev.speed = SPEED_100;
+   break;
+   case PHY_INTERFACE_MODE_GMII:
+   data8 |= PORT_GMII_1GPS_MODE;
+   data8 |= PORT_INTERFACE_GMII;
+   p->phydev.speed = SPEED_1000;
+   break;
+   default:
+   data8 &= ~PORT_RGMII_ID_IN_ENABLE;
+   data8 &= ~PORT_RGMII_ID_OUT_ENABLE;
+   if (p->interface == PHY_INTERFACE_MODE_RGMII_ID ||
+   p->interface == PHY_INTERFACE_MODE_RGMII_RXID)
+   data8 |= PORT_RGMII_ID_IN_ENABLE;
+   if (p->interface == PHY_INTERFACE_MODE_RGMII_ID ||
+   p->interface == PHY_INTERFACE_MODE_RGMII_TXID)
+   data8 |= PORT_RGMII_ID_OUT_ENABLE;
+   data8 |= PORT_GMII_1GPS_MODE;
+   data8 |= PORT_INTERFACE_RGMII;
+   p->phydev.speed = SPEED_1000;
+   break;
+   }
+   ksz_write8(dev, REG_PORT_5_CTRL_6, data8);
+   p->phydev.duplex = 1;
+}
+
 static void ksz8_port_setup(struct ksz_device *dev, int port, bool cpu_port)
 {
struct ksz_port *p = &dev->ports[port];
-   u8 data8, member;
+   u8 member;
 
/* enable broadcast storm limit */
ksz_port_cfg(dev, port, P_BCAST_STORM_CTRL, PORT_BROADCAST_STORM, true);
@@ -931,47 +979,7 @@ static void ksz8_port_setup(struct ksz_device *dev, int 
port, bool cpu_port)
ksz_port_cfg(dev, port, P_PRIO_CTRL, PORT_802_1P_ENABLE, true);
 
if (cpu_port) {
-   if (!p->interface && dev->compat_interface) {
-   dev_warn(dev->dev,
-"Using legacy switch \"phy-mode\" property, 
because it is missing on port %d node. "
-"Please update your device tree.\n",
-port);
-   p->interface = dev->compat_interface;
-   }
-
-   /* Configure MII interface for proper network communication. */
-   ksz_read8(dev, REG_PORT_5_CTRL_6, &data8);
-   data8 &= ~PORT_INTERFACE_TYPE;
-   data8 &= ~PORT_GMII_1GPS_MODE;
-   switch (p->interface) {
-   case PHY_INTERFACE_MODE_MII:
-   p->phydev.speed = SPEED_100;
-   break;
-   case PHY_INTERFACE_MODE_RMII:
-   data8 |= PORT_INTERFACE_RMII;
-   p->phydev.speed = SPEED_100;
-   break;
-   case PHY_INTERFACE_MODE_GMII:
-   data8 |= PORT_GMII_1GPS_MODE;
-   data8 |= PORT_INTERFACE_GMII;
-   p->phydev.speed = SPEED_1000;
-   break;
-   default:
-   data8 &= ~PORT_RGMII_ID_IN_ENABLE;
-   data8 &= ~PORT_RGMII_ID_OUT_ENABLE;
-   if (p->interface == PHY_INTERFACE_MODE_RGMII_ID ||
-   p->interface == PHY_INTERFACE_MODE_RGMII_RXID)
-   data8 |= PORT_RGMII_ID_IN_ENABLE;
-   if (p->interface == PHY_INTERFACE_MODE_RGMII_ID ||
-   p->interface == PHY_INTERFACE_MODE_RGMII_TXID)
-

[PATCH v5 0/6] microchip: add support for ksz88x3 driver family

This series adds support for the ksz88x3 driver family to the dsa based ksz
drivers. The driver is making use of the already available ksz8795 driver
and moves it to an generic driver for the ksz8 based chips which have
similar functions but an totaly different register layout.

This branch is to be rebased on net-next/master

The mainlining discussion history of this branch:

v1: 
https://lore.kernel.org/netdev/20191107110030.25199-1-m.grzesc...@pengutronix.de/
v2: 
https://lore.kernel.org/netdev/20191218200831.13796-1-m.grzesc...@pengutronix.de/
v3: 
https://lore.kernel.org/netdev/20200508154343.6074-1-m.grzesc...@pengutronix.de/
v4: 
https://lore.kernel.org/netdev/20200803054442.20089-1-m.grzesc...@pengutronix.de/

Michael Grzeschik (6):
  net: dsa: microchip: ksz8795: change drivers prefix to be generic
  net: dsa: microchip: ksz8795: move cpu_select_interface to extra function
  net: dsa: microchip: ksz8795: move register offsets and shifts to separate 
struct
  net: dsa: microchip: ksz8795: add support for ksz88xx chips
  net: dsa: microchip: Add Microchip KSZ8863 SPI based driver support
  dt-bindings: net: dsa: document additional Microchip KSZ8863/8873 switch

 .../bindings/net/dsa/microchip,ksz.yaml   |   2 +
 drivers/net/dsa/microchip/ksz8.h  |  69 ++
 drivers/net/dsa/microchip/ksz8795.c   | 888 --
 drivers/net/dsa/microchip/ksz8795_reg.h   | 125 +--
 drivers/net/dsa/microchip/ksz8795_spi.c   |  46 +-
 drivers/net/dsa/microchip/ksz_common.h|   3 +-
 6 files changed, 730 insertions(+), 403 deletions(-)
 create mode 100644 drivers/net/dsa/microchip/ksz8.h

-- 
2.29.2

[PATCH v5 6/6] dt-bindings: net: dsa: document additional Microchip KSZ8863/8873 switch

It is a 3-Port 10/100 Ethernet Switch. One CPU-Port and two
Switch-Ports.

Cc: devicet...@vger.kernel.org
Reviewed-by: Andrew Lunn 
Acked-by: Rob Herring 
Reviewed-by: Florian Fainelli 
Signed-off-by: Michael Grzeschik 

---
v1 -> v3: - nothing changes
  - already Acked-by Rob Herring
v1 -> v4: - nothing changes
v4 -> v5: - nothing changes
---
 Documentation/devicetree/bindings/net/dsa/microchip,ksz.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/dsa/microchip,ksz.yaml 
b/Documentation/devicetree/bindings/net/dsa/microchip,ksz.yaml
index 9f7d131bbcef0..84985f53bffd4 100644
--- a/Documentation/devicetree/bindings/net/dsa/microchip,ksz.yaml
+++ b/Documentation/devicetree/bindings/net/dsa/microchip,ksz.yaml
@@ -21,6 +21,8 @@ properties:
   - microchip,ksz8765
   - microchip,ksz8794
   - microchip,ksz8795
+  - microchip,ksz8863
+  - microchip,ksz8873
   - microchip,ksz9477
   - microchip,ksz9897
   - microchip,ksz9896
-- 
2.29.2

[PATCH v5 4/6] net: dsa: microchip: ksz8795: add support for ksz88xx chips

We add support for the ksz8863 and ksz8873 chips which are
using the same register patterns but other offsets as the
ksz8795.

Signed-off-by: Michael Grzeschik 

---
v1 -> v4: - extracted this change from bigger previous patch
v4 -> v5: - added clear of reset bit for ksz8863 reset code
  - using extra device flag IS_KSZ88x3 instead of is_ksz8795 function
  - using DSA_TAG_PROTO_KSZ9893 protocol for ksz88x3 instead
---
 drivers/net/dsa/microchip/ksz8795.c | 345 +++-
 drivers/net/dsa/microchip/ksz8795_reg.h |  40 ++-
 drivers/net/dsa/microchip/ksz_common.h  |   1 +
 3 files changed, 299 insertions(+), 87 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz8795.c 
b/drivers/net/dsa/microchip/ksz8795.c
index 127498a9b8f72..9484667a29a35 100644
--- a/drivers/net/dsa/microchip/ksz8795.c
+++ b/drivers/net/dsa/microchip/ksz8795.c
@@ -22,6 +22,9 @@
 #include "ksz8795_reg.h"
 #include "ksz8.h"
 
+/* Used with variable features to indicate capabilities. */
+#define IS_88X3BIT(0)
+
 static const u8 ksz8795_regs[] = {
[REG_IND_CTRL_0]= 0x6E,
[REG_IND_DATA_8]= 0x70,
@@ -72,9 +75,60 @@ static const u8 ksz8795_shifts[] = {
[DYNAMIC_MAC_SRC_PORT]  = 24,
 };
 
-static const struct {
+static const u8 ksz8863_regs[] = {
+   [REG_IND_CTRL_0]= 0x79,
+   [REG_IND_DATA_8]= 0x7B,
+   [REG_IND_DATA_CHECK]= 0x7B,
+   [REG_IND_DATA_HI]   = 0x7C,
+   [REG_IND_DATA_LO]   = 0x80,
+   [REG_IND_MIB_CHECK] = 0x80,
+   [P_FORCE_CTRL]  = 0x0C,
+   [P_LINK_STATUS] = 0x0E,
+   [P_LOCAL_CTRL]  = 0x0C,
+   [P_NEG_RESTART_CTRL]= 0x0D,
+   [P_REMOTE_STATUS]   = 0x0E,
+   [P_SPEED_STATUS]= 0x0F,
+   [S_TAIL_TAG_CTRL]   = 0x03,
+};
+
+static const u32 ksz8863_masks[] = {
+   [PORT_802_1P_REMAPPING] = BIT(3),
+   [SW_TAIL_TAG_ENABLE]= BIT(6),
+   [MIB_COUNTER_OVERFLOW]  = BIT(7),
+   [MIB_COUNTER_VALID] = BIT(6),
+   [VLAN_TABLE_FID]= GENMASK(15, 12),
+   [VLAN_TABLE_MEMBERSHIP] = GENMASK(18, 16),
+   [VLAN_TABLE_VALID]  = BIT(19),
+   [STATIC_MAC_TABLE_VALID]= BIT(19),
+   [STATIC_MAC_TABLE_USE_FID]  = BIT(21),
+   [STATIC_MAC_TABLE_FID]  = GENMASK(29, 26),
+   [STATIC_MAC_TABLE_OVERRIDE] = BIT(20),
+   [STATIC_MAC_TABLE_FWD_PORTS]= GENMASK(18, 16),
+   [DYNAMIC_MAC_TABLE_ENTRIES_H]   = GENMASK(5, 0),
+   [DYNAMIC_MAC_TABLE_MAC_EMPTY]   = BIT(7),
+   [DYNAMIC_MAC_TABLE_NOT_READY]   = BIT(7),
+   [DYNAMIC_MAC_TABLE_ENTRIES] = GENMASK(31, 28),
+   [DYNAMIC_MAC_TABLE_FID] = GENMASK(19, 16),
+   [DYNAMIC_MAC_TABLE_SRC_PORT]= GENMASK(21, 20),
+   [DYNAMIC_MAC_TABLE_TIMESTAMP]   = GENMASK(23, 22),
+};
+
+static u8 ksz8863_shifts[] = {
+   [VLAN_TABLE_MEMBERSHIP] = 16,
+   [STATIC_MAC_FWD_PORTS]  = 16,
+   [STATIC_MAC_FID]= 22,
+   [DYNAMIC_MAC_ENTRIES_H] = 3,
+   [DYNAMIC_MAC_ENTRIES]   = 24,
+   [DYNAMIC_MAC_FID]   = 16,
+   [DYNAMIC_MAC_TIMESTAMP] = 24,
+   [DYNAMIC_MAC_SRC_PORT]  = 20,
+};
+
+struct mib_names {
char string[ETH_GSTRING_LEN];
-} mib_names[] = {
+};
+
+static const struct mib_names ksz87xx_mib_names[] = {
{ "rx_hi" },
{ "rx_undersize" },
{ "rx_fragments" },
@@ -113,6 +167,43 @@ static const struct {
{ "tx_discards" },
 };
 
+static const struct mib_names ksz88xx_mib_names[] = {
+   { "rx" },
+   { "rx_hi" },
+   { "rx_undersize" },
+   { "rx_fragments" },
+   { "rx_oversize" },
+   { "rx_jabbers" },
+   { "rx_symbol_err" },
+   { "rx_crc_err" },
+   { "rx_align_err" },
+   { "rx_mac_ctrl" },
+   { "rx_pause" },
+   { "rx_bcast" },
+   { "rx_mcast" },
+   { "rx_ucast" },
+   { "rx_64_or_less" },
+   { "rx_65_127" },
+   { "rx_128_255" },
+   { "rx_256_511" },
+   { "rx_512_1023" },
+   { "rx_1024_1522" },
+   { "tx" },
+   { "tx_hi" },
+   { "tx_late_col" },
+   { "tx_pause" },
+   { "tx_bcast" },
+   { "tx_mcast" },
+   { "tx_ucast" },
+   { "tx_deferred" },
+   { "tx_total_col" },
+   { "tx_exc_col" },
+   { "tx_single_col" },
+   { "tx_mult_col" },
+   { "rx_discards" },
+   { "tx_discards" },
+};
+
 static void ksz_cfg(struct ksz_device *dev, u32 addr, u8 bits, bool set)
 {
regmap_update_bits(dev->regmap[0], addr, bits, set ? bits : 0);
@@ -127,10 +218,18 @@ static void ksz_port_cfg(struct ksz_device *dev, int 
port, int offset, u8 bits,
 
 static int ksz8_reset_switch(struct ksz_device *dev)

Re: Why the auxiliary cipher in gss_krb5_crypto.c?

2020-12-07 Thread Ard Biesheuvel

On Mon, 7 Dec 2020 at 13:02, David Howells  wrote:
>
> Ard Biesheuvel  wrote:
>
> > > Yeah - the problem with that is that for sunrpc, we might be dealing with 
> > > 1MB
> > > plus bits of non-contiguous pages, requiring >8K of scatterlist elements
> > > (admittedly, we can chain them, but we may have to do one or more large
> > > allocations).
> > >
> > > > However, I would recommend against it:
> > >
> > > Sorry, recommend against what?
> > >
> >
> > Recommend against the current approach of manipulating the input like
> > this and feeding it into the skcipher piecemeal.
>
> Right.  I understand the problem, but as I mentioned above, the scatterlist
> itself becomes a performance issue as it may exceed two pages in size.  Double
> that as there may need to be separate input and output scatterlists.
>

I wasn't aware that Herbert's work hadn't been merged yet. So that
means it is entirely reasonable to split the input like this and feed
the first part into a cbc(aes) skcipher and the last part into a
cts(cbc(aes)) skcipher, provided that you ensure that the last part
covers the final two blocks (one full block and one block that is
either full or partial)

With Herbert's changes, you will be able to use the same skcipher, and
pass a flag to all but the final part that more data is coming. But
for lack of that, the current approach is optimal for cases where
having to cover the entire input with a single scatterlist is
undesirable.

> > Herbert recently made some changes for MSG_MORE support in the AF_ALG
> > code, which permits a skcipher encryption to be split into several
> > invocations of the skcipher layer without the need for this complexity
> > on the side of the caller. Maybe there is a way to reuse that here.
> > Herbert?
>
> I wonder if it would help if the input buffer and output buffer didn't have to
> correspond exactly in usage - ie. the output buffer could be used at a slower
> rate than the input to allow for buffering inside the crypto algorithm.
>

I don't follow - how could one be used at a slower rate?

> > > Can you also do SHA at the same time in the same loop?
> >
> > SHA-1 or HMAC-SHA1? The latter could probably be modeled as an AEAD.
> > The former doesn't really fit the current API so we'd have to invent
> > something for it.
>
> The hashes corresponding to the kerberos enctypes I'm supporting are:
>
> HMAC-SHA1 for aes128-cts-hmac-sha1-96 and aes256-cts-hmac-sha1-96.
>
> HMAC-SHA256 for aes128-cts-hmac-sha256-128
>
> HMAC-SHA384 for aes256-cts-hmac-sha384-192
>
> CMAC-CAMELLIA for camellia128-cts-cmac and camellia256-cts-cmac
>
> I'm not sure you can support all of those with the instructions available.
>

It depends on whether the caller can make use of the authenc()
pattern, which is a type of AEAD we support. There are numerous
implementations of authenc(hmac(shaXXX),cbc(aes)), including h/w
accelerated ones, but none that implement ciphertext stealing. So that
means that, even if you manage to use the AEAD layer to perform both
at the same time, the generic authenc() template will perform the
cts(cbc(aes)) and hmac(shaXXX) by calling into skciphers and ahashes,
respectively, which won't give you any benefit until accelerated
implementations turn up that perform the whole operation in one pass
over the input. And even then, I don't think the performance benefit
will be worth it.

[PATCH net-next 2/6] s390/ccwgroup: use bus->dev_groups for bus-based sysfs attributes

Bus drivers have their own way of describing the sysfs attributes that
all devices on a bus should provide.
Switch ccwgroup_attr_groups over to use bus->dev_groups, and thus
free up dev->groups for usage by the ccwgroup device drivers.

While adjusting the attribute naming, use ATTRIBUTE_GROUPS() to get rid
of some boilerplate code.

Signed-off-by: Julian Wiedmann 
Acked-by: Heiko Carstens 
---
 drivers/s390/cio/ccwgroup.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/s390/cio/ccwgroup.c b/drivers/s390/cio/ccwgroup.c
index 483a9ecfcbb1..444385da5792 100644
--- a/drivers/s390/cio/ccwgroup.c
+++ b/drivers/s390/cio/ccwgroup.c
@@ -210,18 +210,12 @@ static ssize_t ccwgroup_ungroup_store(struct device *dev,
 static DEVICE_ATTR(ungroup, 0200, NULL, ccwgroup_ungroup_store);
 static DEVICE_ATTR(online, 0644, ccwgroup_online_show, ccwgroup_online_store);
 
-static struct attribute *ccwgroup_attrs[] = {
+static struct attribute *ccwgroup_dev_attrs[] = {
&dev_attr_online.attr,
&dev_attr_ungroup.attr,
NULL,
 };
-static struct attribute_group ccwgroup_attr_group = {
-   .attrs = ccwgroup_attrs,
-};
-static const struct attribute_group *ccwgroup_attr_groups[] = {
-   &ccwgroup_attr_group,
-   NULL,
-};
+ATTRIBUTE_GROUPS(ccwgroup_dev);
 
 static void ccwgroup_ungroup_workfn(struct work_struct *work)
 {
@@ -384,7 +378,6 @@ int ccwgroup_create_dev(struct device *parent, struct 
ccwgroup_driver *gdrv,
}
 
dev_set_name(&gdev->dev, "%s", dev_name(&gdev->cdev[0]->dev));
-   gdev->dev.groups = ccwgroup_attr_groups;
 
if (gdrv) {
gdev->dev.driver = &gdrv->driver;
@@ -487,6 +480,7 @@ static void ccwgroup_shutdown(struct device *dev)
 
 static struct bus_type ccwgroup_bus_type = {
.name   = "ccwgroup",
+   .dev_groups = ccwgroup_dev_groups,
.remove = ccwgroup_remove,
.shutdown = ccwgroup_shutdown,
 };
-- 
2.17.1

[PATCH net-next 5/6] s390/qeth: remove QETH_QDIO_BUF_HANDLED_DELAYED state

Reuse the QETH_QDIO_BUF_EMPTY state to indicate that a TX buffer has
been completed with a QAOB notification, and may be cleaned up by
qeth_cleanup_handled_pending().

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  | 2 --
 drivers/s390/net/qeth_core_main.c | 5 ++---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index d150da95d073..6f5ddc3eab8c 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -424,8 +424,6 @@ enum qeth_qdio_out_buffer_state {
/* Received QAOB notification on CQ: */
QETH_QDIO_BUF_QAOB_OK,
QETH_QDIO_BUF_QAOB_ERROR,
-   /* Handled via transfer pending / completion queue. */
-   QETH_QDIO_BUF_HANDLED_DELAYED,
 };
 
 struct qeth_qdio_out_buffer {
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 869694217450..da27ef451d05 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -477,8 +477,7 @@ static void qeth_cleanup_handled_pending(struct 
qeth_qdio_out_q *q, int bidx,
 
while (c) {
if (forced_cleanup ||
-   atomic_read(&c->state) ==
- QETH_QDIO_BUF_HANDLED_DELAYED) {
+   atomic_read(&c->state) == QETH_QDIO_BUF_EMPTY) {
struct qeth_qdio_out_buffer *f = c;
 
QETH_CARD_TEXT(f->q->card, 5, "fp");
@@ -549,7 +548,7 @@ static void qeth_qdio_handle_aob(struct qeth_card *card,
kmem_cache_free(qeth_core_header_cache, data);
}
 
-   atomic_set(&buffer->state, QETH_QDIO_BUF_HANDLED_DELAYED);
+   atomic_set(&buffer->state, QETH_QDIO_BUF_EMPTY);
break;
default:
WARN_ON_ONCE(1);
-- 
2.17.1

[PATCH net-next 3/6] s390/qeth: use dev->groups for common sysfs attributes

All qeth devices have a minimum set of sysfs attributes, and non-OSN
devices share a group of additional attributes. Depending on whether
the device is forced to use a specific discipline, the device_type then
specifies further attributes.

Shift the common attributes into dev->groups, so that the device_type
only contains the discipline-specific attributes. This avoids exposing
the common attributes to the disciplines, and nicely cleans up our
sysfs code.

While replacing the qeth_l*_*_device_attributes() helpers, switch from
sysfs_*_groups() to the more generic device_*_groups().

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  6 ++---
 drivers/s390/net/qeth_core_main.c |  7 --
 drivers/s390/net/qeth_core_sys.c  | 41 ++-
 drivers/s390/net/qeth_l2.h|  2 --
 drivers/s390/net/qeth_l2_main.c   |  4 +--
 drivers/s390/net/qeth_l2_sys.c| 19 --
 drivers/s390/net/qeth_l3.h|  2 --
 drivers/s390/net/qeth_l3_main.c   |  4 +--
 drivers/s390/net/qeth_l3_sys.c| 21 
 9 files changed, 30 insertions(+), 76 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 69b474f8735e..d150da95d073 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -1063,10 +1063,8 @@ extern const struct qeth_discipline qeth_l2_discipline;
 extern const struct qeth_discipline qeth_l3_discipline;
 extern const struct ethtool_ops qeth_ethtool_ops;
 extern const struct ethtool_ops qeth_osn_ethtool_ops;
-extern const struct attribute_group *qeth_generic_attr_groups[];
-extern const struct attribute_group *qeth_osn_attr_groups[];
-extern const struct attribute_group qeth_device_attr_group;
-extern const struct attribute_group qeth_device_blkt_group;
+extern const struct attribute_group *qeth_dev_groups[];
+extern const struct attribute_group *qeth_osn_dev_groups[];
 extern const struct device_type qeth_generic_devtype;
 
 const char *qeth_get_cardname_short(struct qeth_card *);
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 8171b9d3a70e..05d0b16bd7d6 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -6375,13 +6375,11 @@ void qeth_core_free_discipline(struct qeth_card *card)
 
 const struct device_type qeth_generic_devtype = {
.name = "qeth_generic",
-   .groups = qeth_generic_attr_groups,
 };
 EXPORT_SYMBOL_GPL(qeth_generic_devtype);
 
 static const struct device_type qeth_osn_devtype = {
.name = "qeth_osn",
-   .groups = qeth_osn_attr_groups,
 };
 
 #define DBF_NAME_LEN   20
@@ -6561,6 +6559,11 @@ static int qeth_core_probe_device(struct ccwgroup_device 
*gdev)
if (rc)
goto err_chp_desc;
 
+   if (IS_OSN(card))
+   gdev->dev.groups = qeth_osn_dev_groups;
+   else
+   gdev->dev.groups = qeth_dev_groups;
+
enforced_disc = qeth_enforce_discipline(card);
switch (enforced_disc) {
case QETH_DISCIPLINE_UNDETERMINED:
diff --git a/drivers/s390/net/qeth_core_sys.c b/drivers/s390/net/qeth_core_sys.c
index 4441b3393eaf..a0f777f76f66 100644
--- a/drivers/s390/net/qeth_core_sys.c
+++ b/drivers/s390/net/qeth_core_sys.c
@@ -640,23 +640,17 @@ static struct attribute *qeth_blkt_device_attrs[] = {
&dev_attr_inter_jumbo.attr,
NULL,
 };
-const struct attribute_group qeth_device_blkt_group = {
+
+static const struct attribute_group qeth_dev_blkt_group = {
.name = "blkt",
.attrs = qeth_blkt_device_attrs,
 };
-EXPORT_SYMBOL_GPL(qeth_device_blkt_group);
 
-static struct attribute *qeth_device_attrs[] = {
-   &dev_attr_state.attr,
-   &dev_attr_chpid.attr,
-   &dev_attr_if_name.attr,
-   &dev_attr_card_type.attr,
+static struct attribute *qeth_dev_extended_attrs[] = {
&dev_attr_inbuf_size.attr,
&dev_attr_portno.attr,
&dev_attr_portname.attr,
&dev_attr_priority_queueing.attr,
-   &dev_attr_buffer_count.attr,
-   &dev_attr_recover.attr,
&dev_attr_performance_stats.attr,
&dev_attr_layer2.attr,
&dev_attr_isolation.attr,
@@ -664,18 +658,12 @@ static struct attribute *qeth_device_attrs[] = {
&dev_attr_switch_attrs.attr,
NULL,
 };
-const struct attribute_group qeth_device_attr_group = {
-   .attrs = qeth_device_attrs,
-};
-EXPORT_SYMBOL_GPL(qeth_device_attr_group);
 
-const struct attribute_group *qeth_generic_attr_groups[] = {
-   &qeth_device_attr_group,
-   &qeth_device_blkt_group,
-   NULL,
+static const struct attribute_group qeth_dev_extended_group = {
+   .attrs = qeth_dev_extended_attrs,
 };
 
-static struct attribute *qeth_osn_device_attrs[] = {
+static struct attribute *qeth_dev_attrs[] = {
&dev_attr_state.attr,
&dev_attr_chpid.attr,
&dev_attr_if_name.attr,
@@ -684,10 +672,19 @@ static struct attribute *qeth_osn_device_attrs[] = {
&dev_attr_re

[PATCH net-next 4/6] s390/qeth: don't replace a fully completed async TX buffer

For TX buffers that require an additional async notification via QAOB, the
TX completion code can now manage all the necessary processing if the
notification has already occurred (or is occurring concurrently).

In such cases we can avoid replacing the metadata that is associated
with the buffer's slot on the ring, and just keep using the current one.

As qeth_clear_output_buffer() will also handle any kmem cache-allocated
memory that was mapped into the TX buffer, qeth_qdio_handle_aob()
doesn't need to worry about it.

While at it, also remove the unneeded forward declaration for
qeth_init_qdio_out_buf().

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core_main.c | 89 ++-
 1 file changed, 51 insertions(+), 38 deletions(-)

diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 05d0b16bd7d6..869694217450 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -75,7 +75,6 @@ static void qeth_notify_skbs(struct qeth_qdio_out_q *queue,
enum iucv_tx_notify notification);
 static void qeth_tx_complete_buf(struct qeth_qdio_out_buffer *buf, bool error,
 int budget);
-static int qeth_init_qdio_out_buf(struct qeth_qdio_out_q *, int);
 
 static void qeth_close_dev_handler(struct work_struct *work)
 {
@@ -517,18 +516,6 @@ static void qeth_qdio_handle_aob(struct qeth_card *card,
buffer = (struct qeth_qdio_out_buffer *) aob->user1;
QETH_CARD_TEXT_(card, 5, "%lx", aob->user1);
 
-   /* Free dangling allocations. The attached skbs are handled by
-* qeth_cleanup_handled_pending().
-*/
-   for (i = 0;
-i < aob->sb_count && i < QETH_MAX_BUFFER_ELEMENTS(card);
-i++) {
-   void *data = phys_to_virt(aob->sba[i]);
-
-   if (data && buffer->is_header[i])
-   kmem_cache_free(qeth_core_header_cache, data);
-   }
-
if (aob->aorc) {
QETH_CARD_TEXT_(card, 2, "aorc%02X", aob->aorc);
new_state = QETH_QDIO_BUF_QAOB_ERROR;
@@ -536,10 +523,9 @@ static void qeth_qdio_handle_aob(struct qeth_card *card,
 
switch (atomic_xchg(&buffer->state, new_state)) {
case QETH_QDIO_BUF_PRIMED:
-   /* Faster than TX completion code. */
-   notification = qeth_compute_cq_notification(aob->aorc, 0);
-   qeth_notify_skbs(buffer->q, buffer, notification);
-   atomic_set(&buffer->state, QETH_QDIO_BUF_HANDLED_DELAYED);
+   /* Faster than TX completion code, let it handle the async
+* completion for us.
+*/
break;
case QETH_QDIO_BUF_PENDING:
/* TX completion code is active and will handle the async
@@ -550,6 +536,19 @@ static void qeth_qdio_handle_aob(struct qeth_card *card,
/* TX completion code is already finished. */
notification = qeth_compute_cq_notification(aob->aorc, 1);
qeth_notify_skbs(buffer->q, buffer, notification);
+
+   /* Free dangling allocations. The attached skbs are handled by
+* qeth_cleanup_handled_pending().
+*/
+   for (i = 0;
+i < aob->sb_count && i < QETH_MAX_BUFFER_ELEMENTS(card);
+i++) {
+   void *data = phys_to_virt(aob->sba[i]);
+
+   if (data && buffer->is_header[i])
+   kmem_cache_free(qeth_core_header_cache, data);
+   }
+
atomic_set(&buffer->state, QETH_QDIO_BUF_HANDLED_DELAYED);
break;
default:
@@ -6078,9 +6077,13 @@ static void qeth_iqd_tx_complete(struct qeth_qdio_out_q 
*queue,
 QDIO_OUTBUF_STATE_FLAG_PENDING)) {
WARN_ON_ONCE(card->options.cq != QETH_CQ_ENABLED);
 
-   if (atomic_cmpxchg(&buffer->state, QETH_QDIO_BUF_PRIMED,
-  QETH_QDIO_BUF_PENDING) ==
-   QETH_QDIO_BUF_PRIMED) {
+   QETH_CARD_TEXT_(card, 5, "pel%u", bidx);
+
+   switch (atomic_cmpxchg(&buffer->state,
+  QETH_QDIO_BUF_PRIMED,
+  QETH_QDIO_BUF_PENDING)) {
+   case QETH_QDIO_BUF_PRIMED:
+   /* We have initial ownership, no QAOB (yet): */
qeth_notify_skbs(queue, buffer, TX_NOTIFY_PENDING);
 
/* Handle race with qeth_qdio_handle_aob(): */
@@ -6088,39 +6091,49 @@ static void qeth_iqd_tx_complete(struct qeth_qdio_out_q 
*queue,
QETH_QDIO_BUF_NEED_QAOB)) {
case QETH_QDIO_BUF_PENDING:
/* No concurrent QAOB notification. */
-   break;
+
+

[PATCH net-next 0/6] s390/qeth: updates 2020-12-07

Hi Jakub,

please apply the following patch series for qeth to netdev's net-next tree.

Some sysfs cleanups (with the prep work in ccwgroup acked by Heiko), and
a few improvements to the code that deals with async TX completion
notifications for IQD devices.

This also brings the missing patch from the previous net-next submission.

Thanks,
Julian

Julian Wiedmann (6):
  s390/qeth: don't call INIT_LIST_HEAD() on iob's list entry
  s390/ccwgroup: use bus->dev_groups for bus-based sysfs attributes
  s390/qeth: use dev->groups for common sysfs attributes
  s390/qeth: don't replace a fully completed async TX buffer
  s390/qeth: remove QETH_QDIO_BUF_HANDLED_DELAYED state
  s390/qeth: make qeth_qdio_handle_aob() more robust

 drivers/s390/cio/ccwgroup.c   |  12 +---
 drivers/s390/net/qeth_core.h  |  10 +--
 drivers/s390/net/qeth_core_main.c | 111 +-
 drivers/s390/net/qeth_core_sys.c  |  41 +--
 drivers/s390/net/qeth_l2.h|   2 -
 drivers/s390/net/qeth_l2_main.c   |   4 +-
 drivers/s390/net/qeth_l2_sys.c|  19 -
 drivers/s390/net/qeth_l3.h|   2 -
 drivers/s390/net/qeth_l3_main.c   |   4 +-
 drivers/s390/net/qeth_l3_sys.c|  21 --
 10 files changed, 92 insertions(+), 134 deletions(-)

-- 
2.17.1

[PATCH net-next 6/6] s390/qeth: make qeth_qdio_handle_aob() more robust

When qeth_qdio_handle_aob() frees dangling allocations in the notified
TX buffer, there are rare tear-down cases where
qeth_drain_output_queue() would later call qeth_clear_output_buffer()
for the same buffer - and thus end up walking the buffer a second time
to check for dangling kmem_cache allocations.

Luckily current code previously scrubs such a buffer, so
qeth_clear_output_buffer() would find buf->buffer->element[i].addr as
NULL and not do anything. But this is fragile, and we can easily improve
it by consistently clearing the ->is_header flag after freeing the
allocation.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index da27ef451d05..f4b60294a969 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -546,6 +546,7 @@ static void qeth_qdio_handle_aob(struct qeth_card *card,
 
if (data && buffer->is_header[i])
kmem_cache_free(qeth_core_header_cache, data);
+   buffer->is_header[i] = 0;
}
 
atomic_set(&buffer->state, QETH_QDIO_BUF_EMPTY);
-- 
2.17.1

[PATCH net-next 1/6] s390/qeth: don't call INIT_LIST_HEAD() on iob's list entry

INIT_LIST_HEAD() only needs to be called on actual list heads.
While at it clarify the naming of the field.

Suggested-by: Vasily Gorbik 
Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  | 2 +-
 drivers/s390/net/qeth_core_main.c | 9 -
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 0e9af2fbaa76..69b474f8735e 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -624,7 +624,7 @@ struct qeth_reply {
 };
 
 struct qeth_cmd_buffer {
-   struct list_head list;
+   struct list_head list_entry;
struct completion done;
spinlock_t lock;
unsigned int length;
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 319190824cd2..8171b9d3a70e 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -615,7 +615,7 @@ static void qeth_enqueue_cmd(struct qeth_card *card,
 struct qeth_cmd_buffer *iob)
 {
spin_lock_irq(&card->lock);
-   list_add_tail(&iob->list, &card->cmd_waiter_list);
+   list_add_tail(&iob->list_entry, &card->cmd_waiter_list);
spin_unlock_irq(&card->lock);
 }
 
@@ -623,7 +623,7 @@ static void qeth_dequeue_cmd(struct qeth_card *card,
 struct qeth_cmd_buffer *iob)
 {
spin_lock_irq(&card->lock);
-   list_del(&iob->list);
+   list_del(&iob->list_entry);
spin_unlock_irq(&card->lock);
 }
 
@@ -977,7 +977,7 @@ static void qeth_clear_ipacmd_list(struct qeth_card *card)
QETH_CARD_TEXT(card, 4, "clipalst");
 
spin_lock_irqsave(&card->lock, flags);
-   list_for_each_entry(iob, &card->cmd_waiter_list, list)
+   list_for_each_entry(iob, &card->cmd_waiter_list, list_entry)
qeth_notify_cmd(iob, -ECANCELED);
spin_unlock_irqrestore(&card->lock, flags);
 }
@@ -1047,7 +1047,6 @@ struct qeth_cmd_buffer *qeth_alloc_cmd(struct 
qeth_channel *channel,
 
init_completion(&iob->done);
spin_lock_init(&iob->lock);
-   INIT_LIST_HEAD(&iob->list);
refcount_set(&iob->ref_count, 1);
iob->channel = channel;
iob->timeout = timeout;
@@ -1094,7 +1093,7 @@ static void qeth_issue_next_read_cb(struct qeth_card 
*card,
 
/* match against pending cmd requests */
spin_lock_irqsave(&card->lock, flags);
-   list_for_each_entry(tmp, &card->cmd_waiter_list, list) {
+   list_for_each_entry(tmp, &card->cmd_waiter_list, list_entry) {
if (tmp->match && tmp->match(tmp, iob)) {
request = tmp;
/* take the object outside the lock */
-- 
2.17.1

[PATCH v2 bpf-next 00/13] Socket migration for SO_REUSEPORT.

The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation[1]. When a SYN packet is received, the connection is tied to
a listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners on the same port could accept
such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in the
in-flight ACK of 3WHS is responded by RST.

The SO_REUSEPORT option is excellent to improve scalability. On the other
hand, as a trade-off, users have to know deeply how the kernel handles SYN
packets and implement connection draining by eBPF[2]:

1. Stop routing SYN packets to the listener by eBPF.
2. Wait for all timers to expire to complete requests
3. Accept connections until EAGAIN, then close the listener.

1. Start counting SYN packets and accept syscalls using eBPF map.
2. Stop routing SYN packets.
3. Accept connections up to the count, then close the listener.

In either way, we cannot close a listener immediately. However, ideally,
the application need not drain the not yet accepted sockets because 3WHS
and tying a connection to a listener are just the kernel behaviour. The
root cause is within the kernel, so the issue should be addressed in kernel
space and should not be visible to user space. This patchset fixes it so
that users need not take care of kernel implementation and connection
draining. With this patchset, the kernel redistributes requests and
connections from a listener to others in the same reuseport group at/after
close() or shutdown() syscalls.

Although some software does connection draining, there are still merits in
migration. For some security reasons such as replacing TLS certificates, we
may want to apply new settings as soon as possible and/or we may not be
able to wait for connection draining. The sockets in the accept queue have
not started application sessions yet. So, if we do not drain such sockets,
they can be handled by the newer listeners and could have a longer
lifetime. It is difficult to drain all connections in every case, but we
can decrease such aborted connections by migration. In that sense,
migration is always better than draining.

Moreover, auto-migration simplifies userspace logic and also works well in
a case where we cannot modify and build a server program to implement the
workaround.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use eBPF program to select a
specific listener or to cancel migration.

Link:
[1] The SO_REUSEPORT socket option
https://lwn.net/Articles/542629/

[2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain
mode

https://lore.kernel.org/netdev/1458828813.10868.65.ca...@edumazet-glaptop3.roam.corp.google.com/

Changelog:
v2:
* Do not save closed sockets in socks[]
* Revert 607904c357c61adf20b8fd18af765e501d61a385
* Extract inet_csk_reqsk_queue_migrate() into a single patch
* Change the spin_lock order to avoid lockdep warning
* Add static to __reuseport_select_sock
* Use refcount_inc_not_zero() in reuseport_select_migrated_sock()
* Set the default attach type in bpf_prog_load_check_attach()
* Define new proto of BPF_FUNC_get_socket_cookie
* Fix test to be compiled successfully
* Update commit messages

v1:
https://lore.kernel.org/netdev/20201201144418.35045-1-kun...@amazon.co.jp/
* Remove the sysctl option
* Enable migration if eBPF progam is not attached
* Add expected_attach_type to check if eBPF program can migrate sockets
* Add a field to tell migration type to eBPF program
* Support BPF_FUNC_get_socket_cookie to get the cookie of sk
* Allocate an empty skb if skb is NULL
* Pass req_to_sk(req)->sk_hash because listener's hash is zero
* Update commit messages and coverletter

RFC:
https://lore.kernel.org/netdev/20201117094023.3685-1-kun...@amazon.co.jp/

Kuniyuki Iwashima (13):
tcp: Allow TCP_CLOSE sockets to hold the reuseport group.
bpf: Define migration types for SO_REUSEPORT.
Revert "locking/spinlocks: Remove the unused spin_lock_bh_nested()
API"
tcp: Introduce inet_csk_reqsk_queue_migrate().
tcp: Set the new listener to migrated TFO requests.
tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
tcp: Migrate TCP_NEW_SYN_RECV requests.
bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
bpf: Add migration to sk_reuseport_(kern|md).
bpf: Support BPF

[PATCH v2 bpf-next 01/13] tcp: Allow TCP_CLOSE sockets to hold the reuseport group.

This patch is a preparation patch to migrate incoming connections in the
later commits and adds a field (num_closed_socks) to the struct
sock_reuseport to allow TCP_CLOSE sockets to access to the reuseport group.

When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So, we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, it is impossible because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to hold sk_reuseport_cb while any child
socket references to them. The point is that reuseport_detach_sock() is
called twice from inet_unhash() and sk_destruct(). At first, it decrements
num_socks and increments num_closed_socks. Later, when all migrated
connections are accepted, it decrements num_closed_socks and sets NULL to
sk_reuseport_cb.

By this change, closed sockets can keep sk_reuseport_cb until all child
requests have been freed or accepted. Consequently calling listen() after
shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or
inet_csk_bind_conflict() which expect that such sockets should not have the
reuseport group. Therefore, this patch also loosens such validation rules
so that the socket can listen again if it has the same reuseport group with
other listening sockets.

Reviewed-by: Benjamin Herrenschmidt 
Signed-off-by: Kuniyuki Iwashima 
---
 include/net/sock_reuseport.h|  5 +++--
 net/core/sock_reuseport.c   | 39 +++--
 net/ipv4/inet_connection_sock.c |  7 --
 3 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 505f1e18e9bf..0e558ca7afbf 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock;
 struct sock_reuseport {
struct rcu_head rcu;
 
-   u16 max_socks;  /* length of socks */
-   u16 num_socks;  /* elements in socks */
+   u16 max_socks;  /* length of socks */
+   u16 num_socks;  /* elements in socks */
+   u16 num_closed_socks;   /* closed elements in 
socks */
/* The last synq overflow event timestamp of this
 * reuse->socks[] group.
 */
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index bbdd3c7b6cb5..c26f4256ff41 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -98,14 +98,15 @@ static struct sock_reuseport *reuseport_grow(struct 
sock_reuseport *reuse)
return NULL;
 
more_reuse->num_socks = reuse->num_socks;
+   more_reuse->num_closed_socks = reuse->num_closed_socks;
more_reuse->prog = reuse->prog;
more_reuse->reuseport_id = reuse->reuseport_id;
more_reuse->bind_inany = reuse->bind_inany;
more_reuse->has_conns = reuse->has_conns;
+   more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
 
memcpy(more_reuse->socks, reuse->socks,
   reuse->num_socks * sizeof(struct sock *));
-   more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
 
for (i = 0; i < reuse->num_socks; ++i)
rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
@@ -152,8 +153,10 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, 
bool bind_inany)
reuse = rcu_dereference_protected(sk2->sk_reuseport_cb,
  lockdep_is_held(&reuseport_lock));
old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
-lockdep_is_held(&reuseport_lock));
-   if (old_reuse && old_reuse->num_socks != 1) {
+ lockdep_is_held(&reuseport_lock));
+   if (old_reuse == reuse) {
+   reuse->num_closed_socks--;
+   } else if (old_reuse && old_reuse->num_socks != 1) {
spin_unlock_bh(&reuseport_lock);
retu

Re: [PATCH 3/7] net: macb: unprepare clocks in case of failure

2020-12-07 Thread Claudiu.Beznea

Hi Andrew,

On 05.12.2020 16:30, Andrew Lunn wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
> content is safe
> 
> On Fri, Dec 04, 2020 at 02:34:17PM +0200, Claudiu Beznea wrote:
>> Unprepare clocks in case of any failure in fu540_c000_clk_init().
> 
> Hi Claudiu
> 
> Nice patchset. Simple to understand.
>>
> 
>> +err_disable_clocks:
>> + clk_disable_unprepare(*tx_clk);
> 
>> + clk_disable_unprepare(*hclk);
>> + clk_disable_unprepare(*pclk);
>> + clk_disable_unprepare(*rx_clk);
>> + clk_disable_unprepare(*tsu_clk);
> 
> This looks correct, but it would be more symmetrical to add a
> 
> macb_clk_uninit()
> 
> function for the four main clocks. I'm surprised it does not already
> exist.

I was in balance b/w added and not added it taking into account that the
disable unprepares are not taking care of all the clocks, in all the
places, in the same way. Anyway, I will add one function for the main
clocks, as you proposed, in the next version.

Thank you for your review,
Claudiu

> 
> Andrew
>

[PATCH v2 bpf-next 02/13] bpf: Define migration types for SO_REUSEPORT.

As noted in the preceding commit, there are two migration types. In
addition to that, the kernel will run the same eBPF program to select a
listener for SYN packets.

This patch defines three types to signal the kernel and the eBPF program if
it is receiving a new request or migrating ESTABLISHED/SYN_RECV sockets in
the accept queue or NEW_SYN_RECV socket during 3WHS.

Signed-off-by: Kuniyuki Iwashima 
---
 include/uapi/linux/bpf.h   | 14 ++
 tools/include/uapi/linux/bpf.h | 14 ++
 2 files changed, 28 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1233f14f659f..7a48e0055500 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4423,6 +4423,20 @@ struct sk_msg_md {
__bpf_md_ptr(struct bpf_sock *, sk); /* current socket */
 };
 
+/* Migration type for SO_REUSEPORT enabled TCP sockets.
+ *
+ * BPF_SK_REUSEPORT_MIGRATE_NO  : Select a listener for SYN packets.
+ * BPF_SK_REUSEPORT_MIGRATE_QUEUE   : Migrate ESTABLISHED and SYN_RECV sockets 
in
+ *the accept queue at close() or 
shutdown().
+ * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving 
the
+ *final ACK of 3WHS or retransmitting 
SYN+ACKs.
+ */
+enum {
+   BPF_SK_REUSEPORT_MIGRATE_NO,
+   BPF_SK_REUSEPORT_MIGRATE_QUEUE,
+   BPF_SK_REUSEPORT_MIGRATE_REQUEST,
+};
+
 struct sk_reuseport_md {
/*
 * Start of directly accessible data. It begins from
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1233f14f659f..7a48e0055500 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4423,6 +4423,20 @@ struct sk_msg_md {
__bpf_md_ptr(struct bpf_sock *, sk); /* current socket */
 };
 
+/* Migration type for SO_REUSEPORT enabled TCP sockets.
+ *
+ * BPF_SK_REUSEPORT_MIGRATE_NO  : Select a listener for SYN packets.
+ * BPF_SK_REUSEPORT_MIGRATE_QUEUE   : Migrate ESTABLISHED and SYN_RECV sockets 
in
+ *the accept queue at close() or 
shutdown().
+ * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving 
the
+ *final ACK of 3WHS or retransmitting 
SYN+ACKs.
+ */
+enum {
+   BPF_SK_REUSEPORT_MIGRATE_NO,
+   BPF_SK_REUSEPORT_MIGRATE_QUEUE,
+   BPF_SK_REUSEPORT_MIGRATE_REQUEST,
+};
+
 struct sk_reuseport_md {
/*
 * Start of directly accessible data. It begins from
-- 
2.17.2 (Apple Git-113)

[PATCH v2 bpf-next 03/13] Revert "locking/spinlocks: Remove the unused spin_lock_bh_nested() API"

This reverts commit 607904c357c61adf20b8fd18af765e501d61a385 to use
spin_lock_bh_nested() in the next commit.

Link: 
https://lore.kernel.org/netdev/9d290a57-49e1-04cd-2487-262b0d7c5...@gmail.com/
Signed-off-by: Kuniyuki Iwashima 
CC: Waiman Long 
---
 include/linux/spinlock.h | 8 
 include/linux/spinlock_api_smp.h | 2 ++
 include/linux/spinlock_api_up.h  | 1 +
 kernel/locking/spinlock.c| 8 
 4 files changed, 19 insertions(+)

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 79897841a2cc..c020b375a071 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -227,6 +227,8 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) 
__releases(lock)
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define raw_spin_lock_nested(lock, subclass) \
_raw_spin_lock_nested(lock, subclass)
+# define raw_spin_lock_bh_nested(lock, subclass) \
+   _raw_spin_lock_bh_nested(lock, subclass)
 
 # define raw_spin_lock_nest_lock(lock, nest_lock)  \
 do {   \
@@ -242,6 +244,7 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) 
__releases(lock)
 # define raw_spin_lock_nested(lock, subclass)  \
_raw_spin_lock(((void)(subclass), (lock)))
 # define raw_spin_lock_nest_lock(lock, nest_lock)  _raw_spin_lock(lock)
+# define raw_spin_lock_bh_nested(lock, subclass)   _raw_spin_lock_bh(lock)
 #endif
 
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
@@ -369,6 +372,11 @@ do {   
\
raw_spin_lock_nested(spinlock_check(lock), subclass);   \
 } while (0)
 
+#define spin_lock_bh_nested(lock, subclass)\
+do {   \
+   raw_spin_lock_bh_nested(spinlock_check(lock), subclass);\
+} while (0)
+
 #define spin_lock_nest_lock(lock, nest_lock)   \
 do {   \
raw_spin_lock_nest_lock(spinlock_check(lock), nest_lock);   \
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 19a9be9d97ee..d565fb6304f2 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -22,6 +22,8 @@ int in_lock_functions(unsigned long addr);
 void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)   
__acquires(lock);
 void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)

__acquires(lock);
+void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
+   
__acquires(lock);
 void __lockfunc
 _raw_spin_lock_nest_lock(raw_spinlock_t *lock, struct lockdep_map *map)

__acquires(lock);
diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h
index d0d188861ad6..d3afef9d8dbe 100644
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -57,6 +57,7 @@
 
 #define _raw_spin_lock(lock)   __LOCK(lock)
 #define _raw_spin_lock_nested(lock, subclass)  __LOCK(lock)
+#define _raw_spin_lock_bh_nested(lock, subclass) __LOCK(lock)
 #define _raw_read_lock(lock)   __LOCK(lock)
 #define _raw_write_lock(lock)  __LOCK(lock)
 #define _raw_spin_lock_bh(lock)__LOCK_BH(lock)
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 0ff08380f531..48e99ed1bdd8 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -363,6 +363,14 @@ void __lockfunc _raw_spin_lock_nested(raw_spinlock_t 
*lock, int subclass)
 }
 EXPORT_SYMBOL(_raw_spin_lock_nested);
 
+void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
+{
+   __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
+   spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
+   LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
+}
+EXPORT_SYMBOL(_raw_spin_lock_bh_nested);
+
 unsigned long __lockfunc _raw_spin_lock_irqsave_nested(raw_spinlock_t *lock,
   int subclass)
 {
-- 
2.17.2 (Apple Git-113)

[PATCH v2 bpf-next 04/13] tcp: Introduce inet_csk_reqsk_queue_migrate().

This patch defines a new function to migrate ESTABLISHED/SYN_RECV sockets.

Listening sockets hold incoming connections as a linked list of struct
request_sock in the accept queue, and each request has reference to its
full socket and listener. In inet_csk_reqsk_queue_migrate(), we only unlink
the requests from the closing listener's queue and relink them to the head
of the new listener's queue. We do not process each request and its
reference to the listener, so the migration completes in O(1) time
complexity.

Moreover, if TFO requests caused RST before 3WHS has completed, they are
held in the listener's TFO queue to prevent DDoS attack. Thus, we also
migrate the requests in the TFO queue in the same way.

After 3WHS has completed, there are three access patterns to incoming
sockets:

  (1) access to the full socket instead of request_sock
  (2) access to request_sock from access queue
  (3) access to request_sock from TFO queue

In the first case, the full socket does not have a reference to its request
socket and listener, so we do not need the correct listener set in the
request socket. In the second case, we always have the correct listener and
currently do not use req->rsk_listener. However, in the third case of
TCP_SYN_RECV sockets, we take special care in the next commit.

Reviewed-by: Benjamin Herrenschmidt 
Signed-off-by: Kuniyuki Iwashima 
---
 include/net/inet_connection_sock.h |  1 +
 net/ipv4/inet_connection_sock.c| 68 ++
 2 files changed, 69 insertions(+)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 7338b3865a2a..2ea2d743f8fc 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct 
sock *sk,
 struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
  struct request_sock *req,
  struct sock *child);
+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
 void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
   unsigned long timeout);
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1451aa9712b0..5da38a756e4c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -992,6 +992,74 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 }
 EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
 
+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
+{
+   struct request_sock_queue *old_accept_queue, *new_accept_queue;
+   struct fastopen_queue *old_fastopenq, *new_fastopenq;
+   spinlock_t *l1, *l2, *l3, *l4;
+
+   old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
+   new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
+   old_fastopenq = &old_accept_queue->fastopenq;
+   new_fastopenq = &new_accept_queue->fastopenq;
+
+   l1 = &old_accept_queue->rskq_lock;
+   l2 = &new_accept_queue->rskq_lock;
+   l3 = &old_fastopenq->lock;
+   l4 = &new_fastopenq->lock;
+
+   /* sk is never selected as the new listener from reuse->socks[],
+* so inversion deadlock does not happen here,
+* but change the order to avoid the warning of lockdep.
+*/
+   if (sk < nsk) {
+   swap(l1, l2);
+   swap(l3, l4);
+   }
+
+   spin_lock(l1);
+   spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+
+   if (old_accept_queue->rskq_accept_head) {
+   if (new_accept_queue->rskq_accept_head)
+   old_accept_queue->rskq_accept_tail->dl_next =
+   new_accept_queue->rskq_accept_head;
+   else
+   new_accept_queue->rskq_accept_tail = 
old_accept_queue->rskq_accept_tail;
+
+   new_accept_queue->rskq_accept_head = 
old_accept_queue->rskq_accept_head;
+   old_accept_queue->rskq_accept_head = NULL;
+   old_accept_queue->rskq_accept_tail = NULL;
+
+   WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + 
sk->sk_ack_backlog);
+   WRITE_ONCE(sk->sk_ack_backlog, 0);
+   }
+
+   spin_unlock(l2);
+   spin_unlock(l1);
+
+   spin_lock_bh(l3);
+   spin_lock_bh_nested(l4, SINGLE_DEPTH_NESTING);
+
+   new_fastopenq->qlen += old_fastopenq->qlen;
+   old_fastopenq->qlen = 0;
+
+   if (old_fastopenq->rskq_rst_head) {
+   if (new_fastopenq->rskq_rst_head)
+   old_fastopenq->rskq_rst_tail->dl_next = 
new_fastopenq->rskq_rst_head;
+   else
+   old_fastopenq->rskq_rst_tail = 
new_fastopenq->rskq_rst_tail;
+
+   new_fastopenq->rskq_rst_head = old_fastopenq->rskq_rst_head;
+   old_fastopenq->rskq_rst_head

[PATCH v2 bpf-next 05/13] tcp: Set the new listener to migrated TFO requests.

A TFO request socket is only freed after BOTH 3WHS has completed (or
aborted) and the child socket has been accepted (or its listener has been
closed). Hence, depending on the order, there can be two kinds of request
sockets in the accept queue.

  3WHS -> accept : TCP_ESTABLISHED
  accept -> 3WHS : TCP_SYN_RECV

Unlike TCP_ESTABLISHED socket, accept() does not free the request socket
for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove().
Also, it accesses request_sock.rsk_listener. So, in order to complete TFO
socket migration, we have to set the current listener to it at accept()
before reqsk_fastopen_remove().

Reviewed-by: Benjamin Herrenschmidt 
Signed-off-by: Kuniyuki Iwashima 
---
 net/ipv4/inet_connection_sock.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 5da38a756e4c..143590858c2e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -500,6 +500,16 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, 
int *err, bool kern)
tcp_rsk(req)->tfo_listener) {
spin_lock_bh(&queue->fastopenq.lock);
if (tcp_rsk(req)->tfo_listener) {
+   if (req->rsk_listener != sk) {
+   /* TFO request was migrated to another listener 
so
+* the new listener must be used in 
reqsk_fastopen_remove()
+* to hold requests which cause RST.
+*/
+   sock_put(req->rsk_listener);
+   sock_hold(sk);
+   req->rsk_listener = sk;
+   }
+
/* We are still waiting for the final ACK from 3WHS
 * so can't free req now. Instead, we set req->sk to
 * NULL to signify that the child socket is taken
@@ -954,7 +964,6 @@ static void inet_child_forget(struct sock *sk, struct 
request_sock *req,
 
if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) {
BUG_ON(rcu_access_pointer(tcp_sk(child)->fastopen_rsk) != req);
-   BUG_ON(sk != req->rsk_listener);
 
/* Paranoid, to prevent race condition if
 * an inbound pkt destined for child is
-- 
2.17.2 (Apple Git-113)

[PATCH v2 bpf-next 06/13] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

This patch lets reuseport_detach_sock() return a pointer of struct sock,
which is used only by inet_unhash(). If it is not NULL,
inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
sockets from the closing listener to the selected one.

By default, the kernel selects a new listener randomly. In order to pick
out a different socket every time, we select the last element of socks[] as
the new listener. This behaviour is based on how the kernel moves sockets
in socks[]. (See also [1])

Basically, in order to redistribute sockets evenly, we have to use an eBPF
program called in the later commit, but as the side effect of such default
selection, the kernel can redistribute old requests evenly to new listeners
for a specific case where the application replaces listeners by
generations.

For example, we call listen() for four sockets (A, B, C, D), and close()
the first two by turns. The sockets move in socks[] like below.

  socks[0] : A <-.  socks[0] : D  socks[0] : D
  socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
  socks[2] : C   |  socks[2] : C --'
  socks[3] : D --'

Then, if C and D have newer settings than A and B, and each socket has a
request (a, b, c, d) in their accept queue, we can redistribute old
requests evenly to new listeners.

  socks[0] : A (a) <-.  socks[0] : D (a + d)  socks[0] : D (a + d)
  socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
  socks[2] : C (c)   |  socks[2] : C (c) --'
  socks[3] : D (d) --'

Here, (A, D), or (B, C) can have different application settings, but they
MUST have the same settings at the socket API level; otherwise, unexpected
error may happen. For instance, if only the new listeners have
TCP_SAVE_SYN, old requests do not hold SYN data, so the application will
face inconsistency and cause an error.

Therefore, if there are different kinds of sockets, we must attach an eBPF
program described in later commits.

Link: 
https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dh...@mail.gmail.com/
Reviewed-by: Benjamin Herrenschmidt 
Signed-off-by: Kuniyuki Iwashima 
---
 include/net/sock_reuseport.h |  2 +-
 net/core/sock_reuseport.c| 16 +---
 net/ipv4/inet_hashtables.c   |  9 +++--
 3 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 0e558ca7afbf..09a1b1539d4c 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -31,7 +31,7 @@ struct sock_reuseport {
 extern int reuseport_alloc(struct sock *sk, bool bind_inany);
 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
  bool bind_inany);
-extern void reuseport_detach_sock(struct sock *sk);
+extern struct sock *reuseport_detach_sock(struct sock *sk);
 extern struct sock *reuseport_select_sock(struct sock *sk,
  u32 hash,
  struct sk_buff *skb,
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index c26f4256ff41..2de42f8103ea 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -184,9 +184,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, 
bool bind_inany)
 }
 EXPORT_SYMBOL(reuseport_add_sock);
 
-void reuseport_detach_sock(struct sock *sk)
+struct sock *reuseport_detach_sock(struct sock *sk)
 {
struct sock_reuseport *reuse;
+   struct bpf_prog *prog;
+   struct sock *nsk = NULL;
int i;
 
spin_lock_bh(&reuseport_lock);
@@ -215,17 +217,25 @@ void reuseport_detach_sock(struct sock *sk)
 
reuse->num_socks--;
reuse->socks[i] = reuse->socks[reuse->num_socks];
+   prog = rcu_dereference_protected(reuse->prog,
+
lockdep_is_held(&reuseport_lock));
+
+   if (sk->sk_protocol == IPPROTO_TCP) {
+   if (reuse->num_socks && !prog)
+   nsk = i == reuse->num_socks ? reuse->socks[i - 
1] : reuse->socks[i];
 
-   if (sk->sk_protocol == IPPROTO_TCP)
reuse->num_closed_socks++;
-   else
+   } else {
rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
+   }
}
 
if (reuse->num_socks + reuse->num_closed_socks == 0)
call_rcu(&reuse->rcu, reuseport_free_rcu);
 
spin_unlock_bh(&reuseport_lock);
+
+   return nsk;
 }
 EXPORT_SYMBOL(reuseport_detach_sock);
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 45fb450b4522..545538a6bfac 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -681,6 +681,7 @@ void inet_unhash(struct sock *sk)
 {
struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
struct inet_listen_hashbucket *ilb = NULL;
+   struct sock *nsk;
spinlock_t *lock

[PATCH v2 bpf-next 07/13] tcp: Migrate TCP_NEW_SYN_RECV requests.

This patch renames reuseport_select_sock() to __reuseport_select_sock() and
adds two wrapper function of it to pass the migration type defined in the
previous commit.

  reuseport_select_sock  : BPF_SK_REUSEPORT_MIGRATE_NO
  reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST

As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
patch also changes the code to call reuseport_select_migrated_sock() even
if the listening socket is TCP_CLOSE. If we can pick out a listening socket
from the reuseport group, we rewrite request_sock.rsk_listener and resume
processing the request.

Link: https://lore.kernel.org/bpf/202012020136.bf0z4guu-...@intel.com/
Reported-by: kernel test robot 
Reviewed-by: Benjamin Herrenschmidt 
Signed-off-by: Kuniyuki Iwashima 
---
 include/net/inet_connection_sock.h | 11 
 include/net/request_sock.h | 13 ++
 include/net/sock_reuseport.h   |  8 +++---
 net/core/sock_reuseport.c  | 40 --
 net/ipv4/inet_connection_sock.c| 13 --
 net/ipv4/tcp_ipv4.c|  9 +--
 net/ipv6/tcp_ipv6.c|  9 +--
 7 files changed, 86 insertions(+), 17 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 2ea2d743f8fc..d8c3be31e987 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -272,6 +272,17 @@ static inline void inet_csk_reqsk_queue_added(struct sock 
*sk)
reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
 }
 
+static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
+struct sock *nsk,
+struct request_sock *req)
+{
+   reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
+&inet_csk(nsk)->icsk_accept_queue,
+req);
+   sock_put(sk);
+   req->rsk_listener = nsk;
+}
+
 static inline int inet_csk_reqsk_queue_len(const struct sock *sk)
 {
return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 29e41ff3ec93..d18ba0b857cc 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -226,6 +226,19 @@ static inline void reqsk_queue_added(struct 
request_sock_queue *queue)
atomic_inc(&queue->qlen);
 }
 
+static inline void reqsk_queue_migrated(struct request_sock_queue 
*old_accept_queue,
+   struct request_sock_queue 
*new_accept_queue,
+   const struct request_sock *req)
+{
+   atomic_dec(&old_accept_queue->qlen);
+   atomic_inc(&new_accept_queue->qlen);
+
+   if (req->num_timeout == 0) {
+   atomic_dec(&old_accept_queue->young);
+   atomic_inc(&new_accept_queue->young);
+   }
+}
+
 static inline int reqsk_queue_len(const struct request_sock_queue *queue)
 {
return atomic_read(&queue->qlen);
diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 09a1b1539d4c..a48259a974be 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -32,10 +32,10 @@ extern int reuseport_alloc(struct sock *sk, bool 
bind_inany);
 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
  bool bind_inany);
 extern struct sock *reuseport_detach_sock(struct sock *sk);
-extern struct sock *reuseport_select_sock(struct sock *sk,
- u32 hash,
- struct sk_buff *skb,
- int hdr_len);
+extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash,
+ struct sk_buff *skb, int hdr_len);
+extern struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
+  struct sk_buff *skb);
 extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog);
 extern int reuseport_detach_prog(struct sock *sk);
 
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 2de42f8103ea..1011c3756c92 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -170,7 +170,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, 
bool bind_inany)
}
 
reuse->socks[reuse->num_socks] = sk;
-   /* paired with smp_rmb() in reuseport_select_sock() */
+   /* paired with smp_rmb() in __reuseport_select_sock() */
smp_wmb();
reuse->num_socks++;
rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
@@ -277,12 +277,13 @@ static struct sock *run_bpf_filter(struct sock_reuseport 
*reuse, u16 socks,
  *  @hdr_len: BPF filter expects skb data pointer at payload data.  If
  *the skb d

[PATCH v2 bpf-next 09/13] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.

This commit introduces a new section (sk_reuseport/migrate) and sets
expected_attach_type to two each section in BPF_PROG_TYPE_SK_REUSEPORT
program.

Signed-off-by: Kuniyuki Iwashima 
---
 tools/lib/bpf/libbpf.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 9be88a90a4aa..ba64c891a5e7 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8471,7 +8471,10 @@ static struct bpf_link *attach_iter(const struct 
bpf_sec_def *sec,
 
 static const struct bpf_sec_def section_defs[] = {
BPF_PROG_SEC("socket",  BPF_PROG_TYPE_SOCKET_FILTER),
-   BPF_PROG_SEC("sk_reuseport",BPF_PROG_TYPE_SK_REUSEPORT),
+   BPF_EAPROG_SEC("sk_reuseport/migrate",  BPF_PROG_TYPE_SK_REUSEPORT,
+   
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE),
+   BPF_EAPROG_SEC("sk_reuseport",  BPF_PROG_TYPE_SK_REUSEPORT,
+   BPF_SK_REUSEPORT_SELECT),
SEC_DEF("kprobe/", KPROBE,
.attach_fn = attach_kprobe),
BPF_PROG_SEC("uprobe/", BPF_PROG_TYPE_KPROBE),
-- 
2.17.2 (Apple Git-113)

[PATCH v2 bpf-next 10/13] bpf: Add migration to sk_reuseport_(kern|md).

This patch adds u8 migration field to sk_reuseport_kern and sk_reuseport_md
to signal the eBPF program if the kernel calls it for selecting a listener
for SYN or migrating sockets in the accept queue or an immature socket
during 3WHS.

Note that this field is accessible only if the attached type is
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Link: 
https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6t...@kafai-mbp.dhcp.thefacebook.com/
Suggested-by: Martin KaFai Lau 
Signed-off-by: Kuniyuki Iwashima 
---
 include/linux/bpf.h|  1 +
 include/linux/filter.h |  4 ++--
 include/uapi/linux/bpf.h   |  1 +
 net/core/filter.c  | 15 ---
 net/core/sock_reuseport.c  |  2 +-
 tools/include/uapi/linux/bpf.h |  1 +
 6 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d05e75ed8c1b..cdeb27f4ad63 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1914,6 +1914,7 @@ struct sk_reuseport_kern {
u32 hash;
u32 reuseport_id;
bool bind_inany;
+   u8 migration;
 };
 bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
  struct bpf_insn_access_aux *info);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1b62397bd124..15d5bf13a905 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -967,12 +967,12 @@ void bpf_warn_invalid_xdp_action(u32 act);
 #ifdef CONFIG_INET
 struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock 
*sk,
  struct bpf_prog *prog, struct sk_buff *skb,
- u32 hash);
+ u32 hash, u8 migration);
 #else
 static inline struct sock *
 bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 struct bpf_prog *prog, struct sk_buff *skb,
-u32 hash)
+u32 hash, u8 migration)
 {
return NULL;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c7f6848c0226..cf518e83df5c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4462,6 +4462,7 @@ struct sk_reuseport_md {
__u32 ip_protocol;  /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
__u32 bind_inany;   /* Is sock bound to an INANY address? */
__u32 hash; /* A hash of the packet 4 tuples */
+   __u8 migration; /* Migration type */
 };
 
 #define BPF_TAG_SIZE   8
diff --git a/net/core/filter.c b/net/core/filter.c
index 77001a35768f..7bdf62f24044 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9860,7 +9860,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter 
__user *ubuf,
 static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
struct sock_reuseport *reuse,
struct sock *sk, struct sk_buff *skb,
-   u32 hash)
+   u32 hash, u8 migration)
 {
reuse_kern->skb = skb;
reuse_kern->sk = sk;
@@ -9869,16 +9869,17 @@ static void bpf_init_reuseport_kern(struct 
sk_reuseport_kern *reuse_kern,
reuse_kern->hash = hash;
reuse_kern->reuseport_id = reuse->reuseport_id;
reuse_kern->bind_inany = reuse->bind_inany;
+   reuse_kern->migration = migration;
 }
 
 struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock 
*sk,
  struct bpf_prog *prog, struct sk_buff *skb,
- u32 hash)
+ u32 hash, u8 migration)
 {
struct sk_reuseport_kern reuse_kern;
enum sk_action action;
 
-   bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash);
+   bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration);
action = BPF_PROG_RUN(prog, &reuse_kern);
 
if (action == SK_PASS)
@@ -10017,6 +10018,10 @@ sk_reuseport_is_valid_access(int off, int size,
case offsetof(struct sk_reuseport_md, hash):
return size == size_default;
 
+   case bpf_ctx_range(struct sk_reuseport_md, migration):
+   return prog->expected_attach_type == 
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE &&
+   size == sizeof(__u8);
+
/* Fields that allow narrowing */
case bpf_ctx_range(struct sk_reuseport_md, eth_protocol):
if (size < sizeof_field(struct sk_buff, protocol))
@@ -10089,6 +10094,10 @@ static u32 sk_reuseport_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct sk_reuseport_md, bind_inany):
SK_REUSEPORT_LOAD_FIELD(bind_inany);
break;
+
+   case offsetof(struct sk_reuseport_md, migration):
+   SK_REUSEPORT_LOAD_FIELD(migration);
+   break;
}
 
return insn - insn_buf;
diff

[PATCH v2 bpf-next 11/13] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT.

We will call sock_reuseport.prog for socket migration in the next commit,
so the eBPF program has to know which listener is closing in order to
select the new listener.

Currently, we can get a unique ID for each listener in the userspace by
calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.

This patch makes the sk pointer available in sk_reuseport_md so that we can
get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.

Link: 
https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f...@kafai-mbp.dhcp.thefacebook.com/
Suggested-by: Martin KaFai Lau 
Signed-off-by: Kuniyuki Iwashima 
---
 include/uapi/linux/bpf.h   |  8 
 net/core/filter.c  | 22 ++
 tools/include/uapi/linux/bpf.h |  8 
 3 files changed, 38 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index cf518e83df5c..a688a7a4fe85 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1655,6 +1655,13 @@ union bpf_attr {
  * A 8-byte long non-decreasing number on success, or 0 if the
  * socket field is missing inside *skb*.
  *
+ * u64 bpf_get_socket_cookie(struct bpf_sock *sk)
+ * Description
+ * Equivalent to bpf_get_socket_cookie() helper that accepts
+ * *skb*, but gets socket from **struct bpf_sock** context.
+ * Return
+ * A 8-byte long non-decreasing number.
+ *
  * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
  * Description
  * Equivalent to bpf_get_socket_cookie() helper that accepts
@@ -4463,6 +4470,7 @@ struct sk_reuseport_md {
__u32 bind_inany;   /* Is sock bound to an INANY address? */
__u32 hash; /* A hash of the packet 4 tuples */
__u8 migration; /* Migration type */
+   __bpf_md_ptr(struct bpf_sock *, sk); /* Current listening socket */
 };
 
 #define BPF_TAG_SIZE   8
diff --git a/net/core/filter.c b/net/core/filter.c
index 7bdf62f24044..9f7018e3f545 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4631,6 +4631,18 @@ static const struct bpf_func_proto 
bpf_get_socket_cookie_sock_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_get_socket_pointer_cookie, struct sock *, sk)
+{
+   return __sock_gen_cookie(sk);
+}
+
+static const struct bpf_func_proto bpf_get_socket_pointer_cookie_proto = {
+   .func   = bpf_get_socket_pointer_cookie,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_SOCKET,
+};
+
 BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx)
 {
return __sock_gen_cookie(ctx->sk);
@@ -9989,6 +10001,8 @@ sk_reuseport_func_proto(enum bpf_func_id func_id,
return &sk_reuseport_load_bytes_proto;
case BPF_FUNC_skb_load_bytes_relative:
return &sk_reuseport_load_bytes_relative_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return &bpf_get_socket_pointer_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -10022,6 +10036,10 @@ sk_reuseport_is_valid_access(int off, int size,
return prog->expected_attach_type == 
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE &&
size == sizeof(__u8);
 
+   case offsetof(struct sk_reuseport_md, sk):
+   info->reg_type = PTR_TO_SOCKET;
+   return size == sizeof(__u64);
+
/* Fields that allow narrowing */
case bpf_ctx_range(struct sk_reuseport_md, eth_protocol):
if (size < sizeof_field(struct sk_buff, protocol))
@@ -10098,6 +10116,10 @@ static u32 sk_reuseport_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct sk_reuseport_md, migration):
SK_REUSEPORT_LOAD_FIELD(migration);
break;
+
+   case offsetof(struct sk_reuseport_md, sk):
+   SK_REUSEPORT_LOAD_FIELD(sk);
+   break;
}
 
return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index cf518e83df5c..a688a7a4fe85 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1655,6 +1655,13 @@ union bpf_attr {
  * A 8-byte long non-decreasing number on success, or 0 if the
  * socket field is missing inside *skb*.
  *
+ * u64 bpf_get_socket_cookie(struct bpf_sock *sk)
+ * Description
+ * Equivalent to bpf_get_socket_cookie() helper that accepts
+ * *skb*, but gets socket from **struct bpf_sock** context.
+ * Return
+ * A 8-byte long non-decreasing number.
+ *
  * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
  * Description
  * Equivalent to bpf_get_socket_cookie() helper that accepts
@@ -4463,6 +4470,7 @@ struct sk_reuseport_md {
__u32 bind_inany;   /* Is sock bound to an INANY address? */

[PATCH v2 bpf-next 12/13] bpf: Call bpf_run_sk_reuseport() for socket migration.

This patch supports socket migration by eBPF. If the attached type is
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.

There are two noteworthy points. The first is that we select a listening
socket in reuseport_detach_sock() and __reuseport_select_sock(), but we do
not have struct skb at closing a listener or retransmitting a SYN+ACK.
However, some helper functions do not expect skb is NULL (e.g.
skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in
BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
temporarily before running the eBPF program. The second is that we do not
have struct request_sock in unhash path, and the sk_hash of the listener is
always zero. So we pass zero as hash to bpf_run_sk_reuseport().

Reviewed-by: Benjamin Herrenschmidt 
Signed-off-by: Kuniyuki Iwashima 
---
 net/core/filter.c  | 19 +++
 net/core/sock_reuseport.c  | 21 +++--
 net/ipv4/inet_hashtables.c |  2 +-
 3 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 9f7018e3f545..53fa3bcbf00f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9890,10 +9890,29 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport 
*reuse, struct sock *sk,
 {
struct sk_reuseport_kern reuse_kern;
enum sk_action action;
+   bool allocated = false;
+
+   if (migration) {
+   /* cancel migration for possibly incapable eBPF program */
+   if (prog->expected_attach_type != 
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)
+   return ERR_PTR(-ENOTSUPP);
+
+   if (!skb) {
+   allocated = true;
+   skb = alloc_skb(0, GFP_ATOMIC);
+   if (!skb)
+   return ERR_PTR(-ENOMEM);
+   }
+   } else if (!skb) {
+   return NULL; /* fall back to select by hash */
+   }
 
bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration);
action = BPF_PROG_RUN(prog, &reuse_kern);
 
+   if (allocated)
+   kfree_skb(skb);
+
if (action == SK_PASS)
return reuse_kern.selected_sk;
else
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index b877c8e552d2..2358e8896199 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -221,8 +221,15 @@ struct sock *reuseport_detach_sock(struct sock *sk)
 
lockdep_is_held(&reuseport_lock));
 
if (sk->sk_protocol == IPPROTO_TCP) {
-   if (reuse->num_socks && !prog)
-   nsk = i == reuse->num_socks ? reuse->socks[i - 
1] : reuse->socks[i];
+   if (reuse->num_socks) {
+   if (prog)
+   nsk = bpf_run_sk_reuseport(reuse, sk, 
prog, NULL, 0,
+  
BPF_SK_REUSEPORT_MIGRATE_QUEUE);
+
+   if (!nsk)
+   nsk = i == reuse->num_socks ?
+   reuse->socks[i - 1] : 
reuse->socks[i];
+   }
 
reuse->num_closed_socks++;
} else {
@@ -306,15 +313,9 @@ static struct sock *__reuseport_select_sock(struct sock 
*sk, u32 hash,
if (!prog)
goto select_by_hash;
 
-   if (migration)
-   goto out;
-
-   if (!skb)
-   goto select_by_hash;
-
if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash, 
migration);
-   else
+   else if (!skb)
sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len);
 
 select_by_hash:
@@ -352,7 +353,7 @@ struct sock *reuseport_select_migrated_sock(struct sock 
*sk, u32 hash,
struct sock *nsk;
 
nsk = __reuseport_select_sock(sk, hash, skb, 0, 
BPF_SK_REUSEPORT_MIGRATE_REQUEST);
-   if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
+   if (!IS_ERR_OR_NULL(nsk) && 
likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
return nsk;
 
return NULL;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 545538a6bfac..59f58740c20d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -699,7 +699,7 @@ void inet_unhash(struct sock *sk)
 
if (rcu_access_pointer(sk->sk_reuseport_cb)) {
nsk = reuseport_detach_sock(sk);
-   if

[PATCH v2 bpf-next 08/13] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.

This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
check if the attached eBPF program is capable of migrating sockets.

When the eBPF program is attached, the kernel runs it for socket migration
only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
The kernel will change the behaviour depending on the returned value:

  - SK_PASS with selected_sk, select it as a new listener
  - SK_PASS with selected_sk NULL, fall back to the random selection
  - SK_DROP, cancel the migration

Link: 
https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6t...@kafai-mbp.dhcp.thefacebook.com/
Suggested-by: Martin KaFai Lau 
Signed-off-by: Kuniyuki Iwashima 
---
 include/uapi/linux/bpf.h   |  2 ++
 kernel/bpf/syscall.c   | 13 +
 tools/include/uapi/linux/bpf.h |  2 ++
 3 files changed, 17 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7a48e0055500..c7f6848c0226 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -241,6 +241,8 @@ enum bpf_attach_type {
BPF_XDP_CPUMAP,
BPF_SK_LOOKUP,
BPF_XDP,
+   BPF_SK_REUSEPORT_SELECT,
+   BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0cd3cc2af9c1..0737673c727c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1920,6 +1920,11 @@ static void bpf_prog_load_fixup_attach_type(union 
bpf_attr *attr)
attr->expected_attach_type =
BPF_CGROUP_INET_SOCK_CREATE;
break;
+   case BPF_PROG_TYPE_SK_REUSEPORT:
+   if (!attr->expected_attach_type)
+   attr->expected_attach_type =
+   BPF_SK_REUSEPORT_SELECT;
+   break;
}
 }
 
@@ -2003,6 +2008,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
if (expected_attach_type == BPF_SK_LOOKUP)
return 0;
return -EINVAL;
+   case BPF_PROG_TYPE_SK_REUSEPORT:
+   switch (expected_attach_type) {
+   case BPF_SK_REUSEPORT_SELECT:
+   case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
+   return 0;
+   default:
+   return -EINVAL;
+   }
case BPF_PROG_TYPE_EXT:
if (expected_attach_type)
return -EINVAL;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7a48e0055500..c7f6848c0226 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -241,6 +241,8 @@ enum bpf_attach_type {
BPF_XDP_CPUMAP,
BPF_SK_LOOKUP,
BPF_XDP,
+   BPF_SK_REUSEPORT_SELECT,
+   BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.17.2 (Apple Git-113)

Re: [PATCH v3 0/7] Improve s0ix flows for systems i219LM

2020-12-07 Thread Hans de Goede

Hi,

On 12/4/20 9:09 PM, Mario Limonciello wrote:
> commit e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME 
> systems")
> disabled s0ix flows for systems that have various incarnations of the
> i219-LM ethernet controller.  This was done because of some regressions
> caused by an earlier
> commit 632fbd5eb5b0e ("e1000e: fix S0ix flows for cable connected case")
> with i219-LM controller.
> 
> Performing suspend to idle with these ethernet controllers requires a properly
> configured system.  To make enabling such systems easier, this patch
> series allows determining if enabled and turning on using ethtool.
> 
> The flows have also been confirmed to be configured correctly on Dell's 
> Latitude
> and Precision CML systems containing the i219-LM controller, when the kernel 
> also
> contains the fix for s0i3.2 entry previously submitted here and now part of 
> this
> series.
> https://marc.info/?l=linux-netdev&m=160677194809564&w=2
> 
> Patches 4 through 7 will turn the behavior on by default for some of Dell's
> CML and TGL systems.

First of all thank you for working on this.

I must say though that I don't like the approach taken here very
much.

This is not so much a criticism of this series as it is a criticism
of the earlier decision to simply disable s0ix on all devices
with the i219-LM + and active ME.

AFAIK there was a perfectly acceptable patch to workaround those
broken devices, which increased a timeout:
https://patchwork.ozlabs.org/project/intel-wired-lan/patch/20200323191639.48826-1-aaron...@canonical.com/

That patch was nacked because it increased the resume time
*on broken devices*.

So it seems to me that we have a simple choice here:

1. Longer resume time on devices with an improperly configured ME
2. Higher power-consumption on all non-buggy devices

Your patches 4-7 try to workaround 2. but IMHO those are just
bandaids for getting the initial priorities *very* wrong.

Instead of penalizing non-buggy devices with a higher power-consumption,
we should default to penalizing the buggy devices with a higher
resume time. And if it is decided that the higher resume time is
a worse problem then the higher power-consumption, then there
should be a list of broken devices and s0ix can be disabled on those.

The current allow-list approach is simply never going to work well
leading to too high power-consumption on countless devices.
This is going to be an endless game of whack-a-mole and as
such really is a bad idea.

A deny-list for broken devices is a much better approach, esp.
since missing devices on that list will still work fine, they
will just have a somewhat larger resume time.

So what needs to happen IMHO is:

1. Merge your fix from patch 1 of this set
2. Merge "e1000e: bump up timeout to wait when ME un-configure ULP mode"
3. Drop the e1000e_check_me check.

Then we also do not need the new "s0ix-enabled" ethertool flag
because we do not need userspace to work-around us doing the
wrong thing by default.

Note a while ago I had access to one of the devices having suspend/resume
issues caused by the S0ix support (a Lenovo Thinkpad X1 Carbon gen 7)
and I can confirm that the "e1000e: bump up timeout to wait when ME
un-configure ULP mode" patch fixes the suspend/resume problem without
any noticeable negative side-effects.

Regards,

Hans

> 
> Changes from v2 to v3:
>  - Correct some grammar and spelling issues caught by Bjorn H.
>* s/s0ix/S0ix/ in all commit messages
>* Fix a typo in commit message
>* Fix capitalization of proper nouns
>  - Add more pre-release systems that pass
>  - Re-order the series to add systems only at the end of the series
>  - Add Fixes tag to a patch in series.
> 
> Changes from v1 to v2:
>  - Directly incorporate Vitaly's dependency patch in the series
>  - Split out s0ix code into it's own file
>  - Adjust from DMI matching to PCI subsystem vendor ID/device matching
>  - Remove module parameter and sysfs, use ethtool flag instead.
>  - Export s0ix flag to ethtool private flags
>  - Include more people and lists directly in this submission chain.
> 
> Mario Limonciello (6):
>   e1000e: Move all S0ix related code into its own source file
>   e1000e: Export S0ix flags to ethtool
>   e1000e: Add Dell's Comet Lake systems into S0ix heuristics
>   e1000e: Add more Dell CML systems into S0ix heuristics
>   e1000e: Add Dell TGL desktop systems into S0ix heuristics
>   e1000e: Add another Dell TGL notebook system into S0ix heuristics
> 
> Vitaly Lifshits (1):
>   e1000e: fix S0ix flow to allow S0i3.2 subset entry
> 
>  drivers/net/ethernet/intel/e1000e/Makefile  |   2 +-
>  drivers/net/ethernet/intel/e1000e/e1000.h   |   4 +
>  drivers/net/ethernet/intel/e1000e/ethtool.c |  40 +++
>  drivers/net/ethernet/intel/e1000e/netdev.c  | 272 +
>  drivers/net/ethernet/intel/e1000e/s0ix.c| 311 
>  5 files changed, 361 insertions(+), 268 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/e1000e/

[PATCH v2 bpf-next 13/13] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Reviewed-by: Benjamin Herrenschmidt 
Signed-off-by: Kuniyuki Iwashima 
---
 .../bpf/prog_tests/select_reuseport_migrate.c | 173 ++
 .../bpf/progs/test_select_reuseport_migrate.c |  53 ++
 2 files changed, 226 insertions(+)
 create mode 100644 
tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c
 create mode 100644 
tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c

diff --git a/tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c 
b/tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c
new file mode 100644
index ..814b1e3a4c56
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c
@@ -0,0 +1,173 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Check if we can migrate child sockets.
+ *
+ *   1. call listen() for 5 server sockets.
+ *   2. update a map to migrate all child socket
+ *to the last server socket (migrate_map[cookie] = 4)
+ *   3. call connect() for 25 client sockets.
+ *   4. call close() for first 4 server sockets.
+ *   5. call accept() for the last server socket.
+ *
+ * Author: Kuniyuki Iwashima 
+ */
+
+#include 
+#include 
+
+#include "test_progs.h"
+#include "test_select_reuseport_migrate.skel.h"
+
+#define ADDRESS "127.0.0.1"
+#define PORT 80
+#define NUM_SERVERS 5
+#define NUM_CLIENTS (NUM_SERVERS * 5)
+
+
+static int test_listen(struct test_select_reuseport_migrate *skel, int 
server_fds[])
+{
+   int i, err, optval = 1, migrated_to = NUM_SERVERS - 1;
+   int prog_fd, reuseport_map_fd, migrate_map_fd;
+   struct sockaddr_in addr;
+   socklen_t addr_len;
+   __u64 value;
+
+   prog_fd = bpf_program__fd(skel->progs.prog_select_reuseport_migrate);
+   reuseport_map_fd = bpf_map__fd(skel->maps.reuseport_map);
+   migrate_map_fd = bpf_map__fd(skel->maps.migrate_map);
+
+   addr_len = sizeof(addr);
+   addr.sin_family = AF_INET;
+   addr.sin_port = htons(PORT);
+   inet_pton(AF_INET, ADDRESS, &addr.sin_addr.s_addr);
+
+   for (i = 0; i < NUM_SERVERS; i++) {
+   server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+   if (CHECK_FAIL(server_fds[i] == -1))
+   return -1;
+
+   err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT,
+&optval, sizeof(optval));
+   if (CHECK_FAIL(err == -1))
+   return -1;
+
+   if (i == 0) {
+   err = setsockopt(server_fds[i], SOL_SOCKET, 
SO_ATTACH_REUSEPORT_EBPF,
+&prog_fd, sizeof(prog_fd));
+   if (CHECK_FAIL(err == -1))
+   return -1;
+   }
+
+   err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len);
+   if (CHECK_FAIL(err == -1))
+   return -1;
+
+   err = listen(server_fds[i], 32);
+   if (CHECK_FAIL(err == -1))
+   return -1;
+
+   err = bpf_map_update_elem(reuseport_map_fd, &i, &server_fds[i], 
BPF_NOEXIST);
+   if (CHECK_FAIL(err == -1))
+   return -1;
+
+   err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value);
+   if (CHECK_FAIL(err == -1))
+   return -1;
+
+   err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, 
BPF_NOEXIST);
+   if (CHECK_FAIL(err == -1))
+   return -1;
+   }
+
+   return 0;
+}
+
+static int test_connect(int client_fds[])
+{
+   struct sockaddr_in addr;
+   socklen_t addr_len;
+   int i, err;
+
+   addr_len = sizeof(addr);
+   addr.sin_family = AF_INET;
+   addr.sin_port = htons(PORT);
+   inet_pton(AF_INET, ADDRESS, &addr.sin_addr.s_addr);
+
+   for (i = 0; i < NUM_CLIENTS; i++) {
+   client_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+   if (CHECK_FAIL(client_fds[i] == -1))
+   return -1;
+
+   err = connect(client_fds[i], (struct sockaddr *)&addr, 
addr_len);
+   if (CHECK_FAIL(err == -1))
+   return -1;
+   }
+
+   return 0;
+}
+
+static void test_close(int server_fds[], int num)
+{
+   int i;
+
+   for (i = 0; i < num; i++)
+   if (server_fds[i] > 0)
+   close(server_fds[i]);
+}
+
+static int test_accept(int server_fd)
+{
+   struct sockaddr_in addr;
+   socklen_t addr_len;
+   int cnt, client_fd;
+
+   fcntl(server_fd, F_SETFL, O_NONBLOCK);
+   addr_len = sizeof(addr);
+
+   for (cnt = 0; cnt < NUM_CLIENTS; cnt++) {
+   client_fd = accept(server_fd, (struct sockaddr *)&addr, 
&addr_len);
+   if (CHECK_FAIL(client_fd == -1))
+   return -1;
+   }
+

Re: [RFC PATCH 2/3] net: sparx5: Add Sparx5 switchdev driver

2020-12-07 Thread Jiri Pirko

Mon, Nov 30, 2020 at 02:13:35PM CET, steen.hegel...@microchip.com wrote:
>On 27.11.2020 18:15, Andrew Lunn wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
>> content is safe
>> 
>> This is a very large driver, which is going to make it slow to review.
>Hi Andrew,
>
>Yes I am aware of that, but I think that what is available with this
>series, makes for a nice package that can be tested by us, and used by
>our customers.

Could you perhaps cut it into multiple patches for easier review? Like
the basics, host delivery, fwd offload, etc?

RE: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path

2020-12-07 Thread wangyunjian



> -Original Message-
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Monday, December 7, 2020 11:54 AM
> To: wangyunjian ; m...@redhat.com
> Cc: virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun
> (Jerry) ; xudingke 
> Subject: Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path
> 
> 
> On 2020/12/4 下午6:22, wangyunjian wrote:
> >> -Original Message-
> >> From: Jason Wang [mailto:jasow...@redhat.com]
> >> Sent: Friday, December 4, 2020 2:11 PM
> >> To: wangyunjian ; m...@redhat.com
> >> Cc: virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org;
> Lilijun
> >> (Jerry) ; xudingke 
> >> Subject: Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error 
> >> path
> >>
> >>
> >> On 2020/12/3 下午4:00, wangyunjian wrote:
> >>> From: Yunjian Wang 
> >>>
> >>> After setting callback for ubuf_info of skb, the callback
> >>> (vhost_net_zerocopy_callback) will be called to decrease the refcount
> >>> when freeing skb. But when an exception occurs afterwards, the error
> >>> handling in vhost handle_tx() will try to decrease the same refcount
> >>> again. This is wrong and fix this by clearing ubuf_info when meeting
> >>> errors.
> >>>
> >>> Fixes: 4477138fa0ae ("tun: properly test for IFF_UP")
> >>> Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP
> >>> driver")
> >>>
> >>> Signed-off-by: Yunjian Wang 
> >>> ---
> >>>drivers/net/tun.c | 11 +++
> >>>1 file changed, 11 insertions(+)
> >>>
> >>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c index
> >>> 2dc1988a8973..3614bb1b6d35 100644
> >>> --- a/drivers/net/tun.c
> >>> +++ b/drivers/net/tun.c
> >>> @@ -1861,6 +1861,12 @@ static ssize_t tun_get_user(struct tun_struct
> >> *tun, struct tun_file *tfile,
> >>>   if (unlikely(!(tun->dev->flags & IFF_UP))) {
> >>>   err = -EIO;
> >>>   rcu_read_unlock();
> >>> + if (zerocopy) {
> >>> + skb_shinfo(skb)->destructor_arg = NULL;
> >>> + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> >>> + skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG;
> >>> + }
> >>> +
> >>>   goto drop;
> >>>   }
> >>>
> >>> @@ -1874,6 +1880,11 @@ static ssize_t tun_get_user(struct tun_struct
> >>> *tun, struct tun_file *tfile,
> >>>
> >>>   if (unlikely(headlen > skb_headlen(skb))) {
> >>>   atomic_long_inc(&tun->dev->rx_dropped);
> >>> + if (zerocopy) {
> >>> + skb_shinfo(skb)->destructor_arg = NULL;
> >>> + skb_shinfo(skb)->tx_flags &=
> ~SKBTX_DEV_ZEROCOPY;
> >>> + skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG;
> >>> + }
> >>>   napi_free_frags(&tfile->napi);
> >>>   rcu_read_unlock();
> >>>   mutex_unlock(&tfile->napi_mutex);
> >>
> >> It looks to me then we miss the failure feedback.
> >>
> >> The issues comes from the inconsistent error handling in tun.
> >>
> >> I wonder whether we can simply do uarg->callback(uarg, false) if necessary
> on
> >> every failture path on tun_get_user().
> > How about this?
> >
> > ---
> >   drivers/net/tun.c | 29 ++---
> >   1 file changed, 18 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 2dc1988a8973..36a8d8eacd7b 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -1637,6 +1637,19 @@ static struct sk_buff *tun_build_skb(struct
> tun_struct *tun,
> > return NULL;
> >   }
> >
> > +/* copy ubuf_info for callback when skb has no error */
> > +inline static tun_copy_ubuf_info(struct sk_buff *skb, bool zerocopy, void
> *msg_control)
> > +{
> > +   if (zerocopy) {
> > +   skb_shinfo(skb)->destructor_arg = msg_control;
> > +   skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> > +   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> > +   } else if (msg_control) {
> > +   struct ubuf_info *uarg = msg_control;
> > +   uarg->callback(uarg, false);
> > +   }
> > +}
> > +
> >   /* Get packet from user space buffer */
> >   static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file 
> > *tfile,
> > void *msg_control, struct iov_iter *from,
> > @@ -1812,16 +1825,6 @@ static ssize_t tun_get_user(struct tun_struct
> *tun, struct tun_file *tfile,
> > break;
> > }
> >
> > -   /* copy skb_ubuf_info for callback when skb has no error */
> > -   if (zerocopy) {
> > -   skb_shinfo(skb)->destructor_arg = msg_control;
> > -   skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> > -   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> > -   } else if (msg_control) {
> > -   struct ubuf_info *uarg = msg_control;
> > -   uarg->callback(uarg, false);
> > -   }
>

Re: [PATCH net-next] nfc: s3fwrn5: Change irqflags

2020-12-07 Thread Bongsu Jeon

On Mon, Dec 7, 2020 at 8:51 PM Krzysztof Kozlowski  wrote:
>
> On Mon, Dec 07, 2020 at 08:38:27PM +0900, Bongsu Jeon wrote:
> > From: Bongsu Jeon 
> >
> > change irqflags from IRQF_TRIGGER_HIGH to IRQF_TRIGGER_RISING for stable
> > Samsung's nfc interrupt handling.
>
> 1. Describe in commit title/subject the change. Just a word "change irqflags" 
> is
>not enough.
>
Ok. I'll update it.

> 2. Describe in commit message what you are trying to fix. Before was not
>stable? The "for stable interrupt handling" is a little bit vauge.
>
Usually, Samsung's NFC Firmware sends an i2c frame as below.

1. NFC Firmware sets the gpio(interrupt pin) high when there is an i2c
frame to send.
2. If the CPU's I2C master has received the i2c frame, NFC F/W sets
the gpio low.

NFC driver's i2c interrupt handler would be called in the abnormal case
as the NFC F/W task of number 2 is delayed because of other high
priority tasks.
In that case, NFC driver will try to receive the i2c frame but there
isn't any i2c frame
to send in NFC. It would cause an I2C communication problem.
This case would hardly happen.
But, I changed the interrupt as a defense code.
If Driver uses the TRIGGER_RISING not LEVEL trigger, there would be no problem
even if the NFC F/W task is delayed.

> 3. This is contradictory to the bindings and current DTS. I think the
>driver should not force the specific trigger type because I could
>imagine some configuration that the actual interrupt to the CPU is
>routed differently.
>
>Instead, how about removing the trigger flags here and fixing the DTS
>and bindings example?
>

As I mentioned before,
I changed this code because of Samsung NFC's I2C Communication way.
So, I think that it is okay for the nfc driver to force the specific
trigger type( EDGE_RISING).

What do you think about it?

> Best regards,
> Krzysztof
>
> >
> > Signed-off-by: Bongsu Jeon 
> > ---
> >  drivers/nfc/s3fwrn5/i2c.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/nfc/s3fwrn5/i2c.c b/drivers/nfc/s3fwrn5/i2c.c
> > index e1bdde105f24..016f6b6df849 100644
> > --- a/drivers/nfc/s3fwrn5/i2c.c
> > +++ b/drivers/nfc/s3fwrn5/i2c.c
> > @@ -213,7 +213,7 @@ static int s3fwrn5_i2c_probe(struct i2c_client *client,
> >   return ret;
> >
> >   ret = devm_request_threaded_irq(&client->dev, phy->i2c_dev->irq, NULL,
> > - s3fwrn5_i2c_irq_thread_fn, IRQF_TRIGGER_HIGH | IRQF_ONESHOT,
> > + s3fwrn5_i2c_irq_thread_fn, IRQF_TRIGGER_RISING | IRQF_ONESHOT,
> >   S3FWRN5_I2C_DRIVER_NAME, phy);
> >   if (ret)
> >   s3fwrn5_remove(phy->common.ndev);
> > --
> > 2.17.1
> >

miss you

2020-12-07 Thread Sophia jasper

Greetings I'm Sophia jasper I hope we can start a relationship please reply me

[PATCH v2] xfrm: interface: Don't hide plain packets from netfilter

2020-12-07 Thread Phil Sutter

With an IPsec tunnel without dedicated interface, netfilter sees locally
generated packets twice as they exit the physical interface: Once as "the
inner packet" with IPsec context attached and once as the encrypted
(ESP) packet.

With xfrm_interface, the inner packet did not traverse NF_INET_LOCAL_OUT
hook anymore, making it impossible to match on both inner header values
and associated IPsec data from that hook.

Fix this by looping packets transmitted from xfrm_interface through
NF_INET_LOCAL_OUT before passing them on to dst_output(), which makes
behaviour consistent again from netfilter's point of view.

Fixes: f203b76d78092 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Phil Sutter 
---
Changes since v1:
- Extend recipients list, no code changes.
---
 net/xfrm/xfrm_interface.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index aa4cdcf69d471..24af61c95b4d4 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -317,7 +317,8 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
skb_dst_set(skb, dst);
skb->dev = tdev;
 
-   err = dst_output(xi->net, skb->sk, skb);
+   err = NF_HOOK(skb_dst(skb)->ops->family, NF_INET_LOCAL_OUT, xi->net,
+ skb->sk, skb, NULL, skb_dst(skb)->dev, dst_output);
if (net_xmit_eval(err) == 0) {
struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
 
-- 
2.28.0

Re: [PATCH bpf-next] bpf: return -EOPNOTSUPP when attaching to non-kernel BTF