date:20200514

[PATCH net] ipv6: Fix suspicious RCU usage warning in ip6mr

2020-05-14 Thread madhuparnabhowmik10

From: Madhuparna Bhowmik 

This patch fixes the following warning:

=
WARNING: suspicious RCU usage
5.7.0-rc4-next-20200507-syzkaller #0 Not tainted
-
net/ipv6/ip6mr.c:124 RCU-list traversed in non-reader section!!

ipmr_new_table() returns an existing table, but there is no table at
init. Therefore the condition: either holding rtnl or the list is empty
is used.

Fixes: d13fee049f ("Default enable RCU list lockdep debugging with .."): 
WARNING: suspicious RCU usage
Reported-by: kernel test robot 
Suggested-by: Jakub Kicinski 
Signed-off-by: Madhuparna Bhowmik 
---
 net/ipv6/ip6mr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 65a54d74acc1..fbe282bb8036 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -98,7 +98,7 @@ static void ipmr_expire_process(struct timer_list *t);
 #ifdef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES
 #define ip6mr_for_each_table(mrt, net) \
list_for_each_entry_rcu(mrt, &net->ipv6.mr6_tables, list, \
-   lockdep_rtnl_is_held())
+   lockdep_rtnl_is_held() ||  
list_empty(&net->ipv6.mr6_tables))
 
 static struct mr_table *ip6mr_mr_table_iter(struct net *net,
struct mr_table *mrt)
-- 
2.17.1

Re: [PATCH] Fix suspicious RCU usage warning

2020-05-14 Thread Madhuparna Bhowmik

On Wed, May 13, 2020 at 12:00:10PM -0700, David Miller wrote:
> From: madhuparnabhowmi...@gmail.com
> Date: Wed, 13 May 2020 11:46:10 +0530
> 
> > From: Madhuparna Bhowmik 
> > 
> > This patch fixes the following warning:
> > 
> > =
> > WARNING: suspicious RCU usage
> > 5.7.0-rc4-next-20200507-syzkaller #0 Not tainted
> > -
> > net/ipv6/ip6mr.c:124 RCU-list traversed in non-reader section!!
> > 
> > ipmr_new_table() returns an existing table, but there is no table at
> > init. Therefore the condition: either holding rtnl or the list is empty
> > is used.
> > 
> > Suggested-by: Jakub Kicinski 
> > Signed-off-by: Madhuparna Bhowmik 
> > 
> > Signed-off-by: Madhuparna Bhowmik 
> 
> Please only provide one signoff line.
> 
> Please provide a proper Fixes: tag for this bug fix.
> 
> And finally, please make your Subject line more appropriate.  It must
> first state the target tree inside of the "[PATCH]" area, the two choices
> are "[PATCH net]" and "[PATCH net-next]" and it depends upon which tree
> this patch is targetting.
> 
> Then your Subject line should also be more descriptive about exactly the
> subsystem and area the change is being made to, for this change for
> example you could use something like:
> 
>   ipv6: Fix suspicious RCU usage warning in ip6mr.
> 
> Also, obviously, there are also syzkaller tags you can add to the
> commit message as well.
Sorry for this malformed patch, I have sent a patch with all these
corrections.

Thank you,
Madhuparna

RE: [EXT] Re: signal quality and cable diagnostic

2020-05-14 Thread Christian Herber

On Tue, May 12, 2020 at 10:22:01AM +0200, Oleksij Rempel wrote:

> So I think we should pass raw SQI value to user space, at least in the
> first implementation.

> What do you think about this?

Hi Oleksij,

I had a check about the background of this SQI thing. The table you reference 
with concrete SNR values is informative only and not a requirement. The 
requirements are rather loose.

This is from OA:
- Only for SQI=0 a link loss shall occur.
- The indicated signal quality shall monotonic increasing /decreasing with 
noise level.
- It shall be indicated in the datasheet at which level a BER<10^-10 (better 
than 10^-10) is achieved (e.g. "from SQI=3 to SQI=7 the link has a BER<10^-10 
(better than 10^-10)")

I.e. SQI does not need to have a direct correlation with SNR. The fundamental 
underlying metric is the BER.
You can report the raw SQI level and users would have to look up what it means 
in the respective data sheet. There is no guaranteed relation between SQI 
levels of different devices, i.e. SQI 5 can have lower BER than SQI 6 on 
another device.
Alternatively, you could report BER < x for the different SQI levels. However, 
this requires the information to be available. While I could provide these for 
NXP, it might not be easily available for other vendors.
If reporting raw SQI, at least the SQI level for BER<10^-10 should be presented 
to give any meaning to the value.

Regards,

Christian

Re: [bpf-next PATCH 2/3] bpf: sk_msg helpers for probe_* and current_task

2020-05-14 Thread Yonghong Song





On 5/13/20 12:24 PM, John Fastabend wrote:

Often it is useful when applying policy to know something about the
task. If the administrator has CAP_SYS_ADMIN rights then they can
use kprobe + sk_msg and link the two programs together to accomplish
this. However, this is a bit clunky and also means we have to call
sk_msg program and kprobe program when we could just use a single
program and avoid passing metadata through sk_msg/skb, socket, etc.

To accomplish this add probe_* helpers to sk_msg programs guarded
by a CAP_SYS_ADMIN check. New supported helpers are the following,

  BPF_FUNC_get_current_task
  BPF_FUNC_current_task_under_cgroup
  BPF_FUNC_probe_read_user
  BPF_FUNC_probe_read_kernel
  BPF_FUNC_probe_read
  BPF_FUNC_probe_read_user_str
  BPF_FUNC_probe_read_kernel_str
  BPF_FUNC_probe_read_str


I think this is a good idea. But this will require bpf program
to be GPLed, probably it will be okay. Currently, for capabilities,
it is CAP_SYS_ADMIN now, in the future, it may be CAP_PERFMON.

Also, do we want to remove BPF_FUNC_probe_read and
BPF_FUNC_probe_read_str from the list? Since we
introduce helpers to new program types, we can deprecate
these two helpers right away.

The new helpers will be subject to new security lockdown
rules which may have impact on networking bpf programs
on particular setup.



Signed-off-by: John Fastabend 
---
  kernel/trace/bpf_trace.c |   16 
  net/core/filter.c|   34 ++
  2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d961428..abe6721 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -147,7 +147,7 @@ BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size,
return ret;
  }
  
-static const struct bpf_func_proto bpf_probe_read_user_proto = {

+const struct bpf_func_proto bpf_probe_read_user_proto = {
.func   = bpf_probe_read_user,
.gpl_only   = true,
.ret_type   = RET_INTEGER,
@@ -167,7 +167,7 @@ BPF_CALL_3(bpf_probe_read_user_str, void *, dst, u32, size,
return ret;
  }
  
-static const struct bpf_func_proto bpf_probe_read_user_str_proto = {

+const struct bpf_func_proto bpf_probe_read_user_str_proto = {
.func   = bpf_probe_read_user_str,
.gpl_only   = true,
.ret_type   = RET_INTEGER,
@@ -198,7 +198,7 @@ BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size,
return bpf_probe_read_kernel_common(dst, size, unsafe_ptr, false);
  }
  
-static const struct bpf_func_proto bpf_probe_read_kernel_proto = {

+const struct bpf_func_proto bpf_probe_read_kernel_proto = {
.func   = bpf_probe_read_kernel,
.gpl_only   = true,
.ret_type   = RET_INTEGER,
@@ -213,7 +213,7 @@ BPF_CALL_3(bpf_probe_read_compat, void *, dst, u32, size,
return bpf_probe_read_kernel_common(dst, size, unsafe_ptr, true);
  }
  
-static const struct bpf_func_proto bpf_probe_read_compat_proto = {

+const struct bpf_func_proto bpf_probe_read_compat_proto = {
.func   = bpf_probe_read_compat,
.gpl_only   = true,
.ret_type   = RET_INTEGER,
@@ -253,7 +253,7 @@ BPF_CALL_3(bpf_probe_read_kernel_str, void *, dst, u32, 
size,
return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr, false);
  }
  
-static const struct bpf_func_proto bpf_probe_read_kernel_str_proto = {

+const struct bpf_func_proto bpf_probe_read_kernel_str_proto = {
.func   = bpf_probe_read_kernel_str,
.gpl_only   = true,
.ret_type   = RET_INTEGER,
@@ -268,7 +268,7 @@ BPF_CALL_3(bpf_probe_read_compat_str, void *, dst, u32, 
size,
return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr, true);
  }
  
-static const struct bpf_func_proto bpf_probe_read_compat_str_proto = {

+const struct bpf_func_proto bpf_probe_read_compat_str_proto = {
.func   = bpf_probe_read_compat_str,
.gpl_only   = true,
.ret_type   = RET_INTEGER,
@@ -874,7 +874,7 @@ BPF_CALL_0(bpf_get_current_task)
return (long) current;
  }
  
-static const struct bpf_func_proto bpf_get_current_task_proto = {

+const struct bpf_func_proto bpf_get_current_task_proto = {
.func   = bpf_get_current_task,
.gpl_only   = true,
.ret_type   = RET_INTEGER,
@@ -895,7 +895,7 @@ BPF_CALL_2(bpf_current_task_under_cgroup, struct bpf_map *, 
map, u32, idx)
return task_under_cgroup_hierarchy(current, cgrp);
  }
  
-static const struct bpf_func_proto bpf_current_task_under_cgroup_proto = {

+const struct bpf_func_proto bpf_current_task_under_cgroup_proto = {
.func   = bpf_current_task_under_cgroup,
.gpl_only   = false,
.ret_type   = RET_INTEGER,
diff --git a/net/core/filter.c b/net/core/filter.c
index 45b4a16..d1c4739 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c

Re: [bpf-next PATCH 3/3] bpf: sk_msg add get socket storage helpers

2020-05-14 Thread Yonghong Song





On 5/13/20 12:24 PM, John Fastabend wrote:

Add helpers to use local socket storage.

Signed-off-by: John Fastabend 
---
  include/uapi/linux/bpf.h |2 ++
  net/core/filter.c|   15 +++
  2 files changed, 17 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bfb31c1..3ca7cfd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3607,6 +3607,8 @@ struct sk_msg_md {
__u32 remote_port;  /* Stored in network byte order */
__u32 local_port;   /* stored in host byte order */
__u32 size; /* Total size of sk_msg */
+
+   __bpf_md_ptr(struct bpf_sock *, sk); /* current socket */
  };


Sync changes to tools/include/uapi/linux/bpf.h?

For this patch and previous patches, it would be good we got some
selftests to exercise some newly-added helpers.

  
  struct sk_reuseport_md {

diff --git a/net/core/filter.c b/net/core/filter.c
index d1c4739..c42adc8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6395,6 +6395,10 @@ sk_msg_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
return &bpf_get_current_uid_gid_proto;
case BPF_FUNC_get_current_pid_tgid:
return &bpf_get_current_pid_tgid_proto;
+   case BPF_FUNC_sk_storage_get:
+   return &bpf_sk_storage_get_proto;
+   case BPF_FUNC_sk_storage_delete:
+   return &bpf_sk_storage_delete_proto;
  #ifdef CONFIG_CGROUPS
case BPF_FUNC_get_current_cgroup_id:
return &bpf_get_current_cgroup_id_proto;
@@ -7243,6 +7247,11 @@ static bool sk_msg_is_valid_access(int off, int size,
if (size != sizeof(__u64))
return false;
break;
+   case offsetof(struct sk_msg_md, sk):
+   if (size != sizeof(__u64))
+   return false;
+   info->reg_type = PTR_TO_SOCKET;
+   break;
case bpf_ctx_range(struct sk_msg_md, family):
case bpf_ctx_range(struct sk_msg_md, remote_ip4):
case bpf_ctx_range(struct sk_msg_md, local_ip4):
@@ -8577,6 +8586,12 @@ static u32 sk_msg_convert_ctx_access(enum 
bpf_access_type type,
  si->dst_reg, si->src_reg,
  offsetof(struct sk_msg_sg, size));
break;
+
+   case offsetof(struct sk_msg_md, sk):
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct sk_msg, sk));
+   break;
}
  
  	return insn - insn_buf;

[PATCH net-next 0/2] Fixing compilation warnings and errors

2020-05-14 Thread Ayush Sawal

Patch 1: Fixes the warnings seen when compiling using sparse tool.

Patch 2: Fixes a cocci check error introduced after commit
567be3a5d227 ("crypto: chelsio - Use multiple txq/rxq per
tfm to process the requests").


Ayush Sawal (2):
  Crypto/chcr: Fixes compilations warnings
  Crypto/chcr: Fixes a cocci check error

 drivers/crypto/chelsio/chcr_algo.c  | 9 +
 drivers/crypto/chelsio/chcr_ipsec.c | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

-- 
2.26.0.rc1.11.g30e9940

[PATCH net-next 1/2] Crypto/chcr: Fixes compilations warnings

2020-05-14 Thread Ayush Sawal

This patch fixes the compilation warnings displayed by sparse tool for
chcr driver.

Signed-off-by: Ayush Sawal 
---
 drivers/crypto/chelsio/chcr_algo.c  | 8 
 drivers/crypto/chelsio/chcr_ipsec.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/crypto/chelsio/chcr_algo.c 
b/drivers/crypto/chelsio/chcr_algo.c
index b8c1c4dd3ef0..1aed0e8d6558 100644
--- a/drivers/crypto/chelsio/chcr_algo.c
+++ b/drivers/crypto/chelsio/chcr_algo.c
@@ -256,7 +256,7 @@ static void get_aes_decrypt_key(unsigned char *dec_key,
return;
}
for (i = 0; i < nk; i++)
-   w_ring[i] = be32_to_cpu(*(u32 *)&key[4 * i]);
+   w_ring[i] = be32_to_cpu(*(__be32 *)&key[4 * i]);
 
i = 0;
temp = w_ring[nk - 1];
@@ -275,7 +275,7 @@ static void get_aes_decrypt_key(unsigned char *dec_key,
}
i--;
for (k = 0, j = i % nk; k < nk; k++) {
-   *((u32 *)dec_key + k) = htonl(w_ring[j]);
+   *((__be32 *)dec_key + k) = htonl(w_ring[j]);
j--;
if (j < 0)
j += nk;
@@ -2926,7 +2926,7 @@ static int ccm_format_packet(struct aead_request *req,
memcpy(ivptr, req->iv, 16);
}
if (assoclen)
-   *((unsigned short *)(reqctx->scratch_pad + 16)) =
+   *((__be16 *)(reqctx->scratch_pad + 16)) =
htons(assoclen);
 
rc = generate_b0(req, ivptr, op_type);
@@ -3201,7 +3201,7 @@ static struct sk_buff *create_gcm_wr(struct aead_request 
*req,
} else {
memcpy(ivptr, req->iv, GCM_AES_IV_SIZE);
}
-   *((unsigned int *)(ivptr + 12)) = htonl(0x01);
+   *((__be32 *)(ivptr + 12)) = htonl(0x01);
 
ulptx = (struct ulptx_sgl *)(ivptr + 16);
 
diff --git a/drivers/crypto/chelsio/chcr_ipsec.c 
b/drivers/crypto/chelsio/chcr_ipsec.c
index d25689837b26..3a10f51ad6fd 100644
--- a/drivers/crypto/chelsio/chcr_ipsec.c
+++ b/drivers/crypto/chelsio/chcr_ipsec.c
@@ -403,7 +403,7 @@ inline void *copy_esn_pktxt(struct sk_buff *skb,
xo = xfrm_offload(skb);
 
aadiv->spi = (esphdr->spi);
-   seqlo = htonl(esphdr->seq_no);
+   seqlo = ntohl(esphdr->seq_no);
seqno = cpu_to_be64(seqlo + ((u64)xo->seq.hi << 32));
memcpy(aadiv->seq_no, &seqno, 8);
iv = skb_transport_header(skb) + sizeof(struct ip_esp_hdr);
-- 
2.26.0.rc1.11.g30e9940

[PATCH net-next 2/2] Crypto/chcr: Fixes a cocci check error

2020-05-14 Thread Ayush Sawal

This fixes an error observed after running coccinile
check.
drivers/crypto/chelsio/chcr_algo.c:1462:5-8: Unneeded variable:
"err". Return "0" on line 1480

This line is missed in the commit 567be3a5d227 ("crypto:
chelsio - Use multiple txq/rxq per tfm to process the requests").

Fixes: 567be3a5d227 ("crypto:
chelsio - Use multiple txq/rxq per tfm to process the requests").

Signed-off-by: Ayush Sawal 
---
 drivers/crypto/chelsio/chcr_algo.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/crypto/chelsio/chcr_algo.c 
b/drivers/crypto/chelsio/chcr_algo.c
index 1aed0e8d6558..c90b68aebe65 100644
--- a/drivers/crypto/chelsio/chcr_algo.c
+++ b/drivers/crypto/chelsio/chcr_algo.c
@@ -1462,6 +1462,7 @@ static int chcr_device_init(struct chcr_context *ctx)
int err = 0, rxq_perchan;
 
if (!ctx->dev) {
+   err = -ENXIO;
u_ctx = assign_chcr_device();
if (!u_ctx) {
pr_err("chcr device assignment fails\n");
-- 
2.26.0.rc1.11.g30e9940

Re: [PATCH] KVM: MIPS/TLB: Remove Unneeded semicolon in tlb.c

2020-05-14 Thread Thomas Bogendoerfer

On Tue, Apr 28, 2020 at 02:32:45PM +0800, Jason Yan wrote:
> Fix the following coccicheck warning:
> 
> arch/mips/kvm/tlb.c:472:2-3: Unneeded semicolon
> arch/mips/kvm/tlb.c:489:2-3: Unneeded semicolon
> 
> Signed-off-by: Jason Yan 
> ---
>  arch/mips/kvm/tlb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

applied to mips-next.

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.[ RFC1925, 2.3 ]

[PATCH v3 10/15] net: ethernet: mtk-eth-mac: new driver

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

This adds the driver for the MediaTek Ethernet MAC used on the MT8* SoC
family. For now we only support full-duplex.

Signed-off-by: Bartosz Golaszewski 
---
 drivers/net/ethernet/mediatek/Kconfig   |6 +
 drivers/net/ethernet/mediatek/Makefile  |1 +
 drivers/net/ethernet/mediatek/mtk_eth_mac.c | 1578 +++
 3 files changed, 1585 insertions(+)
 create mode 100644 drivers/net/ethernet/mediatek/mtk_eth_mac.c

diff --git a/drivers/net/ethernet/mediatek/Kconfig 
b/drivers/net/ethernet/mediatek/Kconfig
index 5079b8090f16..5c3793076765 100644
--- a/drivers/net/ethernet/mediatek/Kconfig
+++ b/drivers/net/ethernet/mediatek/Kconfig
@@ -14,4 +14,10 @@ config NET_MEDIATEK_SOC
  This driver supports the gigabit ethernet MACs in the
  MediaTek SoC family.
 
+config NET_MEDIATEK_MAC
+   tristate "MediaTek Ethernet MAC support"
+   select PHYLIB
+   help
+ This driver supports the ethernet IP on MediaTek MT85** SoCs.
+
 endif #NET_VENDOR_MEDIATEK
diff --git a/drivers/net/ethernet/mediatek/Makefile 
b/drivers/net/ethernet/mediatek/Makefile
index 3362fb7ef859..f7f5638943a0 100644
--- a/drivers/net/ethernet/mediatek/Makefile
+++ b/drivers/net/ethernet/mediatek/Makefile
@@ -5,3 +5,4 @@
 
 obj-$(CONFIG_NET_MEDIATEK_SOC) += mtk_eth.o
 mtk_eth-y := mtk_eth_soc.o mtk_sgmii.o mtk_eth_path.o
+obj-$(CONFIG_NET_MEDIATEK_MAC) += mtk_eth_mac.o
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_mac.c 
b/drivers/net/ethernet/mediatek/mtk_eth_mac.c
new file mode 100644
index ..6fbe49e861d6
--- /dev/null
+++ b/drivers/net/ethernet/mediatek/mtk_eth_mac.c
@@ -0,0 +1,1578 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020 MediaTek Corporation
+ * Copyright (c) 2020 BayLibre SAS
+ *
+ * Author: Bartosz Golaszewski 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MTK_MAC_DRVNAME"mtk_eth_mac"
+
+#define MTK_MAC_WAIT_TIMEOUT   300
+#define MTK_MAC_MAX_FRAME_SIZE 1514
+#define MTK_MAC_SKB_ALIGNMENT  16
+#define MTK_MAC_NAPI_WEIGHT64
+#define MTK_MAC_HASHTABLE_MC_LIMIT 256
+#define MTK_MAC_HASHTABLE_SIZE_MAX 512
+
+/* This is defined to 0 on arm64 in arch/arm64/include/asm/processor.h but
+ * this IP doesn't work without this alignment being equal to 2.
+ */
+#ifdef NET_IP_ALIGN
+#undef NET_IP_ALIGN
+#endif
+#define NET_IP_ALIGN   2
+
+static const char *const mtk_mac_clk_names[] = { "core", "reg", "trans" };
+#define MTK_MAC_NCLKS ARRAY_SIZE(mtk_mac_clk_names)
+
+/* PHY Control Register 0 */
+#define MTK_MAC_REG_PHY_CTRL0  0x
+#define MTK_MAC_BIT_PHY_CTRL0_WTCMDBIT(13)
+#define MTK_MAC_BIT_PHY_CTRL0_RDCMDBIT(14)
+#define MTK_MAC_BIT_PHY_CTRL0_RWOK BIT(15)
+#define MTK_MAC_MSK_PHY_CTRL0_PREG GENMASK(12, 8)
+#define MTK_MAC_OFF_PHY_CTRL0_PREG 8
+#define MTK_MAC_MSK_PHY_CTRL0_RWDATA   GENMASK(31, 16)
+#define MTK_MAC_OFF_PHY_CTRL0_RWDATA   16
+
+/* PHY Control Register 1 */
+#define MTK_MAC_REG_PHY_CTRL1  0x0004
+#define MTK_MAC_BIT_PHY_CTRL1_LINK_ST  BIT(0)
+#define MTK_MAC_BIT_PHY_CTRL1_AN_ENBIT(8)
+#define MTK_MAC_OFF_PHY_CTRL1_FORCE_SPD9
+#define MTK_MAC_VAL_PHY_CTRL1_FORCE_SPD_10M0x00
+#define MTK_MAC_VAL_PHY_CTRL1_FORCE_SPD_100M   0x01
+#define MTK_MAC_VAL_PHY_CTRL1_FORCE_SPD_1000M  0x02
+#define MTK_MAC_BIT_PHY_CTRL1_FORCE_DPXBIT(11)
+#define MTK_MAC_BIT_PHY_CTRL1_FORCE_FC_RX  BIT(12)
+#define MTK_MAC_BIT_PHY_CTRL1_FORCE_FC_TX  BIT(13)
+
+/* MAC Configuration Register */
+#define MTK_MAC_REG_MAC_CFG0x0008
+#define MTK_MAC_OFF_MAC_CFG_IPG10
+#define MTK_MAC_VAL_MAC_CFG_IPG_96BIT  GENMASK(4, 0)
+#define MTK_MAC_BIT_MAC_CFG_MAXLEN_1522BIT(16)
+#define MTK_MAC_BIT_MAC_CFG_AUTO_PAD   BIT(19)
+#define MTK_MAC_BIT_MAC_CFG_CRC_STRIP  BIT(20)
+#define MTK_MAC_BIT_MAC_CFG_VLAN_STRIP BIT(22)
+#define MTK_MAC_BIT_MAC_CFG_NIC_PD BIT(31)
+
+/* Flow-Control Configuration Register */
+#define MTK_MAC_REG_FC_CFG 0x000c
+#define MTK_MAC_BIT_FC_CFG_BP_EN   BIT(7)
+#define MTK_MAC_BIT_FC_CFG_UC_PAUSE_DIRBIT(8)
+#define MTK_MAC_OFF_FC_CFG_SEND_PAUSE_TH   16
+#define MTK_MAC_MSK_FC_CFG_SEND_PAUSE_TH   GENMASK(27, 16)
+#define MTK_MAC_VAL_FC_CFG_SEND_PAUSE_TH_2K0x800
+
+/* ARL Configuration Register */
+#define MTK_MAC_REG_ARL_CFG0x0010
+#define MTK_MAC_BIT_ARL_CFG_HASH_ALG   BIT(0)
+#define MTK_MAC_BIT_ARL_CFG_MISC_MODE  BIT(4)
+
+/* MAC High and Low Bytes Registers */
+#define MT

[PATCH v3 11/15] ARM64: dts: mediatek: add pericfg syscon to mt8516.dtsi

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

This adds support for the PERICFG register range as a syscon. This will
soon be used by the MediaTek Ethernet MAC driver for NIC configuration.

Signed-off-by: Bartosz Golaszewski 
---
 arch/arm64/boot/dts/mediatek/mt8516.dtsi | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/boot/dts/mediatek/mt8516.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8516.dtsi
index 2f8adf042195..8cedaf74ae86 100644
--- a/arch/arm64/boot/dts/mediatek/mt8516.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8516.dtsi
@@ -191,6 +191,11 @@ infracfg: infracfg@10001000 {
#clock-cells = <1>;
};
 
+   pericfg: pericfg@10003050 {
+   compatible = "mediatek,mt8516-pericfg", "syscon";
+   reg = <0 0x10003050 0 0x1000>;
+   };
+
apmixedsys: apmixedsys@10018000 {
compatible = "mediatek,mt8516-apmixedsys", "syscon";
reg = <0 0x10018000 0 0x710>;
-- 
2.25.0

[PATCH v3 15/15] ARM64: dts: mediatek: enable ethernet on pumpkin boards

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

Add remaining properties to the ethernet node and enable it.

Signed-off-by: Bartosz Golaszewski 
---
 .../boot/dts/mediatek/pumpkin-common.dtsi  | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi 
b/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi
index 4b1d5f69aba6..dfceffe6950a 100644
--- a/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi
+++ b/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi
@@ -167,6 +167,24 @@ &uart0 {
status = "okay";
 };
 
+ðernet {
+   pinctrl-names = "default";
+   pinctrl-0 = <ðernet_pins_default>;
+   phy-handle = <ð_phy>;
+   phy-mode = "rmii";
+   mac-address = [00 00 00 00 00 00];
+   status = "okay";
+
+   mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   eth_phy: ethernet-phy@0 {
+   reg = <0>;
+   };
+   };
+};
+
 &usb0 {
status = "okay";
dr_mode = "peripheral";
-- 
2.25.0

[PATCH v3 07/15] net: move devres helpers into a separate source file

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

There's currently only a single devres helper in net/ - devm variant
of alloc_etherdev. Let's move it to net/devres.c with the intention of
assing a second one: devm_register_netdev(). This new routine will need
to know the address of the release function of devm_alloc_etherdev() so
that it can verify (using devres_find()) that the struct net_device
that's being passed to it is also resource managed.

Signed-off-by: Bartosz Golaszewski 
---
 net/Makefile   |  2 +-
 net/devres.c   | 36 
 net/ethernet/eth.c | 28 
 3 files changed, 37 insertions(+), 29 deletions(-)
 create mode 100644 net/devres.c

diff --git a/net/Makefile b/net/Makefile
index 07ea48160874..5744bf1997fd 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -6,7 +6,7 @@
 # Rewritten to use lists instead of if-statements.
 #
 
-obj-$(CONFIG_NET)  := socket.o core/
+obj-$(CONFIG_NET)  := devres.o socket.o core/
 
 tmp-$(CONFIG_COMPAT)   := compat.o
 obj-$(CONFIG_NET)  += $(tmp-y)
diff --git a/net/devres.c b/net/devres.c
new file mode 100644
index ..c1465d9f9019
--- /dev/null
+++ b/net/devres.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * This file contains all networking devres helpers.
+ */
+
+#include 
+#include 
+#include 
+
+static void devm_free_netdev(struct device *dev, void *res)
+{
+   free_netdev(*(struct net_device **)res);
+}
+
+struct net_device *devm_alloc_etherdev_mqs(struct device *dev, int sizeof_priv,
+  unsigned int txqs, unsigned int rxqs)
+{
+   struct net_device **dr;
+   struct net_device *netdev;
+
+   dr = devres_alloc(devm_free_netdev, sizeof(*dr), GFP_KERNEL);
+   if (!dr)
+   return NULL;
+
+   netdev = alloc_etherdev_mqs(sizeof_priv, txqs, rxqs);
+   if (!netdev) {
+   devres_free(dr);
+   return NULL;
+   }
+
+   *dr = netdev;
+   devres_add(dev, dr);
+
+   return netdev;
+}
+EXPORT_SYMBOL(devm_alloc_etherdev_mqs);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index c8b903302ff2..dac65180c4ef 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -400,34 +400,6 @@ struct net_device *alloc_etherdev_mqs(int sizeof_priv, 
unsigned int txqs,
 }
 EXPORT_SYMBOL(alloc_etherdev_mqs);
 
-static void devm_free_netdev(struct device *dev, void *res)
-{
-   free_netdev(*(struct net_device **)res);
-}
-
-struct net_device *devm_alloc_etherdev_mqs(struct device *dev, int sizeof_priv,
-  unsigned int txqs, unsigned int rxqs)
-{
-   struct net_device **dr;
-   struct net_device *netdev;
-
-   dr = devres_alloc(devm_free_netdev, sizeof(*dr), GFP_KERNEL);
-   if (!dr)
-   return NULL;
-
-   netdev = alloc_etherdev_mqs(sizeof_priv, txqs, rxqs);
-   if (!netdev) {
-   devres_free(dr);
-   return NULL;
-   }
-
-   *dr = netdev;
-   devres_add(dev, dr);
-
-   return netdev;
-}
-EXPORT_SYMBOL(devm_alloc_etherdev_mqs);
-
 ssize_t sysfs_format_mac(char *buf, const unsigned char *addr, int len)
 {
return scnprintf(buf, PAGE_SIZE, "%*phC\n", len, addr);
-- 
2.25.0

[PATCH v3 09/15] net: devres: provide devm_register_netdev()

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

Provide devm_register_netdev() - a device resource managed variant
of register_netdev(). This new helper will only work for net_device
structs that are also already managed by devres.

Signed-off-by: Bartosz Golaszewski 
---
 .../driver-api/driver-model/devres.rst|  1 +
 include/linux/netdevice.h |  2 +
 net/devres.c  | 55 +++
 3 files changed, 58 insertions(+)

diff --git a/Documentation/driver-api/driver-model/devres.rst 
b/Documentation/driver-api/driver-model/devres.rst
index 50df28d20fa7..fc242ed4bde5 100644
--- a/Documentation/driver-api/driver-model/devres.rst
+++ b/Documentation/driver-api/driver-model/devres.rst
@@ -375,6 +375,7 @@ MUX
 NET
   devm_alloc_etherdev()
   devm_alloc_etherdev_mqs()
+  devm_register_netdev()
 
 PER-CPU MEM
   devm_alloc_percpu()
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 130a668049ab..c4ad728993dd 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4208,6 +4208,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
 int register_netdev(struct net_device *dev);
 void unregister_netdev(struct net_device *dev);
 
+int devm_register_netdev(struct device *dev, struct net_device *ndev);
+
 /* General hardware address lists handling functions */
 int __hw_addr_sync(struct netdev_hw_addr_list *to_list,
   struct netdev_hw_addr_list *from_list, int addr_len);
diff --git a/net/devres.c b/net/devres.c
index b97b0c5a8216..57a6a88d11f6 100644
--- a/net/devres.c
+++ b/net/devres.c
@@ -38,3 +38,58 @@ struct net_device *devm_alloc_etherdev_mqs(struct device 
*dev, int sizeof_priv,
return dr->ndev;
 }
 EXPORT_SYMBOL(devm_alloc_etherdev_mqs);
+
+static void devm_netdev_release(struct device *dev, void *this)
+{
+   struct net_device_devres *res = this;
+
+   unregister_netdev(res->ndev);
+}
+
+static int netdev_devres_match(struct device *dev, void *this, void 
*match_data)
+{
+   struct net_device_devres *res = this;
+   struct net_device *ndev = match_data;
+
+   return ndev == res->ndev;
+}
+
+/**
+ * devm_register_netdev - resource managed variant of register_netdev()
+ * @dev: managing device for this netdev - usually the parent device
+ * @ndev: device to register
+ *
+ * This is a devres variant of register_netdev() for which the unregister
+ * function will be call automatically when the managing device is
+ * detached. Note: the net_device used must also be resource managed by
+ * the same struct device.
+ */
+int devm_register_netdev(struct device *dev, struct net_device *ndev)
+{
+   struct net_device_devres *dr;
+   int ret;
+
+   /* struct net_device must itself be managed. For now a managed netdev
+* can only be allocated by devm_alloc_etherdev_mqs() so the check is
+* straightforward.
+*/
+   if (WARN_ON(!devres_find(dev, devm_free_netdev,
+netdev_devres_match, ndev)))
+   return -EINVAL;
+
+   dr = devres_alloc(devm_netdev_release, sizeof(*dr), GFP_KERNEL);
+   if (!dr)
+   return -ENOMEM;
+
+   ret = register_netdev(ndev);
+   if (ret) {
+   devres_free(dr);
+   return ret;
+   }
+
+   dr->ndev = ndev;
+   devres_add(ndev->dev.parent, dr);
+
+   return 0;
+}
+EXPORT_SYMBOL(devm_register_netdev);
-- 
2.25.0

[PATCH v3 13/15] ARM64: dts: mediatek: add an alias for ethernet0 for pumpkin boards

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

Add the ethernet0 alias for ethernet so that u-boot can find this node
and fill in the MAC address.

Signed-off-by: Bartosz Golaszewski 
---
 arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi 
b/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi
index a31093d7142b..97d9b000c37e 100644
--- a/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi
+++ b/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi
@@ -9,6 +9,7 @@
 / {
aliases {
serial0 = &uart0;
+   ethernet0 = ðernet;
};
 
chosen {
-- 
2.25.0

[PATCH v3 12/15] ARM64: dts: mediatek: add the ethernet node to mt8516.dtsi

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

Add the Ethernet MAC node to mt8516.dtsi. This defines parameters common
to all the boards based on this SoC.

Signed-off-by: Bartosz Golaszewski 
---
 arch/arm64/boot/dts/mediatek/mt8516.dtsi | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/boot/dts/mediatek/mt8516.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8516.dtsi
index 8cedaf74ae86..89af661e7f63 100644
--- a/arch/arm64/boot/dts/mediatek/mt8516.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8516.dtsi
@@ -406,6 +406,18 @@ mmc2: mmc@1117 {
status = "disabled";
};
 
+   ethernet: ethernet@1118 {
+   compatible = "mediatek,mt8516-eth";
+   reg = <0 0x1118 0 0x1000>;
+   mediatek,pericfg = <&pericfg>;
+   interrupts = ;
+   clocks = <&topckgen CLK_TOP_RG_ETH>,
+<&topckgen CLK_TOP_66M_ETH>,
+<&topckgen CLK_TOP_133M_ETH>;
+   clock-names = "core", "reg", "trans";
+   status = "disabled";
+   };
+
rng: rng@1020c000 {
compatible = "mediatek,mt8516-rng",
 "mediatek,mt7623-rng";
-- 
2.25.0

[PATCH v3 06/15] Documentation: devres: add a missing section for networking helpers

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

Add a new section for networking devres helpers to devres.rst and list
the two existing devm functions.

Signed-off-by: Bartosz Golaszewski 
---
 Documentation/driver-api/driver-model/devres.rst | 4 
 1 file changed, 4 insertions(+)

diff --git a/Documentation/driver-api/driver-model/devres.rst 
b/Documentation/driver-api/driver-model/devres.rst
index 46c13780994c..50df28d20fa7 100644
--- a/Documentation/driver-api/driver-model/devres.rst
+++ b/Documentation/driver-api/driver-model/devres.rst
@@ -372,6 +372,10 @@ MUX
   devm_mux_chip_register()
   devm_mux_control_get()
 
+NET
+  devm_alloc_etherdev()
+  devm_alloc_etherdev_mqs()
+
 PER-CPU MEM
   devm_alloc_percpu()
   devm_free_percpu()
-- 
2.25.0

[PATCH v3 08/15] net: devres: define a separate devres structure for devm_alloc_etherdev()

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

Not using a proxy structure to store struct net_device doesn't save
anything in terms of compiled code size or memory usage but significantly
decreases the readability of the code with all the pointer casting.

Define struct net_device_devres and use it in devm_alloc_etherdev_mqs().

Signed-off-by: Bartosz Golaszewski 
---
 net/devres.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/net/devres.c b/net/devres.c
index c1465d9f9019..b97b0c5a8216 100644
--- a/net/devres.c
+++ b/net/devres.c
@@ -7,30 +7,34 @@
 #include 
 #include 
 
-static void devm_free_netdev(struct device *dev, void *res)
+struct net_device_devres {
+   struct net_device *ndev;
+};
+
+static void devm_free_netdev(struct device *dev, void *this)
 {
-   free_netdev(*(struct net_device **)res);
+   struct net_device_devres *res = this;
+
+   free_netdev(res->ndev);
 }
 
 struct net_device *devm_alloc_etherdev_mqs(struct device *dev, int sizeof_priv,
   unsigned int txqs, unsigned int rxqs)
 {
-   struct net_device **dr;
-   struct net_device *netdev;
+   struct net_device_devres *dr;
 
dr = devres_alloc(devm_free_netdev, sizeof(*dr), GFP_KERNEL);
if (!dr)
return NULL;
 
-   netdev = alloc_etherdev_mqs(sizeof_priv, txqs, rxqs);
-   if (!netdev) {
+   dr->ndev = alloc_etherdev_mqs(sizeof_priv, txqs, rxqs);
+   if (!dr->ndev) {
devres_free(dr);
return NULL;
}
 
-   *dr = netdev;
devres_add(dev, dr);
 
-   return netdev;
+   return dr->ndev;
 }
 EXPORT_SYMBOL(devm_alloc_etherdev_mqs);
-- 
2.25.0

[PATCH v3 05/15] net: ethernet: mediatek: remove unnecessary spaces from Makefile

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

The Makefile formatting in the kernel tree usually doesn't use tabs,
so remove them before we add a second driver.

Signed-off-by: Bartosz Golaszewski 
---
 drivers/net/ethernet/mediatek/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/Makefile 
b/drivers/net/ethernet/mediatek/Makefile
index 2d8362f9341b..3362fb7ef859 100644
--- a/drivers/net/ethernet/mediatek/Makefile
+++ b/drivers/net/ethernet/mediatek/Makefile
@@ -3,5 +3,5 @@
 # Makefile for the Mediatek SoCs built-in ethernet macs
 #
 
-obj-$(CONFIG_NET_MEDIATEK_SOC) += mtk_eth.o
+obj-$(CONFIG_NET_MEDIATEK_SOC) += mtk_eth.o
 mtk_eth-y := mtk_eth_soc.o mtk_sgmii.o mtk_eth_path.o
-- 
2.25.0

[PATCH v3 04/15] net: ethernet: mediatek: rename Kconfig prompt

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

We'll soon by adding a second MediaTek Ethernet driver so modify the
Kconfig prompt.

Signed-off-by: Bartosz Golaszewski 
---
 drivers/net/ethernet/mediatek/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/Kconfig 
b/drivers/net/ethernet/mediatek/Kconfig
index 4968352ba188..5079b8090f16 100644
--- a/drivers/net/ethernet/mediatek/Kconfig
+++ b/drivers/net/ethernet/mediatek/Kconfig
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config NET_VENDOR_MEDIATEK
-   bool "MediaTek ethernet driver"
+   bool "MediaTek devices"
depends on ARCH_MEDIATEK || SOC_MT7621 || SOC_MT7620
---help---
  If you have a Mediatek SoC with ethernet, say Y.
-- 
2.25.0

[PATCH v3 14/15] ARM64: dts: mediatek: add ethernet pins for pumpkin boards

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

Setup the pin control for the Ethernet MAC.

Signed-off-by: Bartosz Golaszewski 
---
 arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi 
b/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi
index 97d9b000c37e..4b1d5f69aba6 100644
--- a/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi
+++ b/arch/arm64/boot/dts/mediatek/pumpkin-common.dtsi
@@ -219,4 +219,19 @@ gpio_mux_int_n_pin {
bias-pull-up;
};
};
+
+   ethernet_pins_default: ethernet {
+   pins_ethernet {
+   pinmux = ,
+,
+,
+,
+,
+,
+,
+,
+,
+;
+   };
+   };
 };
-- 
2.25.0

[PATCH v3 03/15] dt-bindings: net: add a binding document for MediaTek Ethernet MAC

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

This adds yaml DT bindings for the MediaTek Ethernet MAC present on the
mt8* family of SoCs.

Signed-off-by: Bartosz Golaszewski 
---
 .../bindings/net/mediatek,eth-mac.yaml| 89 +++
 1 file changed, 89 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/mediatek,eth-mac.yaml

diff --git a/Documentation/devicetree/bindings/net/mediatek,eth-mac.yaml 
b/Documentation/devicetree/bindings/net/mediatek,eth-mac.yaml
new file mode 100644
index ..8ffd0b762c0f
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/mediatek,eth-mac.yaml
@@ -0,0 +1,89 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/mediatek,eth-mac.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: MediaTek Ethernet MAC Controller
+
+maintainers:
+  - Bartosz Golaszewski 
+
+description:
+  This Ethernet MAC is used on the MT8* family of SoCs from MediaTek.
+  It's compliant with 802.3 standards and supports half- and full-duplex
+  modes with flow-control as well as CRC offloading and VLAN tags.
+
+allOf:
+  - $ref: "ethernet-controller.yaml#"
+
+properties:
+  compatible:
+enum:
+  - mediatek,mt8516-eth
+  - mediatek,mt8518-eth
+  - mediatek,mt8175-eth
+
+  reg:
+maxItems: 1
+
+  interrupts:
+maxItems: 1
+
+  clocks:
+minItems: 3
+maxItems: 3
+
+  clock-names:
+additionalItems: false
+items:
+  - const: core
+  - const: reg
+  - const: trans
+
+  mediatek,pericfg:
+$ref: /schemas/types.yaml#definitions/phandle
+description:
+  Phandle to the device containing the PERICFG register range. This is used
+  to control the MII mode.
+
+  mdio:
+type: object
+description:
+  Creates and registers an MDIO bus.
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - clocks
+  - clock-names
+  - mediatek,pericfg
+  - phy-handle
+
+examples:
+  - |
+#include 
+#include 
+
+ethernet: ethernet@1118 {
+compatible = "mediatek,mt8516-eth";
+reg = <0x1118 0x1000>;
+mediatek,pericfg = <&pericfg>;
+interrupts = ;
+clocks = <&topckgen CLK_TOP_RG_ETH>,
+ <&topckgen CLK_TOP_66M_ETH>,
+ <&topckgen CLK_TOP_133M_ETH>;
+clock-names = "core", "reg", "trans";
+phy-handle = <ð_phy>;
+phy-mode = "rmii";
+
+mdio {
+#address-cells = <1>;
+#size-cells = <0>;
+
+eth_phy: ethernet-phy@0 {
+reg = <0>;
+};
+};
+};
-- 
2.25.0

[PATCH v3 00/15] mediatek: add support for MediaTek Ethernet MAC

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

This adds support for the Ethernet Controller present on MediaTeK SoCs from
the MT8* family.

First we convert the existing DT bindings for the PERICFG controller to YAML
and add a new compatible string for mt8516 variant of it. Then we add the DT
bindings for the MAC.

Next we do some cleanup of the mediatek ethernet drivers directory and update
the devres documentation with existing networking devres helpers.

The following patches introduce a resource managed variant of
register_netdev() and move all networking devres helpers into a separate .c
file.

The largest patch in the series adds the actual new driver.

The rest of the patches add DT fixups for the boards already supported
upstream.

v1 -> v2:
- add a generic helper for retrieving the net_device associated with given
  private data
- fix several typos in commit messages
- remove MTK_MAC_VERSION and don't set the driver version
- use NET_IP_ALIGN instead of a magic number (2) but redefine it as it defaults
  to 0 on arm64
- don't manually turn the carrier off in mtk_mac_enable()
- process TX cleanup in napi poll callback
- configure pause in the adjust_link callback
- use regmap_read_poll_timeout() instead of handcoding the polling
- use devres_find() to verify that struct net_device is managed by devres in
  devm_register_netdev()
- add a patch moving all networking devres helpers into net/devres.c
- tweak the dma barriers: remove where unnecessary and add comments to the
  remaining barriers
- don't reset internal counters when enabling the NIC
- set the net_device's mtu size instead of checking the framesize in
  ndo_start_xmit() callback
- fix a race condition in waking up the netif queue
- don't emit log messages on OOM errors
- use dma_set_mask_and_coherent()
- use eth_hw_addr_random()
- rework the receive callback so that we reuse the previous skb if unmapping
  fails, like we already do if skb allocation fails
- rework hash table operations: add proper timeout handling and clear bits when
  appropriate

v2 -> v3:
- drop the patch adding priv_to_netdev() and store the netdev pointer in the
  driver private data
- add an additional dma_wmb() after reseting the descriptor in
  mtk_mac_ring_pop_tail()
- check the return value of dma_set_mask_and_coherent()
- improve the DT bindings for mtk-eth-mac: make the reg property in the example
  use single-cell address and size, extend the description of the PERICFG
  phandle and document the mdio sub-node
- add a patch converting the old .txt bindings for PERICFG to yaml
- limit reading the DMA memory by storing the mapped addresses in the driver
  private structure
- add a patch documenting the existing networking devres helpers

Bartosz Golaszewski (15):
  dt-bindings: convert the binding document for mediatek PERICFG to yaml
  dt-bindings: add new compatible to mediatek,pericfg
  dt-bindings: net: add a binding document for MediaTek Ethernet MAC
  net: ethernet: mediatek: rename Kconfig prompt
  net: ethernet: mediatek: remove unnecessary spaces from Makefile
  Documentation: devres: add a missing section for networking helpers
  net: move devres helpers into a separate source file
  net: devres: define a separate devres structure for
devm_alloc_etherdev()
  net: devres: provide devm_register_netdev()
  net: ethernet: mtk-eth-mac: new driver
  ARM64: dts: mediatek: add pericfg syscon to mt8516.dtsi
  ARM64: dts: mediatek: add the ethernet node to mt8516.dtsi
  ARM64: dts: mediatek: add an alias for ethernet0 for pumpkin boards
  ARM64: dts: mediatek: add ethernet pins for pumpkin boards
  ARM64: dts: mediatek: enable ethernet on pumpkin boards

 .../arm/mediatek/mediatek,pericfg.txt |   36 -
 .../arm/mediatek/mediatek,pericfg.yaml|   64 +
 .../bindings/net/mediatek,eth-mac.yaml|   89 +
 .../driver-api/driver-model/devres.rst|5 +
 arch/arm64/boot/dts/mediatek/mt8516.dtsi  |   17 +
 .../boot/dts/mediatek/pumpkin-common.dtsi |   34 +
 drivers/net/ethernet/mediatek/Kconfig |8 +-
 drivers/net/ethernet/mediatek/Makefile|3 +-
 drivers/net/ethernet/mediatek/mtk_eth_mac.c   | 1578 +
 include/linux/netdevice.h |2 +
 net/Makefile  |2 +-
 net/devres.c  |   95 +
 net/ethernet/eth.c|   28 -
 13 files changed, 1894 insertions(+), 67 deletions(-)
 delete mode 100644 
Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.txt
 create mode 100644 
Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml
 create mode 100644 Documentation/devicetree/bindings/net/mediatek,eth-mac.yaml
 create mode 100644 drivers/net/ethernet/mediatek/mtk_eth_mac.c
 create mode 100644 net/devres.c

-- 
2.25.0

[PATCH v3 02/15] dt-bindings: add new compatible to mediatek,pericfg

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

The PERICFG controller is present on the MT8516 SoC. Add an appropriate
compatible variant.

Signed-off-by: Bartosz Golaszewski 
---
 .../devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml   | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml 
b/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml
index 1340c6288024..55209a2baedc 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml
@@ -25,6 +25,7 @@ properties:
   - mediatek,mt8135-pericfg
   - mediatek,mt8173-pericfg
   - mediatek,mt8183-pericfg
+  - mediatek,mt8516-pericfg
 - const: syscon
   - items:
 # Special case for mt7623 for backward compatibility
-- 
2.25.0

[PATCH v3 01/15] dt-bindings: convert the binding document for mediatek PERICFG to yaml

2020-05-14 Thread Bartosz Golaszewski

From: Bartosz Golaszewski 

Convert the DT binding .txt file for MediaTek's peripheral configuration
controller to YAML. There's one special case where the compatible has
three positions. Otherwise, it's a pretty normal syscon.

Signed-off-by: Bartosz Golaszewski 
---
 .../arm/mediatek/mediatek,pericfg.txt | 36 ---
 .../arm/mediatek/mediatek,pericfg.yaml| 63 +++
 2 files changed, 63 insertions(+), 36 deletions(-)
 delete mode 100644 
Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.txt
 create mode 100644 
Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml

diff --git 
a/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.txt 
b/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.txt
deleted file mode 100644
index ecf027a9003a..
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.txt
+++ /dev/null
@@ -1,36 +0,0 @@
-Mediatek pericfg controller
-===
-
-The Mediatek pericfg controller provides various clocks and reset
-outputs to the system.
-
-Required Properties:
-
-- compatible: Should be one of:
-   - "mediatek,mt2701-pericfg", "syscon"
-   - "mediatek,mt2712-pericfg", "syscon"
-   - "mediatek,mt7622-pericfg", "syscon"
-   - "mediatek,mt7623-pericfg", "mediatek,mt2701-pericfg", "syscon"
-   - "mediatek,mt7629-pericfg", "syscon"
-   - "mediatek,mt8135-pericfg", "syscon"
-   - "mediatek,mt8173-pericfg", "syscon"
-   - "mediatek,mt8183-pericfg", "syscon"
-- #clock-cells: Must be 1
-- #reset-cells: Must be 1
-
-The pericfg controller uses the common clk binding from
-Documentation/devicetree/bindings/clock/clock-bindings.txt
-The available clocks are defined in dt-bindings/clock/mt*-clk.h.
-Also it uses the common reset controller binding from
-Documentation/devicetree/bindings/reset/reset.txt.
-The available reset outputs are defined in
-dt-bindings/reset/mt*-resets.h
-
-Example:
-
-pericfg: power-controller@10003000 {
-   compatible = "mediatek,mt8173-pericfg", "syscon";
-   reg = <0 0x10003000 0 0x1000>;
-   #clock-cells = <1>;
-   #reset-cells = <1>;
-};
diff --git 
a/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml 
b/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml
new file mode 100644
index ..1340c6288024
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,pericfg.yaml
@@ -0,0 +1,63 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: "http://devicetree.org/schemas/arm/mediatek/mediatek,pericfg.yaml#";
+$schema: "http://devicetree.org/meta-schemas/core.yaml#";
+
+title: MediaTek Peripheral Configuration Controller
+
+maintainers:
+  - Bartosz Golaszewski 
+
+description:
+  The Mediatek pericfg controller provides various clocks and reset outputs
+  to the system.
+
+properties:
+  compatible:
+oneOf:
+  - items:
+- enum:
+  - mediatek,mt2701-pericfg
+  - mediatek,mt2712-pericfg
+  - mediatek,mt7622-pericfg
+  - mediatek,mt7629-pericfg
+  - mediatek,mt8135-pericfg
+  - mediatek,mt8173-pericfg
+  - mediatek,mt8183-pericfg
+- const: syscon
+  - items:
+# Special case for mt7623 for backward compatibility
+- const: mediatek,mt7623-pericfg
+- const: mediatek,mt2701-pericfg
+- const: syscon
+
+  reg:
+maxItems: 1
+
+  '#clock-cells':
+const: 1
+
+  '#reset-cells':
+const: 1
+
+required:
+  - compatible
+  - reg
+
+examples:
+  - |
+pericfg@10003000 {
+compatible = "mediatek,mt8173-pericfg", "syscon";
+reg = <0x10003000 0x1000>;
+#clock-cells = <1>;
+#reset-cells = <1>;
+};
+
+  - |
+pericfg@10003000 {
+compatible =  "mediatek,mt7623-pericfg", "mediatek,mt2701-pericfg", 
"syscon";
+reg = <0x10003000 0x1000>;
+#clock-cells = <1>;
+#reset-cells = <1>;
+};
-- 
2.25.0

Re: [bpf-next PATCH 2/3] bpf: sk_msg helpers for probe_* and current_task

2020-05-14 Thread Daniel Borkmann


On 5/13/20 9:24 PM, John Fastabend wrote:

Often it is useful when applying policy to know something about the
task. If the administrator has CAP_SYS_ADMIN rights then they can
use kprobe + sk_msg and link the two programs together to accomplish
this. However, this is a bit clunky and also means we have to call
sk_msg program and kprobe program when we could just use a single
program and avoid passing metadata through sk_msg/skb, socket, etc.

To accomplish this add probe_* helpers to sk_msg programs guarded
by a CAP_SYS_ADMIN check. New supported helpers are the following,

  BPF_FUNC_get_current_task
  BPF_FUNC_current_task_under_cgroup
  BPF_FUNC_probe_read_user
  BPF_FUNC_probe_read_kernel
  BPF_FUNC_probe_read
  BPF_FUNC_probe_read_user_str
  BPF_FUNC_probe_read_kernel_str
  BPF_FUNC_probe_read_str


Given the current discussion in the other thread with Linus et al, please
don't add more users for BPF_FUNC_probe_read and BPF_FUNC_probe_read_str
as I'm cooking up a patch to disable them on non-x86, and cleanups from
Christoph would make them less efficient than the *_user/_kernel{,_str}()
versions anyway, so lets only add the latter.

Thanks,
Daniel

Re: [PATCH 7/9] bpf: Compile the BTF id whitelist data in vmlinux

2020-05-14 Thread Jiri Olsa

On Wed, May 13, 2020 at 11:29:40AM -0700, Alexei Starovoitov wrote:

SNIP

> > diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
> > index d09ab4afbda4..dee91c6bf450 100755
> > --- a/scripts/link-vmlinux.sh
> > +++ b/scripts/link-vmlinux.sh
> > @@ -130,16 +130,26 @@ gen_btf()
> > info "BTF" ${2}
> > LLVM_OBJCOPY=${OBJCOPY} ${PAHOLE} -J ${1}
> >  
> > -   # Create ${2} which contains just .BTF section but no symbols. Add
> > +   # Create object which contains just .BTF section but no symbols. Add
> > # SHF_ALLOC because .BTF will be part of the vmlinux image. --strip-all
> > # deletes all symbols including __start_BTF and __stop_BTF, which will
> > # be redefined in the linker script. Add 2>/dev/null to suppress GNU
> > # objcopy warnings: "empty loadable segment detected at ..."
> > ${OBJCOPY} --only-section=.BTF --set-section-flags .BTF=alloc,readonly \
> > -   --strip-all ${1} ${2} 2>/dev/null
> > -   # Change e_type to ET_REL so that it can be used to link final vmlinux.
> > -   # Unlike GNU ld, lld does not allow an ET_EXEC input.
> > -   printf '\1' | dd of=${2} conv=notrunc bs=1 seek=16 status=none
> > +   --strip-all ${1} 2>/dev/null
> > +
> > +   # Create object that contains just .BTF_whitelist_* sections generated
> > +   # by bpfwl. Same as BTF section, BTF_whitelist_* data will be part of
> > +   # the vmlinux image, hence SHF_ALLOC.
> > +   whitelist=.btf.vmlinux.whitelist
> > +
> > +   ${BPFWL} ${1} kernel/bpf/helpers-whitelist > ${whitelist}.c
> > +   ${CC} -c -o ${whitelist}.o ${whitelist}.c
> > +   ${OBJCOPY} --only-section=.BTF_whitelist* --set-section-flags 
> > .BTF=alloc,readonly \
> > +--strip-all ${whitelist}.o 2>/dev/null
> > +
> > +   # Link BTF and BTF_whitelist objects together
> > +   ${LD} -r -o ${2} ${1} ${whitelist}.o
> 
> Thank you for working on it!
> Looks great to me overall. In the next rev please drop RFC tag.
> 
> My only concern is this extra linking step. How many extra seconds does it 
> add?

I did not meassure, but I haven't noticed any noticable delay,
I'll add meassurements to the next post

> 
> Also in patch 3:
> +   func = func__find(str);
> +   if (func)
> +   func->id = id;
> which means that if somebody mistyped the name or that kernel function
> got renamed there will be no warnings or errors.
> I think it needs to fail the build instead.

it fails later on, when generating the array:

 if (!func->id) {
 fprintf(stderr, "FAILED: '%s' function not found in BTF data\n",
 func->name);
 return -1;
 }

but it can clearly fail before that.. I'll change that

> 
> If additional linking step takes another 20 seconds it could be a reason
> to move the search to run-time.
> We already have that with struct bpf_func_proto->btf_id[].
> Whitelist could be something similar.
> I think this mechanism will be reused for unstable helpers and other
> func->btf_id mappings, so 'bpfwl' name would change eventually.
> It's not white list specific. It generates a mapping of names to btf_ids.
> Doing it at build time vs run-time is a trade off and it doesn't have
> an obvious answer.

I was thinking of putting the names in __init section and generate the BTF
ids on kernel start, but the build time generation seemed more convenient..
let's see the linking times with 'real size' whitelist and we can reconsider

thanks,
jirka

RE: [PATCH 27/33] sctp: export sctp_setsockopt_bindx

2020-05-14 Thread David Laight

From: Marcelo Ricardo Leitner
> Sent: 13 May 2020 19:01
> On Wed, May 13, 2020 at 08:26:42AM +0200, Christoph Hellwig wrote:
> > And call it directly from dlm instead of going through kernel_setsockopt.
> 
> The advantage on using kernel_setsockopt here is that sctp module will
> only be loaded if dlm actually creates a SCTP socket.  With this
> change, sctp will be loaded on setups that may not be actually using
> it. It's a quite big module and might expose the system.
> 
> I'm okay with the SCTP changes, but I'll defer to DLM folks to whether
> that's too bad or what for DLM.

I didn't see these sneak through.

There is a big long list of SCTP socket options that are
needed to make anything work.

They all need exporting.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [EXT] Re: signal quality and cable diagnostic

2020-05-14 Thread Oleksij Rempel

Hi Christian,

On Thu, May 14, 2020 at 07:13:30AM +, Christian Herber wrote:
> On Tue, May 12, 2020 at 10:22:01AM +0200, Oleksij Rempel wrote:
> 
> > So I think we should pass raw SQI value to user space, at least in the
> > first implementation.
> 
> > What do you think about this?
> 
> Hi Oleksij,
> 
> I had a check about the background of this SQI thing. The table you reference 
> with concrete SNR values is informative only and not a requirement. The 
> requirements are rather loose.
> 
> This is from OA:
> - Only for SQI=0 a link loss shall occur.
> - The indicated signal quality shall monotonic increasing /decreasing with 
> noise level.
> - It shall be indicated in the datasheet at which level a BER<10^-10 (better 
> than 10^-10) is achieved (e.g. "from SQI=3 to SQI=7 the link has a BER<10^-10 
> (better than 10^-10)")
> 
> I.e. SQI does not need to have a direct correlation with SNR. The fundamental 
> underlying metric is the BER.
> You can report the raw SQI level and users would have to look up what it 
> means in the respective data sheet. There is no guaranteed relation between 
> SQI levels of different devices, i.e. SQI 5 can have lower BER than SQI 6 on 
> another device.
> Alternatively, you could report BER < x for the different SQI levels. 
> However, this requires the information to be available. While I could provide 
> these for NXP, it might not be easily available for other vendors.
> If reporting raw SQI, at least the SQI level for BER<10^-10 should be 
> presented to give any meaning to the value.

So the question is, which values to provide via KAPI to user space?

- SQI
  The PHY can probably measure the SNR quite fast and has some internal
  function or lookup table to deduct the SQI from the measured SNR.

  If I understand you correctly, we can only compare SQI values of the
  same PHY, as different PHYs give different SQIs for the same link
  characteristics (=SNR).
- SNR range
  We read the SQI from the PHY look up the SNR range for that value from
  the data sheet and provide that value to use space. This gives a
  better description of the quality of the link.
- "guestimated" BER
  The manufacturer of the PHY has probably done some extensive testing
  that a measured SNR can be correlated to some BER. This value may be
  provided in the data sheet, too.

The SNR seems to be most universal value, when it comes to comparing
different situations (different links and different PHYs). The
resolution of BER is not that detailed, for the NXP PHY is says only
"BER below 1e-10" or not.

> While I could provide these for NXP, it might not be easily available
> for other vendors.

It will be great if you can provide this information. It may force other
vendors to do the same :)

The actual procedure to measure the BER is the following testing
strategy suggested by opensig[1]:

Procedure:
1. Configure the DUT as MASTER.
2. Connect the packet monitoring station to the automotive cable.
3. Connect the DUT to the automotive cable.
4. Send 2,470,000 1,518-byte packets (for a 10 -10 BER) and the monitor will
   count the number of packet errors.
5. Repeat step 4 for the remaining automotive cables.
6. Repeat steps 4-5 with the DUT configured as SLAVE.


[1] 
http://www.opensig.org/download/document/225/Open_Alliance_100BASE-T1_PMA_Test_Suite_v1.0-dec.pdf

Regards,
Oleksij & Marc

-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

RE: remove kernel_setsockopt and kernel_getsockopt

2020-05-14 Thread David Laight

From: Joe Perches
> Sent: 13 May 2020 18:39
> On Wed, 2020-05-13 at 08:26 +0200, Christoph Hellwig wrote:
> > this series removes the kernel_setsockopt and kernel_getsockopt
> > functions, and instead switches their users to small functions that
> > implement setting (or in one case getting) a sockopt directly using
> > a normal kernel function call with type safety and all the other
> > benefits of not having a function call.
> >
> > In some cases these functions seem pretty heavy handed as they do
> > a lock_sock even for just setting a single variable, but this mirrors
> > the real setsockopt implementation - counter to that a few kernel
> > drivers just set the fields directly already.
> >
> > Nevertheless the diffstat looks quite promising:
> >
> >  42 files changed, 721 insertions(+), 799 deletions(-)

I missed this patch going through.
Massive NACK.

You need to export functions that do most of the socket options
for all protocols.
As well as REUSADDR and NODELAY SCTP has loads because a lot
of stuff that should have been extra system calls got piled
into setsockopt.

An alternate solution would be to move the copy_to/from_user()
into a wrapper function so that the kernel_[sg]etsockopt()
functions would bypass them completely.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

[PATCH bpf-next v2 00/14] Introduce AF_XDP buffer allocation API

2020-05-14 Thread Björn Töpel

Overview


Driver adoption for AF_XDP has been slow. The amount of code required
to proper support AF_XDP is substantial and the driver/core APIs are
vague or even non-existing. Drivers have to manually adjust data
offsets, updating AF_XDP handles differently for different modes
(aligned/unaligned).

This series attempts to improve the situation by introducing an AF_XDP
buffer allocation API. The implementation is based on a single core
(single producer/consumer) buffer pool for the AF_XDP UMEM.

A buffer is allocated using the xsk_buff_alloc() function, and
returned using xsk_buff_free(). If a buffer is disassociated with the
pool, e.g. when a buffer is passed to an AF_XDP socket, a buffer is
said to be released. Currently, the release function is only used by
the AF_XDP internals and not visible to the driver.

Drivers using this API should register the XDP memory model with the
new MEM_TYPE_XSK_BUFF_POOL type, which will supersede the
MEM_TYPE_ZERO_COPY type.

The buffer type is struct xdp_buff, and follows the lifetime of
regular xdp_buffs, i.e.  the lifetime of an xdp_buff is restricted to
a NAPI context. In other words, the API is not replacing xdp_frames.

DMA mapping/synching is folded into the buffer handling as well.

@JeffK The Intel drivers changes should go through the bpf-next tree,
   and not your regular Intel tree, since multiple (non-Intel)
   drivers are affected.

The outline of the series is as following:

Patch 1 to 3 are restructures/clean ups. The XSKMAP implementation is
moved to net/xdp/. Functions/defines/enums that are only used by the
AF_XDP internals are moved from the global include/net/xdp_sock.h to
net/xdp/xsk.h. We are also introducing a new "driver include file",
include/net/xdp_sock_drv.h, which is the only file NIC driver
developers adding AF_XDP zero-copy support should care about.

Patch 4 adds the new API, and migrates the "copy-mode"/skb-mode AF_XDP
path to the new API.

Patch 5 to 10 migrates the existing zero-copy drivers to the new API.

Patch 11 removes the MEM_TYPE_ZERO_COPY memory type, and the "handle"
member of struct xdp_buff.

Patch 12 simplifies the xdp_return_{frame,frame_rx_napi,buff}
functions.

Patch 13 is a performance patch, where some functions are inlined.

Finally, patch 14 updates the MAINTAINERS file to correctly mirror the
new file layout.

Note that this series removes the "handle" member from struct
xdp_buff, which reduces the xdp_buff size.

After this series, the diff stat of drivers/net/ is:
  27 files changed, 388 insertions(+), 1259 deletions(-)
 
This series is a first step of simplifying the driver side of
AF_XDP. I think more of the AF_XDP logic can be moved from the drivers
to the AF_XDP core, e.g. the "need wakeup" set/clear functionality.

Statistics when allocation fails can now be added to the socket
statistics via the XDP_STATISTICS getsockopt(). This will be added in
a follow up series.


Performance
===

As a nice side effect, performance is up a bit as well (40 GbE, 64B
packets, i40e):

rxdrop, zero-copy, aligned:
  baseline: 20.4
  new API : 21.3

rxdrop, zero-copy, unaligned:
  baseline: 19.5
  new API : 21.2


Changelog
=

v1->v2: 
  * mlx5: Fix DMA address handling, set XDP metadata to invalid. (Maxim)
  * ixgbe: Fixed xdp_buff data_end update. (Björn)
  * Swapped SoBs in patch 4. (Maxim)

rfc->v1:
  * Fixed build errors/warnings for m68k and riscv. (kbuild test
robot)
  * Added headroom/chunk size getter. (Maxim/Björn)
  * mlx5: Put back the sanity check for XSK params, use XSK API to get
the total headroom size. (Maxim)
  * Fixed spelling in commit message. (Björn)
  * Make sure xp_validate_desc() is inlined for Tx perf. (Maxim)
  * Sorted file entries. (Joe)
  * Added xdp_return_{frame,frame_rx_napi,buff} simplification (Björn)


Thanks for all the comments/input/help!


Cheers,
Björn


Björn Töpel (13):
  xsk: move xskmap.c to net/xdp/
  xsk: move defines only used by AF_XDP internals to xsk.h
  xsk: introduce AF_XDP buffer allocation API
  i40e: refactor rx_bi accesses
  i40e: separate kernel allocated rx_bi rings from AF_XDP rings
  i40e, xsk: migrate to new MEM_TYPE_XSK_BUFF_POOL
  ice, xsk: migrate to new MEM_TYPE_XSK_BUFF_POOL
  ixgbe, xsk: migrate to new MEM_TYPE_XSK_BUFF_POOL
  mlx5, xsk: migrate to new MEM_TYPE_XSK_BUFF_POOL
  xsk: remove MEM_TYPE_ZERO_COPY and corresponding code
  xdp: simplify xdp_return_{frame,frame_rx_napi,buff}
  xsk: explicitly inline functions and move definitions
  MAINTAINERS, xsk: update AF_XDP section after moves/adds

Magnus Karlsson (1):
  xsk: move driver interface to xdp_sock_drv.h

 MAINTAINERS   |   6 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  28 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 134 +++
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  17 +-
 .../ethernet/intel/i40e/i40e_txrx_common.h|  40 +-
 drivers/net/ethernet/intel/i40e/i40e_type.h   |   5 +-
 drivers/net/

[PATCH bpf-next v2 01/14] xsk: move xskmap.c to net/xdp/

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from
kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related
code. Also, move AF_XDP struct definitions, and function declarations
only used by AF_XDP internals into net/xdp/xsk.h.

Signed-off-by: Björn Töpel 
---
 include/net/xdp_sock.h   | 20 
 kernel/bpf/Makefile  |  3 ---
 net/xdp/Makefile |  2 +-
 net/xdp/xsk.h| 16 
 {kernel/bpf => net/xdp}/xskmap.c |  2 ++
 5 files changed, 19 insertions(+), 24 deletions(-)
 rename {kernel/bpf => net/xdp}/xskmap.c (99%)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 67191ccaab85..a26d6c80e43d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -65,22 +65,12 @@ struct xdp_umem {
struct list_head xsk_tx_list;
 };
 
-/* Nodes are linked in the struct xdp_sock map_list field, and used to
- * track which maps a certain socket reside in.
- */
-
 struct xsk_map {
struct bpf_map map;
spinlock_t lock; /* Synchronize map updates */
struct xdp_sock *xsk_map[];
 };
 
-struct xsk_map_node {
-   struct list_head node;
-   struct xsk_map *map;
-   struct xdp_sock **map_entry;
-};
-
 struct xdp_sock {
/* struct sock must be the first member of struct xdp_sock */
struct sock sk;
@@ -114,7 +104,6 @@ struct xdp_sock {
 struct xdp_buff;
 #ifdef CONFIG_XDP_SOCKETS
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
-bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
 /* Used from netdev driver */
 bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt);
 bool xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
@@ -133,10 +122,6 @@ void xsk_clear_rx_need_wakeup(struct xdp_umem *umem);
 void xsk_clear_tx_need_wakeup(struct xdp_umem *umem);
 bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem);
 
-void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
-struct xdp_sock **map_entry);
-int xsk_map_inc(struct xsk_map *map);
-void xsk_map_put(struct xsk_map *map);
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
 void __xsk_map_flush(void);
 
@@ -242,11 +227,6 @@ static inline int xsk_generic_rcv(struct xdp_sock *xs, 
struct xdp_buff *xdp)
return -ENOTSUPP;
 }
 
-static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
-{
-   return false;
-}
-
 static inline bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt)
 {
return false;
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 37b2d8620153..375b933010dd 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -12,9 +12,6 @@ obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
-ifeq ($(CONFIG_XDP_SOCKETS),y)
-obj-$(CONFIG_BPF_SYSCALL) += xskmap.o
-endif
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
 endif
 ifeq ($(CONFIG_PERF_EVENTS),y)
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
index 71e2bdafb2ce..90b5460d6166 100644
--- a/net/xdp/Makefile
+++ b/net/xdp/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o xskmap.o
 obj-$(CONFIG_XDP_SOCKETS_DIAG) += xsk_diag.o
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
index 4cfd106bdb53..d6a0979050e6 100644
--- a/net/xdp/xsk.h
+++ b/net/xdp/xsk.h
@@ -17,9 +17,25 @@ struct xdp_mmap_offsets_v1 {
struct xdp_ring_offset_v1 cr;
 };
 
+/* Nodes are linked in the struct xdp_sock map_list field, and used to
+ * track which maps a certain socket reside in.
+ */
+
+struct xsk_map_node {
+   struct list_head node;
+   struct xsk_map *map;
+   struct xdp_sock **map_entry;
+};
+
 static inline struct xdp_sock *xdp_sk(struct sock *sk)
 {
return (struct xdp_sock *)sk;
 }
 
+bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
+struct xdp_sock **map_entry);
+int xsk_map_inc(struct xsk_map *map);
+void xsk_map_put(struct xsk_map *map);
+
 #endif /* XSK_H_ */
diff --git a/kernel/bpf/xskmap.c b/net/xdp/xskmap.c
similarity index 99%
rename from kernel/bpf/xskmap.c
rename to net/xdp/xskmap.c
index 2cc5c8f4c800..1dc7208c71ba 100644
--- a/kernel/bpf/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -9,6 +9,8 @@
 #include 
 #include 
 
+#include "xsk.h"
+
 int xsk_map_inc(struct xsk_map *map)
 {
bpf_map_inc(&map->map);
-- 
2.25.1

[PATCH bpf-next v2 03/14] xsk: move defines only used by AF_XDP internals to xsk.h

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

Move the XSK_NEXT_PG_CONTIG_{MASK,SHIFT}, and
XDP_UMEM_USES_NEED_WAKEUP defines from xdp_sock.h to the AF_XDP
internal xsk.h file. Also, start using the BIT{,_ULL} macro instead of
explicit shifts.

Signed-off-by: Björn Töpel 
---
 include/net/xdp_sock.h | 14 --
 net/xdp/xsk.h  | 14 ++
 net/xdp/xsk_queue.h|  2 ++
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 6a986dcbc336..fb7fe3060175 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -17,13 +17,6 @@ struct net_device;
 struct xsk_queue;
 struct xdp_buff;
 
-/* Masks for xdp_umem_page flags.
- * The low 12-bits of the addr will be 0 since this is the page address, so we
- * can use them for flags.
- */
-#define XSK_NEXT_PG_CONTIG_SHIFT 0
-#define XSK_NEXT_PG_CONTIG_MASK (1ULL << XSK_NEXT_PG_CONTIG_SHIFT)
-
 struct xdp_umem_page {
void *addr;
dma_addr_t dma;
@@ -35,13 +28,6 @@ struct xdp_umem_fq_reuse {
u64 handles[];
 };
 
-/* Flags for the umem flags field.
- *
- * The NEED_WAKEUP flag is 1 due to the reuse of the flags field for public
- * flags. See inlude/uapi/include/linux/if_xdp.h.
- */
-#define XDP_UMEM_USES_NEED_WAKEUP (1 << 1)
-
 struct xdp_umem {
struct xsk_queue *fq;
struct xsk_queue *cq;
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
index d6a0979050e6..455ddd480f3d 100644
--- a/net/xdp/xsk.h
+++ b/net/xdp/xsk.h
@@ -4,6 +4,20 @@
 #ifndef XSK_H_
 #define XSK_H_
 
+/* Masks for xdp_umem_page flags.
+ * The low 12-bits of the addr will be 0 since this is the page address, so we
+ * can use them for flags.
+ */
+#define XSK_NEXT_PG_CONTIG_SHIFT 0
+#define XSK_NEXT_PG_CONTIG_MASK BIT_ULL(XSK_NEXT_PG_CONTIG_SHIFT)
+
+/* Flags for the umem flags field.
+ *
+ * The NEED_WAKEUP flag is 1 due to the reuse of the flags field for public
+ * flags. See inlude/uapi/include/linux/if_xdp.h.
+ */
+#define XDP_UMEM_USES_NEED_WAKEUP BIT(1)
+
 struct xdp_ring_offset_v1 {
__u64 producer;
__u64 consumer;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 648733ec24ac..a322a7dac58c 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -10,6 +10,8 @@
 #include 
 #include 
 
+#include "xsk.h"
+
 struct xdp_ring {
u32 producer cacheline_aligned_in_smp;
u32 consumer cacheline_aligned_in_smp;
-- 
2.25.1

[PATCH bpf-next v2 02/14] xsk: move driver interface to xdp_sock_drv.h

2020-05-14 Thread Björn Töpel

From: Magnus Karlsson 

Move the AF_XDP zero-copy driver interface to its own include file
called xdp_sock_drv.h. This, hopefully, will make it more clear for
NIC driver implementors to know what functions to use for zero-copy
support.

Signed-off-by: Magnus Karlsson 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c|   2 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c  |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |   2 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |   2 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.h   |   2 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.c |   2 +-
 include/net/xdp_sock.h| 203 +
 include/net/xdp_sock_drv.h| 207 ++
 net/ethtool/channels.c|   2 +-
 net/ethtool/ioctl.c   |   2 +-
 net/xdp/xdp_umem.h|   2 +-
 net/xdp/xsk.c |   2 +-
 14 files changed, 227 insertions(+), 207 deletions(-)
 create mode 100644 include/net/xdp_sock_drv.h

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2a037ec244b9..d6b2db4f2c65 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11,7 +11,7 @@
 #include "i40e_diag.h"
 #include "i40e_xsk.h"
 #include 
-#include 
+#include 
 /* All i40e tracepoints are defined by the include below, which
  * must be included exactly once across the whole kernel with
  * CREATE_TRACE_POINTS defined
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 0b7d29192b2c..452bba7bc4ff 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -2,7 +2,7 @@
 /* Copyright(c) 2018 Intel Corporation. */
 
 #include 
-#include 
+#include 
 #include 
 
 #include "i40e.h"
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c 
b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 8279db15e870..955b0fbb7c9a 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -2,7 +2,7 @@
 /* Copyright (c) 2019, Intel Corporation. */
 
 #include 
-#include 
+#include 
 #include 
 #include "ice.h"
 #include "ice_base.h"
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 74b540ebb3dc..5b6edbd8a4ed 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -2,7 +2,7 @@
 /* Copyright(c) 2018 Intel Corporation. */
 
 #include 
-#include 
+#include 
 #include 
 
 #include "ixgbe.h"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index c4a7fb4ecd14..b04b99396f65 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -31,7 +31,7 @@
  */
 
 #include 
-#include 
+#include 
 #include "en/xdp.h"
 #include "en/params.h"
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
index cab0e93497ae..a8e11adbf426 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
@@ -5,7 +5,7 @@
 #define __MLX5_EN_XSK_RX_H__
 
 #include "en.h"
-#include 
+#include 
 
 /* RX data path */
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h
index 79b487d89757..39fa0a705856 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h
@@ -5,7 +5,7 @@
 #define __MLX5_EN_XSK_TX_H__
 
 #include "en.h"
-#include 
+#include 
 
 /* TX data path */
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
index 4baaa5788320..5e49fdb564b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
 /* Copyright (c) 2019 Mellanox Technologies. */
 
-#include 
+#include 
 #include "umem.h"
 #include "setup.h"
 #include "en/params.h"
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index a26d6c80e43d..6a986dcbc336 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -15,6 +15,7 @@
 
 struct net_device;
 struct xsk_queue;
+struct xdp_buff;
 
 /* Masks for xdp_umem_page flags.
  * The low 12-bits of the addr will be 0 since this is the page address, so we
@@ -101,27 +102,9 @@ struct xdp_sock {
spinlock_t map_list_lock;
 };
 
-struct xdp_buff;
 #ifdef CONFIG_XDP_SOCKETS
-int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
-/* Used

[PATCH bpf-next v2 07/14] i40e, xsk: migrate to new MEM_TYPE_XSK_BUFF_POOL

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

Remove MEM_TYPE_ZERO_COPY in favor of the new MEM_TYPE_XSK_BUFF_POOL
APIs. The AF_XDP zero-copy rx_bi ring is now simply a struct xdp_buff
pointer.

Cc: intel-wired-...@lists.osuosl.org
Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |  19 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   9 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 350 ++--
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |   1 -
 4 files changed, 47 insertions(+), 332 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 3e1695bb8262..ea7395b391e5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3266,21 +3266,19 @@ static int i40e_configure_rx_ring(struct i40e_ring 
*ring)
ret = i40e_alloc_rx_bi_zc(ring);
if (ret)
return ret;
-   ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
-  XDP_PACKET_HEADROOM;
+   ring->rx_buf_len = xsk_umem_get_rx_frame_size(ring->xsk_umem);
/* For AF_XDP ZC, we disallow packets to span on
 * multiple buffers, thus letting us skip that
 * handling in the fast-path.
 */
chain_len = 1;
-   ring->zca.free = i40e_zca_free;
ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
-MEM_TYPE_ZERO_COPY,
-&ring->zca);
+MEM_TYPE_XSK_BUFF_POOL,
+NULL);
if (ret)
return ret;
dev_info(&vsi->back->pdev->dev,
-"Registered XDP mem model MEM_TYPE_ZERO_COPY on Rx 
ring %d\n",
+"Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx 
ring %d\n",
 ring->queue_index);
 
} else {
@@ -3351,9 +3349,12 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
writel(0, ring->tail);
 
-   ok = ring->xsk_umem ?
-i40e_alloc_rx_buffers_zc(ring, I40E_DESC_UNUSED(ring)) :
-!i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+   if (ring->xsk_umem) {
+   xsk_buff_set_rxq_info(ring->xsk_umem, &ring->xdp_rxq);
+   ok = i40e_alloc_rx_buffers_zc(ring, I40E_DESC_UNUSED(ring));
+   } else {
+   ok = !i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+   }
if (!ok) {
/* Log this in case the user has forgotten to give the kernel
 * any buffers, even later in the application.
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index d343498e8de5..5c255977fd58 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -301,12 +301,6 @@ struct i40e_rx_buffer {
__u16 pagecnt_bias;
 };
 
-struct i40e_rx_buffer_zc {
-   dma_addr_t dma;
-   void *addr;
-   u64 handle;
-};
-
 struct i40e_queue_stats {
u64 packets;
u64 bytes;
@@ -356,7 +350,7 @@ struct i40e_ring {
union {
struct i40e_tx_buffer *tx_bi;
struct i40e_rx_buffer *rx_bi;
-   struct i40e_rx_buffer_zc *rx_bi_zc;
+   struct xdp_buff **rx_bi_zc;
};
DECLARE_BITMAP(state, __I40E_RING_STATE_NBITS);
u16 queue_index;/* Queue number of ring */
@@ -418,7 +412,6 @@ struct i40e_ring {
struct i40e_channel *ch;
struct xdp_rxq_info xdp_rxq;
struct xdp_umem *xsk_umem;
-   struct zero_copy_allocator zca; /* ZC allocator anchor */
 } cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 4fce057f1eec..460f5052e1db 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -23,68 +23,11 @@ void i40e_clear_rx_bi_zc(struct i40e_ring *rx_ring)
   sizeof(*rx_ring->rx_bi_zc) * rx_ring->count);
 }
 
-static struct i40e_rx_buffer_zc *i40e_rx_bi(struct i40e_ring *rx_ring, u32 idx)
+static struct xdp_buff **i40e_rx_bi(struct i40e_ring *rx_ring, u32 idx)
 {
return &rx_ring->rx_bi_zc[idx];
 }
 
-/**
- * i40e_xsk_umem_dma_map - DMA maps all UMEM memory for the netdev
- * @vsi: Current VSI
- * @umem: UMEM to DMA map
- *
- * Returns 0 on success, <0 on failure
- **/
-static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
-{
-   struct i40e_pf *pf = vsi->back;
-   struct device *dev;
-   unsigned i

[PATCH bpf-next v2 06/14] i40e: separate kernel allocated rx_bi rings from AF_XDP rings

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

Continuing the path to support MEM_TYPE_XSK_BUFF_POOL, the AF_XDP
zero-copy/sk_buff rx_bi rings are now separate. Functions to properly
allocate the different rings are added as well.

Cc: intel-wired-...@lists.osuosl.org
Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   7 ++
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 119 +++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  22 ++--
 .../ethernet/intel/i40e/i40e_txrx_common.h|  40 +-
 drivers/net/ethernet/intel/i40e/i40e_type.h   |   5 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c|  74 ++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.h|   2 +
 7 files changed, 142 insertions(+), 127 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d6b2db4f2c65..3e1695bb8262 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3260,8 +3260,12 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
if (ring->vsi->type == I40E_VSI_MAIN)
xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
 
+   kfree(ring->rx_bi);
ring->xsk_umem = i40e_xsk_umem(ring);
if (ring->xsk_umem) {
+   ret = i40e_alloc_rx_bi_zc(ring);
+   if (ret)
+   return ret;
ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
   XDP_PACKET_HEADROOM;
/* For AF_XDP ZC, we disallow packets to span on
@@ -3280,6 +3284,9 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 ring->queue_index);
 
} else {
+   ret = i40e_alloc_rx_bi(ring);
+   if (ret)
+   return ret;
ring->rx_buf_len = vsi->rx_buf_len;
if (ring->vsi->type == I40E_VSI_MAIN) {
ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 58daba8fabc8..f063df623443 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -521,28 +521,29 @@ int i40e_add_del_fdir(struct i40e_vsi *vsi,
 /**
  * i40e_fd_handle_status - check the Programming Status for FD
  * @rx_ring: the Rx ring for this descriptor
- * @rx_desc: the Rx descriptor for programming Status, not a packet descriptor.
+ * @qword0_raw: qword0
+ * @qword1: qword1 after le_to_cpu
  * @prog_id: the id originally used for programming
  *
  * This is used to verify if the FD programming or invalidation
  * requested by SW to the HW is successful or not and take actions accordingly.
  **/
-void i40e_fd_handle_status(struct i40e_ring *rx_ring,
-  union i40e_rx_desc *rx_desc, u8 prog_id)
+void i40e_fd_handle_status(struct i40e_ring *rx_ring, u64 qword0_raw,
+  u64 qword1, u8 prog_id)
 {
struct i40e_pf *pf = rx_ring->vsi->back;
struct pci_dev *pdev = pf->pdev;
+   struct i40e_32b_rx_wb_qw0 *qw0;
u32 fcnt_prog, fcnt_avail;
u32 error;
-   u64 qw;
 
-   qw = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
-   error = (qw & I40E_RX_PROG_STATUS_DESC_QW1_ERROR_MASK) >>
+   qw0 = (struct i40e_32b_rx_wb_qw0 *)&qword0_raw;
+   error = (qword1 & I40E_RX_PROG_STATUS_DESC_QW1_ERROR_MASK) >>
I40E_RX_PROG_STATUS_DESC_QW1_ERROR_SHIFT;
 
if (error == BIT(I40E_RX_PROG_STATUS_DESC_FD_TBL_FULL_SHIFT)) {
-   pf->fd_inv = le32_to_cpu(rx_desc->wb.qword0.hi_dword.fd_id);
-   if ((rx_desc->wb.qword0.hi_dword.fd_id != 0) ||
+   pf->fd_inv = le32_to_cpu(qw0->hi_dword.fd_id);
+   if (qw0->hi_dword.fd_id != 0 ||
(I40E_DEBUG_FD & pf->hw.debug_mask))
dev_warn(&pdev->dev, "ntuple filter loc = %d, could not 
be added\n",
 pf->fd_inv);
@@ -560,7 +561,7 @@ void i40e_fd_handle_status(struct i40e_ring *rx_ring,
/* store the current atr filter count */
pf->fd_atr_cnt = i40e_get_current_atr_cnt(pf);
 
-   if ((rx_desc->wb.qword0.hi_dword.fd_id == 0) &&
+   if (qw0->hi_dword.fd_id == 0 &&
test_bit(__I40E_FD_SB_AUTO_DISABLED, pf->state)) {
/* These set_bit() calls aren't atomic with the
 * test_bit() here, but worse case we potentially
@@ -589,7 +590,7 @@ void i40e_fd_handle_status(struct i40e_ring *rx_ring,
} else if (error == BIT(I40E_RX_PROG_STATUS_DESC_NO_FD_ENTRY_SHIFT)) {
if (I40E_DEBUG_FD & pf->hw.debug_mask)
dev_info(&pdev->dev, "ntuple filter fd_id = %d, could 
not be removed\n",
-rx_desc->wb.qword0.hi_dword.fd_id);
+

[PATCH bpf-next v2 08/14] ice, xsk: migrate to new MEM_TYPE_XSK_BUFF_POOL

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

Remove MEM_TYPE_ZERO_COPY in favor of the new MEM_TYPE_XSK_BUFF_POOL
APIs.

Cc: intel-wired-...@lists.osuosl.org
Signed-off-by: Maciej Fijalkowski 
Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ice/ice_base.c |  16 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h |   8 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c  | 372 +++---
 drivers/net/ethernet/intel/ice/ice_xsk.h  |  13 +-
 4 files changed, 54 insertions(+), 355 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.c 
b/drivers/net/ethernet/intel/ice/ice_base.c
index a19cd6f5436b..433eb72b1c85 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2019, Intel Corporation. */
 
+#include 
 #include "ice_base.h"
 #include "ice_dcb_lib.h"
 
@@ -308,24 +309,23 @@ int ice_setup_rx_ctx(struct ice_ring *ring)
if (ring->xsk_umem) {
xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
 
-   ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
-  XDP_PACKET_HEADROOM;
+   ring->rx_buf_len =
+   xsk_umem_get_rx_frame_size(ring->xsk_umem);
/* For AF_XDP ZC, we disallow packets to span on
 * multiple buffers, thus letting us skip that
 * handling in the fast-path.
 */
chain_len = 1;
-   ring->zca.free = ice_zca_free;
err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
-MEM_TYPE_ZERO_COPY,
-&ring->zca);
+MEM_TYPE_XSK_BUFF_POOL,
+NULL);
if (err)
return err;
+   xsk_buff_set_rxq_info(ring->xsk_umem, &ring->xdp_rxq);
 
-   dev_info(ice_pf_to_dev(vsi->back), "Registered XDP mem 
model MEM_TYPE_ZERO_COPY on Rx ring %d\n",
+   dev_info(ice_pf_to_dev(vsi->back), "Registered XDP mem 
model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
 ring->q_index);
} else {
-   ring->zca.free = NULL;
if (!xdp_rxq_info_is_reg(&ring->xdp_rxq))
/* coverity[check_return] */
xdp_rxq_info_reg(&ring->xdp_rxq,
@@ -426,7 +426,7 @@ int ice_setup_rx_ctx(struct ice_ring *ring)
writel(0, ring->tail);
 
err = ring->xsk_umem ?
- ice_alloc_rx_bufs_slow_zc(ring, ICE_DESC_UNUSED(ring)) :
+ ice_alloc_rx_bufs_zc(ring, ICE_DESC_UNUSED(ring)) :
  ice_alloc_rx_bufs(ring, ICE_DESC_UNUSED(ring));
if (err)
dev_info(ice_pf_to_dev(vsi->back), "Failed allocate some 
buffers on %sRx ring %d (pf_q %d)\n",
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h 
b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 7ee00a128663..d0fd2173854f 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -155,17 +155,16 @@ struct ice_tx_offload_params {
 };
 
 struct ice_rx_buf {
-   struct sk_buff *skb;
-   dma_addr_t dma;
union {
struct {
+   struct sk_buff *skb;
+   dma_addr_t dma;
struct page *page;
unsigned int page_offset;
u16 pagecnt_bias;
};
struct {
-   void *addr;
-   u64 handle;
+   struct xdp_buff *xdp;
};
};
 };
@@ -289,7 +288,6 @@ struct ice_ring {
struct rcu_head rcu;/* to avoid race on free */
struct bpf_prog *xdp_prog;
struct xdp_umem *xsk_umem;
-   struct zero_copy_allocator zca;
/* CL3 - 3rd cacheline starts here */
struct xdp_rxq_info xdp_rxq;
/* CLX - the below items are only accessed infrequently and should be
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c 
b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 955b0fbb7c9a..da89589c3137 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -279,28 +279,6 @@ static int ice_xsk_alloc_umems(struct ice_vsi *vsi)
return 0;
 }
 
-/**
- * ice_xsk_add_umem - add a UMEM region for XDP sockets
- * @vsi: VSI to which the UMEM will be added
- * @umem: pointer to a requested UMEM region
- * @qid: queue ID
- *
- * Returns 0 on success, negative on error
- */
-static int ice_xsk_add_umem(struct ice_vsi *vsi, struct xdp_um

[PATCH bpf-next v2 05/14] i40e: refactor rx_bi accesses

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

As a first step to migrate i40e to the new MEM_TYPE_XSK_BUFF_POOL
APIs, code that accesses the rx_bi (SW/shadow ring) is refactored to
use an accessor function.

Cc: intel-wired-...@lists.osuosl.org
Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 17 +++--
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 18 --
 2 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index b8496037ef7f..58daba8fabc8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1195,6 +1195,11 @@ static void i40e_update_itr(struct i40e_q_vector 
*q_vector,
rc->total_packets = 0;
 }
 
+static struct i40e_rx_buffer *i40e_rx_bi(struct i40e_ring *rx_ring, u32 idx)
+{
+   return &rx_ring->rx_bi[idx];
+}
+
 /**
  * i40e_reuse_rx_page - page flip buffer and store it back on the ring
  * @rx_ring: rx descriptor ring to store buffers on
@@ -1208,7 +1213,7 @@ static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
struct i40e_rx_buffer *new_buff;
u16 nta = rx_ring->next_to_alloc;
 
-   new_buff = &rx_ring->rx_bi[nta];
+   new_buff = i40e_rx_bi(rx_ring, nta);
 
/* update, and store next to alloc */
nta++;
@@ -1272,7 +1277,7 @@ struct i40e_rx_buffer *i40e_clean_programming_status(
ntc = rx_ring->next_to_clean;
 
/* fetch, update, and store next to clean */
-   rx_buffer = &rx_ring->rx_bi[ntc++];
+   rx_buffer = i40e_rx_bi(rx_ring, ntc++);
ntc = (ntc < rx_ring->count) ? ntc : 0;
rx_ring->next_to_clean = ntc;
 
@@ -1361,7 +1366,7 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 
/* Free all the Rx ring sk_buffs */
for (i = 0; i < rx_ring->count; i++) {
-   struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
+   struct i40e_rx_buffer *rx_bi = i40e_rx_bi(rx_ring, i);
 
if (!rx_bi->page)
continue;
@@ -1576,7 +1581,7 @@ bool i40e_alloc_rx_buffers(struct i40e_ring *rx_ring, u16 
cleaned_count)
return false;
 
rx_desc = I40E_RX_DESC(rx_ring, ntu);
-   bi = &rx_ring->rx_bi[ntu];
+   bi = i40e_rx_bi(rx_ring, ntu);
 
do {
if (!i40e_alloc_mapped_page(rx_ring, bi))
@@ -1598,7 +1603,7 @@ bool i40e_alloc_rx_buffers(struct i40e_ring *rx_ring, u16 
cleaned_count)
ntu++;
if (unlikely(ntu == rx_ring->count)) {
rx_desc = I40E_RX_DESC(rx_ring, 0);
-   bi = rx_ring->rx_bi;
+   bi = i40e_rx_bi(rx_ring, 0);
ntu = 0;
}
 
@@ -1965,7 +1970,7 @@ static struct i40e_rx_buffer *i40e_get_rx_buffer(struct 
i40e_ring *rx_ring,
 {
struct i40e_rx_buffer *rx_buffer;
 
-   rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
+   rx_buffer = i40e_rx_bi(rx_ring, rx_ring->next_to_clean);
prefetchw(rx_buffer->page);
 
/* we are reusing so sync this buffer for CPU use */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 452bba7bc4ff..8d29477bb0b6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -9,6 +9,11 @@
 #include "i40e_txrx_common.h"
 #include "i40e_xsk.h"
 
+static struct i40e_rx_buffer *i40e_rx_bi(struct i40e_ring *rx_ring, u32 idx)
+{
+   return &rx_ring->rx_bi[idx];
+}
+
 /**
  * i40e_xsk_umem_dma_map - DMA maps all UMEM memory for the netdev
  * @vsi: Current VSI
@@ -321,7 +326,7 @@ __i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 
count,
bool ok = true;
 
rx_desc = I40E_RX_DESC(rx_ring, ntu);
-   bi = &rx_ring->rx_bi[ntu];
+   bi = i40e_rx_bi(rx_ring, ntu);
do {
if (!alloc(rx_ring, bi)) {
ok = false;
@@ -340,7 +345,7 @@ __i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 
count,
 
if (unlikely(ntu == rx_ring->count)) {
rx_desc = I40E_RX_DESC(rx_ring, 0);
-   bi = rx_ring->rx_bi;
+   bi = i40e_rx_bi(rx_ring, 0);
ntu = 0;
}
 
@@ -402,7 +407,7 @@ static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct 
i40e_ring *rx_ring,
 {
struct i40e_rx_buffer *bi;
 
-   bi = &rx_ring->rx_bi[rx_ring->next_to_clean];
+   bi = i40e_rx_bi(rx_ring, rx_ring->next_to_clean);
 
/* we are reusing so sync this buffer for CPU use */
dma_sync_single_range_for_cpu(rx_ring->dev,
@@ -424,7 +429,8 @@ static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct 
i40e_ring *rx_ring,
 static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
struct i40e_rx_buffer *old_bi)
 {
-   struct i4

[PATCH bpf-next v2 09/14] ixgbe, xsk: migrate to new MEM_TYPE_XSK_BUFF_POOL

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

Remove MEM_TYPE_ZERO_COPY in favor of the new MEM_TYPE_XSK_BUFF_POOL
APIs.

v1->v2: Fixed xdp_buff data_end update. (Björn)

Cc: intel-wired-...@lists.osuosl.org
Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   9 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  15 +-
 .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 305 +++---
 4 files changed, 62 insertions(+), 269 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 2833e4f041ce..5ddfc83a1e46 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -224,17 +224,17 @@ struct ixgbe_tx_buffer {
 };
 
 struct ixgbe_rx_buffer {
-   struct sk_buff *skb;
-   dma_addr_t dma;
union {
struct {
+   struct sk_buff *skb;
+   dma_addr_t dma;
struct page *page;
__u32 page_offset;
__u16 pagecnt_bias;
};
struct {
-   void *addr;
-   u64 handle;
+   bool discard;
+   struct xdp_buff *xdp;
};
};
 };
@@ -351,7 +351,6 @@ struct ixgbe_ring {
};
struct xdp_rxq_info xdp_rxq;
struct xdp_umem *xsk_umem;
-   struct zero_copy_allocator zca; /* ZC allocator anchor */
u16 ring_idx;   /* {rx,tx,xdp}_ring back reference idx */
u16 rx_buf_len;
 } cacheline_internodealigned_in_smp;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 718931d951bc..da7b8042901f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -35,7 +35,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 #include "ixgbe.h"
@@ -3726,8 +3726,7 @@ static void ixgbe_configure_srrctl(struct ixgbe_adapter 
*adapter,
 
/* configure the packet buffer length */
if (rx_ring->xsk_umem) {
-   u32 xsk_buf_len = rx_ring->xsk_umem->chunk_size_nohr -
- XDP_PACKET_HEADROOM;
+   u32 xsk_buf_len = xsk_umem_get_rx_frame_size(rx_ring->xsk_umem);
 
/* If the MAC support setting RXDCTL.RLPML, the
 * SRRCTL[n].BSIZEPKT is set to PAGE_SIZE and
@@ -4074,11 +4073,10 @@ void ixgbe_configure_rx_ring(struct ixgbe_adapter 
*adapter,
xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
ring->xsk_umem = ixgbe_xsk_umem(adapter, ring);
if (ring->xsk_umem) {
-   ring->zca.free = ixgbe_zca_free;
WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
-  MEM_TYPE_ZERO_COPY,
-  &ring->zca));
-
+  MEM_TYPE_XSK_BUFF_POOL,
+  NULL));
+   xsk_buff_set_rxq_info(ring->xsk_umem, &ring->xdp_rxq);
} else {
WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
   MEM_TYPE_PAGE_SHARED, NULL));
@@ -4134,8 +4132,7 @@ void ixgbe_configure_rx_ring(struct ixgbe_adapter 
*adapter,
}
 
if (ring->xsk_umem && hw->mac.type != ixgbe_mac_82599EB) {
-   u32 xsk_buf_len = ring->xsk_umem->chunk_size_nohr -
- XDP_PACKET_HEADROOM;
+   u32 xsk_buf_len = xsk_umem_get_rx_frame_size(ring->xsk_umem);
 
rxdctl &= ~(IXGBE_RXDCTL_RLPMLMASK |
IXGBE_RXDCTL_RLPML_EN);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
index 6d01700b46bc..7887ae4aaf4f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
@@ -35,7 +35,7 @@ int ixgbe_xsk_umem_setup(struct ixgbe_adapter *adapter, 
struct xdp_umem *umem,
 
 void ixgbe_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
 
-void ixgbe_alloc_rx_buffers_zc(struct ixgbe_ring *rx_ring, u16 cleaned_count);
+bool ixgbe_alloc_rx_buffers_zc(struct ixgbe_ring *rx_ring, u16 cleaned_count);
 int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
  struct ixgbe_ring *rx_ring,
  const int budget);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 5b6edbd8a4ed..86add9fbd36c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -20,54 +20,11 @@ struct xdp_umem *ixgbe_xsk_umem(struct ixgbe_adapter 
*adapter,
ret

[PATCH bpf-next v2 12/14] xdp: simplify xdp_return_{frame,frame_rx_napi,buff}

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

The xdp_return_{frame,frame_rx_napi,buff} function are never used,
except in xdp_convert_zc_to_xdp_frame(), by the MEM_TYPE_XSK_BUFF_POOL
memory type.

To simplify and reduce code, change so that
xdp_convert_zc_to_xdp_frame() calls xsk_buff_free() directly since the
type is know, and remove MEM_TYPE_XSK_BUFF_POOL from the switch
statement in __xdp_return() function.

Suggested-by: Maxim Mikityanskiy 
Signed-off-by: Björn Töpel 
---
 net/core/xdp.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/net/core/xdp.c b/net/core/xdp.c
index 11273c976e19..7ab1f9014c5e 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -334,10 +334,11 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
  * scenarios (e.g. queue full), it is possible to return the xdp_frame
  * while still leveraging this protection.  The @napi_direct boolean
  * is used for those calls sites.  Thus, allowing for faster recycling
- * of xdp_frames/pages in those cases.
+ * of xdp_frames/pages in those cases. This path is never used by the
+ * MEM_TYPE_XSK_BUFF_POOL memory type, so it's explicitly not part of
+ * the switch-statement.
  */
-static void __xdp_return(void *data, struct xdp_mem_info *mem, bool 
napi_direct,
-struct xdp_buff *xdp)
+static void __xdp_return(void *data, struct xdp_mem_info *mem, bool 
napi_direct)
 {
struct xdp_mem_allocator *xa;
struct page *page;
@@ -359,33 +360,29 @@ static void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
page = virt_to_page(data); /* Assumes order0 page*/
put_page(page);
break;
-   case MEM_TYPE_XSK_BUFF_POOL:
-   /* NB! Only valid from an xdp_buff! */
-   xsk_buff_free(xdp);
-   break;
default:
/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
+   WARN(1, "Incorrect XDP memory type (%d) usage", mem->type);
break;
}
 }
 
 void xdp_return_frame(struct xdp_frame *xdpf)
 {
-   __xdp_return(xdpf->data, &xdpf->mem, false, NULL);
+   __xdp_return(xdpf->data, &xdpf->mem, false);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);
 
 void xdp_return_frame_rx_napi(struct xdp_frame *xdpf)
 {
-   __xdp_return(xdpf->data, &xdpf->mem, true, NULL);
+   __xdp_return(xdpf->data, &xdpf->mem, true);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
 
 void xdp_return_buff(struct xdp_buff *xdp)
 {
-   __xdp_return(xdp->data, &xdp->rxq->mem, true, xdp);
+   __xdp_return(xdp->data, &xdp->rxq->mem, true);
 }
-EXPORT_SYMBOL_GPL(xdp_return_buff);
 
 /* Only called for MEM_TYPE_PAGE_POOL see xdp.h */
 void __xdp_release_frame(void *data, struct xdp_mem_info *mem)
@@ -466,7 +463,7 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct 
xdp_buff *xdp)
xdpf->metasize = metasize;
xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
 
-   xdp_return_buff(xdp);
+   xsk_buff_free(xdp);
return xdpf;
 }
 EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
-- 
2.25.1

[PATCH bpf-next v2 04/14] xsk: introduce AF_XDP buffer allocation API

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

In order to simplify AF_XDP zero-copy enablement for NIC driver
developers, a new AF_XDP buffer allocation API is added. The
implementation is based on a single core (single producer/consumer)
buffer pool for the AF_XDP UMEM.

A buffer is allocated using the xsk_buff_alloc() function, and
returned using xsk_buff_free(). If a buffer is disassociated with the
pool, e.g. when a buffer is passed to an AF_XDP socket, a buffer is
said to be released. Currently, the release function is only used by
the AF_XDP internals and not visible to the driver.

Drivers using this API should register the XDP memory model with the
new MEM_TYPE_XSK_BUFF_POOL type.

The API is defined in net/xdp_sock_drv.h.

The buffer type is struct xdp_buff, and follows the lifetime of
regular xdp_buffs, i.e.  the lifetime of an xdp_buff is restricted to
a NAPI context. In other words, the API is not replacing xdp_frames.

In addition to introducing the API and implementations, the AF_XDP
core is migrated to use the new APIs.

rfc->v1: Fixed build errors/warnings for m68k and riscv. (kbuild test
 robot)
 Added headroom/chunk size getter. (Maxim/Björn)

v1->v2: Swapped SoBs. (Maxim)

Signed-off-by: Björn Töpel 
Signed-off-by: Maxim Mikityanskiy 
---
 include/net/xdp.h   |   4 +-
 include/net/xdp_sock.h  |   2 +
 include/net/xdp_sock_drv.h  | 152 
 include/net/xsk_buff_pool.h |  54 +
 include/trace/events/xdp.h  |   3 +-
 net/core/xdp.c  |  14 +-
 net/xdp/Makefile|   1 +
 net/xdp/xdp_umem.c  |  19 +-
 net/xdp/xsk.c   | 147 +---
 net/xdp/xsk_buff_pool.c | 462 
 net/xdp/xsk_diag.c  |   2 +-
 net/xdp/xsk_queue.h |  59 +++--
 12 files changed, 800 insertions(+), 119 deletions(-)
 create mode 100644 include/net/xsk_buff_pool.h
 create mode 100644 net/xdp/xsk_buff_pool.c

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 3cc6d5d84aa4..83173e4d306c 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -38,6 +38,7 @@ enum xdp_mem_type {
MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
MEM_TYPE_PAGE_POOL,
MEM_TYPE_ZERO_COPY,
+   MEM_TYPE_XSK_BUFF_POOL,
MEM_TYPE_MAX,
 };
 
@@ -101,7 +102,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
int metasize;
int headroom;
 
-   if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
+   if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY ||
+   xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL)
return xdp_convert_zc_to_xdp_frame(xdp);
 
/* Assure headroom is available for storing info */
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index fb7fe3060175..6e7265f63c04 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -31,11 +31,13 @@ struct xdp_umem_fq_reuse {
 struct xdp_umem {
struct xsk_queue *fq;
struct xsk_queue *cq;
+   struct xsk_buff_pool *pool;
struct xdp_umem_page *pages;
u64 chunk_mask;
u64 size;
u32 headroom;
u32 chunk_size_nohr;
+   u32 chunk_size;
struct user_struct *user;
refcount_t users;
struct work_struct work;
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 98dd6962e6d4..5a0970d4c44c 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -7,6 +7,7 @@
 #define _LINUX_XDP_SOCK_DRV_H
 
 #include 
+#include 
 
 #ifdef CONFIG_XDP_SOCKETS
 
@@ -96,6 +97,87 @@ static inline u64 xsk_umem_adjust_offset(struct xdp_umem 
*umem, u64 address,
return address + offset;
 }
 
+static inline u32 xsk_umem_get_headroom(struct xdp_umem *umem)
+{
+   return XDP_PACKET_HEADROOM + umem->headroom;
+}
+
+static inline u32 xsk_umem_get_chunk_size(struct xdp_umem *umem)
+{
+   return umem->chunk_size;
+}
+
+static inline u32 xsk_umem_get_rx_frame_size(struct xdp_umem *umem)
+{
+   return xsk_umem_get_chunk_size(umem) - xsk_umem_get_headroom(umem);
+}
+
+static inline void xsk_buff_set_rxq_info(struct xdp_umem *umem,
+struct xdp_rxq_info *rxq)
+{
+   xp_set_rxq_info(umem->pool, rxq);
+}
+
+static inline void xsk_buff_dma_unmap(struct xdp_umem *umem,
+ unsigned long attrs)
+{
+   xp_dma_unmap(umem->pool, attrs);
+}
+
+static inline int xsk_buff_dma_map(struct xdp_umem *umem, struct device *dev,
+  unsigned long attrs)
+{
+   return xp_dma_map(umem->pool, dev, attrs, umem->pgs, umem->npgs);
+}
+
+static inline dma_addr_t xsk_buff_xdp_get_dma(struct xdp_buff *xdp)
+{
+   struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
+
+   return xp_get_dma(xskb);
+}
+
+static inline struct xdp_buff *xsk_buff_alloc(struct xdp_umem *umem)
+{
+   return xp_alloc(umem->pool);
+}
+
+static inline bool xsk_buff_can_alloc(struct x

[PATCH bpf-next v2 14/14] MAINTAINERS, xsk: update AF_XDP section after moves/adds

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

Update MAINTAINERS to correctly mirror the current AF_XDP socket file
layout. Also, add the AF_XDP files of libbpf.

rfc->v1: Sorted file entries. (Joe)

Cc: Joe Perches 
Signed-off-by: Björn Töpel 
---
 MAINTAINERS | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index db7a6d462dff..79e2bb1280e6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18451,8 +18451,12 @@ R: Jonathan Lemon 
 L: netdev@vger.kernel.org
 L: b...@vger.kernel.org
 S: Maintained
-F: kernel/bpf/xskmap.c
+F: include/net/xdp_sock*
+F: include/net/xsk_buffer_pool.h
+F: include/uapi/linux/if_xdp.h
 F: net/xdp/
+F: samples/bpf/xdpsock*
+F: tools/lib/bpf/xsk*
 
 XEN BLOCK SUBSYSTEM
 M: Konrad Rzeszutek Wilk 
-- 
2.25.1

[PATCH bpf-next v2 11/14] xsk: remove MEM_TYPE_ZERO_COPY and corresponding code

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

There are no users of MEM_TYPE_ZERO_COPY. Remove all corresponding
code, including the "handle" member of struct xdp_buff.

rfc->v1: Fixed spelling in commit message. (Björn)

Signed-off-by: Björn Töpel 
---
 drivers/net/hyperv/netvsc_bpf.c |   1 -
 include/net/xdp.h   |   9 +--
 include/net/xdp_sock.h  |  45 ---
 include/net/xdp_sock_drv.h  | 139 
 include/trace/events/xdp.h  |   1 -
 net/core/xdp.c  |  42 ++
 net/xdp/xdp_umem.c  |  56 +
 net/xdp/xsk.c   |  48 +--
 net/xdp/xsk_buff_pool.c |   7 ++
 net/xdp/xsk_queue.c |  62 --
 net/xdp/xsk_queue.h | 105 
 11 files changed, 15 insertions(+), 500 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
index b86611041db6..9f78f774041b 100644
--- a/drivers/net/hyperv/netvsc_bpf.c
+++ b/drivers/net/hyperv/netvsc_bpf.c
@@ -49,7 +49,6 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct 
netvsc_channel *nvchan,
xdp_set_data_meta_invalid(xdp);
xdp->data_end = xdp->data + len;
xdp->rxq = &nvchan->xdp_rxq;
-   xdp->handle = 0;
 
memcpy(xdp->data, data, len);
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 83173e4d306c..1495ffb7a642 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -37,7 +37,6 @@ enum xdp_mem_type {
MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
MEM_TYPE_PAGE_POOL,
-   MEM_TYPE_ZERO_COPY,
MEM_TYPE_XSK_BUFF_POOL,
MEM_TYPE_MAX,
 };
@@ -53,10 +52,6 @@ struct xdp_mem_info {
 
 struct page_pool;
 
-struct zero_copy_allocator {
-   void (*free)(struct zero_copy_allocator *zca, unsigned long handle);
-};
-
 struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
@@ -69,7 +64,6 @@ struct xdp_buff {
void *data_end;
void *data_meta;
void *data_hard_start;
-   unsigned long handle;
struct xdp_rxq_info *rxq;
 };
 
@@ -102,8 +96,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
int metasize;
int headroom;
 
-   if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY ||
-   xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL)
+   if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL)
return xdp_convert_zc_to_xdp_frame(xdp);
 
/* Assure headroom is available for storing info */
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 6e7265f63c04..96bfc5f5f24e 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -17,26 +17,12 @@ struct net_device;
 struct xsk_queue;
 struct xdp_buff;
 
-struct xdp_umem_page {
-   void *addr;
-   dma_addr_t dma;
-};
-
-struct xdp_umem_fq_reuse {
-   u32 nentries;
-   u32 length;
-   u64 handles[];
-};
-
 struct xdp_umem {
struct xsk_queue *fq;
struct xsk_queue *cq;
struct xsk_buff_pool *pool;
-   struct xdp_umem_page *pages;
-   u64 chunk_mask;
u64 size;
u32 headroom;
-   u32 chunk_size_nohr;
u32 chunk_size;
struct user_struct *user;
refcount_t users;
@@ -48,7 +34,6 @@ struct xdp_umem {
u8 flags;
int id;
struct net_device *dev;
-   struct xdp_umem_fq_reuse *fq_reuse;
bool zc;
spinlock_t xsk_tx_list_lock;
struct list_head xsk_tx_list;
@@ -109,21 +94,6 @@ static inline struct xdp_sock *__xsk_map_lookup_elem(struct 
bpf_map *map,
return xs;
 }
 
-static inline u64 xsk_umem_extract_addr(u64 addr)
-{
-   return addr & XSK_UNALIGNED_BUF_ADDR_MASK;
-}
-
-static inline u64 xsk_umem_extract_offset(u64 addr)
-{
-   return addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
-}
-
-static inline u64 xsk_umem_add_offset_to_addr(u64 addr)
-{
-   return xsk_umem_extract_addr(addr) + xsk_umem_extract_offset(addr);
-}
-
 #else
 
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
@@ -146,21 +116,6 @@ static inline struct xdp_sock 
*__xsk_map_lookup_elem(struct bpf_map *map,
return NULL;
 }
 
-static inline u64 xsk_umem_extract_addr(u64 addr)
-{
-   return 0;
-}
-
-static inline u64 xsk_umem_extract_offset(u64 addr)
-{
-   return 0;
-}
-
-static inline u64 xsk_umem_add_offset_to_addr(u64 addr)
-{
-   return 0;
-}
-
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 5a0970d4c44c..533ee0ce43de 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -11,16 +11,9 @@
 
 #ifdef CONFIG_XDP_SOCKETS
 
-bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt);
-bool xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
-void xsk_umem_release_addr(struct xdp_umem *umem);
 void xsk_umem_comple

[PATCH bpf-next v2 10/14] mlx5, xsk: migrate to new MEM_TYPE_XSK_BUFF_POOL

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

Use the new MEM_TYPE_XSK_BUFF_POOL API in lieu of MEM_TYPE_ZERO_COPY in
mlx5e. It allows to drop a lot of code from the driver (which is now
common in AF_XDP core and was related to XSK RX frame allocation, DMA
mapping, etc.) and slightly improve performance.

rfc->v1: Put back the sanity check for XSK params, use XSK API to get
 the total headroom size. (Maxim)

v1->v2: Fix DMA address handling, set XDP metadata to invalid. (Maxim)

Signed-off-by: Björn Töpel 
Signed-off-by: Maxim Mikityanskiy 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   7 +-
 .../ethernet/mellanox/mlx5/core/en/params.c   |  13 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  30 ++---
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |   2 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 113 --
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  23 +++-
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |   6 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.c |  49 +---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  15 +--
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  33 -
 10 files changed, 94 insertions(+), 197 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 0864b76ca2c0..526e59029beb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -429,10 +429,7 @@ struct mlx5e_dma_info {
dma_addr_t addr;
union {
struct page *page;
-   struct {
-   u64 handle;
-   void *data;
-   } xsk;
+   struct xdp_buff *xsk;
};
 };
 
@@ -650,7 +647,6 @@ struct mlx5e_rq {
} mpwqe;
};
struct {
-   u16umem_headroom;
u16headroom;
u8 map_dir;   /* dma map direction */
} buff;
@@ -682,7 +678,6 @@ struct mlx5e_rq {
struct page_pool  *page_pool;
 
/* AF_XDP zero-copy */
-   struct zero_copy_allocator zca;
struct xdp_umem   *umem;
 
struct work_struct recover_work;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index eb2e1f2138e4..38e4f19d69f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -12,15 +12,16 @@ static inline bool mlx5e_rx_is_xdp(struct mlx5e_params 
*params,
 u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
 struct mlx5e_xsk_param *xsk)
 {
-   u16 headroom = NET_IP_ALIGN;
+   u16 headroom;
 
-   if (mlx5e_rx_is_xdp(params, xsk)) {
+   if (xsk)
+   return xsk->headroom;
+
+   headroom = NET_IP_ALIGN;
+   if (mlx5e_rx_is_xdp(params, xsk))
headroom += XDP_PACKET_HEADROOM;
-   if (xsk)
-   headroom += xsk->headroom;
-   } else {
+   else
headroom += MLX5_RX_HEADROOM;
-   }
 
return headroom;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index b04b99396f65..a2a194525b15 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -71,7 +71,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq 
*rq,
xdptxd.data = xdpf->data;
xdptxd.len  = xdpf->len;
 
-   if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) {
+   if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL) {
/* The xdp_buff was in the UMEM and was copied into a newly
 * allocated page. The UMEM page was returned via the ZCA, and
 * this new page has to be mapped at this point and has to be
@@ -119,49 +119,33 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct 
mlx5e_rq *rq,
 
 /* returns true if packet was consumed by xdp */
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
- void *va, u16 *rx_headroom, u32 *len, bool xsk)
+ u32 *len, struct xdp_buff *xdp)
 {
struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
-   struct xdp_umem *umem = rq->umem;
-   struct xdp_buff xdp;
u32 act;
int err;
 
if (!prog)
return false;
 
-   xdp.data = va + *rx_headroom;
-   xdp_set_data_meta_invalid(&xdp);
-   xdp.data_end = xdp.data + *len;
-   xdp.data_hard_start = va;
-   if (xsk)
-   xdp.handle = di->xsk.handle;
-   xdp.rxq = &rq->xdp_rxq;
-
-   act = bpf_prog_run_xdp(prog, &xdp);
-   if (xsk) {
-   u64 off = xdp.data - xdp.data_hard_start;
-
-   xdp.handle = xsk_umem_adjust_offset(umem, xdp.handle, off);
-   }
+   act = bpf_prog_run_xdp(prog

[PATCH bpf-next v2 13/14] xsk: explicitly inline functions and move definitions

2020-05-14 Thread Björn Töpel

From: Björn Töpel 

In order to reduce the number of function calls, the struct
xsk_buff_pool definition is moved to xsk_buff_pool.h. The functions
xp_get_dma(), xp_dma_sync_for_cpu(), xp_dma_sync_for_device(),
xp_validate_desc() and various helper functions are explicitly
inlined.

Further, move xp_get_handle() and xp_release() to xsk.c, to allow for
the compiler to perform inlining.

rfc->v1: Make sure xp_validate_desc() is inlined for Tx perf. (Maxim)

Signed-off-by: Björn Töpel 
---
 include/net/xsk_buff_pool.h |  92 +--
 net/xdp/xsk.c   |  15 
 net/xdp/xsk_buff_pool.c | 142 ++--
 net/xdp/xsk_queue.h |  45 
 4 files changed, 151 insertions(+), 143 deletions(-)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 9abef166441d..029522696ccb 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -4,6 +4,7 @@
 #ifndef XSK_BUFF_POOL_H_
 #define XSK_BUFF_POOL_H_
 
+#include 
 #include 
 #include 
 #include 
@@ -24,6 +25,27 @@ struct xdp_buff_xsk {
struct list_head free_list_node;
 };
 
+struct xsk_buff_pool {
+   struct xsk_queue *fq;
+   struct list_head free_list;
+   dma_addr_t *dma_pages;
+   struct xdp_buff_xsk *heads;
+   u64 chunk_mask;
+   u64 addrs_cnt;
+   u32 free_list_cnt;
+   u32 dma_pages_cnt;
+   u32 heads_cnt;
+   u32 free_heads_cnt;
+   u32 headroom;
+   u32 chunk_size;
+   u32 frame_len;
+   bool cheap_dma;
+   bool unaligned;
+   void *addrs;
+   struct device *dev;
+   struct xdp_buff_xsk *free_heads[];
+};
+
 /* AF_XDP core. */
 struct xsk_buff_pool *xp_create(struct page **pages, u32 nr_pages, u32 chunks,
u32 chunk_size, u32 headroom, u64 size,
@@ -31,8 +53,6 @@ struct xsk_buff_pool *xp_create(struct page **pages, u32 
nr_pages, u32 chunks,
 void xp_set_fq(struct xsk_buff_pool *pool, struct xsk_queue *fq);
 void xp_destroy(struct xsk_buff_pool *pool);
 void xp_release(struct xdp_buff_xsk *xskb);
-u64 xp_get_handle(struct xdp_buff_xsk *xskb);
-bool xp_validate_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
 
 /* AF_XDP, and XDP core. */
 void xp_free(struct xdp_buff_xsk *xskb);
@@ -46,9 +66,69 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool);
 bool xp_can_alloc(struct xsk_buff_pool *pool, u32 count);
 void *xp_raw_get_data(struct xsk_buff_pool *pool, u64 addr);
 dma_addr_t xp_raw_get_dma(struct xsk_buff_pool *pool, u64 addr);
-dma_addr_t xp_get_dma(struct xdp_buff_xsk *xskb);
-void xp_dma_sync_for_cpu(struct xdp_buff_xsk *xskb);
-void xp_dma_sync_for_device(struct xsk_buff_pool *pool, dma_addr_t dma,
-   size_t size);
+static inline dma_addr_t xp_get_dma(struct xdp_buff_xsk *xskb)
+{
+   return xskb->dma;
+}
+
+void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb);
+static inline void xp_dma_sync_for_cpu(struct xdp_buff_xsk *xskb)
+{
+   if (xskb->pool->cheap_dma)
+   return;
+
+   xp_dma_sync_for_cpu_slow(xskb);
+}
+
+void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t dma,
+size_t size);
+static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool,
+ dma_addr_t dma, size_t size)
+{
+   if (pool->cheap_dma)
+   return;
+
+   xp_dma_sync_for_device_slow(pool, dma, size);
+}
+
+/* Masks for xdp_umem_page flags.
+ * The low 12-bits of the addr will be 0 since this is the page address, so we
+ * can use them for flags.
+ */
+#define XSK_NEXT_PG_CONTIG_SHIFT 0
+#define XSK_NEXT_PG_CONTIG_MASK BIT_ULL(XSK_NEXT_PG_CONTIG_SHIFT)
+
+static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
+u64 addr, u32 len)
+{
+   bool cross_pg = (addr & (PAGE_SIZE - 1)) + len > PAGE_SIZE;
+
+   if (pool->dma_pages_cnt && cross_pg) {
+   return !(pool->dma_pages[addr >> PAGE_SHIFT] &
+XSK_NEXT_PG_CONTIG_MASK);
+   }
+   return false;
+}
+
+static inline u64 xp_aligned_extract_addr(struct xsk_buff_pool *pool, u64 addr)
+{
+   return addr & pool->chunk_mask;
+}
+
+static inline u64 xp_unaligned_extract_addr(u64 addr)
+{
+   return addr & XSK_UNALIGNED_BUF_ADDR_MASK;
+}
+
+static inline u64 xp_unaligned_extract_offset(u64 addr)
+{
+   return addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+}
+
+static inline u64 xp_unaligned_add_offset_to_addr(u64 addr)
+{
+   return xp_unaligned_extract_addr(addr) +
+   xp_unaligned_extract_offset(addr);
+}
 
 #endif /* XSK_BUFF_POOL_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 3f2ab732ab8b..b6c0f08bd80d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -99,6 +99,21 @@ bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem)
 }
 EXPORT_SYMBOL(xsk_umem_uses_need_wakeup);
 
+void xp_release(struct xdp_bu

Re: signal quality and cable diagnostic

2020-05-14 Thread Christian Herber

On Tue, May 14, 2020 at 08:28:00AM +, Oleksij Rempel wrote:
> On Thu, May 14, 2020 at 07:13:30AM +, Christian Herber wrote:
> > On Tue, May 12, 2020 at 10:22:01AM +0200, Oleksij Rempel wrote:
> >
> > > So I think we should pass raw SQI value to user space, at least in the
> > > first implementation.
> >
> > > What do you think about this?
> >
> > Hi Oleksij,
> >
> > I had a check about the background of this SQI thing. The table you 
> > reference with concrete SNR values is informative only and not a 
> > requirement. The requirements are rather loose.
> >
> > This is from OA:
> > - Only for SQI=0 a link loss shall occur.
> > - The indicated signal quality shall monotonic increasing /decreasing with 
> > noise level.
> > - It shall be indicated in the datasheet at which level a BER<10^-10 
> > (better than 10^-10) is achieved (e.g. "from SQI=3 to SQI=7 the link has a 
> > BER<10^-10 (better than 10^-10)")
> >
> > I.e. SQI does not need to have a direct correlation with SNR. The 
> > fundamental underlying metric is the BER.
> > You can report the raw SQI level and users would have to look up what it 
> > means in the respective data sheet. There is no guaranteed relation between 
> > SQI levels of different devices, i.e. SQI 5 can have lower BER than SQI 6 
> > on another device.
> > Alternatively, you could report BER < x for the different SQI levels. 
> > However, this requires the information to be available. While I could 
> > provide these for NXP, it might not be easily available for other vendors.
> > If reporting raw SQI, at least the SQI level for BER<10^-10 should be 
> > presented to give any meaning to the value.

> So the question is, which values to provide via KAPI to user space?
>
> - SQI
>  The PHY can probably measure the SNR quite fast and has some internal
>   function or lookup table to deduct the SQI from the measured SNR.
>
>   If I understand you correctly, we can only compare SQI values of the
>   same PHY, as different PHYs give different SQIs for the same link
>   characteristics (=SNR).
> - SNR range
>   We read the SQI from the PHY look up the SNR range for that value from
>  the data sheet and provide that value to use space. This gives a
>   better description of the quality of the link.
> - "guestimated" BER
>   The manufacturer of the PHY has probably done some extensive testing
>   that a measured SNR can be correlated to some BER. This value may be
>   provided in the data sheet, too.
>
> The SNR seems to be most universal value, when it comes to comparing
> different situations (different links and different PHYs). The
> resolution of BER is not that detailed, for the NXP PHY is says only
> "BER below 1e-10" or not.

The point I was trying to make is that SQI is intentionally called SQI and NOT 
SNR, because it is not a measure for SNR. The standard only suggest a mapping 
of SNR to SQI, but vendors do not need to comply to that or report that. The 
only mandatory requirement is linking to BER. BER is also what would be 
required by a user, as this is the metric that determines what happens to your 
traffic, not the SNR.

So when it comes to KAPI parameters, I see the following options
- SQI only
- SQI + plus indication of SQI level at which BER<10^-10 (this is the only 
required and standardized information)
- SQI + BER range (best for users, but requires input from the silicon vendors)

SNR in my opinion is neither an option nor helpful.

Regards,

Christian

Re: [PATCH 11/18] maccess: remove strncpy_from_unsafe

2020-05-14 Thread Masami Hiramatsu

On Wed, 13 May 2020 19:43:24 -0700
Linus Torvalds  wrote:

> On Wed, May 13, 2020 at 6:00 PM Masami Hiramatsu  wrote:
> >
> > > But we should likely at least disallow it entirely on platforms where
> > > we really can't - or pick one hardcoded choice. On sparc, you really
> > > _have_ to specify one or the other.
> >
> > OK. BTW, is there any way to detect the kernel/user space overlap on
> > memory layout statically? If there, I can do it. (I don't like
> > "if (CONFIG_X86)" thing)
> > Or, maybe we need CONFIG_ARCH_OVERLAP_ADDRESS_SPACE?
> 
> I think it would be better to have a CONFIG variable that
> architectures can just 'select' to show that they are ok with separate
> kernel and user addresses.
> 
> Because I don't think we have any way to say that right now as-is. You
> can probably come up with hacky ways to approximate it, ie something
> like
> 
> if (TASK_SIZE_MAX > PAGE_OFFSET)
>  they overlap ..
> 
> which would almost work, but..

It seems TASK_SIZE_MAX is defined only on x86 and s390, what about
comparing STACK_TOP_MAX with PAGE_OFFSET ?
Anyway, I agree that the best way is introducing a CONFIG.

Thank you,

-- 
Masami Hiramatsu

RE: [PATCH 32/33] sctp: add sctp_sock_get_primary_addr

2020-05-14 Thread David Laight

From: Marcelo Ricardo Leitner
> Sent: 13 May 2020 19:03
> 
> On Wed, May 13, 2020 at 08:26:47AM +0200, Christoph Hellwig wrote:
> > Add a helper to directly get the SCTP_PRIMARY_ADDR sockopt from kernel
> > space without going through a fake uaccess.
> 
> Same comment as on the other dlm/sctp patch.

Wouldn't it be best to write sctp_[gs]etsockotp() that
use a kernel buffer and then implement the user-space
calls using a wrapper that does the copies to an on-stack
(or malloced if big) buffer.

That will also simplify the code be removing all the copies
and -EFAULT returns.
Only the size checks will be needed and the code can assume
the buffer is at least the size of the on-stack buffer.

Our SCTP code uses SO_REUSADDR, SCTP_EVENTS, SCTP_NODELAY,
SCTP_STATUS, SCTP_INITMSG, IPV6_ONLY, SCTP_SOCKOPT_BINDX_ADD
and SO_LINGER.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

[PATCH v2 net-next 07/11] net: qede: optional hw recovery procedure

2020-05-14 Thread Igor Russkikh

Driver has an ability to initiate a recovery process as a reaction to
detected errors. But the codepath (recovery_process) was disabled and
never active.

Here we add ethtool private flag to allow user have the recovery
procedure activated.

We still do not enable this by default though, since in some configurations
this is not desirable. E.g. this may impact other PFs/VFs.

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 .../net/ethernet/qlogic/qede/qede_ethtool.c   | 24 +++
 1 file changed, 24 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index 812c7766e096..24cc68391ac4 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -190,12 +190,14 @@ static const struct {
 enum {
QEDE_PRI_FLAG_CMT,
QEDE_PRI_FLAG_SMART_AN_SUPPORT, /* MFW supports SmartAN */
+   QEDE_PRI_FLAG_RECOVER_ON_ERROR,
QEDE_PRI_FLAG_LEN,
 };
 
 static const char qede_private_arr[QEDE_PRI_FLAG_LEN][ETH_GSTRING_LEN] = {
"Coupled-Function",
"SmartAN capable",
+   "Recover on error",
 };
 
 enum qede_ethtool_tests {
@@ -417,9 +419,30 @@ static u32 qede_get_priv_flags(struct net_device *dev)
if (edev->dev_info.common.smart_an)
flags |= BIT(QEDE_PRI_FLAG_SMART_AN_SUPPORT);
 
+   if (edev->err_flags & BIT(QEDE_ERR_IS_RECOVERABLE))
+   flags |= BIT(QEDE_PRI_FLAG_RECOVER_ON_ERROR);
+
return flags;
 }
 
+static int qede_set_priv_flags(struct net_device *dev, u32 flags)
+{
+   struct qede_dev *edev = netdev_priv(dev);
+   u32 cflags = qede_get_priv_flags(dev);
+   u32 dflags = flags ^ cflags;
+
+   /* can only change RECOVER_ON_ERROR flag */
+   if (dflags & ~BIT(QEDE_PRI_FLAG_RECOVER_ON_ERROR))
+   return -EINVAL;
+
+   if (flags & BIT(QEDE_PRI_FLAG_RECOVER_ON_ERROR))
+   set_bit(QEDE_ERR_IS_RECOVERABLE, &edev->err_flags);
+   else
+   clear_bit(QEDE_ERR_IS_RECOVERABLE, &edev->err_flags);
+
+   return 0;
+}
+
 struct qede_link_mode_mapping {
u32 qed_link_mode;
u32 ethtool_link_mode;
@@ -2098,6 +2121,7 @@ static const struct ethtool_ops qede_ethtool_ops = {
.set_phys_id = qede_set_phys_id,
.get_ethtool_stats = qede_get_ethtool_stats,
.get_priv_flags = qede_get_priv_flags,
+   .set_priv_flags = qede_set_priv_flags,
.get_sset_count = qede_get_sset_count,
.get_rxnfc = qede_get_rxnfc,
.set_rxnfc = qede_set_rxnfc,
-- 
2.17.1

[PATCH v2 net-next 06/11] net: qed: attention clearing properties

2020-05-14 Thread Igor Russkikh

On different hardware events we have to respond differently,
on some of hardware indications hw attention (error condition)
should be cleared by the driver to continue normal functioning.

Here we introduce attention clear flags, and put them on some
important events (in aeu_descs).

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/qlogic/qed/qed.h|  3 +++
 drivers/net/ethernet/qlogic/qed/qed_int.c| 22 
 drivers/net/ethernet/qlogic/qed/qed_int.h| 11 ++
 drivers/net/ethernet/qlogic/qed/qed_main.c   |  7 ++-
 drivers/net/ethernet/qlogic/qede/qede_main.c |  6 ++
 include/linux/qed/qed_if.h   |  9 
 6 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index 07f6ef930b52..66ed39d6f357 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -838,6 +838,9 @@ struct qed_dev {
/* Recovery */
bool recov_in_prog;
 
+   /* Indicates whether should prevent attentions from being reasserted */
+   bool attn_clr_en;
+
/* LLH info */
u8 ppfid_bitmap;
struct qed_llh_info *p_llh_info;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_int.c 
b/drivers/net/ethernet/qlogic/qed/qed_int.c
index 1b1447b2f059..b7b974f0ef21 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_int.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_int.c
@@ -96,6 +96,7 @@ struct aeu_invert_reg_bit {
 #define ATTENTION_BB(value) (value << ATTENTION_BB_SHIFT)
 #define ATTENTION_BB_DIFFERENT  BIT(23)
 
+#define ATTENTION_CLEAR_ENABLE  BIT(28)
unsigned int flags;
 
/* Callback to call if attention will be triggered */
@@ -371,6 +372,13 @@ static int qed_fw_assertion(struct qed_hwfn *p_hwfn)
return -EINVAL;
 }
 
+static int qed_general_attention_35(struct qed_hwfn *p_hwfn)
+{
+   DP_INFO(p_hwfn, "General attention 35!\n");
+
+   return 0;
+}
+
 #define QED_DORQ_ATTENTION_REASON_MASK  (0xf)
 #define QED_DORQ_ATTENTION_OPAQUE_MASK  (0x)
 #define QED_DORQ_ATTENTION_OPAQUE_SHIFT (0x0)
@@ -613,14 +621,15 @@ static struct aeu_invert_reg aeu_descs[NUM_ATTN_REGS] = {
 
{
{   /* After Invert 4 */
-   {"General Attention 32", ATTENTION_SINGLE,
-qed_fw_assertion,
+   {"General Attention 32", ATTENTION_SINGLE |
+ATTENTION_CLEAR_ENABLE, qed_fw_assertion,
 MAX_BLOCK_ID},
{"General Attention %d",
 (2 << ATTENTION_LENGTH_SHIFT) |
 (33 << ATTENTION_OFFSET_SHIFT), NULL, MAX_BLOCK_ID},
-   {"General Attention 35", ATTENTION_SINGLE,
-NULL, MAX_BLOCK_ID},
+   {"General Attention 35", ATTENTION_SINGLE |
+ATTENTION_CLEAR_ENABLE, qed_general_attention_35,
+MAX_BLOCK_ID},
{"NWS Parity",
 ATTENTION_PAR | ATTENTION_BB_DIFFERENT |
 ATTENTION_BB(AEU_INVERT_REG_SPECIAL_CNIG_0),
@@ -2361,6 +2370,11 @@ void qed_int_disable_post_isr_release(struct qed_dev 
*cdev)
cdev->hwfns[i].b_int_requested = false;
 }
 
+void qed_int_attn_clr_enable(struct qed_dev *cdev, bool clr_enable)
+{
+   cdev->attn_clr_en = clr_enable;
+}
+
 int qed_int_set_timer_res(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt,
  u8 timer_res, u16 sb_id, bool tx)
 {
diff --git a/drivers/net/ethernet/qlogic/qed/qed_int.h 
b/drivers/net/ethernet/qlogic/qed/qed_int.h
index 9ad568d93ae6..e09db3386367 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_int.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_int.h
@@ -190,6 +190,17 @@ void qed_int_get_num_sbs(struct qed_hwfn   *p_hwfn,
  */
 void qed_int_disable_post_isr_release(struct qed_dev *cdev);
 
+/**
+ * @brief qed_int_attn_clr_enable - sets whether the general behavior is
+ *preventing attentions from being reasserted, or following the
+ *attributes of the specific attention.
+ *
+ * @param cdev
+ * @param clr_enable
+ *
+ */
+void qed_int_attn_clr_enable(struct qed_dev *cdev, bool clr_enable);
+
 /**
  * @brief - Doorbell Recovery handler.
  *  Run doorbell recovery in case of PF overflow (and flush DORQ if
diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index d7c9d94e4c59..83e798d4eebb 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -2491,10 +2491,14 @@ void qed_hw_error_occurred(struct qed_hwfn *p_hwfn,
 
DP_NOTICE(p_hwfn, "HW error occurred [%s]\n", err_str);
 
-   /* Call the HW error handler of the protocol driver
+   /* C

[PATCH v2 net-next 04/11] net: qed: critical err reporting to management firmware

2020-05-14 Thread Igor Russkikh

On various critical errors, notification handler should also report
the err information into the management firmware.

MFW can interact with server/motherboard backend agents - these are
used by server manufacturers to monitor server HW health.

Thus, it is important for driver to report on any faulty conditions

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/qlogic/qed/qed_hsi.h |  19 
 drivers/net/ethernet/qlogic/qed/qed_hw.c  |   3 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 125 ++
 drivers/net/ethernet/qlogic/qed/qed_mcp.h |  15 +++
 4 files changed, 162 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h 
b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index 4597015b8bff..21d53b00c2e6 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -12492,6 +12492,8 @@ struct public_drv_mb {
 #define DRV_MSG_CODE_GET_ENGINE_CONFIG 0x0037
 #define DRV_MSG_CODE_GET_PPFID_BITMAP  0x4300
 
+#define DRV_MSG_CODE_DEBUG_DATA_SEND   0xc004
+
 #define RESOURCE_CMD_REQ_RESC_MASK 0x001F
 #define RESOURCE_CMD_REQ_RESC_SHIFT0
 #define RESOURCE_CMD_REQ_OPCODE_MASK   0x00E0
@@ -12626,6 +12628,17 @@ struct public_drv_mb {
 #define DRV_MB_PARAM_FEATURE_SUPPORT_PORT_EEE  0x0002
 #define DRV_MB_PARAM_FEATURE_SUPPORT_FUNC_VLINK0x0001
 
+/* DRV_MSG_CODE_DEBUG_DATA_SEND parameters */
+#define DRV_MSG_CODE_DEBUG_DATA_SEND_SIZE_OFFSET   0
+#define DRV_MSG_CODE_DEBUG_DATA_SEND_SIZE_MASK 0xFF
+
+/* Driver attributes params */
+#define DRV_MB_PARAM_ATTRIBUTE_KEY_OFFSET  0
+#define DRV_MB_PARAM_ATTRIBUTE_KEY_MASK0x00FF
+#define DRV_MB_PARAM_ATTRIBUTE_CMD_OFFSET  24
+#define DRV_MB_PARAM_ATTRIBUTE_CMD_MASK0xFF00
+
+#define DRV_MB_PARAM_NVM_CFG_OPTION_ID_OFFSET  0
 #define DRV_MB_PARAM_NVM_CFG_OPTION_ID_SHIFT   0
 #define DRV_MB_PARAM_NVM_CFG_OPTION_ID_MASK0x
 #define DRV_MB_PARAM_NVM_CFG_OPTION_ALL_SHIFT  16
@@ -12678,6 +12691,12 @@ struct public_drv_mb {
 #define FW_MSG_CODE_DRV_CFG_PF_VFS_MSIX_DONE   0x0087
 #define FW_MSG_SEQ_NUMBER_MASK 0x
 
+#define FW_MSG_CODE_DEBUG_DATA_SEND_INV_ARG0xb007
+#define FW_MSG_CODE_DEBUG_DATA_SEND_BUF_FULL   0xb008
+#define FW_MSG_CODE_DEBUG_DATA_SEND_NO_BUF 0xb009
+#define FW_MSG_CODE_DEBUG_NOT_ENABLED  0xb00a
+#define FW_MSG_CODE_DEBUG_DATA_SEND_OK 0xb00b
+
u32 fw_mb_param;
 #define FW_MB_PARAM_RESOURCE_ALLOC_VERSION_MAJOR_MASK  0x
 #define FW_MB_PARAM_RESOURCE_ALLOC_VERSION_MAJOR_SHIFT 16
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hw.c 
b/drivers/net/ethernet/qlogic/qed/qed_hw.c
index 2d176e1b508c..5fa251489536 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hw.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_hw.c
@@ -868,6 +868,9 @@ void qed_hw_err_notify(struct qed_hwfn *p_hwfn,
}
 
qed_hw_error_occurred(p_hwfn, err_type);
+
+   if (fmt)
+   qed_mcp_send_raw_debug_data(p_hwfn, p_ptt, buf, len);
 }
 
 int qed_dmae_sanity(struct qed_hwfn *p_hwfn,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 46653afc385c..62be13d49dd8 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -3821,3 +3821,128 @@ int qed_mcp_nvm_set_cfg(struct qed_hwfn *p_hwfn, struct 
qed_ptt *p_ptt,
  DRV_MSG_CODE_SET_NVM_CFG_OPTION,
  mb_param, &resp, ¶m, len, (u32 *)p_buf);
 }
+
+#define QED_MCP_DBG_DATA_MAX_SIZE   MCP_DRV_NVM_BUF_LEN
+#define QED_MCP_DBG_DATA_MAX_HEADER_SIZEsizeof(u32)
+#define QED_MCP_DBG_DATA_MAX_PAYLOAD_SIZE \
+   (QED_MCP_DBG_DATA_MAX_SIZE - QED_MCP_DBG_DATA_MAX_HEADER_SIZE)
+
+static int
+__qed_mcp_send_debug_data(struct qed_hwfn *p_hwfn,
+ struct qed_ptt *p_ptt, u8 *p_buf, u8 size)
+{
+   struct qed_mcp_mb_params mb_params;
+   int rc;
+
+   if (size > QED_MCP_DBG_DATA_MAX_SIZE) {
+   DP_ERR(p_hwfn,
+  "Debug data size is %d while it should not exceed %d\n",
+  size, QED_MCP_DBG_DATA_MAX_SIZE);
+   return -EINVAL;
+   }
+
+   memset(&mb_params, 0, sizeof(mb_params));
+   mb_params.cmd = DRV_MSG_CODE_DEBUG_DATA_SEND;
+   SET_MFW_FIELD(mb_params.param, DRV_MSG_CODE_DEBUG_DATA_SEND_SIZE, size);
+   mb_params.p_data_src = p_buf;
+   mb_params.data_src_size = size;
+   rc = qed_mcp_cmd_and_union(p_hwfn, p_ptt, &mb_params);
+   if (rc)
+   return rc;
+
+   if (mb_params.mcp_resp == FW_MSG_CODE_UNSUPPORTED) {
+   DP_INFO(p_hwfn,
+

[PATCH v2 net-next 08/11] net: qede: Implement ndo_tx_timeout

2020-05-14 Thread Igor Russkikh

From: Denis Bolotin 

Upon tx timeout detection we do disable carrier and print TX queue
info on TX timeout. We then raise hw error condition and trigger
service task to handle this.

This handler will capture extra debug info and then optionally
trigger recovery procedure to try restore function.

Signed-off-by: Denis Bolotin 
Signed-off-by: Ariel Elior 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/qlogic/qede/qede.h  |  1 -
 drivers/net/ethernet/qlogic/qede/qede_main.c | 46 
 2 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index 695d645d9ba9..8857da1208d7 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -533,7 +533,6 @@ u16 qede_select_queue(struct net_device *dev, struct 
sk_buff *skb,
 netdev_features_t qede_features_check(struct sk_buff *skb,
  struct net_device *dev,
  netdev_features_t features);
-void qede_tx_log_print(struct qede_dev *edev, struct qede_fastpath *fp);
 int qede_alloc_rx_buffer(struct qede_rx_queue *rxq, bool allow_lazy);
 int qede_free_tx_pkt(struct qede_dev *edev,
 struct qede_tx_queue *txq, int *len);
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index ee7662da6413..f50d9a9b76be 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -539,6 +539,51 @@ static int qede_ioctl(struct net_device *dev, struct ifreq 
*ifr, int cmd)
return 0;
 }
 
+static void qede_tx_log_print(struct qede_dev *edev, struct qede_tx_queue *txq)
+{
+   DP_NOTICE(edev,
+ "Txq[%d]: FW cons [host] %04x, SW cons %04x, SW prod %04x 
[Jiffies %lu]\n",
+ txq->index, le16_to_cpu(*txq->hw_cons_ptr),
+ qed_chain_get_cons_idx(&txq->tx_pbl),
+ qed_chain_get_prod_idx(&txq->tx_pbl),
+ jiffies);
+}
+
+static void qede_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+   struct qede_dev *edev = netdev_priv(dev);
+   struct qede_tx_queue *txq;
+   int cos;
+
+   netif_carrier_off(dev);
+   DP_NOTICE(edev, "TX timeout on queue %u!\n", txqueue);
+
+   if (!(edev->fp_array[txqueue].type & QEDE_FASTPATH_TX))
+   return;
+
+   for_each_cos_in_txq(edev, cos) {
+   txq = &edev->fp_array[txqueue].txq[cos];
+
+   if (qed_chain_get_cons_idx(&txq->tx_pbl) !=
+   qed_chain_get_prod_idx(&txq->tx_pbl))
+   qede_tx_log_print(edev, txq);
+   }
+
+   if (IS_VF(edev))
+   return;
+
+   if (test_and_set_bit(QEDE_ERR_IS_HANDLED, &edev->err_flags) ||
+   edev->state == QEDE_STATE_RECOVERY) {
+   DP_INFO(edev,
+   "Avoid handling a Tx timeout while another HW error is 
being handled\n");
+   return;
+   }
+
+   set_bit(QEDE_ERR_GET_DBG_INFO, &edev->err_flags);
+   set_bit(QEDE_SP_HW_ERR, &edev->sp_flags);
+   schedule_delayed_work(&edev->sp_task, 0);
+}
+
 static int qede_setup_tc(struct net_device *ndev, u8 num_tc)
 {
struct qede_dev *edev = netdev_priv(ndev);
@@ -626,6 +671,7 @@ static const struct net_device_ops qede_netdev_ops = {
.ndo_validate_addr = eth_validate_addr,
.ndo_change_mtu = qede_change_mtu,
.ndo_do_ioctl = qede_ioctl,
+   .ndo_tx_timeout = qede_tx_timeout,
 #ifdef CONFIG_QED_SRIOV
.ndo_set_vf_mac = qede_set_vf_mac,
.ndo_set_vf_vlan = qede_set_vf_vlan,
-- 
2.17.1

[PATCH v2 net-next 00/11] net: qed/qede: critical hw error handling

2020-05-14 Thread Igor Russkikh

FastLinQ devices as a complex systems may observe various hardware
level error conditions, both severe and recoverable.

Driver is able to detect and report this, but so far it only did
trace/dmesg based reporting.

Here we implement an extended hw error detection, service task
handler captures a dump for the later analysis.

I also resubmit a patch from Denis Bolotin on tx timeout handler,
addressing David's comment regarding recovery procedure as an extra
reaction on this event.

v2:

Removing the patch with ethtool dump and udev magic. Its quite isolated,
I'm working on devlink based logic for this separately.

v1:

https://patchwork.ozlabs.org/project/netdev/cover/cover.1588758463.git.irussk...@marvell.com/

Denis Bolotin (1):
  net: qede: Implement ndo_tx_timeout

Igor Russkikh (10):
  net: qed: adding hw_err states and handling
  net: qede: add hw err scheduled handler
  net: qed: invoke err notify on critical areas
  net: qed: critical err reporting to management firmware
  net: qed: cleanup debug related declarations
  net: qed: attention clearing properties
  net: qede: optional hw recovery procedure
  net: qed: introduce critical fan failure handler
  net: qed: introduce critical hardware error handler
  net: qed: fix bad formatting

 drivers/net/ethernet/qlogic/qed/qed.h |  16 +-
 drivers/net/ethernet/qlogic/qed/qed_debug.c   |  26 +-
 drivers/net/ethernet/qlogic/qed/qed_dev.c |   4 +-
 drivers/net/ethernet/qlogic/qed/qed_hsi.h |  49 +++-
 drivers/net/ethernet/qlogic/qed/qed_hw.c  |  42 ++-
 drivers/net/ethernet/qlogic/qed/qed_hw.h  |  15 ++
 drivers/net/ethernet/qlogic/qed/qed_int.c |  40 ++-
 drivers/net/ethernet/qlogic/qed/qed_int.h |  11 +
 drivers/net/ethernet/qlogic/qed/qed_main.c|  34 +++
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 254 ++
 drivers/net/ethernet/qlogic/qed/qed_mcp.h |  28 ++
 drivers/net/ethernet/qlogic/qed/qed_spq.c |  16 +-
 drivers/net/ethernet/qlogic/qede/qede.h   |  14 +-
 .../net/ethernet/qlogic/qede/qede_ethtool.c   |  24 ++
 drivers/net/ethernet/qlogic/qede/qede_main.c  | 147 +-
 include/linux/qed/qed_if.h|  26 +-
 16 files changed, 700 insertions(+), 46 deletions(-)

-- 
2.17.1

[PATCH v2 net-next 03/11] net: qed: invoke err notify on critical areas

2020-05-14 Thread Igor Russkikh

In a number of critical places not only debug trace should be printed,
but the appropriate hw error condition should be raised and error
handling/recovery should start.

Introduce our new qed_hw_err_notify invocation in these places to
record and indicate critical error conditions in hardware.

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/qlogic/qed/qed_dev.c |  4 +++-
 drivers/net/ethernet/qlogic/qed/qed_hw.c  |  7 ---
 drivers/net/ethernet/qlogic/qed/qed_int.c | 20 
 drivers/net/ethernet/qlogic/qed/qed_mcp.c |  2 ++
 drivers/net/ethernet/qlogic/qed/qed_spq.c | 16 ++--
 5 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 7119a18af19e..6e857468e993 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -3085,7 +3085,9 @@ int qed_hw_init(struct qed_dev *cdev, struct 
qed_hw_init_params *p_params)
rc = qed_final_cleanup(p_hwfn, p_hwfn->p_main_ptt,
   p_hwfn->rel_pf_id, false);
if (rc) {
-   DP_NOTICE(p_hwfn, "Final cleanup failed\n");
+   qed_hw_err_notify(p_hwfn, p_hwfn->p_main_ptt,
+ QED_HW_ERR_RAMROD_FAIL,
+ "Final cleanup failed\n");
goto load_err;
}
}
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hw.c 
b/drivers/net/ethernet/qlogic/qed/qed_hw.c
index 90b777019cf5..2d176e1b508c 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hw.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_hw.c
@@ -762,9 +762,10 @@ static int qed_dmae_execute_command(struct qed_hwfn 
*p_hwfn,
dst_type,
length_cur);
if (qed_status) {
-   DP_NOTICE(p_hwfn,
- "qed_dmae_execute_sub_operation Failed with 
error 0x%x. source_addr 0x%llx, destination addr 0x%llx, size_in_dwords 0x%x\n",
- qed_status, src_addr, dst_addr, length_cur);
+   qed_hw_err_notify(p_hwfn, p_ptt, QED_HW_ERR_DMAE_FAIL,
+ "qed_dmae_execute_sub_operation 
Failed with error 0x%x. source_addr 0x%llx, destination addr 0x%llx, 
size_in_dwords 0x%x\n",
+ qed_status, src_addr,
+ dst_addr, length_cur);
break;
}
}
diff --git a/drivers/net/ethernet/qlogic/qed/qed_int.c 
b/drivers/net/ethernet/qlogic/qed/qed_int.c
index 9f5113639eaf..1b1447b2f059 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_int.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_int.c
@@ -363,6 +363,14 @@ static int qed_pglueb_rbc_attn_cb(struct qed_hwfn *p_hwfn)
return qed_pglueb_rbc_attn_handler(p_hwfn, p_hwfn->p_dpc_ptt);
 }
 
+static int qed_fw_assertion(struct qed_hwfn *p_hwfn)
+{
+   qed_hw_err_notify(p_hwfn, p_hwfn->p_dpc_ptt, QED_HW_ERR_FW_ASSERT,
+ "FW assertion!\n");
+
+   return -EINVAL;
+}
+
 #define QED_DORQ_ATTENTION_REASON_MASK  (0xf)
 #define QED_DORQ_ATTENTION_OPAQUE_MASK  (0x)
 #define QED_DORQ_ATTENTION_OPAQUE_SHIFT (0x0)
@@ -606,7 +614,8 @@ static struct aeu_invert_reg aeu_descs[NUM_ATTN_REGS] = {
{
{   /* After Invert 4 */
{"General Attention 32", ATTENTION_SINGLE,
-NULL, MAX_BLOCK_ID},
+qed_fw_assertion,
+MAX_BLOCK_ID},
{"General Attention %d",
 (2 << ATTENTION_LENGTH_SHIFT) |
 (33 << ATTENTION_OFFSET_SHIFT), NULL, MAX_BLOCK_ID},
@@ -927,9 +936,12 @@ qed_int_deassertion_aeu_bit(struct qed_hwfn *p_hwfn,
qed_int_attn_print(p_hwfn, p_aeu->block_index,
   ATTN_TYPE_INTERRUPT, !b_fatal);
 
-
-   /* If the attention is benign, no need to prevent it */
-   if (!rc)
+   /* Reach assertion if attention is fatal */
+   if (b_fatal)
+   qed_hw_err_notify(p_hwfn, p_hwfn->p_dpc_ptt, QED_HW_ERR_HW_ATTN,
+ "`%s': Fatal attention\n",
+ p_bit_name);
+   else /* If the attention is benign, no need to prevent it */
goto out;
 
/* Prevent this Attention from being asserted in the future */
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 280527cc0578..46653afc385c 100644
--- a/drivers/net/

[PATCH v2 net-next 02/11] net: qede: add hw err scheduled handler

2020-05-14 Thread Igor Russkikh

qede (ethernet level driver) registers a callback handler.
This handler maintains eth dev state flags/bits to track error processing.

It implements in place processing part for nonsleeping context (WARN_ON
trigger), and a deferred (delayed work) part which triggers recovery
process for recoverable errors.

In later patches this atomic handler will come with more meat.

We introduce err_flags on ethdevice structure, its being used to record
error handling properties.

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/qlogic/qede/qede.h  | 13 ++-
 drivers/net/ethernet/qlogic/qede/qede_main.c | 95 +++-
 2 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index f6f0b51620ab..695d645d9ba9 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -278,6 +278,14 @@ struct qede_dev {
struct qede_rdma_devrdma_info;
 
struct bpf_prog *xdp_prog;
+
+   unsigned long err_flags;
+#define QEDE_ERR_IS_HANDLED31
+#define QEDE_ERR_ATTN_CLR_EN   0
+#define QEDE_ERR_GET_DBG_INFO  1
+#define QEDE_ERR_IS_RECOVERABLE2
+#define QEDE_ERR_WARN  3
+
struct qede_dump_info   dump_info;
 };
 
@@ -485,12 +493,15 @@ struct qede_fastpath {
 
 #define QEDE_SP_RECOVERY   0
 #define QEDE_SP_RX_MODE1
+#define QEDE_SP_RSVD1   2
+#define QEDE_SP_RSVD2   3
+#define QEDE_SP_HW_ERR  4
+#define QEDE_SP_ARFS_CONFIG 5
 #define QEDE_SP_AER7
 
 #ifdef CONFIG_RFS_ACCEL
 int qede_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
   u16 rxq_index, u32 flow_id);
-#define QEDE_SP_ARFS_CONFIG4
 #define QEDE_SP_TASK_POLL_DELAY(5 * HZ)
 #endif
 
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 300405369c37..e67d5da23792 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -139,10 +139,12 @@ static void qede_shutdown(struct pci_dev *pdev);
 static void qede_link_update(void *dev, struct qed_link_output *link);
 static void qede_schedule_recovery_handler(void *dev);
 static void qede_recovery_handler(struct qede_dev *edev);
+static void qede_schedule_hw_err_handler(void *dev,
+enum qed_hw_err_type err_type);
 static void qede_get_eth_tlv_data(void *edev, void *data);
 static void qede_get_generic_tlv_data(void *edev,
  struct qed_generic_tlvs *data);
-
+static void qede_generic_hw_err_handler(struct qede_dev *edev);
 #ifdef CONFIG_QED_SRIOV
 static int qede_set_vf_vlan(struct net_device *ndev, int vf, u16 vlan, u8 qos,
__be16 vlan_proto)
@@ -230,6 +232,7 @@ static struct qed_eth_cb_ops qede_ll_ops = {
 #endif
.link_update = qede_link_update,
.schedule_recovery_handler = qede_schedule_recovery_handler,
+   .schedule_hw_err_handler = qede_schedule_hw_err_handler,
.get_generic_tlv_data = qede_get_generic_tlv_data,
.get_protocol_tlv_data = qede_get_eth_tlv_data,
},
@@ -1009,6 +1012,8 @@ static void qede_sp_task(struct work_struct *work)
qede_process_arfs_filters(edev, false);
}
 #endif
+   if (test_and_clear_bit(QEDE_SP_HW_ERR, &edev->sp_flags))
+   qede_generic_hw_err_handler(edev);
__qede_unlock(edev);
 
if (test_and_clear_bit(QEDE_SP_AER, &edev->sp_flags)) {
@@ -2509,6 +2514,94 @@ static void qede_recovery_handler(struct qede_dev *edev)
qede_recovery_failed(edev);
 }
 
+static void qede_atomic_hw_err_handler(struct qede_dev *edev)
+{
+   DP_NOTICE(edev,
+ "Generic non-sleepable HW error handling started - err_flags 
0x%lx\n",
+ edev->err_flags);
+
+   /* Get a call trace of the flow that led to the error */
+   WARN_ON(test_bit(QEDE_ERR_WARN, &edev->err_flags));
+
+   DP_NOTICE(edev, "Generic non-sleepable HW error handling is done\n");
+}
+
+static void qede_generic_hw_err_handler(struct qede_dev *edev)
+{
+   struct qed_dev *cdev = edev->cdev;
+
+   DP_NOTICE(edev,
+ "Generic sleepable HW error handling started - err_flags 
0x%lx\n",
+ edev->err_flags);
+
+   /* Trigger a recovery process.
+* This is placed in the sleep requiring section just to make
+* sure it is the last one, and that all the other operations
+* were completed.
+*/
+   if (test_bit(QEDE_ERR_IS_RECOVERABLE, &edev->err_flags))
+   edev->ops->common->recovery_process(cdev);
+
+   clear_bit(QEDE_ERR_IS_HANDLED, &edev->err_flags);
+

[PATCH v2 net-next 01/11] net: qed: adding hw_err states and handling

2020-05-14 Thread Igor Russkikh

Here we introduce qed device error tracking flags and error types.

qed_hw_err_notify is an entrace point to report errors.
It'll notify higher level drivers (qede/qedr/etc) to handle and recover
the error.

List of posible errors comes from hardware interfaces, but could be
extended in future.

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/qlogic/qed/qed.h  |  2 ++
 drivers/net/ethernet/qlogic/qed/qed_hw.c   | 32 ++
 drivers/net/ethernet/qlogic/qed/qed_hw.h   | 15 ++
 drivers/net/ethernet/qlogic/qed/qed_main.c | 29 
 include/linux/qed/qed_if.h | 12 
 5 files changed, 90 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index fa41bf08a589..12c40ce3d876 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -1020,6 +1020,8 @@ u32 qed_unzip_data(struct qed_hwfn *p_hwfn,
   u32 input_len, u8 *input_buf,
   u32 max_size, u8 *unzip_buf);
 void qed_schedule_recovery_handler(struct qed_hwfn *p_hwfn);
+void qed_hw_error_occurred(struct qed_hwfn *p_hwfn,
+  enum qed_hw_err_type err_type);
 void qed_get_protocol_stats(struct qed_dev *cdev,
enum qed_mcp_protocol_type type,
union qed_mcp_protocol_stats *stats);
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hw.c 
b/drivers/net/ethernet/qlogic/qed/qed_hw.c
index 4ab8cfaf63d1..90b777019cf5 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hw.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_hw.c
@@ -837,6 +837,38 @@ int qed_dmae_host2host(struct qed_hwfn *p_hwfn,
return rc;
 }
 
+void qed_hw_err_notify(struct qed_hwfn *p_hwfn,
+  struct qed_ptt *p_ptt,
+  enum qed_hw_err_type err_type, char *fmt, ...)
+{
+   char buf[QED_HW_ERR_MAX_STR_SIZE];
+   va_list vl;
+   int len;
+
+   if (fmt) {
+   va_start(vl, fmt);
+   len = vsnprintf(buf, QED_HW_ERR_MAX_STR_SIZE, fmt, vl);
+   va_end(vl);
+
+   if (len > QED_HW_ERR_MAX_STR_SIZE - 1)
+   len = QED_HW_ERR_MAX_STR_SIZE - 1;
+
+   DP_NOTICE(p_hwfn, "%s", buf);
+   }
+
+   /* Fan failure cannot be masked by handling of another HW error */
+   if (p_hwfn->cdev->recov_in_prog &&
+   err_type != QED_HW_ERR_FAN_FAIL) {
+   DP_VERBOSE(p_hwfn,
+  NETIF_MSG_DRV,
+  "Recovery is in progress. Avoid notifying about HW 
error %d.\n",
+  err_type);
+   return;
+   }
+
+   qed_hw_error_occurred(p_hwfn, err_type);
+}
+
 int qed_dmae_sanity(struct qed_hwfn *p_hwfn,
struct qed_ptt *p_ptt, const char *phase)
 {
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hw.h 
b/drivers/net/ethernet/qlogic/qed/qed_hw.h
index 505e94db939d..f5b109b04b66 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hw.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hw.h
@@ -315,4 +315,19 @@ int qed_init_fw_data(struct qed_dev *cdev,
 int qed_dmae_sanity(struct qed_hwfn *p_hwfn,
struct qed_ptt *p_ptt, const char *phase);
 
+#define QED_HW_ERR_MAX_STR_SIZE 256
+
+/**
+ * @brief qed_hw_err_notify - Notify upper layer driver and management FW
+ * about a HW error.
+ *
+ * @param p_hwfn
+ * @param p_ptt
+ * @param err_type
+ * @param fmt - debug data buffer to send to the MFW
+ * @param ... - buffer format args
+ */
+void qed_hw_err_notify(struct qed_hwfn *p_hwfn,
+  struct qed_ptt *p_ptt,
+  enum qed_hw_err_type err_type, char *fmt, ...);
 #endif
diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 38a1d26ca9db..d7c9d94e4c59 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -2468,6 +2468,35 @@ void qed_schedule_recovery_handler(struct qed_hwfn 
*p_hwfn)
ops->schedule_recovery_handler(cookie);
 }
 
+char *qed_hw_err_type_descr[] = {
+   [QED_HW_ERR_FAN_FAIL]   = "Fan Failure",
+   [QED_HW_ERR_MFW_RESP_FAIL]  = "MFW Response Failure",
+   [QED_HW_ERR_HW_ATTN]= "HW Attention",
+   [QED_HW_ERR_DMAE_FAIL]  = "DMAE Failure",
+   [QED_HW_ERR_RAMROD_FAIL]= "Ramrod Failure",
+   [QED_HW_ERR_FW_ASSERT]  = "FW Assertion",
+   [QED_HW_ERR_LAST]   = "Unknown",
+};
+
+void qed_hw_error_occurred(struct qed_hwfn *p_hwfn,
+  enum qed_hw_err_type err_type)
+{
+   struct qed_common_cb_ops *ops = p_hwfn->cdev->protocol_ops.common;
+   void *cookie = p_hwfn->cdev->ops_cookie;
+   char *err_str;
+
+   if (err_type > QED_HW_ERR_LAST)
+   err_type =

[PATCH v2 net-next 09/11] net: qed: introduce critical fan failure handler

2020-05-14 Thread Igor Russkikh

Fan failure is sent by firmware, driver reacts on this error with
newly introduced notification path. It will collect dump and shut down
the device to prevent physical breakage

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/qlogic/qed/qed_hsi.h |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 14 ++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h 
b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index 21d53b00c2e6..ab042b835797 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -12761,7 +12761,7 @@ enum MFW_DRV_MSG_TYPE {
MFW_DRV_MSG_GET_FCOE_STATS,
MFW_DRV_MSG_GET_ISCSI_STATS,
MFW_DRV_MSG_GET_RDMA_STATS,
-   MFW_DRV_MSG_BW_UPDATE10,
+   MFW_DRV_MSG_FAILURE_DETECTED,
MFW_DRV_MSG_TRANSCEIVER_STATE_CHANGE,
MFW_DRV_MSG_BW_UPDATE11,
MFW_DRV_MSG_RESERVED,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 62be13d49dd8..0058e804efc3 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -1706,6 +1706,17 @@ static void qed_mcp_update_stag(struct qed_hwfn *p_hwfn, 
struct qed_ptt *p_ptt)
&resp, ¶m);
 }
 
+static void qed_mcp_handle_fan_failure(struct qed_hwfn *p_hwfn,
+  struct qed_ptt *p_ptt)
+{
+   /* A single notification should be sent to upper driver in CMT mode */
+   if (p_hwfn != QED_LEADING_HWFN(p_hwfn->cdev))
+   return;
+
+   qed_hw_err_notify(p_hwfn, p_ptt, QED_HW_ERR_FAN_FAIL,
+ "Fan failure was detected on the network interface 
card and it's going to be shut down.\n");
+}
+
 void qed_mcp_read_ufp_config(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt)
 {
struct public_func shmem_info;
@@ -1852,6 +1863,9 @@ int qed_mcp_handle_events(struct qed_hwfn *p_hwfn,
case MFW_DRV_MSG_S_TAG_UPDATE:
qed_mcp_update_stag(p_hwfn, p_ptt);
break;
+   case MFW_DRV_MSG_FAILURE_DETECTED:
+   qed_mcp_handle_fan_failure(p_hwfn, p_ptt);
+   break;
case MFW_DRV_MSG_GET_TLV_REQ:
qed_mfw_tlv_req(p_hwfn);
break;
-- 
2.17.1

[PATCH v2 net-next 11/11] net: qed: fix bad formatting

2020-05-14 Thread Igor Russkikh

On some adjacent code, fix bad code formatting

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 include/linux/qed/qed_if.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/qed/qed_if.h b/include/linux/qed/qed_if.h
index 978e91e9ab65..48325d7790f8 100644
--- a/include/linux/qed/qed_if.h
+++ b/include/linux/qed/qed_if.h
@@ -821,12 +821,11 @@ enum qed_nvm_flash_cmd {
 
 struct qed_common_cb_ops {
void (*arfs_filter_op)(void *dev, void *fltr, u8 fw_rc);
-   void(*link_update)(void *dev,
-  struct qed_link_output   *link);
+   void (*link_update)(void *dev, struct qed_link_output *link);
void (*schedule_recovery_handler)(void *dev);
void (*schedule_hw_err_handler)(void *dev,
enum qed_hw_err_type err_type);
-   void(*dcbx_aen)(void *dev, struct qed_dcbx_get *get, u32 mib_type);
+   void (*dcbx_aen)(void *dev, struct qed_dcbx_get *get, u32 mib_type);
void (*get_generic_tlv_data)(void *dev, struct qed_generic_tlvs *data);
void (*get_protocol_tlv_data)(void *dev, void *data);
 };
-- 
2.17.1

[PATCH v2 net-next 10/11] net: qed: introduce critical hardware error handler

2020-05-14 Thread Igor Russkikh

MCP may signal driver about generic critical failure.
Driver has to collect mdump information (get_retain),
it pushes that to logs and triggers generic notification on
"hardware attention" event.

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/qlogic/qed/qed_hsi.h |  28 +-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 113 ++
 drivers/net/ethernet/qlogic/qed/qed_mcp.h |  13 +++
 3 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h 
b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index ab042b835797..f00460d00cab 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -12400,6 +12400,13 @@ struct load_rsp_stc {
 #define LOAD_RSP_FLAGS0_DRV_EXISTS  (0x1 << 0)
 };
 
+struct mdump_retain_data_stc {
+   u32 valid;
+   u32 epoch;
+   u32 pf;
+   u32 status;
+};
+
 union drv_union_data {
u32 ver_str[MCP_DRV_VER_STR_SIZE_DWORD];
struct mcp_mac wol_mac;
@@ -12488,6 +12495,8 @@ struct public_drv_mb {
 #define DRV_MSG_CODE_BIST_TEST 0x001e
 #define DRV_MSG_CODE_SET_LED_MODE  0x0020
 #define DRV_MSG_CODE_RESOURCE_CMD  0x0023
+/* Send crash dump commands with param[3:0] - opcode */
+#define DRV_MSG_CODE_MDUMP_CMD 0x0025
 #define DRV_MSG_CODE_GET_TLV_DONE  0x002f
 #define DRV_MSG_CODE_GET_ENGINE_CONFIG 0x0037
 #define DRV_MSG_CODE_GET_PPFID_BITMAP  0x4300
@@ -12519,6 +12528,21 @@ struct public_drv_mb {
 
 #define RESOURCE_DUMP  0
 
+/* DRV_MSG_CODE_MDUMP_CMD parameters */
+#define MDUMP_DRV_PARAM_OPCODE_MASK 0x000f
+#define DRV_MSG_CODE_MDUMP_ACK  0x01
+#define DRV_MSG_CODE_MDUMP_SET_VALUES   0x02
+#define DRV_MSG_CODE_MDUMP_TRIGGER  0x03
+#define DRV_MSG_CODE_MDUMP_GET_CONFIG   0x04
+#define DRV_MSG_CODE_MDUMP_SET_ENABLE   0x05
+#define DRV_MSG_CODE_MDUMP_CLEAR_LOGS   0x06
+#define DRV_MSG_CODE_MDUMP_GET_RETAIN   0x07
+#define DRV_MSG_CODE_MDUMP_CLR_RETAIN   0x08
+
+#define DRV_MSG_CODE_HW_DUMP_TRIGGER0x0a
+#define DRV_MSG_CODE_MDUMP_GEN_MDUMP2   0x0b
+#define DRV_MSG_CODE_MDUMP_FREE_MDUMP2  0x0c
+
 #define DRV_MSG_CODE_GET_PF_RDMA_PROTOCOL  0x002b
 #define DRV_MSG_CODE_OS_WOL0x002e
 
@@ -12697,6 +12721,8 @@ struct public_drv_mb {
 #define FW_MSG_CODE_DEBUG_NOT_ENABLED  0xb00a
 #define FW_MSG_CODE_DEBUG_DATA_SEND_OK 0xb00b
 
+#define FW_MSG_CODE_MDUMP_INVALID_CMD  0x0003
+
u32 fw_mb_param;
 #define FW_MB_PARAM_RESOURCE_ALLOC_VERSION_MAJOR_MASK  0x
 #define FW_MB_PARAM_RESOURCE_ALLOC_VERSION_MAJOR_SHIFT 16
@@ -12763,7 +12789,7 @@ enum MFW_DRV_MSG_TYPE {
MFW_DRV_MSG_GET_RDMA_STATS,
MFW_DRV_MSG_FAILURE_DETECTED,
MFW_DRV_MSG_TRANSCEIVER_STATE_CHANGE,
-   MFW_DRV_MSG_BW_UPDATE11,
+   MFW_DRV_MSG_CRITICAL_ERROR_OCCURRED,
MFW_DRV_MSG_RESERVED,
MFW_DRV_MSG_GET_TLV_REQ,
MFW_DRV_MSG_OEM_CFG_UPDATE,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 0058e804efc3..8a0bbc7d4b24 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -1717,6 +1717,116 @@ static void qed_mcp_handle_fan_failure(struct qed_hwfn 
*p_hwfn,
  "Fan failure was detected on the network interface 
card and it's going to be shut down.\n");
 }
 
+struct qed_mdump_cmd_params {
+   u32 cmd;
+   void *p_data_src;
+   u8 data_src_size;
+   void *p_data_dst;
+   u8 data_dst_size;
+   u32 mcp_resp;
+};
+
+static int
+qed_mcp_mdump_cmd(struct qed_hwfn *p_hwfn,
+ struct qed_ptt *p_ptt,
+ struct qed_mdump_cmd_params *p_mdump_cmd_params)
+{
+   struct qed_mcp_mb_params mb_params;
+   int rc;
+
+   memset(&mb_params, 0, sizeof(mb_params));
+   mb_params.cmd = DRV_MSG_CODE_MDUMP_CMD;
+   mb_params.param = p_mdump_cmd_params->cmd;
+   mb_params.p_data_src = p_mdump_cmd_params->p_data_src;
+   mb_params.data_src_size = p_mdump_cmd_params->data_src_size;
+   mb_params.p_data_dst = p_mdump_cmd_params->p_data_dst;
+   mb_params.data_dst_size = p_mdump_cmd_params->data_dst_size;
+   rc = qed_mcp_cmd_and_union(p_hwfn, p_ptt, &mb_params);
+   if (rc)
+   return rc;
+
+   p_mdump_cmd_params->mcp_resp = mb_params.mcp_resp;
+
+   if (p_mdump_cmd_params->mcp_resp == FW_MSG_CODE_MDUMP_INVALID_CMD) {
+   DP_INFO(p_hwfn,
+   "The mdump sub command is unsupported by the MFW 
[mdump_cmd 0x%x]\n",
+   p_mdump_cmd_params->cmd);
+   rc = -EOPNOTSUPP;
+   } else if (p_mdump

[PATCH v2 net-next 05/11] net: qed: cleanup debug related declarations

2020-05-14 Thread Igor Russkikh

Thats probably a legacy code had double declaration of some fields.
Cleanup this, removing copy and fixing references.

Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/qlogic/qed/qed.h   | 11 +++--
 drivers/net/ethernet/qlogic/qed/qed_debug.c | 26 ++---
 2 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index 12c40ce3d876..07f6ef930b52 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -740,12 +740,6 @@ struct qed_dbg_feature {
u32 dumped_dwords;
 };
 
-struct qed_dbg_params {
-   struct qed_dbg_feature features[DBG_FEATURE_NUM];
-   u8 engine_for_debug;
-   bool print_data;
-};
-
 struct qed_dev {
u32 dp_module;
u8  dp_level;
@@ -872,17 +866,18 @@ struct qed_dev {
} protocol_ops;
void*ops_cookie;
 
-   struct qed_dbg_params   dbg_params;
-
 #ifdef CONFIG_QED_LL2
struct qed_cb_ll2_info  *ll2;
u8  ll2_mac_address[ETH_ALEN];
 #endif
struct qed_dbg_feature dbg_features[DBG_FEATURE_NUM];
+   u8 engine_for_debug;
bool disable_ilt_dump;
DECLARE_HASHTABLE(connections, 10);
const struct firmware   *firmware;
 
+   bool print_dbg_data;
+
u32 rdma_max_sge;
u32 rdma_max_inline;
u32 rdma_max_srq_sge;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c 
b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index f4eebaabb6d0..57a0dab88431 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -7453,7 +7453,7 @@ static enum dbg_status format_feature(struct qed_hwfn 
*p_hwfn,
  enum qed_dbg_features feature_idx)
 {
struct qed_dbg_feature *feature =
-   &p_hwfn->cdev->dbg_params.features[feature_idx];
+   &p_hwfn->cdev->dbg_features[feature_idx];
u32 text_size_bytes, null_char_pos, i;
enum dbg_status rc;
char *text_buf;
@@ -7502,7 +7502,7 @@ static enum dbg_status format_feature(struct qed_hwfn 
*p_hwfn,
text_buf[i] = '\n';
 
/* Dump printable feature to log */
-   if (p_hwfn->cdev->dbg_params.print_data)
+   if (p_hwfn->cdev->print_dbg_data)
qed_dbg_print_feature(text_buf, text_size_bytes);
 
/* Free the old dump_buf and point the dump_buf to the newly allocagted
@@ -7523,7 +7523,7 @@ static enum dbg_status qed_dbg_dump(struct qed_hwfn 
*p_hwfn,
enum qed_dbg_features feature_idx)
 {
struct qed_dbg_feature *feature =
-   &p_hwfn->cdev->dbg_params.features[feature_idx];
+   &p_hwfn->cdev->dbg_features[feature_idx];
u32 buf_size_dwords;
enum dbg_status rc;
 
@@ -7648,7 +7648,7 @@ static int qed_dbg_nvm_image(struct qed_dev *cdev, void 
*buffer,
 enum qed_nvm_images image_id)
 {
struct qed_hwfn *p_hwfn =
-   &cdev->hwfns[cdev->dbg_params.engine_for_debug];
+   &cdev->hwfns[cdev->engine_for_debug];
u32 len_rounded, i;
__be32 val;
int rc;
@@ -7780,7 +7780,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
 {
u8 cur_engine, omit_engine = 0, org_engine;
struct qed_hwfn *p_hwfn =
-   &cdev->hwfns[cdev->dbg_params.engine_for_debug];
+   &cdev->hwfns[cdev->engine_for_debug];
struct dbg_tools_data *dev_data = &p_hwfn->dbg_info;
int grc_params[MAX_DBG_GRC_PARAMS], i;
u32 offset = 0, feature_size;
@@ -8000,7 +8000,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
 int qed_dbg_all_data_size(struct qed_dev *cdev)
 {
struct qed_hwfn *p_hwfn =
-   &cdev->hwfns[cdev->dbg_params.engine_for_debug];
+   &cdev->hwfns[cdev->engine_for_debug];
u32 regs_len = 0, image_len = 0, ilt_len = 0, total_ilt_len = 0;
u8 cur_engine, org_engine;
 
@@ -8059,9 +8059,9 @@ int qed_dbg_feature(struct qed_dev *cdev, void *buffer,
enum qed_dbg_features feature, u32 *num_dumped_bytes)
 {
struct qed_hwfn *p_hwfn =
-   &cdev->hwfns[cdev->dbg_params.engine_for_debug];
+   &cdev->hwfns[cdev->engine_for_debug];
struct qed_dbg_feature *qed_feature =
-   &cdev->dbg_params.features[feature];
+   &cdev->dbg_features[feature];
enum dbg_status dbg_rc;
struct qed_ptt *p_ptt;
int rc = 0;
@@ -8084,7 +8084,7 @@ int qed_dbg_feature(struct qed_dev *cdev, void *buffer,
DP_VERBOSE(cdev, QED_MSG_DEBUG,
   "copying debugfs feature to external buffer\n");
memcpy(buffer, qed_feature->dump_buf, qed_feature->buf_size);
-

RE: [PATCH 11/18] maccess: remove strncpy_from_unsafe

2020-05-14 Thread David Laight

From: Daniel Borkmann
> Sent: 14 May 2020 00:59
> On 5/14/20 1:28 AM, Al Viro wrote:
> > On Thu, May 14, 2020 at 12:36:28AM +0200, Daniel Borkmann wrote:
> >
> >>> So on say s390 TASK_SIZE_USUALLy is (-PAGE_SIZE), which means we'd alway
> >>> try the user copy first, which seems odd.
> >>>
> >>> I'd really like to here from the bpf folks what the expected use case
> >>> is here, and if the typical argument is kernel or user memory.
> >>
> >> It's used for both. Given this is enabled on pretty much all program 
> >> types, my
> >> assumption would be that usage is still more often on kernel memory than 
> >> user one.
> >
> > Then it needs an argument telling it which one to use.  Look at sparc64.
> > Or s390.  Or parisc.  Et sodding cetera.
> >
> > The underlying model is that the kernel lives in a separate address space.
> > Yes, on x86 it's actually sharing the page tables with userland, but that's
> > not universal.  The same address can be both a valid userland one _and_
> > a valid kernel one.  You need to tell which one do you want.
> 
> Yes, see also 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and 
> probe_read_{user,
> kernel}_str helpers"), and my other reply wrt bpf_trace_printk() on how to 
> address
> this. All I'm trying to say is that both bpf_probe_read() and 
> bpf_trace_printk() do
> exist in this form since early [e]bpf days for ~5yrs now and while broken on 
> non-x86
> there are a lot of users on x86 for this in the wild, so they need to have a 
> chance
> to migrate over to the new facilities before they are fully removed.

If it's not a stupid question why is a BPF program allowed to get
into a situation where it might have an invalid kernel address.

It all stinks of a hole that allows all of kernel memory to be read
and copied to userspace.

Now you might want to something special so that BPF programs just
abort on OOPS instead of possibly paniking the kernel.
But that is different from a copy that expects to be passed garbage.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

[PATCH net] pppoe: only process PADT targeted at local interfaces

2020-05-14 Thread Guillaume Nault

We don't want to disconnect a session because of a stray PADT arriving
while the interface is in promiscuous mode.
Furthermore, multicast and broadcast packets make no sense here, so
only PACKET_HOST is accepted.

Reported-by: David Balažic 
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault 
---
 drivers/net/ppp/pppoe.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index d760a36db28c..beedaad08255 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -490,6 +490,9 @@ static int pppoe_disc_rcv(struct sk_buff *skb, struct 
net_device *dev,
if (!skb)
goto out;
 
+   if (skb->pkt_type != PACKET_HOST)
+   goto abort;
+
if (!pskb_may_pull(skb, sizeof(struct pppoe_hdr)))
goto abort;
 
-- 
2.21.1

Re: remove kernel_setsockopt and kernel_getsockopt

2020-05-14 Thread Christoph Hellwig

On Thu, May 14, 2020 at 08:29:30AM +, David Laight wrote:
> You need to export functions that do most of the socket options
> for all protocols.

Only for those were we have users, and all those are covered.

Re: [PATCH 11/18] maccess: remove strncpy_from_unsafe

2020-05-14 Thread Daniel Borkmann


On 5/14/20 12:01 PM, David Laight wrote:
[...]

If it's not a stupid question why is a BPF program allowed to get
into a situation where it might have an invalid kernel address.

It all stinks of a hole that allows all of kernel memory to be read
and copied to userspace.

Now you might want to something special so that BPF programs just
abort on OOPS instead of possibly paniking the kernel.
But that is different from a copy that expects to be passed garbage.


I suggest you read up on probe_kernel_read() and its uses in tracing in
general, looks like you haven't done that.

RE: remove kernel_setsockopt and kernel_getsockopt

2020-05-14 Thread David Laight

From: Christoph Hellwig
> Only for those were we have users, and all those are covered.

What do we tell all our users when our kernel SCTP code
no longer works?

It uses SO_REUSADDR, SCTP_EVENTS, SCTP_NODELAY,
SCTP_STATUS, SCTP_INITMSG, IPV6_ONLY, SCTP_SOCKOPT_BINDX_ADD
and SO_LINGER.
We should probably use the CONNECTX function as well.

I doubt we are the one company with out-of-tree drivers
that use the kernel_socket interface.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH 11/18] maccess: remove strncpy_from_unsafe

2020-05-14 Thread Daniel Borkmann


On 5/14/20 11:44 AM, Masami Hiramatsu wrote:

On Wed, 13 May 2020 19:43:24 -0700
Linus Torvalds  wrote:

On Wed, May 13, 2020 at 6:00 PM Masami Hiramatsu  wrote:



But we should likely at least disallow it entirely on platforms where
we really can't - or pick one hardcoded choice. On sparc, you really
_have_ to specify one or the other.


OK. BTW, is there any way to detect the kernel/user space overlap on
memory layout statically? If there, I can do it. (I don't like
"if (CONFIG_X86)" thing)
Or, maybe we need CONFIG_ARCH_OVERLAP_ADDRESS_SPACE?


I think it would be better to have a CONFIG variable that
architectures can just 'select' to show that they are ok with separate
kernel and user addresses.

Because I don't think we have any way to say that right now as-is. You
can probably come up with hacky ways to approximate it, ie something
like

 if (TASK_SIZE_MAX > PAGE_OFFSET)
  they overlap ..

which would almost work, but..


It seems TASK_SIZE_MAX is defined only on x86 and s390, what about
comparing STACK_TOP_MAX with PAGE_OFFSET ?
Anyway, I agree that the best way is introducing a CONFIG.


Agree, CONFIG knob that archs can select feels cleanest. Fwiw, I've cooked
up fixes for bpf side locally here and finishing up testing, will push out
later today.

Thanks,
Daniel

Re: [PATCH stable-5.4.y] net: dsa: Do not make user port errors fatal

2020-05-14 Thread Greg KH

On Wed, May 13, 2020 at 12:55:46PM -0700, David Miller wrote:
> From: Florian Fainelli 
> Date: Wed, 13 May 2020 10:41:45 -0700
> 
> > commit 86f8b1c01a0a537a73d2996615133be63cdf75db upstream
> > 
> > Prior to 1d27732f411d ("net: dsa: setup and teardown ports"), we would
> > not treat failures to set-up an user port as fatal, but after this
> > commit we would, which is a regression for some systems where interfaces
> > may be declared in the Device Tree, but the underlying hardware may not
> > be present (pluggable daughter cards for instance).
> > 
> > Fixes: 1d27732f411d ("net: dsa: setup and teardown ports")
> > Signed-off-by: Florian Fainelli 
> > Reviewed-by: Andrew Lunn 
> > Signed-off-by: David S. Miller 
> 
> Greg, please queue this up.

Now queued up, thanks.

greg k-h

Re: [PATCH 29/33] rxrpc_sock_set_min_security_level

2020-05-14 Thread Christoph Hellwig

On Wed, May 13, 2020 at 02:13:07PM +0100, David Howells wrote:
> Christoph Hellwig  wrote:
> 
> > +int rxrpc_sock_set_min_security_level(struct sock *sk, unsigned int val);
> > +
> 
> Looks good - but you do need to add this to Documentation/networking/rxrpc.txt
> also, thanks.

That file doesn't exist, instead we now have a
cumentation/networking/rxrpc.rst in weird markup.  Where do you want this
to be added, and with what text?  Remember I don't really know what this
thing does, I just provide a shortcut.

[PATCH 1/3] net: stmmac: gmac3: add auxiliary snapshot support

2020-05-14 Thread Olivier Dautricourt

From: Artem Panfilov 

This patch adds support for time stamping external inputs for GMAC3.

The documentation defines 4 auxiliary snapshots ATSEN0 to ATSEN3 which
can be toggled by writing the Timestamp Control Register.

When the gmac detects a pps rising edge on one of it's auxiliary inputs,
an isr of type GMAC_INT_STATUS_TSTAMP will be triggered.
We use this isr to generate a ptp clock event of type PTP_CLOCK_EXTTS
with the following content:

  - Time of which the event occurred in ns.
  - All the extts for which the event was generated ( - )

Note from the documentation:
"When more than one bit is set at the same time, it means
that corresponding auxiliary triggers were sampled at the same clock"

When the GMAC writes it's auxiliary snapshots on it's fifo
and that fifo is full, it will discard any new auxiliary snapshot until
we read the fifo. By reading on each isr, it is unlikely that
we will loose the 1pps external timestamps.

Events for one auxiliary input can be requested through the
PTP_EXTTS_REQUEST ioctl and read as already implemented in the uapi.

This patch introduces 2 functions:

stmmac_set_hw_tstamping and stmmac_get_hw_tstamping

Each time we initialize the timestamping, we read the current
value of PTP_TCR and patch with new configuration without setting
the ATSENX flags again, which are set independently by the user
through the ioctl.
This allows to not loose the activated external events between each
initialization of the timestamping, and not force the user to redo
ioctl.

Signed-off-by: Olivier Dautricourt 
Signed-off-by: Artem Panfilov 
---
 .../net/ethernet/stmicro/stmmac/dwmac1000.h   |  3 +-
 .../ethernet/stmicro/stmmac/dwmac1000_core.c  | 24 ++
 drivers/net/ethernet/stmicro/stmmac/hwif.h|  9 ++--
 .../ethernet/stmicro/stmmac/stmmac_hwtstamp.c | 10 -
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |  7 +--
 .../net/ethernet/stmicro/stmmac/stmmac_ptp.c  | 44 +++
 .../net/ethernet/stmicro/stmmac/stmmac_ptp.h  | 20 +
 7 files changed, 107 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
index b70d44ac0990..5cff6c100258 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
@@ -41,8 +41,7 @@
 #defineGMAC_INT_DISABLE_PCS(GMAC_INT_DISABLE_RGMII | \
 GMAC_INT_DISABLE_PCSLINK | \
 GMAC_INT_DISABLE_PCSAN)
-#defineGMAC_INT_DEFAULT_MASK   (GMAC_INT_DISABLE_TIMESTAMP | \
-GMAC_INT_DISABLE_PCS)
+#defineGMAC_INT_DEFAULT_MASK   GMAC_INT_DISABLE_PCS
 
 /* PMT Control and Status */
 #define GMAC_PMT   0x002c
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
index efc6ec1b8027..3895fe9396e5 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
@@ -20,6 +20,7 @@
 #include "stmmac.h"
 #include "stmmac_pcs.h"
 #include "dwmac1000.h"
+#include "stmmac_ptp.h"
 
 static void dwmac1000_core_init(struct mac_device_info *hw,
struct net_device *dev)
@@ -300,9 +301,29 @@ static void dwmac1000_rgsmii(void __iomem *ioaddr, struct 
stmmac_extra_stats *x)
}
 }
 
+static void dwmac1000_ptp_isr(struct stmmac_priv *priv)
+{
+   struct ptp_clock_event event;
+   u32 reg_value;
+   u64 ns;
+
+   reg_value = readl(priv->ioaddr + PTP_GMAC3_TSR);
+
+   ns = readl(priv->ioaddr + PTP_GMAC3_AUXTSLO);
+   ns += readl(priv->ioaddr + PTP_GMAC3_AUXTSHI) * 10ULL;
+
+   event.timestamp = ns;
+   event.type = PTP_CLOCK_EXTTS;
+   event.index = (reg_value & PTP_GMAC3_ATSSTN_MASK) >>
+   PTP_GMAC3_ATSSTN_SHIFT;
+   ptp_clock_event(priv->ptp_clock, &event);
+}
+
 static int dwmac1000_irq_status(struct mac_device_info *hw,
struct stmmac_extra_stats *x)
 {
+   struct stmmac_priv *priv =
+   container_of(x, struct stmmac_priv, xstats);
void __iomem *ioaddr = hw->pcsr;
u32 intr_status = readl(ioaddr + GMAC_INT_STATUS);
u32 intr_mask = readl(ioaddr + GMAC_INT_MASK);
@@ -324,6 +345,9 @@ static int dwmac1000_irq_status(struct mac_device_info *hw,
x->irq_receive_pmt_irq_n++;
}
 
+   if (intr_status & GMAC_INT_STATUS_TSTAMP)
+   dwmac1000_ptp_isr(priv);
+
/* MAC tx/rx EEE LPI entry/exit interrupts */
if (intr_status & GMAC_INT_STATUS_LPIIS) {
/* Clean LPI interrupt by reading the Reg 12 */
diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.h 
b/drivers/net/ethernet/stmicro/stmmac/hwif.h
index ffe2d63389b8..8fa63d059231 100644
--- a/drivers/net/ethernet/stmicro/stmmac/hwif.h
+++ b/drivers/net/ethernet/stmicro

[PATCH 0/3] Patch series for a PTP Grandmaster use case using stmmac/gmac3 ptp clock

2020-05-14 Thread Olivier Dautricourt

This patch series covers a use case where an embedded system is
disciplining an internal clock to a GNSS signal, which provides a
stable frequency, and wants to act as a PTP Grandmaster by disciplining
a ptp clock to this internal clock.

In our setup a 10Mhz oscillator is frequency adjusted so that a derived
pps from that oscillator is in phase with the pps generated by 
a gnss receiver.

An other derived clock from the same disciplined oscillator is used as
ptp_clock for the ethernet mac.

The internal pps of the system is forwarded to one of the auxiliary inputs
of the MAC.

Initially the mac time registers are considered random.
We want the mac nanosecond field to be 0 on the auxiliary pps input edge.


PATCH 1/3: 
The stmmac gmac3 version used in the setup is patched to retrieve a
timestamp at the rising edge of the aux input and to forward
it to userspace.

* What matters here is that we get the subsecond offset between the aux 
edge and the edge of the PHC's pps. *


PATCH 2,3/3:

We want the ptp clock to be in time with our aux pps input.
Since the ptp clock is derived from the system oscillator, we
don't want to do frequency adjustements.

The stmmac driver is patched to allow to set the coarse correction
mode which avoid to adjust the frequency of the mac continuously
(the default behavior), but instead, have just one
time adjustment.


We calculate the time difference between the mac and the internal
clock, and adust the ptp clock time with clock_adjtime syscall.


To summarize this in a user-space program:


#include 
#include 
#include 
#include 
#include 

#include 
#include 

#include 
#include 
#include 

#include 
#include 
#include 

#define NS_PER_SEC 10LL

#define CLOCKFD 3

#define FD_TO_CLOCKID(fd) \
((clockid_t) unsigned int) ~fd) << 3) | CLOCKFD))


static inline int clock_adjtime(clockid_t id, struct timex *tx)
{
return syscall(__NR_clock_adjtime, id, tx);
}

int main(void)
{
int fd;
struct timex tx = {0};
struct ifreq ifreq = {0};
struct hwtstamp_config cfg = {0};
struct ptp_extts_event event = {0};
struct ptp_extts_request extts_request = {
.index = 0,
.flags = PTP_RISING_EDGE | PTP_ENABLE_FEATURE
};

const char *iface = "eth0";
const char *ptp_dev = "/dev/ptp2";

strncpy(ifreq.ifr_name, iface, sizeof(ifreq.ifr_name) - 1);
ifreq.ifr_data = (void *) &cfg;
fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);

if (fd < 0)
return 1;

if (ioctl(fd, SIOCGHWTSTAMP, &ifreq) < 0)
return 1;

// Activate coarse mode for stmmac
cfg.flags |= HWTSTAMP_FLAGS_ADJ_COARSE;
cfg.flags &= ~HWTSTAMP_FLAGS_ADJ_FINE;

if (ioctl(fd, SIOCSHWTSTAMP, &ifreq) < 0)
return 1;

fd = open(ptp_dev, O_RDWR);

if (fd < 0)
return 1;

// Enable extts input index 0
if (ioctl(fd, PTP_EXTTS_REQUEST, &extts_request) < 0)
return 1;

// Read extts
if (read(fd, &event, sizeof(event)) != sizeof(event))
return 1;

// Correct phc time subsecond: note that this does not correct the phc
// second count for concision. The delta is (event.t.nsec - NS_PER_SEC).
tx.modes = ADJ_SETOFFSET | ADJ_NANO;
tx.time.tv_sec = -1;
tx.time.tv_usec = event.t.nsec;

if (clock_adjtime(FD_TO_CLOCKID(fd), &tx))
return 1;

// Disable extts index 0
extts_request.index = 0;
extts_request.flags = 0;

if (ioctl(fd, PTP_EXTTS_REQUEST, &extts_request) < 0)
return 1;

return 0;
}


Artem Panfilov (1):
  net: stmmac: GMAC3: add auxiliary snapshot support

Olivier Dautricourt (2):
  net: uapi: Add HWTSTAMP_FLAGS_ADJ_FINE/ADJ_COARSE
  net: stmmac: Support coarse mode through ioctl

 .../net/ethernet/stmicro/stmmac/dwmac1000.h   |  3 +-
 .../ethernet/stmicro/stmmac/dwmac1000_core.c  | 24 ++
 drivers/net/ethernet/stmicro/stmmac/hwif.h|  9 ++--
 .../ethernet/stmicro/stmmac/stmmac_hwtstamp.c | 10 +++-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 21 ++---
 .../net/ethernet/stmicro/stmmac/stmmac_ptp.c  | 47 +++
 .../net/ethernet/stmicro/stmmac/stmmac_ptp.h  | 20 
 include/uapi/linux/net_tstamp.h   | 12 +
 net/core/dev_ioctl.c  |  3 --
 9 files changed, 133 insertions(+), 16 deletions(-)

-- 
2.17.1

[PATCH 3/3] net: stmmac: Support coarse mode through ioctl

2020-05-14 Thread Olivier Dautricourt

This commit enables coarse correction mode for stmmac driver.

The coarse mode allows to update the system time in one process.
The required time adjustment is written in the Timestamp Update registers
while the Sub-second increment register is programmed with the period
of the clock, which is the precision of our correction.

The fine adjutment mode is always the default behavior of the driver.
One should use the HWTSAMP_FLAG_ADJ_COARSE flag while calling
SIOCGHWTSTAMP ioctl to enable coarse mode for stmmac driver.

Signed-off-by: Olivier Dautricourt 
---
 .../net/ethernet/stmicro/stmmac/stmmac_main.c   | 17 +
 .../net/ethernet/stmicro/stmmac/stmmac_ptp.c|  3 +++
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index c39fafe69b12..f46503b086f4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -541,9 +541,12 @@ static int stmmac_hwtstamp_set(struct net_device *dev, 
struct ifreq *ifr)
netdev_dbg(priv->dev, "%s config flags:0x%x, tx_type:0x%x, 
rx_filter:0x%x\n",
   __func__, config.flags, config.tx_type, config.rx_filter);
 
-   /* reserved for future extensions */
-   if (config.flags)
-   return -EINVAL;
+   if (config.flags != HWTSTAMP_FLAGS_ADJ_COARSE) {
+   /* Defaulting to fine adjustment for compatibility */
+   netdev_dbg(priv->dev, "%s defaulting to fine adjustment mode\n",
+  __func__);
+   config.flags = HWTSTAMP_FLAGS_ADJ_FINE;
+   }
 
if (config.tx_type != HWTSTAMP_TX_OFF &&
config.tx_type != HWTSTAMP_TX_ON)
@@ -689,10 +692,16 @@ static int stmmac_hwtstamp_set(struct net_device *dev, 
struct ifreq *ifr)
stmmac_set_hw_tstamping(priv, priv->ptpaddr, 0);
else {
stmmac_get_hw_tstamping(priv, priv->ptpaddr, &value);
-   value |= (PTP_TCR_TSENA | PTP_TCR_TSCFUPDT | PTP_TCR_TSCTRLSSR |
+   value |= (PTP_TCR_TSENA |  PTP_TCR_TSCTRLSSR |
 tstamp_all | ptp_v2 | ptp_over_ethernet |
 ptp_over_ipv6_udp | ptp_over_ipv4_udp | ts_event_en |
 ts_master_en | snap_type_sel);
+
+   if (config.flags == HWTSTAMP_FLAGS_ADJ_FINE)
+   value |= PTP_TCR_TSCFUPDT;
+   else
+   value &= ~PTP_TCR_TSCFUPDT;
+
stmmac_set_hw_tstamping(priv, priv->ptpaddr, value);
 
/* program Sub Second Increment reg */
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
index 920f0f3ebbca..7fb318441015 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
@@ -27,6 +27,9 @@ static int stmmac_adjust_freq(struct ptp_clock_info *ptp, s32 
ppb)
int neg_adj = 0;
u64 adj;
 
+   if (priv->tstamp_config.flags != HWTSTAMP_FLAGS_ADJ_FINE)
+   return -EPERM;
+
if (ppb < 0) {
neg_adj = 1;
ppb = -ppb;
-- 
2.17.1

[PATCH 2/3] net: uapi: Add HWTSTAMP_FLAGS_ADJ_FINE/ADJ_COARSE

2020-05-14 Thread Olivier Dautricourt

This commit allows a user to specify a flag value for configuring
timestamping through hwtsamp_config structure.

New flags introduced:

HWTSTAMP_FLAGS_NONE = 0
No flag specified: as it is of value 0, this will selects the
default behavior for all the existing drivers and should not
break existing userland programs.

HWTSTAMP_FLAGS_ADJ_FINE = 1
Use the fine adjustment mode.
Fine adjustment mode is usually used for precise frequency adjustments.

HWTSTAMP_FLAGS_ADJ_COARSE = 2
Use the coarse adjustment mode
Coarse adjustment mode is usually used for direct phase correction.

Signed-off-by: Olivier Dautricourt 
---
 include/uapi/linux/net_tstamp.h | 12 
 net/core/dev_ioctl.c|  3 ---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 7ed0b3d1c00a..0cfcd490228f 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -65,6 +65,18 @@ struct hwtstamp_config {
int rx_filter;
 };
 
+/* possible values for hwtstamp_config->flags */
+enum hwtsamp_flags {
+   /* No special flag specified */
+   HWTSTAMP_FLAGS_NONE,
+
+   /* Enable fine adjustment mode if the driver supports it */
+   HWTSTAMP_FLAGS_ADJ_FINE,
+
+   /* Enable coarse adjustment mode if the driver supports it */
+   HWTSTAMP_FLAGS_ADJ_COARSE,
+};
+
 /* possible values for hwtstamp_config->tx_type */
 enum hwtstamp_tx_types {
/*
diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 547b587c1950..017671545d45 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -177,9 +177,6 @@ static int net_hwtstamp_validate(struct ifreq *ifr)
if (copy_from_user(&cfg, ifr->ifr_data, sizeof(cfg)))
return -EFAULT;
 
-   if (cfg.flags) /* reserved for future extensions */
-   return -EINVAL;
-
tx_type = cfg.tx_type;
rx_filter = cfg.rx_filter;
 
-- 
2.17.1

Re: [PATCH 20/33] ipv4: add ip_sock_set_recverr

2020-05-14 Thread Christoph Hellwig

On Wed, May 13, 2020 at 02:00:43PM -0700, Joe Perches wrote:
> On Wed, 2020-05-13 at 08:26 +0200, Christoph Hellwig wrote:
> > Add a helper to directly set the IP_RECVERR sockopt from kernel space
> > without going through a fake uaccess.
> 
> This seems used only with true as the second arg.
> Is there reason to have that argument at all?

Mostly to keep it symmetric with the sockopt.  I could probably remove
a few arguments in the series if we want to be strict.

Re: remove kernel_setsockopt and kernel_getsockopt

2020-05-14 Thread 'Christoph Hellwig'

On Thu, May 14, 2020 at 10:26:41AM +, David Laight wrote:
> From: Christoph Hellwig
> > Only for those were we have users, and all those are covered.
> 
> What do we tell all our users when our kernel SCTP code
> no longer works?

We only care about in-tree modules, just like for every other interface
in the kernel.

is it ok to always pull in sctp for dlm, was: Re: [PATCH 27/33] sctp: export sctp_setsockopt_bindx

2020-05-14 Thread Christoph Hellwig

On Wed, May 13, 2020 at 03:00:58PM -0300, Marcelo Ricardo Leitner wrote:
> On Wed, May 13, 2020 at 08:26:42AM +0200, Christoph Hellwig wrote:
> > And call it directly from dlm instead of going through kernel_setsockopt.
> 
> The advantage on using kernel_setsockopt here is that sctp module will
> only be loaded if dlm actually creates a SCTP socket.  With this
> change, sctp will be loaded on setups that may not be actually using
> it. It's a quite big module and might expose the system.
> 
> I'm okay with the SCTP changes, but I'll defer to DLM folks to whether
> that's too bad or what for DLM.

So for ipv6 I could just move the helpers inline as they were trivial
and avoid that issue.  But some of the sctp stuff really is way too
big for that, so the only other option would be to use symbol_get.

[PATCH net-next v4 07/33] xdp: xdp_frame add member frame_sz and handle in convert_to_xdp_frame

2020-05-14 Thread Jesper Dangaard Brouer

Use hole in struct xdp_frame, when adding member frame_sz, which keeps
same sizeof struct (32 bytes)

Drivers ixgbe and sfc had bug cases where the necessary/expected
tailroom was not reserved. This can lead to some hard to catch memory
corruption issues. Having the drivers frame_sz this can be detected when
packet length/end via xdp->data_end exceed the xdp_data_hard_end
pointer, which accounts for the reserved the tailroom.

When detecting this driver issue, simply fail the conversion with NULL,
which results in feedback to driver (failing xdp_do_redirect()) causing
driver to drop packet. Given the lack of consistent XDP stats, this can
be hard to troubleshoot. And given this is a driver bug, we want to
generate some more noise in form of a WARN stack dump (to ID the driver
code that inlined convert_to_xdp_frame).

Inlining the WARN macro is problematic, because it adds an asm
instruction (on Intel CPUs ud2) what influence instruction cache
prefetching. Thus, introduce xdp_warn and macro XDP_WARN, to avoid this
and at the same time make identifying the function and line of this
inlined function easier.

Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Toke Høiland-Jørgensen 
---
 include/net/xdp.h |   14 +-
 net/core/xdp.c|8 
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index a764af4ae0ea..3094fccf5a88 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -89,7 +89,8 @@ struct xdp_frame {
void *data;
u16 len;
u16 headroom;
-   u16 metasize;
+   u32 metasize:8;
+   u32 frame_sz:24;
/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
 * while mem info is valid on remote CPU.
 */
@@ -104,6 +105,10 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
frame->dev_rx = NULL;
 }
 
+/* Avoids inlining WARN macro in fast-path */
+void xdp_warn(const char *msg, const char *func, const int line);
+#define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
+
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
 
 /* Convert xdp_buff to xdp_frame */
@@ -124,6 +129,12 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff 
*xdp)
if (unlikely((headroom - metasize) < sizeof(*xdp_frame)))
return NULL;
 
+   /* Catch if driver didn't reserve tailroom for skb_shared_info */
+   if (unlikely(xdp->data_end > xdp_data_hard_end(xdp))) {
+   XDP_WARN("Driver BUG: missing reserved tailroom");
+   return NULL;
+   }
+
/* Store info in top of packet */
xdp_frame = xdp->data_hard_start;
 
@@ -131,6 +142,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
xdp_frame->len  = xdp->data_end - xdp->data;
xdp_frame->headroom = headroom - sizeof(*xdp_frame);
xdp_frame->metasize = metasize;
+   xdp_frame->frame_sz = xdp->frame_sz;
 
/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
xdp_frame->mem = xdp->rxq->mem;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4c7ea85486af..490b8f5fa8ee 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -496,3 +497,10 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct 
xdp_buff *xdp)
return xdpf;
 }
 EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
+
+/* Used by XDP_WARN macro, to avoid inlining WARN() in fast-path */
+void xdp_warn(const char *msg, const char *func, const int line)
+{
+   WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
+};
+EXPORT_SYMBOL_GPL(xdp_warn);

[PATCH net-next v4 00/33] XDP extend with knowledge of frame size

2020-05-14 Thread Jesper Dangaard Brouer

(Patchset based on net-next due to all the driver updates)

V4:
- Fixup checkpatch.pl issues
- Collected more ACKs

V3:
- Fix issue on virtio_net patch spotted by Jason Wang
- Adjust name for variable in mlx5 patch
- Collected more ACKs

V2:
- Fix bug in mlx5 for XDP_PASS case
- Collected nitpicks and ACKs from mailing list

V1:
- Fix bug in dpaa2

XDP have evolved to support several frame sizes, but xdp_buff was not
updated with this information. This have caused the side-effect that
XDP frame data hard end is unknown. This have limited the BPF-helper
bpf_xdp_adjust_tail to only shrink the packet. This patchset address
this and add packet tail extend/grow.

The purpose of the patchset is ALSO to reserve a memory area that can be
used for storing extra information, specifically for extending XDP with
multi-buffer support. One proposal is to use same layout as
skb_shared_info, which is why this area is currently 320 bytes.

When converting xdp_frame to SKB (veth and cpumap), the full tailroom
area can now be used and SKB truesize is now correct. For most
drivers this result in a much larger tailroom in SKB "head" data
area. The network stack can now take advantage of this when doing SKB
coalescing. Thus, a good driver test is to use xdp_redirect_cpu from
samples/bpf/ and do some TCP stream testing.

Use-cases for tail grow/extend:
(1) IPsec / XFRM needs a tail extend[1][2].
(2) DNS-cache responses in XDP.
(3) HAProxy ALOHA would need it to convert to XDP.
(4) Add tail info e.g. timestamp and collect via tcpdump

[1] http://vger.kernel.org/netconf2019_files/xfrm_xdp.pdf
[2] http://vger.kernel.org/netconf2019.html

Examples on howto access the tail area of an XDP packet is shown in the
XDP-tutorial example[3].

[3] 
https://github.com/xdp-project/xdp-tutorial/blob/master/experiment01-tailgrow/

---

Ilias Apalodimas (1):
  net: netsec: Add support for XDP frame size

Jesper Dangaard Brouer (32):
  xdp: add frame size to xdp_buff
  bnxt: add XDP frame size to driver
  sfc: add XDP frame size
  mvneta: add XDP frame size to driver
  net: XDP-generic determining XDP frame size
  xdp: xdp_frame add member frame_sz and handle in convert_to_xdp_frame
  xdp: cpumap redirect use frame_sz and increase skb_tailroom
  veth: adjust hard_start offset on redirect XDP frames
  veth: xdp using frame_sz in veth driver
  dpaa2-eth: add XDP frame size
  hv_netvsc: add XDP frame size to driver
  qlogic/qede: add XDP frame size to driver
  net: ethernet: ti: add XDP frame size to driver cpsw
  ena: add XDP frame size to amazon NIC driver
  mlx4: add XDP frame size and adjust max XDP MTU
  net: thunderx: add XDP frame size
  nfp: add XDP frame size to netronome driver
  tun: add XDP frame size
  vhost_net: also populate XDP frame size
  virtio_net: add XDP frame size in two code paths
  ixgbe: fix XDP redirect on archs with PAGE_SIZE above 4K
  ixgbe: add XDP frame size to driver
  ixgbevf: add XDP frame size to VF driver
  i40e: add XDP frame size to driver
  ice: add XDP frame size to driver
  xdp: for Intel AF_XDP drivers add XDP frame_sz
  mlx5: rx queue setup time determine frame_sz for XDP
  xdp: allow bpf_xdp_adjust_tail() to grow packet size
  xdp: clear grow memory in bpf_xdp_adjust_tail()
  bpf: add xdp.frame_sz in bpf_prog_test_run_xdp().
  selftests/bpf: adjust BPF selftest for xdp_adjust_tail
  selftests/bpf: xdp_adjust_tail add grow tail tests


 drivers/net/ethernet/amazon/ena/ena_netdev.c   |1 
 drivers/net/ethernet/amazon/ena/ena_netdev.h   |5 -
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c  |1 
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |1 
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c   |7 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|   30 -
 drivers/net/ethernet/intel/i40e/i40e_xsk.c |2 
 drivers/net/ethernet/intel/ice/ice_txrx.c  |   34 --
 drivers/net/ethernet/intel/ice/ice_xsk.c   |2 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |   33 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c   |2 
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |   34 --
 drivers/net/ethernet/marvell/mvneta.c  |   25 ++--
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |3 
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |1 
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |1 
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c   |1 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |6 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|2 
 .../net/ethernet/netronome/nfp/nfp_net_common.c|6 +
 drivers/net/ethernet/qlogic/qede/qede_fp.c |1 
 drivers/net/ethernet/qlogic/qede/qede_main.c   |2 
 drivers/net/ethernet/sfc/rx.c  |1 
 drivers/net/ethernet/socionext/netsec.c|   30 +

[PATCH net-next v4 05/33] net: netsec: Add support for XDP frame size

2020-05-14 Thread Jesper Dangaard Brouer

From: Ilias Apalodimas 

This driver takes advantage of page_pool PP_FLAG_DMA_SYNC_DEV that
can help reduce the number of cache-lines that need to be flushed
when doing DMA sync for_device. Due to xdp_adjust_tail can grow the
area accessible to the by the CPU (can possibly write into), then max
sync length *after* bpf_prog_run_xdp() needs to be taken into account.

For XDP_TX action the driver is smart and does DMA-sync. When growing
tail this is still safe, because page_pool have DMA-mapped the entire
page size.

Signed-off-by: Ilias Apalodimas 
Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Lorenzo Bianconi 
---
 drivers/net/ethernet/socionext/netsec.c |   30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index a5a0fb60193a..e1f4be4b3d69 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -884,23 +884,28 @@ static u32 netsec_run_xdp(struct netsec_priv *priv, 
struct bpf_prog *prog,
  struct xdp_buff *xdp)
 {
struct netsec_desc_ring *dring = &priv->desc_ring[NETSEC_RING_RX];
-   unsigned int len = xdp->data_end - xdp->data;
+   unsigned int sync, len = xdp->data_end - xdp->data;
u32 ret = NETSEC_XDP_PASS;
+   struct page *page;
int err;
u32 act;
 
act = bpf_prog_run_xdp(prog, xdp);
 
+   /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
+   sync = xdp->data_end - xdp->data_hard_start - NETSEC_RXBUF_HEADROOM;
+   sync = max(sync, len);
+
switch (act) {
case XDP_PASS:
ret = NETSEC_XDP_PASS;
break;
case XDP_TX:
ret = netsec_xdp_xmit_back(priv, xdp);
-   if (ret != NETSEC_XDP_TX)
-   page_pool_put_page(dring->page_pool,
-  virt_to_head_page(xdp->data), len,
-  true);
+   if (ret != NETSEC_XDP_TX) {
+   page = virt_to_head_page(xdp->data);
+   page_pool_put_page(dring->page_pool, page, sync, true);
+   }
break;
case XDP_REDIRECT:
err = xdp_do_redirect(priv->ndev, xdp, prog);
@@ -908,9 +913,8 @@ static u32 netsec_run_xdp(struct netsec_priv *priv, struct 
bpf_prog *prog,
ret = NETSEC_XDP_REDIR;
} else {
ret = NETSEC_XDP_CONSUMED;
-   page_pool_put_page(dring->page_pool,
-  virt_to_head_page(xdp->data), len,
-  true);
+   page = virt_to_head_page(xdp->data);
+   page_pool_put_page(dring->page_pool, page, sync, true);
}
break;
default:
@@ -921,8 +925,8 @@ static u32 netsec_run_xdp(struct netsec_priv *priv, struct 
bpf_prog *prog,
/* fall through -- handle aborts by dropping packet */
case XDP_DROP:
ret = NETSEC_XDP_CONSUMED;
-   page_pool_put_page(dring->page_pool,
-  virt_to_head_page(xdp->data), len, true);
+   page = virt_to_head_page(xdp->data);
+   page_pool_put_page(dring->page_pool, page, sync, true);
break;
}
 
@@ -936,10 +940,14 @@ static int netsec_process_rx(struct netsec_priv *priv, 
int budget)
struct netsec_rx_pkt_info rx_info;
enum dma_data_direction dma_dir;
struct bpf_prog *xdp_prog;
+   struct xdp_buff xdp;
u16 xdp_xmit = 0;
u32 xdp_act = 0;
int done = 0;
 
+   xdp.rxq = &dring->xdp_rxq;
+   xdp.frame_sz = PAGE_SIZE;
+
rcu_read_lock();
xdp_prog = READ_ONCE(priv->xdp_prog);
dma_dir = page_pool_get_dma_dir(dring->page_pool);
@@ -953,7 +961,6 @@ static int netsec_process_rx(struct netsec_priv *priv, int 
budget)
struct sk_buff *skb = NULL;
u16 pkt_len, desc_len;
dma_addr_t dma_handle;
-   struct xdp_buff xdp;
void *buf_addr;
 
if (de->attr & (1U << NETSEC_RX_PKT_OWN_FIELD)) {
@@ -1002,7 +1009,6 @@ static int netsec_process_rx(struct netsec_priv *priv, 
int budget)
xdp.data = desc->addr + NETSEC_RXBUF_HEADROOM;
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = xdp.data + pkt_len;
-   xdp.rxq = &dring->xdp_rxq;
 
if (xdp_prog) {
xdp_result = netsec_run_xdp(priv, xdp_prog, &xdp);

[PATCH net-next v4 06/33] net: XDP-generic determining XDP frame size

2020-05-14 Thread Jesper Dangaard Brouer

The SKB "head" pointer points to the data area that contains
skb_shared_info, that can be found via skb_end_pointer(). Given
xdp->data_hard_start have been established (basically pointing to
skb->head), frame size is between skb_end_pointer() and data_hard_start,
plus the size reserved to skb_shared_info.

Change the bpf_xdp_adjust_tail offset adjust of skb->len, to be a positive
offset number on grow, and negative number on shrink.  As this seems more
natural when reading the code.

Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Toke Høiland-Jørgensen 
---
 net/core/dev.c |   14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4c91de39890a..f937a3ff668d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4617,6 +4617,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
xdp->data_meta = xdp->data;
xdp->data_end = xdp->data + hlen;
xdp->data_hard_start = skb->data - skb_headroom(skb);
+
+   /* SKB "head" area always have tailroom for skb_shared_info */
+   xdp->frame_sz  = (void *)skb_end_pointer(skb) - xdp->data_hard_start;
+   xdp->frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
orig_data_end = xdp->data_end;
orig_data = xdp->data;
eth = (struct ethhdr *)xdp->data;
@@ -4640,14 +4645,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
*skb,
skb_reset_network_header(skb);
}
 
-   /* check if bpf_xdp_adjust_tail was used. it can only "shrink"
-* pckt.
-*/
-   off = orig_data_end - xdp->data_end;
+   /* check if bpf_xdp_adjust_tail was used */
+   off = xdp->data_end - orig_data_end;
if (off != 0) {
skb_set_tail_pointer(skb, xdp->data_end - xdp->data);
-   skb->len -= off;
-
+   skb->len += off; /* positive on grow, negative on shrink */
}
 
/* check if XDP changed eth hdr such SKB needs update */

[PATCH net-next v4 03/33] sfc: add XDP frame size

2020-05-14 Thread Jesper Dangaard Brouer

This driver uses RX page-split when possible. It was recently fixed
in commit 86e85bf6981c ("sfc: fix XDP-redirect in this driver") to
add needed tailroom for XDP-redirect.

After the fix efx->rx_page_buf_step is the frame size, with enough
head and tail-room for XDP-redirect.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/sfc/rx.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 260352d97d9d..68c47a8c71df 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -308,6 +308,7 @@ static bool efx_do_xdp(struct efx_nic *efx, struct 
efx_channel *channel,
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = xdp.data + rx_buf->len;
xdp.rxq = &rx_queue->xdp_rxq_info;
+   xdp.frame_sz = efx->rx_page_buf_step;
 
xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
rcu_read_unlock();

[PATCH net-next v4 02/33] bnxt: add XDP frame size to driver

2020-05-14 Thread Jesper Dangaard Brouer

This driver uses full PAGE_SIZE pages when XDP is enabled.

In case of XDP uses driver uses __bnxt_alloc_rx_page which does full
page DMA-map. Thus, xdp_adjust_tail grow is DMA compliant for XDP_TX
action that does DMA-sync.

Cc: Michael Chan 
Cc: Andy Gospodarek 
Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Andy Gospodarek 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index c6f6f2033880..5e3b4a3b69ea 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -138,6 +138,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info 
*rxr, u16 cons,
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = *data_ptr + *len;
xdp.rxq = &rxr->xdp_rxq;
+   xdp.frame_sz = PAGE_SIZE; /* BNXT_RX_PAGE_MODE(bp) when XDP enabled */
orig_data = xdp.data;
 
rcu_read_lock();

[PATCH net-next v4 09/33] veth: adjust hard_start offset on redirect XDP frames

2020-05-14 Thread Jesper Dangaard Brouer

When native XDP redirect into a veth device, the frame arrives in the
xdp_frame structure. It is then processed in veth_xdp_rcv_one(),
which can run a new XDP bpf_prog on the packet. Doing so requires
converting xdp_frame to xdp_buff, but the tricky part is that
xdp_frame memory area is located in the top (data_hard_start) memory
area that xdp_buff will point into.

The current code tried to protect the xdp_frame area, by assigning
xdp_buff.data_hard_start past this memory. This results in 32 bytes
less headroom to expand into via BPF-helper bpf_xdp_adjust_head().

This protect step is actually not needed, because BPF-helper
bpf_xdp_adjust_head() already reserve this area, and don't allow
BPF-prog to expand into it. Thus, it is safe to point data_hard_start
directly at xdp_frame memory area.

Cc: Toshiaki Makita 
Fixes: 9fc8d518d9d5 ("veth: Handle xdp_frames in xdp napi ring")
Reported-by: Mao Wenan 
Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Toshiaki Makita 
Acked-by: Toke Høiland-Jørgensen 
---
 drivers/net/veth.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index aece0e5eec8c..d5691bb84448 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -564,13 +564,15 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq 
*rq,
struct veth_stats *stats)
 {
void *hard_start = frame->data - frame->headroom;
-   void *head = hard_start - sizeof(struct xdp_frame);
int len = frame->len, delta = 0;
struct xdp_frame orig_frame;
struct bpf_prog *xdp_prog;
unsigned int headroom;
struct sk_buff *skb;
 
+   /* bpf_xdp_adjust_head() assures BPF cannot access xdp_frame area */
+   hard_start -= sizeof(struct xdp_frame);
+
rcu_read_lock();
xdp_prog = rcu_dereference(rq->xdp_prog);
if (likely(xdp_prog)) {
@@ -592,7 +594,6 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
break;
case XDP_TX:
orig_frame = *frame;
-   xdp.data_hard_start = head;
xdp.rxq->mem = frame->mem;
if (unlikely(veth_xdp_tx(rq, &xdp, bq) < 0)) {
trace_xdp_exception(rq->dev, xdp_prog, act);
@@ -605,7 +606,6 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
goto xdp_xmit;
case XDP_REDIRECT:
orig_frame = *frame;
-   xdp.data_hard_start = head;
xdp.rxq->mem = frame->mem;
if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
frame = &orig_frame;
@@ -629,7 +629,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
rcu_read_unlock();
 
headroom = sizeof(struct xdp_frame) + frame->headroom - delta;
-   skb = veth_build_skb(head, headroom, len, 0);
+   skb = veth_build_skb(hard_start, headroom, len, 0);
if (!skb) {
xdp_return_frame(frame);
stats->rx_drops++;

[PATCH net-next v4 04/33] mvneta: add XDP frame size to driver

2020-05-14 Thread Jesper Dangaard Brouer

This marvell driver mvneta uses PAGE_SIZE frames, which makes it
really easy to convert.  Driver updates rxq and now frame_sz
once per NAPI call.

This driver takes advantage of page_pool PP_FLAG_DMA_SYNC_DEV that
can help reduce the number of cache-lines that need to be flushed
when doing DMA sync for_device. Due to xdp_adjust_tail can grow the
area accessible to the by the CPU (can possibly write into), then max
sync length *after* bpf_prog_run_xdp() needs to be taken into account.

For XDP_TX action the driver is smart and does DMA-sync. When growing
tail this is still safe, because page_pool have DMA-mapped the entire
page size.

Cc: thomas.petazz...@bootlin.com
Acked-by: Lorenzo Bianconi 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/marvell/mvneta.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 51889770958d..37947949345c 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2148,12 +2148,17 @@ mvneta_run_xdp(struct mvneta_port *pp, struct 
mvneta_rx_queue *rxq,
   struct bpf_prog *prog, struct xdp_buff *xdp,
   struct mvneta_stats *stats)
 {
-   unsigned int len;
+   unsigned int len, sync;
+   struct page *page;
u32 ret, act;
 
len = xdp->data_end - xdp->data_hard_start - pp->rx_offset_correction;
act = bpf_prog_run_xdp(prog, xdp);
 
+   /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
+   sync = xdp->data_end - xdp->data_hard_start - pp->rx_offset_correction;
+   sync = max(sync, len);
+
switch (act) {
case XDP_PASS:
stats->xdp_pass++;
@@ -2164,9 +2169,8 @@ mvneta_run_xdp(struct mvneta_port *pp, struct 
mvneta_rx_queue *rxq,
err = xdp_do_redirect(pp->dev, xdp, prog);
if (unlikely(err)) {
ret = MVNETA_XDP_DROPPED;
-   page_pool_put_page(rxq->page_pool,
-  virt_to_head_page(xdp->data), len,
-  true);
+   page = virt_to_head_page(xdp->data);
+   page_pool_put_page(rxq->page_pool, page, sync, true);
} else {
ret = MVNETA_XDP_REDIR;
stats->xdp_redirect++;
@@ -2175,10 +2179,10 @@ mvneta_run_xdp(struct mvneta_port *pp, struct 
mvneta_rx_queue *rxq,
}
case XDP_TX:
ret = mvneta_xdp_xmit_back(pp, xdp);
-   if (ret != MVNETA_XDP_TX)
-   page_pool_put_page(rxq->page_pool,
-  virt_to_head_page(xdp->data), len,
-  true);
+   if (ret != MVNETA_XDP_TX) {
+   page = virt_to_head_page(xdp->data);
+   page_pool_put_page(rxq->page_pool, page, sync, true);
+   }
break;
default:
bpf_warn_invalid_xdp_action(act);
@@ -2187,8 +2191,8 @@ mvneta_run_xdp(struct mvneta_port *pp, struct 
mvneta_rx_queue *rxq,
trace_xdp_exception(pp->dev, prog, act);
/* fall through */
case XDP_DROP:
-   page_pool_put_page(rxq->page_pool,
-  virt_to_head_page(xdp->data), len, true);
+   page = virt_to_head_page(xdp->data);
+   page_pool_put_page(rxq->page_pool, page, sync, true);
ret = MVNETA_XDP_DROPPED;
stats->xdp_drop++;
break;
@@ -2320,6 +2324,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
rcu_read_lock();
xdp_prog = READ_ONCE(pp->xdp_prog);
xdp_buf.rxq = &rxq->xdp_rxq;
+   xdp_buf.frame_sz = PAGE_SIZE;
 
/* Fairness NAPI loop */
while (rx_proc < budget && rx_proc < rx_todo) {

[PATCH net-next v4 01/33] xdp: add frame size to xdp_buff

2020-05-14 Thread Jesper Dangaard Brouer

XDP have evolved to support several frame sizes, but xdp_buff was not
updated with this information. The frame size (frame_sz) member of
xdp_buff is introduced to know the real size of the memory the frame is
delivered in.

When introducing this also make it clear that some tailroom is
reserved/required when creating SKBs using build_skb().

It would also have been an option to introduce a pointer to
data_hard_end (with reserved offset). The advantage with frame_sz is
that (like rxq) drivers only need to setup/assign this value once per
NAPI cycle. Due to XDP-generic (and some drivers) it's not possible to
store frame_sz inside xdp_rxq_info, because it's varies per packet as it
can be based/depend on packet length.

V2: nitpick: deduct -> deduce

Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Toke Høiland-Jørgensen 
---
 include/net/xdp.h |   13 +
 1 file changed, 13 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 3cc6d5d84aa4..a764af4ae0ea 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -6,6 +6,8 @@
 #ifndef __LINUX_NET_XDP_H__
 #define __LINUX_NET_XDP_H__
 
+#include  /* skb_shared_info */
+
 /**
  * DOC: XDP RX-queue information
  *
@@ -70,8 +72,19 @@ struct xdp_buff {
void *data_hard_start;
unsigned long handle;
struct xdp_rxq_info *rxq;
+   u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
 };
 
+/* Reserve memory area at end-of data area.
+ *
+ * This macro reserves tailroom in the XDP buffer by limiting the
+ * XDP/BPF data access to data_hard_end.  Notice same area (and size)
+ * is used for XDP_PASS, when constructing the SKB via build_skb().
+ */
+#define xdp_data_hard_end(xdp) \
+   ((xdp)->data_hard_start + (xdp)->frame_sz - \
+SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
 struct xdp_frame {
void *data;
u16 len;

[PATCH net-next v4 08/33] xdp: cpumap redirect use frame_sz and increase skb_tailroom

2020-05-14 Thread Jesper Dangaard Brouer

Knowing the memory size backing the packet/xdp_frame data area, and
knowing it already have reserved room for skb_shared_info, simplifies
using build_skb significantly.

With this change we no-longer lie about the SKB truesize, but more
importantly a significant larger skb_tailroom is now provided, e.g. when
drivers uses a full PAGE_SIZE. This extra tailroom (in linear area) can be
used by the network stack when coalescing SKBs (e.g. in skb_try_coalesce,
see TCP cases where tcp_queue_rcv() can 'eat' skb).

Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Toke Høiland-Jørgensen 
---
 kernel/bpf/cpumap.c |   21 +++--
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3fe0b006d2d2..a71790dab12d 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -162,25 +162,10 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
/* Part of headroom was reserved to xdpf */
hard_start_headroom = sizeof(struct xdp_frame) +  xdpf->headroom;
 
-   /* build_skb need to place skb_shared_info after SKB end, and
-* also want to know the memory "truesize".  Thus, need to
-* know the memory frame size backing xdp_buff.
-*
-* XDP was designed to have PAGE_SIZE frames, but this
-* assumption is not longer true with ixgbe and i40e.  It
-* would be preferred to set frame_size to 2048 or 4096
-* depending on the driver.
-*   frame_size = 2048;
-*   frame_len  = frame_size - sizeof(*xdp_frame);
-*
-* Instead, with info avail, skb_shared_info in placed after
-* packet len.  This, unfortunately fakes the truesize.
-* Another disadvantage of this approach, the skb_shared_info
-* is not at a fixed memory location, with mixed length
-* packets, which is bad for cache-line hotness.
+   /* Memory size backing xdp_frame data already have reserved
+* room for build_skb to place skb_shared_info in tailroom.
 */
-   frame_size = SKB_DATA_ALIGN(xdpf->len + hard_start_headroom) +
-   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   frame_size = xdpf->frame_sz;
 
pkt_data_start = xdpf->data - hard_start_headroom;
skb = build_skb_around(skb, pkt_data_start, frame_size);

[PATCH net-next v4 10/33] veth: xdp using frame_sz in veth driver

2020-05-14 Thread Jesper Dangaard Brouer

The veth driver can run XDP in "native" mode in it's own NAPI
handler, and since commit 9fc8d518d9d5 ("veth: Handle xdp_frames in
xdp napi ring") packets can come in two forms either xdp_frame or
skb, calling respectively veth_xdp_rcv_one() or veth_xdp_rcv_skb().

For packets to arrive in xdp_frame format, they will have been
redirected from an XDP native driver. In case of XDP_PASS or no
XDP-prog attached, the veth driver will allocate and create an SKB.

The current code in veth_xdp_rcv_one() xdp_frame case, had to guess
the frame truesize of the incoming xdp_frame, when using
veth_build_skb(). With xdp_frame->frame_sz this is not longer
necessary.

Calculating the frame_sz in veth_xdp_rcv_skb() skb case, is done
similar to the XDP-generic handling code in net/core/dev.c.

Cc: Toshiaki Makita 
Reviewed-by: Lorenzo Bianconi 
Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Toke Høiland-Jørgensen 
Acked-by: Toshiaki Makita 
---
 drivers/net/veth.c |   22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d5691bb84448..b586d2fa5551 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -405,10 +405,6 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
 {
struct sk_buff *skb;
 
-   if (!buflen) {
-   buflen = SKB_DATA_ALIGN(headroom + len) +
-SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-   }
skb = build_skb(head, buflen);
if (!skb)
return NULL;
@@ -583,6 +579,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
xdp.data = frame->data;
xdp.data_end = frame->data + frame->len;
xdp.data_meta = frame->data - frame->metasize;
+   xdp.frame_sz = frame->frame_sz;
xdp.rxq = &rq->xdp_rxq;
 
act = bpf_prog_run_xdp(xdp_prog, &xdp);
@@ -629,7 +626,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
rcu_read_unlock();
 
headroom = sizeof(struct xdp_frame) + frame->headroom - delta;
-   skb = veth_build_skb(hard_start, headroom, len, 0);
+   skb = veth_build_skb(hard_start, headroom, len, frame->frame_sz);
if (!skb) {
xdp_return_frame(frame);
stats->rx_drops++;
@@ -695,9 +692,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
goto drop;
}
 
-   nskb = veth_build_skb(head,
- VETH_XDP_HEADROOM + mac_len, skb->len,
- PAGE_SIZE);
+   nskb = veth_build_skb(head, VETH_XDP_HEADROOM + mac_len,
+ skb->len, PAGE_SIZE);
if (!nskb) {
page_frag_free(head);
goto drop;
@@ -715,6 +711,11 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
xdp.data_end = xdp.data + pktlen;
xdp.data_meta = xdp.data;
xdp.rxq = &rq->xdp_rxq;
+
+   /* SKB "head" area always have tailroom for skb_shared_info */
+   xdp.frame_sz = (void *)skb_end_pointer(skb) - xdp.data_hard_start;
+   xdp.frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
orig_data = xdp.data;
orig_data_end = xdp.data_end;
 
@@ -758,6 +759,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
}
rcu_read_unlock();
 
+   /* check if bpf_xdp_adjust_head was used */
delta = orig_data - xdp.data;
off = mac_len + delta;
if (off > 0)
@@ -765,9 +767,11 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
else if (off < 0)
__skb_pull(skb, -off);
skb->mac_header -= delta;
+
+   /* check if bpf_xdp_adjust_tail was used */
off = xdp.data_end - orig_data_end;
if (off != 0)
-   __skb_put(skb, off);
+   __skb_put(skb, off); /* positive on grow, negative on shrink */
skb->protocol = eth_type_trans(skb, rq->dev);
 
metalen = xdp.data - xdp.data_meta;

[PATCH net-next v4 13/33] qlogic/qede: add XDP frame size to driver

2020-05-14 Thread Jesper Dangaard Brouer

The driver qede uses a full page, when XDP is enabled. The drivers value
in rx_buf_seg_size (struct qede_rx_queue) will be PAGE_SIZE when an
XDP bpf_prog is attached.

Cc: Ariel Elior 
Cc: gr-everest-linux...@marvell.com
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/qlogic/qede/qede_fp.c   |1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c |2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c 
b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index c6c20776b474..7598ebe0962a 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1066,6 +1066,7 @@ static bool qede_rx_xdp(struct qede_dev *edev,
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = xdp.data + *len;
xdp.rxq = &rxq->xdp_rxq;
+   xdp.frame_sz = rxq->rx_buf_seg_size; /* PAGE_SIZE when XDP enabled */
 
/* Queues always have a full reset currently, so for the time
 * being until there's atomic program replace just mark read
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 300405369c37..194bff3ae813 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -1425,7 +1425,7 @@ static int qede_alloc_mem_rxq(struct qede_dev *edev, 
struct qede_rx_queue *rxq)
if (rxq->rx_buf_size + size > PAGE_SIZE)
rxq->rx_buf_size = PAGE_SIZE - size;
 
-   /* Segment size to spilt a page in multiple equal parts ,
+   /* Segment size to split a page in multiple equal parts,
 * unless XDP is used in which case we'd use the entire page.
 */
if (!edev->xdp_prog) {

[PATCH net-next v4 15/33] ena: add XDP frame size to amazon NIC driver

2020-05-14 Thread Jesper Dangaard Brouer

Frame size ENA_PAGE_SIZE is limited to 16K on systems with larger
PAGE_SIZE than 16K. Change ENA_XDP_MAX_MTU to also take into account
the reserved tailroom.

Cc: Arthur Kiyanovski 
Acked-by: Sameeh Jubran 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c |1 +
 drivers/net/ethernet/amazon/ena/ena_netdev.h |5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 2818965427e9..85b87ed02dd5 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -1606,6 +1606,7 @@ static int ena_clean_rx_irq(struct ena_ring *rx_ring, 
struct napi_struct *napi,
  "%s qid %d\n", __func__, rx_ring->qid);
res_budget = budget;
xdp.rxq = &rx_ring->xdp_rxq;
+   xdp.frame_sz = ENA_PAGE_SIZE;
 
do {
xdp_verdict = XDP_PASS;
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index 7df67bf09b93..680099afcccf 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -151,8 +151,9 @@
  * The buffer size we share with the device is defined to be ENA_PAGE_SIZE
  */
 
-#define ENA_XDP_MAX_MTU (ENA_PAGE_SIZE - ETH_HLEN - ETH_FCS_LEN - \
-   VLAN_HLEN - XDP_PACKET_HEADROOM)
+#define ENA_XDP_MAX_MTU (ENA_PAGE_SIZE - ETH_HLEN - ETH_FCS_LEN -  \
+VLAN_HLEN - XDP_PACKET_HEADROOM -  \
+SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
 
 #define ENA_IS_XDP_INDEX(adapter, index) (((index) >= 
(adapter)->xdp_first_ring) && \
((index) < (adapter)->xdp_first_ring + (adapter)->xdp_num_queues))

[PATCH net-next v4 12/33] hv_netvsc: add XDP frame size to driver

2020-05-14 Thread Jesper Dangaard Brouer

The hyperv NIC driver does memory allocation and copy even without XDP.
In XDP mode it will allocate a new page for each packet and copy over
the payload, before invoking the XDP BPF-prog.

The positive thing it that its easy to determine the xdp.frame_sz.

The XDP implementation for hv_netvsc transparently passes xdp_prog
to the associated VF NIC. Many of the Azure VMs are using SRIOV, so
majority of the data are actually processed directly on the VF driver's XDP
path. So the overhead of the synthetic data path (hv_netvsc) is minimal.

Then XDP is enabled on this driver, XDP_PASS and XDP_TX will create the
SKB via build_skb (based on the newly allocated page). Now using XDP
frame_sz this will provide more skb_tailroom, which netstack can use for
SKB coalescing (e.g tcp_try_coalesce -> skb_try_coalesce).

V3: Adjust patch desc to be more positive.

Cc: Wei Liu 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/hyperv/netvsc_bpf.c |1 +
 drivers/net/hyperv/netvsc_drv.c |2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
index b86611041db6..1e0c024b0a93 100644
--- a/drivers/net/hyperv/netvsc_bpf.c
+++ b/drivers/net/hyperv/netvsc_bpf.c
@@ -49,6 +49,7 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct 
netvsc_channel *nvchan,
xdp_set_data_meta_invalid(xdp);
xdp->data_end = xdp->data + len;
xdp->rxq = &nvchan->xdp_rxq;
+   xdp->frame_sz = PAGE_SIZE;
xdp->handle = 0;
 
memcpy(xdp->data, data, len);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 5de57fc3ec60..6267f706e8ee 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -795,7 +795,7 @@ static struct sk_buff *netvsc_alloc_recv_skb(struct 
net_device *net,
if (xbuf) {
unsigned int hdroom = xdp->data - xdp->data_hard_start;
unsigned int xlen = xdp->data_end - xdp->data;
-   unsigned int frag_size = netvsc_xdp_fraglen(hdroom + xlen);
+   unsigned int frag_size = xdp->frame_sz;
 
skb = build_skb(xbuf, frag_size);

[PATCH net-next v4 20/33] vhost_net: also populate XDP frame size

2020-05-14 Thread Jesper Dangaard Brouer

In vhost_net_build_xdp() the 'buf' that gets queued via an xdp_buff
have embedded a struct tun_xdp_hdr (located at xdp->data_hard_start)
which contains the buffer length 'buflen' (with tailroom for
skb_shared_info). Also storing this buflen in xdp->frame_sz, does not
obsolete struct tun_xdp_hdr, as it also contains a struct
virtio_net_hdr with other information.

Cc: Jason Wang 
Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Michael S. Tsirkin 
Acked-by: Jason Wang 
---
 drivers/vhost/net.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2927f02cc7e1..516519dcc8ff 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -747,6 +747,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
xdp->data = buf + pad;
xdp->data_end = xdp->data + len;
hdr->buflen = buflen;
+   xdp->frame_sz = buflen;
 
--net->refcnt_bias;
alloc_frag->offset += buflen;

[PATCH net-next v4 17/33] net: thunderx: add XDP frame size

2020-05-14 Thread Jesper Dangaard Brouer

To help reviewers these are the defines related to RCV_FRAG_LEN

 #define DMA_BUFFER_LEN 1536 /* In multiples of 128bytes */
 #define RCV_FRAG_LEN   (SKB_DATA_ALIGN(DMA_BUFFER_LEN + NET_SKB_PAD) + \
 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))

Cc: Sunil Goutham 
Cc: Robert Richter 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index b4b33368698f..2ba0ce115e63 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -552,6 +552,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct 
bpf_prog *prog,
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = xdp.data + len;
xdp.rxq = &rq->xdp_rxq;
+   xdp.frame_sz = RCV_FRAG_LEN + XDP_PACKET_HEADROOM;
orig_data = xdp.data;
 
rcu_read_lock();

[PATCH net-next v4 21/33] virtio_net: add XDP frame size in two code paths

2020-05-14 Thread Jesper Dangaard Brouer

The virtio_net driver is running inside the guest-OS. There are two
XDP receive code-paths in virtio_net, namely receive_small() and
receive_mergeable(). The receive_big() function does not support XDP.

In receive_small() the frame size is available in buflen. The buffer
backing these frames are allocated in add_recvbuf_small() with same
size, except for the headroom, but tailroom have reserved room for
skb_shared_info. The headroom is encoded in ctx pointer as a value.

In receive_mergeable() the frame size is more dynamic. There are two
basic cases: (1) buffer size is based on a exponentially weighted
moving average (see DECLARE_EWMA) of packet length. Or (2) in case
virtnet_get_headroom() have any headroom then buffer size is
PAGE_SIZE. The ctx pointer is this time used for encoding two values;
the buffer len "truesize" and headroom. In case (1) if the rx buffer
size is underestimated, the packet will have been split over more
buffers (num_buf info in virtio_net_hdr_mrg_rxbuf placed in top of
buffer area). If that happens the XDP path does a xdp_linearize_page
operation.

V3: Adjust frame_sz in receive_mergeable() case, spotted by Jason Wang.

The code is really hard to follow, so some hints to reviewers.
The receive_mergeable() case gets frames that were allocated in
add_recvbuf_mergeable() which uses headroom=virtnet_get_headroom(),
and 'buf' ptr is advanced this headroom.  The headroom can only
be 0 or VIRTIO_XDP_HEADROOM, as virtnet_get_headroom is really
simple:

  static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
  {
return vi->xdp_queue_pairs ? VIRTIO_XDP_HEADROOM : 0;
  }

As frame_sz is an offset size from xdp.data_hard_start, reviewers
should notice how this is calculated in receive_mergeable():

  int offset = buf - page_address(page);
  [...]
  data = page_address(xdp_page) + offset;
  xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;

The calculated offset will always be VIRTIO_XDP_HEADROOM when
reaching this code.  Thus, xdp.data_hard_start will be page-start
address plus vi->hdr_len.  Given this xdp.frame_sz need to be
reduced with vi->hdr_len size.

IMHO a followup patch should cleanup this code to make it easier
to maintain and understand, but it is outside the scope of this
patchset.

Cc: Jason Wang 
Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Michael S. Tsirkin 
Acked-by: Jason Wang 
---
 drivers/net/virtio_net.c |   15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 11f722460513..9e1b5d748586 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -689,6 +689,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
xdp.data_end = xdp.data + len;
xdp.data_meta = xdp.data;
xdp.rxq = &rq->xdp_rxq;
+   xdp.frame_sz = buflen;
orig_data = xdp.data;
act = bpf_prog_run_xdp(xdp_prog, &xdp);
stats->xdp_packets++;
@@ -797,10 +798,11 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,
int offset = buf - page_address(page);
struct sk_buff *head_skb, *curr_skb;
struct bpf_prog *xdp_prog;
-   unsigned int truesize;
+   unsigned int truesize = mergeable_ctx_to_truesize(ctx);
unsigned int headroom = mergeable_ctx_to_headroom(ctx);
-   int err;
unsigned int metasize = 0;
+   unsigned int frame_sz;
+   int err;
 
head_skb = NULL;
stats->bytes += len - vi->hdr_len;
@@ -821,6 +823,11 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
if (unlikely(hdr->hdr.gso_type))
goto err_xdp;
 
+   /* Buffers with headroom use PAGE_SIZE as alloc size,
+* see add_recvbuf_mergeable() + get_mergeable_buf_len()
+*/
+   frame_sz = headroom ? PAGE_SIZE : truesize;
+
/* This happens when rx buffer size is underestimated
 * or headroom is not enough because of the buffer
 * was refilled before XDP is set. This should only
@@ -834,6 +841,8 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
  page, offset,
  VIRTIO_XDP_HEADROOM,
  &len);
+   frame_sz = PAGE_SIZE;
+
if (!xdp_page)
goto err_xdp;
offset = VIRTIO_XDP_HEADROOM;
@@ -850,6 +859,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
xdp.data_end = xdp.data + (len - vi->hdr_len);
xdp.data_meta = xdp.data;
xdp.rxq = &rq->xdp_rxq;
+   xdp.frame_sz = frame_sz - vi->hdr_len;
 
act = bpf_prog_run_xdp(xd

[PATCH net-next v4 19/33] tun: add XDP frame size

2020-05-14 Thread Jesper Dangaard Brouer

The tun driver have two code paths for running XDP (bpf_prog_run_xdp).
In both cases 'buflen' contains enough tailroom for skb_shared_info.

Cc: Jason Wang 
Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Michael S. Tsirkin 
Acked-by: Jason Wang 
---
 drivers/net/tun.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 44889eba1dbc..c54f967e2c66 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1671,6 +1671,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = xdp.data + len;
xdp.rxq = &tfile->xdp_rxq;
+   xdp.frame_sz = buflen;
 
act = bpf_prog_run_xdp(xdp_prog, &xdp);
if (act == XDP_REDIRECT || act == XDP_TX) {
@@ -2411,6 +2412,7 @@ static int tun_xdp_one(struct tun_struct *tun,
}
xdp_set_data_meta_invalid(xdp);
xdp->rxq = &tfile->xdp_rxq;
+   xdp->frame_sz = buflen;
 
act = bpf_prog_run_xdp(xdp_prog, xdp);
err = tun_xdp_act(tun, xdp_prog, xdp, act);

[PATCH net-next v4 22/33] ixgbe: fix XDP redirect on archs with PAGE_SIZE above 4K

2020-05-14 Thread Jesper Dangaard Brouer

The ixgbe driver have another memory model when compiled on archs with
PAGE_SIZE above 4096 bytes. In this mode it doesn't split the page in
two halves, but instead increment rx_buffer->page_offset by truesize of
packet (which include headroom and tailroom for skb_shared_info).

This is done correctly in ixgbe_build_skb(), but in ixgbe_rx_buffer_flip
which is currently only called on XDP_TX and XDP_REDIRECT, it forgets
to add the tailroom for skb_shared_info. This breaks XDP_REDIRECT, for
veth and cpumap.  Fix by adding size of skb_shared_info tailroom.

Maintainers notice: This fix have been queued to Jeff.

Fixes: 6453073987ba ("ixgbe: add initial support for xdp redirect")
Cc: Jeff Kirsher 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 718931d951bc..ea6834bae04c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2254,7 +2254,8 @@ static void ixgbe_rx_buffer_flip(struct ixgbe_ring 
*rx_ring,
rx_buffer->page_offset ^= truesize;
 #else
unsigned int truesize = ring_uses_build_skb(rx_ring) ?
-   SKB_DATA_ALIGN(IXGBE_SKB_PAD + size) :
+   SKB_DATA_ALIGN(IXGBE_SKB_PAD + size) +
+   SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
SKB_DATA_ALIGN(size);
 
rx_buffer->page_offset += truesize;

[PATCH net-next v4 11/33] dpaa2-eth: add XDP frame size

2020-05-14 Thread Jesper Dangaard Brouer

The dpaa2-eth driver reserve some headroom used for hardware and
software annotation area in RX/TX buffers. Thus, xdp.data_hard_start
doesn't start at page boundary.

When XDP is configured the area reserved via dpaa2_fd_get_offset(fd) is
448 bytes of which XDP have reserved 256 bytes. As frame_sz is
calculated as an offset from xdp_buff.data_hard_start, an adjust from
the full PAGE_SIZE == DPAA2_ETH_RX_BUF_RAW_SIZE.

When doing XDP_REDIRECT, the driver doesn't need this reserved headroom
any-longer and allows xdp_do_redirect() to use it. This is an advantage
for the drivers own ndo-xdp_xmit, as it uses part of this headroom for
itself.  Patch also adjust frame_sz in this case.

The driver cannot support XDP data_meta, because it uses the headroom
just before xdp.data for struct dpaa2_eth_swa (DPAA2_ETH_SWA_SIZE=64),
when transmitting the packet. When transmitting a xdp_frame in
dpaa2_eth_xdp_xmit_frame (call via ndo_xdp_xmit) is uses this area to
store a pointer to xdp_frame and dma_size, which is used in TX
completion (free_tx_fd) to return frame via xdp_return_frame().

Cc: Ioana Radulescu 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 0f3e842a4fd6..8c8d95aa1dfd 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -331,6 +331,9 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv,
xdp_set_data_meta_invalid(&xdp);
xdp.rxq = &ch->xdp_rxq;
 
+   xdp.frame_sz = DPAA2_ETH_RX_BUF_RAW_SIZE -
+   (dpaa2_fd_get_offset(fd) - XDP_PACKET_HEADROOM);
+
xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
/* xdp.data pointer may have changed */
@@ -366,7 +369,11 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv,
dma_unmap_page(priv->net_dev->dev.parent, addr,
   DPAA2_ETH_RX_BUF_SIZE, DMA_BIDIRECTIONAL);
ch->buf_count--;
+
+   /* Allow redirect use of full headroom */
xdp.data_hard_start = vaddr;
+   xdp.frame_sz = DPAA2_ETH_RX_BUF_RAW_SIZE;
+
err = xdp_do_redirect(priv->net_dev, &xdp, xdp_prog);
if (unlikely(err))
ch->stats.xdp_drop++;

[PATCH net-next v4 14/33] net: ethernet: ti: add XDP frame size to driver cpsw

2020-05-14 Thread Jesper Dangaard Brouer

The driver code cpsw.c and cpsw_new.c both use page_pool
with default order-0 pages or their RX-pages.

Cc: Grygorii Strashko 
Cc: Ilias Apalodimas 
Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw.c |1 +
 drivers/net/ethernet/ti/cpsw_new.c |1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 09f98fa2fb4e..ce0645ada6e7 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -406,6 +406,7 @@ static void cpsw_rx_handler(void *token, int len, int 
status)
 
xdp.data_hard_start = pa;
xdp.rxq = &priv->xdp_rxq[ch];
+   xdp.frame_sz = PAGE_SIZE;
 
port = priv->emac_port + cpsw->data.dual_emac;
ret = cpsw_run_xdp(priv, ch, &xdp, page, port);
diff --git a/drivers/net/ethernet/ti/cpsw_new.c 
b/drivers/net/ethernet/ti/cpsw_new.c
index dce49311d3d3..1247d35d42ef 100644
--- a/drivers/net/ethernet/ti/cpsw_new.c
+++ b/drivers/net/ethernet/ti/cpsw_new.c
@@ -348,6 +348,7 @@ static void cpsw_rx_handler(void *token, int len, int 
status)
 
xdp.data_hard_start = pa;
xdp.rxq = &priv->xdp_rxq[ch];
+   xdp.frame_sz = PAGE_SIZE;
 
ret = cpsw_run_xdp(priv, ch, &xdp, page, priv->emac_port);
if (ret != CPSW_XDP_PASS)

[PATCH net-next v4 16/33] mlx4: add XDP frame size and adjust max XDP MTU

2020-05-14 Thread Jesper Dangaard Brouer

The mlx4 drivers size of memory backing the RX packet is stored in
frag_stride. For XDP mode this will be PAGE_SIZE (normally 4096).
For normal mode frag_stride is 2048.

Also adjust MLX4_EN_MAX_XDP_MTU to take tailroom into account.

Cc: Tariq Toukan 
Cc: Saeed Mahameed 
Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |3 ++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 43dcbd8214c6..5bd3cd37d50f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -51,7 +51,8 @@
 #include "en_port.h"
 
 #define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) - \
-  XDP_PACKET_HEADROOM))
+   XDP_PACKET_HEADROOM -   \
+   SKB_DATA_ALIGN(sizeof(struct skb_shared_info
 
 int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 787139219813..8a10285b0e10 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -683,6 +683,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
rcu_read_lock();
xdp_prog = rcu_dereference(ring->xdp_prog);
xdp.rxq = &ring->xdp_rxq;
+   xdp.frame_sz = priv->frag_info[0].frag_stride;
doorbell_pending = 0;
 
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx

[PATCH net-next v4 18/33] nfp: add XDP frame size to netronome driver

2020-05-14 Thread Jesper Dangaard Brouer

The netronome nfp driver use PAGE_SIZE when xdp_prog is set, but
xdp.data_hard_start begins at offset NFP_NET_RX_BUF_HEADROOM.
Thus, adjust for this when setting xdp.frame_sz, as it counts
from data_hard_start.

When doing XDP_TX this driver is smart and instead of a full DMA-map
does a DMA-sync on with packet length. As xdp_adjust_tail can now
grow packet length, add checks to make sure that grow size is within
the DMA-mapped size.

Cc: Jakub Kicinski 
Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c|6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 9bfb3b077bc1..0e0cc3d58bdc 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1741,10 +1741,15 @@ nfp_net_tx_xdp_buf(struct nfp_net_dp *dp, struct 
nfp_net_rx_ring *rx_ring,
   struct nfp_net_rx_buf *rxbuf, unsigned int dma_off,
   unsigned int pkt_len, bool *completed)
 {
+   unsigned int dma_map_sz = dp->fl_bufsz - NFP_NET_RX_BUF_NON_DATA;
struct nfp_net_tx_buf *txbuf;
struct nfp_net_tx_desc *txd;
int wr_idx;
 
+   /* Reject if xdp_adjust_tail grow packet beyond DMA area */
+   if (pkt_len + dma_off > dma_map_sz)
+   return false;
+
if (unlikely(nfp_net_tx_full(tx_ring, 1))) {
if (!*completed) {
nfp_net_xdp_complete(tx_ring);
@@ -1817,6 +1822,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
int budget)
rcu_read_lock();
xdp_prog = READ_ONCE(dp->xdp_prog);
true_bufsz = xdp_prog ? PAGE_SIZE : dp->fl_bufsz;
+   xdp.frame_sz = PAGE_SIZE - NFP_NET_RX_BUF_HEADROOM;
xdp.rxq = &rx_ring->xdp_rxq;
tx_ring = r_vec->xdp_ring;

[PATCH net-next v4 24/33] ixgbevf: add XDP frame size to VF driver

2020-05-14 Thread Jesper Dangaard Brouer

This patch mirrors the changes to ixgbe in previous patch.

This VF driver doesn't support XDP_REDIRECT, but correct tailroom is
still necessary for BPF-helper xdp_adjust_tail.  In legacy-mode +
larger PAGE_SIZE, due to lacking tailroom, we accept that
xdp_adjust_tail shrink doesn't work.

Cc: intel-wired-...@lists.osuosl.org
Cc: Jeff Kirsher 
Cc: Alexander Duyck 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   34 +
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 4622c4ea2e46..a39e2cb384dd 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1095,19 +1095,31 @@ static struct sk_buff *ixgbevf_run_xdp(struct 
ixgbevf_adapter *adapter,
return ERR_PTR(-result);
 }
 
+static unsigned int ixgbevf_rx_frame_truesize(struct ixgbevf_ring *rx_ring,
+ unsigned int size)
+{
+   unsigned int truesize;
+
+#if (PAGE_SIZE < 8192)
+   truesize = ixgbevf_rx_pg_size(rx_ring) / 2; /* Must be power-of-2 */
+#else
+   truesize = ring_uses_build_skb(rx_ring) ?
+   SKB_DATA_ALIGN(IXGBEVF_SKB_PAD + size) +
+   SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
+   SKB_DATA_ALIGN(size);
+#endif
+   return truesize;
+}
+
 static void ixgbevf_rx_buffer_flip(struct ixgbevf_ring *rx_ring,
   struct ixgbevf_rx_buffer *rx_buffer,
   unsigned int size)
 {
-#if (PAGE_SIZE < 8192)
-   unsigned int truesize = ixgbevf_rx_pg_size(rx_ring) / 2;
+   unsigned int truesize = ixgbevf_rx_frame_truesize(rx_ring, size);
 
+#if (PAGE_SIZE < 8192)
rx_buffer->page_offset ^= truesize;
 #else
-   unsigned int truesize = ring_uses_build_skb(rx_ring) ?
-   SKB_DATA_ALIGN(IXGBEVF_SKB_PAD + size) :
-   SKB_DATA_ALIGN(size);
-
rx_buffer->page_offset += truesize;
 #endif
 }
@@ -1125,6 +1137,11 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
 
xdp.rxq = &rx_ring->xdp_rxq;
 
+   /* Frame size depend on rx_ring setup when PAGE_SIZE=4K */
+#if (PAGE_SIZE < 8192)
+   xdp.frame_sz = ixgbevf_rx_frame_truesize(rx_ring, 0);
+#endif
+
while (likely(total_rx_packets < budget)) {
struct ixgbevf_rx_buffer *rx_buffer;
union ixgbe_adv_rx_desc *rx_desc;
@@ -1157,7 +1174,10 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
xdp.data_hard_start = xdp.data -
  ixgbevf_rx_offset(rx_ring);
xdp.data_end = xdp.data + size;
-
+#if (PAGE_SIZE > 4096)
+   /* At larger PAGE_SIZE, frame_sz depend on len size */
+   xdp.frame_sz = ixgbevf_rx_frame_truesize(rx_ring, size);
+#endif
skb = ixgbevf_run_xdp(adapter, rx_ring, &xdp);
}

[PATCH net-next v4 23/33] ixgbe: add XDP frame size to driver

2020-05-14 Thread Jesper Dangaard Brouer

This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".

For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.

The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).

There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.

Cc: intel-wired-...@lists.osuosl.org
Cc: Jeff Kirsher 
Cc: Alexander Duyck 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   34 +++--
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ea6834bae04c..eab5934b04f5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2244,20 +2244,30 @@ static struct sk_buff *ixgbe_run_xdp(struct 
ixgbe_adapter *adapter,
return ERR_PTR(-result);
 }
 
+static unsigned int ixgbe_rx_frame_truesize(struct ixgbe_ring *rx_ring,
+   unsigned int size)
+{
+   unsigned int truesize;
+
+#if (PAGE_SIZE < 8192)
+   truesize = ixgbe_rx_pg_size(rx_ring) / 2; /* Must be power-of-2 */
+#else
+   truesize = ring_uses_build_skb(rx_ring) ?
+   SKB_DATA_ALIGN(IXGBE_SKB_PAD + size) +
+   SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
+   SKB_DATA_ALIGN(size);
+#endif
+   return truesize;
+}
+
 static void ixgbe_rx_buffer_flip(struct ixgbe_ring *rx_ring,
 struct ixgbe_rx_buffer *rx_buffer,
 unsigned int size)
 {
+   unsigned int truesize = ixgbe_rx_frame_truesize(rx_ring, size);
 #if (PAGE_SIZE < 8192)
-   unsigned int truesize = ixgbe_rx_pg_size(rx_ring) / 2;
-
rx_buffer->page_offset ^= truesize;
 #else
-   unsigned int truesize = ring_uses_build_skb(rx_ring) ?
-   SKB_DATA_ALIGN(IXGBE_SKB_PAD + size) +
-   SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
-   SKB_DATA_ALIGN(size);
-
rx_buffer->page_offset += truesize;
 #endif
 }
@@ -2291,6 +2301,11 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
 
xdp.rxq = &rx_ring->xdp_rxq;
 
+   /* Frame size depend on rx_ring setup when PAGE_SIZE=4K */
+#if (PAGE_SIZE < 8192)
+   xdp.frame_sz = ixgbe_rx_frame_truesize(rx_ring, 0);
+#endif
+
while (likely(total_rx_packets < budget)) {
union ixgbe_adv_rx_desc *rx_desc;
struct ixgbe_rx_buffer *rx_buffer;
@@ -2324,7 +2339,10 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
xdp.data_hard_start = xdp.data -
  ixgbe_rx_offset(rx_ring);
xdp.data_end = xdp.data + size;
-
+#if (PAGE_SIZE > 4096)
+   /* At larger PAGE_SIZE, frame_sz depend on len size */
+   xdp.frame_sz = ixgbe_rx_frame_truesize(rx_ring, size);
+#endif
skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
}

1 2 3 4 5 >

1 - 100 of 456 matches

Mail list logo