date:20200609

Re: [PATCH net] net: mvneta: do not redirect frames during reconfiguration

2020-06-09 Thread Lorenzo Bianconi

> On Tue, Jun 09, 2020 at 12:02:39AM +0200, Lorenzo Bianconi wrote:
> > Disable frames injection in mvneta_xdp_xmit routine during hw
> > re-configuration in order to avoid hardware hangs
> 
> Hi Lorenzo
> 
> Why does mvneta_tx() also not need the same protection?
> 
> Andrew

Hi Andrew,

So far I have not been able to trigger the issue in the legacy tx path.
I hit the problem adding the capability to attach an eBPF program to CPUMAP
entries [1]. In particular I am redirecting traffic to mvneta and concurrently
attaching/removing a XDP program to/from it.
I am not sure this can occur running mvneta_tx().
Moreover it seems a common pattern for .ndo_xdp_xmit() in other drivers
(e.g ixgbe, bnxt, mlx5)

Regards,
Lorenzo

[1] 
https://patchwork.ozlabs.org/project/netdev/cover/cover.1590960613.git.lore...@kernel.org/

signature.asc
Description: PGP signature

Re: [PATCHv2 ipsec] xfrm: fix a warning in xfrm_policy_insert_list

2020-06-09 Thread Xin Long

 a, .

On Mon, Jun 8, 2020 at 8:02 PM Tobias Brunner  wrote:
>
> Hi Steffen, Xin,
>
> This change could be problematic.  Actually, it's not really this one
> but the original one that causes the issue:
> > Fixes: 7cb8a93968e3 ("xfrm: Allow inserting policies with matching mark and 
> > different priorities")
>
> However, because the code in xfrm_policy_mark_match() treated policies
> with the same mark/mask equal without considering the priority before
> this change, it wasn't apparent.  The problem is that the code can now
> lead to duplicate policies, which can not correctly be removed or queried.
It's a problem that can't be removed, but not for being queried.

>
> That's because the priority is sent only in xfrm_userpolicy_info, which
> XFRM_MSG_NEWPOLICY and XFRM_MSG_UPDPOLICY expect, but not in
> xfrm_userpolicy_id, which is used to query and delete policies with
> XFRM_MSG_GETPOLICY and XFRM_MSG_DELPOLICY, respectively (the mark is
> sent with a separate attribute, which can be supplied to all commands).
>  So we can only query/delete the duplicate policy with the highest
> priority.  Such duplicates can even be created inadvertently via
> XFRM_MSG_UPDPOLICY if the priority of an existing policy should be
> changed, which worked fine so far.
priority doesn't exist in xfrm_userpolicy_id, and we can add
XFRMA_PRIORITY like XFRMA_MARK to fix it.
since we also take priority to make a policy unique, we can't update
this attribute.

>
> The latter currently happens when strongSwan e.g. replaces a trap or
> block policy with one that has templates assigned (those we install with
> a higher priority, by default), which uses XFRM_MSG_UPDPOLICY that
> doesn't update the existing policy anymore but creates a duplicate
> instead.  Since only one XFRM_MSG_DELPOLICY is sent later when the
> policy is deleted, a duplicate policy remains (and we couldn't even
> delete the exact policy we wanted - the trap might be removed by the
> user before the regular policy - due to the missing priority field in
> xfrm_userpolicy_id).
I see.

>
> I guess we could workaround this issue in strongSwan by installing
> policies that share the same mark and selector with the same priority,
> so only one instance is ever installed in the kernel.  But the inability
> to address the exact policy when querying/deleting still looks like a
> problem to me in general.
>
For deleting, yes, but for querying, I think it makes sense not to pass
the priority, and always get the policy with the highest priority.

We can separate the deleting path from the querying path when
XFRMA_PRIORITY attribute is set.

Is that enough for your case to only fix for the policy deleting?

[PATCH v3 2/3] netlink: add master/slave configuration support

2020-06-09 Thread Oleksij Rempel

This UAPI is needed for BroadR-Reach 100BASE-T1 devices. Due to lack of
auto-negotiation support, we needed to be able to configure the
MASTER-SLAVE role of the port manually or from an application in user
space.

The same UAPI can be used for 1000BASE-T or MultiGBASE-T devices to
force MASTER or SLAVE role. See IEEE 802.3-2018:
22.2.4.3.7 MASTER-SLAVE control register (Register 9)
22.2.4.3.8 MASTER-SLAVE status register (Register 10)
40.5.2 MASTER-SLAVE configuration resolution
45.2.1.185.1 MASTER-SLAVE config value (1.2100.14)
45.2.7.10 MultiGBASE-T AN control 1 register (Register 7.32)

The MASTER-SLAVE role affects the clock configuration:

---
When the  PHY is configured as MASTER, the PMA Transmit function shall
source TX_TCLK from a local clock source. When configured as SLAVE, the
PMA Transmit function shall source TX_TCLK from the clock recovered from
data stream provided by MASTER.

iMX6Q KSZ9031XXX
--\/---\/\
  ||   |||
 MAC  || PHY Slave |<-->| PHY Master |
  |<--- 125 MHz ---+-<--/  || \  |
--/\---/\/
   ^
\-TX_TCLK

---

Since some clock or link related issues are only reproducible in a
specific MASTER-SLAVE-role, MAC and PHY configuration, it is beneficial
to provide generic (not 100BASE-T1 specific) interface to the user space
for configuration flexibility and trouble shooting.

Signed-off-by: Oleksij Rempel 
---
 ethtool.8.in   | 19 
 ethtool.c  |  1 +
 netlink/desc-ethtool.c |  2 ++
 netlink/settings.c | 50 ++
 4 files changed, 72 insertions(+)

diff --git a/ethtool.8.in b/ethtool.8.in
index f8b09e0..dca7c7a 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -52,6 +52,10 @@
 .\"
 .ds MA 
\fIxx\fP\fB:\fP\fIyy\fP\fB:\fP\fIzz\fP\fB:\fP\fIaa\fP\fB:\fP\fIbb\fP\fB:\fP\fIcc\fP
 .\"
+.\"\(*MS - master-slave property
+.\"
+.ds MS 
\fBpreferred-master\fP|\fBpreferred-slave\fP|\fBforced-master\fP|\fBforced-slave\fP
+.\"
 .\"\(*PA - IP address
 .\"
 .ds PA \fIip-address\fP
@@ -255,6 +259,7 @@ ethtool \- query or control network driver and hardware 
settings
 .RB [ wol \ \fIN\fP[\fB/\fP\fIM\fP]
 .RB | \ wol \ \*(WO]
 .RB [ sopass \ \*(MA]
+.RB [ master-slave \ \*(MS]
 .RB [ msglvl
 .IR N\fP[/\fIM\fP] \ |
 .BI msglvl \ type
@@ -646,6 +651,20 @@ Sets full or half duplex mode.
 .A4 port tp aui bnc mii fibre da
 Selects device port.
 .TP
+.BR master-slave \ \*(MS
+Configure MASTER/SLAVE role of the PHY. When the PHY is configured as MASTER,
+the PMA Transmit function shall source TX_TCLK from a local clock source. When
+configured as SLAVE, the PMA Transmit function shall source TX_TCLK from the
+clock recovered from data stream provided by MASTER. Not all devices support 
this.
+.TS
+nokeep;
+lB l.
+preferred-master Prefer MASTER role on autonegotiation
+preferred-slave Prefer SLAVE role on autonegotiation
+forced-masterForce the PHY in MASTER role. Can be used without 
autonegotiation
+forced-slave Force the PHY in SLAVE role. Can be used without 
autonegotiation
+.TE
+.TP
 .A3 mdix auto on off
 Selects MDI-X mode for port. May be used to override the automatic
 detection feature of most adapters. An argument of \fBauto\fR means
diff --git a/ethtool.c b/ethtool.c
index 900880a..a6e9bfc 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -5196,6 +5196,7 @@ static const struct option args[] = {
  " [ wol %d[/%d] | p|u|m|b|a|g|s|f|d... 
]\n"
  " [ sopass %x:%x:%x:%x:%x:%x ]\n"
  " [ msglvl %d[/%d] | type on|off ... [--] 
]\n"
+ " [ master-slave 
master-preferred|slave-preferred|master-force|slave-force ]\n"
},
{
.opts   = "-a|--show-pause",
diff --git a/netlink/desc-ethtool.c b/netlink/desc-ethtool.c
index 76c6f13..b0a793c 100644
--- a/netlink/desc-ethtool.c
+++ b/netlink/desc-ethtool.c
@@ -85,6 +85,8 @@ static const struct pretty_nla_desc __linkmodes_desc[] = {
NLATTR_DESC_NESTED(ETHTOOL_A_LINKMODES_PEER, bitset),
NLATTR_DESC_U32(ETHTOOL_A_LINKMODES_SPEED),
NLATTR_DESC_U8(ETHTOOL_A_LINKMODES_DUPLEX),
+   NLATTR_DESC_U8(ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG),
+   NLATTR_DESC_U8(ETHTOOL_A_LINKMODES_MASTER_SLAVE_STATE),
 };
 
 static const struct pretty_nla_desc __linkstate_desc[] = {
diff --git a/netlink/settings.c b/netlink/settings.c
index 8be5a22..3a5a237 100644
--- a/netlink/settings.c
+++ b/netlink/settings.c
@@ -37,6 +37,21 @@ static const char *

[PATCH v3 1/3] update UAPI header copies

2020-06-09 Thread Oleksij Rempel

Update to net-dev:
dc0f3ed1973 ("net: phy: at803x: add cable diagnostics support for ATH9331 and 
ATH8032")

Signed-off-by: Oleksij Rempel 
---
 uapi/linux/ethtool.h |  25 ++-
 uapi/linux/ethtool_netlink.h | 326 +++
 uapi/linux/genetlink.h   |   2 +
 uapi/linux/if_link.h |   7 +-
 uapi/linux/net_tstamp.h  |   6 +
 uapi/linux/netlink.h | 103 +++
 uapi/linux/rtnetlink.h   |   6 +
 7 files changed, 471 insertions(+), 4 deletions(-)

diff --git a/uapi/linux/ethtool.h b/uapi/linux/ethtool.h
index d3dcb45..6074caa 100644
--- a/uapi/linux/ethtool.h
+++ b/uapi/linux/ethtool.h
@@ -594,6 +594,9 @@ struct ethtool_pauseparam {
  * @ETH_SS_LINK_MODES: link mode names
  * @ETH_SS_MSG_CLASSES: debug message class names
  * @ETH_SS_WOL_MODES: wake-on-lan modes
+ * @ETH_SS_SOF_TIMESTAMPING: SOF_TIMESTAMPING_* flags
+ * @ETH_SS_TS_TX_TYPES: timestamping Tx types
+ * @ETH_SS_TS_RX_FILTERS: timestamping Rx filters
  */
 enum ethtool_stringset {
ETH_SS_TEST = 0,
@@ -608,6 +611,9 @@ enum ethtool_stringset {
ETH_SS_LINK_MODES,
ETH_SS_MSG_CLASSES,
ETH_SS_WOL_MODES,
+   ETH_SS_SOF_TIMESTAMPING,
+   ETH_SS_TS_TX_TYPES,
+   ETH_SS_TS_RX_FILTERS,
 
/* add new constants above here */
ETH_SS_COUNT
@@ -1521,8 +1527,7 @@ enum ethtool_link_mode_bit_indices {
ETHTOOL_LINK_MODE_40baseLR8_ER8_FR8_Full_BIT = 71,
ETHTOOL_LINK_MODE_40baseDR8_Full_BIT = 72,
ETHTOOL_LINK_MODE_40baseCR8_Full_BIT = 73,
-   ETHTOOL_LINK_MODE_FEC_LLRS_BIT   = 74,
-
+   ETHTOOL_LINK_MODE_FEC_LLRS_BIT   = 74,
/* must be last entry */
__ETHTOOL_LINK_MODE_MASK_NBITS
 };
@@ -1659,6 +1664,18 @@ static __inline__ int ethtool_validate_duplex(__u8 
duplex)
return 0;
 }
 
+#define MASTER_SLAVE_CFG_UNSUPPORTED   0
+#define MASTER_SLAVE_CFG_UNKNOWN   1
+#define MASTER_SLAVE_CFG_MASTER_PREFERRED  2
+#define MASTER_SLAVE_CFG_SLAVE_PREFERRED   3
+#define MASTER_SLAVE_CFG_MASTER_FORCE  4
+#define MASTER_SLAVE_CFG_SLAVE_FORCE   5
+#define MASTER_SLAVE_STATE_UNSUPPORTED 0
+#define MASTER_SLAVE_STATE_UNKNOWN 1
+#define MASTER_SLAVE_STATE_MASTER  2
+#define MASTER_SLAVE_STATE_SLAVE   3
+#define MASTER_SLAVE_STATE_ERR 4
+
 /* Which connector port. */
 #define PORT_TP0x00
 #define PORT_AUI   0x01
@@ -1897,7 +1914,9 @@ struct ethtool_link_settings {
__u8eth_tp_mdix_ctrl;
__s8link_mode_masks_nwords;
__u8transceiver;
-   __u8reserved1[3];
+   __u8master_slave_cfg;
+   __u8master_slave_state;
+   __u8reserved1[1];
__u32   reserved[7];
__u32   link_mode_masks[0];
/* layout of link_mode_masks fields:
diff --git a/uapi/linux/ethtool_netlink.h b/uapi/linux/ethtool_netlink.h
index ad6d3a0..051b35b 100644
--- a/uapi/linux/ethtool_netlink.h
+++ b/uapi/linux/ethtool_netlink.h
@@ -24,6 +24,23 @@ enum {
ETHTOOL_MSG_DEBUG_SET,
ETHTOOL_MSG_WOL_GET,
ETHTOOL_MSG_WOL_SET,
+   ETHTOOL_MSG_FEATURES_GET,
+   ETHTOOL_MSG_FEATURES_SET,
+   ETHTOOL_MSG_PRIVFLAGS_GET,
+   ETHTOOL_MSG_PRIVFLAGS_SET,
+   ETHTOOL_MSG_RINGS_GET,
+   ETHTOOL_MSG_RINGS_SET,
+   ETHTOOL_MSG_CHANNELS_GET,
+   ETHTOOL_MSG_CHANNELS_SET,
+   ETHTOOL_MSG_COALESCE_GET,
+   ETHTOOL_MSG_COALESCE_SET,
+   ETHTOOL_MSG_PAUSE_GET,
+   ETHTOOL_MSG_PAUSE_SET,
+   ETHTOOL_MSG_EEE_GET,
+   ETHTOOL_MSG_EEE_SET,
+   ETHTOOL_MSG_TSINFO_GET,
+   ETHTOOL_MSG_CABLE_TEST_ACT,
+   ETHTOOL_MSG_CABLE_TEST_TDR_ACT,
 
/* add new constants above here */
__ETHTOOL_MSG_USER_CNT,
@@ -43,6 +60,24 @@ enum {
ETHTOOL_MSG_DEBUG_NTF,
ETHTOOL_MSG_WOL_GET_REPLY,
ETHTOOL_MSG_WOL_NTF,
+   ETHTOOL_MSG_FEATURES_GET_REPLY,
+   ETHTOOL_MSG_FEATURES_SET_REPLY,
+   ETHTOOL_MSG_FEATURES_NTF,
+   ETHTOOL_MSG_PRIVFLAGS_GET_REPLY,
+   ETHTOOL_MSG_PRIVFLAGS_NTF,
+   ETHTOOL_MSG_RINGS_GET_REPLY,
+   ETHTOOL_MSG_RINGS_NTF,
+   ETHTOOL_MSG_CHANNELS_GET_REPLY,
+   ETHTOOL_MSG_CHANNELS_NTF,
+   ETHTOOL_MSG_COALESCE_GET_REPLY,
+   ETHTOOL_MSG_COALESCE_NTF,
+   ETHTOOL_MSG_PAUSE_GET_REPLY,
+   ETHTOOL_MSG_PAUSE_NTF,
+   ETHTOOL_MSG_EEE_GET_REPLY,
+   ETHTOOL_MSG_EEE_NTF,
+   ETHTOOL_MSG_TSINFO_GET_REPLY,
+   ETHTOOL_MSG_CABLE_TEST_NTF,
+   ETHTOOL_MSG_CABLE_TEST_TDR_NTF,
 
/* add new constants above here */
__ETHTOOL_MSG_KERNEL_CNT,
@@ -185,6 +220,8 @@ enum {
ETHTOOL_A_LINKMODES_PEER,   /* bitset */
ETHTOOL_A_LINKMODES_SPEED,  /* u32 */
ETHTOOL_A_LINKMODES_DUPLEX, /* u8 */
+   ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG,   /* u8 */
+

[PATCH v3 3/3] netlink: add LINKSTATE SQI support

2020-06-09 Thread Oleksij Rempel

Some PHYs provide Signal Quality Index (SQI) if the link is in active
state. This information can help to diagnose cable and system design
related issues.

Signed-off-by: Oleksij Rempel 
Reviewed-by: Florian Fainelli 
Reviewed-by: Michal Kubecek 
---
 netlink/desc-ethtool.c |  2 ++
 netlink/settings.c | 16 
 2 files changed, 18 insertions(+)

diff --git a/netlink/desc-ethtool.c b/netlink/desc-ethtool.c
index b0a793c..8f4c36b 100644
--- a/netlink/desc-ethtool.c
+++ b/netlink/desc-ethtool.c
@@ -93,6 +93,8 @@ static const struct pretty_nla_desc __linkstate_desc[] = {
NLATTR_DESC_INVALID(ETHTOOL_A_LINKSTATE_UNSPEC),
NLATTR_DESC_NESTED(ETHTOOL_A_LINKSTATE_HEADER, header),
NLATTR_DESC_BOOL(ETHTOOL_A_LINKSTATE_LINK),
+   NLATTR_DESC_U32(ETHTOOL_A_LINKSTATE_SQI),
+   NLATTR_DESC_U32(ETHTOOL_A_LINKSTATE_SQI_MAX),
 };
 
 static const struct pretty_nla_desc __debug_desc[] = {
diff --git a/netlink/settings.c b/netlink/settings.c
index 3a5a237..c360167 100644
--- a/netlink/settings.c
+++ b/netlink/settings.c
@@ -638,6 +638,22 @@ int linkstate_reply_cb(const struct nlmsghdr *nlhdr, void 
*data)
printf("\tLink detected: %s\n", val ? "yes" : "no");
}
 
+   if (tb[ETHTOOL_A_LINKSTATE_SQI]) {
+   uint32_t val = mnl_attr_get_u32(tb[ETHTOOL_A_LINKSTATE_SQI]);
+
+   print_banner(nlctx);
+   printf("\tSQI: %u", val);
+
+   if (tb[ETHTOOL_A_LINKSTATE_SQI_MAX]) {
+   uint32_t max;
+
+   max = mnl_attr_get_u32(tb[ETHTOOL_A_LINKSTATE_SQI_MAX]);
+   printf("/%u\n", max);
+   } else {
+   printf("\n");
+   }
+   }
+
return MNL_CB_OK;
 }
 
-- 
2.27.0

[PATCH v3 0/3] Add support for SQI and master-slave

2020-06-09 Thread Oleksij Rempel

This patch set is extending ethtool to make it more usable on automotive
PHYs like NXP TJA11XX.

They make use of new KAPI (currently in net-next, will go probably to the
kernel 5.8-rc1):
- PHY master-slave role configuration and status informaton. Mostly needed
  for 100Base-T1 PHYs due the lack of autonegatiation support.
- Signal Quality Index to investigate cable related issues.

changes v3:
- rename "Port mode" to "master-slave"
- use [preferred|forced]-[master|slave] for information and
  configuration

changes v2:
- add master-slave information to the "ethtool --help" and man page
- move KAPI update changes to the separate patch. 

Oleksij Rempel (3):
  update UAPI header copies
  netlink: add master/slave configuration support
  netlink: add LINKSTATE SQI support

 ethtool.8.in |  19 ++
 ethtool.c|   1 +
 netlink/desc-ethtool.c   |   4 +
 netlink/settings.c   |  66 +++
 uapi/linux/ethtool.h |  25 ++-
 uapi/linux/ethtool_netlink.h | 326 +++
 uapi/linux/genetlink.h   |   2 +
 uapi/linux/if_link.h |   7 +-
 uapi/linux/net_tstamp.h  |   6 +
 uapi/linux/netlink.h | 103 +++
 uapi/linux/rtnetlink.h   |   6 +
 11 files changed, 561 insertions(+), 4 deletions(-)

-- 
2.27.0

Re: fentry/fexit attach to EXT type XDP program does not work

2020-06-09 Thread Eelco Chaudron





On 8 Jun 2020, at 18:58, Yonghong Song wrote:


On 6/8/20 7:11 AM, Eelco Chaudron wrote:

I'm trying for a while to do a fentry/fexit trace an EXT program
attached to an XDP program. To make it easier to explain I've
created a test case (see patch below) to show the issue.

Without the changes to test_xdp_bpf2bpf.c I'll get the following 
error:


   libbpf: -- BEGIN DUMP LOG ---
   libbpf:
   arg#0 type is not a struct
   Unrecognized arg#0 type PTR
   ; int BPF_PROG(trace_on_entry, struct xdp_buff *xdp)
   0: (79) r6 = *(u64 *)(r1 +0)
   invalid bpf_context access off=0 size=8
   processed 1 insns (limit 100) max_states_per_insn 0 
total_states 0 peak_states 0 mark_read 0


   libbpf: -- END LOG --
   libbpf: failed to load program 'fentry/FUNC'
   libbpf: failed to load object 'test_xdp_bpf2bpf'
   libbpf: failed to load BPF skeleton 'test_xdp_bpf2bpf': -4007
   test_xdp_fentry_ext:FAIL:__load ftrace skeleton failed
   #91 xdp_fentry_ext:FAIL
   Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED

With the change I get the following (but I do feel this change
should not be needed):

   libbpf: -- BEGIN DUMP LOG ---
   libbpf:
   Unrecognized arg#0 type PTR
   ; int trace_on_entry(struct xdp_buff *xdp)
   0: (bf) r6 = r1
   ; void *data = (void *)(long)xdp->data;
   1: (79) r1 = *(u64 *)(r6 +0)
   invalid bpf_context access off=0 size=8
   processed 2 insns (limit 100) max_states_per_insn 0 
total_states 0 peak_states 0 mark_read 0


   libbpf: -- END LOG --
   libbpf: failed to load program 'fentry/FUNC'
   libbpf: failed to load object 'test_xdp_bpf2bpf'
   libbpf: failed to load BPF skeleton 'test_xdp_bpf2bpf': -4007
   test_xdp_fentry_ext:FAIL:__load ftrace skeleton failed
   #91 xdp_fentry_ext:FAIL
   Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED

Any idea what could be the case here? The same fentry/fexit attach
code works fine in the xdp_bpf2bpf.c tests case.





I think this is not supported now. That is, you cannot attach a fentry 
trace
to the EXT program. The current implementation for fentry program 
simply
trying to find and match the signature of freplace program which by 
default

is a pointer to void.

It is doable in that in kernel we could recognize to-be-attached 
program is
a freplace and further trace down to find the real signature. The 
related
kernel function is btf_get_prog_ctx_type(). You can try to implement 
by yourself

or I can have a patch for this once bpf-next opens.


I’m not familiar with this area of the code, so if you could prepare a 
patch that would nice.
You can also send it to me before bpf-next opens and I can verify it, 
and clean up the self-test so it can be included as well.

Re: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg

2020-06-09 Thread Jakub Sitnicki

On Fri, Jun 05, 2020 at 10:46 AM CEST, dihu wrote:
> When user application calls read() with MSG_PEEK flag to read data
> of bpf sockmap socket, kernel panic happens at
> __tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
> queue after read out under MSG_PEEK flag is set. Because it's not
> judged whether sk_msg is the last msg of ingress_msg queue, the next
> sk_msg may be the head of ingress_msg queue, whose memory address of
> sg page is invalid. So it's necessary to add check codes to prevent
> this problem.
>
> [20759.125457] BUG: kernel NULL pointer dereference, address:
> 0008
> [20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE
> 5.4.32 #1
> [20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
> 4.1.12 06/18/2017
> [20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
> [20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
> [20759.276099] tcp_bpf_recvmsg+0x113/0x370
> [20759.281137] inet_recvmsg+0x55/0xc0
> [20759.285734] __sys_recvfrom+0xc8/0x130
> [20759.290566] ? __audit_syscall_entry+0x103/0x130
> [20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
> [20759.301700] ? __audit_syscall_exit+0x1e4/0x290
> [20759.307235] __x64_sys_recvfrom+0x24/0x30
> [20759.312226] do_syscall_64+0x55/0x1b0
> [20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Signed-off-by: dihu 
> ---
>  net/ipv4/tcp_bpf.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 5a05327..b82e4c3 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
> *psock,
>   } while (i != msg_rx->sg.end);
>  
>   if (unlikely(peek)) {
> + if (msg_rx == list_last_entry(&psock->ingress_msg,
> +   struct sk_msg, list))
> + break;
>   msg_rx = list_next_entry(msg_rx, list);
>   continue;
>   }

Acked-by: Jakub Sitnicki

[PATCH v3 00/15] Forward MSI-X vector enable error code in pci_alloc_irq_vectors_affinity()

2020-06-09 Thread Piotr Stankiewicz

The primary objective of this patch series is to change the behaviour
of pci_alloc_irq_vectors_affinity() such that it forwards the MSI-X enable
error code when appropriate. In the process, though, it was pointed out
that there are multiple places in the kernel which check/ask for message
signalled interrupts (MSI or MSI-X), which spawned the first patch adding
PCI_IRQ_MSI_TYPES. Finally the rest of the chain converts all users to
take advantage of PCI_IRQ_MSI_TYPES or PCI_IRQ_ALL_TYPES, as
appropriate.

Piotr Stankiewicz (15):
  PCI/MSI: Forward MSI-X vector enable error code in
pci_alloc_irq_vectors_affinity()
  PCI: Add macro for message signalled interrupt types
  PCI: Use PCI_IRQ_MSI_TYPES where appropriate
  ahci: Use PCI_IRQ_MSI_TYPES where appropriate
  crypto: inside-secure - Use PCI_IRQ_MSI_TYPES where appropriate
  dmaengine: dw-edma: Use PCI_IRQ_MSI_TYPES  where appropriate
  drm/amdgpu: Use PCI_IRQ_MSI_TYPES where appropriate
  IB/qib: Use PCI_IRQ_MSI_TYPES where appropriate
  media: ddbridge: Use PCI_IRQ_MSI_TYPES where appropriate
  vmw_vmci: Use PCI_IRQ_ALL_TYPES where appropriate
  mmc: sdhci: Use PCI_IRQ_MSI_TYPES where appropriate
  amd-xgbe: Use PCI_IRQ_MSI_TYPES where appropriate
  aquantia: atlantic: Use PCI_IRQ_ALL_TYPES where appropriate
  net: hns3: Use PCI_IRQ_MSI_TYPES where appropriate
  scsi: Use PCI_IRQ_MSI_TYPES and PCI_IRQ_ALL_TYPES where appropriate

 Documentation/PCI/msi-howto.rst   |  5 +++--
 drivers/ata/ahci.c|  2 +-
 drivers/crypto/inside-secure/safexcel.c   |  2 +-
 drivers/dma/dw-edma/dw-edma-pcie.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c   | 11 +-
 drivers/infiniband/hw/qib/qib_pcie.c  |  6 +++--
 drivers/media/pci/ddbridge/ddbridge-main.c|  2 +-
 drivers/misc/vmw_vmci/vmci_guest.c|  3 +--
 drivers/mmc/host/sdhci-pci-gli.c  |  3 +--
 drivers/mmc/host/sdhci-pci-o2micro.c  |  3 +--
 drivers/net/ethernet/amd/xgbe/xgbe-pci.c  |  2 +-
 .../ethernet/aquantia/atlantic/aq_pci_func.c  |  4 +---
 .../hisilicon/hns3/hns3pf/hclge_main.c|  3 +--
 .../hisilicon/hns3/hns3vf/hclgevf_main.c  |  3 +--
 drivers/pci/msi.c | 22 ---
 drivers/pci/pcie/portdrv_core.c   |  4 ++--
 drivers/pci/switch/switchtec.c|  3 +--
 drivers/scsi/ipr.c|  5 +++--
 drivers/scsi/vmw_pvscsi.c |  2 +-
 include/linux/pci.h   |  4 ++--
 20 files changed, 37 insertions(+), 54 deletions(-)

-- 
2.17.2

[PATCH v3 12/15] amd-xgbe: Use PCI_IRQ_MSI_TYPES where appropriate

2020-06-09 Thread Piotr Stankiewicz

Seeing as there is shorthand available to use when asking for any type
of interrupt, or any type of message signalled interrupt, leverage it.

Signed-off-by: Piotr Stankiewicz 
Reviewed-by: Andy Shevchenko 
---
 drivers/net/ethernet/amd/xgbe/xgbe-pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-pci.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
index 7b86240ecd5f..903bc5ef2518 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
@@ -133,7 +133,7 @@ static int xgbe_config_multi_msi(struct xgbe_prv_data 
*pdata)
pdata->tx_ring_count);
 
ret = pci_alloc_irq_vectors(pdata->pcidev, XGBE_MSI_MIN_COUNT,
-   vector_count, PCI_IRQ_MSI | PCI_IRQ_MSIX);
+   vector_count, PCI_IRQ_MSI_TYPES);
if (ret < 0) {
dev_info(pdata->dev, "multi MSI/MSI-X enablement failed\n");
return ret;
-- 
2.17.2

[PATCH v3 14/15] net: hns3: Use PCI_IRQ_MSI_TYPES where appropriate

2020-06-09 Thread Piotr Stankiewicz

Seeing as there is shorthand available to use when asking for any type
of interrupt, or any type of message signalled interrupt, leverage it.

Signed-off-by: Piotr Stankiewicz 
Reviewed-by: Andy Shevchenko 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c   | 3 +--
 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 96bfad52630d..535d9f4a31f5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2327,8 +2327,7 @@ static int hclge_init_msi(struct hclge_dev *hdev)
int i;
 
vectors = pci_alloc_irq_vectors(pdev, HNAE3_MIN_VECTOR_NUM,
-   hdev->num_msi,
-   PCI_IRQ_MSI | PCI_IRQ_MSIX);
+   hdev->num_msi, PCI_IRQ_MSI_TYPES);
if (vectors < 0) {
dev_err(&pdev->dev,
"failed(%d) to allocate MSI/MSI-X vectors\n",
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 1b9578d0bd80..fbfd4e7b2817 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -2608,8 +2608,7 @@ static int hclgevf_init_msi(struct hclgevf_dev *hdev)
PCI_IRQ_MSIX);
else
vectors = pci_alloc_irq_vectors(pdev, HNAE3_MIN_VECTOR_NUM,
-   hdev->num_msi,
-   PCI_IRQ_MSI | PCI_IRQ_MSIX);
+   hdev->num_msi, 
PCI_IRQ_MSI_TYPES);
 
if (vectors < 0) {
dev_err(&pdev->dev,
-- 
2.17.2

[PATCH v3 13/15] aquantia: atlantic: Use PCI_IRQ_ALL_TYPES where appropriate

2020-06-09 Thread Piotr Stankiewicz

Seeing as there is shorthand available to use when asking for any type
of interrupt, or any type of message signalled interrupt, leverage it.

Signed-off-by: Piotr Stankiewicz 
Reviewed-by: Andy Shevchenko 
---
 drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
index 41c0f560f95b..bc228be9f0a3 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
@@ -299,9 +299,7 @@ static int aq_pci_probe(struct pci_dev *pdev,
numvecs += AQ_HW_SERVICE_IRQS;
/*enable interrupts */
 #if !AQ_CFG_FORCE_LEGACY_INT
-   err = pci_alloc_irq_vectors(self->pdev, 1, numvecs,
-   PCI_IRQ_MSIX | PCI_IRQ_MSI |
-   PCI_IRQ_LEGACY);
+   err = pci_alloc_irq_vectors(self->pdev, 1, numvecs, PCI_IRQ_ALL_TYPES);
 
if (err < 0)
goto err_hwinit;
-- 
2.17.2

[PATCH] net: stmmac: Fix RX Coalesce IOC always true issue

2020-06-09 Thread Biao Huang

Currently rx_count_frames in stmmac_rx_refill always 0, which leads to
use_rx_wd false, and IOC bit of rx_desc3 true forever. Fix it.

Fixes: 6fa9d691b91ac ("net: stmmac: Prevent divide-by-zero")
Signed-off-by: Biao Huang 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index e6898fd5223f..87b529743fd0 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3607,8 +3607,7 @@ static inline void stmmac_rx_refill(struct stmmac_priv 
*priv, u32 queue)
stmmac_refill_desc3(priv, rx_q, p);
 
rx_q->rx_count_frames++;
-   rx_q->rx_count_frames += priv->rx_coal_frames;
-   if (rx_q->rx_count_frames > priv->rx_coal_frames)
+   if (rx_q->rx_count_frames >= priv->rx_coal_frames)
rx_q->rx_count_frames = 0;
 
use_rx_wd = !priv->rx_coal_frames;
-- 
2.18.0

Re: [PATCH bpf 0/3] bpf: avoid using/returning file descriptor value zero

2020-06-09 Thread Jesper Dangaard Brouer

On Mon, 8 Jun 2020 18:34:10 -0700
Alexei Starovoitov  wrote:

> On Mon, Jun 08, 2020 at 06:51:12PM +0200, Jesper Dangaard Brouer wrote:
> > Make it easier to handle UAPI/kABI extensions by avoid BPF using/returning
> > file descriptor value zero. Use this in recent devmap extension to keep
> > older applications compatible with newer kernels.
> > 
> > For special type maps (e.g. devmap and cpumap) the map-value data-layout is
> > a configuration interface. This is a kernel Application Binary Interface
> > (kABI) that can only be tail extended. Thus, new members (and thus features)
> > can only be added to the end of this structure, and the kernel uses the
> > map->value_size from userspace to determine feature set 'version'.  
> 
> please drop these kabi references. As far as I know kabi is a redhat invention
> and I'm not even sure what exactly it means.
> 'struct bpf_devmap_val' is uapi. No need to invent new names for existing 
> concept.

Sure I can call it UAPI.

I was alluding to the difference between API and ABI, but it doesn't matter.
For the record, Red Hat didn't invent ABI (Application Binary Interface):
 https://en.wikipedia.org/wiki/Application_binary_interface


> > The recent extension of devmap with a bpf_prog.fd requires end-user to
> > supply the file-descriptor value minus-1 to communicate that the features
> > isn't used. This isn't compatible with the described kABI extension model.  
> 
> non-zero prog_fd requirement exists already in bpf syscall. It's not recent.
> So I don't think patch 1 is appropriate at this point. Certainly not
> for bpf tree. We can argue about it usefulness when bpf-next reopens.
> For now I think patches 2 and 3 are good to go.

Great.

> Don't delete 'enum sk_action' in patch 2 though.

Sorry, yes, that was a mistake.

> The rest looks good to me.

Thanks!
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [RFC PATCH net-next 04/10] ethtool: Add link extended state

2020-06-09 Thread Amit Cohen

On 07-Jun-20 21:17, Florian Fainelli wrote:
> 
> 
> On 6/7/2020 7:59 AM, Amit Cohen wrote:
>> Currently, drivers can only tell whether the link is up/down using
>> LINKSTATE_GET, but no additional information is given.
>>
>> Add attributes to LINKSTATE_GET command in order to allow drivers
>> to expose the user more information in addition to link state to ease
>> the debug process, for example, reason for link down state.
>>
>> Extended state consists of two attributes - ext_state and ext_substate.
>> The idea is to avoid 'vendor specific' states in order to prevent
>> drivers to use specific ext_state that can be in the future common
>> ext_state.
>>
>> The substates allows drivers to add more information to the common
>> ext_state. For example, vendor can expose 'Autoneg failure' as
>> ext_state and add 'No partner detected during force mode' as
>> ext_substate.
>>
>> If a driver cannot pinpoint the extended state with the substate
>> accuracy, it is free to expose only the extended state and omit the
>> substate attribute.
>>
>> Signed-off-by: Amit Cohen 
>> Reviewed-by: Petr Machata 
>> Reviewed-by: Jiri Pirko 
>> ---
>>  include/linux/ethtool.h  | 22 +
>>  include/uapi/linux/ethtool.h | 70 
>>  include/uapi/linux/ethtool_netlink.h |  2 +
>>  net/ethtool/linkstate.c  | 40 
>>  4 files changed, 134 insertions(+)
>>
>> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
>> index a23b26eab479..48ec542f4504 100644
>> --- a/include/linux/ethtool.h
>> +++ b/include/linux/ethtool.h
>> @@ -86,6 +86,22 @@ struct net_device;
>>  u32 ethtool_op_get_link(struct net_device *dev);
>>  int ethtool_op_get_ts_info(struct net_device *dev, struct ethtool_ts_info 
>> *eti);
>>  
>> +
>> +/**
>> + * struct ethtool_ext_state_info - link extended state and substate.
>> + */
>> +struct ethtool_ext_state_info {
>> +enum ethtool_ext_state ext_state;
>> +union {
>> +enum ethtool_ext_substate_autoneg autoneg;
>> +enum ethtool_ext_substate_link_training link_training;
>> +enum ethtool_ext_substate_link_logical_mismatch 
>> link_logical_mismatch;
>> +enum ethtool_ext_substate_bad_signal_integrity 
>> bad_signal_integrity;
>> +enum ethtool_ext_substate_cable_issue cable_issue;
>> +int __ext_substate;
>> +};
>> +};
>> +
>>  /**
>>   * ethtool_rxfh_indir_default - get default value for RX flow hash 
>> indirection
>>   * @index: Index in RX flow hash indirection table
>> @@ -245,6 +261,10 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 
>> *legacy_u32,
>>   * @get_link: Report whether physical link is up.  Will only be called if
>>   *  the netdev is up.  Should usually be set to ethtool_op_get_link(),
>>   *  which uses netif_carrier_ok().
>> + * @get_ext_state: Report link extended state. Should set ext_state and
>> + *  ext_substate (ext_substate of 0 means ext_substate is unknown,
>> + *  do not attach ext_substate attribute to netlink message). If not
>> + *  implemented, ext_state and ext_substate will not be sent to userspace.
> 
> For consistency with the other link-related operations, I would name
> this get_link_ext_state.
> 

ok.

>>   * @get_eeprom: Read data from the device EEPROM.
>>   *  Should fill in the magic field.  Don't need to check len for zero
>>   *  or wraparound.  Fill in the data argument with the eeprom values
>> @@ -384,6 +404,8 @@ struct ethtool_ops {
>>  void(*set_msglevel)(struct net_device *, u32);
>>  int (*nway_reset)(struct net_device *);
>>  u32 (*get_link)(struct net_device *);
>> +int (*get_ext_state)(struct net_device *,
>> + struct ethtool_ext_state_info *);
>>  int (*get_eeprom_len)(struct net_device *);
>>  int (*get_eeprom)(struct net_device *,
>>struct ethtool_eeprom *, u8 *);
>> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
>> index f4662b3a9e1e..830fa0d6aebe 100644
>> --- a/include/uapi/linux/ethtool.h
>> +++ b/include/uapi/linux/ethtool.h
>> @@ -579,6 +579,76 @@ struct ethtool_pauseparam {
>>  __u32   tx_pause;
>>  };
>>  
>> +/**
>> + * enum ethtool_ext_state - link extended state
>> + */
>> +enum ethtool_ext_state {
>> +ETHTOOL_EXT_STATE_AUTONEG_FAILURE,
>> +ETHTOOL_EXT_STATE_LINK_TRAINING_FAILURE,
>> +ETHTOOL_EXT_STATE_LINK_LOGICAL_MISMATCH,
>> +ETHTOOL_EXT_STATE_BAD_SIGNAL_INTEGRITY,
>> +ETHTOOL_EXT_STATE_NO_CABLE,
>> +ETHTOOL_EXT_STATE_CABLE_ISSUE,
>> +ETHTOOL_EXT_STATE_EEPROM_ISSUE,
> 
> Does the EEPROM issue would indicate for instance that it was not
> possile for the firmware/kernel to determine what transceiver
> capabilities are supported from e.g.: a SFP or SFF EEPROM, and therefore
> the link state could be down because of that. Is this the idea?
> 

We get this reason from firmware if the cable identifier is not spec compliant, 
mis

[PATCH v3 3/7] dev_printk: Add dev_dbg_level macro over dynamic one

2020-06-09 Thread Stanimir Varbanov

Add dev_dbg_level macro wrapper over dynamic debug one for
dev_dbg variants.

Signed-off-by: Stanimir Varbanov 
---
 include/linux/dev_printk.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/dev_printk.h b/include/linux/dev_printk.h
index 7b50551833e1..d639dc60d84d 100644
--- a/include/linux/dev_printk.h
+++ b/include/linux/dev_printk.h
@@ -112,15 +112,24 @@ void _dev_info(const struct device *dev, const char *fmt, 
...)
 #if defined(CONFIG_DYNAMIC_DEBUG)
 #define dev_dbg(dev, fmt, ...) \
dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__)
+#define dev_dbg_level(dev, lvl, fmt, ...)  \
+   dynamic_dev_dbg_level(dev, lvl, dev_fmt(fmt), ##__VA_ARGS__)
 #elif defined(DEBUG)
 #define dev_dbg(dev, fmt, ...) \
dev_printk(KERN_DEBUG, dev, dev_fmt(fmt), ##__VA_ARGS__)
+#define dev_dbg_level(dev, lvl, fmt, ...)  \
+   dev_printk(KERN_DEBUG, dev, dev_fmt(fmt), ##__VA_ARGS__)
 #else
 #define dev_dbg(dev, fmt, ...) \
 ({ \
if (0)  \
dev_printk(KERN_DEBUG, dev, dev_fmt(fmt), ##__VA_ARGS__); \
 })
+#define dev_dbg_level(dev, lvl, fmt, ...)  \
+({ \
+   if (0)  \
+   dev_printk(KERN_DEBUG, dev, dev_fmt(fmt), ##__VA_ARGS__); \
+})
 #endif
 
 #ifdef CONFIG_PRINTK
-- 
2.17.1

[PATCH v3 5/7] venus: Add debugfs interface to set firmware log level

2020-06-09 Thread Stanimir Varbanov

This will be useful when debugging specific issues related to
firmware HFI interface.

Signed-off-by: Stanimir Varbanov 
---
 drivers/media/platform/qcom/venus/Makefile|  2 +-
 drivers/media/platform/qcom/venus/core.c  |  5 
 drivers/media/platform/qcom/venus/core.h  |  3 +++
 drivers/media/platform/qcom/venus/dbgfs.c | 26 +++
 drivers/media/platform/qcom/venus/dbgfs.h | 12 +
 drivers/media/platform/qcom/venus/hfi_venus.c |  7 -
 6 files changed, 53 insertions(+), 2 deletions(-)
 create mode 100644 drivers/media/platform/qcom/venus/dbgfs.c
 create mode 100644 drivers/media/platform/qcom/venus/dbgfs.h

diff --git a/drivers/media/platform/qcom/venus/Makefile 
b/drivers/media/platform/qcom/venus/Makefile
index 64af0bc1edae..dfc636865709 100644
--- a/drivers/media/platform/qcom/venus/Makefile
+++ b/drivers/media/platform/qcom/venus/Makefile
@@ -3,7 +3,7 @@
 
 venus-core-objs += core.o helpers.o firmware.o \
   hfi_venus.o hfi_msgs.o hfi_cmds.o hfi.o \
-  hfi_parser.o pm_helpers.o
+  hfi_parser.o pm_helpers.o dbgfs.o
 
 venus-dec-objs += vdec.o vdec_ctrls.o
 venus-enc-objs += venc.o venc_ctrls.o
diff --git a/drivers/media/platform/qcom/venus/core.c 
b/drivers/media/platform/qcom/venus/core.c
index 203c6538044f..bbb394ca4175 100644
--- a/drivers/media/platform/qcom/venus/core.c
+++ b/drivers/media/platform/qcom/venus/core.c
@@ -290,6 +290,10 @@ static int venus_probe(struct platform_device *pdev)
if (ret)
goto err_dev_unregister;
 
+   ret = venus_dbgfs_init(core);
+   if (ret)
+   goto err_dev_unregister;
+
return 0;
 
 err_dev_unregister:
@@ -337,6 +341,7 @@ static int venus_remove(struct platform_device *pdev)
v4l2_device_unregister(&core->v4l2_dev);
mutex_destroy(&core->pm_lock);
mutex_destroy(&core->lock);
+   venus_dbgfs_deinit(core);
 
return ret;
 }
diff --git a/drivers/media/platform/qcom/venus/core.h 
b/drivers/media/platform/qcom/venus/core.h
index 7118612673c9..b48782f9aa95 100644
--- a/drivers/media/platform/qcom/venus/core.h
+++ b/drivers/media/platform/qcom/venus/core.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 
+#include "dbgfs.h"
 #include "hfi.h"
 
 #define VIDC_CLKS_NUM_MAX  4
@@ -136,6 +137,7 @@ struct venus_caps {
  * @priv:  a private filed for HFI operations
  * @ops:   the core HFI operations
  * @work:  a delayed work for handling system fatal error
+ * @root:  debugfs root directory
  */
 struct venus_core {
void __iomem *base;
@@ -185,6 +187,7 @@ struct venus_core {
unsigned int codecs_count;
unsigned int core0_usage_count;
unsigned int core1_usage_count;
+   struct dentry *root;
 };
 
 struct vdec_controls {
diff --git a/drivers/media/platform/qcom/venus/dbgfs.c 
b/drivers/media/platform/qcom/venus/dbgfs.c
new file mode 100644
index ..a2465fe8e20b
--- /dev/null
+++ b/drivers/media/platform/qcom/venus/dbgfs.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Linaro Ltd.
+ */
+
+#include 
+
+#include "core.h"
+
+extern int venus_fw_debug;
+
+int venus_dbgfs_init(struct venus_core *core)
+{
+   core->root = debugfs_create_dir("venus", NULL);
+   if (IS_ERR(core->root))
+   return IS_ERR(core->root);
+
+   debugfs_create_x32("fw_level", 0644, core->root, &venus_fw_debug);
+
+   return 0;
+}
+
+void venus_dbgfs_deinit(struct venus_core *core)
+{
+   debugfs_remove_recursive(core->root);
+}
diff --git a/drivers/media/platform/qcom/venus/dbgfs.h 
b/drivers/media/platform/qcom/venus/dbgfs.h
new file mode 100644
index ..4e35bd7db15f
--- /dev/null
+++ b/drivers/media/platform/qcom/venus/dbgfs.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Linaro Ltd. */
+
+#ifndef __VENUS_DBGFS_H__
+#define __VENUS_DBGFS_H__
+
+struct venus_core;
+
+int venus_dbgfs_init(struct venus_core *core);
+void venus_dbgfs_deinit(struct venus_core *core);
+
+#endif
diff --git a/drivers/media/platform/qcom/venus/hfi_venus.c 
b/drivers/media/platform/qcom/venus/hfi_venus.c
index 0d8855014ab3..3a04b08ab85a 100644
--- a/drivers/media/platform/qcom/venus/hfi_venus.c
+++ b/drivers/media/platform/qcom/venus/hfi_venus.c
@@ -130,7 +130,7 @@ struct venus_hfi_device {
 };
 
 static bool venus_pkt_debug;
-static int venus_fw_debug = HFI_DEBUG_MSG_ERROR | HFI_DEBUG_MSG_FATAL;
+int venus_fw_debug = HFI_DEBUG_MSG_ERROR | HFI_DEBUG_MSG_FATAL;
 static bool venus_sys_idle_indicator;
 static bool venus_fw_low_power_mode = true;
 static int venus_hw_rsp_timeout = 1000;
@@ -1130,9 +1130,14 @@ static int venus_session_init(struct venus_inst *inst, 
u32 session_type,
  u32 codec)
 {
struct venus_hfi_device *hdev = to_hfi_priv(inst->core);
+   struct device *dev = hdev->core->dev;
struct hfi_session_init_pkt pkt;

[PATCH v3 6/7] venus: Make debug infrastructure more flexible

2020-06-09 Thread Stanimir Varbanov

Here we introduce few debug macros with levels (low, medium and
high) and debug macro for firmware. Enabling the particular level
will be done by dynamic debug with levels.

For example to enable debug messages with low level:
echo 'module venus_dec level 0x01 +p' > debugfs/dynamic_debug/control

If you want to enable all levels:
echo 'module venus_dec level 0x07 +p' > debugfs/dynamic_debug/control

All the features which dynamic debugging provide are preserved.

And finaly all dev_dbg are translated to VDBGX with appropriate
debug levels.

Signed-off-by: Stanimir Varbanov 
---
 drivers/media/platform/qcom/venus/core.h  |  5 ++
 drivers/media/platform/qcom/venus/helpers.c   |  2 +-
 drivers/media/platform/qcom/venus/hfi_msgs.c  | 30 -
 drivers/media/platform/qcom/venus/hfi_venus.c | 20 --
 .../media/platform/qcom/venus/pm_helpers.c|  3 +-
 drivers/media/platform/qcom/venus/vdec.c  | 63 +--
 drivers/media/platform/qcom/venus/venc.c  |  4 ++
 7 files changed, 96 insertions(+), 31 deletions(-)

diff --git a/drivers/media/platform/qcom/venus/core.h 
b/drivers/media/platform/qcom/venus/core.h
index b48782f9aa95..63eabf5ff96d 100644
--- a/drivers/media/platform/qcom/venus/core.h
+++ b/drivers/media/platform/qcom/venus/core.h
@@ -15,6 +15,11 @@
 #include "dbgfs.h"
 #include "hfi.h"
 
+#define VDBGL(fmt, args...)pr_debug_level(0x01, fmt, ##args)
+#define VDBGM(fmt, args...)pr_debug_level(0x02, fmt, ##args)
+#define VDBGH(fmt, args...)pr_debug_level(0x04, fmt, ##args)
+#define VDBGFW(fmt, args...)   pr_debug_level(0x08, fmt, ##args)
+
 #define VIDC_CLKS_NUM_MAX  4
 #define VIDC_VCODEC_CLKS_NUM_MAX   2
 #define VIDC_PMDOMAINS_NUM_MAX 3
diff --git a/drivers/media/platform/qcom/venus/helpers.c 
b/drivers/media/platform/qcom/venus/helpers.c
index 0143af7822b2..115a9a2af1d6 100644
--- a/drivers/media/platform/qcom/venus/helpers.c
+++ b/drivers/media/platform/qcom/venus/helpers.c
@@ -396,7 +396,7 @@ put_ts_metadata(struct venus_inst *inst, struct 
vb2_v4l2_buffer *vbuf)
}
 
if (slot == -1) {
-   dev_dbg(inst->core->dev, "%s: no free slot\n", __func__);
+   VDBGH("no free slot for timestamp\n");
return;
}
 
diff --git a/drivers/media/platform/qcom/venus/hfi_msgs.c 
b/drivers/media/platform/qcom/venus/hfi_msgs.c
index 279a9d6fe737..36986d402c96 100644
--- a/drivers/media/platform/qcom/venus/hfi_msgs.c
+++ b/drivers/media/platform/qcom/venus/hfi_msgs.c
@@ -138,10 +138,9 @@ static void event_sys_error(struct venus_core *core, u32 
event,
struct hfi_msg_event_notify_pkt *pkt)
 {
if (pkt)
-   dev_dbg(core->dev,
-   "sys error (session id:%x, data1:%x, data2:%x)\n",
-   pkt->shdr.session_id, pkt->event_data1,
-   pkt->event_data2);
+   VDBGH("sys error (session id: %x, data1: %x, data2: %x)\n",
+ pkt->shdr.session_id, pkt->event_data1,
+ pkt->event_data2);
 
core->core_ops->event_notify(core, event);
 }
@@ -152,8 +151,8 @@ event_session_error(struct venus_core *core, struct 
venus_inst *inst,
 {
struct device *dev = core->dev;
 
-   dev_dbg(dev, "session error: event id:%x, session id:%x\n",
-   pkt->event_data1, pkt->shdr.session_id);
+   VDBGH("session error: event id: %x, session id: %x\n",
+ pkt->event_data1, pkt->shdr.session_id);
 
if (!inst)
return;
@@ -236,8 +235,7 @@ static void hfi_sys_init_done(struct venus_core *core, 
struct venus_inst *inst,
 }
 
 static void
-sys_get_prop_image_version(struct device *dev,
-  struct hfi_msg_sys_property_info_pkt *pkt)
+sys_get_prop_image_version(struct hfi_msg_sys_property_info_pkt *pkt)
 {
int req_bytes;
 
@@ -247,26 +245,25 @@ sys_get_prop_image_version(struct device *dev,
/* bad packet */
return;
 
-   dev_dbg(dev, "F/W version: %s\n", (u8 *)&pkt->data[1]);
+   VDBGL("F/W version: %s\n", (u8 *)&pkt->data[1]);
 }
 
 static void hfi_sys_property_info(struct venus_core *core,
  struct venus_inst *inst, void *packet)
 {
struct hfi_msg_sys_property_info_pkt *pkt = packet;
-   struct device *dev = core->dev;
 
if (!pkt->num_properties) {
-   dev_dbg(dev, "%s: no properties\n", __func__);
+   VDBGM("no properties\n");
return;
}
 
switch (pkt->data[0]) {
case HFI_PROPERTY_SYS_IMAGE_VERSION:
-   sys_get_prop_image_version(dev, pkt);
+   sys_get_prop_image_version(pkt);
break;
default:
-   dev_dbg(dev, "%s: unknown property data\n", __func__);
+   VDBGM("unknown property data\n");
break;
}
 }
@@ -297,7 +294,7 @@ static void hfi_sys_ping_don

[PATCH v3 7/7] venus: Add a debugfs file for SSR trigger

2020-06-09 Thread Stanimir Varbanov

The SSR (SubSystem Restart) is used to simulate an error on FW
side of Venus. We support following type of triggers - fatal error,
div by zero and watchdog IRQ.

Signed-off-by: Stanimir Varbanov 
---
 drivers/media/platform/qcom/venus/dbgfs.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/drivers/media/platform/qcom/venus/dbgfs.c 
b/drivers/media/platform/qcom/venus/dbgfs.c
index a2465fe8e20b..59d52e5af64a 100644
--- a/drivers/media/platform/qcom/venus/dbgfs.c
+++ b/drivers/media/platform/qcom/venus/dbgfs.c
@@ -9,6 +9,35 @@
 
 extern int venus_fw_debug;
 
+static int trigger_ssr_open(struct inode *inode, struct file *file)
+{
+   file->private_data = inode->i_private;
+   return 0;
+}
+
+static ssize_t trigger_ssr_write(struct file *filp, const char __user *buf,
+size_t count, loff_t *ppos)
+{
+   struct venus_core *core = filp->private_data;
+   u32 ssr_type;
+   int ret;
+
+   ret = kstrtou32_from_user(buf, count, 4, &ssr_type);
+   if (ret)
+   return ret;
+
+   ret = hfi_core_trigger_ssr(core, ssr_type);
+   if (ret < 0)
+   return ret;
+
+   return count;
+}
+
+static const struct file_operations ssr_fops = {
+   .open = trigger_ssr_open,
+   .write = trigger_ssr_write,
+};
+
 int venus_dbgfs_init(struct venus_core *core)
 {
core->root = debugfs_create_dir("venus", NULL);
@@ -17,6 +46,8 @@ int venus_dbgfs_init(struct venus_core *core)
 
debugfs_create_x32("fw_level", 0644, core->root, &venus_fw_debug);
 
+   debugfs_create_file("trigger_ssr", 0200, core->root, core, &ssr_fops);
+
return 0;
 }
 
-- 
2.17.1

[PATCH v3 4/7] printk: Add pr_debug_level macro over dynamic one

2020-06-09 Thread Stanimir Varbanov

Introduce new pr_debug_level macro over dynamic_debug level one
to allow dynamic debugging to show only important messages.

Signed-off-by: Stanimir Varbanov 
---
 include/linux/printk.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index ceea84aa705b..2a6eca56010f 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -416,12 +416,18 @@ extern int kptr_restrict;
  */
 #define pr_debug(fmt, ...) \
dynamic_pr_debug(fmt, ##__VA_ARGS__)
+#define pr_debug_level(lvl, fmt, ...) \
+   dynamic_pr_debug_level(lvl, fmt, ##__VA_ARGS__)
 #elif defined(DEBUG)
 #define pr_debug(fmt, ...) \
printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_debug_level(lvl, fmt, ...) \
+   printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
 #else
 #define pr_debug(fmt, ...) \
no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_debug_level(lvl, fmt, ...) \
+   no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
 #endif
 
 /*
-- 
2.17.1

[PATCH v3 2/7] dynamic_debug: Group debug messages by level bitmask

2020-06-09 Thread Stanimir Varbanov

This will allow dynamic debug users and driver writers to group
debug messages by level bitmask.  The level bitmask should be a
hex number.

Done this functionality by extending dynamic debug metadata with
new level member and propagate it over all the users.  Also
introduce new dynamic_pr_debug_level and dynamic_dev_dbg_level
macros to be used by the drivers.

Cc: Jason Baron 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Petr Mladek 
Cc: Sergey Senozhatsky 
Cc: Steven Rostedt 
Cc: Greg Kroah-Hartman 

Suggested-by: Joe Perches 
Signed-off-by: Stanimir Varbanov 
---
 fs/btrfs/ctree.h  | 12 +---
 include/linux/acpi.h  |  3 +-
 include/linux/dev_printk.h|  3 +-
 include/linux/dynamic_debug.h | 55 ---
 include/linux/net.h   |  3 +-
 include/linux/printk.h|  3 +-
 lib/dynamic_debug.c   | 30 +++
 7 files changed, 84 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 161533040978..f6a778789056 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3081,16 +3081,20 @@ void btrfs_printk(const struct btrfs_fs_info *fs_info, 
const char *fmt, ...);
 
 #if defined(CONFIG_DYNAMIC_DEBUG)
 #define btrfs_debug(fs_info, fmt, args...) \
-   _dynamic_func_call_no_desc(fmt, btrfs_printk,   \
+   _dynamic_func_call_no_desc(fmt, _DPRINTK_LEVEL_DEFAULT, \
+  btrfs_printk,\
   fs_info, KERN_DEBUG fmt, ##args)
 #define btrfs_debug_in_rcu(fs_info, fmt, args...)  \
-   _dynamic_func_call_no_desc(fmt, btrfs_printk_in_rcu,\
+   _dynamic_func_call_no_desc(fmt, _DPRINTK_LEVEL_DEFAULT, \
+  btrfs_printk_in_rcu, \
   fs_info, KERN_DEBUG fmt, ##args)
 #define btrfs_debug_rl_in_rcu(fs_info, fmt, args...)   \
-   _dynamic_func_call_no_desc(fmt, btrfs_printk_rl_in_rcu, \
+   _dynamic_func_call_no_desc(fmt, _DPRINTK_LEVEL_DEFAULT, \
+  btrfs_printk_rl_in_rcu,  \
   fs_info, KERN_DEBUG fmt, ##args)
 #define btrfs_debug_rl(fs_info, fmt, args...)  \
-   _dynamic_func_call_no_desc(fmt, btrfs_printk_ratelimited,   \
+   _dynamic_func_call_no_desc(fmt, _DPRINTK_LEVEL_DEFAULT, \
+  btrfs_printk_ratelimited,\
   fs_info, KERN_DEBUG fmt, ##args)
 #elif defined(DEBUG)
 #define btrfs_debug(fs_info, fmt, args...) \
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index d661cd0ee64d..6e51438f7635 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1043,7 +1043,8 @@ void __acpi_handle_debug(struct _ddebug *descriptor, 
acpi_handle handle, const c
 #else
 #if defined(CONFIG_DYNAMIC_DEBUG)
 #define acpi_handle_debug(handle, fmt, ...)\
-   _dynamic_func_call(fmt, __acpi_handle_debug,\
+   _dynamic_func_call(fmt, _DPRINTK_LEVEL_DEFAULT, \
+  __acpi_handle_debug, \
   handle, pr_fmt(fmt), ##__VA_ARGS__)
 #else
 #define acpi_handle_debug(handle, fmt, ...)\
diff --git a/include/linux/dev_printk.h b/include/linux/dev_printk.h
index 5aad06b4ca7b..7b50551833e1 100644
--- a/include/linux/dev_printk.h
+++ b/include/linux/dev_printk.h
@@ -188,7 +188,8 @@ do {
\
static DEFINE_RATELIMIT_STATE(_rs,  \
  DEFAULT_RATELIMIT_INTERVAL,   \
  DEFAULT_RATELIMIT_BURST); \
-   DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \
+   DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt,  \
+ _DPRINTK_LEVEL_DEFAULT);  \
if (DYNAMIC_DEBUG_BRANCH(descriptor) && \
__ratelimit(&_rs))  \
__dynamic_dev_dbg(&descriptor, dev, dev_fmt(fmt),   \
diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h
index 4cf02ecd67de..95e97260c517 100644
--- a/include/linux/dynamic_debug.h
+++ b/include/linux/dynamic_debug.h
@@ -38,6 +38,8 @@ struct _ddebug {
 #define _DPRINTK_FLAGS_DEFAULT 0
 #endif
unsigned int flags:8;
+#define _DPRINTK_LEVEL_DEFAULT 0
+   unsigned int level:5;
 #ifdef CONFIG_JUMP_LABEL
union {
struct static_key_true dd_key_true;
@@ -78,7 +80,7 @@ void

[PATCH v3 1/7] Documentation: dynamic-debug: Add description of level bitmask

2020-06-09 Thread Stanimir Varbanov

This adds description of the level bitmask feature.

Cc: Jonathan Corbet  (maintainer:DOCUMENTATION)

Signed-off-by: Stanimir Varbanov 
---
 Documentation/admin-guide/dynamic-debug-howto.rst | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst 
b/Documentation/admin-guide/dynamic-debug-howto.rst
index 0dc2eb8e44e5..c2b751fc8a17 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -208,6 +208,12 @@ line
line -1605  // the 1605 lines from line 1 to line 1605
line 1600-  // all lines from line 1600 to the end of the file
 
+level
+The given level will be a bitmask ANDed with the level of the each 
``pr_debug()``
+callsite. This will allow to group debug messages and show only those of 
the
+same level.  The -p flag takes precedence over the given level. Note that 
we can
+have up to five groups of debug messages.
+
 The flags specification comprises a change operation followed
 by one or more flag characters.  The change operation is one
 of the characters::
@@ -346,6 +352,10 @@ Examples
   // add module, function to all enabled messages
   nullarbor:~ # echo -n '+mf' > /dynamic_debug/control
 
+  // enable all messages in file with 0x01 level bitmask
+  nullarbor:~ # echo -n 'file foo.c level 0x01 +p' >
+/dynamic_debug/control
+
   // boot-args example, with newlines and comments for readability
   Kernel command line: ...
 // see whats going on in dyndbg=value processing
-- 
2.17.1

[PATCH v3 0/7] Venus dynamic debug

2020-06-09 Thread Stanimir Varbanov

Hello,

Here is the third version of dynamic debug improvements in Venus
driver.  As has been suggested on previous version by Joe [1] I've
made the relevant changes in dynamic debug core to handle leveling
as more generic way and not open-code/workaround it in the driver.

About changes:
 - added change in the dynamic_debug and in documentation
 - added respective pr_debug_level and dev_dbg_level

regards,
Stan

[1] https://lkml.org/lkml/2020/5/21/668

Stanimir Varbanov (7):
  Documentation: dynamic-debug: Add description of level bitmask
  dynamic_debug: Group debug messages by level bitmask
  dev_printk: Add dev_dbg_level macro over dynamic one
  printk: Add pr_debug_level macro over dynamic one
  venus: Add debugfs interface to set firmware log level
  venus: Make debug infrastructure more flexible
  venus: Add a debugfs file for SSR trigger

 .../admin-guide/dynamic-debug-howto.rst   | 10 +++
 drivers/media/platform/qcom/venus/Makefile|  2 +-
 drivers/media/platform/qcom/venus/core.c  |  5 ++
 drivers/media/platform/qcom/venus/core.h  |  8 +++
 drivers/media/platform/qcom/venus/dbgfs.c | 57 +
 drivers/media/platform/qcom/venus/dbgfs.h | 12 
 drivers/media/platform/qcom/venus/helpers.c   |  2 +-
 drivers/media/platform/qcom/venus/hfi_msgs.c  | 30 -
 drivers/media/platform/qcom/venus/hfi_venus.c | 27 ++--
 .../media/platform/qcom/venus/pm_helpers.c|  3 +-
 drivers/media/platform/qcom/venus/vdec.c  | 63 +--
 drivers/media/platform/qcom/venus/venc.c  |  4 ++
 fs/btrfs/ctree.h  | 12 ++--
 include/linux/acpi.h  |  3 +-
 include/linux/dev_printk.h| 12 +++-
 include/linux/dynamic_debug.h | 55 +++-
 include/linux/net.h   |  3 +-
 include/linux/printk.h|  9 ++-
 lib/dynamic_debug.c   | 30 +
 19 files changed, 289 insertions(+), 58 deletions(-)
 create mode 100644 drivers/media/platform/qcom/venus/dbgfs.c
 create mode 100644 drivers/media/platform/qcom/venus/dbgfs.h

-- 
2.17.1

Re: [PATCH v3 1/7] Documentation: dynamic-debug: Add description of level bitmask

2020-06-09 Thread Matthew Wilcox

On Tue, Jun 09, 2020 at 01:45:58PM +0300, Stanimir Varbanov wrote:
> +level
> +The given level will be a bitmask ANDed with the level of the each 
> ``pr_debug()``
> +callsite. This will allow to group debug messages and show only those of 
> the
> +same level.  The -p flag takes precedence over the given level. Note 
> that we can
> +have up to five groups of debug messages.

That doesn't sound like a "level".  printk has levels.  If you ask for
"level 3" messages, you get messages from levels 0, 1, 2, and 3.  These
seem like "types" or "groups" or something.

> +  // enable all messages in file with 0x01 level bitmask
> +  nullarbor:~ # echo -n 'file foo.c level 0x01 +p' >
> +/dynamic_debug/control

Re: [PATCH v3 5/7] venus: Add debugfs interface to set firmware log level

2020-06-09 Thread Greg Kroah-Hartman

On Tue, Jun 09, 2020 at 01:46:02PM +0300, Stanimir Varbanov wrote:
> +int venus_dbgfs_init(struct venus_core *core)
> +{
> + core->root = debugfs_create_dir("venus", NULL);
> + if (IS_ERR(core->root))
> + return IS_ERR(core->root);

You really do not care, and obviously did not test this on a system with
CONFIG_DEBUG_FS disabled :)

Just make the call to debugfs, and move on, feed it into other debugfs
calls, all is good.

This function should just return 'void', no need to care about this at
all.

> + ret = venus_sys_set_debug(hdev, venus_fw_debug);
> + if (ret)
> + dev_warn(dev, "setting fw debug msg ON failed (%d)\n", ret);

Why do you care about this "error"?

thanks,

greg k-h

Re: [PATCH v3 4/7] printk: Add pr_debug_level macro over dynamic one

2020-06-09 Thread Greg Kroah-Hartman

On Tue, Jun 09, 2020 at 01:46:01PM +0300, Stanimir Varbanov wrote:
> Introduce new pr_debug_level macro over dynamic_debug level one
> to allow dynamic debugging to show only important messages.

What does "important messages" mean???

Re: [PATCH v3 0/7] Venus dynamic debug

2020-06-09 Thread Matthew Wilcox

On Tue, Jun 09, 2020 at 01:45:57PM +0300, Stanimir Varbanov wrote:
> Here is the third version of dynamic debug improvements in Venus
> driver.  As has been suggested on previous version by Joe [1] I've
> made the relevant changes in dynamic debug core to handle leveling
> as more generic way and not open-code/workaround it in the driver.
> 
> About changes:
>  - added change in the dynamic_debug and in documentation
>  - added respective pr_debug_level and dev_dbg_level

Honestly, this seems like you want to use tracepoints, not dynamic debug.

Re: [PATCH v3 6/7] venus: Make debug infrastructure more flexible

2020-06-09 Thread Greg Kroah-Hartman

On Tue, Jun 09, 2020 at 01:46:03PM +0300, Stanimir Varbanov wrote:
> Here we introduce few debug macros with levels (low, medium and
> high) and debug macro for firmware. Enabling the particular level
> will be done by dynamic debug with levels.
> 
> For example to enable debug messages with low level:
> echo 'module venus_dec level 0x01 +p' > debugfs/dynamic_debug/control
> 
> If you want to enable all levels:
> echo 'module venus_dec level 0x07 +p' > debugfs/dynamic_debug/control
> 
> All the features which dynamic debugging provide are preserved.
> 
> And finaly all dev_dbg are translated to VDBGX with appropriate
> debug levels.
> 
> Signed-off-by: Stanimir Varbanov 
> ---
>  drivers/media/platform/qcom/venus/core.h  |  5 ++
>  drivers/media/platform/qcom/venus/helpers.c   |  2 +-
>  drivers/media/platform/qcom/venus/hfi_msgs.c  | 30 -
>  drivers/media/platform/qcom/venus/hfi_venus.c | 20 --
>  .../media/platform/qcom/venus/pm_helpers.c|  3 +-
>  drivers/media/platform/qcom/venus/vdec.c  | 63 +--
>  drivers/media/platform/qcom/venus/venc.c  |  4 ++
>  7 files changed, 96 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/media/platform/qcom/venus/core.h 
> b/drivers/media/platform/qcom/venus/core.h
> index b48782f9aa95..63eabf5ff96d 100644
> --- a/drivers/media/platform/qcom/venus/core.h
> +++ b/drivers/media/platform/qcom/venus/core.h
> @@ -15,6 +15,11 @@
>  #include "dbgfs.h"
>  #include "hfi.h"
>  
> +#define VDBGL(fmt, args...)  pr_debug_level(0x01, fmt, ##args)
> +#define VDBGM(fmt, args...)  pr_debug_level(0x02, fmt, ##args)
> +#define VDBGH(fmt, args...)  pr_debug_level(0x04, fmt, ##args)
> +#define VDBGFW(fmt, args...) pr_debug_level(0x08, fmt, ##args)
> +
>  #define VIDC_CLKS_NUM_MAX4
>  #define VIDC_VCODEC_CLKS_NUM_MAX 2
>  #define VIDC_PMDOMAINS_NUM_MAX   3
> diff --git a/drivers/media/platform/qcom/venus/helpers.c 
> b/drivers/media/platform/qcom/venus/helpers.c
> index 0143af7822b2..115a9a2af1d6 100644
> --- a/drivers/media/platform/qcom/venus/helpers.c
> +++ b/drivers/media/platform/qcom/venus/helpers.c
> @@ -396,7 +396,7 @@ put_ts_metadata(struct venus_inst *inst, struct 
> vb2_v4l2_buffer *vbuf)
>   }
>  
>   if (slot == -1) {
> - dev_dbg(inst->core->dev, "%s: no free slot\n", __func__);
> + VDBGH("no free slot for timestamp\n");

So you just lost the information that dev_dbg() gave you with regards to
the device/driver/instance creating that message?

Ick, no, don't do that.

And why is this driver somehow "special" compared to all the rest of
the kernel?  Why is the current dev_dbg() control not sufficient that
you need to change the core for just this tiny thing?

thanks,

greg k-h

Re: [PATCH v3 1/7] Documentation: dynamic-debug: Add description of level bitmask

2020-06-09 Thread Greg Kroah-Hartman

On Tue, Jun 09, 2020 at 01:45:58PM +0300, Stanimir Varbanov wrote:
> This adds description of the level bitmask feature.
> 
> Cc: Jonathan Corbet  (maintainer:DOCUMENTATION)
> 
> Signed-off-by: Stanimir Varbanov 
> ---
>  Documentation/admin-guide/dynamic-debug-howto.rst | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst 
> b/Documentation/admin-guide/dynamic-debug-howto.rst
> index 0dc2eb8e44e5..c2b751fc8a17 100644
> --- a/Documentation/admin-guide/dynamic-debug-howto.rst
> +++ b/Documentation/admin-guide/dynamic-debug-howto.rst
> @@ -208,6 +208,12 @@ line
>   line -1605  // the 1605 lines from line 1 to line 1605
>   line 1600-  // all lines from line 1600 to the end of the file
>  
> +level
> +The given level will be a bitmask ANDed with the level of the each 
> ``pr_debug()``
> +callsite. This will allow to group debug messages and show only those of 
> the
> +same level.  The -p flag takes precedence over the given level. Note 
> that we can
> +have up to five groups of debug messages.

As was pointed out, this isn't a "level", it's some arbitrary type of
"grouping".

But step back, why?  What is wrong with the existing control of dynamic
debug messages that you want to add another type of arbitrary grouping
to it?  And who defines that grouping?  Will it be
driver/subsystem/arch/author specific?  Or kernel-wide?

This feels like it could easily get out of hand really quickly.

Why not just use tracepoints if you really want to be fine-grained?

thanks,

greg k-h

Re: [EXT] Re: [PATCH v2 net-next 00/10] net: ocelot: VCAP IS1 and ES0 support

2020-06-09 Thread Vladimir Oltean

Hi Xiaoliang,

On Tue, 2 Jun 2020 at 11:50, Xiaoliang Yang  wrote:
>
> Hi Vladimir,
>
> On Tus, 2 Jun 2020 at 16:04,
> > First of all, net-next has just closed yesterday and will be closed for the 
> > following 2 weeks:
> > https://eur01.safelinks.protection.outlook.com/?url=http:%2F%2Fvger.kernel.org%2F~davem%2Fnet-next.html&data=02%7C01%
> >  
> > 7Cxiaoliang.yang_1%40nxp.com%7C2fad4495dabc4f4ca5fd08d806cb70af%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637266818117666386&sdata=ziVybWb4HzYXanehF5KwNv5RJL%2BZz6NeFvrZWg657B8%3D&reserved=0
> >
> > Secondly, could you give an example of how different chains could express 
> > the fact that rules are executed in parallel between the IS1,
> > IS2 and ES0 TCAMs?
> >
>
> Different TCAMs are not running in parallel, they have flow order: 
> IS1->IS2->ES0. Using goto chain to express the flow order.
> For example:
> tc qdisc add dev swp0 ingress
> tc filter add dev swp0 chain 0 protocol 802.1Q parent : flower 
> skip_sw vlan_id 1 vlan_prio 1 action vlan modify id 2 priority 2 action goto 
> chain 1
> tc filter add dev swp0 chain 1 protocol 802.1Q parent : flower 
> skip_sw vlan_id 2 vlan_prio 2 action drop
> In this example, package with (vid=1,pcp=1) vlan tag will be modified to 
> (vid=2,pcp=2) vlan tag on IS1, then will be dropped on IS2.
>
> If there is no rule match on IS1, it will still lookup on IS2. We can set a 
> rule on chain 0 to express this:
> tc filter add dev swp0 chain 0 parent : flower skip_sw action 
> goto chain 1
>
> In addition, VSC9959 chip has PSFP and "Sequence Generation recovery" modules 
> are running after IS2, the flow order like this: IS1->IS2->PSFP-> "Sequence 
> Generation recovery" ->ES0, we can also add chains like this to express these 
> two modules in future.
>

I've been pondering over what is a good abstraction for 802.1CB and I
don't think that it would be a tc action. After reading Annex C "Frame
Replication and Elimination for Reliability in systems" in
8021CB-2017, I think maybe it should be modeled as a stacked net
device a la hsr, but with the ability to add its own stream filtering
rules and actions (a la bridge fdb).
But for the PSFP policers, in principle I think you are correct, we
could designate a static chain id for those.

> BTW, where should I sent patches to due to net-next closed?
>

You can keep sending patches just as you did. There's nothing wrong
with doing that as long as you're only doing it for the feedback (RFC
== Request For Comments).
Since David receives a lot of patches and the backlog builds up very
quickly, he just rejects patches sent to net-next during the merge
window instead of queuing them up.
Patches that are bugfixes (not the case here, just in general) can be
sent to the net tree at all times (even during the merge window).
In all cases, the mailing list is the same, just the --subject-prefix
is different (net, net-next, rfc).

> Thanks,
> Xiaoliang Yang

Thanks,
-Vladimir

[no subject]

2020-06-09 Thread Gaurav Singh

Please find the patch below.

Thanks and regards,
Gaurav.

>From Gaurav Singh  # This line is ignored.
From: Gaurav Singh 
Reply-To: 
Subject: 
In-Reply-To:

[PATCH] bpf: alloc_record_per_cpu Add null check after malloc

2020-06-09 Thread Gaurav Singh

Signed-off-by: Gaurav Singh 
---
 samples/bpf/xdp_rxq_info_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/bpf/xdp_rxq_info_user.c b/samples/bpf/xdp_rxq_info_user.c
index 4fe47502ebed..490b07b7df78 100644
--- a/samples/bpf/xdp_rxq_info_user.c
+++ b/samples/bpf/xdp_rxq_info_user.c
@@ -202,11 +202,11 @@ static struct datarec *alloc_record_per_cpu(void)
 
size = sizeof(struct datarec) * nr_cpus;
array = malloc(size);
-   memset(array, 0, size);
if (!array) {
fprintf(stderr, "Mem alloc error (nr_cpus:%u)\n", nr_cpus);
exit(EXIT_FAIL_MEM);
}
+   memset(array, 0, size);
return array;
 }
 
-- 
2.17.1

Re: your mail

2020-06-09 Thread Greg KH

On Tue, Jun 09, 2020 at 07:38:38AM -0400, Gaurav Singh wrote:
> Please find the patch below.
> 
> Thanks and regards,
> Gaurav.
> 
> >From Gaurav Singh  # This line is ignored.
> From: Gaurav Singh 
> Reply-To: 
> Subject: 
> In-Reply-To: 
> 
> 

I think something went wrong in your submission, just use 'git
send-email' to send the patch out.

thanks,

greg k-h

Re: [PATCH] bpf: alloc_record_per_cpu Add null check after malloc

2020-06-09 Thread Greg KH

On Tue, Jun 09, 2020 at 07:38:39AM -0400, Gaurav Singh wrote:
> Signed-off-by: Gaurav Singh 
> ---
>  samples/bpf/xdp_rxq_info_user.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

I know I don't take patches without any changelog text, maybe other
maintainers are more lax...

[PATCH] bpf: alloc_record_per_cpu Add null check after malloc

2020-06-09 Thread Gaurav Singh

The memset call is made right after malloc call. To fix this, add the null 
check right after malloc and then do memset.

Signed-off-by: Gaurav Singh 
---
 samples/bpf/xdp_rxq_info_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/bpf/xdp_rxq_info_user.c b/samples/bpf/xdp_rxq_info_user.c
index 4fe47502ebed..490b07b7df78 100644
--- a/samples/bpf/xdp_rxq_info_user.c
+++ b/samples/bpf/xdp_rxq_info_user.c
@@ -202,11 +202,11 @@ static struct datarec *alloc_record_per_cpu(void)
 
size = sizeof(struct datarec) * nr_cpus;
array = malloc(size);
-   memset(array, 0, size);
if (!array) {
fprintf(stderr, "Mem alloc error (nr_cpus:%u)\n", nr_cpus);
exit(EXIT_FAIL_MEM);
}
+   memset(array, 0, size);
return array;
 }
 
-- 
2.17.1

Re: [PATCH] net: ethernet: mvneta: add support for 2.5G DRSGMII mode

2020-06-09 Thread Sascha Hauer

On Mon, Jun 08, 2020 at 05:08:01PM +0100, Russell King - ARM Linux admin wrote:
> On Mon, Jun 08, 2020 at 09:47:16AM +0200, Sascha Hauer wrote:
> > The Marvell MVNETA Ethernet controller supports a 2.5 Gbps SGMII mode
> > called DRSGMII.
> > 
> > This patch adds a corresponding phy-mode string 'drsgmii' and parses it
> > from DT. The MVNETA then configures the SERDES protocol value
> > accordingly.
> > 
> > It was successfully tested on a MV78460 connected to a FPGA.
> 
> Digging around, this is Armada XP?  Which SoCs is this mode supported?
> There's no mention of DRSGMII in the A38x nor A37xx documentation which
> are later than Armada XP.

It's an Armada XP MV78460 in my case. I have no idea what other SoCs
this mode is supported on.

> 
> What exactly is "drsgmii"?  It can't be "double-rate" SGMII because that
> would give you 2Gbps max instead of the 1Gbps, but this gives 2.5Gbps,
> so I'm really not sure using "drsgmii" is a good idea.  It may be what
> Marvell call it, but we really need to know if there's some vendor
> neutral way to refer to it.

The abbreviation really is for "Double Rated SGMII". It seems it has 2.5
times the clock rate than ordinary SGMII. Another term I found is HSGMII
(High serial gigabit media-independent interface) which also has
2.5Gbps.

Anyway, I just learned from the paragraph you added to
Documentation/networking/phy.rst that 1000BASEX differs from SGMII in
the format of the control word. As we have a fixed link to a FPGA the
control word seems to be unused, at least the Port MAC Control Register0
PortType setting bit doesn't change anything. So I can equally well use
the existing 2500BASEX mode.

Sascha

-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

Re: [PATCH bpf-next V1] bpf: devmap dynamic map-value area based on BTF

2020-06-09 Thread Jesper Dangaard Brouer

On Fri, 5 Jun 2020 09:58:26 -0700
Alexei Starovoitov  wrote:

> On Fri, Jun 5, 2020 at 1:23 AM Jesper Dangaard Brouer  
> wrote:
> >
> > Great. If we can remove this requirement of -1 init (and let zero mean
> > feature isn't used), then I'm all for exposing expose in uapi/bpf.h.  
> 
> Not having it in bpf.h doesn't magically make it invisible.
> It's uapi because user space C sources rely on its fixed format.
> vmlinux.h contains all kernel types. both uapi and kernel internal.
> devmap selftest taking uapi 'struct bpf_devmap_val' from vmlinux.h is
> an awful hack.
> I prefer to keep vmlinux.h usage to bpf programs only.
> User space C code should interface with kernel via proper uapi headers.
> When vmlinux.h is used by bpf C program it's completely different from
> user space C code doing the same, because llvm emits relocations for
> bpf prog and libbpf adjusts them.
> So doing 'foo->bar' in bpf C is specific to target kernel, whereas
> user C code 'foo->bar' is a hard constant which bakes it into uapi
> that kernel has to keep backwards compatible.

Thank you for taking time to explain this.
Much appreciated, I agree with everything above.


> If in some distant future we teach both gcc and clang to do bpf-style
> relocations for x86 and teach ld.so to adjust them then we can dream
> about very differently looking kernel/user interfaces.
> Right now any struct used by user C code and passed into kernel is uapi.

I like this future vision.

I guess this patch is premature, as it operates in the same problem
space. It tried to address uapi flexbility, by letting userspace define
the uapi layout via BTF at map_create() time, and let kernel-side
validate BTF-info and restrict possible struct member names, which are
remapped to offsets inside the kernel.

I'll instead wait for the future...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH] bpf: alloc_record_per_cpu Add null check after malloc

2020-06-09 Thread Jesper Dangaard Brouer

On Tue,  9 Jun 2020 08:08:03 -0400
Gaurav Singh  wrote:

> The memset call is made right after malloc call. To fix this, add the null 
> check right after malloc and then do memset.
> 

Did you read the section about how long lines should be in desc?


> Signed-off-by: Gaurav Singh 
> ---
>  samples/bpf/xdp_rxq_info_user.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/samples/bpf/xdp_rxq_info_user.c b/samples/bpf/xdp_rxq_info_user.c
> index 4fe47502ebed..490b07b7df78 100644
> --- a/samples/bpf/xdp_rxq_info_user.c
> +++ b/samples/bpf/xdp_rxq_info_user.c
> @@ -202,11 +202,11 @@ static struct datarec *alloc_record_per_cpu(void)
>  
>   size = sizeof(struct datarec) * nr_cpus;
>   array = malloc(size);
> - memset(array, 0, size);
>   if (!array) {
>   fprintf(stderr, "Mem alloc error (nr_cpus:%u)\n", nr_cpus);
>   exit(EXIT_FAIL_MEM);
>   }
> + memset(array, 0, size);
>   return array;
>  }

Looking at code, this bug happen in more places. Please fix up all locations.

I think this fix should go through the "bpf" tree.
Please read:
 
https://github.com/torvalds/linux/blob/master/Documentation/bpf/bpf_devel_QA.rst

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v3 2/7] dynamic_debug: Group debug messages by level bitmask

2020-06-09 Thread Petr Mladek

On Tue 2020-06-09 13:45:59, Stanimir Varbanov wrote:
> This will allow dynamic debug users and driver writers to group
> debug messages by level bitmask.  The level bitmask should be a
> hex number.
> 
> Done this functionality by extending dynamic debug metadata with
> new level member and propagate it over all the users.  Also
> introduce new dynamic_pr_debug_level and dynamic_dev_dbg_level
> macros to be used by the drivers.

Could you please provide more details?

What is the use case?
What is the exact meaning of the level value?
How the levels will get defined?

Dynamic debug is used for messages with KERN_DEBUG log level.
Is this another dimension of the message leveling?

Given that the filter is defined by bits, it is rather grouping
by context or so.


> diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c
> index 8f199f403ab5..5d28d388f6dd 100644
> --- a/lib/dynamic_debug.c
> +++ b/lib/dynamic_debug.c
> @@ -55,6 +55,7 @@ struct ddebug_query {
>   const char *function;
>   const char *format;
>   unsigned int first_lineno, last_lineno;
> + unsigned int level;
>  };
>  
>  struct ddebug_iter {
> @@ -187,6 +188,18 @@ static int ddebug_change(const struct ddebug_query 
> *query,
>  
>   nfound++;
>  
> +#ifdef CONFIG_JUMP_LABEL
> + if (query->level && query->level & dp->level) {
> + if (flags & _DPRINTK_FLAGS_PRINT)
> + 
> static_branch_enable(&dp->key.dd_key_true);
> + else
> + 
> static_branch_disable(&dp->key.dd_key_true);
> + } else if (query->level &&
> +flags & _DPRINTK_FLAGS_PRINT) {
> + static_branch_disable(&dp->key.dd_key_true);
> + continue;
> + }
> +#endif

This looks like a hack in the existing code:

  + It is suspicious that "continue" is only in one branch. It means
that static_branch_enable/disable() might get called 2nd time
by the code below. Or newflags are not stored when there is a change.

  + It changes the behavior and the below vpr_info("changed ...")
is not called.

Or do I miss anything?

>   newflags = (dp->flags & mask) | flags;
>   if (newflags == dp->flags)
>   continue;

Best Regards,
Petr

[PATCH 05/17] drivers: net: Fix trivial spelling

2020-06-09 Thread Kieran Bingham

The word 'descriptor' is misspelled throughout the tree.

Fix it up accordingly:
decriptors -> descriptors

Signed-off-by: Kieran Bingham 
---
 drivers/net/wan/lmc/lmc_main.c| 2 +-
 drivers/net/wireless/ath/ath10k/usb.c | 2 +-
 drivers/net/wireless/ath/ath6kl/usb.c | 2 +-
 drivers/net/wireless/cisco/airo.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wan/lmc/lmc_main.c b/drivers/net/wan/lmc/lmc_main.c
index a20f467ca48a..842def21e089 100644
--- a/drivers/net/wan/lmc/lmc_main.c
+++ b/drivers/net/wan/lmc/lmc_main.c
@@ -2063,7 +2063,7 @@ static void lmc_driver_timeout(struct net_device *dev, 
unsigned int txqueue)
 /*
  * Chip seems to have locked up
  * Reset it
- * This whips out all our decriptor
+ * This whips out all our descriptor
  * table and starts from scartch
  */
 
diff --git a/drivers/net/wireless/ath/ath10k/usb.c 
b/drivers/net/wireless/ath/ath10k/usb.c
index b7daf344d012..05a620ff6fe2 100644
--- a/drivers/net/wireless/ath/ath10k/usb.c
+++ b/drivers/net/wireless/ath/ath10k/usb.c
@@ -824,7 +824,7 @@ static int ath10k_usb_setup_pipe_resources(struct ath10k 
*ar,
 
ath10k_dbg(ar, ATH10K_DBG_USB, "usb setting up pipes using 
interface\n");
 
-   /* walk decriptors and setup pipes */
+   /* walk descriptors and setup pipes */
for (i = 0; i < iface_desc->desc.bNumEndpoints; ++i) {
endpoint = &iface_desc->endpoint[i].desc;
 
diff --git a/drivers/net/wireless/ath/ath6kl/usb.c 
b/drivers/net/wireless/ath/ath6kl/usb.c
index 53b66e9434c9..5372e948e761 100644
--- a/drivers/net/wireless/ath/ath6kl/usb.c
+++ b/drivers/net/wireless/ath/ath6kl/usb.c
@@ -311,7 +311,7 @@ static int ath6kl_usb_setup_pipe_resources(struct 
ath6kl_usb *ar_usb)
 
ath6kl_dbg(ATH6KL_DBG_USB, "setting up USB Pipes using interface\n");
 
-   /* walk decriptors and setup pipes */
+   /* walk descriptors and setup pipes */
for (i = 0; i < iface_desc->desc.bNumEndpoints; ++i) {
endpoint = &iface_desc->endpoint[i].desc;
 
diff --git a/drivers/net/wireless/cisco/airo.c 
b/drivers/net/wireless/cisco/airo.c
index 827bb6d74815..e3ad77666ba8 100644
--- a/drivers/net/wireless/cisco/airo.c
+++ b/drivers/net/wireless/cisco/airo.c
@@ -2450,7 +2450,7 @@ static void mpi_unmap_card(struct pci_dev *pci)
 
 /*
  *  This routine assumes that descriptors have been setup .
- *  Run at insmod time or after reset  when the decriptors
+ *  Run at insmod time or after reset when the descriptors
  *  have been initialized . Returns 0 if all is well nz
  *  otherwise . Does not allocate memory but sets up card
  *  using previously allocated descriptors.
-- 
2.25.1

[PATCH 00/17] spelling.txt: /decriptors/descriptors/

2020-06-09 Thread Kieran Bingham

I wouldn't normally go through spelling fixes, but I caught sight of
this typo twice, and then foolishly grepped the tree for it, and saw how
pervasive it was.

so here I am ... fixing a typo globally... but with an addition in
scripts/spelling.txt so it shouldn't re-appear ;-)

Cc: linux-arm-ker...@lists.infradead.org (moderated list:TI DAVINCI MACHINE 
SUPPORT)
Cc: linux-ker...@vger.kernel.org (open list)
Cc: linux...@vger.kernel.org (open list:DEVICE FREQUENCY EVENT (DEVFREQ-EVENT))
Cc: linux-g...@vger.kernel.org (open list:GPIO SUBSYSTEM)
Cc: dri-de...@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: linux-r...@vger.kernel.org (open list:HFI1 DRIVER)
Cc: linux-in...@vger.kernel.org (open list:INPUT (KEYBOARD, MOUSE, JOYSTICK, 
TOUCHSCREEN)...)
Cc: linux-...@lists.infradead.org (open list:NAND FLASH SUBSYSTEM)
Cc: netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
Cc: ath...@lists.infradead.org (open list:QUALCOMM ATHEROS ATH10K WIRELESS 
DRIVER)
Cc: linux-wirel...@vger.kernel.org (open list:NETWORKING DRIVERS (WIRELESS))
Cc: linux-s...@vger.kernel.org (open list:IBM Power Virtual FC Device Drivers)
Cc: linuxppc-...@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 
64-BIT))
Cc: linux-...@vger.kernel.org (open list:USB SUBSYSTEM)
Cc: virtualizat...@lists.linux-foundation.org (open list:VIRTIO CORE AND NET 
DRIVERS)
Cc: linux...@kvack.org (open list:MEMORY MANAGEMENT)


Kieran Bingham (17):
  arch: arm: mach-davinci: Fix trivial spelling
  drivers: infiniband: Fix trivial spelling
  drivers: gpio: Fix trivial spelling
  drivers: mtd: nand: raw: Fix trivial spelling
  drivers: net: Fix trivial spelling
  drivers: scsi: Fix trivial spelling
  drivers: usb: Fix trivial spelling
  drivers: gpu: drm: Fix trivial spelling
  drivers: regulator: Fix trivial spelling
  drivers: input: joystick: Fix trivial spelling
  drivers: infiniband: Fix trivial spelling
  drivers: devfreq: Fix trivial spelling
  include: dynamic_debug.h: Fix trivial spelling
  kernel: trace: Fix trivial spelling
  mm: Fix trivial spelling
  regulator: gpio: Fix trivial spelling
  scripts/spelling.txt: Add descriptors correction

 arch/arm/mach-davinci/board-da830-evm.c  | 2 +-
 drivers/devfreq/devfreq-event.c  | 4 ++--
 drivers/gpio/TODO| 2 +-
 drivers/gpu/drm/drm_dp_helper.c  | 2 +-
 drivers/infiniband/hw/hfi1/iowait.h  | 2 +-
 drivers/infiniband/hw/hfi1/ipoib_tx.c| 2 +-
 drivers/infiniband/hw/hfi1/verbs_txreq.h | 2 +-
 drivers/input/joystick/spaceball.c   | 2 +-
 drivers/mtd/nand/raw/mxc_nand.c  | 2 +-
 drivers/mtd/nand/raw/nand_bbt.c  | 2 +-
 drivers/net/wan/lmc/lmc_main.c   | 2 +-
 drivers/net/wireless/ath/ath10k/usb.c| 2 +-
 drivers/net/wireless/ath/ath6kl/usb.c| 2 +-
 drivers/net/wireless/cisco/airo.c| 2 +-
 drivers/regulator/fixed.c| 2 +-
 drivers/regulator/gpio-regulator.c   | 2 +-
 drivers/scsi/ibmvscsi/ibmvfc.c   | 2 +-
 drivers/scsi/ibmvscsi/ibmvscsi.c | 2 +-
 drivers/scsi/qla2xxx/qla_inline.h| 2 +-
 drivers/scsi/qla2xxx/qla_iocb.c  | 6 +++---
 drivers/usb/core/of.c| 2 +-
 include/drm/drm_dp_helper.h  | 2 +-
 include/linux/dynamic_debug.h| 2 +-
 kernel/trace/trace_events.c  | 2 +-
 mm/balloon_compaction.c  | 4 ++--
 scripts/spelling.txt | 1 +
 26 files changed, 30 insertions(+), 29 deletions(-)

-- 
2.25.1

Re: [PATCH v2 net-next 03/10] net: mscc: ocelot: allocated rules to different hardware VCAP TCAMs by chain index

2020-06-09 Thread Vladimir Oltean

Hi Allan,

On Mon, 8 Jun 2020 at 16:56, Allan W. Nielsen
 wrote:
>
> On 03.06.2020 13:04, Vladimir Oltean wrote:
> >On Tue, 2 Jun 2020 at 11:38, Allan W. Nielsen
> > wrote:
> >>
> >> Hi Xiaoliang,
> >>
> >> Happy to see that you are moving in the directions of multi chain - this
> >> seems ilke a much better fit to me.
> >>
> >>
> >> On 02.06.2020 13:18, Xiaoliang Yang wrote:
> >> >There are three hardware TCAMs for ocelot chips: IS1, IS2 and ES0. Each
> >> >one supports different actions. The hardware flow order is: IS1->IS2->ES0.
> >> >
> >> >This patch add three blocks to store rules according to chain index.
> >> >chain 0 is offloaded to IS1, chain 1 is offloaded to IS2, and egress chain
> >> >0 is offloaded to ES0.
> >>
> >> Using "static" allocation to to say chain-X goes to TCAM Y, also seems
> >> like the right approach to me. Given the capabilities of the HW, this
> >> will most likely be the easiest scheme to implement and to explain to
> >> the end-user.
> >>
> >> But I think we should make some adjustments to this mapping schema.
> >>
> >> Here are some important "things" I would like to consider when defining
> >> this schema:
> >>
> >> - As you explain, we have 3 TCAMs (IS1, IS2 and ES0), but we have 3
> >>parallel lookups in IS1 and 2 parallel lookups in IS2 - and also these
> >>TCAMs has a wide verity of keys.
> >>
> >> - We can utilize these multiple parallel lookups such that it seems like
> >>they are done in serial (that is if they do not touch the same
> >>actions), but as they are done in parallel they can not influence each
> >>other.
> >>
> >> - We can let IS1 influence the IS2 lookup (like the GOTO actions was
> >>intended to be used).
> >>
> >> - The chip also has other QoS classification facilities which sits
> >>before the TCAM (take a look at 3.7.3 QoS, DP, and DSCP Classification
> >>in vsc7514 datasheet). It we at some point in time want to enable
> >>this, then I think we need to do that in the same tc-flower framework.
> >>
> >> Here is my initial suggestion for an alternative chain-schema:
> >>
> >> Chain 0:   The default chain - today this is in IS2. If we proceed
> >> with this as is - then this will change.
> >> Chain 1-:  These are offloaded by "basic" classification.
> >> Chain 1-1: These are offloaded in IS1
> >> Chain 1: Lookup-0 in IS1, and here we could limit 
> >> the
> >>  action to do QoS related stuff (priority
> >>  update)
> >> Chain 11000: Lookup-1 in IS1, here we could do VLAN
> >>  stuff
> >> Chain 12000: Lookup-2 in IS1, here we could apply the
> >>  "PAG" which is essentially a GOTO.
> >>
> >> Chain 2-2: These are offloaded in IS2
> >> Chain 2-20255: Lookup-0 in IS2, where CHAIN-ID -
> >>2 is the PAG value.
> >> Chain 21000-21000: Lookup-1 in IS2.
> >>
> >> All these chains should be optional - users should only need to
> >> configure the chains they need. To make this work, we need to configure
> >> both the desired actions (could be priority update) and the goto action.
> >> Remember in HW, all packets goes through this process, while in SW they
> >> only follow the "goto" path.
> >>
> >> An example could be (I have not tested this yet - sorry):
> >>
> >> tc qdisc add dev eth0 ingress
> >>
> >> # Activate lookup 11000. We can not do any other rules in chain 0, also
> >> # this implicitly means that we do not want any chains <11000.
> >> tc filter add dev eth0 parent : chain 0
> >> action
> >> matchall goto 11000
> >>
> >> tc filter add dev eth0 parent : chain 11000 \
> >> flower src_mac 00:01:00:00:00:00/00:ff:00:00:00:00 \
> >> action \
> >> vlan modify id 1234 \
> >> pipe \
> >> goto 20001
> >>
> >> tc filter add dev eth0 parent : chain 20001 ...
> >>
> >> Maybe it would be an idea to create some use-cases, implement them in a
> >> test which can pass with today's SW, and then once we have a common
> >> understanding of what we want, we can implement it?
> >>
> >> /Allan
> >>
> >> >Using action goto chain to express flow order as follows:
> >> >tc filter add dev swp0 chain 0 parent : flower skip_sw \
> >> >action goto chain 1
> >> >
> >> >Signed-off-by: Xiaoliang Yang 
> >> >---
> >> > drivers/net/ethernet/mscc/ocelot_ace.c| 51 +++
> >> > drivers/net/ethernet/mscc/ocelot_ace.h|  7 ++--
> >> > drivers/net/ethernet/mscc/ocelot_flower.c | 46 +---
> >> > include/soc/mscc/ocelot.h |  2 +-
> >> > include/soc/mscc/ocelot_vcap.h|  4 +-
> >> > 5 files changed, 81 insertions(+), 29 deletions(-)
> >
> >> /Allan
> >
> >What would be the advantage, from a user perspective

Re: [PATCH] net: ethernet: mvneta: add support for 2.5G DRSGMII mode

2020-06-09 Thread Sascha Hauer

On Mon, Jun 08, 2020 at 04:57:37PM +0200, Andrew Lunn wrote:
> On Mon, Jun 08, 2020 at 09:47:16AM +0200, Sascha Hauer wrote:
> > The Marvell MVNETA Ethernet controller supports a 2.5 Gbps SGMII mode
> > called DRSGMII.
> > 
> > This patch adds a corresponding phy-mode string 'drsgmii' and parses it
> > from DT. The MVNETA then configures the SERDES protocol value
> > accordingly.
> > 
> > It was successfully tested on a MV78460 connected to a FPGA.
> 
> Hi Sascha
> 
> Is this really overclocked SGMII, or 2500BaseX? How does it differ
> from 2500BaseX, which mvneta already supports?

I think it is overclocked SGMII or 2500BaseX depending on the Port MAC
Control Register0 PortType setting bit.
As said to Russell we have a fixed link so nobody really cares if it's
SGMII or 2500BaseX. This boils down the patch to fixing the Serdes
configuration setting for 2500BaseX.

Sascha


-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

Re: [PATCH net] net: mvneta: do not redirect frames during reconfiguration

2020-06-09 Thread Andrew Lunn

On Tue, Jun 09, 2020 at 09:41:10AM +0200, Lorenzo Bianconi wrote:
> > On Tue, Jun 09, 2020 at 12:02:39AM +0200, Lorenzo Bianconi wrote:
> > > Disable frames injection in mvneta_xdp_xmit routine during hw
> > > re-configuration in order to avoid hardware hangs
> > 
> > Hi Lorenzo
> > 
> > Why does mvneta_tx() also not need the same protection?
> > 
> > Andrew
> 
> Hi Andrew,
> 
> So far I have not been able to trigger the issue in the legacy tx path.

Even if you have not hit the issue, do you still think it is possible?
If it is hard to trigger, maybe it is worth protecting against it,
just in case.

> I hit the problem adding the capability to attach an eBPF program to CPUMAP
> entries [1]. In particular I am redirecting traffic to mvneta and concurrently
> attaching/removing a XDP program to/from it.
> I am not sure this can occur running mvneta_tx().
> Moreover it seems a common pattern for .ndo_xdp_xmit() in other drivers
> (e.g ixgbe, bnxt, mlx5)

I was wondering if this should be solved at a higher level. And if you
say there are more MAC drivers with this issue, maybe it should. Not
sure how though. It seems like MTU change and rx mode change wound
need to be protected, which at a higher level is harder to do. What
exactly do you need to protect, in a generic way?

 Andrew

[PATCH] net: mvneta: Fix Serdes configuration for 2.5Gbps modes

2020-06-09 Thread Sascha Hauer

The Marvell MVNETA Ethernet controller supports a 2.5Gbps SGMII mode
called DRSGMII. Depending on the Port MAC Control Register0 PortType
setting this seems to be either an overclocked SGMII mode or 2500BaseX.

This patch adds the necessary Serdes Configuration setting for the
2.5Gbps modes. There is no phy interface mode define for overclocked
SGMII, so only 2500BaseX is handled for now.

As phy_interface_mode_is_8023z() returns true for both
PHY_INTERFACE_MODE_1000BASEX and PHY_INTERFACE_MODE_2500BASEX we
explicitly test for 1000BaseX instead of using
phy_interface_mode_is_8023z() to differentiate the different
possibilities.

Signed-off-by: Sascha Hauer 
---
 drivers/net/ethernet/marvell/mvneta.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 51889770958d8..3b13048931412 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -109,6 +109,7 @@
 #define MVNETA_SERDES_CFG   0x24A0
 #define  MVNETA_SGMII_SERDES_PROTO  0x0cc7
 #define  MVNETA_QSGMII_SERDES_PROTO 0x0667
+#define  MVNETA_DRSGMII_SERDES_PROTO0x1107
 #define MVNETA_TYPE_PRIO 0x24bc
 #define  MVNETA_FORCE_UNIBIT(21)
 #define MVNETA_TXQ_CMD_1 0x24e4
@@ -4966,8 +4967,10 @@ static int mvneta_port_power_up(struct mvneta_port *pp, 
int phy_mode)
if (phy_mode == PHY_INTERFACE_MODE_QSGMII)
mvreg_write(pp, MVNETA_SERDES_CFG, MVNETA_QSGMII_SERDES_PROTO);
else if (phy_mode == PHY_INTERFACE_MODE_SGMII ||
-phy_interface_mode_is_8023z(phy_mode))
+phy_mode == PHY_INTERFACE_MODE_1000BASEX)
mvreg_write(pp, MVNETA_SERDES_CFG, MVNETA_SGMII_SERDES_PROTO);
+   else if (phy_mode == PHY_INTERFACE_MODE_2500BASEX)
+   mvreg_write(pp, MVNETA_SERDES_CFG, MVNETA_DRSGMII_SERDES_PROTO);
else if (!phy_interface_mode_is_rgmii(phy_mode))
return -EINVAL;
 
-- 
2.27.0

Re: [PATCH] net: ethernet: mvneta: add support for 2.5G DRSGMII mode

2020-06-09 Thread Andrew Lunn

On Tue, Jun 09, 2020 at 02:55:35PM +0200, Sascha Hauer wrote:
> On Mon, Jun 08, 2020 at 04:57:37PM +0200, Andrew Lunn wrote:
> > On Mon, Jun 08, 2020 at 09:47:16AM +0200, Sascha Hauer wrote:
> > > The Marvell MVNETA Ethernet controller supports a 2.5 Gbps SGMII mode
> > > called DRSGMII.
> > > 
> > > This patch adds a corresponding phy-mode string 'drsgmii' and parses it
> > > from DT. The MVNETA then configures the SERDES protocol value
> > > accordingly.
> > > 
> > > It was successfully tested on a MV78460 connected to a FPGA.
> > 
> > Hi Sascha
> > 
> > Is this really overclocked SGMII, or 2500BaseX? How does it differ
> > from 2500BaseX, which mvneta already supports?
> 
> I think it is overclocked SGMII or 2500BaseX depending on the Port MAC
> Control Register0 PortType setting bit.
> As said to Russell we have a fixed link so nobody really cares if it's
> SGMII or 2500BaseX. This boils down the patch to fixing the Serdes
> configuration setting for 2500BaseX.

Hi Sascha

Does 2500BaseX work for your use case? Since this drsmgii mode is not
well defined, i would prefer to not add it, unless it is really
needed.

Andrew

Re: [PATCH] net: ethernet: mvneta: add support for 2.5G DRSGMII mode

2020-06-09 Thread Sascha Hauer

Hi Andrew,

On Tue, Jun 09, 2020 at 03:12:16PM +0200, Andrew Lunn wrote:
> On Tue, Jun 09, 2020 at 02:55:35PM +0200, Sascha Hauer wrote:
> > On Mon, Jun 08, 2020 at 04:57:37PM +0200, Andrew Lunn wrote:
> > > On Mon, Jun 08, 2020 at 09:47:16AM +0200, Sascha Hauer wrote:
> > > > The Marvell MVNETA Ethernet controller supports a 2.5 Gbps SGMII mode
> > > > called DRSGMII.
> > > > 
> > > > This patch adds a corresponding phy-mode string 'drsgmii' and parses it
> > > > from DT. The MVNETA then configures the SERDES protocol value
> > > > accordingly.
> > > > 
> > > > It was successfully tested on a MV78460 connected to a FPGA.
> > > 
> > > Hi Sascha
> > > 
> > > Is this really overclocked SGMII, or 2500BaseX? How does it differ
> > > from 2500BaseX, which mvneta already supports?
> > 
> > I think it is overclocked SGMII or 2500BaseX depending on the Port MAC
> > Control Register0 PortType setting bit.
> > As said to Russell we have a fixed link so nobody really cares if it's
> > SGMII or 2500BaseX. This boils down the patch to fixing the Serdes
> > configuration setting for 2500BaseX.
> 
> Hi Sascha
> 
> Does 2500BaseX work for your use case? Since this drsmgii mode is not
> well defined, i would prefer to not add it, unless it is really
> needed.

Yes, it does, see updated patch I just sent.

Sascha

-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

Re: [PATCH] net: mvneta: Fix Serdes configuration for 2.5Gbps modes

2020-06-09 Thread Andrew Lunn

On Tue, Jun 09, 2020 at 03:11:52PM +0200, Sascha Hauer wrote:
> The Marvell MVNETA Ethernet controller supports a 2.5Gbps SGMII mode
> called DRSGMII. Depending on the Port MAC Control Register0 PortType
> setting this seems to be either an overclocked SGMII mode or 2500BaseX.
> 
> This patch adds the necessary Serdes Configuration setting for the
> 2.5Gbps modes. There is no phy interface mode define for overclocked
> SGMII, so only 2500BaseX is handled for now.
> 
> As phy_interface_mode_is_8023z() returns true for both
> PHY_INTERFACE_MODE_1000BASEX and PHY_INTERFACE_MODE_2500BASEX we
> explicitly test for 1000BaseX instead of using
> phy_interface_mode_is_8023z() to differentiate the different
> possibilities.

Hi Sascha

This seems like it should have a Fixes: tag, and be submitted to the
net tree. Please see the Networking FAQ.

Otherwise it looks O.K.

Andrew

[PATCH v2 2/2] net: phy: mscc: handle the clkout control on some phy variants

2020-06-09 Thread Heiko Stuebner

From: Heiko Stuebner 

At least VSC8530/8531/8540/8541 contain a clock output that can emit
a predefined rate of 25, 50 or 125MHz.

This may then feed back into the network interface as source clock.
So follow the example the at803x already set and introduce a
vsc8531,clk-out-frequency property to set that output.

Signed-off-by: Heiko Stuebner 
---
Hi Andrew,

I didn't change the property yet, do you have a suggestion on
how to name it though? Going by the other examples in the
ethernet-phy.yamls, something like enet-phy-clock-out-frequency ?


 .../bindings/net/mscc-phy-vsc8531.txt |  3 +
 drivers/net/phy/mscc/mscc.h   |  9 ++
 drivers/net/phy/mscc/mscc_main.c  | 87 +--
 3 files changed, 92 insertions(+), 7 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt 
b/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
index 5ff37c68c941..4a1f50ae48e1 100644
--- a/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
+++ b/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
@@ -1,6 +1,8 @@
 * Microsemi - vsc8531 Giga bit ethernet phy
 
 Optional properties:
+- vsc8531,clk-out-frequency: Clock output frequency in Hertz.
+ Should be one of 2500, 5000, 12500
 - vsc8531,vddmac   : The vddmac in mV. Allowed values is listed
  in the first row of Table 1 (below).
  This property is only used in combination
@@ -63,6 +65,7 @@ Example:
 
 vsc8531_0: ethernet-phy@0 {
 compatible = "ethernet-phy-id0007.0570";
+vsc8531,clk-out-frequency = <12500>;
 vsc8531,vddmac = <3300>;
 vsc8531,edge-slowdown  = <7>;
 vsc8531,led-0-mode = ;
diff --git a/drivers/net/phy/mscc/mscc.h b/drivers/net/phy/mscc/mscc.h
index 414e3b31bb1f..c8c395a041c2 100644
--- a/drivers/net/phy/mscc/mscc.h
+++ b/drivers/net/phy/mscc/mscc.h
@@ -218,6 +218,13 @@ enum rgmii_clock_delay {
 #define INT_MEM_DATA_M   0x00ff
 #define INT_MEM_DATA(x)  (INT_MEM_DATA_M & (x))
 
+#define MSCC_CLKOUT_CNTL 13
+#define CLKOUT_ENABLEBIT(15)
+#define CLKOUT_FREQ_MASK GENMASK(14, 13)
+#define CLKOUT_FREQ_25M  (0x0 << 13)
+#define CLKOUT_FREQ_50M  (0x1 << 13)
+#define CLKOUT_FREQ_125M (0x2 << 13)
+
 #define MSCC_PHY_PROC_CMD18
 #define PROC_CMD_NCOMPLETED  0x8000
 #define PROC_CMD_FAILED  0x4000
@@ -361,6 +368,8 @@ struct vsc8531_private {
 */
unsigned int base_addr;
 
+   u32 clkout_rate;
+
 #if IS_ENABLED(CONFIG_MACSEC)
/* MACsec fields:
 * - One SecY per device (enforced at the s/w implementation level)
diff --git a/drivers/net/phy/mscc/mscc_main.c b/drivers/net/phy/mscc/mscc_main.c
index 54cac9c295ad..012d5019d7c8 100644
--- a/drivers/net/phy/mscc/mscc_main.c
+++ b/drivers/net/phy/mscc/mscc_main.c
@@ -432,6 +432,18 @@ static int vsc85xx_dt_led_mode_get(struct phy_device 
*phydev,
return led_mode;
 }
 
+static void vsc8531_dt_clkout_rate_get(struct phy_device *phydev)
+{
+   struct vsc8531_private *priv = phydev->priv;
+   struct device *dev = &phydev->mdio.dev;
+   struct device_node *of_node = dev->of_node;
+
+   if (!of_node)
+   return;
+
+   of_property_read_u32(of_node, "vsc8531,clk-out-frequency",
+&priv->clkout_rate);
+}
 #else
 static int vsc85xx_edge_rate_magic_get(struct phy_device *phydev)
 {
@@ -444,6 +456,10 @@ static int vsc85xx_dt_led_mode_get(struct phy_device 
*phydev,
 {
return default_mode;
 }
+
+static void vsc8531_dt_clkout_rate_get(struct phy_device *phydev)
+{
+}
 #endif /* CONFIG_OF_MDIO */
 
 static int vsc85xx_dt_led_modes_get(struct phy_device *phydev,
@@ -1540,6 +1556,37 @@ static int vsc85xx_config_init(struct phy_device *phydev)
return 0;
 }
 
+static int vsc8531_config_init(struct phy_device *phydev)
+{
+   struct vsc8531_private *vsc8531 = phydev->priv;
+   u16 val;
+   int rc;
+
+   rc = vsc85xx_config_init(phydev);
+   if (rc)
+   return rc;
+
+   switch (vsc8531->clkout_rate) {
+   case 0:
+   val = 0;
+   break;
+   case 2500:
+   val = CLKOUT_FREQ_25M | CLKOUT_ENABLE;
+   break;
+   case 5000:
+   val = CLKOUT_FREQ_50M | CLKOUT_ENABLE;
+   break;
+   case 12500:
+   val = CLKOUT_FREQ_125M | CLKOUT_ENABLE;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return phy_write_paged(phydev, MSCC_PHY_PAGE_EXTENDED_GPIO,
+  MSCC_CLKOUT_CNTL, val);
+}
+
 static int vsc8584_did_interrupt(struct phy_device *phydev)
 {

[PATCH v2 1/2] net: phy: mscc: move shared probe code into a helper

2020-06-09 Thread Heiko Stuebner

From: Heiko Stuebner 

The different probe functions share a lot of code, so move the
common parts into a helper to reduce duplication.

Signed-off-by: Heiko Stuebner 
---
changes in v2:
- new patch as suggested by Andrew

 drivers/net/phy/mscc/mscc_main.c | 97 +---
 1 file changed, 40 insertions(+), 57 deletions(-)

diff --git a/drivers/net/phy/mscc/mscc_main.c b/drivers/net/phy/mscc/mscc_main.c
index c8aa6d905d8e..54cac9c295ad 100644
--- a/drivers/net/phy/mscc/mscc_main.c
+++ b/drivers/net/phy/mscc/mscc_main.c
@@ -1983,12 +1983,11 @@ static int vsc85xx_read_status(struct phy_device 
*phydev)
return genphy_read_status(phydev);
 }
 
-static int vsc8514_probe(struct phy_device *phydev)
+static int vsc85xx_probe_helper(struct phy_device *phydev,
+   u32 *leds, int num_leds, u16 led_modes,
+   const struct vsc85xx_hw_stat *stats, int nstats)
 {
struct vsc8531_private *vsc8531;
-   u32 default_mode[4] = {VSC8531_LINK_1000_ACTIVITY,
-  VSC8531_LINK_100_ACTIVITY, VSC8531_LINK_ACTIVITY,
-  VSC8531_DUPLEX_COLLISION};
 
vsc8531 = devm_kzalloc(&phydev->mdio.dev, sizeof(*vsc8531), GFP_KERNEL);
if (!vsc8531)
@@ -1996,46 +1995,46 @@ static int vsc8514_probe(struct phy_device *phydev)
 
phydev->priv = vsc8531;
 
-   vsc8531->nleds = 4;
-   vsc8531->supp_led_modes = VSC85XX_SUPP_LED_MODES;
-   vsc8531->hw_stats = vsc85xx_hw_stats;
-   vsc8531->nstats = ARRAY_SIZE(vsc85xx_hw_stats);
+   vsc8531->nleds = num_leds;
+   vsc8531->supp_led_modes = led_modes;
+   vsc8531->hw_stats = stats;
+   vsc8531->nstats = nstats;
vsc8531->stats = devm_kcalloc(&phydev->mdio.dev, vsc8531->nstats,
  sizeof(u64), GFP_KERNEL);
if (!vsc8531->stats)
return -ENOMEM;
 
-   return vsc85xx_dt_led_modes_get(phydev, default_mode);
+   return vsc85xx_dt_led_modes_get(phydev, leds);
 }
 
-static int vsc8574_probe(struct phy_device *phydev)
+static int vsc8514_probe(struct phy_device *phydev)
 {
-   struct vsc8531_private *vsc8531;
u32 default_mode[4] = {VSC8531_LINK_1000_ACTIVITY,
   VSC8531_LINK_100_ACTIVITY, VSC8531_LINK_ACTIVITY,
   VSC8531_DUPLEX_COLLISION};
 
-   vsc8531 = devm_kzalloc(&phydev->mdio.dev, sizeof(*vsc8531), GFP_KERNEL);
-   if (!vsc8531)
-   return -ENOMEM;
-
-   phydev->priv = vsc8531;
+   return vsc85xx_probe_helper(phydev, default_mode,
+   ARRAY_SIZE(default_mode),
+   VSC85XX_SUPP_LED_MODES,
+   vsc85xx_hw_stats,
+   ARRAY_SIZE(vsc85xx_hw_stats));
+}
 
-   vsc8531->nleds = 4;
-   vsc8531->supp_led_modes = VSC8584_SUPP_LED_MODES;
-   vsc8531->hw_stats = vsc8584_hw_stats;
-   vsc8531->nstats = ARRAY_SIZE(vsc8584_hw_stats);
-   vsc8531->stats = devm_kcalloc(&phydev->mdio.dev, vsc8531->nstats,
- sizeof(u64), GFP_KERNEL);
-   if (!vsc8531->stats)
-   return -ENOMEM;
+static int vsc8574_probe(struct phy_device *phydev)
+{
+   u32 default_mode[4] = {VSC8531_LINK_1000_ACTIVITY,
+  VSC8531_LINK_100_ACTIVITY, VSC8531_LINK_ACTIVITY,
+  VSC8531_DUPLEX_COLLISION};
 
-   return vsc85xx_dt_led_modes_get(phydev, default_mode);
+   return vsc85xx_probe_helper(phydev, default_mode,
+   ARRAY_SIZE(default_mode),
+   VSC8584_SUPP_LED_MODES,
+   vsc8584_hw_stats,
+   ARRAY_SIZE(vsc8584_hw_stats));
 }
 
 static int vsc8584_probe(struct phy_device *phydev)
 {
-   struct vsc8531_private *vsc8531;
u32 default_mode[4] = {VSC8531_LINK_1000_ACTIVITY,
   VSC8531_LINK_100_ACTIVITY, VSC8531_LINK_ACTIVITY,
   VSC8531_DUPLEX_COLLISION};
@@ -2045,28 +2044,17 @@ static int vsc8584_probe(struct phy_device *phydev)
return -ENOTSUPP;
}
 
-   vsc8531 = devm_kzalloc(&phydev->mdio.dev, sizeof(*vsc8531), GFP_KERNEL);
-   if (!vsc8531)
-   return -ENOMEM;
-
-   phydev->priv = vsc8531;
-
-   vsc8531->nleds = 4;
-   vsc8531->supp_led_modes = VSC8584_SUPP_LED_MODES;
-   vsc8531->hw_stats = vsc8584_hw_stats;
-   vsc8531->nstats = ARRAY_SIZE(vsc8584_hw_stats);
-   vsc8531->stats = devm_kcalloc(&phydev->mdio.dev, vsc8531->nstats,
- sizeof(u64), GFP_KERNEL);
-   if (!vsc8531->stats)
-   return -ENOMEM;
-
-   return vsc85xx_dt_led_modes_get(phydev, default_mode);
+   return vsc85xx_probe_helper(phydev, default_mode,
+   ARRAY_SIZE(default_mode),
+   VSC8584_SUPP_LED_MODES,
+   vsc

[PATCH bpf V2 1/2] bpf: devmap adjust uapi for attach bpf program

2020-06-09 Thread Jesper Dangaard Brouer

V2:
- Defer changing BPF-syscall to start at file-descriptor 1
- Use {} to zero initialise struct.

The recent commit fbee97feed9b ("bpf: Add support to attach bpf program to a
devmap entry"), introduced ability to attach (and run) a separate XDP
bpf_prog for each devmap entry. A bpf_prog is added via a file-descriptor.
As zero were a valid FD, not using the feature requires using value minus-1.
The UAPI is extended via tail-extending struct bpf_devmap_val and using
map->value_size to determine the feature set.

This will break older userspace applications not using the bpf_prog feature.
Consider an old userspace app that is compiled against newer kernel
uapi/bpf.h, it will not know that it need to initialise the member
bpf_prog.fd to minus-1. Thus, users will be forced to update source code to
get program running on newer kernels.

This patch remove the minus-1 checks, and have zero mean feature isn't used.

Followup patches either for kernel or libbpf should handle and avoid
returning file-descriptor zero in the first place.

Fixes: fbee97feed9b ("bpf: Add support to attach bpf program to a devmap entry")
Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h |   13 +
 kernel/bpf/devmap.c  |   17 -
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c65b374a5090..19684813faae 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3761,6 +3761,19 @@ struct xdp_md {
__u32 egress_ifindex;  /* txq->dev->ifindex */
 };
 
+/* DEVMAP map-value layout
+ *
+ * The struct data-layout of map-value is a configuration interface.
+ * New members can only be added to the end of this structure.
+ */
+struct bpf_devmap_val {
+   __u32 ifindex;   /* device index */
+   union {
+   int   fd;  /* prog fd on map write */
+   __u32 id;  /* prog id on map read */
+   } bpf_prog;
+};
+
 enum sk_action {
SK_DROP = 0,
SK_PASS,
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 854b09beb16b..899a30a67cc1 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -60,15 +60,6 @@ struct xdp_dev_bulk_queue {
unsigned int count;
 };
 
-/* DEVMAP values */
-struct bpf_devmap_val {
-   u32 ifindex;   /* device index */
-   union {
-   int fd;  /* prog fd on map write */
-   u32 id;  /* prog id on map read */
-   } bpf_prog;
-};
-
 struct bpf_dtab_netdev {
struct net_device *dev; /* must be first member, due to tracepoint */
struct hlist_node index_hlist;
@@ -618,7 +609,7 @@ static struct bpf_dtab_netdev *__dev_map_alloc_node(struct 
net *net,
if (!dev->dev)
goto err_out;
 
-   if (val->bpf_prog.fd >= 0) {
+   if (val->bpf_prog.fd > 0) {
prog = bpf_prog_get_type_dev(val->bpf_prog.fd,
 BPF_PROG_TYPE_XDP, false);
if (IS_ERR(prog))
@@ -652,8 +643,8 @@ static int __dev_map_update_elem(struct net *net, struct 
bpf_map *map,
 void *key, void *value, u64 map_flags)
 {
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
-   struct bpf_devmap_val val = { .bpf_prog.fd = -1 };
struct bpf_dtab_netdev *dev, *old_dev;
+   struct bpf_devmap_val val = {};
u32 i = *(u32 *)key;
 
if (unlikely(map_flags > BPF_EXIST))
@@ -669,7 +660,7 @@ static int __dev_map_update_elem(struct net *net, struct 
bpf_map *map,
if (!val.ifindex) {
dev = NULL;
/* can not specify fd if ifindex is 0 */
-   if (val.bpf_prog.fd != -1)
+   if (val.bpf_prog.fd > 0)
return -EINVAL;
} else {
dev = __dev_map_alloc_node(net, dtab, &val, i);
@@ -699,8 +690,8 @@ static int __dev_map_hash_update_elem(struct net *net, 
struct bpf_map *map,
 void *key, void *value, u64 map_flags)
 {
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
-   struct bpf_devmap_val val = { .bpf_prog.fd = -1 };
struct bpf_dtab_netdev *dev, *old_dev;
+   struct bpf_devmap_val val = {};
u32 idx = *(u32 *)key;
unsigned long flags;
int err = -EEXIST;

[PATCH bpf V2 0/2] bpf: adjust uapi for devmap prior to kernel release

2020-06-09 Thread Jesper Dangaard Brouer

For special type maps (e.g. devmap and cpumap) the map-value data-layout is
a configuration interface. This is uapi that can only be tail extended.
Thus, new members (and thus features) can only be added to the end of this
structure, and the kernel uses the map->value_size from userspace to
determine feature set 'version'.

For this kind of uapi to be extensible and backward compatible, is it common
that new members/fields (that represent a new feature) in the struct are
initialized as zero, which indicate that the feature isn't used. This makes
it possible to write userspace applications that are unaware of new kernel
features, but just include latest uapi headers, zero-init struct and
populate features it knows about.

The recent extension of devmap with a bpf_prog.fd requires end-user to
supply the file-descriptor value minus-1 to communicate that the features
isn't used. This isn't compatible with the described kABI extension model.

V2: Drop patch-1 that changed BPF-syscall to start at file-descriptor 1

---

Jesper Dangaard Brouer (2):
  bpf: devmap adjust uapi for attach bpf program
  bpf: selftests and tools use struct bpf_devmap_val from uapi


 include/uapi/linux/bpf.h   |   13 +
 kernel/bpf/devmap.c|   17 -
 tools/include/uapi/linux/bpf.h |   13 +
 .../selftests/bpf/prog_tests/xdp_devmap_attach.c   |8 
 .../selftests/bpf/progs/test_xdp_devmap_helpers.c  |2 +-
 .../bpf/progs/test_xdp_with_devmap_helpers.c   |3 +--
 6 files changed, 32 insertions(+), 24 deletions(-)

--

[PATCH bpf V2 2/2] bpf: selftests and tools use struct bpf_devmap_val from uapi

2020-06-09 Thread Jesper Dangaard Brouer

Sync tools uapi bpf.h header file and update selftests that use
struct bpf_devmap_val.

Signed-off-by: Jesper Dangaard Brouer 
---
 tools/include/uapi/linux/bpf.h |   13 +
 .../selftests/bpf/prog_tests/xdp_devmap_attach.c   |8 
 .../selftests/bpf/progs/test_xdp_devmap_helpers.c  |2 +-
 .../bpf/progs/test_xdp_with_devmap_helpers.c   |3 +--
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c65b374a5090..19684813faae 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3761,6 +3761,19 @@ struct xdp_md {
__u32 egress_ifindex;  /* txq->dev->ifindex */
 };
 
+/* DEVMAP map-value layout
+ *
+ * The struct data-layout of map-value is a configuration interface.
+ * New members can only be added to the end of this structure.
+ */
+struct bpf_devmap_val {
+   __u32 ifindex;   /* device index */
+   union {
+   int   fd;  /* prog fd on map write */
+   __u32 id;  /* prog id on map read */
+   } bpf_prog;
+};
+
 enum sk_action {
SK_DROP = 0,
SK_PASS,
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c 
b/tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c
index d19dbd668f6a..88ef3ec8ac4c 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c
@@ -8,14 +8,6 @@
 
 #define IFINDEX_LO 1
 
-struct bpf_devmap_val {
-   u32 ifindex;   /* device index */
-   union {
-   int fd;  /* prog fd on map write */
-   u32 id;  /* prog id on map read */
-   } bpf_prog;
-};
-
 void test_xdp_with_devmap_helpers(void)
 {
struct test_xdp_with_devmap_helpers *skel;
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_devmap_helpers.c 
b/tools/testing/selftests/bpf/progs/test_xdp_devmap_helpers.c
index e5c0f131c8a7..b360ba2bd441 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_devmap_helpers.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_devmap_helpers.c
@@ -2,7 +2,7 @@
 /* fails to load without expected_attach_type = BPF_XDP_DEVMAP
  * because of access to egress_ifindex
  */
-#include "vmlinux.h"
+#include 
 #include 
 
 SEC("xdp_dm_log")
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c 
b/tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c
index deef0e050863..330811260123 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c
@@ -1,6 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
-
-#include "vmlinux.h"
+#include 
 #include 
 
 struct {

[PATCH net 2/3] i40e: protect ring accesses with READ- and WRITE_ONCE

2020-06-09 Thread Ciara Loftus

READ_ONCE should be used when reading rings prior to accessing the
statistics pointer. Introduce this as well as the corresponding WRITE_ONCE
usage when allocating and freeing the rings, to ensure protected access.

Signed-off-by: Ciara Loftus 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 29 ++---
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 5d807c8004f8..56ecd6c3f236 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -439,11 +439,15 @@ static void i40e_get_netdev_stats_struct(struct 
net_device *netdev,
i40e_get_netdev_stats_struct_tx(ring, stats);
 
if (i40e_enabled_xdp_vsi(vsi)) {
-   ring++;
+   ring = READ_ONCE(vsi->xdp_rings[i]);
+   if (!ring)
+   continue;
i40e_get_netdev_stats_struct_tx(ring, stats);
}
 
-   ring++;
+   ring = READ_ONCE(vsi->rx_rings[i]);
+   if (!ring)
+   continue;
do {
start   = u64_stats_fetch_begin_irq(&ring->syncp);
packets = ring->stats.packets;
@@ -787,6 +791,8 @@ static void i40e_update_vsi_stats(struct i40e_vsi *vsi)
for (q = 0; q < vsi->num_queue_pairs; q++) {
/* locate Tx ring */
p = READ_ONCE(vsi->tx_rings[q]);
+   if (!p)
+   continue;
 
do {
start = u64_stats_fetch_begin_irq(&p->syncp);
@@ -800,8 +806,11 @@ static void i40e_update_vsi_stats(struct i40e_vsi *vsi)
tx_linearize += p->tx_stats.tx_linearize;
tx_force_wb += p->tx_stats.tx_force_wb;
 
-   /* Rx queue is part of the same block as Tx queue */
-   p = &p[1];
+   /* locate Rx ring */
+   p = READ_ONCE(vsi->rx_rings[q]);
+   if (!p)
+   continue;
+
do {
start = u64_stats_fetch_begin_irq(&p->syncp);
packets = p->stats.packets;
@@ -10824,10 +10833,10 @@ static void i40e_vsi_clear_rings(struct i40e_vsi *vsi)
if (vsi->tx_rings && vsi->tx_rings[0]) {
for (i = 0; i < vsi->alloc_queue_pairs; i++) {
kfree_rcu(vsi->tx_rings[i], rcu);
-   vsi->tx_rings[i] = NULL;
-   vsi->rx_rings[i] = NULL;
+   WRITE_ONCE(vsi->tx_rings[i], NULL);
+   WRITE_ONCE(vsi->rx_rings[i], NULL);
if (vsi->xdp_rings)
-   vsi->xdp_rings[i] = NULL;
+   WRITE_ONCE(vsi->xdp_rings[i], NULL);
}
}
 }
@@ -10861,7 +10870,7 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
if (vsi->back->hw_features & I40E_HW_WB_ON_ITR_CAPABLE)
ring->flags = I40E_TXR_FLAGS_WB_ON_ITR;
ring->itr_setting = pf->tx_itr_default;
-   vsi->tx_rings[i] = ring++;
+   WRITE_ONCE(vsi->tx_rings[i], ring++);
 
if (!i40e_enabled_xdp_vsi(vsi))
goto setup_rx;
@@ -10879,7 +10888,7 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
ring->flags = I40E_TXR_FLAGS_WB_ON_ITR;
set_ring_xdp(ring);
ring->itr_setting = pf->tx_itr_default;
-   vsi->xdp_rings[i] = ring++;
+   WRITE_ONCE(vsi->xdp_rings[i], ring++);
 
 setup_rx:
ring->queue_index = i;
@@ -10892,7 +10901,7 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
ring->size = 0;
ring->dcb_tc = 0;
ring->itr_setting = pf->rx_itr_default;
-   vsi->rx_rings[i] = ring;
+   WRITE_ONCE(vsi->rx_rings[i], ring);
}
 
return 0;
-- 
2.17.1

[PATCH net 3/3] ice: protect ring accesses with WRITE_ONCE

2020-06-09 Thread Ciara Loftus

The READ_ONCE macro is used when reading rings prior to accessing the
statistics pointer. The corresponding WRITE_ONCE usage when allocating and
freeing the rings to ensure protected access was not in place. Introduce
this.

Signed-off-by: Ciara Loftus 
---
 drivers/net/ethernet/intel/ice/ice_lib.c  | 8 
 drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 28b46cc9f5cb..2e3a39cea2c0 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -1194,7 +1194,7 @@ static void ice_vsi_clear_rings(struct ice_vsi *vsi)
for (i = 0; i < vsi->alloc_txq; i++) {
if (vsi->tx_rings[i]) {
kfree_rcu(vsi->tx_rings[i], rcu);
-   vsi->tx_rings[i] = NULL;
+   WRITE_ONCE(vsi->tx_rings[i], NULL);
}
}
}
@@ -1202,7 +1202,7 @@ static void ice_vsi_clear_rings(struct ice_vsi *vsi)
for (i = 0; i < vsi->alloc_rxq; i++) {
if (vsi->rx_rings[i]) {
kfree_rcu(vsi->rx_rings[i], rcu);
-   vsi->rx_rings[i] = NULL;
+   WRITE_ONCE(vsi->rx_rings[i], NULL);
}
}
}
@@ -1235,7 +1235,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
ring->vsi = vsi;
ring->dev = dev;
ring->count = vsi->num_tx_desc;
-   vsi->tx_rings[i] = ring;
+   WRITE_ONCE(vsi->tx_rings[i], ring);
}
 
/* Allocate Rx rings */
@@ -1254,7 +1254,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
ring->netdev = vsi->netdev;
ring->dev = dev;
ring->count = vsi->num_rx_desc;
-   vsi->rx_rings[i] = ring;
+   WRITE_ONCE(vsi->rx_rings[i], ring);
}
 
return 0;
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index 082825e3cb39..4cbd49c87568 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -1702,7 +1702,7 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
xdp_ring->netdev = NULL;
xdp_ring->dev = dev;
xdp_ring->count = vsi->num_tx_desc;
-   vsi->xdp_rings[i] = xdp_ring;
+   WRITE_ONCE(vsi->xdp_rings[i], xdp_ring);
if (ice_setup_tx_ring(xdp_ring))
goto free_xdp_rings;
ice_set_ring_xdp(xdp_ring);
-- 
2.17.1

[PATCH net 1/3] ixgbe: protect ring accesses with READ- and WRITE_ONCE

2020-06-09 Thread Ciara Loftus

READ_ONCE should be used when reading rings prior to accessing the
statistics pointer. Introduce this as well as the corresponding WRITE_ONCE
usage when allocating and freeing the rings, to ensure protected access.

Signed-off-by: Ciara Loftus 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  | 12 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 14 +++---
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index fd9f5d41b594..2e35c5706cf1 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -921,7 +921,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter 
*adapter,
ring->queue_index = txr_idx;
 
/* assign ring to adapter */
-   adapter->tx_ring[txr_idx] = ring;
+   WRITE_ONCE(adapter->tx_ring[txr_idx], ring);
 
/* update count and index */
txr_count--;
@@ -948,7 +948,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter 
*adapter,
set_ring_xdp(ring);
 
/* assign ring to adapter */
-   adapter->xdp_ring[xdp_idx] = ring;
+   WRITE_ONCE(adapter->xdp_ring[xdp_idx], ring);
 
/* update count and index */
xdp_count--;
@@ -991,7 +991,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter 
*adapter,
ring->queue_index = rxr_idx;
 
/* assign ring to adapter */
-   adapter->rx_ring[rxr_idx] = ring;
+   WRITE_ONCE(adapter->rx_ring[rxr_idx], ring);
 
/* update count and index */
rxr_count--;
@@ -1020,13 +1020,13 @@ static void ixgbe_free_q_vector(struct ixgbe_adapter 
*adapter, int v_idx)
 
ixgbe_for_each_ring(ring, q_vector->tx) {
if (ring_is_xdp(ring))
-   adapter->xdp_ring[ring->queue_index] = NULL;
+   WRITE_ONCE(adapter->xdp_ring[ring->queue_index], NULL);
else
-   adapter->tx_ring[ring->queue_index] = NULL;
+   WRITE_ONCE(adapter->tx_ring[ring->queue_index], NULL);
}
 
ixgbe_for_each_ring(ring, q_vector->rx)
-   adapter->rx_ring[ring->queue_index] = NULL;
+   WRITE_ONCE(adapter->rx_ring[ring->queue_index], NULL);
 
adapter->q_vector[v_idx] = NULL;
napi_hash_del(&q_vector->napi);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f162b8b8f345..97a423ecf808 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7051,7 +7051,10 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
}
 
for (i = 0; i < adapter->num_rx_queues; i++) {
-   struct ixgbe_ring *rx_ring = adapter->rx_ring[i];
+   struct ixgbe_ring *rx_ring = READ_ONCE(adapter->rx_ring[i]);
+
+   if (!rx_ring)
+   continue;
non_eop_descs += rx_ring->rx_stats.non_eop_descs;
alloc_rx_page += rx_ring->rx_stats.alloc_rx_page;
alloc_rx_page_failed += rx_ring->rx_stats.alloc_rx_page_failed;
@@ -7072,15 +7075,20 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
packets = 0;
/* gather some stats to the adapter struct that are per queue */
for (i = 0; i < adapter->num_tx_queues; i++) {
-   struct ixgbe_ring *tx_ring = adapter->tx_ring[i];
+   struct ixgbe_ring *tx_ring = READ_ONCE(adapter->tx_ring[i]);
+
+   if (!tx_ring)
+   continue;
restart_queue += tx_ring->tx_stats.restart_queue;
tx_busy += tx_ring->tx_stats.tx_busy;
bytes += tx_ring->stats.bytes;
packets += tx_ring->stats.packets;
}
for (i = 0; i < adapter->num_xdp_queues; i++) {
-   struct ixgbe_ring *xdp_ring = adapter->xdp_ring[i];
+   struct ixgbe_ring *xdp_ring = READ_ONCE(adapter->xdp_ring[i]);
 
+   if (!xdp_ring)
+   continue;
restart_queue += xdp_ring->tx_stats.restart_queue;
tx_busy += xdp_ring->tx_stats.tx_busy;
bytes += xdp_ring->stats.bytes;
-- 
2.17.1

Re: [PATCH bpf V2 1/2] bpf: devmap adjust uapi for attach bpf program

2020-06-09 Thread David Ahern

On 6/9/20 7:31 AM, Jesper Dangaard Brouer wrote:
> This patch remove the minus-1 checks, and have zero mean feature isn't used.
> 

For consistency this should apply to other XDP fd uses as well -- like
IFLA_XDP_EXPECTED_FD and IFLA_XDP_FD.

Re: [PATCH bpf] libbpf: Fix BTF-to-C conversion of noreturn function pointers

2020-06-09 Thread Jean-Philippe Brucker

On Mon, Jun 08, 2020 at 04:50:37PM -0700, Andrii Nakryiko wrote:
> On Mon, Jun 8, 2020 at 8:23 AM Jean-Philippe Brucker
>  wrote:
> >
> > When trying to convert the BTF for a function pointer marked "noreturn"
> > to C code, bpftool currently generates a syntax error. This happens with
> > the exit() pointer in drivers/firmware/efi/libstub/efistub.h, in an
> > arm64 vmlinux. When dealing with this declaration:
> >
> > efi_status_t __noreturn (__efiapi *exit)(...);
> >
> > bpftool produces the following output:
> >
> > efi_status_tvolatile  (*exit)(...);
> 
> 
> I'm curious where this volatile is coming from, I don't see it in
> __efiapi. But even if it's there, shouldn't it be inside parens
> instead:
> 
> efi_status_t (volatile *exit)(...);

It's the __noreturn attribute that becomes "volatile", not the __efiapi.
My reproducer is:

  struct my_struct {
  void __attribute__((noreturn)) (*fn)(int);
  };
  struct my_struct a;

When generating DWARF info for this, GCC inserts a DW_TAG_volatile_type.
Clang doesn't add a volatile tag, it just omits the noreturn qualifier.
>From what I could find, it's due to legacy "noreturn" support in GCC [1]:
before version 2.5 the only way to declare a noreturn function was to
declare it volatile.

[1] https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html

Given that not all compilers turn "noreturn" into "volatile", and that I
haven't managed to insert any other modifier (volatile/const/restrict) in
this location (the efistub example above is the only issue on an
allyesconfig kernel), I was considering simply removing this call to
btf_dump_emit_mods(). But I'm not confident enough that it won't ever be
necessary.

> > Fix the error by inserting the space before the function modifier.
> >
> > Fixes: 351131b51c7a ("libbpf: add btf_dump API for BTF-to-C conversion")
> > Signed-off-by: Jean-Philippe Brucker 
> > ---
> 
> Can you please add tests for this case into selftests (probably
> progs/btf_dump_test_case_syntax.c?) So it's clear what's the input and
> what's the expected output.

Those tests are built with clang, which doesn't emit the "volatile"
modifier. Should I add a separate test for GCC?

Thanks,
Jean

[PATCH RFC net-next 4/6] selftests/net: so_txtime: support txonly/rxonly modes

2020-06-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Allow running the test across two machines, to test nic hw offload.

Add options
-A: receiver address
-r: receive only
-t: transmit only
-T: SO_RCVTIMEO value

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/so_txtime.c | 60 -
 1 file changed, 48 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/net/so_txtime.c 
b/tools/testing/selftests/net/so_txtime.c
index 383bac05ac32..fa748e4209c0 100644
--- a/tools/testing/selftests/net/so_txtime.c
+++ b/tools/testing/selftests/net/so_txtime.c
@@ -28,9 +28,13 @@
 #include 
 #include 
 
+static const char *cfg_addr;
 static int cfg_clockid = CLOCK_TAI;
 static boolcfg_do_ipv4;
 static boolcfg_do_ipv6;
+static boolcfg_rxonly;
+static int cfg_timeout_sec;
+static boolcfg_txonly;
 static uint16_tcfg_port= 8000;
 static int cfg_variance_us = 4000;
 
@@ -238,8 +242,12 @@ static int setup_rx(struct sockaddr *addr, socklen_t alen)
if (fd == -1)
error(1, errno, "socket r");
 
-   if (bind(fd, addr, alen))
-   error(1, errno, "bind");
+   if (!cfg_txonly)
+   if (bind(fd, addr, alen))
+   error(1, errno, "bind");
+
+   if (cfg_timeout_sec)
+   tv.tv_sec = cfg_timeout_sec;
 
if (setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)))
error(1, errno, "setsockopt rcv timeout");
@@ -260,13 +268,18 @@ static void do_test(struct sockaddr *addr, socklen_t alen)
 
glob_tstart = gettime_ns();
 
-   for (i = 0; i < cfg_num_pkt; i++)
-   do_send_one(fdt, &cfg_in[i]);
-   for (i = 0; i < cfg_num_pkt; i++)
-   if (do_recv_one(fdr, &cfg_out[i]))
-   do_recv_errqueue_timeout(fdt);
+   if (!cfg_rxonly) {
+   for (i = 0; i < cfg_num_pkt; i++)
+   do_send_one(fdt, &cfg_in[i]);
+   }
 
-   do_recv_verify_empty(fdr);
+   if (!cfg_txonly) {
+   for (i = 0; i < cfg_num_pkt; i++)
+   if (do_recv_one(fdr, &cfg_out[i]))
+   do_recv_errqueue_timeout(fdt);
+
+   do_recv_verify_empty(fdr);
+   }
 
if (close(fdr))
error(1, errno, "close r");
@@ -308,7 +321,7 @@ static void parse_opts(int argc, char **argv)
 {
int c, ilen, olen;
 
-   while ((c = getopt(argc, argv, "46c:")) != -1) {
+   while ((c = getopt(argc, argv, "46A:c:rtT:")) != -1) {
switch (c) {
case '4':
cfg_do_ipv4 = true;
@@ -316,6 +329,9 @@ static void parse_opts(int argc, char **argv)
case '6':
cfg_do_ipv6 = true;
break;
+   case 'A':
+   cfg_addr = optarg;
+   break;
case 'c':
if (!strcmp(optarg, "tai"))
cfg_clockid = CLOCK_TAI;
@@ -325,13 +341,27 @@ static void parse_opts(int argc, char **argv)
else
error(1, 0, "unknown clock id %s", optarg);
break;
+   case 'r':
+   cfg_rxonly = true;
+   break;
+   case 't':
+   cfg_txonly = true;
+   break;
+   case 'T':
+   cfg_timeout_sec = strtol(optarg, NULL, 0);
+   break;
default:
error(1, 0, "parse error at %d", optind);
}
}
 
if (argc - optind != 2)
-   error(1, 0, "Usage: %s [-46] -c   ", argv[0]);
+   error(1, 0, "Usage: %s [-46rt] [-A addr] [-c clock] [-T 
timeout]  ", argv[0]);
+
+   if (cfg_rxonly && cfg_txonly)
+   error(1, 0, "Select rx-only or tx-only, not both");
+   if (cfg_addr && cfg_do_ipv4 && cfg_do_ipv6)
+   error(1, 0, "Cannot run both IPv4 and IPv6 when passing 
address");
 
ilen = parse_io(argv[optind], cfg_in);
olen = parse_io(argv[optind + 1], cfg_out);
@@ -349,7 +379,10 @@ int main(int argc, char **argv)
 
addr6.sin6_family = AF_INET6;
addr6.sin6_port = htons(cfg_port);
-   addr6.sin6_addr = in6addr_loopback;
+   if (!cfg_addr)
+   addr6.sin6_addr = in6addr_loopback;
+   else if (inet_pton(AF_INET6, cfg_addr, &addr6.sin6_addr) != 1)
+   error(1, 0, "ipv6 parse error: %s", cfg_addr);
 
cfg_errq_level = SOL_IPV6;
cfg_errq_type = IPV6_RECVERR;
@@ -362,7 +395,10 @@ int main(int argc, char **argv)
 
addr4.sin_family = AF_INET;
addr4.sin_port = htons(cfg_port);
-   addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+

[PATCH RFC net-next 1/6] net: multiple release time SO_TXTIME

2020-06-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Pace transmission of segments in a UDP GSO datagram.

Batching datagram protocol stack traversals with UDP_SEGMENT saves
significant cycles for large data transfers.

But GSO packets are sent at once. Pacing traffic to internet clients
often requires sending just a few MSS per msec pacing interval.

SO_TXTIME allows delivery of packets at a later time. Extend it
to allow pacing the segments in a UDP GSO packet, to be able to build
larger GSO datagrams.

Add SO_TXTIME flag SOF_TXTIME_MULTI_RELEASE. This reinterprets the
lower 8 bits of the 64-bit release timestamp as

  - bits 4..7: release time interval in usec
  - bits 0..3: number of segments sent per period

So a timestamp of 0x148 means

  - 0x100 initial timestamp in Qdisc selected clocksource
  - every 4 usec release N MSS
  - N is 8

A subsequent qdisc change will pace the individual segments.

Packet transmission can race with the socket option. This is safe.
For predictable behavior, it is up to the caller to not toggle the
feature while packets on a socket are in flight.

Signed-off-by: Willem de Bruijn 
---
 include/linux/netdevice.h   |  1 +
 include/net/sock.h  |  3 ++-
 include/uapi/linux/net_tstamp.h |  3 ++-
 net/core/dev.c  | 44 +
 net/core/sock.c |  4 +++
 5 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1a96e9c4ec36..15ea976dd446 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4528,6 +4528,7 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
  netdev_features_t features, bool tx_path);
 struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
netdev_features_t features);
+struct sk_buff *skb_gso_segment_txtime(struct sk_buff *skb);
 
 struct netdev_bonding_info {
ifslave slave;
diff --git a/include/net/sock.h b/include/net/sock.h
index c53cc42b5ab9..491e389b3570 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -493,7 +493,8 @@ struct sock {
u8  sk_clockid;
u8  sk_txtime_deadline_mode : 1,
sk_txtime_report_errors : 1,
-   sk_txtime_unused : 6;
+   sk_txtime_multi_release : 1,
+   sk_txtime_unused : 5;
 
struct socket   *sk_socket;
void*sk_user_data;
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 7ed0b3d1c00a..ca1ae3b6f601 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -162,8 +162,9 @@ struct scm_ts_pktinfo {
 enum txtime_flags {
SOF_TXTIME_DEADLINE_MODE = (1 << 0),
SOF_TXTIME_REPORT_ERRORS = (1 << 1),
+   SOF_TXTIME_MULTI_RELEASE = (1 << 2),
 
-   SOF_TXTIME_FLAGS_LAST = SOF_TXTIME_REPORT_ERRORS,
+   SOF_TXTIME_FLAGS_LAST = SOF_TXTIME_MULTI_RELEASE,
SOF_TXTIME_FLAGS_MASK = (SOF_TXTIME_FLAGS_LAST - 1) |
 SOF_TXTIME_FLAGS_LAST
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index 061496a1f640..5058083375fb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3377,6 +3377,50 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(__skb_gso_segment);
 
+struct sk_buff *skb_gso_segment_txtime(struct sk_buff *skb)
+{
+   int mss_per_ival, mss_in_cur_ival;
+   struct sk_buff *segs, *seg;
+   struct skb_shared_info *sh;
+   u64 step_ns, tstamp;
+
+   if (!skb->sk || !sk_fullsock(skb->sk) ||
+   !skb->sk->sk_txtime_multi_release)
+   return NULL;
+
+   /* extract multi release variables mss and stepsize */
+   mss_per_ival = skb->tstamp & 0xF;
+   step_ns = ((skb->tstamp >> 4) & 0xF) * NSEC_PER_MSEC;
+   tstamp = skb->tstamp;
+
+   if (mss_per_ival == 0)
+   return NULL;
+
+   /* skip multi-release if total segs can be sent at once */
+   sh = skb_shinfo(skb);
+   if (sh->gso_segs <= mss_per_ival)
+   return NULL;
+
+   segs = skb_gso_segment(skb, NETIF_F_SG | NETIF_F_HW_CSUM);
+   if (IS_ERR_OR_NULL(segs))
+   return segs;
+
+   mss_in_cur_ival = 0;
+
+   for (seg = segs; seg; seg = seg->next) {
+   seg->tstamp = tstamp & ~0xFF;
+
+   mss_in_cur_ival++;
+   if (mss_in_cur_ival == mss_per_ival) {
+   tstamp += step_ns;
+   mss_in_cur_ival = 0;
+   }
+   }
+
+   return segs;
+}
+EXPORT_SYMBOL_GPL(skb_gso_segment_txtime);
+
 /* Take action when hardware reception checksum errors are detected. */
 #ifdef CONFIG_BUG
 void netdev_rx_csum_fault(struct net_device *dev, struct sk_buff *skb)
diff --git a/net/core/sock.c b/net/core/sock.c
index 6c4acf1f0220..7036b8855154

[PATCH RFC net-next 0/6] multi release pacing for UDP GSO

2020-06-09 Thread Willem de Bruijn

From: Willem de Bruijn 

UDP segmentation offload with UDP_SEGMENT can significantly reduce the
transmission cycle cost per byte for protocols like QUIC.

Pacing offload with SO_TXTIME can improve accuracy and cycle cost of
pacing for such userspace protocols further.

But the maximum GSO size built is limited by the pacing rate. As msec
pacing interval, for many Internet clients results in at most a few
segments per datagram.

The pros and cons were captured in a recent CloudFlare article,
specifically mentioning

  "But it does not yet support specifying different times for each
  packet when GSO is used, as there is no way to define multiple
  timestamps for packets that need to be segmented (each segmented
  packet essentially ends up being sent at the same time anyway)."

  https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/

We have been evaluating such a mechanism for multiple release times
per UDP GSO packets. Since it sounds like it may of interest to
others, too, it may be a while before we have all the data I'd like
and it's more quiet on the list now that the merge window is open,
sharing a WIP version.

The basic approach is to specify

1. initial early release time (in nsec)
2. interval between subsequent release times (in msec)
3. number of segments to release at each release time

One implementation concern is where to store the additional two fields
in the skb. Given that msec granularity is the Internet pacing speed,
for now repurpose the two lowest 4B nibbles in skb->tstamp to hold the
interval and segment count. I'm aware that this does not win a prize
for elegance.

Patch 1 adds the socket option and basic segmentation function to
  adjust the skb->tstamp of the individual segments.

Patch 2 extends this with support for build GSO segs. Build one GSO
   segment per interval if the hardware can offload (USO) and thus
   we are segmenting only to maintain pacing rate.

Patch 3 wires the segmentation up to the FQ qdisc on enqueue, so that
   segments will be scheduled for delivery at their adjusted time.

Patch 4..6 extend existing tests to experiment with the feature

Patch 4 allows testing so_txtime across hardware (for USO)
Patch 5 extends the so_txtime test with support for gso and mr-pacing
Patch 6 extends the udpgso bench to support pacing and mr-pacing

Some known limitations:

- the aforementioned storage in skb->tstamp.

- exposing this constraint through the SO_TXTIME interface.
  it is cleaner to add new fields to the cmsg, at nsec resolution.

- the fq_enqueue path adds a branch to the hot path.
  a static branch would avoid that.

- a few udp specific assumptions in a net/core datapath.
  notably the hw_features. this can be derived from gso_type.

Willem de Bruijn (6):
  net: multiple release time SO_TXTIME
  net: build gso segs in multi release time SO_TXTIME
  net_sched: sch_fq: multiple release time support
  selftests/net: so_txtime: support txonly/rxonly modes
  selftests/net: so_txtime: add gso and multi release pacing
  selftests/net: upgso bench: add pacing with SO_TXTIME

 include/linux/netdevice.h |   1 +
 include/net/sock.h|   3 +-
 include/uapi/linux/net_tstamp.h   |   3 +-
 net/core/dev.c|  71 +
 net/core/sock.c   |   4 +
 net/sched/sch_fq.c|  33 -
 tools/testing/selftests/net/so_txtime.c   | 136 ++
 tools/testing/selftests/net/so_txtime.sh  |   7 +
 .../testing/selftests/net/so_txtime_multi.sh  |  68 +
 .../selftests/net/udpgso_bench_multi.sh   |  65 +
 tools/testing/selftests/net/udpgso_bench_tx.c |  72 +-
 11 files changed, 431 insertions(+), 32 deletions(-)
 create mode 100755 tools/testing/selftests/net/so_txtime_multi.sh
 create mode 100755 tools/testing/selftests/net/udpgso_bench_multi.sh

-- 
2.27.0.278.ge193c7cf3a9-goog

[PATCH RFC net-next 3/6] net_sched: sch_fq: multiple release time support

2020-06-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Optionally segment skbs on FQ enqueue, to later send segments at
their individual delivery time.

Segmentation on enqueue is new for FQ, but already happens in TBF,
CAKE and netem.

This slow patch should probably be behind a static_branch.

Signed-off-by: Willem de Bruijn 
---
 net/sched/sch_fq.c | 33 +++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 8f06a808c59a..a5e2c35bb557 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -439,8 +439,8 @@ static bool fq_packet_beyond_horizon(const struct sk_buff 
*skb,
return unlikely((s64)skb->tstamp > (s64)(q->ktime_cache + q->horizon));
 }
 
-static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
- struct sk_buff **to_free)
+static int __fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+   struct sk_buff **to_free)
 {
struct fq_sched_data *q = qdisc_priv(sch);
struct fq_flow *f;
@@ -496,6 +496,35 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
return NET_XMIT_SUCCESS;
 }
 
+static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+ struct sk_buff **to_free)
+{
+   struct sk_buff *segs, *next;
+   int ret;
+
+   if (likely(!skb_is_gso(skb) || !skb->sk ||
+  !skb->sk->sk_txtime_multi_release))
+   return __fq_enqueue(skb, sch, to_free);
+
+   segs = skb_gso_segment_txtime(skb);
+   if (IS_ERR(segs))
+   return qdisc_drop(skb, sch, to_free);
+   if (!segs)
+   return __fq_enqueue(skb, sch, to_free);
+
+   consume_skb(skb);
+
+   ret = NET_XMIT_DROP;
+   skb_list_walk_safe(segs, segs, next) {
+   skb_mark_not_on_list(segs);
+   qdisc_skb_cb(segs)->pkt_len = segs->len;
+   if (__fq_enqueue(segs, sch, to_free) == NET_XMIT_SUCCESS)
+   ret = NET_XMIT_SUCCESS;
+   }
+
+   return ret;
+}
+
 static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 {
unsigned long sample;
-- 
2.27.0.278.ge193c7cf3a9-goog

[PATCH RFC net-next 2/6] net: build gso segs in multi release time SO_TXTIME

2020-06-09 Thread Willem de Bruijn

From: Willem de Bruijn 

When sending multiple segments per interval and the device supports
hardware segmentation, build one GSO segment per interval.

Signed-off-by: Willem de Bruijn 
---
 net/core/dev.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 5058083375fb..05f538f0f631 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3379,7 +3379,9 @@ EXPORT_SYMBOL(__skb_gso_segment);
 
 struct sk_buff *skb_gso_segment_txtime(struct sk_buff *skb)
 {
+   const netdev_features_t hw_features = NETIF_F_GSO_UDP_L4;
int mss_per_ival, mss_in_cur_ival;
+   u16 gso_size_orig, gso_segs_orig;
struct sk_buff *segs, *seg;
struct skb_shared_info *sh;
u64 step_ns, tstamp;
@@ -3401,13 +3403,27 @@ struct sk_buff *skb_gso_segment_txtime(struct sk_buff 
*skb)
if (sh->gso_segs <= mss_per_ival)
return NULL;
 
+   /* update gso size and segs to build 1 GSO packet per ival */
+   gso_size_orig = sh->gso_size;
+   gso_segs_orig = sh->gso_segs;
+   if (mss_per_ival > 1 && skb->dev->features & hw_features) {
+   sh->gso_size *= mss_per_ival;
+   sh->gso_segs = DIV_ROUND_UP(sh->gso_segs, mss_per_ival);
+   mss_per_ival = 1;
+   }
+
segs = skb_gso_segment(skb, NETIF_F_SG | NETIF_F_HW_CSUM);
-   if (IS_ERR_OR_NULL(segs))
+   if (IS_ERR_OR_NULL(segs)) {
+   sh->gso_size = gso_size_orig;
+   sh->gso_segs = gso_segs_orig;
return segs;
+   }
 
mss_in_cur_ival = 0;
 
for (seg = segs; seg; seg = seg->next) {
+   unsigned int data_len, data_off;
+
seg->tstamp = tstamp & ~0xFF;
 
mss_in_cur_ival++;
@@ -3415,6 +3431,17 @@ struct sk_buff *skb_gso_segment_txtime(struct sk_buff 
*skb)
tstamp += step_ns;
mss_in_cur_ival = 0;
}
+
+   data_off = skb_checksum_start_offset(skb) +
+  skb->csum_offset + sizeof(__sum16);
+   data_len = seg->len - data_off;
+
+   if (data_len > gso_size_orig) {
+   sh = skb_shinfo(seg);
+   sh->gso_type = skb_shinfo(skb)->gso_type;
+   sh->gso_size = gso_size_orig;
+   sh->gso_segs = DIV_ROUND_UP(data_len, gso_size_orig);
+   }
}
 
return segs;
-- 
2.27.0.278.ge193c7cf3a9-goog

[PATCH RFC net-next 5/6] selftests/net: so_txtime: add gso and multi release pacing

2020-06-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Support sending more than 1B payload and passing segment size.

- add option '-s' to send more than 1B payload
- add option '-m' to configure mss

If size exceeds mss, enable UDP_SEGMENT.

Optionally also allow configuring multi release time.

- add option '-M' to set release interval (msec)
- add option '-N' to set release segment count

Both options have to be specified, or neither.

Add a testcase to so_txtime.sh over loopback.
Add a testscript so_txtime_multi.sh that operates over veth using
txonly/rxonly mode.

Also fix a small sock_extended_err parsing bug, mixing up
ee_code and ee_errno

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/so_txtime.c   | 80 +++
 tools/testing/selftests/net/so_txtime.sh  |  7 ++
 .../testing/selftests/net/so_txtime_multi.sh  | 68 
 3 files changed, 141 insertions(+), 14 deletions(-)
 create mode 100755 tools/testing/selftests/net/so_txtime_multi.sh

diff --git a/tools/testing/selftests/net/so_txtime.c 
b/tools/testing/selftests/net/so_txtime.c
index fa748e4209c0..da4ef411693c 100644
--- a/tools/testing/selftests/net/so_txtime.c
+++ b/tools/testing/selftests/net/so_txtime.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,7 +33,11 @@ static const char *cfg_addr;
 static int cfg_clockid = CLOCK_TAI;
 static boolcfg_do_ipv4;
 static boolcfg_do_ipv6;
+static uint8_t cfg_mr_num;
+static uint8_t cfg_mr_ival;
+static int cfg_mss = 1400;
 static boolcfg_rxonly;
+static uint16_tcfg_size= 1;
 static int cfg_timeout_sec;
 static boolcfg_txonly;
 static uint16_tcfg_port= 8000;
@@ -66,6 +71,7 @@ static uint64_t gettime_ns(void)
 
 static void do_send_one(int fdt, struct timed_send *ts)
 {
+   static char buf[1 << 16];
char control[CMSG_SPACE(sizeof(uint64_t))];
struct msghdr msg = {0};
struct iovec iov = {0};
@@ -73,8 +79,10 @@ static void do_send_one(int fdt, struct timed_send *ts)
uint64_t tdeliver;
int ret;
 
-   iov.iov_base = &ts->data;
-   iov.iov_len = 1;
+   memset(buf, ts->data, cfg_size);
+
+   iov.iov_base = buf;
+   iov.iov_len = cfg_size;
 
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
@@ -85,6 +93,11 @@ static void do_send_one(int fdt, struct timed_send *ts)
msg.msg_controllen = sizeof(control);
 
tdeliver = glob_tstart + ts->delay_us * 1000;
+   if (cfg_mr_ival) {
+   tdeliver &= ~0xFF;
+   tdeliver |= cfg_mr_ival << 4;
+   tdeliver |= cfg_mr_num;
+   }
 
cm = CMSG_FIRSTHDR(&msg);
cm->cmsg_level = SOL_SOCKET;
@@ -104,30 +117,41 @@ static void do_send_one(int fdt, struct timed_send *ts)
 static bool do_recv_one(int fdr, struct timed_send *ts)
 {
int64_t tstop, texpect;
+   int total = 0;
char rbuf[2];
int ret;
 
-   ret = recv(fdr, rbuf, sizeof(rbuf), 0);
+read_again:
+   ret = recv(fdr, rbuf, sizeof(rbuf), MSG_TRUNC);
if (ret == -1 && errno == EAGAIN)
-   return true;
+   goto timedout;
if (ret == -1)
error(1, errno, "read");
-   if (ret != 1)
-   error(1, 0, "read: %dB", ret);
 
tstop = (gettime_ns() - glob_tstart) / 1000;
texpect = ts->delay_us >= 0 ? ts->delay_us : 0;
 
-   fprintf(stderr, "payload:%c delay:%lld expected:%lld (us)\n",
-   rbuf[0], (long long)tstop, (long long)texpect);
+   fprintf(stderr, "payload:%c delay:%lld expected:%lld (us) -- 
read=%d,len=%d,total=%d\n",
+   rbuf[0], (long long)tstop, (long long)texpect,
+   total, ret, cfg_size);
 
if (rbuf[0] != ts->data)
error(1, 0, "payload mismatch. expected %c", ts->data);
 
-   if (labs(tstop - texpect) > cfg_variance_us)
+   total += ret;
+   if (total < cfg_size)
+   goto read_again;
+
+   /* measure latency if all data arrives in a single datagram (not GSO) */
+   if (ret == cfg_size && labs(tstop - texpect) > cfg_variance_us)
error(1, 0, "exceeds variance (%d us)", cfg_variance_us);
 
return false;
+
+timedout:
+   if (total != 0 && total != cfg_size)
+   error(1, 0, "timeout mid-read");
+   return true;
 }
 
 static void do_recv_verify_empty(int fdr)
@@ -168,7 +192,9 @@ static void do_recv_errqueue_timeout(int fdt)
break;
if (ret == -1)
error(1, errno, "errqueue");
-   if (msg.msg_flags != MSG_ERRQUEUE)
+   if (ret != sizeof(data))
+   error(1, errno, "insufficient data");
+   if (msg.msg_flags & ~(MSG_ERRQUEUE | MSG_TRUNC))
error(1, 0, "errqueue:

[PATCH RFC net-next 6/6] selftests/net: upgso bench: add pacing with SO_TXTIME

2020-06-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Enable passing an SCM_TXTIME ('-x'), optionally configured to use
multi release (SOF_TXTIME_MULTI_RELEASE) ('-X').

With multi release, the segments that make up a single udp gso packet
can be released over time, as opposed to burst at once.

Repurpose the lower 8 bits of the 64 bit timestamp for this purpose.
- bits 4..7: delay between transmit periods in msec
- bits 0..3: number of segments sent per period

Also add an optional delay in usec between sendmsg calls ('-d'). To
reduce throughput sufficiently to observe pacing delay introduced.

Added udpgso_bench_multi.sh, derived from so_txtime_multi.sh, to run
tests over veth.

Also fix up a minor issue where sizeof the wrong field was used
(mss, instead of gso_size). They happen to both be uint16_t.

Tested:
./udpgso_bench_multi.sh

or manually ran across two hosts:

# sender
#
# -S 6000 -s 1400:  6000B buffer encoding 1400B datagrams
# -d 20:200 msec delay between sendmsg
# -x 0x989611:  ~100 nsec first delay + 1 MSS every 1 msec

./udpgso_bench_tx -6 -D fd00::1 \
-l 1 -s 6000 -S 1400 -v
-d 20 -x 0x989611 -X

# receiver

tcpdump -n -i eth1 -c 100 udp and port 8000 &
sleep 0.2
./udpgso_bench_rx

16:29:45.146855 IP6 host1.40803 > host2.8000: UDP, length 1400
16:29:45.147798 IP6 host1.40803 > host2.8000: UDP, length 1400
16:29:45.148797 IP6 host1.40803 > host2.8000: UDP, length 1400
16:29:45.149797 IP6 host1.40803 > host2.8000: UDP, length 1400
16:29:45.150796 IP6 host1.40803 > host2.8000: UDP, length 400
16:29:45.347056 IP6 host1.40803 > host2.8000: UDP, length 1400
16:29:45.348000 IP6 host1.40803 > host2.8000: UDP, length 1400
16:29:45.349000 IP6 host1.40803 > host2.8000: UDP, length 1400
16:29:45.34 IP6 host1.40803 > host2.8000: UDP, length 1400
16:29:45.350999 IP6 host1.40803 > host2.8000: UDP, length 400

Signed-off-by: Willem de Bruijn 
---
 .../selftests/net/udpgso_bench_multi.sh   | 65 +
 tools/testing/selftests/net/udpgso_bench_tx.c | 72 +--
 2 files changed, 133 insertions(+), 4 deletions(-)
 create mode 100755 tools/testing/selftests/net/udpgso_bench_multi.sh

diff --git a/tools/testing/selftests/net/udpgso_bench_multi.sh 
b/tools/testing/selftests/net/udpgso_bench_multi.sh
new file mode 100755
index ..c29f75aec759
--- /dev/null
+++ b/tools/testing/selftests/net/udpgso_bench_multi.sh
@@ -0,0 +1,65 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Regression tests for the SO_TXTIME interface
+
+readonly ns_prefix="ns-sotxtime-"
+readonly ns1="${ns_prefix}1"
+readonly ns2="${ns_prefix}2"
+
+readonly ns1_v4=192.168.1.1
+readonly ns2_v4=192.168.1.2
+readonly ns1_v6=fd::1
+readonly ns2_v6=fd::2
+
+set -eu
+
+cleanup() {
+   ip netns del "${ns2}"
+   ip netns del "${ns1}"
+}
+
+setup() {
+   ip netns add "${ns1}"
+   ip netns add "${ns2}"
+
+   ip link add dev veth1 mtu 1500 netns "${ns1}" type veth \
+ peer name veth2 mtu 1500 netns "${ns2}"
+
+   ip -netns "${ns1}" link set veth1 up
+   ip -netns "${ns2}" link set veth2 up
+
+   ip -netns "${ns1}" -4 addr add "${ns1_v4}/24" dev veth1
+   ip -netns "${ns2}" -4 addr add "${ns2_v4}/24" dev veth2
+   ip -netns "${ns1}" -6 addr add "${ns1_v6}/64" dev veth1 nodad
+   ip -netns "${ns2}" -6 addr add "${ns2_v6}/64" dev veth2 nodad
+
+   ip netns exec "${ns1}" tc qdisc add dev veth1 root fq
+}
+
+run_test() {
+   ip netns exec "${ns2}" tcpdump -q -n -i veth2 udp &
+   ip netns exec "${ns2}" ./udpgso_bench_rx &
+   sleep 0.1
+   ip netns exec "${ns1}" ./udpgso_bench_tx $@
+   pkill -P $$
+}
+
+run_test_46() {
+   run_test -4 -D "${ns2_v4}" $@
+   run_test -6 -D "${ns2_v6}" $@
+}
+
+trap cleanup EXIT
+setup
+
+echo "gso + pacing"
+TEST_ARGS="-l 1 -s 3500 -S 1000 -v -d 20 -x 100"
+run_test_46 ${TEST_ARGS} -x 100
+
+echo "gso + multi release pacing"
+run_test_46 ${TEST_ARGS} -X -x 0x989611
+run_test_46 ${TEST_ARGS} -X -x 0x9896A2
+
+# Does not validate pacing delay yet. Check manually.
+echo "Ok. Executed tests."
diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c 
b/tools/testing/selftests/net/udpgso_bench_tx.c
index 17512a43885e..264222c2b94e 100644
--- a/tools/testing/selftests/net/udpgso_bench_tx.c
+++ b/tools/testing/selftests/net/udpgso_bench_tx.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "../kselftest.h"
@@ -56,6 +57,7 @@
 static boolcfg_cache_trash;
 static int cfg_cpu = -1;
 static int cfg_connected   = true;
+static int cfg_delay_us;
 static int cfg_family  = PF_UNSPEC;
 static uint16_tcfg_mss;
 static int cfg_payload_len = (1472 * 42);
@@ -65,6 +67,8 @@ static bool   cfg_poll;
 static boolcfg_segment;
 static boolcfg_sendmmsg;
 static boolcfg_tcp;
+static uint64_tcfg_txtime;
+static b

[PATCH] dccp: Fix possible memleak in dccp_init and dccp_fini

2020-06-09 Thread Wang Hai

There are some memory leaks in dccp_init() and dccp_fini().

In dccp_fini() and the error handling path in dccp_init(), free lhash2
is missing. Add inet_hashinfo2_free_mod() to do it.

If inet_hashinfo2_init_mod() failed in dccp_init(),
percpu_counter_destroy() should be called to destroy dccp_orphan_count.
It need to goto out_free_percpu when inet_hashinfo2_init_mod() failed.

Fixes: c92c81df93df ("net: dccp: fix kernel crash on module load")
Reported-by: Hulk Robot 
Signed-off-by: Wang Hai 
---
 include/net/inet_hashtables.h | 6 ++
 net/dccp/proto.c  | 7 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index ad64ba6..9256097 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -185,6 +185,12 @@ static inline spinlock_t *inet_ehash_lockp(
 
 int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo);
 
+static inline void inet_hashinfo2_free_mod(struct inet_hashinfo *h)
+{
+   kfree(h->lhash2);
+   h->lhash2 = NULL;
+}
+
 static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
 {
kvfree(hashinfo->ehash_locks);
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 4af8a98..c13b660 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1139,14 +1139,14 @@ static int __init dccp_init(void)
inet_hashinfo_init(&dccp_hashinfo);
rc = inet_hashinfo2_init_mod(&dccp_hashinfo);
if (rc)
-   goto out_fail;
+   goto out_free_percpu;
rc = -ENOBUFS;
dccp_hashinfo.bind_bucket_cachep =
kmem_cache_create("dccp_bind_bucket",
  sizeof(struct inet_bind_bucket), 0,
  SLAB_HWCACHE_ALIGN, NULL);
if (!dccp_hashinfo.bind_bucket_cachep)
-   goto out_free_percpu;
+   goto out_free_hashinfo2;
 
/*
 * Size and allocate the main established and bind bucket
@@ -1242,6 +1242,8 @@ static int __init dccp_init(void)
free_pages((unsigned long)dccp_hashinfo.ehash, ehash_order);
 out_free_bind_bucket_cachep:
kmem_cache_destroy(dccp_hashinfo.bind_bucket_cachep);
+out_free_hashinfo2:
+   inet_hashinfo2_free_mod(&dccp_hashinfo);
 out_free_percpu:
percpu_counter_destroy(&dccp_orphan_count);
 out_fail:
@@ -1265,6 +1267,7 @@ static void __exit dccp_fini(void)
kmem_cache_destroy(dccp_hashinfo.bind_bucket_cachep);
dccp_ackvec_exit();
dccp_sysctl_exit();
+   inet_hashinfo2_free_mod(&dccp_hashinfo);
percpu_counter_destroy(&dccp_orphan_count);
 }
 
-- 
1.8.3.1

Re: [PATCHv2 ipsec] xfrm: fix a warning in xfrm_policy_insert_list

2020-06-09 Thread Tobias Brunner

Hi Xin,

>> I guess we could workaround this issue in strongSwan by installing
>> policies that share the same mark and selector with the same priority,
>> so only one instance is ever installed in the kernel.  But the inability
>> to address the exact policy when querying/deleting still looks like a
>> problem to me in general.
>>
> For deleting, yes, but for querying, I think it makes sense not to pass
> the priority, and always get the policy with the highest priority.

While I agree it's less of a problem (at least for strongSwan),  it
should be possible to query the exact policy one wants.  Because as far
as I understand, the whole point of Steffen's original patch was that
all duplicate policies could get used concurrently, depending on the
marks and masks on them and the traffic, so all of them must be queryable.

But I actually think the previous check that viewed policies with the
exact same mark and value as duplicates made sense, because those will
never be used concurrently.  It would at least fix the default behavior
with strongSwan (users have to configure marks/masks manually).

> We can separate the deleting path from the querying path when
> XFRMA_PRIORITY attribute is set.
> 
> Is that enough for your case to only fix for the policy deleting?

While such an attribute could be part of a solution, it does not fix the
regression your patch created.  The kernel behavior changed and a
userland modification is required to get back to something resembling
the previous behavior (without an additional kernel patch we'll actually
not be able to restore the previous behavior, where we separated
different types of policies into priority classes).  That is, current
and old strongSwan versions could create lots of duplicate/lingering
policies, which is not good.

A problem with such an attribute is how userland would learn when to use
it.  We could query the kernel version, but patches might get
backported.  So how can we know the kernel will create duplicates when
we update a policy and change the priority, which we then have to delete
(or even can delete with such a new attribute)?  Do we have to do a
runtime check (e.g. install two duplicate policies with different
priorities and delete twice to see if the second attempt results in an
error)?  With marks it's relatively easy as users have to configure them
explicitly and they work or they don't depending on the kernel version.
 But here it's not so easy as the IKE daemon uses priorities extensively
already.

Like the marks it might work somehow if the new attribute also had to be
passed in the message that creates a policy (marks have to be passed
with every message, including querying them).  While that's not super
ideal as we'd have two priority values in these messages (and have to
keep track of them in the kernel state), there is some precedent with
the anti-replay config for SAs (which can be passed via xfrm_usersa_info
struct or as separate attribute with more options for ESN).  Userland
would still have to learn somehow that the kernel understands the new
attribute and duplicate policies with different priorities are possible.
 But if there was any advantage in using this, we could perhaps later
add an option for users to enable it.  At least the current behavior
would not change (i.e. older strongSwan versions would continue to run
on newer kernels without modifications).

Regards,
Tobias

RE: [PATCH] net: stmmac: Fix RX Coalesce IOC always true issue

2020-06-09 Thread Jose Abreu

From: Biao Huang 
Date: Jun/09/2020, 10:41:33 (UTC+00:00)

> - rx_q->rx_count_frames += priv->rx_coal_frames;
> - if (rx_q->rx_count_frames > priv->rx_coal_frames)
> + if (rx_q->rx_count_frames >= priv->rx_coal_frames)

This is no right. If you want to RX IC bit to not always be set you need 
to change coalesce parameters using ethtool.

---
Thanks,
Jose Miguel Abreu

RE: [PATCH net-next 7/8] net: phy: Add Synopsys DesignWare XPCS MDIO module

2020-06-09 Thread Jose Abreu

From: Russell King - ARM Linux admin 
Date: Jun/05/2020, 18:10:34 (UTC+00:00)

> This is incorrect - you should not mask the link partner's advertisement
> with our advertisement like this; consider the table in 802.3 for
> resolving the pause modes, where simply doing a bitwise-and operation
> between the two advertisements would severely restrict the resulting
> resolution to either symmetric pause or nothing at all.
> 
> You want to do this when you resolve the speed, but only _temporarily_
> in order to resolve the speed - you do not want to write back the
> result to state->lp_advertising.

Thanks for bringing this up. Indeed I believe I did this in order to 
resolve the speed, as you said.
 
> You may wish to fix that.

Added to the list of my pending fixes. Thanks!

---
Thanks,
Jose Miguel Abreu

Re: [PATCH net] net: mvneta: do not redirect frames during reconfiguration

2020-06-09 Thread Lorenzo Bianconi

> On Tue, Jun 09, 2020 at 09:41:10AM +0200, Lorenzo Bianconi wrote:
> > > On Tue, Jun 09, 2020 at 12:02:39AM +0200, Lorenzo Bianconi wrote:
> > > > Disable frames injection in mvneta_xdp_xmit routine during hw
> > > > re-configuration in order to avoid hardware hangs
> > > 
> > > Hi Lorenzo
> > > 
> > > Why does mvneta_tx() also not need the same protection?
> > > 
> > > Andrew
> > 
> > Hi Andrew,
> > 
> > So far I have not been able to trigger the issue in the legacy tx path.
> 
> Even if you have not hit the issue, do you still think it is possible?
> If it is hard to trigger, maybe it is worth protecting against it,
> just in case.

The issue occurs putting the device down while it is still transmitting. In
particular mvneta_port_down() fails to stop tx (TIMEOUT for TX stopped 
status=...)
and the device is not able to recover.
The above pattern can occur with XDP because if we remove the program from a
running interface, we will put the interface down for DMA buffers
reconfiguration while mvneta_xdp_xmit() is concurrently running on a remote
cpu.
Looking at the code I do not think it can occurs in the legacy tx path
(mvneta_tx()) since __dev_close() (trigger by userspace) will run
dev_deactivate_many() before running mvneta_stop().

> 
> > I hit the problem adding the capability to attach an eBPF program to CPUMAP
> > entries [1]. In particular I am redirecting traffic to mvneta and 
> > concurrently
> > attaching/removing a XDP program to/from it.
> > I am not sure this can occur running mvneta_tx().
> > Moreover it seems a common pattern for .ndo_xdp_xmit() in other drivers
> > (e.g ixgbe, bnxt, mlx5)
> 
> I was wondering if this should be solved at a higher level. And if you
> say there are more MAC drivers with this issue, maybe it should. Not
> sure how though. It seems like MTU change and rx mode change wound
> need to be protected, which at a higher level is harder to do. What
> exactly do you need to protect, in a generic way?

Yes, we can think about it but I guess we should fix the issue first since it
is already there and it will be easy to backport the fix, agree?

Regards,
Lorenzo

> 
>  Andrew

signature.asc
Description: PGP signature

Re: [PATCH RFC net-next 3/6] net_sched: sch_fq: multiple release time support

2020-06-09 Thread Eric Dumazet




On 6/9/20 7:09 AM, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Optionally segment skbs on FQ enqueue, to later send segments at
> their individual delivery time.
> 
> Segmentation on enqueue is new for FQ, but already happens in TBF,
> CAKE and netem.
> 
> This slow patch should probably be behind a static_branch.
> 
> Signed-off-by: Willem de Bruijn 
> ---
>  net/sched/sch_fq.c | 33 +++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> index 8f06a808c59a..a5e2c35bb557 100644
> --- a/net/sched/sch_fq.c
> +++ b/net/sched/sch_fq.c
> @@ -439,8 +439,8 @@ static bool fq_packet_beyond_horizon(const struct sk_buff 
> *skb,
>   return unlikely((s64)skb->tstamp > (s64)(q->ktime_cache + q->horizon));
>  }
>  
> -static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> -   struct sk_buff **to_free)
> +static int __fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> + struct sk_buff **to_free)
>  {
>   struct fq_sched_data *q = qdisc_priv(sch);
>   struct fq_flow *f;
> @@ -496,6 +496,35 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc 
> *sch,
>   return NET_XMIT_SUCCESS;
>  }
>  
> +static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> +   struct sk_buff **to_free)
> +{
> + struct sk_buff *segs, *next;
> + int ret;
> +
> + if (likely(!skb_is_gso(skb) || !skb->sk ||

You also need to check sk_fullsock(skb->sk), otherwise KMSAN might be unhappy.

> +!skb->sk->sk_txtime_multi_release))
> + return __fq_enqueue(skb, sch, to_free);
> +
> + segs = skb_gso_segment_txtime(skb);
> + if (IS_ERR(segs))
> + return qdisc_drop(skb, sch, to_free);
> + if (!segs)
> + return __fq_enqueue(skb, sch, to_free);
> +
> + consume_skb(skb);

   This needs to be qdisc_drop(skb, sch, to_free) if queue is full, see below.

> +
> + ret = NET_XMIT_DROP;
> + skb_list_walk_safe(segs, segs, next) {
> + skb_mark_not_on_list(segs);
> + qdisc_skb_cb(segs)->pkt_len = segs->len;

This seems to under-estimate bytes sent. See qdisc_pkt_len_init() for details.

> + if (__fq_enqueue(segs, sch, to_free) == NET_XMIT_SUCCESS)
> + ret = NET_XMIT_SUCCESS;
> + }

if (unlikely(ret == NET_XMIT_DROP))
qdisc_drop(skb, sch, to_free);
else
consume_skb(skb);

> +
> + return ret;
> +}
> +
>  static void fq_check_throttled(struct fq_sched_data *q, u64 now)
>  {
>   unsigned long sample;
>

Re: [PATCH RFC net-next 3/6] net_sched: sch_fq: multiple release time support

2020-06-09 Thread Eric Dumazet




On 6/9/20 8:00 AM, Eric Dumazet wrote:
> 
> 
> On 6/9/20 7:09 AM, Willem de Bruijn wrote:
>> From: Willem de Bruijn 
>>
>> Optionally segment skbs on FQ enqueue, to later send segments at
>> their individual delivery time.
>>
>> Segmentation on enqueue is new for FQ, but already happens in TBF,
>> CAKE and netem.
>>
>> This slow patch should probably be behind a static_branch.
>>
>> Signed-off-by: Willem de Bruijn 
>> ---
>>  net/sched/sch_fq.c | 33 +++--
>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
>> index 8f06a808c59a..a5e2c35bb557 100644
>> --- a/net/sched/sch_fq.c
>> +++ b/net/sched/sch_fq.c
>> @@ -439,8 +439,8 @@ static bool fq_packet_beyond_horizon(const struct 
>> sk_buff *skb,
>>  return unlikely((s64)skb->tstamp > (s64)(q->ktime_cache + q->horizon));
>>  }
>>  
>> -static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>> -  struct sk_buff **to_free)
>> +static int __fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>> +struct sk_buff **to_free)
>>  {
>>  struct fq_sched_data *q = qdisc_priv(sch);
>>  struct fq_flow *f;
>> @@ -496,6 +496,35 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc 
>> *sch,
>>  return NET_XMIT_SUCCESS;
>>  }
>>  
>> +static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>> +  struct sk_buff **to_free)
>> +{
>> +struct sk_buff *segs, *next;
>> +int ret;
>> +
>> +if (likely(!skb_is_gso(skb) || !skb->sk ||
> 
> You also need to check sk_fullsock(skb->sk), otherwise KMSAN might be unhappy.
> 
>> +   !skb->sk->sk_txtime_multi_release))
>> +return __fq_enqueue(skb, sch, to_free);
>> +
>> +segs = skb_gso_segment_txtime(skb);
>> +if (IS_ERR(segs))
>> +return qdisc_drop(skb, sch, to_free);
>> +if (!segs)
>> +return __fq_enqueue(skb, sch, to_free);
>> +
>> +consume_skb(skb);
> 
>This needs to be qdisc_drop(skb, sch, to_free) if queue is full, see below.
> 
>> +
>> +ret = NET_XMIT_DROP;
>> +skb_list_walk_safe(segs, segs, next) {
>> +skb_mark_not_on_list(segs);
>> +qdisc_skb_cb(segs)->pkt_len = segs->len;
> 
> This seems to under-estimate bytes sent. See qdisc_pkt_len_init() for details.
> 
>> +if (__fq_enqueue(segs, sch, to_free) == NET_XMIT_SUCCESS)
>> +ret = NET_XMIT_SUCCESS;
>> +}
> 
> if (unlikely(ret == NET_XMIT_DROP))
> qdisc_drop(skb, sch, to_free);

Maybe not qdisc_drop() (qdisc drop counters are already updated)
but at least kfree_skb() so that drop monitor is not fooled.

> else
> consume_skb(skb);
> 
>> +
>> +return ret;
>> +}
>> +
>>  static void fq_check_throttled(struct fq_sched_data *q, u64 now)
>>  {
>>  unsigned long sample;
>>
> 
>

Re: [PATCH bpf V2 1/2] bpf: devmap adjust uapi for attach bpf program

2020-06-09 Thread Jesper Dangaard Brouer

On Tue, 9 Jun 2020 07:47:06 -0600
David Ahern  wrote:

> On 6/9/20 7:31 AM, Jesper Dangaard Brouer wrote:
> > This patch remove the minus-1 checks, and have zero mean feature isn't used.
> >   
> 
> For consistency this should apply to other XDP fd uses as well -- like
> IFLA_XDP_EXPECTED_FD and IFLA_XDP_FD.

I agree, but I choose to limit the scope as this is for bpf-tree.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH] ath10k: Wait until copy complete is actually done before completing

2020-06-09 Thread Douglas Anderson

On wcn3990 we have "per_ce_irq = true".  That makes the
ath10k_ce_interrupt_summary() function always return 0xfff. The
ath10k_ce_per_engine_service_any() function will see this and think
that _all_ copy engines have an interrupt.  Without checking, the
ath10k_ce_per_engine_service() assumes that if it's called that the
"copy complete" (cc) interrupt fired.  This combination seems bad.

Let's add a check to make sure that the "copy complete" interrupt
actually fired in ath10k_ce_per_engine_service().

This might fix a hard-to-reproduce failure where it appears that the
copy complete handlers run before the copy is really complete.
Specifically a symptom was that we were seeing this on a Qualcomm
sc7180 board:
  arm-smmu 1500.iommu: Unhandled context fault:
  fsr=0x402, iova=0x7fdd45780, fsynr=0x30003, cbfrsynra=0xc1, cb=10

Even on platforms that don't have wcn3990 this still seems like it
would be a sane thing to do.  Specifically the current IRQ handler
comments indicate that there might be other misc interrupt sources
firing that need to be cleared.  If one of those sources was the one
that caused the IRQ handler to be called it would also be important to
double-check that the interrupt we cared about actually fired.

Signed-off-by: Douglas Anderson 
---

 drivers/net/wireless/ath/ath10k/ce.c | 30 +++-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/ce.c 
b/drivers/net/wireless/ath/ath10k/ce.c
index 294fbc1e89ab..ffdd4b995f33 100644
--- a/drivers/net/wireless/ath/ath10k/ce.c
+++ b/drivers/net/wireless/ath/ath10k/ce.c
@@ -481,6 +481,15 @@ static inline void 
ath10k_ce_engine_int_status_clear(struct ath10k *ar,
ath10k_ce_write32(ar, ce_ctrl_addr + wm_regs->addr, mask);
 }
 
+static inline bool ath10k_ce_engine_int_status_check(struct ath10k *ar,
+u32 ce_ctrl_addr,
+unsigned int mask)
+{
+   struct ath10k_hw_ce_host_wm_regs *wm_regs = ar->hw_ce_regs->wm_regs;
+
+   return ath10k_ce_read32(ar, ce_ctrl_addr + wm_regs->addr) & mask;
+}
+
 /*
  * Guts of ath10k_ce_send.
  * The caller takes responsibility for any needed locking.
@@ -1301,19 +1310,22 @@ void ath10k_ce_per_engine_service(struct ath10k *ar, 
unsigned int ce_id)
 
spin_lock_bh(&ce->ce_lock);
 
-   /* Clear the copy-complete interrupts that will be handled here. */
-   ath10k_ce_engine_int_status_clear(ar, ctrl_addr,
- wm_regs->cc_mask);
+   if (ath10k_ce_engine_int_status_check(ar, ctrl_addr,
+ wm_regs->cc_mask)) {
+   /* Clear before handling */
+   ath10k_ce_engine_int_status_clear(ar, ctrl_addr,
+ wm_regs->cc_mask);
 
-   spin_unlock_bh(&ce->ce_lock);
+   spin_unlock_bh(&ce->ce_lock);
 
-   if (ce_state->recv_cb)
-   ce_state->recv_cb(ce_state);
+   if (ce_state->recv_cb)
+   ce_state->recv_cb(ce_state);
 
-   if (ce_state->send_cb)
-   ce_state->send_cb(ce_state);
+   if (ce_state->send_cb)
+   ce_state->send_cb(ce_state);
 
-   spin_lock_bh(&ce->ce_lock);
+   spin_lock_bh(&ce->ce_lock);
+   }
 
/*
 * Misc CE interrupts are not being handled, but still need
-- 
2.27.0.278.ge193c7cf3a9-goog

[PATCH] Do not assign in if condition wg_noise_handshake_consume_initiation()

2020-06-09 Thread Frank Werner-Krippendorf

Fixes an error condition reported by checkpatch.pl which caused by
assigning a variable in an if condition in
wg_noise_handshake_consume_initiation().

Signed-off-by: Frank Werner-Krippendorf 
---
 drivers/net/wireguard/noise.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index 626433690abb..9524a15a62a6 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -617,8 +617,9 @@ wg_noise_handshake_consume_initiation(struct 
message_handshake_initiation *src,
memcpy(handshake->hash, hash, NOISE_HASH_LEN);
memcpy(handshake->chaining_key, chaining_key, NOISE_HASH_LEN);
handshake->remote_index = src->sender_index;
+   initiation_consumption = ktime_get_coarse_boottime_ns();
if ((s64)(handshake->last_initiation_consumption -
-   (initiation_consumption = ktime_get_coarse_boottime_ns())) < 0)
+   initiation_consumption) < 0)
handshake->last_initiation_consumption = initiation_consumption;
handshake->state = HANDSHAKE_CONSUMED_INITIATION;
up_write(&handshake->lock);
-- 
2.20.1

Fw: [Bug 208107] New: hard lock with CONFIG_CGROUP_NET_PRIO enabled

2020-06-09 Thread Stephen Hemminger

Begin forwarded message:

Date: Mon, 08 Jun 2020 21:53:11 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 208107] New: hard lock with CONFIG_CGROUP_NET_PRIO enabled

https://bugzilla.kernel.org/show_bug.cgi?id=208107

Bug ID: 208107
   Summary: hard lock with CONFIG_CGROUP_NET_PRIO enabled
   Product: Networking
   Version: 2.5
Kernel Version: 5.4.42
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: step...@networkplumber.org
  Reporter: c...@neo-zeon.de
Regression: No

A bug was introduced with commit fc800ec491c39e42b65df72dc9ede3bb2d4a3755 where
my NFS server (which supports Kerberos) will lock up the system after a client
has mounted volumes export by the NFS server.

Unsetting CONFIG_CGROUP_NET_PRIO or reverting
fc800ec491c39e42b65df72dc9ede3bb2d4a3755 resolves the issue.

I get nothing on the console or in the logs once this occurs. The system is
locked until it finally restarts on its own or through user intervention. The
client machine I used for testing has CONFIG_CGROUP_NET_PRIO enabled in the
kernel, but it causes no issues there.

I suspect the Kerberos functionality has no impact on this issue, but I'm
presently unable to disable it.

I'm not actually using the functionality provided by CONFIG_CGROUP_NET_PRIO, so
I've disabled it for now.

All kernels since v5.4.42 are affected. By disabling this feature or reverting
fc800ec491c39e42b65df72dc9ede3bb2d4a3755 allows 5.4.2+ to work. I'm currently
running stable with v5.4.45.

The 5.5 and 5.6 series are unsurprisingly affected, and I suspect both series
would work with either of the fixes above.

Some info about the NFS server:
Debian 10 Buster
AMD EPYC 3251 8-core processor
128GB memory
Intel X540 10GB NIC PCIE nic (only 1 port is used)
2x Intel Corporation I350 Gigabit onboard NICs (both are used)

All NIC's are used in individual bridge interfaces for a total of 3 interfacs
(br0, br1, br2) to facilitate virtualization.

More info furnished upon request.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Re: [PATCH v3 0/7] Venus dynamic debug

2020-06-09 Thread Randy Dunlap

On 6/9/20 4:13 AM, Matthew Wilcox wrote:
> On Tue, Jun 09, 2020 at 01:45:57PM +0300, Stanimir Varbanov wrote:
>> Here is the third version of dynamic debug improvements in Venus
>> driver.  As has been suggested on previous version by Joe [1] I've
>> made the relevant changes in dynamic debug core to handle leveling
>> as more generic way and not open-code/workaround it in the driver.
>>
>> About changes:
>>  - added change in the dynamic_debug and in documentation
>>  - added respective pr_debug_level and dev_dbg_level
> 
> Honestly, this seems like you want to use tracepoints, not dynamic debug.
> 

Also see this patch series:
https://lore.kernel.org/lkml/20200605162645.289174-1-jim.cro...@gmail.com/
[PATCH 00/16] dynamic_debug: cleanups, 2 features

It adds/expands dynamic debug flags quite a bit.

-- 
~Randy

Re: [PATCH v3 0/7] Venus dynamic debug

2020-06-09 Thread Joe Perches

On Tue, 2020-06-09 at 13:45 +0300, Stanimir Varbanov wrote:
> Hello,
> 
> Here is the third version of dynamic debug improvements in Venus
> driver.  As has been suggested on previous version by Joe [1] I've
> made the relevant changes in dynamic debug core to handle leveling
> as more generic way and not open-code/workaround it in the driver.
> 
> About changes:
>  - added change in the dynamic_debug and in documentation
>  - added respective pr_debug_level and dev_dbg_level
> 
> regards,
> Stan
> 
> [1] https://lkml.org/lkml/2020/5/21/668

This capability is already clumsily used by many
drivers that use a module level "debug" flag.

$ git grep -P "MODULE_PARM.*\bdebug\b"|wc -l
501

That's a _lot_ of homebrewed mechanisms.

Making dynamic debug have this as a feature would
help consolidate and standardize the capability.

ftrace is definitely useful, but not quite as
lightweight and doesn't have the typical uses.

Re: TCP_DEFER_ACCEPT wakes up without data

2020-06-09 Thread Christoph Paasch

Hello,

On Sun, Jun 7, 2020 at 3:00 AM Florian Westphal  wrote:
>
> Eric Dumazet  wrote:
> > > Sure! TCP_DEFER_ACCEPT delays the creation of the socket until data
> > > has been sent by the client *or* the specified time has expired upon
> > > which a last SYN/ACK is sent and if the client replies with an ACK the
> > > socket will be created and presented to the accept()-call. In the
> > > latter case it means that the app gets a socket that does not have any
> > > data to be read - which goes against the intention of TCP_DEFER_ACCEPT
> > > (man-page says: "Allow a listener to be awakened only when data
> > > arrives on the socket.").
> > >
> > > In the original thread the proposal was to kill the connection with a
> > > TCP-RST when the specified timeout expired (after the final SYN/ACK).
> > >
> > > Thus, my question in my first email whether there is a specific reason
> > > to not do this.
> > >
> > > API-breakage does not seem to me to be a concern here. Apps that are
> > > setting DEFER_ACCEPT probably would not expect to get a socket that
> > > does not have data to read.
> >
> > Thanks for the summary ;)
> >
> > I disagree.
> >
> > A server might have two modes :
> >
> > 1) A fast path, expecting data from user in a small amount of time, from 
> > peers not too far away.
> >
> > 2) A slow path, for clients far away. Server can implement strategies to 
> > control number of sockets
> > that have been accepted() but not yet active (no data yet received).

to add to that: There are indeed scenarios where TCP-SYN/... without
payload go through fine but as soon as the packet-size increases
WiFi/Cell has problems because of smaller grants given by the
AP/tower. But even those connections should be able to get the data
through within a "reasonable" timeframe. Anything beyond that
timeframe will anyways have such a bad user-experience that it is
pointless to continue.

So, a use-case here would be where the user is in such a slow network
and a TCP-split proxy is deployed (such proxies are very common in
cellular networks). That proxy will be ACKing the server's SYN/ACK
retransmission at the end of the defer-accept period, while the client
is still trying very hard to get the data through to the proxy (or
even, the client might have gone totally out-of-service).

For those kinds of scenarios it would make sense to have a different
DEFER_ACCEPT-behavior (maybe with a separate socket-option as Florian
suggested).

Christoph

>
> So we can't change DEFER_ACCEPT behaviour.
> Any objections to adding TCP_DEFER_ACCEPT2 with the behaviour outlined
> by Christoph?

Re: [PATCH v3 0/7] Venus dynamic debug

2020-06-09 Thread Joe Perches

(adding Jim Cromie and comments)

On Tue, 2020-06-09 at 09:03 -0700, Randy Dunlap wrote:
> On 6/9/20 4:13 AM, Matthew Wilcox wrote:
> > On Tue, Jun 09, 2020 at 01:45:57PM +0300, Stanimir Varbanov wrote:
> > > Here is the third version of dynamic debug improvements in Venus
> > > driver.  As has been suggested on previous version by Joe [1] I've
> > > made the relevant changes in dynamic debug core to handle leveling
> > > as more generic way and not open-code/workaround it in the driver.
> > > 
> > > About changes:
> > >  - added change in the dynamic_debug and in documentation
> > >  - added respective pr_debug_level and dev_dbg_level
> > 
> > Honestly, this seems like you want to use tracepoints, not dynamic debug.

Tracepoints are a bit heavy and do not have any class
or grouping mechanism.

debug_class is likely a better name than debug_level

> Also see this patch series:
> https://lore.kernel.org/lkml/20200605162645.289174-1-jim.cro...@gmail.com/
> [PATCH 00/16] dynamic_debug: cleanups, 2 features
> 
> It adds/expands dynamic debug flags quite a bit.

Yes, and thanks Randy and Jim and Stanimir

I haven't gone through Jim's proposal enough yet.
It's unfortunate these patches series conflict.

And for Jim, a link to Stanimir's patch series:
https://lore.kernel.org/lkml/20200609104604.1594-1-stanimir.varba...@linaro.org/

RE: [PATCH bpf 1/2] bpf, sockhash: Fix memory leak when unlinking sockets in sock_hash_free

2020-06-09 Thread John Fastabend

Jakub Sitnicki wrote:
> When sockhash gets destroyed while sockets are still linked to it, we will
> walk the bucket lists and delete the links. However, we are not freeing the
> list elements after processing them, leaking the memory.
> 
> The leak can be triggered by close()'ing a sockhash map when it still
> contains sockets, and observed with kmemleak:
> 
>   unreferenced object 0x888116e86f00 (size 64):
> comm "race_sock_unlin", pid 223, jiffies 4294731063 (age 217.404s)
> hex dump (first 32 bytes):
>   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   81 de e8 41 00 00 00 00 c0 69 2f 15 81 88 ff ff  ...A.i/.
> backtrace:
>   [] sock_hash_update_common+0x4ca/0x760
>   [] sock_hash_update_elem+0x1d2/0x200
>   [<5e2c23de>] __do_sys_bpf+0x2046/0x2990
>   [] do_syscall_64+0xad/0x9a0
>   [<0d96f263>] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> Fix it by freeing the list element when we're done with it.
> 
> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
> Signed-off-by: Jakub Sitnicki 
> ---
>  net/core/sock_map.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 00a26cf2cfe9..ea46f07a22d8 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -1031,6 +1031,7 @@ static void sock_hash_free(struct bpf_map *map)
>   sock_map_unref(elem->sk, elem);
>   rcu_read_unlock();
>   release_sock(elem->sk);
> + sock_hash_free_elem(htab, elem);
>   }
>   }
>  
> -- 
> 2.25.4
> 

In Cilium we pin the map and never release it thanks for catching this.

Acked-by: John Fastabend

Re: [PATCH v3 1/7] Documentation: dynamic-debug: Add description of level bitmask

2020-06-09 Thread Joe Perches

On Tue, 2020-06-09 at 13:16 +0200, Greg Kroah-Hartman wrote:
> What is wrong with the existing control of dynamic
> debug messages that you want to add another type of arbitrary grouping
> to it? 

There is no existing grouping mechanism.

Many drivers and some subsystems used an internal one
before dynamic debug.

$ git grep "MODULE_PARM.*\bdebug\b"|wc -l
501

This is an attempt to unify those homebrew mechanisms.

Stanimir attempted to add one for his driver via a
driver specific standardized format substring for level.

> And who defines that grouping?

Individual driver authors

> Will it be driver/subsystem/arch/author specific?  Or kernel-wide?

driver specific

> This feels like it could easily get out of hand really quickly.

Likely not.  A question might be how useful all these
old debugging printks are today and if it's reasonable
to just delete them.

> Why not just use tracepoints if you really want to be fine-grained?

Weight and lack of class/group capability

RE: [PATCH bpf 2/2] bpf, sockhash: Synchronize delete from bucket list on map free

2020-06-09 Thread John Fastabend

Jakub Sitnicki wrote:
> We can end up modifying the sockhash bucket list from two CPUs when a
> sockhash is being destroyed (sock_hash_free) on one CPU, while a socket
> that is in the sockhash is unlinking itself from it on another CPU
> it (sock_hash_delete_from_link).
> 
> This results in accessing a list element that is in an undefined state as
> reported by KASAN:
> 
> | ==
> | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280
> | Write of size 8 at addr dead0122 by task kworker/2:1/95
> |
> | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 
> 5.7.0-rc7-02961-ge22c35ab0038-dirty #691
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> | Workqueue: events bpf_map_free_deferred
> | Call Trace:
> |  dump_stack+0x97/0xe0
> |  ? sock_hash_free+0x13c/0x280
> |  __kasan_report.cold+0x5/0x40
> |  ? mark_lock+0xbc1/0xc00
> |  ? sock_hash_free+0x13c/0x280
> |  kasan_report+0x38/0x50
> |  ? sock_hash_free+0x152/0x280
> |  sock_hash_free+0x13c/0x280
> |  bpf_map_free_deferred+0xb2/0xd0
> |  ? bpf_map_charge_finish+0x50/0x50
> |  ? rcu_read_lock_sched_held+0x81/0xb0
> |  ? rcu_read_lock_bh_held+0x90/0x90
> |  process_one_work+0x59a/0xac0
> |  ? lock_release+0x3b0/0x3b0
> |  ? pwq_dec_nr_in_flight+0x110/0x110
> |  ? rwlock_bug.part.0+0x60/0x60
> |  worker_thread+0x7a/0x680
> |  ? _raw_spin_unlock_irqrestore+0x4c/0x60
> |  kthread+0x1cc/0x220
> |  ? process_one_work+0xac0/0xac0
> |  ? kthread_create_on_node+0xa0/0xa0
> |  ret_from_fork+0x24/0x30
> | ==
> 
> Fix it by reintroducing spin-lock protected critical section around the
> code that removes the elements from the bucket on sockhash free.
> 
> To do that we also need to defer processing of removed elements, until out
> of atomic context so that we can unlink the socket from the map when
> holding the sock lock.
> 
> Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from 
> sock_{hash|map}_free")
> Reported-by: Eric Dumazet 
> Signed-off-by: Jakub Sitnicki 
> ---
>  net/core/sock_map.c | 23 +--
>  1 file changed, 21 insertions(+), 2 deletions(-)

Thanks.

Acked-by: John Fastabend

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-09 Thread Stephen Hemminger

On Sun, 07 Jun 2020 16:45:32 -0700 (PDT)
David Miller  wrote:

> From: Stephen Hemminger 
> Date: Sun, 7 Jun 2020 15:30:19 -0700
> 
> > Open source projects have been working hard to remove the terms master and 
> > slave
> > in API's and documentation. Apparently, Linux hasn't gotten the message.
> > It would make sense not to introduce new instances.  
> 
> Would you also be against, for example, the use of the terminology
> expressing the "death" of allocated registers in a compiler backend,
> for example?
> 
> How far do you plan take this resistence of terminology when it
> clearly has a well defined usage and meaning in a specific technical
> realm which is entirely disconnected to what the terms might imply,
> meaning wise, in other realms?
> 
> And if you are going to say not to use this terminology, you must
> suggest a reasonable (and I do mean _reasonable_) well understood
> and _specific_ replacement.
> 
> Thank you.

How many times have you or Linus argued about variable naming.
Yes, words do matter and convey a lot of implied connotation and meaning.

Most projects and standards bodies are taking a stance on fixing the
language. The IETF is has proposed making changes as well.

There are a very specific set of trigger words and terms that
should be fixed. Most of these terms do have better alternatives.

A common example is that master/slave is unclear and would be clearer
as primary/secondary or active/backup or controller/worker.

Most of networking is based on standards. When the standards wording changes
(and it will happen soon); then Linux should also change the wording in the
source, api and documentation.

See:

[0] - 
,

[1] - 
[2] - 
[3] - , 

[4] - 
[5] - 
[6] - 
https://mail.gnome.org/archives/desktop-devel-list/2019-April/msg00049.html
[7] - https://www.ietf.org/archive/id/draft-knodel-terminology-01.txt

[RFC PATCH bpf-next 2/2] i40e: avoid xdp_do_redirect() call when "redirect_tail_call" is set

2020-06-09 Thread Björn Töpel

From: Björn Töpel 

If an XDP program, where all the bpf_redirect_map() calls are tail
calls (as defined by the previous commit), the driver does not need to
explicitly call xdp_do_redirect().

The driver checks the active XDP program, and notifies the BPF helper
indirectly via xdp_set_redirect_tailcall().

This is just a naive, as-simple-as-possible implementation, calling
xdp_set_redirect_tailcall() for each packet.

For the AF_XDP rxdrop benchmark the pps went from 21.5 to 23.2 Mpps.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 14 --
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 14 --
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f9555c847f73..0d8dbdc4f47e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2186,6 +2186,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
*rx_ring,
int err, result = I40E_XDP_PASS;
struct i40e_ring *xdp_ring;
struct bpf_prog *xdp_prog;
+   bool impl_redir = false;
u32 act;
 
rcu_read_lock();
@@ -2194,6 +2195,11 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
*rx_ring,
if (!xdp_prog)
goto xdp_out;
 
+   if (xdp_prog->redirect_tail_call) {
+   impl_redir = true;
+   xdp_set_redirect_tailcall(rx_ring->xdp_prog, xdp,
+ rx_ring->netdev);
+   }
prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
act = bpf_prog_run_xdp(xdp_prog, xdp);
@@ -2205,8 +2211,12 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
*rx_ring,
result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
break;
case XDP_REDIRECT:
-   err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-   result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
+   if (impl_redir) {
+   result = I40E_XDP_REDIR;
+   } else {
+   err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+   result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
+   }
break;
default:
bpf_warn_invalid_xdp_action(act);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 7276580cbe64..0826f551649f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -146,6 +146,7 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, 
struct xdp_buff *xdp)
int err, result = I40E_XDP_PASS;
struct i40e_ring *xdp_ring;
struct bpf_prog *xdp_prog;
+   bool impl_redir = false;
u32 act;
 
rcu_read_lock();
@@ -153,6 +154,11 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, 
struct xdp_buff *xdp)
 * this path is enabled by setting an XDP program.
 */
xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+   if (xdp_prog->redirect_tail_call) {
+   impl_redir = true;
+   xdp_set_redirect_tailcall(rx_ring->xdp_prog, xdp,
+ rx_ring->netdev);
+   }
act = bpf_prog_run_xdp(xdp_prog, xdp);
 
switch (act) {
@@ -163,8 +169,12 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, 
struct xdp_buff *xdp)
result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
break;
case XDP_REDIRECT:
-   err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-   result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
+   if (impl_redir) {
+   result = I40E_XDP_REDIR;
+   } else {
+   err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+   result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
+   }
break;
default:
bpf_warn_invalid_xdp_action(act);
-- 
2.25.1

[RFC PATCH bpf-next 1/2] bpf, xdp: add naive bpf_redirect_map() tail call detection

2020-06-09 Thread Björn Töpel

From: Björn Töpel 

The bpf_redirect_map() BPF helper is used to redirect a packet to the
endpoint referenced by a map element. Currently, there is no
restrictment how the helper is called, e.g.

  ret = bpf_redirect_map(...);
  ...
  ret = bpf_redirect_map(...);
  ... // e.g. modify packet
  return ret;

is a valid, albeit weird, program.

The unconstrained nature of BPF (redirect) helpers has implications
for the implementation of the XDP_REDIRECT call. Today, the redirect
flow for a packet is along the lines of:

  for_each_packet() {
act = bpf_prog_run_xdp(xdp_prog, xdp); // does bpf_redirect_map() call
if (act == XDP_REDIRECT)
  ret = xdp_do_redirect(...);
  }
  xdp_do_flush();

The bpf_redirect_map() helper populates a per-cpu structure, which is
picked up by the xdp_do_redirect() function that also performs the
redirect action.

What if the verifier could detect that certain programs have certain
constraints, so the functiontionality performed by xdp_do_redirect()
could be done directly from bpf_redirect_map()?

Here, the verifier is taught to detect program where all
bpf_redirect_map() calls, are tail calls. If a function call is a last
action performed in another function, it is said to be a tail call --
not to be confused with the BPF tail call helper.

This implementation is just a PoC, and not complete. It manages to
detect tail calls by a naive peephole scheme.

For the following program (the built in AF_XDP libbpf program):

  ...
  call bpf_redirect_map
  if w0 != 0 goto pc+X (goto exit)
  ...
  call bpf_redirect_map
  exit

the verifier detects that all calls are tail calls.

TODO:
 * Stateful tail call detection, instead of the current peephole one.
 * Add all helpers capable of returning XDP_REDIRECT, and not just
   bpf_redirect_map().

Signed-off-by: Björn Töpel 
---
 include/linux/bpf_verifier.h |  2 ++
 include/linux/filter.h   | 19 +++-
 kernel/bpf/verifier.c| 53 +
 net/core/filter.c| 57 
 4 files changed, 130 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index ca08db4ffb5f..e41fd9264c94 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -411,6 +411,8 @@ struct bpf_verifier_env {
u32 peak_states;
/* longest register parentage chain walked for liveness marking */
u32 longest_mark_read_walk;
+   bool all_redirect_map_tail;
+   u32 redirect_map_call_cnt;
 };
 
 __printf(2, 0) void bpf_verifier_vlog(struct bpf_verifier_log *log,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 259377723603..cdbc2ca8db4f 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -534,7 +534,8 @@ struct bpf_prog {
is_func:1,  /* program is a bpf function */
kprobe_override:1, /* Do we override a kprobe? 
*/
has_callchain_buf:1, /* callchain buffer 
allocated? */
-   enforce_expected_attach_type:1; /* Enforce 
expected_attach_type checking at attach time */
+   enforce_expected_attach_type:1, /* Enforce 
expected_attach_type checking at attach time */
+   redirect_tail_call:1;
enum bpf_prog_type  type;   /* Type of BPF program */
enum bpf_attach_typeexpected_attach_type; /* For some prog types */
u32 len;/* Number of filter blocks */
@@ -613,6 +614,9 @@ struct bpf_redirect_info {
void *tgt_value;
struct bpf_map *map;
u32 kern_flags;
+   struct bpf_prog *xdp_prog_redirect_tailcall; // if !NULL, we can do 
"everything" from BPF context.
+   struct xdp_buff *xdp;
+   struct net_device *dev;
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
@@ -620,6 +624,19 @@ DECLARE_PER_CPU(struct bpf_redirect_info, 
bpf_redirect_info);
 /* flags for bpf_redirect_info kern_flags */
 #define BPF_RI_F_RF_NO_DIRECT  BIT(0)  /* no napi_direct on return_frame */
 
+
+static inline void xdp_set_redirect_tailcall(struct bpf_prog *xdp_prog,
+struct xdp_buff *xdp,
+struct net_device *dev)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+   ri->xdp_prog_redirect_tailcall = xdp_prog;
+   ri->xdp = xdp;
+   ri->dev = dev;
+}
+
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5c7bbaac81ef..27a4fd5c8b8b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4594,6 +4594,54 @@ static int check_reference_leak(struct bpf_verifier_env 
*env

[RFC PATCH bpf-next 0/2] bpf_redirect_map() tail call detection and xdp_do_redirect() avoidance

2020-06-09 Thread Björn Töpel

I hacked a quick PoC, based on the input from my earlier post [1].

Quick recap; For certain XDP programs, would it be possible to get rid
of the xdp_do_redirect() call (and the per-cpu write/read), and
instead perform the action directly from the BPF helper? If so, that
would potentially make the XDP_REDIRECT faster/less complex.

This PoC/RFC teach the verifier to detect when an XDP program is
structured such that all bpf_redirect_map() calls are tail calls. A
driver can then probe the BPF program, use the new
xdp_set_redirect_tailcall() function, and avoid the xdp_do_redirect()
call. This was Toke's suggestion, instead of my XDP_CONSUMED idea,
adding a new action.

To perform the xdp_do_redirect() tasks from the bpf_redirect_map()
helper, additional parameters are needed (XDP context, netdev, and XDP
program). They are passed via the bpf_redirect_into per-cpu structure
using the xdp_set_redirect_tailcall() function.

Note that the code is broken for programs mixing bpf_redirect() and
bpf_redirect_map()!

There are more details in the commits.

Is this a good idea? I have only measured for AF_XDP redirects, but
all XDP_REDIRECT targets should benefit. For AF_XDP the rxdrop
scenario went from 21.5 to 23.2 Mpps on my machine.

Next steps would be implement proper tail calls (suggested by John and
Alexei).

Getting input on my (horrible) verifier peephole hack, and a better
way to detect tail calls would be very welcome!

Disregard naming and style. I'm mostly interested what people think
about the concept, and if it's worth working on.

Next steps:
  * Better tail call detection in the verifier, instead of the naive
peephole version in patch 1.
  * Implement proper tail calls (constrained, indirect jump BPF
instruction).


Cheers,
Björn

[1] 
https://lore.kernel.org/bpf/caj+hfnidbgwtlinlqohwocumoyyrcag454gggkcbseqpsa1...@mail.gmail.com/

Björn Töpel (2):
  bpf, xdp: add naive bpf_redirect_map() tail call detection
  i40e: avoid xdp_do_redirect() call when "redirect_tail_call" is set

 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 14 -
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 14 -
 include/linux/bpf_verifier.h|  2 +
 include/linux/filter.h  | 19 ++-
 kernel/bpf/verifier.c   | 53 +++
 net/core/filter.c   | 57 +
 6 files changed, 154 insertions(+), 5 deletions(-)


base-commit: cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2
-- 
2.25.1

Re: [PATCH v3 1/7] Documentation: dynamic-debug: Add description of level bitmask

2020-06-09 Thread Edward Cree

On 09/06/2020 17:58, Joe Perches wrote:
> On Tue, 2020-06-09 at 13:16 +0200, Greg Kroah-Hartman wrote:
>> What is wrong with the existing control of dynamic
>> debug messages that you want to add another type of arbitrary grouping
>> to it? 
> There is no existing grouping mechanism.
>
> Many drivers and some subsystems used an internal one
> before dynamic debug.
>
> $ git grep "MODULE_PARM.*\bdebug\b"|wc -l
> 501
>
> This is an attempt to unify those homebrew mechanisms.
In network drivers, this is probablyusing the existing groupings
 defined by netif_level() - see NETIF_MSG_DRV and friends.  Note
 that those groups are orthogonal to the level, i.e. they control
 netif_err() etc. as well, not just debug messages.
Certainly in the case of sfc, and I'd imagine for many other net
 drivers too, the 'debug' modparam is setting the default for
 net_dev->msg_enable, which can be changed after probe with
 ethtool.
It doesn't look like the proposed mechanism subsumes that (we have
 rather more than 5 groups, and it's not clear how you'd connect
 it to the existing msg_enable (which uapi must be maintained); if
 you don't have a way to do this, better exclude drivers/net/ from
 your grep|wc because you won't be unifying those - in my tree
 that's 119 hits.

-ed

Re: [PATCH v3 1/7] Documentation: dynamic-debug: Add description of level bitmask

2020-06-09 Thread Joe Perches

On Tue, 2020-06-09 at 18:42 +0100, Edward Cree wrote:
> On 09/06/2020 17:58, Joe Perches wrote:
> > On Tue, 2020-06-09 at 13:16 +0200, Greg Kroah-Hartman wrote:
> > > What is wrong with the existing control of dynamic
> > > debug messages that you want to add another type of arbitrary grouping
> > > to it? 
> > There is no existing grouping mechanism.
> > 
> > Many drivers and some subsystems used an internal one
> > before dynamic debug.
> > 
> > $ git grep "MODULE_PARM.*\bdebug\b"|wc -l
> > 501
> > 
> > This is an attempt to unify those homebrew mechanisms.
> In network drivers, this is probablyusing the existing groupings
>  defined by netif_level() - see NETIF_MSG_DRV and friends.  Note
>  that those groups are orthogonal to the level, i.e. they control
>  netif_err() etc. as well, not just debug messages.

These are _not_ netif_ control flags.  Some are though.

For instance:

$ git grep "MODULE_PARM.*\bdebug\b" drivers/net | head -10
drivers/net/ethernet/3com/3c509.c:MODULE_PARM_DESC(debug, "debug level (0-6)");
drivers/net/ethernet/3com/3c515.c:MODULE_PARM_DESC(debug, "3c515 debug level 
(0-6)");
drivers/net/ethernet/3com/3c59x.c:MODULE_PARM_DESC(debug, "3c59x debug level 
(0-6)");
drivers/net/ethernet/adaptec/starfire.c:MODULE_PARM_DESC(debug, "Debug level 
(0-6)");
drivers/net/ethernet/allwinner/sun4i-emac.c:MODULE_PARM_DESC(debug, "debug 
message flags");
drivers/net/ethernet/altera/altera_tse_main.c:MODULE_PARM_DESC(debug, "Message 
Level (-1: default, 0: no output, 16: all)");
drivers/net/ethernet/amazon/ena/ena_netdev.c:MODULE_PARM_DESC(debug, "Debug 
level (0=none,...,16=all)");
drivers/net/ethernet/amd/atarilance.c:MODULE_PARM_DESC(lance_debug, "atarilance 
debug level (0-3)");
drivers/net/ethernet/amd/lance.c:MODULE_PARM_DESC(lance_debug, "LANCE/PCnet 
debug level (0-7)");
drivers/net/ethernet/amd/pcnet32.c:MODULE_PARM_DESC(debug, DRV_NAME " debug 
level");

These are all level/class output controls.

> Certainly in the case of sfc, and I'd imagine for many other net
>  drivers too, the 'debug' modparam is setting the default for
>  net_dev->msg_enable, which can be changed after probe with
>  ethtool.

True.

> It doesn't look like the proposed mechanism subsumes that (we have
>  rather more than 5 groups, and it's not clear how you'd connect
>  it to the existing msg_enable (which uapi must be maintained); if
>  you don't have a way to do this, better exclude drivers/net/ from
>  your grep|wc because you won't be unifying those - in my tree
>  that's 119 hits.

Likely not.

I agree it'd be useful to attach the modparam control flag
to the dynamic debug use somehow.

cheers, Joe

Re: [PATCH] bpf/sockmap: fix kernel panic at __tcp_bpf_recvmsg

2020-06-09 Thread Alexei Starovoitov

On Tue, Jun 9, 2020 at 2:04 AM Jakub Sitnicki  wrote:
>
> On Fri, Jun 05, 2020 at 10:46 AM CEST, dihu wrote:
> > When user application calls read() with MSG_PEEK flag to read data
> > of bpf sockmap socket, kernel panic happens at
> > __tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
> > queue after read out under MSG_PEEK flag is set. Because it's not
> > judged whether sk_msg is the last msg of ingress_msg queue, the next
> > sk_msg may be the head of ingress_msg queue, whose memory address of
> > sg page is invalid. So it's necessary to add check codes to prevent
> > this problem.
> >
> > [20759.125457] BUG: kernel NULL pointer dereference, address:
> > 0008
> > [20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: GE
> > 5.4.32 #1
> > [20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
> > 4.1.12 06/18/2017
> > [20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
> > [20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
> > [20759.276099] tcp_bpf_recvmsg+0x113/0x370
> > [20759.281137] inet_recvmsg+0x55/0xc0
> > [20759.285734] __sys_recvfrom+0xc8/0x130
> > [20759.290566] ? __audit_syscall_entry+0x103/0x130
> > [20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
> > [20759.301700] ? __audit_syscall_exit+0x1e4/0x290
> > [20759.307235] __x64_sys_recvfrom+0x24/0x30
> > [20759.312226] do_syscall_64+0x55/0x1b0
> > [20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > Signed-off-by: dihu 
> > ---
> >  net/ipv4/tcp_bpf.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> > index 5a05327..b82e4c3 100644
> > --- a/net/ipv4/tcp_bpf.c
> > +++ b/net/ipv4/tcp_bpf.c
> > @@ -64,6 +64,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
> > *psock,
> >   } while (i != msg_rx->sg.end);
> >
> >   if (unlikely(peek)) {
> > + if (msg_rx == list_last_entry(&psock->ingress_msg,
> > +   struct sk_msg, list))
> > + break;
> >   msg_rx = list_next_entry(msg_rx, list);
> >   continue;
> >   }
>
> Acked-by: Jakub Sitnicki 

Applied. Thanks

Re: [PATCH bpf 2/2] bpf, sockhash: Synchronize delete from bucket list on map free

2020-06-09 Thread Alexei Starovoitov

On Tue, Jun 9, 2020 at 10:51 AM John Fastabend  wrote:
>
> Jakub Sitnicki wrote:
> > We can end up modifying the sockhash bucket list from two CPUs when a
> > sockhash is being destroyed (sock_hash_free) on one CPU, while a socket
> > that is in the sockhash is unlinking itself from it on another CPU
> > it (sock_hash_delete_from_link).
> >
> > This results in accessing a list element that is in an undefined state as
> > reported by KASAN:
> >
> > | ==
> > | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280
> > | Write of size 8 at addr dead0122 by task kworker/2:1/95
> > |
> > | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 
> > 5.7.0-rc7-02961-ge22c35ab0038-dirty #691
> > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> > | Workqueue: events bpf_map_free_deferred
> > | Call Trace:
> > |  dump_stack+0x97/0xe0
> > |  ? sock_hash_free+0x13c/0x280
> > |  __kasan_report.cold+0x5/0x40
> > |  ? mark_lock+0xbc1/0xc00
> > |  ? sock_hash_free+0x13c/0x280
> > |  kasan_report+0x38/0x50
> > |  ? sock_hash_free+0x152/0x280
> > |  sock_hash_free+0x13c/0x280
> > |  bpf_map_free_deferred+0xb2/0xd0
> > |  ? bpf_map_charge_finish+0x50/0x50
> > |  ? rcu_read_lock_sched_held+0x81/0xb0
> > |  ? rcu_read_lock_bh_held+0x90/0x90
> > |  process_one_work+0x59a/0xac0
> > |  ? lock_release+0x3b0/0x3b0
> > |  ? pwq_dec_nr_in_flight+0x110/0x110
> > |  ? rwlock_bug.part.0+0x60/0x60
> > |  worker_thread+0x7a/0x680
> > |  ? _raw_spin_unlock_irqrestore+0x4c/0x60
> > |  kthread+0x1cc/0x220
> > |  ? process_one_work+0xac0/0xac0
> > |  ? kthread_create_on_node+0xa0/0xa0
> > |  ret_from_fork+0x24/0x30
> > | ==
> >
> > Fix it by reintroducing spin-lock protected critical section around the
> > code that removes the elements from the bucket on sockhash free.
> >
> > To do that we also need to defer processing of removed elements, until out
> > of atomic context so that we can unlink the socket from the map when
> > holding the sock lock.
> >
> > Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from 
> > sock_{hash|map}_free")
> > Reported-by: Eric Dumazet 
> > Signed-off-by: Jakub Sitnicki 
> > ---
> >  net/core/sock_map.c | 23 +--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
>
> Thanks.
>
> Acked-by: John Fastabend 

Applied both to bpf tree.

FYI I see this splat:
 ./test_sockmap
# 1/ 6  sockmap::txmsg test passthrough:OK
# 2/ 6  sockmap::txmsg test redirect:OK
# 3/ 6  sockmap::txmsg test drop:OK
# 4/ 6  sockmap::txmsg test ingress redirect:OK
[   19.180397]
[   19.180633] =
[   19.181042] WARNING: suspicious RCU usage
[   19.181517] 5.7.0-07177-g75e68e5bf2c7 #688 Not tainted
[   19.182048] -
[   19.182570] include/linux/skmsg.h:284 suspicious
rcu_dereference_check() usage!
[   19.183341]
[   19.183341] other info that might help us debug this:
[   19.183341]
[   19.184215]
[   19.184215] rcu_scheduler_active = 2, debug_locks = 1
[   19.184875] 1 lock held by test_sockmap/291:
[   19.185356]  #0: 8881315b5b20 (sk_lock-AF_INET){+.+.}-{0:0},
at: tls_sw_recvmsg+0x128/0x810
[   19.186315]
[   19.186315] stack backtrace:
[   19.186774] CPU: 0 PID: 291 Comm: test_sockmap Not tainted
5.7.0-07177-g75e68e5bf2c7 #688
[   19.188497] Call Trace:
[   19.188757]  dump_stack+0x71/0xa0
[   19.189140]  sk_psock_skb_redirect.isra.0+0xa6/0xf0
[   19.189651]  sk_psock_tls_strp_read+0x1cc/0x250
[   19.190142]  tls_sw_recvmsg+0x6e4/0x810
[   19.190584]  inet_recvmsg+0x55/0x1d0
[   19.190963]  sys_recvmsg+0x73/0x130
[   19.191365]  ? import_iovec+0x27/0xd0
[   19.191800]  ? copy_msghdr_from_user+0x4c/0x70
[   19.192271]  ___sys_recvmsg+0x68/0x90
[   19.192682]  ? __might_fault+0x36/0x80
[   19.193078]  ? __audit_syscall_exit+0x242/0x2b0
[   19.193576]  __sys_recvmsg+0x46/0x80
[   19.193964]  do_syscall_64+0x4a/0x1b0
[   19.194355]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[   19.194924] RIP: 0033:0x7fc915cc4490

Re: [PATCH v3 1/7] Documentation: dynamic-debug: Add description of level bitmask

2020-06-09 Thread Edward Cree

On 09/06/2020 18:56, Joe Perches wrote:
> These are _not_ netif_ control flags. Some are though.
> For instance:
>
> $ git grep "MODULE_PARM.*\bdebug\b" drivers/net | head -10
> [...]
>
> These are all level/class output controls.
TIL, thanks!  I should have looked deeperrather than assuming
 they were all like ours.

Though judging just by that grep output, it also looks like
 quite a lot of those won't fit into 5 groups either, so some
 rethink may still be needed...

-ed

Re: [PATCH bpf] bpf: Reset data_meta before running programs attached to devmap entry

2020-06-09 Thread Alexei Starovoitov

On Mon, Jun 8, 2020 at 8:17 AM David Ahern  wrote:
>
> This is a new context that does not handle metadata at the moment, so
> mark data_meta invalid.
>
> Fixes: fbee97feed9b ("bpf: Add support to attach bpf program to a devmap 
> entry")
> Signed-off-by: David Ahern 

Applied. Thanks

Re: [PATCH bpf] bpf: cgroup: allow multi-attach program to replace itself

2020-06-09 Thread Alexei Starovoitov

On Mon, Jun 8, 2020 at 9:22 AM Lorenz Bauer  wrote:
>
> When using BPF_PROG_ATTACH to attach a program to a cgroup in
> BPF_F_ALLOW_MULTI mode, it is not possible to replace a program
> with itself. This is because the check for duplicate programs
> doesn't take the replacement program into account.
>
> Replacing a program with itself might seem weird, but it has
> some uses: first, it allows resetting the associated cgroup storage.
> Second, it makes the API consistent with the non-ALLOW_MULTI usage,
> where it is possible to replace a program with itself. Third, it
> aligns BPF_PROG_ATTACH with bpf_link, where replacing itself is
> also supported.
>
> Sice this code has been refactored a few times this change will
> only apply to v5.7 and later. Adjustments could be made to
> commit 1020c1f24a94 ("bpf: Simplify __cgroup_bpf_attach") and
> commit d7bf2c10af05 ("bpf: allocate cgroup storage entries on attaching bpf 
> programs")
> as well as commit 324bda9e6c5a ("bpf: multi program support for cgroup+bpf")
>
> Signed-off-by: Lorenz Bauer 
> Fixes: af6eea57437a ("bpf: Implement bpf_link-based cgroup BPF program 
> attachment")

Applied. Thanks

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-09 Thread Andrew Lunn

Hi Stephen

> A common example is that master/slave is unclear and would be clearer
> as primary/secondary or active/backup or controller/worker.

802.3, cause 32.1.2, 2015 version:

   A 100BASE-T2 PHY can be configured either as a master PHY or as a
   slave PHY. The master-slave relationship between two stations
   sharing a link segment is established during Auto-Negotiation (see
   Clause 28, 32.5, Annex 28C, and 32.5.2). The master PHY uses an
   external clock to determine the timing of transmitter and receiver
   operations. The slave PHY recovers the clock from the received
   signal and uses it to determine the timing of transmitter
   operations, i.e., it performs loop timing, as illustrated in Figure
   32–2.

In this case, i would say master/slave is very clearly defined.

Given these definitions, would you like to propose alternatives?

Do you have any insights has to how the IEEE 802.3 standard will be
changed?

  Andrew

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-09 Thread David Miller

From: Stephen Hemminger 
Date: Tue, 9 Jun 2020 10:19:35 -0700

> Yes, words do matter and convey a lot of implied connotation and
> meaning.

What is your long term plan?  Will you change all of the UAPI for
bonding for example?

Or will we have a partial solution to the problem?

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-09 Thread Edward Cree

Disclaimer: *definitely* not speaking for my employer.

On 09/06/2020 18:19, Stephen Hemminger wrote:
> How many times have you or Linus argued about variable naming.
> Yes, words do matter and convey a lot of implied connotation and meaning.
Connotation, unlike denotation, is widely variable.  I would aver
 that for most people who are triggered or offended by the technical
 use of master/slave or similar terms, this is a *learned behaviour*
 that occurs because they have been taught that they should be
 offended by these words.  Are these people also incapable of using
 those words to describe actual historical slavery?
There is a difference between stating "X relates to Y as a slave to
 a master" and "You relate to me (or ought to do so) as a slave to a
 master".  The former is a merely factual claim which is often true
 of technical entities (and in the past or in certain parts of the
 world today is true of some humans, however morally wrong); the
 latter is an assertion of power.  It is only the latter which is,
 or at any rate should be, offensive; the attempt to ban the former
 rests on an equivocation between positive and normative uses.
Anyone who can't put connotation aside and stick to denotation
 (a) needs to grow up
 (b) is ill-suited to working with computers.

> Most projects and standards bodies are taking a stance on fixing the
> language. The IETF is has proposed making changes as well.
An expired Internet-Draft authored by a human-rights charity and a
 media studies postgrad student does not constitute "the IETF has
 proposed".  (Besides, it's historically inaccurate; it claims that
 the word "robot" means slave, when in fact it means the labour owed
 by a _serf_ ("robotnik").  And it cites Orwell with apparently *no*
 sense of irony whatsoever... he was arguing _against_ Newspeak, not
 for it!)

> A common example is that master/slave is unclear and would be clearer
> as primary/secondary or active/backup or controller/worker.
So why isn't controller/worker just as offensive, given all those
 labourers throughout history who have suffered under abusive
 managers?  Or does a word need a tenuous connection to race before
 it can be ritually purified from the language?
And is there really an EE anywhere who finds the terminology of
 master and slave clocks unclear?  I suspect very few would gain a
 better understanding from any of your suggested alternatives.

> Most of networking is based on standards. When the standards wording changes
> (and it will happen soon); then Linux should also change the wording in the
> source, api and documentation.
Rather, it seems that this is an attempt to change Linux in order
 to _de facto_ change the standard, thereby creating pressure on
 standards bodies to change it _de jure_ to match.  Yet, in the
 real world, engineers use and understand the current terminology;
 the push for language purification bears but little reference to
 anything outside of itself.

In conclusion, I'd like to quote from Henry Spencer's Ten
 Commandments for C Programmers (Annotated Edition) [1]:
> As a lamentable side issue, there has been some unrest from the
> fanatics of the Pronoun Gestapo over the use of the word "man"
> in [the eighth] Commandment, for they believe that great efforts
> and loud shouting devoted to the ritual purification of the
> language will somehow redound to the benefit of the downtrodden
> (whose real and grievous woes tendeth to get lost amidst all that
> thunder and fury).

Grumpily yours,
-ed

[1] https://www.lysator.liu.se/c/ten-commandments.html

Re: [PATCH bpf V2 0/2] bpf: adjust uapi for devmap prior to kernel release

2020-06-09 Thread Alexei Starovoitov

On Tue, Jun 09, 2020 at 03:31:41PM +0200, Jesper Dangaard Brouer wrote:
> For special type maps (e.g. devmap and cpumap) the map-value data-layout is
> a configuration interface. This is uapi that can only be tail extended.
> Thus, new members (and thus features) can only be added to the end of this
> structure, and the kernel uses the map->value_size from userspace to
> determine feature set 'version'.
> 
> For this kind of uapi to be extensible and backward compatible, is it common
> that new members/fields (that represent a new feature) in the struct are
> initialized as zero, which indicate that the feature isn't used. This makes
> it possible to write userspace applications that are unaware of new kernel
> features, but just include latest uapi headers, zero-init struct and
> populate features it knows about.
> 
> The recent extension of devmap with a bpf_prog.fd requires end-user to
> supply the file-descriptor value minus-1 to communicate that the features
> isn't used. This isn't compatible with the described kABI extension model.

Applied to bpf tree without this cover letter, because I don't want
folks to read above and start using kabi terminology liks this.
I've never seen a definition of kabi. I've heard redhat has something, but
I don't know what it is and really not interested to find out.
Studying amd64 psABI, sparc psABI, gABI was enough of time sink.
When folks use ABI they really mean binary. 
Old binaries that use devmap_val will work as-is with newer kernel.
There is no binary breakage due to devmap_val.
Whereas what you describe above is what will happen if something gets
recompiled. It's an API quirk. And arguable not an UAPI breakage.
UAPI structs have to initialized.
There is a struct and there is initializer for it.
Like if you did 'spinlock_t lock;' and it got broken with new kernel
it's programmers fault. It's not uapi and certainly not abi issue.
DEFINE_SPINLOCK() should have been used.
Same thing with user space.
'struct bpf_devmap_val' would be ok from uapi pov even with -1.
It's just much more convenient to have zero init. Less error prone, etc.

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-09 Thread Kees Cook

On Tue, Jun 09, 2020 at 11:36:33AM -0700, David Miller wrote:
> From: Stephen Hemminger 
> Date: Tue, 9 Jun 2020 10:19:35 -0700
> 
> > Yes, words do matter and convey a lot of implied connotation and
> > meaning.
> 
> What is your long term plan?  Will you change all of the UAPI for
> bonding for example?
> 
> Or will we have a partial solution to the problem?

As a first step, let's stop _adding_ this language.

Given what I've seen from other communities and what I know of the kernel
community, I don't think we're going to get consensus on some massive
global search/replace any time soon. However, I think we can get started
on making this change with just stopping further introductions. (I view
this like any other treewide change: stop new badness from getting
added, and chip away as old ones as we can until it's all gone.)

-- 
Kees Cook

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-09 Thread Williams, Dan J

On Tue, 2020-06-09 at 11:36 -0700, David Miller wrote:
> From: Stephen Hemminger 
> Date: Tue, 9 Jun 2020 10:19:35 -0700
> 
> > Yes, words do matter and convey a lot of implied connotation and
> > meaning.
> 
> What is your long term plan?  Will you change all of the UAPI for
> bonding for example?

The long term plan in my view includes talking with standards bodies to
move new content to, for example, master/subordinate. In other words,
practical forward steps, not retroactively changing interfaces.

> Or will we have a partial solution to the problem?

Partial steps toward better inclusion are meaningful.

The root problem is of course much larger than a few changes to
technical terminology could ever solve, but solving *that* problem is
not the goal. The goal is to make the kernel community as productive as
possible and if the antropomorphic association of "slave" can be
replaced with a term that maintains technical meaning it makes kernel-
work that much more approachable.

I recall one of Ingo's explanations of how broken whitespace can make
patches that much harder to read and take him out of the "zone" of
reviewing code. "Slave" is already having that speed bump affect on the
community. If we can replace it without losing meaning and improving
the effectiveness of contributors I think the Linux kernel project is
better for it.

1 2 >

1 - 100 of 184 matches

Mail list logo