[PATCH net-next v2 01/10] drop_monitor: Split tracing enable / disable to different functions

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

Subsequent patches will need to enable / disable tracing based on the
configured alerting mode.

Reduce the nesting level and prepare for the introduction of this
functionality by splitting the tracing enable / disable operations into
two different functions.

Signed-off-by: Ido Schimmel 
---
 net/core/drop_monitor.c | 79 ++---
 1 file changed, 51 insertions(+), 28 deletions(-)

diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 4deb86f990f1..8b9b0b899ebc 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -241,11 +241,58 @@ static void trace_napi_poll_hit(void *ignore, struct 
napi_struct *napi,
rcu_read_unlock();
 }
 
+static int net_dm_trace_on_set(struct netlink_ext_ack *extack)
+{
+   int rc;
+
+   if (!try_module_get(THIS_MODULE)) {
+   NL_SET_ERR_MSG_MOD(extack, "Failed to take reference on 
module");
+   return -ENODEV;
+   }
+
+   rc = register_trace_kfree_skb(trace_kfree_skb_hit, NULL);
+   if (rc) {
+   NL_SET_ERR_MSG_MOD(extack, "Failed to connect probe to 
kfree_skb() tracepoint");
+   goto err_module_put;
+   }
+
+   rc = register_trace_napi_poll(trace_napi_poll_hit, NULL);
+   if (rc) {
+   NL_SET_ERR_MSG_MOD(extack, "Failed to connect probe to 
napi_poll() tracepoint");
+   goto err_unregister_trace;
+   }
+
+   return 0;
+
+err_unregister_trace:
+   unregister_trace_kfree_skb(trace_kfree_skb_hit, NULL);
+err_module_put:
+   module_put(THIS_MODULE);
+   return rc;
+}
+
+static void net_dm_trace_off_set(void)
+{
+   struct dm_hw_stat_delta *new_stat, *temp;
+
+   unregister_trace_napi_poll(trace_napi_poll_hit, NULL);
+   unregister_trace_kfree_skb(trace_kfree_skb_hit, NULL);
+
+   tracepoint_synchronize_unregister();
+
+   list_for_each_entry_safe(new_stat, temp, &hw_stats_list, list) {
+   if (new_stat->dev == NULL) {
+   list_del_rcu(&new_stat->list);
+   kfree_rcu(new_stat, rcu);
+   }
+   }
+
+   module_put(THIS_MODULE);
+}
+
 static int set_all_monitor_traces(int state, struct netlink_ext_ack *extack)
 {
int rc = 0;
-   struct dm_hw_stat_delta *new_stat = NULL;
-   struct dm_hw_stat_delta *temp;
 
if (state == trace_state) {
NL_SET_ERR_MSG_MOD(extack, "Trace state already set to 
requested state");
@@ -254,34 +301,10 @@ static int set_all_monitor_traces(int state, struct 
netlink_ext_ack *extack)
 
switch (state) {
case TRACE_ON:
-   if (!try_module_get(THIS_MODULE)) {
-   NL_SET_ERR_MSG_MOD(extack, "Failed to take reference on 
module");
-   rc = -ENODEV;
-   break;
-   }
-
-   rc |= register_trace_kfree_skb(trace_kfree_skb_hit, NULL);
-   rc |= register_trace_napi_poll(trace_napi_poll_hit, NULL);
+   rc = net_dm_trace_on_set(extack);
break;
-
case TRACE_OFF:
-   rc |= unregister_trace_kfree_skb(trace_kfree_skb_hit, NULL);
-   rc |= unregister_trace_napi_poll(trace_napi_poll_hit, NULL);
-
-   tracepoint_synchronize_unregister();
-
-   /*
-* Clean the device list
-*/
-   list_for_each_entry_safe(new_stat, temp, &hw_stats_list, list) {
-   if (new_stat->dev == NULL) {
-   list_del_rcu(&new_stat->list);
-   kfree_rcu(new_stat, rcu);
-   }
-   }
-
-   module_put(THIS_MODULE);
-
+   net_dm_trace_off_set();
break;
default:
rc = 1;
-- 
2.21.0



[PATCH net-next v2 00/10] drop_monitor: Capture dropped packets and metadata

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

So far drop monitor supported only one mode of operation in which a
summary of recent packet drops is periodically sent to user space as a
netlink event. The event only includes the drop location (program
counter) and number of drops in the last interval.

While this mode of operation allows one to understand if the system is
dropping packets, it is not sufficient if a more detailed analysis is
required. Both the packet itself and related metadata are missing.

This patchset extends drop monitor with another mode of operation where
the packet - potentially truncated - and metadata (e.g., drop location,
timestamp, netdev) are sent to user space as a netlink event. Thanks to
the extensible nature of netlink, more metadata can be added in the
future.

To avoid performing expensive operations in the context in which
kfree_skb() is called, the dropped skbs are cloned and queued on per-CPU
skb drop list. The list is then processed in process context (using a
workqueue), where the netlink messages are allocated, prepared and
finally sent to user space.

A follow-up patchset will integrate drop monitor with devlink and allow
the latter to call into drop monitor to report hardware drops. In the
future, XDP drops can be added as well, thereby making drop monitor the
go-to netlink channel for diagnosing all packet drops.

Example usage with patched dropwatch [1] can be found here [2]. Example
dissection of drop monitor netlink events with patched wireshark [3] can
be found here [4]. I will submit both changes upstream after the kernel
changes are accepted. Another change worth making is adding a dropmon
pseudo interface to libpcap, similar to the nflog interface [5]. This
will allow users to specifically listen on dropmon traffic instead of
capturing all netlink packets via the nlmon netdev.

Patches #1-#5 prepare the code towards the actual changes in later
patches.

Patch #6 adds another mode of operation to drop monitor in which the
dropped packet itself is notified to user space along with metadata.

Patch #7 allows users to truncate reported packets to a specific length,
in case only the headers are of interest. The original length of the
packet is added as metadata to the netlink notification.

Patch #8 allows user to query the current configuration of drop monitor
(e.g., alert mode, truncation length).

Patches #9-#10 allow users to tune the length of the per-CPU skb drop
list according to their needs.

Changes since v1 [6]:
* Add skb protocol as metadata. This allows user space to correctly
  dissect the packet instead of blindly assuming it is an Ethernet
  packet

Changes since RFC [7]:
* Limit the length of the per-CPU skb drop list and make it configurable
* Do not use the hysteresis timer in packet alert mode
* Introduce alert mode operations in a separate patch and only then
  introduce the new alert mode
* Use 'skb->skb_iif' instead of 'skb->dev' because the latter is inside
  a union with 'dev_scratch' and therefore not guaranteed to point to a
  valid netdev
* Return '-EBUSY' instead of '-EOPNOTSUPP' when trying to configure drop
  monitor while it is monitoring
* Did not change schedule_work() in favor of schedule_work_on() as I did
  not observe a change in number of tail drops

[1] https://github.com/idosch/dropwatch/tree/packet-mode
[2] 
https://gist.github.com/idosch/3d524b887e16bc11b4b19e25c23dcc23#file-gistfile1-txt
[3] https://github.com/idosch/wireshark/tree/drop-monitor-v2
[4] 
https://gist.github.com/idosch/3d524b887e16bc11b4b19e25c23dcc23#file-gistfile2-txt
[5] 
https://github.com/the-tcpdump-group/libpcap/blob/master/pcap-netfilter-linux.c
[6] https://patchwork.ozlabs.org/cover/1143443/
[7] https://patchwork.ozlabs.org/cover/1135226/

Ido Schimmel (10):
  drop_monitor: Split tracing enable / disable to different functions
  drop_monitor: Initialize timer and work item upon tracing enable
  drop_monitor: Reset per-CPU data before starting to trace
  drop_monitor: Require CAP_NET_ADMIN for drop monitor configuration
  drop_monitor: Add alert mode operations
  drop_monitor: Add packet alert mode
  drop_monitor: Allow truncation of dropped packets
  drop_monitor: Add a command to query current configuration
  drop_monitor: Make drop queue length configurable
  drop_monitor: Expose tail drop counter

 include/uapi/linux/net_dropmon.h |  51 +++
 net/core/drop_monitor.c  | 599 +--
 2 files changed, 613 insertions(+), 37 deletions(-)

-- 
2.21.0



[PATCH net-next v2 10/10] drop_monitor: Expose tail drop counter

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

Previous patch made the length of the per-CPU skb drop list
configurable. Expose a counter that shows how many packets could not be
enqueued to this list.

This allows users determine the desired queue length.

Signed-off-by: Ido Schimmel 
---
 include/uapi/linux/net_dropmon.h |  10 +++
 net/core/drop_monitor.c  | 101 +++
 2 files changed, 111 insertions(+)

diff --git a/include/uapi/linux/net_dropmon.h b/include/uapi/linux/net_dropmon.h
index 1d0bdb1ba954..405b31cbf723 100644
--- a/include/uapi/linux/net_dropmon.h
+++ b/include/uapi/linux/net_dropmon.h
@@ -56,6 +56,8 @@ enum {
NET_DM_CMD_PACKET_ALERT,
NET_DM_CMD_CONFIG_GET,
NET_DM_CMD_CONFIG_NEW,
+   NET_DM_CMD_STATS_GET,
+   NET_DM_CMD_STATS_NEW,
_NET_DM_CMD_MAX,
 };
 
@@ -80,6 +82,7 @@ enum net_dm_attr {
NET_DM_ATTR_TRUNC_LEN,  /* u32 */
NET_DM_ATTR_ORIG_LEN,   /* u32 */
NET_DM_ATTR_QUEUE_LEN,  /* u32 */
+   NET_DM_ATTR_STATS,  /* nested */
 
__NET_DM_ATTR_MAX,
NET_DM_ATTR_MAX = __NET_DM_ATTR_MAX - 1
@@ -103,4 +106,11 @@ enum {
NET_DM_ATTR_PORT_MAX = __NET_DM_ATTR_PORT_MAX - 1
 };
 
+enum {
+   NET_DM_ATTR_STATS_DROPPED,  /* u64 */
+
+   __NET_DM_ATTR_STATS_MAX,
+   NET_DM_ATTR_STATS_MAX = __NET_DM_ATTR_STATS_MAX - 1
+};
+
 #endif
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index eb3c34d69ea9..39e094907391 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -51,12 +51,18 @@ static int trace_state = TRACE_OFF;
  */
 static DEFINE_MUTEX(net_dm_mutex);
 
+struct net_dm_stats {
+   u64 dropped;
+   struct u64_stats_sync syncp;
+};
+
 struct per_cpu_dm_data {
spinlock_t  lock;   /* Protects 'skb' and 'send_timer' */
struct sk_buff  *skb;
struct sk_buff_head drop_queue;
struct work_struct  dm_alert_work;
struct timer_list   send_timer;
+   struct net_dm_stats stats;
 };
 
 struct dm_hw_stat_delta {
@@ -300,6 +306,9 @@ static void net_dm_packet_trace_kfree_skb_hit(void *ignore,
 
 unlock_free:
spin_unlock_irqrestore(&data->drop_queue.lock, flags);
+   u64_stats_update_begin(&data->stats.syncp);
+   data->stats.dropped++;
+   u64_stats_update_end(&data->stats.syncp);
consume_skb(nskb);
 }
 
@@ -732,6 +741,93 @@ static int net_dm_cmd_config_get(struct sk_buff *skb, 
struct genl_info *info)
return rc;
 }
 
+static void net_dm_stats_read(struct net_dm_stats *stats)
+{
+   int cpu;
+
+   memset(stats, 0, sizeof(*stats));
+   for_each_possible_cpu(cpu) {
+   struct per_cpu_dm_data *data = &per_cpu(dm_cpu_data, cpu);
+   struct net_dm_stats *cpu_stats = &data->stats;
+   unsigned int start;
+   u64 dropped;
+
+   do {
+   start = u64_stats_fetch_begin_irq(&cpu_stats->syncp);
+   dropped = cpu_stats->dropped;
+   } while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, start));
+
+   stats->dropped += dropped;
+   }
+}
+
+static int net_dm_stats_put(struct sk_buff *msg)
+{
+   struct net_dm_stats stats;
+   struct nlattr *attr;
+
+   net_dm_stats_read(&stats);
+
+   attr = nla_nest_start(msg, NET_DM_ATTR_STATS);
+   if (!attr)
+   return -EMSGSIZE;
+
+   if (nla_put_u64_64bit(msg, NET_DM_ATTR_STATS_DROPPED,
+ stats.dropped, NET_DM_ATTR_PAD))
+   goto nla_put_failure;
+
+   nla_nest_end(msg, attr);
+
+   return 0;
+
+nla_put_failure:
+   nla_nest_cancel(msg, attr);
+   return -EMSGSIZE;
+}
+
+static int net_dm_stats_fill(struct sk_buff *msg, struct genl_info *info)
+{
+   void *hdr;
+   int rc;
+
+   hdr = genlmsg_put(msg, info->snd_portid, info->snd_seq,
+ &net_drop_monitor_family, 0, NET_DM_CMD_STATS_NEW);
+   if (!hdr)
+   return -EMSGSIZE;
+
+   rc = net_dm_stats_put(msg);
+   if (rc)
+   goto nla_put_failure;
+
+   genlmsg_end(msg, hdr);
+
+   return 0;
+
+nla_put_failure:
+   genlmsg_cancel(msg, hdr);
+   return -EMSGSIZE;
+}
+
+static int net_dm_cmd_stats_get(struct sk_buff *skb, struct genl_info *info)
+{
+   struct sk_buff *msg;
+   int rc;
+
+   msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+   if (!msg)
+   return -ENOMEM;
+
+   rc = net_dm_stats_fill(msg, info);
+   if (rc)
+   goto free_msg;
+
+   return genlmsg_reply(msg, info);
+
+free_msg:
+   nlmsg_free(msg);
+   return rc;
+}
+
 static int dropmon_net_event(struct notifier_block *ev_block,
 unsigned long event, void *ptr)
 {
@@ -799,6 +895,10 @@ static const struct genl_ops dropmon_ops[] = {
   

[PATCH net-next v2 03/10] drop_monitor: Reset per-CPU data before starting to trace

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

The function reset_per_cpu_data() allocates and prepares a new skb for
the summary netlink alert message ('NET_DM_CMD_ALERT'). The new skb is
stored in the per-CPU 'data' variable and the old is returned.

The function is invoked during module initialization and from the
workqueue, before an alert is sent. This means that it is possible to
receive an alert with stale data, if we stopped tracing when the
hysteresis timer ('data->send_timer') was pending.

Instead of invoking the function during module initialization, invoke it
just before we start tracing and ensure we get a fresh skb.

This also allows us to remove the calls to initialize the timer and the
work item from the module initialization path, since both could have
been triggered by the error paths of reset_per_cpu_data().

Signed-off-by: Ido Schimmel 
---
 net/core/drop_monitor.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index b266dc1660ed..1cf4988de591 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -252,9 +252,16 @@ static int net_dm_trace_on_set(struct netlink_ext_ack 
*extack)
 
for_each_possible_cpu(cpu) {
struct per_cpu_dm_data *data = &per_cpu(dm_cpu_data, cpu);
+   struct sk_buff *skb;
 
INIT_WORK(&data->dm_alert_work, send_dm_alert);
timer_setup(&data->send_timer, sched_send_work, 0);
+   /* Allocate a new per-CPU skb for the summary alert message and
+* free the old one which might contain stale data from
+* previous tracing.
+*/
+   skb = reset_per_cpu_data(data);
+   consume_skb(skb);
}
 
rc = register_trace_kfree_skb(trace_kfree_skb_hit, NULL);
@@ -475,10 +482,7 @@ static int __init init_net_drop_monitor(void)
 
for_each_possible_cpu(cpu) {
data = &per_cpu(dm_cpu_data, cpu);
-   INIT_WORK(&data->dm_alert_work, send_dm_alert);
-   timer_setup(&data->send_timer, sched_send_work, 0);
spin_lock_init(&data->lock);
-   reset_per_cpu_data(data);
}
 
goto out;
-- 
2.21.0



[PATCH net-next v2 04/10] drop_monitor: Require CAP_NET_ADMIN for drop monitor configuration

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

Currently, the configure command does not do anything but return an
error. Subsequent patches will enable the command to change various
configuration options such as alert mode and packet truncation.

Similar to other netlink-based configuration channels, make sure only
users with the CAP_NET_ADMIN capability set can execute this command.

Signed-off-by: Ido Schimmel 
---
 net/core/drop_monitor.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 1cf4988de591..cd2f3069f34e 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -409,6 +409,7 @@ static const struct genl_ops dropmon_ops[] = {
.cmd = NET_DM_CMD_CONFIG,
.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = net_dm_cmd_config,
+   .flags = GENL_ADMIN_PERM,
},
{
.cmd = NET_DM_CMD_START,
-- 
2.21.0



[PATCH net-next v2 08/10] drop_monitor: Add a command to query current configuration

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

Users should be able to query the current configuration of drop monitor
before they start using it. Add a command to query the existing
configuration which currently consists of alert mode and packet
truncation length.

Signed-off-by: Ido Schimmel 
---
 include/uapi/linux/net_dropmon.h |  2 ++
 net/core/drop_monitor.c  | 48 
 2 files changed, 50 insertions(+)

diff --git a/include/uapi/linux/net_dropmon.h b/include/uapi/linux/net_dropmon.h
index 5cd7eb1f66ba..3b765a8428b5 100644
--- a/include/uapi/linux/net_dropmon.h
+++ b/include/uapi/linux/net_dropmon.h
@@ -54,6 +54,8 @@ enum {
NET_DM_CMD_START,
NET_DM_CMD_STOP,
NET_DM_CMD_PACKET_ALERT,
+   NET_DM_CMD_CONFIG_GET,
+   NET_DM_CMD_CONFIG_NEW,
_NET_DM_CMD_MAX,
 };
 
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 9f884adaa85f..135638474ab8 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -676,6 +676,50 @@ static int net_dm_cmd_trace(struct sk_buff *skb,
return -EOPNOTSUPP;
 }
 
+static int net_dm_config_fill(struct sk_buff *msg, struct genl_info *info)
+{
+   void *hdr;
+
+   hdr = genlmsg_put(msg, info->snd_portid, info->snd_seq,
+ &net_drop_monitor_family, 0, NET_DM_CMD_CONFIG_NEW);
+   if (!hdr)
+   return -EMSGSIZE;
+
+   if (nla_put_u8(msg, NET_DM_ATTR_ALERT_MODE, net_dm_alert_mode))
+   goto nla_put_failure;
+
+   if (nla_put_u32(msg, NET_DM_ATTR_TRUNC_LEN, net_dm_trunc_len))
+   goto nla_put_failure;
+
+   genlmsg_end(msg, hdr);
+
+   return 0;
+
+nla_put_failure:
+   genlmsg_cancel(msg, hdr);
+   return -EMSGSIZE;
+}
+
+static int net_dm_cmd_config_get(struct sk_buff *skb, struct genl_info *info)
+{
+   struct sk_buff *msg;
+   int rc;
+
+   msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+   if (!msg)
+   return -ENOMEM;
+
+   rc = net_dm_config_fill(msg, info);
+   if (rc)
+   goto free_msg;
+
+   return genlmsg_reply(msg, info);
+
+free_msg:
+   nlmsg_free(msg);
+   return rc;
+}
+
 static int dropmon_net_event(struct notifier_block *ev_block,
 unsigned long event, void *ptr)
 {
@@ -738,6 +782,10 @@ static const struct genl_ops dropmon_ops[] = {
.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = net_dm_cmd_trace,
},
+   {
+   .cmd = NET_DM_CMD_CONFIG_GET,
+   .doit = net_dm_cmd_config_get,
+   },
 };
 
 static int net_dm_nl_pre_doit(const struct genl_ops *ops,
-- 
2.21.0



[PATCH net-next v2 07/10] drop_monitor: Allow truncation of dropped packets

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

When sending dropped packets to user space it is not always necessary to
copy the entire packet as usually only the headers are of interest.

Allow user to specify the truncation length and add the original length
of the packet as additional metadata to the netlink message.

By default no truncation is performed.

Signed-off-by: Ido Schimmel 
---
 include/uapi/linux/net_dropmon.h |  2 ++
 net/core/drop_monitor.c  | 19 +++
 2 files changed, 21 insertions(+)

diff --git a/include/uapi/linux/net_dropmon.h b/include/uapi/linux/net_dropmon.h
index cfaaf75371b8..5cd7eb1f66ba 100644
--- a/include/uapi/linux/net_dropmon.h
+++ b/include/uapi/linux/net_dropmon.h
@@ -75,6 +75,8 @@ enum net_dm_attr {
NET_DM_ATTR_PROTO,  /* u16 */
NET_DM_ATTR_PAYLOAD,/* binary */
NET_DM_ATTR_PAD,
+   NET_DM_ATTR_TRUNC_LEN,  /* u32 */
+   NET_DM_ATTR_ORIG_LEN,   /* u32 */
 
__NET_DM_ATTR_MAX,
NET_DM_ATTR_MAX = __NET_DM_ATTR_MAX - 1
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index ba765832413b..9f884adaa85f 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -77,6 +77,7 @@ static unsigned long dm_hw_check_delta = 2*HZ;
 static LIST_HEAD(hw_stats_list);
 
 static enum net_dm_alert_mode net_dm_alert_mode = NET_DM_ALERT_MODE_SUMMARY;
+static u32 net_dm_trunc_len;
 
 struct net_dm_alert_ops {
void (*kfree_skb_probe)(void *ignore, struct sk_buff *skb,
@@ -334,6 +335,8 @@ static size_t net_dm_packet_report_size(size_t payload_len)
   net_dm_in_port_size() +
   /* NET_DM_ATTR_TIMESTAMP */
   nla_total_size(sizeof(struct timespec)) +
+  /* NET_DM_ATTR_ORIG_LEN */
+  nla_total_size(sizeof(u32)) +
   /* NET_DM_ATTR_PROTO */
   nla_total_size(sizeof(u16)) +
   /* NET_DM_ATTR_PAYLOAD */
@@ -391,6 +394,9 @@ static int net_dm_packet_report_fill(struct sk_buff *msg, 
struct sk_buff *skb,
nla_put(msg, NET_DM_ATTR_TIMESTAMP, sizeof(ts), &ts))
goto nla_put_failure;
 
+   if (nla_put_u32(msg, NET_DM_ATTR_ORIG_LEN, skb->len))
+   goto nla_put_failure;
+
if (!payload_len)
goto out;
 
@@ -429,6 +435,8 @@ static void net_dm_packet_report(struct sk_buff *skb)
 
/* Ensure packet fits inside a single netlink attribute */
payload_len = min_t(size_t, skb->len, NET_DM_MAX_PACKET_SIZE);
+   if (net_dm_trunc_len)
+   payload_len = min_t(size_t, net_dm_trunc_len, payload_len);
 
msg = nlmsg_new(net_dm_packet_report_size(payload_len), GFP_KERNEL);
if (!msg)
@@ -627,6 +635,14 @@ static int net_dm_alert_mode_set(struct genl_info *info)
return 0;
 }
 
+static void net_dm_trunc_len_set(struct genl_info *info)
+{
+   if (!info->attrs[NET_DM_ATTR_TRUNC_LEN])
+   return;
+
+   net_dm_trunc_len = nla_get_u32(info->attrs[NET_DM_ATTR_TRUNC_LEN]);
+}
+
 static int net_dm_cmd_config(struct sk_buff *skb,
struct genl_info *info)
 {
@@ -642,6 +658,8 @@ static int net_dm_cmd_config(struct sk_buff *skb,
if (rc)
return rc;
 
+   net_dm_trunc_len_set(info);
+
return 0;
 }
 
@@ -700,6 +718,7 @@ static int dropmon_net_event(struct notifier_block 
*ev_block,
 static const struct nla_policy net_dm_nl_policy[NET_DM_ATTR_MAX + 1] = {
[NET_DM_ATTR_UNSPEC] = { .strict_start_type = NET_DM_ATTR_UNSPEC + 1 },
[NET_DM_ATTR_ALERT_MODE] = { .type = NLA_U8 },
+   [NET_DM_ATTR_TRUNC_LEN] = { .type = NLA_U32 },
 };
 
 static const struct genl_ops dropmon_ops[] = {
-- 
2.21.0



[PATCH net-next v2 09/10] drop_monitor: Make drop queue length configurable

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

In packet alert mode, each CPU holds a list of dropped skbs that need to
be processed in process context and sent to user space. To avoid
exhausting the system's memory the maximum length of this queue is
currently set to 1000.

Allow users to tune the length of this queue according to their needs.
The configured length is reported to user space when drop monitor
configuration is queried.

Signed-off-by: Ido Schimmel 
---
 include/uapi/linux/net_dropmon.h |  1 +
 net/core/drop_monitor.c  | 19 ---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/net_dropmon.h b/include/uapi/linux/net_dropmon.h
index 3b765a8428b5..1d0bdb1ba954 100644
--- a/include/uapi/linux/net_dropmon.h
+++ b/include/uapi/linux/net_dropmon.h
@@ -79,6 +79,7 @@ enum net_dm_attr {
NET_DM_ATTR_PAD,
NET_DM_ATTR_TRUNC_LEN,  /* u32 */
NET_DM_ATTR_ORIG_LEN,   /* u32 */
+   NET_DM_ATTR_QUEUE_LEN,  /* u32 */
 
__NET_DM_ATTR_MAX,
NET_DM_ATTR_MAX = __NET_DM_ATTR_MAX - 1
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 135638474ab8..eb3c34d69ea9 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -78,6 +78,7 @@ static LIST_HEAD(hw_stats_list);
 
 static enum net_dm_alert_mode net_dm_alert_mode = NET_DM_ALERT_MODE_SUMMARY;
 static u32 net_dm_trunc_len;
+static u32 net_dm_queue_len = 1000;
 
 struct net_dm_alert_ops {
void (*kfree_skb_probe)(void *ignore, struct sk_buff *skb,
@@ -93,8 +94,6 @@ struct net_dm_skb_cb {
 
 #define NET_DM_SKB_CB(__skb) ((struct net_dm_skb_cb *)&((__skb)->cb[0]))
 
-#define NET_DM_QUEUE_LEN 1000
-
 static struct sk_buff *reset_per_cpu_data(struct per_cpu_dm_data *data)
 {
size_t al;
@@ -289,7 +288,7 @@ static void net_dm_packet_trace_kfree_skb_hit(void *ignore,
data = this_cpu_ptr(&dm_cpu_data);
 
spin_lock_irqsave(&data->drop_queue.lock, flags);
-   if (skb_queue_len(&data->drop_queue) < NET_DM_QUEUE_LEN)
+   if (skb_queue_len(&data->drop_queue) < net_dm_queue_len)
__skb_queue_tail(&data->drop_queue, nskb);
else
goto unlock_free;
@@ -643,6 +642,14 @@ static void net_dm_trunc_len_set(struct genl_info *info)
net_dm_trunc_len = nla_get_u32(info->attrs[NET_DM_ATTR_TRUNC_LEN]);
 }
 
+static void net_dm_queue_len_set(struct genl_info *info)
+{
+   if (!info->attrs[NET_DM_ATTR_QUEUE_LEN])
+   return;
+
+   net_dm_queue_len = nla_get_u32(info->attrs[NET_DM_ATTR_QUEUE_LEN]);
+}
+
 static int net_dm_cmd_config(struct sk_buff *skb,
struct genl_info *info)
 {
@@ -660,6 +667,8 @@ static int net_dm_cmd_config(struct sk_buff *skb,
 
net_dm_trunc_len_set(info);
 
+   net_dm_queue_len_set(info);
+
return 0;
 }
 
@@ -691,6 +700,9 @@ static int net_dm_config_fill(struct sk_buff *msg, struct 
genl_info *info)
if (nla_put_u32(msg, NET_DM_ATTR_TRUNC_LEN, net_dm_trunc_len))
goto nla_put_failure;
 
+   if (nla_put_u32(msg, NET_DM_ATTR_QUEUE_LEN, net_dm_queue_len))
+   goto nla_put_failure;
+
genlmsg_end(msg, hdr);
 
return 0;
@@ -763,6 +775,7 @@ static const struct nla_policy 
net_dm_nl_policy[NET_DM_ATTR_MAX + 1] = {
[NET_DM_ATTR_UNSPEC] = { .strict_start_type = NET_DM_ATTR_UNSPEC + 1 },
[NET_DM_ATTR_ALERT_MODE] = { .type = NLA_U8 },
[NET_DM_ATTR_TRUNC_LEN] = { .type = NLA_U32 },
+   [NET_DM_ATTR_QUEUE_LEN] = { .type = NLA_U32 },
 };
 
 static const struct genl_ops dropmon_ops[] = {
-- 
2.21.0



[PATCH net-next v2 02/10] drop_monitor: Initialize timer and work item upon tracing enable

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

The timer and work item are currently initialized once during module
init, but subsequent patches will need to associate different functions
with the work item, based on the configured alert mode.

Allow subsequent patches to make that change by initializing and
de-initializing these objects during tracing enable and disable.

This also guarantees that once the request to disable tracing returns,
no more netlink notifications will be generated.

Signed-off-by: Ido Schimmel 
---
 net/core/drop_monitor.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 8b9b0b899ebc..b266dc1660ed 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -243,13 +243,20 @@ static void trace_napi_poll_hit(void *ignore, struct 
napi_struct *napi,
 
 static int net_dm_trace_on_set(struct netlink_ext_ack *extack)
 {
-   int rc;
+   int cpu, rc;
 
if (!try_module_get(THIS_MODULE)) {
NL_SET_ERR_MSG_MOD(extack, "Failed to take reference on 
module");
return -ENODEV;
}
 
+   for_each_possible_cpu(cpu) {
+   struct per_cpu_dm_data *data = &per_cpu(dm_cpu_data, cpu);
+
+   INIT_WORK(&data->dm_alert_work, send_dm_alert);
+   timer_setup(&data->send_timer, sched_send_work, 0);
+   }
+
rc = register_trace_kfree_skb(trace_kfree_skb_hit, NULL);
if (rc) {
NL_SET_ERR_MSG_MOD(extack, "Failed to connect probe to 
kfree_skb() tracepoint");
@@ -274,12 +281,23 @@ static int net_dm_trace_on_set(struct netlink_ext_ack 
*extack)
 static void net_dm_trace_off_set(void)
 {
struct dm_hw_stat_delta *new_stat, *temp;
+   int cpu;
 
unregister_trace_napi_poll(trace_napi_poll_hit, NULL);
unregister_trace_kfree_skb(trace_kfree_skb_hit, NULL);
 
tracepoint_synchronize_unregister();
 
+   /* Make sure we do not send notifications to user space after request
+* to stop tracing returns.
+*/
+   for_each_possible_cpu(cpu) {
+   struct per_cpu_dm_data *data = &per_cpu(dm_cpu_data, cpu);
+
+   del_timer_sync(&data->send_timer);
+   cancel_work_sync(&data->dm_alert_work);
+   }
+
list_for_each_entry_safe(new_stat, temp, &hw_stats_list, list) {
if (new_stat->dev == NULL) {
list_del_rcu(&new_stat->list);
@@ -481,14 +499,10 @@ static void exit_net_drop_monitor(void)
/*
 * Because of the module_get/put we do in the trace state change path
 * we are guarnateed not to have any current users when we get here
-* all we need to do is make sure that we don't have any running timers
-* or pending schedule calls
 */
 
for_each_possible_cpu(cpu) {
data = &per_cpu(dm_cpu_data, cpu);
-   del_timer_sync(&data->send_timer);
-   cancel_work_sync(&data->dm_alert_work);
/*
 * At this point, we should have exclusive access
 * to this struct and can free the skb inside it
-- 
2.21.0



[PATCH net-next v2 06/10] drop_monitor: Add packet alert mode

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

So far drop monitor supported only one alert mode in which a summary of
locations in which packets were recently dropped was sent to user space.

This alert mode is sufficient in order to understand that packets were
dropped, but lacks information to perform a more detailed analysis.

Add a new alert mode in which the dropped packet itself is passed to
user space along with metadata: The drop location (as program counter
and resolved symbol), ingress netdevice and drop timestamp. More
metadata can be added in the future.

To avoid performing expensive operations in the context in which
kfree_skb() is invoked (can be hard IRQ), the dropped skb is cloned and
queued on per-CPU skb drop list. Then, in process context the netlink
message is allocated, prepared and finally sent to user space.

The per-CPU skb drop list is limited to 1000 skbs to prevent exhausting
the system's memory. Subsequent patches will make this limit
configurable and also add a counter that indicates how many skbs were
tail dropped.

Signed-off-by: Ido Schimmel 
---
 include/uapi/linux/net_dropmon.h |  27 +++
 net/core/drop_monitor.c  | 280 ++-
 2 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/net_dropmon.h b/include/uapi/linux/net_dropmon.h
index 0fecdedeb6ca..cfaaf75371b8 100644
--- a/include/uapi/linux/net_dropmon.h
+++ b/include/uapi/linux/net_dropmon.h
@@ -53,6 +53,7 @@ enum {
NET_DM_CMD_CONFIG,
NET_DM_CMD_START,
NET_DM_CMD_STOP,
+   NET_DM_CMD_PACKET_ALERT,
_NET_DM_CMD_MAX,
 };
 
@@ -63,12 +64,38 @@ enum {
  */
 #define NET_DM_GRP_ALERT 1
 
+enum net_dm_attr {
+   NET_DM_ATTR_UNSPEC,
+
+   NET_DM_ATTR_ALERT_MODE, /* u8 */
+   NET_DM_ATTR_PC, /* u64 */
+   NET_DM_ATTR_SYMBOL, /* string */
+   NET_DM_ATTR_IN_PORT,/* nested */
+   NET_DM_ATTR_TIMESTAMP,  /* struct timespec */
+   NET_DM_ATTR_PROTO,  /* u16 */
+   NET_DM_ATTR_PAYLOAD,/* binary */
+   NET_DM_ATTR_PAD,
+
+   __NET_DM_ATTR_MAX,
+   NET_DM_ATTR_MAX = __NET_DM_ATTR_MAX - 1
+};
+
 /**
  * enum net_dm_alert_mode - Alert mode.
  * @NET_DM_ALERT_MODE_SUMMARY: A summary of recent drops is sent to user space.
+ * @NET_DM_ALERT_MODE_PACKET: Each dropped packet is sent to user space along
+ *with metadata.
  */
 enum net_dm_alert_mode {
NET_DM_ALERT_MODE_SUMMARY,
+   NET_DM_ALERT_MODE_PACKET,
+};
+
+enum {
+   NET_DM_ATTR_PORT_NETDEV_IFINDEX,/* u32 */
+
+   __NET_DM_ATTR_PORT_MAX,
+   NET_DM_ATTR_PORT_MAX = __NET_DM_ATTR_PORT_MAX - 1
 };
 
 #endif
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 9cd2f662cb9e..ba765832413b 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -54,6 +54,7 @@ static DEFINE_MUTEX(net_dm_mutex);
 struct per_cpu_dm_data {
spinlock_t  lock;   /* Protects 'skb' and 'send_timer' */
struct sk_buff  *skb;
+   struct sk_buff_head drop_queue;
struct work_struct  dm_alert_work;
struct timer_list   send_timer;
 };
@@ -85,6 +86,14 @@ struct net_dm_alert_ops {
void (*work_item_func)(struct work_struct *work);
 };
 
+struct net_dm_skb_cb {
+   void *pc;
+};
+
+#define NET_DM_SKB_CB(__skb) ((struct net_dm_skb_cb *)&((__skb)->cb[0]))
+
+#define NET_DM_QUEUE_LEN 1000
+
 static struct sk_buff *reset_per_cpu_data(struct per_cpu_dm_data *data)
 {
size_t al;
@@ -257,8 +266,214 @@ static const struct net_dm_alert_ops 
net_dm_alert_summary_ops = {
.work_item_func = send_dm_alert,
 };
 
+static void net_dm_packet_trace_kfree_skb_hit(void *ignore,
+ struct sk_buff *skb,
+ void *location)
+{
+   ktime_t tstamp = ktime_get_real();
+   struct per_cpu_dm_data *data;
+   struct sk_buff *nskb;
+   unsigned long flags;
+
+   nskb = skb_clone(skb, GFP_ATOMIC);
+   if (!nskb)
+   return;
+
+   NET_DM_SKB_CB(nskb)->pc = location;
+   /* Override the timestamp because we care about the time when the
+* packet was dropped.
+*/
+   nskb->tstamp = tstamp;
+
+   data = this_cpu_ptr(&dm_cpu_data);
+
+   spin_lock_irqsave(&data->drop_queue.lock, flags);
+   if (skb_queue_len(&data->drop_queue) < NET_DM_QUEUE_LEN)
+   __skb_queue_tail(&data->drop_queue, nskb);
+   else
+   goto unlock_free;
+   spin_unlock_irqrestore(&data->drop_queue.lock, flags);
+
+   schedule_work(&data->dm_alert_work);
+
+   return;
+
+unlock_free:
+   spin_unlock_irqrestore(&data->drop_queue.lock, flags);
+   consume_skb(nskb);
+}
+
+static void net_dm_packet_trace_napi_poll_hit(void *ignore,
+   

[PATCH net-next v2 05/10] drop_monitor: Add alert mode operations

2019-08-11 Thread Ido Schimmel
From: Ido Schimmel 

The next patch is going to add another alert mode in which the dropped
packet is notified to user space, instead of only a summary of recent
drops.

Abstract the differences between the modes by adding alert mode
operations. The operations are selected based on the currently
configured mode and associated with the probes and the work item just
before tracing starts.

Signed-off-by: Ido Schimmel 
---
 include/uapi/linux/net_dropmon.h |  9 
 net/core/drop_monitor.c  | 38 +++-
 2 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/net_dropmon.h b/include/uapi/linux/net_dropmon.h
index 5edbd0a675fd..0fecdedeb6ca 100644
--- a/include/uapi/linux/net_dropmon.h
+++ b/include/uapi/linux/net_dropmon.h
@@ -62,4 +62,13 @@ enum {
  * Our group identifiers
  */
 #define NET_DM_GRP_ALERT 1
+
+/**
+ * enum net_dm_alert_mode - Alert mode.
+ * @NET_DM_ALERT_MODE_SUMMARY: A summary of recent drops is sent to user space.
+ */
+enum net_dm_alert_mode {
+   NET_DM_ALERT_MODE_SUMMARY,
+};
+
 #endif
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index cd2f3069f34e..9cd2f662cb9e 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -75,6 +75,16 @@ static int dm_delay = 1;
 static unsigned long dm_hw_check_delta = 2*HZ;
 static LIST_HEAD(hw_stats_list);
 
+static enum net_dm_alert_mode net_dm_alert_mode = NET_DM_ALERT_MODE_SUMMARY;
+
+struct net_dm_alert_ops {
+   void (*kfree_skb_probe)(void *ignore, struct sk_buff *skb,
+   void *location);
+   void (*napi_poll_probe)(void *ignore, struct napi_struct *napi,
+   int work, int budget);
+   void (*work_item_func)(struct work_struct *work);
+};
+
 static struct sk_buff *reset_per_cpu_data(struct per_cpu_dm_data *data)
 {
size_t al;
@@ -241,10 +251,23 @@ static void trace_napi_poll_hit(void *ignore, struct 
napi_struct *napi,
rcu_read_unlock();
 }
 
+static const struct net_dm_alert_ops net_dm_alert_summary_ops = {
+   .kfree_skb_probe= trace_kfree_skb_hit,
+   .napi_poll_probe= trace_napi_poll_hit,
+   .work_item_func = send_dm_alert,
+};
+
+static const struct net_dm_alert_ops *net_dm_alert_ops_arr[] = {
+   [NET_DM_ALERT_MODE_SUMMARY] = &net_dm_alert_summary_ops,
+};
+
 static int net_dm_trace_on_set(struct netlink_ext_ack *extack)
 {
+   const struct net_dm_alert_ops *ops;
int cpu, rc;
 
+   ops = net_dm_alert_ops_arr[net_dm_alert_mode];
+
if (!try_module_get(THIS_MODULE)) {
NL_SET_ERR_MSG_MOD(extack, "Failed to take reference on 
module");
return -ENODEV;
@@ -254,7 +277,7 @@ static int net_dm_trace_on_set(struct netlink_ext_ack 
*extack)
struct per_cpu_dm_data *data = &per_cpu(dm_cpu_data, cpu);
struct sk_buff *skb;
 
-   INIT_WORK(&data->dm_alert_work, send_dm_alert);
+   INIT_WORK(&data->dm_alert_work, ops->work_item_func);
timer_setup(&data->send_timer, sched_send_work, 0);
/* Allocate a new per-CPU skb for the summary alert message and
 * free the old one which might contain stale data from
@@ -264,13 +287,13 @@ static int net_dm_trace_on_set(struct netlink_ext_ack 
*extack)
consume_skb(skb);
}
 
-   rc = register_trace_kfree_skb(trace_kfree_skb_hit, NULL);
+   rc = register_trace_kfree_skb(ops->kfree_skb_probe, NULL);
if (rc) {
NL_SET_ERR_MSG_MOD(extack, "Failed to connect probe to 
kfree_skb() tracepoint");
goto err_module_put;
}
 
-   rc = register_trace_napi_poll(trace_napi_poll_hit, NULL);
+   rc = register_trace_napi_poll(ops->napi_poll_probe, NULL);
if (rc) {
NL_SET_ERR_MSG_MOD(extack, "Failed to connect probe to 
napi_poll() tracepoint");
goto err_unregister_trace;
@@ -279,7 +302,7 @@ static int net_dm_trace_on_set(struct netlink_ext_ack 
*extack)
return 0;
 
 err_unregister_trace:
-   unregister_trace_kfree_skb(trace_kfree_skb_hit, NULL);
+   unregister_trace_kfree_skb(ops->kfree_skb_probe, NULL);
 err_module_put:
module_put(THIS_MODULE);
return rc;
@@ -288,10 +311,13 @@ static int net_dm_trace_on_set(struct netlink_ext_ack 
*extack)
 static void net_dm_trace_off_set(void)
 {
struct dm_hw_stat_delta *new_stat, *temp;
+   const struct net_dm_alert_ops *ops;
int cpu;
 
-   unregister_trace_napi_poll(trace_napi_poll_hit, NULL);
-   unregister_trace_kfree_skb(trace_kfree_skb_hit, NULL);
+   ops = net_dm_alert_ops_arr[net_dm_alert_mode];
+
+   unregister_trace_napi_poll(ops->napi_poll_probe, NULL);
+   unregister_trace_kfree_skb(ops->kfree_skb_probe, NULL);
 
tracepoint_synchronize_unregister();
 
-- 
2.21.0



[PATCH net] mlxsw: spectrum_ptp: Keep unmatched entries in a linked list

2019-08-11 Thread Ido Schimmel
From: Petr Machata 

To identify timestamps for matching with their packets, Spectrum-1 uses a
five-tuple of (port, direction, domain number, message type, sequence ID).
If there are several clients from the same domain behind a single port
sending Delay_Req's, the only thing differentiating these packets, as far
as Spectrum-1 is concerned, is the sequence ID. Should sequence IDs between
individual clients be similar, conflicts may arise. That is not a problem
to hardware, which will simply deliver timestamps on a first comes, first
served basis.

However the driver uses a simple hash table to store the unmatched pieces.
When a new conflicting piece arrives, it pushes out the previously stored
one, which if it is a packet, is delivered without timestamp. Later on as
the corresponding timestamps arrive, the first one is mismatched to the
second packet, and the second one is never matched and eventually is GCd.

To correct this issue, instead of using a simple rhashtable, use rhltable
to keep the unmatched entries.

Previously, a found unmatched entry would always be removed from the hash
table. That is not the case anymore--an incompatible entry is left in the
hash table. Therefore removal from the hash table cannot be used to confirm
the validity of the looked-up pointer, instead the lookup would simply need
to be redone. Therefore move it inside the critical section. This
simplifies a lot of the code.

Fixes: 8748642751ed ("mlxsw: spectrum: PTP: Support SIOCGHWTSTAMP, 
SIOCSHWTSTAMP ioctls")
Reported-by: Alex Veber 
Signed-off-by: Petr Machata 
Signed-off-by: Ido Schimmel 
---
 .../ethernet/mellanox/mlxsw/spectrum_ptp.c| 138 +++---
 1 file changed, 55 insertions(+), 83 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
index 63b07edd9d81..38bb1cfe4e8c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
@@ -29,7 +29,7 @@
 
 struct mlxsw_sp_ptp_state {
struct mlxsw_sp *mlxsw_sp;
-   struct rhashtable unmatched_ht;
+   struct rhltable unmatched_ht;
spinlock_t unmatched_lock; /* protects the HT */
struct delayed_work ht_gc_dw;
u32 gc_cycle;
@@ -45,7 +45,7 @@ struct mlxsw_sp1_ptp_key {
 
 struct mlxsw_sp1_ptp_unmatched {
struct mlxsw_sp1_ptp_key key;
-   struct rhash_head ht_node;
+   struct rhlist_head ht_node;
struct rcu_head rcu;
struct sk_buff *skb;
u64 timestamp;
@@ -359,7 +359,7 @@ static int mlxsw_sp_ptp_parse(struct sk_buff *skb,
 /* Returns NULL on successful insertion, a pointer on conflict, or an ERR_PTR 
on
  * error.
  */
-static struct mlxsw_sp1_ptp_unmatched *
+static int
 mlxsw_sp1_ptp_unmatched_save(struct mlxsw_sp *mlxsw_sp,
 struct mlxsw_sp1_ptp_key key,
 struct sk_buff *skb,
@@ -368,41 +368,51 @@ mlxsw_sp1_ptp_unmatched_save(struct mlxsw_sp *mlxsw_sp,
int cycles = MLXSW_SP1_PTP_HT_GC_TIMEOUT / MLXSW_SP1_PTP_HT_GC_INTERVAL;
struct mlxsw_sp_ptp_state *ptp_state = mlxsw_sp->ptp_state;
struct mlxsw_sp1_ptp_unmatched *unmatched;
-   struct mlxsw_sp1_ptp_unmatched *conflict;
+   int err;
 
unmatched = kzalloc(sizeof(*unmatched), GFP_ATOMIC);
if (!unmatched)
-   return ERR_PTR(-ENOMEM);
+   return -ENOMEM;
 
unmatched->key = key;
unmatched->skb = skb;
unmatched->timestamp = timestamp;
unmatched->gc_cycle = mlxsw_sp->ptp_state->gc_cycle + cycles;
 
-   conflict = rhashtable_lookup_get_insert_fast(&ptp_state->unmatched_ht,
-   &unmatched->ht_node,
-   mlxsw_sp1_ptp_unmatched_ht_params);
-   if (conflict)
+   err = rhltable_insert(&ptp_state->unmatched_ht, &unmatched->ht_node,
+ mlxsw_sp1_ptp_unmatched_ht_params);
+   if (err)
kfree(unmatched);
 
-   return conflict;
+   return err;
 }
 
 static struct mlxsw_sp1_ptp_unmatched *
 mlxsw_sp1_ptp_unmatched_lookup(struct mlxsw_sp *mlxsw_sp,
-  struct mlxsw_sp1_ptp_key key)
+  struct mlxsw_sp1_ptp_key key, int *p_length)
 {
-   return rhashtable_lookup(&mlxsw_sp->ptp_state->unmatched_ht, &key,
-mlxsw_sp1_ptp_unmatched_ht_params);
+   struct mlxsw_sp1_ptp_unmatched *unmatched, *last = NULL;
+   struct rhlist_head *tmp, *list;
+   int length = 0;
+
+   list = rhltable_lookup(&mlxsw_sp->ptp_state->unmatched_ht, &key,
+  mlxsw_sp1_ptp_unmatched_ht_params);
+   rhl_for_each_entry_rcu(unmatched, tmp, list, ht_node) {
+   last = unmatched;
+   length++;
+   }
+
+   *p_length = length;
+   return last;
 }
 
 static int
 mlxsw_sp1_ptp_unmatched_remo

Re: [PATCH net-next v2 1/1] net: dsa: fix fixed-link port registration

2019-08-11 Thread Sergei Shtylyov

Hello!

   Just noticed a comment typo...

On 11.08.2019 6:18, Marek Behún wrote:


Commit 88d6272acaaa ("net: phy: avoid unneeded MDIO reads in
genphy_read_status") broke fixed link DSA port registration in
dsa_port_fixed_link_register_of: the genphy_read_status does not do what
it is supposed to and the following adjust_link is given wrong
parameters.

This causes a regression on Turris Omnia, where the mvneta driver for
the interface connected to the switch reports crc errors, for some
reason.

I realize this fix is not ideal, something else could change in genphy
functions which could cause DSA fixed-link port to break again.
Hopefully DSA fixed-link port functionality will be converted to phylink
API soon.

Signed-off-by: Marek Behún 
Cc: Heiner Kallweit 
Cc: Sebastian Reichel 
Cc: Vivien Didelot 
Cc: Andrew Lunn 
Cc: Florian Fainelli 
Cc: David S. Miller 
---
  net/dsa/port.c | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 363eab6df51b..c424ebb373e1 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -485,6 +485,17 @@ static int dsa_port_fixed_link_register_of(struct dsa_port 
*dp)
phydev->interface = mode;
  
  	genphy_config_init(phydev);

+
+   /*
+* Commit 88d6272acaaa caused genphy_read_status not to do it's work if

    its.


+* autonegotiation is enabled and link status did not change. This is
+* the case for fixed_phy. By setting phydev->link = 0 before the call
+* to genphy_read_status we force it to read and fill in the parameters.
+*
+* Hopefully this dirty hack will be removed soon by converting DSA
+* fixed link ports to phylink API.
+*/
+   phydev->link = 0;
genphy_read_status(phydev);
  
  	if (ds->ops->adjust_link)


MBR, Sergei



Re: [PATCH net-next] net: openvswitch: Set OvS recirc_id from tc chain index

2019-08-11 Thread Paul Blakey

On 8/8/2019 11:53 PM, Pravin Shelar wrote:
> On Wed, Aug 7, 2019 at 5:08 AM Paul Blakey  wrote:
>> Offloaded OvS datapath rules are translated one to one to tc rules,
>> for example the following simplified OvS rule:
>>
>> recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) 
>> actions:ct(),recirc(2)
>>
>> Will be translated to the following tc rule:
>>
>> $ tc filter add dev dev1 ingress \
>>  prio 1 chain 0 proto ip \
>>  flower tcp ct_state -trk \
>>  action ct pipe \
>>  action goto chain 2
>>
>> Received packets will first travel though tc, and if they aren't stolen
>> by it, like in the above rule, they will continue to OvS datapath.
>> Since we already did some actions (action ct in this case) which might
>> modify the packets, and updated action stats, we would like to continue
>> the proccessing with the correct recirc_id in OvS (here recirc_id(2))
>> where we left off.
>>
>> To support this, introduce a new skb extension for tc, which
>> will be used for translating tc chain to ovs recirc_id to
>> handle these miss cases. Last tc chain index will be set
>> by tc goto chain action and read by OvS datapath.
>>
>> Signed-off-by: Paul Blakey 
>> Signed-off-by: Vlad Buslov 
>> Acked-by: Jiri Pirko 
>> ---
>>   include/linux/skbuff.h| 13 +
>>   include/net/sch_generic.h |  5 -
>>   net/core/skbuff.c |  6 ++
>>   net/openvswitch/flow.c|  9 +
>>   net/sched/Kconfig | 13 +
>>   net/sched/act_api.c   |  1 +
>>   net/sched/cls_api.c   | 12 
>>   7 files changed, 58 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 3aef8d8..fb2a792 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -279,6 +279,16 @@ struct nf_bridge_info {
>>   };
>>   #endif
>>
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +/* Chain in tc_skb_ext will be used to share the tc chain with
>> + * ovs recirc_id. It will be set to the current chain by tc
>> + * and read by ovs to recirc_id.
>> + */
>> +struct tc_skb_ext {
>> +   __u32 chain;
>> +};
>> +#endif
>> +
>>   struct sk_buff_head {
>>  /* These two members must be first. */
>>  struct sk_buff  *next;
>> @@ -4050,6 +4060,9 @@ enum skb_ext_id {
>>   #ifdef CONFIG_XFRM
>>  SKB_EXT_SEC_PATH,
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   TC_SKB_EXT,
>> +#endif
>>  SKB_EXT_NUM, /* must be last */
>>   };
>>
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 6b6b012..871feea 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -275,7 +275,10 @@ struct tcf_result {
>>  unsigned long   class;
>>  u32 classid;
>>  };
>> -   const struct tcf_proto *goto_tp;
>> +   struct {
>> +   const struct tcf_proto *goto_tp;
>> +   u32 goto_index;
>> +   };
>>
>>  /* used in the skb_tc_reinsert function */
>>  struct {
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index ea8e8d3..2b40b5a 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4087,6 +4087,9 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff 
>> *skb)
>>   #ifdef CONFIG_XFRM
>>  [SKB_EXT_SEC_PATH] = SKB_EXT_CHUNKSIZEOF(struct sec_path),
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   [TC_SKB_EXT] = SKB_EXT_CHUNKSIZEOF(struct tc_skb_ext),
>> +#endif
>>   };
>>
>>   static __always_inline unsigned int skb_ext_total_length(void)
>> @@ -4098,6 +4101,9 @@ static __always_inline unsigned int 
>> skb_ext_total_length(void)
>>   #ifdef CONFIG_XFRM
>>  skb_ext_type_len[SKB_EXT_SEC_PATH] +
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   skb_ext_type_len[TC_SKB_EXT] +
>> +#endif
>>  0;
>>   }
>>
>> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
>> index bc89e16..0287ead 100644
>> --- a/net/openvswitch/flow.c
>> +++ b/net/openvswitch/flow.c
>> @@ -816,6 +816,9 @@ static int key_extract_mac_proto(struct sk_buff *skb)
>>   int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>>   struct sk_buff *skb, struct sw_flow_key *key)
>>   {
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   struct tc_skb_ext *tc_ext;
>> +#endif
>>  int res, err;
>>
>>  /* Extract metadata from packet. */
>> @@ -848,7 +851,13 @@ int ovs_flow_key_extract(const struct ip_tunnel_info 
>> *tun_info,
>>  if (res < 0)
>>  return res;
>>  key->mac_proto = res;
>> +
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   tc_ext = skb_ext_find(skb, TC_SKB_EXT);
>> +   key->recirc_id = tc_ext ? tc_ext->chain : 0;
>> +#else
>>  key->recirc_id = 0;
>> +#endif
>>
> 

Re: [PATCH net-next] net: openvswitch: Set OvS recirc_id from tc chain index

2019-08-11 Thread Paul Blakey

On 8/8/2019 11:53 PM, Pravin Shelar wrote:
> On Wed, Aug 7, 2019 at 5:08 AM Paul Blakey  wrote:
>> Offloaded OvS datapath rules are translated one to one to tc rules,
>> for example the following simplified OvS rule:
>>
>> recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) 
>> actions:ct(),recirc(2)
>>
>> Will be translated to the following tc rule:
>>
>> $ tc filter add dev dev1 ingress \
>>  prio 1 chain 0 proto ip \
>>  flower tcp ct_state -trk \
>>  action ct pipe \
>>  action goto chain 2
>>
>> Received packets will first travel though tc, and if they aren't stolen
>> by it, like in the above rule, they will continue to OvS datapath.
>> Since we already did some actions (action ct in this case) which might
>> modify the packets, and updated action stats, we would like to continue
>> the proccessing with the correct recirc_id in OvS (here recirc_id(2))
>> where we left off.
>>
>> To support this, introduce a new skb extension for tc, which
>> will be used for translating tc chain to ovs recirc_id to
>> handle these miss cases. Last tc chain index will be set
>> by tc goto chain action and read by OvS datapath.
>>
>> Signed-off-by: Paul Blakey 
>> Signed-off-by: Vlad Buslov 
>> Acked-by: Jiri Pirko 
>> ---
>>   include/linux/skbuff.h| 13 +
>>   include/net/sch_generic.h |  5 -
>>   net/core/skbuff.c |  6 ++
>>   net/openvswitch/flow.c|  9 +
>>   net/sched/Kconfig | 13 +
>>   net/sched/act_api.c   |  1 +
>>   net/sched/cls_api.c   | 12 
>>   7 files changed, 58 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 3aef8d8..fb2a792 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -279,6 +279,16 @@ struct nf_bridge_info {
>>   };
>>   #endif
>>
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +/* Chain in tc_skb_ext will be used to share the tc chain with
>> + * ovs recirc_id. It will be set to the current chain by tc
>> + * and read by ovs to recirc_id.
>> + */
>> +struct tc_skb_ext {
>> +   __u32 chain;
>> +};
>> +#endif
>> +
>>   struct sk_buff_head {
>>  /* These two members must be first. */
>>  struct sk_buff  *next;
>> @@ -4050,6 +4060,9 @@ enum skb_ext_id {
>>   #ifdef CONFIG_XFRM
>>  SKB_EXT_SEC_PATH,
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   TC_SKB_EXT,
>> +#endif
>>  SKB_EXT_NUM, /* must be last */
>>   };
>>
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 6b6b012..871feea 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -275,7 +275,10 @@ struct tcf_result {
>>  unsigned long   class;
>>  u32 classid;
>>  };
>> -   const struct tcf_proto *goto_tp;
>> +   struct {
>> +   const struct tcf_proto *goto_tp;
>> +   u32 goto_index;
>> +   };
>>
>>  /* used in the skb_tc_reinsert function */
>>  struct {
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index ea8e8d3..2b40b5a 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4087,6 +4087,9 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff 
>> *skb)
>>   #ifdef CONFIG_XFRM
>>  [SKB_EXT_SEC_PATH] = SKB_EXT_CHUNKSIZEOF(struct sec_path),
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   [TC_SKB_EXT] = SKB_EXT_CHUNKSIZEOF(struct tc_skb_ext),
>> +#endif
>>   };
>>
>>   static __always_inline unsigned int skb_ext_total_length(void)
>> @@ -4098,6 +4101,9 @@ static __always_inline unsigned int 
>> skb_ext_total_length(void)
>>   #ifdef CONFIG_XFRM
>>  skb_ext_type_len[SKB_EXT_SEC_PATH] +
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   skb_ext_type_len[TC_SKB_EXT] +
>> +#endif
>>  0;
>>   }
>>
>> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
>> index bc89e16..0287ead 100644
>> --- a/net/openvswitch/flow.c
>> +++ b/net/openvswitch/flow.c
>> @@ -816,6 +816,9 @@ static int key_extract_mac_proto(struct sk_buff *skb)
>>   int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>>   struct sk_buff *skb, struct sw_flow_key *key)
>>   {
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   struct tc_skb_ext *tc_ext;
>> +#endif
>>  int res, err;
>>
>>  /* Extract metadata from packet. */
>> @@ -848,7 +851,13 @@ int ovs_flow_key_extract(const struct ip_tunnel_info 
>> *tun_info,
>>  if (res < 0)
>>  return res;
>>  key->mac_proto = res;
>> +
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +   tc_ext = skb_ext_find(skb, TC_SKB_EXT);
>> +   key->recirc_id = tc_ext ? tc_ext->chain : 0;
>> +#else
>>  key->recirc_id = 0;
>> +#endif
>>
> 

Re: [PATCH] xen-netback: no need to check return value of debugfs_create functions

2019-08-11 Thread Wei Liu
On Sat, Aug 10, 2019 at 12:31:08PM +0200, Greg Kroah-Hartman wrote:
> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 
> Cc: Wei Liu 
> Cc: Paul Durrant 
> Cc: xen-de...@lists.xenproject.org
> Cc: netdev@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman 

Acked-by: Wei Liu 


Re: [PATCH net-next v2 1/1] net: dsa: fix fixed-link port registration

2019-08-11 Thread Heiner Kallweit
On 11.08.2019 05:39, Andrew Lunn wrote:
> On Sun, Aug 11, 2019 at 05:18:57AM +0200, Marek Behún wrote:
>> Commit 88d6272acaaa ("net: phy: avoid unneeded MDIO reads in
>> genphy_read_status") broke fixed link DSA port registration in
>> dsa_port_fixed_link_register_of: the genphy_read_status does not do what
>> it is supposed to and the following adjust_link is given wrong
>> parameters.
> 
> Hi Marek
> 
> Which parameters are incorrect?
> 
> In fixed_phy.c, __fixed_phy_register() there is:
> 
> /* propagate the fixed link values to struct phy_device */
> phy->link = status->link;
> if (status->link) {
> phy->speed = status->speed;
> phy->duplex = status->duplex;
> phy->pause = status->pause;
> phy->asym_pause = status->asym_pause;
> }
> 
> Are we not initialising something? Or is the initialisation done here
> getting reset sometime afterwards?
> 
In addition to Andrew's question:
We talk about this DT config: armada-385-turris-omnia.dts ?
Which kernel version are you using?

> Thanks
>   Andrew
> 
Heiner


[BUG] fec mdio times out under system stress

2019-08-11 Thread Russell King - ARM Linux admin
Hi Fabio,

When I woke up this morning, I found that one of the Hummingboards
had gone offline (as in, lost network link) during the night.
Investigating, I find that the system had gone into OOM, and at
that time, triggered an unrelated:

[4111697.698776] fec 2188000.ethernet eth0: MDIO read timeout
[4111697.712996] MII_DATA: 0x6006796d
[4111697.729415] MII_SPEED: 0x001a
[4111697.745232] IEVENT: 0x
[4111697.745242] IMASK: 0x0a8000aa
[4111698.002233] Atheros 8035 ethernet 2188000.ethernet-1:00: PHY state change 
RUNNING -> HALTED
[4111698.009882] fec 2188000.ethernet eth0: Link is Down

This is on a dual-core iMX6.

It looks like the read actually completed (since MII_DATA contains
the register data) but we somehow lost the interrupt (or maybe
received the interrupt after wait_for_completion_timeout() timed
out.)

>From what I can see, the OOM events happened on CPU1, CPU1 was
allocated the FEC interrupt, and the PHY polling that suffered the
MDIO timeout was on CPU0.

Given that IEVENT is zero, it seems that CPU1 had read serviced the
interrupt, but it is not clear how far through processing that it
was - it may be that fec_enet_interrupt() had been delayed by the
OOM condition.

This seems rather fragile - as the system slowing down due to OOM
triggers the network to completely collapse by phylib taking the
PHY offline, making the system inaccessible except through the
console.

In my case, even serial console wasn't operational (except for
magic sysrq).  Not sure what agetty was playing at... so the only
way I could recover any information from the system was to connect
the HDMI and plug in a USB keyboard.

Any thoughts on how FEC MDIO accesses could be made more robust?

Maybe phylib should retry a number of times - but with read-sensitive
registers, if the read has already completed successfully, and its
just a problem with the FEC MDIO hardware, that could cause issues.

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up


Re: [PATCH net-next v2 1/1] net: dsa: fix fixed-link port registration

2019-08-11 Thread Marek Behun
OK guys, something is terribly wrong here.

I bisected to the commit mentioned (88d6272acaaa), looked around at the
genphy functions, tried adding the link=0 workaround and it did work,
so I though this was the issue.

What I realized now is that before the commit 88d6272acaaa things
worked because of two bugs, which negated each other. This commit caused
one of this bugs not to fire, and thus the second bug was not negated.

What actually happened before the commit that broke it is this:
  - after the fixed_phy is created, the parameters are corrent
  - genphy_read_status breaks the parameters:
 - first it sets the parameters to unknown (SPEED_UNKNOWN,
   DUPLEX_UNKNOWN)
 - then read the registers, which are simulated for fixed_phy
 - then it uses phy-core.c:phy_resolve_aneg_linkmode function, which
   looks for correct settings by bit-anding the ->advertising and
   ->lp_advertigins bit arrays. But in fixed_phy, ->lp_advertising
   is set to zero, so the parameters are left at SPEED_UNKNOWN, ...
   (this is the first bug)
  - then adjust_link is called, which then goes to
mv88e6xxx_port_setup_mac, where there is a test if it should change
something:
   if (state.link == link && state.speed == speed &&
   state.duplex == duplex)
   return 0;
  - since current speed on the switch port (state.speed) is SPEED_1000,
and new speed is SPEED_UNKNOWN, this test fails, and so the rest of
this function is called, which makes the port work
(the if test is the second bug)

After the commit that broke things:
  - after the fixed_phy is created, the parameters are corrent
  - genphy_read_status doesn't change them
  - mv88e6xxx_port_setup_mac does nothing, since the if condition above
is true

So, there are two things that are broken:
 - the test in mv88e6xxx_port_setup_mac whether there is to be a change
   should be more sophisticated
 - fixed_phy should also simulate the lp_advertising register

What do you think of this?

Marek

On Sun, 11 Aug 2019 13:35:20 +0200
Heiner Kallweit  wrote:

> On 11.08.2019 05:39, Andrew Lunn wrote:
> > On Sun, Aug 11, 2019 at 05:18:57AM +0200, Marek Behún wrote:  
> >> Commit 88d6272acaaa ("net: phy: avoid unneeded MDIO reads in
> >> genphy_read_status") broke fixed link DSA port registration in
> >> dsa_port_fixed_link_register_of: the genphy_read_status does not do what
> >> it is supposed to and the following adjust_link is given wrong
> >> parameters.  
> > 
> > Hi Marek
> > 
> > Which parameters are incorrect?
> > 
> > In fixed_phy.c, __fixed_phy_register() there is:
> > 
> > /* propagate the fixed link values to struct phy_device */
> > phy->link = status->link;
> > if (status->link) {
> > phy->speed = status->speed;
> > phy->duplex = status->duplex;
> > phy->pause = status->pause;
> > phy->asym_pause = status->asym_pause;
> > }
> > 
> > Are we not initialising something? Or is the initialisation done here
> > getting reset sometime afterwards?
> >   
> In addition to Andrew's question:
> We talk about this DT config: armada-385-turris-omnia.dts ?
> Which kernel version are you using?
> 
> > Thanks
> > Andrew
> >   
> Heiner



Re: [BUG] fec mdio times out under system stress

2019-08-11 Thread Russell King - ARM Linux admin
On Sun, Aug 11, 2019 at 02:37:07PM +0100, Russell King - ARM Linux admin wrote:
> Hi Fabio,
> 
> When I woke up this morning, I found that one of the Hummingboards
> had gone offline (as in, lost network link) during the night.
> Investigating, I find that the system had gone into OOM, and at
> that time, triggered an unrelated:
> 
> [4111697.698776] fec 2188000.ethernet eth0: MDIO read timeout
> [4111697.712996] MII_DATA: 0x6006796d
> [4111697.729415] MII_SPEED: 0x001a
> [4111697.745232] IEVENT: 0x
> [4111697.745242] IMASK: 0x0a8000aa
> [4111698.002233] Atheros 8035 ethernet 2188000.ethernet-1:00: PHY state 
> change RUNNING -> HALTED
> [4111698.009882] fec 2188000.ethernet eth0: Link is Down
> 
> This is on a dual-core iMX6.
> 
> It looks like the read actually completed (since MII_DATA contains
> the register data) but we somehow lost the interrupt (or maybe
> received the interrupt after wait_for_completion_timeout() timed
> out.)
> 
> From what I can see, the OOM events happened on CPU1, CPU1 was
> allocated the FEC interrupt, and the PHY polling that suffered the
> MDIO timeout was on CPU0.
> 
> Given that IEVENT is zero, it seems that CPU1 had read serviced the
> interrupt, but it is not clear how far through processing that it
> was - it may be that fec_enet_interrupt() had been delayed by the
> OOM condition.
> 
> This seems rather fragile - as the system slowing down due to OOM
> triggers the network to completely collapse by phylib taking the
> PHY offline, making the system inaccessible except through the
> console.
> 
> In my case, even serial console wasn't operational (except for
> magic sysrq).  Not sure what agetty was playing at... so the only
> way I could recover any information from the system was to connect
> the HDMI and plug in a USB keyboard.
> 
> Any thoughts on how FEC MDIO accesses could be made more robust?
> 
> Maybe phylib should retry a number of times - but with read-sensitive
> registers, if the read has already completed successfully, and its
> just a problem with the FEC MDIO hardware, that could cause issues.

I should also note for the phylib people:

After phylib has received an error, and has entered the HALTED state,
downing the interface produces this warning, which seems rather unfair
as the FEC doesn't know that its PHY suffered an error - it is merely
reversing the phy_start() that happened when the interface was opened.

[4144039.099786] [ cut here ]
[4144039.109001] WARNING: CPU: 1 PID: 25842 at drivers/net/phy/phy.c:835 
fec_enet_close+0x14/0x148
[4144039.124366] called from state HALTED
[4144039.132626] Modules linked in: 8021q brcmfmac brcmutil imx_thermal
snd_soc_imx_spdif snd_soc_imx_audmux nvmem_imx_ocotp cfg80211
snd_soc_sgtl5000 imx_sdma
virt_dma rc_cec snd_soc_fsl_ssi snd_soc_fsl_spdif coda imx_pcm_dma
v4l2_mem2mem
imx_vdoa etnaviv videobuf2_dma_contig gpu_sched dw_hdmi_cec
dw_hdmi_ahb_audio imx6q_cpufreq caamrng caam_jr caam error ip_tables
x_tables [last unloaded: evbug][4144039.177249] CPU: 1 PID: 25842 Comm:
ip Tainted: GW 5.1.0+ #319
[4144039.189477] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[4144039.201025] [] (unwind_backtrace) from [] 
(show_stack+0x10/0x14)
[4144039.213769] [] (show_stack) from [] 
(dump_stack+0x9c/0xd4)
[4144039.225959] [] (dump_stack) from [] (__warn+0xf8/0x124)
[4144039.237828] [] (__warn) from [] 
(warn_slowpath_fmt+0x38/0x48)
[4144039.250275] [] (warn_slowpath_fmt) from [] 
(fec_enet_close+0x14/0x148)
[4144039.263558] [] (fec_enet_close) from [] 
(__dev_close_many+0x88/0xf0)
[4144039.276639] [] (__dev_close_many) from [] 
(__dev_change_flags+0xa4/0x1a0)
[4144039.290130] [] (__dev_change_flags) from [] 
(dev_change_flags+0x18/0x48)
[4144039.303558] [] (dev_change_flags) from [] 
(do_setlink+0x29c/0x990)
[4144039.316517] [] (do_setlink) from [] 
(__rtnl_newlink+0x4a4/0x72c)
[4144039.329282] [] (__rtnl_newlink) from [] 
(rtnl_newlink+0x40/0x60)
[4144039.342023] [] (rtnl_newlink) from [] 
(rtnetlink_rcv_msg+0x244/0x470)
[4144039.355224] [] (rtnetlink_rcv_msg) from [] 
(netlink_rcv_skb+0xe0/0xf4)
[4144039.368492] [] (netlink_rcv_skb) from [] 
(netlink_unicast+0x170/0x1b8)
[4144039.381758] [] (netlink_unicast) from [] 
(netlink_sendmsg+0x2b8/0x340)
[4144039.395023] [] (netlink_sendmsg) from [] 
(sock_sendmsg+0x14/0x24)
[4144039.407825] [] (sock_sendmsg) from [] 
(___sys_sendmsg+0x200/0x214)
[4144039.420714] [] (___sys_sendmsg) from [] 
(__sys_sendmsg+0x40/0x6c)
[4144039.433506] [] (__sys_sendmsg) from [] 
(ret_fast_syscall+0x0/0x28)
[4144039.446366] Exception stack(0xe1933fa8 to 0xe1933ff0)
[4144039.456329] 3fa0:   bea086f0 bea086bc 0003 bea086d0 
 
[4144039.469464] 3fc0: bea086f0 bea086bc  0128 0062a278 bea086d0 
5d5011f3 0062a000
[4144039.482590] 3fe0: 0128 bea08680 b6dfc4cb b6d7d6f6
[4144039.492652] irq event stamp: 0
[4144039.500650] hardirqs last  enabled at (0): [<>]   (null)
[4144039.511628] h

Re: [BUG] fec mdio times out under system stress

2019-08-11 Thread Vladimir Oltean
Hi Russell, Fabio,

On Sun, 11 Aug 2019 at 16:42, Russell King - ARM Linux admin
 wrote:
>
> Hi Fabio,
>
> When I woke up this morning, I found that one of the Hummingboards
> had gone offline (as in, lost network link) during the night.
> Investigating, I find that the system had gone into OOM, and at
> that time, triggered an unrelated:
>
> [4111697.698776] fec 2188000.ethernet eth0: MDIO read timeout
> [4111697.712996] MII_DATA: 0x6006796d
> [4111697.729415] MII_SPEED: 0x001a
> [4111697.745232] IEVENT: 0x
> [4111697.745242] IMASK: 0x0a8000aa
> [4111698.002233] Atheros 8035 ethernet 2188000.ethernet-1:00: PHY state 
> change RUNNING -> HALTED
> [4111698.009882] fec 2188000.ethernet eth0: Link is Down
>
> This is on a dual-core iMX6.
>
> It looks like the read actually completed (since MII_DATA contains
> the register data) but we somehow lost the interrupt (or maybe
> received the interrupt after wait_for_completion_timeout() timed
> out.)
>
> From what I can see, the OOM events happened on CPU1, CPU1 was
> allocated the FEC interrupt, and the PHY polling that suffered the
> MDIO timeout was on CPU0.
>
> Given that IEVENT is zero, it seems that CPU1 had read serviced the
> interrupt, but it is not clear how far through processing that it
> was - it may be that fec_enet_interrupt() had been delayed by the
> OOM condition.
>
> This seems rather fragile - as the system slowing down due to OOM
> triggers the network to completely collapse by phylib taking the
> PHY offline, making the system inaccessible except through the
> console.
>
> In my case, even serial console wasn't operational (except for
> magic sysrq).  Not sure what agetty was playing at... so the only
> way I could recover any information from the system was to connect
> the HDMI and plug in a USB keyboard.
>
> Any thoughts on how FEC MDIO accesses could be made more robust?
>

I think a better question is why is the FEC MDIO controller configured
to emit interrupts anyway (especially since the API built on top does
not benefit in any way from this)? Hubert (copied) sent an interesting
email very recently where he pointed out that this is one of the main
sources of jitter when reading the PTP clock on a Marvell switch
attached over MDIO.

> Maybe phylib should retry a number of times - but with read-sensitive
> registers, if the read has already completed successfully, and its
> just a problem with the FEC MDIO hardware, that could cause issues.
>
> Thanks.
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up

Regards,
-Vladimir


[PATCH net-next 1/2] net: dsa: mv88e6xxx: fix RGMII-ID port setup

2019-08-11 Thread Marek Behún
The mv88e6xxx_port_setup_mac looks if one of the {link, speed, duplex}
parameters is being changed from the current setting, and if not, does
not do anything. This test is wrong in some situations: this method also
has the mode argument, which can also be changed.

For example on Turris Omnia, the mode is PHY_INTERFACE_MODE_RGMII_ID,
which has to be set byt the ->port_set_rgmii_delay method. The test does
not look if mode is being changed (in fact there is currently no method
to determine port mode as phy_interface_t type).

The simplest solution seems to be to drop this test altogether and
simply do the setup when requested.

Signed-off-by: Marek Behún 
Cc: Heiner Kallweit 
Cc: Sebastian Reichel 
Cc: Vivien Didelot 
Cc: Andrew Lunn 
Cc: Florian Fainelli 
Cc: David S. Miller 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 2e8b1ab2c6f7..aae63f6515b3 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -420,15 +420,6 @@ int mv88e6xxx_port_setup_mac(struct mv88e6xxx_chip *chip, 
int port, int link,
if (err)
return err;
 
-   /* Has anything actually changed? We don't expect the
-* interface mode to change without one of the other
-* parameters also changing
-*/
-   if (state.link == link &&
-   state.speed == speed &&
-   state.duplex == duplex)
-   return 0;
-
/* Port's MAC control must not be changed unless the link is down */
err = chip->info->ops->port_set_link(chip, port, 0);
if (err)
-- 
2.21.0



[PATCH net-next 2/2] net: fixed_phy: set is_gigabit_capable member when needed

2019-08-11 Thread Marek Behún
The fixed_phy driver does not set the phydev->is_gigabit_capable member
when the fixed_phy is gigabit capable. This in turn causes
phy_device.c:genphy_read_status function to return unknown phy
parameters (SPEED_UNKNOWN, DUPLEX_UNKNOWN, ...), if the fixed_phy is
created with SPEED_1000.

Signed-off-by: Marek Behún 
Cc: Heiner Kallweit 
Cc: Sebastian Reichel 
Cc: Vivien Didelot 
Cc: Andrew Lunn 
Cc: Florian Fainelli 
Cc: David S. Miller 
---
 drivers/net/phy/fixed_phy.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/phy/fixed_phy.c b/drivers/net/phy/fixed_phy.c
index 3ffe46df249e..424b02ad7b7b 100644
--- a/drivers/net/phy/fixed_phy.c
+++ b/drivers/net/phy/fixed_phy.c
@@ -286,6 +286,7 @@ static struct phy_device *__fixed_phy_register(unsigned int 
irq,
 phy->supported);
linkmode_set_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
 phy->supported);
+   phy->is_gigabit_capable = 1;
/* fall through */
case SPEED_100:
linkmode_set_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
-- 
2.21.0



Re: [PATCH net-next v2 1/1] net: dsa: fix fixed-link port registration

2019-08-11 Thread Marek Behun
I have sent two new patches, each fixing one of the bugs that negated
each other.

Marek


Re: [PATCH net-next v2 1/1] net: dsa: fix fixed-link port registration

2019-08-11 Thread Vladimir Oltean
Hi Marek,

On Sun, 11 Aug 2019 at 17:06, Marek Behun  wrote:
>
> OK guys, something is terribly wrong here.
>
> I bisected to the commit mentioned (88d6272acaaa), looked around at the
> genphy functions, tried adding the link=0 workaround and it did work,
> so I though this was the issue.
>
> What I realized now is that before the commit 88d6272acaaa things
> worked because of two bugs, which negated each other. This commit caused
> one of this bugs not to fire, and thus the second bug was not negated.
>
> What actually happened before the commit that broke it is this:
>   - after the fixed_phy is created, the parameters are corrent
>   - genphy_read_status breaks the parameters:
>  - first it sets the parameters to unknown (SPEED_UNKNOWN,
>DUPLEX_UNKNOWN)
>  - then read the registers, which are simulated for fixed_phy
>  - then it uses phy-core.c:phy_resolve_aneg_linkmode function, which
>looks for correct settings by bit-anding the ->advertising and
>->lp_advertigins bit arrays. But in fixed_phy, ->lp_advertising
>is set to zero, so the parameters are left at SPEED_UNKNOWN, ...
>(this is the first bug)
>   - then adjust_link is called, which then goes to
> mv88e6xxx_port_setup_mac, where there is a test if it should change
> something:
>if (state.link == link && state.speed == speed &&
>state.duplex == duplex)
>return 0;
>   - since current speed on the switch port (state.speed) is SPEED_1000,
> and new speed is SPEED_UNKNOWN, this test fails, and so the rest of
> this function is called, which makes the port work
> (the if test is the second bug)
>
> After the commit that broke things:
>   - after the fixed_phy is created, the parameters are corrent
>   - genphy_read_status doesn't change them
>   - mv88e6xxx_port_setup_mac does nothing, since the if condition above
> is true
>
> So, there are two things that are broken:
>  - the test in mv88e6xxx_port_setup_mac whether there is to be a change
>should be more sophisticated
>  - fixed_phy should also simulate the lp_advertising register
>
> What do you think of this?
>

I don't know. But I think Heiner was asking you what kernel you're on
because of what you said here:

> Hopefully DSA fixed-link port functionality will be converted to phylink
> API soon.

The DSA fixed-link port functionality *has* been converted to phylink.
See:
- 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=0e27921816ad9
- 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7fb5a711545d7d25fe9726a9ad277474dd83bd06


> Marek
>
> On Sun, 11 Aug 2019 13:35:20 +0200
> Heiner Kallweit  wrote:
>
> > On 11.08.2019 05:39, Andrew Lunn wrote:
> > > On Sun, Aug 11, 2019 at 05:18:57AM +0200, Marek Behún wrote:
> > >> Commit 88d6272acaaa ("net: phy: avoid unneeded MDIO reads in
> > >> genphy_read_status") broke fixed link DSA port registration in
> > >> dsa_port_fixed_link_register_of: the genphy_read_status does not do what
> > >> it is supposed to and the following adjust_link is given wrong
> > >> parameters.
> > >
> > > Hi Marek
> > >
> > > Which parameters are incorrect?
> > >
> > > In fixed_phy.c, __fixed_phy_register() there is:
> > >
> > > /* propagate the fixed link values to struct phy_device */
> > > phy->link = status->link;
> > > if (status->link) {
> > > phy->speed = status->speed;
> > > phy->duplex = status->duplex;
> > > phy->pause = status->pause;
> > > phy->asym_pause = status->asym_pause;
> > > }
> > >
> > > Are we not initialising something? Or is the initialisation done here
> > > getting reset sometime afterwards?
> > >
> > In addition to Andrew's question:
> > We talk about this DT config: armada-385-turris-omnia.dts ?
> > Which kernel version are you using?
> >
> > > Thanks
> > > Andrew
> > >
> > Heiner
>

Regards,
-Vladimir


Re: [PATCH net-next v2 1/1] net: dsa: fix fixed-link port registration

2019-08-11 Thread Andrew Lunn
On Sun, Aug 11, 2019 at 04:04:04PM +0200, Marek Behun wrote:
> OK guys, something is terribly wrong here.
> 
> I bisected to the commit mentioned (88d6272acaaa), looked around at the
> genphy functions, tried adding the link=0 workaround and it did work,
> so I though this was the issue.
> 
> What I realized now is that before the commit 88d6272acaaa things
> worked because of two bugs, which negated each other. This commit caused
> one of this bugs not to fire, and thus the second bug was not negated.
> 
> What actually happened before the commit that broke it is this:
>   - after the fixed_phy is created, the parameters are corrent
>   - genphy_read_status breaks the parameters:
>  - first it sets the parameters to unknown (SPEED_UNKNOWN,
>DUPLEX_UNKNOWN)
>  - then read the registers, which are simulated for fixed_phy
>  - then it uses phy-core.c:phy_resolve_aneg_linkmode function, which
>looks for correct settings by bit-anding the ->advertising and
>->lp_advertigins bit arrays. But in fixed_phy, ->lp_advertising
>is set to zero, so the parameters are left at SPEED_UNKNOWN, ...
>(this is the first bug)
>   - then adjust_link is called, which then goes to
> mv88e6xxx_port_setup_mac, where there is a test if it should change
> something:
>if (state.link == link && state.speed == speed &&
>state.duplex == duplex)
>return 0;
>   - since current speed on the switch port (state.speed) is SPEED_1000,
> and new speed is SPEED_UNKNOWN, this test fails, and so the rest of
> this function is called, which makes the port work
> (the if test is the second bug)
> 
> After the commit that broke things:
>   - after the fixed_phy is created, the parameters are corrent
>   - genphy_read_status doesn't change them
>   - mv88e6xxx_port_setup_mac does nothing, since the if condition above
> is true
> 
> So, there are two things that are broken:
>  - the test in mv88e6xxx_port_setup_mac whether there is to be a change
>should be more sophisticated
>  - fixed_phy should also simulate the lp_advertising register
> 
> What do you think of this?

Marek

This is the sort of information i like. I was having trouble
understanding what was really wrong and how it was fixed by your
previous patch.

So setting the emulated lp_advertise to advertise makes a lot of sense
for fixed phy. And it is something worthy of stable.

As for mv88e6xxx_port_setup_mac(), which parameter is actually
important here? My assumption was, if one of the other parameters
changes, it would not happen alone. The link would also go down, and
later come up again, etc. But it seems that assumption is wrong.

At a guess, it is the RGMII delays. That would explain CRC errors in
frames received by the master interface.

   Andrew


Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: fix RGMII-ID port setup

2019-08-11 Thread Andrew Lunn
On Sun, Aug 11, 2019 at 05:08:11PM +0200, Marek Behún wrote:
> The mv88e6xxx_port_setup_mac looks if one of the {link, speed, duplex}
> parameters is being changed from the current setting, and if not, does
> not do anything. This test is wrong in some situations: this method also
> has the mode argument, which can also be changed.
> 
> For example on Turris Omnia, the mode is PHY_INTERFACE_MODE_RGMII_ID,
> which has to be set byt the ->port_set_rgmii_delay method. The test does
> not look if mode is being changed (in fact there is currently no method
> to determine port mode as phy_interface_t type).
> 
> The simplest solution seems to be to drop this test altogether and
> simply do the setup when requested.

Hi Marek

Unfortunately, that code is there for a reason. phylink can call the
->mac_config() method once per second. It is documented that
mac_config() should only reconfigure what, if anything, has changed.
And mv88e6xxx_port_setup_mac() needs to disable the port in order to
change anything. So the change you propose here, under some
conditions, will cause the port to be disabled/enables once per
second.

We need to fix this by expanding the test, not removing it.  My
current _guess_ would be, we need to add a ops->port_get_rgmii_delay()
so we can see if that is what needs configuring.

   Andrew


Re: [PATCH net-next 2/2] net: fixed_phy: set is_gigabit_capable member when needed

2019-08-11 Thread Andrew Lunn
On Sun, Aug 11, 2019 at 05:08:12PM +0200, Marek Behún wrote:
> The fixed_phy driver does not set the phydev->is_gigabit_capable member
> when the fixed_phy is gigabit capable.

Neither does any other PHY driver. It should be possible to tell if a
PHY supports 1G by looking at register values. If this does not work
for fixed_link, it means we are missing something in the emulation.
That is what we should be fixing.

Also, this change has nothing to do the lp_advertise, what you
previously said the problem was. At the moment, i don't get the
feeling you have really dug all the way down and really understand the
root causes.

 Andrew


Re: [BUG] fec mdio times out under system stress

2019-08-11 Thread Andrew Lunn
> I think a better question is why is the FEC MDIO controller configured
> to emit interrupts anyway (especially since the API built on top does
> not benefit in any way from this)? Hubert (copied) sent an interesting
> email very recently where he pointed out that this is one of the main
> sources of jitter when reading the PTP clock on a Marvell switch
> attached over MDIO.

Hi Vladimir

One reason is runtime power management.

For a write operation, you could set it going and return
immediately. Many drivers do, and then when the next read/write
operation comes along, they poll in a loop waiting for the device to
go idle, if it was still busy from the last operation.

However, FEC supports runtime PM. When an MDIO read/write call is
made, it calls runtime PM functions to indicate the device is active.
Once it has completed the MDIO transaction, it calls runtime PM
functions to indicate the device is inactive. These transitions can
cause clocks to be enabled/disabled. If we don't wait around for the
operation to complete, the clock could be disabled too early, and bad
things would happen.

You could replace the interrupt with a sleeping poll, but my guess
would be, that has more jitter than using an interrupt.

  Andrew


Re: [PATCH net-next 2/2] net: fixed_phy: set is_gigabit_capable member when needed

2019-08-11 Thread Marek Behun
On Sun, 11 Aug 2019 17:40:01 +0200
Andrew Lunn  wrote:

> On Sun, Aug 11, 2019 at 05:08:12PM +0200, Marek Behún wrote:
> > The fixed_phy driver does not set the phydev->is_gigabit_capable member
> > when the fixed_phy is gigabit capable.  
> 
> Neither does any other PHY driver. It should be possible to tell if a
> PHY supports 1G by looking at register values. If this does not work
> for fixed_link, it means we are missing something in the emulation.
> That is what we should be fixing.
> 
> Also, this change has nothing to do the lp_advertise, what you
> previously said the problem was. At the moment, i don't get the
> feeling you have really dug all the way down and really understand the
> root causes.
> 
>  Andrew

Andrew,
is_gigabit_capable is otherwise set only in the phy_probe function.
This function is not called at all for the DSA cpu port fixed_link phy.
Why is that? But I guess it is not important anymore, if CPU and DSA
were converted to phylink in net-next. I shall test it and let you know.
In any case, sorry for the spam.
Marek


Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: fix RGMII-ID port setup

2019-08-11 Thread Marek Behun
On Sun, 11 Aug 2019 17:31:53 +0200
Andrew Lunn  wrote:

> On Sun, Aug 11, 2019 at 05:08:11PM +0200, Marek Behún wrote:
> > The mv88e6xxx_port_setup_mac looks if one of the {link, speed, duplex}
> > parameters is being changed from the current setting, and if not, does
> > not do anything. This test is wrong in some situations: this method also
> > has the mode argument, which can also be changed.
> > 
> > For example on Turris Omnia, the mode is PHY_INTERFACE_MODE_RGMII_ID,
> > which has to be set byt the ->port_set_rgmii_delay method. The test does
> > not look if mode is being changed (in fact there is currently no method
> > to determine port mode as phy_interface_t type).
> > 
> > The simplest solution seems to be to drop this test altogether and
> > simply do the setup when requested.  
> 
> Hi Marek
> 
> Unfortunately, that code is there for a reason. phylink can call the
> ->mac_config() method once per second. It is documented that  
> mac_config() should only reconfigure what, if anything, has changed.
> And mv88e6xxx_port_setup_mac() needs to disable the port in order to
> change anything. So the change you propose here, under some
> conditions, will cause the port to be disabled/enables once per
> second.
> 
> We need to fix this by expanding the test, not removing it.  My
> current _guess_ would be, we need to add a ops->port_get_rgmii_delay()
> so we can see if that is what needs configuring.
> 
>Andrew

Hi Andrew,
what if we added a phy_mode member to struct mv88e6xxx_port, and either
set it to PHY_INTERFACE_MODE_NA in mv88e6xxx_setup, or implemented
methods for converting the switch register values to
PHY_INTERFACE_MODE_* values.
This way we could just add port.mode == mode check to port_setup_mac
method.
I am willing to implement this if you think this could work.

Marek


Re: [PATCH net-next v2 1/1] net: dsa: fix fixed-link port registration

2019-08-11 Thread Marek Behun
On Sun, 11 Aug 2019 18:16:11 +0300
Vladimir Oltean  wrote:

> The DSA fixed-link port functionality *has* been converted to phylink.
> See:
> - 
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=0e27921816ad9
> - 
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7fb5a711545d7d25fe9726a9ad277474dd83bd06
>

/o\ my bad. I did not realize that I was working on 5.2 :(. Sorry.


Re: [BUG] fec mdio times out under system stress

2019-08-11 Thread Andrew Lunn
On Sun, Aug 11, 2019 at 02:37:07PM +0100, Russell King - ARM Linux admin wrote:
> Hi Fabio,
> 
> When I woke up this morning, I found that one of the Hummingboards
> had gone offline (as in, lost network link) during the night.
> Investigating, I find that the system had gone into OOM, and at
> that time, triggered an unrelated:
> 
> [4111697.698776] fec 2188000.ethernet eth0: MDIO read timeout
> [4111697.712996] MII_DATA: 0x6006796d
> [4111697.729415] MII_SPEED: 0x001a
> [4111697.745232] IEVENT: 0x
> [4111697.745242] IMASK: 0x0a8000aa
> [4111698.002233] Atheros 8035 ethernet 2188000.ethernet-1:00: PHY state 
> change RUNNING -> HALTED
> [4111698.009882] fec 2188000.ethernet eth0: Link is Down
> 
> This is on a dual-core iMX6.
> 
> It looks like the read actually completed (since MII_DATA contains
> the register data) but we somehow lost the interrupt (or maybe
> received the interrupt after wait_for_completion_timeout() timed
> out.)

Hi Russell

The timeout is quite short,

#define FEC_MII_TIMEOUT 3 /* us */

Looking at the Vybrid datasheet, there does not appear to be any way
to determine if the hardware is busy other than waiting for the
interrupt. There is no 'busy' bit which gets cleared on completion.

So about the only option is to make the timeout bigger.

   Andrew


Re: [BUG] fec mdio times out under system stress

2019-08-11 Thread Andrew Lunn
> Maybe phylib should retry a number of times - but with read-sensitive
> registers, if the read has already completed successfully, and its
> just a problem with the FEC MDIO hardware, that could cause issues.

Hi Russell

At the bus level, MDIO cannot fail. The bits get clocked out, and the
bits get clocked in. There is no way for the PHY to stretch the clock
as I2C slaves can. There is nothing like the USB NACK, try again
later.

If something fails, it fails at a higher level, which means it is a
driver issue. In this case, the interrupt got delayed, after the timer
interrupt.

The FEC is also quite unusual in using an interrupt. Most MDIO drivers
poll. And if time gets 'stretched' because the system is too busy,
generally, the right thing happens anyway.

So i don't think it is phylib job to work around this issue.

   Andrew


Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: fix RGMII-ID port setup

2019-08-11 Thread Andrew Lunn
> what if we added a phy_mode member to struct mv88e6xxx_port, and either
> set it to PHY_INTERFACE_MODE_NA in mv88e6xxx_setup, or implemented
> methods for converting the switch register values to
> PHY_INTERFACE_MODE_* values.

We should read the switch registers. I think you can set the defaults
using strapping pins. And in general, the driver always reads state
from the hardware rather than caching it.

 Andrew


Re: [PATCH net-next 2/2] net: fixed_phy: set is_gigabit_capable member when needed

2019-08-11 Thread Heiner Kallweit
On 11.08.2019 18:08, Marek Behun wrote:
> On Sun, 11 Aug 2019 17:40:01 +0200
> Andrew Lunn  wrote:
> 
>> On Sun, Aug 11, 2019 at 05:08:12PM +0200, Marek Behún wrote:
>>> The fixed_phy driver does not set the phydev->is_gigabit_capable member
>>> when the fixed_phy is gigabit capable.  
>>
>> Neither does any other PHY driver. It should be possible to tell if a
>> PHY supports 1G by looking at register values. If this does not work
>> for fixed_link, it means we are missing something in the emulation.
>> That is what we should be fixing.
>>
>> Also, this change has nothing to do the lp_advertise, what you
>> previously said the problem was. At the moment, i don't get the
>> feeling you have really dug all the way down and really understand the
>> root causes.
>>
>>  Andrew
> 
> Andrew,
> is_gigabit_capable is otherwise set only in the phy_probe function.
> This function is not called at all for the DSA cpu port fixed_link phy.
> Why is that? But I guess it is not important anymore, if CPU and DSA
> were converted to phylink in net-next. I shall test it and let you know.
> In any case, sorry for the spam.
> Marek
> 

A fixed phy should be bound to the genphy driver, but that happens later
in phy_attach_direct(). Maybe the fixed phy device of a CPU port gets
never attached to a net device? This would explain why phy_probe()
doesn't get called.
Following idea: We could swphy let return PHY ID 0x
(instead of current value 0x). Then the fixed phy device would
be bound to the genphy driver immediately at device registration time.
As a consequence phy_probe() would call genphy_read_abilities() that
sets supported modes properly. This should allow to remove the manual
adding of supported modes in dsa_port_fixed_link_register_of().

One more thing regarding genphy_read_abilities() and fixed phys in general:
swphy sets bit BMSR_ESTATEN, then genphy_read_abilities() reads
MII_ESTATUS to check if and which 1Gbps modes are supported.
MII_ESTATUS however isn't defined in swphy. We're just fortunate
that therefore the default value 0x is returned and both
1Gbps modes seem to be supported.

Last but not least I think the calls to genphy_config_init() and
genphy_read_status() in dsa_port_fixed_link_register_of() are
both not needed and could be removed.

Heiner


Re: [PATCH net-next v2 00/10] drop_monitor: Capture dropped packets and metadata

2019-08-11 Thread David Miller
From: Ido Schimmel 
Date: Sun, 11 Aug 2019 10:35:45 +0300

> From: Ido Schimmel 
> 
> So far drop monitor supported only one mode of operation in which a
> summary of recent packet drops is periodically sent to user space as a
> netlink event. The event only includes the drop location (program
> counter) and number of drops in the last interval.
> 
> While this mode of operation allows one to understand if the system is
> dropping packets, it is not sufficient if a more detailed analysis is
> required. Both the packet itself and related metadata are missing.
> 
> This patchset extends drop monitor with another mode of operation where
> the packet - potentially truncated - and metadata (e.g., drop location,
> timestamp, netdev) are sent to user space as a netlink event. Thanks to
> the extensible nature of netlink, more metadata can be added in the
> future.
 ...

Series applied.



pull-request: bpf 2019-08-11

2019-08-11 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) x64 JIT code generation fix for backward-jumps to 1st insn, from Alexei.

2) Fix buggy multi-closing of BTF file descriptor in libbpf, from Andrii.

3) Fix libbpf_num_possible_cpus() to make it thread safe, from Takshak.

4) Fix bpftool to dump an error if pinning fails, from Jakub.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 0a062ba725cdad3b167782179ee914a8402a0184:

  Merge tag 'mlx5-fixes-2019-07-25' of 
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux (2019-07-26 14:26:41 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 4f7aafd78aeaf18a4f6dea9415df60e745c9dfa7:

  Merge branch 'bpf-bpftool-pinning-error-msg' (2019-08-09 17:38:53 +0200)


Alexei Starovoitov (2):
  bpf: fix x64 JIT code generation for jmp to 1st insn
  selftests/bpf: tests for jmp to 1st insn

Andrii Nakryiko (2):
  libbpf: fix erroneous multi-closing of BTF FD
  libbpf: set BTF FD for prog only when there is supported .BTF.ext data

Daniel Borkmann (1):
  Merge branch 'bpf-bpftool-pinning-error-msg'

Jakub Kicinski (2):
  tools: bpftool: fix error message (prog -> object)
  tools: bpftool: add error message on pin failure

Takshak Chahande (1):
  libbpf : make libbpf_num_possible_cpus function thread safe

 arch/x86/net/bpf_jit_comp.c   |  9 
 tools/bpf/bpftool/common.c|  8 +--
 tools/lib/bpf/libbpf.c| 33 +++
 tools/testing/selftests/bpf/verifier/loops1.c | 28 +++
 4 files changed, 57 insertions(+), 21 deletions(-)


Re: pull-request: bpf 2019-08-11

2019-08-11 Thread David Miller
From: Daniel Borkmann 
Date: Sun, 11 Aug 2019 21:58:34 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) x64 JIT code generation fix for backward-jumps to 1st insn, from Alexei.
> 
> 2) Fix buggy multi-closing of BTF file descriptor in libbpf, from Andrii.
> 
> 3) Fix libbpf_num_possible_cpus() to make it thread safe, from Takshak.
> 
> 4) Fix bpftool to dump an error if pinning fails, from Jakub.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.


Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames

2019-08-11 Thread Michal Kubecek
On Sat, Aug 10, 2019 at 12:39:31PM -0700, Roopa Prabhu wrote:
> On Sat, Aug 10, 2019 at 8:50 AM Michal Kubecek  wrote:
> >
> > On Sat, Aug 10, 2019 at 06:46:57AM -0700, Roopa Prabhu wrote:
> > > On Fri, Aug 9, 2019 at 8:46 AM Michal Kubecek  wrote:
> > > >
> > > > On Fri, Aug 09, 2019 at 08:40:25AM -0700, Roopa Prabhu wrote:
> > > > > to that point, I am also not sure why we have a new API For multiple
> > > > > names. I mean why support more than two names  (existing old name and
> > > > > a new name to remove the length limitation) ?
> > > >
> > > > One use case is to allow "predictable names" from udev/systemd to work
> > > > the way do for e.g. block devices, see
> > > >
> > > >   http://lkml.kernel.org/r/20190628162716.gf29...@unicorn.suse.cz
> > > >
> > >
> > > thanks for the link. don't know the details about alternate block
> > > device names. Does user-space generate multiple and assign them to a
> > > kernel object as proposed in this series ?. is there a limit to number
> > > of names ?. my understanding of 'predictable names' was still a single
> > > name but predictable structure to the name.
> >
> > It is a single name but IMHO mostly because we can only have one name.
> > For block devices, udev uses symlinks to create multiple aliases based
> > on different naming schemes, e.g.
> >
> > mike@lion:~> find -L /dev/disk/ -samefile /dev/sda2 -exec ls -l {} +
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-id/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T3114933-part2 -> ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-id/scsi-SATA_WDC_WD30EFRX-68A_WD-WMC1T3114933-part2 -> 
> > ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-id/scsi-SATA_WDC_WD30EFRX-68_WD-WMC1T3114933-part2 -> 
> > ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-id/scsi-0ATA_WDC_WD30EFRX-68A_WD-WMC1T3114933-part2 -> 
> > ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-id/scsi-1ATA_WDC_WD30EFRX-68AX9N0_WD-WMC1T3114933-part2 -> 
> > ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-id/scsi-350014ee6589cfea0-part2 -> ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-id/wwn-0x50014ee6589cfea0-part2 -> ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 /dev/disk/by-partlabel/root2 -> 
> > ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-partuuid/71affb47-a93b-40fd-8986-d2e227e1b39d -> ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-path/pci-:00:11.0-ata-1-part2 -> ../../sda2
> > lrwxrwxrwx 1 root root 10 srp  5 21:47 
> > /dev/disk/by-path/pci-:00:11.0-scsi-0:0:0:0-part2 -> ../../sda2
> >
> > Few years ago, udev even dropped support for renaming block and
> > character devices (NAME="...") so that it now keeps kernel name and only
> > creates symlinks to it. Recent versions only allow NAME="..." for
> > network devices.
> 
> ok thanks for the details. This looks like names that are structured
> on hardware info which could fall into devlinks scope and they point
> to a single name.
> We should think about keeping them under devlink (by-id, by-mac etc).
> It already can recognize network interfaces by id.

Not all of them are hardware based, there are also links based on
filesystem label or UUID. But my point is rather that udev creates
multiple links so that any of them can be used in any place where
a block device is to be identified.

As network devices can have only one name, udev drops kernel provided
name completely and replaces it with name following one naming scheme.
Thus we have to know which naming scheme is going to be used and make
sure it does not change. With multiple alternative names, we could also
have all udev provided names at once (and also the original one from
kernel).

Michal


Re: [PATCH bpf-next v2 4/4] selftests/bpf: add sockopt clone/inheritance test

2019-08-11 Thread Yonghong Song


On 8/9/19 9:10 AM, Stanislav Fomichev wrote:
> Add a test that calls setsockopt on the listener socket which triggers
> BPF program. This BPF program writes to the sk storage and sets
> clone flag. Make sure that sk storage is cloned for a newly
> accepted connection.
> 
> We have two cloned maps in the tests to make sure we hit both cases
> in bpf_sk_storage_clone: first element (sk_storage_alloc) and
> non-first element(s) (selem_link_map).
> 
> Cc: Martin KaFai Lau 
> Cc: Yonghong Song 
> Signed-off-by: Stanislav Fomichev 

Acked-by: Yonghong Song 


Re: [PATCH bpf-next v2 1/4] bpf: export bpf_map_inc_not_zero

2019-08-11 Thread Yonghong Song


On 8/9/19 9:10 AM, Stanislav Fomichev wrote:
> Rename existing bpf_map_inc_not_zero to __bpf_map_inc_not_zero to
> indicate that it's caller's responsibility to do proper locking.
> Create and export bpf_map_inc_not_zero wrapper that properly
> locks map_idr_lock. Will be used in the next commit to
> hold a map while cloning a socket.
> 
> Cc: Martin KaFai Lau 
> Cc: Yonghong Song 
> Signed-off-by: Stanislav Fomichev 

Acked-by: Yonghong Song 


Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()

2019-08-11 Thread Yonghong Song


On 8/9/19 9:10 AM, Stanislav Fomichev wrote:
> Add new helper bpf_sk_storage_clone which optionally clones sk storage
> and call it from sk_clone_lock.
> 
> Cc: Martin KaFai Lau 
> Cc: Yonghong Song 
> Signed-off-by: Stanislav Fomichev 

Acked-by: Yonghong Song 


Re: [v4,1/4] tools: bpftool: add net attach command to attach XDP on interface

2019-08-11 Thread Y Song
On Fri, Aug 9, 2019 at 6:35 AM Daniel T. Lee  wrote:
>
> By this commit, using `bpftool net attach`, user can attach XDP prog on
> interface. New type of enum 'net_attach_type' has been made, as stated at
> cover-letter, the meaning of 'attach' is, prog will be attached on interface.
>
> With 'overwrite' option at argument, attached XDP program could be replaced.
> Added new helper 'net_parse_dev' to parse the network device at argument.
>
> BPF prog will be attached through libbpf 'bpf_set_link_xdp_fd'.
>
> Signed-off-by: Daniel T. Lee 
> ---
>  tools/bpf/bpftool/net.c | 136 +---
>  1 file changed, 129 insertions(+), 7 deletions(-)
>
> diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> index 67e99c56bc88..74cc346c36cd 100644
> --- a/tools/bpf/bpftool/net.c
> +++ b/tools/bpf/bpftool/net.c
> @@ -55,6 +55,35 @@ struct bpf_attach_info {
> __u32 flow_dissector_id;
>  };
>
> +enum net_attach_type {
> +   NET_ATTACH_TYPE_XDP,
> +   NET_ATTACH_TYPE_XDP_GENERIC,
> +   NET_ATTACH_TYPE_XDP_DRIVER,
> +   NET_ATTACH_TYPE_XDP_OFFLOAD,
> +};
> +
> +static const char * const attach_type_strings[] = {
> +   [NET_ATTACH_TYPE_XDP]   = "xdp",
> +   [NET_ATTACH_TYPE_XDP_GENERIC]   = "xdpgeneric",
> +   [NET_ATTACH_TYPE_XDP_DRIVER]= "xdpdrv",
> +   [NET_ATTACH_TYPE_XDP_OFFLOAD]   = "xdpoffload",
> +};
> +
> +const size_t net_attach_type_size = ARRAY_SIZE(attach_type_strings);
> +
> +static enum net_attach_type parse_attach_type(const char *str)
> +{
> +   enum net_attach_type type;
> +
> +   for (type = 0; type < net_attach_type_size; type++) {
> +   if (attach_type_strings[type] &&
> +   is_prefix(str, attach_type_strings[type]))
> +   return type;
> +   }
> +
> +   return net_attach_type_size;
> +}
> +
>  static int dump_link_nlmsg(void *cookie, void *msg, struct nlattr **tb)
>  {
> struct bpf_netdev_t *netinfo = cookie;
> @@ -223,6 +252,97 @@ static int query_flow_dissector(struct bpf_attach_info 
> *attach_info)
> return 0;
>  }
>
> +static int net_parse_dev(int *argc, char ***argv)
> +{
> +   int ifindex;
> +
> +   if (is_prefix(**argv, "dev")) {
> +   NEXT_ARGP();
> +
> +   ifindex = if_nametoindex(**argv);
> +   if (!ifindex)
> +   p_err("invalid devname %s", **argv);
> +
> +   NEXT_ARGP();
> +   } else {
> +   p_err("expected 'dev', got: '%s'?", **argv);
> +   return -1;
> +   }
> +
> +   return ifindex;
> +}
> +
> +static int do_attach_detach_xdp(int progfd, enum net_attach_type attach_type,
> +   int ifindex, bool overwrite)
> +{
> +   __u32 flags = 0;
> +
> +   if (!overwrite)
> +   flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
> +   if (attach_type == NET_ATTACH_TYPE_XDP_GENERIC)
> +   flags |= XDP_FLAGS_SKB_MODE;
> +   if (attach_type == NET_ATTACH_TYPE_XDP_DRIVER)
> +   flags |= XDP_FLAGS_DRV_MODE;
> +   if (attach_type == NET_ATTACH_TYPE_XDP_OFFLOAD)
> +   flags |= XDP_FLAGS_HW_MODE;
> +
> +   return bpf_set_link_xdp_fd(ifindex, progfd, flags);
> +}
> +
> +static int do_attach(int argc, char **argv)
> +{
> +   enum net_attach_type attach_type;
> +   int progfd, ifindex, err = 0;
> +   bool overwrite = false;
> +
> +   /* parse attach args */
> +   if (!REQ_ARGS(5))
> +   return -EINVAL;
> +
> +   attach_type = parse_attach_type(*argv);
> +   if (attach_type == net_attach_type_size) {
> +   p_err("invalid net attach/detach type: %s", *argv);
> +   return -EINVAL;
> +   }
> +   NEXT_ARG();
> +
> +   progfd = prog_parse_fd(&argc, &argv);
> +   if (progfd < 0)
> +   return -EINVAL;
> +
> +   ifindex = net_parse_dev(&argc, &argv);
> +   if (ifindex < 1) {
> +   close(progfd);
> +   return -EINVAL;
> +   }
> +
> +   if (argc) {
> +   if (is_prefix(*argv, "overwrite")) {
> +   overwrite = true;
> +   } else {
> +   p_err("expected 'overwrite', got: '%s'?", *argv);
> +   close(progfd);
> +   return -EINVAL;
> +   }
> +   }
> +
> +   /* attach xdp prog */
> +   if (is_prefix("xdp", attach_type_strings[attach_type]))
> +   err = do_attach_detach_xdp(progfd, attach_type, ifindex,
> +  overwrite);
> +
> +   if (err < 0) {
> +   p_err("interface %s attach failed: %s",
> + attach_type_strings[attach_type], strerror(errno));
> +   return err;
> +   }

I tried the below example,

-bash-4.4$ sudo ./bpftool net attach x pinned /sys/fs/bpf/xdp_example
dev v1
-bash-4.4$ sudo ./bpftool net attach x

Re: [v4,2/4] tools: bpftool: add net detach command to detach XDP on interface

2019-08-11 Thread Y Song
On Fri, Aug 9, 2019 at 6:35 AM Daniel T. Lee  wrote:
>
> By this commit, using `bpftool net detach`, the attached XDP prog can
> be detached. Detaching the BPF prog will be done through libbpf
> 'bpf_set_link_xdp_fd' with the progfd set to -1.
>
> Signed-off-by: Daniel T. Lee 
> ---
>  tools/bpf/bpftool/net.c | 42 -
>  1 file changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> index 74cc346c36cd..ef1e576c6dba 100644
> --- a/tools/bpf/bpftool/net.c
> +++ b/tools/bpf/bpftool/net.c
> @@ -343,6 +343,43 @@ static int do_attach(int argc, char **argv)
> return 0;
>  }
>
> +static int do_detach(int argc, char **argv)
> +{
> +   enum net_attach_type attach_type;
> +   int progfd, ifindex, err = 0;
> +
> +   /* parse detach args */
> +   if (!REQ_ARGS(3))
> +   return -EINVAL;
> +
> +   attach_type = parse_attach_type(*argv);
> +   if (attach_type == net_attach_type_size) {
> +   p_err("invalid net attach/detach type: %s", *argv);
> +   return -EINVAL;
> +   }
> +   NEXT_ARG();
> +
> +   ifindex = net_parse_dev(&argc, &argv);
> +   if (ifindex < 1)
> +   return -EINVAL;
> +
> +   /* detach xdp prog */
> +   progfd = -1;
> +   if (is_prefix("xdp", attach_type_strings[attach_type]))
> +   err = do_attach_detach_xdp(progfd, attach_type, ifindex, 
> NULL);
> +
> +   if (err < 0) {
> +   p_err("interface %s detach failed: %s",
> + attach_type_strings[attach_type], strerror(errno));
> +   return err;
> +   }

Similar to previous patch, here we should use "strerror(-err)".
With this fixed, you can add my ack:
Acked-by: Yonghong Song 

> +
> +   if (json_output)
> +   jsonw_null(json_wtr);
> +
> +   return 0;
> +}
> +
[...]


Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames

2019-08-11 Thread David Ahern
On 8/10/19 12:30 AM, Jiri Pirko wrote:
> Could you please write me an example message of add/remove?

altnames are for existing netdevs, yes? existing netdevs have an id and
a name - 2 existing references for identifying the existing netdev for
which an altname will be added. Even using the altname as the main
'handle' for a setlink change, I see no reason why the GETLINK api can
not take an the IFLA_ALT_IFNAME and return the full details of the
device if the altname is unique.

So, what do the new RTM commands give you that you can not do with
RTM_*LINK?


Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames

2019-08-11 Thread David Ahern
On 8/11/19 7:34 PM, David Ahern wrote:
> On 8/10/19 12:30 AM, Jiri Pirko wrote:
>> Could you please write me an example message of add/remove?
> 
> altnames are for existing netdevs, yes? existing netdevs have an id and
> a name - 2 existing references for identifying the existing netdev for
> which an altname will be added. Even using the altname as the main
> 'handle' for a setlink change, I see no reason why the GETLINK api can
> not take an the IFLA_ALT_IFNAME and return the full details of the
> device if the altname is unique.
> 
> So, what do the new RTM commands give you that you can not do with
> RTM_*LINK?
> 


To put this another way, the ALT_NAME is an attribute of an object - a
LINK. It is *not* a separate object which requires its own set of
commands for manipulating.


Re: [PATCH net] ipv4/route: do not check saddr dev if iif is LOOPBACK_IFINDEX

2019-08-11 Thread David Miller
From: David Ahern 
Date: Thu, 1 Aug 2019 22:16:00 -0600

> On 8/1/19 10:13 PM, Hangbin Liu wrote:
>> On Thu, Aug 01, 2019 at 01:51:25PM -0600, David Ahern wrote:
>>> On 8/1/19 2:29 AM, Hangbin Liu wrote:
 Jianlin reported a bug that for IPv4, ip route get from src_addr would fail
 if src_addr is not an address on local system.

 \# ip route get 1.1.1.1 from 2.2.2.2
 RTNETLINK answers: Invalid argument
>>>
>>> so this is a forwarding lookup in which case iif should be set. Based on
>> 
>> with out setting iif in userspace, the kernel set iif to lo by default.
> 
> right, it presumes locally generated traffic.
>> 
>>> the above 'route get' inet_rtm_getroute is doing a lookup as if it is
>>> locally generated traffic.
>> 
>> yeah... but what about the IPv6 part. That cause a different behavior in
>> userspace.
> 
> just one of many, many annoying differences between v4 and v6. We could
> try to catalog it.

I think we just have to accept this difference because this change would
change behavior for all route lookups, not just those done by ip route get.


Re: [PATCH net] netdevsim: Restore per-network namespace accounting for fib entries

2019-08-11 Thread David Miller
From: David Ahern 
Date: Tue,  6 Aug 2019 12:15:17 -0700

> From: David Ahern 
> 
> Prior to the commit in the fixes tag, the resource controller in netdevsim
> tracked fib entries and rules per network namespace. Restore that behavior.
> 
> Fixes: 5fc494225c1e ("netdevsim: create devlink instance per netdevsim 
> instance")
> Signed-off-by: David Ahern 

Applied, thanks for bringing this to our attention and fixing it David.

Jiri, I disagree you on every single possible level.

If you didn't like how netdevsim worked in this area the opportunity to do
something about it was way back when it went in.

No matter how completely busted or disagreeable an interface is, once we have
committed it to a release (and in particular people are knowingly using and
depending upon it) you cannot break it.

It doesn't matter how much you disagree with something, you cannot break it
when it's out there and actively in use.

Do you have any idea how much stuff I'd like to break because I think the
design turned out to be completely wrong?  But I can't.


Re: [PATCHv2 net 0/2] Add netdev_level_ratelimited to avoid netdev msg flush

2019-08-11 Thread David Miller
From: Hangbin Liu 
Date: Fri,  9 Aug 2019 08:29:39 +0800

> ibmveth 3003 env3: h_multicast_ctrl rc=4 when adding an entry to the 
> filter table

You need to root cause and fix the reason this message appears so much.

Once I let you rate limit the message you will have zero incentive to
fix the real problem and fix it.


RE: [PATCH net-next v3 2/2] qed: Add driver API for flashing the config attributes.

2019-08-11 Thread Sudarsana Reddy Kalluru
> -Original Message-
> From: Ariel Elior 
> Sent: Monday, August 5, 2019 8:00 PM
> To: Sudarsana Reddy Kalluru ; David Miller
> 
> Cc: netdev@vger.kernel.org; Michal Kalderon 
> Subject: RE: [PATCH net-next v3 2/2] qed: Add driver API for flashing the
> config attributes.
> 
> > From: Sudarsana Reddy Kalluru
> > Sent: Tuesday, July 30, 2019 6:36 AM
> > To: David Miller 
> >
> > > -Original Message-
> > > From: David Miller 
> > > Sent: Monday, July 29, 2019 11:34 PM
> > > To: Sudarsana Reddy Kalluru 
> > > Cc: netdev@vger.kernel.org; Michal Kalderon
> ;
> > > Ariel Elior 
> > > Subject: Re: [PATCH net-next v3 2/2] qed: Add driver API for
> > > flashing the config attributes.
> > >
> > > From: Sudarsana Reddy Kalluru 
> > > Date: Sat, 27 Jul 2019 18:55:49 -0700
> > >
> > > > @@ -2268,6 +2330,9 @@ static int qed_nvm_flash(struct qed_dev
> > > > *cdev,
> > > const char *name)
> > > > rc = qed_nvm_flash_image_access(cdev, &data,
> > > > &check_resp);
> > > > break;
> > > > +   case QED_NVM_FLASH_CMD_NVM_CFG_ID:
> > > > +   rc = qed_nvm_flash_cfg_write(cdev, &data);
> > > > +   break;
> >
> > > > default:
> > > > DP_ERR(cdev, "Unknown command %08x\n",
> > > cmd_type);
> > >
> > > I don't see how any existing portable interface can cause this new
> > > code to actually be used.
> > >
> > > You have to explain this to me.
> > The API qed_nvm_flash() is used to flash the user provided data (e.g.,
> > Management FW) to the required partitions of the adapter.
> >- Format of the input file would be - file signature info, followed
> > by one or more data sets.
> >- Each data set is represented with the header followed by its contents.
> > Header captures info such as command name (e.g., FILE_START), data
> > size etc., which specifies how to handle the data.
> > The API qed_nvm_flash() validates the user provided input file, parses
> > the data sets and handles each accordingly. Here one of the data sets
> > (preferably the last one) could be nvm-attributes page (with cmd-id =
> > QED_NVM_FLASH_CMD_NVM_CHANGE).
> 
> This is basically an expansion of our existing ethtool -f implementation.
> The management FW has exposed an additional method of configuring some
> of the nvram options, and this makes use of that. The new code will come
> into use when newer FW files which contain configuration directives
> employing this API will be provided to ethtool -f.
> 
> thanks,
> Ariel

Dave,
The series appears as "changes requested" in patchwork. Please let us know 
if any modifications need to be incorporated on this series?

Thanks,
Sudarsana


Re: [patch net-next] netdevsim: register couple of devlink params

2019-08-11 Thread David Miller
From: Jiri Pirko 
Date: Fri,  9 Aug 2019 13:05:12 +0200

> From: Jiri Pirko 
> 
> Register couple of devlink params, one generic, one driver-specific.
> Make the values available over debugfs.
> 
> Example:
> $ echo "111" > /sys/bus/netdevsim/new_device
> $ devlink dev param
> netdevsim/netdevsim111:
>   name max_macs type generic
> values:
>   cmode driverinit value 32
>   name test1 type driver-specific
> values:
>   cmode driverinit value true
> $ cat /sys/kernel/debug/netdevsim/netdevsim111/max_macs
> 32
> $ cat /sys/kernel/debug/netdevsim/netdevsim111/test1
> Y
> $ devlink dev param set netdevsim/netdevsim111 name max_macs cmode driverinit 
> value 16
> $ devlink dev param set netdevsim/netdevsim111 name test1 cmode driverinit 
> value false
> $ devlink dev reload netdevsim/netdevsim111
> $ cat /sys/kernel/debug/netdevsim/netdevsim111/max_macs
> 16
> $ cat /sys/kernel/debug/netdevsim/netdevsim111/test1
> 
> Signed-off-by: Jiri Pirko 

Applied, thanks Jiri.


Re: [PATCH net-next v2 0/4] net: phy: realtek: add support for integrated 2.5Gbps PHY in RTL8125

2019-08-11 Thread David Miller
From: Heiner Kallweit 
Date: Fri, 9 Aug 2019 20:41:58 +0200

> This series adds support for the integrated 2.5Gbps PHY in RTL8125.
> First three patches add necessary functionality to phylib.
> 
> Changes in v2:
> - added patch 1
> - changed patch 4 to use a fake PHY ID that is injected by the
>   network driver. This allows to use a dedicated PHY driver.

Series applied, thanks Heiner.


Re: [PATCH net-next] r8169: inline rtl8169_free_rx_databuff

2019-08-11 Thread David Miller
From: Heiner Kallweit 
Date: Fri, 9 Aug 2019 22:59:07 +0200

> rtl8169_free_rx_databuff is used in only one place, so let's inline it.
> We can improve the loop because rtl8169_init_ring zero's RX_databuff
> before calling rtl8169_rx_fill, and rtl8169_rx_fill fills
> Rx_databuff starting from index 0.
> 
> Signed-off-by: Heiner Kallweit 

Applied, thanks Heiner.


Re: [PATCH net-next 0/7] net: dsa: mv88e6xxx: prepare Wait Bit operation

2019-08-11 Thread David Miller
From: Vivien Didelot 
Date: Fri,  9 Aug 2019 18:47:52 -0400

> The Remote Management Interface has its own implementation of a Wait
> Bit operation, which requires a bit number and a value to wait for.
> 
> In order to prepare the introduction of this implementation, rework the
> code waiting for bits and masks in mv88e6xxx to match this signature.
> 
> This has the benefit to unify the implementation of wait routines while
> removing obsolete wait and update functions and also reducing the code.

Series applied, thanks.


Re: [PATCH] xen-netback: no need to check return value of debugfs_create functions

2019-08-11 Thread David Miller
From: Greg Kroah-Hartman 
Date: Sat, 10 Aug 2019 12:31:08 +0200

> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 
> Cc: Wei Liu 
> Cc: Paul Durrant 
> Cc: xen-de...@lists.xenproject.org
> Cc: netdev@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman 

Applied to net-next.


Re: [PATCH] caif: no need to check return value of debugfs_create functions

2019-08-11 Thread David Miller
From: Greg Kroah-Hartman 
Date: Sat, 10 Aug 2019 12:42:43 +0200

> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 
> Cc: Richard Fontana 
> Cc: Steve Winslow 
> Cc: netdev@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman 

Applied.


Re: [PATCH net] mlxsw: spectrum_ptp: Keep unmatched entries in a linked list

2019-08-11 Thread David Miller
From: Ido Schimmel 
Date: Sun, 11 Aug 2019 10:48:37 +0300

> From: Petr Machata 
> 
> To identify timestamps for matching with their packets, Spectrum-1 uses a
> five-tuple of (port, direction, domain number, message type, sequence ID).
> If there are several clients from the same domain behind a single port
> sending Delay_Req's, the only thing differentiating these packets, as far
> as Spectrum-1 is concerned, is the sequence ID. Should sequence IDs between
> individual clients be similar, conflicts may arise. That is not a problem
> to hardware, which will simply deliver timestamps on a first comes, first
> served basis.
> 
> However the driver uses a simple hash table to store the unmatched pieces.
> When a new conflicting piece arrives, it pushes out the previously stored
> one, which if it is a packet, is delivered without timestamp. Later on as
> the corresponding timestamps arrive, the first one is mismatched to the
> second packet, and the second one is never matched and eventually is GCd.
> 
> To correct this issue, instead of using a simple rhashtable, use rhltable
> to keep the unmatched entries.
> 
> Previously, a found unmatched entry would always be removed from the hash
> table. That is not the case anymore--an incompatible entry is left in the
> hash table. Therefore removal from the hash table cannot be used to confirm
> the validity of the looked-up pointer, instead the lookup would simply need
> to be redone. Therefore move it inside the critical section. This
> simplifies a lot of the code.
> 
> Fixes: 8748642751ed ("mlxsw: spectrum: PTP: Support SIOCGHWTSTAMP, 
> SIOCSHWTSTAMP ioctls")
> Reported-by: Alex Veber 
> Signed-off-by: Petr Machata 
> Signed-off-by: Ido Schimmel 

Applied.