date:20190731

Zdravstvujte! Vas interesuyut klientskie bazy dannyh?

2019-07-31 Thread netdev

Zdravstvujte! Vas interesuyut klientskie bazy dannyh?

Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

2019-07-31 Thread Song Liu

Hi Andy,

> On Jul 30, 2019, at 1:20 PM, Andy Lutomirski  wrote:
> 
> On Sat, Jul 27, 2019 at 11:20 AM Song Liu  wrote:
>> 
>> Hi Andy,
>> 
>> 
> 
> Well, yes. sys_bpf() is pretty powerful.
> 
> The goal of /dev/bpf is to enable special users to call sys_bpf(). In
> the meanwhile, such users should not take down the whole system easily
> by accident, e.g., with rm -rf /.
 
 That’s easy, though — bpftool could learn to read /etc/bpfusers before 
 allowing ruid != 0.
>>> 
>>> This is a great idea! fscaps + /etc/bpfusers should do the trick.
>> 
>> After some discussions and more thinking on this, I have some concerns
>> with the user space only approach.
>> 
>> IIUC, your proposal for user space only approach is like:
>> 
>> 1. bpftool (and other tools) check /etc/bpfusers and only do
>>   setuid for allowed users:
>> 
>>int main()
>>{
>>if (/* uid in /etc/bpfusers */)
>>setuid(0);
>>sys_bpf(...);
>>}
>> 
>> 2. bpftool (and other tools) is installed with CAP_SETUID:
>> 
>>setcap cap_setuid=e+p /bin/bpftool
>> 
> 
> You have this a bit backwards.  You wouldn't use CAP_SETUID.  You
> would use the setuid *mode* bit, i.e. chmod 4111 (or 4100 and use ACLs
> to further lock it down).  Or you could use setcap cap_sys_admin=p,
> although the details vary.  It woks a bit like this:
> 
> First, if you are running with elevated privilege due to SUID or
> fscaps, the kernel and glibc offer you a degree of protection: you are
> protected from ptrace(), LD_PRELOAD, etc.  You are *not* protected
> from yourself.  For example, you may be running in a context in which
> an attacker has malicious values in your environment variables, CWD,
> etc.  Do you need to carefully decide whether you are willing to run
> with elevated privilege on behalf of the user, which you learn like
> this:
> 
> uid_t real_uid = getuid();
> 
> Your decision may may depend on command-line arguments as well (i.e.
> you might want to allow tracing but not filtering, say).  Once you've
> made this decision, the details vary:
> 
> For SUID, you either continue to run with euid == 0, or you drop
> privilege using something like:
> 
> if (setresuid(real_uid, real_uid, real_uid) != 0) {
> /* optionally print an error to stderr */
> exit(1);
> }
> 
> For fscaps, if you want to be privileged, you use something like
> capng_update(); capng_apply() to set CAP_SYS_ADMIN to be effective
> when you want privilege.  If you want to be unprivileged (because
> bpfusers says so, for example), you could use capng_update() to drop
> CAP_SYS_ADMIN entirely and see if the calls still work without
> privilege.  But this is a little bit awkward, since you don't directly
> know whether the user that invoked you in the first place had
> CAP_SYS_ADMIN to begin with.
> 
> In general, SUID is a bit easier to work with.

Thanks a lot for these explanations. I learned a lot today via reading
this email and Googling. 

> 
>> This approach is not ideal, because we need to trust the tool to give
>> it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
>> or use other root only sys calls after setuid(0).
> 
> How?  The whole SUID mechanism is designed fairly carefully to prevent
> this.  /bin/sudo is likely to be SUID on your system, but you can't
> just "hack" it to become root.

I guess "hacked" was not the right description. I should have used 
"buggy" instead. 

SUID mechanism is great for small and solid tools, like sudo, passwd,
mount, etc. The user or sys admin could trust the tool to always do the 
right thing. On the other hand, I think SUID is not ideal for complex
tools that are under heavy development. As you mentioned, SUID doesn't 
protect against oneself. Therefore, it won't protect the system from 
buggy tool that uses SUID on something it should not. 

I think SUID is good for bpftool, because it is easy to use SUID 
correctly in bpftool. But, it is not easy to use SUID correctly in 
systemd. 

systemd and similar user space daemons use sys_bpf() to manage cgroup 
bpf programs and maps (BPF_MAP_TYPE_*CGROUP*, BPF_PROG_TYPE_CGROUP_*, 
BPF_CGROUP_*). The daemon runs in the background, and handles user 
requests to attach/detach BPF programs. If the daemon uses SUID or 
similar mechanism, it has to use euid == 0 for sys_bpf(). However, 
such daemons are usually pretty complicated. It is not easy to prove 
the daemon won't misuse euid == 0. 

Does this make sense so far? I will discuss more about cgroup in the 
next email. 

Thanks,
Song

Re: [PATCH net-next v4 6/6] net: mscc: PTP Hardware Clock (PHC) support

2019-07-31 Thread antoine.ten...@bootlin.com

Hello Saeed,

On Fri, Jul 26, 2019 at 08:52:10PM +, Saeed Mahameed wrote:
> On Thu, 2019-07-25 at 16:27 +0200, Antoine Tenart wrote:
> >  
> > dev->stats.tx_packets++;
> > dev->stats.tx_bytes += skb->len;
> > -   dev_kfree_skb_any(skb);
> > +
> > +   if (ocelot->ptp && shinfo->tx_flags & SKBTX_HW_TSTAMP &&
> > +   port->ptp_cmd == IFH_REW_OP_TWO_STEP_PTP) {
> > +   struct ocelot_skb *oskb =
> > +   kzalloc(sizeof(struct ocelot_skb), GFP_ATOMIC);
> > +
> 
> Device drivers normally pre allocate descriptor info ring array to
> avoid dynamic atomic allocations of private data on data path.

This depends on drivers. It's a good thing to do but I don't think it's
required here for a first version, we can improve this later.

> > +   oskb->skb = skb;
> > +   oskb->id = port->ts_id % 4;
> > +   port->ts_id++;
> 
> missing skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; ?
> see 3.1 Hardware Timestamping Implementation: Device Drivers
> https://www.kernel.org/doc/Documentation/networking/timestamping.txt

Right, I'll add this.

> > +void ocelot_get_hwtimestamp(struct ocelot *ocelot, struct timespec64
> > *ts)
> > +{
> > +   unsigned long flags;
> > +   u32 val;
> > +
> > +   spin_lock_irqsave(&ocelot->ptp_clock_lock, flags);
> > +
> > +   /* Read current PTP time to get seconds */
> > +   val = ocelot_read_rix(ocelot, PTP_PIN_CFG, TOD_ACC_PIN);
> > +
> > +   val &= ~(PTP_PIN_CFG_SYNC | PTP_PIN_CFG_ACTION_MASK |
> > PTP_PIN_CFG_DOM);
> > +   val |= PTP_PIN_CFG_ACTION(PTP_PIN_ACTION_SAVE);
> > +   ocelot_write_rix(ocelot, val, PTP_PIN_CFG, TOD_ACC_PIN);
> > +   ts->tv_sec = ocelot_read_rix(ocelot, PTP_PIN_TOD_SEC_LSB,
> > TOD_ACC_PIN);
> > +
> > +   /* Read packet HW timestamp from FIFO */
> > +   val = ocelot_read(ocelot, SYS_PTP_TXSTAMP);
> > +   ts->tv_nsec = SYS_PTP_TXSTAMP_PTP_TXSTAMP(val);
> > +
> > +   /* Sec has incremented since the ts was registered */
> > +   if ((ts->tv_sec & 0x1) != !!(val &
> > SYS_PTP_TXSTAMP_PTP_TXSTAMP_SEC))
> > +   ts->tv_sec--;
> > +
> > +   spin_unlock_irqrestore(&ocelot->ptp_clock_lock, flags);
> > +}
> > +EXPORT_SYMBOL(ocelot_get_hwtimestamp);
> > +
> 
> Why EXPORT_SYMBOL? this is the last patch and it is touching one
> driver.

Because the function is used in ocelot_board.c, which can be part of
another module (the mscc/ driver provides 2 modules). You can find
other examples of this in this driver.

> > @@ -98,7 +100,11 @@ static irqreturn_t ocelot_xtr_irq_handler(int
> > irq, void *arg)
> > int sz, len, buf_len;
> > u32 ifh[4];
> > u32 val;
> > -   struct frame_info info;
> > +   struct frame_info info = {};
> > +   struct timespec64 ts;
> > +   struct skb_shared_hwtstamps *shhwtstamps;
> > +   u64 tod_in_ns;
> > +   u64 full_ts_in_ns;
> 
> reverse xmas tree.

OK.

> > @@ -145,6 +151,22 @@ static irqreturn_t ocelot_xtr_irq_handler(int
> > irq, void *arg)
> > break;
> > }
> >  
> > +   if (ocelot->ptp) {
> > +   ocelot_ptp_gettime64(&ocelot->ptp_info, &ts);
> > +
> > +   tod_in_ns = ktime_set(ts.tv_sec, ts.tv_nsec);
> > +   if ((tod_in_ns & 0x) < info.timestamp)
> > +   full_ts_in_ns = (((tod_in_ns >> 32) -
> > 1) << 32) |
> > +   info.timestamp;
> > +   else
> > +   full_ts_in_ns = (tod_in_ns &
> > GENMASK_ULL(63, 32)) |
> > +   info.timestamp;
> > +
> > +   shhwtstamps = skb_hwtstamps(skb);
> > +   memset(shhwtstamps, 0, sizeof(struct
> > skb_shared_hwtstamps));
> > +   shhwtstamps->hwtstamp = full_ts_in_ns;
> 
> the right way to set the timestamp is by calling: 
> skb_tstamp_tx(skb, &tstamp);

I'll fix this.

> > +static irqreturn_t ocelot_ptp_rdy_irq_handler(int irq, void *arg)
> > +{
> > +   int budget = OCELOT_PTP_QUEUE_SZ;
> > +   struct ocelot *ocelot = arg;
> > +
> > +   do {
> > +   struct skb_shared_hwtstamps shhwtstamps;
> > +   struct list_head *pos, *tmp;
> > +   struct sk_buff *skb = NULL;
> > +   struct ocelot_skb *entry;
> > +   struct ocelot_port *port;
> > +   struct timespec64 ts;
> > +   u32 val, id, txport;
> > +
> > +   /* Prevent from infinite loop */
> > +   if (unlikely(!--budget))
> > +   break;
> 
> when budget gets to 1 you break, while you still have 1 to go :)
> 
> I assume OCELOT_PTP_QUEUE_SZ > 0, just make this the loop condition and
> avoid infinite loops by design.

That's right, I'll fix it :)

> >  static int mscc_ocelot_probe(struct platform_device *pdev)
> >  {
> > -   int err, irq;
> > unsigned int i;
> > +   int err, irq_xtr, irq_ptp_rdy;
> > struct device_node *np = pdev->dev.of_node;
> > struct device_

Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

2019-07-31 Thread Song Liu




> On Jul 30, 2019, at 1:24 PM, Andy Lutomirski  wrote:
> 
> On Mon, Jul 29, 2019 at 10:07 PM Song Liu  wrote:
>> 
>> Hi Andy,
>> 
>>> On Jul 27, 2019, at 11:20 AM, Song Liu  wrote:
>>> 
>>> Hi Andy,
>>> 
>>> 

[...]

>>> 
>> 
>> I would like more comments on this.
>> 
>> Currently, bpf permission is more or less "root or nothing", which we
>> would like to change.
>> 
>> The short term goal is to separate bpf from root, in other words, it is
>> "all or nothing". Special user space utilities, such as systemd, would
>> benefit from this. Once this is implemented, systemd can call sys_bpf()
>> when it is not running as root.
> 
> As generally nasty as Linux capabilities are, this sounds like a good
> use for CAP_BPF_ADMIN.

I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make 
existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.

> 
> But what do you have in mind?  Isn't non-root systemd mostly just the
> user systemd session?  That should *not* have bpf() privileges until
> bpf() is improved such that you can't use it to compromise the system.

cgroup bpf is the major use case here. A less important use case is to 
run bpf selftests without being root. 

> 
>> 
>> In longer term, it may be useful to provide finer grain permission of
>> sys_bpf(). For example, sys_bpf() should be aware of containers; and
>> user may only have access to certain bpf maps. Let's call this
>> "fine grain" capability.
>> 
>> 
>> Since we are seeing new use cases every year, we will need many
>> iterations to implement the fine grain permission. I think we need an
>> API that is flexible enough to cover different types of permission
>> control.
>> 
>> For example, bpf_with_cap() can be flexible:
>> 
>>bpf_with_cap(cmd, attr, size, perm_fd);
>> 
>> We can get different types of permission via different combinations of
>> arguments:
>> 
>>A perm_fd to /dev/bpf gives access to all sys_bpf() commands, so
>>this is "all or nothing" permission.
>> 
>>A perm_fd to /sys/fs/cgroup/.../bpf.xxx would only allow some
>>commands to this specific cgroup.
>> 
> 
> I don't see why you need to invent a whole new mechanism for this.
> The entire cgroup ecosystem outside bpf() does just fine using the
> write permission on files in cgroupfs to control access.  Why can't
> bpf() do the same thing?

It is easier to use write permission for BPF_PROG_ATTACH. But it is 
not easy to do the same for other bpf commands: BPF_PROG_LOAD and 
BPF_MAP_*. A lot of these commands don't have target concept. Maybe 
we should have target concept for all these commands. But that is a 
much bigger project. OTOH, "all or nothing" model allows all these 
commands at once.

Well, that being said, I will look more into using write permission 
in cgroupfs. 

Thanks again for all these comments and suggestions. Please let us 
know your future thoughts and insights. 

Song

[PATCH net-next 0/6] flow_offload: add indr-block in nf_table_offload

2019-07-31 Thread wenxu

From: wenxu 

This series patch make nftables offload support the vlan and
tunnel device offload through indr-block architecture.

The first four patches mv tc indr block to flow offload and
rename to flow-indr-block.
Because the new flow-indr-block can't get the tcf_block
directly. The fifthe patch provide a callback list to get 
flow_block of each subsystem immediately when the device
register and contain a block.
The last patch make nf_tables_offload support flow-indr-block.

wenxu (6):
  cls_api: modify the tc_indr_block_ing_cmd parameters.
  cls_api: replace block with flow_block in tc_indr_block_dev
  cls_api: add flow_indr_block_call function
  flow_offload: move tc indirect block to flow offload
  flow_offload: support get flow_block immediately
  netfilter: nf_tables_offload: support indr block call

 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  10 +-
 .../net/ethernet/netronome/nfp/flower/offload.c|  11 +-
 include/net/flow_offload.h |  48 
 include/net/netfilter/nf_tables_offload.h  |   2 +
 include/net/pkt_cls.h  |  35 ---
 include/net/sch_generic.h  |   3 -
 net/core/flow_offload.c| 251 
 net/netfilter/nf_tables_api.c  |   7 +
 net/netfilter/nf_tables_offload.c  | 156 +++--
 net/sched/cls_api.c| 255 -
 10 files changed, 497 insertions(+), 281 deletions(-)

-- 
1.8.3.1

[PATCH net-next 4/6] flow_offload: move tc indirect block to flow offload

2019-07-31 Thread wenxu

From: wenxu 

move tc indirect block to flow_offload and rename
it to flow indirect block.The nf_tables can use the
indr block architecture.

Signed-off-by: wenxu 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  10 +-
 .../net/ethernet/netronome/nfp/flower/offload.c|  11 +-
 include/net/flow_offload.h |  31 +++
 include/net/pkt_cls.h  |  35 ---
 include/net/sch_generic.h  |   3 -
 net/core/flow_offload.c| 218 +++
 net/sched/cls_api.c| 241 +
 7 files changed, 265 insertions(+), 284 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 7f747cb..074573b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -785,9 +785,9 @@ static int mlx5e_rep_indr_register_block(struct 
mlx5e_rep_priv *rpriv,
 {
int err;
 
-   err = __tc_indr_block_cb_register(netdev, rpriv,
- mlx5e_rep_indr_setup_tc_cb,
- rpriv);
+   err = __flow_indr_block_cb_register(netdev, rpriv,
+   mlx5e_rep_indr_setup_tc_cb,
+   rpriv);
if (err) {
struct mlx5e_priv *priv = netdev_priv(rpriv->netdev);
 
@@ -800,8 +800,8 @@ static int mlx5e_rep_indr_register_block(struct 
mlx5e_rep_priv *rpriv,
 static void mlx5e_rep_indr_unregister_block(struct mlx5e_rep_priv *rpriv,
struct net_device *netdev)
 {
-   __tc_indr_block_cb_unregister(netdev, mlx5e_rep_indr_setup_tc_cb,
- rpriv);
+   __flow_indr_block_cb_unregister(netdev, mlx5e_rep_indr_setup_tc_cb,
+   rpriv);
 }
 
 static int mlx5e_nic_rep_netdevice_event(struct notifier_block *nb,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index e209f15..7b490db 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -1479,16 +1479,17 @@ int nfp_flower_reg_indir_block_handler(struct nfp_app 
*app,
return NOTIFY_OK;
 
if (event == NETDEV_REGISTER) {
-   err = __tc_indr_block_cb_register(netdev, app,
- nfp_flower_indr_setup_tc_cb,
- app);
+   err = __flow_indr_block_cb_register(netdev, app,
+   nfp_flower_indr_setup_tc_cb,
+   app);
if (err)
nfp_flower_cmsg_warn(app,
 "Indirect block reg failed - %s\n",
 netdev->name);
} else if (event == NETDEV_UNREGISTER) {
-   __tc_indr_block_cb_unregister(netdev,
- nfp_flower_indr_setup_tc_cb, app);
+   __flow_indr_block_cb_unregister(netdev,
+   nfp_flower_indr_setup_tc_cb,
+   app);
}
 
return NOTIFY_OK;
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 00b9aab..c8d60a6 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct flow_match {
struct flow_dissector   *dissector;
@@ -366,4 +367,34 @@ static inline void flow_block_init(struct flow_block 
*flow_block)
INIT_LIST_HEAD(&flow_block->cb_list);
 }
 
+typedef int flow_indr_block_bind_cb_t(struct net_device *dev, void *cb_priv,
+ enum tc_setup_type type, void *type_data);
+
+typedef void flow_indr_block_ing_cmd_t(struct net_device *dev,
+  struct flow_block *flow_block,
+  flow_indr_block_bind_cb_t *cb,
+  void *cb_priv,
+  enum flow_block_command command);
+
+int __flow_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+ flow_indr_block_bind_cb_t *cb,
+ void *cb_ident);
+
+void __flow_indr_block_cb_unregister(struct net_device *dev,
+flow_indr_block_bind_cb_t *cb,
+void *cb_ident);
+
+int flow_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+   flow_indr_block_bind_cb_t *cb, void *cb_ident);
+
+void flow_indr_block_c

[PATCH net-next 2/6] cls_api: replace block with flow_block in tc_indr_block_dev

2019-07-31 Thread wenxu

From: wenxu 

This patch make tc_indr_block_dev can separate from tc subsystem

Signed-off-by: wenxu 
---
 net/sched/cls_api.c | 31 ++-
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 2e3b58d..f9643fa 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -574,7 +574,7 @@ struct tc_indr_block_dev {
struct net_device *dev;
unsigned int refcnt;
struct list_head cb_list;
-   struct tcf_block *block;
+   struct flow_block *flow_block;
 };
 
 struct tc_indr_block_cb {
@@ -597,6 +597,14 @@ struct tc_indr_block_cb {
  tc_indr_setup_block_ht_params);
 }
 
+static void tc_indr_get_default_block(struct tc_indr_block_dev *indr_dev)
+{
+   struct tcf_block *block = tc_dev_ingress_block(indr_dev->dev);
+
+   if (block)
+   indr_dev->flow_block = &block->flow_block;
+}
+
 static struct tc_indr_block_dev *tc_indr_block_dev_get(struct net_device *dev)
 {
struct tc_indr_block_dev *indr_dev;
@@ -611,7 +619,7 @@ static struct tc_indr_block_dev 
*tc_indr_block_dev_get(struct net_device *dev)
 
INIT_LIST_HEAD(&indr_dev->cb_list);
indr_dev->dev = dev;
-   indr_dev->block = tc_dev_ingress_block(dev);
+   tc_indr_get_default_block(indr_dev);
if (rhashtable_insert_fast(&indr_setup_block_ht, &indr_dev->ht_node,
   tc_indr_setup_block_ht_params)) {
kfree(indr_dev);
@@ -678,11 +686,14 @@ static int tcf_block_setup(struct tcf_block *block,
   struct flow_block_offload *bo);
 
 static void tc_indr_block_ing_cmd(struct net_device *dev,
- struct tcf_block *block,
+ struct flow_block *flow_block,
  tc_indr_block_bind_cb_t *cb,
  void *cb_priv,
  enum flow_block_command command)
 {
+   struct tcf_block *block = flow_block ? container_of(flow_block,
+   struct tcf_block,
+   flow_block) : NULL;
struct flow_block_offload bo = {
.command= command,
.binder_type= FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS,
@@ -694,7 +705,7 @@ static void tc_indr_block_ing_cmd(struct net_device *dev,
if (!block)
return;
 
-   bo.block = &block->flow_block;
+   bo.block = flow_block;
 
cb(dev, cb_priv, TC_SETUP_BLOCK, &bo);
 
@@ -717,7 +728,7 @@ int __tc_indr_block_cb_register(struct net_device *dev, 
void *cb_priv,
if (err)
goto err_dev_put;
 
-   tc_indr_block_ing_cmd(dev, indr_dev->block, cb, cb_priv,
+   tc_indr_block_ing_cmd(dev, indr_dev->flow_block, cb, cb_priv,
  FLOW_BLOCK_BIND);
return 0;
 
@@ -750,13 +761,14 @@ void __tc_indr_block_cb_unregister(struct net_device *dev,
if (!indr_dev)
return;
 
-   indr_block_cb = tc_indr_block_cb_lookup(indr_dev, cb, cb_ident);
+   indr_block_cb = tc_indr_block_cb_lookup(indr_dev, indr_block_cb->cb,
+   indr_block_cb->cb_ident);
if (!indr_block_cb)
return;
 
/* Send unbind message if required to free any block cbs. */
-   tc_indr_block_ing_cmd(dev, indr_dev->block, cb, indr_block_cb->cb_priv,
- FLOW_BLOCK_UNBIND);
+   tc_indr_block_ing_cmd(dev, indr_dev->flow_block, indr_block_cb->cb,
+ indr_block_cb->cb_priv, FLOW_BLOCK_UNBIND);
tc_indr_block_cb_del(indr_block_cb);
tc_indr_block_dev_put(indr_dev);
 }
@@ -792,7 +804,8 @@ static void tc_indr_block_call(struct tcf_block *block, 
struct net_device *dev,
if (!indr_dev)
return;
 
-   indr_dev->block = command == FLOW_BLOCK_BIND ? block : NULL;
+   indr_dev->flow_block = command == FLOW_BLOCK_BIND ?
+ &block->flow_block : NULL;
 
list_for_each_entry(indr_block_cb, &indr_dev->cb_list, list)
indr_block_cb->cb(dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
-- 
1.8.3.1

[PATCH net-next 1/6] cls_api: modify the tc_indr_block_ing_cmd parameters.

2019-07-31 Thread wenxu

From: wenxu 

This patch make tc_indr_block_ing_cmd can't access struct
tc_indr_block_dev and tc_indr_block_cb.

Signed-off-by: wenxu 
---
 net/sched/cls_api.c | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3565d9a..2e3b58d 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -677,26 +677,28 @@ static void tc_indr_block_cb_del(struct tc_indr_block_cb 
*indr_block_cb)
 static int tcf_block_setup(struct tcf_block *block,
   struct flow_block_offload *bo);
 
-static void tc_indr_block_ing_cmd(struct tc_indr_block_dev *indr_dev,
- struct tc_indr_block_cb *indr_block_cb,
+static void tc_indr_block_ing_cmd(struct net_device *dev,
+ struct tcf_block *block,
+ tc_indr_block_bind_cb_t *cb,
+ void *cb_priv,
  enum flow_block_command command)
 {
struct flow_block_offload bo = {
.command= command,
.binder_type= FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS,
-   .net= dev_net(indr_dev->dev),
-   .block_shared   = tcf_block_non_null_shared(indr_dev->block),
+   .net= dev_net(dev),
+   .block_shared   = tcf_block_non_null_shared(block),
};
INIT_LIST_HEAD(&bo.cb_list);
 
-   if (!indr_dev->block)
+   if (!block)
return;
 
-   bo.block = &indr_dev->block->flow_block;
+   bo.block = &block->flow_block;
 
-   indr_block_cb->cb(indr_dev->dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
- &bo);
-   tcf_block_setup(indr_dev->block, &bo);
+   cb(dev, cb_priv, TC_SETUP_BLOCK, &bo);
+
+   tcf_block_setup(block, &bo);
 }
 
 int __tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
@@ -715,7 +717,8 @@ int __tc_indr_block_cb_register(struct net_device *dev, 
void *cb_priv,
if (err)
goto err_dev_put;
 
-   tc_indr_block_ing_cmd(indr_dev, indr_block_cb, FLOW_BLOCK_BIND);
+   tc_indr_block_ing_cmd(dev, indr_dev->block, cb, cb_priv,
+ FLOW_BLOCK_BIND);
return 0;
 
 err_dev_put:
@@ -752,7 +755,8 @@ void __tc_indr_block_cb_unregister(struct net_device *dev,
return;
 
/* Send unbind message if required to free any block cbs. */
-   tc_indr_block_ing_cmd(indr_dev, indr_block_cb, FLOW_BLOCK_UNBIND);
+   tc_indr_block_ing_cmd(dev, indr_dev->block, cb, indr_block_cb->cb_priv,
+ FLOW_BLOCK_UNBIND);
tc_indr_block_cb_del(indr_block_cb);
tc_indr_block_dev_put(indr_dev);
 }
-- 
1.8.3.1

[PATCH net-next 6/6] netfilter: nf_tables_offload: support indr block call

2019-07-31 Thread wenxu

From: wenxu 

nftable support indr-block call. It makes nftable an offload vlan
and tunnel device.

nft add table netdev firewall
nft add chain netdev firewall aclout { type filter hook ingress offload device 
mlx_pf0vf0 priority - 300 \; }
nft add rule netdev firewall aclout ip daddr 10.0.0.1 fwd to vlan0
nft add chain netdev firewall aclin { type filter hook ingress device vlan0 
priority - 300 \; }
nft add rule netdev firewall aclin ip daddr 10.0.0.7 fwd to mlx_pf0vf0

Signed-off-by: wenxu 
---
 include/net/netfilter/nf_tables_offload.h |   2 +
 net/netfilter/nf_tables_api.c |   7 ++
 net/netfilter/nf_tables_offload.c | 156 +-
 3 files changed, 141 insertions(+), 24 deletions(-)

diff --git a/include/net/netfilter/nf_tables_offload.h 
b/include/net/netfilter/nf_tables_offload.h
index 3196663..ac69087 100644
--- a/include/net/netfilter/nf_tables_offload.h
+++ b/include/net/netfilter/nf_tables_offload.h
@@ -63,6 +63,8 @@ struct nft_flow_rule {
 struct nft_flow_rule *nft_flow_rule_create(const struct nft_rule *rule);
 void nft_flow_rule_destroy(struct nft_flow_rule *flow);
 int nft_flow_rule_offload_commit(struct net *net);
+bool nft_indr_get_default_block(struct net_device *dev,
+   struct flow_indr_block_info *info);
 
 #define NFT_OFFLOAD_MATCH(__key, __base, __field, __len, __reg)
\
(__reg)->base_offset=   \
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 605a7cf..6a1d0b2 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -7593,6 +7593,11 @@ static void __net_exit nf_tables_exit_net(struct net 
*net)
.exit   = nf_tables_exit_net,
 };
 
+static struct flow_indr_get_block_entry get_block_entry = {
+   .get_block_cb = nft_indr_get_default_block,
+   .list = LIST_HEAD_INIT(get_block_entry.list),
+};
+
 static int __init nf_tables_module_init(void)
 {
int err;
@@ -7624,6 +7629,7 @@ static int __init nf_tables_module_init(void)
goto err5;
 
nft_chain_route_init();
+   flow_indr_add_default_block_cb(&get_block_entry);
return err;
 err5:
rhltable_destroy(&nft_objname_ht);
@@ -7640,6 +7646,7 @@ static int __init nf_tables_module_init(void)
 
 static void __exit nf_tables_module_exit(void)
 {
+   flow_indr_del_default_block_cb(&get_block_entry);
nfnetlink_subsys_unregister(&nf_tables_subsys);
unregister_netdevice_notifier(&nf_tables_flowtable_notifier);
nft_chain_filter_fini();
diff --git a/net/netfilter/nf_tables_offload.c 
b/net/netfilter/nf_tables_offload.c
index 64f5fd5..59c9629 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -171,24 +171,114 @@ static int nft_flow_offload_unbind(struct 
flow_block_offload *bo,
return 0;
 }
 
+static int nft_block_setup(struct nft_base_chain *basechain,
+  struct flow_block_offload *bo,
+  enum flow_block_command cmd)
+{
+   int err;
+
+   switch (cmd) {
+   case FLOW_BLOCK_BIND:
+   err = nft_flow_offload_bind(bo, basechain);
+   break;
+   case FLOW_BLOCK_UNBIND:
+   err = nft_flow_offload_unbind(bo, basechain);
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   err = -EOPNOTSUPP;
+   }
+
+   return err;
+}
+
+static int nft_block_offload_cmd(struct nft_base_chain *chain,
+struct net_device *dev,
+enum flow_block_command cmd)
+{
+   struct netlink_ext_ack extack = {};
+   struct flow_block_offload bo = {};
+   int err;
+
+   bo.net = dev_net(dev);
+   bo.block = &chain->flow_block;
+   bo.command = cmd;
+   bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
+   bo.extack = &extack;
+   INIT_LIST_HEAD(&bo.cb_list);
+
+   err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, &bo);
+   if (err < 0)
+   return err;
+
+   return nft_block_setup(chain, &bo, cmd);
+}
+
+static void nft_indr_block_ing_cmd(struct net_device *dev,
+  struct flow_block *flow_block,
+  flow_indr_block_bind_cb_t *cb,
+  void *cb_priv,
+  enum flow_block_command cmd)
+{
+   struct netlink_ext_ack extack = {};
+   struct flow_block_offload bo = {};
+   struct nft_base_chain *chain;
+
+   if (flow_block)
+   return;
+
+   chain = container_of(flow_block, struct nft_base_chain, flow_block);
+
+   bo.net = dev_net(dev);
+   bo.block = flow_block;
+   bo.command = cmd;
+   bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
+   bo.extack = &extack;
+   INIT_LIST_HEAD(&bo.cb_list);
+
+   cb(dev, cb_priv,

[PATCH net-next 5/6] flow_offload: support get flow_block immediately

2019-07-31 Thread wenxu

From: wenxu 

The new flow-indr-block can't get the tcf_block
directly. It provide a callback list to find the flow_block immediately
when the device register and contain a ingress block.

Signed-off-by: wenxu 
---
 include/net/flow_offload.h | 17 +
 net/core/flow_offload.c| 33 +
 net/sched/cls_api.c| 44 
 3 files changed, 94 insertions(+)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index c8d60a6..db04e3f 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -376,6 +376,23 @@ typedef void flow_indr_block_ing_cmd_t(struct net_device 
*dev,
   void *cb_priv,
   enum flow_block_command command);
 
+struct flow_indr_block_info {
+   struct flow_block *flow_block;
+   flow_indr_block_ing_cmd_t *ing_cmd_cb;
+};
+
+typedef bool flow_indr_get_default_block_t(struct net_device *dev,
+   struct flow_indr_block_info *info);
+
+struct flow_indr_get_block_entry {
+   flow_indr_get_default_block_t *get_block_cb;
+   struct list_headlist;
+};
+
+void flow_indr_add_default_block_cb(struct flow_indr_get_block_entry *entry);
+
+void flow_indr_del_default_block_cb(struct flow_indr_get_block_entry *entry);
+
 int __flow_indr_block_cb_register(struct net_device *dev, void *cb_priv,
  flow_indr_block_bind_cb_t *cb,
  void *cb_ident);
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index a1fdfa4..8ff7a75b 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -282,6 +282,8 @@ int flow_block_cb_setup_simple(struct flow_block_offload *f,
 }
 EXPORT_SYMBOL(flow_block_cb_setup_simple);
 
+static LIST_HEAD(get_default_block_cb_list);
+
 static struct rhashtable indr_setup_block_ht;
 
 struct flow_indr_block_cb {
@@ -313,6 +315,24 @@ struct flow_indr_block_dev {
  flow_indr_setup_block_ht_params);
 }
 
+static void flow_get_default_block(struct flow_indr_block_dev *indr_dev)
+{
+   struct flow_indr_get_block_entry *entry_cb;
+   struct flow_indr_block_info info;
+
+   rcu_read_lock();
+
+   list_for_each_entry_rcu(entry_cb, &get_default_block_cb_list, list) {
+   if (entry_cb->get_block_cb(indr_dev->dev, &info)) {
+   indr_dev->flow_block = info.flow_block;
+   indr_dev->ing_cmd_cb = info.ing_cmd_cb;
+   break;
+   }
+   }
+
+   rcu_read_unlock();
+}
+
 static struct flow_indr_block_dev *
 flow_indr_block_dev_get(struct net_device *dev)
 {
@@ -328,6 +348,7 @@ struct flow_indr_block_dev {
 
INIT_LIST_HEAD(&indr_dev->cb_list);
indr_dev->dev = dev;
+   flow_get_default_block(indr_dev);
if (rhashtable_insert_fast(&indr_setup_block_ht, &indr_dev->ht_node,
   flow_indr_setup_block_ht_params)) {
kfree(indr_dev);
@@ -492,6 +513,18 @@ void flow_indr_block_call(struct flow_block *flow_block,
 }
 EXPORT_SYMBOL_GPL(flow_indr_block_call);
 
+void flow_indr_add_default_block_cb(struct flow_indr_get_block_entry *entry)
+{
+   list_add_tail_rcu(&entry->list, &get_default_block_cb_list);
+}
+EXPORT_SYMBOL_GPL(flow_indr_add_default_block_cb);
+
+void flow_indr_del_default_block_cb(struct flow_indr_get_block_entry *entry)
+{
+   list_del_rcu(&entry->list);
+}
+EXPORT_SYMBOL_GPL(flow_indr_del_default_block_cb);
+
 static int __init init_flow_indr_rhashtable(void)
 {
return rhashtable_init(&indr_setup_block_ht,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index bd5e591..8bf918c 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -576,6 +576,43 @@ static void tc_indr_block_ing_cmd(struct net_device *dev,
tcf_block_setup(block, &bo);
 }
 
+static struct tcf_block *tc_dev_ingress_block(struct net_device *dev)
+{
+   const struct Qdisc_class_ops *cops;
+   struct Qdisc *qdisc;
+
+   if (!dev_ingress_queue(dev))
+   return NULL;
+
+   qdisc = dev_ingress_queue(dev)->qdisc_sleeping;
+   if (!qdisc)
+   return NULL;
+
+   cops = qdisc->ops->cl_ops;
+   if (!cops)
+   return NULL;
+
+   if (!cops->tcf_block)
+   return NULL;
+
+   return cops->tcf_block(qdisc, TC_H_MIN_INGRESS, NULL);
+}
+
+static bool tc_indr_get_default_block(struct net_device *dev,
+ struct flow_indr_block_info *info)
+{
+   struct tcf_block *block = tc_dev_ingress_block(dev);
+
+   if (block) {
+   info->flow_block = &block->flow_block;
+   info->ing_cmd_cb = tc_indr_block_ing_cmd;
+
+   return true;
+   }
+
+   return false;
+}
+
 static void tc_indr_block_call(struct tcf_block *block, str

[PATCH net-next 3/6] cls_api: add flow_indr_block_call function

2019-07-31 Thread wenxu

From: wenxu 

This patch make indr_block_call don't access struct tc_indr_block_cb
and tc_indr_block_dev directly

Signed-off-by: wenxu 
---
 net/sched/cls_api.c | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index f9643fa..617b098 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -783,13 +783,30 @@ void tc_indr_block_cb_unregister(struct net_device *dev,
 }
 EXPORT_SYMBOL_GPL(tc_indr_block_cb_unregister);
 
+static void flow_indr_block_call(struct flow_block *flow_block,
+struct net_device *dev,
+struct flow_block_offload *bo,
+enum flow_block_command command)
+{
+   struct tc_indr_block_cb *indr_block_cb;
+   struct tc_indr_block_dev *indr_dev;
+
+   indr_dev = tc_indr_block_dev_lookup(dev);
+   if (!indr_dev)
+   return;
+
+   indr_dev->flow_block = command == FLOW_BLOCK_BIND ? flow_block : NULL;
+
+   list_for_each_entry(indr_block_cb, &indr_dev->cb_list, list)
+   indr_block_cb->cb(dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
+ bo);
+}
+
 static void tc_indr_block_call(struct tcf_block *block, struct net_device *dev,
   struct tcf_block_ext_info *ei,
   enum flow_block_command command,
   struct netlink_ext_ack *extack)
 {
-   struct tc_indr_block_cb *indr_block_cb;
-   struct tc_indr_block_dev *indr_dev;
struct flow_block_offload bo = {
.command= command,
.binder_type= ei->binder_type,
@@ -800,17 +817,7 @@ static void tc_indr_block_call(struct tcf_block *block, 
struct net_device *dev,
};
INIT_LIST_HEAD(&bo.cb_list);
 
-   indr_dev = tc_indr_block_dev_lookup(dev);
-   if (!indr_dev)
-   return;
-
-   indr_dev->flow_block = command == FLOW_BLOCK_BIND ?
- &block->flow_block : NULL;
-
-   list_for_each_entry(indr_block_cb, &indr_dev->cb_list, list)
-   indr_block_cb->cb(dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
- &bo);
-
+   flow_indr_block_call(&block->flow_block, dev, &bo, command);
tcf_block_setup(block, &bo);
 }
 
-- 
1.8.3.1

[PATCH] net/ethernet/qlogic/qed: force the string buffer NULL-terminated

2019-07-31 Thread Wang Xiayang

strncpy() does not ensure NULL-termination when the input string
size equals to the destination buffer size 30.
The output string is passed to qed_int_deassertion_aeu_bit()
which calls DP_INFO() and relies NULL-termination.

Use strlcpy instead. The other conditional branch above strncpy()
needs no fix as snprintf() ensures NULL-termination.

This issue is identified by a Coccinelle script.

Signed-off-by: Wang Xiayang 
---
 drivers/net/ethernet/qlogic/qed/qed_int.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_int.c 
b/drivers/net/ethernet/qlogic/qed/qed_int.c
index 4e8118a08654..9f5113639eaf 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_int.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_int.c
@@ -1093,7 +1093,7 @@ static int qed_int_deassertion(struct qed_hwfn  *p_hwfn,
snprintf(bit_name, 30,
 p_aeu->bit_name, num);
else
-   strncpy(bit_name,
+   strlcpy(bit_name,
p_aeu->bit_name, 30);
 
/* We now need to pass bitmask in its
-- 
2.11.0

Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm

2019-07-31 Thread Song Liu




> On Jul 30, 2019, at 11:52 PM, Andrii Nakryiko  
> wrote:
> 
> On Tue, Jul 30, 2019 at 10:19 PM Song Liu  wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 6:00 PM, Andrii Nakryiko  
>>> wrote:
>>> 
>>> On Tue, Jul 30, 2019 at 5:39 PM Song Liu  wrote:
 
 
 
> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko  wrote:
> 
> This patch implements the core logic for BPF CO-RE offsets relocations.
> Every instruction that needs to be relocated has corresponding
> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> to match recorded "local" relocation spec against potentially many
> compatible "target" types, creating corresponding spec. Details of the
> algorithm are noted in corresponding comments in the code.
> 
> Signed-off-by: Andrii Nakryiko 

[...]

 
>>> 
>>> I just picked the most succinct and non-repetitive form. It's
>>> immediately apparent which type it's implicitly converted to, so I
>>> felt there is no need to repeat it. Also, just (void *) is much
>>> shorter. :)
>> 
>> _All_ other code in btf.c converts the pointer to the target type.
> 
> Most in libbpf.c doesn't, though. Also, I try to preserve pointer
> constness for uses that don't modify BTF types (pretty much all of
> them in libbpf), so it becomes really verbose, despite extremely short
> variable names:
> 
> const struct btf_member *m = (const struct btf_member *)(t + 1);

I don't think being verbose is a big problem here. Overusing 
(void *) feels like a bigger problem. 

> 
> Add one or two levels of nestedness and you are wrapping this line.
> 
>> In some cases, it is not apparent which type it is converted to,
>> for example:
>> 
>> +   m = (void *)(targ_type + 1);
>> 
>> I would suggest we do implicit conversion whenever possible.
> 
> Implicit conversion (`m = targ_type + 1;`) is a compilation error,
> that won't work.

I misused "implicit" here. I actually meant to say

m = ((const struct btf_member *)(t + 1);

Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP

2019-07-31 Thread Jesper Dangaard Brouer

On Tue, 30 Jul 2019 18:52:40 -0700
Alexei Starovoitov  wrote:

> On Tue, Jul 30, 2019 at 06:21:44PM -0700, Jakub Kicinski wrote:
> > > > Duplicating the same features in bpftool will only diminish the
> > > > incentive for moving iproute2 to libbpf. 
> > > 
> > > not at all. why do you think so?  
> > 
> > Because iproute2's BPF has fallen behind so the simplest thing is to
> > just contribute to bpftool. But iproute2 is the tool set for Linux
> > networking, we can't let it bit rot :(  
> 
> where were you when a lot of libbpf was copy pasted into iproute2 ?!
> Now it diverged a lot and it's difficult to move iproute2 back to the main
> train which is libbpf.

I hope we all agree that libbpf it the way forward.  I really hope we
can convert iproute2 into using libbpf.  It is challenging because
iproute2 ELF files use another map-layout, and I guess we need to stay
backward compatible.

> Same thing with at least 5 copy-pastes of samples/bpf/bpf_load.c
> that spawned a bunch of different bpf loaders.

I'm also to blame here... as I ran with bpf_load.c and based my
prototype-kernel github project on it.  And it seems that it have
gotten enough Google entropy, that people find that first. E.g. I have
pull requests that add new tools, which I have not merged, as I want to
switch to libbpf first.

Recently I've added a README[1] that warn about this issue, and point
people to xdp-tutorial[2] instead.

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/README.rst
[2] https://github.com/xdp-project/xdp-tutorial

 Quote[1]:
   "WARNING: This directory contains its own out-of-date BPF
ELF-loader (in bpf_load.c). Instead people should use libbpf."

> > IMHO vaguely competent users of Linux networking will know ip link.
> > If they are not vaguely competent, they should not attach XDP programs
> > to interfaces by hand...  
> 
> I'm a prime example of moderately competent linux user who
> doesn't know iproute2. I'm not joking.
> I don't know tc syntax and always use my cheat sheet of ip/tc commands
> to do anything.

I agree... I also need to lookup the TC commands used for attaching BPF
programs every time.  It is a pain, and I've often made mistakes that
cause several tc filter's to get attached, which leads to strange
behavior.  (AFAIK you have to remember to use both prio and handle when
removing the old, before adding the new).

> > > 
> > > bpftool must be able to introspect every aspect of bpf programming.
> > > That includes detaching and attaching anywhere.
> > > Anyone doing 'bpftool p s' should be able to switch off particular
> > > prog id without learning ten different other tools.  
> > 
> > I think the fact that we already have an implementation in iproute2,
> > which is at the risk of bit rot is more important to me that the
> > hypothetical scenario where everyone knows to just use bpftool (for
> > XDP, for TC it's still iproute2 unless there's someone crazy enough 
> > to reimplement the TC functionality :))  
> 
> I think you're missing the point that iproute2 is still necessary
> to configure it.
> bpftool should be able to attach/detach from anything.
> But configuring that thing (whether it's cgroup or tc/xdp) is
> a job of corresponding apis and tools.
> 
> > I'm not sure we can settle our differences over email :)
> > I have tremendous respect for all the maintainers I CCed here, 
> > if nobody steps up to agree with me I'll concede the bpftool net
> > battle entirely :)  
> 
> we can keep arguing forever. Respect to ease-of-use only comes
> from the pain of operational experience. I don't think I can
> convey that pain in the email either.

Let me try to convey some pain...

Regarding operational experience.  I have a customer using this:
 https://github.com/xdp-project/xdp-cpumap-tc

The project combines TC and XDP (with CPUMAP redirect).  It has been a
*HUGE* pain that we needed to use iproute2 tc commands to attach and
load TC BPF-programs.

(Issue#1)
The iproute2 tc BPF loader uses another ELF-layout for maps.  This was
worked around, by keeping XDP and TC BPF-progs in different C-file
using different struct's for maps.  And avoiding to use map features,
that iproute2 doesn't know about... although we found a workaround for
using more advanced map features via load-order and pinning see issue#3
sub-problem(2).

(Issue#2)
As this is a C-prog I would really prefer using a library, instead of
having to call cmdline utilities, see C-code doing this[3].  This lead
to several issues.  E.g. they had installed too old version of iproute2
on their systems, after installing newer version (in /usr/local) it was
not first in $PATH for root.  Load order also matters see issue#3.

[3] 
https://github.com/xdp-project/xdp-cpumap-tc/blob/master/src/common_user.c#L381-L493

(Issue#3)
Next to get TC and XDP programs to cooperate, pinned maps via bpffs is
used.  And iproute2 dictates that pinned maps are located in directory
/sys/fs/bpf/tc/globals/.  What could go

Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface

2019-07-31 Thread Jesper Dangaard Brouer

On Wed, 31 Jul 2019 03:48:20 +0900
"Daniel T. Lee"  wrote:

> diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> index 67e99c56bc88..d3a4f18b5b95 100644
> --- a/tools/bpf/bpftool/net.c
> +++ b/tools/bpf/bpftool/net.c
> @@ -55,6 +55,35 @@ struct bpf_attach_info {
>   __u32 flow_dissector_id;
>  };
>  
> +enum net_load_type {
> + NET_LOAD_TYPE_XDP,
> + NET_LOAD_TYPE_XDP_GENERIC,
> + NET_LOAD_TYPE_XDP_DRIVE,

Why "DRIVE" ?
Why not "DRIVER" ?

> + NET_LOAD_TYPE_XDP_OFFLOAD,
> + __MAX_NET_LOAD_TYPE
> +};
> +
> +static const char * const load_type_strings[] = {
> + [NET_LOAD_TYPE_XDP] = "xdp",
> + [NET_LOAD_TYPE_XDP_GENERIC] = "xdpgeneric",
> + [NET_LOAD_TYPE_XDP_DRIVE] = "xdpdrv",
> + [NET_LOAD_TYPE_XDP_OFFLOAD] = "xdpoffload",
> + [__MAX_NET_LOAD_TYPE] = NULL,
> +};



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface

2019-07-31 Thread Jesper Dangaard Brouer

On Wed, 31 Jul 2019 03:48:20 +0900
"Daniel T. Lee"  wrote:

> By this commit, using `bpftool net load`, user can load XDP prog on
> interface. New type of enum 'net_load_type' has been made, as stated at
> cover-letter, the meaning of 'load' is, prog will be loaded on interface.

Why the keyword "load" ?
Why not "attach" (and "detach")?

For BPF there is a clear distinction between the "load" and "attach"
steps.  I know this is under subcommand "net", but to follow the
conversion of other subcommands e.g. "prog" there are both "load" and
"attach" commands.


> BPF prog will be loaded through libbpf 'bpf_set_link_xdp_fd'.

Again this is a "set" operation, not a "load" operation.

> Signed-off-by: Daniel T. Lee 

[...]
>  static int do_show(int argc, char **argv)
>  {
>   struct bpf_attach_info attach_info = {};
> @@ -305,13 +405,17 @@ static int do_help(int argc, char **argv)
>  
>   fprintf(stderr,
>   "Usage: %s %s { show | list } [dev ]\n"
> + "   %s %s load PROG LOAD_TYPE \n"

The "PROG" here does it correspond to the 'bpftool prog' syntax?:

 PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }

>   "   %s %s help\n"
> + "\n"
> + "   " HELP_SPEC_PROGRAM "\n"
> + "   LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload 
> }\n"
>   "Note: Only xdp and tc attachments are supported now.\n"
>   "  For progs attached to cgroups, use \"bpftool cgroup\"\n"
>   "  to dump program attachments. For program types\n"
>   "  sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
>   "  consult iproute2.\n",


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH net-next 1/2] selftests: mlxsw: Fix local variable declarations in DSCP tests

2019-07-31 Thread Petr Machata

These two tests have some problems in the global scope pollution and on
contrary, contain unnecessary local declarations. Fix them.

Signed-off-by: Petr Machata 
---
 .../testing/selftests/drivers/net/mlxsw/qos_dscp_bridge.sh  | 6 --
 .../testing/selftests/drivers/net/mlxsw/qos_dscp_router.sh  | 5 +++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_bridge.sh 
b/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_bridge.sh
index 40f16f2a3afd..5cbff8038f84 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_bridge.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_bridge.sh
@@ -36,8 +36,6 @@ source $lib_dir/lib.sh
 
 h1_create()
 {
-   local dscp;
-
simple_if_init $h1 192.0.2.1/28
tc qdisc add dev $h1 clsact
dscp_capture_install $h1 10
@@ -67,6 +65,7 @@ h2_destroy()
 dscp_map()
 {
local base=$1; shift
+   local prio
 
for prio in {0..7}; do
echo app=$prio,5,$((base + prio))
@@ -138,6 +137,7 @@ dscp_ping_test()
local prio=$1; shift
local dev_10=$1; shift
local dev_20=$1; shift
+   local key
 
local dscp_10=$(((prio + 10) << 2))
local dscp_20=$(((prio + 20) << 2))
@@ -175,6 +175,8 @@ dscp_ping_test()
 
 test_dscp()
 {
+   local prio
+
for prio in {0..7}; do
dscp_ping_test v$h1 192.0.2.1 192.0.2.2 $prio $h1 $h2
done
diff --git a/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_router.sh 
b/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_router.sh
index 9faf02e32627..f25e3229e1cc 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_router.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_router.sh
@@ -52,8 +52,6 @@ reprioritize()
 
 h1_create()
 {
-   local dscp;
-
simple_if_init $h1 192.0.2.1/28
tc qdisc add dev $h1 clsact
dscp_capture_install $h1 0
@@ -87,6 +85,7 @@ h2_destroy()
 dscp_map()
 {
local base=$1; shift
+   local prio
 
for prio in {0..7}; do
echo app=$prio,5,$((base + prio))
@@ -156,6 +155,7 @@ dscp_ping_test()
local reprio=$1; shift
local dev1=$1; shift
local dev2=$1; shift
+   local i
 
local prio2=$($reprio $prio)   # ICMP Request egress prio
local prio3=$($reprio $prio2)  # ICMP Response egress prio
@@ -205,6 +205,7 @@ __test_update()
 {
local update=$1; shift
local reprio=$1; shift
+   local prio
 
sysctl_restore net.ipv4.ip_forward_update_priority
sysctl_set net.ipv4.ip_forward_update_priority $update
-- 
2.20.1

[PATCH net-next 0/2] mlxsw: Test coverage for DSCP leftover fix

2019-07-31 Thread Petr Machata

This patch set fixes some global scope pollution issues in the DSCP tests
(in patch #1), and then proceeds (in patch #2) to add a new test for
checking whether, after DSCP prioritization rules are removed from a port,
DSCP rewrites consistently to zero, instead of the last removed rule still
staying in effect.

Petr Machata (2):
  selftests: mlxsw: Fix local variable declarations in DSCP tests
  selftests: mlxsw: Add a test for leftover DSCP rule

 .../drivers/net/mlxsw/qos_dscp_bridge.sh  |  6 +++--
 .../drivers/net/mlxsw/qos_dscp_router.sh  | 24 +--
 2 files changed, 26 insertions(+), 4 deletions(-)

-- 
2.20.1

[PATCH net-next 2/2] selftests: mlxsw: Add a test for leftover DSCP rule

2019-07-31 Thread Petr Machata

Commit dedfde2fe1c4 ("mlxsw: spectrum_dcb: Configure DSCP map as the last
rule is removed") fixed a problem in mlxsw where last DSCP rule to be
removed remained in effect when DSCP rewrite was applied.

Add a selftest that covers this problem.

Signed-off-by: Petr Machata 
---
 .../drivers/net/mlxsw/qos_dscp_router.sh  | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_router.sh 
b/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_router.sh
index f25e3229e1cc..c745ce3befee 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_router.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/qos_dscp_router.sh
@@ -31,6 +31,7 @@ ALL_TESTS="
ping_ipv4
test_update
test_no_update
+   test_dscp_leftover
 "
 
 lib_dir=$(dirname $0)/../../../net/forwarding
@@ -50,6 +51,11 @@ reprioritize()
echo ${reprio[$in]}
 }
 
+zero()
+{
+echo 0
+}
+
 h1_create()
 {
simple_if_init $h1 192.0.2.1/28
@@ -225,6 +231,19 @@ test_no_update()
__test_update 0 echo
 }
 
+# Test that when the last APP rule is removed, the prio->DSCP map is properly
+# set to zeroes, and that the last APP rule does not stay active in the ASIC.
+test_dscp_leftover()
+{
+   lldptool -T -i $swp2 -V APP -d $(dscp_map 0) >/dev/null
+   lldpad_app_wait_del
+
+   __test_update 0 zero
+
+   lldptool -T -i $swp2 -V APP $(dscp_map 0) >/dev/null
+   lldpad_app_wait_set $swp2
+}
+
 trap cleanup EXIT
 
 setup_prepare
-- 
2.20.1

Re: [PATCH net-next] be2net: disable bh with spin_lock in be_process_mcc

2019-07-31 Thread Denis Kirjanov

On 7/30/19, Willem de Bruijn  wrote:
> On Tue, Jul 30, 2019 at 7:33 AM Denis Kirjanov 
> wrote:
>>
>> Signed-off-by: Denis Kirjanov 
>
> This is a partial revert of the previous change to these lines in 2012
> in commit 072a9c486004 ("netpoll: revert 6bdb7fe3104 and fix be_poll()
> instead").
>
> The commit message is empty. Can you give some context as to why this
> is needed and correct?

The idea was just to make some cleanup. Now I've checked that be_process_mcc
is invoked in 3 different places and always with BHs disabled except
the be_poll function but since it's invoked from softirq with BHs
disabled it won't hurt.

>

Re: next/master build: 221 builds: 11 failed, 210 passed, 13 errors, 1174 warnings (next-20190731)

2019-07-31 Thread Mark Brown

On Wed, Jul 31, 2019 at 04:07:41AM -0700, kernelci.org bot wrote:

Today's -next fails to build an ARM allmodconfig due to:

> allmodconfig (arm, gcc-8) — FAIL, 1 error, 40 warnings, 0 section mismatches
> 
> Errors:
> drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of 
> function 'writeq'; did you mean 'writel'? 
> [-Werror=implicit-function-declaration]

as a result of the changes that introduced:

WARNING: unmet direct dependencies detected for MDIO_OCTEON
  Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=m] && MDIO_BUS [=m] && 64BIT 
&& HAS_IOMEM [=y] && OF_MDIO [=m]
  Selected by [m]:
  - OCTEON_ETHERNET [=m] && STAGING [=y] && (CAVIUM_OCTEON_SOC && NETDEVICES 
[=y] || COMPILE_TEST [=y])

which is triggered by the staging OCTEON_ETHERNET driver which misses a
64BIT dependency but added COMPILE_TEST in 171a9bae68c72f2
(staging/octeon: Allow test build on !MIPS).


signature.asc
Description: PGP signature

Re: next/master build: 221 builds: 11 failed, 210 passed, 13 errors, 1174 warnings (next-20190731)

2019-07-31 Thread Greg Kroah-Hartman

On Wed, Jul 31, 2019 at 12:24:41PM +0100, Mark Brown wrote:
> On Wed, Jul 31, 2019 at 04:07:41AM -0700, kernelci.org bot wrote:
> 
> Today's -next fails to build an ARM allmodconfig due to:
> 
> > allmodconfig (arm, gcc-8) — FAIL, 1 error, 40 warnings, 0 section mismatches
> > 
> > Errors:
> > drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of 
> > function 'writeq'; did you mean 'writel'? 
> > [-Werror=implicit-function-declaration]
> 
> as a result of the changes that introduced:
> 
> WARNING: unmet direct dependencies detected for MDIO_OCTEON
>   Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=m] && MDIO_BUS [=m] && 
> 64BIT && HAS_IOMEM [=y] && OF_MDIO [=m]
>   Selected by [m]:
>   - OCTEON_ETHERNET [=m] && STAGING [=y] && (CAVIUM_OCTEON_SOC && NETDEVICES 
> [=y] || COMPILE_TEST [=y])
> 
> which is triggered by the staging OCTEON_ETHERNET driver which misses a
> 64BIT dependency but added COMPILE_TEST in 171a9bae68c72f2
> (staging/octeon: Allow test build on !MIPS).

A patch was posted for this, but it needs to go through the netdev tree
as that's where the offending patches are coming from.

thanks,

greg k-h

Re: [PATCH net] drop_monitor: Add missing uAPI file to MAINTAINERS file

2019-07-31 Thread Neil Horman

On Wed, Jul 31, 2019 at 09:38:19AM +0300, Ido Schimmel wrote:
> From: Ido Schimmel 
> 
> Fixes: 6e43650cee64 ("add maintainer for network drop monitor kernel service")
> Signed-off-by: Ido Schimmel 
> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9f5b8bd4faf9..b540794cbd91 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11137,6 +11137,7 @@ L:netdev@vger.kernel.org
>  S:   Maintained
>  W:   https://fedorahosted.org/dropwatch/
>  F:   net/core/drop_monitor.c
> +F:   include/uapi/linux/net_dropmon.h
>  
>  NETWORKING DRIVERS
>  M:   "David S. Miller" 
> -- 
> 2.21.0
> 
> 
Acked-by: Neil Horman

[PATCH 0/8] netfilter fixes for net

2019-07-31 Thread Pablo Neira Ayuso

Hi,

The following patchset contains Netfilter fixes for your net tree:

1) memleak in ebtables from the error path for the 32/64 compat layer,
   from Florian Westphal.

2) Fix inverted meta ifname/ifidx matching when no interface is set
   on either from the input/output path, from Phil Sutter.

3) Remove goto label in nft_meta_bridge, also from Phil.

4) Missing include guard in xt_connlabel, from Masahiro Yamada.

5) Two patch to fix ipset destination MAC matching coming from
   Stephano Brivio, via Jozsef Kadlecsik.

6) Fix set rename and listing concurrency problem, from Shijie Luo.
   Patch also coming via Jozsef Kadlecsik.

7) ebtables 32/64 compat missing base chain policy in rule count,
   from Florian Westphal.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Thanks!



The following changes since commit 0cea0e1148fe134a4a3aaf0b1496f09241fb943a:

  net: phy: sfp: hwmon: Fix scaling of RX power (2019-07-21 11:51:50 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD

for you to fetch changes up to 7cdc4412284777c76c919e2ab33b3b8dbed18559:

  Merge branch 'master' of git://blackhole.kfki.hu/nf (2019-07-30 13:39:20 
+0200)


Florian Westphal (1):
  netfilter: ebtables: also count base chain policies

Jozsef Kadlecsik (1):
  netfilter: ipset: Fix rename concurrency with listing

Masahiro Yamada (1):
  netfilter: add include guard to xt_connlabel.h

Pablo Neira Ayuso (1):
  Merge branch 'master' of git://blackhole.kfki.hu/nf

Phil Sutter (2):
  netfilter: nf_tables: Make nft_meta expression more robust
  netfilter: nft_meta_bridge: Eliminate 'out' label

Stefano Brivio (2):
  netfilter: ipset: Actually allow destination MAC address for hash:ip,mac 
sets too
  netfilter: ipset: Copy the right MAC address in bitmap:ip,mac and 
hash:ip,mac sets

Wenwen Wang (1):
  netfilter: ebtables: fix a memory leak bug in compat

 include/uapi/linux/netfilter/xt_connlabel.h |  6 ++
 net/bridge/netfilter/ebtables.c | 32 ++---
 net/bridge/netfilter/nft_meta_bridge.c  | 10 ++---
 net/netfilter/ipset/ip_set_bitmap_ipmac.c   |  2 +-
 net/netfilter/ipset/ip_set_core.c   |  2 +-
 net/netfilter/ipset/ip_set_hash_ipmac.c |  6 +-
 net/netfilter/nft_meta.c| 16 ---
 7 files changed, 35 insertions(+), 39 deletions(-)

[PATCH 1/8] netfilter: ebtables: fix a memory leak bug in compat

2019-07-31 Thread Pablo Neira Ayuso

From: Wenwen Wang 

In compat_do_replace(), a temporary buffer is allocated through vmalloc()
to hold entries copied from the user space. The buffer address is firstly
saved to 'newinfo->entries', and later on assigned to 'entries_tmp'. Then
the entries in this temporary buffer is copied to the internal kernel
structure through compat_copy_entries(). If this copy process fails,
compat_do_replace() should be terminated. However, the allocated temporary
buffer is not freed on this path, leading to a memory leak.

To fix the bug, free the buffer before returning from compat_do_replace().

Signed-off-by: Wenwen Wang 
Reviewed-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/bridge/netfilter/ebtables.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 963dfdc14827..fd84b48e48b5 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -2261,8 +2261,10 @@ static int compat_do_replace(struct net *net, void 
__user *user,
state.buf_kern_len = size64;
 
ret = compat_copy_entries(entries_tmp, tmp.entries_size, &state);
-   if (WARN_ON(ret < 0))
+   if (WARN_ON(ret < 0)) {
+   vfree(entries_tmp);
goto out_unlock;
+   }
 
vfree(entries_tmp);
tmp.entries_size = size64;
-- 
2.11.0

[PATCH 3/8] netfilter: nft_meta_bridge: Eliminate 'out' label

2019-07-31 Thread Pablo Neira Ayuso

From: Phil Sutter 

The label is used just once and the code it points at is not reused, no
point in keeping it.

Signed-off-by: Phil Sutter 
Signed-off-by: Pablo Neira Ayuso 
---
 net/bridge/netfilter/nft_meta_bridge.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/bridge/netfilter/nft_meta_bridge.c 
b/net/bridge/netfilter/nft_meta_bridge.c
index a98dec2cf0cf..1804e867f715 100644
--- a/net/bridge/netfilter/nft_meta_bridge.c
+++ b/net/bridge/netfilter/nft_meta_bridge.c
@@ -57,13 +57,11 @@ static void nft_meta_bridge_get_eval(const struct nft_expr 
*expr,
return;
}
default:
-   goto out;
+   return nft_meta_get_eval(expr, regs, pkt);
}
 
strncpy((char *)dest, br_dev ? br_dev->name : "", IFNAMSIZ);
return;
-out:
-   return nft_meta_get_eval(expr, regs, pkt);
 err:
regs->verdict.code = NFT_BREAK;
 }
-- 
2.11.0

[PATCH 7/8] netfilter: ipset: Fix rename concurrency with listing

2019-07-31 Thread Pablo Neira Ayuso

From: Jozsef Kadlecsik 

Shijie Luo reported that when stress-testing ipset with multiple concurrent
create, rename, flush, list, destroy commands, it can result

ipset : Broken LIST kernel message: missing DATA part!

error messages and broken list results. The problem was the rename operation
was not properly handled with respect of listing. The patch fixes the issue.

Reported-by: Shijie Luo 
Signed-off-by: Jozsef Kadlecsik 
---
 net/netfilter/ipset/ip_set_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/ipset/ip_set_core.c 
b/net/netfilter/ipset/ip_set_core.c
index 2e151856ad99..e64d5f9a89dd 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1161,7 +1161,7 @@ static int ip_set_rename(struct net *net, struct sock 
*ctnl,
return -ENOENT;
 
write_lock_bh(&ip_set_ref_lock);
-   if (set->ref != 0) {
+   if (set->ref != 0 || set->ref_netlink != 0) {
ret = -IPSET_ERR_REFERENCED;
goto out;
}
-- 
2.11.0

[PATCH 2/8] netfilter: nf_tables: Make nft_meta expression more robust

2019-07-31 Thread Pablo Neira Ayuso

From: Phil Sutter 

nft_meta_get_eval()'s tendency to bail out setting NFT_BREAK verdict in
situations where required data is missing leads to unexpected behaviour
with inverted checks like so:

| meta iifname != eth0 accept

This rule will never match if there is no input interface (or it is not
known) which is not intuitive and, what's worse, breaks consistency of
iptables-nft with iptables-legacy.

Fix this by falling back to placing a value in dreg which never matches
(avoiding accidental matches), i.e. zero for interface index and an
empty string for interface name.

Signed-off-by: Phil Sutter 
Signed-off-by: Pablo Neira Ayuso 
---
 net/bridge/netfilter/nft_meta_bridge.c |  6 +-
 net/netfilter/nft_meta.c   | 16 
 2 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/net/bridge/netfilter/nft_meta_bridge.c 
b/net/bridge/netfilter/nft_meta_bridge.c
index bed66f536b34..a98dec2cf0cf 100644
--- a/net/bridge/netfilter/nft_meta_bridge.c
+++ b/net/bridge/netfilter/nft_meta_bridge.c
@@ -30,13 +30,9 @@ static void nft_meta_bridge_get_eval(const struct nft_expr 
*expr,
switch (priv->key) {
case NFT_META_BRI_IIFNAME:
br_dev = nft_meta_get_bridge(in);
-   if (!br_dev)
-   goto err;
break;
case NFT_META_BRI_OIFNAME:
br_dev = nft_meta_get_bridge(out);
-   if (!br_dev)
-   goto err;
break;
case NFT_META_BRI_IIFPVID: {
u16 p_pvid;
@@ -64,7 +60,7 @@ static void nft_meta_bridge_get_eval(const struct nft_expr 
*expr,
goto out;
}
 
-   strncpy((char *)dest, br_dev->name, IFNAMSIZ);
+   strncpy((char *)dest, br_dev ? br_dev->name : "", IFNAMSIZ);
return;
 out:
return nft_meta_get_eval(expr, regs, pkt);
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index f1b1d948c07b..f69afb9ff3cb 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -60,24 +60,16 @@ void nft_meta_get_eval(const struct nft_expr *expr,
*dest = skb->mark;
break;
case NFT_META_IIF:
-   if (in == NULL)
-   goto err;
-   *dest = in->ifindex;
+   *dest = in ? in->ifindex : 0;
break;
case NFT_META_OIF:
-   if (out == NULL)
-   goto err;
-   *dest = out->ifindex;
+   *dest = out ? out->ifindex : 0;
break;
case NFT_META_IIFNAME:
-   if (in == NULL)
-   goto err;
-   strncpy((char *)dest, in->name, IFNAMSIZ);
+   strncpy((char *)dest, in ? in->name : "", IFNAMSIZ);
break;
case NFT_META_OIFNAME:
-   if (out == NULL)
-   goto err;
-   strncpy((char *)dest, out->name, IFNAMSIZ);
+   strncpy((char *)dest, out ? out->name : "", IFNAMSIZ);
break;
case NFT_META_IIFTYPE:
if (in == NULL)
-- 
2.11.0

[PATCH 8/8] netfilter: ebtables: also count base chain policies

2019-07-31 Thread Pablo Neira Ayuso

From: Florian Westphal 

ebtables doesn't include the base chain policies in the rule count,
so we need to add them manually when we call into the x_tables core
to allocate space for the comapt offset table.

This lead syzbot to trigger:
WARNING: CPU: 1 PID: 9012 at net/netfilter/x_tables.c:649
xt_compat_add_offset.cold+0x11/0x36 net/netfilter/x_tables.c:649

Reported-by: syzbot+276ddebab3382bbf7...@syzkaller.appspotmail.com
Fixes: 2035f3ff8eaa ("netfilter: ebtables: compat: un-break 32bit setsockopt 
when no rules are present")
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/bridge/netfilter/ebtables.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index fd84b48e48b5..c8177a89f52c 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1770,20 +1770,28 @@ static int compat_calc_entry(const struct ebt_entry *e,
return 0;
 }
 
+static int ebt_compat_init_offsets(unsigned int number)
+{
+   if (number > INT_MAX)
+   return -EINVAL;
+
+   /* also count the base chain policies */
+   number += NF_BR_NUMHOOKS;
+
+   return xt_compat_init_offsets(NFPROTO_BRIDGE, number);
+}
 
 static int compat_table_info(const struct ebt_table_info *info,
 struct compat_ebt_replace *newinfo)
 {
unsigned int size = info->entries_size;
const void *entries = info->entries;
+   int ret;
 
newinfo->entries_size = size;
-   if (info->nentries) {
-   int ret = xt_compat_init_offsets(NFPROTO_BRIDGE,
-info->nentries);
-   if (ret)
-   return ret;
-   }
+   ret = ebt_compat_init_offsets(info->nentries);
+   if (ret)
+   return ret;
 
return EBT_ENTRY_ITERATE(entries, size, compat_calc_entry, info,
entries, newinfo);
@@ -2234,11 +2242,9 @@ static int compat_do_replace(struct net *net, void 
__user *user,
 
xt_compat_lock(NFPROTO_BRIDGE);
 
-   if (tmp.nentries) {
-   ret = xt_compat_init_offsets(NFPROTO_BRIDGE, tmp.nentries);
-   if (ret < 0)
-   goto out_unlock;
-   }
+   ret = ebt_compat_init_offsets(tmp.nentries);
+   if (ret < 0)
+   goto out_unlock;
 
ret = compat_copy_entries(entries_tmp, tmp.entries_size, &state);
if (ret < 0)
-- 
2.11.0

[PATCH 5/8] netfilter: ipset: Actually allow destination MAC address for hash:ip,mac sets too

2019-07-31 Thread Pablo Neira Ayuso

From: Stefano Brivio 

In commit 8cc4ccf58379 ("ipset: Allow matching on destination MAC address
for mac and ipmac sets"), ipset.git commit 1543514c46a7, I removed the
KADT check that prevents matching on destination MAC addresses for
hash:mac sets, but forgot to remove the same check for hash:ip,mac set.

Drop this check: functionality is now commented in man pages and there's
no reason to restrict to source MAC address matching anymore.

Reported-by: Chen Yi 
Fixes: 8cc4ccf58379 ("ipset: Allow matching on destination MAC address for mac 
and ipmac sets")
Signed-off-by: Stefano Brivio 
Signed-off-by: Jozsef Kadlecsik 
---
 net/netfilter/ipset/ip_set_hash_ipmac.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_ipmac.c 
b/net/netfilter/ipset/ip_set_hash_ipmac.c
index faf59b6a998f..eb1443408320 100644
--- a/net/netfilter/ipset/ip_set_hash_ipmac.c
+++ b/net/netfilter/ipset/ip_set_hash_ipmac.c
@@ -89,10 +89,6 @@ hash_ipmac4_kadt(struct ip_set *set, const struct sk_buff 
*skb,
struct hash_ipmac4_elem e = { .ip = 0, { .foo[0] = 0, .foo[1] = 0 } };
struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
 
-/* MAC can be src only */
-   if (!(opt->flags & IPSET_DIM_TWO_SRC))
-   return 0;
-
if (skb_mac_header(skb) < skb->head ||
(skb_mac_header(skb) + ETH_HLEN) > skb->data)
return -EINVAL;
-- 
2.11.0

[PATCH 4/8] netfilter: add include guard to xt_connlabel.h

2019-07-31 Thread Pablo Neira Ayuso

From: Masahiro Yamada 

Add a header include guard just in case.

Signed-off-by: Masahiro Yamada 
Acked-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/xt_connlabel.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/uapi/linux/netfilter/xt_connlabel.h 
b/include/uapi/linux/netfilter/xt_connlabel.h
index 2312f0ec07b2..323f0dfc2a4e 100644
--- a/include/uapi/linux/netfilter/xt_connlabel.h
+++ b/include/uapi/linux/netfilter/xt_connlabel.h
@@ -1,4 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef _UAPI_XT_CONNLABEL_H
+#define _UAPI_XT_CONNLABEL_H
+
 #include 
 
 #define XT_CONNLABEL_MAXBIT 127
@@ -11,3 +15,5 @@ struct xt_connlabel_mtinfo {
__u16 bit;
__u16 options;
 };
+
+#endif /* _UAPI_XT_CONNLABEL_H */
-- 
2.11.0

[PATCH 6/8] netfilter: ipset: Copy the right MAC address in bitmap:ip,mac and hash:ip,mac sets

2019-07-31 Thread Pablo Neira Ayuso

From: Stefano Brivio 

In commit 8cc4ccf58379 ("ipset: Allow matching on destination MAC address
for mac and ipmac sets"), ipset.git commit 1543514c46a7, I added to the
KADT functions for sets matching on MAC addreses the copy of source or
destination MAC address depending on the configured match.

This was done correctly for hash:mac, but for hash:ip,mac and
bitmap:ip,mac, copying and pasting the same code block presents an
obvious problem: in these two set types, the MAC address is the second
dimension, not the first one, and we are actually selecting the MAC
address depending on whether the first dimension (IP address) specifies
source or destination.

Fix this by checking for the IPSET_DIM_TWO_SRC flag in option flags.

This way, mixing source and destination matches for the two dimensions
of ip,mac set types works as expected. With this setup:

  ip netns add A
  ip link add veth1 type veth peer name veth2 netns A
  ip addr add 192.0.2.1/24 dev veth1
  ip -net A addr add 192.0.2.2/24 dev veth2
  ip link set veth1 up
  ip -net A link set veth2 up

  dst=$(ip netns exec A cat /sys/class/net/veth2/address)

  ip netns exec A ipset create test_bitmap bitmap:ip,mac range 192.0.0.0/16
  ip netns exec A ipset add test_bitmap 192.0.2.1,${dst}
  ip netns exec A iptables -A INPUT -m set ! --match-set test_bitmap src,dst -j 
DROP

  ip netns exec A ipset create test_hash hash:ip,mac
  ip netns exec A ipset add test_hash 192.0.2.1,${dst}
  ip netns exec A iptables -A INPUT -m set ! --match-set test_hash src,dst -j 
DROP

ipset correctly matches a test packet:

  # ping -c1 192.0.2.2 >/dev/null
  # echo $?
  0

Reported-by: Chen Yi 
Fixes: 8cc4ccf58379 ("ipset: Allow matching on destination MAC address for mac 
and ipmac sets")
Signed-off-by: Stefano Brivio 
Signed-off-by: Jozsef Kadlecsik 
---
 net/netfilter/ipset/ip_set_bitmap_ipmac.c | 2 +-
 net/netfilter/ipset/ip_set_hash_ipmac.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_bitmap_ipmac.c 
b/net/netfilter/ipset/ip_set_bitmap_ipmac.c
index ca7ac4a25ada..1d4e63326e68 100644
--- a/net/netfilter/ipset/ip_set_bitmap_ipmac.c
+++ b/net/netfilter/ipset/ip_set_bitmap_ipmac.c
@@ -226,7 +226,7 @@ bitmap_ipmac_kadt(struct ip_set *set, const struct sk_buff 
*skb,
 
e.id = ip_to_id(map, ip);
 
-   if (opt->flags & IPSET_DIM_ONE_SRC)
+   if (opt->flags & IPSET_DIM_TWO_SRC)
ether_addr_copy(e.ether, eth_hdr(skb)->h_source);
else
ether_addr_copy(e.ether, eth_hdr(skb)->h_dest);
diff --git a/net/netfilter/ipset/ip_set_hash_ipmac.c 
b/net/netfilter/ipset/ip_set_hash_ipmac.c
index eb1443408320..24d8f4df4230 100644
--- a/net/netfilter/ipset/ip_set_hash_ipmac.c
+++ b/net/netfilter/ipset/ip_set_hash_ipmac.c
@@ -93,7 +93,7 @@ hash_ipmac4_kadt(struct ip_set *set, const struct sk_buff 
*skb,
(skb_mac_header(skb) + ETH_HLEN) > skb->data)
return -EINVAL;
 
-   if (opt->flags & IPSET_DIM_ONE_SRC)
+   if (opt->flags & IPSET_DIM_TWO_SRC)
ether_addr_copy(e.ether, eth_hdr(skb)->h_source);
else
ether_addr_copy(e.ether, eth_hdr(skb)->h_dest);
-- 
2.11.0

[PATCH nf,v2] netfilter: nf_tables: map basechain priority to hardware priority

2019-07-31 Thread Pablo Neira Ayuso

This patch adds initial support for offloading basechains using the
priority range from -8192 to 8191.

The software priority -8192 is mapped to the hardware priority
0xC000 + 1. tcf_auto_prio() uses 0xC000 if the user specifies no
priority, then it subtracts 1 for each new tcf_proto object. This patch
reserves the hardware priority range from 0xC000 to 0x for
netfilter.

The software to hardware priority mapping is not exposed to userspace,
so it should be possible to update this / extend the range of supported
priorities later on.

Signed-off-by: Pablo Neira Ayuso 
---
v2: fix indent in priority assignment - Marcelo Leitner.
Missing << 16 bitshift.

 include/net/netfilter/nf_tables_offload.h |  2 ++
 net/netfilter/nf_tables_api.c |  4 
 net/netfilter/nf_tables_offload.c | 31 ---
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/nf_tables_offload.h 
b/include/net/netfilter/nf_tables_offload.h
index 3196663a10e3..2d497394021e 100644
--- a/include/net/netfilter/nf_tables_offload.h
+++ b/include/net/netfilter/nf_tables_offload.h
@@ -73,4 +73,6 @@ int nft_flow_rule_offload_commit(struct net *net);
(__reg)->key= __key;\
memset(&(__reg)->mask, 0xff, (__reg)->len);
 
+u16 nft_chain_offload_priority(struct nft_base_chain *basechain);
+
 #endif
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 605a7cfe7ca7..9cf0fecf5cb9 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1662,6 +1662,10 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 
family, u8 genmask,
 
chain->flags |= NFT_BASE_CHAIN | flags;
basechain->policy = NF_ACCEPT;
+   if (chain->flags & NFT_CHAIN_HW_OFFLOAD &&
+   !nft_chain_offload_priority(basechain))
+   return -EOPNOTSUPP;
+
flow_block_init(&basechain->flow_block);
} else {
chain = kzalloc(sizeof(*chain), GFP_KERNEL);
diff --git a/net/netfilter/nf_tables_offload.c 
b/net/netfilter/nf_tables_offload.c
index 64f5fd5f240e..a79efd49c8a8 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -103,10 +103,11 @@ void nft_offload_update_dependency(struct nft_offload_ctx 
*ctx,
 }
 
 static void nft_flow_offload_common_init(struct flow_cls_common_offload 
*common,
-__be16 proto,
-   struct netlink_ext_ack *extack)
+__be16 proto, int priority,
+struct netlink_ext_ack *extack)
 {
common->protocol = proto;
+   common->prio = priority << 16;
common->extack = extack;
 }
 
@@ -124,6 +125,28 @@ static int nft_setup_cb_call(struct nft_base_chain 
*basechain,
return 0;
 }
 
+/* Available priorities for hardware offload range: -8192..8191 */
+#define NFT_BASECHAIN_OFFLOAD_PRIO_MAX (SHRT_MAX / 4)
+#define NFT_BASECHAIN_OFFLOAD_PRIO_MIN (SHRT_MIN / 4)
+#define NFT_BASECHAIN_OFFLOAD_PRIO_RANGE   (USHRT_MAX / 4)
+/* tcf_auto_prio() uses 0xC000 as base, then subtract one for each new chain. 
*/
+#define NFT_BASECHAIN_OFFLOAD_HW_PRIO_BASE (0xC000 + 1)
+
+u16 nft_chain_offload_priority(struct nft_base_chain *basechain)
+{
+   u16 prio;
+
+   if (basechain->ops.priority < NFT_BASECHAIN_OFFLOAD_PRIO_MIN ||
+   basechain->ops.priority > NFT_BASECHAIN_OFFLOAD_PRIO_MAX)
+   return 0;
+
+   /* map netfilter chain priority to hardware priority. */
+   prio = basechain->ops.priority + NFT_BASECHAIN_OFFLOAD_PRIO_MAX +
+   NFT_BASECHAIN_OFFLOAD_HW_PRIO_BASE;
+
+   return prio;
+}
+
 static int nft_flow_offload_rule(struct nft_trans *trans,
 enum flow_cls_command command)
 {
@@ -142,7 +165,9 @@ static int nft_flow_offload_rule(struct nft_trans *trans,
if (flow)
proto = flow->proto;
 
-   nft_flow_offload_common_init(&cls_flow.common, proto, &extack);
+   nft_flow_offload_common_init(&cls_flow.common, proto,
+nft_chain_offload_priority(basechain),
+&extack);
cls_flow.command = command;
cls_flow.cookie = (unsigned long) rule;
if (flow)
-- 
2.11.0

Re: [PATCH] net: sctp: Rename fallthrough label to unhandled

2019-07-31 Thread Neil Horman

On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> > > fallthrough may become a pseudo reserved keyword so this only use of
> > > fallthrough is better renamed to allow it.
> > > 
> > > Signed-off-by: Joe Perches 
> > Are you referring to the __attribute__((fallthrough)) statement that gcc
> > supports?  If so the compiler should by all rights be able to differentiate
> > between a null statement attribute and a explicit goto and label without the
> > need for renaming here.  Or are you referring to something else?
> 
> Hi.
> 
> I sent after this a patch that adds
> 
> # define fallthrough__attribute__((__fallthrough__))
> 
> https://lore.kernel.org/patchwork/patch/1108577/
> 
> So this rename is a prerequisite to adding this #define.
> 
why not just define __fallthrough instead, like we do for all the other
attributes we alias (i.e. __read_mostly, __protected_by, __unused, __exception,
etc)

Neil

> > > diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> []
> > > @@ -2152,7 +2152,7 @@ static enum sctp_ierror sctp_verify_param(struct 
> > > net *net,
> > >   case SCTP_PARAM_SET_PRIMARY:
> > >   if (net->sctp.addip_enable)
> > >   break;
> > > - goto fallthrough;
> > > + goto unhandled;
> 
> etc...
> 
> 
>

Re: [Intel-wired-lan] [PATCH bpf-next v4 03/11] libbpf: add flags to umem config

2019-07-31 Thread Björn Töpel

On Tue, 30 Jul 2019 at 19:43, Kevin Laatz  wrote:
>
> This patch adds a 'flags' field to the umem_config and umem_reg structs.
> This will allow for more options to be added for configuring umems.
>
> The first use for the flags field is to add a flag for unaligned chunks
> mode. These flags can either be user-provided or filled with a default.
>
> Signed-off-by: Kevin Laatz 
> Signed-off-by: Ciara Loftus 
>
> ---
> v2:
>   - Removed the headroom check from this patch. It has moved to the
> previous patch.
>
> v4:
>   - modified chunk flag define
> ---
>  tools/include/uapi/linux/if_xdp.h | 9 +++--
>  tools/lib/bpf/xsk.c   | 3 +++
>  tools/lib/bpf/xsk.h   | 2 ++
>  3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/tools/include/uapi/linux/if_xdp.h 
> b/tools/include/uapi/linux/if_xdp.h
> index faaa5ca2a117..a691802d7915 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -17,6 +17,10 @@
>  #define XDP_COPY   (1 << 1) /* Force copy-mode */
>  #define XDP_ZEROCOPY   (1 << 2) /* Force zero-copy mode */
>
> +/* Flags for xsk_umem_config flags */
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT 15
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 
> XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT)
> +
>  struct sockaddr_xdp {
> __u16 sxdp_family;
> __u16 sxdp_flags;
> @@ -49,8 +53,9 @@ struct xdp_mmap_offsets {
>  #define XDP_OPTIONS8
>
>  struct xdp_umem_reg {
> -   __u64 addr; /* Start of packet data area */
> -   __u64 len; /* Length of packet data area */
> +   __u64 addr; /* Start of packet data area */
> +   __u64 len:48;   /* Length of packet data area */
> +   __u64 flags:16; /* Flags for umem */

So, the flags member moved from struct sockaddr_xdp to struct
xdp_umem_reg. Makes sense. However, I'm not a fan of the bitfield. Why
not just add the flags member after the last member (headroom) and
deal with it in xsk.c/xsk_setsockopt and libbpf? The bitfield
preserves the size, but makes it hard to read/error prone IMO.


Björn


> __u32 chunk_size;
> __u32 headroom;
>  };
> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> index 5007b5d4fd2c..5e7e4d420ee0 100644
> --- a/tools/lib/bpf/xsk.c
> +++ b/tools/lib/bpf/xsk.c
> @@ -116,6 +116,7 @@ static void xsk_set_umem_config(struct xsk_umem_config 
> *cfg,
> cfg->comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
> cfg->frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
> cfg->frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM;
> +   cfg->flags = XSK_UMEM__DEFAULT_FLAGS;
> return;
> }
>
> @@ -123,6 +124,7 @@ static void xsk_set_umem_config(struct xsk_umem_config 
> *cfg,
> cfg->comp_size = usr_cfg->comp_size;
> cfg->frame_size = usr_cfg->frame_size;
> cfg->frame_headroom = usr_cfg->frame_headroom;
> +   cfg->flags = usr_cfg->flags;
>  }
>
>  static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
> @@ -182,6 +184,7 @@ int xsk_umem__create(struct xsk_umem **umem_ptr, void 
> *umem_area, __u64 size,
> mr.len = size;
> mr.chunk_size = umem->config.frame_size;
> mr.headroom = umem->config.frame_headroom;
> +   mr.flags = umem->config.flags;
>
> err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
> if (err) {
> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> index 833a6e60d065..44a03d8c34b9 100644
> --- a/tools/lib/bpf/xsk.h
> +++ b/tools/lib/bpf/xsk.h
> @@ -170,12 +170,14 @@ LIBBPF_API int xsk_socket__fd(const struct xsk_socket 
> *xsk);
>  #define XSK_UMEM__DEFAULT_FRAME_SHIFT12 /* 4096 bytes */
>  #define XSK_UMEM__DEFAULT_FRAME_SIZE (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT)
>  #define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0
> +#define XSK_UMEM__DEFAULT_FLAGS 0
>
>  struct xsk_umem_config {
> __u32 fill_size;
> __u32 comp_size;
> __u32 frame_size;
> __u32 frame_headroom;
> +   __u32 flags;
>  };
>
>  /* Flags for the libbpf_flags field. */
> --
> 2.17.1
>
> ___
> Intel-wired-lan mailing list
> intel-wired-...@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

[PATCH] net: mediatek: Drop unneeded dependency on NET_VENDOR_MEDIATEK

2019-07-31 Thread Geert Uytterhoeven

The whole block is protected by "if NET_VENDOR_MEDIATEK", so there is
no need for individual driver config symbols to duplicate this
dependency.

Signed-off-by: Geert Uytterhoeven 
---
 drivers/net/ethernet/mediatek/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/Kconfig 
b/drivers/net/ethernet/mediatek/Kconfig
index 263cd0909fe0de39..1f7fff81f24dbb96 100644
--- a/drivers/net/ethernet/mediatek/Kconfig
+++ b/drivers/net/ethernet/mediatek/Kconfig
@@ -9,7 +9,6 @@ if NET_VENDOR_MEDIATEK
 
 config NET_MEDIATEK_SOC
tristate "MediaTek SoC Gigabit Ethernet support"
-   depends on NET_VENDOR_MEDIATEK
select PHYLIB
---help---
  This driver supports the gigabit ethernet MACs in the
-- 
2.17.1

[PATCH v2] isdn: hfcsusb: Fix mISDN driver crash caused by transfer buffer on the stack

2019-07-31 Thread Juliana Rodrigueiro

Since linux 4.9 it is not possible to use buffers on the stack for DMA 
transfers.

During usb probe the driver crashes with "transfer buffer is on stack" message.

This fix k-allocates a buffer to be used on "read_reg_atomic", which is a macro
that calls "usb_control_msg" under the hood.

Kernel 4.19 backtrace:

usb_hcd_submit_urb+0x3e5/0x900
? sched_clock+0x9/0x10
? log_store+0x203/0x270
? get_random_u32+0x6f/0x90
? cache_alloc_refill+0x784/0x8a0
usb_submit_urb+0x3b4/0x550
usb_start_wait_urb+0x4e/0xd0
usb_control_msg+0xb8/0x120
hfcsusb_probe+0x6bc/0xb40 [hfcsusb]
usb_probe_interface+0xc2/0x260
really_probe+0x176/0x280
driver_probe_device+0x49/0x130
__driver_attach+0xa9/0xb0
? driver_probe_device+0x130/0x130
bus_for_each_dev+0x5a/0x90
driver_attach+0x14/0x20
? driver_probe_device+0x130/0x130
bus_add_driver+0x157/0x1e0
driver_register+0x51/0xe0
usb_register_driver+0x5d/0x120
? 0xf81ed000
hfcsusb_drv_init+0x17/0x1000 [hfcsusb]
do_one_initcall+0x44/0x190
? free_unref_page_commit+0x6a/0xd0
do_init_module+0x46/0x1c0
load_module+0x1dc1/0x2400
sys_init_module+0xed/0x120
do_fast_syscall_32+0x7a/0x200
entry_SYSENTER_32+0x6b/0xbe

Signed-off-by: Juliana Rodrigueiro 
---
Changes in v2:
   - Ordered the local variables declaration from longest to shortest.

 drivers/isdn/hardware/mISDN/hfcsusb.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/isdn/hardware/mISDN/hfcsusb.c 
b/drivers/isdn/hardware/mISDN/hfcsusb.c
index 6d05946b445e..fe1117a21f6b 100644
--- a/drivers/isdn/hardware/mISDN/hfcsusb.c
+++ b/drivers/isdn/hardware/mISDN/hfcsusb.c
@@ -1704,13 +1704,23 @@ hfcsusb_stop_endpoint(struct hfcsusb *hw, int channel)
 static int
 setup_hfcsusb(struct hfcsusb *hw)
 {
+   void *dmabuf = kmalloc(sizeof(u_char), GFP_KERNEL);
u_char b;
+   int ret;
 
if (debug & DBG_HFC_CALL_TRACE)
printk(KERN_DEBUG "%s: %s\n", hw->name, __func__);
 
+   if (!dmabuf)
+   return -ENOMEM;
+
+   ret = read_reg_atomic(hw, HFCUSB_CHIP_ID, dmabuf);
+
+   memcpy(&b, dmabuf, sizeof(u_char));
+   kfree(dmabuf);
+
/* check the chip id */
-   if (read_reg_atomic(hw, HFCUSB_CHIP_ID, &b) != 1) {
+   if (ret != 1) {
printk(KERN_DEBUG "%s: %s: cannot read chip id\n",
   hw->name, __func__);
return 1;
-- 
2.17.2

Re: [PATCH bpf-next 1/9] selftests/bpf: prevent headers to be compiled as C code

2019-07-31 Thread Ilya Leoshkevich

> Am 27.07.2019 um 20:53 schrieb Andrii Nakryiko :
> 
> On Fri, Jul 26, 2019 at 3:01 PM Stanislav Fomichev  wrote:
>> 
>> On 07/26, Andrii Nakryiko wrote:
>>> On Fri, Jul 26, 2019 at 2:21 PM Stanislav Fomichev  wrote:
 
 On 07/26, Andrii Nakryiko wrote:
> Apprently listing header as a normal dependency for a binary output
> makes it go through compilation as if it was C code. This currently
> works without a problem, but in subsequent commits causes problems for
> differently generated test.h for test_progs. Marking those headers as
> order-only dependency solves the issue.
 Are you sure it will not result in a situation where
 test_progs/test_maps is not regenerated if tests.h is updated.
 
 If I read the following doc correctly, order deps make sense for
 directories only:
 https://www.gnu.org/software/make/manual/html_node/Prerequisite-Types.html
 
 Can you maybe double check it with:
 * make
 * add new prog_tests/test_something.c
 * make
 to see if the binary is regenerated with test_something.c?
>>> 
>>> Yeah, tested that, it triggers test_progs rebuild.
>>> 
>>> Ordering is still preserved, because test.h is dependency of
>>> test_progs.c, which is dependency of test_progs binary, so that's why
>>> it works.
>>> 
>>> As to why .h file is compiled as C file, I have no idea and ideally
>>> that should be fixed somehow.
>> I guess that's because it's a prerequisite and we have a target that
>> puts all prerequisites when calling CC:
>> 
>> test_progs: a.c b.c tests.h
>>gcc a.c b.c tests.h -o test_progs
>> 
>> So gcc compiles each input file.
> 
> If that's really a definition of the rule, then it seems not exactly
> correct. What if some of the prerequisites are some object files,
> directories, etc. I'd say the rule should only include .c files. I'll
> add it to my TODO list to go and check how all this is defined
> somewhere, but for now I'm leaving everything as is in this patch.
> 

I believe it’s an implicit built-in rule, which is defined by make
itself here:

https://git.savannah.gnu.org/cgit/make.git/tree/default.c?h=4.2.1#n131

The observed behavior arises because these rules use $^ all over the
place. My 2c is that it would be better to make the rule explicit,
because it would cost just an extra line, but it would be immediately
obvious why the code is correct w.r.t. rebuilds.

Best regards,
Ilya

[PATCH 8/8] net: samsung: Spelling s/case/cause/

2019-07-31 Thread Geert Uytterhoeven

Signed-off-by: Geert Uytterhoeven 
---
 drivers/net/ethernet/samsung/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/samsung/Kconfig 
b/drivers/net/ethernet/samsung/Kconfig
index 027938017579130f..e92a178a76df0849 100644
--- a/drivers/net/ethernet/samsung/Kconfig
+++ b/drivers/net/ethernet/samsung/Kconfig
@@ -11,7 +11,7 @@ config NET_VENDOR_SAMSUNG
  say Y.
 
  Note that the answer to this question does not directly affect
- the kernel: saying N will just case the configurator to skip all
+ the kernel: saying N will just cause the configurator to skip all
  the questions about Samsung chipsets. If you say Y, you will be asked
  for your specific chipset/driver in the following questions.
 
-- 
2.17.1

Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload

2019-07-31 Thread Boris Pismenny

Hi Jacub,

On 7/31/2019 12:12 AM, Jakub Kicinski wrote:
> sk_validate_xmit_skb() and drivers depend on the sk member of
> struct sk_buff to identify segments requiring encryption.
> Any operation which removes or does not preserve the original TLS
> socket such as skb_orphan() or skb_clone() will cause clear text
> leaks.
>
> Make the TCP socket underlying an offloaded TLS connection
> mark all skbs as decrypted, if TLS TX is in offload mode.
> Then in sk_validate_xmit_skb() catch skbs which have no socket
> (or a socket with no validation) and decrypted flag set.
>
> Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> sk->sk_validate_xmit_skb are slightly interchangeable right now,
> they all imply TLS offload. The new checks are guarded by
> CONFIG_TLS_DEVICE because that's the option guarding the
> sk_buff->decrypted member.
>
> Second, smaller issue with orphaning is that it breaks
> the guarantee that packets will be delivered to device
> queues in-order. All TLS offload drivers depend on that
> scheduling property. This means skb_orphan_partial()'s
> trick of preserving partial socket references will cause
> issues in the drivers. We need a full orphan, and as a
> result netem delay/throttling will cause all TLS offload
> skbs to be dropped.
>
> Reusing the sk_buff->decrypted flag also protects from
> leaking clear text when incoming, decrypted skb is redirected
> (e.g. by TC).

Nice!

>
> Signed-off-by: Jakub Kicinski 
> ---
> I'm sending this for net-next because of lack of confidence
> in my own abilities. It should apply cleanly to net... :)
>
>   Documentation/networking/tls-offload.rst |  9 
>   include/net/sock.h   | 28 +++-
>   net/core/skbuff.c|  3 +++
>   net/core/sock.c  | 22 ++-
>   net/ipv4/tcp.c   |  4 +++-
>   net/ipv4/tcp_output.c|  3 +++
>   net/tls/tls_device.c |  2 ++
>   7 files changed, 55 insertions(+), 16 deletions(-)
>
> diff --git a/Documentation/networking/tls-offload.rst 
> b/Documentation/networking/tls-offload.rst
> index 048e5ca44824..2bc3ab5515d8 100644
> --- a/Documentation/networking/tls-offload.rst
> +++ b/Documentation/networking/tls-offload.rst
> @@ -499,15 +499,6 @@ offloads, old connections will remain active after flags 
> are cleared.
>   Known bugs
>   ==
>   
> -skb_orphan() leaks clear text
> --
> -
> -Currently drivers depend on the :c:member:`sk` member of
> -:c:type:`struct sk_buff ` to identify segments requiring
> -encryption. Any operation which removes or does not preserve the socket
> -association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
> -will cause the driver to miss the packets and lead to clear text leaks.
> -
>   Redirects leak clear text
>   -
Doesn't this patch cover the redirect case as well?
>   
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 228db3998e46..90f3f552c789 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -814,6 +814,7 @@ enum sock_flags {
>   SOCK_TXTIME,
>   SOCK_XDP, /* XDP is attached */
>   SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> + SOCK_CRYPTO_TX_PLAIN_TEXT, /* Generate skbs with decrypted flag set */
>   };
>   
>   #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
> SOCK_TIMESTAMPING_RX_SOFTWARE))
> @@ -2480,8 +2481,26 @@ static inline bool sk_fullsock(const struct sock *sk)
>   return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
>   }
>   
> +static inline void sk_set_tx_crypto(const struct sock *sk, struct sk_buff 
> *skb)
> +{
> +#ifdef CONFIG_TLS_DEVICE
> + skb->decrypted = sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
> +#endif
> +}

Have you considered the following alternative to calling 
sk_set_tx_crypto() - Add a new MSG_DECRYPTE flag that will set the 
skb->decrypted bit in the do_tcp_sendpages function.

Then, the rest of the TCP code can propagate this bit from the existing skb.

> +
> +static inline bool sk_tx_crypto_match(const struct sock *sk,
> +   const struct sk_buff *skb)
> +{
> +#ifdef CONFIG_TLS_DEVICE
> + return sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT) == !!skb->decrypted;
> +#else
> + return true;
> +#endif
> +}
> +
>   /* Checks if this SKB belongs to an HW offloaded socket
>* and whether any SW fallbacks are required based on dev.
> + * Check decrypted mark in case skb_orphan() cleared socket.
>*/
>   static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
>  struct net_device *dev)
> @@ -2489,8 +2508,15 @@ static inline struct sk_buff 
> *sk_validate_xmit_skb(struct sk_buff *skb,
>   #ifdef CONFIG_SOCK_VALIDATE_XMIT
>   struct sock *sk = skb->sk;
>   
> - if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
> + if (sk &&

Re: [Intel-wired-lan] [PATCH bpf-next v4 03/11] libbpf: add flags to umem config

2019-07-31 Thread Björn Töpel

On Tue, 30 Jul 2019 at 19:43, Kevin Laatz  wrote:
>
> This patch adds a 'flags' field to the umem_config and umem_reg structs.
> This will allow for more options to be added for configuring umems.
>
> The first use for the flags field is to add a flag for unaligned chunks
> mode. These flags can either be user-provided or filled with a default.
>
> Signed-off-by: Kevin Laatz 
> Signed-off-by: Ciara Loftus 
>
> ---
> v2:
>   - Removed the headroom check from this patch. It has moved to the
> previous patch.
>
> v4:
>   - modified chunk flag define
> ---
>  tools/include/uapi/linux/if_xdp.h | 9 +++--
>  tools/lib/bpf/xsk.c   | 3 +++
>  tools/lib/bpf/xsk.h   | 2 ++
>  3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/tools/include/uapi/linux/if_xdp.h 
> b/tools/include/uapi/linux/if_xdp.h
> index faaa5ca2a117..a691802d7915 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -17,6 +17,10 @@
>  #define XDP_COPY   (1 << 1) /* Force copy-mode */
>  #define XDP_ZEROCOPY   (1 << 2) /* Force zero-copy mode */
>
> +/* Flags for xsk_umem_config flags */
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT 15
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 
> XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT)
> +
>  struct sockaddr_xdp {
> __u16 sxdp_family;
> __u16 sxdp_flags;
> @@ -49,8 +53,9 @@ struct xdp_mmap_offsets {
>  #define XDP_OPTIONS8
>
>  struct xdp_umem_reg {
> -   __u64 addr; /* Start of packet data area */
> -   __u64 len; /* Length of packet data area */
> +   __u64 addr; /* Start of packet data area */
> +   __u64 len:48;   /* Length of packet data area */
> +   __u64 flags:16; /* Flags for umem */
> __u32 chunk_size;
> __u32 headroom;
>  };
> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> index 5007b5d4fd2c..5e7e4d420ee0 100644
> --- a/tools/lib/bpf/xsk.c
> +++ b/tools/lib/bpf/xsk.c
> @@ -116,6 +116,7 @@ static void xsk_set_umem_config(struct xsk_umem_config 
> *cfg,
> cfg->comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
> cfg->frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
> cfg->frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM;
> +   cfg->flags = XSK_UMEM__DEFAULT_FLAGS;
> return;
> }
>
> @@ -123,6 +124,7 @@ static void xsk_set_umem_config(struct xsk_umem_config 
> *cfg,
> cfg->comp_size = usr_cfg->comp_size;
> cfg->frame_size = usr_cfg->frame_size;
> cfg->frame_headroom = usr_cfg->frame_headroom;
> +   cfg->flags = usr_cfg->flags;
>  }
>
>  static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
> @@ -182,6 +184,7 @@ int xsk_umem__create(struct xsk_umem **umem_ptr, void 
> *umem_area, __u64 size,
> mr.len = size;
> mr.chunk_size = umem->config.frame_size;
> mr.headroom = umem->config.frame_headroom;
> +   mr.flags = umem->config.flags;
>
> err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
> if (err) {
> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> index 833a6e60d065..44a03d8c34b9 100644
> --- a/tools/lib/bpf/xsk.h
> +++ b/tools/lib/bpf/xsk.h
> @@ -170,12 +170,14 @@ LIBBPF_API int xsk_socket__fd(const struct xsk_socket 
> *xsk);
>  #define XSK_UMEM__DEFAULT_FRAME_SHIFT12 /* 4096 bytes */
>  #define XSK_UMEM__DEFAULT_FRAME_SIZE (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT)
>  #define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0
> +#define XSK_UMEM__DEFAULT_FLAGS 0
>
>  struct xsk_umem_config {
> __u32 fill_size;
> __u32 comp_size;
> __u32 frame_size;
> __u32 frame_headroom;
> +   __u32 flags;

And the flags addition here, unfortunately, requires symbol versioning
of xsk_umem__create(). That'll be the first in libbpf! :-)


Björn

>  };
>
>  /* Flags for the libbpf_flags field. */
> --
> 2.17.1
>
> ___
> Intel-wired-lan mailing list
> intel-wired-...@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

Re: [PATCH net 0/2] mlxsw: Two small fixes

2019-07-31 Thread David Miller

From: Ido Schimmel 
Date: Wed, 31 Jul 2019 09:33:13 +0300

> From: Ido Schimmel 
> 
> Patch #1 from Jiri fixes the error path of the module initialization
> function. Found during manual code inspection.
> 
> Patch #2 from Petr further reduces the default shared buffer pool sizes
> in order to work around a problem that was originally described in
> commit e891ce1dd2a5 ("mlxsw: spectrum_buffers: Reduce pool size on
> Spectrum-2").

Series applied and queued up for -stable, thanks.

Re: [PATCH net] drop_monitor: Add missing uAPI file to MAINTAINERS file

2019-07-31 Thread David Miller

From: Ido Schimmel 
Date: Wed, 31 Jul 2019 09:38:19 +0300

> From: Ido Schimmel 
> 
> Fixes: 6e43650cee64 ("add maintainer for network drop monitor kernel service")
> Signed-off-by: Ido Schimmel 

Applied.

RE: [PATCH net-next v4 0/4] enetc: Add mdio bus driver for the PCIe MDIO endpoint

2019-07-31 Thread Claudiu Manoil

>-Original Message-
>From: David Miller 
>Sent: Tuesday, July 30, 2019 7:54 PM
>To: Claudiu Manoil 
>Cc: and...@lunn.ch; robh...@kernel.org; Leo Li ;
>Alexandru Marginean ;
>netdev@vger.kernel.org; devicet...@vger.kernel.org; linux-arm-
>ker...@lists.infradead.org; linux-ker...@vger.kernel.org
>Subject: Re: [PATCH net-next v4 0/4] enetc: Add mdio bus driver for the PCIe
>MDIO endpoint
>
>From: David Miller 
>Date: Tue, 30 Jul 2019 09:44:36 -0700 (PDT)
>
>> From: Claudiu Manoil 
>> Date: Tue, 30 Jul 2019 12:45:15 +0300
>>
>>> First patch fixes a sparse issue and cleans up accessors to avoid
>>> casting to __iomem.
>>> Second patch just registers the PCIe endpoint device containing
>>> the MDIO registers as a standalone MDIO bus driver, to allow
>>> an alternative way to control the MDIO bus.  The same code used
>>> by the ENETC ports (eth controllers) to manage MDIO via local
>>> registers applies and is reused.
>>>
>>> Bindings are provided for the new MDIO node, similarly to ENETC
>>> port nodes bindings.
>>>
>>> Last patch enables the ENETC port 1 and its RGMII PHY on the
>>> LS1028A QDS board, where the MDIO muxing configuration relies
>>> on the MDIO support provided in the first patch.
>>  ...
>>
>> Series applied, thank you.
>
>Actually this doesn't compile, I had to revert:
>

Sorry, I overlooked the module part.  Turns out I have to define a separate
module for this driver (mdio), refactor common code between the mdio module
and the enetc-pf module, clean up the Makefile...  and do more checks.

Re: [PATCH net-next 0/2] mlxsw: Test coverage for DSCP leftover fix

2019-07-31 Thread David Miller

From: Petr Machata 
Date: Wed, 31 Jul 2019 10:30:25 +

> This patch set fixes some global scope pollution issues in the DSCP tests
> (in patch #1), and then proceeds (in patch #2) to add a new test for
> checking whether, after DSCP prioritization rules are removed from a port,
> DSCP rewrites consistently to zero, instead of the last removed rule still
> staying in effect.

Series applied.

Re: next/master build: 221 builds: 11 failed, 210 passed, 13 errors, 1174 warnings (next-20190731)

2019-07-31 Thread David Miller

From: Greg Kroah-Hartman 
Date: Wed, 31 Jul 2019 13:35:22 +0200

> On Wed, Jul 31, 2019 at 12:24:41PM +0100, Mark Brown wrote:
>> On Wed, Jul 31, 2019 at 04:07:41AM -0700, kernelci.org bot wrote:
>> 
>> Today's -next fails to build an ARM allmodconfig due to:
>> 
>> > allmodconfig (arm, gcc-8) ― FAIL, 1 error, 40 warnings, 0 section 
>> > mismatches
>> > 
>> > Errors:
>> > drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of 
>> > function 'writeq'; did you mean 'writel'? 
>> > [-Werror=implicit-function-declaration]
>> 
>> as a result of the changes that introduced:
>> 
>> WARNING: unmet direct dependencies detected for MDIO_OCTEON
>>   Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=m] && MDIO_BUS [=m] && 
>> 64BIT && HAS_IOMEM [=y] && OF_MDIO [=m]
>>   Selected by [m]:
>>   - OCTEON_ETHERNET [=m] && STAGING [=y] && (CAVIUM_OCTEON_SOC && NETDEVICES 
>> [=y] || COMPILE_TEST [=y])
>> 
>> which is triggered by the staging OCTEON_ETHERNET driver which misses a
>> 64BIT dependency but added COMPILE_TEST in 171a9bae68c72f2
>> (staging/octeon: Allow test build on !MIPS).
> 
> A patch was posted for this, but it needs to go through the netdev tree
> as that's where the offending patches are coming from.

I didn't catch that, was netdev CC:'d?

Re: [PATCH 0/8] netfilter fixes for net

2019-07-31 Thread David Miller

From: Pablo Neira Ayuso 
Date: Wed, 31 Jul 2019 13:51:49 +0200

> The following patchset contains Netfilter fixes for your net tree:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks.

Re: [PATCH] net: mediatek: Drop unneeded dependency on NET_VENDOR_MEDIATEK

2019-07-31 Thread David Miller

From: Geert Uytterhoeven 
Date: Wed, 31 Jul 2019 15:12:02 +0200

> The whole block is protected by "if NET_VENDOR_MEDIATEK", so there is
> no need for individual driver config symbols to duplicate this
> dependency.
> 
> Signed-off-by: Geert Uytterhoeven 

Applied.

Re: [PATCH v2] isdn: hfcsusb: Fix mISDN driver crash caused by transfer buffer on the stack

2019-07-31 Thread David Miller

From: Juliana Rodrigueiro 
Date: Wed, 31 Jul 2019 15:17:23 +0200

> Since linux 4.9 it is not possible to use buffers on the stack for DMA 
> transfers.
> 
> During usb probe the driver crashes with "transfer buffer is on stack" 
> message.
> 
> This fix k-allocates a buffer to be used on "read_reg_atomic", which is a 
> macro
> that calls "usb_control_msg" under the hood.
> 
> Kernel 4.19 backtrace:
 ...
> Signed-off-by: Juliana Rodrigueiro 

Applied.

Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload

2019-07-31 Thread Willem de Bruijn

On Tue, Jul 30, 2019 at 5:13 PM Jakub Kicinski
 wrote:
>
> sk_validate_xmit_skb() and drivers depend on the sk member of
> struct sk_buff to identify segments requiring encryption.
> Any operation which removes or does not preserve the original TLS
> socket such as skb_orphan() or skb_clone() will cause clear text
> leaks.
>
> Make the TCP socket underlying an offloaded TLS connection
> mark all skbs as decrypted, if TLS TX is in offload mode.
> Then in sk_validate_xmit_skb() catch skbs which have no socket
> (or a socket with no validation) and decrypted flag set.
>
> Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> sk->sk_validate_xmit_skb are slightly interchangeable right now,
> they all imply TLS offload. The new checks are guarded by
> CONFIG_TLS_DEVICE because that's the option guarding the
> sk_buff->decrypted member.
>
> Second, smaller issue with orphaning is that it breaks
> the guarantee that packets will be delivered to device
> queues in-order. All TLS offload drivers depend on that
> scheduling property. This means skb_orphan_partial()'s
> trick of preserving partial socket references will cause
> issues in the drivers. We need a full orphan, and as a
> result netem delay/throttling will cause all TLS offload
> skbs to be dropped.
>
> Reusing the sk_buff->decrypted flag also protects from
> leaking clear text when incoming, decrypted skb is redirected
> (e.g. by TC).
>
> Signed-off-by: Jakub Kicinski 
> ---
> I'm sending this for net-next because of lack of confidence
> in my own abilities. It should apply cleanly to net... :)
>
>  Documentation/networking/tls-offload.rst |  9 
>  include/net/sock.h   | 28 +++-
>  net/core/skbuff.c|  3 +++
>  net/core/sock.c  | 22 ++-
>  net/ipv4/tcp.c   |  4 +++-
>  net/ipv4/tcp_output.c|  3 +++
>  net/tls/tls_device.c |  2 ++
>  7 files changed, 55 insertions(+), 16 deletions(-)

>  Redirects leak clear text
>  -
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 228db3998e46..90f3f552c789 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -814,6 +814,7 @@ enum sock_flags {
> SOCK_TXTIME,
> SOCK_XDP, /* XDP is attached */
> SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> +   SOCK_CRYPTO_TX_PLAIN_TEXT, /* Generate skbs with decrypted flag set */
>  };
>
>  #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
> SOCK_TIMESTAMPING_RX_SOFTWARE))
> @@ -2480,8 +2481,26 @@ static inline bool sk_fullsock(const struct sock *sk)
> return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
>  }
>
> +static inline void sk_set_tx_crypto(const struct sock *sk, struct sk_buff 
> *skb)

nit: skb_.. instead of sk_.. ?

> +{
> +#ifdef CONFIG_TLS_DEVICE
> +   skb->decrypted = sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
> +#endif
> +}

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 0b788df5a75b..9e92684479b9 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3794,6 +3794,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>
> skb_reserve(nskb, headroom);
> __skb_put(nskb, doffset);
> +#ifdef CONFIG_TLS_DEVICE
> +   nskb->decrypted = head_skb->decrypted;
> +#endif

decrypted is between header_start and headers_end, so
__copy_skb_header just below should take care of this?


> diff --git a/net/core/sock.c b/net/core/sock.c
> index d57b0cc995a0..b0c10b518e65 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock 
> *sk)
>  }
>  EXPORT_SYMBOL(skb_set_owner_w);
>
> +static bool can_skb_orphan_partial(const struct sk_buff *skb)
> +{
> +#ifdef CONFIG_TLS_DEVICE
> +   /* Drivers depend on in-order delivery for crypto offload,
> +* partial orphan breaks out-of-order-OK logic.
> +*/
> +   if (skb->decrypted)
> +   return false;
> +#endif
> +#ifdef CONFIG_INET
> +   if (skb->destructor == tcp_wfree)
> +   return true;
> +#endif
> +   return skb->destructor == sock_wfree;
> +}
> +

Just insert the skb->decrypted check into skb_orphan_partial for less
code churn?

I also think that this is an independent concern from leaking plain
text, so perhaps could be a separate patch.

Re: [PATCH net v2] ipvs: Improve robustness to the ipvs sysctl

2019-07-31 Thread hujunwei

Hello, Julian

On 2019/7/31 3:29, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Tue, 30 Jul 2019, hujunwei wrote:
> 
>> From: Junwei Hu 
>>
>> The ipvs module parse the user buffer and save it to sysctl,
>> then check if the value is valid. invalid value occurs
>> over a period of time.
>> Here, I add a variable, struct ctl_table tmp, used to read
>> the value from the user buffer, and save only when it is valid.
>> I delete proc_do_sync_mode and use extra1/2 in table for the
>> proc_dointvec_minmax call.
>>
>> Fixes: f73181c8288f ("ipvs: add support for sync threads")
>> Signed-off-by: Junwei Hu 
> 
>   Looks good to me, thanks!
> 
> Acked-by: Julian Anastasov 
> 
>   BTW, why ip_vs_zero_all everywhere? Due to old git version?
> 

I will update the git version and send the patch v3.

Regards,
Junwei

Re: next/master build: 221 builds: 11 failed, 210 passed, 13 errors, 1174 warnings (next-20190731)

2019-07-31 Thread Greg KH

On Wed, Jul 31, 2019 at 08:48:24AM -0700, David Miller wrote:
> From: Greg Kroah-Hartman 
> Date: Wed, 31 Jul 2019 13:35:22 +0200
> 
> > On Wed, Jul 31, 2019 at 12:24:41PM +0100, Mark Brown wrote:
> >> On Wed, Jul 31, 2019 at 04:07:41AM -0700, kernelci.org bot wrote:
> >> 
> >> Today's -next fails to build an ARM allmodconfig due to:
> >> 
> >> > allmodconfig (arm, gcc-8) ― FAIL, 1 error, 40 warnings, 0 section 
> >> > mismatches
> >> > 
> >> > Errors:
> >> > drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of 
> >> > function 'writeq'; did you mean 'writel'? 
> >> > [-Werror=implicit-function-declaration]
> >> 
> >> as a result of the changes that introduced:
> >> 
> >> WARNING: unmet direct dependencies detected for MDIO_OCTEON
> >>   Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=m] && MDIO_BUS [=m] && 
> >> 64BIT && HAS_IOMEM [=y] && OF_MDIO [=m]
> >>   Selected by [m]:
> >>   - OCTEON_ETHERNET [=m] && STAGING [=y] && (CAVIUM_OCTEON_SOC && 
> >> NETDEVICES [=y] || COMPILE_TEST [=y])
> >> 
> >> which is triggered by the staging OCTEON_ETHERNET driver which misses a
> >> 64BIT dependency but added COMPILE_TEST in 171a9bae68c72f2
> >> (staging/octeon: Allow test build on !MIPS).
> > 
> > A patch was posted for this, but it needs to go through the netdev tree
> > as that's where the offending patches are coming from.
> 
> I didn't catch that, was netdev CC:'d?

Nope, just you :(

I'll resend it now and cc: netdev.

thanks,

greg k-h

[PATCH net v3] ipvs: Improve robustness to the ipvs sysctl

2019-07-31 Thread hujunwei

From: Junwei Hu 

The ipvs module parse the user buffer and save it to sysctl,
then check if the value is valid. invalid value occurs
over a period of time.
Here, I add a variable, struct ctl_table tmp, used to read
the value from the user buffer, and save only when it is valid.
I delete proc_do_sync_mode and use extra1/2 in table for the
proc_dointvec_minmax call.

Fixes: f73181c8288f ("ipvs: add support for sync threads")
Signed-off-by: Junwei Hu 
Acked-by: Julian Anastasov 
---
V1->V2:
- delete proc_do_sync_mode and use proc_dointvec_minmax call.
V2->V3:
- update git version
---
 net/netfilter/ipvs/ip_vs_ctl.c | 69 +-
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 060565e7d227..72189559a1cd 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1737,12 +1737,18 @@ proc_do_defense_mode(struct ctl_table *table, int write,
int val = *valp;
int rc;

-   rc = proc_dointvec(table, write, buffer, lenp, ppos);
+   struct ctl_table tmp = {
+   .data = &val,
+   .maxlen = sizeof(int),
+   .mode = table->mode,
+   };
+
+   rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
if (write && (*valp != val)) {
-   if ((*valp < 0) || (*valp > 3)) {
-   /* Restore the correct value */
-   *valp = val;
+   if (val < 0 || val > 3) {
+   rc = -EINVAL;
} else {
+   *valp = val;
update_defense_level(ipvs);
}
}
@@ -1756,33 +1762,20 @@ proc_do_sync_threshold(struct ctl_table *table, int 
write,
int *valp = table->data;
int val[2];
int rc;
+   struct ctl_table tmp = {
+   .data = &val,
+   .maxlen = table->maxlen,
+   .mode = table->mode,
+   };

-   /* backup the value first */
memcpy(val, valp, sizeof(val));
-
-   rc = proc_dointvec(table, write, buffer, lenp, ppos);
-   if (write && (valp[0] < 0 || valp[1] < 0 ||
-   (valp[0] >= valp[1] && valp[1]))) {
-   /* Restore the correct value */
-   memcpy(valp, val, sizeof(val));
-   }
-   return rc;
-}
-
-static int
-proc_do_sync_mode(struct ctl_table *table, int write,
-void __user *buffer, size_t *lenp, loff_t *ppos)
-{
-   int *valp = table->data;
-   int val = *valp;
-   int rc;
-
-   rc = proc_dointvec(table, write, buffer, lenp, ppos);
-   if (write && (*valp != val)) {
-   if ((*valp < 0) || (*valp > 1)) {
-   /* Restore the correct value */
-   *valp = val;
-   }
+   rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
+   if (write) {
+   if (val[0] < 0 || val[1] < 0 ||
+   (val[0] >= val[1] && val[1]))
+   rc = -EINVAL;
+   else
+   memcpy(valp, val, sizeof(val));
}
return rc;
 }
@@ -1795,12 +1788,18 @@ proc_do_sync_ports(struct ctl_table *table, int write,
int val = *valp;
int rc;

-   rc = proc_dointvec(table, write, buffer, lenp, ppos);
+   struct ctl_table tmp = {
+   .data = &val,
+   .maxlen = sizeof(int),
+   .mode = table->mode,
+   };
+
+   rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
if (write && (*valp != val)) {
-   if (*valp < 1 || !is_power_of_2(*valp)) {
-   /* Restore the correct value */
+   if (val < 1 || !is_power_of_2(val))
+   rc = -EINVAL;
+   else
*valp = val;
-   }
}
return rc;
 }
@@ -1860,7 +1859,9 @@ static struct ctl_table vs_vars[] = {
.procname   = "sync_version",
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_do_sync_mode,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = SYSCTL_ZERO,
+   .extra2 = SYSCTL_ONE,
},
{
.procname   = "sync_ports",
-- 
2.21.GIT

Re: next/master build: 221 builds: 11 failed, 210 passed, 13 errors, 1174 warnings (next-20190731)

2019-07-31 Thread Nathan Chancellor

On Wed, Jul 31, 2019 at 06:00:43PM +0200, Greg KH wrote:
> On Wed, Jul 31, 2019 at 08:48:24AM -0700, David Miller wrote:
> > From: Greg Kroah-Hartman 
> > Date: Wed, 31 Jul 2019 13:35:22 +0200
> > 
> > > On Wed, Jul 31, 2019 at 12:24:41PM +0100, Mark Brown wrote:
> > >> On Wed, Jul 31, 2019 at 04:07:41AM -0700, kernelci.org bot wrote:
> > >> 
> > >> Today's -next fails to build an ARM allmodconfig due to:
> > >> 
> > >> > allmodconfig (arm, gcc-8) ― FAIL, 1 error, 40 warnings, 0 section 
> > >> > mismatches
> > >> > 
> > >> > Errors:
> > >> > drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration 
> > >> > of function 'writeq'; did you mean 'writel'? 
> > >> > [-Werror=implicit-function-declaration]
> > >> 
> > >> as a result of the changes that introduced:
> > >> 
> > >> WARNING: unmet direct dependencies detected for MDIO_OCTEON
> > >>   Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=m] && MDIO_BUS [=m] 
> > >> && 64BIT && HAS_IOMEM [=y] && OF_MDIO [=m]
> > >>   Selected by [m]:
> > >>   - OCTEON_ETHERNET [=m] && STAGING [=y] && (CAVIUM_OCTEON_SOC && 
> > >> NETDEVICES [=y] || COMPILE_TEST [=y])
> > >> 
> > >> which is triggered by the staging OCTEON_ETHERNET driver which misses a
> > >> 64BIT dependency but added COMPILE_TEST in 171a9bae68c72f2
> > >> (staging/octeon: Allow test build on !MIPS).
> > > 
> > > A patch was posted for this, but it needs to go through the netdev tree
> > > as that's where the offending patches are coming from.
> > 
> > I didn't catch that, was netdev CC:'d?
> 
> Nope, just you :(
> 
> I'll resend it now and cc: netdev.
> 
> thanks,
> 
> greg k-h

If it is this patch:

https://lore.kernel.org/netdev/20190731160219.ga2...@kroah.com/

It doesn't resolve that issue. I applied it and tested on next-20190731.

$ make -j$(nproc) -s ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- O=out distclean 
allyesconfig drivers/net/phy/

WARNING: unmet direct dependencies detected for MDIO_OCTEON
  Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=y] && MDIO_BUS [=y] && 64BIT 
&& HAS_IOMEM [=y] && OF_MDIO [=y]
  Selected by [y]:
  - OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC || COMPILE_TEST 
[=y]) && NETDEVICES [=y]
../drivers/net/phy/mdio-octeon.c: In function 'octeon_mdiobus_probe':
../drivers/net/phy/mdio-octeon.c:48:3: warning: cast from pointer to integer of 
different size [-Wpointer-to-int-cast]
   48 |   (u64)devm_ioremap(&pdev->dev, mdio_phys, regsize);
  |   ^
In file included from ../drivers/net/phy/mdio-octeon.c:14:
../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of 
function 'writeq'; did you mean 'writeb'? 
[-Werror=implicit-function-declaration]
  111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
  |^~
../drivers/net/phy/mdio-octeon.c:56:2: note: in expansion of macro 
'oct_mdio_writeq'
   56 |  oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
  |  ^~~
../drivers/net/phy/mdio-cavium.h:111:48: warning: cast to pointer from integer 
of different size [-Wint-to-pointer-cast]
  111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
  |^
../drivers/net/phy/mdio-octeon.c:56:2: note: in expansion of macro 
'oct_mdio_writeq'
   56 |  oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
  |  ^~~
../drivers/net/phy/mdio-cavium.h:111:48: warning: cast to pointer from integer 
of different size [-Wint-to-pointer-cast]
  111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
  |^
../drivers/net/phy/mdio-octeon.c:77:2: note: in expansion of macro 
'oct_mdio_writeq'
   77 |  oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
  |  ^~~
../drivers/net/phy/mdio-octeon.c: In function 'octeon_mdiobus_remove':
../drivers/net/phy/mdio-cavium.h:111:48: warning: cast to pointer from integer 
of different size [-Wint-to-pointer-cast]
  111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
  |^
../drivers/net/phy/mdio-octeon.c:91:2: note: in expansion of macro 
'oct_mdio_writeq'
   91 |  oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
  |  ^~

Re: next/master build: 221 builds: 11 failed, 210 passed, 13 errors, 1174 warnings (next-20190731)

2019-07-31 Thread David Miller

From: Nathan Chancellor 
Date: Wed, 31 Jul 2019 09:35:09 -0700

> In file included from ../drivers/net/phy/mdio-octeon.c:14:
> ../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of 
> function 'writeq'; did you mean 'writeb'? 
> [-Werror=implicit-function-declaration]
>   111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)

One of the hi-lo, lo-hi writeq/readq headers has to be included.

Re: [PATCH bpf-next 1/9] selftests/bpf: prevent headers to be compiled as C code

2019-07-31 Thread Andrii Nakryiko

On Wed, Jul 31, 2019 at 6:21 AM Ilya Leoshkevich  wrote:
>
> > Am 27.07.2019 um 20:53 schrieb Andrii Nakryiko :
> >
> > On Fri, Jul 26, 2019 at 3:01 PM Stanislav Fomichev  wrote:
> >>
> >> On 07/26, Andrii Nakryiko wrote:
> >>> On Fri, Jul 26, 2019 at 2:21 PM Stanislav Fomichev  
> >>> wrote:
> 
>  On 07/26, Andrii Nakryiko wrote:
> > Apprently listing header as a normal dependency for a binary output
> > makes it go through compilation as if it was C code. This currently
> > works without a problem, but in subsequent commits causes problems for
> > differently generated test.h for test_progs. Marking those headers as
> > order-only dependency solves the issue.
>  Are you sure it will not result in a situation where
>  test_progs/test_maps is not regenerated if tests.h is updated.
> 
>  If I read the following doc correctly, order deps make sense for
>  directories only:
>  https://www.gnu.org/software/make/manual/html_node/Prerequisite-Types.html
> 
>  Can you maybe double check it with:
>  * make
>  * add new prog_tests/test_something.c
>  * make
>  to see if the binary is regenerated with test_something.c?
> >>>
> >>> Yeah, tested that, it triggers test_progs rebuild.
> >>>
> >>> Ordering is still preserved, because test.h is dependency of
> >>> test_progs.c, which is dependency of test_progs binary, so that's why
> >>> it works.
> >>>
> >>> As to why .h file is compiled as C file, I have no idea and ideally
> >>> that should be fixed somehow.
> >> I guess that's because it's a prerequisite and we have a target that
> >> puts all prerequisites when calling CC:
> >>
> >> test_progs: a.c b.c tests.h
> >>gcc a.c b.c tests.h -o test_progs
> >>
> >> So gcc compiles each input file.
> >
> > If that's really a definition of the rule, then it seems not exactly
> > correct. What if some of the prerequisites are some object files,
> > directories, etc. I'd say the rule should only include .c files. I'll
> > add it to my TODO list to go and check how all this is defined
> > somewhere, but for now I'm leaving everything as is in this patch.
> >
>
> I believe it’s an implicit built-in rule, which is defined by make
> itself here:
>
> https://git.savannah.gnu.org/cgit/make.git/tree/default.c?h=4.2.1#n131
>
> The observed behavior arises because these rules use $^ all over the
> place. My 2c is that it would be better to make the rule explicit,
> because it would cost just an extra line, but it would be immediately
> obvious why the code is correct w.r.t. rebuilds.

Thanks for pointing this out, Ilya! I agree, I'd rather have a simple
explicit rule in Makefile that having to guess where this is coming
from and why it doesn't work as you'd expect it to. If no one else
adds that first, I'll eventually get to this.

>
> Best regards,
> Ilya

Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm

2019-07-31 Thread Andrii Nakryiko

On Wed, Jul 31, 2019 at 1:30 AM Song Liu  wrote:
>
>
>
> > On Jul 30, 2019, at 11:52 PM, Andrii Nakryiko  
> > wrote:
> >
> > On Tue, Jul 30, 2019 at 10:19 PM Song Liu  wrote:
> >>
> >>
> >>
> >>> On Jul 30, 2019, at 6:00 PM, Andrii Nakryiko  
> >>> wrote:
> >>>
> >>> On Tue, Jul 30, 2019 at 5:39 PM Song Liu  wrote:
> 
> 
> 
> > On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko  wrote:
> >
> > This patch implements the core logic for BPF CO-RE offsets relocations.
> > Every instruction that needs to be relocated has corresponding
> > bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> > to match recorded "local" relocation spec against potentially many
> > compatible "target" types, creating corresponding spec. Details of the
> > algorithm are noted in corresponding comments in the code.
> >
> > Signed-off-by: Andrii Nakryiko 
>
> [...]
>
> 
> >>>
> >>> I just picked the most succinct and non-repetitive form. It's
> >>> immediately apparent which type it's implicitly converted to, so I
> >>> felt there is no need to repeat it. Also, just (void *) is much
> >>> shorter. :)
> >>
> >> _All_ other code in btf.c converts the pointer to the target type.
> >
> > Most in libbpf.c doesn't, though. Also, I try to preserve pointer
> > constness for uses that don't modify BTF types (pretty much all of
> > them in libbpf), so it becomes really verbose, despite extremely short
> > variable names:
> >
> > const struct btf_member *m = (const struct btf_member *)(t + 1);
>
> I don't think being verbose is a big problem here. Overusing

Problem is too big and strong word to describe this :). It hurts
readability and will often quite artificially force either wrapping
the line or unnecessarily splitting declaration and assignment. Void *
on the other hand is short and usually is in the same line as target
type declaration, if not, you'll have to find local variable
declaration to double-check type, if you are unsure.

Using (void *) + implicit cast to target pointer type is not
unprecedented in libbpf:

$ rg ' = \((const )?struct \w+ \*\)' tools/lib/bpf/ | wc -l
52
$ rg ' = \((const )?void \*\)' tools/lib/bpf/  | wc -l
35

52 vs 35 is majority overall, but not by a landslide.

> (void *) feels like a bigger problem.

Why do you feel it's a problem? void * conveys that we have a piece of
memory that we will need to reinterpret as some concrete pointer type.
That's what we are doing, skipping btf_type and then interpreting
memory after common btf_type prefix is some other type, depending on
actual BTF kind. I don't think void * is misleading in any way.

In any case, if you still feel strongly about this after all my
arguments, please let me know and I will convert them in this patch
set. It's not like I'm opposed to use duplicate type names (though it
does feel sort of Java-like before it got limited type inference),
it's just in practice it leads to unnecessarily verbose code which
doesn't really improve anything.

>
> >
> > Add one or two levels of nestedness and you are wrapping this line.
> >
> >> In some cases, it is not apparent which type it is converted to,
> >> for example:
> >>
> >> +   m = (void *)(targ_type + 1);
> >>
> >> I would suggest we do implicit conversion whenever possible.
> >
> > Implicit conversion (`m = targ_type + 1;`) is a compilation error,
> > that won't work.
>
> I misused "implicit" here. I actually meant to say
>
> m = ((const struct btf_member *)(t + 1);

Ah, so you meant explicit, yep. It's either `void *` or `const struct
something *` then.

>
>
>

Re: [PATCH net v3] ipvs: Improve robustness to the ipvs sysctl

2019-07-31 Thread Julian Anastasov



Hello,

On Thu, 1 Aug 2019, hujunwei wrote:

> From: Junwei Hu 
> 
> The ipvs module parse the user buffer and save it to sysctl,
> then check if the value is valid. invalid value occurs
> over a period of time.
> Here, I add a variable, struct ctl_table tmp, used to read
> the value from the user buffer, and save only when it is valid.
> I delete proc_do_sync_mode and use extra1/2 in table for the
> proc_dointvec_minmax call.
> 
> Fixes: f73181c8288f ("ipvs: add support for sync threads")
> Signed-off-by: Junwei Hu 
> Acked-by: Julian Anastasov 

Yep, Acked-by: Julian Anastasov 

> ---
> V1->V2:
> - delete proc_do_sync_mode and use proc_dointvec_minmax call.
> V2->V3:
> - update git version
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 69 +-
>  1 file changed, 35 insertions(+), 34 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 060565e7d227..72189559a1cd 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -1737,12 +1737,18 @@ proc_do_defense_mode(struct ctl_table *table, int 
> write,
>   int val = *valp;
>   int rc;
> 
> - rc = proc_dointvec(table, write, buffer, lenp, ppos);
> + struct ctl_table tmp = {
> + .data = &val,
> + .maxlen = sizeof(int),
> + .mode = table->mode,
> + };
> +
> + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
>   if (write && (*valp != val)) {
> - if ((*valp < 0) || (*valp > 3)) {
> - /* Restore the correct value */
> - *valp = val;
> + if (val < 0 || val > 3) {
> + rc = -EINVAL;
>   } else {
> + *valp = val;
>   update_defense_level(ipvs);
>   }
>   }
> @@ -1756,33 +1762,20 @@ proc_do_sync_threshold(struct ctl_table *table, int 
> write,
>   int *valp = table->data;
>   int val[2];
>   int rc;
> + struct ctl_table tmp = {
> + .data = &val,
> + .maxlen = table->maxlen,
> + .mode = table->mode,
> + };
> 
> - /* backup the value first */
>   memcpy(val, valp, sizeof(val));
> -
> - rc = proc_dointvec(table, write, buffer, lenp, ppos);
> - if (write && (valp[0] < 0 || valp[1] < 0 ||
> - (valp[0] >= valp[1] && valp[1]))) {
> - /* Restore the correct value */
> - memcpy(valp, val, sizeof(val));
> - }
> - return rc;
> -}
> -
> -static int
> -proc_do_sync_mode(struct ctl_table *table, int write,
> -  void __user *buffer, size_t *lenp, loff_t *ppos)
> -{
> - int *valp = table->data;
> - int val = *valp;
> - int rc;
> -
> - rc = proc_dointvec(table, write, buffer, lenp, ppos);
> - if (write && (*valp != val)) {
> - if ((*valp < 0) || (*valp > 1)) {
> - /* Restore the correct value */
> - *valp = val;
> - }
> + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
> + if (write) {
> + if (val[0] < 0 || val[1] < 0 ||
> + (val[0] >= val[1] && val[1]))
> + rc = -EINVAL;
> + else
> + memcpy(valp, val, sizeof(val));
>   }
>   return rc;
>  }
> @@ -1795,12 +1788,18 @@ proc_do_sync_ports(struct ctl_table *table, int write,
>   int val = *valp;
>   int rc;
> 
> - rc = proc_dointvec(table, write, buffer, lenp, ppos);
> + struct ctl_table tmp = {
> + .data = &val,
> + .maxlen = sizeof(int),
> + .mode = table->mode,
> + };
> +
> + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
>   if (write && (*valp != val)) {
> - if (*valp < 1 || !is_power_of_2(*valp)) {
> - /* Restore the correct value */
> + if (val < 1 || !is_power_of_2(val))
> + rc = -EINVAL;
> + else
>   *valp = val;
> - }
>   }
>   return rc;
>  }
> @@ -1860,7 +1859,9 @@ static struct ctl_table vs_vars[] = {
>   .procname   = "sync_version",
>   .maxlen = sizeof(int),
>   .mode   = 0644,
> - .proc_handler   = proc_do_sync_mode,
> + .proc_handler   = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_ONE,
>   },
>   {
>   .procname   = "sync_ports",
> -- 
> 2.21.GIT

Regards

--
Julian Anastasov

Re: [PATCH bpf-next v4 04/11] xsk: add support to allow unaligned chunk placement

2019-07-31 Thread Jonathan Lemon





On 30 Jul 2019, at 1:53, Kevin Laatz wrote:


Currently, addresses are chunk size aligned. This means, we are very
restricted in terms of where we can place chunk within the umem. For
example, if we have a chunk size of 2k, then our chunks can only be 
placed

at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).

This patch introduces the ability to use unaligned chunks. With these
changes, we are no longer bound to having to place chunks at a 2k (or
whatever your chunk size is) interval. Since we are no longer dealing 
with

aligned chunks, they can now cross page boundaries. Checks for page
contiguity have been added in order to keep track of which pages are
followed by a physically contiguous page.

Signed-off-by: Kevin Laatz 
Signed-off-by: Ciara Loftus 
Signed-off-by: Bruce Richardson 

---
v2:
  - Add checks for the flags coming from userspace
  - Fix how we get chunk_size in xsk_diag.c
  - Add defines for masking the new descriptor format
  - Modified the rx functions to use new descriptor format
  - Modified the tx functions to use new descriptor format

v3:
  - Add helper function to do address/offset masking/addition

v4:
  - fixed page_start calculation in __xsk_rcv_memcpy().
  - move offset handling to the xdp_umem_get_* functions
  - modified the len field in xdp_umem_reg struct. We now use 16 bits 
from

this for the flags field.
  - removed next_pg_contig field from xdp_umem_page struct. Using low 
12

bits of addr to store flags instead.
  - other minor changes based on review comments
---
 include/net/xdp_sock.h  | 40 ++-
 include/uapi/linux/if_xdp.h | 14 ++-
 net/xdp/xdp_umem.c  | 18 ++---
 net/xdp/xsk.c   | 79 
+

 net/xdp/xsk_diag.c  |  2 +-
 net/xdp/xsk_queue.h | 69 
 6 files changed, 188 insertions(+), 34 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69796d264f06..a755e8ab6cac 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -16,6 +16,13 @@
 struct net_device;
 struct xsk_queue;

+/* Masks for xdp_umem_page flags.
+ * The low 12-bits of the addr will be 0 since this is the page 
address, so we

+ * can use them for flags.
+ */
+#define XSK_NEXT_PG_CONTIG_SHIFT 0
+#define XSK_NEXT_PG_CONTIG_MASK (1ULL << XSK_NEXT_PG_CONTIG_SHIFT)
+
 struct xdp_umem_page {
void *addr;
dma_addr_t dma;
@@ -48,6 +55,7 @@ struct xdp_umem {
bool zc;
spinlock_t xsk_list_lock;
struct list_head xsk_list;
+   u16 flags;
 };

 struct xdp_sock {
@@ -98,12 +106,21 @@ struct xdp_umem *xdp_get_umem_from_qid(struct 
net_device *dev, u16 queue_id);


 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 
addr)

 {
-	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 
1));

+   unsigned long page_addr;
+
+   addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+   addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
+   page_addr = (unsigned long)umem->pages[addr >> PAGE_SHIFT].addr;
+
+   return (char *)(page_addr & PAGE_MASK) + (addr & ~PAGE_MASK);
 }

 static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 
addr)

 {
-	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 
1));

+   addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+   addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
+
+   return umem->pages[addr >> PAGE_SHIFT].dma + (addr & ~PAGE_MASK);
 }

 /* Reuse-queue aware version of FILL queue helpers */
@@ -144,6 +161,19 @@ static inline void xsk_umem_fq_reuse(struct 
xdp_umem *umem, u64 addr)


rq->handles[rq->length++] = addr;
 }
+
+/* Handle the offset appropriately depending on aligned or unaligned 
mode.
+ * For unaligned mode, we store the offset in the upper 16-bits of 
the address.

+ * For aligned mode, we simply add the offset to the address.
+ */
+static inline u64 xsk_umem_adjust_offset(struct xdp_umem *umem, u64 
address,

+u64 offset)
+{
+   if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG)
+   return address + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
+   else
+   return address + offset;
+}
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct 
xdp_buff *xdp)

 {
@@ -241,6 +271,12 @@ static inline void xsk_umem_fq_reuse(struct 
xdp_umem *umem, u64 addr)

 {
 }

+static inline u64 xsk_umem_handle_offset(struct xdp_umem *umem, u64 
handle,

+u64 offset)
+{
+   return 0;
+}
+
 #endif /* CONFIG_XDP_SOCKETS */

 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index faaa5ca2a117..4a5490651b22 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -17,6 +17,10 @@
 #define XDP_COPY   (1 << 1) /* Force copy-mode */
 #define XDP_ZEROCOPY   (1 << 2) /* Force zero-copy mode */

+/* Flags fo

Re: [PATCH bpf-next v4 07/11] mlx5e: modify driver for handling offsets

2019-07-31 Thread Jonathan Lemon





On 30 Jul 2019, at 1:53, Kevin Laatz wrote:

With the addition of the unaligned chunks option, we need to make sure 
we
handle the offsets accordingly based on the mode we are currently 
running
in. This patch modifies the driver to appropriately mask the address 
for

each case.

Signed-off-by: Kevin Laatz 

---
v3:
  - Use new helper function to handle offset

v4:
  - fixed headroom addition to handle. Using 
xsk_umem_adjust_headroom()

now.
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c| 8 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 3 ++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c

index b0b982cf69bb..d5245893d2c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -122,6 +122,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct 
mlx5e_dma_info *di,

  void *va, u16 *rx_headroom, u32 *len, bool xsk)
 {
struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+   struct xdp_umem *umem = rq->umem;
struct xdp_buff xdp;
u32 act;
int err;
@@ -138,8 +139,11 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct 
mlx5e_dma_info *di,

xdp.rxq = &rq->xdp_rxq;

act = bpf_prog_run_xdp(prog, &xdp);
-   if (xsk)
-   xdp.handle += xdp.data - xdp.data_hard_start;
+   if (xsk) {
+   u64 off = xdp.data - xdp.data_hard_start;
+
+   xdp.handle = xsk_umem_handle_offset(umem, xdp.handle, off);


Shouldn't this be xdp_umem_adjust_offset()?



+   }
switch (act) {
case XDP_PASS:
*rx_headroom = xdp.data - xdp.data_hard_start;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c

index 6a55573ec8f2..7c49a66d28c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -24,7 +24,8 @@ int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
if (!xsk_umem_peek_addr_rq(umem, &handle))
return -ENOMEM;

-   dma_info->xsk.handle = handle + rq->buff.umem_headroom;
+   dma_info->xsk.handle = xsk_umem_adjust_offset(umem, handle,
+ rq->buff.umem_headroom);
dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle);

 	/* No need to add headroom to the DMA address. In striding RQ case, 
we

--
2.17.1

Re: [oss-drivers] Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload

2019-07-31 Thread Jakub Kicinski

On Wed, 31 Jul 2019 11:57:10 -0400, Willem de Bruijn wrote:
> On Tue, Jul 30, 2019 at 5:13 PM Jakub Kicinski wrote:
> > sk_validate_xmit_skb() and drivers depend on the sk member of
> > struct sk_buff to identify segments requiring encryption.
> > Any operation which removes or does not preserve the original TLS
> > socket such as skb_orphan() or skb_clone() will cause clear text
> > leaks.
> >
> > Make the TCP socket underlying an offloaded TLS connection
> > mark all skbs as decrypted, if TLS TX is in offload mode.
> > Then in sk_validate_xmit_skb() catch skbs which have no socket
> > (or a socket with no validation) and decrypted flag set.
> >
> > Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> > sk->sk_validate_xmit_skb are slightly interchangeable right now,
> > they all imply TLS offload. The new checks are guarded by
> > CONFIG_TLS_DEVICE because that's the option guarding the
> > sk_buff->decrypted member.
> >
> > Second, smaller issue with orphaning is that it breaks
> > the guarantee that packets will be delivered to device
> > queues in-order. All TLS offload drivers depend on that
> > scheduling property. This means skb_orphan_partial()'s
> > trick of preserving partial socket references will cause
> > issues in the drivers. We need a full orphan, and as a
> > result netem delay/throttling will cause all TLS offload
> > skbs to be dropped.
> >
> > Reusing the sk_buff->decrypted flag also protects from
> > leaking clear text when incoming, decrypted skb is redirected
> > (e.g. by TC).
> >
> > Signed-off-by: Jakub Kicinski 

> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index d57b0cc995a0..b0c10b518e65 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct 
> > sock *sk)
> >  }
> >  EXPORT_SYMBOL(skb_set_owner_w);
> >
> > +static bool can_skb_orphan_partial(const struct sk_buff *skb)
> > +{
> > +#ifdef CONFIG_TLS_DEVICE
> > +   /* Drivers depend on in-order delivery for crypto offload,
> > +* partial orphan breaks out-of-order-OK logic.
> > +*/
> > +   if (skb->decrypted)
> > +   return false;
> > +#endif
> > +#ifdef CONFIG_INET
> > +   if (skb->destructor == tcp_wfree)
> > +   return true;
> > +#endif
> > +   return skb->destructor == sock_wfree;
> > +}
> > +  
> 
> Just insert the skb->decrypted check into skb_orphan_partial for less
> code churn?

Okie.. skb_orphan_partial() is a little ugly but will do.

> I also think that this is an independent concern from leaking plain
> text, so perhaps could be a separate patch.

Do you mean the out-of-order stuff is a separate concern?

It is, I had them separate at the first try, but GSO code looks at
the destructor and IIRC only copies the socket if its still tcp_wfree.
If we let partial orphan be we have to do temporary hairy stuff in
tcp_gso_segment(). It's easier to just deal with partial orphan here.

Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface

2019-07-31 Thread Daniel T. Lee

On Wed, Jul 31, 2019 at 7:08 PM Jesper Dangaard Brouer
 wrote:
>
> On Wed, 31 Jul 2019 03:48:20 +0900
> "Daniel T. Lee"  wrote:
>
> > By this commit, using `bpftool net load`, user can load XDP prog on
> > interface. New type of enum 'net_load_type' has been made, as stated at
> > cover-letter, the meaning of 'load' is, prog will be loaded on interface.
>
> Why the keyword "load" ?
> Why not "attach" (and "detach")?
>
> For BPF there is a clear distinction between the "load" and "attach"
> steps.  I know this is under subcommand "net", but to follow the
> conversion of other subcommands e.g. "prog" there are both "load" and
> "attach" commands.
>
>
> > BPF prog will be loaded through libbpf 'bpf_set_link_xdp_fd'.
>
> Again this is a "set" operation, not a "load" operation.

>From earlier at cover-letter, I thought using the same word 'load' might give
confusion since XDP program is not considered as 'bpf_attach_type' and can't
be attached with 'BPF_PROG_ATTACH'.

But, according to the feedback from you and Andrii Nakryiko, replacing
the word 'load' as 'attach' would be more clear and more consistent.

> > Signed-off-by: Daniel T. Lee 
>
> [...]
> >  static int do_show(int argc, char **argv)
> >  {
> >   struct bpf_attach_info attach_info = {};
> > @@ -305,13 +405,17 @@ static int do_help(int argc, char **argv)
> >
> >   fprintf(stderr,
> >   "Usage: %s %s { show | list } [dev ]\n"
> > + "   %s %s load PROG LOAD_TYPE \n"
>
> The "PROG" here does it correspond to the 'bpftool prog' syntax?:
>
>  PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }
>

Yes. By using the same 'prog_parse_fd' from 'bpftool prog',
user can 'attach' XDP prog with id, pinned file or tag.

> >   "   %s %s help\n"
> > + "\n"
> > + "   " HELP_SPEC_PROGRAM "\n"
> > + "   LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload 
> > }\n"
> >   "Note: Only xdp and tc attachments are supported now.\n"
> >   "  For progs attached to cgroups, use \"bpftool 
> > cgroup\"\n"
> >   "  to dump program attachments. For program types\n"
> >   "  sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
> >   "  consult iproute2.\n",
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

And about the enum 'NET_LOAD_TYPE_XDP_DRIVE',
'DRIVER' looks more clear to understand.

Will change to it right away.

Thanks for the review.

Re: [PATCH bpf-next v4 09/11] samples/bpf: add buffer recycling for unaligned chunks to xdpsock

2019-07-31 Thread Jonathan Lemon





On 30 Jul 2019, at 1:53, Kevin Laatz wrote:

This patch adds buffer recycling support for unaligned buffers. Since 
we
don't mask the addr to 2k at umem_reg in unaligned mode, we need to 
make
sure we give back the correct (original) addr to the fill queue. We 
achieve
this using the new descriptor format and associated masks. The new 
format
uses the upper 16-bits for the offset and the lower 48-bits for the 
addr.

Since we have a field for the offset, we no longer need to modify the
actual address. As such, all we have to do to get back the original 
address
is mask for the lower 48 bits (i.e. strip the offset and we get the 
address

on it's own).

Signed-off-by: Kevin Laatz 
Signed-off-by: Bruce Richardson 

---
v2:
  - Removed unused defines
  - Fix buffer recycling for unaligned case
  - Remove --buf-size (--frame-size merged before this)
  - Modifications to use the new descriptor format for buffer 
recycling

---
 samples/bpf/xdpsock_user.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 756b00eb1afe..62b2059cd0e3 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -475,6 +475,7 @@ static void kick_tx(struct xsk_socket_info *xsk)

 static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
 {
+   struct xsk_umem_info *umem = xsk->umem;
u32 idx_cq = 0, idx_fq = 0;
unsigned int rcvd;
size_t ndescs;
@@ -487,22 +488,21 @@ static inline void complete_tx_l2fwd(struct 
xsk_socket_info *xsk)

xsk->outstanding_tx;

/* re-add completed Tx buffers */
-   rcvd = xsk_ring_cons__peek(&xsk->umem->cq, ndescs, &idx_cq);
+   rcvd = xsk_ring_cons__peek(&umem->cq, ndescs, &idx_cq);
if (rcvd > 0) {
unsigned int i;
int ret;

-   ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+   ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
while (ret != rcvd) {
if (ret < 0)
exit_with_error(-ret);
-   ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
-&idx_fq);
+   ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
}
+
for (i = 0; i < rcvd; i++)
-   *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
-   *xsk_ring_cons__comp_addr(&xsk->umem->cq,
- idx_cq++);
+   *xsk_ring_prod__fill_addr(&umem->fq, idx_fq++) =
+   *xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);

xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
xsk_ring_cons__release(&xsk->umem->cq, rcvd);
@@ -549,7 +549,11 @@ static void rx_drop(struct xsk_socket_info *xsk)
for (i = 0; i < rcvd; i++) {
u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
u32 len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len;
-   char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+   u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+
+   addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
+   char *pkt = xsk_umem__get_data(xsk->umem->buffer,
+   addr + offset);


The mask constants should not be part of the api - this should be
hidden behind an accessor.

Something like:
  u64 addr = xsk_umem__get_addr(xsk->umem, handle);





hex_dump(pkt, len, addr);
*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = addr;
@@ -655,7 +659,9 @@ static void l2fwd(struct xsk_socket_info *xsk)
  idx_rx)->addr;
u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
 idx_rx++)->len;
-   char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+   u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+   char *pkt = xsk_umem__get_data(xsk->umem->buffer,
+   (addr & XSK_UNALIGNED_BUF_ADDR_MASK) + offset);

swap_mac_addresses(pkt);

--
2.17.1

Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload

2019-07-31 Thread Jakub Kicinski

On Wed, 31 Jul 2019 13:57:26 +, Boris Pismenny wrote:
> > diff --git a/Documentation/networking/tls-offload.rst 
> > b/Documentation/networking/tls-offload.rst
> > index 048e5ca44824..2bc3ab5515d8 100644
> > --- a/Documentation/networking/tls-offload.rst
> > +++ b/Documentation/networking/tls-offload.rst
> > @@ -499,15 +499,6 @@ offloads, old connections will remain active after 
> > flags are cleared.
> >   Known bugs
> >   ==
> >   
> > -skb_orphan() leaks clear text
> > --
> > -
> > -Currently drivers depend on the :c:member:`sk` member of
> > -:c:type:`struct sk_buff ` to identify segments requiring
> > -encryption. Any operation which removes or does not preserve the socket
> > -association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
> > -will cause the driver to miss the packets and lead to clear text leaks.
> > -
> >   Redirects leak clear text
> >   -  
> Doesn't this patch cover the redirect case as well?

Ah, good catch! I thought this entry was for socket redirect, which
will be a separate fix, but it seems we don't have an entry for that
one. 

> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 228db3998e46..90f3f552c789 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -814,6 +814,7 @@ enum sock_flags {
> > SOCK_TXTIME,
> > SOCK_XDP, /* XDP is attached */
> > SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> > +   SOCK_CRYPTO_TX_PLAIN_TEXT, /* Generate skbs with decrypted flag set */
> >   };
> >   
> >   #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
> > SOCK_TIMESTAMPING_RX_SOFTWARE))
> > @@ -2480,8 +2481,26 @@ static inline bool sk_fullsock(const struct sock *sk)
> > return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> >   }
> >   
> > +static inline void sk_set_tx_crypto(const struct sock *sk, struct sk_buff 
> > *skb)
> > +{
> > +#ifdef CONFIG_TLS_DEVICE
> > +   skb->decrypted = sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
> > +#endif
> > +}  
> 
> Have you considered the following alternative to calling 
> sk_set_tx_crypto() - Add a new MSG_DECRYPTE flag that will set the 
> skb->decrypted bit in the do_tcp_sendpages function.
> 
> Then, the rest of the TCP code can propagate this bit from the existing skb.

That'd also work marking the socket as crypto seemed easy enough. I was
planning on using sk_rx_crypto_match() for socket redirect also, so
it'd be symmetrical. Is there a preference for using the internal flags?

> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index f62f0e7e3cdd..dee93eab02fe 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -984,6 +984,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page 
> > *page, int offset,
> > if (!skb)
> > goto wait_for_memory;
> >   
> > +   sk_set_tx_crypto(sk, skb);
> > skb_entail(sk, skb);
> > copy = size_goal;
> > }
> > @@ -993,7 +994,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page 
> > *page, int offset,
> >   
> > i = skb_shinfo(skb)->nr_frags;
> > can_coalesce = skb_can_coalesce(skb, i, page, offset);
> > -   if (!can_coalesce && i >= sysctl_max_skb_frags) {
> > +   if ((!can_coalesce && i >= sysctl_max_skb_frags) ||
> > +   !sk_tx_crypto_match(sk, skb)) {  
> 
> I see that sk_tx_crypto_match is called only here to handle cases where 
> the socket expected crypto offload, while the skb is not marked 
> decrypted. AFAIU, this should not be possible, because we set the 
> skb->eor bit for the last plaintext skb before sending any traffic that 
> expects crypto offload.

Ack, I missed the tcp_skb_can_collapse_to() above.

[PATCH] net: mdio-octeon: Fix build error and Kconfig warning

2019-07-31 Thread Nathan Chancellor

arm allyesconfig warns:

WARNING: unmet direct dependencies detected for MDIO_OCTEON
  Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=y] && MDIO_BUS [=y]
&& 64BIT && HAS_IOMEM [=y] && OF_MDIO [=y]
  Selected by [y]:
  - OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC &&
NETDEVICES [=y] || COMPILE_TEST [=y])

and errors:

In file included from ../drivers/net/phy/mdio-octeon.c:14:
../drivers/net/phy/mdio-octeon.c: In function 'octeon_mdiobus_probe':
../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of
function 'writeq'; did you mean 'writeb'?
[-Werror=implicit-function-declaration]
  111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
  |^~
../drivers/net/phy/mdio-octeon.c:56:2: note: in expansion of macro
'oct_mdio_writeq'
   56 |  oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
  |  ^~~
cc1: some warnings being treated as errors

This allows MDIO_OCTEON to be built with COMPILE_TEST as well and
includes the proper header for readq/writeq. This does not address
the several -Wint-to-pointer-cast and -Wpointer-to-int-cast warnings
that appeared as a result of commit 171a9bae68c7 ("staging/octeon:
Allow test build on !MIPS") in these files.

Fixes: 171a9bae68c7 ("staging/octeon: Allow test build on !MIPS")
Reported-by: kbuild test robot 
Reported-by: Mark Brown 
Reported-by: Randy Dunlap 
Signed-off-by: Nathan Chancellor 
---
 drivers/net/phy/Kconfig   | 2 +-
 drivers/net/phy/mdio-cavium.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 20f14c5fbb7e..ed2edf4b5b0e 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -159,7 +159,7 @@ config MDIO_MSCC_MIIM
 
 config MDIO_OCTEON
tristate "Octeon and some ThunderX SOCs MDIO buses"
-   depends on 64BIT
+   depends on 64BIT || COMPILE_TEST
depends on HAS_IOMEM && OF_MDIO
select MDIO_CAVIUM
help
diff --git a/drivers/net/phy/mdio-cavium.h b/drivers/net/phy/mdio-cavium.h
index ed5f9bb5448d..b7f89ad27465 100644
--- a/drivers/net/phy/mdio-cavium.h
+++ b/drivers/net/phy/mdio-cavium.h
@@ -108,6 +108,8 @@ static inline u64 oct_mdio_readq(u64 addr)
return cvmx_read_csr(addr);
 }
 #else
+#include 
+
 #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
 #define oct_mdio_readq(addr)   readq((void *)addr)
 #endif
-- 
2.22.0

Re: [oss-drivers] Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload

2019-07-31 Thread Willem de Bruijn

On Wed, Jul 31, 2019 at 2:12 PM Jakub Kicinski
 wrote:
>
> On Wed, 31 Jul 2019 11:57:10 -0400, Willem de Bruijn wrote:
> > On Tue, Jul 30, 2019 at 5:13 PM Jakub Kicinski wrote:
> > > sk_validate_xmit_skb() and drivers depend on the sk member of
> > > struct sk_buff to identify segments requiring encryption.
> > > Any operation which removes or does not preserve the original TLS
> > > socket such as skb_orphan() or skb_clone() will cause clear text
> > > leaks.
> > >
> > > Make the TCP socket underlying an offloaded TLS connection
> > > mark all skbs as decrypted, if TLS TX is in offload mode.
> > > Then in sk_validate_xmit_skb() catch skbs which have no socket
> > > (or a socket with no validation) and decrypted flag set.
> > >
> > > Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> > > sk->sk_validate_xmit_skb are slightly interchangeable right now,
> > > they all imply TLS offload. The new checks are guarded by
> > > CONFIG_TLS_DEVICE because that's the option guarding the
> > > sk_buff->decrypted member.
> > >
> > > Second, smaller issue with orphaning is that it breaks
> > > the guarantee that packets will be delivered to device
> > > queues in-order. All TLS offload drivers depend on that
> > > scheduling property. This means skb_orphan_partial()'s
> > > trick of preserving partial socket references will cause
> > > issues in the drivers. We need a full orphan, and as a
> > > result netem delay/throttling will cause all TLS offload
> > > skbs to be dropped.
> > >
> > > Reusing the sk_buff->decrypted flag also protects from
> > > leaking clear text when incoming, decrypted skb is redirected
> > > (e.g. by TC).
> > >
> > > Signed-off-by: Jakub Kicinski 
>
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index d57b0cc995a0..b0c10b518e65 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct 
> > > sock *sk)
> > >  }
> > >  EXPORT_SYMBOL(skb_set_owner_w);
> > >
> > > +static bool can_skb_orphan_partial(const struct sk_buff *skb)
> > > +{
> > > +#ifdef CONFIG_TLS_DEVICE
> > > +   /* Drivers depend on in-order delivery for crypto offload,
> > > +* partial orphan breaks out-of-order-OK logic.
> > > +*/
> > > +   if (skb->decrypted)
> > > +   return false;
> > > +#endif
> > > +#ifdef CONFIG_INET
> > > +   if (skb->destructor == tcp_wfree)
> > > +   return true;
> > > +#endif
> > > +   return skb->destructor == sock_wfree;
> > > +}
> > > +
> >
> > Just insert the skb->decrypted check into skb_orphan_partial for less
> > code churn?
>
> Okie.. skb_orphan_partial() is a little ugly but will do.
>
> > I also think that this is an independent concern from leaking plain
> > text, so perhaps could be a separate patch.

Just a suggestion and very much depending on how much uglier it
becomes otherwise ;)

> Do you mean the out-of-order stuff is a separate concern?
>
> It is, I had them separate at the first try, but GSO code looks at
> the destructor and IIRC only copies the socket if its still tcp_wfree.
> If we let partial orphan be we have to do temporary hairy stuff in
> tcp_gso_segment(). It's easier to just deal with partial orphan here.

Okay, sounds good.

Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

2019-07-31 Thread Andy Lutomirski

On Wed, Jul 31, 2019 at 1:10 AM Song Liu  wrote:
>
>
>
> > On Jul 30, 2019, at 1:24 PM, Andy Lutomirski  wrote:
> >
> > On Mon, Jul 29, 2019 at 10:07 PM Song Liu  wrote:
> >>
> >> Hi Andy,
> >>
> >>> On Jul 27, 2019, at 11:20 AM, Song Liu  wrote:
> >>>
> >>> Hi Andy,
> >>>
> >>>
>
> [...]
>
> >>>
> >>
> >> I would like more comments on this.
> >>
> >> Currently, bpf permission is more or less "root or nothing", which we
> >> would like to change.
> >>
> >> The short term goal is to separate bpf from root, in other words, it is
> >> "all or nothing". Special user space utilities, such as systemd, would
> >> benefit from this. Once this is implemented, systemd can call sys_bpf()
> >> when it is not running as root.
> >
> > As generally nasty as Linux capabilities are, this sounds like a good
> > use for CAP_BPF_ADMIN.
>
> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.

It's been done before -- it's not that hard.  IMO the main tricky bit
would be try be somewhat careful about defining exactly what
CAP_BPF_ADMIN does.

> > I don't see why you need to invent a whole new mechanism for this.
> > The entire cgroup ecosystem outside bpf() does just fine using the
> > write permission on files in cgroupfs to control access.  Why can't
> > bpf() do the same thing?
>
> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> we should have target concept for all these commands. But that is a
> much bigger project. OTOH, "all or nothing" model allows all these
> commands at once.

For BPF_PROG_LOAD, I admit I've never understood why permission is
required at all.  I think that CAP_SYS_ADMIN or similar should be
needed to get is_priv in the verifier, but I think that should mainly
be useful for tracing, and that requires lots of privilege anyway.
BPF_MAP_* is probably the trickiest part.  One solution would be some
kind of bpffs, but I'm sure other solutions are possible.

Re: [PATCH net-next 0/6] flow_offload: add indr-block in nf_table_offload

2019-07-31 Thread Jiri Pirko

Wed, Jul 31, 2019 at 10:12:27AM CEST, we...@ucloud.cn wrote:
>From: wenxu 
>
>This series patch make nftables offload support the vlan and
>tunnel device offload through indr-block architecture.
>
>The first four patches mv tc indr block to flow offload and
>rename to flow-indr-block.
>Because the new flow-indr-block can't get the tcf_block
>directly. The fifthe patch provide a callback list to get 
>flow_block of each subsystem immediately when the device
>register and contain a block.
>The last patch make nf_tables_offload support flow-indr-block.
>
>wenxu (6):
>  cls_api: modify the tc_indr_block_ing_cmd parameters.
>  cls_api: replace block with flow_block in tc_indr_block_dev
>  cls_api: add flow_indr_block_call function
>  flow_offload: move tc indirect block to flow offload
>  flow_offload: support get flow_block immediately
>  netfilter: nf_tables_offload: support indr block call

Wenxu, this is V5. Please repost is as V5. Also please provide per-patch
changelog, as you did with the previous versions. Thanks!

Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm

2019-07-31 Thread Song Liu




> On Jul 31, 2019, at 10:18 AM, Andrii Nakryiko  
> wrote:
> 
> On Wed, Jul 31, 2019 at 1:30 AM Song Liu  wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 11:52 PM, Andrii Nakryiko  
>>> wrote:
>>> 
>>> On Tue, Jul 30, 2019 at 10:19 PM Song Liu  wrote:
 
 
 
> On Jul 30, 2019, at 6:00 PM, Andrii Nakryiko  
> wrote:
> 
> On Tue, Jul 30, 2019 at 5:39 PM Song Liu  wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko  wrote:
>>> 
>>> This patch implements the core logic for BPF CO-RE offsets relocations.
>>> Every instruction that needs to be relocated has corresponding
>>> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
>>> to match recorded "local" relocation spec against potentially many
>>> compatible "target" types, creating corresponding spec. Details of the
>>> algorithm are noted in corresponding comments in the code.
>>> 
>>> Signed-off-by: Andrii Nakryiko 
>> 
>> [...]
>> 
>> 
> 
> I just picked the most succinct and non-repetitive form. It's
> immediately apparent which type it's implicitly converted to, so I
> felt there is no need to repeat it. Also, just (void *) is much
> shorter. :)
 
 _All_ other code in btf.c converts the pointer to the target type.
>>> 
>>> Most in libbpf.c doesn't, though. Also, I try to preserve pointer
>>> constness for uses that don't modify BTF types (pretty much all of
>>> them in libbpf), so it becomes really verbose, despite extremely short
>>> variable names:
>>> 
>>> const struct btf_member *m = (const struct btf_member *)(t + 1);
>> 
>> I don't think being verbose is a big problem here. Overusing
> 
> Problem is too big and strong word to describe this :). It hurts
> readability and will often quite artificially force either wrapping
> the line or unnecessarily splitting declaration and assignment. Void *
> on the other hand is short and usually is in the same line as target
> type declaration, if not, you'll have to find local variable
> declaration to double-check type, if you are unsure.
> 
> Using (void *) + implicit cast to target pointer type is not
> unprecedented in libbpf:
> 
> $ rg ' = \((const )?struct \w+ \*\)' tools/lib/bpf/ | wc -l
> 52
> $ rg ' = \((const )?void \*\)' tools/lib/bpf/  | wc -l
> 35
> 
> 52 vs 35 is majority overall, but not by a landslide.
> 
>> (void *) feels like a bigger problem.
> 
> Why do you feel it's a problem? void * conveys that we have a piece of
> memory that we will need to reinterpret as some concrete pointer type.
> That's what we are doing, skipping btf_type and then interpreting
> memory after common btf_type prefix is some other type, depending on
> actual BTF kind. I don't think void * is misleading in any way.

(void *) hides some problem. For example:

struct type_a *ptr_a = NULL;
struct type_b *ptr_b = NULL;

/* we want this */
ptr_a = (struct type_a *)data;
ptr_b = (struct type_b *)(data + offset);

/* typo, should be ptr_b, compiler will complain */
ptr_a = (struct type_b *)(data + offset);

/* typo, should be ptr_b, compiler will ignore */
ptr_a = (void *)(data + offset);

Such typo is not very rare. And it may be really painful to debug. 

That being said, I think we have spent too much time on this. I will 
let you make the final call. Either way:

Acked-by: Song Liu

Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces

2019-07-31 Thread Jiri Pirko

Wed, Jul 31, 2019 at 12:39:52AM CEST, jakub.kicin...@netronome.com wrote:
>On Tue, 30 Jul 2019 10:57:32 +0200, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> All devlink instances are created in init_net and stay there for a
>> lifetime. Allow user to be able to move devlink instances into
>> namespaces.
>> 
>> Signed-off-by: Jiri Pirko 
>
>I'm no namespace expert, but seems reasonable, so FWIW:
>
>Acked-by: Jakub Kicinski 
>
>If I read things right we will only send the devlink instance
>notification to other namespaces when it moves, but not
>notifications for sub-objects like ports. Is the expectation 
>that the user space dumps the objects it cares about or will
>the other notifications be added as needed (or option 3 - I 
>misread the code).

You are correct. I plan to take care of the notifications of all objects
in the follow-up patchset. But I can do it in this one if needed. Up to
you.


>
>I was also wondering it moving the devlink instance during a 
>multi-part transaction could be an issue. But AFAIU it should 
>be fine - the flashing doesn't release the instance lock, and 
>health stuff should complete correctly even if devlink is moved
>half way through?

Yes, I don't see any issue there as well.

Re: [PATCH bpf 1/2] bpf: fix x64 JIT code generation for jmp to 1st insn

2019-07-31 Thread Song Liu




> On Jul 30, 2019, at 6:38 PM, Alexei Starovoitov  wrote:
> 
> Introduction of bounded loops exposed old bug in x64 JIT.
> JIT maintains the array of offsets to the end of all instructions to
> compute jmp offsets.
> addrs[0] - offset of the end of the 1st insn (that includes prologue).
> addrs[1] - offset of the end of the 2nd insn.
> JIT didn't keep the offset of the beginning of the 1st insn,
> since classic BPF didn't have backward jumps and valid extended BPF
> couldn't have a branch to 1st insn, because it didn't allow loops.
> With bounded loops it's possible to construct a valid program that
> jumps backwards to the 1st insn.
> Fix JIT by computing:
> addrs[0] - offset of the end of prologue == start of the 1st insn.
> addrs[1] - offset of the end of 1st insn.
> 
> Reported-by: syzbot+35101610ff3e83119...@syzkaller.appspotmail.com
> Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
> Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
> Signed-off-by: Alexei Starovoitov 

Acked-by: Song Liu 

Do we need similar fix for x86_32? 

Thanks,
Song

> ---
> arch/x86/net/bpf_jit_comp.c | 7 ---
> 1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index eaaed5bfc4a4..a56c95805732 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -390,8 +390,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
> u8 *image,
> 
>   emit_prologue(&prog, bpf_prog->aux->stack_depth,
> bpf_prog_was_classic(bpf_prog));
> + addrs[0] = prog - temp;
> 
> - for (i = 0; i < insn_cnt; i++, insn++) {
> + for (i = 1; i <= insn_cnt; i++, insn++) {
>   const s32 imm32 = insn->imm;
>   u32 dst_reg = insn->dst_reg;
>   u32 src_reg = insn->src_reg;
> @@ -1105,7 +1106,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
> *prog)
>   extra_pass = true;
>   goto skip_init_addrs;
>   }
> - addrs = kmalloc_array(prog->len, sizeof(*addrs), GFP_KERNEL);
> + addrs = kmalloc_array(prog->len + 1, sizeof(*addrs), GFP_KERNEL);
>   if (!addrs) {
>   prog = orig_prog;
>   goto out_addrs;
> @@ -1115,7 +1116,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
> *prog)
>* Before first pass, make a rough estimation of addrs[]
>* each BPF instruction is translated to less than 64 bytes
>*/
> - for (proglen = 0, i = 0; i < prog->len; i++) {
> + for (proglen = 0, i = 0; i <= prog->len; i++) {
>   proglen += 64;
>   addrs[i] = proglen;
>   }
> -- 
> 2.20.0
>

Re: [PATCH bpf 2/2] selftests/bpf: tests for jmp to 1st insn

2019-07-31 Thread Song Liu




> On Jul 30, 2019, at 6:38 PM, Alexei Starovoitov  wrote:
> 
> Add 2 tests that check JIT code generation to jumps to 1st insn.
> 1st test is similar to syzbot reproducer.
> The backwards branch is never taken at runtime.
> 2nd test has branch to 1st insn that executes.
> The test is written as two bpf functions, since it's not possible
> to construct valid single bpf program that jumps to 1st insn.
> 
> Signed-off-by: Alexei Starovoitov 

Acked-by: Song Liu 

> ---
> tools/testing/selftests/bpf/verifier/loops1.c | 28 +++
> 1 file changed, 28 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/verifier/loops1.c 
> b/tools/testing/selftests/bpf/verifier/loops1.c
> index 5e980a5ab69d..1fc4e61e9f9f 100644
> --- a/tools/testing/selftests/bpf/verifier/loops1.c
> +++ b/tools/testing/selftests/bpf/verifier/loops1.c
> @@ -159,3 +159,31 @@
>   .errstr = "loop detected",
>   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
> },
> +{
> + "not-taken loop with back jump to 1st insn",
> + .insns = {
> + BPF_MOV64_IMM(BPF_REG_0, 123),
> + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 4, -2),
> + BPF_EXIT_INSN(),
> + },
> + .result = ACCEPT,
> + .prog_type = BPF_PROG_TYPE_XDP,
> + .retval = 123,
> +},
> +{
> + "taken loop with back jump to 1st insn",
> + .insns = {
> + BPF_MOV64_IMM(BPF_REG_1, 10),
> + BPF_MOV64_IMM(BPF_REG_2, 0),
> + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 1),
> + BPF_EXIT_INSN(),
> + BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_1),
> + BPF_ALU64_IMM(BPF_SUB, BPF_REG_1, 1),
> + BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, -3),
> + BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
> + BPF_EXIT_INSN(),
> + },
> + .result = ACCEPT,
> + .prog_type = BPF_PROG_TYPE_XDP,
> + .retval = 55,
> +},
> -- 
> 2.20.0
>

Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces

2019-07-31 Thread David Ahern

On 7/31/19 1:26 PM, Jiri Pirko wrote:
> Wed, Jul 31, 2019 at 12:39:52AM CEST, jakub.kicin...@netronome.com wrote:
>> On Tue, 30 Jul 2019 10:57:32 +0200, Jiri Pirko wrote:
>>> From: Jiri Pirko 
>>>
>>> All devlink instances are created in init_net and stay there for a
>>> lifetime. Allow user to be able to move devlink instances into
>>> namespaces.
>>>
>>> Signed-off-by: Jiri Pirko 
>>
>> I'm no namespace expert, but seems reasonable, so FWIW:
>>
>> Acked-by: Jakub Kicinski 
>>
>> If I read things right we will only send the devlink instance
>> notification to other namespaces when it moves, but not
>> notifications for sub-objects like ports. Is the expectation 
>> that the user space dumps the objects it cares about or will
>> the other notifications be added as needed (or option 3 - I 
>> misread the code).
> 
> You are correct. I plan to take care of the notifications of all objects
> in the follow-up patchset. But I can do it in this one if needed. Up to
> you.
> 

seems like it should be a part of this one. If a devlink instance
changes namespaces, all of its associated resources should as well.

Also, seems like you are missing a 'can this devlink instance be moved'
check. e.g., what happens if a resource controller has been configured
for the devlink instance and it is moved to a namespace whose existing
config exceeds those limits?

Re: [PATCH] net: usb: pegasus: fix improper read if get_registers() fail

2019-07-31 Thread Petko Manolov

On 19-07-31 22:10:39, Petko Manolov wrote:
> On 19-07-30 15:13:57, Denis Kirjanov wrote:
> > get_registers() may fail with -ENOMEM and in this
> > case we can read a garbage from the status variable tmp.
> > 
> > Reported-by: syzbot+3499a83b2d062ae40...@syzkaller.appspotmail.com
> > Signed-off-by: Denis Kirjanov 
> > ---
> >  drivers/net/usb/pegasus.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
> > index 6d25dea5ad4b..f7d117d80cfb 100644
> > --- a/drivers/net/usb/pegasus.c
> > +++ b/drivers/net/usb/pegasus.c
> > @@ -282,7 +282,7 @@ static void mdio_write(struct net_device *dev, int 
> > phy_id, int loc, int val)
> >  static int read_eprom_word(pegasus_t *pegasus, __u8 index, __u16 *retdata)
> >  {
> > int i;
> > -   __u8 tmp;
> > +   __u8 tmp = 0;
> > __le16 retdatai;
> > int ret;
> 
> Unfortunately this patch does not fix anything.  Even if get_registers() fail 
> with -ENOMEM the "for" loop will cover for it and will exit only if the 
> operation was successful or the device got disconnected.  Please read the 
> code 
> carefully.
> 
> So while the patch is harmless it isn't solving a problem.

Actually i am wrong - if "tmp" contains a random value it may accidentally have 
the EPROM_DONE bit set.

Dave, please apply the patch.


thanks,
Petko

Re: [PATCH] net: usb: pegasus: fix improper read if get_registers() fail

2019-07-31 Thread Petko Manolov

On 19-07-30 15:13:57, Denis Kirjanov wrote:
> get_registers() may fail with -ENOMEM and in this
> case we can read a garbage from the status variable tmp.
> 
> Reported-by: syzbot+3499a83b2d062ae40...@syzkaller.appspotmail.com
> Signed-off-by: Denis Kirjanov 
> ---
>  drivers/net/usb/pegasus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
> index 6d25dea5ad4b..f7d117d80cfb 100644
> --- a/drivers/net/usb/pegasus.c
> +++ b/drivers/net/usb/pegasus.c
> @@ -282,7 +282,7 @@ static void mdio_write(struct net_device *dev, int 
> phy_id, int loc, int val)
>  static int read_eprom_word(pegasus_t *pegasus, __u8 index, __u16 *retdata)
>  {
>   int i;
> - __u8 tmp;
> + __u8 tmp = 0;
>   __le16 retdatai;
>   int ret;

Unfortunately this patch does not fix anything.  Even if get_registers() fail 
with -ENOMEM the "for" loop will cover for it and will exit only if the 
operation was successful or the device got disconnected.  Please read the code 
carefully.

So while the patch is harmless it isn't solving a problem.


cheers,
Petko

Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces

2019-07-31 Thread Jiri Pirko

Wed, Jul 31, 2019 at 09:41:10PM CEST, dsah...@gmail.com wrote:
>On 7/31/19 1:26 PM, Jiri Pirko wrote:
>> Wed, Jul 31, 2019 at 12:39:52AM CEST, jakub.kicin...@netronome.com wrote:
>>> On Tue, 30 Jul 2019 10:57:32 +0200, Jiri Pirko wrote:
 From: Jiri Pirko 

 All devlink instances are created in init_net and stay there for a
 lifetime. Allow user to be able to move devlink instances into
 namespaces.

 Signed-off-by: Jiri Pirko 
>>>
>>> I'm no namespace expert, but seems reasonable, so FWIW:
>>>
>>> Acked-by: Jakub Kicinski 
>>>
>>> If I read things right we will only send the devlink instance
>>> notification to other namespaces when it moves, but not
>>> notifications for sub-objects like ports. Is the expectation 
>>> that the user space dumps the objects it cares about or will
>>> the other notifications be added as needed (or option 3 - I 
>>> misread the code).
>> 
>> You are correct. I plan to take care of the notifications of all objects
>> in the follow-up patchset. But I can do it in this one if needed. Up to
>> you.
>> 
>
>seems like it should be a part of this one. If a devlink instance
>changes namespaces, all of its associated resources should as well.

Okay. Will do.


>
>Also, seems like you are missing a 'can this devlink instance be moved'

I don't see why not.


>check. e.g., what happens if a resource controller has been configured
>for the devlink instance and it is moved to a namespace whose existing
>config exceeds those limits?

It's moved with all the values. The whole instance is moved.

Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces

2019-07-31 Thread David Ahern

On 7/31/19 1:45 PM, Jiri Pirko wrote:
>> check. e.g., what happens if a resource controller has been configured
>> for the devlink instance and it is moved to a namespace whose existing
>> config exceeds those limits?
> 
> It's moved with all the values. The whole instance is moved.
> 

The values are moved, but the FIB in a namespace could already contain
more routes than the devlink instance allows.

Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces

2019-07-31 Thread David Ahern

On 7/31/19 1:46 PM, David Ahern wrote:
> On 7/31/19 1:45 PM, Jiri Pirko wrote:
>>> check. e.g., what happens if a resource controller has been configured
>>> for the devlink instance and it is moved to a namespace whose existing
>>> config exceeds those limits?
>>
>> It's moved with all the values. The whole instance is moved.
>>
> 
> The values are moved, but the FIB in a namespace could already contain
> more routes than the devlink instance allows.
> 

>From a quick test your recent refactoring to netdevsim broke the
resource controller. It was, and is intended to be, per network namespace.

Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces

2019-07-31 Thread David Ahern

On 7/31/19 1:58 PM, David Ahern wrote:
> On 7/31/19 1:46 PM, David Ahern wrote:
>> On 7/31/19 1:45 PM, Jiri Pirko wrote:
 check. e.g., what happens if a resource controller has been configured
 for the devlink instance and it is moved to a namespace whose existing
 config exceeds those limits?
>>>
>>> It's moved with all the values. The whole instance is moved.
>>>
>>
>> The values are moved, but the FIB in a namespace could already contain
>> more routes than the devlink instance allows.
>>
> 
> From a quick test your recent refactoring to netdevsim broke the
> resource controller. It was, and is intended to be, per network namespace.
> 

Specifically this commit:

commit 5fc494225c1eb81309cc4c91f183cd30e4edb674
Author: Jiri Pirko 
Date:   Thu Apr 25 15:59:42 2019 +0200

netdevsim: create devlink instance per netdevsim instance

Currently there is one devlink instance created per network namespace.
That is quite odd considering the fact that devlink instance should
represent an ASIC. The following patches are going to move the devlink
instance even more down to a bus device, but until then, have one
devlink instance per netdevsim instance. Struct nsim_devlink is
introduced to hold fib setting.

The changes in the fib code are only related to holding the
configuration per devlink instance instead of network namespace.

broke the intent of the resource controller clearly stated in the
original commit

commit 37923ed6b8cea94d7d76038e2f72c57a0b45daab
Author: David Ahern 
Date:   Tue Mar 27 18:22:00 2018 -0700

netdevsim: Add simple FIB resource controller via devlink

...
Currently, devlink only supports initial namespace. Code is in place to
adapt netdevsim to a per namespace controller once the network namespace
issues are resolved.

And discussed here based on the RFC and v1 of the original patchset:

https://lore.kernel.org/netdev/03eade79-1727-3a31-8e31-a0a7f51b7...@cumulusnetworks.com/

and

https://lore.kernel.org/netdev/a916f016-5d8b-3f56-0a84-95d1712be...@cumulusnetworks.com/

This is a regression that needs to be fixed.

[net-next 10/16] ice: Update number of VF queue before setting VSI resources

2019-07-31 Thread Jeff Kirsher

From: Akeem G Abodunrin 

In case there is a request from a VF to change its number of queues, and
the request was successful, we need to update number of queues
configured on the VF before updating corresponding VSI for that VF,
especially LAN Tx queue tree and TC update, otherwise, we would continued
to use old value of vf->num_vf_qs for allocated Tx/Rx queues...

Signed-off-by: Akeem G Abodunrin 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c 
b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
index 5d24b539648f..00aa6364124a 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
@@ -567,11 +567,6 @@ static int ice_alloc_vf_res(struct ice_vf *vf)
int tx_rx_queue_left;
int status;
 
-   /* setup VF VSI and necessary resources */
-   status = ice_alloc_vsi_res(vf);
-   if (status)
-   goto ice_alloc_vf_res_exit;
-
/* Update number of VF queues, in case VF had requested for queue
 * changes
 */
@@ -581,6 +576,11 @@ static int ice_alloc_vf_res(struct ice_vf *vf)
vf->num_req_qs != vf->num_vf_qs)
vf->num_vf_qs = vf->num_req_qs;
 
+   /* setup VF VSI and necessary resources */
+   status = ice_alloc_vsi_res(vf);
+   if (status)
+   goto ice_alloc_vf_res_exit;
+
if (vf->trusted)
set_bit(ICE_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps);
else
-- 
2.21.0

[net-next 00/16][pull request] 100GbE Intel Wired LAN Driver Updates 2019-07-31

2019-07-31 Thread Jeff Kirsher

This series contains updates to ice driver only.

Paul adds support for reporting what the link partner is advertising for
flow control settings.

Jake fixes the hardware statistics register which is prone to rollover
since the statistic registers are either 32 or 40 bits wide, depending
on which register is being read.  So use a 64 bit software statistic to
store off the hardware statistics to track past when it rolls over.
Fixes an issue with the locking of the control queue, where locks were
being destroyed at run time.

Tony fixes an issue that was created when interrupt tracking was
refactored and the call to ice_vsi_setup_vector_base() was removed from
the PF VSI instead of the VF VSI.  Adds a check before trying to
configure a port to ensure that media is attached.

Brett fixes an issue in the receive queue configuration where prefena
(Prefetch Enable) was being set to 0 which caused the hardware to only
fetch descriptors when there are none free in the cache for a received
packet.  Updates the driver to only bump the receive tail once per
napi_poll call, instead of the current model of bumping the tail up to 4
times per napi_poll call.  Adds statistics for receive drops at the port
level to ethtool/netlink.  Cleans up duplicate code in the allocation of
receive buffer code.

Akeem updates the driver to ensure that VFs stay disabled until the
setup or reset is completed.  Modifies the driver to use the allocated
number of transmit queues per VSI to set up the scheduling tree versus
using the total number of available transmit queues.  Also fix the
driver to update the total number of configured queues, after a
successful VF request to change its number of queues before updating the
corresponding VSI for that VF.  Cleaned up unnecessary flags that are no
longer needed.

The following are changes since commit 6a7ce95d752efa86a1a383385d4f8035c224dc3d:
  staging/octeon: Fix build error without CONFIG_NETDEVICES
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Akeem G Abodunrin (5):
  ice: Disable VFs until reset is completed
  ice: Set up Tx scheduling tree based on alloc VSI Tx queues
  ice: Update number of VF queue before setting VSI resources
  ice: Don't return error for disabling LAN Tx queue that does exist
  ice: Remove flag to track VF interrupt status

Brett Creeley (5):
  ice: Always set prefena when configuring an Rx queue
  ice: Only bump Rx tail and release buffers once per napi_poll
  ice: Add stats for Rx drops at the port level
  ice: Remove duplicate code in ice_alloc_rx_bufs
  ice: Remove unnecessary flag ICE_FLAG_MSIX_ENA

Jacob Keller (2):
  ice: track hardware stat registers past rollover
  ice: separate out control queue lock creation

Paul Greenwalt (1):
  ice: add lp_advertising flow control support

Tony Nguyen (3):
  ice: Move vector base setup to PF VSI
  ice: Do not configure port with no media
  ice: Bump version number

 drivers/net/ethernet/intel/ice/ice.h  |   2 +-
 drivers/net/ethernet/intel/ice/ice_common.c   |  72 +--
 drivers/net/ethernet/intel/ice/ice_common.h   |   6 +-
 drivers/net/ethernet/intel/ice/ice_controlq.c | 112 +++--
 drivers/net/ethernet/intel/ice/ice_ethtool.c  | 104 +++--
 .../net/ethernet/intel/ice/ice_hw_autogen.h   |  31 +-
 .../net/ethernet/intel/ice/ice_lan_tx_rx.h|   1 +
 drivers/net/ethernet/intel/ice/ice_lib.c  | 128 +++---
 drivers/net/ethernet/intel/ice/ice_main.c | 416 ++
 drivers/net/ethernet/intel/ice/ice_txrx.c |  60 +--
 .../net/ethernet/intel/ice/ice_virtchnl_pf.c  |  21 +-
 .../net/ethernet/intel/ice/ice_virtchnl_pf.h  |   5 -
 12 files changed, 527 insertions(+), 431 deletions(-)

-- 
2.21.0

[net-next 16/16] ice: Bump version number

2019-07-31 Thread Jeff Kirsher

From: Tony Nguyen 

Update driver version to 0.7.5

Signed-off-by: Tony Nguyen 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index 8a8f9170e219..c26e6a102dac 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -9,7 +9,7 @@
 #include "ice_lib.h"
 #include "ice_dcb_lib.h"
 
-#define DRV_VERSION"0.7.4-k"
+#define DRV_VERSION"0.7.5-k"
 #define DRV_SUMMARY"Intel(R) Ethernet Connection E800 Series Linux Driver"
 const char ice_drv_ver[] = DRV_VERSION;
 static const char ice_driver_string[] = DRV_SUMMARY;
-- 
2.21.0

[net-next 05/16] ice: separate out control queue lock creation

2019-07-31 Thread Jeff Kirsher

From: Jacob Keller 

The ice_init_all_ctrlq and ice_shutdown_all_ctrlq functions create and
destroy the locks used to protect the send and receive process of each
control queue.

This is problematic, as the driver may use these functions to shutdown
and re-initialize the control queues at run time. For example, it may do
this in response to a device reset.

If the driver failed to recover from a reset, it might leave the control
queues offline. In this case, the locks will no longer be initialized.
A later call to ice_sq_send_cmd will then attempt to acquire a lock that
has been destroyed.

It is incorrect behavior to access a lock that has been destroyed.

Indeed, ice_aq_send_cmd already tries to avoid accessing an offline
control queue, but the check occurs inside the lock.

The root of the problem is that the locks are destroyed at run time.

Modify ice_init_all_ctrlq and ice_shutdown_all_ctrlq such that they no
longer create or destroy the locks.

Introduce new functions, ice_create_all_ctrlq and ice_destroy_all_ctrlq.
Call these functions in ice_init_hw and ice_deinit_hw.

Now, the control queue locks will remain valid for the life of the
driver, and will not be destroyed until the driver unloads.

This also allows removing a duplicate check of the sq.count and
rq.count values when shutting down the controlqs. The ice_shutdown_ctrlq
function already checks this value under the lock. Previously
commit dec64ff10ed9 ("ice: use [sr]q.count when checking if queue is
initialized") needed this check to happen outside the lock, because it
prevented duplicate attempts at destroying the locks.

The driver may now safely use ice_init_all_ctrlq and
ice_shutdown_all_ctrlq while handling reset events, without causing the
locks to be invalid.

Signed-off-by: Jacob Keller 
Signed-off-by: Tony Nguyen 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_common.c   |   6 +-
 drivers/net/ethernet/intel/ice/ice_common.h   |   2 +
 drivers/net/ethernet/intel/ice/ice_controlq.c | 112 ++
 3 files changed, 91 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c 
b/drivers/net/ethernet/intel/ice/ice_common.c
index 01e5ecaaa322..5f9dc76699d2 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -740,7 +740,7 @@ enum ice_status ice_init_hw(struct ice_hw *hw)
 
ice_get_itr_intrl_gran(hw);
 
-   status = ice_init_all_ctrlq(hw);
+   status = ice_create_all_ctrlq(hw);
if (status)
goto err_unroll_cqinit;
 
@@ -855,7 +855,7 @@ enum ice_status ice_init_hw(struct ice_hw *hw)
 err_unroll_alloc:
devm_kfree(ice_hw_to_dev(hw), hw->port_info);
 err_unroll_cqinit:
-   ice_shutdown_all_ctrlq(hw);
+   ice_destroy_all_ctrlq(hw);
return status;
 }
 
@@ -881,7 +881,7 @@ void ice_deinit_hw(struct ice_hw *hw)
 
/* Attempt to disable FW logging before shutting down control queues */
ice_cfg_fw_log(hw, false);
-   ice_shutdown_all_ctrlq(hw);
+   ice_destroy_all_ctrlq(hw);
 
/* Clear VSI contexts if not already cleared */
ice_clear_all_vsi_ctx(hw);
diff --git a/drivers/net/ethernet/intel/ice/ice_common.h 
b/drivers/net/ethernet/intel/ice/ice_common.h
index 68218e63afc2..e376d1eadba4 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.h
+++ b/drivers/net/ethernet/intel/ice/ice_common.h
@@ -17,8 +17,10 @@ enum ice_status ice_init_hw(struct ice_hw *hw);
 void ice_deinit_hw(struct ice_hw *hw);
 enum ice_status ice_check_reset(struct ice_hw *hw);
 enum ice_status ice_reset(struct ice_hw *hw, enum ice_reset_req req);
+enum ice_status ice_create_all_ctrlq(struct ice_hw *hw);
 enum ice_status ice_init_all_ctrlq(struct ice_hw *hw);
 void ice_shutdown_all_ctrlq(struct ice_hw *hw);
+void ice_destroy_all_ctrlq(struct ice_hw *hw);
 enum ice_status
 ice_clean_rq_elem(struct ice_hw *hw, struct ice_ctl_q_info *cq,
  struct ice_rq_event_info *e, u16 *pending);
diff --git a/drivers/net/ethernet/intel/ice/ice_controlq.c 
b/drivers/net/ethernet/intel/ice/ice_controlq.c
index e91ac4df0242..2353166c654e 100644
--- a/drivers/net/ethernet/intel/ice/ice_controlq.c
+++ b/drivers/net/ethernet/intel/ice/ice_controlq.c
@@ -310,7 +310,7 @@ ice_cfg_rq_regs(struct ice_hw *hw, struct ice_ctl_q_info 
*cq)
  * @cq: pointer to the specific Control queue
  *
  * This is the main initialization routine for the Control Send Queue
- * Prior to calling this function, drivers *MUST* set the following fields
+ * Prior to calling this function, the driver *MUST* set the following fields
  * in the cq->structure:
  * - cq->num_sq_entries
  * - cq->sq_buf_size
@@ -369,7 +369,7 @@ static enum ice_status ice_init_sq(struct ice_hw *hw, 
struct ice_ctl_q_info *cq)
  * @cq: pointer to the specific Control queue
  *
  * The main initialization routine for the Admin Receive (Event) Queue.
- * Prior to calling this function, dri

[net-next 07/16] ice: Disable VFs until reset is completed

2019-07-31 Thread Jeff Kirsher

From: Akeem G Abodunrin 

This patch adds code to clear VFs enable status until reset is completed,
and Tx/Rx rings are setup. Without this patch, the code flow request Tx
queues to be disabled after reset, especially PFR - where VF VSI Tx rings
have already been wiped off in the NVM and result to adminq error based on
the call to disable Tx LAN queue in ice_reset_all_vfs function call.

Signed-off-by: Akeem G Abodunrin 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_main.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index 91334d1e83ed..d13f56803658 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -488,6 +488,7 @@ static void
 ice_prepare_for_reset(struct ice_pf *pf)
 {
struct ice_hw *hw = &pf->hw;
+   u8 i;
 
/* already prepared for reset */
if (test_bit(__ICE_PREPARED_FOR_RESET, pf->state))
@@ -497,6 +498,10 @@ ice_prepare_for_reset(struct ice_pf *pf)
if (ice_check_sq_alive(hw, &hw->mailboxq))
ice_vc_notify_reset(pf);
 
+   /* Disable VFs until reset is completed */
+   for (i = 0; i < pf->num_alloc_vfs; i++)
+   clear_bit(ICE_VF_STATE_ENA, pf->vf[i].vf_states);
+
/* disable the VSIs and their queues that are not already DOWN */
ice_pf_dis_all_vsi(pf, false);
 
-- 
2.21.0

[net-next 01/16] ice: add lp_advertising flow control support

2019-07-31 Thread Jeff Kirsher

From: Paul Greenwalt 

Add support for reporting link partner advertising when
ETHTOOL_GLINKSETTINGS defined. Get pause param reports the Tx/Rx
pause configured, and then ethtool issues ETHTOOL_GSET ioctl and
ice_get_settings_link_up reports the negotiated Tx/Rx pause. Negotiated
pause frame report per IEEE 802.3-2005 table 288-3.

$ ethtool --show-pause ens6f0
Pause parameters for ens6f0:
Autonegotiate:  on
RX: on
TX: on
RX negotiated:  on
TX negotiated:  on

$ ethtool ens6f0
Settings for ens6f0:
Supported ports: [ FIBRE ]
Supported link modes:   25000baseCR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None BaseR RS
Advertised link modes:  25000baseCR/Full
Advertised pause frame use: Symmetric Receive-only
Advertised auto-negotiation: Yes
Advertised FEC modes: None BaseR RS
Link partner advertised link modes:  Not reported
Link partner advertised pause frame use: Symmetric
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 25000Mb/s
Duplex: Full
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: g
Current message level: 0x0007 (7)
   drv probe link
Link detected: yes

When ETHTOOL_GLINKSETTINGS is not defined, get pause param reports the
negotiated Tx/Rx pause.

Signed-off-by: Paul Greenwalt 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_ethtool.c | 104 +--
 1 file changed, 72 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c 
b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 52083a63dee6..d3ba535bd65a 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -1716,6 +1716,7 @@ ice_get_settings_link_up(struct ethtool_link_ksettings 
*ks,
 struct net_device *netdev)
 {
struct ice_netdev_priv *np = netdev_priv(netdev);
+   struct ice_port_info *pi = np->vsi->port_info;
struct ethtool_link_ksettings cap_ksettings;
struct ice_link_status *link_info;
struct ice_vsi *vsi = np->vsi;
@@ -2040,6 +2041,33 @@ ice_get_settings_link_up(struct ethtool_link_ksettings 
*ks,
break;
}
ks->base.duplex = DUPLEX_FULL;
+
+   if (link_info->an_info & ICE_AQ_AN_COMPLETED)
+   ethtool_link_ksettings_add_link_mode(ks, lp_advertising,
+Autoneg);
+
+   /* Set flow control negotiated Rx/Tx pause */
+   switch (pi->fc.current_mode) {
+   case ICE_FC_FULL:
+   ethtool_link_ksettings_add_link_mode(ks, lp_advertising, Pause);
+   break;
+   case ICE_FC_TX_PAUSE:
+   ethtool_link_ksettings_add_link_mode(ks, lp_advertising, Pause);
+   ethtool_link_ksettings_add_link_mode(ks, lp_advertising,
+Asym_Pause);
+   break;
+   case ICE_FC_RX_PAUSE:
+   ethtool_link_ksettings_add_link_mode(ks, lp_advertising,
+Asym_Pause);
+   break;
+   case ICE_FC_PFC:
+   /* fall through */
+   default:
+   ethtool_link_ksettings_del_link_mode(ks, lp_advertising, Pause);
+   ethtool_link_ksettings_del_link_mode(ks, lp_advertising,
+Asym_Pause);
+   break;
+   }
 }
 
 /**
@@ -2078,9 +2106,12 @@ ice_get_link_ksettings(struct net_device *netdev,
struct ice_aqc_get_phy_caps_data *caps;
struct ice_link_status *hw_link_info;
struct ice_vsi *vsi = np->vsi;
+   enum ice_status status;
+   int err = 0;
 
ethtool_link_ksettings_zero_link_mode(ks, supported);
ethtool_link_ksettings_zero_link_mode(ks, advertising);
+   ethtool_link_ksettings_zero_link_mode(ks, lp_advertising);
hw_link_info = &vsi->port_info->phy.link_info;
 
/* set speed and duplex */
@@ -2125,48 +2156,36 @@ ice_get_link_ksettings(struct net_device *netdev,
/* flow control is symmetric and always supported */
ethtool_link_ksettings_add_link_mode(ks, supported, Pause);
 
-   switch (vsi->port_info->fc.req_mode) {
-   case ICE_FC_FULL:
+   caps = devm_kzalloc(&vsi->back->pdev->dev, sizeof(*caps), GFP_KERNEL);
+   if (!caps)
+   return -ENOMEM;
+
+   status = ice_aq_get_phy_caps(vsi->port_info, false,
+ICE_AQC_REPORT_SW_CFG, caps, NULL);
+   if (status) {
+   err = -EIO;
+   goto done;
+   }
+
+   /* Set th

[net-next 11/16] ice: Add stats for Rx drops at the port level

2019-07-31 Thread Jeff Kirsher

From: Brett Creeley 

Currently we are not reporting dropped counts at the port level to
ethtool or netlink. This was found when debugging Rx dropped issues
and the total packets sent did not equal the total packets received
minus the rx_dropped, which was very confusing. To determine dropped
counts at the port level we need to read the PRTRPB_RDPC register.
To fix reporting we will store the dropped counts in the PF's
rx_discards. This will be reported to netlink by storing it in the
PF VSI's rx_missed_errors signaling that the receiver missed the
packet. Also, we will report this to ethtool in the rx_dropped.nic
field.

Signed-off-by: Brett Creeley 
Signed-off-by: Tony Nguyen 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h | 1 +
 drivers/net/ethernet/intel/ice/ice_main.c   | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h 
b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
index 3250dfc2..87652d722a30 100644
--- a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
+++ b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
@@ -337,5 +337,6 @@
 #define VSIQF_HLUT_MAX_INDEX   15
 #define VFINT_DYN_CTLN(_i) (0x3800 + ((_i) * 4))
 #define VFINT_DYN_CTLN_CLEARPBA_M  BIT(1)
+#define PRTRPB_RDPC0x000AC260
 
 #endif /* _ICE_HW_AUTOGEN_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index d13f56803658..e4be9aa79337 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3297,6 +3297,8 @@ static void ice_update_vsi_stats(struct ice_vsi *vsi)
cur_ns->rx_errors = pf->stats.crc_errors +
pf->stats.illegal_bytes;
cur_ns->rx_length_errors = pf->stats.rx_len_errors;
+   /* record drops from the port level */
+   cur_ns->rx_missed_errors = pf->stats.eth.rx_discards;
}
 }
 
@@ -3330,6 +3332,10 @@ static void ice_update_pf_stats(struct ice_pf *pf)
  &prev_ps->eth.rx_broadcast,
  &cur_ps->eth.rx_broadcast);
 
+   ice_stat_update32(hw, PRTRPB_RDPC, pf->stat_prev_loaded,
+ &prev_ps->eth.rx_discards,
+ &cur_ps->eth.rx_discards);
+
ice_stat_update40(hw, GLPRT_GOTCL(pf_id), pf->stat_prev_loaded,
  &prev_ps->eth.tx_bytes,
  &cur_ps->eth.tx_bytes);
-- 
2.21.0

[net-next 13/16] ice: Don't return error for disabling LAN Tx queue that does exist

2019-07-31 Thread Jeff Kirsher

From: Akeem G Abodunrin 

Since Tx rings are being managed by FW/NVM, Tx rings might have not been
set up or driver had already wiped them off - In that case, call to
disable LAN Tx queue is being returned as not in existence. This patch
makes sure we don't return unnecessary error for such scenario.

Signed-off-by: Akeem G Abodunrin 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index e28478215810..2add10e02280 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2148,6 +2148,9 @@ ice_vsi_stop_tx_rings(struct ice_vsi *vsi, enum 
ice_disq_rst_src rst_src,
if (status == ICE_ERR_RESET_ONGOING) {
dev_dbg(&pf->pdev->dev,
"Reset in progress. LAN Tx queues already 
disabled\n");
+   } else if (status == ICE_ERR_DOES_NOT_EXIST) {
+   dev_dbg(&pf->pdev->dev,
+   "LAN Tx queues does not exist, nothing to 
disabled\n");
} else if (status) {
dev_err(&pf->pdev->dev,
"Failed to disable LAN Tx queues, error: %d\n",
-- 
2.21.0

[net-next 14/16] ice: Remove unnecessary flag ICE_FLAG_MSIX_ENA

2019-07-31 Thread Jeff Kirsher

From: Brett Creeley 

This flag is not needed and is called every time we re-enable interrupts
in the hotpath so remove it. Also remove ice_vsi_req_irq() because it
was a wrapper function for ice_vsi_req_irq_msix() whose sole purpose was
checking the ICE_FLAG_MSIX_ENA flag.

Signed-off-by: Brett Creeley 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice.h  |  1 -
 drivers/net/ethernet/intel/ice/ice_lib.c  | 71 +-
 drivers/net/ethernet/intel/ice/ice_main.c | 74 ++-
 drivers/net/ethernet/intel/ice/ice_txrx.c |  4 +-
 4 files changed, 48 insertions(+), 102 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index 596b09a905aa..794d97460fc7 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -329,7 +329,6 @@ struct ice_q_vector {
 } cacheline_internodealigned_in_smp;
 
 enum ice_pf_flags {
-   ICE_FLAG_MSIX_ENA,
ICE_FLAG_FLTR_SYNC,
ICE_FLAG_RSS_ENA,
ICE_FLAG_SRIOV_ENA,
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 2add10e02280..6e34c40e7840 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -1129,12 +1129,7 @@ static int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
return -EEXIST;
}
 
-   if (test_bit(ICE_FLAG_MSIX_ENA, pf->flags)) {
-   num_q_vectors = vsi->num_q_vectors;
-   } else {
-   err = -EINVAL;
-   goto err_out;
-   }
+   num_q_vectors = vsi->num_q_vectors;
 
for (v_idx = 0; v_idx < num_q_vectors; v_idx++) {
err = ice_vsi_alloc_q_vector(vsi, v_idx);
@@ -1180,9 +1175,6 @@ static int ice_vsi_setup_vector_base(struct ice_vsi *vsi)
return -EEXIST;
}
 
-   if (!test_bit(ICE_FLAG_MSIX_ENA, pf->flags))
-   return -ENOENT;
-
num_q_vectors = vsi->num_q_vectors;
/* reserve slots from OS requested IRQs */
vsi->base_vector = ice_get_res(pf, pf->irq_tracker, num_q_vectors,
@@ -2605,39 +2597,36 @@ void ice_vsi_free_irq(struct ice_vsi *vsi)
 {
struct ice_pf *pf = vsi->back;
int base = vsi->base_vector;
+   int i;
 
-   if (test_bit(ICE_FLAG_MSIX_ENA, pf->flags)) {
-   int i;
-
-   if (!vsi->q_vectors || !vsi->irqs_ready)
-   return;
+   if (!vsi->q_vectors || !vsi->irqs_ready)
+   return;
 
-   ice_vsi_release_msix(vsi);
-   if (vsi->type == ICE_VSI_VF)
-   return;
+   ice_vsi_release_msix(vsi);
+   if (vsi->type == ICE_VSI_VF)
+   return;
 
-   vsi->irqs_ready = false;
-   ice_for_each_q_vector(vsi, i) {
-   u16 vector = i + base;
-   int irq_num;
+   vsi->irqs_ready = false;
+   ice_for_each_q_vector(vsi, i) {
+   u16 vector = i + base;
+   int irq_num;
 
-   irq_num = pf->msix_entries[vector].vector;
+   irq_num = pf->msix_entries[vector].vector;
 
-   /* free only the irqs that were actually requested */
-   if (!vsi->q_vectors[i] ||
-   !(vsi->q_vectors[i]->num_ring_tx ||
- vsi->q_vectors[i]->num_ring_rx))
-   continue;
+   /* free only the irqs that were actually requested */
+   if (!vsi->q_vectors[i] ||
+   !(vsi->q_vectors[i]->num_ring_tx ||
+ vsi->q_vectors[i]->num_ring_rx))
+   continue;
 
-   /* clear the affinity notifier in the IRQ descriptor */
-   irq_set_affinity_notifier(irq_num, NULL);
+   /* clear the affinity notifier in the IRQ descriptor */
+   irq_set_affinity_notifier(irq_num, NULL);
 
-   /* clear the affinity_mask in the IRQ descriptor */
-   irq_set_affinity_hint(irq_num, NULL);
-   synchronize_irq(irq_num);
-   devm_free_irq(&pf->pdev->dev, irq_num,
- vsi->q_vectors[i]);
-   }
+   /* clear the affinity_mask in the IRQ descriptor */
+   irq_set_affinity_hint(irq_num, NULL);
+   synchronize_irq(irq_num);
+   devm_free_irq(&pf->pdev->dev, irq_num,
+ vsi->q_vectors[i]);
}
 }
 
@@ -2816,15 +2805,13 @@ void ice_vsi_dis_irq(struct ice_vsi *vsi)
}
 
/* disable each interrupt */
-   if (test_bit(ICE_FLAG_MSIX_ENA, pf->flags)) {
-   ice_for_each_q_vector(vsi, i)
-   wr32(hw, GLINT_DYN_CTL(vsi->q_vectors[i]->reg_idx), 0);
+

[net-next 15/16] ice: Remove flag to track VF interrupt status

2019-07-31 Thread Jeff Kirsher

From: Akeem G Abodunrin 

As a result of refactoring of VF VSIs interrupts code, there is no
need to track its configuration status again with ICE_VF_STATE_CFG_INTR
flag - In fact, it is not being checked anywhere in the code right now, so
this patch removes the dead code as applicable to the flag.

Signed-off-by: Akeem G Abodunrin 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c | 11 ---
 drivers/net/ethernet/intel/ice/ice_virtchnl_pf.h |  5 -
 2 files changed, 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c 
b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
index 00aa6364124a..ce01cbe70ea4 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
@@ -297,13 +297,6 @@ void ice_free_vfs(struct ice_pf *pf)
if (test_bit(ICE_VF_STATE_INIT, pf->vf[i].vf_states)) {
/* disable VF qp mappings */
ice_dis_vf_mappings(&pf->vf[i]);
-
-   /* Set this state so that assigned VF vectors can be
-* reclaimed by PF for reuse in ice_vsi_release(). No
-* need to clear this bit since pf->vf array is being
-* freed anyways after this for loop
-*/
-   set_bit(ICE_VF_STATE_CFG_INTR, pf->vf[i].vf_states);
ice_free_vf_res(&pf->vf[i]);
}
}
@@ -551,7 +544,6 @@ static int ice_alloc_vsi_res(struct ice_vf *vf)
 * expect vector assignment to be changed unless there is a request for
 * more vectors.
 */
-   clear_bit(ICE_VF_STATE_CFG_INTR, vf->vf_states);
 ice_alloc_vsi_res_exit:
ice_free_fltr_list(&pf->pdev->dev, &tmp_add_list);
return status;
@@ -1283,9 +1275,6 @@ static int ice_alloc_vfs(struct ice_pf *pf, u16 
num_alloc_vfs)
/* assign default capabilities */
set_bit(ICE_VIRTCHNL_VF_CAP_L2, &vfs[i].vf_caps);
vfs[i].spoofchk = true;
-
-   /* Set this state so that PF driver does VF vector assignment */
-   set_bit(ICE_VF_STATE_CFG_INTR, vfs[i].vf_states);
}
pf->num_alloc_vfs = num_alloc_vfs;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.h 
b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.h
index c3ca522c245a..ada69120ff38 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.h
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.h
@@ -30,11 +30,6 @@ enum ice_vf_states {
ICE_VF_STATE_DIS,
ICE_VF_STATE_MC_PROMISC,
ICE_VF_STATE_UC_PROMISC,
-   /* state to indicate if PF needs to do vector assignment for VF.
-* This needs to be set during first time VF initialization or later
-* when VF asks for more Vectors through virtchnl OP.
-*/
-   ICE_VF_STATE_CFG_INTR,
ICE_VF_STATES_NBITS
 };
 
-- 
2.21.0

[net-next 12/16] ice: Remove duplicate code in ice_alloc_rx_bufs

2019-07-31 Thread Jeff Kirsher

From: Brett Creeley 

Currently if the call to ice_alloc_mapped_page() fails we jump to the
no_buf label, possibly call ice_release_rx_desc(), and return true
indicating that there is more work to do. In the success case we just
fall out of the while loop, possibly call ice_alloc_mapped_page(), and
return false saying we exhausted cleaned_count. This flow can be
improved by breaking if ice_alloc_mapped_page() fails and then the flow
outside of the while loop is the same for the failure and success case.

Signed-off-by: Brett Creeley 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c 
b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 0c459305c12f..020dac283f07 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -478,8 +478,9 @@ bool ice_alloc_rx_bufs(struct ice_ring *rx_ring, u16 
cleaned_count)
bi = &rx_ring->rx_buf[ntu];
 
do {
+   /* if we fail here, we have work remaining */
if (!ice_alloc_mapped_page(rx_ring, bi))
-   goto no_bufs;
+   break;
 
/* sync the buffer for use by the device */
dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
@@ -510,16 +511,7 @@ bool ice_alloc_rx_bufs(struct ice_ring *rx_ring, u16 
cleaned_count)
if (rx_ring->next_to_use != ntu)
ice_release_rx_desc(rx_ring, ntu);
 
-   return false;
-
-no_bufs:
-   if (rx_ring->next_to_use != ntu)
-   ice_release_rx_desc(rx_ring, ntu);
-
-   /* make sure to come back via polling to try again after
-* allocation failure
-*/
-   return true;
+   return !!cleaned_count;
 }
 
 /**
-- 
2.21.0

[net-next 03/16] ice: Move vector base setup to PF VSI

2019-07-31 Thread Jeff Kirsher

From: Tony Nguyen 

When interrupt tracking was refactored, during rebuild, the call to
ice_vsi_setup_vector_base() was inadvertently removed from the PF VSI
instead of being removed from the VF VSI. During reset, the failure to
properly setup the vector base generates a call trace. Correct this so
that resets/rebuilds properly complete.

Fixes: cbe66bfee6a0 ("ice: Refactor interrupt tracking")
Signed-off-by: Tony Nguyen 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index e9e8340b1ab7..01f38abd4277 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2978,6 +2978,10 @@ int ice_vsi_rebuild(struct ice_vsi *vsi)
if (ret)
goto err_rings;
 
+   ret = ice_vsi_setup_vector_base(vsi);
+   if (ret)
+   goto err_vectors;
+
ret = ice_vsi_set_q_vectors_reg_idx(vsi);
if (ret)
goto err_vectors;
@@ -2999,10 +3003,6 @@ int ice_vsi_rebuild(struct ice_vsi *vsi)
if (ret)
goto err_rings;
 
-   ret = ice_vsi_setup_vector_base(vsi);
-   if (ret)
-   goto err_vectors;
-
ret = ice_vsi_set_q_vectors_reg_idx(vsi);
if (ret)
goto err_vectors;
-- 
2.21.0

[net-next 02/16] ice: track hardware stat registers past rollover

2019-07-31 Thread Jeff Kirsher

From: Jacob Keller 

Currently, ice_stat_update32 and ice_stat_update40 will limit the
value of the software statistic to 32 or 40 bits wide, depending on
which register is being read.

This means that if a driver is running for a long time, the displayed
software register values will roll over to zero at 40 bits or 32 bits.

This occurs because the functions directly assign the difference between
the previous value and current value of the hardware statistic.

Instead, add this value to the current software statistic, and then
update the previous value.

In this way, each time ice_stat_update40 or ice_stat_update32 are
called, they will increment the software tracking value by the
difference of the hardware register from its last read. The software
tracking value will correctly count up until it overflows a u64.

The only requirement is that the ice_stat_update functions be called at
least once each time the hardware register overflows.

While we're fixing ice_stat_update40, modify it to use rd64 instead of
two calls to rd32. Additionally, drop the now unnecessary hireg
function parameter.

Signed-off-by: Jacob Keller 
Signed-off-by: Tony Nguyen 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_common.c   | 57 +++-
 drivers/net/ethernet/intel/ice/ice_common.h   |  4 +-
 .../net/ethernet/intel/ice/ice_hw_autogen.h   | 30 ---
 drivers/net/ethernet/intel/ice/ice_lib.c  | 40 -
 drivers/net/ethernet/intel/ice/ice_main.c | 90 ---
 5 files changed, 91 insertions(+), 130 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c 
b/drivers/net/ethernet/intel/ice/ice_common.c
index 2e0731c1e1a3..4be3559de207 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -3240,40 +3240,44 @@ void ice_replay_post(struct ice_hw *hw)
 /**
  * ice_stat_update40 - read 40 bit stat from the chip and update stat values
  * @hw: ptr to the hardware info
- * @hireg: high 32 bit HW register to read from
- * @loreg: low 32 bit HW register to read from
+ * @reg: offset of 64 bit HW register to read from
  * @prev_stat_loaded: bool to specify if previous stats are loaded
  * @prev_stat: ptr to previous loaded stat value
  * @cur_stat: ptr to current stat value
  */
 void
-ice_stat_update40(struct ice_hw *hw, u32 hireg, u32 loreg,
- bool prev_stat_loaded, u64 *prev_stat, u64 *cur_stat)
+ice_stat_update40(struct ice_hw *hw, u32 reg, bool prev_stat_loaded,
+ u64 *prev_stat, u64 *cur_stat)
 {
-   u64 new_data;
-
-   new_data = rd32(hw, loreg);
-   new_data |= ((u64)(rd32(hw, hireg) & 0x)) << 32;
+   u64 new_data = rd64(hw, reg) & (BIT_ULL(40) - 1);
 
/* device stats are not reset at PFR, they likely will not be zeroed
-* when the driver starts. So save the first values read and use them as
-* offsets to be subtracted from the raw values in order to report stats
-* that count from zero.
+* when the driver starts. Thus, save the value from the first read
+* without adding to the statistic value so that we report stats which
+* count up from zero.
 */
-   if (!prev_stat_loaded)
+   if (!prev_stat_loaded) {
*prev_stat = new_data;
+   return;
+   }
+
+   /* Calculate the difference between the new and old values, and then
+* add it to the software stat value.
+*/
if (new_data >= *prev_stat)
-   *cur_stat = new_data - *prev_stat;
+   *cur_stat += new_data - *prev_stat;
else
/* to manage the potential roll-over */
-   *cur_stat = (new_data + BIT_ULL(40)) - *prev_stat;
-   *cur_stat &= 0xFFULL;
+   *cur_stat += (new_data + BIT_ULL(40)) - *prev_stat;
+
+   /* Update the previously stored value to prepare for next read */
+   *prev_stat = new_data;
 }
 
 /**
  * ice_stat_update32 - read 32 bit stat from the chip and update stat values
  * @hw: ptr to the hardware info
- * @reg: HW register to read from
+ * @reg: offset of HW register to read from
  * @prev_stat_loaded: bool to specify if previous stats are loaded
  * @prev_stat: ptr to previous loaded stat value
  * @cur_stat: ptr to current stat value
@@ -3287,17 +3291,26 @@ ice_stat_update32(struct ice_hw *hw, u32 reg, bool 
prev_stat_loaded,
new_data = rd32(hw, reg);
 
/* device stats are not reset at PFR, they likely will not be zeroed
-* when the driver starts. So save the first values read and use them as
-* offsets to be subtracted from the raw values in order to report stats
-* that count from zero.
+* when the driver starts. Thus, save the value from the first read
+* without adding to the statistic value so that we report stats which
+* count up from zero.
 */
-   if (!prev_stat_loaded)
+

[net-next 09/16] ice: Set up Tx scheduling tree based on alloc VSI Tx queues

2019-07-31 Thread Jeff Kirsher

From: Akeem G Abodunrin 

This patch uses allocated number of Tx queues per VSI to set up its
scheduling tree instead of using total number of available Tx queues.
Only PF VSIs have total number of allocated Tx queues equal to number
of available Tx queues, other VSIs have different number of queues
configured.

Signed-off-by: Akeem G Abodunrin 
Signed-off-by: Tony Nguyen 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 01f38abd4277..e28478215810 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2511,7 +2511,7 @@ ice_vsi_setup(struct ice_pf *pf, struct ice_port_info *pi,
 
/* configure VSI nodes based on number of queues and TC's */
for (i = 0; i < vsi->tc_cfg.numtc; i++)
-   max_txqs[i] = pf->num_lan_tx;
+   max_txqs[i] = vsi->alloc_txq;
 
status = ice_cfg_vsi_lan(vsi->port_info, vsi->idx, vsi->tc_cfg.ena_tc,
 max_txqs);
@@ -3020,7 +3020,7 @@ int ice_vsi_rebuild(struct ice_vsi *vsi)
 
/* configure VSI nodes based on number of queues and TC's */
for (i = 0; i < vsi->tc_cfg.numtc; i++)
-   max_txqs[i] = pf->num_lan_tx;
+   max_txqs[i] = vsi->alloc_txq;
 
status = ice_cfg_vsi_lan(vsi->port_info, vsi->idx, vsi->tc_cfg.ena_tc,
 max_txqs);
@@ -3137,7 +3137,7 @@ int ice_vsi_cfg_tc(struct ice_vsi *vsi, u8 ena_tc)
if (ena_tc & BIT(i))
num_tc++;
/* populate max_txqs per TC */
-   max_txqs[i] = pf->num_lan_tx;
+   max_txqs[i] = vsi->alloc_txq;
}
 
vsi->tc_cfg.ena_tc = ena_tc;
-- 
2.21.0

[net-next 04/16] ice: Always set prefena when configuring an Rx queue

2019-07-31 Thread Jeff Kirsher

From: Brett Creeley 

Currently we are always setting prefena to 0. This is causing the
hardware to only fetch descriptors when there are none free in the cache
for a received packet instead of prefetching when it has used the last
descriptor regardless of incoming packets. Fix this by allowing the
hardware to prefetch Rx descriptors.

Signed-off-by: Brett Creeley 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_common.c| 9 -
 drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h | 1 +
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c 
b/drivers/net/ethernet/intel/ice/ice_common.c
index 4be3559de207..01e5ecaaa322 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -1078,6 +1078,7 @@ static const struct ice_ctx_ele ice_rlan_ctx_info[] = {
ICE_CTX_STORE(ice_rlan_ctx, tphdata_ena,1,  195),
ICE_CTX_STORE(ice_rlan_ctx, tphhead_ena,1,  196),
ICE_CTX_STORE(ice_rlan_ctx, lrxqthresh, 3,  198),
+   ICE_CTX_STORE(ice_rlan_ctx, prefena,1,  201),
{ 0 }
 };
 
@@ -1088,7 +1089,8 @@ static const struct ice_ctx_ele ice_rlan_ctx_info[] = {
  * @rxq_index: the index of the Rx queue
  *
  * Converts rxq context from sparse to dense structure and then writes
- * it to HW register space
+ * it to HW register space and enables the hardware to prefetch descriptors
+ * instead of only fetching them on demand
  */
 enum ice_status
 ice_write_rxq_ctx(struct ice_hw *hw, struct ice_rlan_ctx *rlan_ctx,
@@ -1096,6 +1098,11 @@ ice_write_rxq_ctx(struct ice_hw *hw, struct ice_rlan_ctx 
*rlan_ctx,
 {
u8 ctx_buf[ICE_RXQ_CTX_SZ] = { 0 };
 
+   if (!rlan_ctx)
+   return ICE_ERR_BAD_PTR;
+
+   rlan_ctx->prefena = 1;
+
ice_set_ctx((u8 *)rlan_ctx, ctx_buf, ice_rlan_ctx_info);
return ice_copy_rxq_ctx_to_hw(hw, ctx_buf, rxq_index);
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h 
b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
index 510a8c900e61..57ea6811fe2c 100644
--- a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
+++ b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
@@ -290,6 +290,7 @@ struct ice_rlan_ctx {
u8 tphdata_ena;
u8 tphhead_ena;
u16 lrxqthresh; /* bigger than needed, see above for reason */
+   u8 prefena; /* NOTE: normally must be set to 1 at init */
 };
 
 struct ice_ctx_ele {
-- 
2.21.0

[net-next 08/16] ice: Only bump Rx tail and release buffers once per napi_poll

2019-07-31 Thread Jeff Kirsher

From: Brett Creeley 

Currently we bump the Rx tail and release/give buffers to hardware every
16 descriptors. This causes us to bump Rx tail up to 4 times per
napi_poll call. Also we are always bumping tail on an odd index and this
is a problem because hardware ignores the lower 3 bits in the QRX_TAIL
register. This is making it so hardware sees tail bumps only every 8
descriptors. Instead lets only bump Rx tail once per napi_poll if
the value aligns with hardware's expectations of the lower 3 bits being
cleared. Also only release/give Rx buffers once per napi_poll call.

Signed-off-by: Brett Creeley 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 42 +++
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c 
b/drivers/net/ethernet/intel/ice/ice_txrx.c
index dd7392f293bf..0c459305c12f 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -377,18 +377,28 @@ int ice_setup_rx_ring(struct ice_ring *rx_ring)
  */
 static void ice_release_rx_desc(struct ice_ring *rx_ring, u32 val)
 {
+   u16 prev_ntu = rx_ring->next_to_use;
+
rx_ring->next_to_use = val;
 
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = val;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch. (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
+   /* QRX_TAIL will be updated with any tail value, but hardware ignores
+* the lower 3 bits. This makes it so we only bump tail on meaningful
+* boundaries. Also, this allows us to bump tail on intervals of 8 up to
+* the budget depending on the current traffic load.
 */
-   wmb();
-   writel(val, rx_ring->tail);
+   val &= ~0x7;
+   if (prev_ntu != val) {
+   /* Force memory writes to complete before letting h/w
+* know there are new descriptors to fetch. (Only
+* applicable for weak-ordered memory model archs,
+* such as IA-64).
+*/
+   wmb();
+   writel(val, rx_ring->tail);
+   }
 }
 
 /**
@@ -445,7 +455,13 @@ ice_alloc_mapped_page(struct ice_ring *rx_ring, struct 
ice_rx_buf *bi)
  * @rx_ring: ring to place buffers on
  * @cleaned_count: number of buffers to replace
  *
- * Returns false if all allocations were successful, true if any fail
+ * Returns false if all allocations were successful, true if any fail. 
Returning
+ * true signals to the caller that we didn't replace cleaned_count buffers and
+ * there is more work to do.
+ *
+ * First, try to clean "cleaned_count" Rx buffers. Then refill the cleaned Rx
+ * buffers. Then bump tail at most one time. Grouping like this lets us avoid
+ * multiple tail writes per call.
  */
 bool ice_alloc_rx_bufs(struct ice_ring *rx_ring, u16 cleaned_count)
 {
@@ -990,7 +1006,7 @@ static int ice_clean_rx_irq(struct ice_ring *rx_ring, int 
budget)
 {
unsigned int total_rx_bytes = 0, total_rx_pkts = 0;
u16 cleaned_count = ICE_DESC_UNUSED(rx_ring);
-   bool failure = false;
+   bool failure;
 
/* start the loop to process Rx packets bounded by 'budget' */
while (likely(total_rx_pkts < (unsigned int)budget)) {
@@ -1002,13 +1018,6 @@ static int ice_clean_rx_irq(struct ice_ring *rx_ring, 
int budget)
u16 vlan_tag = 0;
u8 rx_ptype;
 
-   /* return some buffers to hardware, one at a time is too slow */
-   if (cleaned_count >= ICE_RX_BUF_WRITE) {
-   failure = failure ||
- ice_alloc_rx_bufs(rx_ring, cleaned_count);
-   cleaned_count = 0;
-   }
-
/* get the Rx desc from Rx ring based on 'next_to_clean' */
rx_desc = ICE_RX_DESC(rx_ring, rx_ring->next_to_clean);
 
@@ -1085,6 +1094,9 @@ static int ice_clean_rx_irq(struct ice_ring *rx_ring, int 
budget)
total_rx_pkts++;
}
 
+   /* return up to cleaned_count buffers to hardware */
+   failure = ice_alloc_rx_bufs(rx_ring, cleaned_count);
+
/* update queue and vector specific stats */
u64_stats_update_begin(&rx_ring->syncp);
rx_ring->stats.pkts += total_rx_pkts;
-- 
2.21.0

[net-next 06/16] ice: Do not configure port with no media

2019-07-31 Thread Jeff Kirsher

From: Tony Nguyen 

The firmware reports an error when trying to configure a port with no
media. Instead of always configuring the port, check for media before
attempting to configure it. In the absence of media, turn off link and
poll for media to become available before re-enabling link.

Move ice_force_phys_link_state() up to avoid forward declaration.

Signed-off-by: Tony Nguyen 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice.h  |   1 +
 drivers/net/ethernet/intel/ice/ice_main.c | 239 ++
 2 files changed, 158 insertions(+), 82 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index 9ee6b3c0..596b09a905aa 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -337,6 +337,7 @@ enum ice_pf_flags {
ICE_FLAG_DCB_CAPABLE,
ICE_FLAG_DCB_ENA,
ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA,
+   ICE_FLAG_NO_MEDIA,
ICE_FLAG_ENABLE_FW_LLDP,
ICE_FLAG_ETHTOOL_CTXT,  /* set when ethtool holds RTNL lock */
ICE_PF_FLAGS_NBITS  /* must be last */
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index f490e65c64bc..91334d1e83ed 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -810,6 +810,20 @@ ice_link_event(struct ice_pf *pf, struct ice_port_info 
*pi, bool link_up,
if (!vsi || !vsi->port_info)
return -EINVAL;
 
+   /* turn off PHY if media was removed */
+   if (!test_bit(ICE_FLAG_NO_MEDIA, pf->flags) &&
+   !(pi->phy.link_info.link_info & ICE_AQ_MEDIA_AVAILABLE)) {
+   set_bit(ICE_FLAG_NO_MEDIA, pf->flags);
+
+   result = ice_aq_set_link_restart_an(pi, false, NULL);
+   if (result) {
+   dev_dbg(&pf->pdev->dev,
+   "Failed to set link down, VSI %d error %d\n",
+   vsi->vsi_num, result);
+   return result;
+   }
+   }
+
ice_vsi_link_event(vsi, link_up);
ice_print_link_msg(vsi, link_up);
 
@@ -1314,6 +1328,124 @@ static void ice_handle_mdd_event(struct ice_pf *pf)
}
 }
 
+/**
+ * ice_force_phys_link_state - Force the physical link state
+ * @vsi: VSI to force the physical link state to up/down
+ * @link_up: true/false indicates to set the physical link to up/down
+ *
+ * Force the physical link state by getting the current PHY capabilities from
+ * hardware and setting the PHY config based on the determined capabilities. If
+ * link changes a link event will be triggered because both the Enable 
Automatic
+ * Link Update and LESM Enable bits are set when setting the PHY capabilities.
+ *
+ * Returns 0 on success, negative on failure
+ */
+static int ice_force_phys_link_state(struct ice_vsi *vsi, bool link_up)
+{
+   struct ice_aqc_get_phy_caps_data *pcaps;
+   struct ice_aqc_set_phy_cfg_data *cfg;
+   struct ice_port_info *pi;
+   struct device *dev;
+   int retcode;
+
+   if (!vsi || !vsi->port_info || !vsi->back)
+   return -EINVAL;
+   if (vsi->type != ICE_VSI_PF)
+   return 0;
+
+   dev = &vsi->back->pdev->dev;
+
+   pi = vsi->port_info;
+
+   pcaps = devm_kzalloc(dev, sizeof(*pcaps), GFP_KERNEL);
+   if (!pcaps)
+   return -ENOMEM;
+
+   retcode = ice_aq_get_phy_caps(pi, false, ICE_AQC_REPORT_SW_CFG, pcaps,
+ NULL);
+   if (retcode) {
+   dev_err(dev,
+   "Failed to get phy capabilities, VSI %d error %d\n",
+   vsi->vsi_num, retcode);
+   retcode = -EIO;
+   goto out;
+   }
+
+   /* No change in link */
+   if (link_up == !!(pcaps->caps & ICE_AQC_PHY_EN_LINK) &&
+   link_up == !!(pi->phy.link_info.link_info & ICE_AQ_LINK_UP))
+   goto out;
+
+   cfg = devm_kzalloc(dev, sizeof(*cfg), GFP_KERNEL);
+   if (!cfg) {
+   retcode = -ENOMEM;
+   goto out;
+   }
+
+   cfg->phy_type_low = pcaps->phy_type_low;
+   cfg->phy_type_high = pcaps->phy_type_high;
+   cfg->caps = pcaps->caps | ICE_AQ_PHY_ENA_AUTO_LINK_UPDT;
+   cfg->low_power_ctrl = pcaps->low_power_ctrl;
+   cfg->eee_cap = pcaps->eee_cap;
+   cfg->eeer_value = pcaps->eeer_value;
+   cfg->link_fec_opt = pcaps->link_fec_options;
+   if (link_up)
+   cfg->caps |= ICE_AQ_PHY_ENA_LINK;
+   else
+   cfg->caps &= ~ICE_AQ_PHY_ENA_LINK;
+
+   retcode = ice_aq_set_phy_cfg(&vsi->back->hw, pi->lport, cfg, NULL);
+   if (retcode) {
+   dev_err(dev, "Failed to set phy config, VSI %d error %d\n",
+   vsi->vsi_num, retcode);
+   retcode = -EIO;
+   }
+
+   devm_kfr

[PATCH net] net: phy: fix race in genphy_update_link

2019-07-31 Thread Heiner Kallweit

In phy_start_aneg() autoneg is started, and immediately after that
link and autoneg status are read. As reported in [0] it can happen that
at time of this read the PHY has reset the "aneg complete" bit but not
yet the "link up" bit, what can result in a false link-up detection.
To fix this don't report link as up if we're in aneg mode and PHY
doesn't signal "aneg complete".

[0] https://marc.info/?t=15641350993&r=1&w=2

Fixes: 4950c2ba49cc ("net: phy: fix autoneg mismatch case in 
genphy_read_status")
Reported-by: liuyonglong 
Tested-by: liuyonglong 
Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy_device.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 6b5cb87f3..7ddd91df9 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1774,6 +1774,12 @@ int genphy_update_link(struct phy_device *phydev)
phydev->link = status & BMSR_LSTATUS ? 1 : 0;
phydev->autoneg_complete = status & BMSR_ANEGCOMPLETE ? 1 : 0;
 
+   /* Consider the case that autoneg was started and "aneg complete"
+* bit has been reset, but "link up" bit not yet.
+*/
+   if (phydev->autoneg == AUTONEG_ENABLE && !phydev->autoneg_complete)
+   phydev->link = 0;
+
return 0;
 }
 EXPORT_SYMBOL(genphy_update_link);
-- 
2.22.0

Re: [patch net-next 0/3] net: devlink: Finish network namespace support

2019-07-31 Thread David Ahern

On 7/30/19 12:08 AM, Jiri Pirko wrote:
> Mon, Jul 29, 2019 at 10:17:25PM CEST, dsah...@gmail.com wrote:
>> On 7/27/19 3:44 AM, Jiri Pirko wrote:
>>> From: Jiri Pirko 
>>>
>>> Devlink from the beginning counts with network namespaces, but the
>>> instances has been fixed to init_net. The first patch allows user
>>> to move existing devlink instances into namespaces:
>>>
>>
>> so you intend for an asic, for example, to have multiple devlink
>> instances where each instance governs a set of related ports (e.g.,
>> ports that share a set of hardware resources) and those instances can be
>> managed from distinct network namespaces?
> 
> No, no multiple devlink instances for asic intended.

So it should be allowed for an asic to have resources split across
network namespaces. e.g., something like this:

   namespace 1 |  namespace 2  | ... | namespace N
   |   | |
  { ports 1 }  |   { ports 2 } | ... | { ports N }
   |   | |
   devlink 1   |devlink 2  | ... |  devlink N
  =
   driver

Re: [PATCH] net: mdio-octeon: Fix build error and Kconfig warning

2019-07-31 Thread Randy Dunlap

On 7/31/19 11:50 AM, Nathan Chancellor wrote:
> arm allyesconfig warns:
> 
> WARNING: unmet direct dependencies detected for MDIO_OCTEON
>   Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=y] && MDIO_BUS [=y]
> && 64BIT && HAS_IOMEM [=y] && OF_MDIO [=y]
>   Selected by [y]:
>   - OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC &&
> NETDEVICES [=y] || COMPILE_TEST [=y])
> 
> and errors:
> 
> In file included from ../drivers/net/phy/mdio-octeon.c:14:
> ../drivers/net/phy/mdio-octeon.c: In function 'octeon_mdiobus_probe':
> ../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of
> function 'writeq'; did you mean 'writeb'?
> [-Werror=implicit-function-declaration]
>   111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
>   |^~
> ../drivers/net/phy/mdio-octeon.c:56:2: note: in expansion of macro
> 'oct_mdio_writeq'
>56 |  oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
>   |  ^~~
> cc1: some warnings being treated as errors
> 
> This allows MDIO_OCTEON to be built with COMPILE_TEST as well and
> includes the proper header for readq/writeq. This does not address
> the several -Wint-to-pointer-cast and -Wpointer-to-int-cast warnings
> that appeared as a result of commit 171a9bae68c7 ("staging/octeon:
> Allow test build on !MIPS") in these files.
> 
> Fixes: 171a9bae68c7 ("staging/octeon: Allow test build on !MIPS")
> Reported-by: kbuild test robot 
> Reported-by: Mark Brown 
> Reported-by: Randy Dunlap 
> Signed-off-by: Nathan Chancellor 


With today's linux-next (20190731), I am still seeing a Kconfig warning and
build errors (building for i386):

and applying Greg's "depends on NETDEVICES" patch and this patch:

WARNING: unmet direct dependencies detected for MDIO_OCTEON
  Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=m] && MDIO_BUS [=m] && 
(64BIT [=n] || COMPILE_TEST [=y]) && HAS_IOMEM [=y] && OF_MDIO [=n]
  Selected by [m]:
  - OCTEON_ETHERNET [=m] && STAGING [=y] && (CAVIUM_OCTEON_SOC || COMPILE_TEST 
[=y]) && NETDEVICES [=y]

ERROR: "cavium_mdiobus_write" [drivers/net/phy/mdio-octeon.ko] undefined!
ERROR: "cavium_mdiobus_read" [drivers/net/phy/mdio-octeon.ko] undefined!


kernel .config file is attached.

Am I missing another patch?

thanks.

> ---
>  drivers/net/phy/Kconfig   | 2 +-
>  drivers/net/phy/mdio-cavium.h | 2 ++
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
> index 20f14c5fbb7e..ed2edf4b5b0e 100644
> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -159,7 +159,7 @@ config MDIO_MSCC_MIIM
>  
>  config MDIO_OCTEON
>   tristate "Octeon and some ThunderX SOCs MDIO buses"
> - depends on 64BIT
> + depends on 64BIT || COMPILE_TEST
>   depends on HAS_IOMEM && OF_MDIO
>   select MDIO_CAVIUM
>   help
> diff --git a/drivers/net/phy/mdio-cavium.h b/drivers/net/phy/mdio-cavium.h
> index ed5f9bb5448d..b7f89ad27465 100644
> --- a/drivers/net/phy/mdio-cavium.h
> +++ b/drivers/net/phy/mdio-cavium.h
> @@ -108,6 +108,8 @@ static inline u64 oct_mdio_readq(u64 addr)
>   return cvmx_read_csr(addr);
>  }
>  #else
> +#include 
> +
>  #define oct_mdio_writeq(val, addr)   writeq(val, (void *)addr)
>  #define oct_mdio_readq(addr) readq((void *)addr)
>  #endif
> 


-- 
~Randy


cavium.i386.config
Description: application/config

Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP

2019-07-31 Thread David Ahern

On 7/30/19 7:21 PM, Jakub Kicinski wrote:
> 
 If bpftool was taught to do equivalent of 'ip link' that would be
 very different story and I would be opposed to that.  
>>> Yes, that'd be pretty clear cut, only the XDP stuff is a bit more 
>>> of a judgement call.  
>> bpftool must be able to introspect every aspect of bpf programming.
>> That includes detaching and attaching anywhere.
>> Anyone doing 'bpftool p s' should be able to switch off particular
>> prog id without learning ten different other tools.
> I think the fact that we already have an implementation in iproute2,
> which is at the risk of bit rot is more important to me that the
> hypothetical scenario where everyone knows to just use bpftool (for
> XDP, for TC it's still iproute2 unless there's someone crazy enough 
> to reimplement the TC functionality :))

apparently the iproute2 version has bit rot which is a shame.

> 
> I'm not sure we can settle our differences over email :)
> I have tremendous respect for all the maintainers I CCed here, 
> if nobody steps up to agree with me I'll concede the bpftool net
> battle entirely :)

bpftool started as an introspection tool and has turned into a one stop
shop for all things ebpf. I am mixed on whether that is a good thing or
a bad thing.

1 2 >

1 - 100 of 155 matches

Mail list logo