date:20190503

[PATCH v3 00/10] of_net: Add NVMEM support to of_get_mac_address

2019-05-03 Thread Petr Štetiar

Hi,

this patch series is a continuation of my previous attempt[1], where I've
tried to wire MTD layer into of_get_mac_address, so it would be possible to
load MAC addresses from various NVMEMs as EEPROMs etc.

Predecessor of this patch which used directly MTD layer has originated in
OpenWrt some time ago and supports already about 497 use cases in 357
device tree files.

During the review process of my 1st attempt I was told, that I shouldn't be
using MTD directly, but that I should rather use new NVMEM subsystem and
durring the review proccess of v2 I was told, that I should handle
EPROBE_DEFFER error as well, so this v3 patch series tries to accommodate
all this remarks.

First patch is wiring NVMEM support directly into of_get_mac_address as
it's obvious, that adding support for NVMEM into every other driver would
mean adding a lot of repetitive code. This patch allows us to configure MAC
addresses in various devices like ethernet and wireless adapters directly
from of_get_mac_address, which is used by quite a lot of drivers in the
tree already.

Second patch is simply updating documentation with NVMEM bits, and cleaning
up all current binding documentation referencing any of the MAC address
related properties.

Third and fourth patches are simply removing duplicate NVMEM code which is
no longer needed as the first patch has wired NVMEM support directly into
of_get_mac_address.

Patches 5-10 are converting all current users of of_get_mac_address to the
new ERR_PTR encoded error value, as of_get_mac_address could now return
valid pointer, NULL and ERR_PTR.

Just for a better picture, this patch series and one simple patch[2] on top
of it, allows me to configure 8Devices Carambola2 board's MAC addresses
with following DTS (simplified):

 &spi {
flash@0 {
partitions {
art: partition@ff {
label = "art";
reg = <0xff 0x01>;
read-only;

nvmem-cells {
compatible = "nvmem-cells";
#address-cells = <1>;
#size-cells = <1>;

eth0_addr: eth-mac-addr@0 {
reg = <0x0 0x6>;
};

eth1_addr: eth-mac-addr@6 {
reg = <0x6 0x6>;
};

wmac_addr: wifi-mac-addr@1002 {
reg = <0x1002 0x6>;
};
};
};
};
};
 };

 ð0 {
nvmem-cells = <ð0_addr>;
nvmem-cell-names = "mac-address";
 };

 ð1 {
nvmem-cells = <ð1_addr>;
nvmem-cell-names = "mac-address";
 };

 &wmac {
nvmem-cells = <&wmac_addr>;
nvmem-cell-names = "mac-address";
 };


1. https://patchwork.ozlabs.org/patch/1086628/
2. https://patchwork.ozlabs.org/patch/890738/

-- ynezz

Petr Štetiar (10):
  of_net: add NVMEM support to of_get_mac_address
  dt-bindings: doc: reflect new NVMEM of_get_mac_address behaviour
  net: macb: support of_get_mac_address new ERR_PTR error
  net: davinci: support of_get_mac_address new ERR_PTR error
  net: ethernet: support of_get_mac_address new ERR_PTR error
  net: usb: support of_get_mac_address new ERR_PTR error
  net: wireless: support of_get_mac_address new ERR_PTR error
  staging: octeon-ethernet: support of_get_mac_address new ERR_PTR error
  ARM: Kirkwood: support of_get_mac_address new ERR_PTR error
  powerpc: tsi108: support of_get_mac_address new ERR_PTR error

 .../devicetree/bindings/net/altera_tse.txt |  5 +-
 Documentation/devicetree/bindings/net/amd-xgbe.txt |  5 +-
 .../devicetree/bindings/net/brcm,amac.txt  |  4 +-
 Documentation/devicetree/bindings/net/cpsw.txt |  4 +-
 .../devicetree/bindings/net/davinci_emac.txt   |  5 +-
 Documentation/devicetree/bindings/net/dsa/dsa.txt  |  5 +-
 Documentation/devicetree/bindings/net/ethernet.txt |  6 +-
 .../devicetree/bindings/net/hisilicon-femac.txt|  4 +-
 .../bindings/net/hisilicon-hix5hd2-gmac.txt|  4 +-
 .../devicetree/bindings/net/keystone-netcp.txt | 10 ++--
 Documentation/devicetree/bindings/net/macb.txt |  5 +-
 .../devicetree/bindings/net/marvell-pxa168.txt |  4 +-
 .../devicetree/bindings/net/microchip,enc28j60.txt |  3 +-
 .../devicetree/bindings/net/microchip,lan78xx.txt  |  5 +-
 .../devicetree/bindings/net/qca,qca7000.txt|  4 +-
 .../devicetree/bindings/net/samsung-sxgbe.txt  |  4 +-
 .../bindings/net/snps,dwc-qos-ethernet.txt |  5 +-
 .../bindings/net/socionext,uniphier-ave4.txt   |  4 +-
 .../devicetree/bind

[PATCH v3 01/10] of_net: add NVMEM support to of_get_mac_address

2019-05-03 Thread Petr Štetiar

Many embedded devices have information such as MAC addresses stored
inside NVMEMs like EEPROMs and so on. Currently there are only two
drivers in the tree which benefit from NVMEM bindings.

Adding support for NVMEM into every other driver would mean adding a lot
of repetitive code. This patch allows us to configure MAC addresses in
various devices like ethernet and wireless adapters directly from
of_get_mac_address, which is already used by almost every driver in the
tree.

Predecessor of this patch which used directly MTD layer has originated
in OpenWrt some time ago and supports already about 497 use cases in 357
device tree files.

Cc: Alban Bedel 
Signed-off-by: Felix Fietkau 
Signed-off-by: John Crispin 
Signed-off-by: Petr Štetiar 
---

 Changes since v1:

  * moved handling of nvmem after mac-address and local-mac-address properties

 Changes since v2:

  * moved of_get_mac_addr_nvmem after of_get_mac_addr(np, "address") call
  * replaced kzalloc, kmemdup and kfree with it's devm variants
  * introduced of_has_nvmem_mac_addr helper which checks if DT node has nvmem
cell with `mac-address`
  * of_get_mac_address now returns ERR_PTR encoded error value

 drivers/of/of_net.c | 65 ++---
 1 file changed, 62 insertions(+), 3 deletions(-)

diff --git a/drivers/of/of_net.c b/drivers/of/of_net.c
index d820f3e..258ceb8 100644
--- a/drivers/of/of_net.c
+++ b/drivers/of/of_net.c
@@ -8,8 +8,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 
 /**
  * of_get_phy_mode - Get phy mode for given device_node
@@ -47,12 +49,59 @@ static const void *of_get_mac_addr(struct device_node *np, 
const char *name)
return NULL;
 }
 
+static const void *of_get_mac_addr_nvmem(struct device_node *np)
+{
+   int ret;
+   u8 mac[ETH_ALEN];
+   struct property *pp;
+   struct platform_device *pdev = of_find_device_by_node(np);
+
+   if (!pdev)
+   return ERR_PTR(-ENODEV);
+
+   ret = nvmem_get_mac_address(&pdev->dev, &mac);
+   if (ret)
+   return ERR_PTR(ret);
+
+   pp = devm_kzalloc(&pdev->dev, sizeof(*pp), GFP_KERNEL);
+   if (!pp)
+   return ERR_PTR(-ENOMEM);
+
+   pp->name = "nvmem-mac-address";
+   pp->length = ETH_ALEN;
+   pp->value = devm_kmemdup(&pdev->dev, mac, ETH_ALEN, GFP_KERNEL);
+   if (!pp->value) {
+   ret = -ENOMEM;
+   goto free;
+   }
+
+   ret = of_add_property(np, pp);
+   if (ret)
+   goto free;
+
+   return pp->value;
+free:
+   devm_kfree(&pdev->dev, pp->value);
+   devm_kfree(&pdev->dev, pp);
+
+   return ERR_PTR(ret);
+}
+
+static inline bool of_has_nvmem_mac_addr(struct device_node *np)
+{
+   int index = of_property_match_string(np, "nvmem-cell-names",
+"mac-address");
+   return of_parse_phandle(np, "nvmem-cells", index) != NULL;
+}
+
 /**
  * Search the device tree for the best MAC address to use.  'mac-address' is
  * checked first, because that is supposed to contain to "most recent" MAC
  * address. If that isn't set, then 'local-mac-address' is checked next,
- * because that is the default address.  If that isn't set, then the obsolete
- * 'address' is checked, just in case we're using an old device tree.
+ * because that is the default address. If that isn't set, then the obsolete
+ * 'address' is checked, just in case we're using an old device tree. If any
+ * of the above isn't set, then try to get MAC address from nvmem cell named
+ * 'mac-address'.
  *
  * Note that the 'address' property is supposed to contain a virtual address of
  * the register set, but some DTS files have redefined that property to be the
@@ -64,6 +113,9 @@ static const void *of_get_mac_addr(struct device_node *np, 
const char *name)
  * addresses.  Some older U-Boots only initialized 'local-mac-address'.  In
  * this case, the real MAC is in 'local-mac-address', and 'mac-address' exists
  * but is all zeros.
+ *
+ * Return: Will be a valid pointer on success, NULL in case there wasn't
+ * 'mac-address' nvmem cell node found, and ERR_PTR in case of error.
 */
 const void *of_get_mac_address(struct device_node *np)
 {
@@ -77,6 +129,13 @@ const void *of_get_mac_address(struct device_node *np)
if (addr)
return addr;
 
-   return of_get_mac_addr(np, "address");
+   addr = of_get_mac_addr(np, "address");
+   if (addr)
+   return addr;
+
+   if (!of_has_nvmem_mac_addr(np))
+   return NULL;
+
+   return of_get_mac_addr_nvmem(np);
 }
 EXPORT_SYMBOL(of_get_mac_address);
-- 
1.9.1

[PATCH v3 04/10] net: davinci: support of_get_mac_address new ERR_PTR error

2019-05-03 Thread Petr Štetiar

There was NVMEM support added directly to of_get_mac_address, and it
uses nvmem_get_mac_address under the hood, so we can remove it. As
of_get_mac_address can now return NULL and ERR_PTR encoded error values,
adjust to that as well.

Signed-off-by: Petr Štetiar 
---

 Changes since v2:

 * ERR_PTR handling

 drivers/net/ethernet/ti/davinci_emac.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_emac.c 
b/drivers/net/ethernet/ti/davinci_emac.c
index 57450b1..4229ef0 100644
--- a/drivers/net/ethernet/ti/davinci_emac.c
+++ b/drivers/net/ethernet/ti/davinci_emac.c
@@ -1714,7 +1714,7 @@ static struct net_device_stats 
*emac_dev_getnetstats(struct net_device *ndev)
 
if (!is_valid_ether_addr(pdata->mac_addr)) {
mac_addr = of_get_mac_address(np);
-   if (mac_addr)
+   if (!IS_ERR_OR_NULL(mac_addr))
ether_addr_copy(pdata->mac_addr, mac_addr);
}
 
@@ -1912,15 +1912,11 @@ static int davinci_emac_probe(struct platform_device 
*pdev)
ether_addr_copy(ndev->dev_addr, priv->mac_addr);
 
if (!is_valid_ether_addr(priv->mac_addr)) {
-   /* Try nvmem if MAC wasn't passed over pdata or DT. */
-   rc = nvmem_get_mac_address(&pdev->dev, priv->mac_addr);
-   if (rc) {
-   /* Use random MAC if still none obtained. */
-   eth_hw_addr_random(ndev);
-   memcpy(priv->mac_addr, ndev->dev_addr, ndev->addr_len);
-   dev_warn(&pdev->dev, "using random MAC addr: %pM\n",
-priv->mac_addr);
-   }
+   /* Use random MAC if still none obtained. */
+   eth_hw_addr_random(ndev);
+   memcpy(priv->mac_addr, ndev->dev_addr, ndev->addr_len);
+   dev_warn(&pdev->dev, "using random MAC addr: %pM\n",
+priv->mac_addr);
}
 
ndev->netdev_ops = &emac_netdev_ops;
-- 
1.9.1

Possible refcount bug in ip6_expire_frag_queue()?

2019-05-03 Thread Stefan Bader

In commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3 "ipv6: frags:
rewrite ip6_expire_frag_queue()" this function got changed to
be like ip_expire() (after dropping a clone there).
This was backported to 4.4.y stable (amongst other stable trees)
in v4.4.174.

Since then we got reports that in evironments with heave ipv6 load,
the kernel crashes about every 2-3hrs with the following trace: [1].

The crash is triggered by the skb_shared(skb) check in
pskb_expand_head(). Comparing ip6_expire_frag_queue() and
ip_expire(), the ipv6 code does a skb_get() which increments that
refcount while the ipv4 code does not seem to do that.

Would it be possible that ip6_expire-frag_queue() should not
call skb_get() when using the first skb of the frag queue for
the icmp message?

Thanks,
Stefan



[1]
[296583.091021] kernel BUG at 
/build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
[296583.091734] Call Trace:
[296583.091749]  [] __pskb_pull_tail+0x50/0x350
[296583.091764]  [] _decode_session6+0x26a/0x400
[296583.091779]  [] __xfrm_decode_session+0x39/0x50
[296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
[296583.091809]  [] icmp6_send+0x5e1/0x940
[296583.091823]  [] ? __netif_receive_skb+0x18/0x60
[296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
[296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
[296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
[nf_defrag_ipv6]
[296583.091893]  [] icmpv6_send+0x21/0x30
[296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
[296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
[nf_defrag_ipv6]
[296583.091938]  [] call_timer_fn+0x37/0x140
[296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
[nf_defrag_ipv6]
[296583.091968]  [] run_timer_softirq+0x234/0x330
[296583.091982]  [] __do_softirq+0x109/0x2b0
[296583.091995]  [] irq_exit+0xa5/0xb0
[296583.092008]  [] smp_apic_timer_interrupt+0x50/0x70
[296583.092023]  [] apic_timer_interrupt+0xcc/0xe0
[296583.092037]  
[296583.092044]  [] ? cpuidle_enter_state+0x11e/0x2d0
[296583.092060]  [] cpuidle_enter+0x17/0x20
[296583.092073]  [] call_cpuidle+0x32/0x60
[296583.092086]  [] ? cpuidle_select+0x19/0x20
[296583.092099]  [] cpu_startup_entry+0x296/0x360
[296583.092114]  [] start_secondary+0x177/0x1b0
[296583.092878] Code: 75 1a 41 8b 87 cc 00 00 00 49 03 87 d0 00 00 00 e9 e2 fe 
ff ff b8 f4 ff ff ff eb bc 4c 89 ef e8 f4 99 ab ff b8 f4 ff ff ff eb ad <0f> 0b 
90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89
[296583.094510] RIP  [] pskb_expand_head+0x243/0x250
[296583.095302]  RSP 
[296583.099491] ---[ end trace 4262f47656f8ba9f ]---

[PATCH v2 7/8] netlink: add infrastructure to expose policies to userspace

2019-05-03 Thread Johannes Berg

From: Johannes Berg 

Add, and use in generic netlink, helpers to dump out a netlink
policy to userspace, including all the range validation data,
nested policies etc.

This lets userspace discover what the kernel understands.

For families/commands other than generic netlink, the helpers
need to be used directly in an appropriate command, or we can
add some infrastructure (a new netlink family) that those can
register their policies with for introspection. I'm not that
familiar with non-generic netlink, so that's left out for now.

The data exposed to userspace also includes min and max length
for binary/string data, I've done that instead of letting the
userspace tools figure out whether min/max is intended based
on the type so that we can extend this later in the kernel, we
might want to just use the range data for example.

Because of this, I opted to not directly expose the NLA_*
values, even if some of them are already exposed via BPF, as
with min/max length we don't need to have different types here
for NLA_BINARY/NLA_MIN_LEN/NLA_EXACT_LEN, we just make them
all NL_ATTR_TYPE_BINARY with min/max length optionally set.

Similarly, we don't really need NLA_MSECS, and perhaps can
remove it in the future - but not if we encode it into the
userspace API now. It gets mapped to NL_ATTR_TYPE_U64 here.

Note that the exposing here corresponds to the strict policy
interpretation, and NLA_UNSPEC items are omitted entirely.
To get those, change them to NLA_MIN_LEN which behaves in
exactly the same way, but is exposed.

Signed-off-by: Johannes Berg 
---
 include/net/netlink.h  |   6 +
 include/uapi/linux/genetlink.h |   2 +
 include/uapi/linux/netlink.h   | 103 +++
 net/netlink/Makefile   |   2 +-
 net/netlink/genetlink.c|  77 +
 net/netlink/policy.c   | 308 +
 6 files changed, 497 insertions(+), 1 deletion(-)
 create mode 100644 net/netlink/policy.c

diff --git a/include/net/netlink.h b/include/net/netlink.h
index c2b4bc819784..e298838a57dc 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1900,4 +1900,10 @@ void nla_get_range_unsigned(const struct nla_policy *pt,
 void nla_get_range_signed(const struct nla_policy *pt,
  struct netlink_range_validation_signed *range);
 
+int netlink_policy_dump_start(const struct nla_policy *policy,
+ unsigned int maxtype,
+ unsigned long *state);
+bool netlink_policy_dump_loop(unsigned long *state);
+int netlink_policy_dump_write(struct sk_buff *skb, unsigned long state);
+
 #endif
diff --git a/include/uapi/linux/genetlink.h b/include/uapi/linux/genetlink.h
index 877f7fa95466..9c0636ec2286 100644
--- a/include/uapi/linux/genetlink.h
+++ b/include/uapi/linux/genetlink.h
@@ -48,6 +48,7 @@ enum {
CTRL_CMD_NEWMCAST_GRP,
CTRL_CMD_DELMCAST_GRP,
CTRL_CMD_GETMCAST_GRP, /* unused */
+   CTRL_CMD_GETPOLICY,
__CTRL_CMD_MAX,
 };
 
@@ -62,6 +63,7 @@ enum {
CTRL_ATTR_MAXATTR,
CTRL_ATTR_OPS,
CTRL_ATTR_MCAST_GROUPS,
+   CTRL_ATTR_POLICY,
__CTRL_ATTR_MAX,
 };
 
diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index 0a4d73317759..eac8a6a648ea 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -249,4 +249,107 @@ struct nla_bitfield32 {
__u32 selector;
 };
 
+/*
+ * policy descriptions - it's specific to each family how this is used
+ * Normally, it should be retrieved via a dump inside another attribute
+ * specifying where it applies.
+ */
+
+/**
+ * enum netlink_attribute_type - type of an attribute
+ * @NL_ATTR_TYPE_INVALID: unused
+ * @NL_ATTR_TYPE_FLAG: flag attribute (present/not present)
+ * @NL_ATTR_TYPE_U8: 8-bit unsigned attribute
+ * @NL_ATTR_TYPE_U16: 16-bit unsigned attribute
+ * @NL_ATTR_TYPE_U32: 32-bit unsigned attribute
+ * @NL_ATTR_TYPE_U64: 64-bit unsigned attribute
+ * @NL_ATTR_TYPE_S8: 8-bit signed attribute
+ * @NL_ATTR_TYPE_S16: 16-bit signed attribute
+ * @NL_ATTR_TYPE_S32: 32-bit signed attribute
+ * @NL_ATTR_TYPE_S64: 64-bit signed attribute
+ * @NL_ATTR_TYPE_BINARY: binary data, min/max length may be specified
+ * @NL_ATTR_TYPE_STRING: string, min/max length may be specified
+ * @NL_ATTR_TYPE_NUL_STRING: NUL-terminated string,
+ * min/max length may be specified
+ * @NL_ATTR_TYPE_NESTED: nested, i.e. the content of this attribute
+ * consists of sub-attributes. The nested policy and maxtype
+ * inside may be specified.
+ * @NL_ATTR_TYPE_NESTED_ARRAY: nested array, i.e. the content of this
+ * attribute contains sub-attributes whose type is irrelevant
+ * (just used to separate the array entries) and each such array
+ * entry has attributes again, the policy for those inner ones
+ * and the corresponding maxtype may be specified.
+ * @NL_ATTR_TYPE_BITFIELD32: &struct nla_bitfield32 attribute
+ */
+enum netlink_attribute_type {
+   NL_A

[PATCH v2 8/8] netlink: limit recursion depth in policy validation

2019-05-03 Thread Johannes Berg

From: Johannes Berg 

Now that we have nested policies, we can theoretically
recurse forever parsing attributes if a (sub-)policy
refers back to a higher level one. This is a situation
that has happened in nl80211, and we've avoided it there
by not linking it.

Add some code to netlink parsing to limit recursion depth,
allowing us to safely change nl80211 to actually link the
nested policy, which in turn allows some code cleanups.

Signed-off-by: Johannes Berg 
---
 lib/nlattr.c   | 46 +++---
 net/wireless/nl80211.c | 10 -
 net/wireless/nl80211.h |  2 --
 net/wireless/pmsr.c|  3 +--
 4 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/lib/nlattr.c b/lib/nlattr.c
index 3db7a6984cb0..ef06645de56c 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -44,6 +44,20 @@ static const u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
[NLA_S64]   = sizeof(s64),
 };
 
+/*
+ * Nested policies might refer back to the original
+ * policy in some cases, and userspace could try to
+ * abuse that and recurse by nesting in the right
+ * ways. Limit recursion to avoid this problem.
+ */
+#define MAX_POLICY_RECURSION_DEPTH 10
+
+static int __nla_validate_parse(const struct nlattr *head, int len, int 
maxtype,
+   const struct nla_policy *policy,
+   unsigned int validate,
+   struct netlink_ext_ack *extack,
+   struct nlattr **tb, unsigned int depth);
+
 static int validate_nla_bitfield32(const struct nlattr *nla,
   const u32 valid_flags_mask)
 {
@@ -70,7 +84,7 @@ static int validate_nla_bitfield32(const struct nlattr *nla,
 static int nla_validate_array(const struct nlattr *head, int len, int maxtype,
  const struct nla_policy *policy,
  struct netlink_ext_ack *extack,
- unsigned int validate)
+ unsigned int validate, unsigned int depth)
 {
const struct nlattr *entry;
int rem;
@@ -87,8 +101,9 @@ static int nla_validate_array(const struct nlattr *head, int 
len, int maxtype,
return -ERANGE;
}
 
-   ret = __nla_validate(nla_data(entry), nla_len(entry),
-maxtype, policy, validate, extack);
+   ret = __nla_validate_parse(nla_data(entry), nla_len(entry),
+  maxtype, policy, validate, extack,
+  NULL, depth + 1);
if (ret < 0)
return ret;
}
@@ -280,7 +295,7 @@ static int nla_validate_int_range(const struct nla_policy 
*pt,
 
 static int validate_nla(const struct nlattr *nla, int maxtype,
const struct nla_policy *policy, unsigned int validate,
-   struct netlink_ext_ack *extack)
+   struct netlink_ext_ack *extack, unsigned int depth)
 {
u16 strict_start_type = policy[0].strict_start_type;
const struct nla_policy *pt;
@@ -375,9 +390,10 @@ static int validate_nla(const struct nlattr *nla, int 
maxtype,
if (attrlen < NLA_HDRLEN)
goto out_err;
if (pt->nested_policy) {
-   err = __nla_validate(nla_data(nla), nla_len(nla), 
pt->len,
-pt->nested_policy, validate,
-extack);
+   err = __nla_validate_parse(nla_data(nla), nla_len(nla),
+  pt->len, pt->nested_policy,
+  validate, extack, NULL,
+  depth + 1);
if (err < 0) {
/*
 * return directly to preserve the inner
@@ -400,7 +416,7 @@ static int validate_nla(const struct nlattr *nla, int 
maxtype,
 
err = nla_validate_array(nla_data(nla), nla_len(nla),
 pt->len, pt->nested_policy,
-extack, validate);
+extack, validate, depth);
if (err < 0) {
/*
 * return directly to preserve the inner
@@ -472,11 +488,17 @@ static int __nla_validate_parse(const struct nlattr 
*head, int len, int maxtype,
const struct nla_policy *policy,
unsigned int validate,
struct netlink_ext_ack *extack,
-   struct nlattr **tb)
+   struct nlattr **tb, unsigned int depth)
 {

[PATCH v2 3/8] netlink: extend policy range validation

2019-05-03 Thread Johannes Berg

From: Johannes Berg 

Using a pointer to a struct indicating the min/max values,
extend the ability to do range validation for arbitrary
values. Small values in the s16 range can be kept in the
policy directly.

Signed-off-by: Johannes Berg 
---
 include/net/netlink.h |  45 +
 lib/nlattr.c  | 112 ++
 2 files changed, 136 insertions(+), 21 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 0dd4546fb68c..2b91a15803b0 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -189,11 +189,20 @@ enum {
 
 #define NLA_TYPE_MAX (__NLA_TYPE_MAX - 1)
 
+struct netlink_range_validation {
+   u64 min, max;
+};
+
+struct netlink_range_validation_signed {
+   s64 min, max;
+};
+
 enum nla_policy_validation {
NLA_VALIDATE_NONE,
NLA_VALIDATE_RANGE,
NLA_VALIDATE_MIN,
NLA_VALIDATE_MAX,
+   NLA_VALIDATE_RANGE_PTR,
NLA_VALIDATE_FUNCTION,
 };
 
@@ -271,6 +280,22 @@ enum nla_policy_validation {
  * of s16 - do that as usual in the code instead.
  * Use the NLA_POLICY_MIN(), NLA_POLICY_MAX() and
  * NLA_POLICY_RANGE() macros.
+ *NLA_U8,
+ *NLA_U16,
+ *NLA_U32,
+ *NLA_U64  If the validation_type field instead is set to
+ * NLA_VALIDATE_RANGE_PTR, `range' must be a pointer
+ * to a struct netlink_range_validation that indicates
+ * the min/max values.
+ * Use NLA_POLICY_FULL_RANGE().
+ *NLA_S8,
+ *NLA_S16,
+ *NLA_S32,
+ *NLA_S64  If the validation_type field instead is set to
+ * NLA_VALIDATE_RANGE_PTR, `range_signed' must be a
+ * pointer to a struct netlink_range_validation_signed
+ * that indicates the min/max values.
+ * Use NLA_POLICY_FULL_RANGE_SIGNED().
  *All otherUnused - but note that it's a union
  *
  * Meaning of `validate' field, use via NLA_POLICY_VALIDATE_FN:
@@ -299,6 +324,8 @@ struct nla_policy {
const u32 bitfield32_valid;
const char *reject_message;
const struct nla_policy *nested_policy;
+   struct netlink_range_validation *range;
+   struct netlink_range_validation_signed *range_signed;
struct {
s16 min, max;
};
@@ -345,6 +372,12 @@ struct nla_policy {
{ .type = NLA_BITFIELD32, .bitfield32_valid = valid }
 
 #define __NLA_ENSURE(condition) BUILD_BUG_ON_ZERO(!(condition))
+#define NLA_ENSURE_UINT_TYPE(tp)   \
+   (__NLA_ENSURE(tp == NLA_U8 || tp == NLA_U16 ||  \
+ tp == NLA_U32 || tp == NLA_U64) + tp)
+#define NLA_ENSURE_SINT_TYPE(tp)   \
+   (__NLA_ENSURE(tp == NLA_S8 || tp == NLA_S16  || \
+ tp == NLA_S32 || tp == NLA_S64) + tp)
 #define NLA_ENSURE_INT_TYPE(tp)\
(__NLA_ENSURE(tp == NLA_S8 || tp == NLA_U8 ||   \
  tp == NLA_S16 || tp == NLA_U16 || \
@@ -363,6 +396,18 @@ struct nla_policy {
.max = _max \
 }
 
+#define NLA_POLICY_FULL_RANGE(tp, _range) {\
+   .type = NLA_ENSURE_UINT_TYPE(tp),   \
+   .validation_type = NLA_VALIDATE_RANGE_PTR,  \
+   .range = _range,\
+}
+
+#define NLA_POLICY_FULL_RANGE_SIGNED(tp, _range) { \
+   .type = NLA_ENSURE_SINT_TYPE(tp),   \
+   .validation_type = NLA_VALIDATE_RANGE_PTR,  \
+   .range_signed = _range, \
+}
+
 #define NLA_POLICY_MIN(tp, _min) { \
.type = NLA_ENSURE_INT_TYPE(tp),\
.validation_type = NLA_VALIDATE_MIN,\
diff --git a/lib/nlattr.c b/lib/nlattr.c
index c546db7c72dd..b549b290d3fa 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -96,17 +96,33 @@ static int nla_validate_array(const struct nlattr *head, 
int len, int maxtype,
return 0;
 }
 
-static int nla_validate_int_range(const struct nla_policy *pt,
- const struct nlattr *nla,
- struct netlink_ext_ack *extack)
+static int nla_validate_int_range_unsigned(const struct nla_policy *pt,
+  const struct nlattr *nla,
+  struct netlink_ext_ack *extack)
 {
-   bool validate_min, validate_max;
-   s64 value;
+   struct netlink_range_validation _range = {
+   .min = 0,
+   .max = U64_MAX,
+   }, *range = &_range;
+   u64 value;
 
-   validate_min = pt->validation_type == NLA_VALIDATE_RANGE ||
-  pt->validation_type =

[PATCH v2 4/8] netlink: allow NLA_MSECS to have range validation

2019-05-03 Thread Johannes Berg

From: Johannes Berg 

Since NLA_MSECS is really equivalent to NLA_U64, allow
it to have range validation as well.

Signed-off-by: Johannes Berg 
---
 include/net/netlink.h | 6 --
 lib/nlattr.c  | 2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 2b91a15803b0..2b035bf8daf6 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -374,7 +374,8 @@ struct nla_policy {
 #define __NLA_ENSURE(condition) BUILD_BUG_ON_ZERO(!(condition))
 #define NLA_ENSURE_UINT_TYPE(tp)   \
(__NLA_ENSURE(tp == NLA_U8 || tp == NLA_U16 ||  \
- tp == NLA_U32 || tp == NLA_U64) + tp)
+ tp == NLA_U32 || tp == NLA_U64 || \
+ tp == NLA_MSECS) + tp)
 #define NLA_ENSURE_SINT_TYPE(tp)   \
(__NLA_ENSURE(tp == NLA_S8 || tp == NLA_S16  || \
  tp == NLA_S32 || tp == NLA_S64) + tp)
@@ -382,7 +383,8 @@ struct nla_policy {
(__NLA_ENSURE(tp == NLA_S8 || tp == NLA_U8 ||   \
  tp == NLA_S16 || tp == NLA_U16 || \
  tp == NLA_S32 || tp == NLA_U32 || \
- tp == NLA_S64 || tp == NLA_U64) + tp)
+ tp == NLA_S64 || tp == NLA_U64 || \
+ tp == NLA_MSECS) + tp)
 #define NLA_ENSURE_NO_VALIDATION_PTR(tp)   \
(__NLA_ENSURE(tp != NLA_BITFIELD32 &&   \
  tp != NLA_REJECT &&   \
diff --git a/lib/nlattr.c b/lib/nlattr.c
index b549b290d3fa..c8789de96046 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -135,6 +135,7 @@ static int nla_validate_int_range_unsigned(const struct 
nla_policy *pt,
value = nla_get_u32(nla);
break;
case NLA_U64:
+   case NLA_MSECS:
value = nla_get_u64(nla);
break;
default:
@@ -211,6 +212,7 @@ static int nla_validate_int_range(const struct nla_policy 
*pt,
case NLA_U16:
case NLA_U32:
case NLA_U64:
+   case NLA_MSECS:
return nla_validate_int_range_unsigned(pt, nla, extack);
case NLA_S8:
case NLA_S16:
-- 
2.17.2

[PATCH v2 5/8] netlink: remove NLA_EXACT_LEN_WARN

2019-05-03 Thread Johannes Berg

From: Johannes Berg 

Use a validation type instead, so we can later expose
the NLA_* values to userspace for policy descriptions.

Signed-off-by: Johannes Berg 
---
 include/net/netlink.h | 15 ---
 lib/nlattr.c  | 16 ++--
 2 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 2b035bf8daf6..3c3bbd2ae2dc 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -182,7 +182,6 @@ enum {
NLA_BITFIELD32,
NLA_REJECT,
NLA_EXACT_LEN,
-   NLA_EXACT_LEN_WARN,
NLA_MIN_LEN,
__NLA_TYPE_MAX,
 };
@@ -204,6 +203,7 @@ enum nla_policy_validation {
NLA_VALIDATE_MAX,
NLA_VALIDATE_RANGE_PTR,
NLA_VALIDATE_FUNCTION,
+   NLA_VALIDATE_WARN_TOO_LONG,
 };
 
 /**
@@ -237,10 +237,10 @@ enum nla_policy_validation {
  * just like "All other"
  *NLA_BITFIELD32   Unused
  *NLA_REJECT   Unused
- *NLA_EXACT_LENAttribute must have exactly this length, otherwise
- * it is rejected.
- *NLA_EXACT_LEN_WARN   Attribute should have exactly this length, a warning
- * is logged if it is longer, shorter is rejected.
+ *NLA_EXACT_LENAttribute should have exactly this length, otherwise
+ * it is rejected or warned about, the latter happening
+ * if and only if the `validation_type' is set to
+ * NLA_VALIDATE_WARN_TOO_LONG.
  *NLA_MIN_LEN  Minimum length of attribute payload
  *All otherMinimum length of attribute payload
  *
@@ -353,8 +353,9 @@ struct nla_policy {
 };
 
 #define NLA_POLICY_EXACT_LEN(_len) { .type = NLA_EXACT_LEN, .len = _len }
-#define NLA_POLICY_EXACT_LEN_WARN(_len){ .type = NLA_EXACT_LEN_WARN, \
- .len = _len }
+#define NLA_POLICY_EXACT_LEN_WARN(_len) \
+   { .type = NLA_EXACT_LEN, .len = _len, \
+ .validation_type = NLA_VALIDATE_WARN_TOO_LONG, }
 #define NLA_POLICY_MIN_LEN(_len)   { .type = NLA_MIN_LEN, .len = _len }
 
 #define NLA_POLICY_ETH_ADDRNLA_POLICY_EXACT_LEN(ETH_ALEN)
diff --git a/lib/nlattr.c b/lib/nlattr.c
index c8789de96046..05761d2a74cc 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -245,7 +245,9 @@ static int validate_nla(const struct nlattr *nla, int 
maxtype,
BUG_ON(pt->type > NLA_TYPE_MAX);
 
if ((nla_attr_len[pt->type] && attrlen != nla_attr_len[pt->type]) ||
-   (pt->type == NLA_EXACT_LEN_WARN && attrlen != pt->len)) {
+   (pt->type == NLA_EXACT_LEN &&
+pt->validation_type == NLA_VALIDATE_WARN_TOO_LONG &&
+attrlen != pt->len)) {
pr_warn_ratelimited("netlink: '%s': attribute type %d has an 
invalid length.\n",
current->comm, type);
if (validate & NL_VALIDATE_STRICT_ATTRS) {
@@ -256,11 +258,6 @@ static int validate_nla(const struct nlattr *nla, int 
maxtype,
}
 
switch (pt->type) {
-   case NLA_EXACT_LEN:
-   if (attrlen != pt->len)
-   goto out_err;
-   break;
-
case NLA_REJECT:
if (extack && pt->reject_message) {
NL_SET_BAD_ATTR(extack, nla);
@@ -373,6 +370,13 @@ static int validate_nla(const struct nlattr *nla, int 
maxtype,
goto out_err;
break;
 
+   case NLA_EXACT_LEN:
+   if (pt->validation_type != NLA_VALIDATE_WARN_TOO_LONG) {
+   if (attrlen != pt->len)
+   goto out_err;
+   break;
+   }
+   /* fall through */
default:
if (pt->len)
minlen = pt->len;
-- 
2.17.2

[PATCH v2 0/8] netlink policy export and recursive validation

2019-05-03 Thread Johannes Berg

Here's (finally, sorry) the respin with the range/range_signed assignment
fixed up.

I've now included the validation recursion protection so it's clear that
it applies on top of the other patches only.

johannes

[PATCH v2 1/8] nl80211: fix NL80211_ATTR_FTM_RESPONDER policy

2019-05-03 Thread Johannes Berg

From: Johannes Berg 

The nested policy here should be established using the
NLA_POLICY_NESTED() macro so the length is properly
filled in.

Fixes: 81e54d08d9d8 ("cfg80211: support FTM responder configuration/statistics")
Signed-off-by: Johannes Berg 
---
 net/wireless/nl80211.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index fffe4b371e23..f40a004ec6f2 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -538,10 +538,8 @@ const struct nla_policy nl80211_policy[NUM_NL80211_ATTR] = 
{
[NL80211_ATTR_HE_CAPABILITY] = { .type = NLA_BINARY,
 .len = NL80211_HE_MAX_CAPABILITY_LEN },
 
-   [NL80211_ATTR_FTM_RESPONDER] = {
-   .type = NLA_NESTED,
-   .validation_data = nl80211_ftm_responder_policy,
-   },
+   [NL80211_ATTR_FTM_RESPONDER] =
+   NLA_POLICY_NESTED(nl80211_ftm_responder_policy),
[NL80211_ATTR_TIMEOUT] = NLA_POLICY_MIN(NLA_U32, 1),
[NL80211_ATTR_PEER_MEASUREMENTS] =
NLA_POLICY_NESTED(nl80211_pmsr_attr_policy),
-- 
2.17.2

[PATCH v2 6/8] netlink: factor out policy range helpers

2019-05-03 Thread Johannes Berg

From: Johannes Berg 

Add helpers to get the policy's signed/unsigned range
validation data.

Signed-off-by: Johannes Berg 
---
 include/net/netlink.h |  5 +++
 lib/nlattr.c  | 95 +--
 2 files changed, 79 insertions(+), 21 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 3c3bbd2ae2dc..c2b4bc819784 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1895,4 +1895,9 @@ static inline bool nla_is_last(const struct nlattr *nla, 
int rem)
return nla->nla_len == rem;
 }
 
+void nla_get_range_unsigned(const struct nla_policy *pt,
+   struct netlink_range_validation *range);
+void nla_get_range_signed(const struct nla_policy *pt,
+ struct netlink_range_validation_signed *range);
+
 #endif
diff --git a/lib/nlattr.c b/lib/nlattr.c
index 05761d2a74cc..3db7a6984cb0 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -96,25 +96,39 @@ static int nla_validate_array(const struct nlattr *head, 
int len, int maxtype,
return 0;
 }
 
-static int nla_validate_int_range_unsigned(const struct nla_policy *pt,
-  const struct nlattr *nla,
-  struct netlink_ext_ack *extack)
+void nla_get_range_unsigned(const struct nla_policy *pt,
+   struct netlink_range_validation *range)
 {
-   struct netlink_range_validation _range = {
-   .min = 0,
-   .max = U64_MAX,
-   }, *range = &_range;
-   u64 value;
-
WARN_ON_ONCE(pt->min < 0 || pt->max < 0);
 
+   range->min = 0;
+
+   switch (pt->type) {
+   case NLA_U8:
+   range->max = U8_MAX;
+   break;
+   case NLA_U16:
+   range->max = U16_MAX;
+   break;
+   case NLA_U32:
+   range->max = U32_MAX;
+   break;
+   case NLA_U64:
+   case NLA_MSECS:
+   range->max = U64_MAX;
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   return;
+   }
+
switch (pt->validation_type) {
case NLA_VALIDATE_RANGE:
range->min = pt->min;
range->max = pt->max;
break;
case NLA_VALIDATE_RANGE_PTR:
-   range = pt->range;
+   *range = *pt->range;
break;
case NLA_VALIDATE_MIN:
range->min = pt->min;
@@ -122,7 +136,17 @@ static int nla_validate_int_range_unsigned(const struct 
nla_policy *pt,
case NLA_VALIDATE_MAX:
range->max = pt->max;
break;
+   default:
+   break;
}
+}
+
+static int nla_validate_int_range_unsigned(const struct nla_policy *pt,
+  const struct nlattr *nla,
+  struct netlink_ext_ack *extack)
+{
+   struct netlink_range_validation range;
+   u64 value;
 
switch (pt->type) {
case NLA_U8:
@@ -142,7 +166,9 @@ static int nla_validate_int_range_unsigned(const struct 
nla_policy *pt,
return -EINVAL;
}
 
-   if (value < range->min || value > range->max) {
+   nla_get_range_unsigned(pt, &range);
+
+   if (value < range.min || value > range.max) {
NL_SET_ERR_MSG_ATTR(extack, nla,
"integer out of range");
return -ERANGE;
@@ -151,15 +177,30 @@ static int nla_validate_int_range_unsigned(const struct 
nla_policy *pt,
return 0;
 }
 
-static int nla_validate_int_range_signed(const struct nla_policy *pt,
-const struct nlattr *nla,
-struct netlink_ext_ack *extack)
+void nla_get_range_signed(const struct nla_policy *pt,
+ struct netlink_range_validation_signed *range)
 {
-   struct netlink_range_validation_signed _range = {
-   .min = S64_MIN,
-   .max = S64_MAX,
-   }, *range = &_range;
-   s64 value;
+   switch (pt->type) {
+   case NLA_S8:
+   range->min = S8_MIN;
+   range->max = S8_MAX;
+   break;
+   case NLA_S16:
+   range->min = S16_MIN;
+   range->max = S16_MAX;
+   break;
+   case NLA_S32:
+   range->min = S32_MIN;
+   range->max = S32_MAX;
+   break;
+   case NLA_S64:
+   range->min = S64_MIN;
+   range->max = S64_MAX;
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   return;
+   }
 
switch (pt->validation_type) {
case NLA_VALIDATE_RANGE:
@@ -167,7 +208,7 @@ static int nla_validate_int_range_signed(const struct 
nla_policy *pt,
range->max = pt->max;
break;
case NLA_VALIDATE_RANGE_PT

[PATCH v2 2/8] netlink: remove type-unsafe validation_data pointer

2019-05-03 Thread Johannes Berg

From: Johannes Berg 

In the netlink policy, we currently have a void *validation_data
that's pointing to different things:
 * a u32 value for bitfield32,
 * the netlink policy for nested/nested array
 * the string for NLA_REJECT

Remove the pointer and place appropriate type-safe items in the
union instead.

While at it, completely dissolve the pointer for the bitfield32
case and just put the value there directly.

Signed-off-by: Johannes Berg 
---
 include/net/netlink.h | 55 ---
 lib/nlattr.c  | 20 
 net/sched/act_api.c   |  4 +---
 3 files changed, 42 insertions(+), 37 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 679f649748d4..0dd4546fb68c 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -217,7 +217,7 @@ enum nla_policy_validation {
  *NLA_NESTED,
  *NLA_NESTED_ARRAY Length verification is done by checking len of
  * nested header (or empty); len field is used if
- * validation_data is also used, for the max attr
+ * nested_policy is also used, for the max attr
  * number in the nested policy.
  *NLA_U8, NLA_U16,
  *NLA_U32, NLA_U64,
@@ -235,27 +235,25 @@ enum nla_policy_validation {
  *NLA_MIN_LEN  Minimum length of attribute payload
  *All otherMinimum length of attribute payload
  *
- * Meaning of `validation_data' field:
+ * Meaning of validation union:
  *NLA_BITFIELD32   This is a 32-bit bitmap/bitselector attribute and
- * validation data must point to a u32 value of valid
- * flags
- *NLA_REJECT   This attribute is always rejected and validation 
data
+ * `bitfield32_valid' is the u32 value of valid flags
+ *NLA_REJECT   This attribute is always rejected and 
`reject_message'
  * may point to a string to report as the error instead
  * of the generic one in extended ACK.
- *NLA_NESTED   Points to a nested policy to validate, must also set
- * `len' to the max attribute number.
+ *NLA_NESTED   `nested_policy' to a nested policy to validate, must
+ * also set `len' to the max attribute number. Use the
+ * provided NLA_POLICY_NESTED() macro.
  * Note that nla_parse() will validate, but of course 
not
  * parse, the nested sub-policies.
- *NLA_NESTED_ARRAY Points to a nested policy to validate, must also set
- * `len' to the max attribute number. The difference to
- * NLA_NESTED is the structure - NLA_NESTED has the
- * nested attributes directly inside, while an array 
has
- * the nested attributes at another level down and the
- * attributes directly in the nesting don't matter.
- *All otherUnused - but note that it's a union
- *
- * Meaning of `min' and `max' fields, use via NLA_POLICY_MIN, NLA_POLICY_MAX
- * and NLA_POLICY_RANGE:
+ *NLA_NESTED_ARRAY `nested_policy' points to a nested policy to 
validate,
+ * must also set `len' to the max attribute number. Use
+ * the provided NLA_POLICY_NESTED_ARRAY() macro.
+ * The difference to NLA_NESTED is the structure:
+ * NLA_NESTED has the nested attributes directly inside
+ * while an array has the nested attributes at another
+ * level down and the attribute types directly in the
+ * nesting don't matter.
  *NLA_U8,
  *NLA_U16,
  *NLA_U32,
@@ -263,14 +261,16 @@ enum nla_policy_validation {
  *NLA_S8,
  *NLA_S16,
  *NLA_S32,
- *NLA_S64  These are used depending on the validation_type
- * field, if that is min/max/range then the minimum,
- * maximum and both are used (respectively) to check
+ *NLA_S64  The `min' and `max' fields are used depending on the
+ * validation_type field, if that is min/max/range then
+ * the min, max or both are used (respectively) to 
check
  * the value of the integer attribute.
  * Note that in the interest of code simplicity and
  * struct size both limits are s16, so you cannot
  * enforce a range that doesn't fall within the range
  * of s16 - do that as usual in the code instead.
+ * Use the NLA_POLICY_MIN(), NLA_POLICY_MAX() and
+ * NLA_POLICY_RAN

[PATCH v6 bpf-next 05/17] bpf: verifier: insert BPF_ZEXT according to zext analysis result

2019-05-03 Thread Jiong Wang

After previous patches, verifier has marked those instructions that really
need zero extension on dst_reg.

It is then for all back-ends to decide how to use such information to
eliminate unnecessary zero extension code-gen during JIT compilation.

One approach is:
  1. Verifier insert explicit zero extension for those instructions that
 need zero extension.
  2. All JIT back-ends do NOT generate zero extension for sub-register
 write any more.

The good thing for this approach is no major change on JIT back-end
interface, all back-ends could get this optimization.

However, only those back-ends that do not have hardware zero extension
want this optimization. For back-ends like x86_64 and AArch64, there is
hardware support, so zext insertion should be disabled.

This patch introduces new target hook "bpf_jit_hardware_zext" which is
default true, meaning the underlying hardware will do zero extension
implicitly, therefore zext insertion by verifier will be disabled. Once a
back-end overrides this hook to false, then verifier will insert BPF_ZEXT
to clear high 32-bit of definitions when necessary.

Offload targets do not use this native target hook, instead, they could
get the optimization results using bpf_prog_offload_ops.finalize.

Reviewed-by: Jakub Kicinski 
Signed-off-by: Jiong Wang 
---
 include/linux/bpf.h|  1 +
 include/linux/filter.h |  1 +
 kernel/bpf/core.c  |  8 
 kernel/bpf/verifier.c  | 40 
 4 files changed, 50 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 11a5fb9..cf3c3f3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -373,6 +373,7 @@ struct bpf_prog_aux {
u32 id;
u32 func_cnt; /* used by non-func prog as the number of func progs */
u32 func_idx; /* 0 for non-func prog, the index in func array for func 
prog */
+   bool verifier_zext; /* Zero extensions has been inserted by verifier. */
bool offload_requested;
struct bpf_prog **func;
void *jit_data; /* JIT specific data. arch dependent */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fb0edad..8750657 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -821,6 +821,7 @@ u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
+bool bpf_jit_hardware_zext(void);
 bool bpf_helper_changes_pkt_data(void *func);
 
 static inline bool bpf_dump_raw_ok(void)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ee8703d..9754346 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2095,6 +2095,14 @@ bool __weak bpf_helper_changes_pkt_data(void *func)
return false;
 }
 
+/* Return TRUE is the target hardware of JIT will do zero extension to high 
bits
+ * when writing to low 32-bit of one register. Otherwise, return FALSE.
+ */
+bool __weak bpf_jit_hardware_zext(void)
+{
+   return true;
+}
+
 /* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
  * skb_copy_bits(), so provide a weak definition of it for NET-less config.
  */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b43e8a2..999da02 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7648,6 +7648,37 @@ static int opt_remove_nops(struct bpf_verifier_env *env)
return 0;
 }
 
+static int opt_subreg_zext_lo32(struct bpf_verifier_env *env)
+{
+   struct bpf_insn_aux_data *aux = env->insn_aux_data;
+   struct bpf_insn *insns = env->prog->insnsi;
+   int i, delta = 0, len = env->prog->len;
+   struct bpf_insn zext_patch[2];
+   struct bpf_prog *new_prog;
+
+   zext_patch[1] = BPF_ALU32_IMM(BPF_ZEXT, 0, 0);
+   for (i = 0; i < len; i++) {
+   int adj_idx = i + delta;
+   struct bpf_insn insn;
+
+   if (!aux[adj_idx].zext_dst)
+   continue;
+
+   insn = insns[adj_idx];
+   zext_patch[0] = insn;
+   zext_patch[1].dst_reg = insn.dst_reg;
+   new_prog = bpf_patch_insn_data(env, adj_idx, zext_patch, 2);
+   if (!new_prog)
+   return -ENOMEM;
+   env->prog = new_prog;
+   insns = new_prog->insnsi;
+   aux = env->insn_aux_data;
+   delta += 2;
+   }
+
+   return 0;
+}
+
 /* convert load instructions that access fields of a context type into a
  * sequence of instructions that access fields of the underlying structure:
  * struct __sk_buff-> struct sk_buff
@@ -8499,6 +8530,15 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr,
if (ret == 0)
ret = fixup_bpf_calls(env);
 
+   /* do 32-bit optimization after insn patching has done so those patched
+* insns could be handled correctly.
+*/
+   if (ret == 0 && !bpf_jit_hardware_zext() &&
+   !bpf_pro

[PATCH v6 bpf-next 02/17] bpf: verifier: mark verified-insn with sub-register zext flag

2019-05-03 Thread Jiong Wang

eBPF ISA specification requires high 32-bit cleared when low 32-bit
sub-register is written. This applies to destination register of ALU32 etc.
JIT back-ends must guarantee this semantic when doing code-gen.

x86-64 and arm64 ISA has the same semantic, so the corresponding JIT
back-end doesn't need to do extra work. However, 32-bit arches (arm, nfp
etc.) and some other 64-bit arches (powerpc, sparc etc), need explicit zero
extension sequence to meet such semantic.

This is important, because for code the following:

  u64_value = (u64) u32_value
  ... other uses of u64_value

compiler could exploit the semantic described above and save those zero
extensions for extending u32_value to u64_value. Hardware, runtime, or BPF
JIT back-ends, are responsible for guaranteeing this. Some benchmarks show
~40% sub-register writes out of total insns, meaning ~40% extra code-gen (
could go up to more for some arches which requires two shifts for zero
extension) because JIT back-end needs to do extra code-gen for all such
instructions.

However this is not always necessary in case u32_value is never cast into
a u64, which is quite normal in real life program. So, it would be really
good if we could identify those places where such type cast happened, and
only do zero extensions for them, not for the others. This could save a lot
of BPF code-gen.

Algo:
 - Split read flags into READ32 and READ64.

 - Record indices of instructions that do sub-register def (write). And
   these indices need to stay with reg state so path pruning and bpf
   to bpf function call could be handled properly.

   These indices are kept up to date while doing insn walk.

 - A full register read on an active sub-register def marks the def insn as
   needing zero extension on dst register.

 - A new sub-register write overrides the old one.

   A new full register write makes the register free of zero extension on
   dst register.

 - When propagating read64 during path pruning, also marks def insns whose
   defs are hanging active sub-register.

Reviewed-by: Jakub Kicinski 
Signed-off-by: Jiong Wang 
---
 include/linux/bpf_verifier.h |  14 ++-
 kernel/bpf/verifier.c| 213 ---
 2 files changed, 211 insertions(+), 16 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 1305ccb..6a0b12c 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -36,9 +36,11 @@
  */
 enum bpf_reg_liveness {
REG_LIVE_NONE = 0, /* reg hasn't been read or written this branch */
-   REG_LIVE_READ, /* reg was read, so we're sensitive to initial value */
-   REG_LIVE_WRITTEN, /* reg was written first, screening off later reads */
-   REG_LIVE_DONE = 4, /* liveness won't be updating this register anymore 
*/
+   REG_LIVE_READ32 = 0x1, /* reg was read, so we're sensitive to initial 
value */
+   REG_LIVE_READ64 = 0x2, /* likewise, but full 64-bit content matters */
+   REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64,
+   REG_LIVE_WRITTEN = 0x4, /* reg was written first, screening off later 
reads */
+   REG_LIVE_DONE = 0x8, /* liveness won't be updating this register 
anymore */
 };
 
 struct bpf_reg_state {
@@ -131,6 +133,11 @@ struct bpf_reg_state {
 * pointing to bpf_func_state.
 */
u32 frameno;
+   /* Tracks subreg definition. The stored value is the insn_idx of the
+* writing insn. This is safe because subreg_def is used before any insn
+* patching which only happens after main verification finished.
+*/
+   s32 subreg_def;
enum bpf_reg_liveness live;
 };
 
@@ -232,6 +239,7 @@ struct bpf_insn_aux_data {
int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
int sanitize_stack_off; /* stack slot to be cleared */
bool seen; /* this insn was processed by the verifier */
+   bool zext_dst; /* this insn zero extend dst reg */
u8 alu_state; /* used in combination with alu_limit */
unsigned int orig_idx; /* original instruction index */
 };
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 07ab563..43ea665 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -981,6 +981,7 @@ static void mark_reg_not_init(struct bpf_verifier_env *env,
__mark_reg_not_init(regs + regno);
 }
 
+#define DEF_NOT_SUBREG (-1)
 static void init_reg_state(struct bpf_verifier_env *env,
   struct bpf_func_state *state)
 {
@@ -991,6 +992,7 @@ static void init_reg_state(struct bpf_verifier_env *env,
mark_reg_not_init(env, regs, i);
regs[i].live = REG_LIVE_NONE;
regs[i].parent = NULL;
+   regs[i].subreg_def = DEF_NOT_SUBREG;
}
 
/* frame pointer */
@@ -1136,7 +1138,7 @@ static int check_subprogs(struct bpf_verifier_env *env)
  */
 static int mark_reg_read(struct bpf_verifier_env *env,

[PATCH v6 bpf-next 06/17] bpf: introduce new bpf prog load flags "BPF_F_TEST_RND_HI32"

2019-05-03 Thread Jiong Wang

x86_64 and AArch64 perhaps are two arches that running bpf testsuite
frequently, however the zero extension insertion pass is not enabled for
them because of their hardware support.

It is critical to guarantee the pass correction as it is supposed to be
enabled at default for a couple of other arches, for example PowerPC,
SPARC, arm, NFP etc. Therefore, it would be very useful if there is a way
to test this pass on for example x86_64.

The test methodology employed by this set is "poisoning" useless bits. High
32-bit of a definition is randomized if it is identified as not used by any
later instruction. Such randomization is only enabled under testing mode
which is gated by the new bpf prog load flags "BPF_F_TEST_RND_HI32".

Suggested-by: Alexei Starovoitov 
Signed-off-by: Jiong Wang 
---
 include/uapi/linux/bpf.h   | 18 ++
 kernel/bpf/syscall.c   |  4 +++-
 tools/include/uapi/linux/bpf.h | 18 ++
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 22ccdf4..1bf32c3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -263,6 +263,24 @@ enum bpf_attach_type {
  */
 #define BPF_F_ANY_ALIGNMENT(1U << 1)
 
+/* BPF_F_TEST_RND_HI32 is used in BPF_PROG_LOAD command for testing purpose.
+ * Verifier does sub-register def/use analysis and identifies instructions 
whose
+ * def only matters for low 32-bit, high 32-bit is never referenced later
+ * through implicit zero extension. Therefore verifier notifies JIT back-ends
+ * that it is safe to ignore clearing high 32-bit for these instructions. This
+ * saves some back-ends a lot of code-gen. However such optimization is not
+ * necessary on some arches, for example x86_64, arm64 etc, whose JIT back-ends
+ * hence hasn't used verifier's analysis result. But, we really want to have a
+ * way to be able to verify the correctness of the described optimization on
+ * x86_64 on which testsuites are frequently exercised.
+ *
+ * So, this flag is introduced. Once it is set, verifier will randomize high
+ * 32-bit for those instructions who has been identified as safe to ignore 
them.
+ * Then, if verifier is not doing correct analysis, such randomization will
+ * regress tests to expose bugs.
+ */
+#define BPF_F_TEST_RND_HI32(1U << 2)
+
 /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
  * two extensions:
  *
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ad3ccf8..ec1b42c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1601,7 +1601,9 @@ static int bpf_prog_load(union bpf_attr *attr, union 
bpf_attr __user *uattr)
if (CHECK_ATTR(BPF_PROG_LOAD))
return -EINVAL;
 
-   if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT))
+   if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT |
+BPF_F_ANY_ALIGNMENT |
+BPF_F_TEST_RND_HI32))
return -EINVAL;
 
if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 22ccdf4..1bf32c3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -263,6 +263,24 @@ enum bpf_attach_type {
  */
 #define BPF_F_ANY_ALIGNMENT(1U << 1)
 
+/* BPF_F_TEST_RND_HI32 is used in BPF_PROG_LOAD command for testing purpose.
+ * Verifier does sub-register def/use analysis and identifies instructions 
whose
+ * def only matters for low 32-bit, high 32-bit is never referenced later
+ * through implicit zero extension. Therefore verifier notifies JIT back-ends
+ * that it is safe to ignore clearing high 32-bit for these instructions. This
+ * saves some back-ends a lot of code-gen. However such optimization is not
+ * necessary on some arches, for example x86_64, arm64 etc, whose JIT back-ends
+ * hence hasn't used verifier's analysis result. But, we really want to have a
+ * way to be able to verify the correctness of the described optimization on
+ * x86_64 on which testsuites are frequently exercised.
+ *
+ * So, this flag is introduced. Once it is set, verifier will randomize high
+ * 32-bit for those instructions who has been identified as safe to ignore 
them.
+ * Then, if verifier is not doing correct analysis, such randomization will
+ * regress tests to expose bugs.
+ */
+#define BPF_F_TEST_RND_HI32(1U << 2)
+
 /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
  * two extensions:
  *
-- 
2.7.4

[PATCH v6 bpf-next 00/17] bpf: eliminate zero extensions for sub-register writes

2019-05-03 Thread Jiong Wang

v6:
  - Fixed s390 kbuild test robot error. (kbuild)
  - Make comment style in backends patches more consistent.

v5:
  - Adjusted several test_verifier helpers to make them works on hosts
w and w/o hardware zext. (Naveen)
  - Make sure zext flag not set when verifier by-passed, for example,
libtest_bpf.ko. (Naveen)
  - Conservatively mark bpf main return value as 64-bit. (Alexei)
  - Make sure read flag is either READ64 or READ32, not the mix of both.
(Alexei)
  - Merged patch 1 and 2 in v4. (Alexei)
  - Fixed kbuild test robot warning on NFP. (kbuild)
  - Proposed new BPF_ZEXT insn to have optimal code-gen for various JIT
back-ends.
  - Conservately set zext flags for patched-insn.
  - Fixed return value zext for helper function calls.
  - Also adjusted test_verifier scalability unit test to avoid triggerring
too many insn patch which will hang computer.
  - re-tested on x86 host with llvm 9.0, no regression on test_verifier,
test_progs, test_progs_32.
  - re-tested offload target (nfp), no regression on local testsuite.

v4:
  - added the two missing fixes which addresses two Jakub's reviewes in v3.
  - rebase on top of bpf-next.

v3:
  - remove redundant check in "propagate_liveness_reg". (Jakub)
  - add extra check in "mark_reg_read" to prune more search. (Jakub)
  - re-implemented "prog_flags" passing mechanism, removed use of
global switch inside libbpf.
  - enabled high 32-bit randomization beyond "test_verifier" and
"test_progs". Now it should have been enabled for all possible
tests. Re-run all tests, haven't noticed regression.
  - remove RFC tag.

v2:
  - rebased on top of bpf-next master.
  - added comments for what is sub-register def index. (Edward, Alexei)
  - removed patch 1 which turns bit mask from enum to macro. (Alexei)
  - removed sysctl/bpf_jit_32bit_opt. (Alexei)
  - merged sub-register def insn index into reg state. (Alexei)
  - change test methodology (Alexei):
  + instead of simple unit tests on x86_64 for which this optimization
doesn't enabled due to there is hardware support, poison high
32-bit for whose def identified as safe to do so. this could let
the correctness of this patch set checked when daily bpf selftest
ran which delivers very stressful test on host machine like x86_64.
  + hi32 poisoning is gated by a new BPF_F_TEST_RND_HI32 prog flags.
  + BPF_F_TEST_RND_HI32 is enabled for all tests of "test_progs" and
"test_verifier", the latter needs minor tweak on two unit tests,
please see the patch for the change.
  + introduced a new global variable "libbpf_test_mode" into libbpf.
once it is set to true, it will set BPF_F_TEST_RND_HI32 for all the
later PROG_LOAD syscall, the goal is to easy the enable of hi32
poison on exsiting testsuite.
we could also introduce new APIs, for example "bpf_prog_test_load",
then use -Dbpf_prog_load=bpf_prog_test_load to migrate tests under
test_progs, but there are several load APIs, and such new API need
some change on struture like "struct bpf_prog_load_attr".
  + removed old unit tests. it is based on insn scan and requires quite
a few test_verifier generic code change. given hi32 randomization
could offer good test coverage, the unit tests doesn't add much
extra test value.
  - enhanced register width check ("is_reg64") when record sub-register
write, now, it returns more accurate width.
  - Re-run all tests under "test_progs" and "test_verifier" on x86_64, no
regression. Fixed a couple of bugs exposed:
  1. ctx field size transformation was not taken into account.
  2. insn patch could cause lost of original aux data which is
 important for ctx field conversion.
  3. return value for propagate_liveness was wrong and caused
 regression on processed insn number.
  4. helper call arg wasn't handled properly that path prune may cause
 64-bit read info in pruned path lost.
  - Re-run Cilium bpf prog for processed-insn-number benchmarking, no
regression.

v1:
  - Fixed the missing handling on callee-saved for bpf-to-bpf call,
sub-register defs therefore moved to frame state. (Jakub Kicinski)
  - Removed redundant "cross_reg". (Jakub Kicinski)
  - Various coding styles & grammar fixes. (Jakub Kicinski, Quentin Monnet)

eBPF ISA specification requires high 32-bit cleared when low 32-bit
sub-register is written. This applies to destination register of
ALU32/LD_H/B/W etc. JIT back-ends must guarantee this semantic when doing
code-gen.

x86-64 and arm64 ISA has the same semantics, so the corresponding JIT
back-end doesn't need to do extra work. However, 32-bit arches (arm, nfp
etc.) and some other 64-bit arches (powerpc, sparc etc), need explicitly
zero extension sequence to meet such semantic.

This is important, because for C code like the following:

  u64_value = (u64) u32_value
  ... other uses of u64_val

[PATCH v6 bpf-next 04/17] bpf: introduce new alu insn BPF_ZEXT for explicit zero extension

2019-05-03 Thread Jiong Wang

This patch introduce new alu32 insn BPF_ZEXT, and allocate the unused
opcode 0xe0 to it.

Compared with the other alu32 insns, zero extension on low 32-bit is the
only semantics for this instruction. It also allows various JIT back-ends
to do optimal zero extension code-gen.

BPF_ZEXT is supposed to be encoded with BPF_ALU only, and is supposed to be
generated by the latter 32-bit optimization code inside verifier for those
arches that do not support hardware implicit zero extension only.

It is not supposed to be used in user's program directly at the moment.
Therefore, no need to recognize it inside generic verification code. It
just need to be supported for execution on interpreter or related JIT
back-ends.

Signed-off-by: Jiong Wang 
---
 Documentation/networking/filter.txt | 10 ++
 include/uapi/linux/bpf.h|  3 +++
 kernel/bpf/core.c   |  4 
 tools/include/uapi/linux/bpf.h  |  3 +++
 4 files changed, 20 insertions(+)

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index 319e5e0..1cb3e42 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -903,6 +903,16 @@ If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], 
BPF_OP(code) is one of:
   BPF_MOV   0xb0  /* eBPF only: mov reg to reg */
   BPF_ARSH  0xc0  /* eBPF only: sign extending shift right */
   BPF_END   0xd0  /* eBPF only: endianness conversion */
+  BPF_ZEXT  0xe0  /* eBPF BPF_ALU only: zero-extends low 32-bit */
+
+Compared with BPF_ALU | BPF_MOV which zero-extends low 32-bit implicitly,
+BPF_ALU | BPF_ZEXT zero-extends low 32-bit explicitly. Such zero extension is
+not the main semantics for the prior, but is for the latter. Therefore, JIT
+optimizer could optimize out the zero extension for the prior when it is
+concluded safe to do so, but should never do such optimization for the latter.
+LLVM compiler won't generate BPF_ZEXT, and hand written assembly is not 
supposed
+to use it. Verifier 32-bit optimization pass, which removes zero extension
+semantics from the other BPF_ALU instructions, is the only place generates it.
 
 If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 72336ba..22ccdf4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -32,6 +32,9 @@
 #define BPF_FROM_LEBPF_TO_LE
 #define BPF_FROM_BEBPF_TO_BE
 
+/* zero extend low 32-bit */
+#define BPF_ZEXT   0xe0
+
 /* jmp encodings */
 #define BPF_JNE0x50/* jump != */
 #define BPF_JLT0xa0/* LT is unsigned, '<' */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2792eda..ee8703d 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1152,6 +1152,7 @@ EXPORT_SYMBOL_GPL(__bpf_call_base);
INSN_2(ALU, NEG),   \
INSN_3(ALU, END, TO_BE),\
INSN_3(ALU, END, TO_LE),\
+   INSN_2(ALU, ZEXT),  \
/*   Immediate based. */\
INSN_3(ALU, ADD,  K),   \
INSN_3(ALU, SUB,  K),   \
@@ -1352,6 +1353,9 @@ static u64 ___bpf_prog_run(u64 *regs, const struct 
bpf_insn *insn, u64 *stack)
ALU64_NEG:
DST = -DST;
CONT;
+   ALU_ZEXT:
+   DST = (u32) DST;
+   CONT;
ALU_MOV_X:
DST = (u32) SRC;
CONT;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 72336ba..22ccdf4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -32,6 +32,9 @@
 #define BPF_FROM_LEBPF_TO_LE
 #define BPF_FROM_BEBPF_TO_BE
 
+/* zero extend low 32-bit */
+#define BPF_ZEXT   0xe0
+
 /* jmp encodings */
 #define BPF_JNE0x50/* jump != */
 #define BPF_JLT0xa0/* LT is unsigned, '<' */
-- 
2.7.4

[PATCH v6 bpf-next 10/17] selftests: bpf: enable hi32 randomization for all tests

2019-05-03 Thread Jiong Wang

The previous libbpf patch allows user to specify "prog_flags" to bpf
program load APIs. To enable high 32-bit randomization for a test, we need
to set BPF_F_TEST_RND_HI32 in "prog_flags".

To enable such randomization for all tests, we need to make sure all places
are passing BPF_F_TEST_RND_HI32. Changing them one by one is not
convenient, also, it would be better if a test could be switched to
"normal" running mode without code change.

Given the program load APIs used across bpf selftests are mostly:
  bpf_prog_load:  load from file
  bpf_load_program:   load from raw insns

A test_stub.c is implemented for bpf seltests, it offers two functions for
testing purpose:

  bpf_prog_test_load
  bpf_test_load_program

The are the same as "bpf_prog_load" and "bpf_load_program", except they
also set BPF_F_TEST_RND_HI32. Given *_xattr functions are the APIs to
customize any "prog_flags", it makes little sense to put these two
functions into libbpf.

Then, the following CFLAGS are passed to compilations for host programs:
  -Dbpf_prog_load=bpf_prog_test_load
  -Dbpf_load_program=bpf_test_load_program

They migrate the used load APIs to the test version, hence enable high
32-bit randomization for these tests without changing source code.

Besides all these, there are several testcases are using
"bpf_prog_load_attr" directly, their call sites are updated to pass
BPF_F_TEST_RND_HI32.

Signed-off-by: Jiong Wang 
---
 tools/testing/selftests/bpf/Makefile   | 10 +++---
 .../selftests/bpf/prog_tests/bpf_verif_scale.c |  1 +
 tools/testing/selftests/bpf/test_sock_addr.c   |  1 +
 tools/testing/selftests/bpf/test_sock_fields.c |  1 +
 tools/testing/selftests/bpf/test_socket_cookie.c   |  1 +
 tools/testing/selftests/bpf/test_stub.c| 40 ++
 tools/testing/selftests/bpf/test_verifier.c|  2 +-
 7 files changed, 51 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_stub.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 66f2dca..3f2c131 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -15,7 +15,9 @@ LLC   ?= llc
 LLVM_OBJCOPY   ?= llvm-objcopy
 LLVM_READELF   ?= llvm-readelf
 BTF_PAHOLE ?= pahole
-CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(BPFDIR) -I$(GENDIR) 
$(GENFLAGS) -I../../../include
+CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(BPFDIR) -I$(GENDIR) 
$(GENFLAGS) -I../../../include \
+ -Dbpf_prog_load=bpf_prog_test_load \
+ -Dbpf_load_program=bpf_test_load_program
 LDLIBS += -lcap -lelf -lrt -lpthread
 
 # Order correspond to 'make run_tests' order
@@ -78,9 +80,9 @@ $(OUTPUT)/test_maps: map_tests/*.c
 
 BPFOBJ := $(OUTPUT)/libbpf.a
 
-$(TEST_GEN_PROGS): $(BPFOBJ)
+$(TEST_GEN_PROGS): test_stub.o $(BPFOBJ)
 
-$(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/libbpf.a
+$(TEST_GEN_PROGS_EXTENDED): test_stub.o $(OUTPUT)/libbpf.a
 
 $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
 $(OUTPUT)/test_skb_cgroup_id_user: cgroup_helpers.c
@@ -176,7 +178,7 @@ $(ALU32_BUILD_DIR)/test_progs_32: test_progs.c 
$(OUTPUT)/libbpf.a\
$(ALU32_BUILD_DIR)/urandom_read
$(CC) $(TEST_PROGS_CFLAGS) $(CFLAGS) \
-o $(ALU32_BUILD_DIR)/test_progs_32 \
-   test_progs.c trace_helpers.c prog_tests/*.c \
+   test_progs.c test_stub.c trace_helpers.c prog_tests/*.c \
$(OUTPUT)/libbpf.a $(LDLIBS)
 
 $(ALU32_BUILD_DIR)/test_progs_32: $(PROG_TESTS_H)
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c 
b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
index 23b159d..2623d15 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
@@ -22,6 +22,7 @@ static int check_load(const char *file)
attr.file = file;
attr.prog_type = BPF_PROG_TYPE_SCHED_CLS;
attr.log_level = 4;
+   attr.prog_flags = BPF_F_TEST_RND_HI32;
err = bpf_prog_load_xattr(&attr, &obj, &prog_fd);
bpf_object__close(obj);
if (err)
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c 
b/tools/testing/selftests/bpf/test_sock_addr.c
index 3f110ea..5d0c4f0 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -745,6 +745,7 @@ static int load_path(const struct sock_addr_test *test, 
const char *path)
attr.file = path;
attr.prog_type = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
attr.expected_attach_type = test->expected_attach_type;
+   attr.prog_flags = BPF_F_TEST_RND_HI32;
 
if (bpf_prog_load_xattr(&attr, &obj, &prog_fd)) {
if (test->expected_result != LOAD_REJECT)
diff --git a/tools/testing/selftests/bpf/test_sock_fields.c 
b/tools/testing/selftests/bpf/test_sock_fields.c
index e089477..f0fc103 100644
--- a/tools/testing/selftests/bpf/

[PATCH v6 bpf-next 16/17] riscv: bpf: eliminate zero extension code-gen

2019-05-03 Thread Jiong Wang

Acked-by: Björn Töpel 
Signed-off-by: Jiong Wang 
---
 arch/riscv/net/bpf_jit_comp.c | 36 +++-
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/riscv/net/bpf_jit_comp.c b/arch/riscv/net/bpf_jit_comp.c
index 80b12aa..3074c9b 100644
--- a/arch/riscv/net/bpf_jit_comp.c
+++ b/arch/riscv/net/bpf_jit_comp.c
@@ -731,6 +731,7 @@ static int emit_insn(const struct bpf_insn *insn, struct 
rv_jit_context *ctx,
 {
bool is64 = BPF_CLASS(insn->code) == BPF_ALU64 ||
BPF_CLASS(insn->code) == BPF_JMP;
+   struct bpf_prog_aux *aux = ctx->prog->aux;
int rvoff, i = insn - ctx->prog->insnsi;
u8 rd = -1, rs = -1, code = insn->code;
s16 off = insn->off;
@@ -739,11 +740,15 @@ static int emit_insn(const struct bpf_insn *insn, struct 
rv_jit_context *ctx,
init_regs(&rd, &rs, insn, ctx);
 
switch (code) {
+   /* dst = (u32) dst */
+   case BPF_ALU | BPF_ZEXT:
+   emit_zext_32(rd, ctx);
+   break;
/* dst = src */
case BPF_ALU | BPF_MOV | BPF_X:
case BPF_ALU64 | BPF_MOV | BPF_X:
emit(is64 ? rv_addi(rd, rs, 0) : rv_addiw(rd, rs, 0), ctx);
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
 
@@ -771,19 +776,19 @@ static int emit_insn(const struct bpf_insn *insn, struct 
rv_jit_context *ctx,
case BPF_ALU | BPF_MUL | BPF_X:
case BPF_ALU64 | BPF_MUL | BPF_X:
emit(is64 ? rv_mul(rd, rd, rs) : rv_mulw(rd, rd, rs), ctx);
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
case BPF_ALU | BPF_DIV | BPF_X:
case BPF_ALU64 | BPF_DIV | BPF_X:
emit(is64 ? rv_divu(rd, rd, rs) : rv_divuw(rd, rd, rs), ctx);
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
case BPF_ALU | BPF_MOD | BPF_X:
case BPF_ALU64 | BPF_MOD | BPF_X:
emit(is64 ? rv_remu(rd, rd, rs) : rv_remuw(rd, rd, rs), ctx);
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
case BPF_ALU | BPF_LSH | BPF_X:
@@ -867,7 +872,7 @@ static int emit_insn(const struct bpf_insn *insn, struct 
rv_jit_context *ctx,
case BPF_ALU | BPF_MOV | BPF_K:
case BPF_ALU64 | BPF_MOV | BPF_K:
emit_imm(rd, imm, ctx);
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
 
@@ -882,7 +887,7 @@ static int emit_insn(const struct bpf_insn *insn, struct 
rv_jit_context *ctx,
emit(is64 ? rv_add(rd, rd, RV_REG_T1) :
 rv_addw(rd, rd, RV_REG_T1), ctx);
}
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
case BPF_ALU | BPF_SUB | BPF_K:
@@ -895,7 +900,7 @@ static int emit_insn(const struct bpf_insn *insn, struct 
rv_jit_context *ctx,
emit(is64 ? rv_sub(rd, rd, RV_REG_T1) :
 rv_subw(rd, rd, RV_REG_T1), ctx);
}
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
case BPF_ALU | BPF_AND | BPF_K:
@@ -906,7 +911,7 @@ static int emit_insn(const struct bpf_insn *insn, struct 
rv_jit_context *ctx,
emit_imm(RV_REG_T1, imm, ctx);
emit(rv_and(rd, rd, RV_REG_T1), ctx);
}
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
case BPF_ALU | BPF_OR | BPF_K:
@@ -917,7 +922,7 @@ static int emit_insn(const struct bpf_insn *insn, struct 
rv_jit_context *ctx,
emit_imm(RV_REG_T1, imm, ctx);
emit(rv_or(rd, rd, RV_REG_T1), ctx);
}
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
case BPF_ALU | BPF_XOR | BPF_K:
@@ -928,7 +933,7 @@ static int emit_insn(const struct bpf_insn *insn, struct 
rv_jit_context *ctx,
emit_imm(RV_REG_T1, imm, ctx);
emit(rv_xor(rd, rd, RV_REG_T1), ctx);
}
-   if (!is64)
+   if (!is64 && !aux->verifier_zext)
emit_zext_32(rd, ctx);
break;
case BPF_ALU | BPF_MUL | BPF_K:
@@ -936,7 +941,7 @@ static int emit_insn(const struct bpf_insn *insn, struct

[PATCH v6 bpf-next 07/17] bpf: verifier: randomize high 32-bit when BPF_F_TEST_RND_HI32 is set

2019-05-03 Thread Jiong Wang

This patch randomizes high 32-bit of a definition when BPF_F_TEST_RND_HI32
is set.

It does this once the flag set no matter there is hardware zero extension
support or not. Because this is a test feature and we want to deliver the
most stressful test.

Suggested-by: Alexei Starovoitov 
Signed-off-by: Jiong Wang 
---
 kernel/bpf/verifier.c | 69 +++
 1 file changed, 58 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 999da02..31ffbef 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7648,32 +7648,79 @@ static int opt_remove_nops(struct bpf_verifier_env *env)
return 0;
 }
 
-static int opt_subreg_zext_lo32(struct bpf_verifier_env *env)
+static int opt_subreg_zext_lo32_rnd_hi32(struct bpf_verifier_env *env,
+const union bpf_attr *attr)
 {
+   struct bpf_insn *patch, zext_patch[2], rnd_hi32_patch[4];
struct bpf_insn_aux_data *aux = env->insn_aux_data;
+   int i, patch_len, delta = 0, len = env->prog->len;
struct bpf_insn *insns = env->prog->insnsi;
-   int i, delta = 0, len = env->prog->len;
-   struct bpf_insn zext_patch[2];
struct bpf_prog *new_prog;
+   bool rnd_hi32;
+
+   rnd_hi32 = attr->prog_flags & BPF_F_TEST_RND_HI32;
 
zext_patch[1] = BPF_ALU32_IMM(BPF_ZEXT, 0, 0);
+   rnd_hi32_patch[1] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, 0);
+   rnd_hi32_patch[2] = BPF_ALU64_IMM(BPF_LSH, BPF_REG_AX, 32);
+   rnd_hi32_patch[3] = BPF_ALU64_REG(BPF_OR, 0, BPF_REG_AX);
for (i = 0; i < len; i++) {
int adj_idx = i + delta;
struct bpf_insn insn;
 
-   if (!aux[adj_idx].zext_dst)
+   insn = insns[adj_idx];
+   if (!aux[adj_idx].zext_dst) {
+   u8 code, class;
+   u32 imm_rnd;
+
+   if (!rnd_hi32)
+   continue;
+
+   code = insn.code;
+   class = BPF_CLASS(code);
+   if (insn_no_def(&insn))
+   continue;
+
+   /* NOTE: arg "reg" (the fourth one) is only used for
+*   BPF_STX which has been ruled out in above
+*   check, it is safe to pass NULL here.
+*/
+   if (is_reg64(env, &insn, insn.dst_reg, NULL, DST_OP)) {
+   if (class == BPF_LD &&
+   BPF_MODE(code) == BPF_IMM)
+   i++;
+   continue;
+   }
+
+   /* ctx load could be transformed into wider load. */
+   if (class == BPF_LDX &&
+   aux[adj_idx].ptr_type == PTR_TO_CTX)
+   continue;
+
+   imm_rnd = get_random_int();
+   rnd_hi32_patch[0] = insn;
+   rnd_hi32_patch[1].imm = imm_rnd;
+   rnd_hi32_patch[3].dst_reg = insn.dst_reg;
+   patch = rnd_hi32_patch;
+   patch_len = 4;
+   goto apply_patch_buffer;
+   }
+
+   if (bpf_jit_hardware_zext())
continue;
 
-   insn = insns[adj_idx];
zext_patch[0] = insn;
zext_patch[1].dst_reg = insn.dst_reg;
-   new_prog = bpf_patch_insn_data(env, adj_idx, zext_patch, 2);
+   patch = zext_patch;
+   patch_len = 2;
+apply_patch_buffer:
+   new_prog = bpf_patch_insn_data(env, adj_idx, patch, patch_len);
if (!new_prog)
return -ENOMEM;
env->prog = new_prog;
insns = new_prog->insnsi;
aux = env->insn_aux_data;
-   delta += 2;
+   delta += patch_len - 1;
}
 
return 0;
@@ -8533,10 +8580,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr,
/* do 32-bit optimization after insn patching has done so those patched
 * insns could be handled correctly.
 */
-   if (ret == 0 && !bpf_jit_hardware_zext() &&
-   !bpf_prog_is_dev_bound(env->prog->aux)) {
-   ret = opt_subreg_zext_lo32(env);
-   env->prog->aux->verifier_zext = !ret;
+   if (ret == 0 && !bpf_prog_is_dev_bound(env->prog->aux)) {
+   ret = opt_subreg_zext_lo32_rnd_hi32(env, attr);
+   env->prog->aux->verifier_zext =
+   bpf_jit_hardware_zext() ? false : !ret;
}
 
if (ret == 0)
-- 
2.7.4

[PATCH v6 bpf-next 01/17] bpf: verifier: offer more accurate helper function arg and return type

2019-05-03 Thread Jiong Wang

BPF helper call transfers execution from eBPF insns to native functions
while verifier insn walker only walks eBPF insns. So, verifier can only
knows argument and return value types from explicit helper function
prototype descriptions.

For 32-bit optimization, it is important to know whether argument (register
use from eBPF insn) and return value (register define from external
function) is 32-bit or 64-bit, so corresponding registers could be
zero-extended correctly.

For arguments, they are register uses, we conservatively treat all of them
as 64-bit at default, while the following new bpf_arg_type are added so we
could start to mark those frequently used helper functions with more
accurate argument type.

  ARG_CONST_SIZE32
  ARG_CONST_SIZE32_OR_ZERO
  ARG_ANYTHING32

A few helper functions shown up frequently inside Cilium bpf program are
updated using these new types.

For return values, they are register defs, we need to know accurate width
for correct zero extensions. Given most of the helper functions returning
integers return 32-bit value, a new RET_INTEGER64 is added to make those
functions return 64-bit value. All related helper functions are updated.

Signed-off-by: Jiong Wang 
---
 include/linux/bpf.h  |  6 +-
 kernel/bpf/core.c|  2 +-
 kernel/bpf/helpers.c | 10 +-
 kernel/bpf/verifier.c| 15 ++-
 kernel/trace/bpf_trace.c |  4 ++--
 net/core/filter.c| 38 +++---
 6 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9a21848..11a5fb9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -198,9 +198,12 @@ enum bpf_arg_type {
 
ARG_CONST_SIZE, /* number of bytes accessed from memory */
ARG_CONST_SIZE_OR_ZERO, /* number of bytes accessed from memory or 0 */
+   ARG_CONST_SIZE32,   /* Likewise, but size fits into 32-bit */
+   ARG_CONST_SIZE32_OR_ZERO,   /* Ditto */
 
ARG_PTR_TO_CTX, /* pointer to context */
ARG_ANYTHING,   /* any (initialized) argument is ok */
+   ARG_ANYTHING32, /* Likewise, but it is a 32-bit argument */
ARG_PTR_TO_SPIN_LOCK,   /* pointer to bpf_spin_lock */
ARG_PTR_TO_SOCK_COMMON, /* pointer to sock_common */
ARG_PTR_TO_INT, /* pointer to int */
@@ -210,7 +213,8 @@ enum bpf_arg_type {
 
 /* type of values returned from helper functions */
 enum bpf_return_type {
-   RET_INTEGER,/* function returns integer */
+   RET_INTEGER,/* function returns 32-bit integer */
+   RET_INTEGER64,  /* function returns 64-bit integer */
RET_VOID,   /* function doesn't return anything */
RET_PTR_TO_MAP_VALUE,   /* returns a pointer to map elem value 
*/
RET_PTR_TO_MAP_VALUE_OR_NULL,   /* returns a pointer to map elem value 
or NULL */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ace8c22..2792eda 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2067,7 +2067,7 @@ const struct bpf_func_proto bpf_tail_call_proto = {
.ret_type   = RET_VOID,
.arg1_type  = ARG_PTR_TO_CTX,
.arg2_type  = ARG_CONST_MAP_PTR,
-   .arg3_type  = ARG_ANYTHING,
+   .arg3_type  = ARG_ANYTHING32,
 };
 
 /* Stub for JITs that only support cBPF. eBPF programs are interpreted.
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 4266ffd..60f6e31 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -157,7 +157,7 @@ BPF_CALL_0(bpf_ktime_get_ns)
 const struct bpf_func_proto bpf_ktime_get_ns_proto = {
.func   = bpf_ktime_get_ns,
.gpl_only   = true,
-   .ret_type   = RET_INTEGER,
+   .ret_type   = RET_INTEGER64,
 };
 
 BPF_CALL_0(bpf_get_current_pid_tgid)
@@ -173,7 +173,7 @@ BPF_CALL_0(bpf_get_current_pid_tgid)
 const struct bpf_func_proto bpf_get_current_pid_tgid_proto = {
.func   = bpf_get_current_pid_tgid,
.gpl_only   = false,
-   .ret_type   = RET_INTEGER,
+   .ret_type   = RET_INTEGER64,
 };
 
 BPF_CALL_0(bpf_get_current_uid_gid)
@@ -193,7 +193,7 @@ BPF_CALL_0(bpf_get_current_uid_gid)
 const struct bpf_func_proto bpf_get_current_uid_gid_proto = {
.func   = bpf_get_current_uid_gid,
.gpl_only   = false,
-   .ret_type   = RET_INTEGER,
+   .ret_type   = RET_INTEGER64,
 };
 
 BPF_CALL_2(bpf_get_current_comm, char *, buf, u32, size)
@@ -221,7 +221,7 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
.gpl_only   = false,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_UNINIT_MEM,
-   .arg2_type  = ARG_CONST_SIZE,
+   .arg2_type  = ARG_CONST_SIZE32,
 };
 
 #if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK)
@@ -331,7 +331,7 @@ BPF_CALL_0(bpf_get_current_

[PATCH v6 bpf-next 13/17] s390: bpf: eliminate zero extension code-gen

2019-05-03 Thread Jiong Wang

Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Signed-off-by: Jiong Wang 
---
 arch/s390/net/bpf_jit_comp.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index 51dd026..8315b2e 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -299,9 +299,11 @@ static inline void reg_set_seen(struct bpf_jit *jit, u32 
b1)
 
 #define EMIT_ZERO(b1)  \
 ({ \
-   /* llgfr %dst,%dst (zero extend to 64 bit) */   \
-   EMIT4(0xb916, b1, b1);  \
-   REG_SET_SEEN(b1);   \
+   if (!fp->aux->verifier_zext) {  \
+   /* llgfr %dst,%dst (zero extend to 64 bit) */   \
+   EMIT4(0xb916, b1, b1);  \
+   REG_SET_SEEN(b1);   \
+   }   \
 })
 
 /*
@@ -515,6 +517,13 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, 
struct bpf_prog *fp, int i
jit->seen |= SEEN_REG_AX;
switch (insn->code) {
/*
+* BPF_ZEXT
+*/
+   case BPF_ALU | BPF_ZEXT: /* dst = (u32) dst */
+   /* llgfr %dst,%dst */
+   EMIT4(0xb916, dst_reg, dst_reg);
+   break;
+   /*
 * BPF_MOV
 */
case BPF_ALU | BPF_MOV | BPF_X: /* dst = (u32) src */
@@ -1282,6 +1291,11 @@ static int bpf_jit_prog(struct bpf_jit *jit, struct 
bpf_prog *fp)
return 0;
 }
 
+bool bpf_jit_hardware_zext(void)
+{
+   return false;
+}
+
 /*
  * Compile eBPF program "fp"
  */
-- 
2.7.4

[PATCH v6 bpf-next 03/17] bpf: verifier: mark patched-insn with sub-register zext flag

2019-05-03 Thread Jiong Wang

Patched insns do not go through generic verification, therefore doesn't has
zero extension information collected during insn walking.

We don't bother analyze them at the moment, for any sub-register def comes
from them, just conservatively mark it as needing zero extension.

Signed-off-by: Jiong Wang 
---
 kernel/bpf/verifier.c | 37 +
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 43ea665..b43e8a2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1303,6 +1303,24 @@ static bool is_reg64(struct bpf_verifier_env *env, 
struct bpf_insn *insn,
return true;
 }
 
+/* Return TRUE if INSN doesn't have explicit value define. */
+static bool insn_no_def(struct bpf_insn *insn)
+{
+   u8 class = BPF_CLASS(insn->code);
+
+   return (class == BPF_JMP || class == BPF_JMP32 ||
+   class == BPF_STX || class == BPF_ST);
+}
+
+/* Return TRUE if INSN has defined any 32-bit value explicitly. */
+static bool insn_has_def32(struct bpf_verifier_env *env, struct bpf_insn *insn)
+{
+   if (insn_no_def(insn))
+   return false;
+
+   return !is_reg64(env, insn, insn->dst_reg, NULL, DST_OP);
+}
+
 static void mark_insn_zext(struct bpf_verifier_env *env,
   struct bpf_reg_state *reg)
 {
@@ -7306,14 +7324,23 @@ static void convert_pseudo_ld_imm64(struct 
bpf_verifier_env *env)
  * insni[off, off + cnt).  Adjust corresponding insn_aux_data by copying
  * [0, off) and [off, end) to new locations, so the patched range stays zero
  */
-static int adjust_insn_aux_data(struct bpf_verifier_env *env, u32 prog_len,
-   u32 off, u32 cnt)
+static int adjust_insn_aux_data(struct bpf_verifier_env *env,
+   struct bpf_prog *new_prog, u32 off, u32 cnt)
 {
struct bpf_insn_aux_data *new_data, *old_data = env->insn_aux_data;
+   struct bpf_insn *insn = new_prog->insnsi;
+   u32 prog_len;
int i;
 
+   /* aux info at OFF always needs adjustment, no matter fast path
+* (cnt == 1) is taken or not. There is no guarantee INSN at OFF is the
+* original insn at old prog.
+*/
+   old_data[off].zext_dst = insn_has_def32(env, insn + off + cnt - 1);
+
if (cnt == 1)
return 0;
+   prog_len = new_prog->len;
new_data = vzalloc(array_size(prog_len,
  sizeof(struct bpf_insn_aux_data)));
if (!new_data)
@@ -7321,8 +7348,10 @@ static int adjust_insn_aux_data(struct bpf_verifier_env 
*env, u32 prog_len,
memcpy(new_data, old_data, sizeof(struct bpf_insn_aux_data) * off);
memcpy(new_data + off + cnt - 1, old_data + off,
   sizeof(struct bpf_insn_aux_data) * (prog_len - off - cnt + 1));
-   for (i = off; i < off + cnt - 1; i++)
+   for (i = off; i < off + cnt - 1; i++) {
new_data[i].seen = true;
+   new_data[i].zext_dst = insn_has_def32(env, insn + i);
+   }
env->insn_aux_data = new_data;
vfree(old_data);
return 0;
@@ -7355,7 +7384,7 @@ static struct bpf_prog *bpf_patch_insn_data(struct 
bpf_verifier_env *env, u32 of
env->insn_aux_data[off].orig_idx);
return NULL;
}
-   if (adjust_insn_aux_data(env, new_prog->len, off, len))
+   if (adjust_insn_aux_data(env, new_prog, off, len))
return NULL;
adjust_subprog_starts(env, off, len);
return new_prog;
-- 
2.7.4

[PATCH v6 bpf-next 14/17] sparc: bpf: eliminate zero extension code-gen

2019-05-03 Thread Jiong Wang

Cc: David S. Miller 
Signed-off-by: Jiong Wang 
---
 arch/sparc/net/bpf_jit_comp_64.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/net/bpf_jit_comp_64.c b/arch/sparc/net/bpf_jit_comp_64.c
index 65428e7..8bac761 100644
--- a/arch/sparc/net/bpf_jit_comp_64.c
+++ b/arch/sparc/net/bpf_jit_comp_64.c
@@ -905,6 +905,10 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
ctx->saw_frame_pointer = true;
 
switch (code) {
+   /* dst = (u32) dst */
+   case BPF_ALU | BPF_ZEXT:
+   emit_alu_K(SRL, dst, 0, ctx);
+   break;
/* dst = src */
case BPF_ALU | BPF_MOV | BPF_X:
emit_alu3_K(SRL, src, 0, dst, ctx);
@@ -1144,7 +1148,8 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
break;
 
do_alu32_trunc:
-   if (BPF_CLASS(code) == BPF_ALU)
+   if (BPF_CLASS(code) == BPF_ALU &&
+   !ctx->prog->aux->verifier_zext)
emit_alu_K(SRL, dst, 0, ctx);
break;
 
@@ -1432,6 +1437,11 @@ static void jit_fill_hole(void *area, unsigned int size)
*ptr++ = 0x91d02005; /* ta 5 */
 }
 
+bool bpf_jit_hardware_zext(void)
+{
+   return false;
+}
+
 struct sparc64_jit_data {
struct bpf_binary_header *header;
u8 *image;
-- 
2.7.4

[PATCH v6 bpf-next 15/17] x32: bpf: eliminate zero extension code-gen

2019-05-03 Thread Jiong Wang

Cc: Wang YanQing 
Signed-off-by: Jiong Wang 
---
 arch/x86/net/bpf_jit_comp32.c | 39 ---
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index 0d9cdff..16c4f4e 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -567,7 +567,7 @@ static inline void emit_ia32_alu_r(const bool is64, const 
bool hi, const u8 op,
 static inline void emit_ia32_alu_r64(const bool is64, const u8 op,
 const u8 dst[], const u8 src[],
 bool dstk,  bool sstk,
-u8 **pprog)
+u8 **pprog, const struct bpf_prog_aux *aux)
 {
u8 *prog = *pprog;
 
@@ -575,7 +575,7 @@ static inline void emit_ia32_alu_r64(const bool is64, const 
u8 op,
if (is64)
emit_ia32_alu_r(is64, true, op, dst_hi, src_hi, dstk, sstk,
&prog);
-   else
+   else if (!aux->verifier_zext)
emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
*pprog = prog;
 }
@@ -666,7 +666,8 @@ static inline void emit_ia32_alu_i(const bool is64, const 
bool hi, const u8 op,
 /* ALU operation (64 bit) */
 static inline void emit_ia32_alu_i64(const bool is64, const u8 op,
 const u8 dst[], const u32 val,
-bool dstk, u8 **pprog)
+bool dstk, u8 **pprog,
+const struct bpf_prog_aux *aux)
 {
u8 *prog = *pprog;
u32 hi = 0;
@@ -677,7 +678,7 @@ static inline void emit_ia32_alu_i64(const bool is64, const 
u8 op,
emit_ia32_alu_i(is64, false, op, dst_lo, val, dstk, &prog);
if (is64)
emit_ia32_alu_i(is64, true, op, dst_hi, hi, dstk, &prog);
-   else
+   else if (!aux->verifier_zext)
emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
 
*pprog = prog;
@@ -1642,6 +1643,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
 
switch (code) {
/* ALU operations */
+   /* dst = (u32) dst */
+   case BPF_ALU | BPF_ZEXT:
+   emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+   break;
/* dst = src */
case BPF_ALU | BPF_MOV | BPF_K:
case BPF_ALU | BPF_MOV | BPF_X:
@@ -1690,11 +1695,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int 
*addrs, u8 *image,
switch (BPF_SRC(code)) {
case BPF_X:
emit_ia32_alu_r64(is64, BPF_OP(code), dst,
- src, dstk, sstk, &prog);
+ src, dstk, sstk, &prog,
+ bpf_prog->aux);
break;
case BPF_K:
emit_ia32_alu_i64(is64, BPF_OP(code), dst,
- imm32, dstk, &prog);
+ imm32, dstk, &prog,
+ bpf_prog->aux);
break;
}
break;
@@ -1713,7 +1720,8 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
false, &prog);
break;
}
-   emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+   if (!bpf_prog->aux->verifier_zext)
+   emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
break;
case BPF_ALU | BPF_LSH | BPF_X:
case BPF_ALU | BPF_RSH | BPF_X:
@@ -1733,7 +1741,8 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
  &prog);
break;
}
-   emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+   if (!bpf_prog->aux->verifier_zext)
+   emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
break;
/* dst = dst / src(imm) */
/* dst = dst % src(imm) */
@@ -1755,7 +1764,8 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
&prog);
break;
}
-   emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+   if (!bpf_prog->aux->verifier_zext)
+   emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
break;
case BPF_ALU64 | BPF_DIV

[PATCH v6 bpf-next 12/17] powerpc: bpf: eliminate zero extension code-gen

2019-05-03 Thread Jiong Wang

Cc: Naveen N. Rao 
Cc: Sandipan Das 
Signed-off-by: Jiong Wang 
---
 arch/powerpc/net/bpf_jit_comp64.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 21a1dcd..9fef73dc 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -557,9 +557,15 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
goto bpf_alu32_trunc;
break;
 
+   /*
+* ZEXT
+*/
+   case BPF_ALU | BPF_ZEXT:
+   PPC_RLWINM(dst_reg, dst_reg, 0, 0, 31);
+   break;
 bpf_alu32_trunc:
/* Truncate to 32-bits */
-   if (BPF_CLASS(code) == BPF_ALU)
+   if (BPF_CLASS(code) == BPF_ALU && !fp->aux->verifier_zext)
PPC_RLWINM(dst_reg, dst_reg, 0, 0, 31);
break;
 
@@ -1046,6 +1052,11 @@ struct powerpc64_jit_data {
struct codegen_context ctx;
 };
 
+bool bpf_jit_hardware_zext(void)
+{
+   return false;
+}
+
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 {
u32 proglen;
-- 
2.7.4

[PATCH v6 bpf-next 08/17] libbpf: add "prog_flags" to bpf_program/bpf_prog_load_attr/bpf_load_program_attr

2019-05-03 Thread Jiong Wang

libbpf doesn't allow passing "prog_flags" during bpf program load in a
couple of load related APIs, "bpf_load_program_xattr", "load_program" and
"bpf_prog_load_xattr".

It makes sense to allow passing "prog_flags" which is useful for
customizing program loading.

Reviewed-by: Jakub Kicinski 
Signed-off-by: Jiong Wang 
---
 tools/lib/bpf/bpf.c| 1 +
 tools/lib/bpf/bpf.h| 1 +
 tools/lib/bpf/libbpf.c | 3 +++
 tools/lib/bpf/libbpf.h | 1 +
 4 files changed, 6 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 955191c..f79ec49 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -254,6 +254,7 @@ int bpf_load_program_xattr(const struct 
bpf_load_program_attr *load_attr,
if (load_attr->name)
memcpy(attr.prog_name, load_attr->name,
   min(strlen(load_attr->name), BPF_OBJ_NAME_LEN - 1));
+   attr.prog_flags = load_attr->prog_flags;
 
fd = sys_bpf_prog_load(&attr, sizeof(attr));
if (fd >= 0)
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9593fec..ff42ca0 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -87,6 +87,7 @@ struct bpf_load_program_attr {
const void *line_info;
__u32 line_info_cnt;
__u32 log_level;
+   __u32 prog_flags;
 };
 
 /* Flags to direct loading requirements */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 11a65db..debca21 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -184,6 +184,7 @@ struct bpf_program {
void *line_info;
__u32 line_info_rec_size;
__u32 line_info_cnt;
+   __u32 prog_flags;
 };
 
 enum libbpf_map_type {
@@ -1949,6 +1950,7 @@ load_program(struct bpf_program *prog, struct bpf_insn 
*insns, int insns_cnt,
load_attr.line_info_rec_size = prog->line_info_rec_size;
load_attr.line_info_cnt = prog->line_info_cnt;
load_attr.log_level = prog->log_level;
+   load_attr.prog_flags = prog->prog_flags;
if (!load_attr.insns || !load_attr.insns_cnt)
return -EINVAL;
 
@@ -3394,6 +3396,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
  expected_attach_type);
 
prog->log_level = attr->log_level;
+   prog->prog_flags = attr->prog_flags;
if (!first_prog)
first_prog = prog;
}
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index c5ff005..5abc237 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -320,6 +320,7 @@ struct bpf_prog_load_attr {
enum bpf_attach_type expected_attach_type;
int ifindex;
int log_level;
+   int prog_flags;
 };
 
 LIBBPF_API int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
-- 
2.7.4

[PATCH v6 bpf-next 11/17] arm: bpf: eliminate zero extension code-gen

2019-05-03 Thread Jiong Wang

Cc: Shubham Bansal 
Signed-off-by: Jiong Wang 
---
 arch/arm/net/bpf_jit_32.c | 35 ++-
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index c8bfbbf..a6f78c8 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -736,7 +736,8 @@ static inline void emit_a32_alu_r64(const bool is64, const 
s8 dst[],
 
/* ALU operation */
emit_alu_r(rd[1], rs, true, false, op, ctx);
-   emit_a32_mov_i(rd[0], 0, ctx);
+   if (!ctx->prog->aux->verifier_zext)
+   emit_a32_mov_i(rd[0], 0, ctx);
}
 
arm_bpf_put_reg64(dst, rd, ctx);
@@ -758,8 +759,9 @@ static inline void emit_a32_mov_r64(const bool is64, const 
s8 dst[],
  struct jit_ctx *ctx) {
if (!is64) {
emit_a32_mov_r(dst_lo, src_lo, ctx);
-   /* Zero out high 4 bytes */
-   emit_a32_mov_i(dst_hi, 0, ctx);
+   if (!ctx->prog->aux->verifier_zext)
+   /* Zero out high 4 bytes */
+   emit_a32_mov_i(dst_hi, 0, ctx);
} else if (__LINUX_ARM_ARCH__ < 6 &&
   ctx->cpu_architecture < CPU_ARCH_ARMv5TE) {
/* complete 8 byte move */
@@ -1060,17 +1062,20 @@ static inline void emit_ldx_r(const s8 dst[], const s8 
src,
case BPF_B:
/* Load a Byte */
emit(ARM_LDRB_I(rd[1], rm, off), ctx);
-   emit_a32_mov_i(rd[0], 0, ctx);
+   if (!ctx->prog->aux->verifier_zext)
+   emit_a32_mov_i(rd[0], 0, ctx);
break;
case BPF_H:
/* Load a HalfWord */
emit(ARM_LDRH_I(rd[1], rm, off), ctx);
-   emit_a32_mov_i(rd[0], 0, ctx);
+   if (!ctx->prog->aux->verifier_zext)
+   emit_a32_mov_i(rd[0], 0, ctx);
break;
case BPF_W:
/* Load a Word */
emit(ARM_LDR_I(rd[1], rm, off), ctx);
-   emit_a32_mov_i(rd[0], 0, ctx);
+   if (!ctx->prog->aux->verifier_zext)
+   emit_a32_mov_i(rd[0], 0, ctx);
break;
case BPF_DW:
/* Load a Double Word */
@@ -1352,6 +1357,10 @@ static int build_insn(const struct bpf_insn *insn, 
struct jit_ctx *ctx)
switch (code) {
/* ALU operations */
 
+   /* dst = (u32) dst */
+   case BPF_ALU | BPF_ZEXT:
+   emit_a32_mov_i(dst_hi, 0, ctx);
+   break;
/* dst = src */
case BPF_ALU | BPF_MOV | BPF_K:
case BPF_ALU | BPF_MOV | BPF_X:
@@ -1438,7 +1447,8 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
}
emit_udivmod(rd_lo, rd_lo, rt, ctx, BPF_OP(code));
arm_bpf_put_reg32(dst_lo, rd_lo, ctx);
-   emit_a32_mov_i(dst_hi, 0, ctx);
+   if (!ctx->prog->aux->verifier_zext)
+   emit_a32_mov_i(dst_hi, 0, ctx);
break;
case BPF_ALU64 | BPF_DIV | BPF_K:
case BPF_ALU64 | BPF_DIV | BPF_X:
@@ -1453,7 +1463,8 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
return -EINVAL;
if (imm)
emit_a32_alu_i(dst_lo, imm, ctx, BPF_OP(code));
-   emit_a32_mov_i(dst_hi, 0, ctx);
+   if (!ctx->prog->aux->verifier_zext)
+   emit_a32_mov_i(dst_hi, 0, ctx);
break;
/* dst = dst << imm */
case BPF_ALU64 | BPF_LSH | BPF_K:
@@ -1488,7 +1499,8 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
/* dst = ~dst */
case BPF_ALU | BPF_NEG:
emit_a32_alu_i(dst_lo, 0, ctx, BPF_OP(code));
-   emit_a32_mov_i(dst_hi, 0, ctx);
+   if (!ctx->prog->aux->verifier_zext)
+   emit_a32_mov_i(dst_hi, 0, ctx);
break;
/* dst = ~dst (64 bit) */
case BPF_ALU64 | BPF_NEG:
@@ -1838,6 +1850,11 @@ void bpf_jit_compile(struct bpf_prog *prog)
/* Nothing to do here. We support Internal BPF. */
 }
 
+bool bpf_jit_hardware_zext(void)
+{
+   return false;
+}
+
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
struct bpf_prog *tmp, *orig_prog = prog;
-- 
2.7.4

[PATCH v6 bpf-next 09/17] selftests: bpf: adjust several test_verifier helpers for insn insertion

2019-05-03 Thread Jiong Wang

  - bpf_fill_ld_abs_vlan_push_pop:
Prevent zext happens inside PUSH_CNT loop. This could happen because
of BPF_LD_ABS (32-bit def) + BPF_JMP (64-bit use), or BPF_LD_ABS +
EXIT (64-bit use of R0). So, change BPF_JMP to BPF_JMP32 and redefine
R0 at exit path to cut off the data-flow from inside the loop.

  - bpf_fill_jump_around_ld_abs:
Jump range is limited to 16 bit. every ld_abs is replaced by 6 insns,
but on arches like arm, ppc etc, there will be one BPF_ZEXT inserted
to extend the error value of the inlined ld_abs sequence which then
contains 7 insns. so, set the dividend to 7 so the testcase could
work on all arches.

  - bpf_fill_scale1/bpf_fill_scale2:
Both contains ~1M BPF_ALU32_IMM which will trigger ~1M insn patcher
call because of hi32 randomization later when BPF_F_TEST_RND_HI32 is
set for bpf selftests. Insn patcher is not efficient that 1M call to
it will hang computer. So , change to BPF_ALU64_IMM to avoid hi32
randomization.

Signed-off-by: Jiong Wang 
---
 tools/testing/selftests/bpf/test_verifier.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index ccd896b..3dcdfd4 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -138,32 +138,36 @@ static void bpf_fill_ld_abs_vlan_push_pop(struct bpf_test 
*self)
 loop:
for (j = 0; j < PUSH_CNT; j++) {
insn[i++] = BPF_LD_ABS(BPF_B, 0);
-   insn[i] = BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0x34, len - i - 2);
+   /* jump to error label */
+   insn[i] = BPF_JMP32_IMM(BPF_JNE, BPF_REG_0, 0x34, len - i - 3);
i++;
insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_6);
insn[i++] = BPF_MOV64_IMM(BPF_REG_2, 1);
insn[i++] = BPF_MOV64_IMM(BPF_REG_3, 2);
insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
 BPF_FUNC_skb_vlan_push),
-   insn[i] = BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, len - i - 2);
+   insn[i] = BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, len - i - 3);
i++;
}
 
for (j = 0; j < PUSH_CNT; j++) {
insn[i++] = BPF_LD_ABS(BPF_B, 0);
-   insn[i] = BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0x34, len - i - 2);
+   insn[i] = BPF_JMP32_IMM(BPF_JNE, BPF_REG_0, 0x34, len - i - 3);
i++;
insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_6);
insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
 BPF_FUNC_skb_vlan_pop),
-   insn[i] = BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, len - i - 2);
+   insn[i] = BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, len - i - 3);
i++;
}
if (++k < 5)
goto loop;
 
-   for (; i < len - 1; i++)
-   insn[i] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 0xbef);
+   for (; i < len - 3; i++)
+   insn[i] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 0xbef);
+   insn[len - 3] = BPF_JMP_A(1);
+   /* error label */
+   insn[len - 2] = BPF_MOV32_IMM(BPF_REG_0, 0);
insn[len - 1] = BPF_EXIT_INSN();
self->prog_len = len;
 }
@@ -171,8 +175,13 @@ static void bpf_fill_ld_abs_vlan_push_pop(struct bpf_test 
*self)
 static void bpf_fill_jump_around_ld_abs(struct bpf_test *self)
 {
struct bpf_insn *insn = self->fill_insns;
-   /* jump range is limited to 16 bit. every ld_abs is replaced by 6 insns 
*/
-   unsigned int len = (1 << 15) / 6;
+   /* jump range is limited to 16 bit. every ld_abs is replaced by 6 insns,
+* but on arches like arm, ppc etc, there will be one BPF_ZEXT inserted
+* to extend the error value of the inlined ld_abs sequence which then
+* contains 7 insns. so, set the dividend to 7 so the testcase could
+* work on all arches.
+*/
+   unsigned int len = (1 << 15) / 7;
int i = 0;
 
insn[i++] = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
@@ -230,7 +239,7 @@ static void bpf_fill_scale1(struct bpf_test *self)
 * within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT
 */
while (i < MAX_TEST_INSNS - 1025)
-   insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42);
+   insn[i++] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 42);
insn[i] = BPF_EXIT_INSN();
self->prog_len = i + 1;
self->retval = 42;
@@ -261,7 +270,7 @@ static void bpf_fill_scale2(struct bpf_test *self)
 * within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT
 */
while (i < MAX_TEST_INSNS - 1025)
-   insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42);
+   insn[i++] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 42);
insn[i] = BPF_EXIT_INS

[PATCH v6 bpf-next 17/17] nfp: bpf: eliminate zero extension code-gen

2019-05-03 Thread Jiong Wang

This patch eliminate zero extension code-gen for instructions including
both alu and load/store. The only exception is for ctx load, because
offload target doesn't go through host ctx convert logic so we do
customized load and ignores zext flag set by verifier.

Reviewed-by: Jakub Kicinski 
Signed-off-by: Jiong Wang 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 115 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.h |   2 +
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c |  12 +++
 3 files changed, 81 insertions(+), 48 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index f272247..634bae0 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -612,6 +612,13 @@ static void wrp_immed(struct nfp_prog *nfp_prog, swreg 
dst, u32 imm)
 }
 
 static void
+wrp_zext(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, u8 dst)
+{
+   if (meta->flags & FLAG_INSN_DO_ZEXT)
+   wrp_immed(nfp_prog, reg_both(dst + 1), 0);
+}
+
+static void
 wrp_immed_relo(struct nfp_prog *nfp_prog, swreg dst, u32 imm,
   enum nfp_relo_type relo)
 {
@@ -847,7 +854,8 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
 }
 
 static int
-data_ld(struct nfp_prog *nfp_prog, swreg offset, u8 dst_gpr, int size)
+data_ld(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, swreg offset,
+   u8 dst_gpr, int size)
 {
unsigned int i;
u16 shift, sz;
@@ -870,14 +878,15 @@ data_ld(struct nfp_prog *nfp_prog, swreg offset, u8 
dst_gpr, int size)
wrp_mov(nfp_prog, reg_both(dst_gpr + i), reg_xfer(i));
 
if (i < 2)
-   wrp_immed(nfp_prog, reg_both(dst_gpr + 1), 0);
+   wrp_zext(nfp_prog, meta, dst_gpr);
 
return 0;
 }
 
 static int
-data_ld_host_order(struct nfp_prog *nfp_prog, u8 dst_gpr,
-  swreg lreg, swreg rreg, int size, enum cmd_mode mode)
+data_ld_host_order(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
+  u8 dst_gpr, swreg lreg, swreg rreg, int size,
+  enum cmd_mode mode)
 {
unsigned int i;
u8 mask, sz;
@@ -900,33 +909,34 @@ data_ld_host_order(struct nfp_prog *nfp_prog, u8 dst_gpr,
wrp_mov(nfp_prog, reg_both(dst_gpr + i), reg_xfer(i));
 
if (i < 2)
-   wrp_immed(nfp_prog, reg_both(dst_gpr + 1), 0);
+   wrp_zext(nfp_prog, meta, dst_gpr);
 
return 0;
 }
 
 static int
-data_ld_host_order_addr32(struct nfp_prog *nfp_prog, u8 src_gpr, swreg offset,
- u8 dst_gpr, u8 size)
+data_ld_host_order_addr32(struct nfp_prog *nfp_prog, struct nfp_insn_meta 
*meta,
+ u8 src_gpr, swreg offset, u8 dst_gpr, u8 size)
 {
-   return data_ld_host_order(nfp_prog, dst_gpr, reg_a(src_gpr), offset,
- size, CMD_MODE_32b);
+   return data_ld_host_order(nfp_prog, meta, dst_gpr, reg_a(src_gpr),
+ offset, size, CMD_MODE_32b);
 }
 
 static int
-data_ld_host_order_addr40(struct nfp_prog *nfp_prog, u8 src_gpr, swreg offset,
- u8 dst_gpr, u8 size)
+data_ld_host_order_addr40(struct nfp_prog *nfp_prog, struct nfp_insn_meta 
*meta,
+ u8 src_gpr, swreg offset, u8 dst_gpr, u8 size)
 {
swreg rega, regb;
 
addr40_offset(nfp_prog, src_gpr, offset, ®a, ®b);
 
-   return data_ld_host_order(nfp_prog, dst_gpr, rega, regb,
+   return data_ld_host_order(nfp_prog, meta, dst_gpr, rega, regb,
  size, CMD_MODE_40b_BA);
 }
 
 static int
-construct_data_ind_ld(struct nfp_prog *nfp_prog, u16 offset, u16 src, u8 size)
+construct_data_ind_ld(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
+ u16 offset, u16 src, u8 size)
 {
swreg tmp_reg;
 
@@ -942,10 +952,12 @@ construct_data_ind_ld(struct nfp_prog *nfp_prog, u16 
offset, u16 src, u8 size)
emit_br_relo(nfp_prog, BR_BLO, BR_OFF_RELO, 0, RELO_BR_GO_ABORT);
 
/* Load data */
-   return data_ld(nfp_prog, imm_b(nfp_prog), 0, size);
+   return data_ld(nfp_prog, meta, imm_b(nfp_prog), 0, size);
 }
 
-static int construct_data_ld(struct nfp_prog *nfp_prog, u16 offset, u8 size)
+static int
+construct_data_ld(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
+ u16 offset, u8 size)
 {
swreg tmp_reg;
 
@@ -956,7 +968,7 @@ static int construct_data_ld(struct nfp_prog *nfp_prog, u16 
offset, u8 size)
 
/* Load data */
tmp_reg = re_load_imm_any(nfp_prog, offset, imm_b(nfp_prog));
-   return data_ld(nfp_prog, tmp_reg, 0, size);
+   return data_ld(nfp_prog, meta, tmp_reg, 0, size);
 }
 
 static int
@@ -1193,7 +1205,7 @@ mem_op_stack(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta,
}

[PATCH net-next] drivers: net: davinci_mdio: fix return value check in davinci_mdio_probe()

2019-05-03 Thread Wei Yongjun

In case of error, the function devm_ioremap() returns NULL pointer not
ERR_PTR(). The IS_ERR() test in the return value check should be
replaced with NULL test.

Fixes: 03f66f067560 ("net: ethernet: ti: davinci_mdio: use devm_ioremap()")
Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/ti/davinci_mdio.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_mdio.c 
b/drivers/net/ethernet/ti/davinci_mdio.c
index 11642721c123..38b7f6d35759 100644
--- a/drivers/net/ethernet/ti/davinci_mdio.c
+++ b/drivers/net/ethernet/ti/davinci_mdio.c
@@ -398,8 +398,8 @@ static int davinci_mdio_probe(struct platform_device *pdev)
 
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
data->regs = devm_ioremap(dev, res->start, resource_size(res));
-   if (IS_ERR(data->regs))
-   return PTR_ERR(data->regs);
+   if (!data->regs)
+   return -ENOMEM;
 
davinci_mdio_init_clk(data);

Re: Possible refcount bug in ip6_expire_frag_queue()?

2019-05-03 Thread Eric Dumazet

On Fri, May 3, 2019 at 5:17 AM Stefan Bader  wrote:
>
> In commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3 "ipv6: frags:
> rewrite ip6_expire_frag_queue()" this function got changed to
> be like ip_expire() (after dropping a clone there).
> This was backported to 4.4.y stable (amongst other stable trees)
> in v4.4.174.
>
> Since then we got reports that in evironments with heave ipv6 load,
> the kernel crashes about every 2-3hrs with the following trace: [1].
>
> The crash is triggered by the skb_shared(skb) check in
> pskb_expand_head(). Comparing ip6_expire_frag_queue() and
> ip_expire(), the ipv6 code does a skb_get() which increments that
> refcount while the ipv4 code does not seem to do that.
>
> Would it be possible that ip6_expire-frag_queue() should not
> call skb_get() when using the first skb of the frag queue for
> the icmp message?

Hi Stefan

The bug should also trigger in latest/current trees as I can see, right ?

The skb_get() in current linux kernel seems unnecessary since we
remove the head skb thanks
to the call to inet_frag_pull_head(). We did remove the skb_get() in
IPv4, but not in IPv6. [1]

But in 4.4.stable this is not happening.

To fix the issue (remove the skb_get()) , we would need to remove the
head from fq->q.fragments

[1]
In IPv4, the skb_get() removal was done in commit
fa0f527358bd900ef92f925878ed6bfbd51305cc
("ip: use rb trees for IP frag queue.")

I will send the following fix

diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 
28aa9b30aeceac9a86ee6754e4b5809be115e947..d3152811b8962705a508b3fd31d2157dd19ae8e5
100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -94,11 +94,9 @@ ip6frag_expire_frag_queue(struct net *net, struct
frag_queue *fq)
goto out;

head->dev = dev;
-   skb_get(head);
spin_unlock(&fq->q.lock);

icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
-   kfree_skb(head);
goto out_rcu_unlock;

 out:


>
> Thanks,
> Stefan
>
>
>
> [1]
> [296583.091021] kernel BUG at 
> /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
> [296583.091734] Call Trace:
> [296583.091749]  [] __pskb_pull_tail+0x50/0x350
> [296583.091764]  [] _decode_session6+0x26a/0x400
> [296583.091779]  [] __xfrm_decode_session+0x39/0x50
> [296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
> [296583.091809]  [] icmp6_send+0x5e1/0x940
> [296583.091823]  [] ? __netif_receive_skb+0x18/0x60
> [296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
> [296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
> [296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091893]  [] icmpv6_send+0x21/0x30
> [296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
> [296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
> [nf_defrag_ipv6]
> [296583.091938]  [] call_timer_fn+0x37/0x140
> [296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091968]  [] run_timer_softirq+0x234/0x330
> [296583.091982]  [] __do_softirq+0x109/0x2b0
> [296583.091995]  [] irq_exit+0xa5/0xb0
> [296583.092008]  [] smp_apic_timer_interrupt+0x50/0x70
> [296583.092023]  [] apic_timer_interrupt+0xcc/0xe0
> [296583.092037]  
> [296583.092044]  [] ? cpuidle_enter_state+0x11e/0x2d0
> [296583.092060]  [] cpuidle_enter+0x17/0x20
> [296583.092073]  [] call_cpuidle+0x32/0x60
> [296583.092086]  [] ? cpuidle_select+0x19/0x20
> [296583.092099]  [] cpu_startup_entry+0x296/0x360
> [296583.092114]  [] start_secondary+0x177/0x1b0
> [296583.092878] Code: 75 1a 41 8b 87 cc 00 00 00 49 03 87 d0 00 00 00 e9 e2 
> fe ff ff b8 f4 ff ff ff eb bc 4c 89 ef e8 f4 99 ab ff b8 f4 ff ff ff eb ad 
> <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89
> [296583.094510] RIP  [] pskb_expand_head+0x243/0x250
> [296583.095302]  RSP 
> [296583.099491] ---[ end trace 4262f47656f8ba9f ]---

Re: Possible refcount bug in ip6_expire_frag_queue()?

2019-05-03 Thread Eric Dumazet

On Fri, May 3, 2019 at 7:12 AM Eric Dumazet  wrote:
>
> On Fri, May 3, 2019 at 5:17 AM Stefan Bader  
> wrote:
> >
> > In commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3 "ipv6: frags:
> > rewrite ip6_expire_frag_queue()" this function got changed to
> > be like ip_expire() (after dropping a clone there).
> > This was backported to 4.4.y stable (amongst other stable trees)
> > in v4.4.174.
> >
> > Since then we got reports that in evironments with heave ipv6 load,
> > the kernel crashes about every 2-3hrs with the following trace: [1].
> >
> > The crash is triggered by the skb_shared(skb) check in
> > pskb_expand_head(). Comparing ip6_expire_frag_queue() and
> > ip_expire(), the ipv6 code does a skb_get() which increments that
> > refcount while the ipv4 code does not seem to do that.
> >
> > Would it be possible that ip6_expire-frag_queue() should not
> > call skb_get() when using the first skb of the frag queue for
> > the icmp message?
>
> Hi Stefan
>
> The bug should also trigger in latest/current trees as I can see, right ?
>
> The skb_get() in current linux kernel seems unnecessary since we
> remove the head skb thanks
> to the call to inet_frag_pull_head(). We did remove the skb_get() in
> IPv4, but not in IPv6. [1]
>
> But in 4.4.stable this is not happening.
>
> To fix the issue (remove the skb_get()) , we would need to remove the
> head from fq->q.fragments
>
> [1]
> In IPv4, the skb_get() removal was done in commit
> fa0f527358bd900ef92f925878ed6bfbd51305cc
> ("ip: use rb trees for IP frag queue.")
>
> I will send the following fix
>
> diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
> index 
> 28aa9b30aeceac9a86ee6754e4b5809be115e947..d3152811b8962705a508b3fd31d2157dd19ae8e5
> 100644
> --- a/include/net/ipv6_frag.h
> +++ b/include/net/ipv6_frag.h
> @@ -94,11 +94,9 @@ ip6frag_expire_frag_queue(struct net *net, struct
> frag_queue *fq)
> goto out;
>
> head->dev = dev;
> -   skb_get(head);
> spin_unlock(&fq->q.lock);
>
> icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
> -   kfree_skb(head);

Oh well, we want to keep the kfree_skb() of course.

Only the skb_get(head) needs to be removed (this would fix memory leak
I presume...  :/ )

> goto out_rcu_unlock;
>
>  out:
>
>
> >
> > Thanks,
> > Stefan
> >
> >
> >
> > [1]
> > [296583.091021] kernel BUG at 
> > /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
> > [296583.091734] Call Trace:
> > [296583.091749]  [] __pskb_pull_tail+0x50/0x350
> > [296583.091764]  [] _decode_session6+0x26a/0x400
> > [296583.091779]  [] __xfrm_decode_session+0x39/0x50
> > [296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
> > [296583.091809]  [] icmp6_send+0x5e1/0x940
> > [296583.091823]  [] ? __netif_receive_skb+0x18/0x60
> > [296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
> > [296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 
> > [ixgbe]
> > [296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
> > [nf_defrag_ipv6]
> > [296583.091893]  [] icmpv6_send+0x21/0x30
> > [296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
> > [296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
> > [nf_defrag_ipv6]
> > [296583.091938]  [] call_timer_fn+0x37/0x140
> > [296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
> > [nf_defrag_ipv6]
> > [296583.091968]  [] run_timer_softirq+0x234/0x330
> > [296583.091982]  [] __do_softirq+0x109/0x2b0
> > [296583.091995]  [] irq_exit+0xa5/0xb0
> > [296583.092008]  [] smp_apic_timer_interrupt+0x50/0x70
> > [296583.092023]  [] apic_timer_interrupt+0xcc/0xe0
> > [296583.092037]  
> > [296583.092044]  [] ? cpuidle_enter_state+0x11e/0x2d0
> > [296583.092060]  [] cpuidle_enter+0x17/0x20
> > [296583.092073]  [] call_cpuidle+0x32/0x60
> > [296583.092086]  [] ? cpuidle_select+0x19/0x20
> > [296583.092099]  [] cpu_startup_entry+0x296/0x360
> > [296583.092114]  [] start_secondary+0x177/0x1b0
> > [296583.092878] Code: 75 1a 41 8b 87 cc 00 00 00 49 03 87 d0 00 00 00 e9 e2 
> > fe ff ff b8 f4 ff ff ff eb bc 4c 89 ef e8 f4 99 ab ff b8 f4 ff ff ff eb ad 
> > <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89
> > [296583.094510] RIP  [] pskb_expand_head+0x243/0x250
> > [296583.095302]  RSP 
> > [296583.099491] ---[ end trace 4262f47656f8ba9f ]---

[patch net-next] devlink: add warning in case driver does not set port type

2019-05-03 Thread Jiri Pirko

From: Jiri Pirko 

Prevent misbehavior of drivers who would not set port type for longer
period of time. Drivers should always set port type. Do WARN if that
happens.

Note that it is perfectly fine to temporarily not have the type set,
during initialization and port type change.

Signed-off-by: Jiri Pirko 
---
 include/net/devlink.h |  2 ++
 net/core/devlink.c| 27 +++
 2 files changed, 29 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 1c4adfb4195a..151eb930d329 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -64,6 +65,7 @@ struct devlink_port {
enum devlink_port_type desired_type;
void *type_dev;
struct devlink_port_attrs attrs;
+   struct delayed_work type_warn_dw;
 };
 
 struct devlink_sb_pool_info {
diff --git a/net/core/devlink.c b/net/core/devlink.c
index d43bc52b8840..2515f7269ed0 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -5390,6 +5391,27 @@ void devlink_free(struct devlink *devlink)
 }
 EXPORT_SYMBOL_GPL(devlink_free);
 
+static void devlink_port_type_warn(struct work_struct *work)
+{
+   WARN(true, "Type was not set for devlink port.");
+}
+
+#define DEVLINK_PORT_TYPE_WARN_TIMEOUT (HZ * 30)
+
+static void devlink_port_type_warn_schedule(struct devlink_port *devlink_port)
+{
+   /* Schedule a work to WARN in case driver does not set port
+* type within timeout.
+*/
+   schedule_delayed_work(&devlink_port->type_warn_dw,
+ DEVLINK_PORT_TYPE_WARN_TIMEOUT);
+}
+
+static void devlink_port_type_warn_cancel(struct devlink_port *devlink_port)
+{
+   cancel_delayed_work_sync(&devlink_port->type_warn_dw);
+}
+
 /**
  * devlink_port_register - Register devlink port
  *
@@ -5419,6 +5441,8 @@ int devlink_port_register(struct devlink *devlink,
list_add_tail(&devlink_port->list, &devlink->port_list);
INIT_LIST_HEAD(&devlink_port->param_list);
mutex_unlock(&devlink->lock);
+   INIT_DELAYED_WORK(&devlink_port->type_warn_dw, &devlink_port_type_warn);
+   devlink_port_type_warn_schedule(devlink_port);
devlink_port_notify(devlink_port, DEVLINK_CMD_PORT_NEW);
return 0;
 }
@@ -5433,6 +5457,7 @@ void devlink_port_unregister(struct devlink_port 
*devlink_port)
 {
struct devlink *devlink = devlink_port->devlink;
 
+   devlink_port_type_warn_cancel(devlink_port);
devlink_port_notify(devlink_port, DEVLINK_CMD_PORT_DEL);
mutex_lock(&devlink->lock);
list_del(&devlink_port->list);
@@ -5446,6 +5471,7 @@ static void __devlink_port_type_set(struct devlink_port 
*devlink_port,
 {
if (WARN_ON(!devlink_port->registered))
return;
+   devlink_port_type_warn_cancel(devlink_port);
spin_lock(&devlink_port->type_lock);
devlink_port->type = type;
devlink_port->type_dev = type_dev;
@@ -5519,6 +5545,7 @@ EXPORT_SYMBOL_GPL(devlink_port_type_ib_set);
 void devlink_port_type_clear(struct devlink_port *devlink_port)
 {
__devlink_port_type_set(devlink_port, DEVLINK_PORT_TYPE_NOTSET, NULL);
+   devlink_port_type_warn_schedule(devlink_port);
 }
 EXPORT_SYMBOL_GPL(devlink_port_type_clear);
 
-- 
2.17.2

[PATCH net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Eric Dumazet

Since ip6frag_expire_frag_queue() now pulls the head skb
from frag queue, we should no longer use skb_get(), since
this leads to an skb leak.

Stefan Bader initially reported a problem in 4.4.stable [1] caused
by the skb_get(), so this patch should also fix this issue.

296583.091021] kernel BUG at 
/build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
[296583.091734] Call Trace:
[296583.091749]  [] __pskb_pull_tail+0x50/0x350
[296583.091764]  [] _decode_session6+0x26a/0x400
[296583.091779]  [] __xfrm_decode_session+0x39/0x50
[296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
[296583.091809]  [] icmp6_send+0x5e1/0x940
[296583.091823]  [] ? __netif_receive_skb+0x18/0x60
[296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
[296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
[296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
[nf_defrag_ipv6]
[296583.091893]  [] icmpv6_send+0x21/0x30
[296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
[296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
[nf_defrag_ipv6]
[296583.091938]  [] call_timer_fn+0x37/0x140
[296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
[nf_defrag_ipv6]
[296583.091968]  [] run_timer_softirq+0x234/0x330
[296583.091982]  [] __do_softirq+0x109/0x2b0

Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")
Signed-off-by: Eric Dumazet 
Reported-by: Stfan Bader 
Cc: Peter Oskolkov 
Cc: Florian Westphal 
---
 include/net/ipv6_frag.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 
28aa9b30aeceac9a86ee6754e4b5809be115e947..1f77fb4dc79df6bc4e41d6d2f4d49ace32082ca4
 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -94,7 +94,6 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue 
*fq)
goto out;
 
head->dev = dev;
-   skb_get(head);
spin_unlock(&fq->q.lock);
 
icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
-- 
2.21.0.1020.gf2820cf01a-goog

Re: Possible refcount bug in ip6_expire_frag_queue()?

2019-05-03 Thread Eric Dumazet

On Fri, May 3, 2019 at 7:17 AM Eric Dumazet  wrote:
>
> On Fri, May 3, 2019 at 7:12 AM Eric Dumazet  wrote:
> >

> > I will send the following fix
> >
> > diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
> > index 
> > 28aa9b30aeceac9a86ee6754e4b5809be115e947..d3152811b8962705a508b3fd31d2157dd19ae8e5
> > 100644
> > --- a/include/net/ipv6_frag.h
> > +++ b/include/net/ipv6_frag.h
> > @@ -94,11 +94,9 @@ ip6frag_expire_frag_queue(struct net *net, struct
> > frag_queue *fq)
> > goto out;
> >
> > head->dev = dev;
> > -   skb_get(head);
> > spin_unlock(&fq->q.lock);
> >
> > icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
> > -   kfree_skb(head);
>
> Oh well, we want to keep the kfree_skb() of course.
>
> Only the skb_get(head) needs to be removed (this would fix memory leak
> I presume...  :/ )

Official submission :

https://patchwork.ozlabs.org/patch/1094854/ ip6: fix skb leak in
ip6frag_expire_frag_queue()

Thanks a lot Stefan for bringing up this issue to our attention !

Re: [linux-sunxi] [PATCH v4 5/9] arm64: dts: allwinner: orange-pi-3: Enable ethernet

2019-05-03 Thread Jagan Teki

On Sat, Apr 13, 2019 at 10:24 PM megous via linux-sunxi
 wrote:
>
> From: Ondrej Jirman 
>
> Orange Pi 3 has two regulators that power the Realtek RTL8211E. According
> to the phy datasheet, both regulators need to be enabled at the same time,
> but we can only specify a single phy-supply in the DT.
>
> This can be achieved by making one regulator depedning on the other via
> vin-supply. While it's not a technically correct description of the
> hardware, it achieves the purpose.
>
> All values of RX/TX delay were tested exhaustively and a middle one of the
> working values was chosen.
>
> Signed-off-by: Ondrej Jirman 
> ---
>  .../dts/allwinner/sun50i-h6-orangepi-3.dts| 44 +++
>  1 file changed, 44 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/allwinner/sun50i-h6-orangepi-3.dts 
> b/arch/arm64/boot/dts/allwinner/sun50i-h6-orangepi-3.dts
> index 17d496990108..6d6b1f66796d 100644
> --- a/arch/arm64/boot/dts/allwinner/sun50i-h6-orangepi-3.dts
> +++ b/arch/arm64/boot/dts/allwinner/sun50i-h6-orangepi-3.dts
> @@ -15,6 +15,7 @@
>
> aliases {
> serial0 = &uart0;
> +   ethernet0 = &emac;
> };
>
> chosen {
> @@ -44,6 +45,27 @@
> regulator-max-microvolt = <500>;
> regulator-always-on;
> };
> +
> +   /*
> +* The board uses 2.5V RGMII signalling. Power sequence to enable
> +* the phy is to enable GMAC-2V5 and GMAC-3V3 (aldo2) power rails
> +* at the same time and to wait 100ms.
> +*/
> +   reg_gmac_2v5: gmac-2v5 {
> +   compatible = "regulator-fixed";
> +   regulator-name = "gmac-2v5";
> +   regulator-min-microvolt = <250>;
> +   regulator-max-microvolt = <250>;
> +   startup-delay-us = <10>;
> +   enable-active-high;
> +   gpio = <&pio 3 6 GPIO_ACTIVE_HIGH>; /* PD6 */
> +
> +   /* The real parent of gmac-2v5 is reg_vcc5v, but we need to
> +* enable two regulators to power the phy. This is one way
> +* to achieve that.
> +*/
> +   vin-supply = <®_aldo2>; /* GMAC-3V3 */

The actual output supply pin name is GMAC-3V which has an input of
VCC3V3-MAC (ie aldo2), if we compatible to schematics better to use
the same, IMHO.

Re: [RFC HACK] xfrm: make state refcounting percpu

2019-05-03 Thread Eric Dumazet

On 5/3/19 2:07 AM, Steffen Klassert wrote:
> On Wed, Apr 24, 2019 at 12:40:23PM +0200, Florian Westphal wrote:
>> I'm not sure this is a good idea to begin with, refcount
>> is right next to state spinlock which is taken for both tx and rx ops,
>> plus this complicates debugging quite a bit.
> 
> 

For some reason I have not received Florian response.

Florian, when the percpu counters are in nominal mode,
the updates are only in percpu memory, so the cache line containing struct 
percpu_ref in the
main object is not dirtied.

(This field/object can be placed in a read-mostly location of the structure if 
needed)

I would definitely try to use this.

Re: Possible refcount bug in ip6_expire_frag_queue()?

2019-05-03 Thread Stefan Bader

On 03.05.19 13:49, Eric Dumazet wrote:
> On Fri, May 3, 2019 at 7:17 AM Eric Dumazet  wrote:
>>
>> On Fri, May 3, 2019 at 7:12 AM Eric Dumazet  wrote:
>>>
> 
>>> I will send the following fix
>>>
>>> diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
>>> index 
>>> 28aa9b30aeceac9a86ee6754e4b5809be115e947..d3152811b8962705a508b3fd31d2157dd19ae8e5
>>> 100644
>>> --- a/include/net/ipv6_frag.h
>>> +++ b/include/net/ipv6_frag.h
>>> @@ -94,11 +94,9 @@ ip6frag_expire_frag_queue(struct net *net, struct
>>> frag_queue *fq)
>>> goto out;
>>>
>>> head->dev = dev;
>>> -   skb_get(head);
>>> spin_unlock(&fq->q.lock);
>>>
>>> icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
>>> -   kfree_skb(head);
>>
>> Oh well, we want to keep the kfree_skb() of course.
>>
>> Only the skb_get(head) needs to be removed (this would fix memory leak
>> I presume...  :/ )
> 
> Official submission :
> 
> https://patchwork.ozlabs.org/patch/1094854/ ip6: fix skb leak in
> ip6frag_expire_frag_queue()
> 
> Thanks a lot Stefan for bringing up this issue to our attention !
> 
Thank you Eric for the quick response.

-Stefan



signature.asc
Description: OpenPGP digital signature

[PATCH net] net: atm: clean up a range check

2019-05-03 Thread Dan Carpenter

The code works fine but the problem is that check for negatives is a
no-op:

if (arg < 0)
i = 0;

The "i" value isn't used.  We immediately overwrite it with:

i = array_index_nospec(arg, MAX_LEC_ITF);

The array_index_nospec() macro returns zero if "arg" is out of bounds so
this works, but the dead code is confusing and it doesn't look very
intentional.

Signed-off-by: Dan Carpenter 
---
This applies to net, but it's just a clean up.

 net/atm/lec.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/atm/lec.c b/net/atm/lec.c
index ad4f829193f0..a0311493b01b 100644
--- a/net/atm/lec.c
+++ b/net/atm/lec.c
@@ -726,9 +726,7 @@ static int lecd_attach(struct atm_vcc *vcc, int arg)
struct lec_priv *priv;
 
if (arg < 0)
-   i = 0;
-   else
-   i = arg;
+   arg = 0;
if (arg >= MAX_LEC_ITF)
return -EINVAL;
i = array_index_nospec(arg, MAX_LEC_ITF);
-- 
2.18.0

[PATCH 1/2 net-next] net: ll_temac: Fix an NULL vs IS_ERR() check in temac_open()

2019-05-03 Thread Dan Carpenter

The phy_connect() function doesn't return NULL pointers.  It returns
error pointers on error, so I have updated the check.

Fixes: 8425c41d1ef7 ("net: ll_temac: Extend support to non-device-tree 
platforms")
Signed-off-by: Dan Carpenter 
---
 drivers/net/ethernet/xilinx/ll_temac_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/xilinx/ll_temac_main.c 
b/drivers/net/ethernet/xilinx/ll_temac_main.c
index 1003ee14c833..bcb97fbf5b54 100644
--- a/drivers/net/ethernet/xilinx/ll_temac_main.c
+++ b/drivers/net/ethernet/xilinx/ll_temac_main.c
@@ -927,9 +927,9 @@ static int temac_open(struct net_device *ndev)
} else if (strlen(lp->phy_name) > 0) {
phydev = phy_connect(lp->ndev, lp->phy_name, temac_adjust_link,
 lp->phy_interface);
-   if (!phydev) {
+   if (IS_ERR(phydev)) {
dev_err(lp->dev, "phy_connect() failed\n");
-   return -ENODEV;
+   return PTR_ERR(phydev);
}
phy_start(phydev);
}
-- 
2.18.0

[PATCH 2/2 net-next] net: ll_temac: remove an unnecessary condition

2019-05-03 Thread Dan Carpenter

The "pdata->mdio_bus_id" is unsigned so this condition is always true.
This patch just removes it.

Signed-off-by: Dan Carpenter 
---
 drivers/net/ethernet/xilinx/ll_temac_mdio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/xilinx/ll_temac_mdio.c 
b/drivers/net/ethernet/xilinx/ll_temac_mdio.c
index c2a11703bc6d..a4667326f745 100644
--- a/drivers/net/ethernet/xilinx/ll_temac_mdio.c
+++ b/drivers/net/ethernet/xilinx/ll_temac_mdio.c
@@ -99,7 +99,7 @@ int temac_mdio_setup(struct temac_local *lp, struct 
platform_device *pdev)
of_address_to_resource(np, 0, &res);
snprintf(bus->id, MII_BUS_ID_SIZE, "%.8llx",
 (unsigned long long)res.start);
-   } else if (pdata && pdata->mdio_bus_id >= 0) {
+   } else if (pdata) {
snprintf(bus->id, MII_BUS_ID_SIZE, "%.8llx",
 pdata->mdio_bus_id);
}
-- 
2.18.0

Re: [PATCH net-next] ipmr: Do not define MAXVIFS twice

2019-05-03 Thread Nikolay Aleksandrov

On 03/05/2019 01:23, David Ahern wrote:
> From: David Ahern 
> 
> b70432f7319eb refactored mroute code to make it common between ipv4
> and ipv6. In the process, MAXVIFS got defined a second time: the
> first is in the uapi file linux/mroute.h. A second one was created
> presumably for IPv6 but it is not needed. Remove it and have
> mroute_base.h include the uapi file directly since it is shared.
> 
> include/linux/mroute.h can not be included in mroute_base.h because
> it contains a reference to mr_mfc which is defined in mroute_base.h.
> 
> Signed-off-by: David Ahern 
> ---
>  include/linux/mroute_base.h | 8 +---
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
> index 34de06b426ef..c5a389f81e91 100644
> --- a/include/linux/mroute_base.h
> +++ b/include/linux/mroute_base.h
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -90,13 +91,6 @@ static inline int mr_call_vif_notifiers(struct net *net,
>   return call_fib_notifiers(net, event_type, &info.info);
>  }
>  
> -#ifndef MAXVIFS
> -/* This one is nasty; value is defined in uapi using different symbols for
> - * mroute and morute6 but both map into same 32.
> - */
> -#define MAXVIFS  32
> -#endif
> -
>  #define VIF_EXISTS(_mrt, _idx) (!!((_mrt)->vif_table[_idx].dev))
>  
>  /* mfc_flags:
> 

It's in fact a mess, ipv6 defines MAXMIFS (notice the *M*) which must match 
MAX*V*IFS
due to the MAXVIFS use in ipmr_base for one (possibly other places too).. Maybe 
this value should be
set on initialization per family in the future because if it gets out of sync 
between v4 and v6 bad
things will follow. :)

Acked-by: Nikolay Aleksandrov

Re: [PATCH v6 bpf-next 13/17] s390: bpf: eliminate zero extension code-gen

2019-05-03 Thread Heiko Carstens

On Fri, May 03, 2019 at 11:42:40AM +0100, Jiong Wang wrote:
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Signed-off-by: Jiong Wang 
> ---
>  arch/s390/net/bpf_jit_comp.c | 20 +---
>  1 file changed, 17 insertions(+), 3 deletions(-)

When sending patches which affect s390, could you please add Martin
and me on cc to _all_ patches? We now received only the cover-letter
plus one patch. It's always hard in such cirumstances to figure out if
the code is doing the right thing.

Usually I end up looking up the missing patches within other mailing
lists, however I haven't subscribed the bpf and netdev mailing lists.

The extra e-mail volume because of being added to CC really doesn't
matter at all.

[PATCH] net: ucc_geth - fix Oops when changing number of buffers in the ring

2019-05-03 Thread Christophe Leroy

When changing the number of buffers in the RX ring while the interface
is running, the following Oops is encountered due to the new number
of buffers being taken into account immediately while their allocation
is done when opening the device only.

[   69.882706] Unable to handle kernel paging request for data at address 
0xf100
[   69.890172] Faulting instruction address: 0xc033e164
[   69.895122] Oops: Kernel access of bad area, sig: 11 [#1]
[   69.900494] BE PREEMPT CMPCPRO
[   69.907120] CPU: 0 PID: 0 Comm: swapper Not tainted 
4.14.115-6-g179ade8ce3-dirty #269
[   69.915956] task: c0684310 task.stack: c06da000
[   69.920470] NIP:  c033e164 LR: c02e44d0 CTR: c02e41fc
[   69.925504] REGS: dfff1e20 TRAP: 0300   Not tainted  
(4.14.115-6-g179ade8ce3-dirty)
[   69.934161] MSR:  9032   CR: 22004428  XER: 2000
[   69.940869] DAR: f100 DSISR: 2000
[   69.940869] GPR00: c0352d70 dfff1ed0 c0684310 f0a4 0040 dfff1f68 
 001f
[   69.940869] GPR08: df53f410 1cc00040 0021 c0781640 42004424 100c82b6 
f0a4 df53f5b0
[   69.940869] GPR16: df53f6c0 c05daf84 0040  0040 c0782be4 
 0001
[   69.940869] GPR24:  df53f400 01b0 df53f410 df53f000 003f 
df708220 1cc00044
[   69.978348] NIP [c033e164] skb_put+0x0/0x5c
[   69.982528] LR [c02e44d0] ucc_geth_poll+0x2d4/0x3f8
[   69.987384] Call Trace:
[   69.989830] [dfff1ed0] [c02e4554] ucc_geth_poll+0x358/0x3f8 (unreliable)
[   69.996522] [dfff1f20] [c0352d70] net_rx_action+0x248/0x30c
[   70.002099] [dfff1f80] [c04e93e4] __do_softirq+0xfc/0x310
[   70.007492] [dfff1fe0] [c0021124] irq_exit+0xd0/0xd4
[   70.012458] [dfff1ff0] [c000e7e0] call_do_irq+0x24/0x3c
[   70.017683] [c06dbe80] [c0006bac] do_IRQ+0x64/0xc4
[   70.022474] [c06dbea0] [c001097c] ret_from_except+0x0/0x14
[   70.027964] --- interrupt: 501 at rcu_idle_exit+0x84/0x90
[   70.027964] LR = rcu_idle_exit+0x74/0x90
[   70.037585] [c06dbf60] [2000] 0x2000 (unreliable)
[   70.042984] [c06dbf80] [c004bb0c] do_idle+0xb4/0x11c
[   70.047945] [c06dbfa0] [c004bd14] cpu_startup_entry+0x18/0x1c
[   70.053682] [c06dbfb0] [c05fb034] start_kernel+0x370/0x384
[   70.059153] [c06dbff0] [3438] 0x3438
[   70.063062] Instruction dump:
[   70.066023] 38a0 3880 90010014 4bfff015 80010014 7c0803a6 3123 
7c691910
[   70.073767] 38210010 4e800020 3860 4e800020 <80e3005c> 80c30098 3107 
7d083910
[   70.081690] ---[ end trace be7ccd9c1e1a9f12 ]---

This patch forbids the modification of the number of buffers in the
ring while the interface is running.

Fixes: ac421852b3a0 ("ucc_geth: add ethtool support")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 drivers/net/ethernet/freescale/ucc_geth_ethtool.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth_ethtool.c 
b/drivers/net/ethernet/freescale/ucc_geth_ethtool.c
index 0beee2cc2ddd..722b6de24816 100644
--- a/drivers/net/ethernet/freescale/ucc_geth_ethtool.c
+++ b/drivers/net/ethernet/freescale/ucc_geth_ethtool.c
@@ -252,14 +252,12 @@ uec_set_ringparam(struct net_device *netdev,
return -EINVAL;
}
 
+   if (netif_running(netdev))
+   return -EBUSY;
+
ug_info->bdRingLenRx[queue] = ring->rx_pending;
ug_info->bdRingLenTx[queue] = ring->tx_pending;
 
-   if (netif_running(netdev)) {
-   /* FIXME: restart automatically */
-   netdev_info(netdev, "Please re-open the interface\n");
-   }
-
return ret;
 }
 
-- 
2.13.3

Re: [PATCH v6 bpf-next 13/17] s390: bpf: eliminate zero extension code-gen

2019-05-03 Thread Eric Dumazet




On 5/3/19 9:41 AM, Heiko Carstens wrote:
> On Fri, May 03, 2019 at 11:42:40AM +0100, Jiong Wang wrote:
>> Cc: Martin Schwidefsky 
>> Cc: Heiko Carstens 
>> Signed-off-by: Jiong Wang 
>> ---
>>  arch/s390/net/bpf_jit_comp.c | 20 +---
>>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> When sending patches which affect s390, could you please add Martin
> and me on cc to _all_ patches? We now received only the cover-letter
> plus one patch. It's always hard in such cirumstances to figure out if
> the code is doing the right thing.
> 
>
One possible way is to use  --signed-off-by-cc option in git send-email

   --[no-]signed-off-by-cc
   If this is set, add emails found in Signed-off-by: or Cc: lines to 
the cc list.
   Default is the value of sendemail.signedoffbycc configuration value; 
if that is
   unspecified, default to --signed-off-by-cc.

Re: [PATCH v6 bpf-next 13/17] s390: bpf: eliminate zero extension code-gen

2019-05-03 Thread Jiong Wang



Heiko Carstens writes:

> On Fri, May 03, 2019 at 11:42:40AM +0100, Jiong Wang wrote:
>> Cc: Martin Schwidefsky 
>> Cc: Heiko Carstens 
>> Signed-off-by: Jiong Wang 
>> ---
>>  arch/s390/net/bpf_jit_comp.c | 20 +---
>>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> When sending patches which affect s390, could you please add Martin
> and me on cc to _all_ patches? We now received only the cover-letter
> plus one patch. It's always hard in such cirumstances to figure out if
> the code is doing the right thing.

OK, will do it next time.

Will just CC back-end maintainers on all patches including patches for the
other back-ends to make the information complete.

Regards,
Jiong

> Usually I end up looking up the missing patches within other mailing
> lists, however I haven't subscribed the bpf and netdev mailing lists.
>
> The extra e-mail volume because of being added to CC really doesn't
> matter at all.

Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: Set STP disable state in port_disable

2019-05-03 Thread Vivien Didelot

On Wed,  1 May 2019 00:08:30 +0200, Andrew Lunn  wrote:
> When requested to disable a port, set the port STP state to disabled.
> This fully disables the port and should save some power.
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Vivien Didelot

Re: [PATCH net-next 2/2] net: dsa :mv88e6xxx: Disable unused ports

2019-05-03 Thread Vivien Didelot

On Wed,  1 May 2019 00:08:31 +0200, Andrew Lunn  wrote:
> If the NO_CPU strap is set, the switch starts in 'dumb hub' mode, with
> all ports enable. Ports which are then actively used are reconfigured
> as required when the driver starts. However unused ports are left
> alone. Change this to disable them, and turn off any SERDES
> interface. This could save some power and so reduce the temperature a
> bit.
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Vivien Didelot

[PATCH v4 00/10] of_net: Add NVMEM support to of_get_mac_address

2019-05-03 Thread Petr Štetiar

Hi,

this patch series is a continuation of my previous attempt[1], where I've
tried to wire MTD layer into of_get_mac_address, so it would be possible to
load MAC addresses from various NVMEMs as EEPROMs etc.

Predecessor of this patch which used directly MTD layer has originated in
OpenWrt some time ago and supports already about 497 use cases in 357
device tree files.

During the review process of my 1st attempt I was told, that I shouldn't be
using MTD directly, but that I should rather use new NVMEM subsystem and
during the review process of v2 I was told, that I should handle
EPROBE_DEFFER error as well, during the review process of v3 I was told,
that returning pointer/NULL/ERR_PTR is considered as wrong API design, so
this v4 patch series tries to accommodate all this previous remarks.

First patch is wiring NVMEM support directly into of_get_mac_address as
it's obvious, that adding support for NVMEM into every other driver would
mean adding a lot of repetitive code. This patch allows us to configure MAC
addresses in various devices like ethernet and wireless adapters directly
from of_get_mac_address, which is used by quite a lot of drivers in the
tree already.

Second patch is simply updating documentation with NVMEM bits, and cleaning
up all current binding documentation referencing any of the MAC address
related properties.

Third and fourth patches are simply removing duplicate NVMEM code which is
no longer needed as the first patch has wired NVMEM support directly into
of_get_mac_address.

Patches 5-10 are converting all current users of of_get_mac_address to the
new ERR_PTR encoded error value, as of_get_mac_address could now return
valid pointer, NULL and ERR_PTR.

Just for a better picture, this patch series and one simple patch[2] on top
of it, allows me to configure 8Devices Carambola2 board's MAC addresses
with following DTS (simplified):

 &spi {
flash@0 {
partitions {
art: partition@ff {
label = "art";
reg = <0xff 0x01>;
read-only;

nvmem-cells {
compatible = "nvmem-cells";
#address-cells = <1>;
#size-cells = <1>;

eth0_addr: eth-mac-addr@0 {
reg = <0x0 0x6>;
};

eth1_addr: eth-mac-addr@6 {
reg = <0x6 0x6>;
};

wmac_addr: wifi-mac-addr@1002 {
reg = <0x1002 0x6>;
};
};
};
};
};
 };

 ð0 {
nvmem-cells = <ð0_addr>;
nvmem-cell-names = "mac-address";
 };

 ð1 {
nvmem-cells = <ð1_addr>;
nvmem-cell-names = "mac-address";
 };

 &wmac {
nvmem-cells = <&wmac_addr>;
nvmem-cell-names = "mac-address";
 };


1. https://patchwork.ozlabs.org/patch/1086628/
2. https://patchwork.ozlabs.org/patch/890738/

-- ynezz

Petr Štetiar (10):
  of_net: add NVMEM support to of_get_mac_address
  dt-bindings: doc: reflect new NVMEM of_get_mac_address behaviour
  net: macb: support of_get_mac_address new ERR_PTR error
  net: davinci: support of_get_mac_address new ERR_PTR error
  net: ethernet: support of_get_mac_address new ERR_PTR error
  net: usb: support of_get_mac_address new ERR_PTR error
  net: wireless: support of_get_mac_address new ERR_PTR error
  staging: octeon-ethernet: support of_get_mac_address new ERR_PTR error
  ARM: Kirkwood: support of_get_mac_address new ERR_PTR error
  powerpc: tsi108: support of_get_mac_address new ERR_PTR error

 .../devicetree/bindings/net/altera_tse.txt |  5 +-
 Documentation/devicetree/bindings/net/amd-xgbe.txt |  5 +-
 .../devicetree/bindings/net/brcm,amac.txt  |  4 +-
 Documentation/devicetree/bindings/net/cpsw.txt |  4 +-
 .../devicetree/bindings/net/davinci_emac.txt   |  5 +-
 Documentation/devicetree/bindings/net/dsa/dsa.txt  |  5 +-
 Documentation/devicetree/bindings/net/ethernet.txt |  6 ++-
 .../devicetree/bindings/net/hisilicon-femac.txt|  4 +-
 .../bindings/net/hisilicon-hix5hd2-gmac.txt|  4 +-
 .../devicetree/bindings/net/keystone-netcp.txt | 10 ++--
 Documentation/devicetree/bindings/net/macb.txt |  5 +-
 .../devicetree/bindings/net/marvell-pxa168.txt |  4 +-
 .../devicetree/bindings/net/microchip,enc28j60.txt |  3 +-
 .../devicetree/bindings/net/microchip,lan78xx.txt  |  5 +-
 .../devicetree/bindings/net/qca,qca7000.txt|  4 +-
 .../devicetree/bindings/net/samsung-sxgbe.txt  |  4 +-
 .../bindings/net

[RFC PATCH net-next 0/3] flow_offload: Re-add various features that disappeared

2019-05-03 Thread Edward Cree

When the flow_offload infrastructure was added, a couple of things that
 were previously possible for drivers to support in TC offload were not
 plumbed through, perhaps because the drivers in the tree did not fully
 or correctly implement them.
The main issue was with statistics; in TC (and in the previous offload
 API) statistics are per-action, though generally only on 'delivery'
 actions like mirred, ok and shot.  Actions also have an index, which
 may be supplied by the user, which allows the sharing of entities such
 as counters between multiple rules.  The existing driver implementations
 did not support this, however, instead allocating a single counter per
 rule.  The flow_offload API did not support this either, as (a) the
 action index never reached the driver, and (b) the TC_CLSFLOWER_STATS
 callback was only able to return a single set of stats which were added
 to all counters for actions on the rule.  Patch #1 of this series fixes
 (a) by storing tcfa_index in a new action_index member of struct
 flow_action_entry, while patch #2 fixes (b) by adding a new callback,
 TC_CLSFLOWER_STATS_BYINDEX, which retrieves statistics for a specified
 action_index rather than by rule (although the rule cookie is still   
 passed as well).
Patch #3 adds flow_rule_match_cvlan(), analogous to
 flow_rule_match_vlan() but accessing FLOW_DISSECTOR_KEY_CVLAN instead
 of FLOW_DISSECTOR_KEY_VLAN, to allow offloading inner VLAN matches.
This patch series does not include any users of these new interfaces;   
 the driver in which I hope to use them does not yet exist upstream as  
 it is for hardware which is still under development.  However I've CCed
 developers of various other drivers that implement TC offload, in case
 any of them want to implement support.  Otherwise I imagine that David
 won't be willing to take this without a user, in which case I'll save
 it to submit alongside the aforementioned unfinished driver (hence the
 RFC tags for now).

Edward Cree (3):
  flow_offload: copy tcfa_index into flow_action_entry
  flow_offload: restore ability to collect separate stats per action
  flow_offload: support CVLAN match

 include/net/flow_offload.h |  3 +++
 include/net/pkt_cls.h  |  2 ++
 net/core/flow_offload.c|  7 +++
 net/sched/cls_api.c|  1 +
 net/sched/cls_flower.c | 30 ++
 5 files changed, 43 insertions(+)

[PATCH net-next 2/4] net: use indirect calls helpers for L3 handler hooks

2019-05-03 Thread Paolo Abeni

So that we avoid another indirect call per RX packet in the common
case.

Signed-off-by: Paolo Abeni 
---
 net/ipv4/ip_input.c  | 6 +-
 net/ipv6/ip6_input.c | 7 ++-
 net/ipv6/tcp_ipv6.c  | 3 ++-
 net/ipv6/udp.c   | 3 ++-
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 1132d6d1796a..8d78de4b0304 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -130,6 +130,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -188,6 +189,8 @@ bool ip_call_ra_chain(struct sk_buff *skb)
return false;
 }
 
+INDIRECT_CALLABLE_DECLARE(int udp_rcv(struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(int tcp_v4_rcv(struct sk_buff *));
 void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
protocol)
 {
const struct net_protocol *ipprot;
@@ -205,7 +208,8 @@ void ip_protocol_deliver_rcu(struct net *net, struct 
sk_buff *skb, int protocol)
}
nf_reset(skb);
}
-   ret = ipprot->handler(skb);
+   ret = INDIRECT_CALL_2(ipprot->handler, tcp_v4_rcv, udp_rcv,
+ skb);
if (ret < 0) {
protocol = -ret;
goto resubmit;
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index c7ed2b6d5a1d..adf06159837f 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -316,6 +317,9 @@ void ipv6_list_rcv(struct list_head *head, struct 
packet_type *pt,
ip6_sublist_rcv(&sublist, curr_dev, curr_net);
 }
 
+INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(int tcp_v6_rcv(struct sk_buff *));
+
 /*
  * Deliver the packet to the host
  */
@@ -391,7 +395,8 @@ void ip6_protocol_deliver_rcu(struct net *net, struct 
sk_buff *skb, int nexthdr,
!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb))
goto discard;
 
-   ret = ipprot->handler(skb);
+   ret = INDIRECT_CALL_2(ipprot->handler, tcp_v6_rcv, udpv6_rcv,
+ skb);
if (ret > 0) {
if (ipprot->flags & INET6_PROTO_FINAL) {
/* Not an extension header, most likely UDP
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 82018bdce863..d58bf84e0f9a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1435,7 +1436,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const 
struct ipv6hdr *hdr,
skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
 }
 
-static int tcp_v6_rcv(struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 {
struct sk_buff *skb_to_free;
int sdif = inet6_sdif(skb);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 2464fba569b4..b3fcafaf5576 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1021,7 +1022,7 @@ static void udp_v6_early_demux(struct sk_buff *skb)
}
 }
 
-static __inline__ int udpv6_rcv(struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
 {
return __udp6_lib_rcv(skb, &udp_table, IPPROTO_UDP);
 }
-- 
2.20.1

[PATCH net-next 3/4] net: use indirect calls helpers at early demux stage

2019-05-03 Thread Paolo Abeni

So that we avoid another indirect call per RX packet, if
early demux is enabled.

Signed-off-by: Paolo Abeni 
---
 net/ipv4/ip_input.c  | 5 -
 net/ipv6/ip6_input.c | 5 -
 net/ipv6/tcp_ipv6.c  | 2 +-
 net/ipv6/udp.c   | 2 +-
 4 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 8d78de4b0304..ed97724c5e33 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -309,6 +309,8 @@ static inline bool ip_rcv_options(struct sk_buff *skb, 
struct net_device *dev)
return true;
 }
 
+INDIRECT_CALLABLE_DECLARE(int udp_v4_early_demux(struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(int tcp_v4_early_demux(struct sk_buff *));
 static int ip_rcv_finish_core(struct net *net, struct sock *sk,
  struct sk_buff *skb, struct net_device *dev)
 {
@@ -326,7 +328,8 @@ static int ip_rcv_finish_core(struct net *net, struct sock 
*sk,
 
ipprot = rcu_dereference(inet_protos[protocol]);
if (ipprot && (edemux = READ_ONCE(ipprot->early_demux))) {
-   err = edemux(skb);
+   err = INDIRECT_CALL_2(edemux, tcp_v4_early_demux,
+ udp_v4_early_demux, skb);
if (unlikely(err))
goto drop_error;
/* must reload iph, skb->head might have changed */
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index adf06159837f..b50b1af1f530 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -48,6 +48,8 @@
 #include 
 #include 
 
+INDIRECT_CALLABLE_DECLARE(void udp_v6_early_demux(struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(void tcp_v6_early_demux(struct sk_buff *));
 static void ip6_rcv_finish_core(struct net *net, struct sock *sk,
struct sk_buff *skb)
 {
@@ -58,7 +60,8 @@ static void ip6_rcv_finish_core(struct net *net, struct sock 
*sk,
 
ipprot = rcu_dereference(inet6_protos[ipv6_hdr(skb)->nexthdr]);
if (ipprot && (edemux = READ_ONCE(ipprot->early_demux)))
-   edemux(skb);
+   INDIRECT_CALL_2(edemux, tcp_v6_early_demux,
+   udp_v6_early_demux, skb);
}
if (!skb_valid_dst(skb))
ip6_route_input(skb);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d58bf84e0f9a..beaf28456301 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1655,7 +1655,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
goto discard_it;
 }
 
-static void tcp_v6_early_demux(struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE void tcp_v6_early_demux(struct sk_buff *skb)
 {
const struct ipv6hdr *hdr;
const struct tcphdr *th;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index b3fcafaf5576..07fa579dfb96 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -981,7 +981,7 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net,
return NULL;
 }
 
-static void udp_v6_early_demux(struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)
 {
struct net *net = dev_net(skb->dev);
const struct udphdr *uh;
-- 
2.20.1

[PATCH net-next 0/4] net: extend indirect calls helper usage

2019-05-03 Thread Paolo Abeni

This series applies the indirect calls helper introduced with commit 
283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect 
calls of builtin") to more hooks inside the network stack.

Overall this avoids up to 4 indirect calls for each RX packets,
giving small but measurable gain TCP_RR workloads and 5% under UDP
flood.

Paolo Abeni (4):
  net: use indirect calls helpers for ptype hook
  net: use indirect calls helpers for L3 handler hooks
  net: use indirect calls helpers at early demux stage
  net: use indirect calls helpers at the socket layer

 net/core/dev.c   |  6 --
 net/ipv4/ip_input.c  | 11 +--
 net/ipv6/ip6_input.c | 12 ++--
 net/ipv6/tcp_ipv6.c  |  5 +++--
 net/ipv6/udp.c   |  5 +++--
 net/socket.c | 20 
 6 files changed, 45 insertions(+), 14 deletions(-)

-- 
2.20.1

[PATCH net-next 4/4] net: use indirect calls helpers at the socket layer

2019-05-03 Thread Paolo Abeni

This avoids an indirect call per {send,recv}msg syscall in
the common (IPv6 or IPv4 socket) case.

Signed-off-by: Paolo Abeni 
---
 net/socket.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index a180e1a9ff23..472fbefa5d9b 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -90,6 +90,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -108,6 +109,13 @@
 #include 
 #include 
 
+/* proto_ops for ipv4 and ipv6 use the same {recv,send}msg function */
+#if IS_ENABLED(CONFIG_INET)
+#define INDIRECT_CALL_INET4(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
+#else
+#define INDIRECT_CALL_INET4(f, f1, ...) f(__VA_ARGS__)
+#endif
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
 unsigned int sysctl_net_busy_read __read_mostly;
 unsigned int sysctl_net_busy_poll __read_mostly;
@@ -645,10 +653,12 @@ EXPORT_SYMBOL(__sock_tx_timestamp);
  * Sends @msg through @sock, passing through LSM.
  * Returns the number of bytes sent, or an error code.
  */
-
+INDIRECT_CALLABLE_DECLARE(int inet_sendmsg(struct socket *, struct msghdr *,
+  size_t));
 static inline int sock_sendmsg_nosec(struct socket *sock, struct msghdr *msg)
 {
-   int ret = sock->ops->sendmsg(sock, msg, msg_data_left(msg));
+   int ret = INDIRECT_CALL_INET4(sock->ops->sendmsg, inet_sendmsg, sock,
+ msg, msg_data_left(msg));
BUG_ON(ret == -EIOCBQUEUED);
return ret;
 }
@@ -874,11 +884,13 @@ EXPORT_SYMBOL_GPL(__sock_recv_ts_and_drops);
  * Receives @msg from @sock, passing through LSM. Returns the total number
  * of bytes received, or an error.
  */
-
+INDIRECT_CALLABLE_DECLARE(int inet_recvmsg(struct socket *, struct msghdr *,
+  size_t , int ));
 static inline int sock_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
 int flags)
 {
-   return sock->ops->recvmsg(sock, msg, msg_data_left(msg), flags);
+   return INDIRECT_CALL_INET4(sock->ops->recvmsg, inet_recvmsg, sock, msg,
+  msg_data_left(msg), flags);
 }
 
 int sock_recvmsg(struct socket *sock, struct msghdr *msg, int flags)
-- 
2.20.1

[PATCH net-next 1/4] net: use indirect calls helpers for ptype hook

2019-05-03 Thread Paolo Abeni

This avoids an indirect call per RX IPv6/IPv4 packet.
Note that we don't want to use the indirect calls helper for taps.

Signed-off-by: Paolo Abeni 
---
 net/core/dev.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 22f2640f559a..108ac8137b9b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4987,7 +4987,8 @@ static int __netif_receive_skb_one_core(struct sk_buff 
*skb, bool pfmemalloc)
 
ret = __netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
if (pt_prev)
-   ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   ret = INDIRECT_CALL_INET(pt_prev->func, ipv6_rcv, ip_rcv, skb,
+skb->dev, pt_prev, orig_dev);
return ret;
 }
 
@@ -5033,7 +5034,8 @@ static inline void __netif_receive_skb_list_ptype(struct 
list_head *head,
else
list_for_each_entry_safe(skb, next, head, list) {
skb_list_del_init(skb);
-   pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   INDIRECT_CALL_INET(pt_prev->func, ipv6_rcv, ip_rcv, skb,
+  skb->dev, pt_prev, orig_dev);
}
 }
 
-- 
2.20.1

[RFC PATCH net-next 1/3] flow_offload: copy tcfa_index into flow_action_entry

2019-05-03 Thread Edward Cree

Required for support of shared counters (and possibly other shared per-
 action entities in future).

Signed-off-by: Edward Cree 
---
 include/net/flow_offload.h | 1 +
 net/sched/cls_api.c| 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index d035183c8d03..6f59cdaf6eb6 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -135,6 +135,7 @@ enum flow_action_mangle_base {
 
 struct flow_action_entry {
enum flow_action_id id;
+   u32 action_index;
union {
u32 chain_index;/* FLOW_ACTION_GOTO */
struct net_device   *dev;   /* FLOW_ACTION_REDIRECT 
*/
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 263c2ec082c9..835f3129c24f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3193,6 +3193,7 @@ int tc_setup_flow_action(struct flow_action *flow_action,
struct flow_action_entry *entry;
 
entry = &flow_action->entries[j];
+   entry->action_index = act->tcfa_index;
if (is_tcf_gact_ok(act)) {
entry->id = FLOW_ACTION_ACCEPT;
} else if (is_tcf_gact_shot(act)) {

Re: [PATCH net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Eric Dumazet

On Fri, May 3, 2019 at 10:55 AM Nicolas Dichtel
 wrote:
>
> Le 03/05/2019 à 13:47, Eric Dumazet a écrit :
> > Since ip6frag_expire_frag_queue() now pulls the head skb
> > from frag queue, we should no longer use skb_get(), since
> > this leads to an skb leak.
> >
> > Stefan Bader initially reported a problem in 4.4.stable [1] caused
> > by the skb_get(), so this patch should also fix this issue.
> >
> > 296583.091021] kernel BUG at 
> > /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
> > [296583.091734] Call Trace:
> > [296583.091749]  [] __pskb_pull_tail+0x50/0x350
> > [296583.091764]  [] _decode_session6+0x26a/0x400
> > [296583.091779]  [] __xfrm_decode_session+0x39/0x50
> > [296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
> > [296583.091809]  [] icmp6_send+0x5e1/0x940
> > [296583.091823]  [] ? __netif_receive_skb+0x18/0x60
> > [296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
> > [296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 
> > [ixgbe]
> > [296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
> > [nf_defrag_ipv6]
> > [296583.091893]  [] icmpv6_send+0x21/0x30
> > [296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
> > [296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
> > [nf_defrag_ipv6]
> > [296583.091938]  [] call_timer_fn+0x37/0x140
> > [296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
> > [nf_defrag_ipv6]
> > [296583.091968]  [] run_timer_softirq+0x234/0x330
> > [296583.091982]  [] __do_softirq+0x109/0x2b0
> >
> > Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")
> > Signed-off-by: Eric Dumazet 
> > Reported-by: Stfan Bader 
> nit: the 'e' is missing in Stefan ;-)

Indeed, copy/paste error, thanks.

Re: [PATCH net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Nicolas Dichtel

Le 03/05/2019 à 13:47, Eric Dumazet a écrit :
> Since ip6frag_expire_frag_queue() now pulls the head skb
> from frag queue, we should no longer use skb_get(), since
> this leads to an skb leak.
> 
> Stefan Bader initially reported a problem in 4.4.stable [1] caused
> by the skb_get(), so this patch should also fix this issue.
> 
> 296583.091021] kernel BUG at 
> /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
> [296583.091734] Call Trace:
> [296583.091749]  [] __pskb_pull_tail+0x50/0x350
> [296583.091764]  [] _decode_session6+0x26a/0x400
> [296583.091779]  [] __xfrm_decode_session+0x39/0x50
> [296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
> [296583.091809]  [] icmp6_send+0x5e1/0x940
> [296583.091823]  [] ? __netif_receive_skb+0x18/0x60
> [296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
> [296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
> [296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091893]  [] icmpv6_send+0x21/0x30
> [296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
> [296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
> [nf_defrag_ipv6]
> [296583.091938]  [] call_timer_fn+0x37/0x140
> [296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091968]  [] run_timer_softirq+0x234/0x330
> [296583.091982]  [] __do_softirq+0x109/0x2b0
> 
> Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")
> Signed-off-by: Eric Dumazet 
> Reported-by: Stfan Bader 
nit: the 'e' is missing in Stefan ;-)

[RFC PATCH net-next 2/3] flow_offload: restore ability to collect separate stats per action

2019-05-03 Thread Edward Cree

Introduce a new offload command TC_CLSFLOWER_STATS_BYINDEX, similar to
 the existing TC_CLSFLOWER_STATS but specifying an action_index (the
 tcfa_index of the action), which is called for each stats-having action
 on the rule.  Drivers should implement either, but not both, of these
 commands.

Signed-off-by: Edward Cree 
---
 include/net/pkt_cls.h  |  2 ++
 net/sched/cls_flower.c | 30 ++
 2 files changed, 32 insertions(+)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index d5e7a1af346f..0e33c52c23a8 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -762,6 +762,7 @@ enum tc_fl_command {
TC_CLSFLOWER_REPLACE,
TC_CLSFLOWER_DESTROY,
TC_CLSFLOWER_STATS,
+   TC_CLSFLOWER_STATS_BYINDEX,
TC_CLSFLOWER_TMPLT_CREATE,
TC_CLSFLOWER_TMPLT_DESTROY,
 };
@@ -773,6 +774,7 @@ struct tc_cls_flower_offload {
struct flow_rule *rule;
struct flow_stats stats;
u32 classid;
+   u32 action_index; /* for TC_CLSFLOWER_STATS_BYINDEX */
 };
 
 static inline struct flow_rule *
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index f6685fc53119..be339cd6a86e 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -474,6 +474,10 @@ static void fl_hw_update_stats(struct tcf_proto *tp, 
struct cls_fl_filter *f,
 {
struct tc_cls_flower_offload cls_flower = {};
struct tcf_block *block = tp->chain->block;
+#ifdef CONFIG_NET_CLS_ACT
+   struct tc_action *a;
+   int i;
+#endif
 
if (!rtnl_held)
rtnl_lock();
@@ -489,6 +493,32 @@ static void fl_hw_update_stats(struct tcf_proto *tp, 
struct cls_fl_filter *f,
  cls_flower.stats.pkts,
  cls_flower.stats.lastused);
 
+#ifdef CONFIG_NET_CLS_ACT
+   for (i = 0; i < f->exts.nr_actions; i++) {
+   a = f->exts.actions[i];
+
+   if (!a->ops->stats_update)
+   continue;
+   memset(&cls_flower, 0, sizeof(cls_flower));
+   tc_cls_common_offload_init(&cls_flower.common, tp, f->flags, 
NULL);
+   cls_flower.command = TC_CLSFLOWER_STATS_BYINDEX;
+   cls_flower.cookie = (unsigned long) f;
+   cls_flower.classid = f->res.classid;
+   cls_flower.action_index = a->tcfa_index;
+
+   tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false);
+
+   /* Some ->stats_update() use percpu variables and must thus be
+* called with preemption disabled.
+*/
+   preempt_disable();
+   a->ops->stats_update(a, cls_flower.stats.bytes,
+cls_flower.stats.pkts,
+cls_flower.stats.lastused, true);
+   preempt_enable();
+   }
+#endif
+
if (!rtnl_held)
rtnl_unlock();
 }

[RFC PATCH net-next 3/3] flow_offload: support CVLAN match

2019-05-03 Thread Edward Cree

Plumb it through from the flow_dissector.

Signed-off-by: Edward Cree 
---
 include/net/flow_offload.h | 2 ++
 net/core/flow_offload.c| 7 +++
 2 files changed, 9 insertions(+)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 6f59cdaf6eb6..48847ee7aa3a 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -71,6 +71,8 @@ void flow_rule_match_eth_addrs(const struct flow_rule *rule,
   struct flow_match_eth_addrs *out);
 void flow_rule_match_vlan(const struct flow_rule *rule,
  struct flow_match_vlan *out);
+void flow_rule_match_cvlan(const struct flow_rule *rule,
+  struct flow_match_vlan *out);
 void flow_rule_match_ipv4_addrs(const struct flow_rule *rule,
struct flow_match_ipv4_addrs *out);
 void flow_rule_match_ipv6_addrs(const struct flow_rule *rule,
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index c3a00eac4804..5ce7d47a960e 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -54,6 +54,13 @@ void flow_rule_match_vlan(const struct flow_rule *rule,
 }
 EXPORT_SYMBOL(flow_rule_match_vlan);
 
+void flow_rule_match_cvlan(const struct flow_rule *rule,
+  struct flow_match_vlan *out)
+{
+   FLOW_DISSECTOR_MATCH(rule, FLOW_DISSECTOR_KEY_CVLAN, out);
+}
+EXPORT_SYMBOL(flow_rule_match_cvlan);
+
 void flow_rule_match_ipv4_addrs(const struct flow_rule *rule,
struct flow_match_ipv4_addrs *out)
 {

[PATCH v2 net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Eric Dumazet

Since ip6frag_expire_frag_queue() now pulls the head skb
from frag queue, we should no longer use skb_get(), since
this leads to an skb leak.

Stefan Bader initially reported a problem in 4.4.stable [1] caused
by the skb_get(), so this patch should also fix this issue.

296583.091021] kernel BUG at 
/build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
[296583.091734] Call Trace:
[296583.091749]  [] __pskb_pull_tail+0x50/0x350
[296583.091764]  [] _decode_session6+0x26a/0x400
[296583.091779]  [] __xfrm_decode_session+0x39/0x50
[296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
[296583.091809]  [] icmp6_send+0x5e1/0x940
[296583.091823]  [] ? __netif_receive_skb+0x18/0x60
[296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
[296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
[296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
[nf_defrag_ipv6]
[296583.091893]  [] icmpv6_send+0x21/0x30
[296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
[296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
[nf_defrag_ipv6]
[296583.091938]  [] call_timer_fn+0x37/0x140
[296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
[nf_defrag_ipv6]
[296583.091968]  [] run_timer_softirq+0x234/0x330
[296583.091982]  [] __do_softirq+0x109/0x2b0

Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")
Signed-off-by: Eric Dumazet 
Reported-by: Stefan Bader 
Cc: Peter Oskolkov 
Cc: Florian Westphal 
---
v2: fixed typo in Stefan email in the Reported-by: tag

 include/net/ipv6_frag.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 
28aa9b30aeceac9a86ee6754e4b5809be115e947..1f77fb4dc79df6bc4e41d6d2f4d49ace32082ca4
 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -94,7 +94,6 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue 
*fq)
goto out;
 
head->dev = dev;
-   skb_get(head);
spin_unlock(&fq->q.lock);
 
icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
-- 
2.21.0.1020.gf2820cf01a-goog

Re: [PATCH net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Peter Oskolkov

On Fri, May 3, 2019 at 4:47 AM Eric Dumazet  wrote:
>
> Since ip6frag_expire_frag_queue() now pulls the head skb
> from frag queue, we should no longer use skb_get(), since
> this leads to an skb leak.
>
> Stefan Bader initially reported a problem in 4.4.stable [1] caused
> by the skb_get(), so this patch should also fix this issue.
>
> 296583.091021] kernel BUG at 
> /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
> [296583.091734] Call Trace:
> [296583.091749]  [] __pskb_pull_tail+0x50/0x350
> [296583.091764]  [] _decode_session6+0x26a/0x400
> [296583.091779]  [] __xfrm_decode_session+0x39/0x50
> [296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
> [296583.091809]  [] icmp6_send+0x5e1/0x940
> [296583.091823]  [] ? __netif_receive_skb+0x18/0x60
> [296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
> [296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
> [296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091893]  [] icmpv6_send+0x21/0x30
> [296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
> [296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
> [nf_defrag_ipv6]
> [296583.091938]  [] call_timer_fn+0x37/0x140
> [296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091968]  [] run_timer_softirq+0x234/0x330
> [296583.091982]  [] __do_softirq+0x109/0x2b0
>
> Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")
> Signed-off-by: Eric Dumazet 
> Reported-by: Stfan Bader 
> Cc: Peter Oskolkov 
> Cc: Florian Westphal 
> ---
>  include/net/ipv6_frag.h | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
> index 
> 28aa9b30aeceac9a86ee6754e4b5809be115e947..1f77fb4dc79df6bc4e41d6d2f4d49ace32082ca4
>  100644
> --- a/include/net/ipv6_frag.h
> +++ b/include/net/ipv6_frag.h
> @@ -94,7 +94,6 @@ ip6frag_expire_frag_queue(struct net *net, struct 
> frag_queue *fq)
> goto out;
>
> head->dev = dev;
> -   skb_get(head);

This skb_get was introduced by commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3
"ipv6: frags: rewrite ip6_expire_frag_queue()", and the rbtree patch
is not in 4.4, where the bug is reported at.
Shouldn't the "Fixes" tag also reference the original patch?


> spin_unlock(&fq->q.lock);
>
> icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
> --
> 2.21.0.1020.gf2820cf01a-goog
>

[PATCH ipsec-next 1/6] xfrm: remove init_tempsel indirection from xfrm_state_afinfo

2019-05-03 Thread Florian Westphal

Simple initialization, handle it in the caller.

Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h |  2 --
 net/ipv4/xfrm4_state.c | 19 --
 net/ipv6/xfrm6_state.c | 21 
 net/xfrm/xfrm_state.c  | 56 --
 4 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index eb5018b1cf9c..9f97e6c1f3ee 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -353,8 +353,6 @@ struct xfrm_state_afinfo {
const struct xfrm_type_offload  *type_offload_map[IPPROTO_MAX];
 
int (*init_flags)(struct xfrm_state *x);
-   void(*init_tempsel)(struct xfrm_selector *sel,
-   const struct flowi *fl);
void(*init_temprop)(struct xfrm_state *x,
const struct xfrm_tmpl *tmpl,
const xfrm_address_t *daddr,
diff --git a/net/ipv4/xfrm4_state.c b/net/ipv4/xfrm4_state.c
index 80c40b4981bb..da0fd9556d57 100644
--- a/net/ipv4/xfrm4_state.c
+++ b/net/ipv4/xfrm4_state.c
@@ -22,24 +22,6 @@ static int xfrm4_init_flags(struct xfrm_state *x)
return 0;
 }
 
-static void
-__xfrm4_init_tempsel(struct xfrm_selector *sel, const struct flowi *fl)
-{
-   const struct flowi4 *fl4 = &fl->u.ip4;
-
-   sel->daddr.a4 = fl4->daddr;
-   sel->saddr.a4 = fl4->saddr;
-   sel->dport = xfrm_flowi_dport(fl, &fl4->uli);
-   sel->dport_mask = htons(0x);
-   sel->sport = xfrm_flowi_sport(fl, &fl4->uli);
-   sel->sport_mask = htons(0x);
-   sel->family = AF_INET;
-   sel->prefixlen_d = 32;
-   sel->prefixlen_s = 32;
-   sel->proto = fl4->flowi4_proto;
-   sel->ifindex = fl4->flowi4_oif;
-}
-
 static void
 xfrm4_init_temprop(struct xfrm_state *x, const struct xfrm_tmpl *tmpl,
   const xfrm_address_t *daddr, const xfrm_address_t *saddr)
@@ -77,7 +59,6 @@ static struct xfrm_state_afinfo xfrm4_state_afinfo = {
.eth_proto  = htons(ETH_P_IP),
.owner  = THIS_MODULE,
.init_flags = xfrm4_init_flags,
-   .init_tempsel   = __xfrm4_init_tempsel,
.init_temprop   = xfrm4_init_temprop,
.output = xfrm4_output,
.output_finish  = xfrm4_output_finish,
diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index 5bdca3d5d6b7..0e19ded3e33b 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -21,26 +21,6 @@
 #include 
 #include 
 
-static void
-__xfrm6_init_tempsel(struct xfrm_selector *sel, const struct flowi *fl)
-{
-   const struct flowi6 *fl6 = &fl->u.ip6;
-
-   /* Initialize temporary selector matching only
-* to current session. */
-   *(struct in6_addr *)&sel->daddr = fl6->daddr;
-   *(struct in6_addr *)&sel->saddr = fl6->saddr;
-   sel->dport = xfrm_flowi_dport(fl, &fl6->uli);
-   sel->dport_mask = htons(0x);
-   sel->sport = xfrm_flowi_sport(fl, &fl6->uli);
-   sel->sport_mask = htons(0x);
-   sel->family = AF_INET6;
-   sel->prefixlen_d = 128;
-   sel->prefixlen_s = 128;
-   sel->proto = fl6->flowi6_proto;
-   sel->ifindex = fl6->flowi6_oif;
-}
-
 static void
 xfrm6_init_temprop(struct xfrm_state *x, const struct xfrm_tmpl *tmpl,
   const xfrm_address_t *daddr, const xfrm_address_t *saddr)
@@ -173,7 +153,6 @@ static struct xfrm_state_afinfo xfrm6_state_afinfo = {
.proto  = IPPROTO_IPV6,
.eth_proto  = htons(ETH_P_IPV6),
.owner  = THIS_MODULE,
-   .init_tempsel   = __xfrm6_init_tempsel,
.init_temprop   = xfrm6_init_temprop,
.tmpl_sort  = __xfrm6_tmpl_sort,
.state_sort = __xfrm6_state_sort,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index ed25eb81aabe..f93c6dc57754 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -767,6 +767,43 @@ void xfrm_sad_getinfo(struct net *net, struct 
xfrmk_sadinfo *si)
 }
 EXPORT_SYMBOL(xfrm_sad_getinfo);
 
+static void
+__xfrm4_init_tempsel(struct xfrm_selector *sel, const struct flowi *fl)
+{
+   const struct flowi4 *fl4 = &fl->u.ip4;
+
+   sel->daddr.a4 = fl4->daddr;
+   sel->saddr.a4 = fl4->saddr;
+   sel->dport = xfrm_flowi_dport(fl, &fl4->uli);
+   sel->dport_mask = htons(0x);
+   sel->sport = xfrm_flowi_sport(fl, &fl4->uli);
+   sel->sport_mask = htons(0x);
+   sel->family = AF_INET;
+   sel->prefixlen_d = 32;
+   sel->prefixlen_s = 32;
+   sel->proto = fl4->flowi4_proto;
+   sel->ifindex = fl4->flowi4_oif;
+}
+
+static void
+__xfrm6_init_tempsel(struct xfrm_selector *sel, const struct flowi *fl)
+{
+   const struct flowi6 *fl6 = &fl->u.ip6;
+
+   /* Initialize temp

[PATCH ipsec-next 4/6] xfrm: remove state and template sort indirections from xfrm_state_afinfo

2019-05-03 Thread Florian Westphal

No module dependency, placing this in xfrm_state.c avoids need for
an indirection.

This also removes the state spinlock -- I don't see why we would need
to hold it during sorting.

This in turn allows to remove the 'net' argument passed to
xfrm_tmpl_sort.  Last, remove the EXPORT_SYMBOL, there are no modular
callers.

For the CONFIG_IPV6=m case, vmlinux size increase is about 300 byte.

Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h |  18 +++---
 net/ipv6/xfrm6_state.c |  98 ---
 net/xfrm/xfrm_policy.c |   2 +-
 net/xfrm/xfrm_state.c  | 129 -
 4 files changed, 110 insertions(+), 137 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 5d1c2bdee91e..5f35f79eb661 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -352,8 +352,6 @@ struct xfrm_state_afinfo {
const struct xfrm_type  *type_map[IPPROTO_MAX];
const struct xfrm_type_offload  *type_offload_map[IPPROTO_MAX];
 
-   int (*tmpl_sort)(struct xfrm_tmpl **dst, struct 
xfrm_tmpl **src, int n);
-   int (*state_sort)(struct xfrm_state **dst, struct 
xfrm_state **src, int n);
int (*output)(struct net *net, struct sock *sk, 
struct sk_buff *skb);
int (*output_finish)(struct sock *sk, struct 
sk_buff *skb);
int (*extract_input)(struct xfrm_state *x,
@@ -1483,21 +1481,19 @@ struct xfrm_state *xfrm_state_lookup_byaddr(struct net 
*net, u32 mark,
u8 proto,
unsigned short family);
 #ifdef CONFIG_XFRM_SUB_POLICY
-int xfrm_tmpl_sort(struct xfrm_tmpl **dst, struct xfrm_tmpl **src, int n,
-  unsigned short family, struct net *net);
-int xfrm_state_sort(struct xfrm_state **dst, struct xfrm_state **src, int n,
+void xfrm_tmpl_sort(struct xfrm_tmpl **dst, struct xfrm_tmpl **src, int n,
unsigned short family);
+void xfrm_state_sort(struct xfrm_state **dst, struct xfrm_state **src, int n,
+unsigned short family);
 #else
-static inline int xfrm_tmpl_sort(struct xfrm_tmpl **dst, struct xfrm_tmpl 
**src,
-int n, unsigned short family, struct net *net)
+static inline void xfrm_tmpl_sort(struct xfrm_tmpl **d, struct xfrm_tmpl **s,
+ int n, unsigned short family)
 {
-   return -ENOSYS;
 }
 
-static inline int xfrm_state_sort(struct xfrm_state **dst, struct xfrm_state 
**src,
- int n, unsigned short family)
+static inline void xfrm_state_sort(struct xfrm_state **d, struct xfrm_state 
**s,
+  int n, unsigned short family)
 {
-   return -ENOSYS;
 }
 #endif
 
diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index aa5d2c52cc31..1782ebb22dd3 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -21,102 +21,6 @@
 #include 
 #include 
 
-/* distribution counting sort function for xfrm_state and xfrm_tmpl */
-static int
-__xfrm6_sort(void **dst, void **src, int n, int (*cmp)(void *p), int maxclass)
-{
-   int count[XFRM_MAX_DEPTH] = { };
-   int class[XFRM_MAX_DEPTH];
-   int i;
-
-   for (i = 0; i < n; i++) {
-   int c;
-   class[i] = c = cmp(src[i]);
-   count[c]++;
-   }
-
-   for (i = 2; i < maxclass; i++)
-   count[i] += count[i - 1];
-
-   for (i = 0; i < n; i++) {
-   dst[count[class[i] - 1]++] = src[i];
-   src[i] = NULL;
-   }
-
-   return 0;
-}
-
-/*
- * Rule for xfrm_state:
- *
- * rule 1: select IPsec transport except AH
- * rule 2: select MIPv6 RO or inbound trigger
- * rule 3: select IPsec transport AH
- * rule 4: select IPsec tunnel
- * rule 5: others
- */
-static int __xfrm6_state_sort_cmp(void *p)
-{
-   struct xfrm_state *v = p;
-
-   switch (v->props.mode) {
-   case XFRM_MODE_TRANSPORT:
-   if (v->id.proto != IPPROTO_AH)
-   return 1;
-   else
-   return 3;
-#if IS_ENABLED(CONFIG_IPV6_MIP6)
-   case XFRM_MODE_ROUTEOPTIMIZATION:
-   case XFRM_MODE_IN_TRIGGER:
-   return 2;
-#endif
-   case XFRM_MODE_TUNNEL:
-   case XFRM_MODE_BEET:
-   return 4;
-   }
-   return 5;
-}
-
-static int
-__xfrm6_state_sort(struct xfrm_state **dst, struct xfrm_state **src, int n)
-{
-   return __xfrm6_sort((void **)dst, (void **)src, n,
-   __xfrm6_state_sort_cmp, 6);
-}
-
-/*
- * Rule for xfrm_tmpl:
- *
- * rule 1: select IPsec transport
- * rule 2: select MIPv6 RO or inbound trigger
- * rule 3: select IPsec tunnel
- * rule 4: others
- */
-static int __xfrm6_tmpl_sort_cmp(void *p)
-{
-   struct xfrm_tmpl *v = p;
-   switch (v->mode) {
-   case XFRM_MODE_TRANSPORT

[PATCH ipsec-next 0/6] xfrm: reduce xfrm_state_afinfo size

2019-05-03 Thread Florian Westphal

xfrm_state_afinfo is a very large struct; its over 4kbyte on 64bit systems.

The size comes from two arrays to store the l4 protocol type pointers
(esp, ah, ipcomp and so on).

There are only a handful of those, so just use pointers for protocols
that we implement instead of mostly-empty arrays.

This also removes the template init/sort related indirections.
Structure size goes down to 120 bytes on x86_64.

 include/net/xfrm.h  |   49 ++---
 net/ipv4/ah4.c  |3 
 net/ipv4/esp4.c |3 
 net/ipv4/esp4_offload.c |4 
 net/ipv4/ipcomp.c   |3 
 net/ipv4/xfrm4_state.c  |   45 -
 net/ipv4/xfrm4_tunnel.c |3 
 net/ipv6/ah6.c  |4 
 net/ipv6/esp6.c |3 
 net/ipv6/esp6_offload.c |4 
 net/ipv6/ipcomp6.c  |3 
 net/ipv6/mip6.c |6 
 net/ipv6/xfrm6_state.c  |  137 
 net/xfrm/xfrm_input.c   |   24 +-
 net/xfrm/xfrm_policy.c  |2 
 net/xfrm/xfrm_state.c   |  400 +++-
 16 files changed, 343 insertions(+), 350 deletions(-)

Florian Westphal (6):
  xfrm: remove init_tempsel indirection from xfrm_state_afinfo
  xfrm: remove init_temprop indirection from xfrm_state_afinfo
  xfrm: remove init_flags indirection from xfrm_state_afinfo
  xfrm: remove state and template sort indirections from xfrm_state_afinfo
  xfrm: remove eth_proto value from xfrm_state_afinfo
  xfrm: remove type and offload_type map from xfrm_state_afinfo

[PATCH ipsec-next 5/6] xfrm: remove eth_proto value from xfrm_state_afinfo

2019-05-03 Thread Florian Westphal

xfrm_prepare_input needs to lookup the state afinfo backend again to fetch
the address family ethernet protocol value.

There are only two address families, so a switch statement is simpler.
While at it, use u8 for family and proto and remove the owner member --
its not used anywhere.

Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h |  6 ++
 net/ipv4/xfrm4_state.c |  2 --
 net/ipv6/xfrm6_state.c |  2 --
 net/xfrm/xfrm_input.c  | 24 
 4 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 5f35f79eb661..6ae52baa0ce7 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -345,10 +345,8 @@ void km_state_expired(struct xfrm_state *x, int hard, u32 
portid);
 int __xfrm_state_delete(struct xfrm_state *x);
 
 struct xfrm_state_afinfo {
-   unsigned intfamily;
-   unsigned intproto;
-   __be16  eth_proto;
-   struct module   *owner;
+   u8  family;
+   u8  proto;
const struct xfrm_type  *type_map[IPPROTO_MAX];
const struct xfrm_type_offload  *type_offload_map[IPPROTO_MAX];
 
diff --git a/net/ipv4/xfrm4_state.c b/net/ipv4/xfrm4_state.c
index 62c96da38b4e..f8ed3c3bb928 100644
--- a/net/ipv4/xfrm4_state.c
+++ b/net/ipv4/xfrm4_state.c
@@ -34,8 +34,6 @@ int xfrm4_extract_header(struct sk_buff *skb)
 static struct xfrm_state_afinfo xfrm4_state_afinfo = {
.family = AF_INET,
.proto  = IPPROTO_IPIP,
-   .eth_proto  = htons(ETH_P_IP),
-   .owner  = THIS_MODULE,
.output = xfrm4_output,
.output_finish  = xfrm4_output_finish,
.extract_input  = xfrm4_extract_input,
diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index 1782ebb22dd3..78daadecbdef 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -40,8 +40,6 @@ int xfrm6_extract_header(struct sk_buff *skb)
 static struct xfrm_state_afinfo xfrm6_state_afinfo = {
.family = AF_INET6,
.proto  = IPPROTO_IPV6,
-   .eth_proto  = htons(ETH_P_IPV6),
-   .owner  = THIS_MODULE,
.output = xfrm6_output,
.output_finish  = xfrm6_output_finish,
.extract_input  = xfrm6_extract_input,
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 314973aaa414..8a00cc94c32c 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -359,28 +359,28 @@ static int xfrm_prepare_input(struct xfrm_state *x, 
struct sk_buff *skb)
afinfo = xfrm_state_afinfo_get_rcu(x->outer_mode.family);
if (likely(afinfo))
err = afinfo->extract_input(x, skb);
+   rcu_read_unlock();
 
-   if (err) {
-   rcu_read_unlock();
+   if (err)
return err;
-   }
 
if (x->sel.family == AF_UNSPEC) {
inner_mode = xfrm_ip2inner_mode(x, 
XFRM_MODE_SKB_CB(skb)->protocol);
-   if (!inner_mode) {
-   rcu_read_unlock();
+   if (!inner_mode)
return -EAFNOSUPPORT;
-   }
}
 
-   afinfo = xfrm_state_afinfo_get_rcu(inner_mode->family);
-   if (unlikely(!afinfo)) {
-   rcu_read_unlock();
-   return -EAFNOSUPPORT;
+   switch (inner_mode->family) {
+   case AF_INET:
+   skb->protocol = htons(ETH_P_IP);
+   break;
+   case AF_INET6:
+   skb->protocol = htons(ETH_P_IPV6);
+   default:
+   WARN_ON_ONCE(1);
+   break;
}
 
-   skb->protocol = afinfo->eth_proto;
-   rcu_read_unlock();
return xfrm_inner_mode_encap_remove(x, inner_mode, skb);
 }
 
-- 
2.21.0

[PATCH ipsec-next 6/6] xfrm: remove type and offload_type map from xfrm_state_afinfo

2019-05-03 Thread Florian Westphal

Only a handful of xfrm_types exist, no need to have 512 pointers for them.

Reduces size of afinfo struct from 4k to 120 bytes on 64bit platforms.

Also, the unregister function doesn't need to return an error, no single
caller does anything useful with it.

Just place a WARN_ON() where needed instead.

Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h  |  16 +++-
 net/ipv4/ah4.c  |   3 +-
 net/ipv4/esp4.c |   3 +-
 net/ipv4/esp4_offload.c |   4 +-
 net/ipv4/ipcomp.c   |   3 +-
 net/ipv4/xfrm4_tunnel.c |   3 +-
 net/ipv6/ah6.c  |   4 +-
 net/ipv6/esp6.c |   3 +-
 net/ipv6/esp6_offload.c |   4 +-
 net/ipv6/ipcomp6.c  |   3 +-
 net/ipv6/mip6.c |   6 +-
 net/xfrm/xfrm_state.c   | 179 
 12 files changed, 150 insertions(+), 81 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 6ae52baa0ce7..939c2a07514a 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -347,8 +347,16 @@ int __xfrm_state_delete(struct xfrm_state *x);
 struct xfrm_state_afinfo {
u8  family;
u8  proto;
-   const struct xfrm_type  *type_map[IPPROTO_MAX];
-   const struct xfrm_type_offload  *type_offload_map[IPPROTO_MAX];
+
+   const struct xfrm_type_offload *type_offload_esp;
+
+   const struct xfrm_type  *type_esp;
+   const struct xfrm_type  *type_ipip;
+   const struct xfrm_type  *type_ipip6;
+   const struct xfrm_type  *type_comp;
+   const struct xfrm_type  *type_ah;
+   const struct xfrm_type  *type_routing;
+   const struct xfrm_type  *type_dstopts;
 
int (*output)(struct net *net, struct sock *sk, 
struct sk_buff *skb);
int (*output_finish)(struct sock *sk, struct 
sk_buff *skb);
@@ -400,7 +408,7 @@ struct xfrm_type {
 };
 
 int xfrm_register_type(const struct xfrm_type *type, unsigned short family);
-int xfrm_unregister_type(const struct xfrm_type *type, unsigned short family);
+void xfrm_unregister_type(const struct xfrm_type *type, unsigned short family);
 
 struct xfrm_type_offload {
char*description;
@@ -412,7 +420,7 @@ struct xfrm_type_offload {
 };
 
 int xfrm_register_type_offload(const struct xfrm_type_offload *type, unsigned 
short family);
-int xfrm_unregister_type_offload(const struct xfrm_type_offload *type, 
unsigned short family);
+void xfrm_unregister_type_offload(const struct xfrm_type_offload *type, 
unsigned short family);
 
 static inline int xfrm_af2proto(unsigned int family)
 {
diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index c01fa791260d..16f8ef06e40c 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -589,8 +589,7 @@ static void __exit ah4_fini(void)
 {
if (xfrm4_protocol_deregister(&ah4_protocol, IPPROTO_AH) < 0)
pr_info("%s: can't remove protocol\n", __func__);
-   if (xfrm_unregister_type(&ah_type, AF_INET) < 0)
-   pr_info("%s: can't remove xfrm type\n", __func__);
+   xfrm_unregister_type(&ah_type, AF_INET);
 }
 
 module_init(ah4_init);
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 10e809b296ec..c43fbc93cab3 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -1055,8 +1055,7 @@ static void __exit esp4_fini(void)
 {
if (xfrm4_protocol_deregister(&esp4_protocol, IPPROTO_ESP) < 0)
pr_info("%s: can't remove protocol\n", __func__);
-   if (xfrm_unregister_type(&esp_type, AF_INET) < 0)
-   pr_info("%s: can't remove xfrm type\n", __func__);
+   xfrm_unregister_type(&esp_type, AF_INET);
 }
 
 module_init(esp4_init);
diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index b61a8ff558f9..ba5c23e328e0 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -313,9 +313,7 @@ static int __init esp4_offload_init(void)
 
 static void __exit esp4_offload_exit(void)
 {
-   if (xfrm_unregister_type_offload(&esp_type_offload, AF_INET) < 0)
-   pr_info("%s: can't remove xfrm type offload\n", __func__);
-
+   xfrm_unregister_type_offload(&esp_type_offload, AF_INET);
inet_del_offload(&esp4_offload, IPPROTO_ESP);
 }
 
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index 9119d012ba46..ee03f0a55152 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -190,8 +190,7 @@ static void __exit ipcomp4_fini(void)
 {
if (xfrm4_protocol_deregister(&ipcomp4_protocol, IPPROTO_COMP) < 0)
pr_info("%s: can't remove protocol\n", __func__);
-   if (xfrm_unregister_type(&ipcomp_type, AF_INET) < 0)
-   pr_info("%s: can't remove xfrm type\n", __func__);
+   xfrm_unregister_type(&ipcomp_type, AF_INET);
 }
 
 module_init(ipcomp4_init);
diff --git a/net/ipv4/xfrm4_tunnel.c b/net/ipv4/xfrm4_tunnel.c
index 06347dbd32c1..9754d69f021b 100644
--- a/net/ipv4/xfrm4_tunnel.c
+++ b/net/ipv4/

[PATCH ipsec-next 3/6] xfrm: remove init_flags indirection from xfrm_state_afinfo

2019-05-03 Thread Florian Westphal

There is only one implementation of this function; just call it directly.

Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h |  1 -
 net/ipv4/xfrm4_state.c |  8 
 net/xfrm/xfrm_state.c  | 17 +++--
 3 files changed, 3 insertions(+), 23 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 8ac6d4d617cc..5d1c2bdee91e 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -352,7 +352,6 @@ struct xfrm_state_afinfo {
const struct xfrm_type  *type_map[IPPROTO_MAX];
const struct xfrm_type_offload  *type_offload_map[IPPROTO_MAX];
 
-   int (*init_flags)(struct xfrm_state *x);
int (*tmpl_sort)(struct xfrm_tmpl **dst, struct 
xfrm_tmpl **src, int n);
int (*state_sort)(struct xfrm_state **dst, struct 
xfrm_state **src, int n);
int (*output)(struct net *net, struct sock *sk, 
struct sk_buff *skb);
diff --git a/net/ipv4/xfrm4_state.c b/net/ipv4/xfrm4_state.c
index 018448e222af..62c96da38b4e 100644
--- a/net/ipv4/xfrm4_state.c
+++ b/net/ipv4/xfrm4_state.c
@@ -15,13 +15,6 @@
 #include 
 #include 
 
-static int xfrm4_init_flags(struct xfrm_state *x)
-{
-   if (xs_net(x)->ipv4.sysctl_ip_no_pmtu_disc)
-   x->props.flags |= XFRM_STATE_NOPMTUDISC;
-   return 0;
-}
-
 int xfrm4_extract_header(struct sk_buff *skb)
 {
const struct iphdr *iph = ip_hdr(skb);
@@ -43,7 +36,6 @@ static struct xfrm_state_afinfo xfrm4_state_afinfo = {
.proto  = IPPROTO_IPIP,
.eth_proto  = htons(ETH_P_IP),
.owner  = THIS_MODULE,
-   .init_flags = xfrm4_init_flags,
.output = xfrm4_output,
.output_finish  = xfrm4_output_finish,
.extract_input  = xfrm4_extract_input,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 42662d9e8b5e..636b055570e7 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -2256,25 +2256,14 @@ int xfrm_state_mtu(struct xfrm_state *x, int mtu)
 
 int __xfrm_init_state(struct xfrm_state *x, bool init_replay, bool offload)
 {
-   const struct xfrm_state_afinfo *afinfo;
const struct xfrm_mode *inner_mode;
const struct xfrm_mode *outer_mode;
int family = x->props.family;
int err;
 
-   err = -EAFNOSUPPORT;
-   afinfo = xfrm_state_get_afinfo(family);
-   if (!afinfo)
-   goto error;
-
-   err = 0;
-   if (afinfo->init_flags)
-   err = afinfo->init_flags(x);
-
-   rcu_read_unlock();
-
-   if (err)
-   goto error;
+   if (family == AF_INET &&
+   xs_net(x)->ipv4.sysctl_ip_no_pmtu_disc)
+   x->props.flags |= XFRM_STATE_NOPMTUDISC;
 
err = -EPROTONOSUPPORT;
 
-- 
2.21.0

[PATCH ipsec-next 2/6] xfrm: remove init_temprop indirection from xfrm_state_afinfo

2019-05-03 Thread Florian Westphal

same as previous patch: just place this in the caller, no need to
have an indirection for a structure initialization.

Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h |  4 
 net/ipv4/xfrm4_state.c | 16 
 net/ipv6/xfrm6_state.c | 16 
 net/xfrm/xfrm_state.c  | 27 ---
 4 files changed, 20 insertions(+), 43 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 9f97e6c1f3ee..8ac6d4d617cc 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -353,10 +353,6 @@ struct xfrm_state_afinfo {
const struct xfrm_type_offload  *type_offload_map[IPPROTO_MAX];
 
int (*init_flags)(struct xfrm_state *x);
-   void(*init_temprop)(struct xfrm_state *x,
-   const struct xfrm_tmpl *tmpl,
-   const xfrm_address_t *daddr,
-   const xfrm_address_t *saddr);
int (*tmpl_sort)(struct xfrm_tmpl **dst, struct 
xfrm_tmpl **src, int n);
int (*state_sort)(struct xfrm_state **dst, struct 
xfrm_state **src, int n);
int (*output)(struct net *net, struct sock *sk, 
struct sk_buff *skb);
diff --git a/net/ipv4/xfrm4_state.c b/net/ipv4/xfrm4_state.c
index da0fd9556d57..018448e222af 100644
--- a/net/ipv4/xfrm4_state.c
+++ b/net/ipv4/xfrm4_state.c
@@ -22,21 +22,6 @@ static int xfrm4_init_flags(struct xfrm_state *x)
return 0;
 }
 
-static void
-xfrm4_init_temprop(struct xfrm_state *x, const struct xfrm_tmpl *tmpl,
-  const xfrm_address_t *daddr, const xfrm_address_t *saddr)
-{
-   x->id = tmpl->id;
-   if (x->id.daddr.a4 == 0)
-   x->id.daddr.a4 = daddr->a4;
-   x->props.saddr = tmpl->saddr;
-   if (x->props.saddr.a4 == 0)
-   x->props.saddr.a4 = saddr->a4;
-   x->props.mode = tmpl->mode;
-   x->props.reqid = tmpl->reqid;
-   x->props.family = AF_INET;
-}
-
 int xfrm4_extract_header(struct sk_buff *skb)
 {
const struct iphdr *iph = ip_hdr(skb);
@@ -59,7 +44,6 @@ static struct xfrm_state_afinfo xfrm4_state_afinfo = {
.eth_proto  = htons(ETH_P_IP),
.owner  = THIS_MODULE,
.init_flags = xfrm4_init_flags,
-   .init_temprop   = xfrm4_init_temprop,
.output = xfrm4_output,
.output_finish  = xfrm4_output_finish,
.extract_input  = xfrm4_extract_input,
diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index 0e19ded3e33b..aa5d2c52cc31 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -21,21 +21,6 @@
 #include 
 #include 
 
-static void
-xfrm6_init_temprop(struct xfrm_state *x, const struct xfrm_tmpl *tmpl,
-  const xfrm_address_t *daddr, const xfrm_address_t *saddr)
-{
-   x->id = tmpl->id;
-   if (ipv6_addr_any((struct in6_addr *)&x->id.daddr))
-   memcpy(&x->id.daddr, daddr, sizeof(x->sel.daddr));
-   memcpy(&x->props.saddr, &tmpl->saddr, sizeof(x->props.saddr));
-   if (ipv6_addr_any((struct in6_addr *)&x->props.saddr))
-   memcpy(&x->props.saddr, saddr, sizeof(x->props.saddr));
-   x->props.mode = tmpl->mode;
-   x->props.reqid = tmpl->reqid;
-   x->props.family = AF_INET6;
-}
-
 /* distribution counting sort function for xfrm_state and xfrm_tmpl */
 static int
 __xfrm6_sort(void **dst, void **src, int n, int (*cmp)(void *p), int maxclass)
@@ -153,7 +138,6 @@ static struct xfrm_state_afinfo xfrm6_state_afinfo = {
.proto  = IPPROTO_IPV6,
.eth_proto  = htons(ETH_P_IPV6),
.owner  = THIS_MODULE,
-   .init_temprop   = xfrm6_init_temprop,
.tmpl_sort  = __xfrm6_tmpl_sort,
.state_sort = __xfrm6_state_sort,
.output = xfrm6_output,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index f93c6dc57754..42662d9e8b5e 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -810,8 +810,6 @@ xfrm_init_tempstate(struct xfrm_state *x, const struct 
flowi *fl,
const xfrm_address_t *daddr, const xfrm_address_t *saddr,
unsigned short family)
 {
-   struct xfrm_state_afinfo *afinfo = xfrm_state_afinfo_get_rcu(family);
-
switch (family) {
case AF_INET:
__xfrm4_init_tempsel(&x->sel, fl);
@@ -821,13 +819,28 @@ xfrm_init_tempstate(struct xfrm_state *x, const struct 
flowi *fl,
break;
}
 
-   if (family != tmpl->encap_family)
-   afinfo = xfrm_state_afinfo_get_rcu(tmpl->encap_family);
+   x->id = tmpl->id;
 
-   if (!afinfo)
-   return;
+   switch (tmpl->encap_family) {
+   case AF_INET:
+   if (x->id

Re: [PATCH net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Eric Dumazet

On Fri, May 3, 2019 at 11:33 AM Peter Oskolkov  wrote:
>
> This skb_get was introduced by commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3
> "ipv6: frags: rewrite ip6_expire_frag_queue()", and the rbtree patch
> is not in 4.4, where the bug is reported at.
> Shouldn't the "Fixes" tag also reference the original patch?

No, this bug really fixes a memory leak.

Fact that it also fixes the XFRM issue is secondary, since all your
patches are being backported in stable
trees anyway for other reasons.

There is no need to list all commits and give a complete context for a
bug fix like this one,
this would be quite noisy.

[PATCH v2 net] neighbor: Call __ipv4_neigh_lookup_noref in neigh_xmit

2019-05-03 Thread David Ahern

From: David Ahern 

Commit cd9ff4de0107 changed the key for IFF_POINTOPOINT devices to
INADDR_ANY, but neigh_xmit which is used for MPLS encapsulations was not
updated to use the altered key. The result is that every packet Tx does
a lookup on the gateway address which does not find an entry, a new one
is created only to find the existing one in the table right before the
insert since arp_constructor was updated to reset the primary key. This
is seen in the allocs and destroys counters:
ip -s -4 ntable show | head -10 | grep alloc

which increase for each packet showing the unnecessary overhread.

Fix by having neigh_xmit use __ipv4_neigh_lookup_noref for NEIGH_ARP_TABLE.
Define __ipv4_neigh_lookup_noref in case CONFIG_INET is not set.

v2
- define __ipv4_neigh_lookup_noref in case CONFIG_INET is not set as
  reported by kbuild test robot

Fixes: cd9ff4de0107 ("ipv4: Make neigh lookup keys for loopback/point-to-point 
devices be INADDR_ANY")
Reported-by: Alan Maguire 
Signed-off-by: David Ahern 

Signed-off-by: David Ahern 
---
 include/net/arp.h| 8 
 net/core/neighbour.c | 9 -
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/net/arp.h b/include/net/arp.h
index 977aabfcdc03..c8f580a0e6b1 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -18,6 +18,7 @@ static inline u32 arp_hashfn(const void *pkey, const struct 
net_device *dev, u32
return val * hash_rnd[0];
 }
 
+#ifdef CONFIG_INET
 static inline struct neighbour *__ipv4_neigh_lookup_noref(struct net_device 
*dev, u32 key)
 {
if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
@@ -25,6 +26,13 @@ static inline struct neighbour 
*__ipv4_neigh_lookup_noref(struct net_device *dev
 
return ___neigh_lookup_noref(&arp_tbl, neigh_key_eq32, arp_hashfn, 
&key, dev);
 }
+#else
+static inline
+struct neighbour *__ipv4_neigh_lookup_noref(struct net_device *dev, u32 key)
+{
+   return NULL;
+}
+#endif
 
 static inline struct neighbour *__ipv4_neigh_lookup(struct net_device *dev, 
u32 key)
 {
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 30f6fd8f68e0..0ba5018ccb7f 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2982,7 +2983,13 @@ int neigh_xmit(int index, struct net_device *dev,
if (!tbl)
goto out;
rcu_read_lock_bh();
-   neigh = __neigh_lookup_noref(tbl, addr, dev);
+   if (index == NEIGH_ARP_TABLE) {
+   u32 key = *((u32 *)addr);
+
+   neigh = __ipv4_neigh_lookup_noref(dev, key);
+   } else {
+   neigh = __neigh_lookup_noref(tbl, addr, dev);
+   }
if (!neigh)
neigh = __neigh_create(tbl, addr, dev, false);
err = PTR_ERR(neigh);
-- 
2.11.0

Re: [RFC HACK] xfrm: make state refcounting percpu

2019-05-03 Thread Florian Westphal

Eric Dumazet  wrote:
 On 5/3/19 2:07 AM, Steffen Klassert wrote:
> > On Wed, Apr 24, 2019 at 12:40:23PM +0200, Florian Westphal wrote:
> >> I'm not sure this is a good idea to begin with, refcount
> >> is right next to state spinlock which is taken for both tx and rx ops,
> >> plus this complicates debugging quite a bit.
> > 
> > 
> 
> 
> For some reason I have not received Florian response.
> 
> Florian, when the percpu counters are in nominal mode,
> the updates are only in percpu memory, so the cache line containing struct 
> percpu_ref in the
> main object is not dirtied.

Yes, I understand this.  We'll still still serialize anyway due to
spinlock.

Given Vakul says the state refcount isn't the main problem and Steffen
suggest to insert multiple states instead I don't think working on this
more makes any sense.

Thanks for the pcpu counter infra pointer though, I had not seen it before.

Re: [PATCH net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Peter Oskolkov

On Fri, May 3, 2019 at 8:52 AM Eric Dumazet  wrote:
>
> On Fri, May 3, 2019 at 11:33 AM Peter Oskolkov  wrote:
> >
> > This skb_get was introduced by commit 
> > 05c0b86b9696802fd0ce5676a92a63f1b455bdf3
> > "ipv6: frags: rewrite ip6_expire_frag_queue()", and the rbtree patch
> > is not in 4.4, where the bug is reported at.
> > Shouldn't the "Fixes" tag also reference the original patch?
>
> No, this bug really fixes a memory leak.
>
> Fact that it also fixes the XFRM issue is secondary, since all your
> patches are being backported in stable
> trees anyway for other reasons.

There are no plans to backport rbtree patches to 4.4 and earlier at
the moment, afaik.

>
> There is no need to list all commits and give a complete context for a
> bug fix like this one,
> this would be quite noisy.

bpftool doc man page build failure

2019-05-03 Thread Yonghong Song

Quentin,

I hit the following errors with latest bpf-next.

-bash-4.4$ make man
   GEN  bpftool-perf.8
   GEN  bpftool-map.8
   GEN  bpftool.8
   GEN  bpftool-net.8
   GEN  bpftool-feature.8
   GEN  bpftool-prog.8
   GEN  bpftool-cgroup.8
   GEN  bpftool-btf.8
   GEN  bpf-helpers.rst
Parsed description of 111 helper function(s)
Traceback (most recent call last):
   File "../../../../scripts/bpf_helpers_doc.py", line 421, in 
 printer.print_all()
   File "../../../../scripts/bpf_helpers_doc.py", line 187, in print_all
 self.print_one(helper)
   File "../../../../scripts/bpf_helpers_doc.py", line 378, in print_one
 self.print_proto(helper)
   File "../../../../scripts/bpf_helpers_doc.py", line 356, in print_proto
 proto = helper.proto_break_down()
   File "../../../../scripts/bpf_helpers_doc.py", line 56, in 
proto_break_down
 'type' : capture.group(1),
AttributeError: 'NoneType' object has no attribute 'group'
make: *** [bpf-helpers.rst] Error 1
-bash-4.4$ pwd
/home/yhs/work/net-next/tools/bpf/bpftool/Documentation
-bash-4.4$

Maybe a format issue in the comments with some recent helpers?

Thanks,

Yonghong

Re: bpftool doc man page build failure

2019-05-03 Thread Quentin Monnet

2019-05-03 16:21 UTC+ ~ Yonghong Song 
> Quentin,
> 
> I hit the following errors with latest bpf-next.
> 
> -bash-4.4$ make man
>GEN  bpftool-perf.8
>GEN  bpftool-map.8
>GEN  bpftool.8
>GEN  bpftool-net.8
>GEN  bpftool-feature.8
>GEN  bpftool-prog.8
>GEN  bpftool-cgroup.8
>GEN  bpftool-btf.8
>GEN  bpf-helpers.rst
> Parsed description of 111 helper function(s)
> Traceback (most recent call last):
>File "../../../../scripts/bpf_helpers_doc.py", line 421, in 
>  printer.print_all()
>File "../../../../scripts/bpf_helpers_doc.py", line 187, in print_all
>  self.print_one(helper)
>File "../../../../scripts/bpf_helpers_doc.py", line 378, in print_one
>  self.print_proto(helper)
>File "../../../../scripts/bpf_helpers_doc.py", line 356, in print_proto
>  proto = helper.proto_break_down()
>File "../../../../scripts/bpf_helpers_doc.py", line 56, in 
> proto_break_down
>  'type' : capture.group(1),
> AttributeError: 'NoneType' object has no attribute 'group'
> make: *** [bpf-helpers.rst] Error 1
> -bash-4.4$ pwd
> /home/yhs/work/net-next/tools/bpf/bpftool/Documentation
> -bash-4.4$
> 
> Maybe a format issue in the comments with some recent helpers?
> 
> Thanks,
> 
> Yonghong
> 

Hi Yonghong,

Thanks for the notice! Yes, I observed the same thing not long ago. It
seems that the Python script breaks on the "unsigned long" pointer
argument for strtoul(): the script only accepts "const" or "struct" for
types made of several words, not "unsigned".

I'll fix the script so it can take any word and send a patch next week,
along with some other clean-up fixes for the doc.

Best regards,
Quentin

Re: bpftool doc man page build failure

2019-05-03 Thread Yonghong Song



On 5/3/19 9:54 AM, Quentin Monnet wrote:
> 2019-05-03 16:21 UTC+ ~ Yonghong Song 
>> Quentin,
>>
>> I hit the following errors with latest bpf-next.
>>
>> -bash-4.4$ make man
>> GEN  bpftool-perf.8
>> GEN  bpftool-map.8
>> GEN  bpftool.8
>> GEN  bpftool-net.8
>> GEN  bpftool-feature.8
>> GEN  bpftool-prog.8
>> GEN  bpftool-cgroup.8
>> GEN  bpftool-btf.8
>> GEN  bpf-helpers.rst
>> Parsed description of 111 helper function(s)
>> Traceback (most recent call last):
>> File "../../../../scripts/bpf_helpers_doc.py", line 421, in 
>>   printer.print_all()
>> File "../../../../scripts/bpf_helpers_doc.py", line 187, in print_all
>>   self.print_one(helper)
>> File "../../../../scripts/bpf_helpers_doc.py", line 378, in print_one
>>   self.print_proto(helper)
>> File "../../../../scripts/bpf_helpers_doc.py", line 356, in print_proto
>>   proto = helper.proto_break_down()
>> File "../../../../scripts/bpf_helpers_doc.py", line 56, in
>> proto_break_down
>>   'type' : capture.group(1),
>> AttributeError: 'NoneType' object has no attribute 'group'
>> make: *** [bpf-helpers.rst] Error 1
>> -bash-4.4$ pwd
>> /home/yhs/work/net-next/tools/bpf/bpftool/Documentation
>> -bash-4.4$
>>
>> Maybe a format issue in the comments with some recent helpers?
>>
>> Thanks,
>>
>> Yonghong
>>
> 
> Hi Yonghong,
> 
> Thanks for the notice! Yes, I observed the same thing not long ago. It
> seems that the Python script breaks on the "unsigned long" pointer
> argument for strtoul(): the script only accepts "const" or "struct" for
> types made of several words, not "unsigned".
> 
> I'll fix the script so it can take any word and send a patch next week,
> along with some other clean-up fixes for the doc.

Thanks!

> 
> Best regards,
> Quentin
>

Re: 32-bit zext JIT efficiency (Was Re: [PATCH bpf-next] selftests/bpf: two scale tests)

2019-05-03 Thread Jiong Wang

Jiong Wang writes:

>> > if you can craft a test that shows patch_insn issue before your set,
>> > then it's ok to hack bpf_fill_scale1 to use alu64.
>>
>> As described above, does the test_verifier 732 + jit blinding looks 
>> convincing?
>>
>> > I would also prefer to go with option 2 (new zext insn) for JITs.
>>
>> Got it.
>
> I followed option 2 and have sent out v5 with latests changes/fixes:

Had done second look at various back-ends, now noticed one new issue,
some arches are not consistent on implicit zext. For example, for s390,
alu32 move could be JITed using single instruction "llgfr" which will do
implicit zext, but alu32 move on PowerPC needs explicit zext. Then for
riscv, all BPF_ALU | BPF_K needs zext but not for some of BPF_ALU | BPF_X.
So, while these arches are generally better off after verifier zext
insertion enabled, but there do have unnecessary zext inserted by verifier
for them case by case.

Also, for 64-bit arches like PowerPC, S390 etc, they normally has zero
extended load, so narrowed load doesn't need extra zext insn, but for
32-bit arches like arm, narrowed load always need explicit zext.

All these differences are because of BPF_ALU32 or BPF_LDX + B | H | W
will be eventually mapped to diversified back-ends which do not have
consistent ISA semantics.

Given all these, looks like pass down the analysis info to back-ends
and let them do the decision become the choice again?

Regards,
Jiong

> The major changes are:
>   - introduced BPF_ZEXT, even though it doesn't resolve insn patch 
> in-efficient,
> but could let JIT back-ends do optimal code-gen, and the change is small,
> so perhap just better to support it in this set.
>   - while look insn patch code, I feel patched-insn need to be conservatiely
> marked if any insn inside patch buffer define sub-register.
>   - Also fixed helper function return value handling bug. I am thinking helper
> function should have accurate return value type description, otherwise
> there could be bug. For example arm32 back-end just executes the native
> helper functions and doesn't do anything special on the return value. So
> a function returns u32 would only set native reg r0, not r1 in the pair.
> Then if the outside eBPF insn is casting it into u64, there needs to be
> zext.
>   - adjusted test_verifier to make sure it could pass on hosts w and w/o hw
> zext.
>
> For more info, please see the cover letter and patch description at v5.
>
> Thanks.
> Regards,
> Jiong

Re: [PATCH net] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-03 Thread Eric Dumazet




On 5/3/19 11:58 AM, Peter Oskolkov wrote:
> On Fri, May 3, 2019 at 8:52 AM Eric Dumazet  wrote:
>>
>> On Fri, May 3, 2019 at 11:33 AM Peter Oskolkov  wrote:
>>>
>>> This skb_get was introduced by commit 
>>> 05c0b86b9696802fd0ce5676a92a63f1b455bdf3
>>> "ipv6: frags: rewrite ip6_expire_frag_queue()", and the rbtree patch
>>> is not in 4.4, where the bug is reported at.
>>> Shouldn't the "Fixes" tag also reference the original patch?
>>
>> No, this bug really fixes a memory leak.
>>
>> Fact that it also fixes the XFRM issue is secondary, since all your
>> patches are being backported in stable
>> trees anyway for other reasons.
> 
> There are no plans to backport rbtree patches to 4.4 and earlier at
> the moment, afaik.
> 

No problem, I mentioned to Stefan what needs to be done.

(removing the head skb, removing the skb_get())

Re: [PATCH net-next 2/2] net: dsa :mv88e6xxx: Disable unused ports

2019-05-03 Thread Florian Fainelli

On 4/30/19 3:08 PM, Andrew Lunn wrote:
> If the NO_CPU strap is set, the switch starts in 'dumb hub' mode, with
> all ports enable. Ports which are then actively used are reconfigured
> as required when the driver starts. However unused ports are left
> alone. Change this to disable them, and turn off any SERDES
> interface. This could save some power and so reduce the temperature a
> bit.
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: Set STP disable state in port_disable

2019-05-03 Thread Florian Fainelli

On 4/30/19 3:08 PM, Andrew Lunn wrote:
> When requested to disable a port, set the port STP state to disabled.
> This fully disables the port and should save some power.
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian

[PATCH net] um: vector netdev: adjust to xmit_more API change

2019-05-03 Thread Johannes Berg

From: Johannes Berg 

Replace skb->xmit_more usage by netdev_xmit_more().

Fixes: 4f296edeb9d4 ("drivers: net: aurora: use netdev_xmit_more helper")
Signed-off-by: Johannes Berg 
---
 arch/um/drivers/vector_kern.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c
index 596e7056f376..e190e4ca52e1 100644
--- a/arch/um/drivers/vector_kern.c
+++ b/arch/um/drivers/vector_kern.c
@@ -1043,7 +1043,7 @@ static int vector_net_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
vector_send(vp->tx_queue);
return NETDEV_TX_OK;
}
-   if (skb->xmit_more) {
+   if (netdev_xmit_more()) {
mod_timer(&vp->tl, vp->coalesce);
return NETDEV_TX_OK;
}
-- 
2.17.2

Re: [PATCH bpf] libbpf: add libbpf_util.h to header install.

2019-05-03 Thread William Tu

On Thu, May 2, 2019 at 1:18 PM Y Song  wrote:
>
> On Thu, May 2, 2019 at 11:34 AM William Tu  wrote:
> >
> > The libbpf_util.h is used by xsk.h, so add it to
> > the install headers.
>
> Can we try to change code a little bit to avoid exposing libbpf_util.h?
> Originally libbpf_util.h is considered as libbpf internal.
> I am not strongly against this patch. But would really like to see
> whether we have an alternative not exposing libbpf_util.h.
>

The commit b7e3a28019c92ff ("libbpf: remove dependency on barrier.h in xsk.h")
adds the dependency of libbpf_util.h to xsk.h.
How about we move the libbpf_smp_* into the xsk.h, since they are
used only by xsk.h.

Regards,
William

> >
> > Reported-by: Ben Pfaff 
> > Signed-off-by: William Tu 
> > ---
> >  tools/lib/bpf/Makefile | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
> > index c6c06bc6683c..f91639bf5650 100644
> > --- a/tools/lib/bpf/Makefile
> > +++ b/tools/lib/bpf/Makefile
> > @@ -230,6 +230,7 @@ install_headers:
> > $(call do_install,bpf.h,$(prefix)/include/bpf,644); \
> > $(call do_install,libbpf.h,$(prefix)/include/bpf,644); \
> > $(call do_install,btf.h,$(prefix)/include/bpf,644); \
> > +   $(call do_install,libbpf_util.h,$(prefix)/include/bpf,644); 
> > \
> > $(call do_install,xsk.h,$(prefix)/include/bpf,644);
> >
> >  install_pkgconfig: $(PC_FILE)
> > --
> > 2.7.4
> >

RE: Hyperv netvsc - regression for 32-PAE kernel

2019-05-03 Thread Dexuan Cui

> From: linux-hyperv-ow...@vger.kernel.org
>  On Behalf Of Michael Kelley
> Sent: Thursday, May 2, 2019 3:24 PM
> To: Juliana Rodrigueiro ;
> linux-hyp...@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Subject: RE: Hyperv netvsc - regression for 32-PAE kernel
> 
> From: Juliana Rodrigueiro  Sent: Thursday,
> May 2, 2019 9:14 AM
> >
> > So I got to the following commit:
> >
> > commit 6ba34171bcbd10321c6cf554e0c1144d170f9d1a
> > Author: Michael Kelley 
> > Date:   Thu Aug 2 03:08:24 2018 +
> >
> > Drivers: hv: vmbus: Remove use of slow_virt_to_phys()
> >
> > slow_virt_to_phys() is only implemented for arch/x86.
> > Remove its use in arch independent Hyper-V drivers, and
> > replace with test for vmalloc() address followed by
> > appropriate v-to-p function. This follows the typical
> > pattern of other drivers and avoids the need to implement
> > slow_virt_to_phys() for Hyper-V on ARM64.
> >
> > Signed-off-by: Michael Kelley 
> > Signed-off-by: K. Y. Srinivasan 
> > Signed-off-by: Greg Kroah-Hartman 
> >
> > The catch is that slow_virt_to_phys has a special trick implemented in order
> > to keep specifically 32-PAE kernel working, it is explained in a comment
> > inside the function.
> >
> > Reverting this commit makes the kernel 4.19 32-bit PAE work again. However
> I
> > believe a better solution might exist.
> >
> > Comments are very much appreciated.
> >
> 
> Julie -- thanks for tracking down the cause of the issue.  I'll try to
> look at this tomorrow and propose a solution.
> 
> Michael Kelley

Hi Juliana,
Can you please try the below one-line patch? 

It should fix the issue.

Thanks,
-- Dexuan

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 23381c4..aaaee5f 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -38,7 +38,7 @@

 static unsigned long virt_to_hvpfn(void *addr)
 {
-   unsigned long paddr;
+   phys_addr_t paddr;

if (is_vmalloc_addr(addr))
paddr = page_to_phys(vmalloc_to_page(addr)) +

Re: [PATCH bpf] libbpf: add libbpf_util.h to header install.

2019-05-03 Thread Y Song

On Fri, May 3, 2019 at 12:54 PM William Tu  wrote:
>
> On Thu, May 2, 2019 at 1:18 PM Y Song  wrote:
> >
> > On Thu, May 2, 2019 at 11:34 AM William Tu  wrote:
> > >
> > > The libbpf_util.h is used by xsk.h, so add it to
> > > the install headers.
> >
> > Can we try to change code a little bit to avoid exposing libbpf_util.h?
> > Originally libbpf_util.h is considered as libbpf internal.
> > I am not strongly against this patch. But would really like to see
> > whether we have an alternative not exposing libbpf_util.h.
> >
>
> The commit b7e3a28019c92ff ("libbpf: remove dependency on barrier.h in xsk.h")
> adds the dependency of libbpf_util.h to xsk.h.
> How about we move the libbpf_smp_* into the xsk.h, since they are
> used only by xsk.h.

Okay. Looks like the libbpf_smp_* is used in some static inline functions
which are also API functions.

Probably having libbpf_smp_* in libbpf_util.h is a better choice as these
primitives can be used by other .c files in tools/lib/bpf.

On the other hand, exposing macros pr_warning(), pr_info() and
pr_debug() may not
be a bad thing as user can use them with the same debug level used by
libbpf itself.

Ack your original patch:
Acked-by: Yonghong Song 

>
> Regards,
> William
>
> > >
> > > Reported-by: Ben Pfaff 
> > > Signed-off-by: William Tu 
> > > ---
> > >  tools/lib/bpf/Makefile | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
> > > index c6c06bc6683c..f91639bf5650 100644
> > > --- a/tools/lib/bpf/Makefile
> > > +++ b/tools/lib/bpf/Makefile
> > > @@ -230,6 +230,7 @@ install_headers:
> > > $(call do_install,bpf.h,$(prefix)/include/bpf,644); \
> > > $(call do_install,libbpf.h,$(prefix)/include/bpf,644); \
> > > $(call do_install,btf.h,$(prefix)/include/bpf,644); \
> > > +   $(call 
> > > do_install,libbpf_util.h,$(prefix)/include/bpf,644); \
> > > $(call do_install,xsk.h,$(prefix)/include/bpf,644);
> > >
> > >  install_pkgconfig: $(PC_FILE)
> > > --
> > > 2.7.4
> > >

[PATCH] net: dsa: mv88e6xxx: refine SMI support

2019-05-03 Thread Vivien Didelot

The Marvell SOHO switches have several ways to access the internal
registers. One of them being the System Management Interface (SMI),
using the MDC and MDIO pins, with direct and indirect variants.

In preparation for adding support for other register accesses, move
the SMI code into its own files. At the same time, refine the code
to make it clear that the indirect variant is implemented using the
direct variant accessing only two registers for command and data.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/Makefile |   1 +
 drivers/net/dsa/mv88e6xxx/chip.c   | 172 ++---
 drivers/net/dsa/mv88e6xxx/chip.h   |  11 --
 drivers/net/dsa/mv88e6xxx/smi.c| 158 ++
 drivers/net/dsa/mv88e6xxx/smi.h|  41 +++
 5 files changed, 211 insertions(+), 172 deletions(-)
 create mode 100644 drivers/net/dsa/mv88e6xxx/smi.c
 create mode 100644 drivers/net/dsa/mv88e6xxx/smi.h

diff --git a/drivers/net/dsa/mv88e6xxx/Makefile 
b/drivers/net/dsa/mv88e6xxx/Makefile
index 50de304abe2f..e85755dde90b 100644
--- a/drivers/net/dsa/mv88e6xxx/Makefile
+++ b/drivers/net/dsa/mv88e6xxx/Makefile
@@ -12,3 +12,4 @@ mv88e6xxx-objs += phy.o
 mv88e6xxx-objs += port.o
 mv88e6xxx-$(CONFIG_NET_DSA_MV88E6XXX_PTP) += ptp.o
 mv88e6xxx-objs += serdes.o
+mv88e6xxx-objs += smi.o
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 489a899c80b6..4c0d06686d53 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -43,6 +43,7 @@
 #include "port.h"
 #include "ptp.h"
 #include "serdes.h"
+#include "smi.h"
 
 static void assert_reg_lock(struct mv88e6xxx_chip *chip)
 {
@@ -52,156 +53,17 @@ static void assert_reg_lock(struct mv88e6xxx_chip *chip)
}
 }
 
-/* The switch ADDR[4:1] configuration pins define the chip SMI device address
- * (ADDR[0] is always zero, thus only even SMI addresses can be strapped).
- *
- * When ADDR is all zero, the chip uses Single-chip Addressing Mode, assuming 
it
- * is the only device connected to the SMI master. In this mode it responds to
- * all 32 possible SMI addresses, and thus maps directly the internal devices.
- *
- * When ADDR is non-zero, the chip uses Multi-chip Addressing Mode, allowing
- * multiple devices to share the SMI interface. In this mode it responds to 
only
- * 2 registers, used to indirectly access the internal SMI devices.
- */
-
-static int mv88e6xxx_smi_read(struct mv88e6xxx_chip *chip,
- int addr, int reg, u16 *val)
-{
-   if (!chip->smi_ops)
-   return -EOPNOTSUPP;
-
-   return chip->smi_ops->read(chip, addr, reg, val);
-}
-
-static int mv88e6xxx_smi_write(struct mv88e6xxx_chip *chip,
-  int addr, int reg, u16 val)
-{
-   if (!chip->smi_ops)
-   return -EOPNOTSUPP;
-
-   return chip->smi_ops->write(chip, addr, reg, val);
-}
-
-static int mv88e6xxx_smi_single_chip_read(struct mv88e6xxx_chip *chip,
- int addr, int reg, u16 *val)
-{
-   int ret;
-
-   ret = mdiobus_read_nested(chip->bus, addr, reg);
-   if (ret < 0)
-   return ret;
-
-   *val = ret & 0x;
-
-   return 0;
-}
-
-static int mv88e6xxx_smi_single_chip_write(struct mv88e6xxx_chip *chip,
-  int addr, int reg, u16 val)
-{
-   int ret;
-
-   ret = mdiobus_write_nested(chip->bus, addr, reg, val);
-   if (ret < 0)
-   return ret;
-
-   return 0;
-}
-
-static const struct mv88e6xxx_bus_ops mv88e6xxx_smi_single_chip_ops = {
-   .read = mv88e6xxx_smi_single_chip_read,
-   .write = mv88e6xxx_smi_single_chip_write,
-};
-
-static int mv88e6xxx_smi_multi_chip_wait(struct mv88e6xxx_chip *chip)
-{
-   int ret;
-   int i;
-
-   for (i = 0; i < 16; i++) {
-   ret = mdiobus_read_nested(chip->bus, chip->sw_addr, SMI_CMD);
-   if (ret < 0)
-   return ret;
-
-   if ((ret & SMI_CMD_BUSY) == 0)
-   return 0;
-   }
-
-   return -ETIMEDOUT;
-}
-
-static int mv88e6xxx_smi_multi_chip_read(struct mv88e6xxx_chip *chip,
-int addr, int reg, u16 *val)
-{
-   int ret;
-
-   /* Wait for the bus to become free. */
-   ret = mv88e6xxx_smi_multi_chip_wait(chip);
-   if (ret < 0)
-   return ret;
-
-   /* Transmit the read command. */
-   ret = mdiobus_write_nested(chip->bus, chip->sw_addr, SMI_CMD,
-  SMI_CMD_OP_22_READ | (addr << 5) | reg);
-   if (ret < 0)
-   return ret;
-
-   /* Wait for the read command to complete. */
-   ret = mv88e6xxx_smi_multi_chip_wait(chip);
-   if (ret < 0)
-   return ret;
-
-   /* Read the data. */
-   ret = mdiobus_read_nested(chip->bus, chip->sw_addr, SMI_DATA);
-   if (ret < 0)
-   return ret;
-
-

Re: [PATCH] net: dsa: mv88e6xxx: refine SMI support

2019-05-03 Thread Florian Fainelli

On 5/3/19 3:49 PM, Vivien Didelot wrote:
> The Marvell SOHO switches have several ways to access the internal
> registers. One of them being the System Management Interface (SMI),
> using the MDC and MDIO pins, with direct and indirect variants.
> 
> In preparation for adding support for other register accesses, move
> the SMI code into its own files. At the same time, refine the code
> to make it clear that the indirect variant is implemented using the
> direct variant accessing only two registers for command and data.
> 
> Signed-off-by: Vivien Didelot 
> ---

With some nits below:

Reviewed-by: Florian Fainelli 

[snip]

>   assert_reg_lock(chip);
>  
> - err = mv88e6xxx_smi_read(chip, addr, reg, val);
> + if (chip->smi_ops)
> + err = chip->smi_ops->read(chip, addr, reg, val);
> + else

You might want to check for smi_ops && smi_ops->read here to be safe.
You could also keep that code unchanged, and just make
mv88e6xxx_smi_read() an inline helper within smi.h:

static inline int mv88e6xxx_smi_read(struct mv88e6xxx_chip *chip, int
addr, int reg, int *val)
{
if (chip->smi_ops && chip->smi_ops->read)
return chip->smi_ops->read(chip, addr, reg, val);

return -EOPNOTSUPP;
}

> + err = -EOPNOTSUPP;
> +
>   if (err)
>   return err;
>  
> @@ -217,7 +79,11 @@ int mv88e6xxx_write(struct mv88e6xxx_chip *chip, int 
> addr, int reg, u16 val)
>  
>   assert_reg_lock(chip);
>  
> - err = mv88e6xxx_smi_write(chip, addr, reg, val);
> + if (chip->smi_ops)
> + err = chip->smi_ops->write(chip, addr, reg, val);
> + else

Same here, you might want to check smi_ops && smi_ops->write to avoid
de-referencing a potentially NULL pointer.
-- 
Florian

[net-next v2 01/11] i40e: Fix for allowing too many MDD events on VF

2019-05-03 Thread Jeff Kirsher

From: Carolyn Wyborny 

This patch changes the driver behavior when detecting a VF MDD event.
It now disables the VF after one event, which indicates a hw detected
problem in the VF.  Before this change, the PF would allow a couple of
events before doing the reset.

Signed-off-by: Carolyn Wyborny 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 65c2b9d2652b..b52a9d5644b8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9767,6 +9767,9 @@ static void i40e_handle_mdd_event(struct i40e_pf *pf)
vf->num_mdd_events++;
dev_info(&pf->pdev->dev, "TX driver issue detected on 
VF %d\n",
 i);
+   dev_info(&pf->pdev->dev,
+"Use PF Control I/F to re-enable the VF\n");
+   set_bit(I40E_VF_STATE_DISABLED, &vf->vf_states);
}
 
reg = rd32(hw, I40E_VP_MDET_RX(i));
@@ -9775,11 +9778,6 @@ static void i40e_handle_mdd_event(struct i40e_pf *pf)
vf->num_mdd_events++;
dev_info(&pf->pdev->dev, "RX driver issue detected on 
VF %d\n",
 i);
-   }
-
-   if (vf->num_mdd_events > I40E_DEFAULT_NUM_MDD_EVENTS_ALLOWED) {
-   dev_info(&pf->pdev->dev,
-"Too many MDD events on VF %d, disabled\n", i);
dev_info(&pf->pdev->dev,
 "Use PF Control I/F to re-enable the VF\n");
set_bit(I40E_VF_STATE_DISABLED, &vf->vf_states);
-- 
2.20.1

[net-next v2 10/11] i40e: print PCI vendor and device ID during probe

2019-05-03 Thread Jeff Kirsher

From: Stefan Assmann 

Printing each devices PCI vendor and device ID has the advantage of
easily revealing what hardware we're dealing with exactly. It's no
longer necessary to match the PCI bus information to the lspci output.

Helps with bug reports where no lspci output is available.

Output before
i40e :08:00.0: fw 6.1.49420 api 1.7 nvm 6.80 0x80003c64 1.2007.0
and after
i40e :08:00.0: fw 6.1.49420 api 1.7 nvm 6.80 0x80003c64 1.2007.0 
[8086:1572] [8086:0004]

Signed-off-by: Stefan Assmann 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 9ea0556c8962..c2673d2cef8e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -14073,11 +14073,12 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
}
i40e_get_oem_version(hw);
 
-   /* provide nvm, fw, api versions */
-   dev_info(&pdev->dev, "fw %d.%d.%05d api %d.%d nvm %s\n",
+   /* provide nvm, fw, api versions, vendor:device id, subsys 
vendor:device id */
+   dev_info(&pdev->dev, "fw %d.%d.%05d api %d.%d nvm %s [%04x:%04x] 
[%04x:%04x]\n",
 hw->aq.fw_maj_ver, hw->aq.fw_min_ver, hw->aq.fw_build,
 hw->aq.api_maj_ver, hw->aq.api_min_ver,
-i40e_nvm_version_str(hw));
+i40e_nvm_version_str(hw), hw->vendor_id, hw->device_id,
+hw->subsystem_vendor_id, hw->subsystem_device_id);
 
if (hw->aq.api_maj_ver == I40E_FW_API_VERSION_MAJOR &&
hw->aq.api_min_ver > I40E_FW_MINOR_VERSION(hw))
-- 
2.20.1

[net-next v2 09/11] i40e: fix misleading message about promisc setting on un-trusted VF

2019-05-03 Thread Jeff Kirsher

From: Harshitha Ramamurthy 

A refactor of the i40e_vc_config_promiscuous_mode_msg function moved
the check for un-trusted VF into another function. We have to lie to
an un-trusted VF that its request to set promiscuous mode is
successful even when it is not because we don't want the VF to find
out its trust status this way. With the refactor, we were running into
a case where even though we were not setting promiscuous mode for an
un-trusted VF, we still printed a misleading message that it was
successful.

This patch fixes that by ensuring that a success message is printed
on the host side only when the promiscuous mode change has been
successful.

Signed-off-by: Harshitha Ramamurthy 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c| 28 +++
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 925ca880bea3..8a6fb9c03955 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -1112,15 +1112,6 @@ static i40e_status 
i40e_config_vf_promiscuous_mode(struct i40e_vf *vf,
if (!i40e_vc_isvalid_vsi_id(vf, vsi_id) || !vsi)
return I40E_ERR_PARAM;
 
-   if (!test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps) &&
-   (allmulti || alluni)) {
-   dev_err(&pf->pdev->dev,
-   "Unprivileged VF %d is attempting to configure 
promiscuous mode\n",
-   vf->vf_id);
-   /* Lie to the VF on purpose. */
-   return 0;
-   }
-
if (vf->port_vlan_id) {
aq_ret = i40e_aq_set_vsi_mc_promisc_on_vlan(hw, vsi->seid,
allmulti,
@@ -1997,8 +1988,21 @@ static int i40e_vc_config_promiscuous_mode_msg(struct 
i40e_vf *vf, u8 *msg)
bool allmulti = false;
bool alluni = false;
 
-   if (!test_bit(I40E_VF_STATE_ACTIVE, &vf->vf_states))
-   return I40E_ERR_PARAM;
+   if (!test_bit(I40E_VF_STATE_ACTIVE, &vf->vf_states)) {
+   aq_ret = I40E_ERR_PARAM;
+   goto err_out;
+   }
+   if (!test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps)) {
+   dev_err(&pf->pdev->dev,
+   "Unprivileged VF %d is attempting to configure 
promiscuous mode\n",
+   vf->vf_id);
+
+   /* Lie to the VF on purpose, because this is an error we can
+* ignore. Unprivileged VF is not a virtual channel error.
+*/
+   aq_ret = 0;
+   goto err_out;
+   }
 
/* Multicast promiscuous handling*/
if (info->flags & FLAG_VF_MULTICAST_PROMISC)
@@ -2032,7 +2036,7 @@ static int i40e_vc_config_promiscuous_mode_msg(struct 
i40e_vf *vf, u8 *msg)
clear_bit(I40E_VF_STATE_UC_PROMISC, &vf->vf_states);
}
}
-
+err_out:
/* send the response to the VF */
return i40e_vc_send_resp_to_vf(vf,
   VIRTCHNL_OP_CONFIG_PROMISCUOUS_MODE,
-- 
2.20.1

[net-next v2 00/11][pull request] 40GbE Intel Wired LAN Driver Updates 2019-05-03

2019-05-03 Thread Jeff Kirsher

This series contains updates to the i40e driver only.

Carolyn changes the driver behavior to now disable the VF after one MDD
event instead of allowing a couple of MDD events before doing the reset.

Aleksandr changes the driver to only report an error when a VF tries to
remove VLAN when a port VLAN is configured, unless it is VLAN 0.  Also
extends the LLDP support to be able to keep the current LLDP state
persistent across a power cycle.

Maciej fixes the checksum calculation due to firmware changes, which
requires the driver to perform a double shadow RAM dump in some cases.

Adam adds advertising support for 40GBase_LR4, 40GBase_CR4 and fibre in
the driver.

Jake cleans up a check that is not needed and was producing a warning in
GCC 8.

Harshitha fixes a misleading message by ensuring that a success message
is only printed on the host side when the promiscuous mode change has
been successful.

Stefan Assmann adds the vendor id and device id to the dmesg log entry
during probe to help with bug reports when lspci output may not be
available.

Alice and Piotr add recovery mode support in the i40e driver, which is
needed for migrating from a structured to a flat firmware image.

v2: Removed patch 1 "i40e: replace switch-statement to speed-up
retpoline-enabled builds" from the series since it is no longer
needed.  Also updated the last patch in the series that introduces
recovery mode support, to include a more detailed patch description
and removed code not intended for the upstream kernel.

The following are changes since commit 8ef988b914bd449458eb2174febb67b0f137b33c:
  Merge branch 'NXP-SJA1105-DSA-driver'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Adam Ludkiewicz (1):
  i40e: Report advertised link modes on 40GBase_LR4, CR4 and fibre

Aleksandr Loktionov (2):
  i40e: remove error msg when vf with port vlan tries to remove vlan 0
  i40e: Further implementation of LLDP

Alice Michael (2):
  i40e: update version number
  i40e: Introduce recovery mode support

Carolyn Wyborny (2):
  i40e: Fix for allowing too many MDD events on VF
  i40e: change behavior on PF in response to MDD event

Harshitha Ramamurthy (1):
  i40e: fix misleading message about promisc setting on un-trusted VF

Jacob Keller (1):
  i40e: remove out-of-range comparisons in i40e_validate_cloud_filter

Maciej Paczkowski (1):
  i40e: ShadowRAM checksum calculation change

Stefan Assmann (1):
  i40e: print PCI vendor and device ID during probe

 drivers/net/ethernet/intel/i40e/i40e.h|   1 +
 drivers/net/ethernet/intel/i40e/i40e_adminq.c |   5 +
 .../net/ethernet/intel/i40e/i40e_adminq_cmd.h |  20 +-
 drivers/net/ethernet/intel/i40e/i40e_common.c |  62 +++-
 .../net/ethernet/intel/i40e/i40e_debugfs.c|   4 +-
 .../net/ethernet/intel/i40e/i40e_ethtool.c|  28 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   | 341 +++---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c|  29 +-
 .../net/ethernet/intel/i40e/i40e_prototype.h  |   8 +-
 drivers/net/ethernet/intel/i40e/i40e_type.h   |   1 +
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c|  35 +-
 11 files changed, 451 insertions(+), 83 deletions(-)

-- 
2.20.1

[net-next v2 04/11] i40e: ShadowRAM checksum calculation change

2019-05-03 Thread Jeff Kirsher

From: Maciej Paczkowski 

Due to changes in FW the SW is required to perform double SR dump in
some cases.

Implementation adds two new steps to update nvm checksum function:
* recalculate checksum and check if checksum in NVM is correct
* if checksum in NVM is not correct then update it again

Signed-off-by: Maciej Paczkowski 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c | 29 +++---
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c 
b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 0299e5bbb902..ee89779a9a6f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -574,13 +574,34 @@ static i40e_status i40e_calc_nvm_checksum(struct i40e_hw 
*hw,
 i40e_status i40e_update_nvm_checksum(struct i40e_hw *hw)
 {
i40e_status ret_code;
-   u16 checksum;
+   u16 checksum, checksum_sr;
__le16 le_sum;
 
ret_code = i40e_calc_nvm_checksum(hw, &checksum);
-   if (!ret_code) {
-   le_sum = cpu_to_le16(checksum);
-   ret_code = i40e_write_nvm_aq(hw, 0x00, I40E_SR_SW_CHECKSUM_WORD,
+   if (ret_code)
+   return ret_code;
+
+   le_sum = cpu_to_le16(checksum);
+   ret_code = i40e_write_nvm_aq(hw, 0x00, I40E_SR_SW_CHECKSUM_WORD,
+1, &le_sum, true);
+   if (ret_code)
+   return ret_code;
+
+   /* Due to changes in FW the SW is required to perform double SR-dump
+* in some cases. SR-dump is the process when internal shadow RAM is
+* dumped into flash bank. It is triggered by setting "last_command"
+* argument in i40e_write_nvm_aq function call.
+* Since FW 1.8 we need to calculate SR checksum again and update it
+* in flash if it is not equal to previously computed checksum.
+* This situation would occur only in FW >= 1.8
+*/
+   ret_code = i40e_calc_nvm_checksum(hw, &checksum_sr);
+   if (ret_code)
+   return ret_code;
+   if (checksum_sr != checksum) {
+   le_sum = cpu_to_le16(checksum_sr);
+   ret_code = i40e_write_nvm_aq(hw, 0x00,
+I40E_SR_SW_CHECKSUM_WORD,
 1, &le_sum, true);
}
 
-- 
2.20.1

[net-next v2 05/11] i40e: Report advertised link modes on 40GBase_LR4, CR4 and fibre

2019-05-03 Thread Jeff Kirsher

From: Adam Ludkiewicz 

Add assignments for advertising 40GBase_LR4, 40GBase_CR4 and fibre

Signed-off-by: Adam Ludkiewicz 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 9eaea1bee4a1..0d923c13c9a1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -541,9 +541,12 @@ static void i40e_phy_type_to_ethtool(struct i40e_pf *pf,
ethtool_link_ksettings_add_link_mode(ks, advertising,
 4baseSR4_Full);
}
-   if (phy_types & I40E_CAP_PHY_TYPE_40GBASE_LR4)
+   if (phy_types & I40E_CAP_PHY_TYPE_40GBASE_LR4) {
ethtool_link_ksettings_add_link_mode(ks, supported,
 4baseLR4_Full);
+   ethtool_link_ksettings_add_link_mode(ks, advertising,
+4baseLR4_Full);
+   }
if (phy_types & I40E_CAP_PHY_TYPE_40GBASE_KR4) {
ethtool_link_ksettings_add_link_mode(ks, supported,
 4baseLR4_Full);
@@ -723,6 +726,8 @@ static void i40e_get_settings_link_up(struct i40e_hw *hw,
case I40E_PHY_TYPE_40GBASE_AOC:
ethtool_link_ksettings_add_link_mode(ks, supported,
 4baseCR4_Full);
+   ethtool_link_ksettings_add_link_mode(ks, advertising,
+4baseCR4_Full);
break;
case I40E_PHY_TYPE_40GBASE_SR4:
ethtool_link_ksettings_add_link_mode(ks, supported,
@@ -733,6 +738,8 @@ static void i40e_get_settings_link_up(struct i40e_hw *hw,
case I40E_PHY_TYPE_40GBASE_LR4:
ethtool_link_ksettings_add_link_mode(ks, supported,
 4baseLR4_Full);
+   ethtool_link_ksettings_add_link_mode(ks, advertising,
+4baseLR4_Full);
break;
case I40E_PHY_TYPE_25GBASE_SR:
case I40E_PHY_TYPE_25GBASE_LR:
@@ -1038,6 +1045,7 @@ static int i40e_get_link_ksettings(struct net_device 
*netdev,
break;
case I40E_MEDIA_TYPE_FIBER:
ethtool_link_ksettings_add_link_mode(ks, supported, FIBRE);
+   ethtool_link_ksettings_add_link_mode(ks, advertising, FIBRE);
ks->base.port = PORT_FIBRE;
break;
case I40E_MEDIA_TYPE_UNKNOWN:
-- 
2.20.1

[net-next v2 08/11] i40e: update version number

2019-05-03 Thread Jeff Kirsher

From: Alice Michael 

Just bumping the version number appropriately.

Signed-off-by: Alice Michael 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 54c172c50479..9ea0556c8962 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -27,7 +27,7 @@ static const char i40e_driver_string[] =
 
 #define DRV_VERSION_MAJOR 2
 #define DRV_VERSION_MINOR 8
-#define DRV_VERSION_BUILD 10
+#define DRV_VERSION_BUILD 20
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD)DRV_KERN
-- 
2.20.1

[net-next v2 06/11] i40e: Further implementation of LLDP

2019-05-03 Thread Jeff Kirsher

From: Aleksandr Loktionov 

This code implements driver code changes necessary for LLDP
Agent support. Modified i40e_aq_start_lldp() and
i40e_aq_stop_lldp() adding false parameter whether LLDP state
should be persistent across power cycles.

Signed-off-by: Aleksandr Loktionov 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq.c |  5 ++
 .../net/ethernet/intel/i40e/i40e_adminq_cmd.h | 20 --
 drivers/net/ethernet/intel/i40e/i40e_common.c | 62 ++-
 .../net/ethernet/intel/i40e/i40e_debugfs.c|  4 +-
 .../net/ethernet/intel/i40e/i40e_ethtool.c|  4 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  2 +-
 .../net/ethernet/intel/i40e/i40e_prototype.h  |  8 ++-
 drivers/net/ethernet/intel/i40e/i40e_type.h   |  1 +
 8 files changed, 93 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.c 
b/drivers/net/ethernet/intel/i40e/i40e_adminq.c
index 45f6adc8ff2f..243dcd4bec19 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.c
@@ -608,6 +608,11 @@ i40e_status i40e_init_adminq(struct i40e_hw *hw)
 hw->aq.api_min_ver >= 7))
hw->flags |= I40E_HW_FLAG_802_1AD_CAPABLE;
 
+   if (hw->aq.api_maj_ver > 1 ||
+   (hw->aq.api_maj_ver == 1 &&
+hw->aq.api_min_ver >= 8))
+   hw->flags |= I40E_HW_FLAG_FW_LLDP_PERSISTENT;
+
if (hw->aq.api_maj_ver > I40E_FW_API_VERSION_MAJOR) {
ret_code = I40E_ERR_FIRMWARE_API_VERSION;
goto init_adminq_free_arq;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index 522058a7d4be..abcf79eb3261 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -261,6 +261,7 @@ enum i40e_admin_queue_opc {
i40e_aqc_opc_get_cee_dcb_cfg= 0x0A07,
i40e_aqc_opc_lldp_set_local_mib = 0x0A08,
i40e_aqc_opc_lldp_stop_start_spec_agent = 0x0A09,
+   i40e_aqc_opc_lldp_restore   = 0x0A0A,
 
/* Tunnel commands */
i40e_aqc_opc_add_udp_tunnel = 0x0B00,
@@ -2498,18 +2499,19 @@ I40E_CHECK_CMD_LENGTH(i40e_aqc_lldp_update_tlv);
 /* Stop LLDP (direct 0x0A05) */
 struct i40e_aqc_lldp_stop {
u8  command;
-#define I40E_AQ_LLDP_AGENT_STOP0x0
-#define I40E_AQ_LLDP_AGENT_SHUTDOWN0x1
+#define I40E_AQ_LLDP_AGENT_STOP0x0
+#define I40E_AQ_LLDP_AGENT_SHUTDOWN0x1
+#define I40E_AQ_LLDP_AGENT_STOP_PERSIST0x2
u8  reserved[15];
 };
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_lldp_stop);
 
 /* Start LLDP (direct 0x0A06) */
-
 struct i40e_aqc_lldp_start {
u8  command;
-#define I40E_AQ_LLDP_AGENT_START   0x1
+#define I40E_AQ_LLDP_AGENT_START   0x1
+#define I40E_AQ_LLDP_AGENT_START_PERSIST   0x2
u8  reserved[15];
 };
 
@@ -2633,6 +2635,16 @@ struct i40e_aqc_lldp_stop_start_specific_agent {
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_lldp_stop_start_specific_agent);
 
+/* Restore LLDP Agent factory settings (direct 0x0A0A) */
+struct i40e_aqc_lldp_restore {
+   u8  command;
+#define I40E_AQ_LLDP_AGENT_RESTORE_NOT 0x0
+#define I40E_AQ_LLDP_AGENT_RESTORE 0x1
+   u8  reserved[15];
+};
+
+I40E_CHECK_CMD_LENGTH(i40e_aqc_lldp_restore);
+
 /* Add Udp Tunnel command and completion (direct 0x0B00) */
 struct i40e_aqc_add_udp_tunnel {
__le16  udp_port;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index dd6b3b3ac5c6..e7d500f92a90 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -3623,15 +3623,55 @@ i40e_status i40e_aq_cfg_lldp_mib_change_event(struct 
i40e_hw *hw,
return status;
 }
 
+/**
+ * i40e_aq_restore_lldp
+ * @hw: pointer to the hw struct
+ * @setting: pointer to factory setting variable or NULL
+ * @restore: True if factory settings should be restored
+ * @cmd_details: pointer to command details structure or NULL
+ *
+ * Restore LLDP Agent factory settings if @restore set to True. In other case
+ * only returns factory setting in AQ response.
+ **/
+enum i40e_status_code
+i40e_aq_restore_lldp(struct i40e_hw *hw, u8 *setting, bool restore,
+struct i40e_asq_cmd_details *cmd_details)
+{
+   struct i40e_aq_desc desc;
+   struct i40e_aqc_lldp_restore *cmd =
+   (struct i40e_aqc_lldp_restore *)&desc.params.raw;
+   i40e_status status;
+
+   if (!(hw->flags & I40E_HW_FLAG_FW_LLDP_PERSISTENT)) {
+   i40e_debug(hw, I40E_DEBUG_ALL,
+  "Restore LLDP not supported by current FW 
version.\n");
+   return I40E_ERR_DEVICE_NOT_SUPPORTED;
+   }
+
+   i40e_fill_default_direct_cmd_desc(&desc, i40e_aqc_opc_lldp_restor

[net-next v2 03/11] i40e: remove error msg when vf with port vlan tries to remove vlan 0

2019-05-03 Thread Jeff Kirsher

From: Aleksandr Loktionov 

VF's attempt to delete vlan 0 when a port vlan is configured is harmless
in this case pf driver just does nothing.  If vf will try to remove
other vlans when a port vlan is configured it will still produce error
as before.

Signed-off-by: Aleksandr Loktionov 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 71cd159e7902..24628de8e624 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2766,7 +2766,8 @@ static int i40e_vc_remove_vlan_msg(struct i40e_vf *vf, u8 
*msg)
 
vsi = pf->vsi[vf->lan_vsi_idx];
if (vsi->info.pvid) {
-   aq_ret = I40E_ERR_PARAM;
+   if (vfl->num_elements > 1 || vfl->vlan_id[0])
+   aq_ret = I40E_ERR_PARAM;
goto error_param;
}
 
-- 
2.20.1

[net-next v2 07/11] i40e: remove out-of-range comparisons in i40e_validate_cloud_filter

2019-05-03 Thread Jeff Kirsher

From: Jacob Keller 

The function i40e_validate_cloud_filter checks that the destination and
source port numbers are valid by attempting to ensure that the number is
non-zero and no larger than 0x. However, the types for the dst_port
and src_port variable are __be16 which by definition cannot be larger
than 0x

Since these values cannot be larger than 2 bytes, the check to see if
they exceed 0x is meaningless.

One might consider these checks as some sort of defensive coding, in
case the type was later changed. However, these checks also byte-swap
the value before comparison using be16_to_cpu, which will truncate the
values to 16bits anyways. Additionally, changing the type would require
updating the opcodes to support new data layout of these virtchnl
commands.

Remove the check to silence the -Wtype-limits warning that was added to
GCC 8.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 24628de8e624..925ca880bea3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -3129,7 +3129,7 @@ static int i40e_validate_cloud_filter(struct i40e_vf *vf,
}
 
if (mask.dst_port & data.dst_port) {
-   if (!data.dst_port || be16_to_cpu(data.dst_port) > 0x) {
+   if (!data.dst_port) {
dev_info(&pf->pdev->dev, "VF %d: Invalid Dest port\n",
 vf->vf_id);
goto err;
@@ -3137,7 +3137,7 @@ static int i40e_validate_cloud_filter(struct i40e_vf *vf,
}
 
if (mask.src_port & data.src_port) {
-   if (!data.src_port || be16_to_cpu(data.src_port) > 0x) {
+   if (!data.src_port) {
dev_info(&pf->pdev->dev, "VF %d: Invalid Source port\n",
 vf->vf_id);
goto err;
-- 
2.20.1

[net-next v2 11/11] i40e: Introduce recovery mode support

2019-05-03 Thread Jeff Kirsher

From: Alice Michael 

This patch introduces "recovery mode" to the i40e driver. It is
part of a new Any2Any idea of upgrading the firmware. In this
approach, it is required for the driver to have support for
"transition firmware", that is used for migrating from structured
to flat firmware image. In this new, very basic mode, i40e driver
must be able to handle particular IOCTL calls from the NVM Update
Tool and run a small set of AQ commands.

These additional AQ commands are part of the interface used by
the NVMUpdate tool.  The NVMUpdate tool contains all of the
necessary logic to reference these new AQ commands.  The end user
experience remains the same, they are using the NVMUpdate tool to
update the NVM contents.

Signed-off-by: Alice Michael 
Signed-off-by: Piotr Marczak 
Tested-by: Don Buchholz 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h|   1 +
 .../net/ethernet/intel/i40e/i40e_ethtool.c|  14 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   | 310 --
 3 files changed, 294 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index c4afb852cb57..7ce42040b851 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -149,6 +149,7 @@ enum i40e_state_t {
__I40E_CLIENT_L2_CHANGE,
__I40E_CLIENT_RESET,
__I40E_VIRTCHNL_OP_PENDING,
+   __I40E_RECOVERY_MODE,
/* This must be last as it determines the size of the BITMAP */
__I40E_STATE_SIZE__,
 };
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 32e137499063..2c81afbd7c58 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -5141,6 +5141,12 @@ static int i40e_get_module_eeprom(struct net_device 
*netdev,
return 0;
 }
 
+static const struct ethtool_ops i40e_ethtool_recovery_mode_ops = {
+   .set_eeprom = i40e_set_eeprom,
+   .get_eeprom_len = i40e_get_eeprom_len,
+   .get_eeprom = i40e_get_eeprom,
+};
+
 static const struct ethtool_ops i40e_ethtool_ops = {
.get_drvinfo= i40e_get_drvinfo,
.get_regs_len   = i40e_get_regs_len,
@@ -5189,5 +5195,11 @@ static const struct ethtool_ops i40e_ethtool_ops = {
 
 void i40e_set_ethtool_ops(struct net_device *netdev)
 {
-   netdev->ethtool_ops = &i40e_ethtool_ops;
+   struct i40e_netdev_priv *np = netdev_priv(netdev);
+   struct i40e_pf  *pf = np->vsi->back;
+
+   if (!test_bit(__I40E_RECOVERY_MODE, pf->state))
+   netdev->ethtool_ops = &i40e_ethtool_ops;
+   else
+   netdev->ethtool_ops = &i40e_ethtool_recovery_mode_ops;
 }
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c2673d2cef8e..fa1b2cfd359e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -46,6 +46,10 @@ static int i40e_setup_pf_filter_control(struct i40e_pf *pf);
 static void i40e_prep_for_reset(struct i40e_pf *pf, bool lock_acquired);
 static int i40e_reset(struct i40e_pf *pf);
 static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired);
+static int i40e_setup_misc_vector_for_recovery_mode(struct i40e_pf *pf);
+static int i40e_restore_interrupt_scheme(struct i40e_pf *pf);
+static bool i40e_check_recovery_mode(struct i40e_pf *pf);
+static int i40e_init_recovery_mode(struct i40e_pf *pf, struct i40e_hw *hw);
 static void i40e_fdir_sb_setup(struct i40e_pf *pf);
 static int i40e_veb_get_bw_info(struct i40e_veb *veb);
 static int i40e_get_capabilities(struct i40e_pf *pf,
@@ -278,8 +282,9 @@ struct i40e_vsi *i40e_find_vsi_from_id(struct i40e_pf *pf, 
u16 id)
  **/
 void i40e_service_event_schedule(struct i40e_pf *pf)
 {
-   if (!test_bit(__I40E_DOWN, pf->state) &&
-   !test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state))
+   if ((!test_bit(__I40E_DOWN, pf->state) &&
+!test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state)) ||
+ test_bit(__I40E_RECOVERY_MODE, pf->state))
queue_work(i40e_wq, &pf->service_task);
 }
 
@@ -4019,7 +4024,8 @@ static irqreturn_t i40e_intr(int irq, void *data)
 enable_intr:
/* re-enable interrupt causes */
wr32(hw, I40E_PFINT_ICR0_ENA, ena_mask);
-   if (!test_bit(__I40E_DOWN, pf->state)) {
+   if (!test_bit(__I40E_DOWN, pf->state) ||
+   test_bit(__I40E_RECOVERY_MODE, pf->state)) {
i40e_service_event_schedule(pf);
i40e_irq_dynamic_enable_icr0(pf);
}
@@ -9409,6 +9415,7 @@ static int i40e_reset(struct i40e_pf *pf)
  **/
 static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired)
 {
+   int old_recovery_mode_bit = test_bit(__I40E_RECOVERY_MODE, pf->state);
struct i40e_vsi *vsi = pf->

[net-next v2 02/11] i40e: change behavior on PF in response to MDD event

2019-05-03 Thread Jeff Kirsher

From: Carolyn Wyborny 

TX MDD events reported on the PF are the result of the
PF misconfiguring a descriptor and not because of "bad actions"
by anything else.  No need to reset now because if it
results in a Tx hang, the Tx hang check will take care of it.

Signed-off-by: Carolyn Wyborny 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b52a9d5644b8..3e15df1d5f52 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9696,7 +9696,6 @@ static void i40e_handle_mdd_event(struct i40e_pf *pf)
 {
struct i40e_hw *hw = &pf->hw;
bool mdd_detected = false;
-   bool pf_mdd_detected = false;
struct i40e_vf *vf;
u32 reg;
int i;
@@ -9742,19 +9741,12 @@ static void i40e_handle_mdd_event(struct i40e_pf *pf)
reg = rd32(hw, I40E_PF_MDET_TX);
if (reg & I40E_PF_MDET_TX_VALID_MASK) {
wr32(hw, I40E_PF_MDET_TX, 0x);
-   dev_info(&pf->pdev->dev, "TX driver issue detected, PF 
reset issued\n");
-   pf_mdd_detected = true;
+   dev_dbg(&pf->pdev->dev, "TX driver issue detected on 
PF\n");
}
reg = rd32(hw, I40E_PF_MDET_RX);
if (reg & I40E_PF_MDET_RX_VALID_MASK) {
wr32(hw, I40E_PF_MDET_RX, 0x);
-   dev_info(&pf->pdev->dev, "RX driver issue detected, PF 
reset issued\n");
-   pf_mdd_detected = true;
-   }
-   /* Queue belongs to the PF, initiate a reset */
-   if (pf_mdd_detected) {
-   set_bit(__I40E_PF_RESET_REQUESTED, pf->state);
-   i40e_service_event_schedule(pf);
+   dev_dbg(&pf->pdev->dev, "RX driver issue detected on 
PF\n");
}
}
 
-- 
2.20.1

1 2 >

1 - 100 of 128 matches

Mail list logo