from:"Edward Cree"

Re: [PATCH] ptp_clock: future-proofing drivers against PTP subsystem becoming optional

2016-09-21 Thread Edward Cree

On 21/09/16 00:25, Nicolas Pitre wrote:
> Drivers must be ready to accept NULL from ptp_clock_register() if the
> PTP clock subsystem is configured out.
>
> This patch documents that and ensures that all drivers cope well
> with a NULL return.
>
> Signed-off-by: Nicolas Pitre 
> Reviewed-by: Eugenia Emantayev 
[...]
> diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
> index c771e0af4e..f105a170b4 100644
> --- a/drivers/net/ethernet/sfc/ptp.c
> +++ b/drivers/net/ethernet/sfc/ptp.c
> @@ -1269,13 +1269,13 @@ int efx_ptp_probe(struct efx_nic *efx, struct 
> efx_channel *channel)
>   if (IS_ERR(ptp->phc_clock)) {
>   rc = PTR_ERR(ptp->phc_clock);
>   goto fail3;
> - }
> -
> - INIT_WORK(&ptp->pps_work, efx_ptp_pps_worker);
> - ptp->pps_workwq = create_singlethread_workqueue("sfc_pps");
> - if (!ptp->pps_workwq) {
> - rc = -ENOMEM;
> - goto fail4;
> + } else if (ptp->phc_clock) {
> + INIT_WORK(&ptp->pps_work, efx_ptp_pps_worker);
> + ptp->pps_workwq = 
> create_singlethread_workqueue("sfc_pps");
> + if (!ptp->pps_workwq) {
> + rc = -ENOMEM;
> +     goto fail4;
> + }
>   }
>   }
>   ptp->nic_ts_enabled = false;
For the sfc change:
Acked-by: Edward Cree

Re: [PATCH net-next 1/2] docs: Add rest label the_canonical_path_format

2018-07-25 Thread Edward Cree

On 25/07/18 03:50, Tobin C. Harding wrote:
> In preparation to convert Documentation/network/netdev-FAQ.rst to
> restructured text format we would like to be able to reference 'the
> canonical patch format' section.
>
> Add rest label: 'the_canonical_path_format'.
Here and in the Subject, 'patch' is typoed as 'path'.

Re: [PATCH] bpf, doc: Correct one wrong value in "Register value tracking"

2018-01-24 Thread Edward Cree

On 24/01/18 07:48, Wang YanQing wrote:
> If we then OR this with 0x40, then the value of 6th bit (0th is first bit)
> become known, so the right mask is 0xbf instead of 0xcf.
>
> Signed-off-by: Wang YanQing 
> ---
>  Documentation/networking/filter.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/filter.txt 
> b/Documentation/networking/filter.txt
> index 8781485..a4508ec 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -1134,7 +1134,7 @@ The verifier's knowledge about the variable offset 
> consists of:
>  mask and value; no bit should ever be 1 in both.  For example, if a byte is 
> read
>  into a register from memory, the register's top 56 bits are known zero, while
>  the low 8 are unknown - which is represented as the tnum (0x0; 0xff).  If we
> -then OR this with 0x40, we get (0x40; 0xcf), then if we add 1 we get (0x0;
> +then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0;
>  0x1ff), because of potential carries.
>  Besides arithmetic, the register state can also be updated by conditional
>  branches.  For instance, if a SCALAR_VALUE is compared > 8, in the 'true' 
> branch

Acked-by: Edward Cree

Re: [RFC PATCH net-next] sfc: efx_default_channel_want_txqs() can be static

2018-01-26 Thread Edward Cree

On 26/01/18 01:03, kbuild test robot wrote:
> Fixes: 2935e3c38228 ("sfc: on 8000 series use TX queues for TX timestamps")
> Signed-off-by: Fengguang Wu 
Acked-by: Edward Cree 

Dave, can you take this directly or do you need it reposted without RFC tags?  
I'm not sure what the procedure is for robopatches.
> ---
>  efx.c |2 +-
>  ptp.c |4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> index 456866b0..16757cf 100644
> --- a/drivers/net/ethernet/sfc/efx.c
> +++ b/drivers/net/ethernet/sfc/efx.c
> @@ -896,7 +896,7 @@ void efx_schedule_slow_fill(struct efx_rx_queue *rx_queue)
>   mod_timer(&rx_queue->slow_fill, jiffies + msecs_to_jiffies(100));
>  }
>  
> -bool efx_default_channel_want_txqs(struct efx_channel *channel)
> +static bool efx_default_channel_want_txqs(struct efx_channel *channel)
>  {
>   return channel->channel - channel->efx->tx_channel_offset <
>   channel->efx->n_tx_channels;
> diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
> index 433d29d..3e2c5b1 100644
> --- a/drivers/net/ethernet/sfc/ptp.c
> +++ b/drivers/net/ethernet/sfc/ptp.c
> @@ -366,7 +366,7 @@ bool efx_ptp_use_mac_tx_timestamps(struct efx_nic *efx)
>  /* PTP 'extra' channel is still a traffic channel, but we only create TX 
> queues
>   * if PTP uses MAC TX timestamps, not if PTP uses the MC directly to 
> transmit.
>   */
> -bool efx_ptp_want_txqs(struct efx_channel *channel)
> +static bool efx_ptp_want_txqs(struct efx_channel *channel)
>  {
>   return efx_ptp_use_mac_tx_timestamps(channel->efx);
>  }
> @@ -2146,7 +2146,7 @@ static int efx_phc_enable(struct ptp_clock_info *ptp,
>   return 0;
>  }
>  
> -const struct efx_channel_type efx_ptp_channel_type = {
> +static const struct efx_channel_type efx_ptp_channel_type = {
>   .handle_no_channel  = efx_ptp_handle_no_channel,
>   .pre_probe  = efx_ptp_probe_channel,
>   .post_remove= efx_ptp_remove_channel,

[PATCH net-next] sfc: mark some unexported symbols as static

2018-01-26 Thread Edward Cree

From: kbuild test robot 

efx_default_channel_want_txqs() is only used in efx.c, while
 efx_ptp_want_txqs() and efx_ptp_channel_type (a struct) are only used
 in ptp.c.  In all cases these symbols should be static.

Fixes: 2935e3c38228 ("sfc: on 8000 series use TX queues for TX timestamps")
Signed-off-by: Fengguang Wu 
[ec...@solarflare.com: rewrote commit message]
Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/efx.c | 2 +-
 drivers/net/ethernet/sfc/ptp.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 456866b05641..16757cfc5b29 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -896,7 +896,7 @@ void efx_schedule_slow_fill(struct efx_rx_queue *rx_queue)
mod_timer(&rx_queue->slow_fill, jiffies + msecs_to_jiffies(100));
 }
 
-bool efx_default_channel_want_txqs(struct efx_channel *channel)
+static bool efx_default_channel_want_txqs(struct efx_channel *channel)
 {
return channel->channel - channel->efx->tx_channel_offset <
channel->efx->n_tx_channels;
diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
index 433d29d6bc95..3e2c5b11b5ef 100644
--- a/drivers/net/ethernet/sfc/ptp.c
+++ b/drivers/net/ethernet/sfc/ptp.c
@@ -366,7 +366,7 @@ bool efx_ptp_use_mac_tx_timestamps(struct efx_nic *efx)
 /* PTP 'extra' channel is still a traffic channel, but we only create TX queues
  * if PTP uses MAC TX timestamps, not if PTP uses the MC directly to transmit.
  */
-bool efx_ptp_want_txqs(struct efx_channel *channel)
+static bool efx_ptp_want_txqs(struct efx_channel *channel)
 {
return efx_ptp_use_mac_tx_timestamps(channel->efx);
 }
@@ -2146,7 +2146,7 @@ static int efx_phc_enable(struct ptp_clock_info *ptp,
return 0;
 }
 
-const struct efx_channel_type efx_ptp_channel_type = {
+static const struct efx_channel_type efx_ptp_channel_type = {
.handle_no_channel  = efx_ptp_handle_no_channel,
.pre_probe  = efx_ptp_probe_channel,
.post_remove= efx_ptp_remove_channel,

Re: [PATCH] kernel:bpf Remove structure passing and assignment to save stack and no coping structures

2018-01-16 Thread Edward Cree

On 13/01/18 22:03, Karim Eshapa wrote:
> Use pointers to structure as arguments to function instead of coping
> structures and less stack size. Also transfer TNUM(_v, _m) to
> tnum.h file to be used in differnet files for creating anonymous structures
> statically.
>
> Signed-off-by: Karim Eshapa 
NACK (some reasons inline).
> Thanks,
> Karim
> ---
>  include/linux/tnum.h  |  4 +++-
>  kernel/bpf/tnum.c | 14 +++---
>  kernel/bpf/verifier.c | 11 ++-
>  3 files changed, 16 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/tnum.h b/include/linux/tnum.h
> index 0d2d3da..72938a0 100644
> --- a/include/linux/tnum.h
> +++ b/include/linux/tnum.h
> @@ -13,6 +13,8 @@ struct tnum {
>  };
>  
>  /* Constructors */
> +/* Statically tnum constant */
> +#define TNUM(_v, _m) (struct tnum){.value = _v, .mask = _m}
This shouldn't be in the 'public' API, because it's dealing in the internals
 of the tnum struct in ways that calling code shouldn't have to worry about.
Instead, callers should use functions like tnum_const() to construct tnums.
>  /* Represent a known constant as a tnum. */
>  struct tnum tnum_const(u64 value);
>  /* A completely unknown value */
> @@ -26,7 +28,7 @@ struct tnum tnum_lshift(struct tnum a, u8 shift);
>  /* Shift a tnum right (by a fixed shift) */
>  struct tnum tnum_rshift(struct tnum a, u8 shift);
>  /* Add two tnums, return @a + @b */
> -struct tnum tnum_add(struct tnum a, struct tnum b);
> +void tnum_add(struct tnum *res, struct tnum *a, struct tnum *b);
I would expect the old tnum_add to be inlined by the compiler.  Moreover,
 the arguments and return value are clearly separate, whereas in your new
 version they could (and often will) alias, thus the function body has to
 be careful not to write the result until it has finished reading the args.
I wouldn't be surprised if your versions actually _increased_ total stack
 usage by confusing the compiler's inliner and liveness analysis.
>  /* Subtract two tnums, return @a - @b */
>  struct tnum tnum_sub(struct tnum a, struct tnum b);
>  /* Bitwise-AND, return @a & @b */
> diff --git a/kernel/bpf/tnum.c b/kernel/bpf/tnum.c
> index 1f4bf68..89e3182 100644
> --- a/kernel/bpf/tnum.c
> +++ b/kernel/bpf/tnum.c
> @@ -8,7 +8,6 @@
>  #include 
>  #include 
>  
> -#define TNUM(_v, _m) (struct tnum){.value = _v, .mask = _m}
>  /* A completely unknown value */
>  const struct tnum tnum_unknown = { .value = 0, .mask = -1 };
>  
> @@ -43,16 +42,17 @@ struct tnum tnum_rshift(struct tnum a, u8 shift)
>   return TNUM(a.value >> shift, a.mask >> shift);
>  }
>  
> -struct tnum tnum_add(struct tnum a, struct tnum b)
> +void tnum_add(struct tnum *res, struct tnum *a, struct tnum *b)
>  {
>   u64 sm, sv, sigma, chi, mu;
>  
> - sm = a.mask + b.mask;
> - sv = a.value + b.value;
> + sm = a->mask + b->mask;
> + sv = a->value + b->value;
>   sigma = sm + sv;
>   chi = sigma ^ sv;
> - mu = chi | a.mask | b.mask;
> - return TNUM(sv & ~mu, mu);
> + mu = chi | a->mask | b->mask;
> + res->value = (sv & ~mu);
> + res->mask = mu;
>  }
>  
>  struct tnum tnum_sub(struct tnum a, struct tnum b)
> @@ -102,7 +102,7 @@ static struct tnum hma(struct tnum acc, u64 value, u64 
> mask)
>  {
>   while (mask) {
>   if (mask & 1)
> - acc = tnum_add(acc, TNUM(0, value));
> + tnum_add(&acc, &acc, &TNUM(0, value));
This is much less readable than the original, since instead of using the
 assignment operator, the destination is just the first argument - not
 nearly as self-documenting.

-Ed

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-12 Thread Edward Cree

On 09/03/18 18:58, Alexei Starovoitov wrote:
> It's not waiting for the whole thing, because once bpfilter starts it
> stays running/sleeping because it's stateful.
So, this has been bugging me a bit.
If bpfilter takes a signal and crashes, all that state goes away.
Does that mean your iptables/netfilter config just got forgotten and next
 time you run iptables it disappears, so you have to re-apply it all again?
> It needs normal
> malloc-ed memory to keep the state of iptable->bpf translation that
> it will use later during subsequent translation calls.
> Theoretically it can use bpf maps pinned in kernel memory to keep
> this state, but then it's non-swappable. It's better to keep bpfilter
> state in its own user memory.
Perhaps the state should live in swappable kernel memory (e.g. a tmpfs
 thing, which bpfilter could access through a mount).  It'd be read-only
 to userspace, listing the existing rules (in untranslated form), and be
 updated to reflect the new rule after bpfilter has supplied the updated
 translation.
Then bpfilter can cache things if it wants, but the kernel remains the
 ultimate arbiter of the state and maintains it over a bpfilter crash.

Sound reasonable?

-Ed

[PATCH bpf] bpf: fix off-by-one error in adjust_subprog_starts

2018-11-16 Thread Edward Cree

When patching in a new sequence for the first insn of a subprog, the start
 of that subprog does not change (it's the first insn of the sequence), so
 adjust_subprog_starts should check start <= off (rather than < off).
Also added a test to test_verifier.c (it's essentially the syz reproducer).

Fixes: cc8b0b92a169 ("bpf: introduce function calls (function boundaries)")
Reported-by: syzbot+4fc427c7af994b094...@syzkaller.appspotmail.com
Signed-off-by: Edward Cree 
---
I'm assuming I don't need to get a Signed-off-by from syzkaller to use its
 reproducer like this; I'm not an expert on the copyright niceties of works
 written by bots.

 kernel/bpf/verifier.c   |  2 +-
 tools/testing/selftests/bpf/test_verifier.c | 19 +++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1971ca325fb4..6dd419550aba 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5650,7 +5650,7 @@ static void adjust_subprog_starts(struct bpf_verifier_env 
*env, u32 off, u32 len
return;
/* NOTE: fake 'exit' subprog should be updated as well. */
for (i = 0; i <= env->subprog_cnt; i++) {
-   if (env->subprog_info[i].start < off)
+   if (env->subprog_info[i].start <= off)
continue;
env->subprog_info[i].start += len - 1;
}
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 6f61df62f690..550b7e46bf4a 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -13896,6 +13896,25 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
},
+   {
+   "calls: ctx read at start of subprog",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 5),
+   BPF_JMP_REG(BPF_JSGT, BPF_REG_0, BPF_REG_0, 0),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 2),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+   BPF_EXIT_INSN(),
+   BPF_LDX_MEM(BPF_B, BPF_REG_9, BPF_REG_1, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
+   .errstr_unpriv = "function calls to other bpf functions are 
allowed for root only",
+   .result_unpriv = REJECT,
+   .result = ACCEPT,
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)

Re: [PATCH v2 0/4] Static calls

2018-12-07 Thread Edward Cree

Sorry if this has been pointed out before (it's a very long thread), but
 in the out-of-line implementation, it appears that static_call_update()
 never alters key->func.  Am I right in thinking that this should be
 fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
 arch_static_call_transform() on line 159 of include/linux/static_call.h?

Some background (why does key->func matter for the
 CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with
 combining these static calls with the 'indirect call wrappers' notion
 that Paolo Abeni has been working on [1], using runtime instrumentation
 to determine a list of potential callees.  (This allows us to cope with
 cases where the callees are in modules, or where different workloads may
 use different sets of callees for a given call site, neither of which is
 handled by Paolo's approach).
The core of my design looks something like:

static int dynamic_call_xyz(int (*func)(some_args), some_args)
{
if (func == dynamic_call_xyz_1.func)
return static_call(dynamic_call_xyz_1, some_args);
if (func == dynamic_call_xyz_2.func)
return static_call(dynamic_call_xyz_2, some_args);
return (*func)(some_args);
}

albeit with a bunch of extra (and currently rather ugly) stuff to collect
 the statistics needed to decide what to put in the static call keys, and
 mechanisms (RCU in my current case) to ensure that the static call isn't
 changed between checking its .func and actually calling it.

-Ed

PS: not on list, please keep me in CC.

[1] https://lwn.net/Articles/773985/

Re: [PATCH v2 0/4] Static calls

2018-12-07 Thread Edward Cree

On 07/12/18 16:06, Edward Cree wrote:
> Sorry if this has been pointed out before (it's a very long thread), but
>  in the out-of-line implementation, it appears that static_call_update()
>  never alters key->func.  Am I right in thinking that this should be
>  fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
>  arch_static_call_transform() on line 159 of include/linux/static_call.h?
On further examination, it's worse than that.

Why does the CONFIG_HAVE_STATIC_CALL_OUTLINE static_call_update() not
 call __static_call_update()?  It contains nothing but a BUILD_BUG_ON,
 which isn't likely to update anything.

-Ed

Re: [PATCH v2 0/4] Static calls

2018-12-12 Thread Edward Cree

On 12/12/18 05:59, Nadav Amit wrote:
> Thanks for cc’ing me. (I didn’t know about the other patch-sets.)
Well in my case, that's because I haven't posted any yet.  (Will follow up
 shortly with what I currently have, though it's not pretty.)

Looking at your patches, it seems you've got a much more developed learning
 mechanism.  Mine on the other hand is brutally simple but runs continuously
 (i.e. after we patch we immediately enter the next 'relearning' phase);
 since it never does anything but prod a handful of percpu variables, this
 shouldn't be too costly.

Also, you've got the macrology for making all indirect calls use this,
 whereas at present I just have an open-coded instance on a single call site
 (I went with deliver_skb in the networking stack).

So I think where we probably want to go from here is:
 1) get Josh's static_calls in.  AIUI Linus seems to prefer the out-of-line
    approach; I'd say ditch the inline version (at least for now).
 2) build a relpolines patch series that uses
   i) static_calls for the text-patching part
  ii) as much of Nadav's macrology as is applicable
 iii) either my or Nadav's learning mechanism; we can experiment with both,
  bikeshed it incessantly etc.

Seem reasonable?

-Ed

[RFC/WIP PATCH 0/2] dynamic calls

2018-12-12 Thread Edward Cree

A fix to the static_calls series (on which this series depends), and a really
 hacky proof-of-concept of runtime-patched branch trees of static_calls to
 avoid indirect calls / retpolines in the hot-path.  Rather than any generally
 applicable machinery, the patch just open-codes it for one call site (the
 pt_prev->func() call in deliver_skb and __netif_receive_skb_one_core()); it
 should however be possible to make a macro that takes a 'name' parameter and
 expands to the whole thing.  Also the _update() function could be shared and
 get something useful from its work_struct, rather than needing a separate
 copy of the function for every indirect call site.

Performance testing so far has been somewhat inconclusive; I applied this on
 net-next, hacked up my Kconfig to use out-of-line static calls on x86-64, and
 ran some 1-byte UDP stream tests with the DUT receiving.
On a single stream test, I saw packet rate go up by 7%, from 470Kpps to
 504Kpps, with a considerable reduction in variance; however, CPU usage
 increased by a larger factor: (packet rate / RX cpu) is a much lower-variance
 measurement and went down by 13%.  This however may be because it often got
 into a state where, while patching the calls (and thus sending all callers
 down the slow path) we continue to gather stats and see enough calls to
 trigger another update; as there's no code to detect and skip an update that
 doesn't change anything, we get into a tight loop of redoing updates.  I am
 working on this & plan to change it to not collect any stats while an update
 is actually in progress.
On a 4-stream test, the variance I saw was too high to draw any conclusions;
 the packet rate went down about 2½% but this was not statistically
 significant (and the fastest run I saw was with dynamic calls present).

Edward Cree (2):
  static_call: fix out-of-line static call implementation
  net: core: rather hacky PoC implementation of dynamic calls

 include/linux/static_call.h |   6 +-
 net/core/dev.c  | 222 +++-
 2 files changed, 221 insertions(+), 7 deletions(-)

[RFC PATCH 1/2] static_call: fix out-of-line static call implementation

2018-12-12 Thread Edward Cree

Actually call __static_call_update() from static_call_update(), and fix the
 former so it can actually compile.  Also make it update key.func.

Signed-off-by: Edward Cree 
---
 include/linux/static_call.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 6daff586c97d..38d6c1e4c85d 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -153,16 +153,18 @@ struct static_call_key {
 
 #define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
 
-#define __static_call_update(key, func)
\
+#define __static_call_update(key, _func)   \
 ({ \
cpus_read_lock();   \
-   arch_static_call_transform(NULL, key->tramp, func); \
+   arch_static_call_transform(NULL, key.tramp, _func); \
+   WRITE_ONCE(key.func, _func);\
cpus_read_unlock(); \
 })
 
 #define static_call_update(key, func)  \
 ({ \
BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key)));   \
+   __static_call_update(key, func);\
 })
 
 #define EXPORT_STATIC_CALL(key)
\

[RFC PATCH 2/2] net: core: rather hacky PoC implementation of dynamic calls

2018-12-12 Thread Edward Cree

Uses runtime instrumentation of callees from an indirect call site
 (deliver_skb, and also __netif_receive_skb_one_core()) to populate an
 indirect-call-wrapper branch tree.  Essentially we're doing indirect
 branch prediction in software because the hardware can't be trusted to
 get it right; this is sad.

It's also full of printk()s right now to display what it's doing for
 debugging purposes; obviously those wouldn't be quite the same in a
 finished version.

Signed-off-by: Edward Cree 
---
 net/core/dev.c | 222 +++--
 1 file changed, 217 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 04a6b7100aac..f69c110c34e3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -145,6 +145,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -1935,14 +1936,223 @@ int dev_forward_skb(struct net_device *dev, struct 
sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
 
-static inline int deliver_skb(struct sk_buff *skb,
- struct packet_type *pt_prev,
- struct net_device *orig_dev)
+static void deliver_skb_update(struct work_struct *unused);
+
+static DECLARE_WORK(deliver_skb_update_work, deliver_skb_update);
+
+typedef int (*deliver_skb_func)(struct sk_buff *, struct net_device *, struct 
packet_type *, struct net_device *);
+
+struct deliver_skb_candidate {
+   deliver_skb_func func;
+   unsigned long hit_count;
+};
+
+static DEFINE_PER_CPU(struct deliver_skb_candidate[4], deliver_skb_candidates);
+
+static DEFINE_PER_CPU(unsigned long, deliver_skb_miss_count);
+
+/* Used to route around the dynamic version when we're changing it, as well as
+ * as a fallback if none of our static calls match.
+ */
+static int do_deliver_skb(struct sk_buff *skb,
+ struct packet_type *pt_prev,
+ struct net_device *orig_dev)
+{
+   struct deliver_skb_candidate *cands = 
*this_cpu_ptr(&deliver_skb_candidates);
+   deliver_skb_func func = pt_prev->func;
+   unsigned long total_count;
+   int i;
+
+   for (i = 0; i < 4; i++)
+   if (func == cands[i].func) {
+   cands[i].hit_count++;
+   break;
+   }
+   if (i == 4) /* no match */
+   for (i = 0; i < 4; i++)
+   if (!cands[i].func) {
+   cands[i].func = func;
+   cands[i].hit_count = 1;
+   break;
+   }
+   if (i == 4) /* no space */
+   (*this_cpu_ptr(&deliver_skb_miss_count))++;
+
+   total_count = *this_cpu_ptr(&deliver_skb_miss_count);
+   for (i = 0; i < 4; i++)
+   total_count += cands[i].hit_count;
+   if (total_count > 1000) /* Arbitrary threshold */
+   schedule_work(&deliver_skb_update_work);
+   return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+}
+
+DEFINE_STATIC_CALL(dispatch_deliver_skb, do_deliver_skb);
+
+static int dummy_deliver_skb(struct sk_buff *skb, struct net_device *dev,
+struct packet_type *pt_prev,
+struct net_device *orig_dev)
+{
+   WARN_ON_ONCE(1); /* shouldn't ever actually get here */
+   return do_deliver_skb(skb, pt_prev, orig_dev);
+}
+
+DEFINE_STATIC_CALL(dynamic_deliver_skb_1, dummy_deliver_skb);
+DEFINE_STATIC_CALL(dynamic_deliver_skb_2, dummy_deliver_skb);
+
+static DEFINE_PER_CPU(unsigned long, dds1_hit_count);
+static DEFINE_PER_CPU(unsigned long, dds2_hit_count);
+
+static int dynamic_deliver_skb(struct sk_buff *skb,
+  struct packet_type *pt_prev,
+  struct net_device *orig_dev)
+{
+   deliver_skb_func func = pt_prev->func;
+
+   if (func == dynamic_deliver_skb_1.func) {
+   (*this_cpu_ptr(&dds1_hit_count))++;
+   return static_call(dynamic_deliver_skb_1, skb, skb->dev,
+  pt_prev, orig_dev);
+   }
+   if (func == dynamic_deliver_skb_2.func) {
+   (*this_cpu_ptr(&dds2_hit_count))++;
+   return static_call(dynamic_deliver_skb_2, skb, skb->dev,
+  pt_prev, orig_dev);
+   }
+   return do_deliver_skb(skb, pt_prev, orig_dev);
+}
+
+DEFINE_MUTEX(deliver_skb_update_lock);
+
+static void deliver_skb_add_cand(struct deliver_skb_candidate *top,
+size_t ncands,
+struct deliver_skb_candidate next)
+{
+   struct deliver_skb_candidate old;
+   int i;
+
+   for (i = 0; i < ncands; i++) {
+   if (next.hit_count > top[i].hit_count) {
+   /* Swap next with top[i], so that the old top[i] can
+

Re: [PATCH v2 0/4] Static calls

2018-12-12 Thread Edward Cree

On 12/12/18 18:14, Nadav Amit wrote:
> Second, (2i) is not very intuitive for me. Using the out-of-line static
> calls seems to me as less performant than the inline (potentially, I didn’t
> check).
>
> Anyhow, the use of out-of-line static calls seems to me as
> counter-intuitive. I think (didn’t measure) that it may add more overhead
> than it saves due to the additional call, ret, and so on
AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
 additional call and ret.  So I wouldn't expect it to be too expensive.
More to the point, it seems like it's easier to get right than the inline
 version, and if we get the inline version working later we can introduce it
 without any API change, much as Josh's existing patches have both versions
 behind a Kconfig switch.

> I tried to avoid reading to
> compared target from memory and therefore used an immediate. This should
> prevent data cache misses and even when the data is available is faster by
> one cycle. But it requires the patching of both the “cmp %target-reg, imm”
> and “call rel-target” to be patched “atomically”. So the static-calls
> mechanism wouldn’t be sufficient.
The approach I took to deal with that (since though I'm doing a read from
 memory, it's key->func in .data rather than the jmp immediate in .text) was
 to have another static_call (though a plain static_key could also be used)
 to 'skip' the fast-path while it's actually being patched.  Then, since all
 my callers were under the rcu_read_lock, I just needed to synchronize_rcu()
 after switching off the fast-path to make sure no threads were still in it.
I'm not sure how that would be generalised to all cases, though; we don't
 want to force every indirect call to take the rcu_read_lock as that means
 no callee can ever synchronize_rcu().  I guess we could have our own
 separate RCU read lock just for indirect call patching?  (What does kgraft
 do?)

> Based on Josh’s previous feedback, I thought of improving the learning using
> some hysteresis. Anyhow, note that there are quite a few cases in which you
> wouldn’t want optpolines. The question is whether in general it would be an
> opt-in or opt-out mechanism.
I was working on the assumption that it would be opt-in, wrapping a macro
 around indirect calls that are known to have a fairly small number of hot
 targets.  There are plenty of indirect calls in the kernel that are only
 called once in a blue moon, e.g. in control-plane operations like ethtool;
 we don't really need to bulk up .text with trampolines for all of them.

-Ed

Re: [PATCH 6/7] selftest/bpf: remove redundant parenthesis

2018-12-12 Thread Edward Cree

On 12/12/18 19:04, Jakub Kicinski wrote:
> On Tue, 11 Dec 2018 20:56:06 +0900, Alice Ferrazzi wrote:
>> Signed-off-by: Alice Ferrazzi 
>> ---
>>  tools/testing/selftests/bpf/test_offload.py | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/bpf/test_offload.py 
>> b/tools/testing/selftests/bpf/test_offload.py
>> index 0f9130ebfd2c..b06cc0eea0eb 100755
>> --- a/tools/testing/selftests/bpf/test_offload.py
>> +++ b/tools/testing/selftests/bpf/test_offload.py
>> @@ -140,7 +140,7 @@ def cmd_result(proc, include_stderr=False, fail=False):
>>  
>>  
>>  def rm(f):
>> -cmd("rm -f %s" % (f))
>> +cmd("rm -f %s" % f)
>>  if f in files:
>>  files.remove(f)
>>  
> Is this in PEP8, too?
I don't know, but it shouldn't be.
If f is a sequence type, both the old and new code can break here,
 throwing a TypeError.  It should be cmd("rm -f %s" % (f,)).  The
 presence of the brackets suggests to me that that's what the
 original author intended.
Now, it's unlikely that we'd ever want to pass a list or tuple
 here, since 'rm' wouldn't understand the result, but the proper
 way to deal with that is an assertion with a meaningful message,
 since the TypeError here will have the non-obvious message "not
 all arguments converted during string formatting".

-Ed

Re: [PATCH v2 0/4] Static calls

2018-12-12 Thread Edward Cree

On 12/12/18 21:15, Nadav Amit wrote:
>> On Dec 12, 2018, at 10:33 AM, Edward Cree  wrote:
>>
>> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
>>  additional call and ret.  So I wouldn't expect it to be too expensive.
>> More to the point, it seems like it's easier to get right than the inline
>>  version, and if we get the inline version working later we can introduce it
>>  without any API change, much as Josh's existing patches have both versions
>>  behind a Kconfig switch.
> I see. For my outlined blocks I used the opposite approach - a call followed
> by jmp
That's what Josh did too.  I.e. caller calls the trampoline, which jmps to the
 callee; later it rets, taking it back to the caller.  Perhaps I wasn't clear.
The point is that there's still only one call and one ret.

>> I was working on the assumption that it would be opt-in, wrapping a macro
>>  around indirect calls that are known to have a fairly small number of hot
>>  targets.  There are plenty of indirect calls in the kernel that are only
>>  called once in a blue moon, e.g. in control-plane operations like ethtool;
>>  we don't really need to bulk up .text with trampolines for all of them.
> On the other hand, I’m not sure the static_call interface is so intuitive.
> And extending it into “dynamic_call” might be even worse. As I initially
> used an opt-in approach, I can tell you that it was very exhausting.
Well, if it's done with a gcc plugin after all, then it wouldn't be too hard
 to make it opt-out.
One advantage of the explicit opt-in dynamic_call, though, which can be seen
 in my patch is that multiple call sites can share the same learning-state,
 if they're expected to call the same set of functions.  An opt-out approach
 would automatically give each indirect call statement its own individual BTB.
Either way, I think the question is orthogonal to what the trampolines
 themselves look like (and even to the inline vs outline question).

-Ed

Re: [PATCH] ethernet/sfc: mark state UNINIT after unregister

2015-06-15 Thread Edward Cree

On 12/06/15 19:51, Jarod Wilson wrote:
> Without this change, modprobe -r sfc hits the BUG_ON() in
> efx_pci_remove_main(). Best as I can tell, this was just an oversight,
> efx->state gets set to STATE_UNINIT in the error path of
> efx_register_netdev() just after unregister_netdevice(), and the same
> should happen in efx_unregister_netdev() after its unregister_netdevice()
> call. Now I can load and unload no problem.
>
> CC: Solarflare linux maintainers 
> CC: net...@vger.kernel.org
> Signed-off-by: Jarod Wilson 
> ---
>  drivers/net/ethernet/sfc/efx.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> index 0c42ed9..f3eaade 100644
> --- a/drivers/net/ethernet/sfc/efx.c
> +++ b/drivers/net/ethernet/sfc/efx.c
> @@ -2448,6 +2448,7 @@ static void efx_unregister_netdev(struct efx_nic *efx)
>  #endif
>   device_remove_file(&efx->pci_dev->dev, &dev_attr_phy_type);
>   unregister_netdev(efx->net_dev);
> + efx->state = STATE_UNINIT;
>   }
>  }
>  
This isn't quite the right place, efx->state changes are supposed to be 
serialised by the RTNL lock.
Our out-of-tree driver has this in efx_pci_remove, just after the 
efx_disable_interrupts(efx) call and before rtnl_unlock() (see patch below).  
I'd suggest that's the change we should make, but I haven't tested it yet.

For reference, the "oversight" was in my 
e7fef9b45ae188066bb6eb3dde8310d33c2f7d5e "sfc: add sysfs entry to control MCDI 
tracing" which accidentally took our out-of-tree version of 
efx_unregister_netdev().  Before that the code was as in Jarod's patch.

-Ed

8<

sfc: mark state UNINIT after unregister

Without this change, modprobe -r sfc hits the BUG_ON() in
efx_pci_remove_main().

Reported-by: Jarod Wilson 
Fixes: e7fef9b45ae188066bb6eb3dde8310d33c2f7d5e
---
 drivers/net/ethernet/sfc/efx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 0c42ed9..67bdaf3 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -2920,6 +2920,7 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
 efx_dissociate(efx);
 dev_close(efx->net_dev);
 efx_disable_interrupts(efx);
+efx->state = STATE_UNINIT;
 rtnl_unlock();
 
 if (efx->type->sriov_fini)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net: vxge: fix spelling mistake in macro VXGE_HW_ERR_PRIVILAGED_OPEARATION

2018-05-22 Thread Edward Cree

On 22/05/18 16:36, Colin King wrote:
> From: Colin Ian King 
>
> Rename VXGE_HW_ERR_PRIVILAGED_OPEARATION to VXGE_HW_ERR_PRIVILAGED_OPERATION
> to fix spelling mistake.
>
> Signed-off-by: Colin Ian King 
"Privilaged" doesn't look right either, maybe fix both at once?
 -> VXGE_HW_PRIVILEGED_OPERATION

-Ed

Re: [PATCH] ethtool: fix a potential missing-check bug

2018-04-30 Thread Edward Cree

On 30/04/18 02:31, Wenwen Wang wrote:
> In ethtool_get_rxnfc(), the object "info" is firstly copied from
> user-space. If the FLOW_RSS flag is set in the member field flow_type of
> "info" (and cmd is ETHTOOL_GRXFH), info needs to be copied again from
> user-space because FLOW_RSS is newer and has new definition, as mentioned
> in the comment. However, given that the user data resides in user-space, a
> malicious user can race to change the data after the first copy. By doing
> so, the user can inject inconsistent data. For example, in the second
> copy, the FLOW_RSS flag could be cleared in the field flow_type of "info".
> In the following execution, "info" will be used in the function
> ops->get_rxnfc(). Such inconsistent data can potentially lead to unexpected
> information leakage since ops->get_rxnfc() will prepare various types of
> data according to flow_type, and the prepared data will be eventually
> copied to user-space. This inconsistent data may also cause undefined
> behaviors based on how ops->get_rxnfc() is implemented.
I'm not sure there's actually an issue here, since the only purpose of the
 FLOW_RSS check is to avoid faulting/trampling user memory when the user
 process only has the short version of 'info'.
If userland subsequently removes the FLOW_RSS flag, then all that will
 happen is that info_size is larger than it ought to be; that cannot affect
 ops->get_rxnfc() since it isn't passed; the op should already be treating
 'info' as unsafe/user-controlled.
The only way this could lead to information leakage would be if in the non-
 FLOW_RSS case ops->get_rxnfc() was writing things it shouldn't into the
 latter part of 'info' and was getting away with it so far as that was
 never copied_to_user; that seems improbable to me, but I suppose you might
 want to do the check anyway as belt-and-braces security.
(A cleaner approach might be to only copy_from_user() the extra region of
 'info', leaving the previously-copied part alone.  That way each byte is
 only copied_from_user once and thus cannot change.)

-Ed

> This patch re-verifies the flow_type field of "info" after the second copy.
> If the value is not as expected, an error code will be returned.
>
> Signed-off-by: Wenwen Wang 
> ---
>  net/core/ethtool.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index 03416e6..a121034 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -1032,6 +1032,8 @@ static noinline_for_stack int ethtool_get_rxnfc(struct 
> net_device *dev,
>   info_size = sizeof(info);
>   if (copy_from_user(&info, useraddr, info_size))
>   return -EFAULT;
> + if (!(info.flow_type & FLOW_RSS))
> + return -EINVAL;
>   }
>  
>   if (info.cmd == ETHTOOL_GRXCLSRLALL) {

Re: [PATCH] drivers: net: ethernet: sun: Fix couple of spells in the file sunhme.c

2021-02-08 Thread Edward Cree

On 05/02/2021 12:47, Bhaskar Chowdhury wrote:
> 
> 
> s/fuck/mess/
> s/fucking/s/
> 
> Signed-off-by: Bhaskar Chowdhury 

Right or wrong, these are not "spelling fixes".
Please do not misrepresent your patch in your Subject: line.
(Also, subsystem prefix should probably just be "net: sunhme:".)

-ed

Re: [linux-next:master 13398/13940] drivers/net/ethernet/sfc/ef100_nic.c:610: undefined reference to `__umoddi3'

2020-08-11 Thread Edward Cree

On 10/08/2020 16:51, Guenter Roeck wrote:
> On Thu, Aug 06, 2020 at 07:17:43PM +0100, Edward Cree wrote:
>> Maybe I should add a
>>
>> static inline u32 mod_u64(u64 dividend, u32 divisor)
>> {
>>     return do_div(dividend, divisor);
>> }
> Your proposed function is an exact replicate of do_div()
No, because do_div() is a macro that modifies 'dividend', whereas by
 wrapping it in an inline function mod_u64() implicitly creates a
 local variable.  Thus do_div() cannot be used on a constant, whereas
 mod_u64() can.
> You could try something like
>
>   if (reader->value > EFX_MIN_DMAQ_SIZE || EFX_MIN_DMAQ_SIZE % 
> (u32)reader->value)
I considered that.  It's ugly, so while it will work I think it's
 worthlooking to see if there's a better way.
> If EFX_MIN_DMAQ_SIZE is indeed known to be a power of 2, you could also use
> the knowledge that a 2^n value can only be divided by a smaller 2^n value,
> meaning that reader->value must have exactly one bit set. This would also
> avoid divide-by-0 issues if reader->value can be 0.
>
>   if (reader->value > EFX_MIN_DMAQ_SIZE || hweight64(reader->value) != 1)
This is also ugly and I don't like relying on the power-of-twoness —
 it just feels fragile.

But you're right to point out that there's a div/0 issue, and if I'm
 going to have to check for that, then ugliness is unavoidable.  So
 I think the least painful option available is probably

    if (!reader->value || reader->value > EFX_MIN_DMAQ_SIZE ||
    EFX_MIN_DMAQ_SIZE % (u32)reader->value)

 which only assumes EFX_MIN_DMAQ_SIZE <= U32_MAX, an assumption I'm
 comfortable with baking in.
I'll put together a formal patch with that.

Thanks for the help.

-ed

Re: [linux-next:master 13398/13940] drivers/net/ethernet/sfc/ef100_nic.c:610: undefined reference to `__umoddi3'

2020-08-06 Thread Edward Cree

On 06/08/2020 00:48, kernel test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
> master
> head:   d15fe4ec043588beee823781602ddb51d0bc84c8
> commit: adcfc3482813fa2c34e5902005853f79c2aa [13398/13940] sfc_ef100: 
> read Design Parameters at probe time
> config: microblaze-randconfig-r032-20200805 (attached as .config)
> compiler: microblaze-linux-gcc (GCC) 9.3.0
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> git checkout adcfc3482813fa2c34e5902005853f79c2aa
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
> ARCH=microblaze 
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> All errors (new ones prefixed by >>):
>
>microblaze-linux-ld: drivers/net/ethernet/sfc/ef100_nic.o: in function 
> `ef100_process_design_param':
>>> drivers/net/ethernet/sfc/ef100_nic.c:610: undefined reference to `__umoddi3'
>605/* Our TXQ and RXQ sizes are always 
> power-of-two and thus divisible by
>606 * EFX_MIN_DMAQ_SIZE, so we just need to check 
> that
>607 * EFX_MIN_DMAQ_SIZE is divisible by 
> GRANULARITY.
>608 * This is very unlikely to fail.
>609 */
>  > 610if (EFX_MIN_DMAQ_SIZE % reader->value) {
So, this is (unsigned long) % (u64), whichI guess doesn't go quite
 as smoothly 32-bit microcontrollers (though the thought of plugging
 a 100-gig smartNIC into a microblaze boggles the mind a little ;).
And none of the math64.h functions seem to have the shape we want —
 div_u64_rem() wants to write the remainder through a pointer, and
 do_div() wants to modify the dividend (which is a constant in this
 case).  So whatever I do, it's gonna be ugly :(

Maybe I should add a

static inline u32 mod_u64(u64 dividend, u32 divisor)
{
    return do_div(dividend, divisor);
}

to include/linux/math64.h?  At least that way the ugly is centralised
 in the header file.

Re: [PATCH net-next 0/5] skbuff: introduce skbuff_heads bulking and reusing

2021-01-12 Thread Edward Cree

Without wishing to weigh in on whether this caching is a good idea...
Wouldn't it be simpler, rather than having two separate "alloc" and "flush"
 caches, to have a single larger cache, such that whenever it becomes full
 we bulk flush the top half, and when it's empty we bulk alloc the bottom
 half?  That should mean fewer branches, fewer instructions etc. than
 having to decide which cache to act upon every time.

-ed

Re: [PATCH 10/11] pragma once: delete few backslashes

2021-03-04 Thread Edward Cree

On 28/02/2021 17:05, Alexey Dobriyan wrote:
> From 251ca5673886b5bb0a42004944290b9d2b267a4a Mon Sep 17 00:00:00 2001
> From: Alexey Dobriyan 
> Date: Fri, 19 Feb 2021 13:37:24 +0300
> Subject: [PATCH 10/11] pragma once: delete few backslashes
> 
> Some macros contain one backslash too many and end up being the last
> macro in a header file. When #pragma once conversion script truncates
> the last #endif and whitespace before it, such backslash triggers
> a warning about "OMG file ends up in a backslash-newline".
> 
> Needless to say I don't want to handle another case in my script,
> so delete useless backslashes instead.
> 
> Signed-off-by: Alexey Dobriyan 
> ---
>  arch/arc/include/asm/cacheflush.h  | 2 +-
>  drivers/net/ethernet/mellanox/mlxsw/item.h | 2 +-
>  include/linux/once.h   | 2 +-
>  include/media/drv-intf/exynos-fimc.h   | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arc/include/asm/cacheflush.h 
> b/arch/arc/include/asm/cacheflush.h
> index e201b4b1655a..46704c341b17 100644
> --- a/arch/arc/include/asm/cacheflush.h
> +++ b/arch/arc/include/asm/cacheflush.h
> @@ -112,6 +112,6 @@ do {  
> \
>  } while (0)
>  
>  #define copy_from_user_page(vma, page, vaddr, dst, src, len) \
> - memcpy(dst, src, len);  \
> + memcpy(dst, src, len)
>  This changebar also removes a semicolon.
It looks plausibly correct, but the commit message ought to mention it.

-ed

Re: [PATCH] net: flow_offload: remove trailing semicolon in macro definition

2020-11-30 Thread Edward Cree

On 27/11/2020 19:37, t...@redhat.com wrote:
> From: Tom Rix 
> 
> The macro use will already have a semicolon.
> 
> Signed-off-by: Tom Rix 
> ---
>  net/core/flow_offload.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
> index d4474c812b64..59ddfd3f3876 100644
> --- a/net/core/flow_offload.c
> +++ b/net/core/flow_offload.c
> @@ -32,7 +32,7 @@ EXPORT_SYMBOL(flow_rule_alloc);
>   struct flow_dissector *__d = (__m)->dissector;  
> \
>   
> \
>   (__out)->key = skb_flow_dissector_target(__d, __type, (__m)->key);  
> \
> - (__out)->mask = skb_flow_dissector_target(__d, __type, (__m)->mask);
> \
> + (__out)->mask = skb_flow_dissector_target(__d, __type, (__m)->mask) 
> \
>  Strictly speaking shouldn't this macro have a do {} while (0)
 around it anyway?

-ed

Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-25 Thread Edward Cree

On 25/11/2020 00:32, Miguel Ojeda wrote:
> I have said *authoring* lines of *this* kind takes a minute per line.
> Specifically: lines fixing the fallthrough warning mechanically and
> repeatedly where the compiler tells you to, and doing so full-time for
> a month.

> It is useful since it makes intent clear.
To make the intent clear, you have to first be certain that you
 understand the intent; otherwise by adding either a break or a
 fallthrough to suppress the warning you are just destroying the
 information that "the intent of this code is unknown".
Figuring out the intent of a piece of unfamiliar code takes more
 than 1 minute; just because
case foo:
thing;
case bar:
break;
 produces identical code to
case foo:
thing;
break;
case bar:
break;
 doesn't mean that *either* is correct — maybe the author meant
 to write
case foo:
return thing;
case bar:
break;
 and by inserting that break you've destroyed the marker that
 would direct someone who knew what the code was about to look
 at that point in the code and spot the problem.
Thus, you *always* have to look at more than just the immediate
 mechanical context of the code, to make a proper judgement that
 yes, this was the intent.  If you think that that sort of thing
 can be done in an *average* time of one minute, then I hope you
 stay away from code I'm responsible for!
One minute would be an optimistic target for code that, as the
 maintainer, one is already somewhat familiar with.  For code
 that you're seeing for the first time, as is usually the case
 with the people doing these mechanical fix-a-warning patches,
 it's completely unrealistic.

A warning is only useful because it makes you *think* about the
 code.  If you suppress the warning without doing that thinking,
 then you made the warning useless; and if the warning made you
 think about code that didn't *need* it, then the warning was
 useless from the start.

So make your mind up: does Clang's stricter -Wimplicit-fallthrough
 flag up code that needs thought (in which case the fixes take
 effort both to author and to review) or does it flag up code
 that can be mindlessly "fixed" (in which case the warning is
 worthless)?  Proponents in this thread seem to be trying to
 have it both ways.

-ed

Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-25 Thread Edward Cree

On 24/11/2020 21:25, Kees Cook wrote:
> I still think this isn't right -- it's a case statement that runs off
> the end without an explicit flow control determination.

Proves too much — for instance
case foo:
case bar:
thing;
break;
 doesn't require a fallthrough; after case foo:, and afaik
 no-one is suggesting it should.  Yet it, too, is "a case
 statement that runs off the end without an explicit flow
 control determination".

-ed

Re: [PATCH net-next] sfc: reduce the number of requested xdp ev queues

2021-01-21 Thread Edward Cree

On 20/01/2021 21:27, Ivan Babrou wrote:
> Without this change the driver tries to allocate too many queues,
> breaching the number of available msi-x interrupts on machines
> with many logical cpus and default adapter settings:
> 
> Insufficient resources for 12 XDP event queues (24 other channels, max 32)
> 
> Which in turn triggers EINVAL on XDP processing:
> 
> sfc :86:00.0 ext0: XDP TX failed (-22)
> 
> Signed-off-by: Ivan Babrou 

Acked-by: Edward Cree

Re: [PATCH net-next] sfc: reduce the number of requested xdp ev queues

2020-12-15 Thread Edward Cree

On 15/12/2020 09:43, Jesper Dangaard Brouer wrote:
> On Mon, 14 Dec 2020 17:29:06 -0800
> Ivan Babrou  wrote:
> 
>> Without this change the driver tries to allocate too many queues,
>> breaching the number of available msi-x interrupts on machines
>> with many logical cpus and default adapter settings:
>>
>> Insufficient resources for 12 XDP event queues (24 other channels, max 32)
>>
>> Which in turn triggers EINVAL on XDP processing:
>>
>> sfc :86:00.0 ext0: XDP TX failed (-22)
> 
> I have a similar QA report with XDP_REDIRECT:
>   sfc :05:00.0 ens1f0np0: XDP redirect failed (-22)
> 
> Here we are back to the issue we discussed with ixgbe, that NIC / msi-x
> interrupts hardware resources are not enough on machines with many
> logical cpus.
> 
> After this fix, what will happen if (cpu >= efx->xdp_tx_queue_count) ?
Same as happened before: the "failed -22".  But this fix will make that
 less likely to happen, because it ties more TXQs to each EVQ, and it's
 the EVQs that are in short supply.
(Strictly speaking, I believe the limitation is a software one, that
 comes from the driver's channel structures having been designed a
 decade ago when 32 cpus ought to be enough for anybody... AFAIR the
 hardware is capable of giving us something like 1024 evqs if we ask
 for them, it just might not have that many msi-x vectors for us.)
Anyway, the patch looks correct, so
Acked-by: Edward Cree 

-ed

Re: [PATCH net-next] sfc: reduce the number of requested xdp ev queues

2020-12-16 Thread Edward Cree

On 16/12/2020 08:45, Jesper Dangaard Brouer wrote:
> So, what I hear is that this fix is just pampering over the real issue.
Yes, it is, but it's better than nothing in the meantime while we work
 out the complete fix.

> I suggest that you/we detect the situation, and have a code path that
> will take a lock (per 16 packets bulk) and solve the issue.
Imho that would _also_ paper over the issue, because it would mean the
 system degraded to a lower performance mode of operation, while still
 appearing to support XDP_TX.  I think that that in general should not
 happen unless there is a way for the user to determine at runtime
 whether it has/should happen.  Perhaps Marek's new XDP feature flags
 could include a "tx-lockless" flag to indicate this?

-ed

Idea for reducing sysfs memory usage

2016-02-16 Thread Edward Cree


Sorry if this has been suggested before, but if so I couldn't find it.
Short version: could a sysfs dir reference a list of default attributes
rather than having to instantiate them all?

Long version:

If I'm understanding correctly, when creating a sysfs dir for a kobject, we
call populate_dir() and create a sysfs file for each of its default_attrs.
This leads to some memory allocations, which apparently can really add up
when, for instance, using a large number of netdevices (mostly software
devices like VLANs and bonds whose sysfs files will probably never actually
be used anyway).  At netdev1.1, a use case was presented for thousands -
perhaps even hundreds of thousands - of such netdevices, at which point the
memory overhead for all those sysfs files becomes quite significant.

Thus I was wondering if it would be feasible to replace the instantiation of
these default attributes with, in principle, a pointer to the default_attr
list, which already contains the methods needed to show or store the
attribute, and need not be duplicated for each sysfs dir referring to it.
In practice it would probably make sense to build a fast-lookup data
structure from the list when the latter is first created, since the list is
(as I understand it) never changed.

Then for instance, a lookup on the directory would search both the default
attrs list and the rbtree of runtime-added nodes.  I don't think it would
ever be necessary for runtime code to _remove_ a default attribute from a
dir, but if it were, then perhaps a negative entry could be added to the
rbtree, and that tree could always be searched first - then, if the negative
entry was found, the search of the default list would be skipped.
Of course, all the other operations on the directory would similarly have to
be modified to take account of there being two places in which an attribute
could be defined, but I believe (perhaps naïvely) that this would not be
excessively painful.

This could potentially save memory whenever large numbers of directories of
the same 'type' are created in sysfs; not just netdevices, but also for
instance PCI devices and their 'power' subdirectories.  While it would
slightly slow down sysfs accesses, I can't quite imagine a situation in
which the speed of sysfs would be critical to system performance - though
perhaps that is just my ignorance.

Is there an obvious reason why I can't do this, or should I dive in and try
to implement it?  (Or should I first just try to estimate how much memory it
would save, to find out whether it would be worth the complexity?)

Not subscribed to list; please Cc on replies.

--
-ed

Re: Idea for reducing sysfs memory usage

2016-02-16 Thread Edward Cree


On 16/02/16 23:55, Greg Kroah-Hartman wrote:

On Tue, Feb 16, 2016 at 11:46:49PM +, Edward Cree wrote:

Sorry if this has been suggested before, but if so I couldn't find it.
Short version: could a sysfs dir reference a list of default attributes
rather than having to instantiate them all?

Shorter version, why do you think it is?  :)

Have you done some testing of the amount of memory that sysfs entries
consume and found any problems with it?

Two reasons:
a) in his netdev1.1 talk "Scaling the Number of Network Interfaces on 
Linux",
   David Ahern claimed a memory overhead of (iirc) about 45kB per 
netdevice,
   of which he attributed (again, iirc) about 20kB to sysfs entries.  
He also

   indicated that this was a problem for his use case.  (My apologies to
   David if I've misrepresented him.  CCed him so he can correct me.)
b) my reading of the code suggested it was allocating stuff for every 
call to

   sysfs_create_file() in the loop in populate_dir().
Having re-read __kernfs_new_node() and struct kernfs_node, I now realise I
misinterpreted them - the name isn't being allocated at all 
(kstrdup_const())

and the struct kernfs_node consists chiefly (if not entirely) of fields
specific to the individual file rather than shareable between multiple
instances.  So there isn't any memory we can save here.

Sorry for the noise.

--
-ed

Re: Idea for reducing sysfs memory usage

2016-02-16 Thread Edward Cree


On 17/02/16 00:47, Greg Kroah-Hartman wrote:

How many sysfs entries are you creating for that 20kb?  And how did you
measure it?  If you don't access the files, the backing store is not
allocated, saving you a lot of memory.

Thinking about this some more, could we not do the same thing with the
struct kernfs_nodes, i.e. only allocate them when someone first accesses
the file?  Or simpler, defer allocation of all the files in the dir until
someone first touches the directory?  Of course it would add a little
latency to that first access, but (barring differences in cache warmth) it
would subtract the same amount of time from the initial dir creation, and
in the case where no-one ever looks at the directory, it would save the
memory.
I did find a patch series from 2009 doing something vaguely similar[1], but
a) it looks like it wasn't applied and b) it appears to involve a function
pointer in struct sysfs_elem_dir to say how to populate the directory.
All we need here is to get our kobj (which we have in priv member of struct
kernfs_node) and call populate_dir().  (And remove whatever flag or mark we
used to say "not populated yet".  And some locking will be needed.)

Again, no hard numbers on how much memory this would save, nor evidence
that the "no-one ever touches the dir" case happens in practice...

[1] http://thread.gmane.org/gmane.linux.kernel/904418
--
-ed

Re: We've released a generic netlink python library -- gnlpy

2015-05-21 Thread Edward Cree

On 21/05/15 00:33, Cong Wang wrote:
> On Wed, May 20, 2015 at 2:10 PM, Alex Gartrell  wrote:
>> Hey everyone,
>>
>> tl;dr; pure python generic netlink library with simple clients for ipvs and
>> taskstats here: https://github.com/facebook/gnlpy
> libnl should have python support for generic netlink too:
>
> $ ls python/netlink/genl/
> capi.i  __init__.py  Makefile  Makefile.am  Makefile.in
>
> but I never use its python module so not sure if it works... ;)
There is also pyroute2.netlink.generic

which does work but it's a bit clunky imho.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] KVM: MMU: fix SMAP virtualization

2015-05-26 Thread Edward Cree

>> This breaks older compilers that can't initialize anon structures.
> 
> How old ? Even gcc 3.1 says you can use unnamed struct/union fields and
> 3.2 is the minimum version required to compile the kernel as mentioned
> in the README.
> 
> We could simply just name the structure, but I doubt this is the
> only place in the kernel code where it's being used this way :)

This appears to be GCC bug #10676, see 

Says it was fixed in 4.6, but I believe the kernel supports GCCs much older
than that (back to 3.2).  I personally hit it on 4.4.7, the version shipped
with RHEL6.6.
So I think the kernel code has to change, probably by naming the structure.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] if_link: Add VF multicast promiscuous control

2015-02-20 Thread Edward Cree

On 20/02/15 01:00, Hiroshi Shimamoto wrote:
> From: Hiroshi Shimamoto 
>
> Add netlink directives and ndo entry to allow VF multicast promiscuous mode.
>
> The administrator wants to allow dedicatedly multicast promiscuous per VF.
If I'm properly understanding, this seems to be an ixgbe-specific option
to work around an ixgbe limitation; is it really appropriate to
implement as a generic net_device_op?
What would this ndo mean to a driver which can support thousands of
multicast groups without MC promisc?  Is it expected to limit the number
of MC groups when this is set to disallow?  Or just fulfil the letter of
the option but not its spirit?  The option doesn't seem to have
well-defined semantics outside of ixgbe.
I would suggest that the right place for this sort of driver-specific
device control is in sysfs.

I'm also a little perplexed as to why anyone would need to disallow
this; what security, or even administrative convenience, is gained by
allowing a VF to join 30 multicast groups but not multicast promiscuous
mode?  Especially as, afaik, there are no restrictions on which
multicast groups are joined, so the VF can receive any particular
multicast traffic it cares about.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] if_link: Add VF multicast promiscuous control

2015-02-23 Thread Edward Cree

On 20/02/15 21:05, Skidmore, Donald C wrote:
> If a vender specific interface is objectionable maybe a simpler and more 
> generic interface would be for the PF to be able to set a given VF into 
> "trusted" mode... I admit exactly what 'trusted' meant would vary from vender 
> to vender, but it would be a way for the driver to know it could allow 
> configurations such as this.  Just an idea, since we seem to be getting more 
> requests for things such as this.
That's an even worse idea; now you have a generic interface with
completely undefined semantics.
The right way to do this, imho, is to use one of the standard interfaces
for driver-specific gubbins - e.g. sysfs, genetlink or even (whisper it)
ioctls - and put your 'VF promisc mode' setting there.  That way you
have a vendor-specific interface with vendor-specified semantics.
Of those options, I'd recommend sysfs as the best fit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net] MAINTAINERS: net: update sfc maintainers

2016-04-25 Thread Edward Cree

On 25/04/16 17:42, Bert Kenward wrote:
> Add myself and Edward Cree as maintainers.
> Remove Shradha Shah, who is on extended leave.
>
> Cc: David S. Miller 
> Cc: Edward Cree 
> Cc: Shradha Shah 
> Signed-off-by: Bert Kenward 
Acked-by: Edward Cree 
> ---
>  MAINTAINERS | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8491336..17ad615 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10014,7 +10014,8 @@ F:drivers/infiniband/hw/ocrdma/
>  
>  SFC NETWORK DRIVER
>  M:   Solarflare linux maintainers 
> -M:   Shradha Shah 
> +M:   Edward Cree 
> +M:   Bert Kenward 
>  L:   net...@vger.kernel.org
>  S:   Supported
>  F:   drivers/net/ethernet/sfc/

Re: [PATCH net-next 2/2] net: core: increase the default size of GRO_NORMAL skb lists to flush

2019-10-10 Thread Edward Cree

On 10/10/2019 15:42, Alexander Lobakin wrote:
> Commit 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL
> skbs") have introduced a sysctl variable gro_normal_batch for defining
> a limit for listified Rx of GRO_NORMAL skbs. The initial value of 8 is
> purely arbitrary and has been chosen, I believe, as a minimal safe
> default.
8 was chosen by performance tests on my setup with v1 of that patch;
 see https://www.spinics.net/lists/netdev/msg585001.html .
Sorry for not including that info in the final version of the patch.
While I didn't re-do tests on varying gro_normal_batch on the final
 version, I think changing it needs more evidence than just "we tested
 it; it's better".  In particular, increasing the batch size should be
 accompanied by demonstration that latency isn't increased in e.g. a
 multi-stream ping-pong test.

> However, several tests show that it's rather suboptimal and doesn't
> allow to take a full advantage of listified processing. The best and
> the most balanced results have been achieved with a batches of 16 skbs
> per flush.
> So double the default value to give a yet another boost for Rx path.

> It remains configurable via sysctl anyway, so may be fine-tuned for
> each hardware.
I see this as a reason to leave the default as it is; the combination
 of your tests and mine have established that the optimal size does
 vary (I found 16 to be 2% slower than 8 with my setup), so any
 tweaking of the default is likely only worthwhile if we have data
 over lots of different hardware combinations.

> Signed-off-by: Alexander Lobakin 
> ---
>  net/core/dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a33f56b439ce..4f60444bb766 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4189,7 +4189,7 @@ int dev_weight_tx_bias __read_mostly = 1;  /* bias for 
> output_queue quota */
>  int dev_rx_weight __read_mostly = 64;
>  int dev_tx_weight __read_mostly = 64;
>  /* Maximum number of GRO_NORMAL skbs to batch up for list-RX */
> -int gro_normal_batch __read_mostly = 8;
> +int gro_normal_batch __read_mostly = 16;
>  
>  /* Called with irq disabled */
>  static inline void napi_schedule(struct softnet_data *sd,

Re: [PATCH net-next1/2] net: core: use listified Rx for GRO_NORMAL in napi_gro_receive()

2019-10-10 Thread Edward Cree

On 10/10/2019 15:42, Alexander Lobakin wrote:
> Commit 323ebb61e32b4 ("net: use listified RX for handling GRO_NORMAL
> skbs") made use of listified skb processing for the users of
> napi_gro_frags().
> The same technique can be used in a way more common napi_gro_receive()
> to speed up non-merged (GRO_NORMAL) skbs for a wide range of drivers,
> including gro_cells and mac80211 users.
>
> Signed-off-by: Alexander Lobakin 
> ---
>  net/core/dev.c | 49 +
>  1 file changed, 25 insertions(+), 24 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8bc3dce71fc0..a33f56b439ce 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5884,6 +5884,26 @@ struct packet_offload 
> *gro_find_complete_by_type(__be16 type)
>  }
>  EXPORT_SYMBOL(gro_find_complete_by_type);
>  
> +/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
> +static void gro_normal_list(struct napi_struct *napi)
> +{
> + if (!napi->rx_count)
> + return;
> + netif_receive_skb_list_internal(&napi->rx_list);
> + INIT_LIST_HEAD(&napi->rx_list);
> + napi->rx_count = 0;
> +}
> +
> +/* Queue one GRO_NORMAL SKB up for list processing.  If batch size exceeded,
> + * pass the whole batch up to the stack.
> + */
> +static void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb)
> +{
> + list_add_tail(&skb->list, &napi->rx_list);
> + if (++napi->rx_count >= gro_normal_batch)
> + gro_normal_list(napi);
> +}
> +
>  static void napi_skb_free_stolen_head(struct sk_buff *skb)
>  {
>   skb_dst_drop(skb);
> @@ -5891,12 +5911,13 @@ static void napi_skb_free_stolen_head(struct sk_buff 
> *skb)
>   kmem_cache_free(skbuff_head_cache, skb);
>  }
>  
> -static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
> +static gro_result_t napi_skb_finish(struct napi_struct *napi,
> + struct sk_buff *skb,
> + gro_result_t ret)
Any reason why the argument order here is changed around?

-Ed
>  {
>   switch (ret) {
>   case GRO_NORMAL:
> - if (netif_receive_skb_internal(skb))
> - ret = GRO_DROP;
> + gro_normal_one(napi, skb);
>   break;
>  
>   case GRO_DROP:
> @@ -5928,7 +5949,7 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, 
> struct sk_buff *skb)
>  
>   skb_gro_reset_offset(skb);
>  
> - ret = napi_skb_finish(dev_gro_receive(napi, skb), skb);
> + ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
>   trace_napi_gro_receive_exit(ret);
>  
>   return ret;
> @@ -5974,26 +5995,6 @@ struct sk_buff *napi_get_frags(struct napi_struct 
> *napi)
>  }
>  EXPORT_SYMBOL(napi_get_frags);
>  
> -/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
> -static void gro_normal_list(struct napi_struct *napi)
> -{
> - if (!napi->rx_count)
> - return;
> - netif_receive_skb_list_internal(&napi->rx_list);
> - INIT_LIST_HEAD(&napi->rx_list);
> - napi->rx_count = 0;
> -}
> -
> -/* Queue one GRO_NORMAL SKB up for list processing.  If batch size exceeded,
> - * pass the whole batch up to the stack.
> - */
> -static void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb)
> -{
> - list_add_tail(&skb->list, &napi->rx_list);
> - if (++napi->rx_count >= gro_normal_batch)
> - gro_normal_list(napi);
> -}
> -
>  static gro_result_t napi_frags_finish(struct napi_struct *napi,
> struct sk_buff *skb,
> gro_result_t ret)

Re: [PATCH net-next1/2] net: core: use listified Rx for GRO_NORMAL in napi_gro_receive()

2019-10-11 Thread Edward Cree

>> On 10/10/2019 15:42, Alexander Lobakin wrote:
>>> Commit 323ebb61e32b4 ("net: use listified RX for handling GRO_NORMAL
>>> skbs") made use of listified skb processing for the users of
>>> napi_gro_frags().
>>> The same technique can be used in a way more common napi_gro_receive()
>>> to speed up non-merged (GRO_NORMAL) skbs for a wide range of drivers,
>>> including gro_cells and mac80211 users.
>>>
>>> Signed-off-by: Alexander Lobakin 
Acked-by: Edward Cree 
but I think this needs review from the socionext folks as well.

Re: [RFC PATCH v2 0/4] dynamic indirect call promotion

2019-02-15 Thread Edward Cree

On 05/02/19 08:50, Nadav Amit wrote:
>> In cases where RCU cannot be used (e.g. because some callees need to RCU
>> synchronise), it might be possible to add a variant that uses
>> synchronize_rcu_tasks() when updating, but this series does not attempt this.
> I wonder why.
Mainly because I have yet to convince myself that it's the Right Thing.
Note also the following (from kernel/rcu/update.c):

/* * This is a very specialized primitive, intended only for a few uses in
* tracing and other situations requiring manipulation of function
* preambles and profiling hooks. The synchronize_rcu_tasks() function
* is not (yet) intended for heavy use from multiple CPUs.  */

> This seems like an easy solution, and according to Josh, Steven
> Rostedt and the documentation appears to be valid.
Will it hurt performance, though, if we end up (say) having rcu-tasks-
 based synchronisation for updates on every indirect call in the kernel?
(As would result from a plugin-based opt-out approach.)

> As I stated before, I think that the best solution is to use a GCC plugin,
> [...] Such a solution will not enable the calling code to be
> written in C and would require a plugin for each architecture.
I'm afraid I don't see why.  If we use the static_calls infrastructure,
 but then do a source-level transformation in the compiler plugin to turn
 indirect calls into dynamic_calls, it should be possible to create an
 opt-out system without any arch-specific code in the plugin (the arch-
 specific stuff being all in the static_calls code).
Any reason that can't be done?  (Note: I don't know much about GCC
 internals, maybe there's something obvious that stops a plugin doing
 things like that.)

> Feel free to try my code and give me feedback. I did not get a feedback on my
> last version. Is there a fundamental problem with my plugin? Did you try it 
> and got bad results, or perhaps it did not build?
I didn't test your patches yet, because I was busy trying to get mine
 working and ready to post (and also with unrelated work).  But now that
 that's done, next time I have cycles spare for indirect call stuff I
 guess testing (and reviewing) your approach will be next on my list.

> Why do you prefer an approach
> which requires annotation of the callers, instead of something that is much
> more transparent?
I'm concerned about the overhead (in both time and memory) of running
 learning on every indirect call site (including ones that aren't in a
 hot-path, and ones which have such a wide variety of callees that
 promotion really doesn't help) throughout the whole kernel.  Also, an
 annotating programmer knows the locking/rcu context and can thus tell
 whether a given dynamic_call should use synchronise_rcu_tasks(),
 synchronise_rcu(), or perhaps something else (if e.g. the call always
 happens under a mutex, then the updater work could take that mutex).

The real answer, though, is that I don't so much prefer this approach,
 as think that both should be tried "publicly" and evaluated by more
 developers than just us three.  There's a reason this series is
 marked RFC ;-)


-Ed

[RFC PATCH v2 0/4] dynamic indirect call promotion

2019-02-01 Thread Edward Cree

This series introduces 'dynamic_calls', branch trees of static calls (updated
 at runtime using text patching), to avoid making indirect calls to common
 targets.  The basic mechanism is
if (func == static_key_1.target)
call_static_key_1(args);
else if (func == static_key_2.target)
call_static_key_2(args);
/* ... */
else
(*func)(args); /* typically leads to a retpoline nowadays */
 with some additional statistics-gathering to allow periodic relearning of
 branch targets.  Creating and calling a dynamic call table are each a single
 line in the consuming code, although they expand to a nontrivial amount of
 data and text in the kernel image.
This is essentially indirect branch prediction, performed in software because
 we can't trust hardware to get it right.  While the processor may speculate
 into the function calls, this is *probably* OK since they are known to be
 functions that frequently get called in this path, and thus are much less
 likely to contain side-channel information leaks than a completely arbitrary
 target PC from a branch target buffer.  Moreover, when the speculation is
 accurate we positively want to be able to speculate into the callee.
The branch target statistics are collected with percpu variables, counting
 both 'hits' on the existing branch targets and 'misses', divided into counts
 for up to four specific targets (first-come-first-served) and a catch-all
 miss count used once that table is full.
When the total number of specific misses on a cpu reaches 1000, work is
 triggered which adds up counts across all CPUs and chooses the two most-
 popular call targets to patch into the call path.
If instead the catch-all miss count reaches 1000, the counts and specific
 targets for that cpu are discarded, since either the target is too
 unpredictable (lots of low-frequency callees rather than a few dominating
 ones) or the targets that populated the table were by chance unpopular ones.
To ensure that the static key target does not change between the if () check
 and the call, the whole dynamic_call must take place in an RCU read-side
 critical section (which, since the callee does not know it is being called in
 this manner, then lasts at least until the callee returns), and the patching
 at re-learning time is done with the help of a static_key to switch callers
 off the dynamic_call path and RCU synchronisation to ensure none are still on
 it.  In cases where RCU cannot be used (e.g. because some callees need to RCU
 synchronise), it might be possible to add a variant that uses
 synchronize_rcu_tasks() when updating, but this series does not attempt this.

The dynamic_calls created by this series are opt-in, partly because of the
 abovementioned rcu_read_lock requirement.

My attempts to measure the performance impact of dynamic_calls have been
 inconclusive; the effects on an RX-side UDP packet rate test were within
 ±1.5% and nowhere near statistical significance (p around 0.2-0.3 with n=6
 in a Welch t-test).  This could mean that dynamic_calls are ineffective,
 but it could also mean that many more sites need converting before any gain
 shows up, or it could just mean that my testing was insufficiently sensitive
 or measuring the wrong thing.  Given these poor results, this series is
 clearly not 'ready', hence the RFC tags, but hopefully it will inform the
 discussion in this area.

As before, this series depends on Josh's "static calls" patch series (v3 this
 time).  My testing was done with out-of-line static calls, since the inline
 implementation lead to crashes; I have not yet determined whether they were
 the fault of my patch or of the static calls series.

Edward Cree (4):
  static_call: add indirect call promotion (dynamic_call) infrastructure
  net: core: use a dynamic_call for pt_prev->func() in RX path
  net: core: use a dynamic_call for dst_input
  net: core: use a dynamic_call for pt_prev->list_func() in list RX path

 include/linux/dynamic_call.h | 300 +++
 include/net/dst.h|   5 +-
 init/Kconfig |  11 ++
 kernel/Makefile  |   1 +
 kernel/dynamic_call.c| 131 +++
 net/core/dev.c   |  18 ++-
 net/core/dst.c   |   2 +
 7 files changed, 463 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/dynamic_call.h
 create mode 100644 kernel/dynamic_call.c

[RFC PATCH v2 1/4] static_call: add indirect call promotion (dynamic_call) infrastructure

2019-02-01 Thread Edward Cree

Uses runtime instrumentation of callees from an indirect call site to
 populate an indirect-call-wrapper branch tree.  Essentially we're doing
 indirect branch prediction in software because the hardware can't be
 trusted to get it right; this is sad.
Calls to these trampolines must take place within an RCU read-side
 critical section.  This is necessary because we use RCU synchronisation
 to ensure that no CPUs are running the fast path while we patch it;
 otherwise they could be between checking a static_call's func and
 actually calling it, and end up calling the wrong function.  The use
 of RCU as the synchronisation method means that dynamic_calls cannot be
 used for functions which call synchronize_rcu(), thus the mechanism has
 to be opt-in rather than being automatically applied to all indirect
 calls in the kernel.

Enabled by new CONFIG_DYNAMIC_CALLS, which defaults to off (and depends
 on a static_call implementation being available).

Signed-off-by: Edward Cree 
---
 include/linux/dynamic_call.h | 300 +++
 init/Kconfig |  11 ++
 kernel/Makefile  |   1 +
 kernel/dynamic_call.c| 131 +++
 4 files changed, 443 insertions(+)
 create mode 100644 include/linux/dynamic_call.h
 create mode 100644 kernel/dynamic_call.c

diff --git a/include/linux/dynamic_call.h b/include/linux/dynamic_call.h
new file mode 100644
index ..2e84543c0c8b
--- /dev/null
+++ b/include/linux/dynamic_call.h
@@ -0,0 +1,300 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_DYNAMIC_CALL_H
+#define _LINUX_DYNAMIC_CALL_H
+
+/*
+ * Dynamic call (optpoline) support
+ *
+ * Dynamic calls use code patching and runtime learning to promote indirect
+ * calls into direct calls using the static_call machinery.  They give the
+ * flexibility of function pointers, but with improved performance.  This is
+ * especially important for cases where retpolines would otherwise be used, as
+ * retpolines can significantly impact performance.
+ * The two callees learned to be most common will be made through static_calls,
+ * while for any other callee the trampoline will fall back to an indirect call
+ * (or a retpoline, if those are enabled).
+ * Patching of newly learned callees into the fast-path relies on RCU to ensure
+ * the fast-path is not in use on any CPU; thus the calls must be made under
+ * the RCU read lock.
+ *
+ *
+ * A dynamic call table must be defined in file scope with
+ * DYNAMIC_CALL_$NR(ret, name, type1, ..., type$NR);
+ * where $NR is from 1 to 4, ret is the return type of the function and type1
+ * through type$NR are the argument types.  Then, calls can be made through a
+ * matching function pointer 'func' with
+ * x = dynamic_name(func, arg1, ..., arg$NR);
+ * which will behave equivalently to
+ * (*func)(arg1, ..., arg$NR);
+ * except hopefully with higher performance.  It is allowed for multiple
+ * callsites to use the same dynamic call table, in which case they will share
+ * statistics for learning.  This will perform well as long as the callsites
+ * typically have the same set of common callees.
+ *
+ * Usage example:
+ *
+ * struct foo {
+ * int x;
+ * int (*f)(int);
+ * }
+ * DYNAMIC_CALL_1(int, handle_foo, int);
+ *
+ * int handle_foo(struct foo *f)
+ * {
+ * return dynamic_handle_foo(f->f, f->x);
+ * }
+ *
+ * This should behave the same as if the function body were changed to:
+ * return (f->f)(f->x);
+ * but potentially with improved performance.
+ */
+
+#define DEFINE_DYNAMIC_CALL_1(_ret, _name, _type1)\
+_ret dynamic_##_name(_ret (*func)(_type1), _type1 arg1);
+
+#define DEFINE_DYNAMIC_CALL_2(_ret, _name, _type1, _name2, _type2)\
+_ret dynamic_##_name(_ret (*func)(_type1, _type2), _type1 arg1, _type2 arg2);
+
+#define DEFINE_DYNAMIC_CALL_3(_ret, _name, _type1, _type2, _type3)\
+_ret dynamic_##_name(_ret (*func)(_type1, _type2, _type3), _type1 arg1,
   \
+_type2 arg2, _type3 arg3);
+
+#define DEFINE_DYNAMIC_CALL_4(_ret, _name, _type1, _type2, _type3, _type4) 
\
+_ret dynamic_##_name(_ret (*func)(_type1, _type2, _type3, _type4), _type1 
arg1,\
+_type2 arg2, _type3 arg3, _type4 arg4);
+
+#ifdef CONFIG_DYNAMIC_CALLS
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Number of callees from the slowpath to track on each CPU */
+#define DYNAMIC_CALL_CANDIDATES4
+/*
+ * Number of fast-path callees; to change this, much of the macrology below
+ * must also be changed.
+ */
+#define DYNAMIC_CALL_BRANCHES  2
+struct dynamic_call_candidate {
+   void *func;
+   unsigned long hit_count;
+};
+struct dynamic_call_percpu {
+   struct dynamic_call_candidate candidates[DYNAMIC_CALL_CANDIDATES];
+   unsigned long hit_count[DYNAMIC_CALL_BRANCH

[RFC PATCH v2 2/4] net: core: use a dynamic_call for pt_prev->func() in RX path

2019-02-01 Thread Edward Cree

Typically a small number of callees, such as ip[v6]_rcv or packet_rcv,
 will cover most packets.

Signed-off-by: Edward Cree 
---
 net/core/dev.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8e276e0192a1..7b38a33689d8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -146,6 +146,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -1949,6 +1950,9 @@ int dev_forward_skb(struct net_device *dev, struct 
sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
 
+DYNAMIC_CALL_4(int, deliver_skb, struct sk_buff *, struct net_device *,
+  struct packet_type *, struct net_device *);
+
 static inline int deliver_skb(struct sk_buff *skb,
  struct packet_type *pt_prev,
  struct net_device *orig_dev)
@@ -1956,7 +1960,7 @@ static inline int deliver_skb(struct sk_buff *skb,
if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
return -ENOMEM;
refcount_inc(&skb->users);
-   return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   return dynamic_deliver_skb(pt_prev->func, skb, skb->dev, pt_prev, 
orig_dev);
 }
 
 static inline void deliver_ptype_list_skb(struct sk_buff *skb,
@@ -4970,7 +4974,8 @@ static int __netif_receive_skb_one_core(struct sk_buff 
*skb, bool pfmemalloc)
 
ret = __netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
if (pt_prev)
-   ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   ret = dynamic_deliver_skb(pt_prev->func, skb, skb->dev, pt_prev,
+ orig_dev);
return ret;
 }
 
@@ -5015,7 +5020,8 @@ static inline void __netif_receive_skb_list_ptype(struct 
list_head *head,
pt_prev->list_func(head, pt_prev, orig_dev);
else
list_for_each_entry_safe(skb, next, head, list)
-   pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   dynamic_deliver_skb(pt_prev->func, skb, skb->dev,
+   pt_prev, orig_dev);
 }
 
 static void __netif_receive_skb_list_core(struct list_head *head, bool 
pfmemalloc)

[RFC PATCH v2 4/4] net: core: use a dynamic_call for pt_prev->list_func() in list RX path

2019-02-01 Thread Edward Cree

There are currently only two possible callees, ip_list_rcv and
 ipv6_list_rcv.  Even when more are added, most packets will typically
 follow one of a small number of callees on any given system.

Signed-off-by: Edward Cree 
---
 net/core/dev.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7b38a33689d8..ecf41618a279 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5006,6 +5006,9 @@ int netif_receive_skb_core(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
 
+DYNAMIC_CALL_3(void, deliver_skb_list, struct list_head *, struct packet_type 
*,
+  struct net_device *);
+
 static inline void __netif_receive_skb_list_ptype(struct list_head *head,
  struct packet_type *pt_prev,
  struct net_device *orig_dev)
@@ -5017,7 +5020,8 @@ static inline void __netif_receive_skb_list_ptype(struct 
list_head *head,
if (list_empty(head))
return;
if (pt_prev->list_func != NULL)
-   pt_prev->list_func(head, pt_prev, orig_dev);
+   dynamic_deliver_skb_list(pt_prev->list_func, head, pt_prev,
+orig_dev);
else
list_for_each_entry_safe(skb, next, head, list)
dynamic_deliver_skb(pt_prev->func, skb, skb->dev,

[RFC PATCH v2 3/4] net: core: use a dynamic_call for dst_input

2019-02-01 Thread Edward Cree

Typically there will be a small number of callees, such as ip_local_deliver
 or ip_forward, which will cover most packets.

Signed-off-by: Edward Cree 
---
 include/net/dst.h | 5 -
 net/core/dst.c| 2 ++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 6cf0870414c7..5dd838b9a7d2 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -444,10 +445,12 @@ static inline int dst_output(struct net *net, struct sock 
*sk, struct sk_buff *s
return skb_dst(skb)->output(net, sk, skb);
 }
 
+DEFINE_DYNAMIC_CALL_1(int, dst_input, struct sk_buff *);
+
 /* Input packet from network to transport.  */
 static inline int dst_input(struct sk_buff *skb)
 {
-   return skb_dst(skb)->input(skb);
+   return dynamic_dst_input(skb_dst(skb)->input, skb);
 }
 
 static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
diff --git a/net/core/dst.c b/net/core/dst.c
index 81ccf20e2826..a00a75bab84e 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -342,3 +342,5 @@ void metadata_dst_free_percpu(struct metadata_dst __percpu 
*md_dst)
free_percpu(md_dst);
 }
 EXPORT_SYMBOL_GPL(metadata_dst_free_percpu);
+
+DYNAMIC_CALL_1(int, dst_input, struct sk_buff *);

Re: [PATCH] rds: ib: force endiannes annotation

2019-04-29 Thread Edward Cree

On 29/04/2019 12:18, Nicholas Mc Guire wrote:
> On Mon, Apr 29, 2019 at 12:00:06PM +0100, Edward Cree wrote:
>> Again, a __force cast doesn't seem necessary here.  It looks like the
>>  code is just using the wrong types; if all of src, dst and uncongested
>>  were __le64 instead of uint64_t, and the last two lines replaced with
>>  rds_cong_map_updated(map, le64_to_cpu(uncongested)); then the semantics
>>  would be kept with neither sparse errors nor __force.
>>
>> __force is almost never necessary and mostly just masks other bugs or
>>  endianness confusion in the surrounding code.  Instead of adding a
>>  __force, either fix the code to be sparse-clean or leave the sparse
>>  warning in place so that future developers know there's something not
>>  right.
>>
> changing uncongested to __le64 is not an option here - it would only move
> the sparse warnings to those other locatoins where the ports that 
> became uncongested are being or'ed into uncongested.
That's why I say to change *src and *dst too.  Sparse won't mind the
 conversion from void * to __le64 * when they're assigned, and the only
 operations we do on them...
> uncongested |= ~(*src) & *dst;
> *dst++ = *src++;
... are some bitwise ops on the values (bitwise ops are legal in any
 endianness) and incrementation of the pointers (which cares only about
 the pointee size, not type).

-Ed

Re: [PATCH] net: sfc: falcon: convert to i2c_new_dummy_device

2019-07-25 Thread Edward Cree

On 24/07/2019 23:47, David Miller wrote:
> From: Wolfram Sang 
> Date: Mon, 22 Jul 2019 19:26:35 +0200
>
>> Move from i2c_new_dummy() to i2c_new_dummy_device(). So, we now get an
>> ERRPTR which we use in error handling.
>>
>> Signed-off-by: Wolfram Sang 

Subject & description are incomplete, you're also changing i2c_new_device()
 to i2c_new_client_device().
Other than that,
Acked-by: Edward Cree 

> Solarflare folks, please review/test.
>
> Thank you.
Falcon isn't likely to get tested by us, I think we only have about three
 of them left in the building, two of which are in display cabinets ;-)
We end-of-lifed this hardware a couple of years ago, maybe it should be
 downgraded from 'supported' to 'odd fixes'.  Or even moved to staging,
 like that qlogic driver recently was.

Re: [PATCH net-next 03/10] sfc: Use dev_get_drvdata where possible

2019-07-24 Thread Edward Cree

On 24/07/2019 12:26, Chuhong Yuan wrote:
> Instead of using to_pci_dev + pci_get_drvdata,
> use dev_get_drvdata to make code simpler.
>
> Signed-off-by: Chuhong Yuan 
Acked-by: Edward Cree 

> ---
>  drivers/net/ethernet/sfc/ef10.c |  4 ++--
>  drivers/net/ethernet/sfc/efx.c  | 10 +-
>  2 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
> index 16d6952c312a..0ec13f520e90 100644
> --- a/drivers/net/ethernet/sfc/ef10.c
> +++ b/drivers/net/ethernet/sfc/ef10.c
> @@ -508,7 +508,7 @@ static ssize_t efx_ef10_show_link_control_flag(struct 
> device *dev,
>  struct device_attribute *attr,
>  char *buf)
>  {
> - struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> + struct efx_nic *efx = dev_get_drvdata(dev);
>  
>   return sprintf(buf, "%d\n",
>  ((efx->mcdi->fn_flags) &
> @@ -520,7 +520,7 @@ static ssize_t efx_ef10_show_primary_flag(struct device 
> *dev,
> struct device_attribute *attr,
> char *buf)
>  {
> - struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> + struct efx_nic *efx = dev_get_drvdata(dev);
>  
>   return sprintf(buf, "%d\n",
>  ((efx->mcdi->fn_flags) &
> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> index ab58b837df47..2fef7402233e 100644
> --- a/drivers/net/ethernet/sfc/efx.c
> +++ b/drivers/net/ethernet/sfc/efx.c
> @@ -2517,7 +2517,7 @@ static struct notifier_block efx_netdev_notifier = {
>  static ssize_t
>  show_phy_type(struct device *dev, struct device_attribute *attr, char *buf)
>  {
> - struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> + struct efx_nic *efx = dev_get_drvdata(dev);
>   return sprintf(buf, "%d\n", efx->phy_type);
>  }
>  static DEVICE_ATTR(phy_type, 0444, show_phy_type, NULL);
> @@ -2526,7 +2526,7 @@ static DEVICE_ATTR(phy_type, 0444, show_phy_type, NULL);
>  static ssize_t show_mcdi_log(struct device *dev, struct device_attribute 
> *attr,
>char *buf)
>  {
> - struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> + struct efx_nic *efx = dev_get_drvdata(dev);
>   struct efx_mcdi_iface *mcdi = efx_mcdi(efx);
>  
>   return scnprintf(buf, PAGE_SIZE, "%d\n", mcdi->logging_enabled);
> @@ -2534,7 +2534,7 @@ static ssize_t show_mcdi_log(struct device *dev, struct 
> device_attribute *attr,
>  static ssize_t set_mcdi_log(struct device *dev, struct device_attribute 
> *attr,
>   const char *buf, size_t count)
>  {
> - struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> + struct efx_nic *efx = dev_get_drvdata(dev);
>   struct efx_mcdi_iface *mcdi = efx_mcdi(efx);
>   bool enable = count > 0 && *buf != '0';
>  
> @@ -3654,7 +3654,7 @@ static int efx_pci_sriov_configure(struct pci_dev *dev, 
> int num_vfs)
>  
>  static int efx_pm_freeze(struct device *dev)
>  {
> - struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> + struct efx_nic *efx = dev_get_drvdata(dev);
>  
>   rtnl_lock();
>  
> @@ -3675,7 +3675,7 @@ static int efx_pm_freeze(struct device *dev)
>  static int efx_pm_thaw(struct device *dev)
>  {
>   int rc;
> - struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> + struct efx_nic *efx = dev_get_drvdata(dev);
>  
>   rtnl_lock();
>

Re: [PATCH v3 2/6] static_call: Add basic static call infrastructure

2019-01-10 Thread Edward Cree

On 09/01/19 22:59, Josh Poimboeuf wrote:
> Static calls are a replacement for global function pointers.  They use
> code patching to allow direct calls to be used instead of indirect
> calls.  They give the flexibility of function pointers, but with
> improved performance.  This is especially important for cases where
> retpolines would otherwise be used, as retpolines can significantly
> impact performance.
>
> The concept and code are an extension of previous work done by Ard
> Biesheuvel and Steven Rostedt:
>
>   https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org
>   https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org
>
> There are two implementations, depending on arch support:
>
> 1) out-of-line: patched trampolines (CONFIG_HAVE_STATIC_CALL)
> 2) basic function pointers
>
> For more details, see the comments in include/linux/static_call.h.
>
> Signed-off-by: Josh Poimboeuf 
> ---
>  arch/Kconfig  |   3 +
>  include/linux/static_call.h   | 135 ++
>  include/linux/static_call_types.h |  13 +++
>  3 files changed, 151 insertions(+)
>  create mode 100644 include/linux/static_call.h
>  create mode 100644 include/linux/static_call_types.h
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 4cfb6de48f79..7e469a693da3 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -885,6 +885,9 @@ config HAVE_ARCH_PREL32_RELOCATIONS
> architectures, and don't require runtime relocation on relocatable
> kernels.
>  
> +config HAVE_STATIC_CALL
> + bool
> +
>  source "kernel/gcov/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> new file mode 100644
> index ..9e85c479cd11
> --- /dev/null
> +++ b/include/linux/static_call.h
> @@ -0,0 +1,135 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_STATIC_CALL_H
> +#define _LINUX_STATIC_CALL_H
> +
> +/*
> + * Static call support
> + *
> + * Static calls use code patching to hard-code function pointers into direct
> + * branch instructions.  They give the flexibility of function pointers, but
> + * with improved performance.  This is especially important for cases where
> + * retpolines would otherwise be used, as retpolines can significantly impact
> + * performance.
> + *
> + *
> + * API overview:
> + *
> + *   DECLARE_STATIC_CALL(key, func);
> + *   DEFINE_STATIC_CALL(key, func);
> + *   static_call(key, args...);
> + *   static_call_update(key, func);
> + *
> + *
> + * Usage example:
> + *
> + *   # Start with the following functions (with identical prototypes):
> + *   int func_a(int arg1, int arg2);
> + *   int func_b(int arg1, int arg2);
> + *
> + *   # Define a 'my_key' reference, associated with func_a() by default
> + *   DEFINE_STATIC_CALL(my_key, func_a);
> + *
> + *   # Call func_a()
> + *   static_call(my_key, arg1, arg2);
> + *
> + *   # Update 'my_key' to point to func_b()
> + *   static_call_update(my_key, func_b);
> + *
> + *   # Call func_b()
> + *   static_call(my_key, arg1, arg2);
> + *
> + *
> + * Implementation details:
> + *
> + *This requires some arch-specific code (CONFIG_HAVE_STATIC_CALL).
> + *Otherwise basic indirect calls are used (with function pointers).
> + *
> + *Each static_call() site calls into a trampoline associated with the 
> key.
> + *The trampoline has a direct branch to the default function.  Updates 
> to a
> + *key will modify the trampoline's branch destination.
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#ifdef CONFIG_HAVE_STATIC_CALL
> +#include 
> +extern void arch_static_call_transform(void *site, void *tramp, void *func);
> +#endif
> +
> +
> +#define DECLARE_STATIC_CALL(key, func)   
> \
> + extern struct static_call_key key;  \
> + extern typeof(func) STATIC_CALL_TRAMP(key)
> +
> +
> +#if defined(CONFIG_HAVE_STATIC_CALL)
> +
> +struct static_call_key {
> + void *func, *tramp;
> +};
> +
> +#define DEFINE_STATIC_CALL(key, _func)   
> \
> + DECLARE_STATIC_CALL(key, _func);\
> + struct static_call_key key = {  \
> + .func = _func,  \
> + .tramp = STATIC_CALL_TRAMP(key),\
> + };  \
> + ARCH_DEFINE_STATIC_CALL_TRAMP(key, _func)
> +
> +#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
> +
> +static inline void __static_call_update(struct static_call_key *key, void 
> *func)
> +{
> + cpus_read_lock();
> + WRITE_ONCE(key->func, func);
> + arch_static_call_transform(NULL, key->tramp, func);
> + cpus_read_unlock();
> +}
> +
> +#define static_call_update(key, func)
> \
> +({

Re: Code of Conduct: Let's revamp it.

2018-09-18 Thread Edward Cree


The new Code of Conduct makes me feel threatened and uncomfortable.

No, really.  As a person with (diagnosed) Asperger's, I'm a member of,
 objectively, a marginalised minority.  Effectively (i.e. this is a massive
 oversimplification), I was born without the hard-wired circuitry for 
social
 interactions that is normally a part of the human brain; consequently 
I have

 to run a slow and inaccurate software simulation when interacting with
 'normal' people.

In nearly all the communities I participate in, this is a constantly 
limiting
 factor for me.  But there is one world that is blessedly free of such 
things:

 the world of open-source software.  It is one of the last places where my
 particular neurodiversity does _not_ mark me out as Other, does _not_ 
force
 me to carefully watch what I say and present a falsely constructed 
façade in
 place of my real identity.  For here, we care not for 'feelings'; 
either the

 code is good or it is bad, and in the latter case we say so directly and
 bluntly.  Not only does this mean that I don't have to guard my tongue 
when

 critiquing someone else's patch, far more importantly it means I can
 understand what's being said when _my_ patches are criticised.  
(Almost all

 of my best ideas and patches have been born out of someone telling me I'm
 wrong.)

The Linux kernel community is a place without office politics, without 
subtle
 subtexts, without primate dominance dynamics.  A place where criticism 
_can_

 be gracefully accepted _without_ having to worry that admitting to being
 wrong will lower one's status.  A place where I, and people like me, 
can feel

 at home, and maybe even create something of value.

And the Contributor Covenant looks very much like the camel's nose of an
 attempt to take that place, that community, away from me.  To replace 
it with
 an Orwellian nightmare where I must forever second-guess what is safe 
to say.

 (First they came for "master/slave replication", and I did not speak up
 because I was not a DBA.)

I cannot speak for my employer (hence why I am posting this from my personal
 address), but to the extent that my rôle as a contributor to the 
networking
 subsystem, and as co-maintainer of the sfc driver, gives me any 
standing in a
 _personal_ capacity, I absolutely cannot sign up to this 'Pledge' nor 
accept
 the 'Responsibilities' to police the speech of others that it makes a 
duty of

 maintainership, and I urge the project leadership to revert its adoption.

Some elements of the Code are unobjectionable; sexual advances, for 
instance,

 have no place on the lkml (though they may at, say, a conference, and not
 everyone can reliably predict whether they are unwelcome), and the 
ability of

 kernel developers to accept constructive criticism is one of the strengths
 that has made Linux what it is.  But far too many of its provisions 
rely on
 ill-defined terms, and thus give those charged with interpreting those 
terms

 the power to destroy livelihoods.  By placing a corporate body (the LF) in
 the position of arbiter, an avenue is opened for commercial pressure to be
 applied; and the legalistic phrasing of the Code practically invites 
rules-

 lawyering whereby the most abusive may twist it into a weapon to further
 their abuse.

If the Code were reduced to something more like the old Code of Conflict,
 reminding people to 'be liberal in what they accept and conservative 
in what

 they emit', and clarifying that patch submissions should be judged by the
 _code_ and not by any characteristics or beliefs of the submitter (I don't
 think the enumerated list of protected classes is helpful, as a legalistic
 abuser can always slip into a crack between them), I think the sting 
would be
 drawn.  Probably the CoConflict would make a better base from which to 
draft

 such a document.

(A note for the irony-challenged: where I use Progressive terms-of-art, such
 as 'marginalised', 'Other' and 'identity', in the above, I am 
endeavouring to
 show that this alleged push for 'inclusiveness' fails on its own 
terms; I am

 _not_ accepting the theory behind those terms nor suggesting that, in
 reality, the kernel community owes me any special treatment on account 
of my

 'diversity'.)

Re: Code of Conduct: Let's revamp it.

2018-09-19 Thread Edward Cree

On 19/09/18 15:18, Jonathan Corbet wrote:
> I'd like to address just this part, speaking only for myself.
> The LF is not in the position of arbitrating anything here.  The body
> charged with that is the LF Technical Advisory Board, which is a
> different thing.

Thank you for clarifying that.

Jon, you're a good person, and I trust you not to harass developers.
 I'm not expecting you, or gregkh, or hpa to launch a dastardly plot
 to boot an innocent developer off the project.

But there are too many ways this can go wrong, maybe not now or next
 week but in five or ten years, when maybe a different kind of person
 is on the TAB, or maybe external pressure is brought to bear on TAB
 members.  (Some people are speculating that pressure has already
 been brought to bear on Linus, although I'd like to stress that if
 he feels that changing the way he communicates is best for the
 project then I for one thoroughly respect that.)

Or maybe some manipulative extrovert will manage to whip up a media
 storm about some developer's innocuous remarks, the court of public
 opinion will immediately convict, other developers will be caught
 in a preference falsification cascade as no-one wants to take the
 risk of defending them.  Are you, and the other members of the TAB,
 strong enough to hold back the lynch mobs when you know the guy
 hasn't done anything wrong?  Do you really want to take on the
 responsibility to do that, perhaps time and time again?  And can
 you be sure that you won't fall for a misleading framing by some
 charismatic sociopath of a fight you missed the start of?

Given that possibility, I think it is important for the kernel
 community to continue to send a strong signal that divisive identity
 politics are not welcome here; and I think that to adopt a Code of
 Conduct that contains in its opening sentence a laundry-list of
 protected classes (and was written by a, shall we say, divisive
 figure with a history of identity politics advocacy) sends precisely
 the opposite signal.

Linux is too important to civilisation to allow it to be turned into
 just another battlefield for the American culture war; commit
 8a104f8b5867 seems like an invitation to both armies to take up
 positions on our fertile farmland.  The only rule that gives no
 comfort to discriminatory political abusers (on either side) is—
 ye shall judge by their code alone.

Re: Code of Conduct: Let's revamp it.

2018-09-19 Thread Edward Cree

On 20/09/18 02:16, Olof Johansson wrote:
> I would be very surprised if any of my peers on the TAB ever had those
> intentions, and I know I would not have them myself.
In case my references to individualsmade it unclear: I have no reason to
 suspect _any_ of the present TAB members would; everything I know about
 them points to them being good, honest and principled people.

> I can
> see how that kind of environment _could_ be implemented with the same
> code of conduct as a base, but [...] I know I
> would fight strongly against that.
It is definitely reassuring to hear you say that.

> There is a list in the first paragraph, but the preceding words say
> that it should be a *harassment-free experience for everyone*. That
> part of the paragraph is to me the most important part.
It certainly _should_ be the most important; IMHO the sentence should end
 after 'everyone'.

> Your above argument that the Code of Conduct is problematic because of
> who wrote it seems to contradict your statement that we shall judge by
> code (or text) alone.
I think there are important differences between code to be run by CPUs
 and a Code to be run by humans.  And when the author goes on a victory
 lap on Twitter and declares the Code to be "a political document", is
 it any surprise I'm worried?

Applying extra scrutiny to a political document written by someone with a
 history ofstirring up political antagonism is like checking the locking
 extra-carefully in a patch from a developer who has a history of letting
 locking bugs slip through.

Re: Code of Conduct: Let's revamp it.

2018-09-20 Thread Edward Cree

On 20/09/18 10:27, unconditionedwitn...@redchan.it wrote:
> Contributors can, at any time, rescind the license grant regarding their
> property via written notice to those whom they are rescinding the grant
> from (regarding their property (code)).

I know others have already said it, but:
This is legally nonsense.  The only way I can revoke someone's rights to
 my code under the GPL is if they violate the terms of the GPL.  If I
 were to do so otherwise, then _I_ would be in violation for having
 distributed derived works of the kernel without a GPL, not to mention
 the obvious reliance/estoppel problems.

Moreover, even if I _could_ revoke the license, I wouldn't want to do
 so; it would be ridiculously petty in itself and the precedent it would
 set would be destructive to the entire open-source community, about
 which I care deeply.  It is _because_ Linux and other open-source
 projects are so important to humanity that I spoke up about what I
 perceive as a threat to it.

In short, "unconditionedwitness", please shut up.  You're not helping.

Re: [PATCH v2 net-next 2/4] net: add skeleton of bpfilter kernel module

2018-05-03 Thread Edward Cree

On 03/05/18 05:36, Alexei Starovoitov wrote:
> bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
> and user mode helper code that is embedded into bpfilter.ko
>
> The steps to build bpfilter.ko are the following:
> - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
> - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
>   is converted into bpfilter_umh.o object file
>   with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
>   Example:
>   $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
>   4cf8 T _binary_net_bpfilter_bpfilter_umh_end
>   4cf8 A _binary_net_bpfilter_bpfilter_umh_size
>    T _binary_net_bpfilter_bpfilter_umh_start
> - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko
>
> bpfilter_kern.c is a normal kernel module code that calls
> the fork_usermode_blob() helper to execute part of its own data
> as a user mode process.
>
> Notice that _binary_net_bpfilter_bpfilter_umh_start - end
> is placed into .init.rodata section, so it's freed as soon as __init
> function of bpfilter.ko is finished.
> As part of __init the bpfilter.ko does first request/reply action
> via two unix pipe provided by fork_usermode_blob() helper to
> make sure that umh is healthy. If not it will kill it via pid.
>
> Later bpfilter_process_sockopt() will be called from bpfilter hooks
> in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
>
> If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
> kill umh as well.
>
> Signed-off-by: Alexei Starovoitov 
> ---
>  include/linux/bpfilter.h  | 15 +++
>  include/uapi/linux/bpfilter.h | 21 ++
>  net/Kconfig   |  2 +
>  net/Makefile  |  1 +
>  net/bpfilter/Kconfig  | 17 
>  net/bpfilter/Makefile | 24 +++
>  net/bpfilter/bpfilter_kern.c  | 93 
> +++
>  net/bpfilter/main.c   | 63 +
>  net/bpfilter/msgfmt.h | 17 
>  net/ipv4/Makefile |  2 +
>  net/ipv4/bpfilter/Makefile|  2 +
>  net/ipv4/bpfilter/sockopt.c   | 42 +++
>  net/ipv4/ip_sockglue.c| 17 
>  13 files changed, 316 insertions(+)
>  create mode 100644 include/linux/bpfilter.h
>  create mode 100644 include/uapi/linux/bpfilter.h
>  create mode 100644 net/bpfilter/Kconfig
>  create mode 100644 net/bpfilter/Makefile
>  create mode 100644 net/bpfilter/bpfilter_kern.c
>  create mode 100644 net/bpfilter/main.c
>  create mode 100644 net/bpfilter/msgfmt.h
>  create mode 100644 net/ipv4/bpfilter/Makefile
>  create mode 100644 net/ipv4/bpfilter/sockopt.c
>
> diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h
> new file mode 100644
> index ..687b1760bb9f
> --- /dev/null
> +++ b/include/linux/bpfilter.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_BPFILTER_H
> +#define _LINUX_BPFILTER_H
> +
> +#include 
> +
> +struct sock;
> +int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char *optval,
> + unsigned int optlen);
> +int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char *optval,
> + int *optlen);
> +extern int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
> +char __user *optval,
> +unsigned int optlen, bool is_set);
> +#endif
> diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
> new file mode 100644
> index ..2ec3cc99ea4c
> --- /dev/null
> +++ b/include/uapi/linux/bpfilter.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _UAPI_LINUX_BPFILTER_H
> +#define _UAPI_LINUX_BPFILTER_H
> +
> +#include 
> +
> +enum {
> + BPFILTER_IPT_SO_SET_REPLACE = 64,
> + BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65,
> + BPFILTER_IPT_SET_MAX,
> +};
> +
> +enum {
> + BPFILTER_IPT_SO_GET_INFO = 64,
> + BPFILTER_IPT_SO_GET_ENTRIES = 65,
> + BPFILTER_IPT_SO_GET_REVISION_MATCH = 66,
> + BPFILTER_IPT_SO_GET_REVISION_TARGET = 67,
> + BPFILTER_IPT_GET_MAX,
> +};
> +
> +#endif /* _UAPI_LINUX_BPFILTER_H */
> diff --git a/net/Kconfig b/net/Kconfig
> index b62089fb1332..ed6368b306fa 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -201,6 +201,8 @@ source "net/bridge/netfilter/Kconfig"
>  
>  endif
>  
> +source "net/bpfilter/Kconfig"
> +
>  source "net/dccp/Kconfig"
>  source "net/sctp/Kconfig"
>  source "net/rds/Kconfig"
> diff --git a/net/Makefile b/net/Makefile
> index a6147c61b174..7f982b7682bd 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -20,6 +20,7 @@ obj-$(CONFIG_TLS)   += tls/
>  obj-$(CONFIG_XFRM)   += xfrm/
>  obj-$(CONFIG_UNIX)   += unix/
>  obj-$(CONFIG_NET)+= ipv6/
> +obj-$(CONFIG_BPFILTER)   += bpfilter/
>  obj-$(CONFIG_PACKET) += packet/

Re: [PATCH][net-next] sfc: fix an off-by-one compare on an array size

2017-01-31 Thread Edward Cree

On 31/01/17 16:30, Colin King wrote:
> From: Colin Ian King 
>
> encap_type should be checked to see if it is greater or equal to
> the size of array map to fix an off-by-one array size check. This
> fixes an array overrun read as detected by static analysis by
> CoverityScan, CID#1398883 ("Out-of-bounds-read")
>
> Fixes: 9b41080125176841e ("sfc: insert catch-all filters for encapsulated 
> traffic")
> Signed-off-by: Colin Ian King 
Good catch.

Acked-by: Edward Cree 
> ---
>  drivers/net/ethernet/sfc/ef10.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
> index 8bec938..0475f18 100644
> --- a/drivers/net/ethernet/sfc/ef10.c
> +++ b/drivers/net/ethernet/sfc/ef10.c
> @@ -5080,7 +5080,7 @@ static int efx_ef10_filter_insert_def(struct efx_nic 
> *efx,
>  
>   /* quick bounds check (BCAST result impossible) */
>   BUILD_BUG_ON(EFX_EF10_BCAST != 0);
> - if (encap_type > ARRAY_SIZE(map) || map[encap_type] == 0) {
> + if (encap_type >= ARRAY_SIZE(map) || map[encap_type] == 0) {
>   WARN_ON(1);
>   return -EINVAL;
>   }

Re: [PATCH] ethtool: fix a kernel infoleak in ethtool_get_pauseparam

2016-06-01 Thread Edward Cree

On 01/06/16 15:39, Kangjie Lu wrote:
> The field autoneg of pauseparam is not initialized in some
> implementations of get_pauseparam(), but the whole object is
> copied to userland.
>
> Signed-off-by: Kangjie Lu 
> ---
>  net/core/ethtool.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index f426c5a..84544bd 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -1723,7 +1723,10 @@ static noinline_for_stack int 
> ethtool_set_channels(struct net_device *dev,
>  
>  static int ethtool_get_pauseparam(struct net_device *dev, void __user 
> *useraddr)
>  {
> - struct ethtool_pauseparam pauseparam = { ETHTOOL_GPAUSEPARAM };
AIUI an incomplete compound initialiser will fill all unspecified fields
with zeroes of the appropriate type.  So this patch is unnecessary.

Per C99, §6.7.8.21:
> If there are fewer initializers in a brace-enclosed list than there are 
> elements or members of an aggregate [...] the remainder of the aggregate 
> shall be initialized implicitly the same as objects that have static storage 
> duration.

-Ed

> + struct ethtool_pauseparam pauseparam;
> +
> + memset(&pauseparam, 0, sizeof(pauseparam));
> + pauseparam.cmd = ETHTOOL_GPAUSEPARAM;
>  
>   if (!dev->ethtool_ops->get_pauseparam)
>   return -EOPNOTSUPP;

Re: [PATCH v3 3/4] ptp_clock: allow for it to be optional

2016-11-08 Thread Edward Cree

On 07/11/16 22:14, Nicolas Pitre wrote:
> In order to break the hard dependency between the PTP clock subsystem and
> ethernet drivers capable of being clock providers, this patch provides
> simple PTP stub functions to allow linkage of those drivers into the
> kernel even when the PTP subsystem is configured out. Drivers must be
> ready to accept NULL from ptp_clock_register() in that case.
>
> And to make it possible for PTP to be configured out, the select statement
> in those driver's Kconfig menu entries is converted to the new "imply"
> statement. This way the PTP subsystem may have Kconfig dependencies of
> its own, such as POSIX_TIMERS, without having to make those ethernet
> drivers unavailable if POSIX timers are cconfigured out. And when support
> for POSIX timers is selected again then the default config option for PTP
> clock support will automatically be adjusted accordingly.
>
> The pch_gbe driver is a bit special as it relies on extra code in
> drivers/ptp/ptp_pch.c. Therefore we let the make process descend into
> drivers/ptp/ even if PTP_1588_CLOCK is unselected.
>
> Signed-off-by: Nicolas Pitre 
> Reviewed-by: Josh Triplett 
> Acked-by: Richard Cochran 
For the sfc change:
Acked-by: Edward Cree 

And I accept that "suggests" isn't needed and the current "imply" semantics
are probably fine.  If not we can always change it or resurrect "suggests".

Re: [PATCH 1/4] kconfig: introduce the "imply" keyword

2016-10-20 Thread Edward Cree

On 20/10/16 18:04, Nicolas Pitre wrote:
> On Thu, 20 Oct 2016, Edward Cree wrote:
>> Also, I don't think having any FOO=y should preclude BAZ=m.  Suppose both
>> FOO and FOO2 imply BAZ, FOO=y and FOO2=m.
> Some people didn't like the fact that you could turn a driver from m to 
> y and silently lose some features if they were provided by a subsystem 
> that also used to be m, which arguably is not the same as being 
> explicitly disabled.  With "select" this is not a problem as the target 
> symbol is also promoted to y in that case, so I wanted to preserve that 
> property.
Right, but that's an argument for pushing the subsystem's default to y,
not for preventing changing the subsystem back to m afterwards.
>> Then if BAZ-features are only
>> desired for driver FOO2, BAz=m makes sense.
> In that case it would make more sense to add a config option related to 
> FOO asking if BAZ features are desired for that driver (there is already 
> one occurrence of that with PTP).  Or you could simply drop the "imply" 
> statement from the FOO config entry.
But the desire is a property of the user, not of the driver.  If you're
willing to add CONFIG_FOO_BAZ to every combination of (driver, subsystem)
then "imply" becomes unnecessary, doesn't it?  Conversely, if you *don't*
want to have to do that, then "imply" needs to only ever deal in defaults,
not in limitations.
>> There is also the case of drivers with the ability to detect at runtime
>> whether BAZ is present, rather than making the decision at build time, but
>> I'm not sure how common that is.
> Right now that's how PTP support is done.  Drivers can optimize things 
> at build time, but most of them simply cope with a NULL return from 
> ptp_clock_register().  Hence the imply statement becomes a big 
> configuration hint rather than some hard build dependency.
Right, so those drivers can use PTP if they're y and PTP is m, as long as
the PTP module is loaded when they probe.  But current "imply" semantics
won't allow that...

I think that Josh's suggestion (have the UI warn you if you set BAZ to m
while FOO=y) is the right approach, but I also think it should be done
now rather than at some unspecified future time.  Otherwise you forbid
potentially valid configs.

-Ed

Re: [PATCH 1/4] kconfig: introduce the "imply" keyword

2016-10-20 Thread Edward Cree

On 20/10/16 19:29, Nicolas Pitre wrote:
> On Thu, 20 Oct 2016, Edward Cree wrote:
>> But the desire is a property of the user, not of the driver.  If you're
>> willing to add CONFIG_FOO_BAZ to every combination of (driver, subsystem)
>> then "imply" becomes unnecessary, doesn't it?
> Absolutely.  And if that's something that inspires you please be my 
> guest.  So far, though, this apparently didn't inspire the majority of 
> driver authors who preferred to have a smaller set of config options and 
> forcefully pull in the BAZ features with a "select".  But "select" comes 
> with its set of evils which "imply" is meant to overcome.
It really doesn't inspire me either ;)
I was using it as a way to set up the converse, rather than as any kind of
serious suggestion.
And I agree that "imply", as it stands, is an improvement over "select" for
these kinds of cases.  I just think it could be further improved.
>> Conversely, if you *don't*
>> want to have to do that, then "imply" needs to only ever deal in defaults,
>> not in limitations.
> As I explained, It still has to prevent BAZ=m if FOO moves from m to y 
> otherwise this would effectively have the same result as BAZ=n in 
> practice and that is not what people expect if BAZ actually isn't n in 
> your .config file.  That's why "select" also has that particular 
> semantic.
If FOO "moves from" m to y (by which I'm assuming you mean a call to
sym_set_tristate_value()), then by all means set BAZ=y.  But the user
should then still be allowed to move BAZ from y to m without having to
change FOO (though hopefully they will get warned about it somehow).
> Here "imply" is meant to be a weaker form of "select".  If you prefer 
> not to have that limitation imposed by either "select" and "imply" then 
> simply don't use them at all.  Nothing forces you to use any of them if 
> your code can cope with any config combination.
I'm interpreting "imply" as being more a way of saying "if you want FOO you
probably want BAZ as well".  But maybe that should be yet another new
keyword if it's so different from what you want "imply" to be.  "suggests",
perhaps.
>> Right, so those drivers can use PTP if they're y and PTP is m, as long 
>> as the PTP module is loaded when they probe.
> Not at the moment. There is no way for PTP to dynamically signal to 
> interested drivers its presence at run time.  And drivers, when 
> built-in, typically probe their hardware during the boot process even 
> before you have the chance to load any module. If that ever changes, 
> then the imply or select statement could simply be dropped.
At least for PCIe devices, the driver can be un- and re-bound (despite
being built-in) through sysfs.  So you (already) can re-probe them after
loading PTP.  So driver=y && PTP=m is valid, today.
>> I think that Josh's suggestion (have the UI warn you if you set BAZ to m
>> while FOO=y) is the right approach, but I also think it should be done
>> now rather than at some unspecified future time.
> Please advocate this with kconfig UI authors.  My recursion stack is 
> already quite deep.
If I'm reading your patch correctly, your symbol.c additions enforce this
restriction, and AFAIK the UI can't override that by saying "Yeah I warned
the user and he said it was fine".
The kconfig UI API would need to change; sym_set_tristate_value() could
grow an 'override-imply' flag, for instance.

But I'm not married to the idea; I don't have a real-world use-case this
interferes with, so if you still think I'm wrong, feel free to ignore me :)

-Ed

Re: [net-next][PATCH] RDS: keep data type consistent in the user visible header

2017-02-21 Thread Edward Cree

On 21/02/17 14:29, David Laight wrote:
>> The entire file should use the proper "__uX" kernel types
>> rather than the uint* ones.
> The uint* ones are part of the C standard :-)
>
>   David
>
... which is exactly why we can't use them.

http://yarchive.net/comp/linux/kernel_headers.html#17

Visibility of  types is defined by a twisty maze of standards, all 
different (POSIX, SuS, BSD_SOURCE, XOPEN_SOURCE...).

__u8 and friends are in the reserved system namespace, so we can safely use 
them in user headers.

-Ed

Re: sfc: process RX event inner checksum flags

2017-02-10 Thread Edward Cree

On 10/02/17 16:14, Colin Ian King wrote:
> Hi there,
>
>
> not sure if this is a bug, or intentional, but CoverityScan picked up a
> mismatch in arguments when calling efx_ef10_handle_rx_event_error() with
> commit "sfc: process RX event inner checksum flags"

It's a bug, thanks for the catch (and thanks Coverity as well).

Will send a patch shortly.

-Ed

Re: [PATCH 0732/1285] Replace numeric parameter like 0444 with macro

2016-08-02 Thread Edward Cree

On 02/08/16 12:40, Baole Ni wrote:
> I find that the developers often just specified the numeric value
> when calling a macro which is defined with a parameter for access permission.
> As we know, these numeric value for access permission have had the 
> corresponding macro,
> and that using macro can improve the robustness and readability of the code,
> thus, I suggest replacing the numeric parameter with the macro.
NAK.  To anyone with enough Unix experience to be contributing to the kernel,
the octal values are *easier* to read: they're compact, and usually just take
one of a few stereotyped values anyway (mostly 0444 or 0644).  The macros are
full of fluffy noise and take longer to read.
(Also, if you're sending a 1,285-patch series, you should probably reconsider
the choices that have brought you to this point.)
-Ed
> Signed-off-by: Chuansheng Liu 
> Signed-off-by: Baole Ni 
> ---
>  drivers/net/ethernet/sfc/ef10.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
> index 1f30912..fcf06c5 100644
> --- a/drivers/net/ethernet/sfc/ef10.c
> +++ b/drivers/net/ethernet/sfc/ef10.c
> @@ -275,9 +275,9 @@ static ssize_t efx_ef10_show_primary_flag(struct device 
> *dev,
>  ? 1 : 0);
>  }
>  
> -static DEVICE_ATTR(link_control_flag, 0444, efx_ef10_show_link_control_flag,
> +static DEVICE_ATTR(link_control_flag, S_IRUSR | S_IRGRP | S_IROTH, 
> efx_ef10_show_link_control_flag,
>  NULL);
> -static DEVICE_ATTR(primary_flag, 0444, efx_ef10_show_primary_flag, NULL);
> +static DEVICE_ATTR(primary_flag, S_IRUSR | S_IRGRP | S_IROTH, 
> efx_ef10_show_primary_flag, NULL);
>  
>  static int efx_ef10_probe(struct efx_nic *efx)
>  {

[RFC] sysfs: defer instantiation of default attrs

2016-02-20 Thread Edward Cree

Ok, here's a rather ugly proof-of-concept patch.  I've gone through some
 contortions to avoid double-locking the kernfs_mutex; I should really
 try to think of a better approach.
However, (within /sys/class/net) it only seems to do anything for the
 queue kobjects, not the netdevice or statistics; those appear to be done
 by a different method (attribute groups rather than default attributes).
It should be possible to do something similar for attribute groups, but I
 will need to spend some more time thinking about it.

Testing this in a VM, and with udevd disabled (being too lazy to do it
 properly), created 1024 bridges.  On ls'ing
 /sys/class/net/bridge*/queues/*/, saw free memory drop by 64kB,
 suggesting that much had been saved by deferral.  It's not very much,
 hopefully deferring attribute groups will do better.

Changes behaviour if someone tries to add a node that clashes with one of
 the default attrs.  Previously the add would fail with EEXIST, now it
 succeeds while later the deferred populate will fail to add the default
 attr node, pr_warn(), and carry on (no point passing the EEXIST back to
 the caller, as there's nothing they can do about it).
I've gone with that as it's simple and it doesn't seem likely to matter -
 name collisions should be found promptly either way - but if it's a
 problem I could change it.

Signed-off-by: Edward Cree 
---
 fs/kernfs/dir.c | 100 +---
 fs/kernfs/file.c|  53 ++-
 fs/kernfs/kernfs-internal.h |   2 +
 fs/sysfs/dir.c  |   9 
 fs/sysfs/file.c |  32 ++
 fs/sysfs/mount.c|   2 +
 fs/sysfs/sysfs.h|   1 +
 include/linux/kernfs.h  |  11 +
 include/linux/kobject.h |   2 +
 include/linux/sysfs.h   |  12 +-
 lib/kobject.c   |  48 ++---
 11 files changed, 194 insertions(+), 78 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 996b774..8ad0531 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -577,26 +577,17 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node 
*parent,
return kn;
 }
 
-/**
- * kernfs_add_one - add kernfs_node to parent without warning
- * @kn: kernfs_node to be added
- *
- * The caller must already have initialized @kn->parent.  This
- * function increments nlink of the parent's inode if @kn is a
- * directory and link into the children list of the parent.
- *
- * RETURNS:
- * 0 on success, -EEXIST if entry with the given name already
- * exists.
- */
-int kernfs_add_one(struct kernfs_node *kn)
+static void __kernfs_activate(struct kernfs_node *kn, bool locked);
+
+int __kernfs_add_one(struct kernfs_node *kn, bool locked)
 {
struct kernfs_node *parent = kn->parent;
struct kernfs_iattrs *ps_iattr;
bool has_ns;
int ret;
 
-   mutex_lock(&kernfs_mutex);
+   if (!locked)
+   mutex_lock(&kernfs_mutex);
 
ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -627,7 +618,8 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattrs->ia_ctime = ps_iattrs->ia_mtime = CURRENT_TIME;
}
 
-   mutex_unlock(&kernfs_mutex);
+   if (!locked)
+   mutex_unlock(&kernfs_mutex);
 
/*
 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -637,14 +629,44 @@ int kernfs_add_one(struct kernfs_node *kn)
 * trigger deactivation.
 */
if (!(kernfs_root(kn)->flags & KERNFS_ROOT_CREATE_DEACTIVATED))
-   kernfs_activate(kn);
+   __kernfs_activate(kn, locked);
return 0;
 
 out_unlock:
-   mutex_unlock(&kernfs_mutex);
+   if (!locked)
+   mutex_unlock(&kernfs_mutex);
return ret;
 }
 
+
+/**
+ * kernfs_add_one - add kernfs_node to parent without warning
+ * @kn: kernfs_node to be added
+ *
+ * The caller must already have initialized @kn->parent.  This
+ * function increments nlink of the parent's inode if @kn is a
+ * directory and link into the children list of the parent.
+ *
+ * RETURNS:
+ * 0 on success, -EEXIST if entry with the given name already
+ * exists.
+ */
+int kernfs_add_one(struct kernfs_node *kn)
+{
+   return __kernfs_add_one(kn, false);
+}
+
+/* Caller must ensure kernfs_mutex held */
+void kernfs_ensure_populated(struct kernfs_node *kn)
+{
+   if (unlikely(kn->flags & KERNFS_UNPOPULATED)) {
+   struct kernfs_root *kr = kernfs_root(kn);
+
+   kr->populate(kn);
+   kn->flags &= ~KERNFS_UNPOPULATED;
+   }
+}
+
 /**
  * kernfs_find_ns - find kernfs_node with the given name
  * @parent: kernfs_node to search under
@@ -664,6 +686,8 @@ static struct kernfs_node *kernfs_find_ns(struct 
kernfs_node

Re: [PATCH v2 net-next] net: core: use listified Rx for GRO_NORMAL in napi_gro_receive()

2020-06-25 Thread Edward Cree

On 24/06/2020 22:06, Jason A. Donenfeld wrote:
> Hi Alexander,
>
> This patch introduced a behavior change around GRO_DROP:
>
> napi_skb_finish used to sometimes return GRO_DROP:
>
>> -static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
>> +static gro_result_t napi_skb_finish(struct napi_struct *napi,
>> +struct sk_buff *skb,
>> +gro_result_t ret)
>>  {
>>  switch (ret) {
>>  case GRO_NORMAL:
>> -if (netif_receive_skb_internal(skb))
>> -ret = GRO_DROP;
>> +gro_normal_one(napi, skb);
>>
> But under your change, gro_normal_one and the various calls that makes
> never propagates its return value, and so GRO_DROP is never returned to
> the caller, even if something drops it.
This followed the pattern set by napi_frags_finish(), and is
 intentional: gro_normal_one() usually defers processing of
 the skb to the end of the napi poll, so by the time we know
 that the network stack has dropped it, the caller has long
 since returned.
In fact the RX will be handled by netif_receive_skb_list_internal(),
 which can't return NET_RX_SUCCESS vs. NET_RX_DROP, because it's
 handling many skbs which might not all have the same verdict.

When originally doing this work I felt this was OK because
 almost no-one was sensitive to the return value — almost the
 only callers that were were in our own sfc driver, and then
 only for making bogus decisions about interrupt moderation.
Alexander just followed my lead, so don't blame him ;-)

> For some context, I'm consequently mulling over this change in my code,
> since checking for GRO_DROP now constitutes dead code:
Incidentally, it's only dead because dev_gro_receive() can't
 return GRO_DROP either.  If it could, napi_skb_finish()
 would pass that on.  And napi_gro_frags() (which AIUI is the
 better API for some performance reasons that I can't remember)
 can still return GRO_DROP too.

However, I think that incrementing your rx_dropped stat when
 the network stack chose to drop the packet is the wrong
 thing to do anyway (IMHO rx_dropped is for "there was a
 packet on the wire but either the hardware or the driver was
 unable to receive it"), so I'd say go ahead and remove the
 check.

HTH
-ed

Re: [RFC PATCH net-next] sfc: replace in_interrupt() usage

2020-09-29 Thread Edward Cree

> On Mon, Sep 28 2020 at 21:05, Edward Cree wrote:
>> Only compile-tested so far, because I'm waiting for my kernel to
>>  finish rebuilding with CONFIG_DEBUG_ATOMIC_SLEEP

I've now tested and confirmed that the might_sleep warning goes
 away with this patch.

Thomas, do you want to pull it into v2 of your series, or should
 I submit it separately to David?

-ed

Re: [patch 15/35] net: sfc: Replace in_interrupt() usage.

2020-09-28 Thread Edward Cree

On 27/09/2020 20:49, Thomas Gleixner wrote:
> Note, that the fixes tag is empty as it is unclear which of the commits to
> blame.
Seems like it should be
Fixes: f00bf2305cab("sfc: don't update stats on VF when called in atomic 
context")
 since that adds the in_interrupt() check and the code concerned
 doesn't seemto have changed a great deal since.

Anyway, this fix looks correct, and you can have my
Acked-by: Edward Cree 
 but I thinkit might be cleaner to avoid having to have this unused
 can_sleep argument on all the NICs that don't need it, by instead
 adding an update_stats_atomic() member to struct efx_nic_type, which
 could be set to the same as update_stats() for everything except
 EF10 VFs which would just do the call to efx_update_sw_stats().
I'll send an rfc patch embodying the above shortly...

-ed

[RFC PATCH net-next] sfc: replace in_interrupt() usage

2020-09-28 Thread Edward Cree

efx_ef10_try_update_nic_stats_vf() used in_interrupt() to figure out
 whether it is safe to sleep (for MCDI) or not.
The only caller from which it was not is efx_net_stats(), which can be
 invoked under dev_base_lock from net-sysfs::netstat_show().
So add a new update_stats_atomic() method to struct efx_nic_type, and
 call it from efx_net_stats(), removing the need for
 efx_ef10_try_update_nic_stats_vf() to behave differently for this case
 (which it wasn't doing correctly anyway).
For all nic_types other than EF10 VF, this method is NULL and so we
 call the regular update_stats() methods, which are happy with being
 called from atomic contexts.

Fixes: f00bf2305cab ("sfc: don't update stats on VF when called in atomic 
context")
Reported-by: Sebastian Andrzej Siewior 
Signed-off-by: Edward Cree 
---
Only compile-tested so far, because I'm waiting for my kernel to
 finish rebuilding with CONFIG_DEBUG_ATOMIC_SLEEP which I'm hoping
 is the right thing to detect the bug in the existing code.
I also wasn't quite sure how to give credit to the thorough analysis
 in the commit message of Sebastian's patch.  I don't think we have
 a Whatever-by: tag to cover that, do we?
And this doesn't include your GFP_KERNEL change, which should
 probably go in separately if you take this.

 drivers/net/ethernet/sfc/ef10.c   | 22 +-
 drivers/net/ethernet/sfc/efx_common.c |  2 +-
 drivers/net/ethernet/sfc/net_driver.h |  5 +
 drivers/net/ethernet/sfc/nic_common.h |  7 +++
 4 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index c9df2e96ebe4..b702ba5986dc 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -1871,15 +1871,6 @@ static int efx_ef10_try_update_nic_stats_vf(struct 
efx_nic *efx)
 
spin_unlock_bh(&efx->stats_lock);
 
-   if (in_interrupt()) {
-   /* If in atomic context, cannot update stats.  Just update the
-* software stats and return so the caller can continue.
-*/
-   spin_lock_bh(&efx->stats_lock);
-   efx_update_sw_stats(efx, stats);
-   return 0;
-   }
-
efx_ef10_get_stat_mask(efx, mask);
 
rc = efx_nic_alloc_buffer(efx, &stats_buf, dma_len, GFP_ATOMIC);
@@ -1938,6 +1929,18 @@ static size_t efx_ef10_update_stats_vf(struct efx_nic 
*efx, u64 *full_stats,
return efx_ef10_update_stats_common(efx, full_stats, core_stats);
 }
 
+static size_t efx_ef10_update_stats_atomic_vf(struct efx_nic *efx, u64 
*full_stats,
+ struct rtnl_link_stats64 
*core_stats)
+{
+   struct efx_ef10_nic_data *nic_data = efx->nic_data;
+
+   /* In atomic context, cannot update HW stats.  Just update the
+* software stats and return so the caller can continue.
+*/
+   efx_update_sw_stats(efx, nic_data->stats);
+   return efx_ef10_update_stats_common(efx, full_stats, core_stats);
+}
+
 static void efx_ef10_push_irq_moderation(struct efx_channel *channel)
 {
struct efx_nic *efx = channel->efx;
@@ -3998,6 +4001,7 @@ const struct efx_nic_type efx_hunt_a0_vf_nic_type = {
.finish_flr = efx_port_dummy_op_void,
.describe_stats = efx_ef10_describe_stats,
.update_stats = efx_ef10_update_stats_vf,
+   .update_stats_atomic = efx_ef10_update_stats_atomic_vf,
.start_stats = efx_port_dummy_op_void,
.pull_stats = efx_port_dummy_op_void,
.stop_stats = efx_port_dummy_op_void,
diff --git a/drivers/net/ethernet/sfc/efx_common.c 
b/drivers/net/ethernet/sfc/efx_common.c
index c256db241570..72a3f0e09f52 100644
--- a/drivers/net/ethernet/sfc/efx_common.c
+++ b/drivers/net/ethernet/sfc/efx_common.c
@@ -602,7 +602,7 @@ void efx_net_stats(struct net_device *net_dev, struct 
rtnl_link_stats64 *stats)
struct efx_nic *efx = netdev_priv(net_dev);
 
spin_lock_bh(&efx->stats_lock);
-   efx->type->update_stats(efx, NULL, stats);
+   efx_nic_update_stats_atomic(efx, NULL, stats);
spin_unlock_bh(&efx->stats_lock);
 }
 
diff --git a/drivers/net/ethernet/sfc/net_driver.h 
b/drivers/net/ethernet/sfc/net_driver.h
index 47aa753e64bd..9f7dfdf708cf 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -1172,6 +1172,9 @@ struct efx_udp_tunnel {
  * @describe_stats: Describe statistics for ethtool
  * @update_stats: Update statistics not provided by event handling.
  * Either argument may be %NULL.
+ * @update_stats_atomic: Update statistics while in atomic context, if that
+ * is more limiting than @update_stats.  Otherwise, leave %NULL and
+ * driver core will call @update_stats.
  * @start_stats: Start the regular fetching of statistics
  * @pull_stats: Pull stats from the NIC and wait until they arrive.

Re: [PATCH] [net-next] sfc: avoid an unused-variable warning

2020-05-28 Thread Edward Cree

On 28/05/2020 20:49, David Miller wrote:
> From: Arnd Bergmann 
> Date: Wed, 27 May 2020 15:41:06 +0200
>
>> 'nic_data' is no longer used outside of the #ifdef block
>> in efx_ef10_set_mac_address:
>>
>> drivers/net/ethernet/sfc/ef10.c:3231:28: error: unused variable 'nic_data' 
>> [-Werror,-Wunused-variable]
>> struct efx_ef10_nic_data *nic_data = efx->nic_data;
>>
>> Move the variable into a local scope.
>>
>> Fixes: dfcabb078847 ("sfc: move vport_id to struct efx_nic")
>> Signed-off-by: Arnd Bergmann 
> Applid, thanks.
Sorry I didn't see the original patch (I think it disappeared
 into a mail outage).  Fix is good, but I think we can improve
 the code further by moving the declaration down another block,
 into the 'if (rc == -EPERM)', at which point it is no longer
 shadowed by the other nic_data declaration in the
 'else if (!rc)' block.
Alternatively, we could rename the latter to 'pf_nic_data' or
 something.  Any opinions/preferences on which way to go?  We
 could even do both...
When I make up my mind I'll spin an incremental patch.

-ed

Re: [PATCH][next] sfc: fix dereference of table before it is null checked

2020-05-13 Thread Edward Cree

On 12/05/2020 18:13, Colin King wrote:
> From: Colin Ian King 
>
> Currently pointer table is being dereferenced on a null check of
> table->must_restore_filters before it is being null checked, leading
> to a potential null pointer dereference issue.  Fix this by null
> checking table before dereferencing it when checking for a null
> table->must_restore_filters.
>
> Addresses-Coverity: ("Dereference before null check")
> Fixes: e4fe938cff04 ("sfc: move 'must restore' flags out of ef10-specific 
> nic_data")
> Signed-off-by: Colin Ian King 
Acked-by: Edward Cree

Re: [PATCH] net: sfc: fix possible buffer overflow caused by bad DMA value in efx_siena_sriov_vfdi()

2020-08-03 Thread Edward Cree

On 02/08/2020 16:39, Jia-Ju Bai wrote:
> To fix this problem, "req->op" is assigned to a local variable, and then
> the driver accesses this variable instead of "req->op".
>
> Signed-off-by: Jia-Ju Bai 
Not sure how necessary this is (or even if anyone's still usingSiena
 SR-IOV, since it needed a specially-patched libvirt to work), but I
 don't see any reason to refuse.
> diff --git a/drivers/net/ethernet/sfc/siena_sriov.c 
> b/drivers/net/ethernet/sfc/siena_sriov.c
> index 83dcfcae3d4b..21a8482cbb3b 100644
> --- a/drivers/net/ethernet/sfc/siena_sriov.c
> +++ b/drivers/net/ethernet/sfc/siena_sriov.c
> @@ -875,6 +875,7 @@ static void efx_siena_sriov_vfdi(struct work_struct *work)
>   struct vfdi_req *req = vf->buf.addr;
>   struct efx_memcpy_req copy[2];
>   int rc;
> + u32 op = req->op;
Could you maybe fix up the xmas here, rather than making it worse?

Also, you didn't specify in your Subject line which tree this is for.

-ed

Re: [PATCH 5.4 086/129] net: core: use listified Rx for GRO_NORMAL in napi_gro_receive()

2020-09-08 Thread Edward Cree

On 08/09/2020 16:25, Greg Kroah-Hartman wrote:
> From: Alexander Lobakin 
>
> commit 6570bc79c0dfff0f228b7afd2de720fb4e84d61d upstream.
>
> Commit 323ebb61e32b4 ("net: use listified RX for handling GRO_NORMAL
> skbs") made use of listified skb processing for the users of
> napi_gro_frags().
> The same technique can be used in a way more common napi_gro_receive()
> to speed up non-merged (GRO_NORMAL) skbs for a wide range of drivers
> including gro_cells and mac80211 users.
> This slightly changes the return value in cases where skb is being
> dropped by the core stack, but it seems to have no impact on related
> drivers' functionality.
> gro_normal_batch is left untouched as it's very individual for every
> single system configuration and might be tuned in manual order to
> achieve an optimal performance.
>
> Signed-off-by: Alexander Lobakin 
> Acked-by: Edward Cree 
> Signed-off-by: David S. Miller 
> Signed-off-by: Hyunsoon Kim 
> Signed-off-by: Greg Kroah-Hartman 
I'm not quite sure why this is stable material(it's a performance
 enhancement, rather than a fix).  But if you do want to take it,
 make sure you've also got
c80794323e82 ("net: Fix packet reordering caused by GRO and listified RX 
cooperation")
b167191e2a85 ("net: wireless: intel: iwlwifi: fix GRO_NORMAL packet stalling")
 in your tree, particularly the latter as without it this commit
 triggers a severe regression in iwlwifi.

-ed

Re: [PATCH] sfc_ef100: Fix build failure on powerpc

2020-08-13 Thread Edward Cree

On 13/08/2020 15:39, Christophe Leroy wrote:
> ppc6xx_defconfig fails building sfc.ko module, complaining
> about the lack of _umoddi3 symbol.
>
> This is due to the following test
>
>   if (EFX_MIN_DMAQ_SIZE % reader->value) {
>
> Because reader->value is u64.
Already fixed in net.git by 41077c990266 ("sfc: fix ef100 design-param 
checking").
But thanks anyway.

Re: [PATCH AUTOSEL 5.9 054/147] sfc: add and use efx_tx_send_pending in tx.c

2020-10-27 Thread Edward Cree

On 26/10/2020 23:47, Sasha Levin wrote:
> From: Edward Cree 
>
> [ Upstream commit 1c0544d24927e4fad04f858216b8ea767a3bd123 ]
>
> Instead of using efx_tx_queue_partner(), which relies on the assumption
>  that tx_queues_per_channel is 2, efx_tx_send_pending() iterates over
>  txqs with efx_for_each_channel_tx_queue().
That assumption was valid for the code as of v5.9; this change was only
 needed to support the extra queues that were added for encap offloads.
Thus, this patch shouldn't be backported, unless -stable is also planning
 to backport that feature (e.g. 0ce8df661456, 24b2c3751aa3), which I
 doubt (it's nearly 20 patches, and can't be considered a bugfix).

-ed

Re: [PATCH 2/7] sfc: drop unnecessary list_empty

2020-07-27 Thread Edward Cree

On 26/07/2020 11:58, Julia Lawall wrote:
> list_for_each_safe is able to handle an empty list.
> The only effect of avoiding the loop is not initializing the
> index variable.
> Drop list_empty tests in cases where these variables are not
> used.
Sure, why not.
Acked-by: Edward Cree

Re: [PATCH net-next v3 1/2] hinic: add support to handle hw abnormal event

2020-07-24 Thread Edward Cree

On 23/07/2020 20:08, David Miller wrote:
> From: Luo bin 
> Date: Thu, 23 Jul 2020 22:40:37 +0800
>
>> +static int hinic_fw_reporter_dump(struct devlink_health_reporter *reporter,
>> +  struct devlink_fmsg *fmsg, void *priv_ctx,
>> +  struct netlink_ext_ack *extack)
>> +{
>> +struct hinic_mgmt_watchdog_info *watchdog_info;
>> +int err;
>> +
>> +if (priv_ctx) {
>> +watchdog_info = priv_ctx;
>> +err = mgmt_watchdog_report_show(fmsg, watchdog_info);
>> +if (err)
>> +return err;
>> +}
>> +
>> +return 0;
>> +}
> This 'watchdog_info' variable is completely unnecessary, just pass
> 'priv_ctx' as-is into mgmt_watchdog_report_show().
Looks like the 'err' variable is unnecessary too...

-ed

Re: [PATCH v3 1/7] Documentation: dynamic-debug: Add description of level bitmask

2020-06-09 Thread Edward Cree

On 09/06/2020 17:58, Joe Perches wrote:
> On Tue, 2020-06-09 at 13:16 +0200, Greg Kroah-Hartman wrote:
>> What is wrong with the existing control of dynamic
>> debug messages that you want to add another type of arbitrary grouping
>> to it? 
> There is no existing grouping mechanism.
>
> Many drivers and some subsystems used an internal one
> before dynamic debug.
>
> $ git grep "MODULE_PARM.*\bdebug\b"|wc -l
> 501
>
> This is an attempt to unify those homebrew mechanisms.
In network drivers, this is probablyusing the existing groupings
 defined by netif_level() - see NETIF_MSG_DRV and friends.  Note
 that those groups are orthogonal to the level, i.e. they control
 netif_err() etc. as well, not just debug messages.
Certainly in the case of sfc, and I'd imagine for many other net
 drivers too, the 'debug' modparam is setting the default for
 net_dev->msg_enable, which can be changed after probe with
 ethtool.
It doesn't look like the proposed mechanism subsumes that (we have
 rather more than 5 groups, and it's not clear how you'd connect
 it to the existing msg_enable (which uapi must be maintained); if
 you don't have a way to do this, better exclude drivers/net/ from
 your grep|wc because you won't be unifying those - in my tree
 that's 119 hits.

-ed

Re: [PATCH v3 1/7] Documentation: dynamic-debug: Add description of level bitmask

2020-06-09 Thread Edward Cree

On 09/06/2020 18:56, Joe Perches wrote:
> These are _not_ netif_ control flags. Some are though.
> For instance:
>
> $ git grep "MODULE_PARM.*\bdebug\b" drivers/net | head -10
> [...]
>
> These are all level/class output controls.
TIL, thanks!  I should have looked deeperrather than assuming
 they were all like ours.

Though judging just by that grep output, it also looks like
 quite a lot of those won't fit into 5 groups either, so some
 rethink may still be needed...

-ed

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-09 Thread Edward Cree

Disclaimer: *definitely* not speaking for my employer.

On 09/06/2020 18:19, Stephen Hemminger wrote:
> How many times have you or Linus argued about variable naming.
> Yes, words do matter and convey a lot of implied connotation and meaning.
Connotation, unlike denotation, is widely variable.  I would aver
 that for most people who are triggered or offended by the technical
 use of master/slave or similar terms, this is a *learned behaviour*
 that occurs because they have been taught that they should be
 offended by these words.  Are these people also incapable of using
 those words to describe actual historical slavery?
There is a difference between stating "X relates to Y as a slave to
 a master" and "You relate to me (or ought to do so) as a slave to a
 master".  The former is a merely factual claim which is often true
 of technical entities (and in the past or in certain parts of the
 world today is true of some humans, however morally wrong); the
 latter is an assertion of power.  It is only the latter which is,
 or at any rate should be, offensive; the attempt to ban the former
 rests on an equivocation between positive and normative uses.
Anyone who can't put connotation aside and stick to denotation
 (a) needs to grow up
 (b) is ill-suited to working with computers.

> Most projects and standards bodies are taking a stance on fixing the
> language. The IETF is has proposed making changes as well.
An expired Internet-Draft authored by a human-rights charity and a
 media studies postgrad student does not constitute "the IETF has
 proposed".  (Besides, it's historically inaccurate; it claims that
 the word "robot" means slave, when in fact it means the labour owed
 by a _serf_ ("robotnik").  And it cites Orwell with apparently *no*
 sense of irony whatsoever... he was arguing _against_ Newspeak, not
 for it!)

> A common example is that master/slave is unclear and would be clearer
> as primary/secondary or active/backup or controller/worker.
So why isn't controller/worker just as offensive, given all those
 labourers throughout history who have suffered under abusive
 managers?  Or does a word need a tenuous connection to race before
 it can be ritually purified from the language?
And is there really an EE anywhere who finds the terminology of
 master and slave clocks unclear?  I suspect very few would gain a
 better understanding from any of your suggested alternatives.

> Most of networking is based on standards. When the standards wording changes
> (and it will happen soon); then Linux should also change the wording in the
> source, api and documentation.
Rather, it seems that this is an attempt to change Linux in order
 to _de facto_ change the standard, thereby creating pressure on
 standards bodies to change it _de jure_ to match.  Yet, in the
 real world, engineers use and understand the current terminology;
 the push for language purification bears but little reference to
 anything outside of itself.

In conclusion, I'd like to quote from Henry Spencer's Ten
 Commandments for C Programmers (Annotated Edition) [1]:
> As a lamentable side issue, there has been some unrest from the
> fanatics of the Pronoun Gestapo over the use of the word "man"
> in [the eighth] Commandment, for they believe that great efforts
> and loud shouting devoted to the ritual purification of the
> language will somehow redound to the benefit of the downtrodden
> (whose real and grievous woes tendeth to get lost amidst all that
> thunder and fury).

Grumpily yours,
-ed

[1] https://www.lysator.liu.se/c/ten-commandments.html

[PATCH v3 net-next 00/12] bpf: rewrite value tracking in verifier

2017-06-27 Thread Edward Cree

This series simplifies alignment tracking, generalises bounds tracking and
 fixes some bounds-tracking bugs in the BPF verifier.  Pointer arithmetic on
 packet pointers, stack pointers, map value pointers and context pointers has
 been unified, and bounds on these pointers are only checked when the pointer
 is dereferenced.
Operations on pointers which destroy all relation to the original pointer
 (such as multiplies and shifts) are disallowed if !env->allow_ptr_leaks,
 otherwise they convert the pointer to an unknown scalar and feed it to the
 normal scalar arithmetic handling.
Pointer types have been unified with the corresponding adjusted-pointer types
 where those existed (e.g. PTR_TO_MAP_VALUE[_ADJ] or FRAME_PTR vs
 PTR_TO_STACK); similarly, CONST_IMM and UNKNOWN_VALUE have been unified into
 SCALAR_VALUE.
Pointer types (except CONST_PTR_TO_MAP, PTR_TO_MAP_VALUE_OR_NULL and
 PTR_TO_PACKET_END, which do not allow arithmetic) have a 'fixed offset' and
 a 'variable offset'; the former is used when e.g. adding an immediate or a
 known-constant register, as long as it does not overflow.  Otherwise the
 latter is used, and any operation creating a new variable offset creates a
 new 'id' (and, for PTR_TO_PACKET, clears the 'range').
SCALAR_VALUEs use the 'variable offset' fields to track the range of possible
 values; the 'fixed offset' should never be set on a scalar.

As of patch 12/12, all tests of tools/testing/selftests/bpf/test_verifier
 and tools/testing/selftests/bpf/test_align pass.

v3: added a few more tests; removed RFC tags.

v2: fixed nfp build, made test_align pass again and extended it with a few
 new tests (though still need to add more).

Edward Cree (12):
  selftests/bpf: add test for mixed signed and unsigned bounds checks
  bpf/verifier: rework value tracking
  nfp: change bpf verifier hooks to match new verifier data structures
  bpf/verifier: track signed and unsigned min/max values
  bpf/verifier: more concise register state logs for constant var_off
  selftests/bpf: change test_verifier expectations
  selftests/bpf: rewrite test_align
  selftests/bpf: add a test to test_align
  selftests/bpf: add test for bogus operations on pointers
  selftests/bpf: don't try to access past MAX_PACKET_OFF in
test_verifier
  selftests/bpf: add tests for subtraction & negative numbers
  selftests/bpf: variable offset negative tests

 drivers/net/ethernet/netronome/nfp/bpf/verifier.c |   24 +-
 include/linux/bpf.h   |   34 +-
 include/linux/bpf_verifier.h  |   56 +-
 include/linux/tnum.h  |   81 +
 kernel/bpf/Makefile   |2 +-
 kernel/bpf/tnum.c |  180 ++
 kernel/bpf/verifier.c | 1943 -
 tools/testing/selftests/bpf/test_align.c  |  462 -
 tools/testing/selftests/bpf/test_verifier.c   |  293 ++--
 9 files changed, 2034 insertions(+), 1041 deletions(-)
 create mode 100644 include/linux/tnum.h
 create mode 100644 kernel/bpf/tnum.c

[PATCH v3 net-next 01/12] selftests/bpf: add test for mixed signed and unsigned bounds checks

2017-06-27 Thread Edward Cree

Currently fails due to bug in verifier bounds handling.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_verifier.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index c0af019..fceed67 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -5279,6 +5279,32 @@ static struct bpf_test tests[] = {
.errstr = "invalid bpf_context access",
.prog_type = BPF_PROG_TYPE_LWT_IN,
},
+   {
+   "bounds checks mixing signed and unsigned",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 7),
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, -8),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_10, -16),
+   BPF_MOV64_IMM(BPF_REG_2, -1),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_1, BPF_REG_2, 3),
+   BPF_JMP_IMM(BPF_JSGT, BPF_REG_1, 1, 2),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+   BPF_ST_MEM(BPF_B, BPF_REG_0, 0, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map1 = { 3 },
+   .errstr_unpriv = "R0 pointer arithmetic prohibited",
+   .errstr = "R0 min value is negative, either use unsigned index 
or do a if (index >=0) check.",
+   .result = REJECT,
+   .result_unpriv = REJECT,
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)

[PATCH v3 net-next 02/12] bpf/verifier: rework value tracking

2017-06-27 Thread Edward Cree

Tracks value alignment by means of tracking known & unknown bits.
Tightens some min/max value checks and fixes a couple of bugs therein.
If pointer leaks are allowed, and adjust_ptr_min_max_vals returns -EACCES,
 treat the pointer as an unknown scalar and try again, because we might be
 able to conclude something about the result (e.g. pointer & 0x40 is either
 0 or 0x40).

Signed-off-by: Edward Cree 
---
 include/linux/bpf.h  |   34 +-
 include/linux/bpf_verifier.h |   40 +-
 include/linux/tnum.h |   79 ++
 kernel/bpf/Makefile  |2 +-
 kernel/bpf/tnum.c|  163 
 kernel/bpf/verifier.c| 1692 --
 6 files changed, 1235 insertions(+), 775 deletions(-)
 create mode 100644 include/linux/tnum.h
 create mode 100644 kernel/bpf/tnum.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index deca4e7..0fc3bbc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -116,35 +116,25 @@ enum bpf_access_type {
 };
 
 /* types of values stored in eBPF registers */
+/* Pointer types represent:
+ * pointer
+ * pointer + imm
+ * pointer + (u16) var
+ * pointer + (u16) var + imm
+ * if (range > 0) then [ptr, ptr + range - off) is safe to access
+ * if (id > 0) means that some 'var' was added
+ * if (off > 0) means that 'imm' was added
+ */
 enum bpf_reg_type {
NOT_INIT = 0,/* nothing was written into register */
-   UNKNOWN_VALUE,   /* reg doesn't contain a valid pointer */
+   SCALAR_VALUE,/* reg doesn't contain a valid pointer */
PTR_TO_CTX,  /* reg points to bpf_context */
CONST_PTR_TO_MAP,/* reg points to struct bpf_map */
PTR_TO_MAP_VALUE,/* reg points to map element value */
PTR_TO_MAP_VALUE_OR_NULL,/* points to map elem value or NULL */
-   FRAME_PTR,   /* reg == frame_pointer */
-   PTR_TO_STACK,/* reg == frame_pointer + imm */
-   CONST_IMM,   /* constant integer value */
-
-   /* PTR_TO_PACKET represents:
-* skb->data
-* skb->data + imm
-* skb->data + (u16) var
-* skb->data + (u16) var + imm
-* if (range > 0) then [ptr, ptr + range - off) is safe to access
-* if (id > 0) means that some 'var' was added
-* if (off > 0) menas that 'imm' was added
-*/
-   PTR_TO_PACKET,
+   PTR_TO_STACK,/* reg == frame_pointer + offset */
+   PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
-
-   /* PTR_TO_MAP_VALUE_ADJ is used for doing pointer math inside of a map
-* elem value.  We only allow this if we can statically verify that
-* access from this register are going to fall within the size of the
-* map element.
-*/
-   PTR_TO_MAP_VALUE_ADJ,
 };
 
 struct bpf_prog;
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 621076f..ca7e2ce 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -9,6 +9,7 @@
 
 #include  /* for enum bpf_reg_type */
 #include  /* for MAX_BPF_STACK */
+#include 
 
  /* Just some arbitrary values so we can safely do math without overflowing and
   * are obviously wrong for any sort of memory access.
@@ -19,30 +20,39 @@
 struct bpf_reg_state {
enum bpf_reg_type type;
union {
-   /* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE 
*/
-   s64 imm;
-
-   /* valid when type == PTR_TO_PACKET* */
-   struct {
-   u16 off;
-   u16 range;
-   };
+   /* valid when type == PTR_TO_PACKET */
+   u32 range;
 
/* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
 *   PTR_TO_MAP_VALUE_OR_NULL
 */
struct bpf_map *map_ptr;
};
+   /* Fixed part of pointer offset, pointer types only */
+   s32 off;
+   /* Used to find other pointers with the same variable offset, so they
+* can share range knowledge.
+* Exception: for PTR_TO_MAP_VALUE_OR_NULL this is used to share which
+* map value we came from, when one is tested for != NULL.  Note that
+* this overloading means that we can't do pointer arithmetic on a
+* PTR_TO_MAP_VALUE_OR_NULL, we have to NULL-check it _first_.
+*/
u32 id;
+   /* These three fields must be last.  See states_equal() */
+   /* For scalar types (SCALAR_VALUE), this represents our knowledge of
+* the actual value.
+* For pointer types, this represents the variable part of the offset
+* from the pointed-to object, and is shared with all bpf_reg_states
+* with the same id as us.
+

[PATCH v3 net-next 03/12] nfp: change bpf verifier hooks to match new verifier data structures

2017-06-27 Thread Edward Cree

Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 24 +--
 kernel/bpf/tnum.c |  1 +
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c 
b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
index d696ba4..5b783a9 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
@@ -79,28 +79,32 @@ nfp_bpf_check_exit(struct nfp_prog *nfp_prog,
   const struct bpf_verifier_env *env)
 {
const struct bpf_reg_state *reg0 = &env->cur_state.regs[0];
+   u64 imm;
 
if (nfp_prog->act == NN_ACT_XDP)
return 0;
 
-   if (reg0->type != CONST_IMM) {
-   pr_info("unsupported exit state: %d, imm: %llx\n",
-   reg0->type, reg0->imm);
+   if (!(reg0->type == SCALAR_VALUE && tnum_is_const(reg0->var_off))) {
+   char tn_buf[48];
+
+   tnum_strn(tn_buf, sizeof(tn_buf), reg0->var_off);
+   pr_info("unsupported exit state: %d, var_off: %s\n",
+   reg0->type, tn_buf);
return -EINVAL;
}
 
-   if (nfp_prog->act != NN_ACT_DIRECT &&
-   reg0->imm != 0 && (reg0->imm & ~0U) != ~0U) {
+   imm = reg0->var_off.value;
+   if (nfp_prog->act != NN_ACT_DIRECT && imm != 0 && (imm & ~0U) != ~0U) {
pr_info("unsupported exit state: %d, imm: %llx\n",
-   reg0->type, reg0->imm);
+   reg0->type, imm);
return -EINVAL;
}
 
-   if (nfp_prog->act == NN_ACT_DIRECT && reg0->imm <= TC_ACT_REDIRECT &&
-   reg0->imm != TC_ACT_SHOT && reg0->imm != TC_ACT_STOLEN &&
-   reg0->imm != TC_ACT_QUEUED) {
+   if (nfp_prog->act == NN_ACT_DIRECT && imm <= TC_ACT_REDIRECT &&
+   imm != TC_ACT_SHOT && imm != TC_ACT_STOLEN &&
+   imm != TC_ACT_QUEUED) {
pr_info("unsupported exit state: %d, imm: %llx\n",
-   reg0->type, reg0->imm);
+   reg0->type, imm);
return -EINVAL;
}
 
diff --git a/kernel/bpf/tnum.c b/kernel/bpf/tnum.c
index 803bd0d..92eeeb1 100644
--- a/kernel/bpf/tnum.c
+++ b/kernel/bpf/tnum.c
@@ -141,6 +141,7 @@ int tnum_strn(char *str, size_t size, struct tnum a)
 {
return snprintf(str, size, "(%#llx; %#llx)", a.value, a.mask);
 }
+EXPORT_SYMBOL_GPL(tnum_strn);
 
 int tnum_sbin(char *str, size_t size, struct tnum a)
 {

[PATCH v3 net-next 04/12] bpf/verifier: track signed and unsigned min/max values

2017-06-27 Thread Edward Cree

Allows us to, sometimes, combine information from a signed check of one
 bound and an unsigned check of the other.
We now track the full range of possible values, rather than restricting
 ourselves to [0, 1<<30) and considering anything beyond that as
 unknown.  While this is probably not necessary, it makes the code more
 straightforward and symmetrical between signed and unsigned bounds.

Signed-off-by: Edward Cree 
---
 include/linux/bpf_verifier.h |  22 +-
 include/linux/tnum.h |   2 +
 kernel/bpf/tnum.c|  16 +
 kernel/bpf/verifier.c| 727 ++-
 4 files changed, 471 insertions(+), 296 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index ca7e2ce..84c6576 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -11,11 +11,15 @@
 #include  /* for MAX_BPF_STACK */
 #include 
 
- /* Just some arbitrary values so we can safely do math without overflowing and
-  * are obviously wrong for any sort of memory access.
-  */
-#define BPF_REGISTER_MAX_RANGE (1024 * 1024 * 1024)
-#define BPF_REGISTER_MIN_RANGE -1
+/* Maximum variable offset umax_value permitted when resolving memory accesses.
+ * In practice this is far bigger than any realistic pointer offset; this limit
+ * ensures that umax_value + (int)off + (int)size cannot overflow a u64.
+ */
+#define BPF_MAX_VAR_OFF(1ULL << 31)
+/* Maximum variable size permitted for ARG_CONST_SIZE[_OR_ZERO].  This ensures
+ * that converting umax_value to int cannot overflow.
+ */
+#define BPF_MAX_VAR_SIZINT_MAX
 
 struct bpf_reg_state {
enum bpf_reg_type type;
@@ -38,7 +42,7 @@ struct bpf_reg_state {
 * PTR_TO_MAP_VALUE_OR_NULL, we have to NULL-check it _first_.
 */
u32 id;
-   /* These three fields must be last.  See states_equal() */
+   /* These five fields must be last.  See states_equal() */
/* For scalar types (SCALAR_VALUE), this represents our knowledge of
 * the actual value.
 * For pointer types, this represents the variable part of the offset
@@ -51,8 +55,10 @@ struct bpf_reg_state {
 * These refer to the same value as var_off, not necessarily the actual
 * contents of the register.
 */
-   s64 min_value; /* minimum possible (s64)value */
-   u64 max_value; /* maximum possible (u64)value */
+   s64 smin_value; /* minimum possible (s64)value */
+   s64 smax_value; /* maximum possible (s64)value */
+   u64 umin_value; /* minimum possible (u64)value */
+   u64 umax_value; /* maximum possible (u64)value */
 };
 
 enum bpf_stack_slot_type {
diff --git a/include/linux/tnum.h b/include/linux/tnum.h
index a0b07bf..0d2d3da 100644
--- a/include/linux/tnum.h
+++ b/include/linux/tnum.h
@@ -17,6 +17,8 @@ struct tnum {
 struct tnum tnum_const(u64 value);
 /* A completely unknown value */
 extern const struct tnum tnum_unknown;
+/* A value that's unknown except that @min <= value <= @max */
+struct tnum tnum_range(u64 min, u64 max);
 
 /* Arithmetic and logical ops */
 /* Shift a tnum left (by a fixed shift) */
diff --git a/kernel/bpf/tnum.c b/kernel/bpf/tnum.c
index 92eeeb1..1f4bf68 100644
--- a/kernel/bpf/tnum.c
+++ b/kernel/bpf/tnum.c
@@ -17,6 +17,22 @@ struct tnum tnum_const(u64 value)
return TNUM(value, 0);
 }
 
+struct tnum tnum_range(u64 min, u64 max)
+{
+   u64 chi = min ^ max, delta;
+   u8 bits = fls64(chi);
+
+   /* special case, needed because 1ULL << 64 is undefined */
+   if (bits > 63)
+   return tnum_unknown;
+   /* e.g. if chi = 4, bits = 3, delta = (1<<3) - 1 = 7.
+* if chi = 0, bits = 0, delta = (1<<0) - 1 = 0, so we return
+*  constant min (since min == max).
+*/
+   delta = (1ULL << bits) - 1;
+   return TNUM(min & ~delta, delta);
+}
+
 struct tnum tnum_lshift(struct tnum a, u8 shift)
 {
return TNUM(a.value << shift, a.mask << shift);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 82823f1..d45c1d1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -234,12 +234,20 @@ static void print_verifier_state(struct 
bpf_verifier_state *state)
verbose(",ks=%d,vs=%d",
reg->map_ptr->key_size,
reg->map_ptr->value_size);
-   if (reg->min_value != BPF_REGISTER_MIN_RANGE)
-   verbose(",min_value=%lld",
-   (long long)reg->min_value);
-   if (reg->max_value != BPF_REGISTER_MAX_RANGE)
-   verbose(",max_value=%llu",
-   (unsigned long long)reg->max_value);
+   if (reg->smin_value != reg->u

[PATCH v3 net-next 06/12] selftests/bpf: change test_verifier expectations

2017-06-27 Thread Edward Cree

Some of the verifier's error messages have changed, and some constructs
 that previously couldn't be verified are now accepted.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_verifier.c | 226 ++--
 1 file changed, 116 insertions(+), 110 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index fceed67..210a031 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -421,7 +421,7 @@ static struct bpf_test tests[] = {
BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr_unpriv = "R1 pointer arithmetic",
+   .errstr_unpriv = "R1 subtraction from stack pointer",
.result_unpriv = REJECT,
.errstr = "R1 invalid mem access",
.result = REJECT,
@@ -603,8 +603,9 @@ static struct bpf_test tests[] = {
BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_2, -4),
BPF_EXIT_INSN(),
},
-   .errstr = "misaligned access",
+   .errstr = "misaligned stack access",
.result = REJECT,
+   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"invalid map_fd for function call",
@@ -650,8 +651,9 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "misaligned access",
+   .errstr = "misaligned value access",
.result = REJECT,
+   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"sometimes access memory with incorrect alignment",
@@ -672,6 +674,7 @@ static struct bpf_test tests[] = {
.errstr = "R0 invalid mem access",
.errstr_unpriv = "R0 leaks addr",
.result = REJECT,
+   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"jump test 1",
@@ -1215,8 +1218,9 @@ static struct bpf_test tests[] = {
offsetof(struct __sk_buff, cb[0]) + 1),
BPF_EXIT_INSN(),
},
-   .errstr = "misaligned access",
+   .errstr = "misaligned context access",
.result = REJECT,
+   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"check __sk_buff->hash, offset 0, half store not permitted",
@@ -1319,8 +1323,9 @@ static struct bpf_test tests[] = {
offsetof(struct __sk_buff, cb[0]) + 2),
BPF_EXIT_INSN(),
},
-   .errstr = "misaligned access",
+   .errstr = "misaligned context access",
.result = REJECT,
+   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"check cb access: word, unaligned 2",
@@ -1330,8 +1335,9 @@ static struct bpf_test tests[] = {
offsetof(struct __sk_buff, cb[4]) + 1),
BPF_EXIT_INSN(),
},
-   .errstr = "misaligned access",
+   .errstr = "misaligned context access",
.result = REJECT,
+   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"check cb access: word, unaligned 3",
@@ -1341,8 +1347,9 @@ static struct bpf_test tests[] = {
offsetof(struct __sk_buff, cb[4]) + 2),
BPF_EXIT_INSN(),
},
-   .errstr = "misaligned access",
+   .errstr = "misaligned context access",
.result = REJECT,
+   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"check cb access: word, unaligned 4",
@@ -1352,8 +1359,9 @@ static struct bpf_test tests[] = {
offsetof(struct __sk_buff, cb[4]) + 3),
BPF_EXIT_INSN(),
},
-   .errstr = "misaligned access",
+   .errstr = "misaligned context access",
.result = REJECT,
+   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"check cb access: double",
@@ -1379,8 +1387,9 @@ static struct bpf_test tests[] = {
offsetof(struct __sk_buff, cb[1])),
BPF_EXIT_INSN(),
},
-   .errstr = "mis

[PATCH v3 net-next 05/12] bpf/verifier: more concise register state logs for constant var_off

2017-06-27 Thread Edward Cree

Signed-off-by: Edward Cree 
---
 kernel/bpf/verifier.c | 46 +++---
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d45c1d1..3e1df75 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -234,25 +234,33 @@ static void print_verifier_state(struct 
bpf_verifier_state *state)
verbose(",ks=%d,vs=%d",
reg->map_ptr->key_size,
reg->map_ptr->value_size);
-   if (reg->smin_value != reg->umin_value &&
-   reg->smin_value != S64_MIN)
-   verbose(",smin_value=%lld",
-   (long long)reg->smin_value);
-   if (reg->smax_value != reg->umax_value &&
-   reg->smax_value != S64_MAX)
-   verbose(",smax_value=%lld",
-   (long long)reg->smax_value);
-   if (reg->umin_value != 0)
-   verbose(",umin_value=%llu",
-   (unsigned long long)reg->umin_value);
-   if (reg->umax_value != U64_MAX)
-   verbose(",umax_value=%llu",
-   (unsigned long long)reg->umax_value);
-   if (!tnum_is_unknown(reg->var_off)) {
-   char tn_buf[48];
-
-   tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
-   verbose(",var_off=%s", tn_buf);
+   if (tnum_is_const(reg->var_off)) {
+   /* Typically an immediate SCALAR_VALUE, but
+* could be a pointer whose offset is too big
+* for reg->off
+*/
+   verbose(",imm=%llx", reg->var_off.value);
+   } else {
+   if (reg->smin_value != reg->umin_value &&
+   reg->smin_value != S64_MIN)
+   verbose(",smin_value=%lld",
+   (long long)reg->smin_value);
+   if (reg->smax_value != reg->umax_value &&
+   reg->smax_value != S64_MAX)
+   verbose(",smax_value=%lld",
+   (long long)reg->smax_value);
+   if (reg->umin_value != 0)
+   verbose(",umin_value=%llu",
+   (unsigned long 
long)reg->umin_value);
+   if (reg->umax_value != U64_MAX)
+   verbose(",umax_value=%llu",
+   (unsigned long 
long)reg->umax_value);
+   if (!tnum_is_unknown(reg->var_off)) {
+   char tn_buf[48];
+
+   tnum_strn(tn_buf, sizeof(tn_buf), 
reg->var_off);
+   verbose(",var_off=%s", tn_buf);
+   }
}
verbose(")");
}

[PATCH v3 net-next 08/12] selftests/bpf: add a test to test_align

2017-06-27 Thread Edward Cree

New test adds 14 to the unknown value before adding to the packet pointer,
 meaning there's no 'fixed offset' field and instead we add into the
 var_off, yielding a '4n+2' value.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_align.c | 67 
 1 file changed, 67 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_align.c 
b/tools/testing/selftests/bpf/test_align.c
index 031bba8..5165d8e 100644
--- a/tools/testing/selftests/bpf/test_align.c
+++ b/tools/testing/selftests/bpf/test_align.c
@@ -374,6 +374,73 @@ static struct bpf_align_test tests[] = {
{33, 
"R5=pkt(id=4,off=18,r=22,umax_value=2040,var_off=(0x0; 0x7fc))"},
},
},
+   {
+   .descr = "packet variable offset 2",
+   .insns = {
+   /* Create an unknown offset, (4n+2)-aligned */
+   LOAD_UNKNOWN(BPF_REG_6),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_6, 2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 14),
+   /* Add it to the packet pointer */
+   BPF_MOV64_REG(BPF_REG_5, BPF_REG_2),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_5, BPF_REG_6),
+   /* Check bounds and perform a read */
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_5),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+   BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_4, 1),
+   BPF_EXIT_INSN(),
+   BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_5, 0),
+   /* Make a (4n) offset from the value we just read */
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_6, 0xff),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_6, 2),
+   /* Add it to the packet pointer */
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_5, BPF_REG_6),
+   /* Check bounds and perform a read */
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_5),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+   BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_4, 1),
+   BPF_EXIT_INSN(),
+   BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_5, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .matches = {
+   /* Calculated offset in R6 has unknown value, but known
+* alignment of 4.
+*/
+   {8, "R2=pkt(id=0,off=0,r=8,imm=0)"},
+   {8, "R6=inv(id=0,umax_value=1020,var_off=(0x0; 
0x3fc))"},
+   /* Adding 14 makes R6 be (4n+2) */
+   {9, 
"R6=inv(id=0,umin_value=14,umax_value=1034,var_off=(0x2; 0x7fc))"},
+   /* Packet pointer has (4n+2) offset */
+   {11, 
"R5=pkt(id=1,off=0,r=0,umin_value=14,umax_value=1034,var_off=(0x2; 0x7fc))"},
+   {13, 
"R4=pkt(id=1,off=4,r=0,umin_value=14,umax_value=1034,var_off=(0x2; 0x7fc))"},
+   /* At the time the word size load is performed from R5,
+* its total fixed offset is NET_IP_ALIGN + reg->off (0)
+* which is 2.  Then the variable offset is (4n+2), so
+* the total offset is 4-byte aligned and meets the
+* load's requirements.
+*/
+   {15, 
"R5=pkt(id=1,off=0,r=4,umin_value=14,umax_value=1034,var_off=(0x2; 0x7fc))"},
+   /* Newly read value in R6 was shifted left by 2, so has
+* known alignment of 4.
+*/
+   {18, "R6=inv(id=0,umax_value=1020,var_off=(0x0; 
0x3fc))"},
+   /* Added (4n) to packet pointer's (4n+2) var_off, giving
+* another (4n+2).
+*/
+   {19, 
"R5=pkt(id=2,off=0,r=0,umin_value=14,umax_value=2054,var_off=(0x2; 0xffc))"},
+   {21, 
"R4=pkt(id=2,off=4,r=0,umin_value=14,umax_value=2054,var_off=(0x2; 0xffc))"},
+   /* At the time the word size load is performed from R5,
+* its total fixed offset is NET_IP_ALIGN + reg->off (0)
+* which is 2.  Then the variable offset is (4n+2), so
+* the total offset is 4-byte aligned and meets the
+* load's requirements.
+*/
+   {23, 
"R5=pkt(id=2,off=0,r=4,umin_value=14,umax_value=2054,var_off=(0x2; 0xffc))"},
+   },
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)

[PATCH v3 net-next 07/12] selftests/bpf: rewrite test_align

2017-06-27 Thread Edward Cree

Expectations have changed, as has the format of the logged state.
To make the tests easier to read, add a line-matching framework so that
 each match need only quote the register it cares about.  (Multiple
 matches may refer to the same line, but matches must be listed in
 order of increasing line.)

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_align.c | 225 ++-
 1 file changed, 132 insertions(+), 93 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_align.c 
b/tools/testing/selftests/bpf/test_align.c
index bccebd9..031bba8 100644
--- a/tools/testing/selftests/bpf/test_align.c
+++ b/tools/testing/selftests/bpf/test_align.c
@@ -27,6 +27,11 @@
 #define MAX_INSNS  512
 #define MAX_MATCHES16
 
+struct bpf_reg_match {
+   unsigned int line;
+   const char *match;
+};
+
 struct bpf_align_test {
const char *descr;
struct bpf_insn insns[MAX_INSNS];
@@ -36,10 +41,14 @@ struct bpf_align_test {
REJECT
} result;
enum bpf_prog_type prog_type;
-   const char *matches[MAX_MATCHES];
+   /* Matches must be in order of increasing line */
+   struct bpf_reg_match matches[MAX_MATCHES];
 };
 
 static struct bpf_align_test tests[] = {
+   /* Four tests of known constants.  These aren't staggeringly
+* interesting since we track exact values now.
+*/
{
.descr = "mov",
.insns = {
@@ -53,11 +62,13 @@ static struct bpf_align_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.matches = {
-   "1: R1=ctx R3=imm2,min_value=2,max_value=2,min_align=2 
R10=fp",
-   "2: R1=ctx R3=imm4,min_value=4,max_value=4,min_align=4 
R10=fp",
-   "3: R1=ctx R3=imm8,min_value=8,max_value=8,min_align=8 
R10=fp",
-   "4: R1=ctx 
R3=imm16,min_value=16,max_value=16,min_align=16 R10=fp",
-   "5: R1=ctx 
R3=imm32,min_value=32,max_value=32,min_align=32 R10=fp",
+   {1, "R1=ctx(id=0,off=0,imm=0)"},
+   {1, "R10=fp0"},
+   {1, "R3=inv2"},
+   {2, "R3=inv4"},
+   {3, "R3=inv8"},
+   {4, "R3=inv16"},
+   {5, "R3=inv32"},
},
},
{
@@ -79,17 +90,19 @@ static struct bpf_align_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.matches = {
-   "1: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 
R10=fp",
-   "2: R1=ctx R3=imm2,min_value=2,max_value=2,min_align=2 
R10=fp",
-   "3: R1=ctx R3=imm4,min_value=4,max_value=4,min_align=4 
R10=fp",
-   "4: R1=ctx R3=imm8,min_value=8,max_value=8,min_align=8 
R10=fp",
-   "5: R1=ctx 
R3=imm16,min_value=16,max_value=16,min_align=16 R10=fp",
-   "6: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 
R10=fp",
-   "7: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 
R4=imm32,min_value=32,max_value=32,min_align=32 R10=fp",
-   "8: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 
R4=imm16,min_value=16,max_value=16,min_align=16 R10=fp",
-   "9: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 
R4=imm8,min_value=8,max_value=8,min_align=8 R10=fp",
-   "10: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 
R4=imm4,min_value=4,max_value=4,min_align=4 R10=fp",
-   "11: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 
R4=imm2,min_value=2,max_value=2,min_align=2 R10=fp",
+   {1, "R1=ctx(id=0,off=0,imm=0)"},
+   {1, "R10=fp0"},
+   {1, "R3=inv1"},
+   {2, "R3=inv2"},
+   {3, "R3=inv4"},
+   {4, "R3=inv8"},
+   {5, "R3=inv16"},
+   {6, "R3=inv1"},
+   {7, "R4=inv32"},
+   {8, "R4=inv16"},
+   {9, "R4=inv8"},
+   {10, "R4=inv4"},
+   {11, "R4=inv2"},
},
},
{
@@ -106,12 +119,14 @@ static struct bpf_align_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.matches = {
-   "1: R1=ctx R3=imm4,min_value=4,max_value=

[PATCH v3 net-next 10/12] selftests/bpf: don't try to access past MAX_PACKET_OFF in test_verifier

2017-06-27 Thread Edward Cree

"direct packet access: test2" was potentially reading four bytes from
 pkt + 0x, which could take it past the verifier's limit, causing
 the program to be rejected.
Increase the shifts by one so that R2 is now mask 0x7fff instead of
 mask 0x.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_verifier.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 210a031..7df3c34 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -2330,8 +2330,8 @@ static struct bpf_test tests[] = {
offsetof(struct __sk_buff, data)),
BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_4),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
-   BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 48),
-   BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 48),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 49),
+   BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 49),
BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_2),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_3),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, 8),

[PATCH v3 net-next 11/12] selftests/bpf: add tests for subtraction & negative numbers

2017-06-27 Thread Edward Cree

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_align.c | 104 +++
 1 file changed, 104 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_align.c 
b/tools/testing/selftests/bpf/test_align.c
index dfd96c6..6bc2ceb 100644
--- a/tools/testing/selftests/bpf/test_align.c
+++ b/tools/testing/selftests/bpf/test_align.c
@@ -497,6 +497,110 @@ static struct bpf_align_test tests[] = {
{16, 
"R6=pkt(id=1,off=0,r=0,umin_value=2,umax_value=9223372036854775806,var_off=(0x2;
 0x7ffc))"},
}
},
+   {
+   .descr = "variable subtraction",
+   .insns = {
+   /* Create an unknown offset, (4n+2)-aligned */
+   LOAD_UNKNOWN(BPF_REG_6),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_6),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_6, 2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 14),
+   /* Create another unknown, (4n)-aligned, and subtract
+* it from the first one
+*/
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_7, 2),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_6, BPF_REG_7),
+   /* Bounds-check the result */
+   BPF_JMP_IMM(BPF_JSGE, BPF_REG_6, 0, 1),
+   BPF_EXIT_INSN(),
+   /* Add it to the packet pointer */
+   BPF_MOV64_REG(BPF_REG_5, BPF_REG_2),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_5, BPF_REG_6),
+   /* Check bounds and perform a read */
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_5),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+   BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_4, 1),
+   BPF_EXIT_INSN(),
+   BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_5, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .matches = {
+   /* Calculated offset in R6 has unknown value, but known
+* alignment of 4.
+*/
+   {7, "R2=pkt(id=0,off=0,r=8,imm=0)"},
+   {9, "R6=inv(id=0,umax_value=1020,var_off=(0x0; 
0x3fc))"},
+   /* Adding 14 makes R6 be (4n+2) */
+   {10, 
"R6=inv(id=0,umin_value=14,umax_value=1034,var_off=(0x2; 0x7fc))"},
+   /* New unknown value in R7 is (4n) */
+   {11, "R7=inv(id=0,umax_value=1020,var_off=(0x0; 
0x3fc))"},
+   /* Subtracting it from R6 blows our unsigned bounds */
+   {12, 
"R6=inv(id=0,smin_value=-1006,smax_value=1034,var_off=(0x2; 
0xfffc))"},
+   /* Checked s>= 0 */
+   {14, 
"R6=inv(id=0,umin_value=2,umax_value=1034,var_off=(0x2; 0x7fc))"},
+   /* At the time the word size load is performed from R5,
+* its total fixed offset is NET_IP_ALIGN + reg->off (0)
+* which is 2.  Then the variable offset is (4n+2), so
+* the total offset is 4-byte aligned and meets the
+* load's requirements.
+*/
+   {20, 
"R5=pkt(id=1,off=0,r=4,umin_value=2,umax_value=1034,var_off=(0x2; 0x7fc))"},
+   },
+   },
+   {
+   .descr = "pointer variable subtraction",
+   .insns = {
+   /* Create an unknown offset, (4n+2)-aligned and bounded
+* to [14,74]
+*/
+   LOAD_UNKNOWN(BPF_REG_6),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_6),
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_6, 0xf),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_6, 2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 14),
+   /* Subtract it from the packet pointer */
+   BPF_MOV64_REG(BPF_REG_5, BPF_REG_2),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_5, BPF_REG_6),
+   /* Create another unknown, (4n)-aligned and >= 74.
+* That in fact means >= 76, since 74 % 4 == 2
+*/
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_7, 2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 76),
+   /* Add it to the packet pointer */
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_5, BPF_REG_7),
+   /* Check bounds and perform a read */
+   BPF_MOV64_REG(BPF_REG_4,

[PATCH v3 net-next 09/12] selftests/bpf: add test for bogus operations on pointers

2017-06-27 Thread Edward Cree

Tests non-add/sub operations (AND, LSH) on pointers decaying them to
 unknown scalars.
Also tests that a pkt_ptr add which could potentially overflow is rejected
 (find_good_pkt_pointers ignores it and doesn't give us any reg->range).

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_align.c | 66 +++-
 1 file changed, 64 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_align.c 
b/tools/testing/selftests/bpf/test_align.c
index 5165d8e..dfd96c6 100644
--- a/tools/testing/selftests/bpf/test_align.c
+++ b/tools/testing/selftests/bpf/test_align.c
@@ -441,6 +441,62 @@ static struct bpf_align_test tests[] = {
{23, 
"R5=pkt(id=2,off=0,r=4,umin_value=14,umax_value=2054,var_off=(0x2; 0xffc))"},
},
},
+   {
+   .descr = "dubious pointer arithmetic",
+   .insns = {
+   PREP_PKT_POINTERS,
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   /* ptr & const => unknown & const */
+   BPF_MOV64_REG(BPF_REG_5, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_5, 0x40),
+   /* ptr << const => unknown << const */
+   BPF_MOV64_REG(BPF_REG_5, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_5, 2),
+   /* We have a (4n) value.  Let's make a packet offset
+* out of it.  First add 14, to make it a (4n+2)
+*/
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_5, 14),
+   /* Then make sure it's nonnegative */
+   BPF_JMP_IMM(BPF_JSGE, BPF_REG_5, 0, 1),
+   BPF_EXIT_INSN(),
+   /* Add it to packet pointer */
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_2),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_6, BPF_REG_5),
+   /* Check bounds and perform a read */
+   BPF_MOV64_REG(BPF_REG_4, BPF_REG_6),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+   BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_4, 1),
+   BPF_EXIT_INSN(),
+   BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_6, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .result = REJECT,
+   .matches = {
+   {4, "R5=pkt(id=0,off=0,r=0,imm=0)"},
+   /* ptr & 0x40 == either 0 or 0x40 */
+   {5, "R5=inv(id=0,umax_value=64,var_off=(0x0; 0x40))"},
+   /* ptr << 2 == unknown, (4n) */
+   {7, 
"R5=inv(id=0,smax_value=9223372036854775804,umax_value=18446744073709551612,var_off=(0x0;
 0xfffc))"},
+   /* (4n) + 14 == (4n+2).  We blow our bounds, because
+* the add could overflow.
+*/
+   {8, "R5=inv(id=0,var_off=(0x2; 0xfffc))"},
+   /* Checked s>=0 */
+   {10, 
"R5=inv(id=0,umin_value=2,umax_value=9223372036854775806,var_off=(0x2; 
0x7ffc))"},
+   /* packet pointer + nonnegative (4n+2) */
+   {12, 
"R6=pkt(id=1,off=0,r=0,umin_value=2,umax_value=9223372036854775806,var_off=(0x2;
 0x7ffc))"},
+   {14, 
"R4=pkt(id=1,off=4,r=0,umin_value=2,umax_value=9223372036854775806,var_off=(0x2;
 0x7ffc))"},
+   /* NET_IP_ALIGN + (4n+2) == (4n), alignment is fine.
+* We checked the bounds, but it might have been able
+* to overflow if the packet pointer started in the
+* upper half of the address space.
+* So we did not get a 'range' on R6, and the access
+* attempt will fail.
+*/
+   {16, 
"R6=pkt(id=1,off=0,r=0,umin_value=2,umax_value=9223372036854775806,var_off=(0x2;
 0x7ffc))"},
+   }
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)
@@ -470,10 +526,15 @@ static int do_test_single(struct bpf_align_test *test)
fd_prog = bpf_verify_program(prog_type ? : BPF_PROG_TYPE_SOCKET_FILTER,
 prog, prog_len, 1, "GPL", 0,
 bpf_vlog, sizeof(bpf_vlog));
-   if (fd_prog < 0) {
+   if (fd_prog < 0 && test->result != REJECT) {
printf("Failed to load program.\n");
printf("%s",

[PATCH v3 net-next 12/12] selftests/bpf: variable offset negative tests

2017-06-27 Thread Edward Cree

Variable ctx accesses and stack accesses aren't allowed, because we can't
 determine what type of value will be read.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_verifier.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 7df3c34..471fbee 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -5311,6 +5311,47 @@ static struct bpf_test tests[] = {
.errstr = "R0 min value is negative, either use unsigned index 
or do a if (index >=0) check.",
.result = REJECT,
},
+   {
+   "variable-offset ctx access",
+   .insns = {
+   /* Get an unknown value */
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 0),
+   /* Make it small and 4-byte aligned */
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_2, 4),
+   /* add it to skb.  We now have either &skb->len or
+* &skb->pkt_type, but we don't know which
+*/
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_2),
+   /* dereference it */
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
+   BPF_EXIT_INSN(),
+   },
+   .errstr = "variable ctx access var_off=(0x0; 0x4)",
+   .result = REJECT,
+   .prog_type = BPF_PROG_TYPE_LWT_IN,
+   },
+   {
+   "variable-offset stack access",
+   .insns = {
+   /* Fill the top 8 bytes of the stack */
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   /* Get an unknown value */
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 0),
+   /* Make it small and 4-byte aligned */
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_2, 4),
+   BPF_ALU64_IMM(BPF_SUB, BPF_REG_2, 8),
+   /* add it to fp.  We now have either fp-4 or fp-8, but
+* we don't know which
+*/
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_10),
+   /* dereference it */
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_2, 0),
+   BPF_EXIT_INSN(),
+   },
+   .errstr = "variable stack access var_off=(0xfff8; 
0x4)",
+   .result = REJECT,
+   .prog_type = BPF_PROG_TYPE_LWT_IN,
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)

Re: [PATCH v3 net-next 00/12] bpf: rewrite value tracking in verifier

2017-06-28 Thread Edward Cree

On 28/06/17 14:50, Daniel Borkmann wrote:
> Hi Edward,
>
> Did you also have a chance in the meantime to look at reducing complexity
> along with your unification? I did run the cilium test suite with your
> latest set from here and current # worst case processed insns that
> verifier has to go through for cilium progs increases from ~53k we have
> right now to ~76k. I'm a bit worried that this quickly gets us close to
> the upper ~98k max limit starting to reject programs again. Alternative
> is to bump the complexity limit again in near future once run into it,
> but preferably there's a way to optimize it along with the rewrite? Do
> you see any possibilities worth exploring? 
The trouble, I think, is that as we're now tracking more information about
 each register value, we're less able to prune branches.  But often that
 information is not actually being used in reaching the exit state.  So it
 seems like the way to tackle this would be to track what information is
 used — or at least, which registers are read from (including e.g. writing
 through them or passing them to helper calls) — in reaching a safe state.
 Then only registers which are used are required to match for pruning.
But that tracking would presumably have to propagate backwards through the
 verifier stack, and I'm not sure how easily that could be done.  Someone
 (was it you?) was talking about replacing the current DAG walking and
 pruning with some kind of basic-block thing, which would help with this.
Summary: I think it could be done, but I haven't looked into the details
 of implementation yet; if it's not actually breaking your programs (yet),
 maybe leave it for a followup patch series?

-Ed

Re: [PATCH v3 net-next 02/12] bpf/verifier: rework value tracking

2017-06-28 Thread Edward Cree

On 28/06/17 16:15, Daniel Borkmann wrote:
> On 06/27/2017 02:56 PM, Edward Cree wrote:
>> Tracks value alignment by means of tracking known & unknown bits.
>> Tightens some min/max value checks and fixes a couple of bugs therein.
>
> You mean the one in relation to patch 1/12? Would be good to elaborate
> here since otherwise this gets forgotten few weeks later.
That wasn't the only one; there were also some in the new min/max value
 calculation for ALU ops.  For instance, in subtraction we were taking
 the new bounds as [min-min, max-max] instead of [min-max, max-min].
I can't remember what else there was and there might also have been some
 that I missed but that got incidentally fixed by the rewrite.  But I
 guess I should change "checks" to "checks and updates" in the above?
> Could you also document all the changes that verifier will then start
> allowing for after the patch?
Maybe not the changes, because the old verifier had a lot of special
 cases, but I could, and probably should, document the new behaviour
 (maybe in Documentation/networking/filter.txt, that already has a bit
 of description of the verifier).
> [...]
>>   /* check whether memory at (regno + off) is accessible for t = (read | 
>> write)
>> @@ -899,52 +965,79 @@ static int check_mem_access(struct bpf_verifier_env 
>> *env, int insn_idx, u32 regn
>>   struct bpf_reg_state *reg = &state->regs[regno];
>>   int size, err = 0;
>>
>> -if (reg->type == PTR_TO_STACK)
>> -off += reg->imm;
>> -
>>   size = bpf_size_to_bytes(bpf_size);
>>   if (size < 0)
>>   return size;
>>
> [...]
>> -if (reg->type == PTR_TO_MAP_VALUE ||
>> -reg->type == PTR_TO_MAP_VALUE_ADJ) {
>> +/* for access checks, reg->off is just part of off */
>> +off += reg->off;
>
> Could you elaborate on why removing the reg->type == PTR_TO_STACK?
Previously bpf_reg_state had a member 'imm' which, for PTR_TO_STACK, was
 a fixed offset, so we had to add it in to the offset.  Now we instead
 have reg->off and it's generic to all pointerish types, so we don't need
 special handling of PTR_TO_STACK here.
> Also in context of below PTR_TO_CTX.
>
> [...]
>>   } else if (reg->type == PTR_TO_CTX) {
>> -enum bpf_reg_type reg_type = UNKNOWN_VALUE;
>> +enum bpf_reg_type reg_type = SCALAR_VALUE;
>>
>>   if (t == BPF_WRITE && value_regno >= 0 &&
>>   is_pointer_value(env, value_regno)) {
>>   verbose("R%d leaks addr into ctx\n", value_regno);
>>   return -EACCES;
>>   }
>> +/* ctx accesses must be at a fixed offset, so that we can
>> + * determine what type of data were returned.
>> + */
>> +if (!tnum_is_const(reg->var_off)) {
>> +char tn_buf[48];
>> +
>> +tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
>> +verbose("variable ctx access var_off=%s off=%d size=%d",
>> +tn_buf, off, size);
>> +return -EACCES;
>> +}
>> +off += reg->var_off.value;
>
> ... f.e. in PTR_TO_CTX case the only access that is currently
> allowed is LDX/STX with fixed offset from insn->off, which is
> passed as off param to check_mem_access(). Can you elaborate on
> off += reg->var_off.value? Meaning we make this more dynamic
> as long as access is known const?
So, I can't actually figure out how to construct a pointer with a known
 variable offset, but future changes to the verifier (like learning from
 comparing two pointers with the same base) could make it possible.  The
 situation we're handling here is where our register holds ctx + x,
 where x is also known to be some constant value k, and currently I don't
 know if that's possible except for the trivial case of k==0, and the edge
 case where k is too big to fit in the s32 reg->off (in which case the
 check_ctx_access will presumably reject it).
Stepping back a bit, each register holding a pointer type has two offsets,
 reg->off and reg->var_off, and the latter is a tnum representing
 knowledge about a value that's not necessarily exactly known.  But
 tnum_is_const checks that it _is_ exactly known.
There is another case that we allow now through the reg->off handling:
 adding a constant to a pointer and then dereferencing it.
So, with r1=ctx, instead of r2 = *(r1 + off), you can write
r3 = r1 + off
r2 = *(r1 + 0)
 if for some reason that suits you better.  But in that case, because off
 is a known value (either an immediate, or a register whose value is
 exactly known), that value gets added to r3->off rather than r3->var_off;
 see adjust_ptr_min_max_vals().

Hope that's clear,
-Ed

Re: [PATCH v3 net-next 02/12] bpf/verifier: rework value tracking

2017-06-28 Thread Edward Cree

On 28/06/17 18:09, Daniel Borkmann wrote:
> Could you elaborate on this one? If I understand it correctly, then
> the scalar += pointer case would mean the following: given I have one
> of the allowed pointer types in adjust_ptr_min_max_vals() then the
> prior scalar type inherits the ptr type/id. I would then 'destroy' the
> pointer value so we get a -EACCES on it. We mark the tmp off_reg as
> scalar type, but shouldn't also actual dst_reg be marked as such
> like in below pointer += scalar case, such that we undo the prior
> ptr_type inheritance?
Good catch.  The intent was that adjust_ptr_min_max_vals() wouldn't mark
 dst_reg's type/id in the case when it returned -EACCES, but indeed there
 are some such paths, and rather than changing those it may be easier to
 change the type/id back to scalar/0.  I don't think we need to go as far
 as calling __mark_reg_unknown() on it though, its bounds and align
 shouldn't have been screwed up by adjust_ptr_min_max_vals().

-Ed

Re: [iovisor-dev] [PATCH v3 net-next 02/12] bpf/verifier: rework value tracking

2017-07-12 Thread Edward Cree

On 07/07/17 18:45, Nadav Amit wrote:
> For me changes such as:
>
>>  if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
>> -dst_reg->min_value -= min_val;
>> +dst_reg->min_value -= max_val;
>
> are purely cryptic. What happened here? Was there a bug before and if so
> what are its implications? Why can’t it be in a separate patch?
In this specific case, there was a bug before: if (say) src and dst were
 both unknown bytes (so range 0 to 255), it would compute the new min and max
 to be 0, so it would think the result is known to be 0.  But that's wrong,
 because it could be anything from -255 to +255.  The bug's implications are
 that it could be used to construct an out-of-range offset to (say) a map
 pointer which the verifier would think was in-range and thus accept.
It might be possible to put it in a separate patch, but in general (not
 necessarily the case here) isolated fixes to range handling break some of
 the existing regression tests.  That is why I ended up doing patch #4,
 because I couldn't find a small fix for the patch #1 test without breaking
 others.  Essentially, this patch started out as just the tnum tracking to
 replace imm and align, and then rolled in everything I had to do to get the
 regression tests to pass again.  So some of those things are fixes that
 could go in earlier patches, but because of the order I wrote it in I'd have
 to disentangle them.  I can do that if it's necessary, but I'm not sure it'd
 really make the patch that much more readable so I'd rather avoid that work
 if I can get away with it...
> I also think that changes such as:
>> -s64 min_value;
>> -u64 max_value;
> [snip]
>> +s64 min_value; /* minimum possible (s64)value */
>> +u64 max_value; /* maximum possible (u64)value */
> Should have been avoided. Personally, I find this comment redundant (to say
> the least). It does not help to reduce the diff size.
The comment is meaningful, though perhaps badly phrased.  It's an attempt to
 define the semantics of these fields (which previously was unclear); e.g. the
 first one means "minimum value when interpreted as signed", i.e. the (s64) in
 the comment is a cast.
Apparently those weren't the semantics the original author intended, but I'm
 not sure the original semantics were well-defined and I certainly don't
 understand them well enough to define them, hence why I defined my own here
 (and then redefined them in patch #4).
> In this regard, I think that refactoring should have been done first and not
> together with the logic changes. As another example, changing UNKNOWN_VALUE
> to SCALAR_VALUE should have been a separate, easy to understand patch.
But SCALAR_VALUE is the union UNKNOWN_VALUE *or* CONST_IMM, and merging those
 together means all of the ripping-out of evaluate_reg_alu() and friends and
 thus depends on much of the rest of the patch.
>> The latter is also needed for correctness in computing reg->range;
>> if 'pkt_ptr + offset' could conceivably overflow, then the result
>> could be < pkt_end without being a valid pointer into the packet.
>> We thus rely on the assumption that the packet pointer will never be
>> within MAX_PACKET_OFF of the top of the address space.  (This
>> assumption is, again, carried over from the existing verifier.)
> I understand the limitations (I think). I agree that CONST being spillable
> is not directly related. As for the possible packet offsets/range:
> intentionally or not you do make some changes that push the 64k packet size
> limit even deeper into the code. While the packet size should be limited to
> avoid overflow, IIUC the requirement is that:
>
>   64 > log(n_insn) + log(MAX_PACKET_OFF) + 1
I don't think that's right, unless you also make each addition to a packet-
 pointer require a max_value <= MAX_PACKET_OFF.  It's also a very loose bound
 because it assumes every instruction is such an add.
I think it makes far more sense to do it the way I have done, where the bounds
 are tracked all the way through the arithmetic and then only checked against
 MAX_PACKET_OFF when doing the access (and when doing a test against a
 PTR_TO_PACKET_END, i.e. find_good_pkt_pointers(), though for some reason I
 only added that check in patch #4).
That way we can allow things like (for the sake of example) adding $BIG_NUMBER
 to a packet pointer and then subtracting it again.
> Such an assertion may be staticly checked (using BUILD_BUG_ON), but I don’t
> think should propagate into the entire code, especially in a non consistent
> way. For example:
>
>> struct bpf_reg_state {
>>  enum bpf_reg_type type;
>>  union {
>> -/* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE 
>> */
>> -s64 imm;
>> -
>> -/* valid when type == PTR_TO_PACKET* */
>> -struct {
>> -u16 off;
>> -u16 range;
>> -};
>> +/* valid when type == PTR_TO_PACKET */

Re: [PATCH v3 net-next 00/12] bpf: rewrite value tracking in verifier

2017-06-30 Thread Edward Cree

On 28/06/17 22:37, Alexei Starovoitov wrote:
> Increasing the limit is must have, since pruning suffered so much.
> Going from 53k to 76k is pretty substantial.
> What is the % increase for tests in selftests/ ?
When I tried to measure the test_verifier tests, they changed hardly at
 all, only a couple of percent iirc.  But that's with (a) only the
 accepted progs get measured, since rejected don't print the #insns line
 - and most of the tests in test_verifier are rejected; and (b) those
 test progs are pretty small, usually with only a couple of jumps and
 not much chance for pruning to occur.  So it's really not a great test
 case for pruning effectiveness.
I haven't measured the test_progs ones, because I *still* haven't gotten
 around to actually setting up a BPF toolchain (it doesn't help that I'm
 building everything on a test server that gets reimaged every night to
 run our nightly tests...).
> I think we need to pin point exactly the reason.
> Saying we just track more data is not enough.
> We've tried v2 set on our load balancer and also saw ~20% increase.
> I don't remember the absolute numbers.
> These jumps don't make me comfortable with these extra tracking.
> Can you try to roll back ptr&const and full negative/positive tracking
> and see whether it gets back to what we had before?
The ptr&const bit shouldn't be relevant unless your programs are actually
 doing that (i.e. ops on pointers other than +/-), which seems surprising.
 But if you really are, then it's not too hard to roll it back - just
 need to change how adjust_reg_min_max_vals() deals with EACCES.
For a version without full negative/positive tracking, just take the first
 3 patches; some of the selftests will fail but hopefully your progs will
 still be accepted.  If not, we can try jbacik's patch (off-list response
 to v2).  I will followup this email with a patch to apply on top of the
 first 3 that does that and rolls back ptr&const.
> If tnum is causing it that would be reasonable trade off to make,
> but if it's full neg/pos tracking that has no use today other than
> (the whole thing is cleaner) I would rather drop it then.
Well, the full neg/pos tracking was a result of needing to fix the bug I
 found with patch #1, and not wanting to confuse myself with the min/max
 range while doing my signed/unsigned tracking.  But if we can make it
 work with jbacik's approach of 'remember whether our current bounds came
 from a signed or an unsigned compare', then we can drop or delay the
 full neg/pos tracking unless/until pruning is sorted out.

-Ed

[TEST PATCH] bpf/verifier: roll back ptr&const handling, and fix signed bounds

2017-06-30 Thread Edward Cree

I'm far from sure that my adjust_reg_min_max_vals() code remains correct
 in the face of maybe-signed-maybe-not bounds, even with the checks I've
 added in.  So this probably has security leaks in it; but it should
 suffice for measuring the effect on pruning.

The treat_as_signed part is loosely based on a patch by Josef Bacik 
.

Build-tested only.  Applies on top of patches 1-3.

Signed-off-by: Edward Cree 
---
 include/linux/bpf_verifier.h |   5 +-
 kernel/bpf/verifier.c| 179 ++-
 2 files changed, 111 insertions(+), 73 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index ca7e2ce..ad02a9d 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -51,8 +51,9 @@ struct bpf_reg_state {
 * These refer to the same value as var_off, not necessarily the actual
 * contents of the register.
 */
-   s64 min_value; /* minimum possible (s64)value */
-   u64 max_value; /* maximum possible (u64)value */
+   bool treat_as_signed; /* Do below limits come from a JSGT/JSGE? */
+   s64 min_value;
+   u64 max_value;
 };
 
 enum bpf_stack_slot_type {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 82823f1..b59c09b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -474,6 +474,7 @@ static void __mark_reg_known_zero(struct bpf_reg_state *reg)
reg->var_off = tnum_const(0);
reg->min_value = 0;
reg->max_value = 0;
+   reg->treat_as_signed = false;
 }
 
 static void mark_reg_known_zero(struct bpf_reg_state *regs, u32 regno)
@@ -497,6 +498,7 @@ static void __mark_reg_unknown(struct bpf_reg_state *reg)
reg->var_off = tnum_unknown;
reg->min_value = BPF_REGISTER_MIN_RANGE;
reg->max_value = BPF_REGISTER_MAX_RANGE;
+   reg->treat_as_signed = false;
 }
 
 static void mark_reg_unknown(struct bpf_reg_state *regs, u32 regno)
@@ -1087,6 +1089,7 @@ static int check_mem_access(struct bpf_verifier_env *env, 
int insn_idx, u32 regn
state->regs[value_regno].var_off.value |
state->regs[value_regno].var_off.mask,
BPF_REGISTER_MAX_RANGE);
+   state->regs[value_regno].treat_as_signed = false;
}
return err;
 }
@@ -1601,6 +1604,9 @@ static void check_reg_overflow(struct bpf_reg_state *reg)
if (reg->min_value < BPF_REGISTER_MIN_RANGE ||
reg->min_value > BPF_REGISTER_MAX_RANGE)
reg->min_value = BPF_REGISTER_MIN_RANGE;
+   /* Unsigned cannot be < 0 */
+   if (!reg->treat_as_signed && reg->min_value < 0)
+   reg->min_value = 0;
 }
 
 static void coerce_reg_to_32(struct bpf_reg_state *reg)
@@ -1612,10 +1618,12 @@ static void coerce_reg_to_32(struct bpf_reg_state *reg)
reg->var_off = tnum_cast(reg->var_off, 4);
/* Did value become known?  Then update bounds */
if (tnum_is_const(reg->var_off)) {
-   if ((s64)reg->var_off.value > BPF_REGISTER_MIN_RANGE)
+   /* We're treating as unsigned, so min is always >= 0 */
+   if (reg->var_off.value < BPF_REGISTER_MAX_RANGE) {
reg->min_value = reg->var_off.value;
-   if (reg->var_off.value < BPF_REGISTER_MAX_RANGE)
reg->max_value = reg->var_off.value;
+   }
+   reg->treat_as_signed = false;
}
 }
 
@@ -1637,6 +1645,18 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
u8 opcode = BPF_OP(insn->code);
u32 dst = insn->dst_reg;
 
+   /* If we can cross the sign boundary, don't trust our bounds.
+* Normally the program should have given us both upper and lower
+* bounds (e.g. two signed checks or one unsigned upper bound), in
+* which case this won't fire.
+*/
+   if (off_reg->treat_as_signed) {
+   if (min_val == BPF_REGISTER_MIN_RANGE)
+   max_val = BPF_REGISTER_MAX_RANGE;
+   } else {
+   if (max_val == BPF_REGISTER_MAX_RANGE)
+   min_val = BPF_REGISTER_MIN_RANGE;
+   }
dst_reg = ®s[dst];
 
if (WARN_ON_ONCE(known && (min_val != max_val))) {
@@ -1677,6 +1697,11 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
 */
dst_reg->type = ptr_reg->type;
dst_reg->id = ptr_reg->id;
+   /* We also inherit the signedness of our offset.
+* XXX If we already had an offset of a different signedness, this
+* might lead to trouble!
+*/
+   dst_reg->treat_as_signed = off_reg->treat_as_signed;
 
switch (opcode) {
case BPF_ADD:

1 2 >

1 - 100 of 166 matches

Mail list logo