date:20180428

[PATCH bpf-next v8 00/10] bpf: add bpf_get_stack helper

2018-04-28 Thread Yonghong Song

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table regardless of
whether BPF_F_REUSE_STACKID is specified or not,
so some stack traces may be missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Patches #1 and #2 implemented the core kernel support.
Patch #3 removes two never-hit branches in verifier.
Patches #4 and #5 are two verifier improves to make
bpf programming easier. Patch #6 synced the new helper
to tools headers. Patch #7 moved perf_event polling code
and ksym lookup code from samples/bpf to
tools/testing/selftests/bpf. Patch #8 added a verifier
test in tools/bpf for new verifier change.
Patches #9 and #10 added tests for raw tracepoint prog
and tracepoint prog respectively.

Changelogs:
  v7 -> v8:
. rebase on top of latest bpf-next
. simplify BPF_ARSH dst_reg->smin_val/smax_value tracking
. rewrite the description of bpf_get_stack() in uapi bpf.h
  based on new format.
  v6 -> v7:
. do perf callchain buffer allocation inside the
  verifier. so if the prog->has_callchain_buf is set,
  it is guaranteed that the buffer has been allocated.
. change condition "trace_nr <= skip" to "trace_nr < skip"
  so that for zero size buffer, return 0 instead of -EFAULT
  v5 -> v6:
. after refining return register smax_value and umax_value
  for helpers bpf_get_stack and bpf_probe_read_str,
  bounds and var_off of the return register are further refined.
. added missing commit message for tools header sync commit.
. removed one unnecessary empty line.
  v4 -> v5:
. relied on dst_reg->var_off to refine umin_val/umax_val
  in verifier handling BPF_ARSH value range tracking,
  suggested by Edward.
  v3 -> v4:
. fixed a bug when meta ptr is set to NULL in check_func_arg.
. introduced tnum_arshift and added detailed comments for
  the underlying implementation
. avoided using VLA in tools/bpf test_progs.
  v2 -> v3:
. used meta to track helper memory size argument
. implemented range checking for ARSH in verifier
. moved perf event polling and ksym related functions
  from samples/bpf to tools/bpf
. added test to compare build id's between bpf_get_stackid
  and bpf_get_stack
  v1 -> v2:
. fixed compilation error when CONFIG_PERF_EVENTS is not enabled

Yonghong Song (10):
  bpf: change prototype for stack_map_get_build_id_offset
  bpf: add bpf_get_stack helper
  bpf/verifier: refine retval R0 state for bpf_get_stack helper
  bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals
  bpf/verifier: improve register value range tracking with ARSH
  tools/bpf: add bpf_get_stack helper to tools headers
  samples/bpf: move common-purpose trace functions to selftests
  tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH
  tools/bpf: add a test for bpf_get_stack with raw tracepoint prog
  tools/bpf: add a test for bpf_get_stack with tracepoint prog

 include/linux/bpf.h|   1 +
 include/linux/filter.h |   3 +-
 include/linux/tnum.h   |   4 +-
 include/uapi/linux/bpf.h   |  42 -
 kernel/bpf/core.c  |   5 +
 kernel/bpf/stackmap.c  |  80 -
 kernel/bpf/tnum.c  |  10 ++
 kernel/bpf/verifier.c  |  80 -
 kernel/trace/bpf_trace.c   |  50 +-
 samples/bpf/Makefile   |  11 +-
 samples/bpf/bpf_load.c |  63 ---
 samples/bpf/bpf_load.h |   7 -
 samples/bpf/offwaketime_user.c |   1 +
 samples/bpf/sampleip_user.c|   1 +
 samples/bpf/spintest_user.c|   1 +
 samples/bpf/trace_event_user.c |   1 +
 samples/bpf/trace_output_user.c| 125 ++
 tools/include/uapi/linux/bpf.h |  42 -
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/test_get_stack_rawtp.c | 102 +++
 tools/testing/selftests/bpf/test_progs.c   | 192 -
 .../selftests/bpf/test_stacktrace_build_id.c   |  20 ++-
 tools/testing/selftests/bpf/test_stacktrace_map.c  |  19 +-
 tools/testing/selftests/bpf/test_verifier.c|  45 +
 tools/testing/selftests/bpf/trace_helpers.c| 186 +++

[PATCH bpf-next v8 02/10] bpf: add bpf_get_stack helper

2018-04-28 Thread Yonghong Song

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 include/linux/bpf.h  |  1 +
 include/linux/filter.h   |  3 ++-
 include/uapi/linux/bpf.h | 42 --
 kernel/bpf/core.c|  5 
 kernel/bpf/stackmap.c| 67 
 kernel/bpf/verifier.c| 19 ++
 kernel/trace/bpf_trace.c | 50 +++-
 7 files changed, 183 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 38ebbc6..c553f6f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -692,6 +692,7 @@ extern const struct bpf_func_proto 
bpf_get_current_comm_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
+extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b23..64899c0 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -468,7 +468,8 @@ struct bpf_prog {
dst_needed:1,   /* Do we need dst entry? */
blinded:1,  /* Was blinded */
is_func:1,  /* program is a bpf function */
-   kprobe_override:1; /* Do we override a kprobe? 
*/
+   kprobe_override:1, /* Do we override a kprobe? 
*/
+   has_callchain_buf:1; /* callchain buffer 
allocated? */
enum bpf_prog_type  type;   /* Type of BPF program */
enum bpf_attach_typeexpected_attach_type; /* For some prog types */
u32 len;/* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index da77a93..1afb606 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1767,6 +1767,40 @@ union bpf_attr {
  * **CONFIG_XFRM** configuration option.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_get_stack(struct pt_regs *regs, void *buf, u32 size, u64 flags)
+ * Description
+ * Return a user or a kernel stack in bpf program provided buffer.
+ * To achieve this, the helper needs *ctx*, which is a pointer
+ * to the context on which the tracing program is executed.
+ * To store the stacktrace, the bpf program provides *buf* with
+ * a nonnegative *size*.
+ *
+ * The last argument, *flags*, holds the number of stack frames to
+ * skip (from 0 to 255), masked with
+ * **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
+ * the following flags:
+ *
+ * **BPF_F_USER_STACK**
+ * Collect a user space stack instead of a kernel stack.
+ * **BPF_F_USER_BUILD_ID**
+ * Collect buildid+offset instead of ips for user stack,
+ * only valid if **BPF_F_USER_STACK** is also specified.
+ *
+ * **bpf_get_stack**\ () can collect up to
+ * **PERF_MAX_STACK_DEPTH** both kernel and user frames, subject
+ * to sufficient large buffer size. Note that
+ * this limit can be controlled with the **sysctl** program, and
+ * that it should be manually increased in order to profile long
+ * user stacks (such as stacks for Java programs). To do so, use:
+ *
+ * ::
+ *
+ * # sysctl kernel.perf_event_max_stack=
+ *
+ * Return
+ * a non-negative value equal to or less than size on success, or
+ * a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1835,7 +1869,8 @@ union bpf_attr {
FN(msg_pull_data),  \
FN(bind),   \
FN(xdp_adjust_tail),\
-   FN(skb_get_xfrm_state),
+   FN(skb_get_xfrm_state), \
+   FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1869,11 +1904,14 @@ enum bpf_func_id

[PATCH bpf-next v8 04/10] bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals

2018-04-28 Thread Yonghong Song

In verifier function adjust_scalar_min_max_vals,
when src_known is false and the opcode is BPF_LSH/BPF_RSH,
early return will happen in the function. So remove
the branch in handling BPF_LSH/BPF_RSH when src_known is false.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 kernel/bpf/verifier.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 988400e..6e3f859 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2940,10 +2940,7 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
dst_reg->umin_value <<= umin_val;
dst_reg->umax_value <<= umax_val;
}
-   if (src_known)
-   dst_reg->var_off = tnum_lshift(dst_reg->var_off, 
umin_val);
-   else
-   dst_reg->var_off = tnum_lshift(tnum_unknown, umin_val);
+   dst_reg->var_off = tnum_lshift(dst_reg->var_off, umin_val);
/* We may learn something more from the var_off */
__update_reg_bounds(dst_reg);
break;
@@ -2971,11 +2968,7 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
 */
dst_reg->smin_value = S64_MIN;
dst_reg->smax_value = S64_MAX;
-   if (src_known)
-   dst_reg->var_off = tnum_rshift(dst_reg->var_off,
-  umin_val);
-   else
-   dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
+   dst_reg->var_off = tnum_rshift(dst_reg->var_off, umin_val);
dst_reg->umin_value >>= umax_val;
dst_reg->umax_value >>= umin_val;
/* We may learn something more from the var_off */
-- 
2.9.5

[PATCH bpf-next v8 06/10] tools/bpf: add bpf_get_stack helper to tools headers

2018-04-28 Thread Yonghong Song

The tools header file bpf.h is synced with kernel uapi bpf.h.
The new helper is also added to bpf_helpers.h.

Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h| 42 +--
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index da77a93..1afb606 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1767,6 +1767,40 @@ union bpf_attr {
  * **CONFIG_XFRM** configuration option.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_get_stack(struct pt_regs *regs, void *buf, u32 size, u64 flags)
+ * Description
+ * Return a user or a kernel stack in bpf program provided buffer.
+ * To achieve this, the helper needs *ctx*, which is a pointer
+ * to the context on which the tracing program is executed.
+ * To store the stacktrace, the bpf program provides *buf* with
+ * a nonnegative *size*.
+ *
+ * The last argument, *flags*, holds the number of stack frames to
+ * skip (from 0 to 255), masked with
+ * **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
+ * the following flags:
+ *
+ * **BPF_F_USER_STACK**
+ * Collect a user space stack instead of a kernel stack.
+ * **BPF_F_USER_BUILD_ID**
+ * Collect buildid+offset instead of ips for user stack,
+ * only valid if **BPF_F_USER_STACK** is also specified.
+ *
+ * **bpf_get_stack**\ () can collect up to
+ * **PERF_MAX_STACK_DEPTH** both kernel and user frames, subject
+ * to sufficient large buffer size. Note that
+ * this limit can be controlled with the **sysctl** program, and
+ * that it should be manually increased in order to profile long
+ * user stacks (such as stacks for Java programs). To do so, use:
+ *
+ * ::
+ *
+ * # sysctl kernel.perf_event_max_stack=
+ *
+ * Return
+ * a non-negative value equal to or less than size on success, or
+ * a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1835,7 +1869,8 @@ union bpf_attr {
FN(msg_pull_data),  \
FN(bind),   \
FN(xdp_adjust_tail),\
-   FN(skb_get_xfrm_state),
+   FN(skb_get_xfrm_state), \
+   FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1869,11 +1904,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6 (1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK  0xffULL
 #define BPF_F_USER_STACK   (1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP   (1ULL << 9)
 #define BPF_F_REUSE_STACKID(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 69d7b91..265f8e0 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -101,6 +101,8 @@ static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
 static int (*bpf_skb_get_xfrm_state)(void *ctx, int index, void *state,
 int size, int flags) =
(void *) BPF_FUNC_skb_get_xfrm_state;
+static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
+   (void *) BPF_FUNC_get_stack;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5

[PATCH bpf-next v8 08/10] tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH

2018-04-28 Thread Yonghong Song

The test_verifier already has a few ARSH test cases.
This patch adds a new test case which takes advantage of newly
improved verifier behavior for bpf_get_stack and ARSH.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_verifier.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 165e9dd..1acafe26 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -11680,6 +11680,51 @@ static struct bpf_test tests[] = {
.errstr = "BPF_XADD stores into R2 packet",
.prog_type = BPF_PROG_TYPE_XDP,
},
+   {
+   "bpf_get_stack return R0 within range",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 28),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+   BPF_MOV64_IMM(BPF_REG_9, sizeof(struct test_val)),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+   BPF_MOV64_IMM(BPF_REG_3, sizeof(struct test_val)),
+   BPF_MOV64_IMM(BPF_REG_4, 256),
+   BPF_EMIT_CALL(BPF_FUNC_get_stack),
+   BPF_MOV64_IMM(BPF_REG_1, 0),
+   BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_8, 32),
+   BPF_ALU64_IMM(BPF_ARSH, BPF_REG_8, 32),
+   BPF_JMP_REG(BPF_JSLT, BPF_REG_1, BPF_REG_8, 16),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_9, BPF_REG_8),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_8),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_9),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 32),
+   BPF_ALU64_IMM(BPF_ARSH, BPF_REG_1, 32),
+   BPF_MOV64_REG(BPF_REG_3, BPF_REG_2),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_1),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+   BPF_MOV64_IMM(BPF_REG_5, sizeof(struct test_val)),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_5),
+   BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_1, 4),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_MOV64_REG(BPF_REG_3, BPF_REG_9),
+   BPF_MOV64_IMM(BPF_REG_4, 0),
+   BPF_EMIT_CALL(BPF_FUNC_get_stack),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map2 = { 4 },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)
-- 
2.9.5

[PATCH bpf-next v8 01/10] bpf: change prototype for stack_map_get_build_id_offset

2018-04-28 Thread Yonghong Song

This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.

Signed-off-by: Yonghong Song 
---
 kernel/bpf/stackmap.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 57eeb12..04f6ec1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -262,16 +262,11 @@ static int stack_map_get_build_id(struct vm_area_struct 
*vma,
return ret;
 }
 
-static void stack_map_get_build_id_offset(struct bpf_map *map,
- struct stack_map_bucket *bucket,
+static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
  u64 *ips, u32 trace_nr, bool user)
 {
int i;
struct vm_area_struct *vma;
-   struct bpf_stack_build_id *id_offs;
-
-   bucket->nr = trace_nr;
-   id_offs = (struct bpf_stack_build_id *)bucket->data;
 
/*
 * We cannot do up_read() in nmi context, so build_id lookup is
@@ -361,8 +356,10 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct 
bpf_map *, map,
pcpu_freelist_pop(&smap->freelist);
if (unlikely(!new_bucket))
return -ENOMEM;
-   stack_map_get_build_id_offset(map, new_bucket, ips,
- trace_nr, user);
+   new_bucket->nr = trace_nr;
+   stack_map_get_build_id_offset(
+   (struct bpf_stack_build_id *)new_bucket->data,
+   ips, trace_nr, user);
trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
-- 
2.9.5

[PATCH bpf-next v8 05/10] bpf/verifier: improve register value range tracking with ARSH

2018-04-28 Thread Yonghong Song

When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
56: (c7) r8 s>>= 32
57: R8=inv(id=0)
With this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
56: (c7) r8 s>>= 32
57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.

In our later example,
..
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0)
return 0;
ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
..
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.

Signed-off-by: Yonghong Song 
---
 include/linux/tnum.h  |  4 +++-
 kernel/bpf/tnum.c | 10 ++
 kernel/bpf/verifier.c | 23 +++
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/tnum.h b/include/linux/tnum.h
index 0d2d3da..c7dc2b5 100644
--- a/include/linux/tnum.h
+++ b/include/linux/tnum.h
@@ -23,8 +23,10 @@ struct tnum tnum_range(u64 min, u64 max);
 /* Arithmetic and logical ops */
 /* Shift a tnum left (by a fixed shift) */
 struct tnum tnum_lshift(struct tnum a, u8 shift);
-/* Shift a tnum right (by a fixed shift) */
+/* Shift (rsh) a tnum right (by a fixed shift) */
 struct tnum tnum_rshift(struct tnum a, u8 shift);
+/* Shift (arsh) a tnum right (by a fixed min_shift) */
+struct tnum tnum_arshift(struct tnum a, u8 min_shift);
 /* Add two tnums, return @a + @b */
 struct tnum tnum_add(struct tnum a, struct tnum b);
 /* Subtract two tnums, return @a - @b */
diff --git a/kernel/bpf/tnum.c b/kernel/bpf/tnum.c
index 1f4bf68..938d412 100644
--- a/kernel/bpf/tnum.c
+++ b/kernel/bpf/tnum.c
@@ -43,6 +43,16 @@ struct tnum tnum_rshift(struct tnum a, u8 shift)
return TNUM(a.value >> shift, a.mask >> shift);
 }
 
+struct tnum tnum_arshift(struct tnum a, u8 min_shift)
+{
+   /* if a.value is negative, arithmetic shifting by minimum shift
+* will have larger negative offset compared to more shifting.
+* If a.value is nonnegative, arithmetic shifting by minimum shift
+* will have larger positive offset compare to more shifting.
+*/
+   return TNUM((s64)a.value >> min_shift, (s64)a.mask >> min_shift);
+}
+
 struct tnum tnum_add(struct tnum a, struct tnum b)
 {
u64 sm, sv, sigma, chi, mu;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6e3f859..712d865 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2974,6 +2974,29 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
/* We may learn something more from the var_off */
__update_reg_bounds(dst_reg);
break;
+   case BPF_ARSH:
+   if (umax_val >= insn_bitness) {
+   /* Shifts greater than 31 or 63 are undefined.
+* This includes shifts by a negative number.
+*/
+   mark_reg_unknown(env, regs, insn->dst_reg);
+   break;
+   }
+
+   /* Upon reaching here, src_known is true and
+* umax_val is equal to umin_val.
+*/
+   dst_reg->smin_value >>= umin_val;
+   dst_reg->smax_value >>= umin_val;
+   dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);
+
+   /* blow away the dst_reg umin_value/umax_value and rely on
+* dst_reg var_off to refine the result.
+*/
+   dst_reg->umin_value = 0;
+   dst_reg->umax_value = U64_MAX;
+   __update_reg_bounds(dst_reg);
+   break;
default:
mark_reg_unknown(env, regs, insn->dst_reg);
break;
-- 
2.9.5

[PATCH bpf-next v8 07/10] samples/bpf: move common-purpose trace functions to selftests

2018-04-28 Thread Yonghong Song

There is no functionality change in this patch. The common-purpose
trace functions, including perf_event polling and ksym lookup,
are moved from trace_output_user.c and bpf_load.c to
selftests/bpf/trace_helpers.c so that these function can
be reused later in selftests.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 samples/bpf/Makefile|  11 +-
 samples/bpf/bpf_load.c  |  63 --
 samples/bpf/bpf_load.h  |   7 --
 samples/bpf/offwaketime_user.c  |   1 +
 samples/bpf/sampleip_user.c |   1 +
 samples/bpf/spintest_user.c |   1 +
 samples/bpf/trace_event_user.c  |   1 +
 samples/bpf/trace_output_user.c | 125 +++
 tools/testing/selftests/bpf/trace_helpers.c | 186 
 tools/testing/selftests/bpf/trace_helpers.h |  24 
 10 files changed, 238 insertions(+), 182 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.h

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index b853581..5e31770 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -49,6 +49,7 @@ hostprogs-y += xdp_adjust_tail
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
+TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
 
 test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
 sock_example-objs := sock_example.o $(LIBBPF)
@@ -65,10 +66,10 @@ tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
 tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
-trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
+trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o $(TRACE_HELPERS)
 lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
-offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o
-spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o
+offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o $(TRACE_HELPERS)
+spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o $(TRACE_HELPERS)
 map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o
 test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o
 test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o
@@ -82,8 +83,8 @@ xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o
 xdp_router_ipv4-objs := bpf_load.o $(LIBBPF) xdp_router_ipv4_user.o
 test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) $(CGROUP_HELPERS) \
   test_current_task_under_cgroup_user.o
-trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o
-sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o
+trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o $(TRACE_HELPERS)
+sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o $(TRACE_HELPERS)
 tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
 xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index feca497..a27ef3c 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -648,66 +648,3 @@ void read_trace_pipe(void)
}
}
 }
-
-#define MAX_SYMS 30
-static struct ksym syms[MAX_SYMS];
-static int sym_cnt;
-
-static int ksym_cmp(const void *p1, const void *p2)
-{
-   return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
-}
-
-int load_kallsyms(void)
-{
-   FILE *f = fopen("/proc/kallsyms", "r");
-   char func[256], buf[256];
-   char symbol;
-   void *addr;
-   int i = 0;
-
-   if (!f)
-   return -ENOENT;
-
-   while (!feof(f)) {
-   if (!fgets(buf, sizeof(buf), f))
-   break;
-   if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
-   break;
-   if (!addr)
-   continue;
-   syms[i].addr = (long) addr;
-   syms[i].name = strdup(func);
-   i++;
-   }
-   sym_cnt = i;
-   qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
-   return 0;
-}
-
-struct ksym *ksym_search(long key)
-{
-   int start = 0, end = sym_cnt;
-   int result;
-
-   while (start < end) {
-   size_t mid = start + (end - start) / 2;
-
-   result = key - syms[mid].addr;
-   if (result < 0)
-   end = mid;
-   else if (result > 0)
-   start = mid + 1;
-   else
-   return &syms[mid];
-   }
-
-   if (start >= 1 && syms[start - 1].addr < key &&
-   key < syms[st

[PATCH bpf-next v8 09/10] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog

2018-04-28 Thread Yonghong Song

The test attached a raw_tracepoint program to sched/sched_switch.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.

Whenever the kernel stack is available, the user space
application will check to ensure that the kernel function
for raw_tracepoint ___bpf_prog_run is part of the stack.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_get_stack_rawtp.c | 102 +
 tools/testing/selftests/bpf/test_progs.c   | 122 +
 3 files changed, 227 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_get_stack_rawtp.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index b64a7a3..9d76218 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
-   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o
+   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
+   test_get_stack_rawtp.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -58,6 +59,7 @@ $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
 $(OUTPUT)/test_sock: cgroup_helpers.c
 $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 $(OUTPUT)/test_sockmap: cgroup_helpers.c
+$(OUTPUT)/test_progs: trace_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/test_get_stack_rawtp.c 
b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
new file mode 100644
index 000..ba1dcf9
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include "bpf_helpers.h"
+
+/* Permit pretty deep stack traces */
+#define MAX_STACK_RAWTP 100
+struct stack_trace_t {
+   int pid;
+   int kern_stack_size;
+   int user_stack_size;
+   int user_stack_buildid_size;
+   __u64 kern_stack[MAX_STACK_RAWTP];
+   __u64 user_stack[MAX_STACK_RAWTP];
+   struct bpf_stack_build_id user_stack_buildid[MAX_STACK_RAWTP];
+};
+
+struct bpf_map_def SEC("maps") perfmap = {
+   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(__u32),
+   .max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") stackdata_map = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(struct stack_trace_t),
+   .max_entries = 1,
+};
+
+/* Allocate per-cpu space twice the needed. For the code below
+ *   usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+ *   if (usize < 0)
+ * return 0;
+ *   ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
+ *
+ * If we have value_size = MAX_STACK_RAWTP * sizeof(__u64),
+ * verifier will complain that access "raw_data + usize"
+ * with size "max_len - usize" may be out of bound.
+ * The maximum "raw_data + usize" is "raw_data + max_len"
+ * and the maximum "max_len - usize" is "max_len", verifier
+ * concludes that the maximum buffer access range is
+ * "raw_data[0...max_len * 2 - 1]" and hence reject the program.
+ *
+ * Doubling the to-be-used max buffer size can fix this verifier
+ * issue and avoid complicated C programming massaging.
+ * This is an acceptable workaround since there is one entry here.
+ */
+struct bpf_map_def SEC("maps") rawdata_map = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = MAX_STACK_RAWTP * sizeof(__u64) * 2,
+   .max_entries = 1,
+};
+
+SEC("tracepoint/sched/sched_switch")
+int bpf_prog1(void *ctx)
+{
+   int max_len, max_buildid_len, usize, ksize, total_size;
+   struct stack_trace_t *data;
+   void *raw_data;
+   __u32 key = 0;
+
+   data = bpf_map_lookup_elem(&stackdata_map, &key);
+   if (!data)
+   return 0;
+
+   max_len = MAX_STACK_RAWTP * sizeof(__u64);
+   max_buildid_len = MAX_STACK_RAWTP * sizeof(struct bpf_stack_build_id);
+   data->pid = bpf_get_current_pid_tgid();
+   data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
+ max_len, 0);
+   data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
+   BPF_F_USER_STACK);
+   data->user_stack_buildid_size = bpf_get_stack(
+   ctx, data->user_stack_buildid, max_buildid_len,
+   BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
+

[PATCH bpf-next v8 10/10] tools/bpf: add a test for bpf_get_stack with tracepoint prog

2018-04-28 Thread Yonghong Song

The test_stacktrace_map and test_stacktrace_build_id are
enhanced to call bpf_get_stack in the helper to get the
stack trace as well.  The stack traces from bpf_get_stack
and bpf_get_stackid are compared to ensure that for the
same stack as represented as the same hash, their ip addresses
or build id's must be the same.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_progs.c   | 70 --
 .../selftests/bpf/test_stacktrace_build_id.c   | 20 ++-
 tools/testing/selftests/bpf/test_stacktrace_map.c  | 19 +-
 3 files changed, 98 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index c148a55..664db67 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -897,11 +897,47 @@ static int compare_map_keys(int map1_fd, int map2_fd)
return 0;
 }
 
+static int compare_stack_ips(int smap_fd, int amap_fd, int stack_trace_len)
+{
+   __u32 key, next_key, *cur_key_p, *next_key_p;
+   char *val_buf1, *val_buf2;
+   int i, err = 0;
+
+   val_buf1 = malloc(stack_trace_len);
+   val_buf2 = malloc(stack_trace_len);
+   cur_key_p = NULL;
+   next_key_p = &key;
+   while (bpf_map_get_next_key(smap_fd, cur_key_p, next_key_p) == 0) {
+   err = bpf_map_lookup_elem(smap_fd, next_key_p, val_buf1);
+   if (err)
+   goto out;
+   err = bpf_map_lookup_elem(amap_fd, next_key_p, val_buf2);
+   if (err)
+   goto out;
+   for (i = 0; i < stack_trace_len; i++) {
+   if (val_buf1[i] != val_buf2[i]) {
+   err = -1;
+   goto out;
+   }
+   }
+   key = *next_key_p;
+   cur_key_p = &key;
+   next_key_p = &next_key;
+   }
+   if (errno != ENOENT)
+   err = -1;
+
+out:
+   free(val_buf1);
+   free(val_buf2);
+   return err;
+}
+
 static void test_stacktrace_map()
 {
-   int control_map_fd, stackid_hmap_fd, stackmap_fd;
+   int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
const char *file = "./test_stacktrace_map.o";
-   int bytes, efd, err, pmu_fd, prog_fd;
+   int bytes, efd, err, pmu_fd, prog_fd, stack_trace_len;
struct perf_event_attr attr = {};
__u32 key, val, duration = 0;
struct bpf_object *obj;
@@ -957,6 +993,10 @@ static void test_stacktrace_map()
if (stackmap_fd < 0)
goto disable_pmu;
 
+   stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+   if (stack_amap_fd < 0)
+   goto disable_pmu;
+
/* give some time for bpf program run */
sleep(1);
 
@@ -978,6 +1018,12 @@ static void test_stacktrace_map()
  "err %d errno %d\n", err, errno))
goto disable_pmu_noerr;
 
+   stack_trace_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
+   err = compare_stack_ips(stackmap_fd, stack_amap_fd, stack_trace_len);
+   if (CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu_noerr;
+
goto disable_pmu_noerr;
 disable_pmu:
error_cnt++;
@@ -1071,9 +1117,9 @@ static int extract_build_id(char *build_id, size_t size)
 
 static void test_stacktrace_build_id(void)
 {
-   int control_map_fd, stackid_hmap_fd, stackmap_fd;
+   int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
const char *file = "./test_stacktrace_build_id.o";
-   int bytes, efd, err, pmu_fd, prog_fd;
+   int bytes, efd, err, pmu_fd, prog_fd, stack_trace_len;
struct perf_event_attr attr = {};
__u32 key, previous_key, val, duration = 0;
struct bpf_object *obj;
@@ -1138,6 +1184,11 @@ static void test_stacktrace_build_id(void)
  err, errno))
goto disable_pmu;
 
+   stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+   if (CHECK(stack_amap_fd < 0, "bpf_find_map stack_amap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu;
+
assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null")
   == 0);
assert(system("./urandom_read if=/dev/urandom of=/dev/zero count=4 2> 
/dev/null") == 0);
@@ -1189,8 +1240,15 @@ static void test_stacktrace_build_id(void)
previous_key = key;
} while (bpf_map_get_next_key(stackmap_fd, &previous_key, &key) == 0);
 
-   CHECK(build_id_matches < 1, "build id match",
- "Didn't find expected build ID from the map");
+   if (CHECK(build_id_matches < 1, "build id match",
+ "Didn't find expected build ID from the map"))
+   goto disable_pmu;
+
+

[PATCH bpf-next v8 03/10] bpf/verifier: refine retval R0 state for bpf_get_stack helper

2018-04-28 Thread Yonghong Song

The special property of return values for helpers bpf_get_stack
and bpf_probe_read_str are captured in verifier.
Both helpers return a negative error code or
a length, which is equal to or smaller than the buffer
size argument. This additional information in the
verifier can avoid the condition such as "retval > bufsize"
in the bpf program. For example, for the code blow,
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0 || usize > max_len)
return 0;
The verifier may have the following errors:
52: (85) call bpf_get_stack#65
 R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
 R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R9_w=inv800 R10=fp0,call_-1
53: (bf) r8 = r0
54: (bf) r1 = r8
55: (67) r1 <<= 32
56: (bf) r2 = r1
57: (77) r2 >>= 32
58: (25) if r2 > 0x31f goto pc+33
 R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
 umax_value=18446744069414584320,
 var_off=(0x0; 0x))
 R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R8=inv(id=0) R9=inv800 R10=fp0,call_-1
59: (1f) r9 -= r8
60: (c7) r1 s>>= 32
61: (bf) r2 = r7
62: (0f) r2 += r1
math between map_value pointer and register with unbounded
min value is not allowed
The failure is due to llvm compiler optimization where register "r2",
which is a copy of "r1", is tested for condition while later on "r1"
is used for map_ptr operation. The verifier is not able to track such
inst sequence effectively.

Without the "usize > max_len" condition, there is no llvm optimization
and the below generated code passed verifier:
52: (85) call bpf_get_stack#65
 R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
 R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R9_w=inv800 R10=fp0,call_-1
53: (b7) r1 = 0
54: (bf) r8 = r0
55: (67) r8 <<= 32
56: (c7) r8 s>>= 32
57: (6d) if r1 s> r8 goto pc+24
 R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
 R1=inv0 R6=ctx(id=0,off=0,imm=0)
 R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
 R10=fp0,call_-1
58: (bf) r2 = r7
59: (0f) r2 += r8
60: (1f) r9 -= r8
61: (bf) r1 = r6

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 kernel/bpf/verifier.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 253f6bd..988400e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -165,6 +165,8 @@ struct bpf_call_arg_meta {
bool pkt_access;
int regno;
int access_size;
+   s64 msize_smax_value;
+   u64 msize_umax_value;
 };
 
 static DEFINE_MUTEX(bpf_verifier_lock);
@@ -1985,6 +1987,12 @@ static int check_func_arg(struct bpf_verifier_env *env, 
u32 regno,
} else if (arg_type_is_mem_size(arg_type)) {
bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
 
+   /* remember the mem_size which may be used later
+* to refine return values.
+*/
+   meta->msize_smax_value = reg->smax_value;
+   meta->msize_umax_value = reg->umax_value;
+
/* The register is SCALAR_VALUE; the access check
 * happens using its boundaries.
 */
@@ -2324,6 +2332,23 @@ static int prepare_func_exit(struct bpf_verifier_env 
*env, int *insn_idx)
return 0;
 }
 
+static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
+  int func_id,
+  struct bpf_call_arg_meta *meta)
+{
+   struct bpf_reg_state *ret_reg = ®s[BPF_REG_0];
+
+   if (ret_type != RET_INTEGER ||
+   (func_id != BPF_FUNC_get_stack &&
+func_id != BPF_FUNC_probe_read_str))
+   return;
+
+   ret_reg->smax_value = meta->msize_smax_value;
+   ret_reg->umax_value = meta->msize_umax_value;
+   __reg_deduce_bounds(ret_reg);
+   __reg_bound_offset(ret_reg);
+}
+
 static int check_helper_call(struct bpf_verifier_env *env, int func_id, int 
insn_idx)
 {
const struct bpf_func_proto *fn = NULL;
@@ -2447,6 +2472,8 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
return -EINVAL;
}
 
+   do_refine_retval_range(regs, fn->ret_type, func_id, &meta);
+
err = check_map_func_compatibility(env, meta.map_ptr, func_id);
if (err)
return err;
-- 
2.9.5

[PATCH bpf-next] bpf: Allow bpf_current_task_under_cgroup in interrupt

2018-04-28 Thread Teng Qin

Currently, the bpf_current_task_under_cgroup helper has a check where if
the BPF program is running in_interrupt(), it will return -EINVAL. This
prevents the helper to be used in many useful scenarios, particularly
BPF programs attached to Perf Events.

This commit removes the check. Tested a few NMI (Perf Event) and some
softirq context, the helper returns the correct result.
---
 kernel/trace/bpf_trace.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 56ba0f2..f94890c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -474,8 +474,6 @@ BPF_CALL_2(bpf_current_task_under_cgroup, struct bpf_map *, 
map, u32, idx)
struct bpf_array *array = container_of(map, struct bpf_array, map);
struct cgroup *cgrp;
 
-   if (unlikely(in_interrupt()))
-   return -EINVAL;
if (unlikely(idx >= array->map.max_entries))
return -E2BIG;
 
-- 
2.9.5

[PATCH net-next] can: dev: use skb_put_zero to simplfy code

2018-04-28 Thread YueHaibing

use helper skb_put_zero to replace the pattern of skb_put() && memset()

Signed-off-by: YueHaibing 
---
 drivers/net/can/dev.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
index b177956..d8140a9 100644
--- a/drivers/net/can/dev.c
+++ b/drivers/net/can/dev.c
@@ -649,8 +649,7 @@ struct sk_buff *alloc_can_skb(struct net_device *dev, 
struct can_frame **cf)
can_skb_prv(skb)->ifindex = dev->ifindex;
can_skb_prv(skb)->skbcnt = 0;
 
-   *cf = skb_put(skb, sizeof(struct can_frame));
-   memset(*cf, 0, sizeof(struct can_frame));
+   *cf = skb_put_zero(skb, sizeof(struct can_frame));
 
return skb;
 }
@@ -678,8 +677,7 @@ struct sk_buff *alloc_canfd_skb(struct net_device *dev,
can_skb_prv(skb)->ifindex = dev->ifindex;
can_skb_prv(skb)->skbcnt = 0;
 
-   *cfd = skb_put(skb, sizeof(struct canfd_frame));
-   memset(*cfd, 0, sizeof(struct canfd_frame));
+   *cfd = skb_put_zero(skb, sizeof(struct canfd_frame));
 
return skb;
 }
-- 
2.7.0

Re: [PATCH net-next v9 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit

2018-04-28 Thread Jiri Pirko

Fri, Apr 27, 2018 at 07:06:57PM CEST, sridhar.samudr...@intel.com wrote:
>This feature bit can be used by hypervisor to indicate virtio_net device to
>act as a standby for another device with the same MAC address.
>
>VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.
>
>Signed-off-by: Sridhar Samudrala 
>---
> drivers/net/virtio_net.c| 2 +-
> include/uapi/linux/virtio_net.h | 3 +++
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>index 3b5991734118..51a085b1a242 100644
>--- a/drivers/net/virtio_net.c
>+++ b/drivers/net/virtio_net.c
>@@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
>   VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>   VIRTIO_NET_F_CTRL_MAC_ADDR, \
>   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>-  VIRTIO_NET_F_SPEED_DUPLEX
>+  VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY


This is not part of current qemu master (head 
6f0c4706b35dead265509115ddbd2a8d1af516c1)
Were I can find the qemu code?

Also, I think it makes sense to push HW (qemu HW in this case) first
and only then the driver.



> 
> static unsigned int features[] = {
>   VIRTNET_FEATURES,
>diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>index 5de6ed37695b..a3715a3224c1 100644
>--- a/include/uapi/linux/virtio_net.h
>+++ b/include/uapi/linux/virtio_net.h
>@@ -57,6 +57,9 @@
>* Steering */
> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
> 
>+#define VIRTIO_NET_F_STANDBY62/* Act as standby for another device
>+   * with the same MAC.
>+   */
> #define VIRTIO_NET_F_SPEED_DUPLEX 63  /* Device set linkspeed and duplex */
> 
> #ifndef VIRTIO_NET_NO_LEGACY
>-- 
>2.14.3
>

net: smsc95xx: aligment issues

2018-04-28 Thread Stefan Wahren

Hi,
after connecting a Raspberry Pi 1 B to my local network i'm seeing aligment 
issues under /proc/cpu/alignment:

User:   0
System: 142 (_decode_session4+0x12c/0x3c8)
Skipped:0
Half:   0
Word:   0
DWord:  127
Multi:  15
User faults:2 (fixup)

I've also seen outputs with _csum_ipv6_magic.

Kernel config: bcm2835_defconfig
Reproducible kernel trees: current linux-next, 4.17-rc2 and 4.14.37 (i didn't 
test older versions)

Please tell if you need more information to narrow down this issue.

Best regards
Stefan

Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-04-28 Thread Jiri Pirko

Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. For paravirtual drivers like virtio_net that use 3 netdev model, the
>   the failover module provides interfaces to create/destroy additional
>   master netdev and all the slave events are managed internally.
>net_failover_create()
>net_failover_destroy()
>   A failover netdev is created that acts a master device and controls 2
>   slave devices. The original virtio_net netdev is registered as 'standby'
>   netdev and a passthru/vf device with the same MAC gets registered as
>   'primary' netdev. Both 'standby' and 'primary' netdevs are associated
>   with the same 'pci' device.  The user accesses the network interface via

'standby' and 'primary' netdevs are not associated with the same 'pci'
device.
"Primary" is the VF netdevice and "standby" is virtio_net. Each
associated with different pci device.


>   'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
>   default for transmits when it is available with link up and running.
>2. For existing netvsc driver that uses 2 netdev model, no master netdev
>   is created. The paravirtual driver registers each instance of netvsc
>   as a 'failover' netdev  along with a set of ops to manage the slave
>   events. There is no 'standby' netdev in this model. A passthru/vf device
>   with the same MAC gets registered as 'primary' netdev.
>net_failover_register()
>net_failover_unregister()

[...]

Re: [PATCH net-next v8 2/4] net: Introduce generic failover module

2018-04-28 Thread Dan Carpenter

Hi Sridhar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
url:
https://github.com/0day-ci/linux/commits/Sridhar-Samudrala/Enable-virtio_net-to-act-as-a-standby-for-a-passthru-device/20180427-183842

smatch warnings:
net/core/net_failover.c:229 net_failover_change_mtu() error: we previously 
assumed 'primary_dev' could be null (see line 219)
net/core/net_failover.c:279 net_failover_vlan_rx_add_vid() error: we previously 
assumed 'primary_dev' could be null (see line 269)

# 
https://github.com/0day-ci/linux/commit/5a5f2e3efcb699867db79543dfebe764927b9c93
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 5a5f2e3efcb699867db79543dfebe764927b9c93
vim +/primary_dev +229 net/core/net_failover.c

5a5f2e3e Sridhar Samudrala 2018-04-25  211  
5a5f2e3e Sridhar Samudrala 2018-04-25  212  static int 
net_failover_change_mtu(struct net_device *dev, int new_mtu)
5a5f2e3e Sridhar Samudrala 2018-04-25  213  {
5a5f2e3e Sridhar Samudrala 2018-04-25  214  struct net_failover_info 
*nfo_info = netdev_priv(dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  215  struct net_device *primary_dev, 
*standby_dev;
5a5f2e3e Sridhar Samudrala 2018-04-25  216  int ret = 0;
5a5f2e3e Sridhar Samudrala 2018-04-25  217  
5a5f2e3e Sridhar Samudrala 2018-04-25  218  primary_dev = 
rcu_dereference(nfo_info->primary_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25 @219  if (primary_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  220  ret = 
dev_set_mtu(primary_dev, new_mtu);
5a5f2e3e Sridhar Samudrala 2018-04-25  221  if (ret)
5a5f2e3e Sridhar Samudrala 2018-04-25  222  return ret;
5a5f2e3e Sridhar Samudrala 2018-04-25  223  }
5a5f2e3e Sridhar Samudrala 2018-04-25  224  
5a5f2e3e Sridhar Samudrala 2018-04-25  225  standby_dev = 
rcu_dereference(nfo_info->standby_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  226  if (standby_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  227  ret = 
dev_set_mtu(standby_dev, new_mtu);
5a5f2e3e Sridhar Samudrala 2018-04-25  228  if (ret) {
5a5f2e3e Sridhar Samudrala 2018-04-25 @229  
dev_set_mtu(primary_dev, dev->mtu);
5a5f2e3e Sridhar Samudrala 2018-04-25  230  return ret;
5a5f2e3e Sridhar Samudrala 2018-04-25  231  }
5a5f2e3e Sridhar Samudrala 2018-04-25  232  }
5a5f2e3e Sridhar Samudrala 2018-04-25  233  
5a5f2e3e Sridhar Samudrala 2018-04-25  234  dev->mtu = new_mtu;
5a5f2e3e Sridhar Samudrala 2018-04-25  235  
5a5f2e3e Sridhar Samudrala 2018-04-25  236  return 0;
5a5f2e3e Sridhar Samudrala 2018-04-25  237  }
5a5f2e3e Sridhar Samudrala 2018-04-25  238  
5a5f2e3e Sridhar Samudrala 2018-04-25  239  static void 
net_failover_set_rx_mode(struct net_device *dev)
5a5f2e3e Sridhar Samudrala 2018-04-25  240  {
5a5f2e3e Sridhar Samudrala 2018-04-25  241  struct net_failover_info 
*nfo_info = netdev_priv(dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  242  struct net_device *slave_dev;
5a5f2e3e Sridhar Samudrala 2018-04-25  243  
5a5f2e3e Sridhar Samudrala 2018-04-25  244  rcu_read_lock();
5a5f2e3e Sridhar Samudrala 2018-04-25  245  
5a5f2e3e Sridhar Samudrala 2018-04-25  246  slave_dev = 
rcu_dereference(nfo_info->primary_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  247  if (slave_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  248  
dev_uc_sync_multiple(slave_dev, dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  249  
dev_mc_sync_multiple(slave_dev, dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  250  }
5a5f2e3e Sridhar Samudrala 2018-04-25  251  
5a5f2e3e Sridhar Samudrala 2018-04-25  252  slave_dev = 
rcu_dereference(nfo_info->standby_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  253  if (slave_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  254  
dev_uc_sync_multiple(slave_dev, dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  255  
dev_mc_sync_multiple(slave_dev, dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  256  }
5a5f2e3e Sridhar Samudrala 2018-04-25  257  
5a5f2e3e Sridhar Samudrala 2018-04-25  258  rcu_read_unlock();
5a5f2e3e Sridhar Samudrala 2018-04-25  259  }
5a5f2e3e Sridhar Samudrala 2018-04-25  260  
5a5f2e3e Sridhar Samudrala 2018-04-25  261  static int 
net_failover_vlan_rx_add_vid(struct net_device *dev, __be16 proto,
5a5f2e3e Sridhar Samudrala 2018-04-25  262  
u16 vid)
5a5f2e3e Sridhar Samudrala 2018-04-25  263  {
5a5f2e3e Sridhar Samudrala 2018-04-25  264  struct net_failover_info 
*nfo_info = netdev_priv(dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  265  struct net_device *primary_dev, 
*standby_dev;
5a5f2e3e Sridhar Samudrala 2018-04-25  266  int ret = 0;
5a5f2e3e Sridhar Samudrala 2018-04-25  267  
5a5f2e3e Sridhar Samudrala 2018-04-25  268  primary_dev = 
rcu_dereference(nfo_info->primary_dev);
5a5f2e3e Sridhar Samudral

Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-28 Thread Jiri Pirko

Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF

Why? Both datapaths could be enabled at a time. Why the loop on
hypervisor side would be a problem. This in not an issue for
bonding/team as well.


>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to

"reset the MAC filter of the VF" - you mean "set the VF mac"?


>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>It uses the generic failover framework that provides 2 functions to create
>and destroy a master failover netdev. When STANDBY feature is enabled, an
>additional netdev(failover netdev) is created that acts as a master device
>and tracks the state of the 2 lower netdevs. The original virtio_net netdev
>is marked as 'standby' netdev and a passthru device with the same MAC is
>registered as 'primary' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

[...]

Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-04-28 Thread Jiri Pirko

Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. For paravirtual drivers like virtio_net that use 3 netdev model, the
>   the failover module provides interfaces to create/destroy additional
>   master netdev and all the slave events are managed internally.
>net_failover_create()
>net_failover_destroy()
>   A failover netdev is created that acts a master device and controls 2
>   slave devices. The original virtio_net netdev is registered as 'standby'
>   netdev and a passthru/vf device with the same MAC gets registered as
>   'primary' netdev. Both 'standby' and 'primary' netdevs are associated
>   with the same 'pci' device.  The user accesses the network interface via
>   'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
>   default for transmits when it is available with link up and running.
>2. For existing netvsc driver that uses 2 netdev model, no master netdev
>   is created. The paravirtual driver registers each instance of netvsc
>   as a 'failover' netdev  along with a set of ops to manage the slave
>   events. There is no 'standby' netdev in this model. A passthru/vf device
>   with the same MAC gets registered as 'primary' netdev.
>net_failover_register()
>net_failover_unregister()
>

First of all, I like this v9 very much. Nice progress!
Couple of notes inlined.


>Signed-off-by: Sridhar Samudrala 
>---
> include/linux/netdevice.h  |  16 +
> include/net/net_failover.h |  62 
> net/Kconfig|  10 +
> net/core/Makefile  |   1 +
> net/core/net_failover.c| 892 +
> 5 files changed, 981 insertions(+)
> create mode 100644 include/net/net_failover.h
> create mode 100644 net/core/net_failover.c

[...]


>+static int net_failover_slave_register(struct net_device *slave_dev)
>+{
>+  struct net_failover_info *nfo_info;
>+  struct net_failover_ops *nfo_ops;
>+  struct net_device *failover_dev;
>+  bool slave_is_standby;
>+  u32 orig_mtu;
>+  int err;
>+
>+  ASSERT_RTNL();
>+
>+  failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
>+  if (!failover_dev)
>+  goto done;
>+
>+  if (failover_dev->type != slave_dev->type)
>+  goto done;
>+
>+  if (nfo_ops && nfo_ops->slave_register)
>+  return nfo_ops->slave_register(slave_dev, failover_dev);
>+
>+  nfo_info = netdev_priv(failover_dev);
>+  slave_is_standby = (slave_dev->dev.parent == failover_dev->dev.parent);

No parentheses needed.


>+  if (slave_is_standby ? rtnl_dereference(nfo_info->standby_dev) :
>+  rtnl_dereference(nfo_info->primary_dev)) {
>+  netdev_err(failover_dev, "%s attempting to register as slave 
>dev when %s already present\n",
>+ slave_dev->name,
>+ slave_is_standby ? "standby" : "primary");
>+  goto done;
>+  }
>+
>+  /* We want to allow only a direct attached VF device as a primary
>+   * netdev. As there is no easy way to check for a VF device, restrict
>+   * this to a pci device.
>+   */
>+  if (!slave_is_standby && (!slave_dev->dev.parent ||
>+!dev_is_pci(slave_dev->dev.parent)))

Yeah, this is good for now.


>+  goto done;
>+
>+  if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
>+  vlan_uses_dev(failover_dev)) {
>+  netdev_err(failover_dev, "Device %s is VLAN challenged and 
>failover device has VLAN set up\n",
>+ failover_dev->name);
>+  goto done;
>+  }
>+
>+  /* Align MTU of slave with failover dev */
>+  orig_mtu = slave_dev->mtu;
>+  err = dev_set_mtu(slave_dev, failover_dev->mtu);
>+  if (err) {
>+  netdev_err(failover_dev, "unable to change mtu of %s to %u 
>register failed\n",
>+ slave_dev->name, failover_dev->mtu);
>+  goto done;
>+  }
>+
>+  dev_hold(slave_dev);
>+
>+  if (netif_running(failover_dev)) {
>+  err = dev_open(slave_dev);
>+  if (err && (err != -EBUSY)) {
>+  netdev_err(failover_dev, "Opening slave %s failed 
>err:%d\n",
>+ slave_dev->name, err);
>+  goto err_dev_open;
>+  }
>+  }
>+
>+  netif_addr_lock_bh(failover_dev);
>+  dev_uc_sync_multiple(slave_dev, failover_dev);
>+  dev_uc_sync_multiple(slave_dev, failover_dev);
>+  netif_addr_unlock_bh(failover_dev);
>+
>+  err = vlan_vids_ad

[PATCH] mwifiex: fix spelling mistake: "capabilties" -> "capabilities"

2018-04-28 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in function names and text strings

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/marvell/mwifiex/sta_event.c | 10 +-
 drivers/net/wireless/marvell/mwifiex/uap_event.c |  8 
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireless/marvell/mwifiex/sta_event.c 
b/drivers/net/wireless/marvell/mwifiex/sta_event.c
index 93dfb76cd8a6..01636c1b7447 100644
--- a/drivers/net/wireless/marvell/mwifiex/sta_event.c
+++ b/drivers/net/wireless/marvell/mwifiex/sta_event.c
@@ -27,9 +27,9 @@
 
 #define MWIFIEX_IBSS_CONNECT_EVT_FIX_SIZE12
 
-static int mwifiex_check_ibss_peer_capabilties(struct mwifiex_private *priv,
-  struct mwifiex_sta_node *sta_ptr,
-  struct sk_buff *event)
+static int mwifiex_check_ibss_peer_capabilities(struct mwifiex_private *priv,
+   struct mwifiex_sta_node 
*sta_ptr,
+   struct sk_buff *event)
 {
int evt_len, ele_len;
u8 *curr;
@@ -42,7 +42,7 @@ static int mwifiex_check_ibss_peer_capabilties(struct 
mwifiex_private *priv,
evt_len = event->len;
curr = event->data;
 
-   mwifiex_dbg_dump(priv->adapter, EVT_D, "ibss peer capabilties:",
+   mwifiex_dbg_dump(priv->adapter, EVT_D, "ibss peer capabilities:",
 event->data, event->len);
 
skb_push(event, MWIFIEX_IBSS_CONNECT_EVT_FIX_SIZE);
@@ -933,7 +933,7 @@ int mwifiex_process_sta_event(struct mwifiex_private *priv)
ibss_sta_addr);
sta_ptr = mwifiex_add_sta_entry(priv, ibss_sta_addr);
if (sta_ptr && adapter->adhoc_11n_enabled) {
-   mwifiex_check_ibss_peer_capabilties(priv, sta_ptr,
+   mwifiex_check_ibss_peer_capabilities(priv, sta_ptr,
 
adapter->event_skb);
if (sta_ptr->is_11n_enabled)
for (i = 0; i < MAX_NUM_TID; i++)
diff --git a/drivers/net/wireless/marvell/mwifiex/uap_event.c 
b/drivers/net/wireless/marvell/mwifiex/uap_event.c
index e8c8728db15a..5c57efe24f6a 100644
--- a/drivers/net/wireless/marvell/mwifiex/uap_event.c
+++ b/drivers/net/wireless/marvell/mwifiex/uap_event.c
@@ -23,8 +23,8 @@
 
 #define MWIFIEX_BSS_START_EVT_FIX_SIZE12
 
-static int mwifiex_check_uap_capabilties(struct mwifiex_private *priv,
-struct sk_buff *event)
+static int mwifiex_check_uap_capabilities(struct mwifiex_private *priv,
+ struct sk_buff *event)
 {
int evt_len;
u8 *curr;
@@ -38,7 +38,7 @@ static int mwifiex_check_uap_capabilties(struct 
mwifiex_private *priv,
evt_len = event->len;
curr = event->data;
 
-   mwifiex_dbg_dump(priv->adapter, EVT_D, "uap capabilties:",
+   mwifiex_dbg_dump(priv->adapter, EVT_D, "uap capabilities:",
 event->data, event->len);
 
skb_push(event, MWIFIEX_BSS_START_EVT_FIX_SIZE);
@@ -194,7 +194,7 @@ int mwifiex_process_uap_event(struct mwifiex_private *priv)
   ETH_ALEN);
if (priv->hist_data)
mwifiex_hist_data_reset(priv);
-   mwifiex_check_uap_capabilties(priv, adapter->event_skb);
+   mwifiex_check_uap_capabilities(priv, adapter->event_skb);
break;
case EVENT_UAP_MIC_COUNTERMEASURES:
/* For future development */
-- 
2.17.0

Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-28 Thread Jiri Pirko

Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF
>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to
>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>It uses the generic failover framework that provides 2 functions to create
>and destroy a master failover netdev. When STANDBY feature is enabled, an
>additional netdev(failover netdev) is created that acts as a master device
>and tracks the state of the 2 lower netdevs. The original virtio_net netdev
>is marked as 'standby' netdev and a passthru device with the same MAC is
>registered as 'primary' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>

When I enabled the standby feature (hardcoded), I have 2 netdevices now:
4: ens3:  mtu 1500 qdisc noqueue state UP 
group default qlen 1000
link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
inet6 fe80::5054:ff:feb2:a7f1/64 scope link 
   valid_lft forever preferred_lft forever
5: ens3n_sby:  mtu 1500 qdisc fq_codel state 
UP group default qlen 1000
link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
inet6 fe80::5054:ff:feb2:a7f1/64 scope link 
   valid_lft forever preferred_lft forever

However, it seems to confuse my initscripts on Fedora:
[root@test1 ~]# ifup ens3
./network-functions: line 78: [: /etc/dhcp/dhclient-ens3: binary operator 
expected
./network-functions: line 80: [: /etc/dhclient-ens3: binary operator expected
./network-functions: line 69: [: /var/lib/dhclient/dhclient-ens3: binary 
operator expected

Determining IP information for ens3
ens3n_sby...Cannot find device "ens3n_sby.pid"
Cannot find device "ens3n_sby.lease"
 failed.

I tried to change the standby device mac:
ip link set ens3n_sby addr 52:54:00:b2:a7:f2
[root@test1 ~]# ifup ens3

Determining IP information for ens3... done.
[root@test1 ~]#

But now the network does not work. I think that the mac change on
standby device should be probably refused, no?

When I change the mac back, all works fine.


Now I try to change mac of the failover master:
[root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
RTNETLINK answers: Operation not supported

That I did expect to work. I would expect this would change the mac of
the master and both standby and primary slaves.


Now I tried to add a primary pci device. I don't have any fancy VF on my
test setup, but I expected the good old 8139cp to work:
[root@test1 ~]# ethtool -i ens9
driver: 8139cp

[root@test1 ~]# ip link set ens9 addr 52:54:00:b2:a7:f1

I see no message in dmesg, so I guess the failover module did not
enslave this netdev. The mac change is not monitored. I would expect
that it is and whenever a device changes mac to the failover one, it
should be enslaved and whenever it changes mac back to something else,
it should be released - the primary one ofcourse.



[...]

>+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>+size_t len)
>+{
>+  struct virtnet_info *vi = netdev_priv(dev);
>+  int ret;
>+
>+  if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY))
>+  return -EOPNOTSUPP;
>+
>+  ret = snprintf(buf, len, "_sby");

please avoid the "_".

[...]

[PATCH] qed: fix spelling mistake: "checksumed" -> "checksummed"

2018-04-28 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in DP_INFO message text

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/qlogic/qed/qed_ll2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.c 
b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
index 74fc626b1ec1..38502815d681 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
@@ -2370,7 +2370,7 @@ static int qed_ll2_start_xmit(struct qed_dev *cdev, 
struct sk_buff *skb)
u8 flags = 0;
 
if (unlikely(skb->ip_summed != CHECKSUM_NONE)) {
-   DP_INFO(cdev, "Cannot transmit a checksumed packet\n");
+   DP_INFO(cdev, "Cannot transmit a checksummed packet\n");
return -EINVAL;
}
 
-- 
2.17.0

[PATCH] liquidio: fix spelling mistake: "mac_tx_multi_collison" -> "mac_tx_multi_collision"

2018-04-28 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in oct_stats_strings text

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c 
b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
index 9926a12dd805..000e7d40e2ad 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
@@ -120,7 +120,7 @@ static const char oct_stats_strings[][ETH_GSTRING_LEN] = {
"mac_tx_ctl_packets",
"mac_tx_total_collisions",
"mac_tx_one_collision",
-   "mac_tx_multi_collison",
+   "mac_tx_multi_collision",
"mac_tx_max_collision_fail",
"mac_tx_max_deferal_fail",
"mac_tx_fifo_err",
-- 
2.17.0

[PATCH] net: ethernet: ucc: fix spelling mistake: "tx-late-collsion" -> "tx-late-collision"

2018-04-28 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in tx_fw_stat_gstrings text

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/freescale/ucc_geth_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth_ethtool.c 
b/drivers/net/ethernet/freescale/ucc_geth_ethtool.c
index 4df282ed22c7..0beee2cc2ddd 100644
--- a/drivers/net/ethernet/freescale/ucc_geth_ethtool.c
+++ b/drivers/net/ethernet/freescale/ucc_geth_ethtool.c
@@ -61,7 +61,7 @@ static const char hw_stat_gstrings[][ETH_GSTRING_LEN] = {
 static const char tx_fw_stat_gstrings[][ETH_GSTRING_LEN] = {
"tx-single-collision",
"tx-multiple-collision",
-   "tx-late-collsion",
+   "tx-late-collision",
"tx-aborted-frames",
"tx-lost-frames",
"tx-carrier-sense-errors",
-- 
2.17.0

Re: [PATCHv2 net] bridge: check iface upper dev when setting master via ioctl

2018-04-28 Thread Nikolay Aleksandrov


On 27/04/18 15:59, Hangbin Liu wrote:

When we set a bond slave's master to bridge via ioctl, we only check
the IFF_BRIDGE_PORT flag. Although we will find the slave's real master
at netdev_master_upper_dev_link() later, it already does some settings
and allocates some resources. It would be better to return as early
as possible.

v1 -> v2:
use netdev_master_upper_dev_get() instead of netdev_has_any_upper_dev()
to check if we have a master, because not all upper devs are masters,
e.g. vlan device.

Reported-by: syzbot+de73361ee4971b6e6...@syzkaller.appspotmail.com
Signed-off-by: Hangbin Liu 
---
  net/bridge/br_if.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 82c1a6f..5bb6681 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -518,8 +518,8 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
return -ELOOP;
}
  
-	/* Device is already being bridged */

-   if (br_port_exists(dev))
+   /* Device has master upper dev */
+   if (netdev_master_upper_dev_get(dev))
return -EBUSY;
  
  	/* No bridging devices that dislike that (e.g. wireless) */




Acked-by: Nikolay Aleksandrov

Re: [net-next] ipv6: sr: Extract the right key values for "seg6_make_flowlabel"

2018-04-28 Thread Ahmed Abdelsalam

On Fri, 27 Apr 2018 13:59:07 -0400 (EDT)
David Miller  wrote:

> From: Ahmed Abdelsalam 
> Date: Thu, 26 Apr 2018 16:11:11 +0200
> 
> > @@ -119,6 +119,9 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct 
> > ipv6_sr_hdr *osrh, int proto)
> > int hdrlen, tot_len, err;
> > __be32 flowlabel;
> >  
> > +   inner_hdr = ipv6_hdr(skb);
> 
> You have to make this assignment after, not before, the skb_cow_header()
> call.  Otherwise this point can be pointing to freed up memory.

Ok! 
I fixed and sent you a v2 of the patch. 

-- 
Ahmed Abdelsalam

[net-next v2] ipv6: sr: extract the right key values for "seg6_make_flowlabel"

2018-04-28 Thread Ahmed Abdelsalam

The seg6_make_flowlabel() is used by seg6_do_srh_encap() to compute the
flowlabel from a given skb. It relies on skb_get_hash() which eventually
calls __skb_flow_dissect() to extract the flow_keys struct values from
the skb.

In case of IPv4 traffic, calling seg6_make_flowlabel() after skb_push(),
skb_reset_network_header(), and skb_mac_header_rebuild() will results in
flow_keys struct of all key values set to zero.

This patch calls seg6_make_flowlabel() before resetting the headers of skb
to get the right key values.

Extracted Key values are based on the type inner packet as follows:
1) IPv6 traffic: src_IP, dst_IP, L4 proto, and flowlabel of inner packet.
2) IPv4 traffic: src_IP, dst_IP, L4 proto, src_port, and dst_port
3) L2 traffic: depends on what kind of traffic carried into the L2
frame. IPv6 and IPv4 traffic works as discussed 1) and 2)

Here a hex_dump of struct flow_keys for IPv4 and IPv6 traffic
10.100.1.100: 47302 > 30.0.0.2: 5001
: 14 00 02 00 00 00 00 00 08 00 11 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 13 89 b8 c6 1e 00 00 02
0020: 0a 64 01 64

fc00:a1:a > b2::2
: 28 00 03 00 00 00 00 00 86 dd 11 00 99 f9 02 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 b2 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 02 fc 00 00 a1
0030: 00 00 00 00 00 00 00 00 00 00 00 0a

Signed-off-by: Ahmed Abdelsalam 
---
 net/ipv6/seg6_iptunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 9898926..eab39bd 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -127,6 +127,7 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct 
ipv6_sr_hdr *osrh, int proto)
return err;
 
inner_hdr = ipv6_hdr(skb);
+   flowlabel = seg6_make_flowlabel(net, skb, inner_hdr);
 
skb_push(skb, tot_len);
skb_reset_network_header(skb);
@@ -138,7 +139,6 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct 
ipv6_sr_hdr *osrh, int proto)
 * decapsulation will overwrite inner hlim with outer hlim
 */
 
-   flowlabel = seg6_make_flowlabel(net, skb, inner_hdr);
if (skb->protocol == htons(ETH_P_IPV6)) {
ip6_flow_hdr(hdr, ip6_tclass(ip6_flowinfo(inner_hdr)),
 flowlabel);
-- 
2.1.4

Re: [linux-sunxi] Re: [PATCH 1/5] dt-bindings: allow dwmac-sun8i to use other devices' exported regmap

2018-04-28 Thread Chen-Yu Tsai

Hi Rob,

On Tue, Apr 17, 2018 at 7:17 AM, Icenowy Zheng  wrote:
>
>
> 于 2018年4月17日 GMT+08:00 上午2:47:45, Rob Herring  写到:
>>On Wed, Apr 11, 2018 at 10:16:37PM +0800, Icenowy Zheng wrote:
>>> On some Allwinner SoCs the EMAC clock register needed by dwmac-sun8i
>>is
>>> in another device's memory space. In this situation dwmac-sun8i can
>>use
>>> a regmap exported by the other device with only the EMAC clock
>>register.
>>
>>If this is a clock, then why not use the clock binding?
>
> EMAC clock register is only the datasheet name. It contains
> MII mode selection and delay chain configuration.

As Icenowy already mentioned, this is likely a misnomer.

The register contains controls on how to route the TX and RX clock
lines, and also what interface mode to use. The former includes things
like the delays mentioned in the device tree binding, and also whether
to invert the signals or not. The latter influences whether the TXC
line is an input or an output (or maybe what decoding module to send
all the signals to). On the H3/H5, it even contains controls for the
embedded PHY.

The settings only make sense to the MAC. To expose it as a generic
clock line would not be a good fit. You can look at what we did for
sun7i-a20-gmac, which is not pretty. All other DWMAC platforms that
were introduced after sun7i-a20-gmac also use a syscon, instead of
clocks, even though they probably cover the same set of RXC/TXC
controls.

ChenYu

>>
>>>
>>> Document this situation in the dwmac-sun8i device tree binding
>>> documentation.
>>>
>>> Signed-off-by: Icenowy Zheng 
>>> ---
>>>  Documentation/devicetree/bindings/net/dwmac-sun8i.txt | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
>>b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
>>> index 3d6d5fa0c4d5..0c5f63a80617 100644
>>> --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
>>> +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
>>> @@ -20,8 +20,9 @@ Required properties:
>>>  - phy-handle: See ethernet.txt
>>>  - #address-cells: shall be 1
>>>  - #size-cells: shall be 0
>>> -- syscon: A phandle to the syscon of the SoC with one of the
>>following
>>> - compatible string:
>>> +- syscon: A phandle to a device which exports the EMAC clock
>>register as a
>>> + regmap or to the syscon of the SoC with one of the following
>>compatible
>>> + string:
>>>- allwinner,sun8i-h3-system-controller
>>>- allwinner,sun8i-v3s-system-controller
>>>- allwinner,sun50i-a64-system-controller
>>> --
>>> 2.15.1
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe devicetree"
>>in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>___
>>linux-arm-kernel mailing list
>>linux-arm-ker...@lists.infradead.org
>>http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
> --
> You received this message because you are subscribed to the Google Groups 
> "linux-sunxi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to linux-sunxi+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Re: [PATCH bpf-next] bpf: Allow bpf_current_task_under_cgroup in interrupt

2018-04-28 Thread Alexei Starovoitov


On 4/28/18 12:32 AM, Teng Qin wrote:

Currently, the bpf_current_task_under_cgroup helper has a check where if
the BPF program is running in_interrupt(), it will return -EINVAL. This
prevents the helper to be used in many useful scenarios, particularly
BPF programs attached to Perf Events.

This commit removes the check. Tested a few NMI (Perf Event) and some
softirq context, the helper returns the correct result.
---
 kernel/trace/bpf_trace.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 56ba0f2..f94890c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -474,8 +474,6 @@ BPF_CALL_2(bpf_current_task_under_cgroup, struct bpf_map *, 
map, u32, idx)
struct bpf_array *array = container_of(map, struct bpf_array, map);
struct cgroup *cgrp;

-   if (unlikely(in_interrupt()))
-   return -EINVAL;
if (unlikely(idx >= array->map.max_entries))
return -E2BIG;



looks good, but SOB is missing. Please respin.

Re: [Cake] [PATCH iproute2-next v7] Add support for cake qdisc

2018-04-28 Thread Toke Høiland-Jørgensen

Stephen Hemminger  writes:

> On Fri, 27 Apr 2018 21:57:20 +0200
> Toke Høiland-Jørgensen  wrote:
>
>> sch_cake is intended to squeeze the most bandwidth and latency out of even
>> the slowest ISP links and routers, while presenting an API simple enough
>> that even an ISP can configure it.
>> 
>> Example of use on a cable ISP uplink:
>> 
>> tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
>> 
>> To shape a cable download link (ifb and tc-mirred setup elided)
>> 
>> tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash 
>> besteffort
>> 
>> Cake is filled with:
>> 
>> * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
>>   derived Flow Queuing system, which autoconfigures based on the bandwidth.
>> * A novel "triple-isolate" mode (the default) which balances per-host
>>   and per-flow FQ even through NAT.
>> * An deficit based shaper, that can also be used in an unlimited mode.
>> * 8 way set associative hashing to reduce flow collisions to a minimum.
>> * A reasonable interpretation of various diffserv latency/loss tradeoffs.
>> * Support for zeroing diffserv markings for entering and exiting traffic.
>> * Support for interacting well with Docsis 3.0 shaper framing.
>> * Support for DSL framing types and shapers.
>> * Support for ack filtering.
>> * Extensive statistics for measuring, loss, ecn markings, latency variation.
>> 
>> Various versions baking have been available as an out of tree build for
>> kernel versions going back to 3.10, as the embedded router world has been
>> running a few years behind mainline Linux. A stable version has been
>> generally available on lede-17.01 and later.
>> 
>> sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
>> in the sqm-scripts, with sane defaults and vastly simpler configuration.
>> 
>> Cake's principal author is Jonathan Morton, with contributions from
>> Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
>> Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
>> and Loganaden Velvindron.
>> 
>> Testing from Pete Heist, Georgios Amanakis, and the many other members of
>> the c...@lists.bufferbloat.net mailing list.
>> 
>> Signed-off-by: Dave Taht 
>> Signed-off-by: Toke Høiland-Jørgensen 
>> ---
>> Changelog:
>> v7:
>>   - Move the target/interval presets to a table and check that only
>> one is passed.
>> 
>> v6:
>>   - Identical to v5 because apparently I don't git so well... :/
>> 
>> v5:
>>   - Print the SPLIT_GSO flag
>>   - Switch to print_u64() for JSON output
>>   - Fix a format string for mpu option output
>> 
>> v4:
>>   - Switch stats parsing to use nested netlink attributes
>>   - Tweaks to JSON stats output keys
>> 
>> v3:
>>   - Remove accidentally included test flag
>> 
>> v2:
>>   - Updated netlink config ABI
>>   - Remove diffserv-llt mode
>>   - Various tweaks and clean-ups of stats output
>>  man/man8/tc-cake.8 | 632 ++
>>  man/man8/tc.8  |   1 +
>>  tc/Makefile|   1 +
>>  tc/q_cake.c| 748 +
>>  4 files changed, 1382 insertions(+)
>>  create mode 100644 man/man8/tc-cake.8
>>  create mode 100644 tc/q_cake.c
>
> Looks good to me, when cake makes it into net-next.

Awesome, thanks!

-Toke

Re: [PATCH net-next v5] Add Common Applications Kept Enhanced (cake) qdisc

2018-04-28 Thread Toke Høiland-Jørgensen

Toke Høiland-Jørgensen  writes:

> +static inline struct tcphdr *cake_get_tcphdr(struct sk_buff *skb)
> +{
> + struct ipv6hdr *ipv6h;
> + struct iphdr *iph;
> + struct tcphdr *th;
> +
> +
> + switch (skb->protocol) {
> + case cpu_to_be16(ETH_P_IP):

As someone was kind enough to point out off-list, skb->protocol doesn't
actually contain the protocol number of the inner protocol, so this
doesn't work for 6in4 encapsulation. Will try again...

-Toke

Re: [PATCH bpf-next v8 05/10] bpf/verifier: improve register value range tracking with ARSH

2018-04-28 Thread Alexei Starovoitov

On Sat, Apr 28, 2018 at 12:02:00AM -0700, Yonghong Song wrote:
> When helpers like bpf_get_stack returns an int value
> and later on used for arithmetic computation, the LSH and ARSH
> operations are often required to get proper sign extension into
> 64-bit. For example, without this patch:
> 54: R0=inv(id=0,umax_value=800)
> 54: (bf) r8 = r0
> 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
> 55: (67) r8 <<= 32
> 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
> 56: (c7) r8 s>>= 32
> 57: R8=inv(id=0)
> With this patch:
> 54: R0=inv(id=0,umax_value=800)
> 54: (bf) r8 = r0
> 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
> 55: (67) r8 <<= 32
> 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
> 56: (c7) r8 s>>= 32
> 57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
> With better range of "R8", later on when "R8" is added to other register,
> e.g., a map pointer or scalar-value register, the better register
> range can be derived and verifier failure may be avoided.
> 
> In our later example,
> ..
> usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> if (usize < 0)
> return 0;
> ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> ..
> Without improving ARSH value range tracking, the register representing
> "max_len - usize" will have smin_value equal to S64_MIN and will be
> rejected by verifier.
> 
> Signed-off-by: Yonghong Song 

Acked-by: Alexei Starovoitov

Re: [PATCH bpf-next v8 09/10] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog

2018-04-28 Thread Alexei Starovoitov

On Sat, Apr 28, 2018 at 12:02:04AM -0700, Yonghong Song wrote:
> The test attached a raw_tracepoint program to sched/sched_switch.
> It tested to get stack for user space, kernel space and user
> space with build_id request. It also tested to get user
> and kernel stack into the same buffer with back-to-back
> bpf_get_stack helper calls.
> 
> Whenever the kernel stack is available, the user space
> application will check to ensure that the kernel function
> for raw_tracepoint ___bpf_prog_run is part of the stack.
> 
> Signed-off-by: Yonghong Song 
...
> +static int get_stack_print_output(void *data, int size)
> +{
> + bool good_kern_stack = false, good_user_stack = false;
> + const char *expected_func = "___bpf_prog_run";

so the test works with interpreter only?
I guess that's ok for now, but needs to fixed for
configs with CONFIG_BPF_JIT_ALWAYS_ON=y

Re: [PATCH bpf-next v8 09/10] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog

2018-04-28 Thread Y Song

On Sat, Apr 28, 2018 at 9:56 AM, Alexei Starovoitov
 wrote:
> On Sat, Apr 28, 2018 at 12:02:04AM -0700, Yonghong Song wrote:
>> The test attached a raw_tracepoint program to sched/sched_switch.
>> It tested to get stack for user space, kernel space and user
>> space with build_id request. It also tested to get user
>> and kernel stack into the same buffer with back-to-back
>> bpf_get_stack helper calls.
>>
>> Whenever the kernel stack is available, the user space
>> application will check to ensure that the kernel function
>> for raw_tracepoint ___bpf_prog_run is part of the stack.
>>
>> Signed-off-by: Yonghong Song 
> ...
>> +static int get_stack_print_output(void *data, int size)
>> +{
>> + bool good_kern_stack = false, good_user_stack = false;
>> + const char *expected_func = "___bpf_prog_run";
>
> so the test works with interpreter only?
> I guess that's ok for now, but needs to fixed for
> configs with CONFIG_BPF_JIT_ALWAYS_ON=y

I did not test CONFIG_BPF_JIT_ALWAYS_ON=y.
I can have a followup patch for this if the patch set does not need respin.

Re: [PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats

2018-04-28 Thread Michael Chan

On Fri, Apr 27, 2018 at 8:15 PM, Zumeng Chen  wrote:

> diff --git a/drivers/net/ethernet/broadcom/tg3.h 
> b/drivers/net/ethernet/broadcom/tg3.h
> index 3b5e98e..6727d93 100644
> --- a/drivers/net/ethernet/broadcom/tg3.h
> +++ b/drivers/net/ethernet/broadcom/tg3.h
> @@ -3352,6 +3352,7 @@ struct tg3 {
> struct pci_dev  *pdev_peer;
>
> struct tg3_hw_stats *hw_stats;
> +   boolhw_stats_flag;

You can just add another bit to enum TG3_FLAGS for this purpose.

While this scheme will probably work, I think a better and more
elegant way to fix this is to use RCU.

> dma_addr_t  stats_mapping;
> struct work_struct  reset_task;
>
> --
> 2.9.3
>

Re: [PATCH bpf-next v8 09/10] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog

2018-04-28 Thread Alexei Starovoitov

On Sat, Apr 28, 2018 at 11:17:30AM -0700, Y Song wrote:
> On Sat, Apr 28, 2018 at 9:56 AM, Alexei Starovoitov
>  wrote:
> > On Sat, Apr 28, 2018 at 12:02:04AM -0700, Yonghong Song wrote:
> >> The test attached a raw_tracepoint program to sched/sched_switch.
> >> It tested to get stack for user space, kernel space and user
> >> space with build_id request. It also tested to get user
> >> and kernel stack into the same buffer with back-to-back
> >> bpf_get_stack helper calls.
> >>
> >> Whenever the kernel stack is available, the user space
> >> application will check to ensure that the kernel function
> >> for raw_tracepoint ___bpf_prog_run is part of the stack.
> >>
> >> Signed-off-by: Yonghong Song 
> > ...
> >> +static int get_stack_print_output(void *data, int size)
> >> +{
> >> + bool good_kern_stack = false, good_user_stack = false;
> >> + const char *expected_func = "___bpf_prog_run";
> >
> > so the test works with interpreter only?
> > I guess that's ok for now, but needs to fixed for
> > configs with CONFIG_BPF_JIT_ALWAYS_ON=y
> 
> I did not test CONFIG_BPF_JIT_ALWAYS_ON=y.
> I can have a followup patch for this if the patch set does not need respin.

I was thinking to apply the set and do the fix in the follow up,
but testing it with jit_enable=1 I don't see it's failing,
so something is wrong with the test.
Also get_stack_raw_tp_action() keeps spawning new 'dd' in the background
which is not killed after test stops.
Please fix both issues in respin.

Re: [PATCH net-next 1/2 v3] uevent: add alloc_uevent_skb() helper

2018-04-28 Thread Christian Brauner

On Fri, Apr 27, 2018 at 11:39:44AM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > This patch adds alloc_uevent_skb() in preparation for follow up patches.
> >
> > Signed-off-by: Christian Brauner 
> > ---
> >  lib/kobject_uevent.c | 39 ++-
> >  1 file changed, 26 insertions(+), 13 deletions(-)
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 15ea216a67ce..c3cb110f663b 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -296,6 +296,31 @@ static void cleanup_uevent_env(struct subprocess_info 
> > *info)
> >  }
> >  #endif
> >  
> > +static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
> > +   const char *action_string,
> > +   const char *devpath)
> > +{
> > +   struct sk_buff *skb = NULL;
> > +   char *scratch;
> > +   size_t len;
> > +
> > +   /* allocate message with maximum possible size */
> > +   len = strlen(action_string) + strlen(devpath) + 2;
> > +   skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> > +   if (!skb)
> > +   return NULL;
> > +
> > +   /* add header */
> > +   scratch = skb_put(skb, len);
> > +   sprintf(scratch, "%s@%s", action_string, devpath);
> > +
> > +   skb_put_data(skb, env->buf, env->buflen);
> > +
> > +   NETLINK_CB(skb).dst_group = 1;
> 
> nit:
>  We might want to explicitly set NETLINK_CB(skb).portid to 0 and
>  NETLINK_CB(skb).creds.uid to GLOBAL_ROOT_UID and
>  NETLINK_CB(skb).creds.gid to GLOBAL_ROOT_GID here
>  just to make it clear this is happening.
> 
>  It is not a problem because they __alloc_skb memsets to 0 the
>  fields of struct sk_buff that it does not initialize.  And these
>  are the zero values.
> 
>  Still it would be nice to be able to look at the code and quickly
>  see these are the values being set.

Don't really mind adding it. Ok, non-functional changes added to the new
version. But then let's set "portid" too:

parms = &NETLINK_CB(skb);
parms->creds.uid = GLOBAL_ROOT_UID;
parms->creds.gid = GLOBAL_ROOT_GID;
parms->dst_group = 1;
parms->portid = 0;

Christian

Re: [PATCH net-next 2/2 v3] netns: restrict uevents

2018-04-28 Thread Christian Brauner

On Fri, Apr 27, 2018 at 11:30:26AM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> > ---
> >  lib/kobject_uevent.c | 140 ++-
> >  1 file changed, 99 insertions(+), 41 deletions(-)
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index c3cb110f663b..d8ce5e6d83af 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> >  
> > +static int uevent_net_broadcast_tagged(struct sock *usk,
> > +  struct kobj_uevent_env *env,
> > +  const char *action_string,
> > +  const char *devpath)
> > +{
> > +   struct user_namespace *owning_user_ns = sock_net(usk)->user_ns;
> > +   struct sk_buff *skb = NULL;
> > +   int ret;
> > +
> > +   skb = alloc_uevent_skb(env, action_string, devpath);
> > +   if (!skb)
> > +   return -ENOMEM;
> > +
> > +   /* fix credentials */
> > +   if (owning_user_ns != &init_user_ns) {
> 
> Nit: This test is just a performance optimization as such is not
>   necessary.  That is we can safely unconditionally set the
>   credentials this way.

alloc_uevent_skb() will now set

parms = &NETLINK_CB(skb);
parms->creds.uid = GLOBAL_ROOT_UID;
parms->creds.gid = GLOBAL_ROOT_GID;
parms->dst_group = 1;
parms->portid = 0;

explicitly. So repeating that initialization unconditionally here does
not make sense to me. Also, this hits map_uid_down() in user_namespace.c
which is a known-hotpath (Remember the extensive testing we did back for
uidmap limit bumping from 5 to 340.). And even though it might not
matter much in this case there's no need to hit this code. The condition
also make it obvious that only non-initial user namespace uevent sockets
need fixing.

Christian

> 
> > +   struct netlink_skb_parms *parms = &NETLINK_CB(skb);
> > +   kuid_t root_uid;
> > +   kgid_t root_gid;
> > +
> > +   /* fix uid */
> > +   root_uid = make_kuid(owning_user_ns, 0);
> > +   if (!uid_valid(root_uid))
> > +   root_uid = GLOBAL_ROOT_UID;
> > +   parms->creds.uid = root_uid;
> > +
> > +   /* fix gid */
> > +   root_gid = make_kgid(owning_user_ns, 0);
> > +   if (!gid_valid(root_gid))
> > +   root_gid = GLOBAL_ROOT_GID;
> > +   parms->creds.gid = root_gid;
> > +   }
> > +
> > +   ret = netlink_broadcast(usk, skb, 0, 1, GFP_KERNEL);
> > +   /* ENOBUFS should be handled in userspace */
> > +   if (ret == -ENOBUFS || ret == -ESRCH)
> > +   ret = 0;
> > +
> > +   return ret;
> > +}
> > +#endif

[PATCH net-next 1/2 v4] uevent: add alloc_uevent_skb() helper

2018-04-28 Thread Christian Brauner

This patch adds alloc_uevent_skb() in preparation for follow up patches.

Signed-off-by: Christian Brauner 
---
v3->v4:
* non-functional changes:
  initialize some variables again explicitly to make it obvious to
  readers that they are correctly set
v2->v3:
* new approach: patch added
v1->v2:
* different approach in different patchset
v0->v1:
* different approach in different patchset
---
 lib/kobject_uevent.c | 47 
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..649bf60a9440 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -296,6 +297,38 @@ static void cleanup_uevent_env(struct subprocess_info 
*info)
 }
 #endif
 
+#ifdef CONFIG_NET
+static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
+   const char *action_string,
+   const char *devpath)
+{
+   struct netlink_skb_parms *parms;
+   struct sk_buff *skb = NULL;
+   char *scratch;
+   size_t len;
+
+   /* allocate message with maximum possible size */
+   len = strlen(action_string) + strlen(devpath) + 2;
+   skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+   if (!skb)
+   return NULL;
+
+   /* add header */
+   scratch = skb_put(skb, len);
+   sprintf(scratch, "%s@%s", action_string, devpath);
+
+   skb_put_data(skb, env->buf, env->buflen);
+
+   parms = &NETLINK_CB(skb);
+   parms->creds.uid = GLOBAL_ROOT_UID;
+   parms->creds.gid = GLOBAL_ROOT_GID;
+   parms->dst_group = 1;
+   parms->portid = 0;
+
+   return skb;
+}
+#endif
+
 static int kobject_uevent_net_broadcast(struct kobject *kobj,
struct kobj_uevent_env *env,
const char *action_string,
@@ -314,22 +347,10 @@ static int kobject_uevent_net_broadcast(struct kobject 
*kobj,
continue;
 
if (!skb) {
-   /* allocate message with the maximum possible size */
-   size_t len = strlen(action_string) + strlen(devpath) + 
2;
-   char *scratch;
-
retval = -ENOMEM;
-   skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+   skb = alloc_uevent_skb(env, action_string, devpath);
if (!skb)
continue;
-
-   /* add header */
-   scratch = skb_put(skb, len);
-   sprintf(scratch, "%s@%s", action_string, devpath);
-
-   skb_put_data(skb, env->buf, env->buflen);
-
-   NETLINK_CB(skb).dst_group = 1;
}
 
retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
-- 
2.17.0

[PATCH net-next 2/2 v4] netns: restrict uevents

2018-04-28 Thread Christian Brauner

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
  1. Untagged kobjects - uevent_net_broadcast_untagged():
 Untagged kobjects will be broadcast into all uevent sockets recorded
 in uevent_sock_list, i.e. into all network namespacs owned by the
 intial user namespace.
  2. Tagged kobjects - uevent_net_broadcast_tagged():
 Tagged kobjects will only be broadcast into the network namespace they
 were tagged with.
  Handling of tagged kobjects in 2. does not cause any semantic changes.
  This is just splitting out the filtering logic that was handled by
  kobj_bcast_filter() before.
  Handling of untagged kobjects in 1. will cause a semantic change. The
  reasons why this is needed and ok have been discussed in [1]. Here is a
  short summary:
  - Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
  - Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
  - Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
  - Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
  - Filtering logic:
This patch filters by *owning user namespace of the

[PATCH net-next 0/2 v4] netns: uevent filtering

2018-04-28 Thread Christian Brauner

Hey everyone,

This is the new approach to uevent filtering as discussed (see the
threads in [1], [2], and [3]). It only contains *non-functional
changes*.

This series deals with with fixing up uevent filtering logic:
- uevent filtering logic is simplified
- locking time on uevent_sock_list is minimized
- tagged and untagged kobjects are handled in separate codepaths
- permissions for userspace are fixed for network device uevents in
  network namespaces owned by non-initial user namespaces
  Udev is now able to see those events correctly which it wasn't before.
  For example, moving a physical device into a network namespace not
  owned by the initial user namespaces before gave:

  root@xen1:~# udevadm --debug monitor -k
  calling: monitor
  monitor will print the received events for:
  KERNEL - the kernel uevent

  sender uid=65534, message ignored
  sender uid=65534, message ignored
  sender uid=65534, message ignored
  sender uid=65534, message ignored
  sender uid=65534, message ignored

  and now after the discussion and solution in [3] correctly gives:

  root@xen1:~# udevadm --debug monitor -k
  calling: monitor
  monitor will print the received events for:
  KERNEL - the kernel uevent

  KERNEL[625.301042] add  
/devices/pci:00/:00:02.0/:01:00.1/net/enp1s0f1 (net)
  KERNEL[625.301109] move 
/devices/pci:00/:00:02.0/:01:00.1/net/enp1s0f1 (net)
  KERNEL[625.301138] move 
/devices/pci:00/:00:02.0/:01:00.1/net/eth1 (net)
  KERNEL[655.333272] remove 
/devices/pci:00/:00:02.0/:01:00.1/net/eth1 (net)

Thanks!
Christian

[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738

Christian Brauner (2):
  uevent: add alloc_uevent_skb() helper
  netns: restrict uevents

 lib/kobject_uevent.c | 180 ++-
 1 file changed, 128 insertions(+), 52 deletions(-)

-- 
2.17.0

Re: [PATCH net-next v2 4/7] net: mscc: Add initial Ocelot switch support

2018-04-28 Thread kbuild test robot

Hi Alexandre,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Alexandre-Belloni/Microsemi-Ocelot-Ethernet-switch-support/20180429-024136
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sh 

All warnings (new ones prefixed by >>):

   In file included from include/linux/swab.h:5:0,
from include/uapi/linux/byteorder/little_endian.h:13,
from include/linux/byteorder/little_endian.h:5,
from arch/sh/include/uapi/asm/byteorder.h:6,
from arch/sh/include/asm/bitops.h:12,
from include/linux/bitops.h:38,
from include/linux/kernel.h:11,
from include/linux/interrupt.h:6,
from drivers/net/ethernet/mscc/ocelot_board.c:7:
   drivers/net/ethernet/mscc/ocelot_board.c: In function 'ocelot_parse_ifh':
   drivers/net/ethernet/mscc/ocelot_board.c:23:27: error: '_be32' undeclared 
(first use in this function); did you mean '__be32'?
  ifh[i] = ntohl((__force _be32)ifh[i]);
  ^
   include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32'
 (__builtin_constant_p((__u32)(x)) ? \
   ^
   include/linux/byteorder/generic.h:136:21: note: in expansion of macro 
'__be32_to_cpu'
#define ___ntohl(x) __be32_to_cpu(x)
^
   include/linux/byteorder/generic.h:140:18: note: in expansion of macro 
'___ntohl'
#define ntohl(x) ___ntohl(x)
 ^~~~
   drivers/net/ethernet/mscc/ocelot_board.c:23:12: note: in expansion of macro 
'ntohl'
  ifh[i] = ntohl((__force _be32)ifh[i]);
   ^
   drivers/net/ethernet/mscc/ocelot_board.c:23:27: note: each undeclared 
identifier is reported only once for each function it appears in
  ifh[i] = ntohl((__force _be32)ifh[i]);
  ^
   include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32'
 (__builtin_constant_p((__u32)(x)) ? \
   ^
   include/linux/byteorder/generic.h:136:21: note: in expansion of macro 
'__be32_to_cpu'
#define ___ntohl(x) __be32_to_cpu(x)
^
   include/linux/byteorder/generic.h:140:18: note: in expansion of macro 
'___ntohl'
#define ntohl(x) ___ntohl(x)
 ^~~~
   drivers/net/ethernet/mscc/ocelot_board.c:23:12: note: in expansion of macro 
'ntohl'
  ifh[i] = ntohl((__force _be32)ifh[i]);
   ^
   drivers/net/ethernet/mscc/ocelot_board.c:23:33: error: expected ')' before 
'ifh'
  ifh[i] = ntohl((__force _be32)ifh[i]);
^
   include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32'
 (__builtin_constant_p((__u32)(x)) ? \
   ^
   include/linux/byteorder/generic.h:136:21: note: in expansion of macro 
'__be32_to_cpu'
#define ___ntohl(x) __be32_to_cpu(x)
^
   include/linux/byteorder/generic.h:140:18: note: in expansion of macro 
'___ntohl'
#define ntohl(x) ___ntohl(x)
 ^~~~
   drivers/net/ethernet/mscc/ocelot_board.c:23:12: note: in expansion of macro 
'ntohl'
  ifh[i] = ntohl((__force _be32)ifh[i]);
   ^
   drivers/net/ethernet/mscc/ocelot_board.c:23:33: error: expected ')' before 
'ifh'
  ifh[i] = ntohl((__force _be32)ifh[i]);
^
   include/uapi/linux/swab.h:18:12: note: in definition of macro 
'___constant_swab32'
 (((__u32)(x) & (__u32)0x00ffUL) << 24) |  \
   ^
>> include/uapi/linux/byteorder/little_endian.h:40:26: note: in expansion of 
>> macro '__swab32'
#define __be32_to_cpu(x) __swab32((__force __u32)(__be32)(x))
 ^~~~
   include/linux/byteorder/generic.h:136:21: note: in expansion of macro 
'__be32_to_cpu'
#define ___ntohl(x) __be32_to_cpu(x)
^
   include/linux/byteorder/generic.h:140:18: note: in expansion of macro 
'___ntohl'
#define ntohl(x) ___ntohl(x)
 ^~~~
   drivers/net/ethernet/mscc/ocelot_board.c:23:12: note: in expansion of macro 
'ntohl'
  ifh[i] = ntohl((__force _be32)ifh[i]);
   ^
   drivers/net/ethernet/mscc/ocelot_board.c:23:33: error: expected ')' before 
'ifh'
  ifh[i] = ntohl((__force _be32)ifh[i]);
^
   include/uapi/linux/swab.h:19:12: note: in definition of macro 
'___constant_swab32'
 (((__u32)(x) & (__u32)0xff00

Re: [PATCH bpf-next v8 09/10] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog

2018-04-28 Thread Yonghong Song




On 4/28/18 12:06 PM, Alexei Starovoitov wrote:

On Sat, Apr 28, 2018 at 11:17:30AM -0700, Y Song wrote:

On Sat, Apr 28, 2018 at 9:56 AM, Alexei Starovoitov
 wrote:

On Sat, Apr 28, 2018 at 12:02:04AM -0700, Yonghong Song wrote:

The test attached a raw_tracepoint program to sched/sched_switch.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.

Whenever the kernel stack is available, the user space
application will check to ensure that the kernel function
for raw_tracepoint ___bpf_prog_run is part of the stack.

Signed-off-by: Yonghong Song 

...

+static int get_stack_print_output(void *data, int size)
+{
+ bool good_kern_stack = false, good_user_stack = false;
+ const char *expected_func = "___bpf_prog_run";


so the test works with interpreter only?
I guess that's ok for now, but needs to fixed for
configs with CONFIG_BPF_JIT_ALWAYS_ON=y


I did not test CONFIG_BPF_JIT_ALWAYS_ON=y.
I can have a followup patch for this if the patch set does not need respin.


I was thinking to apply the set and do the fix in the follow up,
but testing it with jit_enable=1 I don't see it's failing,
so something is wrong with the test.


Yes, it is because the return value test

if (CHECK(err < 0, "perf_event_poller", "err %d errno %d\n", err,
...

the "err < 0" is not right as all the return values are nonnegative.



Also get_stack_raw_tp_action() keeps spawning new 'dd' in the background
which is not killed after test stops.
Please fix both issues in respin.


I will fix both and resend the patch.

[PATCH net-next 0/8] r8169: further improvements w/o functional change

2018-04-28 Thread Heiner Kallweit

This series aims at further improving and simplifying the code w/o
any intended functional changes.

Series was tested on: RTL8169sb, RTL8168d, RTL8168e-vl

Heiner Kallweit (8):
  r8169: remove unneeded call to __rtl8169_set_features in rtl_open
  r8169: improve rtl8169_set_features
  r8169: replace magic number for INTT mask with a constant
  r8169: improve CPlusCmd handling
  r8169: improve handling of CPCMD quirk mask
  r8169: simplify rtl_hw_start_8169
  r8169: remove calls to rtl_set_rx_mode
  r8169: move common initializations to tp->hw_start

 drivers/net/ethernet/realtek/r8169.c | 184 ++-
 1 file changed, 42 insertions(+), 142 deletions(-)

-- 
2.17.0

[PATCH net-next 2/8] r8169: improve rtl8169_set_features

2018-04-28 Thread Heiner Kallweit

__rtl8169_set_features is used in rtl8169_set_features only, so we
can inline it. In addition:
- Remove check (features ^ dev->features), __netdev_update_features
  check's already that requested features differ from current ones.
- Don't mask out unsupported flags, there's no benefit in it.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 18 --
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index d2656224..411d12be 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1935,12 +1935,14 @@ static netdev_features_t rtl8169_fix_features(struct 
net_device *dev,
return features;
 }
 
-static void __rtl8169_set_features(struct net_device *dev,
-  netdev_features_t features)
+static int rtl8169_set_features(struct net_device *dev,
+   netdev_features_t features)
 {
struct rtl8169_private *tp = netdev_priv(dev);
u32 rx_config;
 
+   rtl_lock_work(tp);
+
rx_config = RTL_R32(tp, RxConfig);
if (features & NETIF_F_RXALL)
rx_config |= (AcceptErr | AcceptRunt);
@@ -1963,24 +1965,12 @@ static void __rtl8169_set_features(struct net_device 
*dev,
 
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
RTL_R16(tp, CPlusCmd);
-}
 
-static int rtl8169_set_features(struct net_device *dev,
-   netdev_features_t features)
-{
-   struct rtl8169_private *tp = netdev_priv(dev);
-
-   features &= NETIF_F_RXALL | NETIF_F_RXCSUM | NETIF_F_HW_VLAN_CTAG_RX;
-
-   rtl_lock_work(tp);
-   if (features ^ dev->features)
-   __rtl8169_set_features(dev, features);
rtl_unlock_work(tp);
 
return 0;
 }
 
-
 static inline u32 rtl8169_tx_vlan_tag(struct sk_buff *skb)
 {
return (skb_vlan_tag_present(skb)) ?
-- 
2.17.0

[PATCH net-next 1/8] r8169: remove unneeded call to __rtl8169_set_features in rtl_open

2018-04-28 Thread Heiner Kallweit

RxChkSum and RxVlan aren't touched outside __rtl8169_set_features
(except in probe), so they are always in sync with dev->features.
And the RxConfig flags are set in rtl_set_rx_mode() which is
called via dev_set_rx_mode() from __dev_open().
Therefore we can safely remove this call.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index a5d00ee9..d2656224 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7637,8 +7637,6 @@ static int rtl_open(struct net_device *dev)
 
rtl8169_init_phy(dev, tp);
 
-   __rtl8169_set_features(dev, dev->features);
-
rtl_pll_power_up(tp);
 
rtl_hw_start(tp);
-- 
2.17.0

[PATCH net-next 3/8] r8169: replace magic number for INTT mask with a constant

2018-04-28 Thread Heiner Kallweit

Use a proper constant for INTT bit mask.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 411d12be..1dd189dd 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -599,6 +599,7 @@ enum rtl_register_content {
RxChkSum= (1 << 5),
PCIDAC  = (1 << 4),
PCIMulRW= (1 << 3),
+#define INTT_MASK  GENMASK(1, 0)
INTT_0  = 0x,   // 8168
INTT_1  = 0x0001,   // 8168
INTT_2  = 0x0002,   // 8168
@@ -2344,7 +2345,7 @@ static int rtl_get_coalesce(struct net_device *dev, 
struct ethtool_coalesce *ec)
if (IS_ERR(ci))
return PTR_ERR(ci);
 
-   scale = &ci->scalev[RTL_R16(tp, CPlusCmd) & 3];
+   scale = &ci->scalev[RTL_R16(tp, CPlusCmd) & INTT_MASK];
 
/* read IntrMitigate and adjust according to scale */
for (w = RTL_R16(tp, IntrMitigate); w; w >>= RTL_COALESCE_SHIFT, p++) {
@@ -2443,7 +2444,7 @@ static int rtl_set_coalesce(struct net_device *dev, 
struct ethtool_coalesce *ec)
 
RTL_W16(tp, IntrMitigate, swab16(w));
 
-   tp->cp_cmd = (tp->cp_cmd & ~3) | cp01;
+   tp->cp_cmd = (tp->cp_cmd & ~INTT_MASK) | cp01;
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
RTL_R16(tp, CPlusCmd);
 
-- 
2.17.0

[PATCH net-next 7/8] r8169: remove calls to rtl_set_rx_mode

2018-04-28 Thread Heiner Kallweit

__dev_open() calls the ndo_set_rx_mode callback anyway, so we don't
have to do it here too.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 8c816f6c..c7b9301a 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5445,8 +5445,6 @@ static void rtl_hw_start_8169(struct rtl8169_private *tp)
 
RTL_W32(tp, RxMissed, 0);
 
-   rtl_set_rx_mode(tp->dev);
-
/* no early-rx interrupts */
RTL_W16(tp, MultiIntr, RTL_R16(tp, MultiIntr) & 0xf000);
 }
@@ -6361,8 +6359,6 @@ static void rtl_hw_start_8168(struct rtl8169_private *tp)
 
RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
 
-   rtl_set_rx_mode(tp->dev);
-
RTL_W16(tp, MultiIntr, RTL_R16(tp, MultiIntr) & 0xf000);
 }
 
@@ -6554,8 +6550,6 @@ static void rtl_hw_start_8101(struct rtl8169_private *tp)
 
RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
 
-   rtl_set_rx_mode(tp->dev);
-
RTL_R8(tp, IntrMask);
 
RTL_W16(tp, MultiIntr, RTL_R16(tp, MultiIntr) & 0xf000);
-- 
2.17.0

[PATCH net-next 4/8] r8169: improve CPlusCmd handling

2018-04-28 Thread Heiner Kallweit

tp->cp_cmd is supposed to reflect the current value of the CplusCmd
register. Several (quite old) changes however directly change this
register w/o updating tp->cp_cmd. Also we have places in the code
reading this register where we could use the cached value.

In addition:
- Properly initialize tp->cmd with the register value.
- In rtl_hw_start_8169 remove one setting of PCIMulRW because it's
  set unconditionally anyway a few lines later.
- In rtl_hw_start_8168 properly mask out the INTT bits before
  setting INTT_1. So far we rely on both bits being zero.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 42 +++-
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 1dd189dd..868dee7d 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1962,8 +1962,6 @@ static int rtl8169_set_features(struct net_device *dev,
else
tp->cp_cmd &= ~RxVlan;
 
-   tp->cp_cmd |= RTL_R16(tp, CPlusCmd) & ~(RxVlan | RxChkSum);
-
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
RTL_R16(tp, CPlusCmd);
 
@@ -2345,7 +2343,7 @@ static int rtl_get_coalesce(struct net_device *dev, 
struct ethtool_coalesce *ec)
if (IS_ERR(ci))
return PTR_ERR(ci);
 
-   scale = &ci->scalev[RTL_R16(tp, CPlusCmd) & INTT_MASK];
+   scale = &ci->scalev[tp->cp_cmd & INTT_MASK];
 
/* read IntrMitigate and adjust according to scale */
for (w = RTL_R16(tp, IntrMitigate); w; w >>= RTL_COALESCE_SHIFT, p++) {
@@ -4841,7 +4839,7 @@ static void r8168_pll_power_down(struct rtl8169_private 
*tp)
 
if ((tp->mac_version == RTL_GIGA_MAC_VER_23 ||
 tp->mac_version == RTL_GIGA_MAC_VER_24) &&
-   (RTL_R16(tp, CPlusCmd) & ASF)) {
+   (tp->cp_cmd & ASF)) {
return;
}
 
@@ -5321,15 +5319,6 @@ static void rtl_set_rx_tx_desc_registers(struct 
rtl8169_private *tp)
RTL_W32(tp, RxDescAddrLow, ((u64) tp->RxPhyAddr) & DMA_BIT_MASK(32));
 }
 
-static u16 rtl_rw_cpluscmd(struct rtl8169_private *tp)
-{
-   u16 cmd;
-
-   cmd = RTL_R16(tp, CPlusCmd);
-   RTL_W16(tp, CPlusCmd, cmd);
-   return cmd;
-}
-
 static void rtl_set_rx_max_size(struct rtl8169_private *tp)
 {
/* Low hurts. Let's disable the filtering. */
@@ -5415,10 +5404,8 @@ static void rtl_set_rx_mode(struct net_device *dev)
 
 static void rtl_hw_start_8169(struct rtl8169_private *tp)
 {
-   if (tp->mac_version == RTL_GIGA_MAC_VER_05) {
-   RTL_W16(tp, CPlusCmd, RTL_R16(tp, CPlusCmd) | PCIMulRW);
+   if (tp->mac_version == RTL_GIGA_MAC_VER_05)
pci_write_config_byte(tp->pci_dev, PCI_CACHE_LINE_SIZE, 0x08);
-   }
 
RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
if (tp->mac_version == RTL_GIGA_MAC_VER_01 ||
@@ -5439,7 +5426,7 @@ static void rtl_hw_start_8169(struct rtl8169_private *tp)
tp->mac_version == RTL_GIGA_MAC_VER_04)
rtl_set_rx_tx_config_registers(tp);
 
-   tp->cp_cmd |= rtl_rw_cpluscmd(tp) | PCIMulRW;
+   tp->cp_cmd |= PCIMulRW;
 
if (tp->mac_version == RTL_GIGA_MAC_VER_02 ||
tp->mac_version == RTL_GIGA_MAC_VER_03) {
@@ -5671,7 +5658,8 @@ static void rtl_hw_start_8168bb(struct rtl8169_private 
*tp)
 {
RTL_W8(tp, Config3, RTL_R8(tp, Config3) & ~Beacon_en);
 
-   RTL_W16(tp, CPlusCmd, RTL_R16(tp, CPlusCmd) & ~R8168_CPCMD_QUIRK_MASK);
+   tp->cp_cmd &= ~R8168_CPCMD_QUIRK_MASK;
+   RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 
if (tp->dev->mtu <= ETH_DATA_LEN) {
rtl_tx_performance_tweak(tp, PCI_EXP_DEVCTL_READRQ_4096B |
@@ -5699,7 +5687,8 @@ static void __rtl_hw_start_8168cp(struct rtl8169_private 
*tp)
 
rtl_disable_clock_request(tp);
 
-   RTL_W16(tp, CPlusCmd, RTL_R16(tp, CPlusCmd) & ~R8168_CPCMD_QUIRK_MASK);
+   tp->cp_cmd &= ~R8168_CPCMD_QUIRK_MASK;
+   RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 }
 
 static void rtl_hw_start_8168cp_1(struct rtl8169_private *tp)
@@ -5728,7 +5717,8 @@ static void rtl_hw_start_8168cp_2(struct rtl8169_private 
*tp)
if (tp->dev->mtu <= ETH_DATA_LEN)
rtl_tx_performance_tweak(tp, PCI_EXP_DEVCTL_READRQ_4096B);
 
-   RTL_W16(tp, CPlusCmd, RTL_R16(tp, CPlusCmd) & ~R8168_CPCMD_QUIRK_MASK);
+   tp->cp_cmd &= ~R8168_CPCMD_QUIRK_MASK;
+   RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 }
 
 static void rtl_hw_start_8168cp_3(struct rtl8169_private *tp)
@@ -5745,7 +5735,8 @@ static void rtl_hw_start_8168cp_3(struct rtl8169_private 
*tp)
if (tp->dev->mtu <= ETH_DATA_LEN)
rtl_tx_performance_tweak(tp, PCI_EXP_DEVCTL_READRQ_4096B);
 
-   RTL_W16(tp, CPlusCmd, RTL_R16(tp, CPlusCmd) & ~R8168_CPCMD_QUIRK_MASK);
+   tp->cp_cmd &= ~R8168_CPCMD_QUIRK_MASK;
+   RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 }
 
 static void rtl_hw_start_8168c_1(struct rtl8169_private *tp

[PATCH net-next 8/8] r8169: move common initializations to tp->hw_start

2018-04-28 Thread Heiner Kallweit

The chip-specific init code includes quite some calls which are
identical for all chips. So move these calls to tp->hw_start().

In addition move rtl_set_rx_max_size() a little to make sure it's
defined before it's used. Unfortunately the diff generated by git
is a little bit hard to read.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 74 +++-
 1 file changed, 19 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index c7b9301a..66f10d11 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5301,10 +5301,10 @@ static void rtl_set_rx_tx_config_registers(struct 
rtl8169_private *tp)
(InterFrameGap << TxInterFrameGapShift));
 }
 
-static void rtl_hw_start(struct  rtl8169_private *tp)
+static void rtl_set_rx_max_size(struct rtl8169_private *tp)
 {
-   tp->hw_start(tp);
-   rtl_irq_enable_all(tp);
+   /* Low hurts. Let's disable the filtering. */
+   RTL_W16(tp, RxMaxSize, R8169_RX_BUF_SIZE + 1);
 }
 
 static void rtl_set_rx_tx_desc_registers(struct rtl8169_private *tp)
@@ -5320,10 +5320,23 @@ static void rtl_set_rx_tx_desc_registers(struct 
rtl8169_private *tp)
RTL_W32(tp, RxDescAddrLow, ((u64) tp->RxPhyAddr) & DMA_BIT_MASK(32));
 }
 
-static void rtl_set_rx_max_size(struct rtl8169_private *tp)
+static void rtl_hw_start(struct  rtl8169_private *tp)
 {
-   /* Low hurts. Let's disable the filtering. */
-   RTL_W16(tp, RxMaxSize, R8169_RX_BUF_SIZE + 1);
+   RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
+
+   tp->hw_start(tp);
+
+   rtl_set_rx_max_size(tp);
+   rtl_set_rx_tx_desc_registers(tp);
+   rtl_set_rx_tx_config_registers(tp);
+   RTL_W8(tp, Cfg9346, Cfg9346_Lock);
+
+   /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
+   RTL_R8(tp, IntrMask);
+   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
+   /* no early-rx interrupts */
+   RTL_W16(tp, MultiIntr, RTL_R16(tp, MultiIntr) & 0xf000);
+   rtl_irq_enable_all(tp);
 }
 
 static void rtl8169_set_magic_reg(struct rtl8169_private *tp, unsigned 
mac_version)
@@ -5408,12 +5421,8 @@ static void rtl_hw_start_8169(struct rtl8169_private *tp)
if (tp->mac_version == RTL_GIGA_MAC_VER_05)
pci_write_config_byte(tp->pci_dev, PCI_CACHE_LINE_SIZE, 0x08);
 
-   RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
-
RTL_W8(tp, EarlyTxThres, NoEarlyTx);
 
-   rtl_set_rx_max_size(tp);
-
tp->cp_cmd |= PCIMulRW;
 
if (tp->mac_version == RTL_GIGA_MAC_VER_02 ||
@@ -5433,20 +5442,7 @@ static void rtl_hw_start_8169(struct rtl8169_private *tp)
 */
RTL_W16(tp, IntrMitigate, 0x);
 
-   rtl_set_rx_tx_desc_registers(tp);
-   rtl_set_rx_tx_config_registers(tp);
-
-   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
-
-   RTL_W8(tp, Cfg9346, Cfg9346_Lock);
-
-   /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
-   RTL_R8(tp, IntrMask);
-
RTL_W32(tp, RxMissed, 0);
-
-   /* no early-rx interrupts */
-   RTL_W16(tp, MultiIntr, RTL_R16(tp, MultiIntr) & 0xf000);
 }
 
 static void rtl_csi_write(struct rtl8169_private *tp, int addr, int value)
@@ -6227,12 +6223,8 @@ static void rtl_hw_start_8168ep_3(struct rtl8169_private 
*tp)
 
 static void rtl_hw_start_8168(struct rtl8169_private *tp)
 {
-   RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
-
RTL_W8(tp, MaxTxPacketSize, TxPacketMax);
 
-   rtl_set_rx_max_size(tp);
-
tp->cp_cmd &= ~INTT_MASK;
tp->cp_cmd |= PktCntrDisable | INTT_1;
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
@@ -6245,12 +6237,6 @@ static void rtl_hw_start_8168(struct rtl8169_private *tp)
tp->event_slow &= ~RxOverflow;
}
 
-   rtl_set_rx_tx_desc_registers(tp);
-
-   rtl_set_rx_tx_config_registers(tp);
-
-   RTL_R8(tp, IntrMask);
-
switch (tp->mac_version) {
case RTL_GIGA_MAC_VER_11:
rtl_hw_start_8168bb(tp);
@@ -6354,12 +6340,6 @@ static void rtl_hw_start_8168(struct rtl8169_private *tp)
   tp->dev->name, tp->mac_version);
break;
}
-
-   RTL_W8(tp, Cfg9346, Cfg9346_Lock);
-
-   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
-
-   RTL_W16(tp, MultiIntr, RTL_R16(tp, MultiIntr) & 0xf000);
 }
 
 static void rtl_hw_start_8102e_1(struct rtl8169_private *tp)
@@ -6495,19 +6475,11 @@ static void rtl_hw_start_8101(struct rtl8169_private 
*tp)
pcie_capability_set_word(tp->pci_dev, PCI_EXP_DEVCTL,
 PCI_EXP_DEVCTL_NOSNOOP_EN);
 
-   RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
-
RTL_W8(tp, MaxTxPacketSize, TxPacketMax);
 
-   rtl_set_rx_max_size(tp);
-
tp->cp_cmd &= CPCMD_QUIRK_MASK;
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 
-   rtl_set_rx_tx_desc_registers(tp);
-
-   rtl_set_rx_tx_config_registers(tp);
-

[PATCH net-next 5/8] r8169: improve handling of CPCMD quirk mask

2018-04-28 Thread Heiner Kallweit

Both quirk masks are the same, so we can merge them. The quirk mask
includes most bits so it's actually easier to define a mask with
the bits to keep.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 35 ++--
 1 file changed, 7 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 868dee7d..cf7a7db5 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -690,6 +690,7 @@ enum rtl_rx_desc_bit {
 };
 
 #define RsvdMask   0x3fffc000
+#define CPCMD_QUIRK_MASK   (Normal_mode | RxVlan | RxChkSum | INTT_MASK)
 
 struct TxDesc {
__le32 opts1;
@@ -5643,22 +5644,11 @@ static void rtl_pcie_state_l2l3_enable(struct 
rtl8169_private *tp, bool enable)
RTL_W8(tp, Config3, data);
 }
 
-#define R8168_CPCMD_QUIRK_MASK (\
-   EnableBist | \
-   Mac_dbgo_oe | \
-   Force_half_dup | \
-   Force_rxflow_en | \
-   Force_txflow_en | \
-   Cxpl_dbg_sel | \
-   ASF | \
-   PktCntrDisable | \
-   Mac_dbgo_sel)
-
 static void rtl_hw_start_8168bb(struct rtl8169_private *tp)
 {
RTL_W8(tp, Config3, RTL_R8(tp, Config3) & ~Beacon_en);
 
-   tp->cp_cmd &= ~R8168_CPCMD_QUIRK_MASK;
+   tp->cp_cmd &= CPCMD_QUIRK_MASK;
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 
if (tp->dev->mtu <= ETH_DATA_LEN) {
@@ -5687,7 +5677,7 @@ static void __rtl_hw_start_8168cp(struct rtl8169_private 
*tp)
 
rtl_disable_clock_request(tp);
 
-   tp->cp_cmd &= ~R8168_CPCMD_QUIRK_MASK;
+   tp->cp_cmd &= CPCMD_QUIRK_MASK;
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 }
 
@@ -5717,7 +5707,7 @@ static void rtl_hw_start_8168cp_2(struct rtl8169_private 
*tp)
if (tp->dev->mtu <= ETH_DATA_LEN)
rtl_tx_performance_tweak(tp, PCI_EXP_DEVCTL_READRQ_4096B);
 
-   tp->cp_cmd &= ~R8168_CPCMD_QUIRK_MASK;
+   tp->cp_cmd &= CPCMD_QUIRK_MASK;
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 }
 
@@ -5735,7 +5725,7 @@ static void rtl_hw_start_8168cp_3(struct rtl8169_private 
*tp)
if (tp->dev->mtu <= ETH_DATA_LEN)
rtl_tx_performance_tweak(tp, PCI_EXP_DEVCTL_READRQ_4096B);
 
-   tp->cp_cmd &= ~R8168_CPCMD_QUIRK_MASK;
+   tp->cp_cmd &= CPCMD_QUIRK_MASK;
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 }
 
@@ -5793,7 +5783,7 @@ static void rtl_hw_start_8168d(struct rtl8169_private *tp)
if (tp->dev->mtu <= ETH_DATA_LEN)
rtl_tx_performance_tweak(tp, PCI_EXP_DEVCTL_READRQ_4096B);
 
-   tp->cp_cmd &= ~R8168_CPCMD_QUIRK_MASK;
+   tp->cp_cmd &= CPCMD_QUIRK_MASK;
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 }
 
@@ -6394,17 +6384,6 @@ static void rtl_hw_start_8168(struct rtl8169_private *tp)
RTL_W16(tp, MultiIntr, RTL_R16(tp, MultiIntr) & 0xf000);
 }
 
-#define R810X_CPCMD_QUIRK_MASK (\
-   EnableBist | \
-   Mac_dbgo_oe | \
-   Force_half_dup | \
-   Force_rxflow_en | \
-   Force_txflow_en | \
-   Cxpl_dbg_sel | \
-   ASF | \
-   PktCntrDisable | \
-   Mac_dbgo_sel)
-
 static void rtl_hw_start_8102e_1(struct rtl8169_private *tp)
 {
static const struct ephy_info e_info_8102e_1[] = {
@@ -6544,7 +6523,7 @@ static void rtl_hw_start_8101(struct rtl8169_private *tp)
 
rtl_set_rx_max_size(tp);
 
-   tp->cp_cmd &= ~R810X_CPCMD_QUIRK_MASK;
+   tp->cp_cmd &= CPCMD_QUIRK_MASK;
RTL_W16(tp, CPlusCmd, tp->cp_cmd);
 
rtl_set_rx_tx_desc_registers(tp);
-- 
2.17.0

[PATCH net-next 6/8] r8169: simplify rtl_hw_start_8169

2018-04-28 Thread Heiner Kallweit

Currently done:
- if mac_version in (01, 02, 03, 04)
RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
- if mac_version in (01, 02, 03, 04)
rtl_set_rx_tx_config_registers(tp);
- if mac_version not in (01, 02, 03, 04)
RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
rtl_set_rx_tx_config_registers(tp);

So we do exactly the same independent of chip version and can simplify
the code.

In addition remove the call to rtl_init_rxcfg(), it's called in
rtl_init_one() already and the set bits are never touched later.
rtl_init_8168/8101 don't include this call either.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 22 ++
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index cf7a7db5..8c816f6c 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5409,24 +5409,11 @@ static void rtl_hw_start_8169(struct rtl8169_private 
*tp)
pci_write_config_byte(tp->pci_dev, PCI_CACHE_LINE_SIZE, 0x08);
 
RTL_W8(tp, Cfg9346, Cfg9346_Unlock);
-   if (tp->mac_version == RTL_GIGA_MAC_VER_01 ||
-   tp->mac_version == RTL_GIGA_MAC_VER_02 ||
-   tp->mac_version == RTL_GIGA_MAC_VER_03 ||
-   tp->mac_version == RTL_GIGA_MAC_VER_04)
-   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
-
-   rtl_init_rxcfg(tp);
 
RTL_W8(tp, EarlyTxThres, NoEarlyTx);
 
rtl_set_rx_max_size(tp);
 
-   if (tp->mac_version == RTL_GIGA_MAC_VER_01 ||
-   tp->mac_version == RTL_GIGA_MAC_VER_02 ||
-   tp->mac_version == RTL_GIGA_MAC_VER_03 ||
-   tp->mac_version == RTL_GIGA_MAC_VER_04)
-   rtl_set_rx_tx_config_registers(tp);
-
tp->cp_cmd |= PCIMulRW;
 
if (tp->mac_version == RTL_GIGA_MAC_VER_02 ||
@@ -5447,14 +5434,9 @@ static void rtl_hw_start_8169(struct rtl8169_private *tp)
RTL_W16(tp, IntrMitigate, 0x);
 
rtl_set_rx_tx_desc_registers(tp);
+   rtl_set_rx_tx_config_registers(tp);
 
-   if (tp->mac_version != RTL_GIGA_MAC_VER_01 &&
-   tp->mac_version != RTL_GIGA_MAC_VER_02 &&
-   tp->mac_version != RTL_GIGA_MAC_VER_03 &&
-   tp->mac_version != RTL_GIGA_MAC_VER_04) {
-   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
-   rtl_set_rx_tx_config_registers(tp);
-   }
+   RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
 
RTL_W8(tp, Cfg9346, Cfg9346_Lock);
 
-- 
2.17.0

Re: [PATCH net-next] net: phy: Fix modular PHYLIB build

2018-04-28 Thread Marcelo Ricardo Leitner

On Fri, Apr 27, 2018 at 12:41:49PM -0700, Florian Fainelli wrote:
> After commit c59530d0d5dc ("net: Move PHY statistics code into PHY
> library helpers") we made net/core/ethtool.c reference symbols which are
> part of the library which can be modular. David introduced a temporary
> fix with 1ecd6e8ad996 ("phy: Temporary build fix after phylib changes.")
> which would prevent such modularity.
>
> This is not desireable of course, so instead, just inline the functions
> into include/linux/phy.h to keep both options available.
>
> Fixes: c59530d0d5dc ("net: Move PHY statistics code into PHY library helpers")
> Fixes: 1ecd6e8ad996 ("phy: Temporary build fix after phylib changes.")
> Signed-off-by: Florian Fainelli 

Confirmed that this also fixes the build when CONFIG_PHYLIB is off.
Thanks.

Re: [PATCH net-next v2 4/7] net: mscc: Add initial Ocelot switch support

2018-04-28 Thread kbuild test robot

Hi Alexandre,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Alexandre-Belloni/Microsemi-Ocelot-Ethernet-switch-support/20180429-024136
config: x86_64-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   drivers/net/ethernet/mscc/ocelot_board.c:23:26: sparse: Expected ) at end of 
cast operator
   drivers/net/ethernet/mscc/ocelot_board.c:23:26: sparse: got _be32
   drivers/net/ethernet/mscc/ocelot_board.c:23:26: sparse: cast from unknown 
type
   In file included from include/linux/swab.h:5:0,
from include/uapi/linux/byteorder/little_endian.h:13,
from include/linux/byteorder/little_endian.h:5,
from arch/x86/include/uapi/asm/byteorder.h:5,
from include/asm-generic/bitops/le.h:6,
from arch/x86/include/asm/bitops.h:521,
from include/linux/bitops.h:38,
from include/linux/kernel.h:11,
from include/linux/interrupt.h:6,
from drivers/net/ethernet/mscc/ocelot_board.c:7:
   drivers/net/ethernet/mscc/ocelot_board.c: In function 'ocelot_parse_ifh':
>> drivers/net/ethernet/mscc/ocelot_board.c:23:27: error: '_be32' undeclared 
>> (first use in this function); did you mean '__be32'?
  ifh[i] = ntohl((__force _be32)ifh[i]);
  ^
   include/uapi/linux/swab.h:114:54: note: in definition of macro '__swab32'
#define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
 ^
>> include/linux/byteorder/generic.h:136:21: note: in expansion of macro 
>> '__be32_to_cpu'
#define ___ntohl(x) __be32_to_cpu(x)
^
>> include/linux/byteorder/generic.h:140:18: note: in expansion of macro 
>> '___ntohl'
#define ntohl(x) ___ntohl(x)
 ^~~~
>> drivers/net/ethernet/mscc/ocelot_board.c:23:12: note: in expansion of macro 
>> 'ntohl'
  ifh[i] = ntohl((__force _be32)ifh[i]);
   ^
   drivers/net/ethernet/mscc/ocelot_board.c:23:27: note: each undeclared 
identifier is reported only once for each function it appears in
  ifh[i] = ntohl((__force _be32)ifh[i]);
  ^
   include/uapi/linux/swab.h:114:54: note: in definition of macro '__swab32'
#define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
 ^
>> include/linux/byteorder/generic.h:136:21: note: in expansion of macro 
>> '__be32_to_cpu'
#define ___ntohl(x) __be32_to_cpu(x)
^
>> include/linux/byteorder/generic.h:140:18: note: in expansion of macro 
>> '___ntohl'
#define ntohl(x) ___ntohl(x)
 ^~~~
>> drivers/net/ethernet/mscc/ocelot_board.c:23:12: note: in expansion of macro 
>> 'ntohl'
  ifh[i] = ntohl((__force _be32)ifh[i]);
   ^
>> drivers/net/ethernet/mscc/ocelot_board.c:23:33: error: expected ')' before 
>> 'ifh'
  ifh[i] = ntohl((__force _be32)ifh[i]);
^
   include/uapi/linux/swab.h:114:54: note: in definition of macro '__swab32'
#define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
 ^
>> include/linux/byteorder/generic.h:136:21: note: in expansion of macro 
>> '__be32_to_cpu'
#define ___ntohl(x) __be32_to_cpu(x)
^
>> include/linux/byteorder/generic.h:140:18: note: in expansion of macro 
>> '___ntohl'
#define ntohl(x) ___ntohl(x)
 ^~~~
>> drivers/net/ethernet/mscc/ocelot_board.c:23:12: note: in expansion of macro 
>> 'ntohl'
  ifh[i] = ntohl((__force _be32)ifh[i]);
   ^
--
   In file included from include/linux/swab.h:5:0,
from include/uapi/linux/byteorder/little_endian.h:13,
from include/linux/byteorder/little_endian.h:5,
from arch/x86/include/uapi/asm/byteorder.h:5,
from include/asm-generic/bitops/le.h:6,
from arch/x86/include/asm/bitops.h:521,
from include/linux/bitops.h:38,
from include/linux/kernel.h:11,
from include/linux/interrupt.h:6,
from drivers/net//ethernet/mscc/ocelot_board.c:7:
   drivers/net//ethernet/mscc/ocelot_board.c: In function 'ocelot_parse_ifh':
   drivers/net//ethernet/mscc/ocelot_board.c:23:27: error: '_be32' undeclared 
(first use in this function); did you mean '__be32'?
  ifh[i] = ntohl((__force _be32)ifh[i]);
  ^
   include/uapi/linux/swab.h:114:54: note: in definition of macro '__swab32'
#

Re: [PATCH net-next] net: phy: Fix modular PHYLIB build

2018-04-28 Thread David Miller

From: Florian Fainelli 
Date: Fri, 27 Apr 2018 12:41:49 -0700

> After commit c59530d0d5dc ("net: Move PHY statistics code into PHY
> library helpers") we made net/core/ethtool.c reference symbols which are
> part of the library which can be modular. David introduced a temporary
> fix with 1ecd6e8ad996 ("phy: Temporary build fix after phylib changes.")
> which would prevent such modularity.
> 
> This is not desireable of course, so instead, just inline the functions
> into include/linux/phy.h to keep both options available.
> 
> Fixes: c59530d0d5dc ("net: Move PHY statistics code into PHY library helpers")
> Fixes: 1ecd6e8ad996 ("phy: Temporary build fix after phylib changes.")
> Signed-off-by: Florian Fainelli 

Applied, thanks Florian.

[PATCH] can: cc770: fix spelling mistake: "comptibility" -> "compatibility"

2018-04-28 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in module parameter description text

Signed-off-by: Colin Ian King 
---
 drivers/net/can/cc770/cc770.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/can/cc770/cc770.c b/drivers/net/can/cc770/cc770.c
index d4dd4da23997..da636a22c542 100644
--- a/drivers/net/can/cc770/cc770.c
+++ b/drivers/net/can/cc770/cc770.c
@@ -73,7 +73,7 @@ MODULE_PARM_DESC(msgobj15_eff, "Extended 29-bit frames for 
message object 15 "
 
 static int i82527_compat;
 module_param(i82527_compat, int, 0444);
-MODULE_PARM_DESC(i82527_compat, "Strict Intel 82527 comptibility mode "
+MODULE_PARM_DESC(i82527_compat, "Strict Intel 82527 compatibility mode "
 "without using additional functions");
 
 /*
-- 
2.17.0

[PATCH] wireless: ipw2100: fix spelling mistake: "decsribed" -> "described"

2018-04-28 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in comment and in the ord_data text

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/intel/ipw2x00/ipw2100.c | 2 +-
 drivers/net/wireless/intel/ipw2x00/ipw2100.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/intel/ipw2x00/ipw2100.c 
b/drivers/net/wireless/intel/ipw2x00/ipw2100.c
index 236b52423506..7c4f550a1475 100644
--- a/drivers/net/wireless/intel/ipw2x00/ipw2100.c
+++ b/drivers/net/wireless/intel/ipw2x00/ipw2100.c
@@ -3732,7 +3732,7 @@ IPW2100_ORD(STAT_TX_HOST_REQUESTS, "requested Host Tx's 
(MSDU)"),
IPW2100_ORD(ASSOCIATED_AP_PTR,
"0 if not associated, else pointer to AP table 
entry"),
IPW2100_ORD(AVAILABLE_AP_CNT,
-   "AP's decsribed in the AP table"),
+   "AP's described in the AP table"),
IPW2100_ORD(AP_LIST_PTR, "Ptr to list of available APs"),
IPW2100_ORD(STAT_AP_ASSNS, "associations"),
IPW2100_ORD(STAT_ASSN_FAIL, "association failures"),
diff --git a/drivers/net/wireless/intel/ipw2x00/ipw2100.h 
b/drivers/net/wireless/intel/ipw2x00/ipw2100.h
index 193947865efd..ce3e35f6b60f 100644
--- a/drivers/net/wireless/intel/ipw2x00/ipw2100.h
+++ b/drivers/net/wireless/intel/ipw2x00/ipw2100.h
@@ -1009,7 +1009,7 @@ typedef enum _ORDINAL_TABLE_1 {   // NS - means Not 
Supported by FW
IPW_ORD_STAT_PERCENT_RETRIES,   // current calculation of % missed tx 
retries
IPW_ORD_ASSOCIATED_AP_PTR,  // If associated, this is ptr to the 
associated
// AP table entry. set to 0 if not associated
-   IPW_ORD_AVAILABLE_AP_CNT,   // # of AP's decsribed in the AP table
+   IPW_ORD_AVAILABLE_AP_CNT,   // # of AP's described in the AP table
IPW_ORD_AP_LIST_PTR,// Ptr to list of available APs
IPW_ORD_STAT_AP_ASSNS,  // # of associations
IPW_ORD_STAT_ASSN_FAIL, // # of association failures
-- 
2.17.0

[PATCH bpf-next 2/2] bpf: Sync bpf.h to tools/

2018-04-28 Thread Andrey Ignatov

The patch syncs bpf.h to tools/.

Signed-off-by: Andrey Ignatov 
---
 tools/include/uapi/linux/bpf.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index da77a93..730f448 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1361,7 +1361,7 @@ union bpf_attr {
  * Return
  * 0
  *
- * int bpf_setsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int 
optname, char *optval, int optlen)
+ * int bpf_setsockopt(struct bpf_sock_ops *bpf_socket, int level, int optname, 
char *optval, int optlen)
  * Description
  * Emulate a call to **setsockopt()** on the socket associated to
  * *bpf_socket*, which must be a full socket. The *level* at
@@ -1435,7 +1435,7 @@ union bpf_attr {
  * Return
  * **SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_sock_map_update(struct bpf_sock_ops_kern *skops, struct bpf_map 
*map, void *key, u64 flags)
+ * int bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, 
void *key, u64 flags)
  * Description
  * Add an entry to, or update a *map* referencing sockets. The
  * *skops* is used as a new value for the entry associated to
@@ -1533,7 +1533,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * int bpf_perf_prog_read_value(struct bpf_perf_event_data_kern *ctx, struct 
bpf_perf_event_value *buf, u32 buf_size)
+ * int bpf_perf_prog_read_value(struct bpf_perf_event_data *ctx, struct 
bpf_perf_event_value *buf, u32 buf_size)
  * Description
  * For en eBPF program attached to a perf event, retrieve the
  * value of the event counter associated to *ctx* and store it in
@@ -1544,7 +1544,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * int bpf_getsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int 
optname, char *optval, int optlen)
+ * int bpf_getsockopt(struct bpf_sock_ops *bpf_socket, int level, int optname, 
char *optval, int optlen)
  * Description
  * Emulate a call to **getsockopt()** on the socket associated to
  * *bpf_socket*, which must be a full socket. The *level* at
@@ -1588,7 +1588,7 @@ union bpf_attr {
  * Return
  * 0
  *
- * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops_kern *bpf_sock, int 
argval)
+ * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops *bpf_sock, int argval)
  * Description
  * Attempt to set the value of the **bpf_sock_ops_cb_flags** field
  * for the full TCP socket associated to *bpf_sock_ops* to
@@ -1721,7 +1721,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int 
addr_len)
+ * int bpf_bind(struct bpf_sock_addr *ctx, struct sockaddr *addr, int addr_len)
  * Description
  * Bind the socket associated to *ctx* to the address pointed by
  * *addr*, of length *addr_len*. This allows for making outgoing
-- 
2.9.5

[PATCH bpf-next 0/2] Fix BPF helpers documentation

2018-04-28 Thread Andrey Ignatov

BPF helpers documentation in UAPI refers to kernel ctx structures when it
has to refer to user visible ones. Fix it.

Andrey Ignatov (2):
  bpf: Fix helpers ctx struct types in uapi doc
  bpf: Sync bpf.h to tools/

 include/uapi/linux/bpf.h   | 12 ++--
 tools/include/uapi/linux/bpf.h | 12 ++--
 2 files changed, 12 insertions(+), 12 deletions(-)

-- 
2.9.5

[PATCH bpf-next 1/2] bpf: Fix helpers ctx struct types in uapi doc

2018-04-28 Thread Andrey Ignatov

Helpers may operate on two types of ctx structures: user visible ones
(e.g. `struct bpf_sock_ops`) when used in user programs, and kernel ones
(e.g. `struct bpf_sock_ops_kern`) in kernel implementation.

UAPI documentation must refer to only user visible structures.

The patch replaces references to `_kern` structures in BPF helpers
description by corresponding user visible structures.

Signed-off-by: Andrey Ignatov 
---
 include/uapi/linux/bpf.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index da77a93..730f448 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1361,7 +1361,7 @@ union bpf_attr {
  * Return
  * 0
  *
- * int bpf_setsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int 
optname, char *optval, int optlen)
+ * int bpf_setsockopt(struct bpf_sock_ops *bpf_socket, int level, int optname, 
char *optval, int optlen)
  * Description
  * Emulate a call to **setsockopt()** on the socket associated to
  * *bpf_socket*, which must be a full socket. The *level* at
@@ -1435,7 +1435,7 @@ union bpf_attr {
  * Return
  * **SK_PASS** on success, or **SK_DROP** on error.
  *
- * int bpf_sock_map_update(struct bpf_sock_ops_kern *skops, struct bpf_map 
*map, void *key, u64 flags)
+ * int bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, 
void *key, u64 flags)
  * Description
  * Add an entry to, or update a *map* referencing sockets. The
  * *skops* is used as a new value for the entry associated to
@@ -1533,7 +1533,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * int bpf_perf_prog_read_value(struct bpf_perf_event_data_kern *ctx, struct 
bpf_perf_event_value *buf, u32 buf_size)
+ * int bpf_perf_prog_read_value(struct bpf_perf_event_data *ctx, struct 
bpf_perf_event_value *buf, u32 buf_size)
  * Description
  * For en eBPF program attached to a perf event, retrieve the
  * value of the event counter associated to *ctx* and store it in
@@ -1544,7 +1544,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * int bpf_getsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int 
optname, char *optval, int optlen)
+ * int bpf_getsockopt(struct bpf_sock_ops *bpf_socket, int level, int optname, 
char *optval, int optlen)
  * Description
  * Emulate a call to **getsockopt()** on the socket associated to
  * *bpf_socket*, which must be a full socket. The *level* at
@@ -1588,7 +1588,7 @@ union bpf_attr {
  * Return
  * 0
  *
- * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops_kern *bpf_sock, int 
argval)
+ * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops *bpf_sock, int argval)
  * Description
  * Attempt to set the value of the **bpf_sock_ops_cb_flags** field
  * for the full TCP socket associated to *bpf_sock_ops* to
@@ -1721,7 +1721,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int 
addr_len)
+ * int bpf_bind(struct bpf_sock_addr *ctx, struct sockaddr *addr, int addr_len)
  * Description
  * Bind the socket associated to *ctx* to the address pointed by
  * *addr*, of length *addr_len*. This allows for making outgoing
-- 
2.9.5

Re: ip6-in-ip{4,6} ipsec tunnel issues with 1280 MTU

2018-04-28 Thread David Ahern

On 4/27/18 9:44 AM, Ashwanth Goli wrote:
> On 2018-04-27 20:18, David Ahern wrote:
>> On 4/27/18 5:02 AM, Ashwanth Goli wrote:
>>> On 2018-04-26 17:21, Paolo Abeni wrote:
 Hi,

 [fixed CC list]

 On Wed, 2018-04-25 at 21:43 +0530, Ashwanth Goli wrote:
> Hi Pablo,

 Actually I'm Paolo, but yours is a recurring mistake ;)

> I am noticing an issue similar to the one reported by Alexis Perez
> [Regression for ip6-in-ip4 IPsec tunnel in 4.14.16]
>
> In my IPsec setup outer MTU is set to 1280, ip6_setup_cork sees an MTU
> less than IPV6_MIN_MTU because of the tunnel headers. -EINVAL is being
> returned as a result of the MTU check that got added with below patch.
>>
>> If you know you are running ipsec over the link why are you setting the
>> outer MTU to 1280? RFC 2460 suggests the fragmentation of packets for
>> links with MTU < 1280 should be done below the IPv6 layer:
>>
>> 5. Packet Size Issues
>>
>>    IPv6 requires that every link in the internet have an MTU of 1280
>>    octets or greater.  On any link that cannot convey a 1280-octet
>>    packet in one piece, link-specific fragmentation and reassembly must
>>    be provided at a layer below IPv6.
>>
>>    Links that have a configurable MTU (for example, PPP links [RFC-
>>    1661]) must be configured to have an MTU of at least 1280 octets; it
>>    is recommended that they be configured with an MTU of 1500 octets or
>>    greater, to accommodate possible encapsulations (i.e., tunneling)
>>    without incurring IPv6-layer fragmentation.
> 
> But is this not breaking point (b) from section 7.1 of RFC2473 since the
> inner packet can be smaller than 1280.
> 
> https://tools.ietf.org/html/rfc2473#section-7.1

I don't think so.

Given how Linux works with ipsec (or my understanding of it), your
proposed change seems ok to me.

[PATCH bpf-next] bpf: remove tracepoints from bpf core

2018-04-28 Thread Alexei Starovoitov

tracepoints to bpf core were added as a way to provide introspection
to bpf programs and maps, but after some time it became clear that
this approach is inadequate, so prog_id, map_id and corresponding
get_next_id, get_fd_by_id, get_info_by_fd, prog_query APIs were
introduced and fully adopted by bpftool and other applications.
The tracepoints in bpf core started to rot and causing syzbot warnings:
WARNING: CPU: 0 PID: 3008 at kernel/trace/trace_event_perf.c:274
Kernel panic - not syncing: panic_on_warn set ...
perf_trace_bpf_map_keyval+0x260/0xbd0 include/trace/events/bpf.h:228
trace_bpf_map_update_elem include/trace/events/bpf.h:274 [inline]
map_update_elem kernel/bpf/syscall.c:597 [inline]
SYSC_bpf kernel/bpf/syscall.c:1478 [inline]
Hence this patch deletes tracepoints in bpf core.

Reported-by: Eric Biggers 
Reported-by: syzbot 

Signed-off-by: Alexei Starovoitov 
---
 MAINTAINERS|   1 -
 include/linux/bpf_trace.h  |   1 -
 include/trace/events/bpf.h | 355 -
 kernel/bpf/core.c  |   6 -
 kernel/bpf/inode.c |  16 +-
 kernel/bpf/syscall.c   |  11 --
 6 files changed, 1 insertion(+), 389 deletions(-)
 delete mode 100644 include/trace/events/bpf.h

diff --git a/MAINTAINERS b/MAINTAINERS
index a52800867850..537fd17a211b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2727,7 +2727,6 @@ F:Documentation/networking/filter.txt
 F: Documentation/bpf/
 F: include/linux/bpf*
 F: include/linux/filter.h
-F: include/trace/events/bpf.h
 F: include/trace/events/xdp.h
 F: include/uapi/linux/bpf*
 F: include/uapi/linux/filter.h
diff --git a/include/linux/bpf_trace.h b/include/linux/bpf_trace.h
index e6fe98ae3794..ddf896abcfb6 100644
--- a/include/linux/bpf_trace.h
+++ b/include/linux/bpf_trace.h
@@ -2,7 +2,6 @@
 #ifndef __LINUX_BPF_TRACE_H__
 #define __LINUX_BPF_TRACE_H__
 
-#include 
 #include 
 
 #endif /* __LINUX_BPF_TRACE_H__ */
diff --git a/include/trace/events/bpf.h b/include/trace/events/bpf.h
deleted file mode 100644
index 150185647e6b..
--- a/include/trace/events/bpf.h
+++ /dev/null
@@ -1,355 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM bpf
-
-#if !defined(_TRACE_BPF_H) || defined(TRACE_HEADER_MULTI_READ)
-#define _TRACE_BPF_H
-
-/* These are only used within the BPF_SYSCALL code */
-#ifdef CONFIG_BPF_SYSCALL
-
-#include 
-#include 
-#include 
-#include 
-
-#define __PROG_TYPE_MAP(FN)\
-   FN(SOCKET_FILTER)   \
-   FN(KPROBE)  \
-   FN(SCHED_CLS)   \
-   FN(SCHED_ACT)   \
-   FN(TRACEPOINT)  \
-   FN(XDP) \
-   FN(PERF_EVENT)  \
-   FN(CGROUP_SKB)  \
-   FN(CGROUP_SOCK) \
-   FN(LWT_IN)  \
-   FN(LWT_OUT) \
-   FN(LWT_XMIT)
-
-#define __MAP_TYPE_MAP(FN) \
-   FN(HASH)\
-   FN(ARRAY)   \
-   FN(PROG_ARRAY)  \
-   FN(PERF_EVENT_ARRAY)\
-   FN(PERCPU_HASH) \
-   FN(PERCPU_ARRAY)\
-   FN(STACK_TRACE) \
-   FN(CGROUP_ARRAY)\
-   FN(LRU_HASH)\
-   FN(LRU_PERCPU_HASH) \
-   FN(LPM_TRIE)
-
-#define __PROG_TYPE_TP_FN(x)   \
-   TRACE_DEFINE_ENUM(BPF_PROG_TYPE_##x);
-#define __PROG_TYPE_SYM_FN(x)  \
-   { BPF_PROG_TYPE_##x, #x },
-#define __PROG_TYPE_SYM_TAB\
-   __PROG_TYPE_MAP(__PROG_TYPE_SYM_FN) { -1, 0 }
-__PROG_TYPE_MAP(__PROG_TYPE_TP_FN)
-
-#define __MAP_TYPE_TP_FN(x)\
-   TRACE_DEFINE_ENUM(BPF_MAP_TYPE_##x);
-#define __MAP_TYPE_SYM_FN(x)   \
-   { BPF_MAP_TYPE_##x, #x },
-#define __MAP_TYPE_SYM_TAB \
-   __MAP_TYPE_MAP(__MAP_TYPE_SYM_FN) { -1, 0 }
-__MAP_TYPE_MAP(__MAP_TYPE_TP_FN)
-
-DECLARE_EVENT_CLASS(bpf_prog_event,
-
-   TP_PROTO(const struct bpf_prog *prg),
-
-   TP_ARGS(prg),
-
-   TP_STRUCT__entry(
-   __array(u8, prog_tag, 8)
-   __field(u32, type)
-   ),
-
-   TP_fast_assign(
-   BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(prg->tag));
-   memcpy(__entry->prog_tag, prg->tag, sizeof(prg->tag));
-   __entry->type = prg->type;
-   ),
-
-   TP_printk("prog=%s type=%s",
- __print_hex_str(__entry->prog_tag, 8),
- __print_symbolic(__entry->type, __PROG_TYPE_SYM_TAB))
-);
-
-DEFINE_EVENT(bpf_prog_event, bpf_prog_get_type,
-
-   TP_PROTO(const struct bpf_prog *prg),
-
-   TP_ARGS(prg)
-);
-
-DEFINE_EVENT(bpf_prog_event, bpf_prog_put_rcu,
-
-   TP_PROTO(const struct bpf_prog *prg),
-
-   TP_ARGS(prg)
-);
-
-TRACE_EVENT(bpf_prog_load,
-
-   TP_PROTO(const struct bpf_prog *prg, int ufd),
-
-   TP_ARGS(prg, ufd),
-
-   TP_STRUCT__entry(
-   __array(u8, prog_tag, 8)
-   __field(u32, type)
-   __field(int, ufd)
-   ),
-
-   TP_fast_assign(

Re: [llc_ui_release] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-04-28 Thread Linus Torvalds

On Sat, Apr 28, 2018 at 7:12 PM Fengguang Wu  wrote:

> FYI this happens in mainline kernel 4.17.0-rc2.
> It looks like a new regression.

> It occurs in 5 out of 5 boots.

> [main] 375 sockets created based on info from socket cachefile.
> [main] Generating file descriptors
> [main] Added 83 filenames from /dev
> udevd[507]: failed to execute '/sbin/modprobe' '/sbin/modprobe -bv
platform:regulatory': No such file or directory
> [  372.057947] caif:caif_disconnect_client(): nothing to disconnect
> [  372.082415] BUG: unable to handle kernel NULL pointer dereference at
0004

I think this is fixed by commit 3a04ce7130a7 ("llc: fix NULL pointer deref
for SOCK_ZAPPED")

Liunus

[PATCH bpf-next v3 0/4] Hash support for sock

2018-04-28 Thread John Fastabend

In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets that need to be referenced using longer keys.

This series adds support for a sockhash map reusing as much of
the sockmap code as possible. I made the decision to add sockhash
specific helpers vs trying to generalize the existing helpers
because (a) they have sockmap in the name and (b) the keys are
different types. I prefer to be explicit here rather than play
type games or do something else tricky.

To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.

v2: fix file stats and add v2 tag
v3: move tool updates into test patch, move bpftool updates into
its own patch, and fixup the test patch stats to catch the
renamed file and provide only diffs +/- on that.

John Fastabend (4):
  bpf: sockmap, refactor sockmap routines to work with hashmap
  bpf: sockmap, add hash map support
  bpf: bpftool, support for sockhash
  bpf: selftest additions for SOCKHASH

 include/linux/bpf.h|   8 +
 include/linux/bpf_types.h  |   1 +
 include/linux/filter.h |   3 +-
 include/net/tcp.h  |   3 +-
 include/uapi/linux/bpf.h   |   6 +-
 kernel/bpf/core.c  |   1 +
 kernel/bpf/sockmap.c   | 638 ++---
 kernel/bpf/verifier.c  |  14 +-
 net/core/filter.c  |  89 ++-
 tools/bpf/bpftool/map.c|   1 +
 tools/include/uapi/linux/bpf.h |   6 +-
 tools/testing/selftests/bpf/Makefile   |   3 +-
 tools/testing/selftests/bpf/test_sockhash_kern.c   |   4 +
 tools/testing/selftests/bpf/test_sockmap.c |  27 +-
 .../{test_sockmap_kern.c => test_sockmap_kern.h}   |   6 +-
 15 files changed, 695 insertions(+), 115 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 rename tools/testing/selftests/bpf/{test_sockmap_kern.c => 
test_sockmap_kern.h} (98%)

-- 
1.9.1

[PATCH bpf-next v3 1/4] bpf: sockmap, refactor sockmap routines to work with hashmap

2018-04-28 Thread John Fastabend

This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.

Most the fallout comes from three changes,

  - Pushing bpf programs into an independent structure so we
can use it from the htab struct in the next patch.
  - Generalizing helpers to use void *key instead of the hardcoded
u32.
  - Instead of passing map/key through the metadata we now do
the lookup inline. This avoids storing the key in the metadata
which will be useful when keys can be longer than 4 bytes. We
rename the sk pointers to sk_redir at this point as well to
avoid any confusion between the current sk pointer and the
redirect pointer sk_redir.

Signed-off-by: John Fastabend 
---
 include/linux/filter.h |   3 +-
 include/net/tcp.h  |   3 +-
 kernel/bpf/sockmap.c   | 148 +
 net/core/filter.c  |  31 +++
 4 files changed, 98 insertions(+), 87 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b23..31cdfe8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -512,9 +512,8 @@ struct sk_msg_buff {
int sg_end;
struct scatterlist sg_data[MAX_SKB_FRAGS];
bool sg_copy[MAX_SKB_FRAGS];
-   __u32 key;
__u32 flags;
-   struct bpf_map *map;
+   struct sock *sk_redir;
struct sk_buff *skb;
struct list_head list;
 };
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e..089185a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -814,9 +814,8 @@ struct tcp_skb_cb {
 #endif
} header;   /* For incoming skbs */
struct {
-   __u32 key;
__u32 flags;
-   struct bpf_map *map;
+   struct sock *sk_redir;
void *data_end;
} bpf;
};
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 634415c..8bda881 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -48,14 +48,18 @@
 #define SOCK_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
-struct bpf_stab {
-   struct bpf_map map;
-   struct sock **sock_map;
+struct bpf_sock_progs {
struct bpf_prog *bpf_tx_msg;
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
 };
 
+struct bpf_stab {
+   struct bpf_map map;
+   struct sock **sock_map;
+   struct bpf_sock_progs progs;
+};
+
 enum smap_psock_state {
SMAP_TX_RUNNING,
 };
@@ -456,7 +460,7 @@ static int free_curr_sg(struct sock *sk, struct sk_msg_buff 
*md)
 static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
 {
return ((_rc == SK_PASS) ?
-  (md->map ? __SK_REDIRECT : __SK_PASS) :
+  (md->sk_redir ? __SK_REDIRECT : __SK_PASS) :
   __SK_DROP);
 }
 
@@ -1088,7 +1092,7 @@ static int smap_verdict_func(struct smap_psock *psock, 
struct sk_buff *skb)
 * when we orphan the skb so that we don't have the possibility
 * to reference a stale map.
 */
-   TCP_SKB_CB(skb)->bpf.map = NULL;
+   TCP_SKB_CB(skb)->bpf.sk_redir = NULL;
skb->sk = psock->sock;
bpf_compute_data_pointers(skb);
preempt_disable();
@@ -1098,7 +1102,7 @@ static int smap_verdict_func(struct smap_psock *psock, 
struct sk_buff *skb)
 
/* Moving return codes from UAPI namespace into internal namespace */
return rc == SK_PASS ?
-   (TCP_SKB_CB(skb)->bpf.map ? __SK_REDIRECT : __SK_PASS) :
+   (TCP_SKB_CB(skb)->bpf.sk_redir ? __SK_REDIRECT : __SK_PASS) :
__SK_DROP;
 }
 
@@ -1368,7 +1372,6 @@ static int smap_init_sock(struct smap_psock *psock,
 }
 
 static void smap_init_progs(struct smap_psock *psock,
-   struct bpf_stab *stab,
struct bpf_prog *verdict,
struct bpf_prog *parse)
 {
@@ -1446,14 +1449,13 @@ static void smap_gc_work(struct work_struct *w)
kfree(psock);
 }
 
-static struct smap_psock *smap_init_psock(struct sock *sock,
- struct bpf_stab *stab)
+static struct smap_psock *smap_init_psock(struct sock *sock, int node)
 {
struct smap_psock *psock;
 
psock = kzalloc_node(sizeof(struct smap_psock),
 GFP_ATOMIC | __GFP_NOWARN,
-stab->map.numa_node);
+node);
if (!psock)
return ERR_PTR(-ENOMEM);
 
@@ -1658,40 +1660,26 @@ static int sock_map_delete_elem(struct bpf_map *map, 
void *key)
  *  - sock_map must use READ_ONCE and (cmp)xchg operations
  *  - BPF verdict/parse programs must use READ_ONCE and xchg operation

[PATCH bpf-next v3 2/4] bpf: sockmap, add hash map support

2018-04-28 Thread John Fastabend

Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.

To support this add hash support.

Signed-off-by: John Fastabend 
---
 include/linux/bpf.h   |   8 +
 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |   6 +-
 kernel/bpf/core.c |   1 +
 kernel/bpf/sockmap.c  | 494 --
 kernel/bpf/verifier.c |  14 +-
 net/core/filter.c |  58 ++
 7 files changed, 564 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 38ebbc6..add768a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -661,6 +661,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map 
*map)
 
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && 
defined(CONFIG_INET)
 struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
+struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key);
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
 #else
 static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 
key)
@@ -668,6 +669,12 @@ static inline struct sock  *__sock_map_lookup_elem(struct 
bpf_map *map, u32 key)
return NULL;
 }
 
+static inline struct sock  *__sock_hash_lookup_elem(struct bpf_map *map,
+   void *key)
+{
+   return NULL;
+}
+
 static inline int sock_map_prog(struct bpf_map *map,
struct bpf_prog *prog,
u32 type)
@@ -693,6 +700,7 @@ static inline int sock_map_prog(struct bpf_map *map,
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
+extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b28fcf..3101118 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -47,6 +47,7 @@
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index da77a93..5cb983d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
+   BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1835,7 +1836,10 @@ struct bpf_stack_build_id {
FN(msg_pull_data),  \
FN(bind),   \
FN(xdp_adjust_tail),\
-   FN(skb_get_xfrm_state),
+   FN(skb_get_xfrm_state), \
+   FN(sock_hash_update),   \
+   FN(msg_redirect_hash),  \
+   FN(sk_redirect_hash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba03ec3..5917cc1 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1782,6 +1782,7 @@ void bpf_user_rnd_init_once(void)
 const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
+const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 8bda881..08eb3a5 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -60,6 +60,28 @@ struct bpf_stab {
struct bpf_sock_progs progs;
 };
 
+struct bucket {
+   struct hlist_head head;
+   raw_spinlock_t lock;
+};
+
+struct bpf_htab {
+   struct bpf_map map;
+   struct bucket *buckets;
+   atomic_t count;
+   u32 n_buckets;
+   u32 elem_size;
+   struct bpf_sock_progs progs;
+};
+
+struct htab_elem {
+   struct rcu_head rcu;
+   struct hlist_node hash_node;
+   u32 hash;
+   struct sock *sk;
+   char key[0];
+};
+
 enum smap_psock_state {
SMAP_TX_RUNNING,
 };
@@ -67,6 +89,8 @@ enum smap_psock_state {
 struct smap_psock_map_entry {
struct list_head list;
struct sock **entry;
+   struct htab_elem *hash_link;
+   struct bpf_htab *htab;
 };
 
 struct smap_psock {
@@ -195,6 +219,1

[PATCH bpf-next v3 3/4] bpf: selftest additions for SOCKHASH

2018-04-28 Thread John Fastabend

This runs existing SOCKMAP tests with SOCKHASH map type. To do this
we push programs into include file and build two BPF programs. One
for SOCKHASH and one for SOCKMAP.

We then run the entire test suite with each type.

Signed-off-by: John Fastabend 
---
 tools/include/uapi/linux/bpf.h |  6 -
 tools/testing/selftests/bpf/Makefile   |  3 ++-
 tools/testing/selftests/bpf/test_sockhash_kern.c   |  4 
 tools/testing/selftests/bpf/test_sockmap.c | 27 --
 .../{test_sockmap_kern.c => test_sockmap_kern.h}   |  6 ++---
 5 files changed, 34 insertions(+), 12 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 rename tools/testing/selftests/bpf/{test_sockmap_kern.c => 
test_sockmap_kern.h} (98%)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index da77a93..5cb983d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
+   BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1835,7 +1836,10 @@ struct bpf_stack_build_id {
FN(msg_pull_data),  \
FN(bind),   \
FN(xdp_adjust_tail),\
-   FN(skb_get_xfrm_state),
+   FN(skb_get_xfrm_state), \
+   FN(sock_hash_update),   \
+   FN(msg_redirect_hash),  \
+   FN(sk_redirect_hash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index b64a7a3..03f9bf3 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
-   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o
+   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
+   test_sockmap_kern.o test_sockhash_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_sockhash_kern.c 
b/tools/testing/selftests/bpf/test_sockhash_kern.c
new file mode 100644
index 000..3bf4ad4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sockhash_kern.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+#define TEST_MAP_TYPE BPF_MAP_TYPE_SOCKHASH
+#include "./test_sockmap_kern.h"
diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 29c022d..df7afc7 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -47,7 +47,8 @@
 #define S1_PORT 1
 #define S2_PORT 10001
 
-#define BPF_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKMAP_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKHASH_FILENAME "test_sockmap_kern.o"
 #define CG_PATH "/sockmap"
 
 /* global sockets */
@@ -1260,9 +1261,8 @@ static int test_start_end(int cgrp)
BPF_PROG_TYPE_SK_MSG,
 };
 
-static int populate_progs(void)
+static int populate_progs(char *bpf_file)
 {
-   char *bpf_file = BPF_FILENAME;
struct bpf_program *prog;
struct bpf_object *obj;
int i = 0;
@@ -1306,11 +1306,11 @@ static int populate_progs(void)
return 0;
 }
 
-static int test_suite(void)
+static int __test_suite(char *bpf_file)
 {
int cg_fd, err;
 
-   err = populate_progs();
+   err = populate_progs(bpf_file);
if (err < 0) {
fprintf(stderr, "ERROR: (%i) load bpf failed\n", err);
return err;
@@ -1347,17 +1347,30 @@ static int test_suite(void)
 
 out:
printf("Summary: %i PASSED %i FAILED\n", passed, failed);
+   cleanup_cgroup_environment();
close(cg_fd);
return err;
 }
 
+static int test_suite(void)
+{
+   int err;
+
+   err = __test_suite(BPF_SOCKMAP_FILENAME);
+   if (err)
+   goto out;
+   err = __test_suite(BPF_SOCKHASH_FILENAME);
+out:
+   return err;
+}
+
 int main(int argc, char **argv)
 {
struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
int iov_count = 1, length = 1024, rate = 1;
struct sockmap_options options = {0};
int opt, longindex, err, cg_fd = 0;
-   char *bpf_file = BPF_FILENAME;
+   char *bpf_file = BPF_SOCKMAP_FILENAME;
int test = PING_PONG;
 
if (setrlimit(RLIMIT_MEMLOCK, &r)) {
@@ -1438,7 +1451,7 @@ int main(int argc, char **argv)
return -1;
}
 
-

[PATCH bpf-next v3 4/4] bpf: bpftool, support for sockhash

2018-04-28 Thread John Fastabend

This adds the SOCKHASH map type to bpftools so that we get correct
pretty printing.

Signed-off-by: John Fastabend 
---
 tools/bpf/bpftool/map.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index a6cdb64..4420b1a 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -67,6 +67,7 @@
[BPF_MAP_TYPE_DEVMAP]   = "devmap",
[BPF_MAP_TYPE_SOCKMAP]  = "sockmap",
[BPF_MAP_TYPE_CPUMAP]   = "cpumap",
+   [BPF_MAP_TYPE_SOCKHASH] = "sockhash",
 };
 
 static unsigned int get_possible_cpus(void)
-- 
1.9.1

Re: [bpf-next PATCH v2 3/3] bpf: selftest additions for SOCKHASH

2018-04-28 Thread John Fastabend

On 04/27/2018 05:10 PM, Alexei Starovoitov wrote:
> On Fri, Apr 27, 2018 at 04:24:43PM -0700, John Fastabend wrote:
>> This runs existing SOCKMAP tests with SOCKHASH map type. To do this
>> we push programs into include file and build two BPF programs. One
>> for SOCKHASH and one for SOCKMAP.
>>
>> We then run the entire test suite with each type.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  tools/testing/selftests/bpf/Makefile |3 
>>  tools/testing/selftests/bpf/test_sockhash_kern.c |4 
>>  tools/testing/selftests/bpf/test_sockmap.c   |   27 +-
>>  tools/testing/selftests/bpf/test_sockmap_kern.c  |  340 
>> --
>>  tools/testing/selftests/bpf/test_sockmap_kern.h  |  340 
>> ++
>>  5 files changed, 368 insertions(+), 346 deletions(-)
>>  create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
>>  create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h
> 
> Looks like it was mainly a rename of test_sockmap_kern.c into .h
> but commit doesn't show it as such.
> Can you redo it with 'git mv' ?
> 

Sure, my scripts didn't have the --find-renames. Anyways should be
better now in v3. Also pushed tools updates into selftest patch.

Thanks,
John

Re: [PATCH net-next 2/2 v4] netns: restrict uevents

2018-04-28 Thread Eric W. Biederman


> + /* fix credentials */
> + if (owning_user_ns != &init_user_ns) {
> + struct netlink_skb_parms *parms = &NETLINK_CB(skb);
> + kuid_t root_uid;
> + kgid_t root_gid;
> +
> + /* fix uid */
> + root_uid = make_kuid(owning_user_ns, 0);
> + if (!uid_valid(root_uid))
> + root_uid = GLOBAL_ROOT_UID;
> + parms->creds.uid = root_uid;
> +
> + /* fix gid */
> + root_gid = make_kgid(owning_user_ns, 0);
> + if (!gid_valid(root_gid))
> + root_gid = GLOBAL_ROOT_GID;
> + parms->creds.gid = root_gid;

One last nit:

You can only make the assignment if the uid is valid.
Leaving it GLBOAL_ROOT_UID if the composed uid is invalid.
AKA

/* fix uid */
root_uid = make_kuid(owning_user_ns, 0);
if (uid_valid(root_uid))
parms->creds.uid = root_uid;

/* fix gid */
root_gid = make_kgid(owning_user_ns, 0);
if (gid_valid(root_gid))
params->creds.gid = root_gid;


One line shorter and I think a little clearer.  I suspect
it even results in better code.

Eric

Re: [PATCH net-next 0/2 v4] netns: uevent filtering

2018-04-28 Thread Eric W. Biederman

Christian Brauner  writes:

> Hey everyone,
>
> This is the new approach to uevent filtering as discussed (see the
> threads in [1], [2], and [3]). It only contains *non-functional
> changes*.
>
> This series deals with with fixing up uevent filtering logic:
> - uevent filtering logic is simplified
> - locking time on uevent_sock_list is minimized
> - tagged and untagged kobjects are handled in separate codepaths
> - permissions for userspace are fixed for network device uevents in
>   network namespaces owned by non-initial user namespaces
>   Udev is now able to see those events correctly which it wasn't before.
>   For example, moving a physical device into a network namespace not
>   owned by the initial user namespaces before gave:
>
>   root@xen1:~# udevadm --debug monitor -k
>   calling: monitor
>   monitor will print the received events for:
>   KERNEL - the kernel uevent
>
>   sender uid=65534, message ignored
>   sender uid=65534, message ignored
>   sender uid=65534, message ignored
>   sender uid=65534, message ignored
>   sender uid=65534, message ignored
>
>   and now after the discussion and solution in [3] correctly gives:
>
>   root@xen1:~# udevadm --debug monitor -k
>   calling: monitor
>   monitor will print the received events for:
>   KERNEL - the kernel uevent
>
>   KERNEL[625.301042] add  
> /devices/pci:00/:00:02.0/:01:00.1/net/enp1s0f1 (net)
>   KERNEL[625.301109] move 
> /devices/pci:00/:00:02.0/:01:00.1/net/enp1s0f1 (net)
>   KERNEL[625.301138] move 
> /devices/pci:00/:00:02.0/:01:00.1/net/eth1 (net)
>   KERNEL[655.333272] remove 
> /devices/pci:00/:00:02.0/:01:00.1/net/eth1 (net)
>
> Thanks!
> Christian
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> [2]: https://lkml.org/lkml/2018/4/26/767
> [3]: https://lkml.org/lkml/2018/4/26/738

Again ovearall ack.  One last nit that might be worth addressing.

Acked-by: "Eric W. Biederman" 

Eric

Re: [PATCH bpf-next] bpf: remove tracepoints from bpf core

2018-04-28 Thread David Miller

From: Alexei Starovoitov 
Date: Sat, 28 Apr 2018 19:56:37 -0700

> tracepoints to bpf core were added as a way to provide introspection
> to bpf programs and maps, but after some time it became clear that
> this approach is inadequate, so prog_id, map_id and corresponding
> get_next_id, get_fd_by_id, get_info_by_fd, prog_query APIs were
> introduced and fully adopted by bpftool and other applications.
> The tracepoints in bpf core started to rot and causing syzbot warnings:
> WARNING: CPU: 0 PID: 3008 at kernel/trace/trace_event_perf.c:274
> Kernel panic - not syncing: panic_on_warn set ...
> perf_trace_bpf_map_keyval+0x260/0xbd0 include/trace/events/bpf.h:228
> trace_bpf_map_update_elem include/trace/events/bpf.h:274 [inline]
> map_update_elem kernel/bpf/syscall.c:597 [inline]
> SYSC_bpf kernel/bpf/syscall.c:1478 [inline]
> Hence this patch deletes tracepoints in bpf core.
> 
> Reported-by: Eric Biggers 
> Reported-by: syzbot 
> 
> Signed-off-by: Alexei Starovoitov 

Acked-by: David S. Miller

[PATCH bpf-next v9 07/10] samples/bpf: move common-purpose trace functions to selftests

2018-04-28 Thread Yonghong Song

There is no functionality change in this patch. The common-purpose
trace functions, including perf_event polling and ksym lookup,
are moved from trace_output_user.c and bpf_load.c to
selftests/bpf/trace_helpers.c so that these function can
be reused later in selftests.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 samples/bpf/Makefile|  11 +-
 samples/bpf/bpf_load.c  |  63 --
 samples/bpf/bpf_load.h  |   7 --
 samples/bpf/offwaketime_user.c  |   1 +
 samples/bpf/sampleip_user.c |   1 +
 samples/bpf/spintest_user.c |   1 +
 samples/bpf/trace_event_user.c  |   1 +
 samples/bpf/trace_output_user.c | 110 ++---
 tools/testing/selftests/bpf/trace_helpers.c | 180 
 tools/testing/selftests/bpf/trace_helpers.h |  23 
 10 files changed, 223 insertions(+), 175 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.h

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index b853581..5e31770 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -49,6 +49,7 @@ hostprogs-y += xdp_adjust_tail
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
+TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
 
 test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
 sock_example-objs := sock_example.o $(LIBBPF)
@@ -65,10 +66,10 @@ tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
 tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
-trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
+trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o $(TRACE_HELPERS)
 lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
-offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o
-spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o
+offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o $(TRACE_HELPERS)
+spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o $(TRACE_HELPERS)
 map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o
 test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o
 test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o
@@ -82,8 +83,8 @@ xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o
 xdp_router_ipv4-objs := bpf_load.o $(LIBBPF) xdp_router_ipv4_user.o
 test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) $(CGROUP_HELPERS) \
   test_current_task_under_cgroup_user.o
-trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o
-sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o
+trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o $(TRACE_HELPERS)
+sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o $(TRACE_HELPERS)
 tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
 xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index feca497..a27ef3c 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -648,66 +648,3 @@ void read_trace_pipe(void)
}
}
 }
-
-#define MAX_SYMS 30
-static struct ksym syms[MAX_SYMS];
-static int sym_cnt;
-
-static int ksym_cmp(const void *p1, const void *p2)
-{
-   return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
-}
-
-int load_kallsyms(void)
-{
-   FILE *f = fopen("/proc/kallsyms", "r");
-   char func[256], buf[256];
-   char symbol;
-   void *addr;
-   int i = 0;
-
-   if (!f)
-   return -ENOENT;
-
-   while (!feof(f)) {
-   if (!fgets(buf, sizeof(buf), f))
-   break;
-   if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
-   break;
-   if (!addr)
-   continue;
-   syms[i].addr = (long) addr;
-   syms[i].name = strdup(func);
-   i++;
-   }
-   sym_cnt = i;
-   qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
-   return 0;
-}
-
-struct ksym *ksym_search(long key)
-{
-   int start = 0, end = sym_cnt;
-   int result;
-
-   while (start < end) {
-   size_t mid = start + (end - start) / 2;
-
-   result = key - syms[mid].addr;
-   if (result < 0)
-   end = mid;
-   else if (result > 0)
-   start = mid + 1;
-   else
-   return &syms[mid];
-   }
-
-   if (start >= 1 && syms[start - 1].addr < key &&
-   key < syms[star

[PATCH bpf-next v9 06/10] tools/bpf: add bpf_get_stack helper to tools headers

2018-04-28 Thread Yonghong Song

The tools header file bpf.h is synced with kernel uapi bpf.h.
The new helper is also added to bpf_helpers.h.

Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h| 42 +--
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index da77a93..1afb606 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1767,6 +1767,40 @@ union bpf_attr {
  * **CONFIG_XFRM** configuration option.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_get_stack(struct pt_regs *regs, void *buf, u32 size, u64 flags)
+ * Description
+ * Return a user or a kernel stack in bpf program provided buffer.
+ * To achieve this, the helper needs *ctx*, which is a pointer
+ * to the context on which the tracing program is executed.
+ * To store the stacktrace, the bpf program provides *buf* with
+ * a nonnegative *size*.
+ *
+ * The last argument, *flags*, holds the number of stack frames to
+ * skip (from 0 to 255), masked with
+ * **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
+ * the following flags:
+ *
+ * **BPF_F_USER_STACK**
+ * Collect a user space stack instead of a kernel stack.
+ * **BPF_F_USER_BUILD_ID**
+ * Collect buildid+offset instead of ips for user stack,
+ * only valid if **BPF_F_USER_STACK** is also specified.
+ *
+ * **bpf_get_stack**\ () can collect up to
+ * **PERF_MAX_STACK_DEPTH** both kernel and user frames, subject
+ * to sufficient large buffer size. Note that
+ * this limit can be controlled with the **sysctl** program, and
+ * that it should be manually increased in order to profile long
+ * user stacks (such as stacks for Java programs). To do so, use:
+ *
+ * ::
+ *
+ * # sysctl kernel.perf_event_max_stack=
+ *
+ * Return
+ * a non-negative value equal to or less than size on success, or
+ * a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1835,7 +1869,8 @@ union bpf_attr {
FN(msg_pull_data),  \
FN(bind),   \
FN(xdp_adjust_tail),\
-   FN(skb_get_xfrm_state),
+   FN(skb_get_xfrm_state), \
+   FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1869,11 +1904,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6 (1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK  0xffULL
 #define BPF_F_USER_STACK   (1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP   (1ULL << 9)
 #define BPF_F_REUSE_STACKID(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 69d7b91..265f8e0 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -101,6 +101,8 @@ static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
 static int (*bpf_skb_get_xfrm_state)(void *ctx, int index, void *state,
 int size, int flags) =
(void *) BPF_FUNC_skb_get_xfrm_state;
+static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
+   (void *) BPF_FUNC_get_stack;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5

[PATCH bpf-next v9 03/10] bpf/verifier: refine retval R0 state for bpf_get_stack helper

2018-04-28 Thread Yonghong Song

The special property of return values for helpers bpf_get_stack
and bpf_probe_read_str are captured in verifier.
Both helpers return a negative error code or
a length, which is equal to or smaller than the buffer
size argument. This additional information in the
verifier can avoid the condition such as "retval > bufsize"
in the bpf program. For example, for the code blow,
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0 || usize > max_len)
return 0;
The verifier may have the following errors:
52: (85) call bpf_get_stack#65
 R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
 R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R9_w=inv800 R10=fp0,call_-1
53: (bf) r8 = r0
54: (bf) r1 = r8
55: (67) r1 <<= 32
56: (bf) r2 = r1
57: (77) r2 >>= 32
58: (25) if r2 > 0x31f goto pc+33
 R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
 umax_value=18446744069414584320,
 var_off=(0x0; 0x))
 R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R8=inv(id=0) R9=inv800 R10=fp0,call_-1
59: (1f) r9 -= r8
60: (c7) r1 s>>= 32
61: (bf) r2 = r7
62: (0f) r2 += r1
math between map_value pointer and register with unbounded
min value is not allowed
The failure is due to llvm compiler optimization where register "r2",
which is a copy of "r1", is tested for condition while later on "r1"
is used for map_ptr operation. The verifier is not able to track such
inst sequence effectively.

Without the "usize > max_len" condition, there is no llvm optimization
and the below generated code passed verifier:
52: (85) call bpf_get_stack#65
 R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
 R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R9_w=inv800 R10=fp0,call_-1
53: (b7) r1 = 0
54: (bf) r8 = r0
55: (67) r8 <<= 32
56: (c7) r8 s>>= 32
57: (6d) if r1 s> r8 goto pc+24
 R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
 R1=inv0 R6=ctx(id=0,off=0,imm=0)
 R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
 R10=fp0,call_-1
58: (bf) r2 = r7
59: (0f) r2 += r8
60: (1f) r9 -= r8
61: (bf) r1 = r6

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 kernel/bpf/verifier.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 253f6bd..988400e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -165,6 +165,8 @@ struct bpf_call_arg_meta {
bool pkt_access;
int regno;
int access_size;
+   s64 msize_smax_value;
+   u64 msize_umax_value;
 };
 
 static DEFINE_MUTEX(bpf_verifier_lock);
@@ -1985,6 +1987,12 @@ static int check_func_arg(struct bpf_verifier_env *env, 
u32 regno,
} else if (arg_type_is_mem_size(arg_type)) {
bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
 
+   /* remember the mem_size which may be used later
+* to refine return values.
+*/
+   meta->msize_smax_value = reg->smax_value;
+   meta->msize_umax_value = reg->umax_value;
+
/* The register is SCALAR_VALUE; the access check
 * happens using its boundaries.
 */
@@ -2324,6 +2332,23 @@ static int prepare_func_exit(struct bpf_verifier_env 
*env, int *insn_idx)
return 0;
 }
 
+static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
+  int func_id,
+  struct bpf_call_arg_meta *meta)
+{
+   struct bpf_reg_state *ret_reg = ®s[BPF_REG_0];
+
+   if (ret_type != RET_INTEGER ||
+   (func_id != BPF_FUNC_get_stack &&
+func_id != BPF_FUNC_probe_read_str))
+   return;
+
+   ret_reg->smax_value = meta->msize_smax_value;
+   ret_reg->umax_value = meta->msize_umax_value;
+   __reg_deduce_bounds(ret_reg);
+   __reg_bound_offset(ret_reg);
+}
+
 static int check_helper_call(struct bpf_verifier_env *env, int func_id, int 
insn_idx)
 {
const struct bpf_func_proto *fn = NULL;
@@ -2447,6 +2472,8 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
return -EINVAL;
}
 
+   do_refine_retval_range(regs, fn->ret_type, func_id, &meta);
+
err = check_map_func_compatibility(env, meta.map_ptr, func_id);
if (err)
return err;
-- 
2.9.5

[PATCH bpf-next v9 05/10] bpf/verifier: improve register value range tracking with ARSH

2018-04-28 Thread Yonghong Song

When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
56: (c7) r8 s>>= 32
57: R8=inv(id=0)
With this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
56: (c7) r8 s>>= 32
57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.

In our later example,
..
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0)
return 0;
ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
..
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 include/linux/tnum.h  |  4 +++-
 kernel/bpf/tnum.c | 10 ++
 kernel/bpf/verifier.c | 23 +++
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/tnum.h b/include/linux/tnum.h
index 0d2d3da..c7dc2b5 100644
--- a/include/linux/tnum.h
+++ b/include/linux/tnum.h
@@ -23,8 +23,10 @@ struct tnum tnum_range(u64 min, u64 max);
 /* Arithmetic and logical ops */
 /* Shift a tnum left (by a fixed shift) */
 struct tnum tnum_lshift(struct tnum a, u8 shift);
-/* Shift a tnum right (by a fixed shift) */
+/* Shift (rsh) a tnum right (by a fixed shift) */
 struct tnum tnum_rshift(struct tnum a, u8 shift);
+/* Shift (arsh) a tnum right (by a fixed min_shift) */
+struct tnum tnum_arshift(struct tnum a, u8 min_shift);
 /* Add two tnums, return @a + @b */
 struct tnum tnum_add(struct tnum a, struct tnum b);
 /* Subtract two tnums, return @a - @b */
diff --git a/kernel/bpf/tnum.c b/kernel/bpf/tnum.c
index 1f4bf68..938d412 100644
--- a/kernel/bpf/tnum.c
+++ b/kernel/bpf/tnum.c
@@ -43,6 +43,16 @@ struct tnum tnum_rshift(struct tnum a, u8 shift)
return TNUM(a.value >> shift, a.mask >> shift);
 }
 
+struct tnum tnum_arshift(struct tnum a, u8 min_shift)
+{
+   /* if a.value is negative, arithmetic shifting by minimum shift
+* will have larger negative offset compared to more shifting.
+* If a.value is nonnegative, arithmetic shifting by minimum shift
+* will have larger positive offset compare to more shifting.
+*/
+   return TNUM((s64)a.value >> min_shift, (s64)a.mask >> min_shift);
+}
+
 struct tnum tnum_add(struct tnum a, struct tnum b)
 {
u64 sm, sv, sigma, chi, mu;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6e3f859..712d865 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2974,6 +2974,29 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
/* We may learn something more from the var_off */
__update_reg_bounds(dst_reg);
break;
+   case BPF_ARSH:
+   if (umax_val >= insn_bitness) {
+   /* Shifts greater than 31 or 63 are undefined.
+* This includes shifts by a negative number.
+*/
+   mark_reg_unknown(env, regs, insn->dst_reg);
+   break;
+   }
+
+   /* Upon reaching here, src_known is true and
+* umax_val is equal to umin_val.
+*/
+   dst_reg->smin_value >>= umin_val;
+   dst_reg->smax_value >>= umin_val;
+   dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);
+
+   /* blow away the dst_reg umin_value/umax_value and rely on
+* dst_reg var_off to refine the result.
+*/
+   dst_reg->umin_value = 0;
+   dst_reg->umax_value = U64_MAX;
+   __update_reg_bounds(dst_reg);
+   break;
default:
mark_reg_unknown(env, regs, insn->dst_reg);
break;
-- 
2.9.5

[PATCH bpf-next v9 00/10] bpf: add bpf_get_stack helper

2018-04-28 Thread Yonghong Song

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table regardless of
whether BPF_F_REUSE_STACKID is specified or not,
so some stack traces may be missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Patches #1 and #2 implemented the core kernel support.
Patch #3 removes two never-hit branches in verifier.
Patches #4 and #5 are two verifier improves to make
bpf programming easier. Patch #6 synced the new helper
to tools headers. Patch #7 moved perf_event polling code
and ksym lookup code from samples/bpf to
tools/testing/selftests/bpf. Patch #8 added a verifier
test in tools/bpf for new verifier change.
Patches #9 and #10 added tests for raw tracepoint prog
and tracepoint prog respectively.

Changelogs:
  v8 -> v9:
. make function perf_event_mmap (in trace_helpers.c) extern
  to decouple perf_event_mmap and perf_event_poller.
. add jit enabled handling for kernel stack verification
  in Patch #9. Since we did not have a good way to
  verify jit enabled kernel stack, just return true if
  the kernel stack is not empty.
. In path #9, using raw_syscalls/sys_enter instead of
  sched/sched_switch, removed calling cmd
  "task 1 dd if=/dev/zero of=/dev/null" which is left
  with dangling process after the program exited.

  v7 -> v8:
. rebase on top of latest bpf-next
. simplify BPF_ARSH dst_reg->smin_val/smax_value tracking
. rewrite the description of bpf_get_stack() in uapi bpf.h
  based on new format.
  v6 -> v7:
. do perf callchain buffer allocation inside the
  verifier. so if the prog->has_callchain_buf is set,
  it is guaranteed that the buffer has been allocated.
. change condition "trace_nr <= skip" to "trace_nr < skip"
  so that for zero size buffer, return 0 instead of -EFAULT
  v5 -> v6:
. after refining return register smax_value and umax_value
  for helpers bpf_get_stack and bpf_probe_read_str,
  bounds and var_off of the return register are further refined.
. added missing commit message for tools header sync commit.
. removed one unnecessary empty line.
  v4 -> v5:
. relied on dst_reg->var_off to refine umin_val/umax_val
  in verifier handling BPF_ARSH value range tracking,
  suggested by Edward.
  v3 -> v4:
. fixed a bug when meta ptr is set to NULL in check_func_arg.
. introduced tnum_arshift and added detailed comments for
  the underlying implementation
. avoided using VLA in tools/bpf test_progs.
  v2 -> v3:
. used meta to track helper memory size argument
. implemented range checking for ARSH in verifier
. moved perf event polling and ksym related functions
  from samples/bpf to tools/bpf
. added test to compare build id's between bpf_get_stackid
  and bpf_get_stack
  v1 -> v2:
. fixed compilation error when CONFIG_PERF_EVENTS is not enabled

Yonghong Song (10):
  bpf: change prototype for stack_map_get_build_id_offset
  bpf: add bpf_get_stack helper
  bpf/verifier: refine retval R0 state for bpf_get_stack helper
  bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals
  bpf/verifier: improve register value range tracking with ARSH
  tools/bpf: add bpf_get_stack helper to tools headers
  samples/bpf: move common-purpose trace functions to selftests
  tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH
  tools/bpf: add a test for bpf_get_stack with raw tracepoint prog
  tools/bpf: add a test for bpf_get_stack with tracepoint prog

 include/linux/bpf.h|   1 +
 include/linux/filter.h |   3 +-
 include/linux/tnum.h   |   4 +-
 include/uapi/linux/bpf.h   |  42 +++-
 kernel/bpf/core.c  |   5 +
 kernel/bpf/stackmap.c  |  80 ++-
 kernel/bpf/tnum.c  |  10 +
 kernel/bpf/verifier.c  |  80 ++-
 kernel/trace/bpf_trace.c   |  50 -
 samples/bpf/Makefile   |  11 +-
 samples/bpf/bpf_load.c |  63 --
 samples/bpf/bpf_load.h |   7 -
 samples/bpf/offwaketime_user.c |   1 +
 samples/bpf/sampleip_user.c|   1 +
 samples/bpf/spintest_user.c|   1 +
 samples/bpf/trace_event_user.c |   1 +
 samples/bpf/trace_output_user.c| 110 +-
 tools/include/uapi/linux/bpf.h

[PATCH bpf-next v9 01/10] bpf: change prototype for stack_map_get_build_id_offset

2018-04-28 Thread Yonghong Song

This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.

Signed-off-by: Yonghong Song 
---
 kernel/bpf/stackmap.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 57eeb12..04f6ec1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -262,16 +262,11 @@ static int stack_map_get_build_id(struct vm_area_struct 
*vma,
return ret;
 }
 
-static void stack_map_get_build_id_offset(struct bpf_map *map,
- struct stack_map_bucket *bucket,
+static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
  u64 *ips, u32 trace_nr, bool user)
 {
int i;
struct vm_area_struct *vma;
-   struct bpf_stack_build_id *id_offs;
-
-   bucket->nr = trace_nr;
-   id_offs = (struct bpf_stack_build_id *)bucket->data;
 
/*
 * We cannot do up_read() in nmi context, so build_id lookup is
@@ -361,8 +356,10 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct 
bpf_map *, map,
pcpu_freelist_pop(&smap->freelist);
if (unlikely(!new_bucket))
return -ENOMEM;
-   stack_map_get_build_id_offset(map, new_bucket, ips,
- trace_nr, user);
+   new_bucket->nr = trace_nr;
+   stack_map_get_build_id_offset(
+   (struct bpf_stack_build_id *)new_bucket->data,
+   ips, trace_nr, user);
trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
-- 
2.9.5

[PATCH bpf-next v9 08/10] tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH

2018-04-28 Thread Yonghong Song

The test_verifier already has a few ARSH test cases.
This patch adds a new test case which takes advantage of newly
improved verifier behavior for bpf_get_stack and ARSH.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_verifier.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 165e9dd..1acafe26 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -11680,6 +11680,51 @@ static struct bpf_test tests[] = {
.errstr = "BPF_XADD stores into R2 packet",
.prog_type = BPF_PROG_TYPE_XDP,
},
+   {
+   "bpf_get_stack return R0 within range",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 28),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+   BPF_MOV64_IMM(BPF_REG_9, sizeof(struct test_val)),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+   BPF_MOV64_IMM(BPF_REG_3, sizeof(struct test_val)),
+   BPF_MOV64_IMM(BPF_REG_4, 256),
+   BPF_EMIT_CALL(BPF_FUNC_get_stack),
+   BPF_MOV64_IMM(BPF_REG_1, 0),
+   BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_8, 32),
+   BPF_ALU64_IMM(BPF_ARSH, BPF_REG_8, 32),
+   BPF_JMP_REG(BPF_JSLT, BPF_REG_1, BPF_REG_8, 16),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_9, BPF_REG_8),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_8),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_9),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 32),
+   BPF_ALU64_IMM(BPF_ARSH, BPF_REG_1, 32),
+   BPF_MOV64_REG(BPF_REG_3, BPF_REG_2),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_1),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+   BPF_MOV64_IMM(BPF_REG_5, sizeof(struct test_val)),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_5),
+   BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_1, 4),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_MOV64_REG(BPF_REG_3, BPF_REG_9),
+   BPF_MOV64_IMM(BPF_REG_4, 0),
+   BPF_EMIT_CALL(BPF_FUNC_get_stack),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map2 = { 4 },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)
-- 
2.9.5

[PATCH bpf-next v9 09/10] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog

2018-04-28 Thread Yonghong Song

The test attached a raw_tracepoint program to raw_syscalls/sys_enter.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.

If jit is not enabled, the user space application will check
to ensure that the kernel function for raw_tracepoint
___bpf_prog_run is part of the stack.

If jit is enabled, we did not have a reliable way to
verify the kernel stack, so just assume the kernel stack
is good when the kernel stack size is greater than 0.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_get_stack_rawtp.c | 102 
 tools/testing/selftests/bpf/test_progs.c   | 172 +++--
 3 files changed, 266 insertions(+), 12 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_get_stack_rawtp.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index b64a7a3..9d76218 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
-   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o
+   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
+   test_get_stack_rawtp.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -58,6 +59,7 @@ $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
 $(OUTPUT)/test_sock: cgroup_helpers.c
 $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 $(OUTPUT)/test_sockmap: cgroup_helpers.c
+$(OUTPUT)/test_progs: trace_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/test_get_stack_rawtp.c 
b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
new file mode 100644
index 000..f6d9f23
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include "bpf_helpers.h"
+
+/* Permit pretty deep stack traces */
+#define MAX_STACK_RAWTP 100
+struct stack_trace_t {
+   int pid;
+   int kern_stack_size;
+   int user_stack_size;
+   int user_stack_buildid_size;
+   __u64 kern_stack[MAX_STACK_RAWTP];
+   __u64 user_stack[MAX_STACK_RAWTP];
+   struct bpf_stack_build_id user_stack_buildid[MAX_STACK_RAWTP];
+};
+
+struct bpf_map_def SEC("maps") perfmap = {
+   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(__u32),
+   .max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") stackdata_map = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(struct stack_trace_t),
+   .max_entries = 1,
+};
+
+/* Allocate per-cpu space twice the needed. For the code below
+ *   usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+ *   if (usize < 0)
+ * return 0;
+ *   ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
+ *
+ * If we have value_size = MAX_STACK_RAWTP * sizeof(__u64),
+ * verifier will complain that access "raw_data + usize"
+ * with size "max_len - usize" may be out of bound.
+ * The maximum "raw_data + usize" is "raw_data + max_len"
+ * and the maximum "max_len - usize" is "max_len", verifier
+ * concludes that the maximum buffer access range is
+ * "raw_data[0...max_len * 2 - 1]" and hence reject the program.
+ *
+ * Doubling the to-be-used max buffer size can fix this verifier
+ * issue and avoid complicated C programming massaging.
+ * This is an acceptable workaround since there is one entry here.
+ */
+struct bpf_map_def SEC("maps") rawdata_map = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = MAX_STACK_RAWTP * sizeof(__u64) * 2,
+   .max_entries = 1,
+};
+
+SEC("tracepoint/raw_syscalls/sys_enter")
+int bpf_prog1(void *ctx)
+{
+   int max_len, max_buildid_len, usize, ksize, total_size;
+   struct stack_trace_t *data;
+   void *raw_data;
+   __u32 key = 0;
+
+   data = bpf_map_lookup_elem(&stackdata_map, &key);
+   if (!data)
+   return 0;
+
+   max_len = MAX_STACK_RAWTP * sizeof(__u64);
+   max_buildid_len = MAX_STACK_RAWTP * sizeof(struct bpf_stack_build_id);
+   data->pid = bpf_get_current_pid_tgid();
+   data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
+ max_len, 0);
+   data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
+   BPF_F_USER_STACK);
+   data->user_stack_build

[PATCH bpf-next v9 02/10] bpf: add bpf_get_stack helper

2018-04-28 Thread Yonghong Song

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 include/linux/bpf.h  |  1 +
 include/linux/filter.h   |  3 ++-
 include/uapi/linux/bpf.h | 42 --
 kernel/bpf/core.c|  5 
 kernel/bpf/stackmap.c| 67 
 kernel/bpf/verifier.c| 19 ++
 kernel/trace/bpf_trace.c | 50 +++-
 7 files changed, 183 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 38ebbc6..c553f6f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -692,6 +692,7 @@ extern const struct bpf_func_proto 
bpf_get_current_comm_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
+extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b23..64899c0 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -468,7 +468,8 @@ struct bpf_prog {
dst_needed:1,   /* Do we need dst entry? */
blinded:1,  /* Was blinded */
is_func:1,  /* program is a bpf function */
-   kprobe_override:1; /* Do we override a kprobe? 
*/
+   kprobe_override:1, /* Do we override a kprobe? 
*/
+   has_callchain_buf:1; /* callchain buffer 
allocated? */
enum bpf_prog_type  type;   /* Type of BPF program */
enum bpf_attach_typeexpected_attach_type; /* For some prog types */
u32 len;/* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index da77a93..1afb606 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1767,6 +1767,40 @@ union bpf_attr {
  * **CONFIG_XFRM** configuration option.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_get_stack(struct pt_regs *regs, void *buf, u32 size, u64 flags)
+ * Description
+ * Return a user or a kernel stack in bpf program provided buffer.
+ * To achieve this, the helper needs *ctx*, which is a pointer
+ * to the context on which the tracing program is executed.
+ * To store the stacktrace, the bpf program provides *buf* with
+ * a nonnegative *size*.
+ *
+ * The last argument, *flags*, holds the number of stack frames to
+ * skip (from 0 to 255), masked with
+ * **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
+ * the following flags:
+ *
+ * **BPF_F_USER_STACK**
+ * Collect a user space stack instead of a kernel stack.
+ * **BPF_F_USER_BUILD_ID**
+ * Collect buildid+offset instead of ips for user stack,
+ * only valid if **BPF_F_USER_STACK** is also specified.
+ *
+ * **bpf_get_stack**\ () can collect up to
+ * **PERF_MAX_STACK_DEPTH** both kernel and user frames, subject
+ * to sufficient large buffer size. Note that
+ * this limit can be controlled with the **sysctl** program, and
+ * that it should be manually increased in order to profile long
+ * user stacks (such as stacks for Java programs). To do so, use:
+ *
+ * ::
+ *
+ * # sysctl kernel.perf_event_max_stack=
+ *
+ * Return
+ * a non-negative value equal to or less than size on success, or
+ * a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1835,7 +1869,8 @@ union bpf_attr {
FN(msg_pull_data),  \
FN(bind),   \
FN(xdp_adjust_tail),\
-   FN(skb_get_xfrm_state),
+   FN(skb_get_xfrm_state), \
+   FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1869,11 +1904,14 @@ enum bpf_func_id

[PATCH bpf-next v9 10/10] tools/bpf: add a test for bpf_get_stack with tracepoint prog

2018-04-28 Thread Yonghong Song

The test_stacktrace_map and test_stacktrace_build_id are
enhanced to call bpf_get_stack in the helper to get the
stack trace as well.  The stack traces from bpf_get_stack
and bpf_get_stackid are compared to ensure that for the
same stack as represented as the same hash, their ip addresses
or build id's must be the same.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_progs.c   | 70 --
 .../selftests/bpf/test_stacktrace_build_id.c   | 20 ++-
 tools/testing/selftests/bpf/test_stacktrace_map.c  | 19 +-
 3 files changed, 98 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 0ddbf34..aa336f0 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -906,11 +906,47 @@ static int compare_map_keys(int map1_fd, int map2_fd)
return 0;
 }
 
+static int compare_stack_ips(int smap_fd, int amap_fd, int stack_trace_len)
+{
+   __u32 key, next_key, *cur_key_p, *next_key_p;
+   char *val_buf1, *val_buf2;
+   int i, err = 0;
+
+   val_buf1 = malloc(stack_trace_len);
+   val_buf2 = malloc(stack_trace_len);
+   cur_key_p = NULL;
+   next_key_p = &key;
+   while (bpf_map_get_next_key(smap_fd, cur_key_p, next_key_p) == 0) {
+   err = bpf_map_lookup_elem(smap_fd, next_key_p, val_buf1);
+   if (err)
+   goto out;
+   err = bpf_map_lookup_elem(amap_fd, next_key_p, val_buf2);
+   if (err)
+   goto out;
+   for (i = 0; i < stack_trace_len; i++) {
+   if (val_buf1[i] != val_buf2[i]) {
+   err = -1;
+   goto out;
+   }
+   }
+   key = *next_key_p;
+   cur_key_p = &key;
+   next_key_p = &next_key;
+   }
+   if (errno != ENOENT)
+   err = -1;
+
+out:
+   free(val_buf1);
+   free(val_buf2);
+   return err;
+}
+
 static void test_stacktrace_map()
 {
-   int control_map_fd, stackid_hmap_fd, stackmap_fd;
+   int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
const char *file = "./test_stacktrace_map.o";
-   int bytes, efd, err, pmu_fd, prog_fd;
+   int bytes, efd, err, pmu_fd, prog_fd, stack_trace_len;
struct perf_event_attr attr = {};
__u32 key, val, duration = 0;
struct bpf_object *obj;
@@ -966,6 +1002,10 @@ static void test_stacktrace_map()
if (stackmap_fd < 0)
goto disable_pmu;
 
+   stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+   if (stack_amap_fd < 0)
+   goto disable_pmu;
+
/* give some time for bpf program run */
sleep(1);
 
@@ -987,6 +1027,12 @@ static void test_stacktrace_map()
  "err %d errno %d\n", err, errno))
goto disable_pmu_noerr;
 
+   stack_trace_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
+   err = compare_stack_ips(stackmap_fd, stack_amap_fd, stack_trace_len);
+   if (CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu_noerr;
+
goto disable_pmu_noerr;
 disable_pmu:
error_cnt++;
@@ -1080,9 +1126,9 @@ static int extract_build_id(char *build_id, size_t size)
 
 static void test_stacktrace_build_id(void)
 {
-   int control_map_fd, stackid_hmap_fd, stackmap_fd;
+   int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
const char *file = "./test_stacktrace_build_id.o";
-   int bytes, efd, err, pmu_fd, prog_fd;
+   int bytes, efd, err, pmu_fd, prog_fd, stack_trace_len;
struct perf_event_attr attr = {};
__u32 key, previous_key, val, duration = 0;
struct bpf_object *obj;
@@ -1147,6 +1193,11 @@ static void test_stacktrace_build_id(void)
  err, errno))
goto disable_pmu;
 
+   stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+   if (CHECK(stack_amap_fd < 0, "bpf_find_map stack_amap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu;
+
assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null")
   == 0);
assert(system("./urandom_read if=/dev/urandom of=/dev/zero count=4 2> 
/dev/null") == 0);
@@ -1198,8 +1249,15 @@ static void test_stacktrace_build_id(void)
previous_key = key;
} while (bpf_map_get_next_key(stackmap_fd, &previous_key, &key) == 0);
 
-   CHECK(build_id_matches < 1, "build id match",
- "Didn't find expected build ID from the map");
+   if (CHECK(build_id_matches < 1, "build id match",
+ "Didn't find expected build ID from the map"))
+   goto disable_pmu;
+
+

[PATCH bpf-next v9 04/10] bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals

2018-04-28 Thread Yonghong Song

In verifier function adjust_scalar_min_max_vals,
when src_known is false and the opcode is BPF_LSH/BPF_RSH,
early return will happen in the function. So remove
the branch in handling BPF_LSH/BPF_RSH when src_known is false.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 kernel/bpf/verifier.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 988400e..6e3f859 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2940,10 +2940,7 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
dst_reg->umin_value <<= umin_val;
dst_reg->umax_value <<= umax_val;
}
-   if (src_known)
-   dst_reg->var_off = tnum_lshift(dst_reg->var_off, 
umin_val);
-   else
-   dst_reg->var_off = tnum_lshift(tnum_unknown, umin_val);
+   dst_reg->var_off = tnum_lshift(dst_reg->var_off, umin_val);
/* We may learn something more from the var_off */
__update_reg_bounds(dst_reg);
break;
@@ -2971,11 +2968,7 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
 */
dst_reg->smin_value = S64_MIN;
dst_reg->smax_value = S64_MAX;
-   if (src_known)
-   dst_reg->var_off = tnum_rshift(dst_reg->var_off,
-  umin_val);
-   else
-   dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
+   dst_reg->var_off = tnum_rshift(dst_reg->var_off, umin_val);
dst_reg->umin_value >>= umax_val;
dst_reg->umax_value >>= umin_val;
/* We may learn something more from the var_off */
-- 
2.9.5

[PATCH bpf-next v2] bpf: Allow bpf_current_task_under_cgroup in interrupt

2018-04-28 Thread Teng Qin

Currently, the bpf_current_task_under_cgroup helper has a check where if
the BPF program is running in_interrupt(), it will return -EINVAL. This
prevents the helper to be used in many useful scenarios, particularly
BPF programs attached to Perf Events.

This commit removes the check. Tested a few NMI (Perf Event) and some
softirq context, the helper returns the correct result.

Signed-off-by: Teng Qin 
---
 kernel/trace/bpf_trace.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 56ba0f2..f94890c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -474,8 +474,6 @@ BPF_CALL_2(bpf_current_task_under_cgroup, struct bpf_map *, 
map, u32, idx)
struct bpf_array *array = container_of(map, struct bpf_array, map);
struct cgroup *cgrp;
 
-   if (unlikely(in_interrupt()))
-   return -EINVAL;
if (unlikely(idx >= array->map.max_entries))
return -E2BIG;
 
-- 
2.9.5

Re: [PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats

2018-04-28 Thread Zumeng Chen


On 2018年04月29日 02:36, Michael Chan wrote:

On Fri, Apr 27, 2018 at 8:15 PM, Zumeng Chen  wrote:


diff --git a/drivers/net/ethernet/broadcom/tg3.h 
b/drivers/net/ethernet/broadcom/tg3.h
index 3b5e98e..6727d93 100644
--- a/drivers/net/ethernet/broadcom/tg3.h
+++ b/drivers/net/ethernet/broadcom/tg3.h
@@ -3352,6 +3352,7 @@ struct tg3 {
 struct pci_dev  *pdev_peer;

 struct tg3_hw_stats *hw_stats;
+   boolhw_stats_flag;

You can just add another bit to enum TG3_FLAGS for this purpose.


Right, it's a good idea, I didn't notice it, I'll send V2 with that later.



While this scheme will probably work, I think a better and more
elegant way to fix this is to use RCU.


IMHO, RCU is not necessary for this simple two consumers, and no 
frequent ops

on tg3_halt, plus no new locker involved either.

Cheers,
Zumeng



 dma_addr_t  stats_mapping;
 struct work_struct  reset_task;

--
2.9.3

85 matches

Mail list logo