Re: [PATCH v3 1/2] kretprobe: produce sane stack traces

2018-11-08 Thread Aleksa Sarai
On 2018-11-08, Aleksa Sarai  wrote:
> I will attach what I have at the moment to hopefully explain what the
> issue I've found is (re-using the kretprobe architecture but with the
> shadow-stack idea).

Here is the patch I have at the moment (it works, except for the
question I have about how to handle the top-level pt_regs -- I've marked
that code with XXX).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH


--8<-

Since the return address is modified by kretprobe, the various unwinders
can produce invalid and confusing stack traces. ftrace mostly solved
this problem by teaching each unwinder how to find the original return
address for stack trace purposes. This same technique can be applied to
kretprobes by simply adding a pointer to where the return address was
replaced in the stack, and then looking up the relevant
kretprobe_instance when a stack trace is requested.

[WIP: This is currently broken because the *first entry* will not be
  overwritten since it looks like the stack pointer is different
  when we are provided pt_regs. All other addresses are correctly
  handled.]

Signed-off-by: Aleksa Sarai 
---
 arch/x86/events/core.c |  6 +++-
 arch/x86/include/asm/ptrace.h  |  1 +
 arch/x86/kernel/kprobes/core.c |  5 ++--
 arch/x86/kernel/stacktrace.c   | 10 +--
 arch/x86/kernel/unwind_frame.c |  2 ++
 arch/x86/kernel/unwind_guess.c |  5 +++-
 arch/x86/kernel/unwind_orc.c   |  2 ++
 include/linux/kprobes.h| 15 +-
 kernel/kprobes.c   | 55 ++
 9 files changed, 93 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index de32741d041a..d71062702179 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2371,7 +2371,11 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
return;
}
 
-   if (perf_callchain_store(entry, regs->ip))
+   /* XXX: Currently broken -- stack_addr(regs) doesn't match entry. */
+   addr = regs->ip;
+   //addr = ftrace_graph_ret_addr(current, &state.graph_idx, addr, 
stack_addr(regs));
+   addr = kretprobe_ret_addr(current, addr, stack_addr(regs));
+   if (perf_callchain_store(entry, addr))
return;
 
for (unwind_start(&state, current, regs, NULL); !unwind_done(&state);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index ee696efec99f..c4dfafd43e11 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -172,6 +172,7 @@ static inline unsigned long kernel_stack_pointer(struct 
pt_regs *regs)
return regs->sp;
 }
 #endif
+#define stack_addr(regs) ((unsigned long *) kernel_stack_pointer(regs))
 
 #define GET_IP(regs) ((regs)->ip)
 #define GET_FP(regs) ((regs)->bp)
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..eb4da885020c 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -69,8 +69,6 @@
 DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
 DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
 
-#define stack_addr(regs) ((unsigned long *)kernel_stack_pointer(regs))
-
 #define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
  (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
@@ -568,7 +566,8 @@ void arch_prepare_kretprobe(struct kretprobe_instance *ri, 
struct pt_regs *regs)
 {
unsigned long *sara = stack_addr(regs);
 
-   ri->ret_addr = (kprobe_opcode_t *) *sara;
+   ri->ret_addrp = (kprobe_opcode_t **) sara;
+   ri->ret_addr = *ri->ret_addrp;
 
/* Replace the return addr with trampoline addr */
*sara = (unsigned long) &kretprobe_trampoline;
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 7627455047c2..8a4fb3109d6b 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -37,8 +38,13 @@ static void noinline __save_stack_trace(struct stack_trace 
*trace,
struct unwind_state state;
unsigned long addr;
 
-   if (regs)
-   save_stack_address(trace, regs->ip, nosched);
+   if (regs) {
+   /* XXX: Currently broken -- stack_addr(regs) doesn't match 
entry. */
+   addr = regs->ip;
+   //addr = ftrace_graph_ret_addr(current, &state.graph_idx, addr, 
stack_addr(regs));
+   addr = kretprobe_ret_addr(current, addr, stack_addr(regs));
+   save_stack_address(trace, addr, nosched);
+   }
 
for (unwind_start(&state, task, regs, NULL); !unwind_done(&state);
 unwind_next_frame(&state))

Re: [PATCH v14 00/12] Enable cpuset controller in default hierarchy

2018-11-08 Thread Peter Zijlstra
On Wed, Nov 07, 2018 at 01:32:29PM -0800, Tejun Heo wrote:

> How about the following?
> 
> Interface file is named "cpuset.cpus.partition" as it's a partition of
> cpus.  The writeable keys are "root" and "member" - a partition root
> and a partition member.  When "root" is requested but can't be
> satisfied it shows "root invalid".  This seems pretty consistent with
> cgroup.type and other cpuset files.

I can live with that, thanks!


RE: [PATCH v2] Document /proc/pid PID reuse behavior

2018-11-08 Thread David Laight
From: Martin Steigerwald
> Sent: 07 November 2018 17:05
...
> Its not quite on-topic, but I am curious now: AFAIK PID limit is 16
> bits. Right? Could it be raised to 32 bits? I bet it would be a major
> change throughout different parts of the kernel.

It is probably 15 bits (since -ve pid numbers are used for process groups).

My guess is that userspace and the system call interface will handle 32bit
(signed) pid numbers.
(I don't remember 'linux emulation' being one of the emulations that
would truncate 32bit pids when one of the BDSs went to 32bit pids.)
The main problem will be that big numbers will mess up the alignment
of printouts from ps and top (etc).
This can be mitigated by only allocating 'big' numbers on systems
that have a lot of pids.
You also really want an O(1) allocator.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH v10 12/22] kasan, arm64: fix up fault handling logic

2018-11-08 Thread Mark Rutland
On Tue, Nov 06, 2018 at 06:30:27PM +0100, Andrey Konovalov wrote:
> show_pte in arm64 fault handling relies on the fact that the top byte of
> a kernel pointer is 0xff, which isn't always the case with tag-based
> KASAN.

That's for the TTBR1 check, right?

i.e. for the following to work:

if (addr >= VA_START)

... we need the tag bits to be an extension of bit 55...

> 
> This patch resets the top byte in show_pte.
> 
> Reviewed-by: Andrey Ryabinin 
> Reviewed-by: Dmitry Vyukov 
> Signed-off-by: Andrey Konovalov 
> ---
>  arch/arm64/mm/fault.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 7d9571f4ae3d..d9a84d6f3343 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -141,6 +142,8 @@ void show_pte(unsigned long addr)
>   pgd_t *pgdp;
>   pgd_t pgd;
>  
> + addr = (unsigned long)kasan_reset_tag((void *)addr);

... but this ORs in (0xffUL << 56), which is not correct for addresses
which aren't TTBR1 addresses to begin with, where bit 55 is clear, and
throws away useful information.

We could use untagged_addr() here, but that wouldn't be right for
kernels which don't use TBI1, and we'd erroneously report addresses
under the TTBR1 range as being in the TTBR1 range.

I also see that the entry assembly for el{1,0}_{da,ia} clears the tag
for EL0 addresses.

So we could have:

static inline bool is_ttbr0_addr(unsigned long addr)
{
/* entry assembly clears tags for TTBR0 addrs */
return addr < TASK_SIZE_64;
}

static inline bool is_ttbr1_addr(unsigned long addr)
{
/* TTBR1 addresses may have a tag if HWKASAN is in use */
return arch_kasan_reset_tag(addr) >= VA_START;
}

... and use those in the conditionals, leaving the addr as-is for
reporting purposes.

Thanks,
Mark.


Re: [PATCH v2] Document /proc/pid PID reuse behavior

2018-11-08 Thread Matthew Wilcox
On Thu, Nov 08, 2018 at 12:02:49PM +, David Laight wrote:
> From: Martin Steigerwald
> > Sent: 07 November 2018 17:05
> ...
> > Its not quite on-topic, but I am curious now: AFAIK PID limit is 16
> > bits. Right? Could it be raised to 32 bits? I bet it would be a major
> > change throughout different parts of the kernel.
> 
> It is probably 15 bits (since -ve pid numbers are used for process groups).
> 
> My guess is that userspace and the system call interface will handle 32bit
> (signed) pid numbers.
> (I don't remember 'linux emulation' being one of the emulations that
> would truncate 32bit pids when one of the BDSs went to 32bit pids.)
> The main problem will be that big numbers will mess up the alignment
> of printouts from ps and top (etc).
> This can be mitigated by only allocating 'big' numbers on systems
> that have a lot of pids.
> You also really want an O(1) allocator.

The allocator is O(log n) -- it's the IDR allocator, used in cyclic mode.
n in this case is the highest ID which is still in use.  The tree is
log_64(n) levels high.  It walks to the bottom of the tree and puts a
pointer into the tree.  If the cursor has wrapped to the beginning of
the tree, it may encounter a PID which is still in use; if it does,
it does a bitmap scan of that node, and will then walk up the tree,
doing a bitmap scan forward at each level until it finds a free PID.

So it's not exactly O(log(n)), but it's close enough for all practical
purposes.  And more importantly, it doesn't touch a lot of cachelines.
Two or three at each level of the tree that it accesses.  If we went
all the way to a 32-bit PID, the tree would grow to 6 levels deep,
and worst-case would touch 6 + 5 + 4 levels of the tree (starting with
trying to allocate PID 0x, failing, trying to allocate PID 300,
then having to walk all the way forward to find PID 0xe000), so
that's only 45 cachelines.

People care a little too much about O(1)/O(n) behaviour.  Cacheline
behaviour, and good average-case performance without falling off a cliff
in the worst case is much more important.


Re: [PATCH v2] Document /proc/pid PID reuse behavior

2018-11-08 Thread Michal Hocko
On Wed 07-11-18 18:04:59, Martin Steigerwald wrote:
> Michal Hocko - 07.11.18, 17:00:
> > > > otherwise anybody could simply DoS the system
> > > > by consuming all available pids.
> > > 
> > > People can do that today using the instrument of terror widely known
> > > as fork(2). The only thing standing between fork(2) and a full
> > > process table is RLIMIT_NPROC.
> > 
> > not really. If you really do care about pid space depletion then you
> > should use pid cgroup controller.
> 
> Its not quite on-topic, but I am curious now: AFAIK PID limit is 16 
> bits. Right? Could it be raised to 32 bits? I bet it would be a major 
> change throughout different parts of the kernel.
> 
> 16 bits sound a bit low these days, not only for PIDs, but also for 
> connections / ports.

Do you have any specific example of the pid space exhaustion? Well
except for a fork bomb attacks that could be mitigated by the pid cgroup
controller.
-- 
Michal Hocko
SUSE Labs


RE: [PATCH v2] Document /proc/pid PID reuse behavior

2018-11-08 Thread David Laight
From: Matthew Wilcox
> Sent: 08 November 2018 12:28
>
> On Thu, Nov 08, 2018 at 12:02:49PM +, David Laight wrote:
> > From: Martin Steigerwald
> > > Sent: 07 November 2018 17:05
> > ...
> > > Its not quite on-topic, but I am curious now: AFAIK PID limit is 16
> > > bits. Right? Could it be raised to 32 bits? I bet it would be a major
> > > change throughout different parts of the kernel.
> >
> > It is probably 15 bits (since -ve pid numbers are used for process groups).
> >
> > My guess is that userspace and the system call interface will handle 32bit
> > (signed) pid numbers.
> > (I don't remember 'linux emulation' being one of the emulations that
> > would truncate 32bit pids when one of the BDSs went to 32bit pids.)
> > The main problem will be that big numbers will mess up the alignment
> > of printouts from ps and top (etc).
> > This can be mitigated by only allocating 'big' numbers on systems
> > that have a lot of pids.
> > You also really want an O(1) allocator.
> 
> The allocator is O(log n) -- it's the IDR allocator, used in cyclic mode.
> n in this case is the highest ID which is still in use.  The tree is
> log_64(n) levels high.  It walks to the bottom of the tree and puts a
> pointer into the tree.  If the cursor has wrapped to the beginning of
> the tree, it may encounter a PID which is still in use; if it does,
> it does a bitmap scan of that node, and will then walk up the tree,
> doing a bitmap scan forward at each level until it finds a free PID.

Right, but you can choose the pid so that you get a perfect hash.
You can then put a FIFO free list through the unused entries of
the hash table (just an array).
Then pid allocate just picks the oldest free entry and ups the
high bits (that the hash masks out) to make the old value stale.
Almost no cache lines are involved in the whole operation.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH v2] Document /proc/pid PID reuse behavior

2018-11-08 Thread Matthew Wilcox
On Thu, Nov 08, 2018 at 01:42:41PM +, David Laight wrote:
> From: Matthew Wilcox
> > On Thu, Nov 08, 2018 at 12:02:49PM +, David Laight wrote:
> > > This can be mitigated by only allocating 'big' numbers on systems
> > > that have a lot of pids.
> > > You also really want an O(1) allocator.
> > 
> > The allocator is O(log n) -- it's the IDR allocator, used in cyclic mode.
> > n in this case is the highest ID which is still in use.  The tree is
> > log_64(n) levels high.  It walks to the bottom of the tree and puts a
> > pointer into the tree.  If the cursor has wrapped to the beginning of
> > the tree, it may encounter a PID which is still in use; if it does,
> > it does a bitmap scan of that node, and will then walk up the tree,
> > doing a bitmap scan forward at each level until it finds a free PID.
> 
> Right, but you can choose the pid so that you get a perfect hash.
> You can then put a FIFO free list through the unused entries of
> the hash table (just an array).
> Then pid allocate just picks the oldest free entry and ups the
> high bits (that the hash masks out) to make the old value stale.
> Almost no cache lines are involved in the whole operation.

You'd be looking for something like dynamic perfect hashing then?
I think that'd make a great project for someone to try out.  I don't
have the time to look into it myself, and I'm not convinced there's
a real workload that would benefit.  But it'd be a great project for
a university student.



RE: [PATCH v2] Document /proc/pid PID reuse behavior

2018-11-08 Thread David Laight
From: Matthew Wilcox
> Sent: 08 November 2018 14:07
...
> 
> You'd be looking for something like dynamic perfect hashing then?
> I think that'd make a great project for someone to try out.  I don't
> have the time to look into it myself, and I'm not convinced there's
> a real workload that would benefit.  But it'd be a great project for
> a university student.

Been there, done that several times.
Half an afternoon's work.

I'm surprised there isn't a C++ container class that will give
you a 'id' for a pointer (or other large value).
But they all do it the other way around.

It is also trivial to double the size of the array and 'unzip'
the allocated items into their correct halves while adding the
other entry to the free list.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



[PATCH v8 4/8] mm, arm64: untag user addresses in mm/gup.c

2018-11-08 Thread Andrey Konovalov
mm/gup.c provides a kernel interface that accepts user addresses and
manipulates user pages directly (for example get_user_pages, that is used
by the futex syscall). Since a user can provided tagged addresses, we need
to handle such case.

Add untagging to gup.c functions that use user addresses for vma lookup.

Signed-off-by: Andrey Konovalov 
---
 mm/gup.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index f76e77a2d34b..6ff310818616 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -677,6 +677,8 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
if (!nr_pages)
return 0;
 
+   start = untagged_addr(start);
+
VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
 
/*
@@ -840,6 +842,8 @@ int fixup_user_fault(struct task_struct *tsk, struct 
mm_struct *mm,
struct vm_area_struct *vma;
vm_fault_t ret, major = 0;
 
+   address = untagged_addr(address);
+
if (unlocked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v8 5/8] lib, arm64: untag addrs passed to strncpy_from_user and strnlen_user

2018-11-08 Thread Andrey Konovalov
strncpy_from_user and strnlen_user accept user addresses as arguments, and
do not go through the same path as copy_from_user and others, so here we
need to handle the case of tagged user addresses separately.

Untag user pointers passed to these functions.

Signed-off-by: Andrey Konovalov 
---
 lib/strncpy_from_user.c | 2 ++
 lib/strnlen_user.c  | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index b53e1b5d80f4..97467cd2bc59 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -106,6 +106,8 @@ long strncpy_from_user(char *dst, const char __user *src, 
long count)
if (unlikely(count <= 0))
return 0;
 
+   src = untagged_addr(src);
+
max_addr = user_addr_max();
src_addr = (unsigned long)src;
if (likely(src_addr < max_addr)) {
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index 60d0bbda8f5e..8b5f56466e00 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -108,6 +108,8 @@ long strnlen_user(const char __user *str, long count)
if (unlikely(count <= 0))
return 0;
 
+   str = untagged_addr(str);
+
max_addr = user_addr_max();
src_addr = (unsigned long)str;
if (likely(src_addr < max_addr)) {
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v8 8/8] selftests, arm64: add a selftest for passing tagged pointers to kernel

2018-11-08 Thread Andrey Konovalov
This patch adds a simple test, that calls the uname syscall with a
tagged user pointer as an argument. Without the kernel accepting tagged
user pointers the test fails with EFAULT.

Signed-off-by: Andrey Konovalov 
---
 tools/testing/selftests/arm64/.gitignore  |  1 +
 tools/testing/selftests/arm64/Makefile| 11 +++
 .../testing/selftests/arm64/run_tags_test.sh  | 12 
 tools/testing/selftests/arm64/tags_test.c | 19 +++
 4 files changed, 43 insertions(+)
 create mode 100644 tools/testing/selftests/arm64/.gitignore
 create mode 100644 tools/testing/selftests/arm64/Makefile
 create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
 create mode 100644 tools/testing/selftests/arm64/tags_test.c

diff --git a/tools/testing/selftests/arm64/.gitignore 
b/tools/testing/selftests/arm64/.gitignore
new file mode 100644
index ..e8fae8d61ed6
--- /dev/null
+++ b/tools/testing/selftests/arm64/.gitignore
@@ -0,0 +1 @@
+tags_test
diff --git a/tools/testing/selftests/arm64/Makefile 
b/tools/testing/selftests/arm64/Makefile
new file mode 100644
index ..a61b2e743e99
--- /dev/null
+++ b/tools/testing/selftests/arm64/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# ARCH can be overridden by the user for cross compiling
+ARCH ?= $(shell uname -m 2>/dev/null || echo not)
+
+ifneq (,$(filter $(ARCH),aarch64 arm64))
+TEST_GEN_PROGS := tags_test
+TEST_PROGS := run_tags_test.sh
+endif
+
+include ../lib.mk
diff --git a/tools/testing/selftests/arm64/run_tags_test.sh 
b/tools/testing/selftests/arm64/run_tags_test.sh
new file mode 100755
index ..745f11379930
--- /dev/null
+++ b/tools/testing/selftests/arm64/run_tags_test.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+echo ""
+echo "running tags test"
+echo ""
+./tags_test
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+else
+   echo "[PASS]"
+fi
diff --git a/tools/testing/selftests/arm64/tags_test.c 
b/tools/testing/selftests/arm64/tags_test.c
new file mode 100644
index ..1452ed7d33f9
--- /dev/null
+++ b/tools/testing/selftests/arm64/tags_test.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include 
+
+#define SHIFT_TAG(tag) ((uint64_t)(tag) << 56)
+#define SET_TAG(ptr, tag)  (((uint64_t)(ptr) & ~SHIFT_TAG(0xff)) | \
+   SHIFT_TAG(tag))
+
+int main(void)
+{
+   struct utsname utsname;
+   void *ptr = &utsname;
+   void *tagged_ptr = (void *)SET_TAG(ptr, 0x42);
+   int err = uname(tagged_ptr);
+   return err;
+}
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v8 0/8] arm64: untag user pointers passed to the kernel

2018-11-08 Thread Andrey Konovalov
arm64 has a feature called Top Byte Ignore, which allows to embed pointer
tags into the top byte of each pointer. Userspace programs (such as
HWASan, a memory debugging tool [1]) might use this feature and pass
tagged user pointers to the kernel through syscalls or other interfaces.

Right now the kernel is already able to handle user faults with tagged
pointers, due to these patches:

1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
 tagged pointer")
2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
  pointers")
3. 276e9327 ("arm64: entry: improve data abort handling of tagged
  pointers")

When passing tagged pointers to syscalls, there's a special case of such a
pointer being passed to one of the memory syscalls (mmap, mprotect, etc.).
These syscalls don't do memory accesses but rather deal with memory
ranges, hence an untagged pointer is better suited.

This patchset extends tagged pointer support to non-memory syscalls. This
is done by reusing the untagged_addr macro to untag user pointers when the
kernel performs pointer checking to find out whether the pointer comes
from userspace (most notably in access_ok). The untagging is done only
when the pointer is being checked, the tag is preserved as the pointer
makes its way through the kernel.

One of the alternative approaches to untagging that was considered is to
completely strip the pointer tag as the pointer enters the kernel with
some kind of a syscall wrapper, but that won't work with the countless
number of different ioctl calls. With this approach we would need a custom
wrapper for each ioctl variation, which doesn't seem practical.

The following testing approaches has been taken to find potential issues
with user pointer untagging:

1. Static testing (with sparse [2] and separately with a custom static
   analyzer based on Clang) to track casts of __user pointers to integer
   types to find places where untagging needs to be done.

2. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
   a modified syzkaller version that passes tagged pointers to the kernel.

Based on the results of the testing the requried patches have been added
to the patchset.

This patchset has been merged into the Pixel 2 kernel tree and is now
being used to enable testing of Pixel 2 phones with HWASan.

This patchset is a prerequisite for ARM's memory tagging hardware feature
support [3].

Thanks!

[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

[2] 
https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e0145f292

[3] 
https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

Changes in v8:
- Rebased onto 65102238 (4.20-rc1).
- Added a note to the cover letter on why syscall wrappers/shims that untag
  user pointers won't work.
- Added a note to the cover letter that this patchset has been merged into
  the Pixel 2 kernel tree.
- Documentation fixes, in particular added a list of syscalls that don't
  support tagged user pointers.

Changes in v7:
- Rebased onto 17b57b18 (4.19-rc6).
- Dropped the "arm64: untag user address in __do_user_fault" patch, since
  the existing patches already handle user faults properly.
- Dropped the "usb, arm64: untag user addresses in devio" patch, since the
  passed pointer must come from a vma and therefore be untagged.
- Dropped the "arm64: annotate user pointers casts detected by sparse"
  patch (see the discussion to the replies of the v6 of this patchset).
- Added more context to the cover letter.
- Updated Documentation/arm64/tagged-pointers.txt.

Changes in v6:
- Added annotations for user pointer casts found by sparse.
- Rebased onto 050cdc6c (4.19-rc1+).

Changes in v5:
- Added 3 new patches that add untagging to places found with static
  analysis.
- Rebased onto 44c929e1 (4.18-rc8).

Changes in v4:
- Added a selftest for checking that passing tagged pointers to the
  kernel succeeds.
- Rebased onto 81e97f013 (4.18-rc1+).

Changes in v3:
- Rebased onto e5c51f30 (4.17-rc6+).
- Added linux-arch@ to the list of recipients.

Changes in v2:
- Rebased onto 2d618bdf (4.17-rc3+).
- Removed excessive untagging in gup.c.
- Removed untagging pointers returned from __uaccess_mask_ptr.

Changes in v1:
- Rebased onto 4.17-rc1.

Changes in RFC v2:
- Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of
  defining it for each arch individually.
- Updated Documentation/arm64/tagged-pointers.txt.
- Dropped "mm, arm64: untag user addresses in memory syscalls".
- Rebased onto 3eb2ce82 (4.16-rc7).

Reviewed-by: Luc Van Oostenryck 
Signed-off-by: Andrey Konovalov 

Andrey Konovalov (8):
  arm64: add type casts to untagged_addr macro
  uaccess: add untagged_addr definition for other arches
  arm64: untag user addresses in access_ok and __uaccess_mask_ptr
  mm, arm64: untag user addresses in mm/gup.c
  lib, arm64: untag addrs passed to strnc

[PATCH v8 6/8] fs, arm64: untag user address in copy_mount_options

2018-11-08 Thread Andrey Konovalov
In copy_mount_options a user address is being subtracted from TASK_SIZE.
If the address is lower than TASK_SIZE, the size is calculated to not
allow the exact_copy_from_user() call to cross TASK_SIZE boundary.
However if the address is tagged, then the size will be calculated
incorrectly.

Untag the address before subtracting.

Signed-off-by: Andrey Konovalov 
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 98d27da43304..1f1f998d15ed 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2674,7 +2674,7 @@ void *copy_mount_options(const void __user * data)
 * the remainder of the page.
 */
/* copy_from_user cannot cross TASK_SIZE ! */
-   size = TASK_SIZE - (unsigned long)data;
+   size = TASK_SIZE - (unsigned long)untagged_addr(data);
if (size > PAGE_SIZE)
size = PAGE_SIZE;
 
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v8 3/8] arm64: untag user addresses in access_ok and __uaccess_mask_ptr

2018-11-08 Thread Andrey Konovalov
copy_from_user (and a few other similar functions) are used to copy data
from user memory into the kernel memory or vice versa. Since a user can
provided a tagged pointer to one of the syscalls that use copy_from_user,
we need to correctly handle such pointers.

Do this by untagging user pointers in access_ok and in __uaccess_mask_ptr,
before performing access validity checks.

Signed-off-by: Andrey Konovalov 
---
 arch/arm64/include/asm/uaccess.h | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index c1325271e368..abc35cba134b 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -104,7 +104,8 @@ static inline unsigned long __range_ok(const void __user 
*addr, unsigned long si
 #define untagged_addr(addr)\
((__typeof__(addr))sign_extend64((__u64)(addr), 55))
 
-#define access_ok(type, addr, size)__range_ok(addr, size)
+#define access_ok(type, addr, size)\
+   __range_ok(untagged_addr(addr), size)
 #define user_addr_max  get_fs
 
 #define _ASM_EXTABLE(from, to) \
@@ -236,7 +237,8 @@ static inline void uaccess_enable_not_uao(void)
 
 /*
  * Sanitise a uaccess pointer such that it becomes NULL if above the
- * current addr_limit.
+ * current addr_limit. In case the pointer is tagged (has the top byte set),
+ * untag the pointer before checking.
  */
 #define uaccess_mask_ptr(ptr) (__typeof__(ptr))__uaccess_mask_ptr(ptr)
 static inline void __user *__uaccess_mask_ptr(const void __user *ptr)
@@ -244,10 +246,11 @@ static inline void __user *__uaccess_mask_ptr(const void 
__user *ptr)
void __user *safe_ptr;
 
asm volatile(
-   "   bicsxzr, %1, %2\n"
+   "   bicsxzr, %3, %2\n"
"   csel%0, %1, xzr, eq\n"
: "=&r" (safe_ptr)
-   : "r" (ptr), "r" (current_thread_info()->addr_limit)
+   : "r" (ptr), "r" (current_thread_info()->addr_limit),
+ "r" (untagged_addr(ptr))
: "cc");
 
csdb();
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v8 7/8] arm64: update Documentation/arm64/tagged-pointers.txt

2018-11-08 Thread Andrey Konovalov
Document the changes in Documentation/arm64/tagged-pointers.txt.

Signed-off-by: Andrey Konovalov 
---
 Documentation/arm64/tagged-pointers.txt | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/Documentation/arm64/tagged-pointers.txt 
b/Documentation/arm64/tagged-pointers.txt
index a25a99e82bb1..f4cf1f5cf362 100644
--- a/Documentation/arm64/tagged-pointers.txt
+++ b/Documentation/arm64/tagged-pointers.txt
@@ -17,13 +17,22 @@ this byte for application use.
 Passing tagged addresses to the kernel
 --
 
-All interpretation of userspace memory addresses by the kernel assumes
-an address tag of 0x00.
+The kernel supports tags in pointer arguments (including pointers in
+structures) for a limited set of syscalls, the exceptions are:
 
-This includes, but is not limited to, addresses found in:
+ - memory syscalls: brk, madvise, mbind, mincore, mlock, mlock2, move_pages,
+   mprotect, mremap, msync, munlock, munmap, pkey_mprotect, process_vm_readv,
+   process_vm_writev, remap_file_pages;
 
- - pointer arguments to system calls, including pointers in structures
-   passed to system calls,
+ - ioctls that accept user pointers that describe virtual memory ranges;
+
+ - TCP_ZEROCOPY_RECEIVE setsockopt.
+
+The kernel supports tags in user fault addresses. However the fault_address
+field in the sigcontext struct will contain an untagged address.
+
+All other interpretations of userspace memory addresses by the kernel
+assume an address tag of 0x00, in particular:
 
  - the stack pointer (sp), e.g. when interpreting it to deliver a
signal,
@@ -33,11 +42,7 @@ This includes, but is not limited to, addresses found in:
 
 Using non-zero address tags in any of these locations may result in an
 error code being returned, a (fatal) signal being raised, or other modes
-of failure.
-
-For these reasons, passing non-zero address tags to the kernel via
-system calls is forbidden, and using a non-zero address tag for sp is
-strongly discouraged.
+of failure. Using a non-zero address tag for sp is strongly discouraged.
 
 Programs maintaining a frame pointer and frame records that use non-zero
 address tags may suffer impaired or inaccurate debug and profiling
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v8 2/8] uaccess: add untagged_addr definition for other arches

2018-11-08 Thread Andrey Konovalov
To allow arm64 syscalls accept tagged pointers from userspace, we must
untag them when they are passed to the kernel. Since untagging is done in
generic parts of the kernel, the untagged_addr macro needs to be defined
for all architectures.

Define it as a noop for other architectures besides arm64.

Signed-off-by: Andrey Konovalov 
---
 include/linux/uaccess.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index efe79c1cdd47..c045b4eff95e 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -13,6 +13,10 @@
 
 #include 
 
+#ifndef untagged_addr
+#define untagged_addr(addr) addr
+#endif
+
 /*
  * Architectures should provide two primitives (raw_copy_{to,from}_user())
  * and get rid of their private instances of copy_{to,from}_user() and
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v8 1/8] arm64: add type casts to untagged_addr macro

2018-11-08 Thread Andrey Konovalov
This patch makes the untagged_addr macro accept all kinds of address types
(void *, unsigned long, etc.) and allows not to specify type casts in each
place where it is used. This is done by using __typeof__.

Signed-off-by: Andrey Konovalov 
---
 arch/arm64/include/asm/uaccess.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 07c34087bd5e..c1325271e368 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -101,7 +101,8 @@ static inline unsigned long __range_ok(const void __user 
*addr, unsigned long si
  * up with a tagged userland pointer. Clear the tag to get a sane pointer to
  * pass on to access_ok(), for instance.
  */
-#define untagged_addr(addr)sign_extend64(addr, 55)
+#define untagged_addr(addr)\
+   ((__typeof__(addr))sign_extend64((__u64)(addr), 55))
 
 #define access_ok(type, addr, size)__range_ok(addr, size)
 #define user_addr_max  get_fs
-- 
2.19.1.930.g4563a0d9d0-goog



Re: [PATCH v15 23/23] x86/sgx: Driver documentation

2018-11-08 Thread Jarkko Sakkinen
On Wed, Nov 07, 2018 at 09:09:37AM -0800, Dave Hansen wrote:
> On 11/7/18 8:30 AM, Jarkko Sakkinen wrote:
> >> Does this code run when I type "make kselftest"?  If not, I think we
> >> should rectify that.
> > No, it doesn't. It is just my backup for the non-SDK user space code
> > that I've made that I will use to fork my user space SGX projects in
> > the future.
> > 
> > I can work-out a selftest (and provide a new patch in the series) but
> > I'm still wondering what the enclave should do. I would suggest that
> > we start with an enclave that does just EEXIT and nothing else.
> 
> Yeah, that's a start.  But, a good selftest would include things like
> faults and error conditions.

Great. We can add more entry points to the enclave for different tests
but I'll start with a bare minimum. And yeah but ever goes into next
version I'll document the fault handling.

/Jarkko


Re: [PATCH v3 1/2] kretprobe: produce sane stack traces

2018-11-08 Thread Josh Poimboeuf
On Thu, Nov 08, 2018 at 07:04:48PM +1100, Aleksa Sarai wrote:
> On 2018-11-08, Aleksa Sarai  wrote:
> > I will attach what I have at the moment to hopefully explain what the
> > issue I've found is (re-using the kretprobe architecture but with the
> > shadow-stack idea).
> 
> Here is the patch I have at the moment (it works, except for the
> question I have about how to handle the top-level pt_regs -- I've marked
> that code with XXX).
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> 
> 
> --8<-
> 
> Since the return address is modified by kretprobe, the various unwinders
> can produce invalid and confusing stack traces. ftrace mostly solved
> this problem by teaching each unwinder how to find the original return
> address for stack trace purposes. This same technique can be applied to
> kretprobes by simply adding a pointer to where the return address was
> replaced in the stack, and then looking up the relevant
> kretprobe_instance when a stack trace is requested.
> 
> [WIP: This is currently broken because the *first entry* will not be
>   overwritten since it looks like the stack pointer is different
>   when we are provided pt_regs. All other addresses are correctly
>   handled.]

When you see this problem, what does regs->ip point to?  If it's
pointing to generated code, then we don't _currently_ have a way of
dealing with that.  If it's pointing to a real function, we can fix that
with unwind hints.

-- 
Josh


Re: [PATCH v8 0/8] arm64: untag user pointers passed to the kernel

2018-11-08 Thread Andrey Konovalov
On Thu, Nov 8, 2018 at 3:36 PM, Andrey Konovalov  wrote:

[...]

> Changes in v8:
> - Rebased onto 65102238 (4.20-rc1).
> - Added a note to the cover letter on why syscall wrappers/shims that untag
>   user pointers won't work.
> - Added a note to the cover letter that this patchset has been merged into
>   the Pixel 2 kernel tree.
> - Documentation fixes, in particular added a list of syscalls that don't
>   support tagged user pointers.

Hi Catalin,

I've changed the documentation to be more specific, please take a look.

I haven't done anything about adding a way for the user to find out
that the kernel supports this ABI extension. I don't know what would
the the preferred way to do this, and we haven't received any comments
on that from anybody else. Probing "on some innocuous syscall
currently returning -EFAULT on tagged pointer arguments" works though,
as you mentioned.

As mentioned in the cover letter, this patchset has been merged into
the Pixel 2 kernel tree.

Thanks!


[PATCH v15 00/12] Enable cpuset controller in default hierarchy

2018-11-08 Thread Waiman Long
v15:
 - Change the partition state descriptive text to "root", "member" and
   "root invalid" as suggested by Tejun.
 - Move the descriptive text patch up to before the documentation patch.
 - Address comments raised by PeterZ.

v14:
 - Fix a bug about cpumask handling in patch 4 by using
   CONFIG_CPUMASK_OFFSTACK #ifdef block.
 - Add patch 12 to show descriptive name when reading
   cpuset.sched.partition (suggested by Tejun).
 - Change the function prototype of free_cpumasks() to match that of
   alloc_cpumasks() (suggested by Tom Hromatka).

v13:
 - A major rewrite of the partition code so that there will be no
   auto-turning off anymore. Instead, the partition root can enter into
   an error state that can be restored back to a partition root later on.
 - Patches 1 and 9 are the same as previous version, the rests are
   either new or substantially revised.

v12:
 - Take out the debugging patch to print partitions.
 - Add a patch to force turning off partition flag if newly modified CPU
   list doesn't meet the requirement of being a partition root.
 - Remove some unneeded checking code in update_reserved_cpumask().

v12 patch: https://lkml.org/lkml/2018/8/27/423
v13 patch: https://lkml.org/lkml/2018/10/12/861
v14 patch: https://lkml.org/lkml/2018/10/16/28

The purpose of this patchset is to provide a basic set of cpuset control
files for cgroup v2. This basic set includes the non-root "cpus",
"mems" and "sched.partition". The "cpus.effective" and "mems.effective"
will appear in all cpuset-enabled cgroups.

The new control file that is unique to v2 is "sched.partition". It is
a tristate flag file that designates if a cgroup is the root of a new
scheduling domain or partition with its own set of unique list of CPUs
from scheduling perspective disjointed from other partitions. An user
can write only "1" or "0" or the corresponding descriptive text "root" or
"member" into this file to turn on and off partition root.  Depending on
circumstances, a partition root may become invalid and has a flag value
of -1 ("root invalid"). However, if condition becomes favorable again,
it can be changed back to a partition root automatically.

The root cgroup is always a partition root. Multiple levels of partitions
are supported with some limitations. So a container partition root can
behave like a real root.

When a partition root cgroup is removed, its list of exclusive CPUs
will be returned back to the parent's cpus.effective automatically.

A container root can be a partition root with sub-partitions
created underneath it. One difference from the real root is that the
"cpuset.sched.partition" flag isn't present in the real root, but is
present in a container root. This is also true for other cpuset control
files as well as those from the other controllers. This is a general
issue that is not going to be addressed here in this patchset.

This patchset does not exclude the possibility of adding more features
in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their effective
counterparts.

Patch 2 defines new data structures to support partitioning.

Patch 3 simplifies the allocation and freeing of cpumasks in the cpuset
code and prepares for use by subsequent patches.

Patch 4 adds a new "sched.partition" control file for setting up multiple
scheduling domains or partitions. A partition root implies cpu_exclusive.

Patch 5 makes new "sched.partition" file to have a new error value of -1
which indicates that the partition root enters into an invalid state
where some of the constraints of a partition root (like cpu_exclusive)
will still hold but it is not a real partition root anymore. This allows
the cpuset to change back to a partition root later on automatically
if the conditions become favorable again.

Patch 6 adds tracking of the number of cpusets that use the parent's
effective_cpus in order to make sure that those cpusets will be properly
updated if their parents effective cpus changes because of changes in
sibling partitions.

Patch 7 makes the hotplug code deal with partition root properly.

Patch 8 updates the scheduling domain genaration code to work with
the new partition feature.

Patch 9 exposes cpus.effective and mems.effective to the root cgroup
as enabling child partitions will take CPUs away from the root cgroup.
So it will be nice to monitor what CPUs are left there.

Patch 10 changes the output of reading cpuset.sched.partition from
integer to the descriptive text "root", "member" or "root invalid"
similar to what cgroup.type is doing. The first two text strings are
also accepted when writing into cpuset.sched.partition.

Patch 11 updates the cgroup v2 documentation file with information
about the new "sched.partition" file.

Patch 12 adds a new read-only "cpus.subpartitions" file that list the
CPUs in the subparts_cpus mask in the cpuset data structure when the
command line option "cgroup_debug" is specified. This is mostly used
for debugging and v

[PATCH v15 02/12] cpuset: Define data structures to support scheduling partition

2018-11-08 Thread Waiman Long
>From a cpuset point of view, a scheduling partition is a group of
cpusets with their own set of exclusive CPUs that are not shared by
other tasks outside the scheduling partition.

In the legacy hierarchy, scheduling partitions are supported indirectly
via the right use of the load balancing and the exclusive CPUs flag
which is not intuitive and can be hard to use.

To fully support the concept of scheduling partitions in the default
hierarchy, we need to add some new field into the cpuset structure as
well as a new tmpmasks structure that is used to pre-allocate cpumasks
at the top level cpuset functions to avoid memory allocation in inner
functions as memory allocation failure in those inner functions may
cause a cpuset to have inconsistent states.

Signed-off-by: Waiman Long 
---
 kernel/cgroup/cpuset.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2b5c447..29a2bdc 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -110,6 +110,13 @@ struct cpuset {
nodemask_t effective_mems;
 
/*
+* CPUs allocated to child sub-partitions (default hierarchy only)
+* - CPUs granted by the parent = effective_cpus U subparts_cpus
+* - effective_cpus and subparts_cpus are mutually exclusive.
+*/
+   cpumask_var_t subparts_cpus;
+
+   /*
 * This is old Memory Nodes tasks took on.
 *
 * - top_cpuset.old_mems_allowed is initialized to mems_allowed.
@@ -134,6 +141,30 @@ struct cpuset {
 
/* for custom sched domain */
int relax_domain_level;
+
+   /* number of CPUs in subparts_cpus */
+   int nr_subparts_cpus;
+
+   /* partition root state */
+   int partition_root_state;
+};
+
+/*
+ * Partition root states:
+ *
+ *   0 - not a partition root
+ *   1 - partition root
+ */
+#define PRS_DISABLED   0
+#define PRS_ENABLED1
+
+/*
+ * Temporary cpumasks for working with partitions that are passed among
+ * functions to avoid memory allocation in inner functions.
+ */
+struct tmpmasks {
+   cpumask_var_t addmask, delmask; /* For partition root */
+   cpumask_var_t new_cpus; /* For update_cpumasks_hier() */
 };
 
 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -218,9 +249,15 @@ static inline int is_spread_slab(const struct cpuset *cs)
return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_partition_root(const struct cpuset *cs)
+{
+   return cs->partition_root_state;
+}
+
 static struct cpuset top_cpuset = {
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
  (1 << CS_MEM_EXCLUSIVE)),
+   .partition_root_state = PRS_ENABLED,
 };
 
 /**
-- 
1.8.3.1



[PATCH v15 08/12] cpuset: Make generate_sched_domains() work with partition

2018-11-08 Thread Waiman Long
The generate_sched_domains() function is modified to make it work
correctly with the newly introduced subparts_cpus mask for scheduling
domains generation.

Signed-off-by: Waiman Long 
---
 kernel/cgroup/cpuset.c | 34 +++---
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 82c88c6..3960de7 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -769,13 +769,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
int ndoms = 0;  /* number of sched domains in result */
int nslot;  /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
+   bool root_load_balance = is_sched_load_balance(&top_cpuset);
 
doms = NULL;
dattr = NULL;
csa = NULL;
 
/* Special case for the 99% of systems with one, full, sched domain */
-   if (is_sched_load_balance(&top_cpuset)) {
+   if (root_load_balance && !top_cpuset.nr_subparts_cpus) {
ndoms = 1;
doms = alloc_sched_domains(ndoms);
if (!doms)
@@ -798,6 +799,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
csn = 0;
 
rcu_read_lock();
+   if (root_load_balance)
+   csa[csn++] = &top_cpuset;
cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
if (cp == &top_cpuset)
continue;
@@ -808,6 +811,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 * parent's cpus, so just skip them, and then we call
 * update_domain_attr_tree() to calc relax_domain_level of
 * the corresponding sched domain.
+*
+* If root is load-balancing, we can skip @cp if it
+* is a subset of the root's effective_cpus.
 */
if (!cpumask_empty(cp->cpus_allowed) &&
!(is_sched_load_balance(cp) &&
@@ -815,11 +821,16 @@ static int generate_sched_domains(cpumask_var_t **domains,
 housekeeping_cpumask(HK_FLAG_DOMAIN
continue;
 
+   if (root_load_balance &&
+   cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+   continue;
+
if (is_sched_load_balance(cp))
csa[csn++] = cp;
 
-   /* skip @cp's subtree */
-   pos_css = css_rightmost_descendant(pos_css);
+   /* skip @cp's subtree if not a partition root */
+   if (!is_partition_root(cp))
+   pos_css = css_rightmost_descendant(pos_css);
}
rcu_read_unlock();
 
@@ -947,7 +958,12 @@ static void rebuild_sched_domains_locked(void)
 * passing doms with offlined cpu to partition_sched_domains().
 * Anyways, hotplug work item will rebuild sched domains.
 */
-   if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+   if (!top_cpuset.nr_subparts_cpus &&
+   !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+   goto out;
+
+   if (top_cpuset.nr_subparts_cpus &&
+  !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
goto out;
 
/* Generate domain masks and attrs */
@@ -1367,11 +1383,15 @@ static void update_cpumasks_hier(struct cpuset *cs, 
struct tmpmasks *tmp)
update_tasks_cpumask(cp);
 
/*
-* If the effective cpumask of any non-empty cpuset is changed,
-* we need to rebuild sched domains.
+* On legacy hierarchy, if the effective cpumask of any non-
+* empty cpuset is changed, we need to rebuild sched domains.
+* On default hierarchy, the cpuset needs to be a partition
+* root as well.
 */
if (!cpumask_empty(cp->cpus_allowed) &&
-   is_sched_load_balance(cp))
+   is_sched_load_balance(cp) &&
+  (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
+   is_partition_root(cp)))
need_rebuild_sched_domains = true;
 
rcu_read_lock();
-- 
1.8.3.1



[PATCH v15 05/12] cpuset: Add an error state to cpuset.sched.partition

2018-11-08 Thread Waiman Long
When external events like CPU offlining or user events like changing
the cpu list of an ancestor cpuset happen, update_cpumasks_hier()
will be called to update the effective cpus of each of the affected
cpusets. That will then call update_parent_subparts_cpumask() if
partitions are impacted.

Currently, these events may cause update_parent_subparts_cpumask()
to return error if none of the requested cpus are available or it will
consume all the cpus in the parent partition root. Handling these errors
is problematic as the states may become inconsistent.

Instead of letting update_parent_subparts_cpumask() return error, a new
error state (-1) is added to the partition_root_state flag to designate
the fact that the partition is no longer valid. IOW, it is no longer a
real partition root, but the CS_CPU_EXCLUSIVE flag will still be set
as it can be changed back to a real one if favorable change happens
later on.

This new error state is set internally and user cannot write this new
value to "cpuset.sched.partition".

Signed-off-by: Waiman Long 
---
 kernel/cgroup/cpuset.c | 153 +
 1 file changed, 129 insertions(+), 24 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 85e0416..ef41f58 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -153,10 +153,19 @@ struct cpuset {
  * Partition root states:
  *
  *   0 - not a partition root
+ *
  *   1 - partition root
+ *
+ *  -1 - invalid partition root
+ *   None of the cpus in cpus_allowed can be put into the parent's
+ *   subparts_cpus. In this case, the cpuset is not a real partition
+ *   root anymore.  However, the CPU_EXCLUSIVE bit will still be set
+ *   and the cpuset can be restored back to a partition root if the
+ *   parent cpuset can give more CPUs back to this child cpuset.
  */
 #define PRS_DISABLED   0
 #define PRS_ENABLED1
+#define PRS_ERROR  -1
 
 /*
  * Temporary cpumasks for working with partitions that are passed among
@@ -251,7 +260,7 @@ static inline int is_spread_slab(const struct cpuset *cs)
 
 static inline int is_partition_root(const struct cpuset *cs)
 {
-   return cs->partition_root_state;
+   return cs->partition_root_state > 0;
 }
 
 static struct cpuset top_cpuset = {
@@ -1021,9 +1030,12 @@ enum subparts_cmd {
  *
  * For partcmd_update, if the optional newmask is specified, the cpu
  * list is to be changed from cpus_allowed to newmask. Otherwise,
- * cpus_allowed is assumed to remain the same.  The function will return
- * 1 if changes to parent's subparts_cpus and effective_cpus happen or 0
- * otherwise. In case of error, an error code will be returned.
+ * cpus_allowed is assumed to remain the same. The cpuset should either
+ * be a partition root or an invalid partition root. The partition root
+ * state may change if newmask is NULL and none of the requested CPUs can
+ * be granted by the parent. The function will return 1 if changes to
+ * parent's subparts_cpus and effective_cpus happen or 0 otherwise.
+ * Error code should only be returned when newmask is non-NULL.
  *
  * The partcmd_enable and partcmd_disable commands are used by
  * update_prstate(). The partcmd_update command is used by
@@ -1046,6 +1058,7 @@ static int update_parent_subparts_cpumask(struct cpuset 
*cpuset, int cmd,
struct cpuset *parent = parent_cs(cpuset);
int adding; /* Moving cpus from effective_cpus to subparts_cpus */
int deleting;   /* Moving cpus from subparts_cpus to effective_cpus */
+   bool part_error = false;/* Partition error? */
 
lockdep_assert_held(&cpuset_mutex);
 
@@ -1114,13 +1127,48 @@ static int update_parent_subparts_cpumask(struct cpuset 
*cpuset, int cmd,
 * addmask = cpus_allowed & parent->effectiveb_cpus
 *
 * Note that parent's subparts_cpus may have been
-* pre-shrunk in case the CPUs granted to the parent
-* by the grandparent changes. So no deletion is needed.
+* pre-shrunk in case there is a change in the cpu list.
+* So no deletion is needed.
 */
adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
 parent->effective_cpus);
-   if (cpumask_equal(tmp->addmask, parent->effective_cpus))
-   return -EINVAL;
+   part_error = cpumask_equal(tmp->addmask,
+  parent->effective_cpus);
+   }
+
+   if (cmd == partcmd_update) {
+   int prev_prs = cpuset->partition_root_state;
+
+   /*
+* Check for possible transition between PRS_ENABLED
+* and PRS_ERROR.
+*/
+   switch (cpuset->partition_root_state) {
+   case PRS_ENABLED:
+   if (part_error)
+ 

[PATCH v15 12/12] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug

2018-11-08 Thread Waiman Long
For debugging purpose, it will be useful to expose the content of the
subparts_cpus as a read-only file to see if the code work correctly.
However, subparts_cpus will not be used at all in most use cases. So
adding a new cpuset file that clutters the cgroup directory may not be
desirable.  This is now being done by using the hidden "cgroup_debug"
kernel command line option to expose a new "cpuset.cpus.subpartitions"
file.

That option was originally used by the debug controller to expose
itself when configured into the kernel. This is now extended to set an
internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag
can now be used to specify that a cgroup file should only be created
when the "cgroup_debug" option is specified.

Signed-off-by: Waiman Long 
---
 include/linux/cgroup-defs.h |  1 +
 kernel/cgroup/cgroup-internal.h |  2 ++
 kernel/cgroup/cgroup.c  | 14 +-
 kernel/cgroup/cpuset.c  | 11 +++
 kernel/cgroup/debug.c   |  4 +---
 5 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5e1694f..8fcbae1 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -92,6 +92,7 @@ enum {
 
CFTYPE_NO_PREFIX= (1 << 3), /* (DON'T USE FOR NEW FILES) no 
subsys prefix */
CFTYPE_WORLD_WRITABLE   = (1 << 4), /* (DON'T USE FOR NEW FILES) 
S_IWUGO */
+   CFTYPE_DEBUG= (1 << 5), /* create when cgroup_debug */
 
/* internal flags, do not use outside cgroup core proper */
__CFTYPE_ONLY_ON_DFL= (1 << 16),/* only on default hierarchy */
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 75568fc..c950864 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -11,6 +11,8 @@
 #define TRACE_CGROUP_PATH_LEN 1024
 extern spinlock_t trace_cgroup_path_lock;
 extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
+extern bool cgroup_debug;
+extern void __init enable_debug_cgroup(void);
 
 /*
  * cgroup_path() takes a spin lock. It is good practice not to take
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6aaf5dd..f0a8c68 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -86,6 +86,7 @@
 
 DEFINE_SPINLOCK(trace_cgroup_path_lock);
 char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
+bool cgroup_debug __read_mostly;
 
 /*
  * Protects cgroup_idr and css_idr so that IDs can be released without
@@ -3639,7 +3640,8 @@ static int cgroup_addrm_files(struct cgroup_subsys_state 
*css,
continue;
if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp))
continue;
-
+   if ((cft->flags & CFTYPE_DEBUG) && !cgroup_debug)
+   continue;
if (is_add) {
ret = cgroup_add_file(css, cgrp, cft);
if (ret) {
@@ -5743,6 +5745,16 @@ static int __init cgroup_disable(char *str)
 }
 __setup("cgroup_disable=", cgroup_disable);
 
+void __init __weak enable_debug_cgroup(void) { }
+
+static int __init enable_cgroup_debug(char *str)
+{
+   cgroup_debug = true;
+   enable_debug_cgroup();
+   return 1;
+}
+__setup("cgroup_debug", enable_cgroup_debug);
+
 /**
  * css_tryget_online_from_dir - get corresponding css from a cgroup dentry
  * @dentry: directory dentry of interest
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c739fda..b897314 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2204,6 +2204,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
FILE_MEMLIST,
FILE_EFFECTIVE_CPULIST,
FILE_EFFECTIVE_MEMLIST,
+   FILE_SUBPARTS_CPULIST,
FILE_CPU_EXCLUSIVE,
FILE_MEM_EXCLUSIVE,
FILE_MEM_HARDWALL,
@@ -2382,6 +2383,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, 
void *v)
case FILE_EFFECTIVE_MEMLIST:
seq_printf(sf, "%*pbl\n", 
nodemask_pr_args(&cs->effective_mems));
break;
+   case FILE_SUBPARTS_CPULIST:
+   seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->subparts_cpus));
+   break;
default:
ret = -EINVAL;
}
@@ -2634,6 +2638,13 @@ static ssize_t sched_partition_write(struct 
kernfs_open_file *of, char *buf,
.flags = CFTYPE_NOT_ON_ROOT,
},
 
+   {
+   .name = "cpus.subpartitions",
+   .seq_show = cpuset_common_seq_show,
+   .private = FILE_SUBPARTS_CPULIST,
+   .flags = CFTYPE_DEBUG,
+   },
+
{ } /* terminate */
 };
 
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index 9caeda6..5f1b873 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -373,11 +373,9 @@ struct cgroup_subsys debug_cgrp_subsys = {
  * On v2, debug is an implicit controller enabled by "cgroup_debu

[PATCH v15 10/12] cpuset: Use descriptive text when reading/writing cpuset.sched.partition

2018-11-08 Thread Waiman Long
Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
read. A person who is not familiar with the partition code may not
understand what they mean.

In order to make cpuset.sched.partition more user-friendly, it will
now display the following descriptive text on read:

  "root" - A partition root (top cpuset of a partition)
  "member" - A non-root member of a partition
  "root invalid" - An invalid partition root

Note that there is at least one partition in the whole cgroup hierarchy.
The top cpuset is the root of that partition.  The rests are either a
root if it starts a new partition or a member of a partition.

The cpuset.sched.partition file will now also accept "root" and
"member" besides 1 and 0 as valid input values. The "root invalid"
value is internal only and cannot be written to the file.

Suggested-by: Tejun Heo 
Signed-off-by: Waiman Long 
---
 kernel/cgroup/cpuset.c | 58 --
 1 file changed, 51 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index fc1a809..c739fda 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2278,9 +2278,6 @@ static int cpuset_write_s64(struct cgroup_subsys_state 
*css, struct cftype *cft,
case FILE_SCHED_RELAX_DOMAIN_LEVEL:
retval = update_relax_domain_level(cs, val);
break;
-   case FILE_PARTITION_ROOT:
-   retval = update_prstate(cs, val);
-   break;
default:
retval = -EINVAL;
break;
@@ -2431,8 +2428,6 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
switch (type) {
case FILE_SCHED_RELAX_DOMAIN_LEVEL:
return cs->relax_domain_level;
-   case FILE_PARTITION_ROOT:
-   return cs->partition_root_state;
default:
BUG();
}
@@ -2441,6 +2436,55 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
return 0;
 }
 
+static int sched_partition_show(struct seq_file *seq, void *v)
+{
+   struct cpuset *cs = css_cs(seq_css(seq));
+
+   switch (cs->partition_root_state) {
+   case PRS_ENABLED:
+   seq_puts(seq, "root\n");
+   break;
+   case PRS_DISABLED:
+   seq_puts(seq, "member\n");
+   break;
+   case PRS_ERROR:
+   seq_puts(seq, "root invalid\n");
+   break;
+   }
+   return 0;
+}
+
+static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
+size_t nbytes, loff_t off)
+{
+   struct cpuset *cs = css_cs(of_css(of));
+   int val;
+   int retval = -ENODEV;
+
+   buf = strstrip(buf);
+
+   /*
+* Convert "root"/"1" to 1, and convert "member"/"0" to 0.
+*/
+   if (!strcmp(buf, "root") || !strcmp(buf, "1"))
+   val = PRS_ENABLED;
+   else if (!strcmp(buf, "member") || !strcmp(buf, "0"))
+   val = PRS_DISABLED;
+   else
+   return -EINVAL;
+
+   css_get(&cs->css);
+   mutex_lock(&cpuset_mutex);
+   if (!is_cpuset_online(cs))
+   goto out_unlock;
+
+   retval = update_prstate(cs, val);
+out_unlock:
+   mutex_unlock(&cpuset_mutex);
+   css_put(&cs->css);
+   return retval ?: nbytes;
+}
+
 /*
  * for the common functions, 'private' gives the type of file
  */
@@ -2584,8 +2628,8 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
 
{
.name = "sched.partition",
-   .read_s64 = cpuset_read_s64,
-   .write_s64 = cpuset_write_s64,
+   .seq_show = sched_partition_show,
+   .write = sched_partition_write,
.private = FILE_PARTITION_ROOT,
.flags = CFTYPE_NOT_ON_ROOT,
},
-- 
1.8.3.1



[PATCH v15 11/12] cpuset: Add documentation about the new "cpuset.sched.partition" flag

2018-11-08 Thread Waiman Long
The cgroup-v2.rst file is updated to document the purpose of the new
"cpuset.sched.partition" flag and how its usage.

Signed-off-by: Waiman Long 
---
 Documentation/admin-guide/cgroup-v2.rst | 73 +
 1 file changed, 73 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 595b075..f83a523 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1708,6 +1708,79 @@ Cpuset Interface Files
 
Its value will be affected by memory nodes hotplug events.
 
+  cpuset.sched.partition
+   A read-write single value file which exists on non-root
+   cpuset-enabled cgroups.  This flag is owned by the parent cgroup
+   and is not delegatable.
+
+It accepts only the following input values when written to.
+
+"root" or "1"   - a paritition root
+"member" or "0" - a non-root member of a partition
+
+   When set to be a partition root, the current cgroup is the
+   root of a new partition or scheduling domain that comprises
+   itself and all its descendants except those that are separate
+   partition roots themselves and their descendants.  The root
+   cgroup is always a partition root.
+
+   There are constraints on where a partition root can be set.
+   It can only be set in a cgroup if all the following conditions
+   are true.
+
+   1) The "cpuset.cpus" is not empty and the list of CPUs are
+  exclusive, i.e. they are not shared by any of its siblings.
+   2) The parent cgroup is a partition root.
+   3) The "cpuset.cpus" is also a proper subset of the parent's
+  "cpuset.cpus.effective".
+   4) There is no child cgroups with cpuset enabled.  This is for
+  eliminating corner cases that have to be handled if such a
+  condition is allowed.
+
+   Setting it to partition root will take the CPUs away from the
+   effective CPUs of the parent cgroup.  Once it is set, this
+   file cannot be reverted back to "member" if there are any child
+   cgroups with cpuset enabled.
+
+   A parent partition cannot distribute all its CPUs to its
+   child partitions.  There must be at least one cpu left in the
+   parent partition.
+
+   Once becoming a partition root, changes to "cpuset.cpus" is
+   generally allowed as long as the first condition above is true,
+   the change will not take away all the CPUs from the parent
+   partition and the new "cpuset.cpus" value is a superset of its
+   children's "cpuset.cpus" values.
+
+   Sometimes, external factors like changes to ancestors'
+   "cpuset.cpus" or cpu hotplug can cause the state of the partition
+   root to change.  On read, the "cpuset.sched.partition" file
+   can show the following values.
+
+   "member"   Non-root member of a partition
+   "root" Partition root
+   "root invalid" Invalid partition root
+
+   It is a partition root if the first 2 partition root conditions
+   above are true and at least one CPU from "cpuset.cpus" is
+   granted by the parent cgroup.
+
+   A partition root can become invalid if none of CPUs requested
+   in "cpuset.cpus" can be granted by the parent cgroup or the
+   parent cgroup is no longer a partition root itself.  In this
+   case, it is not a real partition even though the restriction
+   of the first partition root condition above will still apply.
+   The cpu affinity of all the tasks in the cgroup will then be
+   associated with CPUs in the nearest ancestor partition.
+
+   An invalid partition root can be transitioned back to a
+   real partition root if at least one of the requested CPUs
+   can now be granted by its parent.  In this case, the cpu
+   affinity of all the tasks in the formerly invalid partition
+   will be associated to the CPUs of the newly formed partition.
+   Changing the partition state of an invalid partition root to
+   "member" is always allowed even if child cpusets are present.
+
 
 Device controller
 -
-- 
1.8.3.1



[PATCH v15 01/12] cpuset: Enable cpuset controller in default hierarchy

2018-11-08 Thread Waiman Long
Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long 
---
 Documentation/admin-guide/cgroup-v2.rst | 109 ++--
 kernel/cgroup/cpuset.c  |  48 +-
 2 files changed, 149 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 476722b..01b70f6 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/.
  5-3-3-2. IO Latency Interface Files
  5-4. PID
5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
-   5-6-1. RDMA Interface Files
- 5-7. Misc
-   5-7-1. perf_event
+ 5-5. Cpuset
+   5.5-1. Cpuset Interface Files
+ 5-6. Device
+ 5-7. RDMA
+   5-7-1. RDMA Interface Files
+ 5-8. Misc
+   5-8-1. perf_event
  5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1610,6 +1612,103 @@ through fork() or clone(). These will return -EAGAIN if 
the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+--
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~
+
+  cpuset.cpus
+   A read-write multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the requested CPUs to be used by tasks within this
+   cgroup.  The actual list of CPUs to be granted, however, is
+   subjected to constraints imposed by its parent and can differ
+   from the requested CPUs.
+
+   The CPU numbers are comma-separated numbers or ranges.
+   For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.cpus" or all the available CPUs if none is found.
+
+   The value of "cpuset.cpus" stays constant until the next update
+   and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+   A read-only multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the onlined CPUs that are actually granted to this
+   cgroup by its parent.  These CPUs are allowed to be used by
+   tasks within the current cgroup.
+
+   If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
+   all the CPUs from the parent cgroup that can be available to
+   be used by this cgroup.  Otherwise, it should be a subset of
+   "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
+   can be granted.  In this case, it will be treated just like an
+   empty "cpuset.cpus".
+
+   Its value will be affected by CPU hotplug events.
+
+  cpuset.mems
+   A read-write multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the requested memory nodes to be used by tasks within
+   this cgroup.  The actual list of memory nodes granted, however,
+   is subjected to constraints imposed by its parent and can differ
+   from the requested memory nodes.
+
+   The memory node numbers are comma-separated numbers or ranges.
+   For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+   An empty value indicates that the cgroup is us

[PATCH v15 04/12] cpuset: Add new v2 cpuset.sched.partition flag

2018-11-08 Thread Waiman Long
A new cpuset.sched.partition boolean flag is added to cpuset v2.
This new flag, if set, indicates that the cgroup is the root of a
new scheduling domain or partition that includes itself and all its
descendants except those that are scheduling domain roots themselves
and their descendants.

With this new flag, one can directly create as many partitions as
necessary without ever using the v1 trick of turning off load balancing
in specific cpusets to create partitions as a side effect.

This new flag is owned by the parent and will cause the CPUs in the
cpuset to be removed from the effective CPUs of its parent.

This is implemented internally by adding a new subparts_cpus mask that
holds the CPUs belonging to child partitions so that:

subparts_cpus | effective_cpus = cpus_allowed
subparts_cpus & effective_cpus = 0

This new flag can only be turned on in a cpuset if its parent is a
partition root itself. The state of this flag cannot be changed if the
cpuset has children.

Once turned on, further changes to "cpuset.cpus" is allowed as long
as there is at least one CPU left that can be granted from the parent
and a child partition root cannot use up all the CPUs in the parent's
effective_cpus.

Signed-off-by: Waiman Long 
---
 kernel/cgroup/cpuset.c | 365 +++--
 1 file changed, 352 insertions(+), 13 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2ac9437..85e0416 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -970,10 +970,191 @@ static void update_tasks_cpumask(struct cpuset *cs)
css_task_iter_end(&it);
 }
 
+/**
+ * compute_effective_cpumask - Compute the effective cpumask of the cpuset
+ * @new_cpus: the temp variable for the new effective_cpus mask
+ * @cs: the cpuset the need to recompute the new effective_cpus mask
+ * @parent: the parent cpuset
+ *
+ * If the parent has subpartition CPUs, include them in the list of
+ * allowable CPUs in computing the new effective_cpus mask.
+ */
+static void compute_effective_cpumask(struct cpumask *new_cpus,
+ struct cpuset *cs, struct cpuset *parent)
+{
+   if (parent->nr_subparts_cpus) {
+   cpumask_or(new_cpus, parent->effective_cpus,
+  parent->subparts_cpus);
+   cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+   } else {
+   cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
+   }
+}
+
+/*
+ * Commands for update_parent_subparts_cpumask
+ */
+enum subparts_cmd {
+   partcmd_enable, /* Enable partition root */
+   partcmd_disable,/* Disable partition root*/
+   partcmd_update, /* Update parent's subparts_cpus */
+};
+
+/**
+ * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset
+ * @cpuset:  The cpuset that requests change in partition root state
+ * @cmd: Partition root state change command
+ * @newmask: Optional new cpumask for partcmd_update
+ * @tmp: Temporary addmask and delmask
+ * Return:   0, 1 or an error code
+ *
+ * For partcmd_enable, the cpuset is being transformed from a non-partition
+ * root to a partition root. The cpus_allowed mask of the given cpuset will
+ * be put into parent's subparts_cpus and taken away from parent's
+ * effective_cpus. The function will return 0 if all the CPUs listed in
+ * cpus_allowed can be granted or an error code will be returned.
+ *
+ * For partcmd_disable, the cpuset is being transofrmed from a partition
+ * root back to a non-partition root. any CPUs in cpus_allowed that are in
+ * parent's subparts_cpus will be taken away from that cpumask and put back
+ * into parent's effective_cpus. 0 should always be returned.
+ *
+ * For partcmd_update, if the optional newmask is specified, the cpu
+ * list is to be changed from cpus_allowed to newmask. Otherwise,
+ * cpus_allowed is assumed to remain the same.  The function will return
+ * 1 if changes to parent's subparts_cpus and effective_cpus happen or 0
+ * otherwise. In case of error, an error code will be returned.
+ *
+ * The partcmd_enable and partcmd_disable commands are used by
+ * update_prstate(). The partcmd_update command is used by
+ * update_cpumasks_hier() with newmask NULL and update_cpumask() with
+ * newmask set.
+ *
+ * The checking is more strict when enabling partition root than the
+ * other two commands.
+ *
+ * Because of the implicit cpu exclusive nature of a partition root,
+ * cpumask changes that violates the cpu exclusivity rule will not be
+ * permitted when checked by validate_change(). The validate_change()
+ * function will also prevent any changes to the cpu list if it is not
+ * a superset of children's cpu lists.
+ */
+static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
+ struct cpumask *newmask,
+ struct 

[PATCH v15 07/12] cpuset: Make CPU hotplug work with partition

2018-11-08 Thread Waiman Long
When there is a cpu hotplug event (CPU online or offline), the partitions
may need to be reconfigured and regenerated. So code is added to the
hotplug functions to make them work with new subparts_cpus mask to
compute the right effective_cpus for each of the affected cpusets.
It may also change the state of a partition root from real one to an
erroneous one or vice versa.

Signed-off-by: Waiman Long 
---
 kernel/cgroup/cpuset.c | 131 +++--
 1 file changed, 116 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 21eaa89..82c88c6 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -113,6 +113,9 @@ struct cpuset {
 * CPUs allocated to child sub-partitions (default hierarchy only)
 * - CPUs granted by the parent = effective_cpus U subparts_cpus
 * - effective_cpus and subparts_cpus are mutually exclusive.
+*
+* effective_cpus contains only onlined CPUs, but subparts_cpus
+* may have offlined ones.
 */
cpumask_var_t subparts_cpus;
 
@@ -994,7 +997,9 @@ static void update_tasks_cpumask(struct cpuset *cs)
  * @parent: the parent cpuset
  *
  * If the parent has subpartition CPUs, include them in the list of
- * allowable CPUs in computing the new effective_cpus mask.
+ * allowable CPUs in computing the new effective_cpus mask. Since offlined
+ * CPUs are not removed from subparts_cpus, we have to use cpu_active_mask
+ * to mask those out.
  */
 static void compute_effective_cpumask(struct cpumask *new_cpus,
  struct cpuset *cs, struct cpuset *parent)
@@ -1003,6 +1008,7 @@ static void compute_effective_cpumask(struct cpumask 
*new_cpus,
cpumask_or(new_cpus, parent->effective_cpus,
   parent->subparts_cpus);
cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+   cpumask_and(new_cpus, new_cpus, cpu_active_mask);
} else {
cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
}
@@ -1125,9 +1131,20 @@ static int update_parent_subparts_cpumask(struct cpuset 
*cpuset, int cmd,
/*
 * Return error if the new effective_cpus could become empty.
 */
-   if (adding && !deleting &&
-   cpumask_equal(parent->effective_cpus, tmp->addmask))
-   return -EINVAL;
+   if (adding &&
+   cpumask_equal(parent->effective_cpus, tmp->addmask)) {
+   if (!deleting)
+   return -EINVAL;
+   /*
+* As some of the CPUs in subparts_cpus might have
+* been offlined, we need to compute the real delmask
+* to confirm that.
+*/
+   if (!cpumask_and(tmp->addmask, tmp->delmask,
+cpu_active_mask))
+   return -EINVAL;
+   cpumask_copy(tmp->addmask, parent->effective_cpus);
+   }
} else {
/*
 * partcmd_update w/o newmask:
@@ -1197,6 +1214,10 @@ static int update_parent_subparts_cpumask(struct cpuset 
*cpuset, int cmd,
if (deleting) {
cpumask_andnot(parent->subparts_cpus,
   parent->subparts_cpus, tmp->delmask);
+   /*
+* Some of the CPUs in subparts_cpus might have been offlined.
+*/
+   cpumask_and(tmp->delmask, tmp->delmask, cpu_active_mask);
cpumask_or(parent->effective_cpus,
   parent->effective_cpus, tmp->delmask);
}
@@ -2863,20 +2884,29 @@ static void remove_tasks_in_empty_cpuset(struct cpuset 
*cs)
update_tasks_nodemask(cs);
 }
 
+static bool force_rebuild;
+
+void cpuset_force_rebuild(void)
+{
+   force_rebuild = true;
+}
+
 /**
  * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug
  * @cs: cpuset in interest
+ * @tmp: the tmpmasks structure pointer
  *
  * Compare @cs's cpu and mem masks against top_cpuset and if some have gone
  * offline, update @cs accordingly.  If @cs ends up with no CPU or memory,
  * all its tasks are moved to the nearest ancestor with both resources.
  */
-static void cpuset_hotplug_update_tasks(struct cpuset *cs)
+static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks 
*tmp)
 {
static cpumask_t new_cpus;
static nodemask_t new_mems;
bool cpus_updated;
bool mems_updated;
+   struct cpuset *parent;
 retry:
wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
 
@@ -2891,9 +2921,60 @@ static void cpuset_hotplug_update_tasks(struct cpuset 
*cs)
goto retry;
}
 
-   cpumask_and(&new_cpus, cs-

[PATCH v15 03/12] cpuset: Simply allocation and freeing of cpumasks

2018-11-08 Thread Waiman Long
The previous commit introduces a new subparts_cpus mask into the cpuset
data structure and a new tmpmasks structure.  Managing the allocation
and freeing of those cpumasks is becoming more complex.

So a number of helper functions are added to simplify and streamline
the management of those cpumasks. To make it simple, all the cpumasks
are now pre-cleared on allocation.

Signed-off-by: Waiman Long 
---
 kernel/cgroup/cpuset.c | 110 ++---
 1 file changed, 77 insertions(+), 33 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 29a2bdc..2ac9437 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -456,6 +456,65 @@ static int is_cpuset_subset(const struct cpuset *p, const 
struct cpuset *q)
 }
 
 /**
+ * alloc_cpumasks - allocate three cpumasks for cpuset
+ * @cs:  the cpuset that have cpumasks to be allocated.
+ * @tmp: the tmpmasks structure pointer
+ * Return: 0 if successful, -ENOMEM otherwise.
+ *
+ * Only one of the two input arguments should be non-NULL.
+ */
+static inline int alloc_cpumasks(struct cpuset *cs, struct tmpmasks *tmp)
+{
+   cpumask_var_t *pmask1, *pmask2, *pmask3;
+
+   if (cs) {
+   pmask1 = &cs->cpus_allowed;
+   pmask2 = &cs->effective_cpus;
+   pmask3 = &cs->subparts_cpus;
+   } else {
+   pmask1 = &tmp->new_cpus;
+   pmask2 = &tmp->addmask;
+   pmask3 = &tmp->delmask;
+   }
+
+   if (!zalloc_cpumask_var(pmask1, GFP_KERNEL))
+   return -ENOMEM;
+
+   if (!zalloc_cpumask_var(pmask2, GFP_KERNEL))
+   goto free_one;
+
+   if (!zalloc_cpumask_var(pmask3, GFP_KERNEL))
+   goto free_two;
+
+   return 0;
+
+free_two:
+   free_cpumask_var(*pmask2);
+free_one:
+   free_cpumask_var(*pmask1);
+   return -ENOMEM;
+}
+
+/**
+ * free_cpumasks - free cpumasks in a tmpmasks structure
+ * @cs:  the cpuset that have cpumasks to be free.
+ * @tmp: the tmpmasks structure pointer
+ */
+static inline void free_cpumasks(struct cpuset *cs, struct tmpmasks *tmp)
+{
+   if (cs) {
+   free_cpumask_var(cs->cpus_allowed);
+   free_cpumask_var(cs->effective_cpus);
+   free_cpumask_var(cs->subparts_cpus);
+   }
+   if (tmp) {
+   free_cpumask_var(tmp->new_cpus);
+   free_cpumask_var(tmp->addmask);
+   free_cpumask_var(tmp->delmask);
+   }
+}
+
+/**
  * alloc_trial_cpuset - allocate a trial cpuset
  * @cs: the cpuset that the trial cpuset duplicates
  */
@@ -467,31 +526,24 @@ static struct cpuset *alloc_trial_cpuset(struct cpuset 
*cs)
if (!trial)
return NULL;
 
-   if (!alloc_cpumask_var(&trial->cpus_allowed, GFP_KERNEL))
-   goto free_cs;
-   if (!alloc_cpumask_var(&trial->effective_cpus, GFP_KERNEL))
-   goto free_cpus;
+   if (alloc_cpumasks(trial, NULL)) {
+   kfree(trial);
+   return NULL;
+   }
 
cpumask_copy(trial->cpus_allowed, cs->cpus_allowed);
cpumask_copy(trial->effective_cpus, cs->effective_cpus);
return trial;
-
-free_cpus:
-   free_cpumask_var(trial->cpus_allowed);
-free_cs:
-   kfree(trial);
-   return NULL;
 }
 
 /**
- * free_trial_cpuset - free the trial cpuset
- * @trial: the trial cpuset to be freed
+ * free_cpuset - free the cpuset
+ * @cs: the cpuset to be freed
  */
-static void free_trial_cpuset(struct cpuset *trial)
+static inline void free_cpuset(struct cpuset *cs)
 {
-   free_cpumask_var(trial->effective_cpus);
-   free_cpumask_var(trial->cpus_allowed);
-   kfree(trial);
+   free_cpumasks(cs, NULL);
+   kfree(cs);
 }
 
 /*
@@ -1385,7 +1437,7 @@ static int update_flag(cpuset_flagbits_t bit, struct 
cpuset *cs,
if (spread_flag_changed)
update_tasks_flags(cs);
 out:
-   free_trial_cpuset(trialcs);
+   free_cpuset(trialcs);
return err;
 }
 
@@ -1769,7 +1821,7 @@ static ssize_t cpuset_write_resmask(struct 
kernfs_open_file *of,
break;
}
 
-   free_trial_cpuset(trialcs);
+   free_cpuset(trialcs);
 out_unlock:
mutex_unlock(&cpuset_mutex);
kernfs_unbreak_active_protection(of->kn);
@@ -2024,26 +2076,19 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
cs = kzalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return ERR_PTR(-ENOMEM);
-   if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
-   goto free_cs;
-   if (!alloc_cpumask_var(&cs->effective_cpus, GFP_KERNEL))
-   goto free_cpus;
+
+   if (alloc_cpumasks(cs, NULL)) {
+   kfree(cs);
+   return ERR_PTR(-ENOMEM);
+   }
 
set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
-   cpumask_clear(cs->cpus_allowed);
nodes_clear(cs->mems_allowed);
-   cpumask_clear(c

[PATCH v15 06/12] cpuset: Track cpusets that use parent's effective_cpus

2018-11-08 Thread Waiman Long
In the default hierarchy, a cpuset will use the parent's effective_cpus
if none of the requested CPUs can be granted from the parent. That can
be a problem if a parent is a partition root with children partition
roots. Changes to a parent's effective_cpus list due to changes in a
child partition root may not be properly reflected in a child cpuset
that use parent's effective_cpus because the cpu_exclusive rule of a
partition root will not guard against that.

In order to avoid the mismatch, two new tracking variables are added to
the cpuset structure to track if a cpuset uses parent's effective_cpus
and the number of children cpusets that use its effective_cpus. So
whenever cpumask changes are made to a parent, it will also check to
see if it has other children cpusets that use its effective_cpus and
call update_cpumasks_hier() if that is the case.

Signed-off-by: Waiman Long 
---
 kernel/cgroup/cpuset.c | 71 +-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ef41f58..21eaa89 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -147,6 +147,14 @@ struct cpuset {
 
/* partition root state */
int partition_root_state;
+
+   /*
+* Default hierarchy only:
+* use_parent_ecpus - set if using parent's effective_cpus
+* child_ecpus_count - # of children with use_parent_ecpus set
+*/
+   int use_parent_ecpus;
+   int child_ecpus_count;
 };
 
 /*
@@ -1227,8 +1235,17 @@ static void update_cpumasks_hier(struct cpuset *cs, 
struct tmpmasks *tmp)
 * If it becomes empty, inherit the effective mask of the
 * parent, which is guaranteed to have some CPUs.
 */
-   if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus))
+   if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
cpumask_copy(tmp->new_cpus, parent->effective_cpus);
+   if (!cp->use_parent_ecpus) {
+   cp->use_parent_ecpus = true;
+   parent->child_ecpus_count++;
+   }
+   } else if (cp->use_parent_ecpus) {
+   cp->use_parent_ecpus = false;
+   WARN_ON_ONCE(!parent->child_ecpus_count);
+   parent->child_ecpus_count--;
+   }
 
/*
 * Skip the whole subtree if the cpumask remains the same
@@ -1346,6 +1363,35 @@ static void update_cpumasks_hier(struct cpuset *cs, 
struct tmpmasks *tmp)
 }
 
 /**
+ * update_sibling_cpumasks - Update siblings cpumasks
+ * @parent:  Parent cpuset
+ * @cs:  Current cpuset
+ * @tmp: Temp variables
+ */
+static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
+   struct tmpmasks *tmp)
+{
+   struct cpuset *sibling;
+   struct cgroup_subsys_state *pos_css;
+
+   /*
+* Check all its siblings and call update_cpumasks_hier()
+* if their use_parent_ecpus flag is set in order for them
+* to use the right effective_cpus value.
+*/
+   rcu_read_lock();
+   cpuset_for_each_child(sibling, pos_css, parent) {
+   if (sibling == cs)
+   continue;
+   if (!sibling->use_parent_ecpus)
+   continue;
+
+   update_cpumasks_hier(sibling, tmp);
+   }
+   rcu_read_unlock();
+}
+
+/**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in 
it
  * @cs: the cpuset to consider
  * @trialcs: trial cpuset
@@ -1420,6 +1466,17 @@ static int update_cpumask(struct cpuset *cs, struct 
cpuset *trialcs,
spin_unlock_irq(&callback_lock);
 
update_cpumasks_hier(cs, &tmp);
+
+   if (cs->partition_root_state) {
+   struct cpuset *parent = parent_cs(cs);
+
+   /*
+* For partition root, update the cpumasks of sibling
+* cpusets if they use parent's effective_cpus.
+*/
+   if (parent->child_ecpus_count)
+   update_sibling_cpumasks(parent, cs, &tmp);
+   }
return 0;
 }
 
@@ -1856,6 +1913,9 @@ static int update_prstate(struct cpuset *cs, int val)
if (parent != &top_cpuset)
update_tasks_cpumask(parent);
 
+   if (parent->child_ecpus_count)
+   update_sibling_cpumasks(parent, cs, &tmp);
+
rebuild_sched_domains_locked();
 out:
free_cpumasks(NULL, &tmp);
@@ -2550,6 +2610,8 @@ static int cpuset_css_online(struct cgroup_subsys_state 
*css)
if (is_in_v2_mode()) {
cpumask_copy(cs->effective_cpus, parent->effective_cpus);
cs->effective_mems = parent->effective_mems;
+   cs->use_parent_ecpus = true;
+   parent->child_ecpus_count++;

[PATCH v15 09/12] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root

2018-11-08 Thread Waiman Long
Because of the fact that setting the "cpuset.sched.partition" in
a direct child of root can remove CPUs from the root's effective CPU
list, it makes sense to know what CPUs are left in the root cgroup for
scheduling purpose. So the "cpuset.cpus.effective" control file is now
exposed in the v2 cgroup root.

For consistency, the "cpuset.mems.effective" control file is exposed
as well.

Signed-off-by: Waiman Long 
---
 Documentation/admin-guide/cgroup-v2.rst | 4 ++--
 kernel/cgroup/cpuset.c  | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 01b70f6..595b075 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1653,7 +1653,7 @@ Cpuset Interface Files
and won't be affected by any CPU hotplug events.
 
   cpuset.cpus.effective
-   A read-only multiple values file which exists on non-root
+   A read-only multiple values file which exists on all
cpuset-enabled cgroups.
 
It lists the onlined CPUs that are actually granted to this
@@ -1693,7 +1693,7 @@ Cpuset Interface Files
and won't be affected by any memory nodes hotplug events.
 
   cpuset.mems.effective
-   A read-only multiple values file which exists on non-root
+   A read-only multiple values file which exists on all
cpuset-enabled cgroups.
 
It lists the onlined memory nodes that are actually granted to
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3960de7..fc1a809 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2574,14 +2574,12 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
.name = "cpus.effective",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_CPULIST,
-   .flags = CFTYPE_NOT_ON_ROOT,
},
 
{
.name = "mems.effective",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_MEMLIST,
-   .flags = CFTYPE_NOT_ON_ROOT,
},
 
{
-- 
1.8.3.1



Re: [PATCH v15 00/12] Enable cpuset controller in default hierarchy

2018-11-08 Thread Peter Zijlstra
On Thu, Nov 08, 2018 at 10:08:34AM -0500, Waiman Long wrote:
> Waiman Long (12):
>   cpuset: Enable cpuset controller in default hierarchy
>   cpuset: Define data structures to support scheduling partition
>   cpuset: Simply allocation and freeing of cpumasks
>   cpuset: Add new v2 cpuset.sched.partition flag
>   cpuset: Add an error state to cpuset.sched.partition
>   cpuset: Track cpusets that use parent's effective_cpus
>   cpuset: Make CPU hotplug work with partition
>   cpuset: Make generate_sched_domains() work with partition
>   cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
>   cpuset: Use descriptive text when reading/writing
> cpuset.sched.partition
>   cpuset: Add documentation about the new "cpuset.sched.partition" flag
>   cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug
> 
>  Documentation/admin-guide/cgroup-v2.rst | 182 +-
>  include/linux/cgroup-defs.h |   1 +
>  kernel/cgroup/cgroup-internal.h |   2 +
>  kernel/cgroup/cgroup.c  |  14 +-
>  kernel/cgroup/cpuset.c  | 942 
> +---
>  kernel/cgroup/debug.c   |   4 +-
>  6 files changed, 1070 insertions(+), 75 deletions(-)

Acked-by: Peter Zijlstra (Intel) 

Thanks Waiman!


Re: [PATCH v15 12/12] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug

2018-11-08 Thread Peter Zijlstra
On Thu, Nov 08, 2018 at 10:08:46AM -0500, Waiman Long wrote:
> For debugging purpose, it will be useful to expose the content of the
> subparts_cpus as a read-only file to see if the code work correctly.
> However, subparts_cpus will not be used at all in most use cases. So
> adding a new cpuset file that clutters the cgroup directory may not be
> desirable.  This is now being done by using the hidden "cgroup_debug"
> kernel command line option to expose a new "cpuset.cpus.subpartitions"
> file.

One thought I had; would it make sense to make these debug files hidden
("." prefix) ?


Re: [PATCH v8 2/2] overlayfs: override_creds=off option bypass creator_cred

2018-11-08 Thread Vivek Goyal
On Tue, Nov 06, 2018 at 03:01:15PM -0800, Mark Salyzyn wrote:
> By default, all access to the upper, lower and work directories is the
> recorded mounter's MAC and DAC credentials.  The incoming accesses are
> checked against the caller's credentials.
> 
> If the principles of least privilege are applied, the mounter's
> credentials might not overlap the credentials of the caller's when
> accessing the overlayfs filesystem.  For example, a file that a lower
> DAC privileged caller can execute, is MAC denied to the generally
> higher DAC privileged mounter, to prevent an attack vector.
> 
> We add the option to turn off override_creds in the mount options; all
> subsequent operations after mount on the filesystem will be only the
> caller's credentials.  The module boolean parameter and mount option
> override_creds is also added as a presence check for this "feature",
> existence of /sys/module/overlay/parameters/override_creds.
> 
> It was not always this way.  Circa 4.4 there was no recorded mounter's
> credentials, instead privileged access to upper or work directories
> were temporarily increased to perform the operations.  The MAC
> (selinux) policies were caller's in all cases.  override_creds=off
> partially returns us to this older access model minus the insecure
> temporary credential increases.


So what will have to bunch of overlay operations now which require
priviliges or additional capabilities. To make all that succeed,
you will have to give those capabilities to processes accessing
overlayfs. If yes, I am wondering how is that more secure. IOW,
having to give these capabilities to all processes accessing
fs as opposed to giving those capabilities to only mounter.

You also mentioned that you want init to be least priviliged as
it could be attacked. But now all the processses are priviliged
and these could be attacked as well.

The way I am looking at it as follows.

- With current model, only mounter needs to have capabilities to
  do some operations such as whiteout creation etc.

- With override_creds=off, all services accessing overlayfs needs
  to have increased priviliges.

Is that not increasing attack surface. Previously, you will have to
attach init and now you can get away by breaking one of the services.

/me is wondering how does that increase security.

Thanks
Vivek


> This is to permit use in a system
> with non-overlapping security models for each executable including
> the agent that mounts the overlayfs filesystem.  In Android
> this is the case since init, which performs the mount operations,
> has a minimal MAC set of privileges to reduce any attack surface,
> and services that use the content have a different set of MAC
> privileges (eg: read, for vendor labelled configuration, execute for
> vendor libraries and modules).
> 
> Signed-off-by: Mark Salyzyn 
> Cc: Miklos Szeredi 
> Cc: Jonathan Corbet 
> Cc: Vivek Goyal 
> Cc: Eric W. Biederman 
> Cc: Amir Goldstein 
> Cc: Randy Dunlap 
> Cc: Stephen Smalley 
> Cc: linux-unio...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: kernel-t...@android.com
> ---
> v8:
> - drop pr_warn message after straw poll to remove it.
> - added a use case in the commit message
> 
> v7:
> - change name of internal parameter to ovl_override_creds_def
> - report override_creds only if different than default
> 
> v6:
> - Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
> - Do better with the documentation.
> - pr_warn message adjusted to report consequences.
> 
> v5:
> - beefed up the caveats in the Documentation
> - Is dependent on
>   "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
>   "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
> - Added prwarn when override_creds=off
> 
> v4:
> - spelling and grammar errors in text
> 
> v3:
> - Change name from caller_credentials / creator_credentials to the
>   boolean override_creds.
> - Changed from creator to mounter credentials.
> - Updated and fortified the documentation.
> - Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS
> 
> v2:
> - Forward port changed attr to stat, resulting in a build error.
> - altered commit message.
> 
>  Documentation/filesystems/overlayfs.txt | 17 +
>  fs/overlayfs/copy_up.c  |  2 +-
>  fs/overlayfs/dir.c  |  9 +
>  fs/overlayfs/inode.c| 16 
>  fs/overlayfs/namei.c|  6 +++---
>  fs/overlayfs/overlayfs.h|  1 +
>  fs/overlayfs/ovl_entry.h|  1 +
>  fs/overlayfs/readdir.c  |  4 ++--
>  fs/overlayfs/super.c| 22 +-
>  fs/overlayfs/util.c | 12 ++--
>  10 files changed, 69 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/filesystems/overlayfs.txt 
> b/Documentation/filesystems/overlayfs.txt
> index eef7d9d259e8..5cc299df4436 100644
> --- a/Documentation/filesystems/overlayfs.txt
> +

[PATCH v3 0/3] Add driver for Synopsys DesignWare I3C master IP

2018-11-08 Thread Vitor soares
This patch series is a proposal for the I3C master driver for Synopsys IP.
This patch is to be applied on top of I3C subsystem RFC V10 submitted by
Boris Brezillon.

Supported features:
  Regular CCC commands.
  I3C private transfers.
  I2C transfers.

Missing functionalities:
  Support DMA interface.
  Support for I3C_BUS_MODE_MIXED_SLOW.
  Hot-join.
  IBI.

Main change between v2 and v3:
- Minor fixes. They are described in each patch

Main change between v1 and v2:
- Add controller version on dt-binding
- The driver now calls writesl/readsl() instead readl/writel
- Rename some variables in the driver

Vitor soares (3):
  i3c: master: Add driver for Synopsys DesignWare IP
  dt-bindings: i3c: Document Synopsys DesignWare I3C master bindings
  MAINTAINERS: Add myself as the dw-i3c-master module maintainer

 .../devicetree/bindings/i3c/snps,dw-i3c-master.txt |   42 +
 MAINTAINERS|6 +
 drivers/i3c/master/Kconfig |   15 +
 drivers/i3c/master/Makefile|1 +
 drivers/i3c/master/dw-i3c-master.c | 1216 
 5 files changed, 1280 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
 create mode 100644 drivers/i3c/master/dw-i3c-master.c

-- 
2.7.4




[PATCH v3 1/3] i3c: master: Add driver for Synopsys DesignWare IP

2018-11-08 Thread Vitor soares
Add driver for Synopsys DesignWare I3C master IP

Signed-off-by: Vitor soares 
---
Change in v3:
- Use struct_size() (suggested by Matthew)

Change in v2:
- Rename some variables
- Remove dw_i3c_master_dev_set_info()
- Ajust code to match the changes made of i3c subsystem
- Use readsl/writesl() to populate TX/RX buffers

 drivers/i3c/master/Kconfig |   15 +
 drivers/i3c/master/Makefile|1 +
 drivers/i3c/master/dw-i3c-master.c | 1216 
 3 files changed, 1232 insertions(+)
 create mode 100644 drivers/i3c/master/dw-i3c-master.c

diff --git a/drivers/i3c/master/Kconfig b/drivers/i3c/master/Kconfig
index 959b837..2702751 100644
--- a/drivers/i3c/master/Kconfig
+++ b/drivers/i3c/master/Kconfig
@@ -4,3 +4,18 @@ config CDNS_I3C_MASTER
depends on !(ALPHA || PARISC)
help
  Enable this driver if you want to support Cadence I3C master block.
+
+config DW_I3C_MASTER
+   tristate "Synospsys DesignWare I3C master driver"
+   depends on I3C
+   depends on !(ALPHA || PARISC)
+   # ALPHA and PARISC needs {read,write}sl()
+   help
+ Support for Synopsys DesignWare MIPI I3C Controller.
+
+ For details please see
+ https://www.synopsys.com/dw/ipdir.php?ds=mipi_i3c
+
+ This driver can also be built as a module.  If so, the module
+ will be called dw-i3c-master.
diff --git a/drivers/i3c/master/Makefile b/drivers/i3c/master/Makefile
index 4c4304a..fc53939 100644
--- a/drivers/i3c/master/Makefile
+++ b/drivers/i3c/master/Makefile
@@ -1 +1,2 @@
 obj-$(CONFIG_CDNS_I3C_MASTER)  += i3c-master-cdns.o
+obj-$(CONFIG_DW_I3C_MASTER)+= dw-i3c-master.o
diff --git a/drivers/i3c/master/dw-i3c-master.c 
b/drivers/i3c/master/dw-i3c-master.c
new file mode 100644
index 000..a436d0d
--- /dev/null
+++ b/drivers/i3c/master/dw-i3c-master.c
@@ -0,0 +1,1216 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 Synopsys, Inc. and/or its affiliates.
+ *
+ * Author: Vitor Soares 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DEVICE_CTRL0x0
+#define DEV_CTRL_ENABLEBIT(31)
+#define DEV_CTRL_RESUMEBIT(30)
+#define DEV_CTRL_HOT_JOIN_NACK BIT(8)
+#define DEV_CTRL_I2C_SLAVE_PRESENT BIT(7)
+
+#define DEVICE_ADDR0x4
+#define DEV_ADDR_DYNAMIC_ADDR_VALIDBIT(31)
+#define DEV_ADDR_DYNAMIC(x)(((x) << 16) & GENMASK(22, 16))
+
+#define HW_CAPABILITY  0x8
+#define COMMAND_QUEUE_PORT 0xc
+#define COMMAND_PORT_TOC   BIT(30)
+#define COMMAND_PORT_READ_TRANSFER BIT(28)
+#define COMMAND_PORT_SDAP  BIT(27)
+#define COMMAND_PORT_ROC   BIT(26)
+#define COMMAND_PORT_SPEED(x)  (((x) << 21) & GENMASK(23, 21))
+#define COMMAND_PORT_DEV_INDEX(x)  (((x) << 16) & GENMASK(20, 16))
+#define COMMAND_PORT_CPBIT(15)
+#define COMMAND_PORT_CMD(x)(((x) << 7) & GENMASK(14, 7))
+#define COMMAND_PORT_TID(x)(((x) << 3) & GENMASK(6, 3))
+
+#define COMMAND_PORT_ARG_DATA_LEN(x)   (((x) << 16) & GENMASK(31, 16))
+#define COMMAND_PORT_ARG_DATA_LEN_MAX  65536
+#define COMMAND_PORT_TRANSFER_ARG  0x01
+
+#define COMMAND_PORT_SDA_DATA_BYTE_3(x)(((x) << 24) & GENMASK(31, 24))
+#define COMMAND_PORT_SDA_DATA_BYTE_2(x)(((x) << 16) & GENMASK(23, 16))
+#define COMMAND_PORT_SDA_DATA_BYTE_1(x)(((x) << 8) & GENMASK(15, 8))
+#define COMMAND_PORT_SDA_BYTE_STRB_3   BIT(5)
+#define COMMAND_PORT_SDA_BYTE_STRB_2   BIT(4)
+#define COMMAND_PORT_SDA_BYTE_STRB_1   BIT(3)
+#define COMMAND_PORT_SHORT_DATA_ARG0x02
+
+#define COMMAND_PORT_DEV_COUNT(x)  (((x) << 21) & GENMASK(25, 21))
+#define COMMAND_PORT_ADDR_ASSGN_CMD0x03
+
+#define RESPONSE_QUEUE_PORT0x10
+#define RESPONSE_PORT_ERR_STATUS(x)(((x) & GENMASK(31, 28)) >> 28)
+#define RESPONSE_NO_ERROR  0
+#define RESPONSE_ERROR_CRC 1
+#define RESPONSE_ERROR_PARITY  2
+#define RESPONSE_ERROR_FRAME   3
+#define RESPONSE_ERROR_IBA_NACK4
+#define RESPONSE_ERROR_ADDRESS_NACK5
+#define RESPONSE_ERROR_OVER_UNDER_FLOW 6
+#define RESPONSE_ERROR_TRANSF_ABORT8
+#define RESPONSE_ERROR_I2C_W_NACK_ERR  9
+#define RESPONSE_PORT_TID(x)   (((x) & GENMASK(27, 24)) >> 24)
+#define RESPONSE_PORT_DATA_LEN(x)  ((x) & GENMASK(15, 0))
+
+#define RX_TX_DATA_PORT0x14
+#define IBI_QUEUE_STATUS   0x18
+#define QUEUE_THLD_CTRL0x1c
+#define QUEUE_THLD_CTRL_RESP_BUF_MASK  GENMASK(15, 8)
+#define QUEUE_THLD_CTRL_RESP_BUF(x)(((x) - 1) << 8)
+
+#define DATA_BUFFER_THLD_CTRL  0x20
+#define DATA_BUFFER_THLD_CTRL_RX_BUF   GENMASK(11, 8)
+
+#define IBI_QUEUE_CTRL 

[PATCH v3 2/3] dt-bindings: i3c: Document Synopsys DesignWare I3C master bindings

2018-11-08 Thread Vitor soares
Document Synopsys DesignWare I3C master module

Signed-off-by: Vitor soares 
---
Changes in v3:
- Remove dummy characters

Changes in v2:
- Address the changes in Documentation/devicetree/bindings/i3c/i3c.txt
- Add controller version on compatible string

 .../devicetree/bindings/i3c/snps,dw-i3c-master.txt | 42 ++
 1 file changed, 42 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt

diff --git a/Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt 
b/Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
new file mode 100644
index 000..b930c09
--- /dev/null
+++ b/Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
@@ -0,0 +1,42 @@
+Bindings for Synopsys DesignWare I3C master block
+=
+
+Required properties:
+
+- compatible: shall be "snps,dw-i3c-master-1.00a"
+- clocks: shall reference the core_clk
+- interrupts: the interrupt line connected to this I3C master
+- reg: Offset and length of I3C master registers
+
+Mandatory properties defined by the generic binding (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details):
+
+- #address-cells: shall be set to 3
+- #size-cells: shall be set to 0
+
+Optional properties defined by the generic binding (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details):
+
+- i2c-scl-hz
+- i3c-scl-hz
+
+I3C device connected on the bus follow the generic description (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details).
+
+Example:
+
+   i3c-master@2000 {
+   compatible = "snps,dw-i3c-master-1.00a";
+   #address-cells = <3>;
+   #size-cells = <0>;
+   reg = <0x02000 0x1000>;
+   interrupts = <0>;
+   clocks = <&i3cclk>;
+
+   eeprom@57{
+   compatible = "atmel,24c01";
+   reg = < 0x57 0x0 0x10>;
+   pagesize = <0x8>;
+   };
+   };
+
-- 
2.7.4




[PATCH v3 3/3] MAINTAINERS: Add myself as the dw-i3c-master module maintainer

2018-11-08 Thread Vitor soares
Signed-off-by: Vitor soares 
---
Change in v3:
- Remove dummy characters

Change in v2:
- None

 MAINTAINERS | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 293c863..39910b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6930,6 +6930,12 @@ F:   drivers/i3c/
 F: include/linux/i3c/
 F: include/dt-bindings/i3c/
 
+I3C DRIVER FOR SYNOPSYS DESIGNWARE
+M: Vitor Soares 
+S: Maintained
+F: Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
+F: drivers/i3c/master/dw*
+
 IA64 (Itanium) PLATFORM
 M: Tony Luck 
 M: Fenghua Yu 
-- 
2.7.4




Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Borislav Petkov
On Thu, Oct 11, 2018 at 08:15:00AM -0700, Yu-cheng Yu wrote:
> Intel Control-flow Enforcement Technology (CET) introduces the
> following MSRs into the XSAVES system states.
> 
> IA32_U_CET (user-mode CET settings),
> IA32_PL3_SSP (user-mode shadow stack),
> IA32_PL0_SSP (kernel-mode shadow stack),
> IA32_PL1_SSP (ring-1 shadow stack),
> IA32_PL2_SSP (ring-2 shadow stack).

And?

That commit message got chopped off here, it seems.

> Signed-off-by: Yu-cheng Yu 
> ---
>  arch/x86/include/asm/fpu/types.h| 22 +
>  arch/x86/include/asm/fpu/xstate.h   |  4 +++-
>  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
>  arch/x86/kernel/fpu/xstate.c| 10 ++
>  4 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/fpu/types.h 
> b/arch/x86/include/asm/fpu/types.h
> index 202c53918ecf..e55d51d172f1 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -114,6 +114,9 @@ enum xfeature {
>   XFEATURE_Hi16_ZMM,
>   XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
>   XFEATURE_PKRU,
> + XFEATURE_RESERVED,
> + XFEATURE_SHSTK_USER,
> + XFEATURE_SHSTK_KERNEL,
>  
>   XFEATURE_MAX,
>  };
> @@ -128,6 +131,8 @@ enum xfeature {
>  #define XFEATURE_MASK_Hi16_ZMM   (1 << XFEATURE_Hi16_ZMM)
>  #define XFEATURE_MASK_PT (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
>  #define XFEATURE_MASK_PKRU   (1 << XFEATURE_PKRU)
> +#define XFEATURE_MASK_SHSTK_USER (1 << XFEATURE_SHSTK_USER)
> +#define XFEATURE_MASK_SHSTK_KERNEL   (1 << XFEATURE_SHSTK_KERNEL)
>  
>  #define XFEATURE_MASK_FPSSE  (XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
>  #define XFEATURE_MASK_AVX512 (XFEATURE_MASK_OPMASK \
> @@ -229,6 +234,23 @@ struct pkru_state {
>   u32 pad;
>  } __packed;
>  
> +/*
> + * State component 11 is Control flow Enforcement user states

Why the Camel-cased naming?

"Control" then "flow" then capitalized again "Enforcement".

Fix all occurrences pls, especially the user-visible strings.

> + */
> +struct cet_user_state {
> + u64 u_cet;  /* user control flow settings */
> + u64 user_ssp;   /* user shadow stack pointer */

Prefix both with "usr_" instead.

> +} __packed;
> +
> +/*
> + * State component 12 is Control flow Enforcement kernel states
> + */
> +struct cet_kernel_state {
> + u64 kernel_ssp; /* kernel shadow stack */
> + u64 pl1_ssp;/* ring-1 shadow stack */
> + u64 pl2_ssp;/* ring-2 shadow stack */

Just write "privilege level" everywhere - not "ring".

Btw, do you see how the type and the name of all those other fields in
that file are tabulated? Except yours...

> +} __packed;
> +
>  struct xstate_header {
>   u64 xfeatures;
>   u64 xcomp_bv;
> diff --git a/arch/x86/include/asm/fpu/xstate.h 
> b/arch/x86/include/asm/fpu/xstate.h
> index d8e2ec99f635..18b60748a34d 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -28,7 +28,9 @@
> XFEATURE_MASK_Hi16_ZMM | \
> XFEATURE_MASK_PKRU | \
> XFEATURE_MASK_BNDREGS | \
> -   XFEATURE_MASK_BNDCSR)
> +   XFEATURE_MASK_BNDCSR | \
> +   XFEATURE_MASK_SHSTK_USER | \
> +   XFEATURE_MASK_SHSTK_KERNEL)
>  
>  #ifdef CONFIG_X86_64
>  #define REX_PREFIX   "0x48, "
> diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
> b/arch/x86/include/uapi/asm/processor-flags.h
> index bcba3c643e63..25311ec4b731 100644
> --- a/arch/x86/include/uapi/asm/processor-flags.h
> +++ b/arch/x86/include/uapi/asm/processor-flags.h
> @@ -130,6 +130,8 @@
>  #define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT)
>  #define X86_CR4_PKE_BIT  22 /* enable Protection Keys support */
>  #define X86_CR4_PKE  _BITUL(X86_CR4_PKE_BIT)
> +#define X86_CR4_CET_BIT  23 /* enable Control flow Enforcement */
> +#define X86_CR4_CET  _BITUL(X86_CR4_CET_BIT)
>  
>  /*
>   * x86-64 Task Priority Register, CR8
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 605ec6decf3e..ad36ea28bfd1 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -35,6 +35,9 @@ static const char *xfeature_names[] =
>   "Processor Trace (unused)"  ,
>   "Protection Keys User registers",
>   "unknown xstate feature",
> + "Control flow User registers"   ,
> + "Control flow Kernel registers" ,
> + "unknown xstate feature",

So there are two "unknown xstate feature" array elems now...

>  static short xsave_cpuid_features[] __initdata = {
> @@ -48,6 +51,9 @@ static short xsave_cpuid_features[] __initdata = {
>   X86_FEATURE_AVX512F,
>   X86_FE

Re: [PATCH v15 12/12] cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug

2018-11-08 Thread Waiman Long
On 11/08/2018 10:46 AM, Peter Zijlstra wrote:
> On Thu, Nov 08, 2018 at 10:08:46AM -0500, Waiman Long wrote:
>> For debugging purpose, it will be useful to expose the content of the
>> subparts_cpus as a read-only file to see if the code work correctly.
>> However, subparts_cpus will not be used at all in most use cases. So
>> adding a new cpuset file that clutters the cgroup directory may not be
>> desirable.  This is now being done by using the hidden "cgroup_debug"
>> kernel command line option to expose a new "cpuset.cpus.subpartitions"
>> file.
> One thought I had; would it make sense to make these debug files hidden
> ("." prefix) ?


These debug files will only appear when the cgroup_debug command line
option is used. So I don't think we really need to add an extra "."
prefix as it will break the naming convention.

Cheers,
Longman



Re: [PATCH v15 23/23] x86/sgx: Driver documentation

2018-11-08 Thread Jarkko Sakkinen
On Thu, Nov 08, 2018 at 04:39:42PM +0200, Jarkko Sakkinen wrote:
> On Wed, Nov 07, 2018 at 09:09:37AM -0800, Dave Hansen wrote:
> > On 11/7/18 8:30 AM, Jarkko Sakkinen wrote:
> > >> Does this code run when I type "make kselftest"?  If not, I think we
> > >> should rectify that.
> > > No, it doesn't. It is just my backup for the non-SDK user space code
> > > that I've made that I will use to fork my user space SGX projects in
> > > the future.
> > > 
> > > I can work-out a selftest (and provide a new patch in the series) but
> > > I'm still wondering what the enclave should do. I would suggest that
> > > we start with an enclave that does just EEXIT and nothing else.
> > 
> > Yeah, that's a start.  But, a good selftest would include things like
> > faults and error conditions.
> 
> Great. We can add more entry points to the enclave for different tests
> but I'll start with a bare minimum. And yeah but ever goes into next
> version I'll document the fault handling.

For the v17 I'll add exactly two test cases:

1. EENTER/EEXIT
2. EENTER/exception

So that it will easier to evaluate and demonstrate exception handling.

/Jarkko


Re: [PATCH v8 2/2] overlayfs: override_creds=off option bypass creator_cred

2018-11-08 Thread Vivek Goyal
On Tue, Nov 06, 2018 at 03:01:15PM -0800, Mark Salyzyn wrote:
> By default, all access to the upper, lower and work directories is the
> recorded mounter's MAC and DAC credentials.  The incoming accesses are
> checked against the caller's credentials.

Ok, I am trying to think of scenarios where override_creds=off can 
provide any privilege escalation. How about following.

$ mkdir lower lower/foo upper upper/foo work merged
$ touch lower/foo/bar.txt
$ chmod 700 lower/foo/

# Mount overlay with override_creds=off

$ mount -t overlay -o
lowerdir=lower,upperdir=upper,workdir=work,override_creds=off none merged

# Try to read lower/foo as unpriviliged user. Say "test"
# su test
# ls merged/foo/
ls: cannot access 'merged/foo/': Operation not permitted

# Now first try to do same operation as root and retry as test user.
$ exit
$ ls merged/foo
bar.txt
$ su test
$ ls merged/foo
bar.txt

lower/foo/ is not readable by user "test". So it fails in first try. Later
"root" accesses it and it populates cache in overlayfs. When test retries,
it gets these entries from cache.

With override_creds=on this is not a problem because overlay provides
this as functionality as long as mounter as access to lower/foo/.

But with override_creds=off, mounter is not providing any such
functionality and we are exposing an issue where cache will make
something available which is not normally available.

I think it probably is a good idea to do something about it?

Thanks
Vivek


> 
> If the principles of least privilege are applied, the mounter's
> credentials might not overlap the credentials of the caller's when
> accessing the overlayfs filesystem.  For example, a file that a lower
> DAC privileged caller can execute, is MAC denied to the generally
> higher DAC privileged mounter, to prevent an attack vector.
> 
> We add the option to turn off override_creds in the mount options; all
> subsequent operations after mount on the filesystem will be only the
> caller's credentials.  The module boolean parameter and mount option
> override_creds is also added as a presence check for this "feature",
> existence of /sys/module/overlay/parameters/override_creds.
> 
> It was not always this way.  Circa 4.4 there was no recorded mounter's
> credentials, instead privileged access to upper or work directories
> were temporarily increased to perform the operations.  The MAC
> (selinux) policies were caller's in all cases.  override_creds=off
> partially returns us to this older access model minus the insecure
> temporary credential increases.  This is to permit use in a system
> with non-overlapping security models for each executable including
> the agent that mounts the overlayfs filesystem.  In Android
> this is the case since init, which performs the mount operations,
> has a minimal MAC set of privileges to reduce any attack surface,
> and services that use the content have a different set of MAC
> privileges (eg: read, for vendor labelled configuration, execute for
> vendor libraries and modules).
> 
> Signed-off-by: Mark Salyzyn 
> Cc: Miklos Szeredi 
> Cc: Jonathan Corbet 
> Cc: Vivek Goyal 
> Cc: Eric W. Biederman 
> Cc: Amir Goldstein 
> Cc: Randy Dunlap 
> Cc: Stephen Smalley 
> Cc: linux-unio...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: kernel-t...@android.com
> ---
> v8:
> - drop pr_warn message after straw poll to remove it.
> - added a use case in the commit message
> 
> v7:
> - change name of internal parameter to ovl_override_creds_def
> - report override_creds only if different than default
> 
> v6:
> - Drop CONFIG_OVERLAY_FS_OVERRIDE_CREDS.
> - Do better with the documentation.
> - pr_warn message adjusted to report consequences.
> 
> v5:
> - beefed up the caveats in the Documentation
> - Is dependent on
>   "overlayfs: check CAP_DAC_READ_SEARCH before issuing exportfs_decode_fh"
>   "overlayfs: check CAP_MKNOD before issuing vfs_whiteout"
> - Added prwarn when override_creds=off
> 
> v4:
> - spelling and grammar errors in text
> 
> v3:
> - Change name from caller_credentials / creator_credentials to the
>   boolean override_creds.
> - Changed from creator to mounter credentials.
> - Updated and fortified the documentation.
> - Added CONFIG_OVERLAY_FS_OVERRIDE_CREDS
> 
> v2:
> - Forward port changed attr to stat, resulting in a build error.
> - altered commit message.
> 
>  Documentation/filesystems/overlayfs.txt | 17 +
>  fs/overlayfs/copy_up.c  |  2 +-
>  fs/overlayfs/dir.c  |  9 +
>  fs/overlayfs/inode.c| 16 
>  fs/overlayfs/namei.c|  6 +++---
>  fs/overlayfs/overlayfs.h|  1 +
>  fs/overlayfs/ovl_entry.h|  1 +
>  fs/overlayfs/readdir.c  |  4 ++--
>  fs/overlayfs/super.c| 22 +-
>  fs/overlayfs/util.c | 12 ++--
>  10 files changed, 69 insertions(+), 21

Re: [PATCH V6 0/8] KVM: X86: Introducing ROE Protection Kernel Hardening

2018-11-08 Thread Paolo Bonzini
On 04/11/2018 18:11, Ahmed Abd El Mawgood wrote:
> Our model assumes that an attacker got full root access to a running guest and
> his goal is to manipulate kernel code/data (hook syscalls, overwrite IDT 
> ..etc).
> 
> There is future work in progress to also put some sort of protection on the 
> page
> table register CR3 and other critical registers that can be intercepted by 
> KVM.
> This way it won't be possible for an attacker to manipulate any part of the
> guests page table.
> 

Do you have patches that enable usage of ROE in the kernel?
Alternatively you can write testcases in tools/testing/selftests/kvm to
test how guests should use it.

I would remove CONFIG_KVM_ROE altogether.  You can enable it
unconditionally.

I will continue reviewing the patches soon.

Paolo


Re: [PATCH v15 00/12] Enable cpuset controller in default hierarchy

2018-11-08 Thread Tejun Heo
Applied to cgroup/for-4.21.

Thanks a lot, Waiman!

-- 
tejun


Re: [PATCH v15 00/12] Enable cpuset controller in default hierarchy

2018-11-08 Thread Waiman Long
On 11/08/2018 03:28 PM, Tejun Heo wrote:
> Applied to cgroup/for-4.21.
>
> Thanks a lot, Waiman!
>
Thanks for applying it. It is a long way to get to this point.

Cheers,
Longman



Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Yu-cheng Yu
On Thu, 2018-11-08 at 19:40 +0100, Borislav Petkov wrote:
> On Thu, Oct 11, 2018 at 08:15:00AM -0700, Yu-cheng Yu wrote:
> > [...] 
> > +/*
> > + * State component 11 is Control flow Enforcement user states
> 
> Why the Camel-cased naming?
> 
> "Control" then "flow" then capitalized again "Enforcement".
> 
> Fix all occurrences pls, especially the user-visible strings.

I will change it to "Control-flow Enforcement" everywhere.

> > + */
> > +struct cet_user_state {
> > +   u64 u_cet;  /* user control flow settings */
> > +   u64 user_ssp;   /* user shadow stack pointer */
> 
> Prefix both with "usr_" instead.

Ok.

> [...]
> 
> Just write "privilege level" everywhere - not "ring".
> 
> Btw, do you see how the type and the name of all those other fields in
> that file are tabulated? Except yours...

I will fix it.

[...] 
> > 
> > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > index 605ec6decf3e..ad36ea28bfd1 100644
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -35,6 +35,9 @@ static const char *xfeature_names[] =
> > "Processor Trace (unused)"  ,
> > "Protection Keys User registers",
> > "unknown xstate feature",
> > +   "Control flow User registers"   ,
> > +   "Control flow Kernel registers" ,
> > +   "unknown xstate feature",
> 
> So there are two "unknown xstate feature" array elems now...
> 
> >  static short xsave_cpuid_features[] __initdata = {
> > @@ -48,6 +51,9 @@ static short xsave_cpuid_features[] __initdata = {
> > X86_FEATURE_AVX512F,
> > X86_FEATURE_INTEL_PT,
> > X86_FEATURE_PKU,
> > +   0, /* Unused */
> 
> What's that for?

In fpu_init_system_xstate(), we test and clear features that are not enabled.
There we depend on the order of these elements.  This is the tenth "unknown
xstate feature".

Yu-cheng


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Andy Lutomirski
On Thu, Oct 11, 2018 at 8:20 AM Yu-cheng Yu  wrote:
>
> Intel Control-flow Enforcement Technology (CET) introduces the
> following MSRs into the XSAVES system states.
>
> IA32_U_CET (user-mode CET settings),
> IA32_PL3_SSP (user-mode shadow stack),
> IA32_PL0_SSP (kernel-mode shadow stack),
> IA32_PL1_SSP (ring-1 shadow stack),
> IA32_PL2_SSP (ring-2 shadow stack).
>
> Signed-off-by: Yu-cheng Yu 
> ---
>  arch/x86/include/asm/fpu/types.h| 22 +
>  arch/x86/include/asm/fpu/xstate.h   |  4 +++-
>  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
>  arch/x86/kernel/fpu/xstate.c| 10 ++
>  4 files changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/fpu/types.h 
> b/arch/x86/include/asm/fpu/types.h
> index 202c53918ecf..e55d51d172f1 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -114,6 +114,9 @@ enum xfeature {
> XFEATURE_Hi16_ZMM,
> XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
> XFEATURE_PKRU,
> +   XFEATURE_RESERVED,
> +   XFEATURE_SHSTK_USER,
> +   XFEATURE_SHSTK_KERNEL,
>
> XFEATURE_MAX,
>  };
> @@ -128,6 +131,8 @@ enum xfeature {
>  #define XFEATURE_MASK_Hi16_ZMM (1 << XFEATURE_Hi16_ZMM)
>  #define XFEATURE_MASK_PT   (1 << 
> XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
>  #define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
> +#define XFEATURE_MASK_SHSTK_USER   (1 << XFEATURE_SHSTK_USER)
> +#define XFEATURE_MASK_SHSTK_KERNEL (1 << XFEATURE_SHSTK_KERNEL)
>
>  #define XFEATURE_MASK_FPSSE(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
>  #define XFEATURE_MASK_AVX512   (XFEATURE_MASK_OPMASK \
> @@ -229,6 +234,23 @@ struct pkru_state {
> u32 pad;
>  } __packed;
>
> +/*
> + * State component 11 is Control flow Enforcement user states
> + */
> +struct cet_user_state {
> +   u64 u_cet;  /* user control flow settings */
> +   u64 user_ssp;   /* user shadow stack pointer */
> +} __packed;
> +
> +/*
> + * State component 12 is Control flow Enforcement kernel states
> + */
> +struct cet_kernel_state {
> +   u64 kernel_ssp; /* kernel shadow stack */
> +   u64 pl1_ssp;/* ring-1 shadow stack */
> +   u64 pl2_ssp;/* ring-2 shadow stack */
> +} __packed;
> +

Why are these __packed?  It seems like it'll generate bad code for no
obvious purpose.


Re: [PATCH V6 0/8] KVM: X86: Introducing ROE Protection Kernel Hardening

2018-11-08 Thread Ahmed Soliman
Hello,

> Do you have patches that enable usage of ROE in the kernel?
> Alternatively you can write testcases in tools/testing/selftests/kvm to
> test how guests should use it.

As for now ROE isn't integrated yet into the kernel, I did have some
tests I will
have them in  selftests/kvm in the next patch set

> I would remove CONFIG_KVM_ROE altogether.  You can enable it
> unconditionally.

Noted, I will have it enabled unconditionally in the next patch set.

> I will continue reviewing the patches soon.
>
> Paolo


[PATCH v6 01/10] dt-bindings: fsi: Add P9 OCC device documentation

2018-11-08 Thread Eddie James
From: Eddie James 

Document the bindings for the FSI-attached POWER9 On-Chip Controller.

Signed-off-by: Eddie James 
---
 Documentation/devicetree/bindings/fsi/ibm,p9-occ.txt | 16 
 1 file changed, 16 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/fsi/ibm,p9-occ.txt

diff --git a/Documentation/devicetree/bindings/fsi/ibm,p9-occ.txt 
b/Documentation/devicetree/bindings/fsi/ibm,p9-occ.txt
new file mode 100644
index 000..99ca986
--- /dev/null
+++ b/Documentation/devicetree/bindings/fsi/ibm,p9-occ.txt
@@ -0,0 +1,16 @@
+Device-tree bindings for FSI-attached POWER9 On-Chip Controller (OCC)
+-
+
+This is the binding for the P9 On-Chip Controller accessed over FSI from a
+service processor. See fsi.txt for details on bindings for FSI slave and CFAM
+nodes. The OCC is not an FSI slave device itself, rather it is accessed
+through the SBE fifo.
+
+Required properties:
+ - compatible = "ibm,p9-occ"
+
+Examples:
+
+occ {
+compatible = "ibm,p9-occ";
+};
-- 
1.8.3.1



[PATCH v6 07/10] hwmon (occ): Parse OCC poll response

2018-11-08 Thread Eddie James
From: Eddie James 

Add method to parse the response from the OCC poll command. This only
needs to be done during probe(), since the OCC shouldn't change the
number or format of sensors while it's running. The parsed response
allows quick access to sensor data, as well as information on the
number and version of sensors, which we need to instantiate hwmon
attributes.

Signed-off-by: Eddie James 
---
 drivers/hwmon/occ/common.c | 61 ++
 drivers/hwmon/occ/common.h | 55 +
 2 files changed, 116 insertions(+)

diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c
index 81acd4b..a066509 100644
--- a/drivers/hwmon/occ/common.c
+++ b/drivers/hwmon/occ/common.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 
 #include 
+#include 
 
 #include "common.h"
 
@@ -22,6 +23,64 @@ static int occ_poll(struct occ *occ)
return occ->send_cmd(occ, cmd);
 }
 
+/* only need to do this once at startup, as OCC won't change sensors on us */
+static void occ_parse_poll_response(struct occ *occ)
+{
+   unsigned int i, old_offset, offset = 0, size = 0;
+   struct occ_sensor *sensor;
+   struct occ_sensors *sensors = &occ->sensors;
+   struct occ_response *resp = &occ->resp;
+   struct occ_poll_response *poll =
+   (struct occ_poll_response *)&resp->data[0];
+   struct occ_poll_response_header *header = &poll->header;
+   struct occ_sensor_data_block *block = &poll->block;
+
+   dev_info(occ->bus_dev, "OCC found, code level: %.16s\n",
+header->occ_code_level);
+
+   for (i = 0; i < header->num_sensor_data_blocks; ++i) {
+   block = (struct occ_sensor_data_block *)((u8 *)block + offset);
+   old_offset = offset;
+   offset = (block->header.num_sensors *
+ block->header.sensor_length) + sizeof(block->header);
+   size += offset;
+
+   /* validate all the length/size fields */
+   if ((size + sizeof(*header)) >= OCC_RESP_DATA_BYTES) {
+   dev_warn(occ->bus_dev, "exceeded response buffer\n");
+   return;
+   }
+
+   dev_dbg(occ->bus_dev, " %04x..%04x: %.4s (%d sensors)\n",
+   old_offset, offset - 1, block->header.eye_catcher,
+   block->header.num_sensors);
+
+   /* match sensor block type */
+   if (strncmp(block->header.eye_catcher, "TEMP", 4) == 0)
+   sensor = &sensors->temp;
+   else if (strncmp(block->header.eye_catcher, "FREQ", 4) == 0)
+   sensor = &sensors->freq;
+   else if (strncmp(block->header.eye_catcher, "POWR", 4) == 0)
+   sensor = &sensors->power;
+   else if (strncmp(block->header.eye_catcher, "CAPS", 4) == 0)
+   sensor = &sensors->caps;
+   else if (strncmp(block->header.eye_catcher, "EXTN", 4) == 0)
+   sensor = &sensors->extended;
+   else {
+   dev_warn(occ->bus_dev, "sensor not supported %.4s\n",
+block->header.eye_catcher);
+   continue;
+   }
+
+   sensor->num_sensors = block->header.num_sensors;
+   sensor->version = block->header.sensor_format;
+   sensor->data = &block->data;
+   }
+
+   dev_dbg(occ->bus_dev, "Max resp size: %u+%zd=%zd\n", size,
+   sizeof(*header), size + sizeof(*header));
+}
+
 int occ_setup(struct occ *occ, const char *name)
 {
int rc;
@@ -36,5 +95,7 @@ int occ_setup(struct occ *occ, const char *name)
return rc;
}
 
+   occ_parse_poll_response(occ);
+
return 0;
 }
diff --git a/drivers/hwmon/occ/common.h b/drivers/hwmon/occ/common.h
index b44fee2..0a7a107 100644
--- a/drivers/hwmon/occ/common.h
+++ b/drivers/hwmon/occ/common.h
@@ -20,10 +20,65 @@ struct occ_response {
__be16 checksum;
 } __packed;
 
+struct occ_sensor_data_block_header {
+   u8 eye_catcher[4];
+   u8 reserved;
+   u8 sensor_format;
+   u8 sensor_length;
+   u8 num_sensors;
+} __packed;
+
+struct occ_sensor_data_block {
+   struct occ_sensor_data_block_header header;
+   u32 data;
+} __packed;
+
+struct occ_poll_response_header {
+   u8 status;
+   u8 ext_status;
+   u8 occs_present;
+   u8 config_data;
+   u8 occ_state;
+   u8 mode;
+   u8 ips_status;
+   u8 error_log_id;
+   __be32 error_log_start_address;
+   __be16 error_log_length;
+   u16 reserved;
+   u8 occ_code_level[16];
+   u8 eye_catcher[6];
+   u8 num_sensor_data_blocks;
+   u8 sensor_data_block_header_version;
+} __packed;
+
+struct occ_poll_response {
+   struct occ_poll_response_header header;
+   struct occ_sensor_data_block 

[PATCH v6 09/10] hwmon (occ): Add sensor attributes and register hwmon device

2018-11-08 Thread Eddie James
From: Eddie James 

Setup the sensor attributes for every OCC sensor found by the first poll
response. Register the attributes with hwmon.

Signed-off-by: Eddie James 
---
 drivers/hwmon/occ/common.c | 337 +
 drivers/hwmon/occ/common.h |  16 +++
 2 files changed, 353 insertions(+)

diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c
index f7220132..c6c8161 100644
--- a/drivers/hwmon/occ/common.c
+++ b/drivers/hwmon/occ/common.c
@@ -1,11 +1,13 @@
 // SPDX-License-Identifier: GPL-2.0
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "common.h"
@@ -641,6 +643,324 @@ static ssize_t occ_show_extended(struct device *dev,
return rc;
 }
 
+/*
+ * Some helper macros to make it easier to define an occ_attribute. Since these
+ * are dynamically allocated, we shouldn't use the existing kernel macros which
+ * stringify the name argument.
+ */
+#define ATTR_OCC(_name, _mode, _show, _store) {
\
+   .attr   = { \
+   .name = _name,  \
+   .mode = VERIFY_OCTAL_PERMISSIONS(_mode),\
+   },  \
+   .show   = _show,\
+   .store  = _store,   \
+}
+
+#define SENSOR_ATTR_OCC(_name, _mode, _show, _store, _nr, _index) {\
+   .dev_attr   = ATTR_OCC(_name, _mode, _show, _store),\
+   .index  = _index,   \
+   .nr = _nr,  \
+}
+
+#define OCC_INIT_ATTR(_name, _mode, _show, _store, _nr, _index)
\
+   ((struct sensor_device_attribute_2) \
+   SENSOR_ATTR_OCC(_name, _mode, _show, _store, _nr, _index))
+
+/*
+ * Allocate and instatiate sensor_device_attribute_2s. It's most efficient to
+ * use our own instead of the built-in hwmon attribute types.
+ */
+static int occ_setup_sensor_attrs(struct occ *occ)
+{
+   unsigned int i, s, num_attrs = 0;
+   struct device *dev = occ->bus_dev;
+   struct occ_sensors *sensors = &occ->sensors;
+   struct occ_attribute *attr;
+   struct temp_sensor_2 *temp;
+   ssize_t (*show_temp)(struct device *, struct device_attribute *,
+char *) = occ_show_temp_1;
+   ssize_t (*show_freq)(struct device *, struct device_attribute *,
+char *) = occ_show_freq_1;
+   ssize_t (*show_power)(struct device *, struct device_attribute *,
+ char *) = occ_show_power_1;
+   ssize_t (*show_caps)(struct device *, struct device_attribute *,
+char *) = occ_show_caps_1_2;
+
+   switch (sensors->temp.version) {
+   case 1:
+   num_attrs += (sensors->temp.num_sensors * 2);
+   break;
+   case 2:
+   num_attrs += (sensors->temp.num_sensors * 4);
+   show_temp = occ_show_temp_2;
+   break;
+   default:
+   sensors->temp.num_sensors = 0;
+   }
+
+   switch (sensors->freq.version) {
+   case 2:
+   show_freq = occ_show_freq_2;
+   /* fall through */
+   case 1:
+   num_attrs += (sensors->freq.num_sensors * 2);
+   break;
+   default:
+   sensors->freq.num_sensors = 0;
+   }
+
+   switch (sensors->power.version) {
+   case 2:
+   show_power = occ_show_power_2;
+   /* fall through */
+   case 1:
+   num_attrs += (sensors->power.num_sensors * 4);
+   break;
+   case 0xA0:
+   num_attrs += (sensors->power.num_sensors * 16);
+   show_power = occ_show_power_a0;
+   break;
+   default:
+   sensors->power.num_sensors = 0;
+   }
+
+   switch (sensors->caps.version) {
+   case 1:
+   num_attrs += (sensors->caps.num_sensors * 7);
+   break;
+   case 3:
+   show_caps = occ_show_caps_3;
+   /* fall through */
+   case 2:
+   num_attrs += (sensors->caps.num_sensors * 8);
+   break;
+   default:
+   sensors->caps.num_sensors = 0;
+   }
+
+   switch (sensors->extended.version) {
+   case 1:
+   num_attrs += (sensors->extended.num_sensors * 3);
+   break;
+   default:
+   sensors->extended.num_sensors = 0;
+   }
+
+   occ->attrs = devm_kzalloc(dev, sizeof(*occ->attrs) * num_attrs,
+ GFP_KERNEL);
+   if (!occ->attrs)
+   return -ENOMEM;
+
+   /* null-terminated list */
+   o

[PATCH v6 00/10] hwmon and fsi: Add On-Chip Controller Driver

2018-11-08 Thread Eddie James
From: Eddie James 

This series adds a hwmon driver to support the OCC on POWER8 and POWER9
processors. The OCC is an embedded processor that provides realtime power and
thermal monitoring and management.

The series also adds a "bus" driver to handle atomic communication between the
service processor and the OCC on a POWER9 chip. This communication takes place
over FSI bus to the SBE (Self-Boot engine) FIFO, which in turn communicates
with the OCC. The driver for the SBEFIFO is already available as an FSI client
driver.

For POWER8 OCCs, communication between the service processor and the OCC is
achieved over I2C bus.

Changes since v5:
 * Makefile fix when compiling both P8 and P9 versions
 * Spelling fix in hwmon doc
 * Added an additional sentence for P9 binding doc to explain that OCC isn't
   an FSI slave device.

Changes since v4:
 * Make the hwmon attributes conform almost completely to standard names and
   values. The only exception is powerX_cap_user and powerX_cap_user_source.
 * Improve hwmon documentation.
 * Add ibm,p9-occ dt documentation.

Changes since v3:
 * Add the FSI OCC driver.
 * Pull the sysfs attribute code into it's own file for cleanliness.
 * Various fixes for attribute creation and integer overflow.

Changes since v2:
 * Add sysfs_notify for the error and throttling attributes when change is
   detected.
 * Removed occs_present counting of devices bound.
 * Improved remove() of P9 driver to avoid bad behavior with relation to OCC
   driver when unbound.
 * Added default cases (return EINVAL) for all sensor show functions.
 * Added temperature fault sensor.
 * Added back dt binding documentation for P9 to address checkpatch warning.
 * Added occs_present attribute from the poll response.

Changes since v1:
 * Remove wait loop in P9 code, as that is now handled by FSI OCC driver.
 * Removed dt binding documentation for P9, FSI OCC driver will probe OCC hwmon
   driver automatically.
 * Moved OCC response code definitions to the OCC include file.
 * Fixed includes.
 * Changed some structure fields to __beXX as that is what they are.
 * Changed some errnos.
 * Removed some dev_err().
 * Refactored P8 code a bit to use #defined addresses and magic values, and
   changed "goto retry" to a loop.
 * Refactored error handling a bit.

Eddie James (10):
  dt-bindings: fsi: Add P9 OCC device documentation
  fsi: Add On-Chip Controller (OCC) driver
  Documentation: hwmon: Add OCC documentation
  dt-bindings: i2c: Add P8 OCC hwmon device documentation
  hwmon: Add On-Chip Controller (OCC) hwmon driver
  hwmon (occ): Add command transport method for P8 and P9
  hwmon (occ): Parse OCC poll response
  hwmon (occ): Add sensor types and versions
  hwmon (occ): Add sensor attributes and register hwmon device
  hwmon (occ): Add sysfs attributes for additional OCC data

 .../devicetree/bindings/fsi/ibm,p9-occ.txt |   16 +
 .../devicetree/bindings/i2c/ibm,p8-occ-hwmon.txt   |   25 +
 Documentation/hwmon/occ|  112 ++
 drivers/fsi/Kconfig|   10 +
 drivers/fsi/Makefile   |1 +
 drivers/fsi/fsi-occ.c  |  599 +++
 drivers/hwmon/Kconfig  |2 +
 drivers/hwmon/Makefile |1 +
 drivers/hwmon/occ/Kconfig  |   31 +
 drivers/hwmon/occ/Makefile |5 +
 drivers/hwmon/occ/common.c | 1098 
 drivers/hwmon/occ/common.h |  128 +++
 drivers/hwmon/occ/p8_i2c.c |  255 +
 drivers/hwmon/occ/p9_sbe.c |  106 ++
 drivers/hwmon/occ/sysfs.c  |  188 
 include/linux/fsi-occ.h|   25 +
 16 files changed, 2602 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/fsi/ibm,p9-occ.txt
 create mode 100644 Documentation/devicetree/bindings/i2c/ibm,p8-occ-hwmon.txt
 create mode 100644 Documentation/hwmon/occ
 create mode 100644 drivers/fsi/fsi-occ.c
 create mode 100644 drivers/hwmon/occ/Kconfig
 create mode 100644 drivers/hwmon/occ/Makefile
 create mode 100644 drivers/hwmon/occ/common.c
 create mode 100644 drivers/hwmon/occ/common.h
 create mode 100644 drivers/hwmon/occ/p8_i2c.c
 create mode 100644 drivers/hwmon/occ/p9_sbe.c
 create mode 100644 drivers/hwmon/occ/sysfs.c
 create mode 100644 include/linux/fsi-occ.h

-- 
1.8.3.1



[PATCH v6 10/10] hwmon (occ): Add sysfs attributes for additional OCC data

2018-11-08 Thread Eddie James
From: Eddie James 

The OCC provides a variety of additional information about the state of
the host processor, such as throttling, error conditions, and the number
of OCCs detected in the system. This information is essential to service
processor applications such as fan control and host management.
Therefore, export this data in the form of sysfs attributes attached to
the platform device (to which the hwmon device is also attached).

Signed-off-by: Eddie James 
---
 drivers/hwmon/occ/Makefile |   4 +-
 drivers/hwmon/occ/common.c |  45 ++-
 drivers/hwmon/occ/common.h |  17 
 drivers/hwmon/occ/p8_i2c.c |  10 +++
 drivers/hwmon/occ/p9_sbe.c |   1 +
 drivers/hwmon/occ/sysfs.c  | 188 +
 6 files changed, 260 insertions(+), 5 deletions(-)
 create mode 100644 drivers/hwmon/occ/sysfs.c

diff --git a/drivers/hwmon/occ/Makefile b/drivers/hwmon/occ/Makefile
index 57c0e91..3fec12d 100644
--- a/drivers/hwmon/occ/Makefile
+++ b/drivers/hwmon/occ/Makefile
@@ -1,5 +1,5 @@
-occ-p8-hwmon-objs := common.o p8_i2c.o
-occ-p9-hwmon-objs := common.o p9_sbe.o
+occ-p8-hwmon-objs := common.o sysfs.o p8_i2c.o
+occ-p9-hwmon-objs := common.o sysfs.o p9_sbe.o
 
 obj-$(CONFIG_SENSORS_OCC_P8_I2C) += occ-p8-hwmon.o
 obj-$(CONFIG_SENSORS_OCC_P9_SBE) += occ-p9-hwmon.o
diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c
index c6c8161..423903f 100644
--- a/drivers/hwmon/occ/common.c
+++ b/drivers/hwmon/occ/common.c
@@ -14,6 +14,11 @@
 
 #define EXTN_FLAG_SENSOR_IDBIT(7)
 
+#define OCC_ERROR_COUNT_THRESHOLD  2   /* required by OCC spec */
+
+#define OCC_STATE_SAFE 4
+#define OCC_SAFE_TIMEOUT   msecs_to_jiffies(6) /* 1 min */
+
 #define OCC_UPDATE_FREQUENCY   msecs_to_jiffies(1000)
 
 #define OCC_TEMP_SENSOR_FAULT  0xFF
@@ -115,8 +120,10 @@ struct extended_sensor {
 
 static int occ_poll(struct occ *occ)
 {
+   int rc;
u16 checksum = occ->poll_cmd_data + 1;
u8 cmd[8];
+   struct occ_poll_response_header *header;
 
/* big endian */
cmd[0] = 0; /* sequence number */
@@ -129,7 +136,35 @@ static int occ_poll(struct occ *occ)
cmd[7] = 0;
 
/* mutex should already be locked if necessary */
-   return occ->send_cmd(occ, cmd);
+   rc = occ->send_cmd(occ, cmd);
+   if (rc) {
+   if (occ->error_count++ > OCC_ERROR_COUNT_THRESHOLD)
+   occ->error = rc;
+
+   goto done;
+   }
+
+   /* clear error since communication was successful */
+   occ->error_count = 0;
+   occ->error = 0;
+
+   /* check for safe state */
+   header = (struct occ_poll_response_header *)occ->resp.data;
+   if (header->occ_state == OCC_STATE_SAFE) {
+   if (occ->last_safe) {
+   if (time_after(jiffies,
+  occ->last_safe + OCC_SAFE_TIMEOUT))
+   occ->error = -EHOSTDOWN;
+   } else {
+   occ->last_safe = jiffies;
+   }
+   } else {
+   occ->last_safe = 0;
+   }
+
+done:
+   occ_sysfs_poll_done(occ);
+   return rc;
 }
 
 static int occ_set_user_power_cap(struct occ *occ, u16 user_power_cap)
@@ -161,7 +196,7 @@ static int occ_set_user_power_cap(struct occ *occ, u16 
user_power_cap)
return rc;
 }
 
-static int occ_update_response(struct occ *occ)
+int occ_update_response(struct occ *occ)
 {
int rc = mutex_lock_interruptible(&occ->lock);
 
@@ -1055,5 +1090,9 @@ int occ_setup(struct occ *occ, const char *name)
return rc;
}
 
-   return 0;
+   rc = occ_setup_sysfs(occ);
+   if (rc)
+   dev_err(occ->bus_dev, "failed to setup sysfs: %d\n", rc);
+
+   return rc;
 }
diff --git a/drivers/hwmon/occ/common.h b/drivers/hwmon/occ/common.h
index 00ac101..da80e65 100644
--- a/drivers/hwmon/occ/common.h
+++ b/drivers/hwmon/occ/common.h
@@ -104,8 +104,25 @@ struct occ {
struct occ_attribute *attrs;
struct attribute_group group;
const struct attribute_group *groups[2];
+
+   int error;  /* latest transfer error */
+   unsigned int error_count;   /* number of xfr errors observed */
+   unsigned long last_safe;/* time OCC entered "safe" state */
+
+   /*
+* Store the previous state data for comparison in order to notify
+* sysfs readers of state changes.
+*/
+   int prev_error;
+   u8 prev_stat;
+   u8 prev_ext_stat;
+   u8 prev_occs_present;
 };
 
 int occ_setup(struct occ *occ, const char *name);
+int occ_setup_sysfs(struct occ *occ);
+void occ_shutdown(struct occ *occ);
+void occ_sysfs_poll_done(struct occ *occ);
+int occ_update_response(struct occ *occ);
 
 #endif /* OCC_COMMON_H */
diff --git a/drivers/hwmon/occ/p8_i2c.c b/drivers/hwmon/occ/p8_i2c.c
index e4c2c04..b5

[PATCH v6 03/10] Documentation: hwmon: Add OCC documentation

2018-11-08 Thread Eddie James
From: Eddie James 

Document the hwmon interface for the OCC.

Signed-off-by: Eddie James 
---
 Documentation/hwmon/occ | 112 
 1 file changed, 112 insertions(+)
 create mode 100644 Documentation/hwmon/occ

diff --git a/Documentation/hwmon/occ b/Documentation/hwmon/occ
new file mode 100644
index 000..e787596
--- /dev/null
+++ b/Documentation/hwmon/occ
@@ -0,0 +1,112 @@
+Kernel driver occ-hwmon
+===
+
+Supported chips:
+  * POWER8
+  * POWER9
+
+Author: Eddie James 
+
+Description
+---
+
+This driver supports hardware monitoring for the On-Chip Controller (OCC)
+embedded on POWER processors. The OCC is a device that collects and aggregates
+sensor data from the processor and the system. The OCC can provide the raw
+sensor data as well as perform thermal and power management on the system.
+
+The P8 version of this driver is a client driver of I2C. It may be probed
+manually if an "ibm,p8-occ-hwmon" compatible device is found under the
+appropriate I2C bus node in the device-tree.
+
+The P9 version of this driver is a client driver of the FSI-based OCC driver.
+It will be probed automatically by the FSI-based OCC driver.
+
+Sysfs entries
+-
+
+The following attributes are supported. All attributes are read-only unless
+specified.
+
+The OCC sensor ID is an integer that represents the unique identifier of the
+sensor with respect to the OCC. For example, a temperature sensor for the third
+DIMM slot in the system may have a sensor ID of 7. This mapping is unavailable
+to the device driver, which must therefore export the sensor ID as-is.
+
+Some entries are only present with certain OCC sensor versions or only on
+certain OCCs in the system. The version number is not exported to the user
+but can be inferred.
+
+temp[1-n]_labelOCC sensor ID.
+[with temperature sensor version 1]
+temp[1-n]_inputMeasured temperature of the component in millidegrees
+   Celsius.
+[with temperature sensor version >= 2]
+temp[1-n]_type The FRU (Field Replaceable Unit) type
+   (represented by an integer) for the component
+   that this sensor measures.
+temp[1-n]_faultTemperature sensor fault boolean; 1 to indicate
+   that a fault is present or 0 to indicate that
+   no fault is present.
+[with type == 3 (FRU type is VRM)]
+temp[1-n]_alarmVRM temperature alarm boolean; 1 to 
indicate
+   alarm, 0 to indicate no alarm
+[else]
+temp[1-n]_inputMeasured temperature of the component in
+   millidegrees Celsius.
+
+freq[1-n]_labelOCC sensor ID.
+freq[1-n]_inputMeasured frequency of the component in MHz.
+
+power[1-n]_input   Latest measured power reading of the component in
+   microwatts.
+power[1-n]_average Average power of the component in microwatts.
+power[1-n]_average_intervalThe amount of time over which the power average
+   was taken in microseconds.
+[with power sensor version < 2]
+power[1-n]_label   OCC sensor ID.
+[with power sensor version >= 2]
+power[1-n]_label   OCC sensor ID + function ID + channel in the form
+   of a string, delimited by underscores, i.e. "0_15_1".
+   Both the function ID and channel are integers that
+   further identify the power sensor.
+[with power sensor version 0xa0]
+power[1-n]_label   OCC sensor ID + sensor type in the form of a string,
+   delimited by an underscore, i.e. "0_system". Sensor
+   type will be one of "system", "proc", "vdd" or "vdn".
+   For this sensor version, OCC sensor ID will be the same
+   for all power sensors.
+[present only on "master" OCC; represents the whole system power; only one of
+ this type of power sensor will be present]
+power[1-n]_label   "system"
+power[1-n]_input   Latest system output power in microwatts.
+power[1-n]_cap Current system power cap in microwatts.
+power[1-n]_cap_not_redundant   System power cap in microwatts when
+   there is not redundant power.
+power[1-n]_cap_max Maximum power cap that the OCC can enforce in
+   microwatts.
+power[1-n]_cap_min Minimum power cap that the OCC can enforce in
+   microwatts.
+power[1-n]_cap_userThe power cap set by the user, in 
microwatts.
+   This attribute will return 0 if no user power
+   cap has been set. This attribute is read-write,
+

[PATCH v6 02/10] fsi: Add On-Chip Controller (OCC) driver

2018-11-08 Thread Eddie James
From: Eddie James 

The OCC is a device embedded on a POWER processor that collects and
aggregates sensor data from the processor and system. The OCC can
provide the raw sensor data as well as perform thermal and power
management on the system.

This driver provides an atomic communications channel between a service
processor (e.g. a BMC) and the OCC. The driver is dependent on the FSI
SBEFIFO driver to get hardware access through the SBE to the OCC SRAM.
Commands are issued to the SBE to send or fetch data to the SRAM.

Signed-off-by: Eddie James 
Signed-off-by: Andrew Jeffery 
Signed-off-by: Benjamin Herrenschmidt 
Signed-off-by: Joel Stanley 
---
 drivers/fsi/Kconfig |  10 +
 drivers/fsi/Makefile|   1 +
 drivers/fsi/fsi-occ.c   | 599 
 include/linux/fsi-occ.h |  25 ++
 4 files changed, 635 insertions(+)
 create mode 100644 drivers/fsi/fsi-occ.c
 create mode 100644 include/linux/fsi-occ.h

diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
index af3a20d..ea2f4a1 100644
--- a/drivers/fsi/Kconfig
+++ b/drivers/fsi/Kconfig
@@ -64,4 +64,14 @@ config FSI_SBEFIFO
a pipe-like FSI device for communicating with the self boot engine
(SBE) on POWER processors.
 
+config FSI_OCC
+   tristate "OCC SBEFIFO client device driver"
+   depends on FSI_SBEFIFO
+   ---help---
+   This option enables an SBEFIFO based On-Chip Controller (OCC) device
+   driver. The OCC is a device embedded on a POWER processor that collects
+   and aggregates sensor data from the processor and system. The OCC can
+   provide the raw sensor data as well as perform thermal and power
+   management on the system.
+
 endif
diff --git a/drivers/fsi/Makefile b/drivers/fsi/Makefile
index a50d6ce..62687ec 100644
--- a/drivers/fsi/Makefile
+++ b/drivers/fsi/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_FSI_MASTER_GPIO) += fsi-master-gpio.o
 obj-$(CONFIG_FSI_MASTER_AST_CF) += fsi-master-ast-cf.o
 obj-$(CONFIG_FSI_SCOM) += fsi-scom.o
 obj-$(CONFIG_FSI_SBEFIFO) += fsi-sbefifo.o
+obj-$(CONFIG_FSI_OCC) += fsi-occ.o
diff --git a/drivers/fsi/fsi-occ.c b/drivers/fsi/fsi-occ.c
new file mode 100644
index 000..a2301ce
--- /dev/null
+++ b/drivers/fsi/fsi-occ.c
@@ -0,0 +1,599 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define OCC_SRAM_BYTES 4096
+#define OCC_CMD_DATA_BYTES 4090
+#define OCC_RESP_DATA_BYTES4089
+
+#define OCC_SRAM_CMD_ADDR  0xFFFBE000
+#define OCC_SRAM_RSP_ADDR  0xFFFBF000
+
+/*
+ * Assume we don't have much FFDC, if we do we'll overflow and
+ * fail the command. This needs to be big enough for simple
+ * commands as well.
+ */
+#define OCC_SBE_STATUS_WORDS   32
+
+#define OCC_TIMEOUT_MS 1000
+#define OCC_CMD_IN_PRG_WAIT_MS 50
+
+struct occ {
+   struct device *dev;
+   struct device *sbefifo;
+   char name[32];
+   int idx;
+   struct miscdevice mdev;
+   struct mutex occ_lock;
+};
+
+#define to_occ(x)  container_of((x), struct occ, mdev)
+
+struct occ_response {
+   u8 seq_no;
+   u8 cmd_type;
+   u8 return_status;
+   __be16 data_length;
+   u8 data[OCC_RESP_DATA_BYTES + 2];   /* two bytes checksum */
+} __packed;
+
+struct occ_client {
+   struct occ *occ;
+   struct mutex lock;
+   size_t data_size;
+   size_t read_offset;
+   u8 *buffer;
+};
+
+#define to_client(x)   container_of((x), struct occ_client, xfr)
+
+static DEFINE_IDA(occ_ida);
+
+static int occ_open(struct inode *inode, struct file *file)
+{
+   struct occ_client *client = kzalloc(sizeof(*client), GFP_KERNEL);
+   struct miscdevice *mdev = file->private_data;
+   struct occ *occ = to_occ(mdev);
+
+   if (!client)
+   return -ENOMEM;
+
+   client->buffer = (u8 *)__get_free_page(GFP_KERNEL);
+   if (!client->buffer) {
+   kfree(client);
+   return -ENOMEM;
+   }
+
+   client->occ = occ;
+   mutex_init(&client->lock);
+   file->private_data = client;
+
+   /* We allocate a 1-page buffer, make sure it all fits */
+   BUILD_BUG_ON((OCC_CMD_DATA_BYTES + 3) > PAGE_SIZE);
+   BUILD_BUG_ON((OCC_RESP_DATA_BYTES + 7) > PAGE_SIZE);
+
+   return 0;
+}
+
+static ssize_t occ_read(struct file *file, char __user *buf, size_t len,
+   loff_t *offset)
+{
+   struct occ_client *client = file->private_data;
+   ssize_t rc = 0;
+
+   if (!client)
+   return -ENODEV;
+
+   if (len > OCC_SRAM_BYTES)
+   return -EINVAL;
+
+   mutex_lock(&client->lock);
+
+   /* This should not be possible ... */
+   if (WARN_ON_ONCE(client->read_offset > client->data_size)) {
+   rc = -EIO;
+   goto done;
+   }
+
+   /*

[PATCH v6 08/10] hwmon (occ): Add sensor types and versions

2018-11-08 Thread Eddie James
From: Eddie James 

Add structures to define all sensor types and versions. Add sysfs show
and store functions for each sensor type. Add a method to construct the
"set user power cap" command and send it to the OCC. Add rate limit to
polling the OCC (in case user-space reads our hwmon entries rapidly).

Signed-off-by: Eddie James 
---
 drivers/hwmon/occ/common.c | 621 +
 drivers/hwmon/occ/common.h |   6 +
 drivers/hwmon/occ/p8_i2c.c |   1 +
 drivers/hwmon/occ/p9_sbe.c |   1 +
 4 files changed, 629 insertions(+)

diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c
index a066509..f7220132 100644
--- a/drivers/hwmon/occ/common.c
+++ b/drivers/hwmon/occ/common.c
@@ -1,10 +1,116 @@
 // SPDX-License-Identifier: GPL-2.0
 
 #include 
+#include 
+#include 
 #include 
+#include 
+#include 
+#include 
 
 #include "common.h"
 
+#define EXTN_FLAG_SENSOR_IDBIT(7)
+
+#define OCC_UPDATE_FREQUENCY   msecs_to_jiffies(1000)
+
+#define OCC_TEMP_SENSOR_FAULT  0xFF
+
+#define OCC_FRU_TYPE_VRM   3
+
+/* OCC sensor type and version definitions */
+
+struct temp_sensor_1 {
+   u16 sensor_id;
+   u16 value;
+} __packed;
+
+struct temp_sensor_2 {
+   u32 sensor_id;
+   u8 fru_type;
+   u8 value;
+} __packed;
+
+struct freq_sensor_1 {
+   u16 sensor_id;
+   u16 value;
+} __packed;
+
+struct freq_sensor_2 {
+   u32 sensor_id;
+   u16 value;
+} __packed;
+
+struct power_sensor_1 {
+   u16 sensor_id;
+   u32 update_tag;
+   u32 accumulator;
+   u16 value;
+} __packed;
+
+struct power_sensor_2 {
+   u32 sensor_id;
+   u8 function_id;
+   u8 apss_channel;
+   u16 reserved;
+   u32 update_tag;
+   u64 accumulator;
+   u16 value;
+} __packed;
+
+struct power_sensor_data {
+   u16 value;
+   u32 update_tag;
+   u64 accumulator;
+} __packed;
+
+struct power_sensor_data_and_time {
+   u16 update_time;
+   u16 value;
+   u32 update_tag;
+   u64 accumulator;
+} __packed;
+
+struct power_sensor_a0 {
+   u32 sensor_id;
+   struct power_sensor_data_and_time system;
+   u32 reserved;
+   struct power_sensor_data_and_time proc;
+   struct power_sensor_data vdd;
+   struct power_sensor_data vdn;
+} __packed;
+
+struct caps_sensor_2 {
+   u16 cap;
+   u16 system_power;
+   u16 n_cap;
+   u16 max;
+   u16 min;
+   u16 user;
+   u8 user_source;
+} __packed;
+
+struct caps_sensor_3 {
+   u16 cap;
+   u16 system_power;
+   u16 n_cap;
+   u16 max;
+   u16 hard_min;
+   u16 soft_min;
+   u16 user;
+   u8 user_source;
+} __packed;
+
+struct extended_sensor {
+   union {
+   u8 name[4];
+   u32 sensor_id;
+   };
+   u8 flags;
+   u8 reserved;
+   u8 data[6];
+} __packed;
+
 static int occ_poll(struct occ *occ)
 {
u16 checksum = occ->poll_cmd_data + 1;
@@ -20,9 +126,521 @@ static int occ_poll(struct occ *occ)
cmd[6] = checksum & 0xFF;   /* checksum lsb */
cmd[7] = 0;
 
+   /* mutex should already be locked if necessary */
return occ->send_cmd(occ, cmd);
 }
 
+static int occ_set_user_power_cap(struct occ *occ, u16 user_power_cap)
+{
+   int rc;
+   u8 cmd[8];
+   u16 checksum = 0x24;
+   __be16 user_power_cap_be = cpu_to_be16(user_power_cap);
+
+   cmd[0] = 0;
+   cmd[1] = 0x22;
+   cmd[2] = 0;
+   cmd[3] = 2;
+
+   memcpy(&cmd[4], &user_power_cap_be, 2);
+
+   checksum += cmd[4] + cmd[5];
+   cmd[6] = checksum >> 8;
+   cmd[7] = checksum & 0xFF;
+
+   rc = mutex_lock_interruptible(&occ->lock);
+   if (rc)
+   return rc;
+
+   rc = occ->send_cmd(occ, cmd);
+
+   mutex_unlock(&occ->lock);
+
+   return rc;
+}
+
+static int occ_update_response(struct occ *occ)
+{
+   int rc = mutex_lock_interruptible(&occ->lock);
+
+   if (rc)
+   return rc;
+
+   /* limit the maximum rate of polling the OCC */
+   if (time_after(jiffies, occ->last_update + OCC_UPDATE_FREQUENCY)) {
+   rc = occ_poll(occ);
+   occ->last_update = jiffies;
+   }
+
+   mutex_unlock(&occ->lock);
+   return rc;
+}
+
+static ssize_t occ_show_temp_1(struct device *dev,
+  struct device_attribute *attr, char *buf)
+{
+   int rc;
+   u32 val = 0;
+   struct temp_sensor_1 *temp;
+   struct occ *occ = dev_get_drvdata(dev);
+   struct occ_sensors *sensors = &occ->sensors;
+   struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr);
+
+   rc = occ_update_response(occ);
+   if (rc)
+   return rc;
+
+   temp = ((struct temp_sensor_1 *)sensors->temp.data) + sattr->index;
+
+   switch (sattr->nr) {
+   case 0:
+   val = get_unaligned_be16(&temp->sensor_id);
+   break;
+   case 1:
+ 

[PATCH v6 04/10] dt-bindings: i2c: Add P8 OCC hwmon device documentation

2018-11-08 Thread Eddie James
From: Eddie James 

Document the bindings for I2C-based OCC hwmon device.

Signed-off-by: Eddie James 
Acked-by: Rob Herring 
---
 .../devicetree/bindings/i2c/ibm,p8-occ-hwmon.txt   | 25 ++
 1 file changed, 25 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/i2c/ibm,p8-occ-hwmon.txt

diff --git a/Documentation/devicetree/bindings/i2c/ibm,p8-occ-hwmon.txt 
b/Documentation/devicetree/bindings/i2c/ibm,p8-occ-hwmon.txt
new file mode 100644
index 000..5dc5d2e
--- /dev/null
+++ b/Documentation/devicetree/bindings/i2c/ibm,p8-occ-hwmon.txt
@@ -0,0 +1,25 @@
+Device-tree bindings for I2C-based On-Chip Controller hwmon device
+--
+
+Required properties:
+ - compatible = "ibm,p8-occ-hwmon";
+ - reg = ;: I2C bus address
+
+Examples:
+
+i2c-bus@100 {
+#address-cells = <1>;
+#size-cells = <0>;
+clock-frequency = <10>;
+< more properties >
+
+occ-hwmon@1 {
+compatible = "ibm,p8-occ-hwmon";
+reg = <0x50>;
+};
+
+occ-hwmon@2 {
+compatible = "ibm,p8-occ-hwmon";
+reg = <0x51>;
+};
+};
-- 
1.8.3.1



Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Yu-cheng Yu
On Thu, 2018-11-08 at 12:46 -0800, Andy Lutomirski wrote:
> On Thu, Oct 11, 2018 at 8:20 AM Yu-cheng Yu  wrote:
> > 
> > Intel Control-flow Enforcement Technology (CET) introduces the
> > following MSRs into the XSAVES system states.
> > 
> > IA32_U_CET (user-mode CET settings),
> > IA32_PL3_SSP (user-mode shadow stack),
> > IA32_PL0_SSP (kernel-mode shadow stack),
> > IA32_PL1_SSP (ring-1 shadow stack),
> > IA32_PL2_SSP (ring-2 shadow stack).
> > 
> > Signed-off-by: Yu-cheng Yu 
> > ---
> >  arch/x86/include/asm/fpu/types.h| 22 +
> >  arch/x86/include/asm/fpu/xstate.h   |  4 +++-
> >  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
> >  arch/x86/kernel/fpu/xstate.c| 10 ++
> >  4 files changed, 37 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/fpu/types.h
> > b/arch/x86/include/asm/fpu/types.h
> > index 202c53918ecf..e55d51d172f1 100644
> > --- a/arch/x86/include/asm/fpu/types.h
> > +++ b/arch/x86/include/asm/fpu/types.h
> > @@ -114,6 +114,9 @@ enum xfeature {
> > XFEATURE_Hi16_ZMM,
> > XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
> > XFEATURE_PKRU,
> > +   XFEATURE_RESERVED,
> > +   XFEATURE_SHSTK_USER,
> > +   XFEATURE_SHSTK_KERNEL,
> > 
> > XFEATURE_MAX,
> >  };
> > @@ -128,6 +131,8 @@ enum xfeature {
> >  #define XFEATURE_MASK_Hi16_ZMM (1 << XFEATURE_Hi16_ZMM)
> >  #define XFEATURE_MASK_PT   (1 <<
> > XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
> >  #define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
> > +#define XFEATURE_MASK_SHSTK_USER   (1 << XFEATURE_SHSTK_USER)
> > +#define XFEATURE_MASK_SHSTK_KERNEL (1 << XFEATURE_SHSTK_KERNEL)
> > 
> >  #define XFEATURE_MASK_FPSSE(XFEATURE_MASK_FP |
> > XFEATURE_MASK_SSE)
> >  #define XFEATURE_MASK_AVX512   (XFEATURE_MASK_OPMASK \
> > @@ -229,6 +234,23 @@ struct pkru_state {
> > u32 pad;
> >  } __packed;
> > 
> > +/*
> > + * State component 11 is Control flow Enforcement user states
> > + */
> > +struct cet_user_state {
> > +   u64 u_cet;  /* user control flow settings */
> > +   u64 user_ssp;   /* user shadow stack pointer */
> > +} __packed;
> > +
> > +/*
> > + * State component 12 is Control flow Enforcement kernel states
> > + */
> > +struct cet_kernel_state {
> > +   u64 kernel_ssp; /* kernel shadow stack */
> > +   u64 pl1_ssp;/* ring-1 shadow stack */
> > +   u64 pl2_ssp;/* ring-2 shadow stack */
> > +} __packed;
> > +
> 
> Why are these __packed?  It seems like it'll generate bad code for no
> obvious purpose.

That prevents any possibility that the compiler will insert padding, although in
64-bit kernel this should not happen to either struct.  Also all xstate
components here are packed.

Yu-cheng


[PATCH v6 06/10] hwmon (occ): Add command transport method for P8 and P9

2018-11-08 Thread Eddie James
From: Eddie James 

For the P8 OCC, add the procedure to send a command to the OCC over I2C
bus. This involves writing the OCC command registers with serial
communication operations (SCOMs) interpreted by the I2C slave. For the
P9 OCC, add a procedure to use the OCC in-kernel API to send a command
to the OCC through the SBE.

Signed-off-by: Eddie James 
---
 drivers/hwmon/occ/p8_i2c.c | 185 -
 drivers/hwmon/occ/p9_sbe.c |  38 +-
 2 files changed, 221 insertions(+), 2 deletions(-)

diff --git a/drivers/hwmon/occ/p8_i2c.c b/drivers/hwmon/occ/p8_i2c.c
index 4f3937d..e3326ff 100644
--- a/drivers/hwmon/occ/p8_i2c.c
+++ b/drivers/hwmon/occ/p8_i2c.c
@@ -2,11 +2,29 @@
 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
+#include 
+#include 
 
 #include "common.h"
 
+#define OCC_TIMEOUT_MS 1000
+#define OCC_CMD_IN_PRG_WAIT_MS 50
+
+/* OCB (on-chip control bridge - interface to OCC) registers */
+#define OCB_DATA1  0x6B035
+#define OCB_ADDR   0x6B070
+#define OCB_DATA3  0x6B075
+
+/* OCC SRAM address space */
+#define OCC_SRAM_ADDR_CMD  0x6000
+#define OCC_SRAM_ADDR_RESP 0x7000
+
+#define OCC_DATA_ATTN  0x2001
+
 struct p8_i2c_occ {
struct occ occ;
struct i2c_client *client;
@@ -14,9 +32,174 @@ struct p8_i2c_occ {
 
 #define to_p8_i2c_occ(x)   container_of((x), struct p8_i2c_occ, occ)
 
+static int p8_i2c_occ_getscom(struct i2c_client *client, u32 address, u8 *data)
+{
+   ssize_t rc;
+   __be64 buf;
+   struct i2c_msg msgs[2];
+
+   /* p8 i2c slave requires shift */
+   address <<= 1;
+
+   msgs[0].addr = client->addr;
+   msgs[0].flags = client->flags & I2C_M_TEN;
+   msgs[0].len = sizeof(u32);
+   /* address is a scom address; bus-endian */
+   msgs[0].buf = (char *)&address;
+
+   /* data from OCC is big-endian */
+   msgs[1].addr = client->addr;
+   msgs[1].flags = (client->flags & I2C_M_TEN) | I2C_M_RD;
+   msgs[1].len = sizeof(u64);
+   msgs[1].buf = (char *)&buf;
+
+   rc = i2c_transfer(client->adapter, msgs, 2);
+   if (rc < 0)
+   return rc;
+
+   *(u64 *)data = be64_to_cpu(buf);
+
+   return 0;
+}
+
+static int p8_i2c_occ_putscom(struct i2c_client *client, u32 address, u8 *data)
+{
+   u32 buf[3];
+   ssize_t rc;
+
+   /* p8 i2c slave requires shift */
+   address <<= 1;
+
+   /* address is bus-endian; data passed through from user as-is */
+   buf[0] = address;
+   memcpy(&buf[1], &data[4], sizeof(u32));
+   memcpy(&buf[2], data, sizeof(u32));
+
+   rc = i2c_master_send(client, (const char *)buf, sizeof(buf));
+   if (rc < 0)
+   return rc;
+   else if (rc != sizeof(buf))
+   return -EIO;
+
+   return 0;
+}
+
+static int p8_i2c_occ_putscom_u32(struct i2c_client *client, u32 address,
+ u32 data0, u32 data1)
+{
+   u8 buf[8];
+
+   memcpy(buf, &data0, 4);
+   memcpy(buf + 4, &data1, 4);
+
+   return p8_i2c_occ_putscom(client, address, buf);
+}
+
+static int p8_i2c_occ_putscom_be(struct i2c_client *client, u32 address,
+u8 *data)
+{
+   __be32 data0, data1;
+
+   memcpy(&data0, data, 4);
+   memcpy(&data1, data + 4, 4);
+
+   return p8_i2c_occ_putscom_u32(client, address, be32_to_cpu(data0),
+ be32_to_cpu(data1));
+}
+
 static int p8_i2c_occ_send_cmd(struct occ *occ, u8 *cmd)
 {
-   return -EOPNOTSUPP;
+   int i, rc;
+   unsigned long start;
+   u16 data_length;
+   const unsigned long timeout = msecs_to_jiffies(OCC_TIMEOUT_MS);
+   const long wait_time = msecs_to_jiffies(OCC_CMD_IN_PRG_WAIT_MS);
+   struct p8_i2c_occ *ctx = to_p8_i2c_occ(occ);
+   struct i2c_client *client = ctx->client;
+   struct occ_response *resp = &occ->resp;
+
+   start = jiffies;
+
+   /* set sram address for command */
+   rc = p8_i2c_occ_putscom_u32(client, OCB_ADDR, OCC_SRAM_ADDR_CMD, 0);
+   if (rc)
+   return rc;
+
+   /* write command (expected to already be BE), we need bus-endian... */
+   rc = p8_i2c_occ_putscom_be(client, OCB_DATA3, cmd);
+   if (rc)
+   return rc;
+
+   /* trigger OCC attention */
+   rc = p8_i2c_occ_putscom_u32(client, OCB_DATA1, OCC_DATA_ATTN, 0);
+   if (rc)
+   return rc;
+
+   do {
+   /* set sram address for response */
+   rc = p8_i2c_occ_putscom_u32(client, OCB_ADDR,
+   OCC_SRAM_ADDR_RESP, 0);
+   if (rc)
+   return rc;
+
+   rc = p8_i2c_occ_getscom(client, OCB_DATA3, (u8 *)resp);
+   if (rc)
+   return rc;
+
+   /* wait for OC

[PATCH v6 05/10] hwmon: Add On-Chip Controller (OCC) hwmon driver

2018-11-08 Thread Eddie James
From: Eddie James 

The OCC is a device embedded on a POWER processor that collects and
aggregates sensor data from the processor and system. The OCC can
provide the raw sensor data as well as perform thermal and power
management on the system.

This driver provides a hwmon interface to the OCC from a service
processor (e.g. a BMC). The driver supports both POWER8 and POWER9 OCCs.
Communications with the POWER8 OCC are established over standard I2C
bus. The driver communicates with the POWER9 OCC through the FSI-based
OCC driver, which handles the lower-level communication details.

This patch lays out the structure of the OCC hwmon driver. There are two
platform drivers, one each for P8 and P9 OCCs. These are probed through
the I2C tree and the FSI-based OCC driver, respectively. The patch also
defines the first common structures and methods between the two OCC
versions.

Signed-off-by: Eddie James 
---
 drivers/hwmon/Kconfig  |  2 ++
 drivers/hwmon/Makefile |  1 +
 drivers/hwmon/occ/Kconfig  | 31 +
 drivers/hwmon/occ/Makefile |  5 
 drivers/hwmon/occ/common.c | 40 +++
 drivers/hwmon/occ/common.h | 34 +++
 drivers/hwmon/occ/p8_i2c.c | 61 +
 drivers/hwmon/occ/p9_sbe.c | 68 ++
 8 files changed, 242 insertions(+)
 create mode 100644 drivers/hwmon/occ/Kconfig
 create mode 100644 drivers/hwmon/occ/Makefile
 create mode 100644 drivers/hwmon/occ/common.c
 create mode 100644 drivers/hwmon/occ/common.h
 create mode 100644 drivers/hwmon/occ/p8_i2c.c
 create mode 100644 drivers/hwmon/occ/p9_sbe.c

diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
index 81da17a..532a053 100644
--- a/drivers/hwmon/Kconfig
+++ b/drivers/hwmon/Kconfig
@@ -1293,6 +1293,8 @@ config SENSORS_NSA320
  This driver can also be built as a module. If so, the module
  will be called nsa320-hwmon.
 
+source drivers/hwmon/occ/Kconfig
+
 config SENSORS_PCF8591
tristate "Philips PCF8591 ADC/DAC"
depends on I2C
diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
index 93f7f41..f5c7b44 100644
--- a/drivers/hwmon/Makefile
+++ b/drivers/hwmon/Makefile
@@ -178,6 +178,7 @@ obj-$(CONFIG_SENSORS_WM831X)+= wm831x-hwmon.o
 obj-$(CONFIG_SENSORS_WM8350)   += wm8350-hwmon.o
 obj-$(CONFIG_SENSORS_XGENE)+= xgene-hwmon.o
 
+obj-$(CONFIG_SENSORS_OCC)  += occ/
 obj-$(CONFIG_PMBUS)+= pmbus/
 
 ccflags-$(CONFIG_HWMON_DEBUG_CHIP) := -DDEBUG
diff --git a/drivers/hwmon/occ/Kconfig b/drivers/hwmon/occ/Kconfig
new file mode 100644
index 000..6668662
--- /dev/null
+++ b/drivers/hwmon/occ/Kconfig
@@ -0,0 +1,31 @@
+#
+# On-Chip Controller configuration
+#
+
+config SENSORS_OCC_P8_I2C
+   tristate "POWER8 OCC through I2C"
+   depends on I2C
+   select SENSORS_OCC
+   help
+This option enables support for monitoring sensors provided by the
+On-Chip Controller (OCC) on a POWER8 processor. Communications with
+the OCC are established through I2C bus.
+
+This driver can also be built as a module. If so, the module will be
+called occ-p8-hwmon.
+
+config SENSORS_OCC_P9_SBE
+   tristate "POWER9 OCC through SBE"
+   depends on FSI_OCC
+   select SENSORS_OCC
+   help
+This option enables support for monitoring sensors provided by the
+On-Chip Controller (OCC) on a POWER9 processor. Communications with
+the OCC are established through SBE fifo on an FSI bus.
+
+This driver can also be built as a module. If so, the module will be
+called occ-p9-hwmon.
+
+config SENSORS_OCC
+   bool "POWER On-Chip Controller"
+   depends on SENSORS_OCC_P8_I2C || SENSORS_OCC_P9_SBE
diff --git a/drivers/hwmon/occ/Makefile b/drivers/hwmon/occ/Makefile
new file mode 100644
index 000..57c0e91
--- /dev/null
+++ b/drivers/hwmon/occ/Makefile
@@ -0,0 +1,5 @@
+occ-p8-hwmon-objs := common.o p8_i2c.o
+occ-p9-hwmon-objs := common.o p9_sbe.o
+
+obj-$(CONFIG_SENSORS_OCC_P8_I2C) += occ-p8-hwmon.o
+obj-$(CONFIG_SENSORS_OCC_P9_SBE) += occ-p9-hwmon.o
diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c
new file mode 100644
index 000..81acd4b
--- /dev/null
+++ b/drivers/hwmon/occ/common.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+
+#include "common.h"
+
+static int occ_poll(struct occ *occ)
+{
+   u16 checksum = occ->poll_cmd_data + 1;
+   u8 cmd[8];
+
+   /* big endian */
+   cmd[0] = 0; /* sequence number */
+   cmd[1] = 0; /* cmd type */
+   cmd[2] = 0; /* data length msb */
+   cmd[3] = 1; /* data length lsb */
+   cmd[4] = occ->poll_cmd_data;/* data */
+   cmd[5] = checksum >> 8; /* checksum msb */
+   cmd[6] = checksum & 0xFF;   /* checksum lsb */
+   cmd[7] = 0;
+
+   retur

Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Andy Lutomirski
On Thu, Nov 8, 2018 at 1:06 PM Yu-cheng Yu  wrote:
>
> On Thu, 2018-11-08 at 12:46 -0800, Andy Lutomirski wrote:
> > On Thu, Oct 11, 2018 at 8:20 AM Yu-cheng Yu  wrote:
> > >
> > > Intel Control-flow Enforcement Technology (CET) introduces the
> > > following MSRs into the XSAVES system states.
> > >
> > > IA32_U_CET (user-mode CET settings),
> > > IA32_PL3_SSP (user-mode shadow stack),
> > > IA32_PL0_SSP (kernel-mode shadow stack),
> > > IA32_PL1_SSP (ring-1 shadow stack),
> > > IA32_PL2_SSP (ring-2 shadow stack).
> > >
> > > Signed-off-by: Yu-cheng Yu 
> > > ---
> > >  arch/x86/include/asm/fpu/types.h| 22 +
> > >  arch/x86/include/asm/fpu/xstate.h   |  4 +++-
> > >  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
> > >  arch/x86/kernel/fpu/xstate.c| 10 ++
> > >  4 files changed, 37 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/include/asm/fpu/types.h
> > > b/arch/x86/include/asm/fpu/types.h
> > > index 202c53918ecf..e55d51d172f1 100644
> > > --- a/arch/x86/include/asm/fpu/types.h
> > > +++ b/arch/x86/include/asm/fpu/types.h
> > > @@ -114,6 +114,9 @@ enum xfeature {
> > > XFEATURE_Hi16_ZMM,
> > > XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
> > > XFEATURE_PKRU,
> > > +   XFEATURE_RESERVED,
> > > +   XFEATURE_SHSTK_USER,
> > > +   XFEATURE_SHSTK_KERNEL,
> > >
> > > XFEATURE_MAX,
> > >  };
> > > @@ -128,6 +131,8 @@ enum xfeature {
> > >  #define XFEATURE_MASK_Hi16_ZMM (1 << XFEATURE_Hi16_ZMM)
> > >  #define XFEATURE_MASK_PT   (1 <<
> > > XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
> > >  #define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
> > > +#define XFEATURE_MASK_SHSTK_USER   (1 << XFEATURE_SHSTK_USER)
> > > +#define XFEATURE_MASK_SHSTK_KERNEL (1 << XFEATURE_SHSTK_KERNEL)
> > >
> > >  #define XFEATURE_MASK_FPSSE(XFEATURE_MASK_FP |
> > > XFEATURE_MASK_SSE)
> > >  #define XFEATURE_MASK_AVX512   (XFEATURE_MASK_OPMASK \
> > > @@ -229,6 +234,23 @@ struct pkru_state {
> > > u32 pad;
> > >  } __packed;
> > >
> > > +/*
> > > + * State component 11 is Control flow Enforcement user states
> > > + */
> > > +struct cet_user_state {
> > > +   u64 u_cet;  /* user control flow settings */
> > > +   u64 user_ssp;   /* user shadow stack pointer */
> > > +} __packed;
> > > +
> > > +/*
> > > + * State component 12 is Control flow Enforcement kernel states
> > > + */
> > > +struct cet_kernel_state {
> > > +   u64 kernel_ssp; /* kernel shadow stack */
> > > +   u64 pl1_ssp;/* ring-1 shadow stack */
> > > +   u64 pl2_ssp;/* ring-2 shadow stack */
> > > +} __packed;
> > > +
> >
> > Why are these __packed?  It seems like it'll generate bad code for no
> > obvious purpose.
>
> That prevents any possibility that the compiler will insert padding, although 
> in
> 64-bit kernel this should not happen to either struct.  Also all xstate
> components here are packed.
>

They both seem like bugs, perhaps.  As I understand it, __packed
removes padding, but it also forces the compiler to expect the fields
to be unaligned even if they are actually aligned.


Re: [PATCH v10 1/4] ipc: Allow boot time extension of IPCMNI from 32k to 8M

2018-11-08 Thread Waiman Long
On 11/06/2018 08:20 AM, Matthew Wilcox wrote:
> On Mon, Nov 05, 2018 at 10:43:43AM -0500, Waiman Long wrote:
>> The maximum number of unique System V IPC identifiers was limited to
>> 32k.  That limit should be big enough for most use cases.
>>
>> However, there are some users out there requesting for more, especially
>> those that are migrating from Solaris which uses 24 bits for unique
>> identifiers. To satisfy the need of those users, a new boot time kernel
>> option "ipcmni_extend" is added to extend the IPCMNI value to 8M. This
>> is a 256X increase which hopefully is big enough for them.
> Why go to 23 bits when people are coming from systems with 24 bits?
> Let's just go to 24 bits.  This happens to fit well with the underlying
> data structure which uses 6 bits per layer of the tree.

Sure. I can move it up to 24 bits leave 7 bits for the sequence number.

>
>> The use of this new option will change the pattern of the IPC identifiers
>> returned by functions like shmget(2). An application that depends on
>> such pattern may not work properly.  So it should only be used if the
>> users really need more than 32k of unique IPC numbers.
> Are there applications out there that rely on the internal structure of
> the IPC identifiers?!

That is a question that may not have a clear answer. Most applications
won't do that, but there are always some outliners that do crazy thing.
So you never know for sure.

>
> How about scrapping all this and just doing the following:
>
> Allocate 24 bits of the ID cyclically.  Increment the top 7 bits of the
> ID every time the cursor wraps.  That's not going to give us a perfect
> progression from 0-2 billion, because it'll skip the ones in use.
> But it'll ensure the ID isn't reused particularly quickly unless the
> application is really using millions of IDs.

Eric Biederman had sent out a patch somewhat like that before. Again,
there is a slight chance that it may break existing applications. So the
question is whether we are willing to take the risk. I won't mind if
upstream decide to go this route.

Cheers,
Longman




Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Cyrill Gorcunov
On Thu, Nov 08, 2018 at 01:22:54PM -0800, Andy Lutomirski wrote:
> > >
> > > Why are these __packed?  It seems like it'll generate bad code for no
> > > obvious purpose.
> >
> > That prevents any possibility that the compiler will insert padding, 
> > although in
> > 64-bit kernel this should not happen to either struct.  Also all xstate
> > components here are packed.
> >
> 
> They both seem like bugs, perhaps.  As I understand it, __packed
> removes padding, but it also forces the compiler to expect the fields
> to be unaligned even if they are actually aligned.

How is that? Andy, mind to point where you get that this
attribute forces compiler to make such assumption?


Re: [PATCH v10 4/4] ipc: Add a cyclic mode for id generation

2018-11-08 Thread Waiman Long
On 11/06/2018 08:02 AM, Matthew Wilcox wrote:
> On Mon, Nov 05, 2018 at 10:43:46AM -0500, Waiman Long wrote:
>> The idea of using the cyclic mode to reduce id reuse came from Manfred
>> Spraul . There may be a little bit of
>> additional memory/performance overhead in doing cyclic id allocation,
>> but it is a slow path anyway and a bit of overhead shouldn't be an issue.
>>
>> This patch differs from his as the cyclic mode is not the default and
>> has to be explicitly opted in for users who want that.
> I really don't like the "make the sysadmin choose" approach.  It's
> shifting blame to someone who has no idea what the tradeoffs are.
> We should try to get it right rather than force more people to learn
> why we're incapable of making the right decision for them.
>
My patch is conservative as there will be no behavior change by default.
I won't mind changing the default to cyclic if we all agree.

Cheers,
Longman




Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Dave Hansen
On 11/8/18 1:22 PM, Andy Lutomirski wrote:
>> +struct cet_kernel_state {
>> +   u64 kernel_ssp; /* kernel shadow stack */
>> +   u64 pl1_ssp;/* ring-1 shadow stack */
>> +   u64 pl2_ssp;/* ring-2 shadow stack */
>> +} __packed;
>> +
> Why are these __packed?  It seems like it'll generate bad code for no
> obvious purpose.

It's a hardware-defined in-memory structure.  Granted, we'd need a
really wonky compiler to make that anything *other* than a nicely-packed
24-byte structure, but the __packed makes it explicit.

It is probably a really useful long-term thing to stop using __packed
and start using "__hw_defined" or something that #defines down to __packed.


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Cyrill Gorcunov
On Thu, Nov 08, 2018 at 02:01:42PM -0800, Andy Lutomirski wrote:
> > >
> > > They both seem like bugs, perhaps.  As I understand it, __packed
> > > removes padding, but it also forces the compiler to expect the fields
> > > to be unaligned even if they are actually aligned.
> >
> > How is that? Andy, mind to point where you get that this
> > attribute forces compiler to make such assumption?
> 
> It's from memory.  But gcc seems to agree with me I compiled this:
> 

Indeed, thanks!


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Andy Lutomirski
On Thu, Nov 8, 2018 at 1:31 PM Cyrill Gorcunov  wrote:
>
> On Thu, Nov 08, 2018 at 01:22:54PM -0800, Andy Lutomirski wrote:
> > > >
> > > > Why are these __packed?  It seems like it'll generate bad code for no
> > > > obvious purpose.
> > >
> > > That prevents any possibility that the compiler will insert padding, 
> > > although in
> > > 64-bit kernel this should not happen to either struct.  Also all xstate
> > > components here are packed.
> > >
> >
> > They both seem like bugs, perhaps.  As I understand it, __packed
> > removes padding, but it also forces the compiler to expect the fields
> > to be unaligned even if they are actually aligned.
>
> How is that? Andy, mind to point where you get that this
> attribute forces compiler to make such assumption?

It's from memory.  But gcc seems to agree with me I compiled this:

struct foo {
int x;
} __attribute__((packed));

int read_foo(struct foo *f)
{
return f->x;
}

int align_of_foo_x(struct foo *f)
{
return __alignof__(f->x);
}

Compiling with -O2 gives:

.globlread_foo
.typeread_foo, @function
read_foo:
movl(%rdi), %eax
ret
.sizeread_foo, .-read_foo

.p2align 4,,15
.globlalign_of_foo_x
.typealign_of_foo_x, @function
align_of_foo_x:
movl$1, %eax
ret
.sizealign_of_foo_x, .-align_of_foo_x

So gcc thinks that the x field is one-byte-aligned, but the code is
okay (at least in this instance) on x86.
Building for armv5 gives:

.typeread_foo, %function
read_foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldrbr3, [r0]@ zero_extendqisi2
ldrbr1, [r0, #1]@ zero_extendqisi2
ldrbr2, [r0, #2]@ zero_extendqisi2
orrr3, r3, r1, lsl #8
ldrbr0, [r0, #3]@ zero_extendqisi2
orrr3, r3, r2, lsl #16
orrr0, r3, r0, lsl #24
bxlr
.sizeread_foo, .-read_foo
.align2
.globalalign_of_foo_x
.syntax unified
.arm
.fpu vfpv3-d16
.typealign_of_foo_x, %function

So I'm pretty sure I'm right.


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Matthew Wilcox
On Thu, Nov 08, 2018 at 01:48:54PM -0800, Dave Hansen wrote:
> On 11/8/18 1:22 PM, Andy Lutomirski wrote:
> >> +struct cet_kernel_state {
> >> +   u64 kernel_ssp; /* kernel shadow stack */
> >> +   u64 pl1_ssp;/* ring-1 shadow stack */
> >> +   u64 pl2_ssp;/* ring-2 shadow stack */
> >> +} __packed;
> >> +
> > Why are these __packed?  It seems like it'll generate bad code for no
> > obvious purpose.
> 
> It's a hardware-defined in-memory structure.  Granted, we'd need a
> really wonky compiler to make that anything *other* than a nicely-packed
> 24-byte structure, but the __packed makes it explicit.
> 
> It is probably a really useful long-term thing to stop using __packed
> and start using "__hw_defined" or something that #defines down to __packed.

packed doesn't mean "don't leave gaps".  It means:

'packed'
 The 'packed' attribute specifies that a variable or structure field
 should have the smallest possible alignment--one byte for a
 variable, and one bit for a field, unless you specify a larger
 value with the 'aligned' attribute.

So Andy's right.  It tells the compiler, "this struct will not be naturally 
aligned, it will be aligned to a 1-byte boundary".  Which is silly.  If we have

struct b {
unsigned long x;
} __packed;

struct a {
char c;
struct b b;
};

we want struct b to start at offset 8, but with __packed, it will start
at offset 1.

Delete __packed.  It doesn't do what you think it does.


Re: [PATCH V2] doc:it_IT: add some process/* translations

2018-11-08 Thread Federico Vaga
On Wednesday, November 7, 2018 11:22:58 PM CET Jonathan Corbet wrote:
> On Fri,  5 Oct 2018 20:51:57 +0200
> 
> Federico Vaga  wrote:
> > This patch does not translate entirely the subfolder "process/"
> > but only part of it (to begin with).
> > 
> > In order to avoid broken links, I included empty documents
> > for those which are not yet translated.
> > 
> > In order to be able to refer to all documents in "process/",
> > I added a sphinx label to those which had not one.
> > 
> > Translated documents:
> > - howto
> > - 1.Intro
> > - clang-format
> > - coding-style
> > - kernel-driver-statement
> > - magic-number
> > - volatile-considered-harmful
> > - development-process
> 
> So I'm (finally) ready to apply this, but it has conflicts against current
> docs-next.  Could I ask you, please, to send me a respin?

Ok, I'm going to prepare it. I will send also a couple of minor fixes that I 
ported to the Italian translation from the main documents.





[PATCH 0/3] doc:it_IT: update italian documentation

2018-11-08 Thread Federico Vaga
In this small patch set there is the V3 of
"doc:it_IT: add some process/* translations" in which I fixed
the recent conflicts in docs-next.
I took the occasion to add other two simple patches that fix:
- a missing translation in locking.rst
- apply a patch from the official documentation to the italian one




[PATCH 3/3] doc:it_IT:doc-guide: fix reference to foobar

2018-11-08 Thread Federico Vaga
Replicate the following patch into italian translation

1bb37a35671c  doc-guide:kernel-doc.rst: Reference to foobar

Signed-off-by: Federico Vaga 
---
 Documentation/translations/it_IT/doc-guide/kernel-doc.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/translations/it_IT/doc-guide/kernel-doc.rst 
b/Documentation/translations/it_IT/doc-guide/kernel-doc.rst
index 2bf1c1e2f394..a4ecd8f27631 100644
--- a/Documentation/translations/it_IT/doc-guide/kernel-doc.rst
+++ b/Documentation/translations/it_IT/doc-guide/kernel-doc.rst
@@ -107,7 +107,7 @@ macro simil-funzioni è il seguente::
* Context: Describes whether the function can sleep, what locks it takes,
*  releases, or expects to be held. It can extend over multiple
*  lines.
-   * Return: Describe the return value of foobar.
+   * Return: Describe the return value of function_name.
*
* The return value description can also have multiple paragraphs, and should
* be placed at the end of the comment block.
-- 
2.19.1



[PATCH 2/3] doc:it_IT: fix locking.rst section title

2018-11-08 Thread Federico Vaga
This title was still in English. I just translated it

Signed-off-by: Federico Vaga 
---
 Documentation/translations/it_IT/kernel-hacking/locking.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/translations/it_IT/kernel-hacking/locking.rst 
b/Documentation/translations/it_IT/kernel-hacking/locking.rst
index 753643622c23..0ef3163b 100644
--- a/Documentation/translations/it_IT/kernel-hacking/locking.rst
+++ b/Documentation/translations/it_IT/kernel-hacking/locking.rst
@@ -593,8 +593,8 @@ l'opzione ``GFP_KERNEL`` che è permessa solo in contesto 
utente. Ho supposto
 che :c:func:`cache_add()` venga chiamata dal contesto utente, altrimenti
 questa opzione deve diventare un parametro di :c:func:`cache_add()`.
 
-Exposing Objects Outside This File
---
+Esporre gli oggetti al di fuori del file
+
 
 Se i vostri oggetti contengono più informazioni, potrebbe non essere
 sufficiente copiare i dati avanti e indietro: per esempio, altre parti del
-- 
2.19.1



Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Dave Hansen
On 11/8/18 2:00 PM, Matthew Wilcox wrote:
> struct a {
>   char c;
>   struct b b;
> };
> 
> we want struct b to start at offset 8, but with __packed, it will start
> at offset 1.

You're talking about how we want the struct laid out in memory if we
have control over the layout.  I'm talking about what happens if
something *else* tells us the layout, like a hardware specification
which is what is in play with the XSAVE instruction dictated layout
that's in question here.

What I'm concerned about is a structure like this:

struct foo {
u32 i1;
u64 i2;
};

If we leave that to natural alignment, we end up with a 16-byte
structure laid out like this:

0-3 i1
3-8 alignment gap
8-15i2

Which isn't what we want.  We want a 12-byte structure, laid out like this:

0-3 i1
4-11i2

Which we get with:


struct foo {
u32 i1;
u64 i2;
} __packed;

Now, looking at Yu-cheng's specific example, it doesn't matter.  We've
got 64-bit types and natural 64-bit alignment.  Without __packed, we
need to look out for natural alignment screwing us up.  With __packed,
it just does what it *looks* like it does.


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Borislav Petkov
On Thu, Nov 08, 2018 at 12:40:02PM -0800, Yu-cheng Yu wrote:
> In fpu_init_system_xstate(), we test and clear features that are not enabled.
> There we depend on the order of these elements.  This is the tenth "unknown
> xstate feature".

Aha, those are *reserved* bits - not unused, in XCR0.

Do an s/unused/reserved/g pls.

Now let's see, you have 0 for the 10th bit which happens to be

#define X86_FEATURE_FPU ( 0*32+ 0) /* Onboard FPU */

too. And if we look at the code:

for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
if (!boot_cpu_has(xsave_cpuid_features[i]))
xfeatures_mask_all &= ~BIT_ULL(i);

guess what happens if i == 10.

I know, the subsequent & SUPPORTED_XFEATURES_MASK saves you from the
#GP but that's still not good enough. The loop should not even call
boot_cpu_has() for reserved feature bits.

Thx.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Matthew Wilcox
On Thu, Nov 08, 2018 at 03:35:02PM -0800, Dave Hansen wrote:
> On 11/8/18 2:00 PM, Matthew Wilcox wrote:
> > struct a {
> > char c;
> > struct b b;
> > };
> > 
> > we want struct b to start at offset 8, but with __packed, it will start
> > at offset 1.
> 
> You're talking about how we want the struct laid out in memory if we
> have control over the layout.  I'm talking about what happens if
> something *else* tells us the layout, like a hardware specification
> which is what is in play with the XSAVE instruction dictated layout
> that's in question here.
> 
> What I'm concerned about is a structure like this:
> 
> struct foo {
> u32 i1;
> u64 i2;
> };
> 
> If we leave that to natural alignment, we end up with a 16-byte
> structure laid out like this:
> 
>   0-3 i1
>   3-8 alignment gap
>   8-15i2

I know you actually meant:

0-3 i1
4-7 pad
8-15i2

> Which isn't what we want.  We want a 12-byte structure, laid out like this:
> 
>   0-3 i1
>   4-11i2
> 
> Which we get with:
> 
> struct foo {
> u32 i1;
> u64 i2;
> } __packed;

But we _also_ get pessimised accesses to i1 and i2.  Because gcc can't
rely on struct foo being aligned to a 4 or even 8 byte boundary (it
might be embedded in "struct a" from above).

> Now, looking at Yu-cheng's specific example, it doesn't matter.  We've
> got 64-bit types and natural 64-bit alignment.  Without __packed, we
> need to look out for natural alignment screwing us up.  With __packed,
> it just does what it *looks* like it does.

The question is whether Yu-cheng's struct is ever embedded in another
struct.  And if so, what does the hardware do?


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Andy Lutomirski
On Thu, Nov 8, 2018 at 4:32 PM Matthew Wilcox  wrote:
>
> On Thu, Nov 08, 2018 at 03:35:02PM -0800, Dave Hansen wrote:
> > On 11/8/18 2:00 PM, Matthew Wilcox wrote:
> > > struct a {
> > > char c;
> > > struct b b;
> > > };
> > >
> > > we want struct b to start at offset 8, but with __packed, it will start
> > > at offset 1.
> >
> > You're talking about how we want the struct laid out in memory if we
> > have control over the layout.  I'm talking about what happens if
> > something *else* tells us the layout, like a hardware specification
> > which is what is in play with the XSAVE instruction dictated layout
> > that's in question here.
> >
> > What I'm concerned about is a structure like this:
> >
> > struct foo {
> > u32 i1;
> > u64 i2;
> > };
> >
> > If we leave that to natural alignment, we end up with a 16-byte
> > structure laid out like this:
> >
> >   0-3 i1
> >   3-8 alignment gap
> >   8-15i2
>
> I know you actually meant:
>
> 0-3 i1
> 4-7 pad
> 8-15i2
>
> > Which isn't what we want.  We want a 12-byte structure, laid out like this:
> >
> >   0-3 i1
> >   4-11i2
> >
> > Which we get with:
> >
> > struct foo {
> > u32 i1;
> > u64 i2;
> > } __packed;
>
> But we _also_ get pessimised accesses to i1 and i2.  Because gcc can't
> rely on struct foo being aligned to a 4 or even 8 byte boundary (it
> might be embedded in "struct a" from above).
>

In the event we end up with a hardware structure that has
not-really-aligned elements, I suspect we can ask gcc for a new
extension to help.  Or maybe some hack like:

struct foo {
  u32 i1;
  struct {
u64 i2;
  } __attribute__((packed));
};

would do the trick.


Re: [PATCH v8 2/2] overlayfs: override_creds=off option bypass creator_cred

2018-11-08 Thread Amir Goldstein
On Thu, Nov 8, 2018 at 11:28 PM Mark Salyzyn  wrote:
>
> On 11/08/2018 12:01 PM, Vivek Goyal wrote:
>
> On Tue, Nov 06, 2018 at 03:01:15PM -0800, Mark Salyzyn wrote:
>
> By default, all access to the upper, lower and work directories is the
> recorded mounter's MAC and DAC credentials.  The incoming accesses are
> checked against the caller's credentials.
>
> Ok, I am trying to think of scenarios where override_creds=off can
> provide any privilege escalation. How about following.
>
> $ mkdir lower lower/foo upper upper/foo work merged
> $ touch lower/foo/bar.txt
> $ chmod 700 lower/foo/
>
> # Mount overlay with override_creds=off
>
> $ mount -t overlay -o
> lowerdir=lower,upperdir=upper,workdir=work,override_creds=off none merged
>
> # Try to read lower/foo as unpriviliged user. Say "test"
> # su test
> # ls merged/foo/
> ls: cannot access 'merged/foo/': Operation not permitted
>
> # Now first try to do same operation as root and retry as test user.
> $ exit
> $ ls merged/foo
> bar.txt
> $ su test
> $ ls merged/foo
> bar.txt
>
> lower/foo/ is not readable by user "test". So it fails in first try. Later
> "root" accesses it and it populates cache in overlayfs. When test retries,
> it gets these entries from cache.
>
> With override_creds=on this is not a problem because overlay provides
> this as functionality as long as mounter as access to lower/foo/.
>
> But with override_creds=off, mounter is not providing any such
> functionality and we are exposing an issue where cache will make
> something available which is not normally available.
>
> I think it probably is a good idea to do something about it?
>
> Thanks
> Vivek
>
> Good stuff.
>
> That sounds like a bug in cache (!) to not recheck caller's credentials. 
> Currently unsure how/where to force bypass of the cache (performance hit) as 
> it is wired in throughout the code without a clear off switch, or rechecking 
> of the credentials at access. This does need to be addressed to make this 
> 'feature' more useful/trusted for non-MAC controlled, use cases.
>
> This is not a problem in the Android usage case since DAC is simple and all 
> can be read, execute bits might be controlled, the owners and perms are 
> otherwise unremarkable in the affected arenas that are utilizing overlayfs. 
> Not using it for a generic r/w backing except in rooted debug scenarios by 
> developers, otherwise everything is r/o, MAC on the other hand is complex and 
> heavily inspected. We do, however, want multi level security in that both DAC 
> and MAC can be trusted and can protect each other from holes.
>
> Sounds like the caveats in the documentation need to be expanded if _no_ 
> solution for this kind of access pattern becomes apparent.
>

I think maybe you could append the "behavior of the overlay is undefined" clause
to the caveats to cover issues like the one raised by Vivek.

Mark,

I have some Android internals background, so I have a general
understanding of the
use case, but I can understand why people have a hard time connecting to the
motivation, thinking "their security model must be flawed".

I am not sure if you are avoiding laying out the details of the model
because you
are not allowed to expose details or because you feel details may confuse us.
If the latter, then I think that actually listing the details of the
overlays used in Android
and some concrete examples of access policies to those overlays could
benefit the
discussion on the feature.
To clarify, this only a suggestion. I have no objection to the patch.

Thanks,
Amir.


Re: [PATCH v3 1/2] kretprobe: produce sane stack traces

2018-11-08 Thread Masami Hiramatsu
On Thu, 8 Nov 2018 19:04:48 +1100
Aleksa Sarai  wrote:

> On 2018-11-08, Aleksa Sarai  wrote:
> > I will attach what I have at the moment to hopefully explain what the
> > issue I've found is (re-using the kretprobe architecture but with the
> > shadow-stack idea).
> 
> Here is the patch I have at the moment (it works, except for the
> question I have about how to handle the top-level pt_regs -- I've marked
> that code with XXX).
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> 
> 
> --8<-
> 
> Since the return address is modified by kretprobe, the various unwinders
> can produce invalid and confusing stack traces. ftrace mostly solved
> this problem by teaching each unwinder how to find the original return
> address for stack trace purposes. This same technique can be applied to
> kretprobes by simply adding a pointer to where the return address was
> replaced in the stack, and then looking up the relevant
> kretprobe_instance when a stack trace is requested.
> 
> [WIP: This is currently broken because the *first entry* will not be
>   overwritten since it looks like the stack pointer is different
>   when we are provided pt_regs. All other addresses are correctly
>   handled.]
> 
> Signed-off-by: Aleksa Sarai 
> ---
>  arch/x86/events/core.c |  6 +++-
>  arch/x86/include/asm/ptrace.h  |  1 +
>  arch/x86/kernel/kprobes/core.c |  5 ++--
>  arch/x86/kernel/stacktrace.c   | 10 +--
>  arch/x86/kernel/unwind_frame.c |  2 ++
>  arch/x86/kernel/unwind_guess.c |  5 +++-
>  arch/x86/kernel/unwind_orc.c   |  2 ++
>  include/linux/kprobes.h| 15 +-
>  kernel/kprobes.c   | 55 ++
>  9 files changed, 93 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index de32741d041a..d71062702179 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2371,7 +2371,11 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
> *entry, struct pt_regs *re
>   return;
>   }
>  
> - if (perf_callchain_store(entry, regs->ip))
> + /* XXX: Currently broken -- stack_addr(regs) doesn't match entry. */
> + addr = regs->ip;
> + //addr = ftrace_graph_ret_addr(current, &state.graph_idx, addr, 
> stack_addr(regs));
> + addr = kretprobe_ret_addr(current, addr, stack_addr(regs));
> + if (perf_callchain_store(entry, addr))
>   return;
>  
>   for (unwind_start(&state, current, regs, NULL); !unwind_done(&state);
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index ee696efec99f..c4dfafd43e11 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -172,6 +172,7 @@ static inline unsigned long kernel_stack_pointer(struct 
> pt_regs *regs)
>   return regs->sp;
>  }
>  #endif
> +#define stack_addr(regs) ((unsigned long *) kernel_stack_pointer(regs))

No, you should use kernel_stack_pointer(regs) itself instead of stack_addr().

> 
>  #define GET_IP(regs) ((regs)->ip)
>  #define GET_FP(regs) ((regs)->bp)
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index b0d1e81c96bb..eb4da885020c 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -69,8 +69,6 @@
>  DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
>  DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>  
> -#define stack_addr(regs) ((unsigned long *)kernel_stack_pointer(regs))

I don't like keeping this meaningless macro... this should be replaced with 
generic
kernel_stack_pointer() macro.

> -
>  #define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, 
> bf)\
>   (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
> (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
> @@ -568,7 +566,8 @@ void arch_prepare_kretprobe(struct kretprobe_instance 
> *ri, struct pt_regs *regs)
>  {
>   unsigned long *sara = stack_addr(regs);
>  
> - ri->ret_addr = (kprobe_opcode_t *) *sara;
> + ri->ret_addrp = (kprobe_opcode_t **) sara;
> + ri->ret_addr = *ri->ret_addrp;
>  
>   /* Replace the return addr with trampoline addr */
>   *sara = (unsigned long) &kretprobe_trampoline;
> diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
> index 7627455047c2..8a4fb3109d6b 100644
> --- a/arch/x86/kernel/stacktrace.c
> +++ b/arch/x86/kernel/stacktrace.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -37,8 +38,13 @@ static void noinline __save_stack_trace(struct stack_trace 
> *trace,
>   struct unwind_state state;
>   unsigned long addr;
>  
> - if (regs)
> - save_stack_address(trace, regs->ip, nosched);
> + if (regs) {
> + /* XXX: Currently 

Re: [PATCH v3 1/2] kretprobe: produce sane stack traces

2018-11-08 Thread Masami Hiramatsu
On Thu, 8 Nov 2018 08:44:37 -0600
Josh Poimboeuf  wrote:

> On Thu, Nov 08, 2018 at 07:04:48PM +1100, Aleksa Sarai wrote:
> > On 2018-11-08, Aleksa Sarai  wrote:
> > > I will attach what I have at the moment to hopefully explain what the
> > > issue I've found is (re-using the kretprobe architecture but with the
> > > shadow-stack idea).
> > 
> > Here is the patch I have at the moment (it works, except for the
> > question I have about how to handle the top-level pt_regs -- I've marked
> > that code with XXX).
> > 
> > -- 
> > Aleksa Sarai
> > Senior Software Engineer (Containers)
> > SUSE Linux GmbH
> > 
> > 
> > --8<-
> > 
> > Since the return address is modified by kretprobe, the various unwinders
> > can produce invalid and confusing stack traces. ftrace mostly solved
> > this problem by teaching each unwinder how to find the original return
> > address for stack trace purposes. This same technique can be applied to
> > kretprobes by simply adding a pointer to where the return address was
> > replaced in the stack, and then looking up the relevant
> > kretprobe_instance when a stack trace is requested.
> > 
> > [WIP: This is currently broken because the *first entry* will not be
> >   overwritten since it looks like the stack pointer is different
> >   when we are provided pt_regs. All other addresses are correctly
> >   handled.]
> 
> When you see this problem, what does regs->ip point to?  If it's
> pointing to generated code, then we don't _currently_ have a way of
> dealing with that.  If it's pointing to a real function, we can fix that
> with unwind hints.

As I replied, If the stackdump is called from kretprobe event, regs->ip
always points trampoline function. Otherwise (maybe from kprobe event,
or panic, BUG etc.) it always be the address which the event occurs.

So fixing regs->ip is correct.

Thank you,


-- 
Masami Hiramatsu