Re: [PATCH] powerpc/xmon: add read-only mode

2019-04-02 Thread Christopher M Riedl
> On March 29, 2019 at 3:41 AM Christophe Leroy  wrote:
> 
> 
> 
> 
> Le 29/03/2019 à 05:21, cmr a écrit :
> > Operations which write to memory should be restricted on secure systems
> > and optionally to avoid self-destructive behaviors.
> > 
> > Add a config option, XMON_RO, to control default xmon behavior along
> > with kernel cmdline options xmon=ro and xmon=rw for explicit control.
> > The default is to enable read-only mode.
> > 
> > The following xmon operations are affected:
> > memops:
> > disable memmove
> > disable memset
> > memex:
> > no-op'd mwrite
> > super_regs:
> > no-op'd write_spr
> > bpt_cmds:
> > disable
> > proc_call:
> > disable
> > 
> > Signed-off-by: cmr 
> 
> A Fully qualified name should be used.

What do you mean by fully-qualified here? PPC_XMON_RO? (PPC_)XMON_READONLY?

> 
> > ---
> >   arch/powerpc/Kconfig.debug |  7 +++
> >   arch/powerpc/xmon/xmon.c   | 24 
> >   2 files changed, 31 insertions(+)
> > 
> > diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
> > index 4e00cb0a5464..33cc01adf4cb 100644
> > --- a/arch/powerpc/Kconfig.debug
> > +++ b/arch/powerpc/Kconfig.debug
> > @@ -117,6 +117,13 @@ config XMON_DISASSEMBLY
> >   to say Y here, unless you're building for a memory-constrained
> >   system.
> >   
> > +config XMON_RO
> > +   bool "Set xmon read-only mode"
> > +   depends on XMON
> > +   default y
> 
> Should it really be always default y ?
> I would set default 'y' only when some security options are also set.
> 

This is a good point, I based this on an internal Slack suggestion but giving 
this more thought, disabling read-only mode by default makes more sense. I'm 
not sure what security options could be set though?


Re: [PATCH] powerpc/xmon: add read-only mode

2019-04-03 Thread Christopher M Riedl


> On March 29, 2019 at 12:49 AM Andrew Donnellan  
> wrote:
> 
> 
> On 29/3/19 3:21 pm, cmr wrote:
> > Operations which write to memory should be restricted on secure systems
> > and optionally to avoid self-destructive behaviors.
> 
> For reference:
>   - https://github.com/linuxppc/issues/issues/219
>   - https://github.com/linuxppc/issues/issues/232
> 
> Perhaps clarify what is meant here by "secure systems".
> 
> Otherwise commit message looks good.
> 

I will reword this for the next patch to reflect the verbiage in the referenced
github issue -- ie. Secure Boot and not violating secure boot integrity by 
using xmon.

> 
> > ---
> >   arch/powerpc/Kconfig.debug |  7 +++
> >   arch/powerpc/xmon/xmon.c   | 24 
> >   2 files changed, 31 insertions(+)
> > 
> > diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
> > index 4e00cb0a5464..33cc01adf4cb 100644
> > --- a/arch/powerpc/Kconfig.debug
> > +++ b/arch/powerpc/Kconfig.debug
> > @@ -117,6 +117,13 @@ config XMON_DISASSEMBLY
> >   to say Y here, unless you're building for a memory-constrained
> >   system.
> >   
> > +config XMON_RO
> > +   bool "Set xmon read-only mode"
> > +   depends on XMON
> > +   default y
> > +   help
> > + Disable state- and memory-altering write operations in xmon.
> 
> The meaning of this option is a bit unclear.
> 
>  From the code - it looks like what this option actually does is enable 
> RO mode *by default*. In which case it should probably be called 
> XMON_RO_DEFAULT and the description should note that RW mode can still 
> be enabled via a cmdline option.
>

Based on Christophe's feedback the default will change for this option in the
next patch. I will also add the cmdline options to the description for clarity.

>
> > +
> >   config DEBUGGER
> > bool
> > depends on KGDB || XMON
> > diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> > index a0f44f992360..c13ee73cdfd4 100644
> > --- a/arch/powerpc/xmon/xmon.c
> > +++ b/arch/powerpc/xmon/xmon.c
> > @@ -80,6 +80,7 @@ static int set_indicator_token = RTAS_UNKNOWN_SERVICE;
> >   #endif
> >   static unsigned long in_xmon __read_mostly = 0;
> >   static int xmon_on = IS_ENABLED(CONFIG_XMON_DEFAULT);
> > +static int xmon_ro = IS_ENABLED(CONFIG_XMON_RO);
> >   
> >   static unsigned long adrs;
> >   static int size = 1;
> > @@ -1042,6 +1043,8 @@ cmds(struct pt_regs *excp)
> > set_lpp_cmd();
> > break;
> > case 'b':
> > +   if (xmon_ro == 1)
> > +   break;
> 
> For all these cases - it would be much better to print an error message 
> somewhere when we abort due to read-only mode.
> 

I included print messages initially but then thought about how xmon is intended
for "power" users. I can add print statements to avoid confusion and frustration
since the operations are just "silently" dropped -- *if* that aligns with xmon's
"philosophy".


Re: [PATCH] powerpc/xmon: add read-only mode

2019-04-03 Thread Christopher M Riedl
> On April 3, 2019 at 12:15 AM Christophe Leroy  wrote:
> 
> 
> 
> 
> Le 03/04/2019 à 05:38, Christopher M Riedl a écrit :
> >> On March 29, 2019 at 3:41 AM Christophe Leroy  
> >> wrote:
> >>
> >>
> >>
> >>
> >> Le 29/03/2019 à 05:21, cmr a écrit :
> >>> Operations which write to memory should be restricted on secure systems
> >>> and optionally to avoid self-destructive behaviors.
> >>>
> >>> Add a config option, XMON_RO, to control default xmon behavior along
> >>> with kernel cmdline options xmon=ro and xmon=rw for explicit control.
> >>> The default is to enable read-only mode.
> >>>
> >>> The following xmon operations are affected:
> >>> memops:
> >>>   disable memmove
> >>>   disable memset
> >>> memex:
> >>>   no-op'd mwrite
> >>> super_regs:
> >>>   no-op'd write_spr
> >>> bpt_cmds:
> >>>   disable
> >>> proc_call:
> >>>   disable
> >>>
> >>> Signed-off-by: cmr 
> >>
> >> A Fully qualified name should be used.
> > 
> > What do you mean by fully-qualified here? PPC_XMON_RO? (PPC_)XMON_READONLY?
> 
> I mean it should be
> 
> Signed-off-by: Christopher M Riedl 
> 
> instead of
> 
> Signed-off-by: cmr 
> 

Hehe, thanks :)

> > 
> >>
> >>> ---
> >>>arch/powerpc/Kconfig.debug |  7 +++
> >>>arch/powerpc/xmon/xmon.c   | 24 
> >>>2 files changed, 31 insertions(+)
> >>>
> >>> diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
> >>> index 4e00cb0a5464..33cc01adf4cb 100644
> >>> --- a/arch/powerpc/Kconfig.debug
> >>> +++ b/arch/powerpc/Kconfig.debug
> >>> @@ -117,6 +117,13 @@ config XMON_DISASSEMBLY
> >>> to say Y here, unless you're building for a memory-constrained
> >>> system.
> >>>
> >>> +config XMON_RO
> >>> + bool "Set xmon read-only mode"
> >>> + depends on XMON
> >>> + default y
> >>
> >> Should it really be always default y ?
> >> I would set default 'y' only when some security options are also set.
> >>
> > 
> > This is a good point, I based this on an internal Slack suggestion but 
> > giving this more thought, disabling read-only mode by default makes more 
> > sense. I'm not sure what security options could be set though?
> > 
> 
> Maybe starting with CONFIG_STRICT_KERNEL_RWX
> 
> Another point that may also be addressed by your patch is the definition 
> of PAGE_KERNEL_TEXT:
> 
> #if defined(CONFIG_KGDB) || defined(CONFIG_XMON) || 
> defined(CONFIG_BDI_SWITCH) ||\
>   defined(CONFIG_KPROBES) || defined(CONFIG_DYNAMIC_FTRACE)
> #define PAGE_KERNEL_TEXT  PAGE_KERNEL_X
> #else
> #define PAGE_KERNEL_TEXT  PAGE_KERNEL_ROX
> #endif
> 
> The above let me think that it would be better if you add a config 
> XMON_RW instead of XMON_RO, with default !STRICT_KERNEL_RWX
> 
> Christophe

Thanks! I like that a lot better, this, along with your other suggestions
in the initial review, will be in the next version.


[PATCH v2] powerpc/xmon: add read-only mode

2019-04-07 Thread Christopher M. Riedl
Operations which write to memory and special purpose registers should be
restricted on systems with integrity guarantees (such as Secure Boot)
and, optionally, to avoid self-destructive behaviors.

Add a config option, XMON_RW, to control default xmon behavior along
with kernel cmdline options xmon=ro and xmon=rw for explicit control.
Use XMON_RW instead of XMON in the condition to set PAGE_KERNEL_TEXT to
allow xmon in read-only mode alongside write-protected kernel text.
XMON_RW defaults to !STRICT_KERNEL_RWX.

The following xmon operations are affected:
memops:
disable memmove
disable memset
disable memzcan
memex:
no-op'd mwrite
super_regs:
no-op'd write_spr
bpt_cmds:
disable
proc_call:
disable

Signed-off-by: Christopher M. Riedl 
---
v1->v2:
Use bool type for xmon_is_ro flag
Replace XMON_RO with XMON_RW config option
Make XMON_RW dependent on STRICT_KERNEL_RWX
Use XMON_RW to control PAGE_KERNEL_TEXT
Add printf in xmon read-only mode when dropping/skipping writes
Disable memzcan (zero-fill memop) in xmon read-only mode

 arch/powerpc/Kconfig.debug   | 10 +
 arch/powerpc/include/asm/book3s/32/pgtable.h |  5 ++-
 arch/powerpc/include/asm/book3s/64/pgtable.h |  5 ++-
 arch/powerpc/include/asm/nohash/pgtable.h|  5 ++-
 arch/powerpc/xmon/xmon.c | 42 
 5 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 4e00cb0a5464..0c7f21476018 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -117,6 +117,16 @@ config XMON_DISASSEMBLY
  to say Y here, unless you're building for a memory-constrained
  system.
 
+config XMON_RW
+   bool "Allow xmon read and write operations"
+   depends on XMON
+   default !STRICT_KERNEL_RWX
+   help
+ Allow xmon to read and write to memory and special-purpose registers.
+  Conversely, prevent xmon write access when set to N. Read and write
+  access can also be explicitly controlled with 'xmon=rw' or 'xmon=ro'
+  (read-only) cmdline options. Default is !STRICT_KERNEL_RWX.
+
 config DEBUGGER
bool
depends on KGDB || XMON
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index aa8406b8f7ba..615144ad667d 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -86,8 +86,9 @@ static inline bool pte_user(pte_t pte)
  * set breakpoints anywhere, so don't write protect the kernel text
  * on platforms where such control is possible.
  */
-#if defined(CONFIG_KGDB) || defined(CONFIG_XMON) || defined(CONFIG_BDI_SWITCH) 
||\
-   defined(CONFIG_KPROBES) || defined(CONFIG_DYNAMIC_FTRACE)
+#if defined(CONFIG_KGDB) || defined(CONFIG_XMON_RW) || \
+   defined(CONFIG_BDI_SWITCH) || defined(CONFIG_KPROBES) || \
+   defined(CONFIG_DYNAMIC_FTRACE)
 #define PAGE_KERNEL_TEXT   PAGE_KERNEL_X
 #else
 #define PAGE_KERNEL_TEXT   PAGE_KERNEL_ROX
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 581f91be9dd4..bc4655122f6b 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -168,8 +168,9 @@
  * set breakpoints anywhere, so don't write protect the kernel text
  * on platforms where such control is possible.
  */
-#if defined(CONFIG_KGDB) || defined(CONFIG_XMON) || defined(CONFIG_BDI_SWITCH) 
|| \
-   defined(CONFIG_KPROBES) || defined(CONFIG_DYNAMIC_FTRACE)
+#if defined(CONFIG_KGDB) || defined(CONFIG_XMON_RW) || \
+   defined(CONFIG_BDI_SWITCH) || defined(CONFIG_KPROBES) || \
+   defined(CONFIG_DYNAMIC_FTRACE)
 #define PAGE_KERNEL_TEXT   PAGE_KERNEL_X
 #else
 #define PAGE_KERNEL_TEXT   PAGE_KERNEL_ROX
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 1ca1c1864b32..c052931bd243 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -22,8 +22,9 @@
  * set breakpoints anywhere, so don't write protect the kernel text
  * on platforms where such control is possible.
  */
-#if defined(CONFIG_KGDB) || defined(CONFIG_XMON) || defined(CONFIG_BDI_SWITCH) 
||\
-   defined(CONFIG_KPROBES) || defined(CONFIG_DYNAMIC_FTRACE)
+#if defined(CONFIG_KGDB) || defined(CONFIG_XMON_RW) || \
+   defined(CONFIG_BDI_SWITCH) || defined(CONFIG_KPROBES) || \
+   defined(CONFIG_DYNAMIC_FTRACE)
 #define PAGE_KERNEL_TEXT   PAGE_KERNEL_X
 #else
 #define PAGE_KERNEL_TEXT   PAGE_KERNEL_ROX
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index a0f44f992360..224ca0b3506b 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/power

Re: [PATCH v2] powerpc/xmon: add read-only mode

2019-04-09 Thread Christopher M Riedl


> On April 8, 2019 at 1:34 AM Oliver  wrote:
> 
> 
> On Mon, Apr 8, 2019 at 1:06 PM Christopher M. Riedl  
> wrote:
> >
> > Operations which write to memory and special purpose registers should be
> > restricted on systems with integrity guarantees (such as Secure Boot)
> > and, optionally, to avoid self-destructive behaviors.
> >
> > Add a config option, XMON_RW, to control default xmon behavior along
> > with kernel cmdline options xmon=ro and xmon=rw for explicit control.
> > Use XMON_RW instead of XMON in the condition to set PAGE_KERNEL_TEXT to
> > allow xmon in read-only mode alongside write-protected kernel text.
> > XMON_RW defaults to !STRICT_KERNEL_RWX.
> >
> > The following xmon operations are affected:
> > memops:
> > disable memmove
> > disable memset
> > disable memzcan
> > memex:
> > no-op'd mwrite
> > super_regs:
> > no-op'd write_spr
> > bpt_cmds:
> > disable
> > proc_call:
> > disable
> >
> > Signed-off-by: Christopher M. Riedl 
> > ---
> > v1->v2:
> > Use bool type for xmon_is_ro flag
> > Replace XMON_RO with XMON_RW config option
> > Make XMON_RW dependent on STRICT_KERNEL_RWX
> Do you mean make it dependent on XMON?
>

Yeah that's really not clear at all -- XMON_RW is set based on the value of
STRICT_KERNEL_RWX.

> 
> > Use XMON_RW to control PAGE_KERNEL_TEXT
> > Add printf in xmon read-only mode when dropping/skipping writes
> > Disable memzcan (zero-fill memop) in xmon read-only mode
> >
> >  arch/powerpc/Kconfig.debug   | 10 +
> >  arch/powerpc/include/asm/book3s/32/pgtable.h |  5 ++-
> >  arch/powerpc/include/asm/book3s/64/pgtable.h |  5 ++-
> >  arch/powerpc/include/asm/nohash/pgtable.h|  5 ++-
> >  arch/powerpc/xmon/xmon.c | 42 
> >  5 files changed, 61 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
> > index 4e00cb0a5464..0c7f21476018 100644
> > --- a/arch/powerpc/Kconfig.debug
> > +++ b/arch/powerpc/Kconfig.debug
> > @@ -117,6 +117,16 @@ config XMON_DISASSEMBLY
> >   to say Y here, unless you're building for a memory-constrained
> >   system.
> >
> 
> > +config XMON_RW
> > +   bool "Allow xmon read and write operations"
> > +   depends on XMON
> > +   default !STRICT_KERNEL_RWX
> > +   help
> > + Allow xmon to read and write to memory and special-purpose 
> > registers.
> > +  Conversely, prevent xmon write access when set to N. Read and 
> > write
> > +  access can also be explicitly controlled with 'xmon=rw' or 
> > 'xmon=ro'
> > +  (read-only) cmdline options. Default is !STRICT_KERNEL_RWX.
> 
> Maybe I am a dumb, but I found this *extremely* confusing.
> Conventionally Kconfig options will control what code is and is not
> included in the kernel (see XMON_DISASSEMBLY) rather than changing the
> default behaviour of code. It's not wrong to do so and I'm going to
> assume that you were following the pattern of XMON_DEFAULT, but I
> think you need to be a little more clear about what option actually
> does. Renaming it to XMON_DEFAULT_RO_MODE and re-wording the
> description to indicate it's a only a mode change would help a lot.
> 
> Sorry if this comes across as pointless bikeshedding since it's the
> opposite of what Christophe said in the last patch, but this was a bit
> of a head scratcher.
> 

If anyone is dumb here it's me for making this confusing :)
I chatted with Michael Ellerman about this, so let me try to explain this more 
clearly.

There are two things I am trying to address with XMON_RW:
1) provide a default access mode for xmon based on system "security"
2) replace XMON in the decision to write-protect kernel text at compile-time

I think a single Kconfig for both of those things is sensible as ultimately the
point is to allow xmon to operate in read-only mode on "secure" systems -- 
without
violating any integrity/security guarantees (such as write-protected kernel 
text).

Christophe suggested looking at STRICT_KERNEL_RWX and I think that option makes 
the
most sense to base XMON_RW on since the description for STRICT_KERNEL_RWX 
states:

> If this is set, kernel text and rodata memory will be made read-only,
> and non-text memory will be made non-executable. This provides
> protection against certain security exploi

Re: [PATCH v2] powerpc/xmon: add read-only mode

2019-04-09 Thread Christopher M Riedl


> On April 8, 2019 at 2:37 AM Andrew Donnellan  
> wrote:
> 
> 
> On 8/4/19 1:08 pm, Christopher M. Riedl wrote:
> > Operations which write to memory and special purpose registers should be
> > restricted on systems with integrity guarantees (such as Secure Boot)
> > and, optionally, to avoid self-destructive behaviors.
> > 
> > Add a config option, XMON_RW, to control default xmon behavior along
> > with kernel cmdline options xmon=ro and xmon=rw for explicit control.
> > Use XMON_RW instead of XMON in the condition to set PAGE_KERNEL_TEXT to
> > allow xmon in read-only mode alongside write-protected kernel text.
> > XMON_RW defaults to !STRICT_KERNEL_RWX.
> > 
> > The following xmon operations are affected:
> > memops:
> > disable memmove
> > disable memset
> > disable memzcan
> > memex:
> > no-op'd mwrite
> > super_regs:
> > no-op'd write_spr
> > bpt_cmds:
> > disable
> > proc_call:
> > disable
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> > v1->v2:
> > Use bool type for xmon_is_ro flag
> > Replace XMON_RO with XMON_RW config option
> > Make XMON_RW dependent on STRICT_KERNEL_RWX
> > Use XMON_RW to control PAGE_KERNEL_TEXT
> > Add printf in xmon read-only mode when dropping/skipping writes
> > Disable memzcan (zero-fill memop) in xmon read-only mode
> > 
> >   arch/powerpc/Kconfig.debug   | 10 +
> >   arch/powerpc/include/asm/book3s/32/pgtable.h |  5 ++-
> >   arch/powerpc/include/asm/book3s/64/pgtable.h |  5 ++-
> >   arch/powerpc/include/asm/nohash/pgtable.h|  5 ++-
> >   arch/powerpc/xmon/xmon.c | 42 
> >   5 files changed, 61 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
> > index 4e00cb0a5464..0c7f21476018 100644
> > --- a/arch/powerpc/Kconfig.debug
> > +++ b/arch/powerpc/Kconfig.debug
> > @@ -117,6 +117,16 @@ config XMON_DISASSEMBLY
> >   to say Y here, unless you're building for a memory-constrained
> >   system.
> >   
> > +config XMON_RW
> > +   bool "Allow xmon read and write operations"
> 
> "Allow xmon write operations" would be clearer. This option has no 
> impact on read operations.
> 

Agreed, if the option isn't renamed again I will fix this in the next version :)

>
> > +   depends on XMON
> > +   default !STRICT_KERNEL_RWX
> > +   help
> > + Allow xmon to read and write to memory and special-purpose registers.
> > +  Conversely, prevent xmon write access when set to N. Read and 
> > write
> > +  access can also be explicitly controlled with 'xmon=rw' or 
> > 'xmon=ro'
> > +  (read-only) cmdline options. Default is !STRICT_KERNEL_RWX.
> 
> This is an improvement but still doesn't clearly explain the 
> relationship between selecting this option and using the cmdline options.
> 

I will reword this in the next version.

> 
> > +
> >   config DEBUGGER
> > bool
> > depends on KGDB || XMON
> > diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
> > b/arch/powerpc/include/asm/book3s/32/pgtable.h
> > index aa8406b8f7ba..615144ad667d 100644
> > --- a/arch/powerpc/include/asm/book3s/32/pgtable.h
> > +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
> > @@ -86,8 +86,9 @@ static inline bool pte_user(pte_t pte)
> >* set breakpoints anywhere, so don't write protect the kernel text
> >* on platforms where such control is possible.
> >*/
> > -#if defined(CONFIG_KGDB) || defined(CONFIG_XMON) || 
> > defined(CONFIG_BDI_SWITCH) ||\
> > -   defined(CONFIG_KPROBES) || defined(CONFIG_DYNAMIC_FTRACE)
> > +#if defined(CONFIG_KGDB) || defined(CONFIG_XMON_RW) || \
> > +   defined(CONFIG_BDI_SWITCH) || defined(CONFIG_KPROBES) || \
> > +   defined(CONFIG_DYNAMIC_FTRACE)
> >   #define PAGE_KERNEL_TEXT  PAGE_KERNEL_X
> >   #else
> >   #define PAGE_KERNEL_TEXT  PAGE_KERNEL_ROX
> > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> > b/arch/powerpc/include/asm/book3s/64/pgtable.h
> > index 581f91be9dd4..bc4655122f6b 100644
> > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> > @@ -168,8 +168,9 @@
> >* set breakpoints anywhere, so don't write protect the kernel text
> >* on platforms where such control is possible.
> >*/
>

Re: [PATCH v2] powerpc/xmon: add read-only mode

2019-04-11 Thread Christopher M Riedl


> On April 11, 2019 at 8:37 AM Michael Ellerman  wrote:
> 
> 
> Christopher M Riedl  writes:
> >> On April 8, 2019 at 1:34 AM Oliver  wrote:
> >> On Mon, Apr 8, 2019 at 1:06 PM Christopher M. Riedl  
> >> wrote:
> ...
> >> >
> >> > diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
> >> > index 4e00cb0a5464..0c7f21476018 100644
> >> > --- a/arch/powerpc/Kconfig.debug
> >> > +++ b/arch/powerpc/Kconfig.debug
> >> > @@ -117,6 +117,16 @@ config XMON_DISASSEMBLY
> >> >   to say Y here, unless you're building for a memory-constrained
> >> >   system.
> >> >
> >> 
> >> > +config XMON_RW
> >> > +   bool "Allow xmon read and write operations"
> >> > +   depends on XMON
> >> > +   default !STRICT_KERNEL_RWX
> >> > +   help
> >> > + Allow xmon to read and write to memory and special-purpose 
> >> > registers.
> >> > +  Conversely, prevent xmon write access when set to N. Read and 
> >> > write
> >> > +  access can also be explicitly controlled with 'xmon=rw' or 
> >> > 'xmon=ro'
> >> > +  (read-only) cmdline options. Default is !STRICT_KERNEL_RWX.
> >> 
> >> Maybe I am a dumb, but I found this *extremely* confusing.
> >> Conventionally Kconfig options will control what code is and is not
> >> included in the kernel (see XMON_DISASSEMBLY) rather than changing the
> >> default behaviour of code. It's not wrong to do so and I'm going to
> >> assume that you were following the pattern of XMON_DEFAULT, but I
> >> think you need to be a little more clear about what option actually
> >> does. Renaming it to XMON_DEFAULT_RO_MODE and re-wording the
> >> description to indicate it's a only a mode change would help a lot.
> >> 
> >> Sorry if this comes across as pointless bikeshedding since it's the
> >> opposite of what Christophe said in the last patch, but this was a bit
> >> of a head scratcher.
> >
> > If anyone is dumb here it's me for making this confusing :)
> > I chatted with Michael Ellerman about this, so let me try to explain this 
> > more clearly.
> 
> Yeah it's my fault :)
>

"Signed-off-by: Christopher M. Riedl" -- I take full responsibility hah.

> 
> > There are two things I am trying to address with XMON_RW:
> > 1) provide a default access mode for xmon based on system "security"
> 
> I think I've gone off this idea. Tying them together is just enforcing a
> linkage that people may not want.
> 
> I think XMON_RW should just be an option that stands on its own. It
> should probably be default n, to give people a safe default.
> 

Next version includes this along with making it clear that this option
provides the default mode for XMON.

>
> > 2) replace XMON in the decision to write-protect kernel text at compile-time
> 
> We should do that as a separate patch. That's actually a bug in the
> current STRICT_KERNEL_RWX support.
> 
> ie. STRICT_KERNEL_RWX should always give you PAGE_KERNEL_ROX, regardless
> of XMON or anything else.
> 
> > I think a single Kconfig for both of those things is sensible as ultimately 
> > the
> > point is to allow xmon to operate in read-only mode on "secure" systems -- 
> > without
> > violating any integrity/security guarantees (such as write-protected kernel 
> > text).
> >
> > Christophe suggested looking at STRICT_KERNEL_RWX and I think that option 
> > makes the
> > most sense to base XMON_RW on since the description for STRICT_KERNEL_RWX 
> > states:
> 
> Once we fix the bugs in STRICT_KERNEL_RWX people are going to enable
> that by default, so it will essentially be always on in future.
> 
> 
> > With that said, I will remove the 'xmon=rw' cmdline option as it really 
> > doesn't work
> > since kernel text is write-protected at compile time.
> 
> I think 'xmon=rw' still makes sense. Only some of the RW functionality
> relies on being able to patch kernel text.
> 
> And once you have proccall() you can just call a function to make it
> read/write anyway, or use memex to manually frob the page tables.
> 
> cheers

Great, adding this back in the next version.


[PATCH v3] powerpc/xmon: add read-only mode

2019-04-14 Thread Christopher M. Riedl
Operations which write to memory and special purpose registers should be
restricted on systems with integrity guarantees (such as Secure Boot)
and, optionally, to avoid self-destructive behaviors.

Add a config option, XMON_DEFAULT_RO_MODE, to set default xmon behavior.
The kernel cmdline options xmon=ro and xmon=rw override this default.

The following xmon operations are affected:
memops:
disable memmove
disable memset
disable memzcan
memex:
no-op'd mwrite
super_regs:
no-op'd write_spr
bpt_cmds:
disable
proc_call:
disable

Signed-off-by: Christopher M. Riedl 
---

v2->v3:
Use XMON_DEFAULT_RO_MODE to set xmon read-only mode
Untangle read-only mode from STRICT_KERNEL_RWX and PAGE_KERNEL_ROX
Update printed msg string for write ops in read-only mode

 arch/powerpc/Kconfig.debug |  8 
 arch/powerpc/xmon/xmon.c   | 42 ++
 2 files changed, 50 insertions(+)

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 4e00cb0a5464..8de4823dfb86 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -117,6 +117,14 @@ config XMON_DISASSEMBLY
  to say Y here, unless you're building for a memory-constrained
  system.
 
+config XMON_DEFAULT_RO_MODE
+   bool "Restrict xmon to read-only operations"
+   depends on XMON
+   default y
+   help
+  Operate xmon in read-only mode. The cmdline options 'xmon=rw' and
+  'xmon=ro' override this default.
+
 config DEBUGGER
bool
depends on KGDB || XMON
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index a0f44f992360..ce98c8049eb6 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -80,6 +80,7 @@ static int set_indicator_token = RTAS_UNKNOWN_SERVICE;
 #endif
 static unsigned long in_xmon __read_mostly = 0;
 static int xmon_on = IS_ENABLED(CONFIG_XMON_DEFAULT);
+static bool xmon_is_ro = IS_ENABLED(CONFIG_XMON_DEFAULT_RO_MODE);
 
 static unsigned long adrs;
 static int size = 1;
@@ -202,6 +203,8 @@ static void dump_tlb_book3e(void);
 #define GETWORD(v) (((v)[0] << 24) + ((v)[1] << 16) + ((v)[2] << 8) + 
(v)[3])
 #endif
 
+static const char *xmon_ro_msg = "Operation disabled: xmon in read-only 
mode\n";
+
 static char *help_string = "\
 Commands:\n\
   bshow breakpoints\n\
@@ -989,6 +992,10 @@ cmds(struct pt_regs *excp)
memlocate();
break;
case 'z':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
memzcan();
break;
case 'i':
@@ -1042,6 +1049,10 @@ cmds(struct pt_regs *excp)
set_lpp_cmd();
break;
case 'b':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
bpt_cmds();
break;
case 'C':
@@ -1055,6 +1066,10 @@ cmds(struct pt_regs *excp)
bootcmds();
break;
case 'p':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
proccall();
break;
case 'P':
@@ -1777,6 +1792,11 @@ read_spr(int n, unsigned long *vp)
 static void
 write_spr(int n, unsigned long val)
 {
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   return;
+   }
+
if (setjmp(bus_error_jmp) == 0) {
catch_spr_faults = 1;
sync();
@@ -2016,6 +2036,12 @@ mwrite(unsigned long adrs, void *buf, int size)
char *p, *q;
 
n = 0;
+
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   return n;
+   }
+
if (setjmp(bus_error_jmp) == 0) {
catch_memory_errors = 1;
sync();
@@ -2884,9 +2910,17 @@ memops(int cmd)
scanhex((void *)&mcount);
switch( cmd ){
case 'm':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
memmove((void *)mdest, (void *)msrc, mcount);
break;
case 's':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
memset((void *)mdest, m

[PATCH v4] powerpc/xmon: add read-only mode

2019-04-15 Thread Christopher M. Riedl
Operations which write to memory and special purpose registers should be
restricted on systems with integrity guarantees (such as Secure Boot)
and, optionally, to avoid self-destructive behaviors.

Add a config option, XMON_DEFAULT_RO_MODE, to set default xmon behavior.
The kernel cmdline options xmon=ro and xmon=rw override this default.

The following xmon operations are affected:
memops:
disable memmove
disable memset
disable memzcan
memex:
no-op'd mwrite
super_regs:
no-op'd write_spr
bpt_cmds:
disable
proc_call:
disable

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Oliver O'Halloran 
---

v3->v4:
Address Andrew's nitpick.

 arch/powerpc/Kconfig.debug |  8 
 arch/powerpc/xmon/xmon.c   | 42 ++
 2 files changed, 50 insertions(+)

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 4e00cb0a5464..326ac5ea3f72 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -117,6 +117,14 @@ config XMON_DISASSEMBLY
  to say Y here, unless you're building for a memory-constrained
  system.
 
+config XMON_DEFAULT_RO_MODE
+   bool "Restrict xmon to read-only operations by default"
+   depends on XMON
+   default y
+   help
+  Operate xmon in read-only mode. The cmdline options 'xmon=rw' and
+  'xmon=ro' override this default.
+
 config DEBUGGER
bool
depends on KGDB || XMON
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index a0f44f992360..ce98c8049eb6 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -80,6 +80,7 @@ static int set_indicator_token = RTAS_UNKNOWN_SERVICE;
 #endif
 static unsigned long in_xmon __read_mostly = 0;
 static int xmon_on = IS_ENABLED(CONFIG_XMON_DEFAULT);
+static bool xmon_is_ro = IS_ENABLED(CONFIG_XMON_DEFAULT_RO_MODE);
 
 static unsigned long adrs;
 static int size = 1;
@@ -202,6 +203,8 @@ static void dump_tlb_book3e(void);
 #define GETWORD(v) (((v)[0] << 24) + ((v)[1] << 16) + ((v)[2] << 8) + 
(v)[3])
 #endif
 
+static const char *xmon_ro_msg = "Operation disabled: xmon in read-only 
mode\n";
+
 static char *help_string = "\
 Commands:\n\
   bshow breakpoints\n\
@@ -989,6 +992,10 @@ cmds(struct pt_regs *excp)
memlocate();
break;
case 'z':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
memzcan();
break;
case 'i':
@@ -1042,6 +1049,10 @@ cmds(struct pt_regs *excp)
set_lpp_cmd();
break;
case 'b':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
bpt_cmds();
break;
case 'C':
@@ -1055,6 +1066,10 @@ cmds(struct pt_regs *excp)
bootcmds();
break;
case 'p':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
proccall();
break;
case 'P':
@@ -1777,6 +1792,11 @@ read_spr(int n, unsigned long *vp)
 static void
 write_spr(int n, unsigned long val)
 {
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   return;
+   }
+
if (setjmp(bus_error_jmp) == 0) {
catch_spr_faults = 1;
sync();
@@ -2016,6 +2036,12 @@ mwrite(unsigned long adrs, void *buf, int size)
char *p, *q;
 
n = 0;
+
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   return n;
+   }
+
if (setjmp(bus_error_jmp) == 0) {
catch_memory_errors = 1;
sync();
@@ -2884,9 +2910,17 @@ memops(int cmd)
scanhex((void *)&mcount);
switch( cmd ){
case 'm':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
memmove((void *)mdest, (void *)msrc, mcount);
break;
case 's':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
memset((void *)mdest, mval, mcount);
break;
case 'd':
@@ -3796,6 +3830,14 @@ static int __init early_parse_xmon(char

[RFC PATCH 2/3] powerpc/lib: Initialize a temporary mm for code patching

2020-03-22 Thread Christopher M. Riedl
When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped with permissive memory
protections. Currently, a per-cpu vmalloc patch area is used for this
purpose. While the patch area is per-cpu, the temporary page mapping is
inserted into the kernel page tables for the duration of the patching.
The mapping is exposed to CPUs other than the patching CPU - this is
undesirable from a hardening perspective.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
portion. The next patch uses the temporary mm and patching address for
code patching.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/lib/code-patching.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 3345f039a876..18b88ecfc5a8 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -39,6 +41,30 @@ int raw_patch_instruction(unsigned int *addr, unsigned int 
instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+__ro_after_init struct mm_struct *patching_mm;
+__ro_after_init unsigned long patching_addr;
+
+void __init poking_init(void)
+{
+   spinlock_t *ptl; /* for protecting pte table */
+   pte_t *ptep;
+
+   patching_mm = copy_init_mm();
+   BUG_ON(!patching_mm);
+
+   /*
+* In hash we cannot go above DEFAULT_MAP_WINDOW easily.
+* XXX: Do we want additional bits of entropy for radix?
+*/
+   patching_addr = (get_random_long() & PAGE_MASK) %
+   (DEFAULT_MAP_WINDOW - PAGE_SIZE);
+
+   ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
+   BUG_ON(!ptep);
+   pte_unmap_unlock(ptep, ptl);
+}
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
 static int text_area_cpu_up(unsigned int cpu)
-- 
2.25.1



[RFC PATCH 0/3] Use per-CPU temporary mappings for patching

2020-03-22 Thread Christopher M. Riedl
When compiled with CONFIG_STRICT_KERNEL_RWX, the kernel must create
temporary mappings when patching itself. These mappings temporarily
override the strict RWX text protections to permit a write. Currently,
powerpc allocates a per-CPU VM area for patching. Patching occurs as
follows:

1. Map page of text to be patched to per-CPU VM area w/
   PAGE_KERNEL protection
2. Patch text
3. Remove the temporary mapping

While the VM area is per-CPU, the mapping is actually inserted into the
kernel page tables. Presumably, this could allow another CPU to access
the normally write-protected text - either malicously or accidentally -
via this same mapping if the address of the VM area is known. Ideally,
the mapping should be kept local to the CPU doing the patching (or any
other sensitive operations requiring temporarily overriding memory
protections) [0].

x86 introduced "temporary mm" structs which allow the creation of
mappings local to a particular CPU [1]. This series intends to bring the
notion of a temporary mm to powerpc and harden powerpc by using such a
mapping for patching a kernel with strict RWX permissions.

The first patch introduces the temporary mm struct and API for powerpc
along with a new function to retrieve a current hw breakpoint.

The second patch uses the `poking_init` init hook added by the x86
patches to initialize a temporary mm and patching address. The patching
address is randomized between 0 and DEFAULT_MAP_WINDOW-PAGE_SIZE. The
upper limit is necessary due to how the hash MMU operates - by default
the space above DEFAULT_MAP_WINDOW is not available. For now, both hash
and radix randomize inside this range. The number of possible random
addresses is dependent on PAGE_SIZE and limited by DEFAULT_MAP_WINDOW.

Bits of entropy with 64K page size on BOOK3S_64:

bits-o-entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
bits-o-entropy = log2(128TB / 64K)
bits-o-entropy = 31

Currently, randomization occurs only once during initialization at boot.

The third patch replaces the VM area with the temporary mm in the
patching code. The page for patching has to be mapped PAGE_SHARED with
the hash MMU since hash prevents the kernel from accessing userspace
pages with PAGE_PRIVILEGED bit set. There is on-going work on my side to
explore if this is actually necessary in the hash codepath.

Testing so far is limited to booting on QEMU (power8 and power9 targets)
and a POWER8 VM along with setting some simple xmon breakpoints (which
makes use of code-patching). A POC lkdtm test is in-progress to actually
exploit the existing vulnerability (ie. the mapping during patching is
exposed in kernel page tables and accessible by other CPUS) - this will
accompany a future v1 of this series.

[0]: https://github.com/linuxppc/issues/issues/224
[1]: 
https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.a...@gmail.com/

Christopher M. Riedl (3):
  powerpc/mm: Introduce temporary mm
  powerpc/lib: Initialize a temporary mm for code patching
  powerpc/lib: Use a temporary mm for code patching

 arch/powerpc/include/asm/debug.h   |   1 +
 arch/powerpc/include/asm/mmu_context.h |  56 +-
 arch/powerpc/kernel/process.c  |   5 +
 arch/powerpc/lib/code-patching.c   | 140 ++---
 4 files changed, 137 insertions(+), 65 deletions(-)

-- 
2.25.1



[RFC PATCH 3/3] powerpc/lib: Use a temporary mm for code patching

2020-03-22 Thread Christopher M. Riedl
Currently, code patching a STRICT_KERNEL_RWX exposes the temporary
mappings to other CPUs. These mappings should be kept local to the CPU
doing the patching. Use the pre-initialized temporary mm and patching
address for this purpose. Also add a check after patching to ensure the
patch succeeded.

Based on x86 implementation:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/lib/code-patching.c | 128 ++-
 1 file changed, 57 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 18b88ecfc5a8..f156132e8975 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int __patch_instruction(unsigned int *exec_addr, unsigned int instr,
   unsigned int *patch_addr)
@@ -65,99 +66,79 @@ void __init poking_init(void)
pte_unmap_unlock(ptep, ptl);
 }
 
-static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
-
-static int text_area_cpu_up(unsigned int cpu)
-{
-   struct vm_struct *area;
-
-   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
-   if (!area) {
-   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
-   cpu);
-   return -1;
-   }
-   this_cpu_write(text_poke_area, area);
-
-   return 0;
-}
-
-static int text_area_cpu_down(unsigned int cpu)
-{
-   free_vm_area(this_cpu_read(text_poke_area));
-   return 0;
-}
-
-/*
- * Run as a late init call. This allows all the boot time patching to be done
- * simply by patching the code, and then we're called here prior to
- * mark_rodata_ro(), which happens after all init calls are run. Although
- * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we judge
- * it as being preferable to a kernel that will crash later when someone tries
- * to use patch_instruction().
- */
-static int __init setup_text_poke_area(void)
-{
-   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
-   "powerpc/text_poke:online", text_area_cpu_up,
-   text_area_cpu_down));
-
-   return 0;
-}
-late_initcall(setup_text_poke_area);
+struct patch_mapping {
+   spinlock_t *ptl; /* for protecting pte table */
+   struct temp_mm temp_mm;
+};
 
 /*
  * This can be called for kernel text or a module.
  */
-static int map_patch_area(void *addr, unsigned long text_poke_addr)
+static int map_patch(const void *addr, struct patch_mapping *patch_mapping)
 {
-   unsigned long pfn;
-   int err;
+   struct page *page;
+   pte_t pte, *ptep;
+   pgprot_t pgprot;
 
if (is_vmalloc_addr(addr))
-   pfn = vmalloc_to_pfn(addr);
+   page = vmalloc_to_page(addr);
else
-   pfn = __pa_symbol(addr) >> PAGE_SHIFT;
+   page = virt_to_page(addr);
 
-   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
+   if (radix_enabled())
+   pgprot = __pgprot(pgprot_val(PAGE_KERNEL));
+   else
+   pgprot = PAGE_SHARED;
 
-   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
-   if (err)
+   ptep = get_locked_pte(patching_mm, patching_addr, &patch_mapping->ptl);
+   if (unlikely(!ptep)) {
+   pr_warn("map patch: failed to allocate pte for patching\n");
return -1;
+   }
+
+   pte = mk_pte(page, pgprot);
+   set_pte_at(patching_mm, patching_addr, ptep, pte);
+
+   init_temp_mm(&patch_mapping->temp_mm, patching_mm);
+   use_temporary_mm(&patch_mapping->temp_mm);
 
return 0;
 }
 
-static inline int unmap_patch_area(unsigned long addr)
+static int unmap_patch(struct patch_mapping *patch_mapping)
 {
pte_t *ptep;
pmd_t *pmdp;
pud_t *pudp;
pgd_t *pgdp;
 
-   pgdp = pgd_offset_k(addr);
+   pgdp = pgd_offset(patching_mm, patching_addr);
if (unlikely(!pgdp))
return -EINVAL;
 
-   pudp = pud_offset(pgdp, addr);
+   pudp = pud_offset(pgdp, patching_addr);
if (unlikely(!pudp))
return -EINVAL;
 
-   pmdp = pmd_offset(pudp, addr);
+   pmdp = pmd_offset(pudp, patching_addr);
if (unlikely(!pmdp))
return -EINVAL;
 
-   ptep = pte_offset_kernel(pmdp, addr);
+   ptep = pte_offset_kernel(pmdp, patching_addr);
if (unlikely(!ptep))
return -EINVAL;
 
-   pr_devel("clearing mm %p, pte %p, addr %lx\n", &init_mm, ptep, addr);
+   /*
+* In hash, pte_clear flushes the tlb
+*/
+   pte_clear(patching_mm, patching_addr, ptep);
+   unuse_temporary_mm(&patch_mapping->temp_mm);
 
/*
-* In hash, pte_c

[RFC PATCH 1/3] powerpc/mm: Introduce temporary mm

2020-03-22 Thread Christopher M. Riedl
x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. A side benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use.

Based on x86 implementation:

commit cefa929c034e
("x86/mm: Introduce temporary mm structs")

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/debug.h   |  1 +
 arch/powerpc/include/asm/mmu_context.h | 56 +-
 arch/powerpc/kernel/process.c  |  5 +++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 7756026b95ca..b945bc16c932 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -45,6 +45,7 @@ static inline int debugger_break_match(struct pt_regs *regs) 
{ return 0; }
 static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
+void __get_breakpoint(struct arch_hw_breakpoint *brk);
 void __set_breakpoint(struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 360367c579de..3e6381d04c28 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -7,9 +7,10 @@
 #include 
 #include 
 #include 
-#include
+#include 
 #include 
 #include 
+#include 
 
 /*
  * Most if the context management is out of line
@@ -270,5 +271,58 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
return 0;
 }
 
+struct temp_mm {
+   struct mm_struct *temp;
+   struct mm_struct *prev;
+   bool is_kernel_thread;
+   struct arch_hw_breakpoint brk;
+};
+
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+{
+   temp_mm->temp = mm;
+   temp_mm->prev = NULL;
+   temp_mm->is_kernel_thread = false;
+   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
+}
+
+static inline void use_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   temp_mm->is_kernel_thread = current->mm == NULL;
+   if (temp_mm->is_kernel_thread)
+   temp_mm->prev = current->active_mm;
+   else
+   temp_mm->prev = current->mm;
+
+   /*
+* Hash requires a non-NULL current->mm to allocate a userspace address
+* when handling a page fault. Does not appear to hurt in Radix either.
+*/
+   current->mm = temp_mm->temp;
+   switch_mm_irqs_off(NULL, temp_mm->temp, current);
+
+   if (ppc_breakpoint_available()) {
+   __get_breakpoint(&temp_mm->brk);
+   if (temp_mm->brk.type != 0)
+   hw_breakpoint_disable();
+   }
+}
+
+static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   if (temp_mm->is_kernel_thread)
+   current->mm = NULL;
+   else
+   current->mm = temp_mm->prev;
+   switch_mm_irqs_off(NULL, temp_mm->prev, current);
+
+   if (ppc_breakpoint_available() && temp_mm->brk.type != 0)
+   __set_breakpoint(&temp_mm->brk);
+}
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index fad50db9dcf2..5e5cf33fc358 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -793,6 +793,11 @@ static inline int set_breakpoint_8xx(struct 
arch_hw_breakpoint *brk)
return 0;
 }
 
+void __get_breakpoint(struct arch_hw_breakpoint *brk)
+{
+   memcpy(brk, this_cpu_ptr(¤t_brk), sizeof(*brk));
+}
+
 void __set_breakpoint(struct arch_hw_breakpoint *brk)
 {
memcpy(this_cpu_ptr(¤t_brk), brk, sizeof(*brk));
-- 
2.25.1



Re: [RFC PATCH 1/3] powerpc/mm: Introduce temporary mm

2020-03-30 Thread Christopher M Riedl
> On March 24, 2020 11:07 AM Christophe Leroy  wrote:
> 
>  
> Le 23/03/2020 à 05:52, Christopher M. Riedl a écrit :
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for situations
> > where a CPU needs to perform sensitive operations (such as patching a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> > said mappings to other CPUs. A side benefit is that other CPU TLBs do
> > not need to be flushed when the temporary mm is torn down.
> > 
> > Mappings in the temporary mm can be set in the userspace portion of the
> > address-space.
> > 
> > Interrupts must be disabled while the temporary mm is in use. HW
> > breakpoints, which may have been set by userspace as watchpoints on
> > addresses now within the temporary mm, are saved and disabled when
> > loading the temporary mm. The HW breakpoints are restored when unloading
> > the temporary mm. All HW breakpoints are indiscriminately disabled while
> > the temporary mm is in use.
> > 
> > Based on x86 implementation:
> > 
> > commit cefa929c034e
> > ("x86/mm: Introduce temporary mm structs")
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/include/asm/debug.h   |  1 +
> >   arch/powerpc/include/asm/mmu_context.h | 56 +-
> >   arch/powerpc/kernel/process.c  |  5 +++
> >   3 files changed, 61 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/debug.h 
> > b/arch/powerpc/include/asm/debug.h
> > index 7756026b95ca..b945bc16c932 100644
> > --- a/arch/powerpc/include/asm/debug.h
> > +++ b/arch/powerpc/include/asm/debug.h
> > @@ -45,6 +45,7 @@ static inline int debugger_break_match(struct pt_regs 
> > *regs) { return 0; }
> >   static inline int debugger_fault_handler(struct pt_regs *regs) { return 
> > 0; }
> >   #endif
> >   
> > +void __get_breakpoint(struct arch_hw_breakpoint *brk);
> >   void __set_breakpoint(struct arch_hw_breakpoint *brk);
> >   bool ppc_breakpoint_available(void);
> >   #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> > b/arch/powerpc/include/asm/mmu_context.h
> > index 360367c579de..3e6381d04c28 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -7,9 +7,10 @@
> >   #include 
> >   #include 
> >   #include 
> > -#include
> > +#include 
> 
> What's this change ?
> I see you are removing a space at the end of the line, but it shouldn't 
> be part of this patch.
> 

Overly aggressive "helpful" editor setting apparently.
Removed this change in the next version.

> >   #include 
> >   #include 
> > +#include 
> >   
> >   /*
> >* Most if the context management is out of line
> > @@ -270,5 +271,58 @@ static inline int arch_dup_mmap(struct mm_struct 
> > *oldmm,
> > return 0;
> >   }
> >   
> > +struct temp_mm {
> > +   struct mm_struct *temp;
> > +   struct mm_struct *prev;
> > +   bool is_kernel_thread;
> > +   struct arch_hw_breakpoint brk;
> > +};
> > +
> > +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct 
> > *mm)
> > +{
> > +   temp_mm->temp = mm;
> > +   temp_mm->prev = NULL;
> > +   temp_mm->is_kernel_thread = false;
> > +   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> > +}
> > +
> > +static inline void use_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > +   lockdep_assert_irqs_disabled();
> > +
> > +   temp_mm->is_kernel_thread = current->mm == NULL;
> > +   if (temp_mm->is_kernel_thread)
> > +   temp_mm->prev = current->active_mm;
> > +   else
> > +   temp_mm->prev = current->mm;
> > +
> > +   /*
> > +* Hash requires a non-NULL current->mm to allocate a userspace address
> > +* when handling a page fault. Does not appear to hurt in Radix either.
> > +*/
> > +   current->mm = temp_mm->temp;
> > +   switch_mm_irqs_off(NULL, temp_mm->temp, current);
> > +
> > +   if (ppc_breakpoint_available()) {
> > +   __get_breakpoint(&temp_mm->brk);
> > +   if (temp_mm->brk.type != 0)
> > +   hw_breakpoint_disable();
> > +   }
> > +}
> > +
> > +static inline void unuse_temporary_mm(struct temp_mm *tem

Re: [RFC PATCH 2/3] powerpc/lib: Initialize a temporary mm for code patching

2020-03-30 Thread Christopher M Riedl
> On March 24, 2020 11:10 AM Christophe Leroy  wrote:
> 
>  
> Le 23/03/2020 à 05:52, Christopher M. Riedl a écrit :
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped with permissive memory
> > protections. Currently, a per-cpu vmalloc patch area is used for this
> > purpose. While the patch area is per-cpu, the temporary page mapping is
> > inserted into the kernel page tables for the duration of the patching.
> > The mapping is exposed to CPUs other than the patching CPU - this is
> > undesirable from a hardening perspective.
> > 
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > portion. The next patch uses the temporary mm and patching address for
> > code patching.
> > 
> > Based on x86 implementation:
> > 
> > commit 4fc19708b165
> > ("x86/alternatives: Initialize temporary mm for patching")
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/lib/code-patching.c | 26 ++
> >   1 file changed, 26 insertions(+)
> > 
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 3345f039a876..18b88ecfc5a8 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -11,6 +11,8 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> > +#include 
> >   
> >   #include 
> >   #include 
> > @@ -39,6 +41,30 @@ int raw_patch_instruction(unsigned int *addr, unsigned 
> > int instr)
> >   }
> >   
> >   #ifdef CONFIG_STRICT_KERNEL_RWX
> > +
> > +__ro_after_init struct mm_struct *patching_mm;
> > +__ro_after_init unsigned long patching_addr;
> 
> Can we make those those static ?
> 

Yes, makes sense to me.

> > +
> > +void __init poking_init(void)
> > +{
> > +   spinlock_t *ptl; /* for protecting pte table */
> > +   pte_t *ptep;
> > +
> > +   patching_mm = copy_init_mm();
> > +   BUG_ON(!patching_mm);
> 
> Does it needs to be a BUG_ON() ? Can't we fail gracefully with just a 
> WARN_ON ?
>

I'm not sure what failing gracefully means here? The main reason this could
fail is if there is not enough memory to allocate the patching_mm. The
previous implementation had this justification for BUG_ON():

/*
 * Run as a late init call. This allows all the boot time patching to be done
 * simply by patching the code, and then we're called here prior to
 * mark_rodata_ro(), which happens after all init calls are run. Although
 * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we judge
 * it as being preferable to a kernel that will crash later when someone tries
 * to use patch_instruction().
 */
static int __init setup_text_poke_area(void)
{
BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"powerpc/text_poke:online", text_area_cpu_up,
text_area_cpu_down));

return 0;
}
late_initcall(setup_text_poke_area);

I think the BUG_ON() is appropriate even if only to adhere to the previous
judgement call. I can add a similar comment explaining the reasoning if
that helps.

> > +
> > +   /*
> > +* In hash we cannot go above DEFAULT_MAP_WINDOW easily.
> > +* XXX: Do we want additional bits of entropy for radix?
> > +*/
> > +   patching_addr = (get_random_long() & PAGE_MASK) %
> > +   (DEFAULT_MAP_WINDOW - PAGE_SIZE);
> > +
> > +   ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > +   BUG_ON(!ptep);
> 
> Same here, can we fail gracefully instead ?
>

Same reasoning as above.

> > +   pte_unmap_unlock(ptep, ptl);
> > +}
> > +
> >   static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >   
> >   static int text_area_cpu_up(unsigned int cpu)
> > 
> 
> Christophe


Re: [RFC PATCH 2/3] powerpc/lib: Initialize a temporary mm for code patching

2020-04-14 Thread Christopher M Riedl
> On April 8, 2020 6:01 AM Christophe Leroy  wrote:
> 
>  
> Le 31/03/2020 à 05:19, Christopher M Riedl a écrit :
> >> On March 24, 2020 11:10 AM Christophe Leroy  
> >> wrote:
> >>
> >>   
> >> Le 23/03/2020 à 05:52, Christopher M. Riedl a écrit :
> >>> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> >>> address to be patched is temporarily mapped with permissive memory
> >>> protections. Currently, a per-cpu vmalloc patch area is used for this
> >>> purpose. While the patch area is per-cpu, the temporary page mapping is
> >>> inserted into the kernel page tables for the duration of the patching.
> >>> The mapping is exposed to CPUs other than the patching CPU - this is
> >>> undesirable from a hardening perspective.
> >>>
> >>> Use the `poking_init` init hook to prepare a temporary mm and patching
> >>> address. Initialize the temporary mm by copying the init mm. Choose a
> >>> randomized patching address inside the temporary mm userspace address
> >>> portion. The next patch uses the temporary mm and patching address for
> >>> code patching.
> >>>
> >>> Based on x86 implementation:
> >>>
> >>> commit 4fc19708b165
> >>> ("x86/alternatives: Initialize temporary mm for patching")
> >>>
> >>> Signed-off-by: Christopher M. Riedl 
> >>> ---
> >>>arch/powerpc/lib/code-patching.c | 26 ++
> >>>1 file changed, 26 insertions(+)
> >>>
> >>> diff --git a/arch/powerpc/lib/code-patching.c 
> >>> b/arch/powerpc/lib/code-patching.c
> >>> index 3345f039a876..18b88ecfc5a8 100644
> >>> --- a/arch/powerpc/lib/code-patching.c
> >>> +++ b/arch/powerpc/lib/code-patching.c
> >>> @@ -11,6 +11,8 @@
> >>>#include 
> >>>#include 
> >>>#include 
> >>> +#include 
> >>> +#include 
> >>>
> >>>#include 
> >>>#include 
> >>> @@ -39,6 +41,30 @@ int raw_patch_instruction(unsigned int *addr, unsigned 
> >>> int instr)
> >>>}
> >>>
> >>>#ifdef CONFIG_STRICT_KERNEL_RWX
> >>> +
> >>> +__ro_after_init struct mm_struct *patching_mm;
> >>> +__ro_after_init unsigned long patching_addr;
> >>
> >> Can we make those those static ?
> >>
> > 
> > Yes, makes sense to me.
> > 
> >>> +
> >>> +void __init poking_init(void)
> >>> +{
> >>> + spinlock_t *ptl; /* for protecting pte table */
> >>> + pte_t *ptep;
> >>> +
> >>> + patching_mm = copy_init_mm();
> >>> + BUG_ON(!patching_mm);
> >>
> >> Does it needs to be a BUG_ON() ? Can't we fail gracefully with just a
> >> WARN_ON ?
> >>
> > 
> > I'm not sure what failing gracefully means here? The main reason this could
> > fail is if there is not enough memory to allocate the patching_mm. The
> > previous implementation had this justification for BUG_ON():
> 
> But the system can continue running just fine after this failure.
> Only the things that make use of code patching will fail (ftrace, kgdb, ...)
> 
> Checkpatch tells: "Avoid crashing the kernel - try using WARN_ON & 
> recovery code rather than BUG() or BUG_ON()"
> 
> All vital code patching has already been done previously, so I think a 
> WARN_ON() should be enough, plus returning non 0 to indicate that the 
> late_initcall failed.
> 
> 

Got it, makes sense to me. I will make these changes in the next version.
Thanks!

> > 
> > /*
> >   * Run as a late init call. This allows all the boot time patching to be 
> > done
> >   * simply by patching the code, and then we're called here prior to
> >   * mark_rodata_ro(), which happens after all init calls are run. Although
> >   * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we 
> > judge
> >   * it as being preferable to a kernel that will crash later when someone 
> > tries
> >   * to use patch_instruction().
> >   */
> > static int __init setup_text_poke_area(void)
> > {
> >  BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> >  "powerpc/text_poke:online", text_area_cpu_up,
> >  text_area_cpu_down));
> > 
> >  return 0;
> > }
> > late_initcall(setup_text_poke_area);
> > 
> > I think the BUG_ON() is appropriate even if only to adhere to the previous
> > judgement call. I can add a similar comment explaining the reasoning if
> > that helps.
> > 
> >>> +
> >>> + /*
> >>> +  * In hash we cannot go above DEFAULT_MAP_WINDOW easily.
> >>> +  * XXX: Do we want additional bits of entropy for radix?
> >>> +  */
> >>> + patching_addr = (get_random_long() & PAGE_MASK) %
> >>> + (DEFAULT_MAP_WINDOW - PAGE_SIZE);
> >>> +
> >>> + ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> >>> + BUG_ON(!ptep);
> >>
> >> Same here, can we fail gracefully instead ?
> >>
> > 
> > Same reasoning as above.
> 
> Here as well, a WARN_ON() should be enough, the system will continue 
> running after that.
> 
> > 
> >>> + pte_unmap_unlock(ptep, ptl);
> >>> +}
> >>> +
> >>>static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >>>
> >>>static int text_area_cpu_up(unsigned int cpu)
> >>>
> >>
> >> Christophe
> 
> Christophe


Re: [RFC PATCH 3/3] powerpc/lib: Use a temporary mm for code patching

2020-04-14 Thread Christopher M Riedl
> On March 24, 2020 11:25 AM Christophe Leroy  wrote:
> 
>  
> Le 23/03/2020 à 05:52, Christopher M. Riedl a écrit :
> > Currently, code patching a STRICT_KERNEL_RWX exposes the temporary
> > mappings to other CPUs. These mappings should be kept local to the CPU
> > doing the patching. Use the pre-initialized temporary mm and patching
> > address for this purpose. Also add a check after patching to ensure the
> > patch succeeded.
> > 
> > Based on x86 implementation:
> > 
> > commit b3fd8e83ada0
> > ("x86/alternatives: Use temporary mm for text poking")
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/lib/code-patching.c | 128 ++-
> >   1 file changed, 57 insertions(+), 71 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 18b88ecfc5a8..f156132e8975 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -19,6 +19,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   
> >   static int __patch_instruction(unsigned int *exec_addr, unsigned int 
> > instr,
> >unsigned int *patch_addr)
> > @@ -65,99 +66,79 @@ void __init poking_init(void)
> > pte_unmap_unlock(ptep, ptl);
> >   }
> >   
> > -static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> > -
> > -static int text_area_cpu_up(unsigned int cpu)
> > -{
> > -   struct vm_struct *area;
> > -
> > -   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -   if (!area) {
> > -   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -   cpu);
> > -   return -1;
> > -   }
> > -   this_cpu_write(text_poke_area, area);
> > -
> > -   return 0;
> > -}
> > -
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -   free_vm_area(this_cpu_read(text_poke_area));
> > -   return 0;
> > -}
> > -
> > -/*
> > - * Run as a late init call. This allows all the boot time patching to be 
> > done
> > - * simply by patching the code, and then we're called here prior to
> > - * mark_rodata_ro(), which happens after all init calls are run. Although
> > - * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we 
> > judge
> > - * it as being preferable to a kernel that will crash later when someone 
> > tries
> > - * to use patch_instruction().
> > - */
> > -static int __init setup_text_poke_area(void)
> > -{
> > -   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> > -   "powerpc/text_poke:online", text_area_cpu_up,
> > -   text_area_cpu_down));
> > -
> > -   return 0;
> > -}
> > -late_initcall(setup_text_poke_area);
> > +struct patch_mapping {
> > +   spinlock_t *ptl; /* for protecting pte table */
> > +   struct temp_mm temp_mm;
> > +};
> >   
> >   /*
> >* This can be called for kernel text or a module.
> >*/
> > -static int map_patch_area(void *addr, unsigned long text_poke_addr)
> > +static int map_patch(const void *addr, struct patch_mapping *patch_mapping)
> 
> Why change the name ?
> 

It's not really an "area" anymore.

> >   {
> > -   unsigned long pfn;
> > -   int err;
> > +   struct page *page;
> > +   pte_t pte, *ptep;
> > +   pgprot_t pgprot;
> >   
> > if (is_vmalloc_addr(addr))
> > -   pfn = vmalloc_to_pfn(addr);
> > +   page = vmalloc_to_page(addr);
> > else
> > -   pfn = __pa_symbol(addr) >> PAGE_SHIFT;
> > +   page = virt_to_page(addr);
> >   
> > -   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
> > +   if (radix_enabled())
> > +   pgprot = __pgprot(pgprot_val(PAGE_KERNEL));
> > +   else
> > +   pgprot = PAGE_SHARED;
> 
> Can you explain the difference between radix and non radix ?
> 
> Why PAGE_KERNEL for a page that is mapped in userspace ?
> 
> Why do you need to do __pgprot(pgprot_val(PAGE_KERNEL)) instead of just 
> using PAGE_KERNEL ?
> 

On hash there is a manual check which prevents setting _PAGE_PRIVILEGED for
kernel to userspace access in __hash_page - hence we cannot access the mapping
if the page is mapped PAGE_KERNEL on hash. However, I would like to use
PAGE_KERNEL here as well and am working on understanding why this check is
done in hash and i

Re: [RFC PATCH] powerpc/lib: Fixing use a temporary mm for code patching

2020-04-14 Thread Christopher M Riedl
> On March 26, 2020 9:42 AM Christophe Leroy  wrote:
> 
>  
> This patch fixes the RFC series identified below.
> It fixes three points:
> - Failure with CONFIG_PPC_KUAP
> - Failure to write do to lack of DIRTY bit set on the 8xx
> - Inadequaly complex WARN post verification
> 
> However, it has an impact on the CPU load. Here is the time
> needed on an 8xx to run the ftrace selftests without and
> with this series:
> - Without CONFIG_STRICT_KERNEL_RWX==> 38 seconds
> - With CONFIG_STRICT_KERNEL_RWX   ==> 40 seconds
> - With CONFIG_STRICT_KERNEL_RWX + this series ==> 43 seconds
> 
> Link: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=166003
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/lib/code-patching.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c 
> b/arch/powerpc/lib/code-patching.c
> index f156132e8975..4ccff427592e 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -97,6 +97,7 @@ static int map_patch(const void *addr, struct patch_mapping 
> *patch_mapping)
>   }
>  
>   pte = mk_pte(page, pgprot);
> + pte = pte_mkdirty(pte);
>   set_pte_at(patching_mm, patching_addr, ptep, pte);
>  
>   init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> @@ -168,7 +169,9 @@ static int do_patch_instruction(unsigned int *addr, 
> unsigned int instr)
>   (offset_in_page((unsigned long)addr) /
>   sizeof(unsigned int));
>  
> + allow_write_to_user(patch_addr, sizeof(instr));
>   __patch_instruction(addr, instr, patch_addr);
> + prevent_write_to_user(patch_addr, sizeof(instr));
> 

On radix we can map the page with PAGE_KERNEL protection which ends up
setting EAA[0] in the radix PTE. This means the KUAP (AMR) protection is
ignored (ISA v3.0b Fig. 35) since we are accessing the page from MSR[PR]=0.

Can we employ a similar approach on the 8xx? I would prefer *not* to wrap
the __patch_instruction() with the allow_/prevent_write_to_user() KUAP things
because this is a temporary kernel mapping which really isn't userspace in
the usual sense.
 
>   err = unmap_patch(&patch_mapping);
>   if (err)
> @@ -179,7 +182,7 @@ static int do_patch_instruction(unsigned int *addr, 
> unsigned int instr)
>* think we just wrote.
>* XXX: BUG_ON() instead?
>*/
> - WARN_ON(memcmp(addr, &instr, sizeof(instr)));
> + WARN_ON(*addr != instr);
>  
>  out:
>   local_irq_restore(flags);
> -- 
> 2.25.0


Re: [RFC PATCH] powerpc/lib: Fixing use a temporary mm for code patching

2020-04-15 Thread Christopher M Riedl
> On April 15, 2020 4:12 AM Christophe Leroy  wrote:
> 
>  
> Le 15/04/2020 à 07:16, Christopher M Riedl a écrit :
> >> On March 26, 2020 9:42 AM Christophe Leroy  wrote:
> >>
> >>   
> >> This patch fixes the RFC series identified below.
> >> It fixes three points:
> >> - Failure with CONFIG_PPC_KUAP
> >> - Failure to write do to lack of DIRTY bit set on the 8xx
> >> - Inadequaly complex WARN post verification
> >>
> >> However, it has an impact on the CPU load. Here is the time
> >> needed on an 8xx to run the ftrace selftests without and
> >> with this series:
> >> - Without CONFIG_STRICT_KERNEL_RWX ==> 38 seconds
> >> - With CONFIG_STRICT_KERNEL_RWX==> 40 seconds
> >> - With CONFIG_STRICT_KERNEL_RWX + this series  ==> 43 seconds
> >>
> >> Link: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=166003
> >> Signed-off-by: Christophe Leroy 
> >> ---
> >>   arch/powerpc/lib/code-patching.c | 5 -
> >>   1 file changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/lib/code-patching.c 
> >> b/arch/powerpc/lib/code-patching.c
> >> index f156132e8975..4ccff427592e 100644
> >> --- a/arch/powerpc/lib/code-patching.c
> >> +++ b/arch/powerpc/lib/code-patching.c
> >> @@ -97,6 +97,7 @@ static int map_patch(const void *addr, struct 
> >> patch_mapping *patch_mapping)
> >>}
> >>   
> >>pte = mk_pte(page, pgprot);
> >> +  pte = pte_mkdirty(pte);
> >>set_pte_at(patching_mm, patching_addr, ptep, pte);
> >>   
> >>init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> >> @@ -168,7 +169,9 @@ static int do_patch_instruction(unsigned int *addr, 
> >> unsigned int instr)
> >>(offset_in_page((unsigned long)addr) /
> >>sizeof(unsigned int));
> >>   
> >> +  allow_write_to_user(patch_addr, sizeof(instr));
> >>__patch_instruction(addr, instr, patch_addr);
> >> +  prevent_write_to_user(patch_addr, sizeof(instr));
> >>
> > 
> > On radix we can map the page with PAGE_KERNEL protection which ends up
> > setting EAA[0] in the radix PTE. This means the KUAP (AMR) protection is
> > ignored (ISA v3.0b Fig. 35) since we are accessing the page from MSR[PR]=0.
> > 
> > Can we employ a similar approach on the 8xx? I would prefer *not* to wrap
> > the __patch_instruction() with the allow_/prevent_write_to_user() KUAP 
> > things
> > because this is a temporary kernel mapping which really isn't userspace in
> > the usual sense.
> 
> On the 8xx, that's pretty different.
> 
> The PTE doesn't control whether a page is user page or a kernel page. 
> The only thing that is set in the PTE is whether a page is linked to a 
> given PID or not.
> PAGE_KERNEL tells that the page can be addressed with any PID.
> 
> The user access right is given by a kind of zone, which is in the PGD 
> entry. Every pages above PAGE_OFFSET are defined as belonging to zone 0. 
> Every pages below PAGE_OFFSET are defined as belonging to zone 1.
> 
> By default, zone 0 can only be accessed by kernel, and zone 1 can only 
> be accessed by user. When kernel wants to access zone 1, it temporarily 
> changes properties of zone 1 to allow both kernel and user accesses.
> 
> So, if your mapping is below PAGE_OFFSET, it is in zone 1 and kernel 
> must unlock it to access it.
> 
> 
> And this is more or less the same on hash/32. This is managed by segment 
> registers. One segment register corresponds to a 256Mbytes area. Every 
> pages below PAGE_OFFSET can only be read by default by kernel. Only user 
> can write if the PTE allows it. When the kernel needs to write at an 
> address below PAGE_OFFSET, it must change the segment properties in the 
> corresponding segment register.
> 
> So, for both cases, if we want to have it local to a task while still 
> allowing kernel access, it means we have to define a new special area 
> between TASK_SIZE and PAGE_OFFSET which belongs to kernel zone.
> 
> That looks complex to me for a small benefit, especially as 8xx is not 
> SMP and neither are most of the hash/32 targets.
> 

Agreed. So I guess the solution is to differentiate between radix/non-radix
and use PAGE_SHARED for non-radix along with the KUAP functions when KUAP
is enabled. Hmm, I need to think about this some more, especially if it's
acceptable to temporarily map kernel text as PAGE_SHARED for patching. Do
you see any obvious problems on 8xx and hash/32 w/ using PAGE_SHARED?

I don't necessarily want to drop the local mm patching idea for non-radix
platforms since that means we would have to maintain two implementations.

> Christophe


Re: [RFC PATCH 3/3] powerpc/lib: Use a temporary mm for code patching

2020-04-15 Thread Christopher M Riedl
> On April 15, 2020 3:45 AM Christophe Leroy  wrote:
> 
>  
> Le 15/04/2020 à 07:11, Christopher M Riedl a écrit :
> >> On March 24, 2020 11:25 AM Christophe Leroy  
> >> wrote:
> >>
> >>   
> >> Le 23/03/2020 à 05:52, Christopher M. Riedl a écrit :
> >>> Currently, code patching a STRICT_KERNEL_RWX exposes the temporary
> >>> mappings to other CPUs. These mappings should be kept local to the CPU
> >>> doing the patching. Use the pre-initialized temporary mm and patching
> >>> address for this purpose. Also add a check after patching to ensure the
> >>> patch succeeded.
> >>>
> >>> Based on x86 implementation:
> >>>
> >>> commit b3fd8e83ada0
> >>> ("x86/alternatives: Use temporary mm for text poking")
> >>>
> >>> Signed-off-by: Christopher M. Riedl 
> >>> ---
> >>>arch/powerpc/lib/code-patching.c | 128 ++-
> >>>1 file changed, 57 insertions(+), 71 deletions(-)
> >>>
> >>> diff --git a/arch/powerpc/lib/code-patching.c 
> >>> b/arch/powerpc/lib/code-patching.c
> >>> index 18b88ecfc5a8..f156132e8975 100644
> >>> --- a/arch/powerpc/lib/code-patching.c
> >>> +++ b/arch/powerpc/lib/code-patching.c
> >>> @@ -19,6 +19,7 @@
> >>>#include 
> >>>#include 
> >>>#include 
> >>> +#include 
> >>>
> >>>static int __patch_instruction(unsigned int *exec_addr, unsigned int 
> >>> instr,
> >>>  unsigned int *patch_addr)
> >>> @@ -65,99 +66,79 @@ void __init poking_init(void)
> >>>   pte_unmap_unlock(ptep, ptl);
> >>>}
> >>>
> >>> -static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >>> -
> >>> -static int text_area_cpu_up(unsigned int cpu)
> >>> -{
> >>> - struct vm_struct *area;
> >>> -
> >>> - area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> >>> - if (!area) {
> >>> - WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> >>> - cpu);
> >>> - return -1;
> >>> - }
> >>> - this_cpu_write(text_poke_area, area);
> >>> -
> >>> - return 0;
> >>> -}
> >>> -
> >>> -static int text_area_cpu_down(unsigned int cpu)
> >>> -{
> >>> - free_vm_area(this_cpu_read(text_poke_area));
> >>> - return 0;
> >>> -}
> >>> -
> >>> -/*
> >>> - * Run as a late init call. This allows all the boot time patching to be 
> >>> done
> >>> - * simply by patching the code, and then we're called here prior to
> >>> - * mark_rodata_ro(), which happens after all init calls are run. Although
> >>> - * BUG_ON() is rude, in this case it should only happen if ENOMEM, and 
> >>> we judge
> >>> - * it as being preferable to a kernel that will crash later when someone 
> >>> tries
> >>> - * to use patch_instruction().
> >>> - */
> >>> -static int __init setup_text_poke_area(void)
> >>> -{
> >>> - BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> >>> - "powerpc/text_poke:online", text_area_cpu_up,
> >>> - text_area_cpu_down));
> >>> -
> >>> - return 0;
> >>> -}
> >>> -late_initcall(setup_text_poke_area);
> >>> +struct patch_mapping {
> >>> + spinlock_t *ptl; /* for protecting pte table */
> >>> + struct temp_mm temp_mm;
> >>> +};
> >>>
> >>>/*
> >>> * This can be called for kernel text or a module.
> >>> */
> >>> -static int map_patch_area(void *addr, unsigned long text_poke_addr)
> >>> +static int map_patch(const void *addr, struct patch_mapping 
> >>> *patch_mapping)
> >>
> >> Why change the name ?
> >>
> > 
> > It's not really an "area" anymore.
> > 
> >>>{
> >>> - unsigned long pfn;
> >>> - int err;
> >>> + struct page *page;
> >>> + pte_t pte, *ptep;
> >>> + pgprot_t pgprot;
> >>>
> >>>   if (is_vmalloc_addr(addr))
> >>> - pfn = vmalloc_to_pfn(addr);
> >>> + 

Re: [RFC PATCH] powerpc/lib: Fixing use a temporary mm for code patching

2020-04-20 Thread Christopher M. Riedl
On Sat Apr 18, 2020 at 12:27 PM, Christophe Leroy wrote:
>
> 
>
> 
> Le 15/04/2020 à 18:22, Christopher M Riedl a écrit :
> >> On April 15, 2020 4:12 AM Christophe Leroy  wrote:
> >>
> >>   
> >> Le 15/04/2020 à 07:16, Christopher M Riedl a écrit :
> >>>> On March 26, 2020 9:42 AM Christophe Leroy  
> >>>> wrote:
> >>>>
> >>>>
> >>>> This patch fixes the RFC series identified below.
> >>>> It fixes three points:
> >>>> - Failure with CONFIG_PPC_KUAP
> >>>> - Failure to write do to lack of DIRTY bit set on the 8xx
> >>>> - Inadequaly complex WARN post verification
> >>>>
> >>>> However, it has an impact on the CPU load. Here is the time
> >>>> needed on an 8xx to run the ftrace selftests without and
> >>>> with this series:
> >>>> - Without CONFIG_STRICT_KERNEL_RWX   ==> 38 seconds
> >>>> - With CONFIG_STRICT_KERNEL_RWX  ==> 40 seconds
> >>>> - With CONFIG_STRICT_KERNEL_RWX + this series==> 43 seconds
> >>>>
> >>>> Link: 
> >>>> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=166003
> >>>> Signed-off-by: Christophe Leroy 
> >>>> ---
> >>>>arch/powerpc/lib/code-patching.c | 5 -
> >>>>1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/arch/powerpc/lib/code-patching.c 
> >>>> b/arch/powerpc/lib/code-patching.c
> >>>> index f156132e8975..4ccff427592e 100644
> >>>> --- a/arch/powerpc/lib/code-patching.c
> >>>> +++ b/arch/powerpc/lib/code-patching.c
> >>>> @@ -97,6 +97,7 @@ static int map_patch(const void *addr, struct 
> >>>> patch_mapping *patch_mapping)
> >>>>  }
> >>>>
> >>>>  pte = mk_pte(page, pgprot);
> >>>> +pte = pte_mkdirty(pte);
> >>>>  set_pte_at(patching_mm, patching_addr, ptep, pte);
> >>>>
> >>>>  init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> >>>> @@ -168,7 +169,9 @@ static int do_patch_instruction(unsigned int *addr, 
> >>>> unsigned int instr)
> >>>>  (offset_in_page((unsigned long)addr) /
> >>>>  sizeof(unsigned int));
> >>>>
> >>>> +allow_write_to_user(patch_addr, sizeof(instr));
> >>>>  __patch_instruction(addr, instr, patch_addr);
> >>>> +prevent_write_to_user(patch_addr, sizeof(instr));
> >>>>
> >>>
> >>> On radix we can map the page with PAGE_KERNEL protection which ends up
> >>> setting EAA[0] in the radix PTE. This means the KUAP (AMR) protection is
> >>> ignored (ISA v3.0b Fig. 35) since we are accessing the page from 
> >>> MSR[PR]=0.
> >>>
> >>> Can we employ a similar approach on the 8xx? I would prefer *not* to wrap
> >>> the __patch_instruction() with the allow_/prevent_write_to_user() KUAP 
> >>> things
> >>> because this is a temporary kernel mapping which really isn't userspace in
> >>> the usual sense.
> >>
> >> On the 8xx, that's pretty different.
> >>
> >> The PTE doesn't control whether a page is user page or a kernel page.
> >> The only thing that is set in the PTE is whether a page is linked to a
> >> given PID or not.
> >> PAGE_KERNEL tells that the page can be addressed with any PID.
> >>
> >> The user access right is given by a kind of zone, which is in the PGD
> >> entry. Every pages above PAGE_OFFSET are defined as belonging to zone 0.
> >> Every pages below PAGE_OFFSET are defined as belonging to zone 1.
> >>
> >> By default, zone 0 can only be accessed by kernel, and zone 1 can only
> >> be accessed by user. When kernel wants to access zone 1, it temporarily
> >> changes properties of zone 1 to allow both kernel and user accesses.
> >>
> >> So, if your mapping is below PAGE_OFFSET, it is in zone 1 and kernel
> >> must unlock it to access it.
> >>
> >>
> >> And this is more or less the same on hash/32. This is managed by segment
> >> registers. One segment register corresponds to a 256Mbytes area. Every
> >> pages below PAGE_OFFSET can only

Re: [PATCH 1/3] powerpc: Properly return error code from do_patch_instruction()

2020-04-24 Thread Christopher M. Riedl
On Fri Apr 24, 2020 at 9:15 AM, Steven Rostedt wrote:
> On Thu, 23 Apr 2020 18:21:14 +0200
> Christophe Leroy  wrote:
>
> 
> > Le 23/04/2020 à 17:09, Naveen N. Rao a écrit :
> > > With STRICT_KERNEL_RWX, we are currently ignoring return value from
> > > __patch_instruction() in do_patch_instruction(), resulting in the error
> > > not being propagated back. Fix the same.  
> > 
> > Good patch.
> > 
> > Be aware that there is ongoing work which tend to wanting to replace 
> > error reporting by BUG_ON() . See 
> > https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=166003
>
> 
> Thanks for the reference. I still believe that WARN_ON() should be used
> in
> 99% of the cases, including here. And only do a BUG_ON() when you know
> there's no recovering from it.
>
> 
> In fact, there's still BUG_ON()s in my code that I need to convert to
> WARN_ON() (it was written when BUG_ON() was still acceptable ;-)
>
Figured I'd chime in since I am working on that other series :) The
BUG_ON()s are _only_ in the init code to set things up to allow a
temporary mapping for patching a STRICT_RWX kernel later. There's no
ongoing work to "replace error reporting by BUG_ON()". If that initial
setup fails we cannot patch under STRICT_KERNEL_RWX at all which imo
warrants a BUG_ON(). I am still working on v2 of my RFC which does
return any __patch_instruction() error back to the caller of
patch_instruction() similar to this patch.
> 
> -- Steve
>
> 
>
> 



[RFC PATCH v2 0/5] Use per-CPU temporary mappings for patching

2020-04-28 Thread Christopher M. Riedl
When compiled with CONFIG_STRICT_KERNEL_RWX, the kernel must create
temporary mappings when patching itself. These mappings temporarily
override the strict RWX text protections to permit a write. Currently,
powerpc allocates a per-CPU VM area for patching. Patching occurs as
follows:

1. Map page of text to be patched to per-CPU VM area w/
   PAGE_KERNEL protection
2. Patch text
3. Remove the temporary mapping

While the VM area is per-CPU, the mapping is actually inserted into the
kernel page tables. Presumably, this could allow another CPU to access
the normally write-protected text - either malicously or accidentally -
via this same mapping if the address of the VM area is known. Ideally,
the mapping should be kept local to the CPU doing the patching (or any
other sensitive operations requiring temporarily overriding memory
protections) [0].

x86 introduced "temporary mm" structs which allow the creation of
mappings local to a particular CPU [1]. This series intends to bring the
notion of a temporary mm to powerpc and harden powerpc by using such a
mapping for patching a kernel with strict RWX permissions.

The first patch introduces the temporary mm struct and API for powerpc
along with a new function to retrieve a current hw breakpoint.

The second patch uses the `poking_init` init hook added by the x86
patches to initialize a temporary mm and patching address. The patching
address is randomized between 0 and DEFAULT_MAP_WINDOW-PAGE_SIZE. The
upper limit is necessary due to how the hash MMU operates - by default
the space above DEFAULT_MAP_WINDOW is not available. For now, both hash
and radix randomize inside this range. The number of possible random
addresses is dependent on PAGE_SIZE and limited by DEFAULT_MAP_WINDOW.

Bits of entropy with 64K page size on BOOK3S_64:

bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
bits of entropy = log2(128TB / 64K)
bits of entropy = 31

Randomization occurs only once during initialization at boot.

The third patch replaces the VM area with the temporary mm in the
patching code. The page for patching has to be mapped PAGE_SHARED with
the hash MMU since hash prevents the kernel from accessing userspace
pages with PAGE_PRIVILEGED bit set. On the radix MMU the page is mapped with
PAGE_KERNEL which has the added benefit that we can skip KUAP. 

The fourth and fifth patches implement an LKDTM test "proof-of-concept" which
exploits the previous vulnerability (ie. the mapping during patching is exposed
in kernel page tables and accessible by other CPUS). The LKDTM test is somewhat
"rough" in that it uses a brute-force approach - I am open to any suggestions
and/or ideas to improve this. Currently, the LKDTM test passes with this series
on POWER8 (hash) and POWER9 (radix, hash) and fails without this series (ie.
the temporary mapping for patching is exposed to CPUs other than the patching
CPU).

The test can be applied to a tree without this new series by first
adding this in /arch/powerpc/lib/code-patching.c:

@@ -41,6 +41,13 @@ int raw_patch_instruction(unsigned int *addr, unsigned int 
instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);

+#ifdef CONFIG_LKDTM
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+   return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
+}
+#endif
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;

And then applying the last patch of this series which adds the LKDTM test,
(powerpc: Add LKDTM test to hijack a patch mapping).

Tested on QEMU (POWER8, POWER9), POWER8 VM, and a Blackbird (8-core POWER9).

v2: Many fixes and improvements mostly based on extensive feedback and testing
by Christophe Leroy (thanks!).
* Make patching_mm and patching_addr static and mode '__ro_after_init'
  to after the variable name (more common in other parts of the kernel)
* Use 'asm/debug.h' header instead of 'asm/hw_breakpoint.h' to fix
  PPC64e compile
* Add comment explaining why we use BUG_ON() during the init call to
  setup for patching later
* Move ptep into patch_mapping to avoid walking page tables a second
  time when unmapping the temporary mapping
* Use KUAP under non-radix, also manually dirty the PTE for patch
  mapping on non-BOOK3S_64 platforms
* Properly return any error from __patch_instruction
* Do not use 'memcmp' where a simple comparison is appropriate
* Simplify expression for patch address by removing pointer maths
* Add LKDTM test


[0]: https://github.com/linuxppc/issues/issues/224
[1]: 
https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.a...@gmail.com/

Christopher M. Riedl (5):
  powerpc/mm: Introduce temporary mm

[RFC PATCH v2 3/5] powerpc/lib: Use a temporary mm for code patching

2020-04-28 Thread Christopher M. Riedl
Currently, code patching a STRICT_KERNEL_RWX exposes the temporary
mappings to other CPUs. These mappings should be kept local to the CPU
doing the patching. Use the pre-initialized temporary mm and patching
address for this purpose. Also add a check after patching to ensure the
patch succeeded.

Use the KUAP functions on non-BOOKS3_64 platforms since the temporary
mapping for patching uses a userspace address (to keep the mapping
local). On BOOKS3_64 platforms hash does not implement KUAP and on radix
the use of PAGE_KERNEL sets EAA[0] for the PTE which means the AMR
(KUAP) protection is ignored (see PowerISA v3.0b, Fig, 35).

Based on x86 implementation:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/lib/code-patching.c | 149 ---
 1 file changed, 55 insertions(+), 94 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 259c19480a85..26f06cdb5d7e 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int __patch_instruction(unsigned int *exec_addr, unsigned int instr,
   unsigned int *patch_addr)
@@ -72,101 +73,58 @@ void __init poking_init(void)
pte_unmap_unlock(ptep, ptl);
 }
 
-static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
-
-static int text_area_cpu_up(unsigned int cpu)
-{
-   struct vm_struct *area;
-
-   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
-   if (!area) {
-   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
-   cpu);
-   return -1;
-   }
-   this_cpu_write(text_poke_area, area);
-
-   return 0;
-}
-
-static int text_area_cpu_down(unsigned int cpu)
-{
-   free_vm_area(this_cpu_read(text_poke_area));
-   return 0;
-}
-
-/*
- * Run as a late init call. This allows all the boot time patching to be done
- * simply by patching the code, and then we're called here prior to
- * mark_rodata_ro(), which happens after all init calls are run. Although
- * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we judge
- * it as being preferable to a kernel that will crash later when someone tries
- * to use patch_instruction().
- */
-static int __init setup_text_poke_area(void)
-{
-   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
-   "powerpc/text_poke:online", text_area_cpu_up,
-   text_area_cpu_down));
-
-   return 0;
-}
-late_initcall(setup_text_poke_area);
+struct patch_mapping {
+   spinlock_t *ptl; /* for protecting pte table */
+   pte_t *ptep;
+   struct temp_mm temp_mm;
+};
 
 /*
  * This can be called for kernel text or a module.
  */
-static int map_patch_area(void *addr, unsigned long text_poke_addr)
+static int map_patch(const void *addr, struct patch_mapping *patch_mapping)
 {
-   unsigned long pfn;
-   int err;
+   struct page *page;
+   pte_t pte;
+   pgprot_t pgprot;
 
if (is_vmalloc_addr(addr))
-   pfn = vmalloc_to_pfn(addr);
+   page = vmalloc_to_page(addr);
else
-   pfn = __pa_symbol(addr) >> PAGE_SHIFT;
+   page = virt_to_page(addr);
 
-   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
+   if (radix_enabled())
+   pgprot = PAGE_KERNEL;
+   else
+   pgprot = PAGE_SHARED;
 
-   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
-   if (err)
+   patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
+&patch_mapping->ptl);
+   if (unlikely(!patch_mapping->ptep)) {
+   pr_warn("map patch: failed to allocate pte for patching\n");
return -1;
+   }
+
+   pte = mk_pte(page, pgprot);
+   if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64))
+   pte = pte_mkdirty(pte);
+   set_pte_at(patching_mm, patching_addr, patch_mapping->ptep, pte);
+
+   init_temp_mm(&patch_mapping->temp_mm, patching_mm);
+   use_temporary_mm(&patch_mapping->temp_mm);
 
return 0;
 }
 
-static inline int unmap_patch_area(unsigned long addr)
+static void unmap_patch(struct patch_mapping *patch_mapping)
 {
-   pte_t *ptep;
-   pmd_t *pmdp;
-   pud_t *pudp;
-   pgd_t *pgdp;
-
-   pgdp = pgd_offset_k(addr);
-   if (unlikely(!pgdp))
-   return -EINVAL;
-
-   pudp = pud_offset(pgdp, addr);
-   if (unlikely(!pudp))
-   return -EINVAL;
-
-   pmdp = pmd_offset(pudp, addr);
-   if (unlikely(!pmdp))
-   return -EINVAL;
-
-   ptep = pte_offset_kernel(pmdp, addr);
-   if (unlikely(!ptep))
-   return -EINV

[RFC PATCH v2 5/5] powerpc: Add LKDTM test to hijack a patch mapping

2020-04-28 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX, the CPU doing the patching
must use a temporary mapping which allows for writing to kernel text.
During the entire window of time when this temporary mapping is in use,
another CPU could write to the same mapping and maliciously alter kernel
text. Implement a LKDTM test to attempt to exploit such a openings when
a CPU is patching under STRICT_KERNEL_RWX. The test is only implemented
on powerpc for now.

The LKDTM "hijack" test works as follows:

1. A CPU executes an infinite loop to patch an instruction.
   This is the "patching" CPU.
2. Another CPU attempts to write to the address of the temporary
   mapping used by the "patching" CPU. This other CPU is the
   "hijacker" CPU. The hijack either fails with a segfault or
   succeeds, in which case some kernel text is now overwritten.

How to run the test:

mount -t debugfs none /sys/kernel/debug
(echo HIJACK_PATCH > /sys/kernel/debug/provoke-crash/DIRECT)

Signed-off-by: Christopher M. Riedl 
---
 drivers/misc/lkdtm/core.c  |  1 +
 drivers/misc/lkdtm/lkdtm.h |  1 +
 drivers/misc/lkdtm/perms.c | 99 ++
 3 files changed, 101 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index a5e344df9166..482e72f6a1e1 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -145,6 +145,7 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
CRASHTYPE(WRITE_KERN),
+   CRASHTYPE(HIJACK_PATCH),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
CRASHTYPE(REFCOUNT_INC_NOT_ZERO_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 601a2156a0d4..bfcf3542370d 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -62,6 +62,7 @@ void lkdtm_EXEC_USERSPACE(void);
 void lkdtm_EXEC_NULL(void);
 void lkdtm_ACCESS_USERSPACE(void);
 void lkdtm_ACCESS_NULL(void);
+void lkdtm_HIJACK_PATCH(void);
 
 /* lkdtm_refcount.c */
 void lkdtm_REFCOUNT_INC_OVERFLOW(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 62f76d506f04..547ce16e03e5 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -213,6 +214,104 @@ void lkdtm_ACCESS_NULL(void)
*ptr = tmp;
 }
 
+#if defined(CONFIG_PPC) && defined(CONFIG_STRICT_KERNEL_RWX)
+#include 
+
+extern unsigned long read_cpu_patching_addr(unsigned int cpu);
+
+static unsigned int * const patch_site = (unsigned int * const)&do_nothing;
+
+static int lkdtm_patching_cpu(void *data)
+{
+   int err = 0;
+
+   pr_info("starting patching_cpu=%d\n", smp_processor_id());
+   do {
+   err = patch_instruction(patch_site, 0xdeadbeef);
+   } while (*READ_ONCE(patch_site) == 0xdeadbeef &&
+   !err && !kthread_should_stop());
+
+   if (err)
+   pr_warn("patch_instruction returned error: %d\n", err);
+
+   set_current_state(TASK_INTERRUPTIBLE);
+   while (!kthread_should_stop()) {
+   schedule();
+   set_current_state(TASK_INTERRUPTIBLE);
+   }
+
+   return err;
+}
+
+void lkdtm_HIJACK_PATCH(void)
+{
+   struct task_struct *patching_kthrd;
+   int patching_cpu, hijacker_cpu, original_insn, attempts;
+   unsigned long addr;
+   bool hijacked;
+
+   if (num_online_cpus() < 2) {
+   pr_warn("need at least two cpus\n");
+   return;
+   }
+
+   original_insn = *READ_ONCE(patch_site);
+
+   hijacker_cpu = smp_processor_id();
+   patching_cpu = cpumask_any_but(cpu_online_mask, hijacker_cpu);
+
+   patching_kthrd = kthread_create_on_node(&lkdtm_patching_cpu, NULL,
+   cpu_to_node(patching_cpu),
+   "lkdtm_patching_cpu");
+   kthread_bind(patching_kthrd, patching_cpu);
+   wake_up_process(patching_kthrd);
+
+   addr = offset_in_page(patch_site) | 
read_cpu_patching_addr(patching_cpu);
+
+   pr_info("starting hijacker_cpu=%d\n", hijacker_cpu);
+   for (attempts = 0; attempts < 10; ++attempts) {
+   /* Use __put_user to catch faults without an Oops */
+   hijacked = !__put_user(0xbad00bad, (unsigned int *)addr);
+
+   if (hijacked) {
+   if (kthread_stop(patching_kthrd))
+   goto out;
+   break;
+   }
+   }
+   pr_info("hijack attempts: %d\n", attempts)

[RFC PATCH v2 2/5] powerpc/lib: Initialize a temporary mm for code patching

2020-04-28 Thread Christopher M. Riedl
When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped with permissive memory
protections. Currently, a per-cpu vmalloc patch area is used for this
purpose. While the patch area is per-cpu, the temporary page mapping is
inserted into the kernel page tables for the duration of the patching.
The mapping is exposed to CPUs other than the patching CPU - this is
undesirable from a hardening perspective.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
portion. The next patch uses the temporary mm and patching address for
code patching.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/lib/code-patching.c | 33 
 1 file changed, 33 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 3345f039a876..259c19480a85 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -39,6 +41,37 @@ int raw_patch_instruction(unsigned int *addr, unsigned int 
instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+static struct mm_struct *patching_mm __ro_after_init;
+static unsigned long patching_addr __ro_after_init;
+
+void __init poking_init(void)
+{
+   spinlock_t *ptl; /* for protecting pte table */
+   pte_t *ptep;
+
+   /*
+* Some parts of the kernel (static keys for example) depend on
+* successful code patching. Code patching under STRICT_KERNEL_RWX
+* requires this setup - otherwise we cannot patch at all. We use
+* BUG_ON() here and later since an early failure is preferred to
+* buggy behavior and/or strange crashes later.
+*/
+   patching_mm = copy_init_mm();
+   BUG_ON(!patching_mm);
+
+   /*
+* In hash we cannot go above DEFAULT_MAP_WINDOW easily.
+* XXX: Do we want additional bits of entropy for radix?
+*/
+   patching_addr = (get_random_long() & PAGE_MASK) %
+   (DEFAULT_MAP_WINDOW - PAGE_SIZE);
+
+   ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
+   BUG_ON(!ptep);
+   pte_unmap_unlock(ptep, ptl);
+}
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
 static int text_area_cpu_up(unsigned int cpu)
-- 
2.26.1



[RFC PATCH v2 4/5] powerpc/lib: Add LKDTM accessor for patching addr

2020-04-28 Thread Christopher M. Riedl
When live patching a STRICT_RWX kernel, a mapping is installed at a
"patching address" with temporary write permissions. Provide a
LKDTM-only accessor function for this address in preparation for a LKDTM
test which attempts to "hijack" this mapping by writing to it from
another CPU.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/lib/code-patching.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 26f06cdb5d7e..cfbdef90384e 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -46,6 +46,13 @@ int raw_patch_instruction(unsigned int *addr, unsigned int 
instr)
 static struct mm_struct *patching_mm __ro_after_init;
 static unsigned long patching_addr __ro_after_init;
 
+#ifdef CONFIG_LKDTM
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+   return patching_addr;
+}
+#endif
+
 void __init poking_init(void)
 {
spinlock_t *ptl; /* for protecting pte table */
-- 
2.26.1



[RFC PATCH v2 1/5] powerpc/mm: Introduce temporary mm

2020-04-28 Thread Christopher M. Riedl
x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. A side benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use.

Based on x86 implementation:

commit cefa929c034e
("x86/mm: Introduce temporary mm structs")

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/debug.h   |  1 +
 arch/powerpc/include/asm/mmu_context.h | 54 ++
 arch/powerpc/kernel/process.c  |  5 +++
 3 files changed, 60 insertions(+)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 7756026b95ca..b945bc16c932 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -45,6 +45,7 @@ static inline int debugger_break_match(struct pt_regs *regs) 
{ return 0; }
 static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
+void __get_breakpoint(struct arch_hw_breakpoint *brk);
 void __set_breakpoint(struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 360367c579de..57a8695fe63f 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -10,6 +10,7 @@
 #include
 #include 
 #include 
+#include 
 
 /*
  * Most if the context management is out of line
@@ -270,5 +271,58 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
return 0;
 }
 
+struct temp_mm {
+   struct mm_struct *temp;
+   struct mm_struct *prev;
+   bool is_kernel_thread;
+   struct arch_hw_breakpoint brk;
+};
+
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+{
+   temp_mm->temp = mm;
+   temp_mm->prev = NULL;
+   temp_mm->is_kernel_thread = false;
+   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
+}
+
+static inline void use_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   temp_mm->is_kernel_thread = current->mm == NULL;
+   if (temp_mm->is_kernel_thread)
+   temp_mm->prev = current->active_mm;
+   else
+   temp_mm->prev = current->mm;
+
+   /*
+* Hash requires a non-NULL current->mm to allocate a userspace address
+* when handling a page fault. Does not appear to hurt in Radix either.
+*/
+   current->mm = temp_mm->temp;
+   switch_mm_irqs_off(NULL, temp_mm->temp, current);
+
+   if (ppc_breakpoint_available()) {
+   __get_breakpoint(&temp_mm->brk);
+   if (temp_mm->brk.type != 0)
+   hw_breakpoint_disable();
+   }
+}
+
+static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   if (temp_mm->is_kernel_thread)
+   current->mm = NULL;
+   else
+   current->mm = temp_mm->prev;
+   switch_mm_irqs_off(NULL, temp_mm->prev, current);
+
+   if (ppc_breakpoint_available() && temp_mm->brk.type != 0)
+   __set_breakpoint(&temp_mm->brk);
+}
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 9c21288f8645..ec4cf890d92c 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -800,6 +800,11 @@ static inline int set_breakpoint_8xx(struct 
arch_hw_breakpoint *brk)
return 0;
 }
 
+void __get_breakpoint(struct arch_hw_breakpoint *brk)
+{
+   memcpy(brk, this_cpu_ptr(¤t_brk), sizeof(*brk));
+}
+
 void __set_breakpoint(struct arch_hw_breakpoint *brk)
 {
memcpy(this_cpu_ptr(¤t_brk), brk, sizeof(*brk));
-- 
2.26.1



Re: [RFC PATCH v2 3/5] powerpc/lib: Use a temporary mm for code patching

2020-05-01 Thread Christopher M. Riedl
On Wed Apr 29, 2020 at 7:52 AM, Christophe Leroy wrote:
>
> 
>
> 
> Le 29/04/2020 à 04:05, Christopher M. Riedl a écrit :
> > Currently, code patching a STRICT_KERNEL_RWX exposes the temporary
> > mappings to other CPUs. These mappings should be kept local to the CPU
> > doing the patching. Use the pre-initialized temporary mm and patching
> > address for this purpose. Also add a check after patching to ensure the
> > patch succeeded.
> > 
> > Use the KUAP functions on non-BOOKS3_64 platforms since the temporary
> > mapping for patching uses a userspace address (to keep the mapping
> > local). On BOOKS3_64 platforms hash does not implement KUAP and on radix
> > the use of PAGE_KERNEL sets EAA[0] for the PTE which means the AMR
> > (KUAP) protection is ignored (see PowerISA v3.0b, Fig, 35).
> > 
> > Based on x86 implementation:
> > 
> > commit b3fd8e83ada0
> > ("x86/alternatives: Use temporary mm for text poking")
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/lib/code-patching.c | 149 ---
> >   1 file changed, 55 insertions(+), 94 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 259c19480a85..26f06cdb5d7e 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -19,6 +19,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   
> >   static int __patch_instruction(unsigned int *exec_addr, unsigned int 
> > instr,
> >unsigned int *patch_addr)
> > @@ -72,101 +73,58 @@ void __init poking_init(void)
> > pte_unmap_unlock(ptep, ptl);
> >   }
> >   
> > -static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> > -
> > -static int text_area_cpu_up(unsigned int cpu)
> > -{
> > -   struct vm_struct *area;
> > -
> > -   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -   if (!area) {
> > -   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -   cpu);
> > -   return -1;
> > -   }
> > -   this_cpu_write(text_poke_area, area);
> > -
> > -   return 0;
> > -}
> > -
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -   free_vm_area(this_cpu_read(text_poke_area));
> > -   return 0;
> > -}
> > -
> > -/*
> > - * Run as a late init call. This allows all the boot time patching to be 
> > done
> > - * simply by patching the code, and then we're called here prior to
> > - * mark_rodata_ro(), which happens after all init calls are run. Although
> > - * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we 
> > judge
> > - * it as being preferable to a kernel that will crash later when someone 
> > tries
> > - * to use patch_instruction().
> > - */
> > -static int __init setup_text_poke_area(void)
> > -{
> > -   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> > -   "powerpc/text_poke:online", text_area_cpu_up,
> > -   text_area_cpu_down));
> > -
> > -   return 0;
> > -}
> > -late_initcall(setup_text_poke_area);
> > +struct patch_mapping {
> > +   spinlock_t *ptl; /* for protecting pte table */
> > +   pte_t *ptep;
> > +   struct temp_mm temp_mm;
> > +};
> >   
> >   /*
> >* This can be called for kernel text or a module.
> >*/
> > -static int map_patch_area(void *addr, unsigned long text_poke_addr)
> > +static int map_patch(const void *addr, struct patch_mapping *patch_mapping)
> >   {
> > -   unsigned long pfn;
> > -   int err;
> > +   struct page *page;
> > +   pte_t pte;
> > +   pgprot_t pgprot;
> >   
> > if (is_vmalloc_addr(addr))
> > -   pfn = vmalloc_to_pfn(addr);
> > +   page = vmalloc_to_page(addr);
> > else
> > -   pfn = __pa_symbol(addr) >> PAGE_SHIFT;
> > +   page = virt_to_page(addr);
> >   
> > -   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
> > +   if (radix_enabled())
> > +   pgprot = PAGE_KERNEL;
> > +   else
> > +   pgprot = PAGE_SHARED;
> >   
> > -   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
> > -   if (err)
> > +   patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > +

Re: [RFC PATCH v2 1/5] powerpc/mm: Introduce temporary mm

2020-05-01 Thread Christopher M. Riedl
On Wed Apr 29, 2020 at 7:39 AM, Christophe Leroy wrote:
>
> 
>
> 
> Le 29/04/2020 à 04:05, Christopher M. Riedl a écrit :
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for situations
> > where a CPU needs to perform sensitive operations (such as patching a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> > said mappings to other CPUs. A side benefit is that other CPU TLBs do
> > not need to be flushed when the temporary mm is torn down.
> > 
> > Mappings in the temporary mm can be set in the userspace portion of the
> > address-space.
> > 
> > Interrupts must be disabled while the temporary mm is in use. HW
> > breakpoints, which may have been set by userspace as watchpoints on
> > addresses now within the temporary mm, are saved and disabled when
> > loading the temporary mm. The HW breakpoints are restored when unloading
> > the temporary mm. All HW breakpoints are indiscriminately disabled while
> > the temporary mm is in use.
> > 
> > Based on x86 implementation:
> > 
> > commit cefa929c034e
> > ("x86/mm: Introduce temporary mm structs")
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/include/asm/debug.h   |  1 +
> >   arch/powerpc/include/asm/mmu_context.h | 54 ++
> >   arch/powerpc/kernel/process.c  |  5 +++
> >   3 files changed, 60 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/debug.h 
> > b/arch/powerpc/include/asm/debug.h
> > index 7756026b95ca..b945bc16c932 100644
> > --- a/arch/powerpc/include/asm/debug.h
> > +++ b/arch/powerpc/include/asm/debug.h
> > @@ -45,6 +45,7 @@ static inline int debugger_break_match(struct pt_regs 
> > *regs) { return 0; }
> >   static inline int debugger_fault_handler(struct pt_regs *regs) { return 
> > 0; }
> >   #endif
> >   
> > +void __get_breakpoint(struct arch_hw_breakpoint *brk);
> >   void __set_breakpoint(struct arch_hw_breakpoint *brk);
> >   bool ppc_breakpoint_available(void);
> >   #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> > b/arch/powerpc/include/asm/mmu_context.h
> > index 360367c579de..57a8695fe63f 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -10,6 +10,7 @@
> >   #include   
> >   #include 
> >   #include 
> > +#include 
> >   
> >   /*
> >* Most if the context management is out of line
> > @@ -270,5 +271,58 @@ static inline int arch_dup_mmap(struct mm_struct 
> > *oldmm,
> > return 0;
> >   }
> >   
> > +struct temp_mm {
> > +   struct mm_struct *temp;
> > +   struct mm_struct *prev;
> > +   bool is_kernel_thread;
> > +   struct arch_hw_breakpoint brk;
> > +};
> > +
> > +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct 
> > *mm)
> > +{
> > +   temp_mm->temp = mm;
> > +   temp_mm->prev = NULL;
> > +   temp_mm->is_kernel_thread = false;
> > +   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> > +}
> > +
> > +static inline void use_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > +   lockdep_assert_irqs_disabled();
> > +
> > +   temp_mm->is_kernel_thread = current->mm == NULL;
> > +   if (temp_mm->is_kernel_thread)
> > +   temp_mm->prev = current->active_mm;
> > +   else
> > +   temp_mm->prev = current->mm;
> > +
> > +   /*
> > +* Hash requires a non-NULL current->mm to allocate a userspace address
> > +* when handling a page fault. Does not appear to hurt in Radix either.
> > +*/
> > +   current->mm = temp_mm->temp;
> > +   switch_mm_irqs_off(NULL, temp_mm->temp, current);
> > +
> > +   if (ppc_breakpoint_available()) {
> > +   __get_breakpoint(&temp_mm->brk);
> > +   if (temp_mm->brk.type != 0)
> > +   hw_breakpoint_disable();
> > +   }
> > +}
> > +
> > +static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
>
> 
> Not sure "unuse" is a best naming, allthought I don't have a better
> suggestion a the moment. If not using temporary_mm anymore, what are we
> using now ?
>
> 

I'm not too fond of 'unuse' either, but it's what x86 uses and I
couldn't come up with anything better on the spot

Re: [RFC PATCH v2 1/5] powerpc/mm: Introduce temporary mm

2020-05-01 Thread Christopher M. Riedl
On Wed Apr 29, 2020 at 7:48 AM, Christophe Leroy wrote:
>
> 
>
> 
> Le 29/04/2020 à 04:05, Christopher M. Riedl a écrit :
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for situations
> > where a CPU needs to perform sensitive operations (such as patching a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> > said mappings to other CPUs. A side benefit is that other CPU TLBs do
> > not need to be flushed when the temporary mm is torn down.
> > 
> > Mappings in the temporary mm can be set in the userspace portion of the
> > address-space.
> > 
> > Interrupts must be disabled while the temporary mm is in use. HW
> > breakpoints, which may have been set by userspace as watchpoints on
> > addresses now within the temporary mm, are saved and disabled when
> > loading the temporary mm. The HW breakpoints are restored when unloading
> > the temporary mm. All HW breakpoints are indiscriminately disabled while
> > the temporary mm is in use.
>
> 
> Why do we need to use a temporary mm all the time ?
>

Not sure I understand, the temporary mm is only in use for kernel
patching in this series. We could have other uses in the future maybe
where it's beneficial to keep mappings local.

> 
> Doesn't each CPU have its own mm already ? Only the upper address space
> is shared between all mm's but each mm has its own lower address space,
> at least when it is running a user process. Why not just use that mm ?
> As we are mapping then unmapping with interrupts disabled, there is no
> risk at all that the user starts running while the patch page is mapped,
> so I'm not sure why switching to a temporary mm is needed.
>
> 

I suppose that's an option, but then we have to save and restore the
mapping which we temporarily "steal" from userspace. I admit I didn't
consider that as an option when I started this series based on the x86
patches. I think it's cleaner to switch mm, but that's a rather weak
argument. Are you concerned about performance with the temporary mm?

>
> 
> > 
> > Based on x86 implementation:
> > 
> > commit cefa929c034e
> > ("x86/mm: Introduce temporary mm structs")
> > 
> > Signed-off-by: Christopher M. Riedl 
>
> 
> Christophe
>
> 
>
> 



[PATCH 2/3] powerpc/spinlocks: Rename SPLPAR-only spinlocks

2019-07-28 Thread Christopher M. Riedl
The __rw_yield and __spin_yield locks only pertain to SPLPAR mode.
Rename them to make this relationship obvious.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/spinlock.h | 6 --
 arch/powerpc/lib/locks.c| 6 +++---
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 8631b0b4e109..1e7721176f39 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -101,8 +101,10 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
-extern void __spin_yield(arch_spinlock_t *lock);
-extern void __rw_yield(arch_rwlock_t *lock);
+void splpar_spin_yield(arch_spinlock_t *lock);
+void splpar_rw_yield(arch_rwlock_t *lock);
+#define __spin_yield(x) splpar_spin_yield(x)
+#define __rw_yield(x) splpar_rw_yield(x)
 #else /* SPLPAR */
 #define __spin_yield(x)barrier()
 #define __rw_yield(x)  barrier()
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index 6550b9e5ce5f..6440d5943c00 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -18,7 +18,7 @@
 #include 
 #include 
 
-void __spin_yield(arch_spinlock_t *lock)
+void splpar_spin_yield(arch_spinlock_t *lock)
 {
unsigned int lock_value, holder_cpu, yield_count;
 
@@ -36,14 +36,14 @@ void __spin_yield(arch_spinlock_t *lock)
plpar_hcall_norets(H_CONFER,
get_hard_smp_processor_id(holder_cpu), yield_count);
 }
-EXPORT_SYMBOL_GPL(__spin_yield);
+EXPORT_SYMBOL_GPL(splpar_spin_yield);
 
 /*
  * Waiting for a read lock or a write lock on a rwlock...
  * This turns out to be the same for read and write locks, since
  * we only know the holder if it is write-locked.
  */
-void __rw_yield(arch_rwlock_t *rw)
+void splpar_rw_yield(arch_rwlock_t *rw)
 {
int lock_value;
unsigned int holder_cpu, yield_count;
-- 
2.22.0



[PATCH 0/3] Fix oops in shared-processor spinlocks

2019-07-28 Thread Christopher M. Riedl
Fixes an oops when calling the shared-processor spinlock implementation
from a non-SP LPAR. Also take this opportunity to refactor
SHARED_PROCESSOR a bit.

Reference:  https://github.com/linuxppc/issues/issues/229

Christopher M. Riedl (3):
  powerpc/spinlocks: Refactor SHARED_PROCESSOR
  powerpc/spinlocks: Rename SPLPAR-only spinlocks
  powerpc/spinlocks: Fix oops in shared-processor spinlocks

 arch/powerpc/include/asm/spinlock.h | 59 -
 arch/powerpc/lib/locks.c|  6 +--
 2 files changed, 45 insertions(+), 20 deletions(-)

-- 
2.22.0



[PATCH 1/3] powerpc/spinlocks: Refactor SHARED_PROCESSOR

2019-07-28 Thread Christopher M. Riedl
Determining if a processor is in shared processor mode is not a constant
so don't hide it behind a #define.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/spinlock.h | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index a47f827bc5f1..8631b0b4e109 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -101,15 +101,24 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
-#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
 extern void __spin_yield(arch_spinlock_t *lock);
 extern void __rw_yield(arch_rwlock_t *lock);
 #else /* SPLPAR */
 #define __spin_yield(x)barrier()
 #define __rw_yield(x)  barrier()
-#define SHARED_PROCESSOR   0
 #endif
 
+static inline bool is_shared_processor(void)
+{
+/* Only server processors have an lppaca struct */
+#ifdef CONFIG_PPC_BOOK3S
+   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
+   lppaca_shared_proc(local_paca->lppaca_ptr));
+#else
+   return false;
+#endif
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
while (1) {
@@ -117,7 +126,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
@@ -136,7 +145,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
local_irq_restore(flags);
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
@@ -226,7 +235,7 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__rw_yield(rw);
} while (unlikely(rw->lock < 0));
HMT_medium();
@@ -240,7 +249,7 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__rw_yield(rw);
} while (unlikely(rw->lock != 0));
HMT_medium();
-- 
2.22.0



[PATCH 3/3] powerpc/spinlock: Fix oops in shared-processor spinlocks

2019-07-28 Thread Christopher M. Riedl
Booting w/ ppc64le_defconfig + CONFIG_PREEMPT results in the attached
kernel trace due to calling shared-processor spinlocks while not running
in an SPLPAR. Previously, the out-of-line spinlocks implementations were
selected based on CONFIG_PPC_SPLPAR at compile time without a runtime
shared-processor LPAR check.

To fix, call the actual spinlock implementations from a set of common
functions, spin_yield() and rw_yield(), which check for shared-processor
LPAR during runtime and select the appropriate lock implementation.

[0.430878] BUG: Kernel NULL pointer dereference at 0x0100
[0.431991] Faulting instruction address: 0xc0097f88
[0.432934] Oops: Kernel access of bad area, sig: 7 [#1]
[0.433448] LE PAGE_SIZE=64K MMU=Radix MMU=Hash PREEMPT SMP NR_CPUS=2048 
NUMA PowerNV
[0.434479] Modules linked in:
[0.435055] CPU: 0 PID: 2 Comm: kthreadd Not tainted 
5.2.0-rc6-00491-g249155c20f9b #28
[0.435730] NIP:  c0097f88 LR: c0c07a88 CTR: c015ca10
[0.436383] REGS: c000727079f0 TRAP: 0300   Not tainted  
(5.2.0-rc6-00491-g249155c20f9b)
[0.437004] MSR:  92009033   CR: 
84000424  XER: 2004
[0.437874] CFAR: c0c07a84 DAR: 0100 DSISR: 0008 
IRQMASK: 1
[0.437874] GPR00: c0c07a88 c00072707c80 c1546300 
c0007be38a80
[0.437874] GPR04: c000726f0c00 0002 c0007279c980 
0100
[0.437874] GPR08: c1581b78 8001 0008 
c0007279c9b0
[0.437874] GPR12:  c173 c0142558 

[0.437874] GPR16:    

[0.437874] GPR20:    

[0.437874] GPR24: c0007be38a80 c0c002f4  

[0.437874] GPR28: c00072221a00 c000726c2600 c0007be38a80 
c0007be38a80
[0.443992] NIP [c0097f88] __spin_yield+0x48/0xa0
[0.444523] LR [c0c07a88] __raw_spin_lock+0xb8/0xc0
[0.445080] Call Trace:
[0.445670] [c00072707c80] [c00072221a00] 0xc00072221a00 
(unreliable)
[0.446425] [c00072707cb0] [c0bffb0c] __schedule+0xbc/0x850
[0.447078] [c00072707d70] [c0c002f4] schedule+0x54/0x130
[0.447694] [c00072707da0] [c01427dc] kthreadd+0x28c/0x2b0
[0.448389] [c00072707e20] [c000c1cc] 
ret_from_kernel_thread+0x5c/0x70
[0.449143] Instruction dump:
[0.449821] 4d9e0020 552a043e 210a07ff 79080fe0 0b08 3d020004 3908b878 
794a1f24
[0.450587] e8e8 7ce7502a e8e7 38e70100 <7ca03c2c> 70a70001 78a50020 
4d820020
[0.452808] ---[ end trace 474d6b2b8fc5cb7e ]---

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/spinlock.h | 36 -
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 1e7721176f39..8161809c6be1 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -103,11 +103,9 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 /* We only yield to the hypervisor if we are in shared processor mode */
 void splpar_spin_yield(arch_spinlock_t *lock);
 void splpar_rw_yield(arch_rwlock_t *lock);
-#define __spin_yield(x) splpar_spin_yield(x)
-#define __rw_yield(x) splpar_rw_yield(x)
 #else /* SPLPAR */
-#define __spin_yield(x)barrier()
-#define __rw_yield(x)  barrier()
+#define splpar_spin_yield(lock)
+#define splpar_rw_yield(lock)
 #endif
 
 static inline bool is_shared_processor(void)
@@ -121,6 +119,22 @@ static inline bool is_shared_processor(void)
 #endif
 }
 
+static inline void spin_yield(arch_spinlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_spin_yield(lock);
+   else
+   barrier();
+}
+
+static inline void rw_yield(arch_rwlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_rw_yield(lock);
+   else
+   barrier();
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
while (1) {
@@ -129,7 +143,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
do {
HMT_low();
if (is_shared_processor())
-   __spin_yield(lock);
+   spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
}
@@ -148,7 +162,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
do {
HMT_low();
if (is_shared_processor())
-   __spin_yield(lock);
+   spin_yield(lock);
} while (unlikely(lock-&g

Re: [PATCH 1/3] powerpc/spinlocks: Refactor SHARED_PROCESSOR

2019-07-30 Thread Christopher M Riedl
> On July 30, 2019 at 4:31 PM Thiago Jung Bauermann  
> wrote:
> 
> 
> 
> Christopher M. Riedl  writes:
> 
> > Determining if a processor is in shared processor mode is not a constant
> > so don't hide it behind a #define.
> >
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/include/asm/spinlock.h | 21 +++--
> >  1 file changed, 15 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/spinlock.h 
> > b/arch/powerpc/include/asm/spinlock.h
> > index a47f827bc5f1..8631b0b4e109 100644
> > --- a/arch/powerpc/include/asm/spinlock.h
> > +++ b/arch/powerpc/include/asm/spinlock.h
> > @@ -101,15 +101,24 @@ static inline int arch_spin_trylock(arch_spinlock_t 
> > *lock)
> >
> >  #if defined(CONFIG_PPC_SPLPAR)
> >  /* We only yield to the hypervisor if we are in shared processor mode */
> > -#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
> >  extern void __spin_yield(arch_spinlock_t *lock);
> >  extern void __rw_yield(arch_rwlock_t *lock);
> >  #else /* SPLPAR */
> >  #define __spin_yield(x)barrier()
> >  #define __rw_yield(x)  barrier()
> > -#define SHARED_PROCESSOR   0
> >  #endif
> >
> > +static inline bool is_shared_processor(void)
> > +{
> > +/* Only server processors have an lppaca struct */
> > +#ifdef CONFIG_PPC_BOOK3S
> > +   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
> > +   lppaca_shared_proc(local_paca->lppaca_ptr));
> > +#else
> > +   return false;
> > +#endif
> > +}
> > +
> 
> CONFIG_PPC_SPLPAR depends on CONFIG_PPC_PSERIES, which depends on
> CONFIG_PPC_BOOK3S so the #ifdef above is unnecessary:
> 
> if CONFIG_PPC_BOOK3S is unset then CONFIG_PPC_SPLPAR will be unset as
> well and the return expression should short-circuit to false.
> 

Agreed, but the #ifdef is necessary to compile platforms which include
this header but do not implement lppaca_shared_proc(...) and friends.
I can reword the comment if that helps.

> --
> Thiago Jung Bauermann
> IBM Linux Technology Center


Re: [PATCH 1/3] powerpc/spinlocks: Refactor SHARED_PROCESSOR

2019-07-30 Thread Christopher M Riedl


> On July 30, 2019 at 7:11 PM Thiago Jung Bauermann  
> wrote:
> 
> 
> 
> Christopher M Riedl  writes:
> 
> >> On July 30, 2019 at 4:31 PM Thiago Jung Bauermann  
> >> wrote:
> >>
> >>
> >>
> >> Christopher M. Riedl  writes:
> >>
> >> > Determining if a processor is in shared processor mode is not a constant
> >> > so don't hide it behind a #define.
> >> >
> >> > Signed-off-by: Christopher M. Riedl 
> >> > ---
> >> >  arch/powerpc/include/asm/spinlock.h | 21 +++--
> >> >  1 file changed, 15 insertions(+), 6 deletions(-)
> >> >
> >> > diff --git a/arch/powerpc/include/asm/spinlock.h 
> >> > b/arch/powerpc/include/asm/spinlock.h
> >> > index a47f827bc5f1..8631b0b4e109 100644
> >> > --- a/arch/powerpc/include/asm/spinlock.h
> >> > +++ b/arch/powerpc/include/asm/spinlock.h
> >> > @@ -101,15 +101,24 @@ static inline int 
> >> > arch_spin_trylock(arch_spinlock_t *lock)
> >> >
> >> >  #if defined(CONFIG_PPC_SPLPAR)
> >> >  /* We only yield to the hypervisor if we are in shared processor mode */
> >> > -#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
> >> >  extern void __spin_yield(arch_spinlock_t *lock);
> >> >  extern void __rw_yield(arch_rwlock_t *lock);
> >> >  #else /* SPLPAR */
> >> >  #define __spin_yield(x) barrier()
> >> >  #define __rw_yield(x)   barrier()
> >> > -#define SHARED_PROCESSOR0
> >> >  #endif
> >> >
> >> > +static inline bool is_shared_processor(void)
> >> > +{
> >> > +/* Only server processors have an lppaca struct */
> >> > +#ifdef CONFIG_PPC_BOOK3S
> >> > +return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
> >> > +lppaca_shared_proc(local_paca->lppaca_ptr));
> >> > +#else
> >> > +return false;
> >> > +#endif
> >> > +}
> >> > +
> >>
> >> CONFIG_PPC_SPLPAR depends on CONFIG_PPC_PSERIES, which depends on
> >> CONFIG_PPC_BOOK3S so the #ifdef above is unnecessary:
> >>
> >> if CONFIG_PPC_BOOK3S is unset then CONFIG_PPC_SPLPAR will be unset as
> >> well and the return expression should short-circuit to false.
> >>
> >
> > Agreed, but the #ifdef is necessary to compile platforms which include
> > this header but do not implement lppaca_shared_proc(...) and friends.
> > I can reword the comment if that helps.
> 
> Ah, indeed. Yes, if you could mention that in the commit I think it
> would help. These #ifdefs are becoming démodé so it's good to know why
> they're there.
> 
> Another alternative is to provide a dummy lppaca_shared_proc() which
> always returns false when CONFIG_PPC_BOOK3S isn't set (just mentioning
> it, I don't have a preference).
> 

Yeah, I tried that first, but the declaration and definition for 
lppaca_shared_proc()
and arguments are nested within several includes and arch/platform #ifdefs that 
I
decided the #ifdef in is_shared_processor() is simpler.
I am not sure if unraveling all that makes sense for implementing this fix, 
maybe
someone can convince me hah.

In any case, next version will have an improved commit message and comment.

> --
> Thiago Jung Bauermann
> IBM Linux Technology Center


[PATCH v2 0/3] Fix oops in shared-processor spinlocks

2019-08-01 Thread Christopher M. Riedl
Fixes an oops when calling the shared-processor spinlock implementation
from a non-SP LPAR. Also take this opportunity to refactor
SHARED_PROCESSOR a bit.

Reference:  https://github.com/linuxppc/issues/issues/229

Changes since v1:
 - Improve comment wording to make it clear why the BOOK3S #ifdef is
   required in is_shared_processor() in spinlock.h
 - Replace empty #define of splpar_*_yield() with actual functions with
   empty bodies.

Christopher M. Riedl (3):
  powerpc/spinlocks: Refactor SHARED_PROCESSOR
  powerpc/spinlocks: Rename SPLPAR-only spinlocks
  powerpc/spinlocks: Fix oops in shared-processor spinlocks

 arch/powerpc/include/asm/spinlock.h | 62 +
 arch/powerpc/lib/locks.c|  6 +--
 2 files changed, 48 insertions(+), 20 deletions(-)

-- 
2.22.0



[PATCH v2 2/3] powerpc/spinlocks: Rename SPLPAR-only spinlocks

2019-08-01 Thread Christopher M. Riedl
The __rw_yield and __spin_yield locks only pertain to SPLPAR mode.
Rename them to make this relationship obvious.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Andrew Donnellan 
---
 arch/powerpc/include/asm/spinlock.h | 6 --
 arch/powerpc/lib/locks.c| 6 +++---
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index dc5fcea1f006..0a8270183770 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -101,8 +101,10 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
-extern void __spin_yield(arch_spinlock_t *lock);
-extern void __rw_yield(arch_rwlock_t *lock);
+void splpar_spin_yield(arch_spinlock_t *lock);
+void splpar_rw_yield(arch_rwlock_t *lock);
+#define __spin_yield(x) splpar_spin_yield(x)
+#define __rw_yield(x) splpar_rw_yield(x)
 #else /* SPLPAR */
 #define __spin_yield(x)barrier()
 #define __rw_yield(x)  barrier()
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index 6550b9e5ce5f..6440d5943c00 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -18,7 +18,7 @@
 #include 
 #include 
 
-void __spin_yield(arch_spinlock_t *lock)
+void splpar_spin_yield(arch_spinlock_t *lock)
 {
unsigned int lock_value, holder_cpu, yield_count;
 
@@ -36,14 +36,14 @@ void __spin_yield(arch_spinlock_t *lock)
plpar_hcall_norets(H_CONFER,
get_hard_smp_processor_id(holder_cpu), yield_count);
 }
-EXPORT_SYMBOL_GPL(__spin_yield);
+EXPORT_SYMBOL_GPL(splpar_spin_yield);
 
 /*
  * Waiting for a read lock or a write lock on a rwlock...
  * This turns out to be the same for read and write locks, since
  * we only know the holder if it is write-locked.
  */
-void __rw_yield(arch_rwlock_t *rw)
+void splpar_rw_yield(arch_rwlock_t *rw)
 {
int lock_value;
unsigned int holder_cpu, yield_count;
-- 
2.22.0



[PATCH v2 1/3] powerpc/spinlocks: Refactor SHARED_PROCESSOR

2019-08-01 Thread Christopher M. Riedl
Determining if a processor is in shared processor mode is not a constant
so don't hide it behind a #define.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Andrew Donnellan 
---
 arch/powerpc/include/asm/spinlock.h | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index a47f827bc5f1..dc5fcea1f006 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -101,15 +101,27 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
-#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
 extern void __spin_yield(arch_spinlock_t *lock);
 extern void __rw_yield(arch_rwlock_t *lock);
 #else /* SPLPAR */
 #define __spin_yield(x)barrier()
 #define __rw_yield(x)  barrier()
-#define SHARED_PROCESSOR   0
 #endif
 
+static inline bool is_shared_processor(void)
+{
+/*
+ * LPPACA is only available on BOOK3S so guard anything LPPACA related to
+ * allow other platforms (which include this common header) to compile.
+ */
+#ifdef CONFIG_PPC_BOOK3S
+   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
+   lppaca_shared_proc(local_paca->lppaca_ptr));
+#else
+   return false;
+#endif
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
while (1) {
@@ -117,7 +129,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
@@ -136,7 +148,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
local_irq_restore(flags);
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
@@ -226,7 +238,7 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__rw_yield(rw);
} while (unlikely(rw->lock < 0));
HMT_medium();
@@ -240,7 +252,7 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__rw_yield(rw);
} while (unlikely(rw->lock != 0));
HMT_medium();
-- 
2.22.0



[PATCH v2 3/3] powerpc/spinlocks: Fix oops in shared-processor spinlocks

2019-08-01 Thread Christopher M. Riedl
Booting w/ ppc64le_defconfig + CONFIG_PREEMPT results in the attached
kernel trace due to calling shared-processor spinlocks while not running
in an SPLPAR. Previously, the out-of-line spinlocks implementations were
selected based on CONFIG_PPC_SPLPAR at compile time without a runtime
shared-processor LPAR check.

To fix, call the actual spinlock implementations from a set of common
functions, spin_yield() and rw_yield(), which check for shared-processor
LPAR during runtime and select the appropriate lock implementation.

[0.430878] BUG: Kernel NULL pointer dereference at 0x0100
[0.431991] Faulting instruction address: 0xc0097f88
[0.432934] Oops: Kernel access of bad area, sig: 7 [#1]
[0.433448] LE PAGE_SIZE=64K MMU=Radix MMU=Hash PREEMPT SMP NR_CPUS=2048 
NUMA PowerNV
[0.434479] Modules linked in:
[0.435055] CPU: 0 PID: 2 Comm: kthreadd Not tainted 
5.2.0-rc6-00491-g249155c20f9b #28
[0.435730] NIP:  c0097f88 LR: c0c07a88 CTR: c015ca10
[0.436383] REGS: c000727079f0 TRAP: 0300   Not tainted  
(5.2.0-rc6-00491-g249155c20f9b)
[0.437004] MSR:  92009033   CR: 
84000424  XER: 2004
[0.437874] CFAR: c0c07a84 DAR: 0100 DSISR: 0008 
IRQMASK: 1
[0.437874] GPR00: c0c07a88 c00072707c80 c1546300 
c0007be38a80
[0.437874] GPR04: c000726f0c00 0002 c0007279c980 
0100
[0.437874] GPR08: c1581b78 8001 0008 
c0007279c9b0
[0.437874] GPR12:  c173 c0142558 

[0.437874] GPR16:    

[0.437874] GPR20:    

[0.437874] GPR24: c0007be38a80 c0c002f4  

[0.437874] GPR28: c00072221a00 c000726c2600 c0007be38a80 
c0007be38a80
[0.443992] NIP [c0097f88] __spin_yield+0x48/0xa0
[0.444523] LR [c0c07a88] __raw_spin_lock+0xb8/0xc0
[0.445080] Call Trace:
[0.445670] [c00072707c80] [c00072221a00] 0xc00072221a00 
(unreliable)
[0.446425] [c00072707cb0] [c0bffb0c] __schedule+0xbc/0x850
[0.447078] [c00072707d70] [c0c002f4] schedule+0x54/0x130
[0.447694] [c00072707da0] [c01427dc] kthreadd+0x28c/0x2b0
[0.448389] [c00072707e20] [c000c1cc] 
ret_from_kernel_thread+0x5c/0x70
[0.449143] Instruction dump:
[0.449821] 4d9e0020 552a043e 210a07ff 79080fe0 0b08 3d020004 3908b878 
794a1f24
[0.450587] e8e8 7ce7502a e8e7 38e70100 <7ca03c2c> 70a70001 78a50020 
4d820020
[0.452808] ---[ end trace 474d6b2b8fc5cb7e ]---

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/spinlock.h | 36 -
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 0a8270183770..6aed8a83b180 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -103,11 +103,9 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 /* We only yield to the hypervisor if we are in shared processor mode */
 void splpar_spin_yield(arch_spinlock_t *lock);
 void splpar_rw_yield(arch_rwlock_t *lock);
-#define __spin_yield(x) splpar_spin_yield(x)
-#define __rw_yield(x) splpar_rw_yield(x)
 #else /* SPLPAR */
-#define __spin_yield(x)barrier()
-#define __rw_yield(x)  barrier()
+static inline void splpar_spin_yield(arch_spinlock_t *lock) {};
+static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 #endif
 
 static inline bool is_shared_processor(void)
@@ -124,6 +122,22 @@ static inline bool is_shared_processor(void)
 #endif
 }
 
+static inline void spin_yield(arch_spinlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_spin_yield(lock);
+   else
+   barrier();
+}
+
+static inline void rw_yield(arch_rwlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_rw_yield(lock);
+   else
+   barrier();
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
while (1) {
@@ -132,7 +146,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
do {
HMT_low();
if (is_shared_processor())
-   __spin_yield(lock);
+   spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
}
@@ -151,7 +165,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
do {
HMT_low();
if (is_shared_processor())
-   __spin_yield(lock);
+   spin_y

Re: [PATCH v2 3/3] powerpc/spinlocks: Fix oops in shared-processor spinlocks

2019-08-02 Thread Christopher M Riedl


> On August 2, 2019 at 6:38 AM Michael Ellerman  wrote:
> 
> 
> "Christopher M. Riedl"  writes:
> > diff --git a/arch/powerpc/include/asm/spinlock.h 
> > b/arch/powerpc/include/asm/spinlock.h
> > index 0a8270183770..6aed8a83b180 100644
> > --- a/arch/powerpc/include/asm/spinlock.h
> > +++ b/arch/powerpc/include/asm/spinlock.h
> > @@ -124,6 +122,22 @@ static inline bool is_shared_processor(void)
> >  #endif
> >  }
> >  
> > +static inline void spin_yield(arch_spinlock_t *lock)
> > +{
> > +   if (is_shared_processor())
> > +   splpar_spin_yield(lock);
> > +   else
> > +   barrier();
> > +}
> ...
> >  static inline void arch_spin_lock(arch_spinlock_t *lock)
> >  {
> > while (1) {
> > @@ -132,7 +146,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> > do {
> > HMT_low();
> > if (is_shared_processor())
> > -   __spin_yield(lock);
> > +   spin_yield(lock);
> 
> This leaves us with a double test of is_shared_processor() doesn't it?

Yep, and that's no good. Hmm, executing the barrier() in the 
non-shared-processor
case probably hurts performance here?


[RFC PATCH v3] powerpc/xmon: Restrict when kernel is locked down

2019-08-03 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and completely
disable xmon when lockdown=confidentiality. Xmon checks the lockdown
state and takes appropriate action:

 (1) during xmon_setup to prevent early xmon'ing

 (2) when triggered via sysrq

 (3) when toggled via debugfs

 (4) when triggered via a previously enabled breakpoint

The following lockdown state transitions are handled:

 (1) lockdown=none -> lockdown=integrity
 set xmon read-only mode

 (2) lockdown=none -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent re-entry into xmon

 (3) lockdown=integrity -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent re-entry into xmon

Suggested-by: Andrew Donnellan 
Signed-off-by: Christopher M. Riedl 
---
Changes since v1:
 - Rebased onto v36 of https://patchwork.kernel.org/cover/11049461/
   (based on: f632a8170a6b667ee4e3f552087588f0fe13c4bb)
 - Do not clear existing breakpoints when transitioning from
   lockdown=none to lockdown=integrity
 - Remove line continuation and dangling quote (confuses checkpatch.pl)
   from the xmon command help/usage string

 arch/powerpc/xmon/xmon.c | 59 ++--
 include/linux/security.h |  2 ++
 security/lockdown/lockdown.c |  2 ++
 3 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index d0620d762a5a..1a5e43d664ca 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -187,6 +188,9 @@ static void dump_tlb_44x(void);
 static void dump_tlb_book3e(void);
 #endif
 
+static void clear_all_bpt(void);
+static void xmon_init(int);
+
 #ifdef CONFIG_PPC64
 #define REG"%.16lx"
 #else
@@ -283,10 +287,41 @@ Commands:\n\
 "  U   show uptime information\n"
 "  ?   help\n"
 "  # n limit output to n lines per page (for dp, dpa, dl)\n"
-"  zr  reboot\n\
-  zh   halt\n"
+"  zr  reboot\n"
+"  zh  halt\n"
 ;
 
+#ifdef CONFIG_SECURITY
+static bool xmon_is_locked_down(void)
+{
+   static bool lockdown;
+
+   if (!lockdown) {
+   lockdown = !!security_locked_down(LOCKDOWN_XMON_RW);
+   if (lockdown) {
+   printf("xmon: Disabled due to kernel lockdown\n");
+   xmon_is_ro = true;
+   xmon_on = 0;
+   xmon_init(0);
+   clear_all_bpt();
+   }
+   }
+
+   if (!xmon_is_ro) {
+   xmon_is_ro = !!security_locked_down(LOCKDOWN_XMON_WR);
+   if (xmon_is_ro)
+   printf("xmon: Read-only due to kernel lockdown\n");
+   }
+
+   return lockdown;
+}
+#else /* CONFIG_SECURITY */
+static inline bool xmon_is_locked_down(void)
+{
+   return false;
+}
+#endif
+
 static struct pt_regs *xmon_regs;
 
 static inline void sync(void)
@@ -704,6 +739,9 @@ static int xmon_bpt(struct pt_regs *regs)
struct bpt *bp;
unsigned long offset;
 
+   if (xmon_is_locked_down())
+   return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
 
@@ -735,6 +773,9 @@ static int xmon_sstep(struct pt_regs *regs)
 
 static int xmon_break_match(struct pt_regs *regs)
 {
+   if (xmon_is_locked_down())
+   return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
if (dabr.enabled == 0)
@@ -745,6 +786,9 @@ static int xmon_break_match(struct pt_regs *regs)
 
 static int xmon_iabr_match(struct pt_regs *regs)
 {
+   if (xmon_is_locked_down())
+   return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
if (iabr == NULL)
@@ -3741,6 +3785,9 @@ static void xmon_init(int enable)
 #ifdef CONFIG_MAGIC_SYSRQ
 static void sysrq_handle_xmon(int key)
 {
+   if (xmon_is_locked_down())
+   return;
+
/* ensure xmon is enabled */
xmon_init(1);
debugger(get_irq_regs());
@@ -3762,7 +3809,6 @@ static int __init setup_xmon_sysrq(void)
 device_initcall(setup_xmon_sysrq);
 #endif /* CONFIG_MAGIC_SYSRQ */
 
-#ifdef CONFIG_DEBUG_FS
 static void clear_all_bpt(void)
 {
int i;
@@ -3784,8 +3830,12 @@ static void clear_all_bpt(void)
printf("xmon: All breakpoints cleared\n");
 }
 
+#ifdef CONFIG_DEBUG_FS
 static int xmon_dbgfs_set(void *data, u64 val)
 {
+   if (xmon_is_locked_down())
+   return 0;
+
xmon_on = !!val;
xmon_init(xmon_on);
 
@@ -3844,6 +3894,9 @@ early_param("xmon", early_parse_xmon);

Re: [RFC PATCH v2] powerpc/xmon: restrict when kernel is locked down

2019-08-03 Thread Christopher M Riedl
> On July 29, 2019 at 2:00 AM Daniel Axtens  wrote:
> 
> Would you be able to send a v2 with these changes? (that is, not purging
> breakpoints when entering integrity mode)
> 

Just sent out a v3 with that change among a few others and a rebase.

Thanks,
Chris R.


[PATCH v3 1/3] powerpc/spinlocks: Refactor SHARED_PROCESSOR

2019-08-05 Thread Christopher M. Riedl
Determining if a processor is in shared processor mode is not a constant
so don't hide it behind a #define.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Andrew Donnellan 
---
 arch/powerpc/include/asm/spinlock.h | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index a47f827bc5f1..dc5fcea1f006 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -101,15 +101,27 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
-#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
 extern void __spin_yield(arch_spinlock_t *lock);
 extern void __rw_yield(arch_rwlock_t *lock);
 #else /* SPLPAR */
 #define __spin_yield(x)barrier()
 #define __rw_yield(x)  barrier()
-#define SHARED_PROCESSOR   0
 #endif
 
+static inline bool is_shared_processor(void)
+{
+/*
+ * LPPACA is only available on BOOK3S so guard anything LPPACA related to
+ * allow other platforms (which include this common header) to compile.
+ */
+#ifdef CONFIG_PPC_BOOK3S
+   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
+   lppaca_shared_proc(local_paca->lppaca_ptr));
+#else
+   return false;
+#endif
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
while (1) {
@@ -117,7 +129,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
@@ -136,7 +148,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
local_irq_restore(flags);
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
@@ -226,7 +238,7 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__rw_yield(rw);
} while (unlikely(rw->lock < 0));
HMT_medium();
@@ -240,7 +252,7 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__rw_yield(rw);
} while (unlikely(rw->lock != 0));
HMT_medium();
-- 
2.22.0



[PATCH v3 3/3] powerpc/spinlocks: Fix oops in shared-processor spinlocks

2019-08-05 Thread Christopher M. Riedl
Booting w/ ppc64le_defconfig + CONFIG_PREEMPT results in the attached
kernel trace due to calling shared-processor spinlocks while not running
in an SPLPAR. Previously, the out-of-line spinlocks implementations were
selected based on CONFIG_PPC_SPLPAR at compile time without a runtime
shared-processor LPAR check.

To fix, call the actual spinlock implementations from a set of common
functions, spin_yield() and rw_yield(), which check for shared-processor
LPAR during runtime and select the appropriate lock implementation.

[0.430878] BUG: Kernel NULL pointer dereference at 0x0100
[0.431991] Faulting instruction address: 0xc0097f88
[0.432934] Oops: Kernel access of bad area, sig: 7 [#1]
[0.433448] LE PAGE_SIZE=64K MMU=Radix MMU=Hash PREEMPT SMP NR_CPUS=2048 
NUMA PowerNV
[0.434479] Modules linked in:
[0.435055] CPU: 0 PID: 2 Comm: kthreadd Not tainted 
5.2.0-rc6-00491-g249155c20f9b #28
[0.435730] NIP:  c0097f88 LR: c0c07a88 CTR: c015ca10
[0.436383] REGS: c000727079f0 TRAP: 0300   Not tainted  
(5.2.0-rc6-00491-g249155c20f9b)
[0.437004] MSR:  92009033   CR: 
84000424  XER: 2004
[0.437874] CFAR: c0c07a84 DAR: 0100 DSISR: 0008 
IRQMASK: 1
[0.437874] GPR00: c0c07a88 c00072707c80 c1546300 
c0007be38a80
[0.437874] GPR04: c000726f0c00 0002 c0007279c980 
0100
[0.437874] GPR08: c1581b78 8001 0008 
c0007279c9b0
[0.437874] GPR12:  c173 c0142558 

[0.437874] GPR16:    

[0.437874] GPR20:    

[0.437874] GPR24: c0007be38a80 c0c002f4  

[0.437874] GPR28: c00072221a00 c000726c2600 c0007be38a80 
c0007be38a80
[0.443992] NIP [c0097f88] __spin_yield+0x48/0xa0
[0.444523] LR [c0c07a88] __raw_spin_lock+0xb8/0xc0
[0.445080] Call Trace:
[0.445670] [c00072707c80] [c00072221a00] 0xc00072221a00 
(unreliable)
[0.446425] [c00072707cb0] [c0bffb0c] __schedule+0xbc/0x850
[0.447078] [c00072707d70] [c0c002f4] schedule+0x54/0x130
[0.447694] [c00072707da0] [c01427dc] kthreadd+0x28c/0x2b0
[0.448389] [c00072707e20] [c000c1cc] 
ret_from_kernel_thread+0x5c/0x70
[0.449143] Instruction dump:
[0.449821] 4d9e0020 552a043e 210a07ff 79080fe0 0b08 3d020004 3908b878 
794a1f24
[0.450587] e8e8 7ce7502a e8e7 38e70100 <7ca03c2c> 70a70001 78a50020 
4d820020
[0.452808] ---[ end trace 474d6b2b8fc5cb7e ]---

Signed-off-by: Christopher M. Riedl 
---
Changes since v2:
 - Directly call splpar_*_yield() to avoid duplicate call to
   is_shared_processor() in some cases

 arch/powerpc/include/asm/spinlock.h | 36 -
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 0a8270183770..8935315c80ff 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -103,11 +103,9 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 /* We only yield to the hypervisor if we are in shared processor mode */
 void splpar_spin_yield(arch_spinlock_t *lock);
 void splpar_rw_yield(arch_rwlock_t *lock);
-#define __spin_yield(x) splpar_spin_yield(x)
-#define __rw_yield(x) splpar_rw_yield(x)
 #else /* SPLPAR */
-#define __spin_yield(x)barrier()
-#define __rw_yield(x)  barrier()
+static inline void splpar_spin_yield(arch_spinlock_t *lock) {};
+static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 #endif
 
 static inline bool is_shared_processor(void)
@@ -124,6 +122,22 @@ static inline bool is_shared_processor(void)
 #endif
 }
 
+static inline void spin_yield(arch_spinlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_spin_yield(lock);
+   else
+   barrier();
+}
+
+static inline void rw_yield(arch_rwlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_rw_yield(lock);
+   else
+   barrier();
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
while (1) {
@@ -132,7 +146,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
do {
HMT_low();
if (is_shared_processor())
-   __spin_yield(lock);
+   splpar_spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
}
@@ -151,7 +165,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
do {
HMT_low();
 

[PATCH v3 2/3] powerpc/spinlocks: Rename SPLPAR-only spinlocks

2019-08-05 Thread Christopher M. Riedl
The __rw_yield and __spin_yield locks only pertain to SPLPAR mode.
Rename them to make this relationship obvious.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Andrew Donnellan 
---
 arch/powerpc/include/asm/spinlock.h | 6 --
 arch/powerpc/lib/locks.c| 6 +++---
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index dc5fcea1f006..0a8270183770 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -101,8 +101,10 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
-extern void __spin_yield(arch_spinlock_t *lock);
-extern void __rw_yield(arch_rwlock_t *lock);
+void splpar_spin_yield(arch_spinlock_t *lock);
+void splpar_rw_yield(arch_rwlock_t *lock);
+#define __spin_yield(x) splpar_spin_yield(x)
+#define __rw_yield(x) splpar_rw_yield(x)
 #else /* SPLPAR */
 #define __spin_yield(x)barrier()
 #define __rw_yield(x)  barrier()
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index 6550b9e5ce5f..6440d5943c00 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -18,7 +18,7 @@
 #include 
 #include 
 
-void __spin_yield(arch_spinlock_t *lock)
+void splpar_spin_yield(arch_spinlock_t *lock)
 {
unsigned int lock_value, holder_cpu, yield_count;
 
@@ -36,14 +36,14 @@ void __spin_yield(arch_spinlock_t *lock)
plpar_hcall_norets(H_CONFER,
get_hard_smp_processor_id(holder_cpu), yield_count);
 }
-EXPORT_SYMBOL_GPL(__spin_yield);
+EXPORT_SYMBOL_GPL(splpar_spin_yield);
 
 /*
  * Waiting for a read lock or a write lock on a rwlock...
  * This turns out to be the same for read and write locks, since
  * we only know the holder if it is write-locked.
  */
-void __rw_yield(arch_rwlock_t *rw)
+void splpar_rw_yield(arch_rwlock_t *rw)
 {
int lock_value;
unsigned int holder_cpu, yield_count;
-- 
2.22.0



[PATCH v3 0/3] Fix oops in shared-processor spinlocks

2019-08-05 Thread Christopher M. Riedl
Fixes an oops when calling the shared-processor spinlock implementation
from a non-SP LPAR. Also take this opportunity to refactor
SHARED_PROCESSOR a bit.

Reference:  https://github.com/linuxppc/issues/issues/229

Changes since v2:
 - Directly call splpar_*_yield() to avoid duplicate call to
   is_shared_processor() in some cases

Changes since v1:
 - Improve comment wording to make it clear why the BOOK3S #ifdef is
   required in is_shared_processor() in spinlock.h
 - Replace empty #define of splpar_*_yield() with actual functions with
   empty bodies

Christopher M. Riedl (3):
  powerpc/spinlocks: Refactor SHARED_PROCESSOR
  powerpc/spinlocks: Rename SPLPAR-only spinlocks
  powerpc/spinlocks: Fix oops in shared-processor spinlocks

 arch/powerpc/include/asm/spinlock.h | 62 +
 arch/powerpc/lib/locks.c|  6 +--
 2 files changed, 48 insertions(+), 20 deletions(-)

-- 
2.22.0



Re: [PATCH v2 3/3] powerpc/spinlocks: Fix oops in shared-processor spinlocks

2019-08-06 Thread Christopher M Riedl


> On August 6, 2019 at 7:14 AM Michael Ellerman  wrote:
> 
> 
> Christopher M Riedl  writes:
> >> On August 2, 2019 at 6:38 AM Michael Ellerman  wrote:
> >> "Christopher M. Riedl"  writes:
> >> 
> >> This leaves us with a double test of is_shared_processor() doesn't it?
> >
> > Yep, and that's no good. Hmm, executing the barrier() in the 
> > non-shared-processor
> > case probably hurts performance here?
> 
> It's only a "compiler barrier", so it shouldn't generate any code.
> 
> But it does have the effect of telling the compiler it can't optimise
> across that barrier, which can be important.
> 
> In those spin loops all we're doing is checking lock->slock which is
> already marked volatile in the definition of arch_spinlock_t, so the
> extra barrier shouldn't really make any difference.
> 
> But still the current code doesn't have a barrier() there, so we should
> make sure we don't introduce one as part of this refactor.

Thank you for taking the time to explain this. I have some more reading to
do about compiler-barriers it seems :)

> 
> So I think you just want to change the call to spin_yield() above to
> splpar_spin_yield(), which avoids the double check, and also avoids the
> barrier() in the SPLPAR=n case.
> 
> And then arch_spin_relax() calls spin_yield() etc.

I submitted a v3 before your reply with this change already - figured this
is the best way to avoid the double check and maintain legacy behavior.

> 
> cheers


[PATCH v4 0/3] Fix oops in shared-processor spinlocks

2019-08-12 Thread Christopher M. Riedl
Fixes an oops when calling the shared-processor spinlock implementation
from a non-SP LPAR. Also take this opportunity to refactor
SHARED_PROCESSOR a bit.

Reference:  https://github.com/linuxppc/issues/issues/229

Changes since v3:
 - Replace CONFIG_BOOK3S #ifdef with CONFIG_PPC_PSERIES in
   is_shared_processor() to fix compile error reported by 0day-ci

Changes since v2:
 - Directly call splpar_*_yield() to avoid duplicate call to
   is_shared_processor() in some cases

Changes since v1:
 - Improve comment wording to make it clear why the BOOK3S #ifdef is
   required in is_shared_processor() in spinlock.h
 - Replace empty #define of splpar_*_yield() with actual functions with
   empty bodies

Christopher M. Riedl (3):
  powerpc/spinlocks: Refactor SHARED_PROCESSOR
  powerpc/spinlocks: Rename SPLPAR-only spinlocks
  powerpc/spinlocks: Fix oops in shared-processor spinlocks

 arch/powerpc/include/asm/spinlock.h | 62 +
 arch/powerpc/lib/locks.c|  6 +--
 2 files changed, 48 insertions(+), 20 deletions(-)

-- 
2.22.0



[PATCH v4 1/3] powerpc/spinlocks: Refactor SHARED_PROCESSOR

2019-08-12 Thread Christopher M. Riedl
Determining if a processor is in shared processor mode is not a constant
so don't hide it behind a #define.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Andrew Donnellan 
---
 arch/powerpc/include/asm/spinlock.h | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index a47f827bc5f1..e9c60fbcc8fe 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -101,15 +101,27 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
-#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
 extern void __spin_yield(arch_spinlock_t *lock);
 extern void __rw_yield(arch_rwlock_t *lock);
 #else /* SPLPAR */
 #define __spin_yield(x)barrier()
 #define __rw_yield(x)  barrier()
-#define SHARED_PROCESSOR   0
 #endif
 
+static inline bool is_shared_processor(void)
+{
+/*
+ * LPPACA is only available on Pseries so guard anything LPPACA related to
+ * allow other platforms (which include this common header) to compile.
+ */
+#ifdef CONFIG_PPC_PSERIES
+   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
+   lppaca_shared_proc(local_paca->lppaca_ptr));
+#else
+   return false;
+#endif
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
while (1) {
@@ -117,7 +129,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
@@ -136,7 +148,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
local_irq_restore(flags);
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
@@ -226,7 +238,7 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__rw_yield(rw);
} while (unlikely(rw->lock < 0));
HMT_medium();
@@ -240,7 +252,7 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
break;
do {
HMT_low();
-   if (SHARED_PROCESSOR)
+   if (is_shared_processor())
__rw_yield(rw);
} while (unlikely(rw->lock != 0));
HMT_medium();
-- 
2.22.0



[PATCH v4 3/3] powerpc/spinlocks: Fix oops in shared-processor spinlocks

2019-08-12 Thread Christopher M. Riedl
Booting w/ ppc64le_defconfig + CONFIG_PREEMPT results in the attached
kernel trace due to calling shared-processor spinlocks while not running
in an SPLPAR. Previously, the out-of-line spinlocks implementations were
selected based on CONFIG_PPC_SPLPAR at compile time without a runtime
shared-processor LPAR check.

To fix, call the actual spinlock implementations from a set of common
functions, spin_yield() and rw_yield(), which check for shared-processor
LPAR during runtime and select the appropriate lock implementation.

[0.430878] BUG: Kernel NULL pointer dereference at 0x0100
[0.431991] Faulting instruction address: 0xc0097f88
[0.432934] Oops: Kernel access of bad area, sig: 7 [#1]
[0.433448] LE PAGE_SIZE=64K MMU=Radix MMU=Hash PREEMPT SMP NR_CPUS=2048 
NUMA PowerNV
[0.434479] Modules linked in:
[0.435055] CPU: 0 PID: 2 Comm: kthreadd Not tainted 
5.2.0-rc6-00491-g249155c20f9b #28
[0.435730] NIP:  c0097f88 LR: c0c07a88 CTR: c015ca10
[0.436383] REGS: c000727079f0 TRAP: 0300   Not tainted  
(5.2.0-rc6-00491-g249155c20f9b)
[0.437004] MSR:  92009033   CR: 
84000424  XER: 2004
[0.437874] CFAR: c0c07a84 DAR: 0100 DSISR: 0008 
IRQMASK: 1
[0.437874] GPR00: c0c07a88 c00072707c80 c1546300 
c0007be38a80
[0.437874] GPR04: c000726f0c00 0002 c0007279c980 
0100
[0.437874] GPR08: c1581b78 8001 0008 
c0007279c9b0
[0.437874] GPR12:  c173 c0142558 

[0.437874] GPR16:    

[0.437874] GPR20:    

[0.437874] GPR24: c0007be38a80 c0c002f4  

[0.437874] GPR28: c00072221a00 c000726c2600 c0007be38a80 
c0007be38a80
[0.443992] NIP [c0097f88] __spin_yield+0x48/0xa0
[0.444523] LR [c0c07a88] __raw_spin_lock+0xb8/0xc0
[0.445080] Call Trace:
[0.445670] [c00072707c80] [c00072221a00] 0xc00072221a00 
(unreliable)
[0.446425] [c00072707cb0] [c0bffb0c] __schedule+0xbc/0x850
[0.447078] [c00072707d70] [c0c002f4] schedule+0x54/0x130
[0.447694] [c00072707da0] [c01427dc] kthreadd+0x28c/0x2b0
[0.448389] [c00072707e20] [c000c1cc] 
ret_from_kernel_thread+0x5c/0x70
[0.449143] Instruction dump:
[0.449821] 4d9e0020 552a043e 210a07ff 79080fe0 0b08 3d020004 3908b878 
794a1f24
[0.450587] e8e8 7ce7502a e8e7 38e70100 <7ca03c2c> 70a70001 78a50020 
4d820020
[0.452808] ---[ end trace 474d6b2b8fc5cb7e ]---

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/spinlock.h | 36 -
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 0d04d468f660..e9a960e28f3c 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -103,11 +103,9 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 /* We only yield to the hypervisor if we are in shared processor mode */
 void splpar_spin_yield(arch_spinlock_t *lock);
 void splpar_rw_yield(arch_rwlock_t *lock);
-#define __spin_yield(x) splpar_spin_yield(x)
-#define __rw_yield(x) splpar_rw_yield(x)
 #else /* SPLPAR */
-#define __spin_yield(x)barrier()
-#define __rw_yield(x)  barrier()
+static inline void splpar_spin_yield(arch_spinlock_t *lock) {};
+static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 #endif
 
 static inline bool is_shared_processor(void)
@@ -124,6 +122,22 @@ static inline bool is_shared_processor(void)
 #endif
 }
 
+static inline void spin_yield(arch_spinlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_spin_yield(lock);
+   else
+   barrier();
+}
+
+static inline void rw_yield(arch_rwlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_rw_yield(lock);
+   else
+   barrier();
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
while (1) {
@@ -132,7 +146,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
do {
HMT_low();
if (is_shared_processor())
-   __spin_yield(lock);
+   splpar_spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
}
@@ -151,7 +165,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
do {
HMT_low();
if (is_shared_processor())
-   __spin_yield(lock);
+   splpar_spin_y

[PATCH v4 2/3] powerpc/spinlocks: Rename SPLPAR-only spinlocks

2019-08-12 Thread Christopher M. Riedl
The __rw_yield and __spin_yield locks only pertain to SPLPAR mode.
Rename them to make this relationship obvious.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Andrew Donnellan 
---
 arch/powerpc/include/asm/spinlock.h | 6 --
 arch/powerpc/lib/locks.c| 6 +++---
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index e9c60fbcc8fe..0d04d468f660 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -101,8 +101,10 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
-extern void __spin_yield(arch_spinlock_t *lock);
-extern void __rw_yield(arch_rwlock_t *lock);
+void splpar_spin_yield(arch_spinlock_t *lock);
+void splpar_rw_yield(arch_rwlock_t *lock);
+#define __spin_yield(x) splpar_spin_yield(x)
+#define __rw_yield(x) splpar_rw_yield(x)
 #else /* SPLPAR */
 #define __spin_yield(x)barrier()
 #define __rw_yield(x)  barrier()
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index 6550b9e5ce5f..6440d5943c00 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -18,7 +18,7 @@
 #include 
 #include 
 
-void __spin_yield(arch_spinlock_t *lock)
+void splpar_spin_yield(arch_spinlock_t *lock)
 {
unsigned int lock_value, holder_cpu, yield_count;
 
@@ -36,14 +36,14 @@ void __spin_yield(arch_spinlock_t *lock)
plpar_hcall_norets(H_CONFER,
get_hard_smp_processor_id(holder_cpu), yield_count);
 }
-EXPORT_SYMBOL_GPL(__spin_yield);
+EXPORT_SYMBOL_GPL(splpar_spin_yield);
 
 /*
  * Waiting for a read lock or a write lock on a rwlock...
  * This turns out to be the same for read and write locks, since
  * we only know the holder if it is write-locked.
  */
-void __rw_yield(arch_rwlock_t *rw)
+void splpar_rw_yield(arch_rwlock_t *rw)
 {
int lock_value;
unsigned int holder_cpu, yield_count;
-- 
2.22.0



[RFC PATCH v4 2/2] powerpc/xmon: Restrict when kernel is locked down

2019-08-14 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and completely
disable xmon when lockdown=confidentiality. Xmon checks the lockdown
state and takes appropriate action:

 (1) during xmon_setup to prevent early xmon'ing

 (2) when triggered via sysrq

 (3) when toggled via debugfs

 (4) when triggered via a previously enabled breakpoint

The following lockdown state transitions are handled:

 (1) lockdown=none -> lockdown=integrity
 set xmon read-only mode

 (2) lockdown=none -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent re-entry into xmon

 (3) lockdown=integrity -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent re-entry into xmon

Suggested-by: Andrew Donnellan 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/xmon/xmon.c | 59 ++--
 include/linux/security.h |  2 ++
 security/lockdown/lockdown.c |  2 ++
 3 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index bb63ecc599fd..8fd79369974e 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -187,6 +188,9 @@ static void dump_tlb_44x(void);
 static void dump_tlb_book3e(void);
 #endif
 
+static void clear_all_bpt(void);
+static void xmon_init(int);
+
 #ifdef CONFIG_PPC64
 #define REG"%.16lx"
 #else
@@ -283,10 +287,41 @@ Commands:\n\
 "  U   show uptime information\n"
 "  ?   help\n"
 "  # n limit output to n lines per page (for dp, dpa, dl)\n"
-"  zr  reboot\n\
-  zh   halt\n"
+"  zr  reboot\n"
+"  zh  halt\n"
 ;
 
+#ifdef CONFIG_SECURITY
+static bool xmon_is_locked_down(void)
+{
+   static bool lockdown;
+
+   if (!lockdown) {
+   lockdown = !!security_locked_down(LOCKDOWN_XMON_RW);
+   if (lockdown) {
+   printf("xmon: Disabled due to kernel lockdown\n");
+   xmon_is_ro = true;
+   xmon_on = 0;
+   xmon_init(0);
+   clear_all_bpt();
+   }
+   }
+
+   if (!xmon_is_ro) {
+   xmon_is_ro = !!security_locked_down(LOCKDOWN_XMON_WR);
+   if (xmon_is_ro)
+   printf("xmon: Read-only due to kernel lockdown\n");
+   }
+
+   return lockdown;
+}
+#else /* CONFIG_SECURITY */
+static inline bool xmon_is_locked_down(void)
+{
+   return false;
+}
+#endif
+
 static struct pt_regs *xmon_regs;
 
 static inline void sync(void)
@@ -704,6 +739,9 @@ static int xmon_bpt(struct pt_regs *regs)
struct bpt *bp;
unsigned long offset;
 
+   if (xmon_is_locked_down())
+   return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
 
@@ -735,6 +773,9 @@ static int xmon_sstep(struct pt_regs *regs)
 
 static int xmon_break_match(struct pt_regs *regs)
 {
+   if (xmon_is_locked_down())
+   return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
if (dabr.enabled == 0)
@@ -745,6 +786,9 @@ static int xmon_break_match(struct pt_regs *regs)
 
 static int xmon_iabr_match(struct pt_regs *regs)
 {
+   if (xmon_is_locked_down())
+   return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
if (iabr == NULL)
@@ -3750,6 +3794,9 @@ static void xmon_init(int enable)
 #ifdef CONFIG_MAGIC_SYSRQ
 static void sysrq_handle_xmon(int key)
 {
+   if (xmon_is_locked_down())
+   return;
+
/* ensure xmon is enabled */
xmon_init(1);
debugger(get_irq_regs());
@@ -3771,7 +3818,6 @@ static int __init setup_xmon_sysrq(void)
 device_initcall(setup_xmon_sysrq);
 #endif /* CONFIG_MAGIC_SYSRQ */
 
-#ifdef CONFIG_DEBUG_FS
 static void clear_all_bpt(void)
 {
int i;
@@ -3793,8 +3839,12 @@ static void clear_all_bpt(void)
printf("xmon: All breakpoints cleared\n");
 }
 
+#ifdef CONFIG_DEBUG_FS
 static int xmon_dbgfs_set(void *data, u64 val)
 {
+   if (xmon_is_locked_down())
+   return 0;
+
xmon_on = !!val;
xmon_init(xmon_on);
 
@@ -3853,6 +3903,9 @@ early_param("xmon", early_parse_xmon);
 
 void __init xmon_setup(void)
 {
+   if (xmon_is_locked_down())
+   return;
+
if (xmon_on)
xmon_init(1);
if (xmon_early)
diff --git a/include/linux/security.h b/include/linux/security.h
index 807dc0d24982..379b74b5d545 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -116,12 +11

[RFC PATCH v4 0/2] Restrict xmon when kernel is locked down

2019-08-14 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and completely
disable xmon when lockdown=confidentiality. Since this can occur
dynamically, there may be pre-existing, active breakpoints in xmon when
transitioning into read-only mode. These breakpoints will still trigger,
so allow them to be listed, but not cleared or altered, using xmon.

Changes since v3:
 - Allow active breakpoints to be shown/listed in read-only mode

Changes since v2:
 - Rebased onto v36 of https://patchwork.kernel.org/cover/11049461/
   (based on: f632a8170a6b667ee4e3f552087588f0fe13c4bb)
 - Do not clear existing breakpoints when transitioning from
   lockdown=none to lockdown=integrity
 - Remove line continuation and dangling quote (confuses checkpatch.pl)
   from the xmon command help/usage string

Christopher M. Riedl (2):
  powerpc/xmon: Allow listing active breakpoints in read-only mode
  powerpc/xmon: Restrict when kernel is locked down

 arch/powerpc/xmon/xmon.c | 78 
 include/linux/security.h |  2 +
 security/lockdown/lockdown.c |  2 +
 3 files changed, 74 insertions(+), 8 deletions(-)

-- 
2.22.0



[RFC PATCH v4 1/2] powerpc/xmon: Allow listing active breakpoints in read-only mode

2019-08-14 Thread Christopher M. Riedl
Xmon can enter read-only mode dynamically due to changes in kernel
lockdown state. This transition does not clear active breakpoints and
any these breakpoints should remain visible to the xmon'er.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/xmon/xmon.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index d0620d762a5a..bb63ecc599fd 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1045,10 +1045,6 @@ cmds(struct pt_regs *excp)
set_lpp_cmd();
break;
case 'b':
-   if (xmon_is_ro) {
-   printf(xmon_ro_msg);
-   break;
-   }
bpt_cmds();
break;
case 'C':
@@ -1317,11 +1313,16 @@ bpt_cmds(void)
struct bpt *bp;
 
cmd = inchar();
+
switch (cmd) {
 #ifndef CONFIG_PPC_8xx
static const char badaddr[] = "Only kernel addresses are permitted for 
breakpoints\n";
int mode;
case 'd':   /* bd - hardware data breakpoint */
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
if (!ppc_breakpoint_available()) {
printf("Hardware data breakpoint not supported on this 
cpu\n");
break;
@@ -1349,6 +1350,10 @@ bpt_cmds(void)
break;
 
case 'i':   /* bi - hardware instr breakpoint */
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
if (!cpu_has_feature(CPU_FTR_ARCH_207S)) {
printf("Hardware instruction breakpoint "
   "not supported on this cpu\n");
@@ -1372,6 +1377,10 @@ bpt_cmds(void)
 #endif
 
case 'c':
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
if (!scanhex(&a)) {
/* clear all breakpoints */
for (i = 0; i < NBPTS; ++i)
@@ -1407,7 +1416,7 @@ bpt_cmds(void)
break;
}
termch = cmd;
-   if (!scanhex(&a)) {
+   if (xmon_is_ro || !scanhex(&a)) {
/* print all breakpoints */
printf("   typeaddress\n");
if (dabr.enabled) {
-- 
2.22.0



[PATCH v5 1/2] powerpc/xmon: Allow listing and clearing breakpoints in read-only mode

2019-08-27 Thread Christopher M. Riedl
Read-only mode should not prevent listing and clearing any active
breakpoints.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/xmon/xmon.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index d0620d762a5a..a98a354d46ac 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1045,10 +1045,6 @@ cmds(struct pt_regs *excp)
set_lpp_cmd();
break;
case 'b':
-   if (xmon_is_ro) {
-   printf(xmon_ro_msg);
-   break;
-   }
bpt_cmds();
break;
case 'C':
@@ -1317,11 +1313,16 @@ bpt_cmds(void)
struct bpt *bp;
 
cmd = inchar();
+
switch (cmd) {
 #ifndef CONFIG_PPC_8xx
static const char badaddr[] = "Only kernel addresses are permitted for 
breakpoints\n";
int mode;
case 'd':   /* bd - hardware data breakpoint */
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
if (!ppc_breakpoint_available()) {
printf("Hardware data breakpoint not supported on this 
cpu\n");
break;
@@ -1349,6 +1350,10 @@ bpt_cmds(void)
break;
 
case 'i':   /* bi - hardware instr breakpoint */
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
if (!cpu_has_feature(CPU_FTR_ARCH_207S)) {
printf("Hardware instruction breakpoint "
   "not supported on this cpu\n");
@@ -1407,7 +1412,7 @@ bpt_cmds(void)
break;
}
termch = cmd;
-   if (!scanhex(&a)) {
+   if (xmon_is_ro || !scanhex(&a)) {
/* print all breakpoints */
printf("   typeaddress\n");
if (dabr.enabled) {
-- 
2.23.0



[PATCH v5 2/2] powerpc/xmon: Restrict when kernel is locked down

2019-08-27 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and prevent user
entry into xmon when lockdown=confidentiality. Xmon checks the lockdown
state on every attempted entry:

 (1) during early xmon'ing

 (2) when triggered via sysrq

 (3) when toggled via debugfs

 (4) when triggered via a previously enabled breakpoint

The following lockdown state transitions are handled:

 (1) lockdown=none -> lockdown=integrity
 set xmon read-only mode

 (2) lockdown=none -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent user re-entry into xmon

 (3) lockdown=integrity -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent user re-entry into xmon

Suggested-by: Andrew Donnellan 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/xmon/xmon.c | 85 
 include/linux/security.h |  2 +
 security/lockdown/lockdown.c |  2 +
 3 files changed, 72 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index a98a354d46ac..94a5fada3034 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -187,6 +188,8 @@ static void dump_tlb_44x(void);
 static void dump_tlb_book3e(void);
 #endif
 
+static void clear_all_bpt(void);
+
 #ifdef CONFIG_PPC64
 #define REG"%.16lx"
 #else
@@ -283,10 +286,38 @@ Commands:\n\
 "  U   show uptime information\n"
 "  ?   help\n"
 "  # n limit output to n lines per page (for dp, dpa, dl)\n"
-"  zr  reboot\n\
-  zh   halt\n"
+"  zr  reboot\n"
+"  zh  halt\n"
 ;
 
+#ifdef CONFIG_SECURITY
+static bool xmon_is_locked_down(void)
+{
+   static bool lockdown;
+
+   if (!lockdown) {
+   lockdown = !!security_locked_down(LOCKDOWN_XMON_RW);
+   if (lockdown) {
+   printf("xmon: Disabled due to kernel lockdown\n");
+   xmon_is_ro = true;
+   }
+   }
+
+   if (!xmon_is_ro) {
+   xmon_is_ro = !!security_locked_down(LOCKDOWN_XMON_WR);
+   if (xmon_is_ro)
+   printf("xmon: Read-only due to kernel lockdown\n");
+   }
+
+   return lockdown;
+}
+#else /* CONFIG_SECURITY */
+static inline bool xmon_is_locked_down(void)
+{
+   return false;
+}
+#endif
+
 static struct pt_regs *xmon_regs;
 
 static inline void sync(void)
@@ -438,7 +469,10 @@ static bool wait_for_other_cpus(int ncpus)
 
return false;
 }
-#endif /* CONFIG_SMP */
+#else /* CONFIG_SMP */
+static inline void get_output_lock(void) {}
+static inline void release_output_lock(void) {}
+#endif
 
 static inline int unrecoverable_excp(struct pt_regs *regs)
 {
@@ -455,6 +489,7 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
int cmd = 0;
struct bpt *bp;
long recurse_jmp[JMP_BUF_LEN];
+   bool locked_down;
unsigned long offset;
unsigned long flags;
 #ifdef CONFIG_SMP
@@ -465,6 +500,8 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
local_irq_save(flags);
hard_irq_disable();
 
+   locked_down = xmon_is_locked_down();
+
tracing_enabled = tracing_is_on();
tracing_off();
 
@@ -516,7 +553,8 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
 
if (!fromipi) {
get_output_lock();
-   excprint(regs);
+   if (!locked_down)
+   excprint(regs);
if (bp) {
printf("cpu 0x%x stopped at breakpoint 0x%tx (",
   cpu, BP_NUM(bp));
@@ -568,10 +606,14 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
}
remove_bpts();
disable_surveillance();
-   /* for breakpoint or single step, print the current instr. */
-   if (bp || TRAP(regs) == 0xd00)
-   ppc_inst_dump(regs->nip, 1, 0);
-   printf("enter ? for help\n");
+
+   if (!locked_down) {
+   /* for breakpoint or single step, print curr insn */
+   if (bp || TRAP(regs) == 0xd00)
+   ppc_inst_dump(regs->nip, 1, 0);
+   printf("enter ? for help\n");
+   }
+
mb();
xmon_gate = 1;
barrier();
@@ -595,8 +637,9 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
spin_cpu_relax();
touch_nmi_watchdog();
} else {
-   cmd = cmds(regs);
-   if (cmd != 0) {
+  

[PATCH v5 0/2] Restrict xmon when kernel is locked down

2019-08-27 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and completely
disable xmon when lockdown=confidentiality. Since this can occur
dynamically, there may be pre-existing, active breakpoints in xmon when
transitioning into read-only mode. These breakpoints will still trigger,
so allow them to be listed, but not cleared or altered, using xmon.

Changes since v4:
 - Move lockdown state checks into xmon_core
 - Allow clearing of breakpoints in xmon read-only mode
 - Test simple scenarios (combinations of xmon and lockdown cmdline
   options, setting breakpoints and changing lockdown state, etc) in
   QEMU and on an actual POWER8 VM
 - Rebase onto security/next-lockdown
   b602614a81078bf29c82b2671bb96a63488f68d6

Changes since v3:
 - Allow active breakpoints to be shown/listed in read-only mode

Changes since v2:
 - Rebased onto v36 of https://patchwork.kernel.org/cover/11049461/
   (based on: f632a8170a6b667ee4e3f552087588f0fe13c4bb)
 - Do not clear existing breakpoints when transitioning from
   lockdown=none to lockdown=integrity
 - Remove line continuation and dangling quote (confuses checkpatch.pl)
   from the xmon command help/usage string

Christopher M. Riedl (2):
  powerpc/xmon: Allow listing active breakpoints in read-only mode
  powerpc/xmon: Restrict when kernel is locked down

 arch/powerpc/xmon/xmon.c | 104 +++
 include/linux/security.h |   2 +
 security/lockdown/lockdown.c |   2 +
 3 files changed, 86 insertions(+), 22 deletions(-)

-- 
2.23.0



Re: [PATCH v5 2/2] powerpc/xmon: Restrict when kernel is locked down

2019-08-29 Thread Christopher M Riedl


> On August 29, 2019 at 2:43 AM Daniel Axtens  wrote:
> 
> 
> Hi,
> 
> > Xmon should be either fully or partially disabled depending on the
> > kernel lockdown state.
> 
> I've been kicking the tyres of this, and it seems to work well:
> 
> Tested-by: Daniel Axtens 
> 

Thank you for taking the time to test this!

>
> I have one small nit: if I enter confidentiality mode and then try to
> enter xmon, I get 32 messages about clearing the breakpoints each time I
> try to enter xmon:
>

Ugh, that's annoying. I tested this on a vm w/ 2 vcpus but should have
considered the case of more vcpus :(

> 
> root@dja-guest:~# echo confidentiality > /sys/kernel/security/lockdown 
> root@dja-guest:~# echo x >/proc/sysrq-trigger 
> [  489.585400] sysrq: Entering xmon
> xmon: Disabled due to kernel lockdown
> xmon: All breakpoints cleared
> xmon: All breakpoints cleared
> xmon: All breakpoints cleared
> xmon: All breakpoints cleared
> xmon: All breakpoints cleared
> ...
> 
> Investigating, I see that this is because my vm has 32 vcpus, and I'm
> getting one per CPU.
> 
> Looking at the call sites, there's only one other caller, so I think you
> might be better served with this:
> 
> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> index 94a5fada3034..fcaf1d568162 100644
> --- a/arch/powerpc/xmon/xmon.c
> +++ b/arch/powerpc/xmon/xmon.c
> @@ -3833,10 +3833,6 @@ static void clear_all_bpt(void)
> iabr = NULL;
> dabr.enabled = 0;
> }
> -
> -   get_output_lock();
> -   printf("xmon: All breakpoints cleared\n");
> -   release_output_lock();
>  }
>  
>  #ifdef CONFIG_DEBUG_FS
> @@ -3846,8 +3842,13 @@ static int xmon_dbgfs_set(void *data, u64 val)
> xmon_init(xmon_on);
>  
> /* make sure all breakpoints removed when disabling */
> -   if (!xmon_on)
> +   if (!xmon_on) {
> clear_all_bpt();
> +   get_output_lock();
> +   printf("xmon: All breakpoints cleared\n");
> +   release_output_lock();
> +   }
> +
> return 0;
>  }
>

Good point, I will add this to the next version, thanks!  

>
> Apart from that:
> Reviewed-by: Daniel Axtens 
> 
> Regards,
> Daniel
>


Re: [PATCH v5 1/2] powerpc/xmon: Allow listing and clearing breakpoints in read-only mode

2019-08-29 Thread Christopher M Riedl
> On August 29, 2019 at 1:40 AM Daniel Axtens  wrote:
> 
> 
> Hi Chris,
> 
> > Read-only mode should not prevent listing and clearing any active
> > breakpoints.
> 
> I tested this and it works for me:
> 
> Tested-by: Daniel Axtens 
> 
> > +   if (xmon_is_ro || !scanhex(&a)) {
> 
> It took me a while to figure out what this line does: as I understand
> it, the 'b' command can also be used to install a breakpoint (as well as
> bi/bd). If we are in ro mode or if the input after 'b' doesn't scan as a
> hex string, print the list of breakpoints instead. Anyway, I'm now
> happy with it, so:
>

I can add a comment to that effect in the next version. That entire section
of code could probably be cleaned up a bit - but that's for another patch.
Thanks for testing!

> 
> Reviewed-by: Daniel Axtens 
> 
> Regards,
> Daniel
> 
> > /* print all breakpoints */
> > printf("   typeaddress\n");
> > if (dabr.enabled) {
> > -- 
> > 2.23.0


[PATCH v6 1/2] powerpc/xmon: Allow listing and clearing breakpoints in read-only mode

2019-08-29 Thread Christopher M. Riedl
Read-only mode should not prevent listing and clearing any active
breakpoints.

Tested-by: Daniel Axtens 
Reviewed-by: Daniel Axtens 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/xmon/xmon.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index d0620d762a5a..ed94de614938 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1045,10 +1045,6 @@ cmds(struct pt_regs *excp)
set_lpp_cmd();
break;
case 'b':
-   if (xmon_is_ro) {
-   printf(xmon_ro_msg);
-   break;
-   }
bpt_cmds();
break;
case 'C':
@@ -1317,11 +1313,16 @@ bpt_cmds(void)
struct bpt *bp;
 
cmd = inchar();
+
switch (cmd) {
 #ifndef CONFIG_PPC_8xx
static const char badaddr[] = "Only kernel addresses are permitted for 
breakpoints\n";
int mode;
case 'd':   /* bd - hardware data breakpoint */
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
if (!ppc_breakpoint_available()) {
printf("Hardware data breakpoint not supported on this 
cpu\n");
break;
@@ -1349,6 +1350,10 @@ bpt_cmds(void)
break;
 
case 'i':   /* bi - hardware instr breakpoint */
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
if (!cpu_has_feature(CPU_FTR_ARCH_207S)) {
printf("Hardware instruction breakpoint "
   "not supported on this cpu\n");
@@ -1407,7 +1412,8 @@ bpt_cmds(void)
break;
}
termch = cmd;
-   if (!scanhex(&a)) {
+
+   if (xmon_is_ro || !scanhex(&a)) {
/* print all breakpoints */
printf("   typeaddress\n");
if (dabr.enabled) {
-- 
2.23.0



[PATCH v6 0/2] Restrict xmon when kernel is locked down

2019-08-29 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and completely
disable xmon when lockdown=confidentiality. Since this can occur
dynamically, there may be pre-existing, active breakpoints in xmon when
transitioning into read-only mode. These breakpoints will still trigger,
so allow them to be listed and cleared using xmon.

Changes since v5:
 - Do not spam print messages when attempting to enter xmon when
   lockdown=confidentiality

Changes since v4:
 - Move lockdown state checks into xmon_core
 - Allow clearing of breakpoints in xmon read-only mode
 - Test simple scenarios (combinations of xmon and lockdown cmdline
   options, setting breakpoints and changing lockdown state, etc) in
   QEMU and on an actual POWER8 VM
 - Rebase onto security/next-lockdown
   b602614a81078bf29c82b2671bb96a63488f68d6

Changes since v3:
 - Allow active breakpoints to be shown/listed in read-only mode

Changes since v2:
 - Rebased onto v36 of https://patchwork.kernel.org/cover/11049461/
   (based on: f632a8170a6b667ee4e3f552087588f0fe13c4bb)
 - Do not clear existing breakpoints when transitioning from
   lockdown=none to lockdown=integrity
 - Remove line continuation and dangling quote (confuses checkpatch.pl)
   from the xmon command help/usage string

Christopher M. Riedl (2):
  powerpc/xmon: Allow listing and clearing breakpoints in read-only mode
  powerpc/xmon: Restrict when kernel is locked down

 arch/powerpc/xmon/xmon.c | 108 +++
 include/linux/security.h |   2 +
 security/lockdown/lockdown.c |   2 +
 3 files changed, 87 insertions(+), 25 deletions(-)

-- 
2.23.0



[PATCH v6 2/2] powerpc/xmon: Restrict when kernel is locked down

2019-08-29 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and prevent user
entry into xmon when lockdown=confidentiality. Xmon checks the lockdown
state on every attempted entry:

 (1) during early xmon'ing

 (2) when triggered via sysrq

 (3) when toggled via debugfs

 (4) when triggered via a previously enabled breakpoint

The following lockdown state transitions are handled:

 (1) lockdown=none -> lockdown=integrity
 set xmon read-only mode

 (2) lockdown=none -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent user re-entry into xmon

 (3) lockdown=integrity -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent user re-entry into xmon

Suggested-by: Andrew Donnellan 
Tested-by: Daniel Axtens 
Reviewed-by: Daniel Axtens 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/xmon/xmon.c | 92 
 include/linux/security.h |  2 +
 security/lockdown/lockdown.c |  2 +
 3 files changed, 76 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index ed94de614938..335718d0b777 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -187,6 +188,8 @@ static void dump_tlb_44x(void);
 static void dump_tlb_book3e(void);
 #endif
 
+static void clear_all_bpt(void);
+
 #ifdef CONFIG_PPC64
 #define REG"%.16lx"
 #else
@@ -283,10 +286,38 @@ Commands:\n\
 "  U   show uptime information\n"
 "  ?   help\n"
 "  # n limit output to n lines per page (for dp, dpa, dl)\n"
-"  zr  reboot\n\
-  zh   halt\n"
+"  zr  reboot\n"
+"  zh  halt\n"
 ;
 
+#ifdef CONFIG_SECURITY
+static bool xmon_is_locked_down(void)
+{
+   static bool lockdown;
+
+   if (!lockdown) {
+   lockdown = !!security_locked_down(LOCKDOWN_XMON_RW);
+   if (lockdown) {
+   printf("xmon: Disabled due to kernel lockdown\n");
+   xmon_is_ro = true;
+   }
+   }
+
+   if (!xmon_is_ro) {
+   xmon_is_ro = !!security_locked_down(LOCKDOWN_XMON_WR);
+   if (xmon_is_ro)
+   printf("xmon: Read-only due to kernel lockdown\n");
+   }
+
+   return lockdown;
+}
+#else /* CONFIG_SECURITY */
+static inline bool xmon_is_locked_down(void)
+{
+   return false;
+}
+#endif
+
 static struct pt_regs *xmon_regs;
 
 static inline void sync(void)
@@ -438,7 +469,10 @@ static bool wait_for_other_cpus(int ncpus)
 
return false;
 }
-#endif /* CONFIG_SMP */
+#else /* CONFIG_SMP */
+static inline void get_output_lock(void) {}
+static inline void release_output_lock(void) {}
+#endif
 
 static inline int unrecoverable_excp(struct pt_regs *regs)
 {
@@ -455,6 +489,7 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
int cmd = 0;
struct bpt *bp;
long recurse_jmp[JMP_BUF_LEN];
+   bool locked_down;
unsigned long offset;
unsigned long flags;
 #ifdef CONFIG_SMP
@@ -465,6 +500,8 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
local_irq_save(flags);
hard_irq_disable();
 
+   locked_down = xmon_is_locked_down();
+
tracing_enabled = tracing_is_on();
tracing_off();
 
@@ -516,7 +553,8 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
 
if (!fromipi) {
get_output_lock();
-   excprint(regs);
+   if (!locked_down)
+   excprint(regs);
if (bp) {
printf("cpu 0x%x stopped at breakpoint 0x%tx (",
   cpu, BP_NUM(bp));
@@ -568,10 +606,14 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
}
remove_bpts();
disable_surveillance();
-   /* for breakpoint or single step, print the current instr. */
-   if (bp || TRAP(regs) == 0xd00)
-   ppc_inst_dump(regs->nip, 1, 0);
-   printf("enter ? for help\n");
+
+   if (!locked_down) {
+   /* for breakpoint or single step, print curr insn */
+   if (bp || TRAP(regs) == 0xd00)
+   ppc_inst_dump(regs->nip, 1, 0);
+   printf("enter ? for help\n");
+   }
+
mb();
xmon_gate = 1;
barrier();
@@ -595,8 +637,9 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
spin_cpu_relax();
touch_nmi_watchdog();
} else {
-   cmd = cmds(regs);
- 

[PATCH v7 1/2] powerpc/xmon: Allow listing and clearing breakpoints in read-only mode

2019-09-06 Thread Christopher M. Riedl
Read-only mode should not prevent listing and clearing any active
breakpoints.

Tested-by: Daniel Axtens 
Reviewed-by: Daniel Axtens 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/xmon/xmon.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index d0620d762a5a..ed94de614938 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1045,10 +1045,6 @@ cmds(struct pt_regs *excp)
set_lpp_cmd();
break;
case 'b':
-   if (xmon_is_ro) {
-   printf(xmon_ro_msg);
-   break;
-   }
bpt_cmds();
break;
case 'C':
@@ -1317,11 +1313,16 @@ bpt_cmds(void)
struct bpt *bp;
 
cmd = inchar();
+
switch (cmd) {
 #ifndef CONFIG_PPC_8xx
static const char badaddr[] = "Only kernel addresses are permitted for 
breakpoints\n";
int mode;
case 'd':   /* bd - hardware data breakpoint */
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
if (!ppc_breakpoint_available()) {
printf("Hardware data breakpoint not supported on this 
cpu\n");
break;
@@ -1349,6 +1350,10 @@ bpt_cmds(void)
break;
 
case 'i':   /* bi - hardware instr breakpoint */
+   if (xmon_is_ro) {
+   printf(xmon_ro_msg);
+   break;
+   }
if (!cpu_has_feature(CPU_FTR_ARCH_207S)) {
printf("Hardware instruction breakpoint "
   "not supported on this cpu\n");
@@ -1407,7 +1412,8 @@ bpt_cmds(void)
break;
}
termch = cmd;
-   if (!scanhex(&a)) {
+
+   if (xmon_is_ro || !scanhex(&a)) {
/* print all breakpoints */
printf("   typeaddress\n");
if (dabr.enabled) {
-- 
2.23.0



[PATCH v7 2/2] powerpc/xmon: Restrict when kernel is locked down

2019-09-06 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and prevent user
entry into xmon when lockdown=confidentiality. Xmon checks the lockdown
state on every attempted entry:

 (1) during early xmon'ing

 (2) when triggered via sysrq

 (3) when toggled via debugfs

 (4) when triggered via a previously enabled breakpoint

The following lockdown state transitions are handled:

 (1) lockdown=none -> lockdown=integrity
 set xmon read-only mode

 (2) lockdown=none -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent user re-entry into xmon

 (3) lockdown=integrity -> lockdown=confidentiality
 clear all breakpoints, set xmon read-only mode,
 prevent user re-entry into xmon

Suggested-by: Andrew Donnellan 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/xmon/xmon.c | 103 ---
 include/linux/security.h |   2 +
 security/lockdown/lockdown.c |   2 +
 3 files changed, 86 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index ed94de614938..6eaf8ab532f6 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -187,6 +188,8 @@ static void dump_tlb_44x(void);
 static void dump_tlb_book3e(void);
 #endif
 
+static void clear_all_bpt(void);
+
 #ifdef CONFIG_PPC64
 #define REG"%.16lx"
 #else
@@ -283,10 +286,38 @@ Commands:\n\
 "  U   show uptime information\n"
 "  ?   help\n"
 "  # n limit output to n lines per page (for dp, dpa, dl)\n"
-"  zr  reboot\n\
-  zh   halt\n"
+"  zr  reboot\n"
+"  zh  halt\n"
 ;
 
+#ifdef CONFIG_SECURITY
+static bool xmon_is_locked_down(void)
+{
+   static bool lockdown;
+
+   if (!lockdown) {
+   lockdown = !!security_locked_down(LOCKDOWN_XMON_RW);
+   if (lockdown) {
+   printf("xmon: Disabled due to kernel lockdown\n");
+   xmon_is_ro = true;
+   }
+   }
+
+   if (!xmon_is_ro) {
+   xmon_is_ro = !!security_locked_down(LOCKDOWN_XMON_WR);
+   if (xmon_is_ro)
+   printf("xmon: Read-only due to kernel lockdown\n");
+   }
+
+   return lockdown;
+}
+#else /* CONFIG_SECURITY */
+static inline bool xmon_is_locked_down(void)
+{
+   return false;
+}
+#endif
+
 static struct pt_regs *xmon_regs;
 
 static inline void sync(void)
@@ -438,7 +469,10 @@ static bool wait_for_other_cpus(int ncpus)
 
return false;
 }
-#endif /* CONFIG_SMP */
+#else /* CONFIG_SMP */
+static inline void get_output_lock(void) {}
+static inline void release_output_lock(void) {}
+#endif
 
 static inline int unrecoverable_excp(struct pt_regs *regs)
 {
@@ -455,6 +489,7 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
int cmd = 0;
struct bpt *bp;
long recurse_jmp[JMP_BUF_LEN];
+   bool locked_down;
unsigned long offset;
unsigned long flags;
 #ifdef CONFIG_SMP
@@ -465,6 +500,8 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
local_irq_save(flags);
hard_irq_disable();
 
+   locked_down = xmon_is_locked_down();
+
tracing_enabled = tracing_is_on();
tracing_off();
 
@@ -516,7 +553,8 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
 
if (!fromipi) {
get_output_lock();
-   excprint(regs);
+   if (!locked_down)
+   excprint(regs);
if (bp) {
printf("cpu 0x%x stopped at breakpoint 0x%tx (",
   cpu, BP_NUM(bp));
@@ -568,10 +606,14 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
}
remove_bpts();
disable_surveillance();
-   /* for breakpoint or single step, print the current instr. */
-   if (bp || TRAP(regs) == 0xd00)
-   ppc_inst_dump(regs->nip, 1, 0);
-   printf("enter ? for help\n");
+
+   if (!locked_down) {
+   /* for breakpoint or single step, print curr insn */
+   if (bp || TRAP(regs) == 0xd00)
+   ppc_inst_dump(regs->nip, 1, 0);
+   printf("enter ? for help\n");
+   }
+
mb();
xmon_gate = 1;
barrier();
@@ -595,8 +637,9 @@ static int xmon_core(struct pt_regs *regs, int fromipi)
spin_cpu_relax();
touch_nmi_watchdog();
} else {
-   cmd = cmds(regs);
-   if (cmd != 0) {
+  

[PATCH v7 0/2] Restrict xmon when kernel is locked down

2019-09-06 Thread Christopher M. Riedl
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.

Put xmon into read-only mode for lockdown=integrity and completely
disable xmon when lockdown=confidentiality. Since this can occur
dynamically, there may be pre-existing, active breakpoints in xmon when
transitioning into read-only mode. These breakpoints will still trigger,
so allow them to be listed and cleared using xmon.

Changes since v6:
 - Add lockdown check in sysrq-trigger to prevent entry into xmon_core
 - Add lockdown check during init xmon setup for the case when booting
   with compile-time or cmdline lockdown=confidentialiaty

Changes since v5:
 - Do not spam print messages when attempting to enter xmon when
   lockdown=confidentiality

Changes since v4:
 - Move lockdown state checks into xmon_core
 - Allow clearing of breakpoints in xmon read-only mode
 - Test simple scenarios (combinations of xmon and lockdown cmdline
   options, setting breakpoints and changing lockdown state, etc) in
   QEMU and on an actual POWER8 VM
 - Rebase onto security/next-lockdown
   b602614a81078bf29c82b2671bb96a63488f68d6

Changes since v3:
 - Allow active breakpoints to be shown/listed in read-only mode

Changes since v2:
 - Rebased onto v36 of https://patchwork.kernel.org/cover/11049461/
   (based on: f632a8170a6b667ee4e3f552087588f0fe13c4bb)
 - Do not clear existing breakpoints when transitioning from
   lockdown=none to lockdown=integrity
 - Remove line continuation and dangling quote (confuses checkpatch.pl)
   from the xmon command help/usage string

Christopher M. Riedl (2):
  powerpc/xmon: Allow listing and clearing breakpoints in read-only mode
  powerpc/xmon: Restrict when kernel is locked down

 arch/powerpc/xmon/xmon.c | 119 +++
 include/linux/security.h |   2 +
 security/lockdown/lockdown.c |   2 +
 3 files changed, 97 insertions(+), 26 deletions(-)

-- 
2.23.0



Re: [PATCH v8 5/6] powerpc/code-patching: Use temporary mm for Radix MMU

2022-10-24 Thread Christopher M. Riedl
On Mon Oct 24, 2022 at 12:17 AM CDT, Benjamin Gray wrote:
> On Mon, 2022-10-24 at 14:45 +1100, Russell Currey wrote:
> > On Fri, 2022-10-21 at 16:22 +1100, Benjamin Gray wrote:
> > > From: "Christopher M. Riedl" 
> > >

-%<--

> > >
> > > ---
> >
> > Is the section following the --- your addendum to Chris' patch?  That
> > cuts it off from git, including your signoff.  It'd be better to have
> > it together as one commit message and note the bits you contributed
> > below the --- after your signoff.
> >
> > Commits where you're modifying someone else's previous work should
> > include their signoff above yours, as well.
>
> Addendum to his wording, to break it off from the "From..." section
> (which is me splicing together his comments from previous patches with
> some minor changes to account for the patch changes). I found out
> earlier today that Git will treat it as a comment :(
>
> I'll add the signed off by back, I wasn't sure whether to leave it
> there after making changes (same in patch 2).
>  

This commit has lots of my words so should probably keep the sign-off - if only
to guarantee that blame is properly directed at me for any nonsense therein ^^.

Patch 2 probably doesn't need my sign-off any more - iirc, I actually defended
the BUG_ON()s (which are WARN_ON()s now) at some point.


Re: [PATCH v3 4/6] powerpc: Introduce temporary mm

2020-09-06 Thread Christopher M. Riedl
On Thu Aug 27, 2020 at 11:15 AM CDT, Jann Horn wrote:
> On Thu, Aug 27, 2020 at 7:24 AM Christopher M. Riedl 
> wrote:
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for situations
> > where a CPU needs to perform sensitive operations (such as patching a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> > said mappings to other CPUs. A side benefit is that other CPU TLBs do
> > not need to be flushed when the temporary mm is torn down.
> >
> > Mappings in the temporary mm can be set in the userspace portion of the
> > address-space.
> [...]
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> [...]
> > @@ -44,6 +45,70 @@ int raw_patch_instruction(struct ppc_inst *addr, struct 
> > ppc_inst instr)
> >  }
> >
> >  #ifdef CONFIG_STRICT_KERNEL_RWX
> > +
> > +struct temp_mm {
> > +   struct mm_struct *temp;
> > +   struct mm_struct *prev;
> > +   bool is_kernel_thread;
> > +   struct arch_hw_breakpoint brk[HBP_NUM_MAX];
> > +};
> > +
> > +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct 
> > *mm)
> > +{
> > +   temp_mm->temp = mm;
> > +   temp_mm->prev = NULL;
> > +   temp_mm->is_kernel_thread = false;
> > +   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> > +}
> > +
> > +static inline void use_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > +   lockdep_assert_irqs_disabled();
> > +
> > +   temp_mm->is_kernel_thread = current->mm == NULL;
>
> (That's a somewhat misleading variable name - kernel threads can have
> a non-NULL ->mm, too.)
>

Oh I didn't know that, in that case yes this is not a good name. I am
considering some changes (based on your comments about current->mm
below) which would make this variable superfluous.

> > +   if (temp_mm->is_kernel_thread)
> > +   temp_mm->prev = current->active_mm;
> > +   else
> > +   temp_mm->prev = current->mm;
>
> Why the branch? Shouldn't current->active_mm work in both cases?
>
>

Yes you are correct.

> > +   /*
> > +* Hash requires a non-NULL current->mm to allocate a userspace 
> > address
> > +* when handling a page fault. Does not appear to hurt in Radix 
> > either.
> > +*/
> > +   current->mm = temp_mm->temp;
>
> This looks dangerous to me. There are various places that attempt to
> find all userspace tasks that use a given mm by iterating through all
> tasks on the system and comparing each task's ->mm pointer to
> current's. Things like current_is_single_threaded() as part of various
> security checks, mm_update_next_owner(), zap_threads(), and so on. So
> if this is reachable from userspace task context (which I think it
> is?), I don't think we're allowed to switch out the ->mm pointer here.
>
>

Thanks for pointing this out! I took a step back and looked at this
again in more detail. The only reason for reassigning the ->mm pointer
is that when patching we need to hash the page and allocate an SLB 
entry w/ the hash MMU. That codepath includes a check to ensure that
->mm is not NULL. Overwriting ->mm temporarily and restoring it is
pretty crappy in retrospect. I _think_ a better approach is to just call
the hashing and allocate SLB functions from `map_patch` directly - this
both removes the need to overwrite ->mm (since the functions take an mm
parameter) and it avoids taking two exceptions when doing the actual
patching.

This works fine on Power9 and a Power8 at least but needs some testing
on PPC32 before I can send a v4.

> > +   switch_mm_irqs_off(NULL, temp_mm->temp, current);
>
> switch_mm_irqs_off() calls switch_mmu_context(), which in the nohash
> implementation increments next->context.active and decrements
> prev->context.active if prev is non-NULL, right? So this would
> increase temp_mm->temp->context.active...
>
> > +   if (ppc_breakpoint_available()) {
> > +   struct arch_hw_breakpoint null_brk = {0};
> > +   int i = 0;
> > +
> > +   for (; i < nr_wp_slots(); ++i) {
> > +   __get_breakpoint(i, &temp_mm->brk[i]);
> > +   if (temp_mm->brk[i].type != 0)
> > +   __set_breakpoint(i, &null_brk);
> > +   }
> > +   }
> > +}
> > +
> > +static in

Re: [PATCH v2 25/25] powerpc/signal32: Transform save_user_regs() and save_tm_user_regs() in 'unsafe' version

2020-09-28 Thread Christopher M. Riedl
On Tue Aug 18, 2020 at 12:19 PM CDT, Christophe Leroy wrote:
> Change those two functions to be used within a user access block.
>
> For that, change save_general_regs() to and unsafe_save_general_regs(),
> then replace all user accesses by unsafe_ versions.
>
> This series leads to a reduction from 2.55s to 1.73s of
> the system CPU time with the following microbench app
> on an mpc832x with KUAP (approx 32%)
>
> Without KUAP, the difference is in the noise.
>
> void sigusr1(int sig) { }
>
> int main(int argc, char **argv)
> {
> int i = 10;
>
> signal(SIGUSR1, sigusr1);
> for (;i--;)
> raise(SIGUSR1);
> exit(0);
> }
>
> An additional 0.10s reduction is achieved by removing
> CONFIG_PPC_FPU, as the mpc832x has no FPU.
>
> A bit less spectacular on an 8xx as KUAP is less heavy, prior to
> the series (with KUAP) it ran in 8.10 ms. Once applies the removal
> of FPU regs handling, we get 7.05s. With the full series, we get 6.9s.
> If artificially re-activating FPU regs handling with the full series,
> we get 7.6s.
>
> So for the 8xx, the removal of the FPU regs copy is what makes the
> difference, but the rework of handle_signal also have a benefit.
>
> Same as above, without KUAP the difference is in the noise.
>
> Signed-off-by: Christophe Leroy 
> ---
> arch/powerpc/kernel/signal_32.c | 224 
> 1 file changed, 111 insertions(+), 113 deletions(-)
>
> diff --git a/arch/powerpc/kernel/signal_32.c
> b/arch/powerpc/kernel/signal_32.c
> index 86539a4e0514..f795fe0240a1 100644
> --- a/arch/powerpc/kernel/signal_32.c
> +++ b/arch/powerpc/kernel/signal_32.c
> @@ -93,8 +93,8 @@ static inline int get_sigset_t(sigset_t *set,
> #define to_user_ptr(p) ptr_to_compat(p)
> #define from_user_ptr(p) compat_ptr(p)
>  
> -static inline int save_general_regs(struct pt_regs *regs,
> - struct mcontext __user *frame)
> +static __always_inline int
> +save_general_regs_unsafe(struct pt_regs *regs, struct mcontext __user
> *frame)
> {
> elf_greg_t64 *gregs = (elf_greg_t64 *)regs;
> int val, i;
> @@ -108,10 +108,12 @@ static inline int save_general_regs(struct pt_regs
> *regs,
> else
> val = gregs[i];
>  
> - if (__put_user(val, &frame->mc_gregs[i]))
> - return -EFAULT;
> + unsafe_put_user(val, &frame->mc_gregs[i], failed);
> }
> return 0;
> +
> +failed:
> + return 1;
> }
>  
> static inline int restore_general_regs(struct pt_regs *regs,
> @@ -148,11 +150,15 @@ static inline int get_sigset_t(sigset_t *set,
> const sigset_t __user *uset)
> #define to_user_ptr(p) ((unsigned long)(p))
> #define from_user_ptr(p) ((void __user *)(p))
>  
> -static inline int save_general_regs(struct pt_regs *regs,
> - struct mcontext __user *frame)
> +static __always_inline int
> +save_general_regs_unsafe(struct pt_regs *regs, struct mcontext __user
> *frame)
> {
> WARN_ON(!FULL_REGS(regs));
> - return __copy_to_user(&frame->mc_gregs, regs, GP_REGS_SIZE);
> + unsafe_copy_to_user(&frame->mc_gregs, regs, GP_REGS_SIZE, failed);
> + return 0;
> +
> +failed:
> + return 1;
> }
>  
> static inline int restore_general_regs(struct pt_regs *regs,
> @@ -170,6 +176,11 @@ static inline int restore_general_regs(struct
> pt_regs *regs,
> }
> #endif
>  
> +#define unsafe_save_general_regs(regs, frame, label) do { \
> + if (save_general_regs_unsafe(regs, frame)) \

Minor nitpick (sorry); this naming seems a bit strange to me, in x86 it
is "__unsafe_" as a prefix instead of "_unsafe" as a suffix. That sounds
a bit better to me, what do you think? Unless there is some convention I
am not aware of here apart from "unsafe_" using a goto label for errors.

> + goto label; \
> +} while (0)
> +
> /*
> * When we have signals to deliver, we set up on the
> * user stack, going down from the original stack pointer:
> @@ -249,21 +260,19 @@ static void prepare_save_user_regs(int
> ctx_has_vsx_region)
> #endif
> }
>  
> -static int save_user_regs(struct pt_regs *regs, struct mcontext __user
> *frame,
> - struct mcontext __user *tm_frame, int ctx_has_vsx_region)
> +static int save_user_regs_unsafe(struct pt_regs *regs, struct mcontext
> __user *frame,
> + struct mcontext __user *tm_frame, int ctx_has_vsx_region)
> {
> unsigned long msr = regs->msr;
>  
> /* save general registers */
> - if (save_general_regs(regs, frame))
> - return 1;
> + unsafe_save_general_regs(regs, frame, failed);
>  
> #ifdef CONFIG_ALTIVEC
> /* save altivec registers */
> if (current->thread.used_vr) {
> - if (__copy_to_user(&frame->mc_vregs, ¤t->thread.vr_state,
> - ELF_NVRREG * sizeof(vector128)))
> - return 1;
> + unsafe_copy_to_user(&frame->mc_vregs, ¤t->thread.vr_state,
> + ELF_NVRREG * sizeof(vector128), failed);
> /* set MSR_VEC in the saved MSR value to indicate that
> frame->mc_vregs contains valid data */
> msr |= MSR_VEC;
> @@ -276,11 +285,10 @@ static int save_user_regs(struct pt_regs *regs,
> struct mcontext __user *frame,
> * most significant bits of that same vector. --BenH
> * Note that the current VRSAVE value is in the SPR at this point.
> */
> - if (__put_user(c

Re: [PATCH v2 23/25] powerpc/signal: Create 'unsafe' versions of copy_[ck][fpr/vsx]_to_user()

2020-09-28 Thread Christopher M. Riedl
On Tue Aug 18, 2020 at 12:19 PM CDT, Christophe Leroy wrote:
> For the non VSX version, that's trivial. Just use unsafe_copy_to_user()
> instead of __copy_to_user().
>
> For the VSX version, remove the intermediate step through a buffer and
> use unsafe_put_user() directly. This generates a far smaller code which
> is acceptable to inline, see below:
>
> Standard VSX version:
>
>  <.copy_fpr_to_user>:
> 0: 7c 08 02 a6 mflr r0
> 4: fb e1 ff f8 std r31,-8(r1)
> 8: 39 00 00 20 li r8,32
> c: 39 24 0b 80 addi r9,r4,2944
> 10: 7d 09 03 a6 mtctr r8
> 14: f8 01 00 10 std r0,16(r1)
> 18: f8 21 fe 71 stdu r1,-400(r1)
> 1c: 39 41 00 68 addi r10,r1,104
> 20: e9 09 00 00 ld r8,0(r9)
> 24: 39 4a 00 08 addi r10,r10,8
> 28: 39 29 00 10 addi r9,r9,16
> 2c: f9 0a 00 00 std r8,0(r10)
> 30: 42 00 ff f0 bdnz 20 <.copy_fpr_to_user+0x20>
> 34: e9 24 0d 80 ld r9,3456(r4)
> 38: 3d 42 00 00 addis r10,r2,0
> 3a: R_PPC64_TOC16_HA .toc
> 3c: eb ea 00 00 ld r31,0(r10)
> 3e: R_PPC64_TOC16_LO_DS .toc
> 40: f9 21 01 70 std r9,368(r1)
> 44: e9 3f 00 00 ld r9,0(r31)
> 48: 81 29 00 20 lwz r9,32(r9)
> 4c: 2f 89 00 00 cmpwi cr7,r9,0
> 50: 40 9c 00 18 bge cr7,68 <.copy_fpr_to_user+0x68>
> 54: 4c 00 01 2c isync
> 58: 3d 20 40 00 lis r9,16384
> 5c: 79 29 07 c6 rldicr r9,r9,32,31
> 60: 7d 3d 03 a6 mtspr 29,r9
> 64: 4c 00 01 2c isync
> 68: 38 a0 01 08 li r5,264
> 6c: 38 81 00 70 addi r4,r1,112
> 70: 48 00 00 01 bl 70 <.copy_fpr_to_user+0x70>
> 70: R_PPC64_REL24 .__copy_tofrom_user
> 74: 60 00 00 00 nop
> 78: e9 3f 00 00 ld r9,0(r31)
> 7c: 81 29 00 20 lwz r9,32(r9)
> 80: 2f 89 00 00 cmpwi cr7,r9,0
> 84: 40 9c 00 18 bge cr7,9c <.copy_fpr_to_user+0x9c>
> 88: 4c 00 01 2c isync
> 8c: 39 20 ff ff li r9,-1
> 90: 79 29 00 44 rldicr r9,r9,0,1
> 94: 7d 3d 03 a6 mtspr 29,r9
> 98: 4c 00 01 2c isync
> 9c: 38 21 01 90 addi r1,r1,400
> a0: e8 01 00 10 ld r0,16(r1)
> a4: eb e1 ff f8 ld r31,-8(r1)
> a8: 7c 08 03 a6 mtlr r0
> ac: 4e 80 00 20 blr
>
> 'unsafe' simulated VSX version (The ... are only nops) using
> unsafe_copy_fpr_to_user() macro:
>
> unsigned long copy_fpr_to_user(void __user *to,
> struct task_struct *task)
> {
> unsafe_copy_fpr_to_user(to, task, failed);
> return 0;
> failed:
> return 1;
> }
>
>  <.copy_fpr_to_user>:
> 0: 39 00 00 20 li r8,32
> 4: 39 44 0b 80 addi r10,r4,2944
> 8: 7d 09 03 a6 mtctr r8
> c: 7c 69 1b 78 mr r9,r3
> ...
> 20: e9 0a 00 00 ld r8,0(r10)
> 24: f9 09 00 00 std r8,0(r9)
> 28: 39 4a 00 10 addi r10,r10,16
> 2c: 39 29 00 08 addi r9,r9,8
> 30: 42 00 ff f0 bdnz 20 <.copy_fpr_to_user+0x20>
> 34: e9 24 0d 80 ld r9,3456(r4)
> 38: f9 23 01 00 std r9,256(r3)
> 3c: 38 60 00 00 li r3,0
> 40: 4e 80 00 20 blr
> ...
> 50: 38 60 00 01 li r3,1
> 54: 4e 80 00 20 blr
>
> Signed-off-by: Christophe Leroy 
> ---
> arch/powerpc/kernel/signal.h | 53 
> 1 file changed, 53 insertions(+)
>
> diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> index f610cfafa478..2559a681536e 100644
> --- a/arch/powerpc/kernel/signal.h
> +++ b/arch/powerpc/kernel/signal.h
> @@ -32,7 +32,54 @@ unsigned long copy_fpr_to_user(void __user *to,
> struct task_struct *task);
> unsigned long copy_ckfpr_to_user(void __user *to, struct task_struct
> *task);
> unsigned long copy_fpr_from_user(struct task_struct *task, void __user
> *from);
> unsigned long copy_ckfpr_from_user(struct task_struct *task, void __user
> *from);
> +
> +#define unsafe_copy_fpr_to_user(to, task, label) do { \
> + struct task_struct *__t = task; \
> + u64 __user *buf = (u64 __user *)to; \
> + int i; \
> + \
> + for (i = 0; i < ELF_NFPREG - 1 ; i++) \
> + unsafe_put_user(__t->thread.TS_FPR(i), &buf[i], label); \
> + unsafe_put_user(__t->thread.fp_state.fpscr, &buf[i], label); \
> +} while (0)
> +

I've been working on the PPC64 side of this "unsafe" rework using this
series as a basis. One question here - I don't really understand what
the benefit of re-implementing this logic in macros (similarly for the
other copy_* functions below) is?

I am considering  a "__unsafe_copy_*" implementation in signal.c for
each (just the original implementation w/ using the "unsafe_" variants
of the uaccess stuff) which gets called by the "safe" functions w/ the
appropriate "user_*_access_begin/user_*_access_end". Something like
(pseudo-ish code):

/* signal.c */
unsigned long __unsafe_copy_fpr_to_user(...)
{
...
unsafe_copy_to_user(..., bad);
return 0;
bad:
return 1; /* -EFAULT? */
}

unsigned long copy_fpr_to_user(...)
{
unsigned long err;
if (!user_write_access_begin(...))
return 1; /* -EFAULT? */

err = __unsafe_copy_fpr_to_user(...);

user_write_access_end();
return err;
}

/* signal.h */
unsigned long __unsafe_copy_fpr_to_user(...);
#define unsafe_copy_fpr_to_user(..., label) \

[PATCH 0/8] Improve signal performance on PPC64 with KUAP

2020-10-15 Thread Christopher M. Riedl
As reported by Anton, there is a large penalty to signal handling
performance on radix systems using KUAP. The signal handling code
performs many user access operations, each of which needs to switch the
KUAP permissions bit to open and then close user access. This involves a
costly 'mtspr' operation [0].

There is existing work done on x86 and by Christopher Leroy for PPC32 to
instead open up user access in "blocks" using user_*_access_{begin,end}.
We can do the same in PPC64 to bring performance back up on KUAP-enabled
radix systems.

This series applies on top of Christophe Leroy's work for PPC32 [1] (I'm
sure patchwork won't be too happy about that).

The first two patches add some needed 'unsafe' versions of copy-from
functions. While these do not make use of asm-goto they still allow for
avoiding the repeated uaccess switches. The third patch adds 'notrace'
to any functions expected to be called in a uaccess block context.
Normally functions called in such a context should be inlined, but this
is not feasible everywhere. Marking them 'notrace' should provide _some_
protection against leaving the user access window open.

The next three patches rewrite some of the signal64 helper functions to
be 'unsafe'. Finally, the last two patches update the main signal
handling functions to make use of the new 'unsafe' helpers and eliminate
some additional uaccess switching.

I used the will-it-scale signal1 benchmark to measure and compare
performance [2]. The below results are from a P9 Blackbird system. Note
that currently hash does not support KUAP and is therefore used as the
"baseline" comparison. Bigger numbers are better:

signal1_threads -t1 -s10

| | hash   | radix  |
| --- | -- | -- |
| linuxppc/next   | 289014 | 158408 |
| unsafe-signal64 | 298506 | 253053 |

[0]: https://github.com/linuxppc/issues/issues/277
[1]: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=196278
[2]: https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c

Christopher M. Riedl (5):
  powerpc/uaccess: Add unsafe_copy_from_user
  powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()
  powerpc: Mark functions called inside uaccess blocks w/ 'notrace'
  powerpc/signal64: Replace setup_sigcontext() w/
unsafe_setup_sigcontext()
  powerpc/signal64: Replace restore_sigcontext() w/
unsafe_restore_sigcontext()

Daniel Axtens (3):
  powerpc/signal64: Replace setup_trampoline() w/
unsafe_setup_trampoline()
  powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess
switches
  powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

 arch/powerpc/include/asm/uaccess.h |  28 ++--
 arch/powerpc/kernel/process.c  |  20 +--
 arch/powerpc/kernel/signal.h   |  33 +
 arch/powerpc/kernel/signal_64.c| 216 +
 arch/powerpc/mm/mem.c  |   4 +-
 5 files changed, 194 insertions(+), 107 deletions(-)

-- 
2.28.0



[PATCH 6/8] powerpc/signal64: Replace setup_trampoline() w/ unsafe_setup_trampoline()

2020-10-15 Thread Christopher M. Riedl
From: Daniel Axtens 

Previously setup_trampoline() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite setup_trampoline() to assume that a userspace write access
window is open. Replace all uaccess functions with their 'unsafe'
versions to avoid the repeated uaccess switches.

Signed-off-by: Daniel Axtens 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 32 +++-
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index bd92064e5576..6d4f7a5c4fbf 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -600,30 +600,33 @@ static long restore_tm_sigcontexts(struct task_struct 
*tsk,
 /*
  * Setup the trampoline code on the stack
  */
-static long setup_trampoline(unsigned int syscall, unsigned int __user *tramp)
+#define unsafe_setup_trampoline(syscall, tramp, e) \
+   unsafe_op_wrap(__unsafe_setup_trampoline(syscall, tramp), e)
+static long notrace __unsafe_setup_trampoline(unsigned int syscall,
+   unsigned int __user *tramp)
 {
int i;
-   long err = 0;
 
/* bctrl # call the handler */
-   err |= __put_user(PPC_INST_BCTRL, &tramp[0]);
+   unsafe_put_user(PPC_INST_BCTRL, &tramp[0], err);
/* addi r1, r1, __SIGNAL_FRAMESIZE  # Pop the dummy stackframe */
-   err |= __put_user(PPC_INST_ADDI | __PPC_RT(R1) | __PPC_RA(R1) |
- (__SIGNAL_FRAMESIZE & 0x), &tramp[1]);
+   unsafe_put_user(PPC_INST_ADDI | __PPC_RT(R1) | __PPC_RA(R1) |
+ (__SIGNAL_FRAMESIZE & 0x), &tramp[1], err);
/* li r0, __NR_[rt_]sigreturn| */
-   err |= __put_user(PPC_INST_ADDI | (syscall & 0x), &tramp[2]);
+   unsafe_put_user(PPC_INST_ADDI | (syscall & 0x), &tramp[2], err);
/* sc */
-   err |= __put_user(PPC_INST_SC, &tramp[3]);
+   unsafe_put_user(PPC_INST_SC, &tramp[3], err);
 
/* Minimal traceback info */
for (i=TRAMP_TRACEBACK; i < TRAMP_SIZE ;i++)
-   err |= __put_user(0, &tramp[i]);
+   unsafe_put_user(0, &tramp[i], err);
 
-   if (!err)
-   flush_icache_range((unsigned long) &tramp[0],
-  (unsigned long) &tramp[TRAMP_SIZE]);
+   flush_icache_range((unsigned long)&tramp[0],
+  (unsigned long)&tramp[TRAMP_SIZE]);
 
-   return err;
+   return 0;
+err:
+   return 1;
 }
 
 /*
@@ -888,7 +891,10 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
if (vdso64_rt_sigtramp && tsk->mm->context.vdso_base) {
regs->nip = tsk->mm->context.vdso_base + vdso64_rt_sigtramp;
} else {
-   err |= setup_trampoline(__NR_rt_sigreturn, &frame->tramp[0]);
+   if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
+   return -EFAULT;
+   err |= __unsafe_setup_trampoline(__NR_rt_sigreturn, 
&frame->tramp[0]);
+   user_write_access_end();
if (err)
goto badframe;
regs->nip = (unsigned long) &frame->tramp[0];
-- 
2.28.0



[PATCH 5/8] powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()

2020-10-15 Thread Christopher M. Riedl
Previously restore_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite restore_sigcontext() to assume that a userspace read access
window is open. Replace all uaccess functions with their 'unsafe'
versions which avoid the repeated uaccess switches.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 68 -
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 26934ceeb925..bd92064e5576 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -318,14 +318,14 @@ static long setup_tm_sigcontexts(struct sigcontext __user 
*sc,
 /*
  * Restore the sigcontext from the signal frame.
  */
-
-static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
- struct sigcontext __user *sc)
+#define unsafe_restore_sigcontext(tsk, set, sig, sc, e) \
+   unsafe_op_wrap(__unsafe_restore_sigcontext(tsk, set, sig, sc), e)
+static long notrace __unsafe_restore_sigcontext(struct task_struct *tsk, 
sigset_t *set,
+   int sig, struct sigcontext 
__user *sc)
 {
 #ifdef CONFIG_ALTIVEC
elf_vrreg_t __user *v_regs;
 #endif
-   unsigned long err = 0;
unsigned long save_r13 = 0;
unsigned long msr;
struct pt_regs *regs = tsk->thread.regs;
@@ -340,27 +340,28 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
save_r13 = regs->gpr[13];
 
/* copy the GPRs */
-   err |= __copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr));
-   err |= __get_user(regs->nip, &sc->gp_regs[PT_NIP]);
+   unsafe_copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr),
+ efault_out);
+   unsafe_get_user(regs->nip, &sc->gp_regs[PT_NIP], efault_out);
/* get MSR separately, transfer the LE bit if doing signal return */
-   err |= __get_user(msr, &sc->gp_regs[PT_MSR]);
+   unsafe_get_user(msr, &sc->gp_regs[PT_MSR], efault_out);
if (sig)
regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
-   err |= __get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3]);
-   err |= __get_user(regs->ctr, &sc->gp_regs[PT_CTR]);
-   err |= __get_user(regs->link, &sc->gp_regs[PT_LNK]);
-   err |= __get_user(regs->xer, &sc->gp_regs[PT_XER]);
-   err |= __get_user(regs->ccr, &sc->gp_regs[PT_CCR]);
+   unsafe_get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3], efault_out);
+   unsafe_get_user(regs->ctr, &sc->gp_regs[PT_CTR], efault_out);
+   unsafe_get_user(regs->link, &sc->gp_regs[PT_LNK], efault_out);
+   unsafe_get_user(regs->xer, &sc->gp_regs[PT_XER], efault_out);
+   unsafe_get_user(regs->ccr, &sc->gp_regs[PT_CCR], efault_out);
/* Don't allow userspace to set SOFTE */
set_trap_norestart(regs);
-   err |= __get_user(regs->dar, &sc->gp_regs[PT_DAR]);
-   err |= __get_user(regs->dsisr, &sc->gp_regs[PT_DSISR]);
-   err |= __get_user(regs->result, &sc->gp_regs[PT_RESULT]);
+   unsafe_get_user(regs->dar, &sc->gp_regs[PT_DAR], efault_out);
+   unsafe_get_user(regs->dsisr, &sc->gp_regs[PT_DSISR], efault_out);
+   unsafe_get_user(regs->result, &sc->gp_regs[PT_RESULT], efault_out);
 
if (!sig)
regs->gpr[13] = save_r13;
if (set != NULL)
-   err |=  __get_user(set->sig[0], &sc->oldmask);
+   unsafe_get_user(set->sig[0], &sc->oldmask, efault_out);
 
/*
 * Force reload of FP/VEC.
@@ -370,29 +371,28 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
regs->msr &= ~(MSR_FP | MSR_FE0 | MSR_FE1 | MSR_VEC | MSR_VSX);
 
 #ifdef CONFIG_ALTIVEC
-   err |= __get_user(v_regs, &sc->v_regs);
-   if (err)
-   return err;
+   unsafe_get_user(v_regs, &sc->v_regs, efault_out);
if (v_regs && !access_ok(v_regs, 34 * sizeof(vector128)))
return -EFAULT;
/* Copy 33 vec registers (vr0..31 and vscr) from the stack */
if (v_regs != NULL && (msr & MSR_VEC) != 0) {
-   err |= __copy_from_user(&tsk->thread.vr_state, v_regs,
-   33 * sizeof(vector128));
+   unsafe_copy_from_user(&tsk->thread.vr_state, v_regs,
+ 33 * sizeof(vector128), efault_out);
tsk->thread.used_vr = true;
} else if (

[PATCH 4/8] powerpc/signal64: Replace setup_sigcontext() w/ unsafe_setup_sigcontext()

2020-10-15 Thread Christopher M. Riedl
Previously setup_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite setup_sigcontext() to assume that a userspace write access window
is open. Replace all uaccess functions with their 'unsafe' versions
which avoid the repeated uaccess switches.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 71 -
 1 file changed, 44 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 7df088b9ad0f..26934ceeb925 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -83,9 +83,13 @@ static elf_vrreg_t __user *sigcontext_vmx_regs(struct 
sigcontext __user *sc)
  * Set up the sigcontext for the signal frame.
  */
 
-static long setup_sigcontext(struct sigcontext __user *sc,
-   struct task_struct *tsk, int signr, sigset_t *set,
-   unsigned long handler, int ctx_has_vsx_region)
+#define unsafe_setup_sigcontext(sc, tsk, signr, set, handler,  \
+   ctx_has_vsx_region, e)  \
+   unsafe_op_wrap(__unsafe_setup_sigcontext(sc, tsk, signr, set,   \
+   handler, ctx_has_vsx_region), e)
+static long notrace __unsafe_setup_sigcontext(struct sigcontext __user *sc,
+   struct task_struct *tsk, int signr, 
sigset_t *set,
+   unsigned long handler, int 
ctx_has_vsx_region)
 {
/* When CONFIG_ALTIVEC is set, we _always_ setup v_regs even if the
 * process never used altivec yet (MSR_VEC is zero in pt_regs of
@@ -101,21 +105,20 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 #endif
struct pt_regs *regs = tsk->thread.regs;
unsigned long msr = regs->msr;
-   long err = 0;
/* Force usr to alway see softe as 1 (interrupts enabled) */
unsigned long softe = 0x1;
 
BUG_ON(tsk != current);
 
 #ifdef CONFIG_ALTIVEC
-   err |= __put_user(v_regs, &sc->v_regs);
+   unsafe_put_user(v_regs, &sc->v_regs, efault_out);
 
/* save altivec registers */
if (tsk->thread.used_vr) {
flush_altivec_to_thread(tsk);
/* Copy 33 vec registers (vr0..31 and vscr) to the stack */
-   err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
- 33 * sizeof(vector128));
+   unsafe_copy_to_user(v_regs, &tsk->thread.vr_state,
+   33 * sizeof(vector128), efault_out);
/* set MSR_VEC in the MSR value in the frame to indicate that 
sc->v_reg)
 * contains valid data.
 */
@@ -130,13 +133,13 @@ static long setup_sigcontext(struct sigcontext __user *sc,
tsk->thread.vrsave = vrsave;
}
 
-   err |= __put_user(vrsave, (u32 __user *)&v_regs[33]);
+   unsafe_put_user(vrsave, (u32 __user *)&v_regs[33], efault_out);
 #else /* CONFIG_ALTIVEC */
-   err |= __put_user(0, &sc->v_regs);
+   unsafe_put_user(0, &sc->v_regs, efault_out);
 #endif /* CONFIG_ALTIVEC */
flush_fp_to_thread(tsk);
/* copy fpr regs and fpscr */
-   err |= copy_fpr_to_user(&sc->fp_regs, tsk);
+   unsafe_copy_fpr_to_user(&sc->fp_regs, tsk, efault_out);
 
/*
 * Clear the MSR VSX bit to indicate there is no valid state attached
@@ -152,24 +155,27 @@ static long setup_sigcontext(struct sigcontext __user *sc,
if (tsk->thread.used_vsr && ctx_has_vsx_region) {
flush_vsx_to_thread(tsk);
v_regs += ELF_NVRREG;
-   err |= copy_vsx_to_user(v_regs, tsk);
+   unsafe_copy_vsx_to_user(v_regs, tsk, efault_out);
/* set MSR_VSX in the MSR value in the frame to
 * indicate that sc->vs_reg) contains valid data.
 */
msr |= MSR_VSX;
}
 #endif /* CONFIG_VSX */
-   err |= __put_user(&sc->gp_regs, &sc->regs);
+   unsafe_put_user(&sc->gp_regs, &sc->regs, efault_out);
WARN_ON(!FULL_REGS(regs));
-   err |= __copy_to_user(&sc->gp_regs, regs, GP_REGS_SIZE);
-   err |= __put_user(msr, &sc->gp_regs[PT_MSR]);
-   err |= __put_user(softe, &sc->gp_regs[PT_SOFTE]);
-   err |= __put_user(signr, &sc->signal);
-   err |= __put_user(handler, &sc->handler);
+   unsafe_copy_to_user(&sc->gp_regs, regs, GP_REGS_SIZE, efault_out);
+   unsafe_put_user(msr, &sc->gp_regs[PT_MSR], efault_out);
+   unsafe_put_user(softe, &sc->gp_regs[PT_SOFTE], efault_out);
+   unsafe_put_user(signr, &sc->signal, efault_out);
+   uns

[PATCH 2/8] powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()

2020-10-15 Thread Christopher M. Riedl
Reuse the "safe" implementation from signal.c except for calling
unsafe_copy_from_user() to copy into a local buffer. Unlike the
unsafe_copy_{vsx,fpr}_to_user() functions the "copy from" functions
cannot use unsafe_get_user() directly to bypass the local buffer since
doing so significantly reduces signal handling performance.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal.h | 33 +
 1 file changed, 33 insertions(+)

diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
index 2559a681536e..e9aaeac0da37 100644
--- a/arch/powerpc/kernel/signal.h
+++ b/arch/powerpc/kernel/signal.h
@@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct *task, 
void __user *from);
&buf[i], label);\
 } while (0)
 
+#define unsafe_copy_fpr_from_user(task, from, label)   do {\
+   struct task_struct *__t = task; \
+   u64 __user *__f = (u64 __user *)from;   \
+   u64 buf[ELF_NFPREG];\
+   int i;  \
+   \
+   unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),\
+   label); \
+   for (i = 0; i < ELF_NFPREG - 1; i++)\
+   __t->thread.TS_FPR(i) = buf[i]; \
+   __t->thread.fp_state.fpscr = buf[i];\
+} while (0)
+
+#define unsafe_copy_vsx_from_user(task, from, label)   do {\
+   struct task_struct *__t = task; \
+   u64 __user *__f = (u64 __user *)from;   \
+   u64 buf[ELF_NVSRHALFREG];   \
+   int i;  \
+   \
+   unsafe_copy_from_user(buf, __f, \
+   ELF_NVSRHALFREG * sizeof(double),   \
+   label); \
+   for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
+   __t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];  \
+} while (0)
+
+
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 #define unsafe_copy_ckfpr_to_user(to, task, label) do {\
struct task_struct *__t = task; \
@@ -80,6 +107,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
*task, void __user *from);
unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
ELF_NFPREG * sizeof(double), label)
 
+#define unsafe_copy_fpr_from_user(task, from, label)   \
+   unsafe_copy_from_user((task)->thread.fp_state.fpr, from \
+   ELF_NFPREG * sizeof(double), label)
+
 static inline unsigned long
 copy_fpr_to_user(void __user *to, struct task_struct *task)
 {
@@ -115,6 +146,8 @@ copy_ckfpr_from_user(struct task_struct *task, void __user 
*from)
 #else
 #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0)
 
+#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0)
+
 static inline unsigned long
 copy_fpr_to_user(void __user *to, struct task_struct *task)
 {
-- 
2.28.0



[PATCH 8/8] powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

2020-10-15 Thread Christopher M. Riedl
From: Daniel Axtens 

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

Signed-off-by: Daniel Axtens 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 3b97e3681a8f..0f4ff7a5bfc1 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -779,18 +779,22 @@ SYSCALL_DEFINE0(rt_sigreturn)
 */
regs->msr &= ~MSR_TS_MASK;
 
-   if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
+   if (!user_read_access_begin(uc, sizeof(*uc)))
goto badframe;
+
+   unsafe_get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR], badframe_block);
+
if (MSR_TM_ACTIVE(msr)) {
/* We recheckpoint on return. */
struct ucontext __user *uc_transact;
 
/* Trying to start TM on non TM system */
if (!cpu_has_feature(CPU_FTR_TM))
-   goto badframe;
+   goto badframe_block;
+
+   unsafe_get_user(uc_transact, &uc->uc_link, badframe_block);
+   user_read_access_end();
 
-   if (__get_user(uc_transact, &uc->uc_link))
-   goto badframe;
if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
   &uc_transact->uc_mcontext))
goto badframe;
@@ -810,12 +814,13 @@ SYSCALL_DEFINE0(rt_sigreturn)
 * causing a TM bad thing.
 */
current->thread.regs->msr &= ~MSR_TS_MASK;
+
+#ifndef CONFIG_PPC_TRANSACTIONAL_MEM
if (!user_read_access_begin(uc, sizeof(*uc)))
-   return -EFAULT;
-   if (__unsafe_restore_sigcontext(current, NULL, 1, 
&uc->uc_mcontext)) {
-   user_read_access_end();
goto badframe;
-   }
+#endif
+   unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext,
+ badframe_block);
user_read_access_end();
}
 
@@ -825,6 +830,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
set_thread_flag(TIF_RESTOREALL);
return 0;
 
+badframe_block:
+   user_read_access_end();
 badframe:
signal_fault(current, regs, "rt_sigreturn", uc);
 
-- 
2.28.0



[PATCH 7/8] powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches

2020-10-15 Thread Christopher M. Riedl
From: Daniel Axtens 

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

There is no 'unsafe' version of copy_siginfo_to_user, so move it
slightly to allow for a "longer" uaccess block.

Signed-off-by: Daniel Axtens 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 54 -
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 6d4f7a5c4fbf..3b97e3681a8f 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -843,46 +843,42 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
*set,
/* Save the thread's msr before get_tm_stackpointer() changes it */
unsigned long msr = regs->msr;
 #endif
-
frame = get_sigframe(ksig, tsk, sizeof(*frame), 0);
-   if (!access_ok(frame, sizeof(*frame)))
+   if (!user_write_access_begin(frame, sizeof(*frame)))
goto badframe;
 
-   err |= __put_user(&frame->info, &frame->pinfo);
-   err |= __put_user(&frame->uc, &frame->puc);
-   err |= copy_siginfo_to_user(&frame->info, &ksig->info);
-   if (err)
-   goto badframe;
+   unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
+   unsafe_put_user(&frame->uc, &frame->puc, badframe_block);
 
/* Create the ucontext.  */
-   err |= __put_user(0, &frame->uc.uc_flags);
-   err |= __save_altstack(&frame->uc.uc_stack, regs->gpr[1]);
+   unsafe_put_user(0, &frame->uc.uc_flags, badframe_block);
+   unsafe_save_altstack(&frame->uc.uc_stack, regs->gpr[1], badframe_block);
+
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
if (MSR_TM_ACTIVE(msr)) {
/* The ucontext_t passed to userland points to the second
 * ucontext_t (for transactional state) with its uc_link ptr.
 */
-   err |= __put_user(&frame->uc_transact, &frame->uc.uc_link);
+   unsafe_put_user(&frame->uc_transact, &frame->uc.uc_link, 
badframe_block);
+   user_write_access_end();
err |= setup_tm_sigcontexts(&frame->uc.uc_mcontext,
&frame->uc_transact.uc_mcontext,
tsk, ksig->sig, NULL,
(unsigned 
long)ksig->ka.sa.sa_handler,
msr);
+   if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
+   goto badframe;
+
} else
 #endif
{
-   err |= __put_user(0, &frame->uc.uc_link);
-
-   if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
-   return -EFAULT;
-   err |= __unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk,
-   ksig->sig, NULL,
-   (unsigned 
long)ksig->ka.sa.sa_handler, 1);
-   user_write_access_end();
+   unsafe_put_user(0, &frame->uc.uc_link, badframe_block);
+   unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
+   NULL, (unsigned 
long)ksig->ka.sa.sa_handler,
+   1, badframe_block);
}
-   err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
-   if (err)
-   goto badframe;
+
+   unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), 
badframe_block);
 
/* Make sure signal handler doesn't get spurious FP exceptions */
tsk->thread.fp_state.fpscr = 0;
@@ -891,15 +887,17 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
*set,
if (vdso64_rt_sigtramp && tsk->mm->context.vdso_base) {
regs->nip = tsk->mm->context.vdso_base + vdso64_rt_sigtramp;
} else {
-   if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
-   return -EFAULT;
-   err |= __unsafe_setup_trampoline(__NR_rt_sigreturn, 
&frame->tramp[0]);
-   user_write_access_end();
-   if (err)
-   goto badframe;
+   unsafe_setup_trampoline(__NR_rt_sigreturn, &frame->tramp[0],
+   badframe_block);
regs->nip = (unsigned long) &frame->tramp[0];
}
 
+   user_write_access_end();
+
+   /* Save the siginfo outside of the unsafe block. */
+   if (copy_siginfo_to_user(&frame

[PATCH 1/8] powerpc/uaccess: Add unsafe_copy_from_user

2020-10-15 Thread Christopher M. Riedl
Implement raw_copy_from_user_allowed() which assumes that userspace read
access is open. Use this new function to implement raw_copy_from_user().
Finally, wrap the new function to follow the usual "unsafe_" convention
of taking a label argument. The new raw_copy_from_user_allowed() calls
__copy_tofrom_user() internally, but this is still safe to call in user
access blocks formed with user_*_access_begin()/user_*_access_end()
since asm functions are not instrumented for tracing.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/uaccess.h | 28 +++-
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 26781b044932..66940b4eb692 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -418,38 +418,45 @@ raw_copy_in_user(void __user *to, const void __user 
*from, unsigned long n)
 }
 #endif /* __powerpc64__ */
 
-static inline unsigned long raw_copy_from_user(void *to,
-   const void __user *from, unsigned long n)
+static inline unsigned long
+raw_copy_from_user_allowed(void *to, const void __user *from, unsigned long n)
 {
-   unsigned long ret;
if (__builtin_constant_p(n) && (n <= 8)) {
-   ret = 1;
+   unsigned long ret = 1;
 
switch (n) {
case 1:
barrier_nospec();
-   __get_user_size(*(u8 *)to, from, 1, ret);
+   __get_user_size_allowed(*(u8 *)to, from, 1, ret);
break;
case 2:
barrier_nospec();
-   __get_user_size(*(u16 *)to, from, 2, ret);
+   __get_user_size_allowed(*(u16 *)to, from, 2, ret);
break;
case 4:
barrier_nospec();
-   __get_user_size(*(u32 *)to, from, 4, ret);
+   __get_user_size_allowed(*(u32 *)to, from, 4, ret);
break;
case 8:
barrier_nospec();
-   __get_user_size(*(u64 *)to, from, 8, ret);
+   __get_user_size_allowed(*(u64 *)to, from, 8, ret);
break;
}
if (ret == 0)
return 0;
}
 
+   return __copy_tofrom_user((__force void __user *)to, from, n);
+}
+
+static inline unsigned long
+raw_copy_from_user(void *to, const void __user *from, unsigned long n)
+{
+   unsigned long ret;
+
barrier_nospec();
allow_read_from_user(from, n);
-   ret = __copy_tofrom_user((__force void __user *)to, from, n);
+   ret = raw_copy_from_user_allowed(to, from, n);
prevent_read_from_user(from, n);
return ret;
 }
@@ -571,6 +578,9 @@ user_write_access_begin(const void __user *ptr, size_t len)
 #define unsafe_get_user(x, p, e) unsafe_op_wrap(__get_user_allowed(x, p), e)
 #define unsafe_put_user(x, p, e) __put_user_goto(x, p, e)
 
+#define unsafe_copy_from_user(d, s, l, e) \
+   unsafe_op_wrap(raw_copy_from_user_allowed(d, s, l), e)
+
 #define unsafe_copy_to_user(d, s, l, e) \
 do {   \
u8 __user *_dst = (u8 __user *)(d); \
-- 
2.28.0



[PATCH 3/8] powerpc: Mark functions called inside uaccess blocks w/ 'notrace'

2020-10-15 Thread Christopher M. Riedl
Functions called between user_*_access_begin() and user_*_access_end()
should be either inlined or marked 'notrace' to prevent leaving
userspace access exposed. Mark any such functions relevant to signal
handling so that subsequent patches can call them inside uaccess blocks.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/process.c | 20 ++--
 arch/powerpc/mm/mem.c |  4 ++--
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index ba2c987b8403..bf5d9654bd2c 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -84,7 +84,7 @@ extern unsigned long _get_SP(void);
  */
 bool tm_suspend_disabled __ro_after_init = false;
 
-static void check_if_tm_restore_required(struct task_struct *tsk)
+static void notrace check_if_tm_restore_required(struct task_struct *tsk)
 {
/*
 * If we are saving the current thread's registers, and the
@@ -151,7 +151,7 @@ void notrace __msr_check_and_clear(unsigned long bits)
 EXPORT_SYMBOL(__msr_check_and_clear);
 
 #ifdef CONFIG_PPC_FPU
-static void __giveup_fpu(struct task_struct *tsk)
+static void notrace __giveup_fpu(struct task_struct *tsk)
 {
unsigned long msr;
 
@@ -163,7 +163,7 @@ static void __giveup_fpu(struct task_struct *tsk)
tsk->thread.regs->msr = msr;
 }
 
-void giveup_fpu(struct task_struct *tsk)
+void notrace giveup_fpu(struct task_struct *tsk)
 {
check_if_tm_restore_required(tsk);
 
@@ -177,7 +177,7 @@ EXPORT_SYMBOL(giveup_fpu);
  * Make sure the floating-point register state in the
  * the thread_struct is up to date for task tsk.
  */
-void flush_fp_to_thread(struct task_struct *tsk)
+void notrace flush_fp_to_thread(struct task_struct *tsk)
 {
if (tsk->thread.regs) {
/*
@@ -234,7 +234,7 @@ static inline void __giveup_fpu(struct task_struct *tsk) { }
 #endif /* CONFIG_PPC_FPU */
 
 #ifdef CONFIG_ALTIVEC
-static void __giveup_altivec(struct task_struct *tsk)
+static void notrace __giveup_altivec(struct task_struct *tsk)
 {
unsigned long msr;
 
@@ -246,7 +246,7 @@ static void __giveup_altivec(struct task_struct *tsk)
tsk->thread.regs->msr = msr;
 }
 
-void giveup_altivec(struct task_struct *tsk)
+void notrace giveup_altivec(struct task_struct *tsk)
 {
check_if_tm_restore_required(tsk);
 
@@ -285,7 +285,7 @@ EXPORT_SYMBOL(enable_kernel_altivec);
  * Make sure the VMX/Altivec register state in the
  * the thread_struct is up to date for task tsk.
  */
-void flush_altivec_to_thread(struct task_struct *tsk)
+void notrace flush_altivec_to_thread(struct task_struct *tsk)
 {
if (tsk->thread.regs) {
preempt_disable();
@@ -300,7 +300,7 @@ EXPORT_SYMBOL_GPL(flush_altivec_to_thread);
 #endif /* CONFIG_ALTIVEC */
 
 #ifdef CONFIG_VSX
-static void __giveup_vsx(struct task_struct *tsk)
+static void notrace __giveup_vsx(struct task_struct *tsk)
 {
unsigned long msr = tsk->thread.regs->msr;
 
@@ -317,7 +317,7 @@ static void __giveup_vsx(struct task_struct *tsk)
__giveup_altivec(tsk);
 }
 
-static void giveup_vsx(struct task_struct *tsk)
+static void notrace giveup_vsx(struct task_struct *tsk)
 {
check_if_tm_restore_required(tsk);
 
@@ -352,7 +352,7 @@ void enable_kernel_vsx(void)
 }
 EXPORT_SYMBOL(enable_kernel_vsx);
 
-void flush_vsx_to_thread(struct task_struct *tsk)
+void notrace flush_vsx_to_thread(struct task_struct *tsk)
 {
if (tsk->thread.regs) {
preempt_disable();
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ddc32cc1b6cf..da2345a2abc6 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -378,7 +378,7 @@ static inline bool flush_coherent_icache(unsigned long addr)
  * @start: the start address
  * @stop: the stop address (exclusive)
  */
-static void invalidate_icache_range(unsigned long start, unsigned long stop)
+static void notrace invalidate_icache_range(unsigned long start, unsigned long 
stop)
 {
unsigned long shift = l1_icache_shift();
unsigned long bytes = l1_icache_bytes();
@@ -402,7 +402,7 @@ static void invalidate_icache_range(unsigned long start, 
unsigned long stop)
  * @start: the start address
  * @stop: the stop address (exclusive)
  */
-void flush_icache_range(unsigned long start, unsigned long stop)
+void notrace flush_icache_range(unsigned long start, unsigned long stop)
 {
if (flush_coherent_icache(start))
return;
-- 
2.28.0



Re: [PATCH 3/8] powerpc: Mark functions called inside uaccess blocks w/ 'notrace'

2020-10-19 Thread Christopher M. Riedl
On Fri Oct 16, 2020 at 4:02 AM CDT, Christophe Leroy wrote:
>
>
> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> > Functions called between user_*_access_begin() and user_*_access_end()
> > should be either inlined or marked 'notrace' to prevent leaving
> > userspace access exposed. Mark any such functions relevant to signal
> > handling so that subsequent patches can call them inside uaccess blocks.
>
> Is it enough to mark it "notrace" ? I see that when I activate KASAN,
> there are still KASAN calls in
> those functions.
>

Maybe not enough after all :(

> In my series for 32 bits, I re-ordered stuff in order to do all those
> calls before doing the
> _access_begin(), can't you do the same on PPC64 ? (See
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/f6eac65781b4a57220477c8864bca2b57f29a5d5.1597770847.git.christophe.le...@csgroup.eu/)
>

Yes, I will give this another shot in the next spin.

> Christophe
>
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/kernel/process.c | 20 ++--
> >   arch/powerpc/mm/mem.c |  4 ++--
> >   2 files changed, 12 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index ba2c987b8403..bf5d9654bd2c 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -84,7 +84,7 @@ extern unsigned long _get_SP(void);
> >*/
> >   bool tm_suspend_disabled __ro_after_init = false;
> >   
> > -static void check_if_tm_restore_required(struct task_struct *tsk)
> > +static void notrace check_if_tm_restore_required(struct task_struct *tsk)
> >   {
> > /*
> >  * If we are saving the current thread's registers, and the
> > @@ -151,7 +151,7 @@ void notrace __msr_check_and_clear(unsigned long bits)
> >   EXPORT_SYMBOL(__msr_check_and_clear);
> >   
> >   #ifdef CONFIG_PPC_FPU
> > -static void __giveup_fpu(struct task_struct *tsk)
> > +static void notrace __giveup_fpu(struct task_struct *tsk)
> >   {
> > unsigned long msr;
> >   
> > @@ -163,7 +163,7 @@ static void __giveup_fpu(struct task_struct *tsk)
> > tsk->thread.regs->msr = msr;
> >   }
> >   
> > -void giveup_fpu(struct task_struct *tsk)
> > +void notrace giveup_fpu(struct task_struct *tsk)
> >   {
> > check_if_tm_restore_required(tsk);
> >   
> > @@ -177,7 +177,7 @@ EXPORT_SYMBOL(giveup_fpu);
> >* Make sure the floating-point register state in the
> >* the thread_struct is up to date for task tsk.
> >*/
> > -void flush_fp_to_thread(struct task_struct *tsk)
> > +void notrace flush_fp_to_thread(struct task_struct *tsk)
> >   {
> > if (tsk->thread.regs) {
> > /*
> > @@ -234,7 +234,7 @@ static inline void __giveup_fpu(struct task_struct 
> > *tsk) { }
> >   #endif /* CONFIG_PPC_FPU */
> >   
> >   #ifdef CONFIG_ALTIVEC
> > -static void __giveup_altivec(struct task_struct *tsk)
> > +static void notrace __giveup_altivec(struct task_struct *tsk)
> >   {
> > unsigned long msr;
> >   
> > @@ -246,7 +246,7 @@ static void __giveup_altivec(struct task_struct *tsk)
> > tsk->thread.regs->msr = msr;
> >   }
> >   
> > -void giveup_altivec(struct task_struct *tsk)
> > +void notrace giveup_altivec(struct task_struct *tsk)
> >   {
> > check_if_tm_restore_required(tsk);
> >   
> > @@ -285,7 +285,7 @@ EXPORT_SYMBOL(enable_kernel_altivec);
> >* Make sure the VMX/Altivec register state in the
> >* the thread_struct is up to date for task tsk.
> >*/
> > -void flush_altivec_to_thread(struct task_struct *tsk)
> > +void notrace flush_altivec_to_thread(struct task_struct *tsk)
> >   {
> > if (tsk->thread.regs) {
> > preempt_disable();
> > @@ -300,7 +300,7 @@ EXPORT_SYMBOL_GPL(flush_altivec_to_thread);
> >   #endif /* CONFIG_ALTIVEC */
> >   
> >   #ifdef CONFIG_VSX
> > -static void __giveup_vsx(struct task_struct *tsk)
> > +static void notrace __giveup_vsx(struct task_struct *tsk)
> >   {
> > unsigned long msr = tsk->thread.regs->msr;
> >   
> > @@ -317,7 +317,7 @@ static void __giveup_vsx(struct task_struct *tsk)
> > __giveup_altivec(tsk);
> >   }
> >   
> > -static void giveup_vsx(struct task_struct *tsk)
> > +static void notrace giveup_vsx(struct task_struct *tsk)
> >   {
> > check_if_tm_restore_required(tsk);

Re: [PATCH 2/8] powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()

2020-10-19 Thread Christopher M. Riedl
On Fri Oct 16, 2020 at 10:48 AM CDT, Christophe Leroy wrote:
>
>
> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> > Reuse the "safe" implementation from signal.c except for calling
> > unsafe_copy_from_user() to copy into a local buffer. Unlike the
> > unsafe_copy_{vsx,fpr}_to_user() functions the "copy from" functions
> > cannot use unsafe_get_user() directly to bypass the local buffer since
> > doing so significantly reduces signal handling performance.
>
> Why can't the functions use unsafe_get_user(), why does it significantly
> reduces signal handling
> performance ? How much significant ? I would expect that not going
> through an intermediate memory
> area would be more efficient
>

Here is a comparison, 'unsafe-signal64-regs' avoids the intermediate buffer:

|  | hash   | radix  |
|  | -- | -- |
| linuxppc/next| 289014 | 158408 |
| unsafe-signal64  | 298506 | 253053 |
| unsafe-signal64-regs | 254898 | 220831 |

I have not figured out the 'why' yet. As you mentioned in your series,
technically calling __copy_tofrom_user() is overkill for these
operations. The only obvious difference between unsafe_put_user() and
unsafe_get_user() is that we don't have asm-goto for the 'get' variant.
Instead we wrap with unsafe_op_wrap() which inserts a conditional and
then goto to the label.

Implemenations:

#define unsafe_copy_fpr_from_user(task, from, label)   do {\
   struct task_struct *__t = task; \
   u64 __user *buf = (u64 __user *)from;   \
   int i;  \
   \
   for (i = 0; i < ELF_NFPREG - 1; i++)\
   unsafe_get_user(__t->thread.TS_FPR(i), &buf[i], label); \
   unsafe_get_user(__t->thread.fp_state.fpscr, &buf[i], label);\
} while (0)

#define unsafe_copy_vsx_from_user(task, from, label)   do {\
   struct task_struct *__t = task; \
   u64 __user *buf = (u64 __user *)from;   \
   int i;  \
   \
   for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
   
unsafe_get_user(__t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET], \
           &buf[i], label);\
} while (0)

> Christophe
>
>
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/kernel/signal.h | 33 +
> >   1 file changed, 33 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> > index 2559a681536e..e9aaeac0da37 100644
> > --- a/arch/powerpc/kernel/signal.h
> > +++ b/arch/powerpc/kernel/signal.h
> > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> > *task, void __user *from);
> > &buf[i], label);\
> >   } while (0)
> >   
> > +#define unsafe_copy_fpr_from_user(task, from, label)   do {
> > \
> > +   struct task_struct *__t = task; \
> > +   u64 __user *__f = (u64 __user *)from;   \
> > +   u64 buf[ELF_NFPREG];\
> > +   int i;  \
> > +   \
> > +   unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),\
> > +   label); \
> > +   for (i = 0; i < ELF_NFPREG - 1; i++)\
> > +   __t->thread.TS_FPR(i) = buf[i]; \
> > +   __t->thread.fp_state.fpscr = buf[i];\
> > +} while (0)
> > +
> > +#define unsafe_copy_vsx_from_user(task, from, label)   do {
> > \
> > +   struct task_struct *__t = task; \
> > +   u64 __user *__f = (u64 __user *)from;   \
> > +   u64 buf[ELF_NVSRHALFREG];   \
> > +   int i;  \
> > +  

Re: [PATCH 6/8] powerpc/signal64: Replace setup_trampoline() w/ unsafe_setup_trampoline()

2020-10-19 Thread Christopher M. Riedl
On Fri Oct 16, 2020 at 10:56 AM CDT, Christophe Leroy wrote:
>
>
> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> > From: Daniel Axtens 
> > 
> > Previously setup_trampoline() performed a costly KUAP switch on every
> > uaccess operation. These repeated uaccess switches cause a significant
> > drop in signal handling performance.
> > 
> > Rewrite setup_trampoline() to assume that a userspace write access
> > window is open. Replace all uaccess functions with their 'unsafe'
> > versions to avoid the repeated uaccess switches.
> > 
> > Signed-off-by: Daniel Axtens 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/kernel/signal_64.c | 32 +++-
> >   1 file changed, 19 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index bd92064e5576..6d4f7a5c4fbf 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -600,30 +600,33 @@ static long restore_tm_sigcontexts(struct task_struct 
> > *tsk,
> >   /*
> >* Setup the trampoline code on the stack
> >*/
> > -static long setup_trampoline(unsigned int syscall, unsigned int __user 
> > *tramp)
> > +#define unsafe_setup_trampoline(syscall, tramp, e) \
> > +   unsafe_op_wrap(__unsafe_setup_trampoline(syscall, tramp), e)
> > +static long notrace __unsafe_setup_trampoline(unsigned int syscall,
> > +   unsigned int __user *tramp)
> >   {
> > int i;
> > -   long err = 0;
> >   
> > /* bctrl # call the handler */
> > -   err |= __put_user(PPC_INST_BCTRL, &tramp[0]);
> > +   unsafe_put_user(PPC_INST_BCTRL, &tramp[0], err);
> > /* addi r1, r1, __SIGNAL_FRAMESIZE  # Pop the dummy stackframe */
> > -   err |= __put_user(PPC_INST_ADDI | __PPC_RT(R1) | __PPC_RA(R1) |
> > - (__SIGNAL_FRAMESIZE & 0x), &tramp[1]);
> > +   unsafe_put_user(PPC_INST_ADDI | __PPC_RT(R1) | __PPC_RA(R1) |
> > + (__SIGNAL_FRAMESIZE & 0x), &tramp[1], err);
> > /* li r0, __NR_[rt_]sigreturn| */
> > -   err |= __put_user(PPC_INST_ADDI | (syscall & 0x), &tramp[2]);
> > +   unsafe_put_user(PPC_INST_ADDI | (syscall & 0x), &tramp[2], err);
> > /* sc */
> > -   err |= __put_user(PPC_INST_SC, &tramp[3]);
> > +   unsafe_put_user(PPC_INST_SC, &tramp[3], err);
> >   
> > /* Minimal traceback info */
> > for (i=TRAMP_TRACEBACK; i < TRAMP_SIZE ;i++)
> > -   err |= __put_user(0, &tramp[i]);
> > +   unsafe_put_user(0, &tramp[i], err);
> >   
> > -   if (!err)
> > -   flush_icache_range((unsigned long) &tramp[0],
> > -  (unsigned long) &tramp[TRAMP_SIZE]);
> > +   flush_icache_range((unsigned long)&tramp[0],
> > +  (unsigned long)&tramp[TRAMP_SIZE]);
>
> This flush should be done outside the user_write_access block.
>

Hmm, I suppose that means setup_trampoline() cannot be completely
"unsafe". I'll see if I can re-arrange the code which calls this
function to avoid an additional uaccess block instead and push the
start()/end() into setup_trampoline() directly.

> >   
> > -   return err;
> > +   return 0;
> > +err:
> > +   return 1;
> >   }
> >   
> >   /*
> > @@ -888,7 +891,10 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
> > *set,
> > if (vdso64_rt_sigtramp && tsk->mm->context.vdso_base) {
> > regs->nip = tsk->mm->context.vdso_base + vdso64_rt_sigtramp;
> > } else {
> > -   err |= setup_trampoline(__NR_rt_sigreturn, &frame->tramp[0]);
> > +   if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
> > +   return -EFAULT;
> > +   err |= __unsafe_setup_trampoline(__NR_rt_sigreturn, 
> > &frame->tramp[0]);
> > +   user_write_access_end();
> > if (err)
> > goto badframe;
> > regs->nip = (unsigned long) &frame->tramp[0];
> > 
>
> Christophe



Re: [PATCH 7/8] powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches

2020-10-19 Thread Christopher M. Riedl
On Fri Oct 16, 2020 at 11:00 AM CDT, Christophe Leroy wrote:
>
>
> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> > From: Daniel Axtens 
> > 
> > Add uaccess blocks and use the 'unsafe' versions of functions doing user
> > access where possible to reduce the number of times uaccess has to be
> > opened/closed.
> > 
> > There is no 'unsafe' version of copy_siginfo_to_user, so move it
> > slightly to allow for a "longer" uaccess block.
> > 
> > Signed-off-by: Daniel Axtens 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/kernel/signal_64.c | 54 -
> >   1 file changed, 27 insertions(+), 27 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index 6d4f7a5c4fbf..3b97e3681a8f 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -843,46 +843,42 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
> > *set,
> > /* Save the thread's msr before get_tm_stackpointer() changes it */
> > unsigned long msr = regs->msr;
> >   #endif
> > -
> > frame = get_sigframe(ksig, tsk, sizeof(*frame), 0);
> > -   if (!access_ok(frame, sizeof(*frame)))
> > +   if (!user_write_access_begin(frame, sizeof(*frame)))
> > goto badframe;
> >   
> > -   err |= __put_user(&frame->info, &frame->pinfo);
> > -   err |= __put_user(&frame->uc, &frame->puc);
> > -   err |= copy_siginfo_to_user(&frame->info, &ksig->info);
> > -   if (err)
> > -   goto badframe;
> > +   unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
> > +   unsafe_put_user(&frame->uc, &frame->puc, badframe_block);
> >   
> > /* Create the ucontext.  */
> > -   err |= __put_user(0, &frame->uc.uc_flags);
> > -   err |= __save_altstack(&frame->uc.uc_stack, regs->gpr[1]);
> > +   unsafe_put_user(0, &frame->uc.uc_flags, badframe_block);
> > +   unsafe_save_altstack(&frame->uc.uc_stack, regs->gpr[1], badframe_block);
> > +
> >   #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > if (MSR_TM_ACTIVE(msr)) {
> > /* The ucontext_t passed to userland points to the second
> >  * ucontext_t (for transactional state) with its uc_link ptr.
> >  */
> > -   err |= __put_user(&frame->uc_transact, &frame->uc.uc_link);
> > +   unsafe_put_user(&frame->uc_transact, &frame->uc.uc_link, 
> > badframe_block);
> > +   user_write_access_end();
>
> Whaou. Doing this inside an #ifdef sequence is dirty.
> Can you reorganise code to avoid that and to avoid nesting #ifdef/#endif
> and the if/else as I did in
> signal32 ?

Hopefully yes - next spin!

>
> > err |= setup_tm_sigcontexts(&frame->uc.uc_mcontext,
> > &frame->uc_transact.uc_mcontext,
> > tsk, ksig->sig, NULL,
> > (unsigned 
> > long)ksig->ka.sa.sa_handler,
> > msr);
> > +   if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
> > +   goto badframe;
> > +
> > } else
> >   #endif
> > {
> > -   err |= __put_user(0, &frame->uc.uc_link);
> > -
> > -   if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
> > -   return -EFAULT;
> > -   err |= __unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk,
> > -   ksig->sig, NULL,
> > -   (unsigned 
> > long)ksig->ka.sa.sa_handler, 1);
> > -   user_write_access_end();
> > +   unsafe_put_user(0, &frame->uc.uc_link, badframe_block);
> > +   unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
> > +   NULL, (unsigned 
> > long)ksig->ka.sa.sa_handler,
> > +   1, badframe_block);
> > }
> > -   err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
> > -   if (err)
> > -   goto badframe;
> > +
> > +   unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), 
> > badframe_block);
> >   
> >

Re: [PATCH 8/8] powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

2020-10-19 Thread Christopher M. Riedl
On Fri Oct 16, 2020 at 11:07 AM CDT, Christophe Leroy wrote:
>
>
> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> > From: Daniel Axtens 
> > 
> > Add uaccess blocks and use the 'unsafe' versions of functions doing user
> > access where possible to reduce the number of times uaccess has to be
> > opened/closed.
> > 
> > Signed-off-by: Daniel Axtens 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/kernel/signal_64.c | 23 +++
> >   1 file changed, 15 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index 3b97e3681a8f..0f4ff7a5bfc1 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -779,18 +779,22 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >  */
> > regs->msr &= ~MSR_TS_MASK;
> >   
> > -   if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
> > +   if (!user_read_access_begin(uc, sizeof(*uc)))
> > goto badframe;
> > +
> > +   unsafe_get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR], badframe_block);
> > +
> > if (MSR_TM_ACTIVE(msr)) {
> > /* We recheckpoint on return. */
> > struct ucontext __user *uc_transact;
> >   
> > /* Trying to start TM on non TM system */
> > if (!cpu_has_feature(CPU_FTR_TM))
> > -   goto badframe;
> > +   goto badframe_block;
> > +
> > +   unsafe_get_user(uc_transact, &uc->uc_link, badframe_block);
> > +   user_read_access_end();
>
> user_access_end() only in the if branch ?
>
> >   
> > -   if (__get_user(uc_transact, &uc->uc_link))
> > -   goto badframe;
> > if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
> >&uc_transact->uc_mcontext))
> > goto badframe;
> > @@ -810,12 +814,13 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >  * causing a TM bad thing.
> >  */
> > current->thread.regs->msr &= ~MSR_TS_MASK;
> > +
> > +#ifndef CONFIG_PPC_TRANSACTIONAL_MEM
> > if (!user_read_access_begin(uc, sizeof(*uc)))
>
> The matching user_read_access_end() is not in the same #ifndef ? That's
> dirty and hard to follow.
> Can you re-organise the code to avoid all those nesting ?

Yes, thanks for pointing this out. I really wanted to avoid changing too
much of the logic inside these functions. But I suppose I ended up
creating a mess - I will fix this in the next spin.

>
> > -   return -EFAULT;
> > -   if (__unsafe_restore_sigcontext(current, NULL, 1, 
> > &uc->uc_mcontext)) {
> > -   user_read_access_end();
> > goto badframe;
> > -   }
> > +#endif
> > +   unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext,
> > + badframe_block);
> > user_read_access_end();
> > }
> >   
> > @@ -825,6 +830,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > set_thread_flag(TIF_RESTOREALL);
> > return 0;
> >   
> > +badframe_block:
> > +   user_read_access_end();
> >   badframe:
> > signal_fault(current, regs, "rt_sigreturn", uc);
> >   
> > 
>
> Christophe



Re: [PATCH 1/8] powerpc/uaccess: Add unsafe_copy_from_user

2020-10-19 Thread Christopher M. Riedl
On Fri Oct 16, 2020 at 10:17 AM CDT, Christophe Leroy wrote:
>
>
> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> > Implement raw_copy_from_user_allowed() which assumes that userspace read
> > access is open. Use this new function to implement raw_copy_from_user().
> > Finally, wrap the new function to follow the usual "unsafe_" convention
> > of taking a label argument. The new raw_copy_from_user_allowed() calls
> > __copy_tofrom_user() internally, but this is still safe to call in user
> > access blocks formed with user_*_access_begin()/user_*_access_end()
> > since asm functions are not instrumented for tracing.
>
> Would objtool accept that if it was implemented on powerpc ?
>
> __copy_tofrom_user() is a function which is optimised for larger memory
> copies (using dcbz, etc ...)
> Do we need such an optimisation for unsafe_copy_from_user() ? Or can we
> do a simple loop as done for
> unsafe_copy_to_user() instead ?

I tried using a simple loop based on your unsafe_copy_to_user()
implementation. Similar to the copy_{vsx,fpr}_from_user() results there
is a hit to signal handling performance. The results with the loop are
in the 'unsafe-signal64-copy' column:

|  | hash   | radix  |
|  | -- | -- |
| linuxppc/next| 289014 | 158408 |
| unsafe-signal64  | 298506 | 253053 |
| unsafe-signal64-copy | 197029 | 177002 |

Similar to the copy_{vsx,fpr}_from_user() patch I don't fully understand
why this performs so badly yet.

Implementation:

unsafe_copy_from_user(d, s, l, e) \
do {   \
   u8 *_dst = (u8 *)(d);   \
   const u8 __user *_src = (u8 __user*)(s); 
   \
   size_t _len = (l);  \
   int _i; \
   \
   for (_i = 0; _i < (_len & ~(sizeof(long) - 1)); _i += 
sizeof(long)) \
   unsafe_get_user(*(long*)(_dst + _i), (long __user 
*)(_src + _i), e);\
   if (IS_ENABLED(CONFIG_PPC64) && (_len & 4)) {   \
   unsafe_get_user(*(u32*)(_dst + _i), (u32 __user *)(_src 
+ _i), e);  \
   _i += 4;\
   }   \
   if (_len & 2) { \
   unsafe_get_user(*(u16*)(_dst + _i), (u16 __user *)(_src 
+ _i), e);  \
   _i += 2;\
   }   \
   if (_len & 1)   \
   unsafe_get_user(*(u8*)(_dst + _i), (u8 __user *)(_src + 
_i), e);\
} while (0)

>
> Christophe
>
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/include/asm/uaccess.h | 28 +++-
> >   1 file changed, 19 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/uaccess.h 
> > b/arch/powerpc/include/asm/uaccess.h
> > index 26781b044932..66940b4eb692 100644
> > --- a/arch/powerpc/include/asm/uaccess.h
> > +++ b/arch/powerpc/include/asm/uaccess.h
> > @@ -418,38 +418,45 @@ raw_copy_in_user(void __user *to, const void __user 
> > *from, unsigned long n)
> >   }
> >   #endif /* __powerpc64__ */
> >   
> > -static inline unsigned long raw_copy_from_user(void *to,
> > -   const void __user *from, unsigned long n)
> > +static inline unsigned long
> > +raw_copy_from_user_allowed(void *to, const void __user *from, unsigned 
> > long n)
> >   {
> > -   unsigned long ret;
> > if (__builtin_constant_p(n) && (n <= 8)) {
> > -   ret = 1;
> > +   unsigned long ret = 1;
> >   
> > switch (n) {
> > case 1:
> > barrier_nospec();
> > -   __get_user_size(*(u8 *)to, from, 1, ret);
> > +   __get_user_size_allowed(*(u8 *)to, from, 1, ret);
> > break;
> > case 2:
> > barrier_nospec();
> > -   __get_user_size(*(u16 *)to, from, 2, ret);
> > +   __get_user_siz

[PATCH v2 2/8] powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()

2020-11-04 Thread Christopher M. Riedl
Reuse the "safe" implementation from signal.c except for calling
unsafe_copy_from_user() to copy into a local buffer. Unlike the
unsafe_copy_{vsx,fpr}_to_user() functions the "copy from" functions
cannot use unsafe_get_user() directly to bypass the local buffer since
doing so significantly reduces signal handling performance.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal.h | 33 +
 1 file changed, 33 insertions(+)

diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
index 2559a681536e..e9aaeac0da37 100644
--- a/arch/powerpc/kernel/signal.h
+++ b/arch/powerpc/kernel/signal.h
@@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct *task, 
void __user *from);
&buf[i], label);\
 } while (0)
 
+#define unsafe_copy_fpr_from_user(task, from, label)   do {\
+   struct task_struct *__t = task; \
+   u64 __user *__f = (u64 __user *)from;   \
+   u64 buf[ELF_NFPREG];\
+   int i;  \
+   \
+   unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),\
+   label); \
+   for (i = 0; i < ELF_NFPREG - 1; i++)\
+   __t->thread.TS_FPR(i) = buf[i]; \
+   __t->thread.fp_state.fpscr = buf[i];\
+} while (0)
+
+#define unsafe_copy_vsx_from_user(task, from, label)   do {\
+   struct task_struct *__t = task; \
+   u64 __user *__f = (u64 __user *)from;   \
+   u64 buf[ELF_NVSRHALFREG];   \
+   int i;  \
+   \
+   unsafe_copy_from_user(buf, __f, \
+   ELF_NVSRHALFREG * sizeof(double),   \
+   label); \
+   for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
+   __t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];  \
+} while (0)
+
+
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 #define unsafe_copy_ckfpr_to_user(to, task, label) do {\
struct task_struct *__t = task; \
@@ -80,6 +107,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
*task, void __user *from);
unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
ELF_NFPREG * sizeof(double), label)
 
+#define unsafe_copy_fpr_from_user(task, from, label)   \
+   unsafe_copy_from_user((task)->thread.fp_state.fpr, from \
+   ELF_NFPREG * sizeof(double), label)
+
 static inline unsigned long
 copy_fpr_to_user(void __user *to, struct task_struct *task)
 {
@@ -115,6 +146,8 @@ copy_ckfpr_from_user(struct task_struct *task, void __user 
*from)
 #else
 #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0)
 
+#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0)
+
 static inline unsigned long
 copy_fpr_to_user(void __user *to, struct task_struct *task)
 {
-- 
2.29.0



[PATCH v2 3/8] powerpc/signal64: Move non-inline functions out of setup_sigcontext()

2020-11-04 Thread Christopher M. Riedl
There are non-inline functions which get called in setup_sigcontext() to
save register state to the thread struct. Move these functions into a
separate prepare_setup_sigcontext() function so that
setup_sigcontext() can be refactored later into an "unsafe" version
which assumes an open uaccess window. Non-inline functions should be
avoided when uaccess is open.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 7df088b9ad0f..ece1f982dd05 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -79,6 +79,24 @@ static elf_vrreg_t __user *sigcontext_vmx_regs(struct 
sigcontext __user *sc)
 }
 #endif
 
+static void prepare_setup_sigcontext(struct task_struct *tsk, int 
ctx_has_vsx_region)
+{
+#ifdef CONFIG_ALTIVEC
+   /* save altivec registers */
+   if (tsk->thread.used_vr)
+   flush_altivec_to_thread(tsk);
+   if (cpu_has_feature(CPU_FTR_ALTIVEC))
+   tsk->thread.vrsave = mfspr(SPRN_VRSAVE);
+#endif /* CONFIG_ALTIVEC */
+
+   flush_fp_to_thread(tsk);
+
+#ifdef CONFIG_VSX
+   if (tsk->thread.used_vsr && ctx_has_vsx_region)
+   flush_vsx_to_thread(tsk);
+#endif /* CONFIG_VSX */
+}
+
 /*
  * Set up the sigcontext for the signal frame.
  */
@@ -97,7 +115,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 */
 #ifdef CONFIG_ALTIVEC
elf_vrreg_t __user *v_regs = sigcontext_vmx_regs(sc);
-   unsigned long vrsave;
 #endif
struct pt_regs *regs = tsk->thread.regs;
unsigned long msr = regs->msr;
@@ -112,7 +129,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 
/* save altivec registers */
if (tsk->thread.used_vr) {
-   flush_altivec_to_thread(tsk);
/* Copy 33 vec registers (vr0..31 and vscr) to the stack */
err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
  33 * sizeof(vector128));
@@ -124,17 +140,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
/* We always copy to/from vrsave, it's 0 if we don't have or don't
 * use altivec.
 */
-   vrsave = 0;
-   if (cpu_has_feature(CPU_FTR_ALTIVEC)) {
-   vrsave = mfspr(SPRN_VRSAVE);
-   tsk->thread.vrsave = vrsave;
-   }
-
-   err |= __put_user(vrsave, (u32 __user *)&v_regs[33]);
+   err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
 #else /* CONFIG_ALTIVEC */
err |= __put_user(0, &sc->v_regs);
 #endif /* CONFIG_ALTIVEC */
-   flush_fp_to_thread(tsk);
/* copy fpr regs and fpscr */
err |= copy_fpr_to_user(&sc->fp_regs, tsk);
 
@@ -150,7 +159,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 * VMX data.
 */
if (tsk->thread.used_vsr && ctx_has_vsx_region) {
-   flush_vsx_to_thread(tsk);
v_regs += ELF_NVRREG;
err |= copy_vsx_to_user(v_regs, tsk);
/* set MSR_VSX in the MSR value in the frame to
@@ -655,6 +663,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
ctx_has_vsx_region = 1;
 
if (old_ctx != NULL) {
+   prepare_setup_sigcontext(current, ctx_has_vsx_region);
if (!access_ok(old_ctx, ctx_size)
|| setup_sigcontext(&old_ctx->uc_mcontext, current, 0, 
NULL, 0,
ctx_has_vsx_region)
@@ -842,6 +851,7 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 #endif
{
err |= __put_user(0, &frame->uc.uc_link);
+   prepare_setup_sigcontext(tsk, 1);
err |= setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
NULL, (unsigned 
long)ksig->ka.sa.sa_handler,
1);
-- 
2.29.0



[PATCH v2 4/8] powerpc/signal64: Remove TM ifdefery in middle of if/else block

2020-11-04 Thread Christopher M. Riedl
Similar to commit 1c32940f5220 ("powerpc/signal32: Remove ifdefery in
middle of if/else") for PPC32, remove the messy ifdef. Unlike PPC32, the
ifdef cannot be removed entirely since the uc_transact member of the
sigframe depends on CONFIG_PPC_TRANSACTIONAL_MEM=y.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index ece1f982dd05..d3e9519b2e62 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -710,9 +710,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
struct pt_regs *regs = current_pt_regs();
struct ucontext __user *uc = (struct ucontext __user *)regs->gpr[1];
sigset_t set;
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
unsigned long msr;
-#endif
 
/* Always make any pending restarted system calls return -EINTR */
current->restart_block.fn = do_no_restart_syscall;
@@ -762,10 +760,12 @@ SYSCALL_DEFINE0(rt_sigreturn)
 * restore_tm_sigcontexts.
 */
regs->msr &= ~MSR_TS_MASK;
+#endif
 
if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
goto badframe;
if (MSR_TM_ACTIVE(msr)) {
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
/* We recheckpoint on return. */
struct ucontext __user *uc_transact;
 
@@ -778,9 +778,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
   &uc_transact->uc_mcontext))
goto badframe;
-   } else
 #endif
-   {
+   } else {
/*
 * Fall through, for non-TM restore
 *
@@ -818,10 +817,8 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
unsigned long newsp = 0;
long err = 0;
struct pt_regs *regs = tsk->thread.regs;
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
/* Save the thread's msr before get_tm_stackpointer() changes it */
-   unsigned long msr = regs->msr;
-#endif
+   unsigned long msr __maybe_unused = regs->msr;
 
frame = get_sigframe(ksig, tsk, sizeof(*frame), 0);
if (!access_ok(frame, sizeof(*frame)))
@@ -836,8 +833,9 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
/* Create the ucontext.  */
err |= __put_user(0, &frame->uc.uc_flags);
err |= __save_altstack(&frame->uc.uc_stack, regs->gpr[1]);
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+
if (MSR_TM_ACTIVE(msr)) {
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
/* The ucontext_t passed to userland points to the second
 * ucontext_t (for transactional state) with its uc_link ptr.
 */
@@ -847,9 +845,8 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
tsk, ksig->sig, NULL,
(unsigned 
long)ksig->ka.sa.sa_handler,
msr);
-   } else
 #endif
-   {
+   } else {
err |= __put_user(0, &frame->uc.uc_link);
prepare_setup_sigcontext(tsk, 1);
err |= setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
-- 
2.29.0



[PATCH v2 0/8] Improve signal performance on PPC64 with KUAP

2020-11-04 Thread Christopher M. Riedl
As reported by Anton, there is a large penalty to signal handling
performance on radix systems using KUAP. The signal handling code
performs many user access operations, each of which needs to switch the
KUAP permissions bit to open and then close user access. This involves a
costly 'mtspr' operation [0].

There is existing work done on x86 and by Christopher Leroy for PPC32 to
instead open up user access in "blocks" using user_*_access_{begin,end}.
We can do the same in PPC64 to bring performance back up on KUAP-enabled
radix systems.

This series applies on top of Christophe Leroy's work for PPC32 [1] (I'm
sure patchwork won't be too happy about that).

The first two patches add some needed 'unsafe' versions of copy-from
functions. While these do not make use of asm-goto they still allow for
avoiding the repeated uaccess switches.

The third patch moves functions called by setup_sigcontext() into a new
prepare_setup_sigcontext() to simplify converting setup_sigcontext()
into an 'unsafe' version which assumes an open uaccess window later.

The fourth patch cleans-up some of the Transactional Memory ifdef stuff
to simplify using uaccess blocks later.

The next two patches rewrite some of the signal64 helper functions to
be 'unsafe'. Finally, the last two patches update the main signal
handling functions to make use of the new 'unsafe' helpers and eliminate
some additional uaccess switching.

I used the will-it-scale signal1 benchmark to measure and compare
performance [2]. The below results are from a P9 Blackbird system. Note
that currently hash does not support KUAP and is therefore used as the
"baseline" comparison. Bigger numbers are better:

signal1_threads -t1 -s10

| | hash   | radix  |
| --- | -- | -- |
| linuxppc/next   | 289014 | 158408 |
| unsafe-signal64 | 298506 | 253053 |

[0]: https://github.com/linuxppc/issues/issues/277
[1]: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=196278
[2]: https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c

v2: * Rebase on latest linuxppc/next + Christophe Leroy's PPC32
  signal series
* Simplify/remove TM ifdefery similar to PPC32 series and clean
  up the uaccess begin/end calls
* Isolate non-inline functions so they are not called when
  uaccess window is open

Christopher M. Riedl (6):
  powerpc/uaccess: Add unsafe_copy_from_user
  powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()
  powerpc/signal64: Move non-inline functions out of setup_sigcontext()
  powerpc/signal64: Remove TM ifdefery in middle of if/else block
  powerpc/signal64: Replace setup_sigcontext() w/
unsafe_setup_sigcontext()
  powerpc/signal64: Replace restore_sigcontext() w/
unsafe_restore_sigcontext()

Daniel Axtens (2):
  powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess
switches
  powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

 arch/powerpc/include/asm/uaccess.h |  28 ++--
 arch/powerpc/kernel/signal.h   |  33 
 arch/powerpc/kernel/signal_64.c| 239 ++---
 3 files changed, 201 insertions(+), 99 deletions(-)

-- 
2.29.0



[PATCH v2 8/8] powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

2020-11-04 Thread Christopher M. Riedl
From: Daniel Axtens 

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

Signed-off-by: Daniel Axtens 
Co-developed-by: Christopher M. Riedl 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index d17f2d5436d2..82e68a508e5c 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -784,8 +784,11 @@ SYSCALL_DEFINE0(rt_sigreturn)
regs->msr &= ~MSR_TS_MASK;
 #endif
 
-   if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
+   if (!user_read_access_begin(uc, sizeof(*uc)))
goto badframe;
+
+   unsafe_get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR], badframe_block);
+
if (MSR_TM_ACTIVE(msr)) {
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
/* We recheckpoint on return. */
@@ -793,10 +796,12 @@ SYSCALL_DEFINE0(rt_sigreturn)
 
/* Trying to start TM on non TM system */
if (!cpu_has_feature(CPU_FTR_TM))
-   goto badframe;
+   goto badframe_block;
+
+   unsafe_get_user(uc_transact, &uc->uc_link, badframe_block);
+
+   user_read_access_end();
 
-   if (__get_user(uc_transact, &uc->uc_link))
-   goto badframe;
if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
   &uc_transact->uc_mcontext))
goto badframe;
@@ -815,12 +820,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
 * causing a TM bad thing.
 */
current->thread.regs->msr &= ~MSR_TS_MASK;
-   if (!user_read_access_begin(uc, sizeof(*uc)))
-   return -EFAULT;
-   if (__unsafe_restore_sigcontext(current, NULL, 1, 
&uc->uc_mcontext)) {
-   user_read_access_end();
-   goto badframe;
-   }
+   unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext,
+ badframe_block);
+
user_read_access_end();
}
 
@@ -830,6 +832,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
set_thread_flag(TIF_RESTOREALL);
return 0;
 
+badframe_block:
+   user_read_access_end();
 badframe:
signal_fault(current, regs, "rt_sigreturn", uc);
 
-- 
2.29.0



[PATCH v2 6/8] powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()

2020-11-04 Thread Christopher M. Riedl
Previously restore_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite restore_sigcontext() to assume that a userspace read access
window is open. Replace all uaccess functions with their 'unsafe'
versions which avoid the repeated uaccess switches.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 68 -
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 3f25309826b6..d72153825719 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -326,14 +326,14 @@ static long setup_tm_sigcontexts(struct sigcontext __user 
*sc,
 /*
  * Restore the sigcontext from the signal frame.
  */
-
-static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
- struct sigcontext __user *sc)
+#define unsafe_restore_sigcontext(tsk, set, sig, sc, e) \
+   unsafe_op_wrap(__unsafe_restore_sigcontext(tsk, set, sig, sc), e)
+static long notrace __unsafe_restore_sigcontext(struct task_struct *tsk, 
sigset_t *set,
+   int sig, struct sigcontext 
__user *sc)
 {
 #ifdef CONFIG_ALTIVEC
elf_vrreg_t __user *v_regs;
 #endif
-   unsigned long err = 0;
unsigned long save_r13 = 0;
unsigned long msr;
struct pt_regs *regs = tsk->thread.regs;
@@ -348,27 +348,28 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
save_r13 = regs->gpr[13];
 
/* copy the GPRs */
-   err |= __copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr));
-   err |= __get_user(regs->nip, &sc->gp_regs[PT_NIP]);
+   unsafe_copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr),
+ efault_out);
+   unsafe_get_user(regs->nip, &sc->gp_regs[PT_NIP], efault_out);
/* get MSR separately, transfer the LE bit if doing signal return */
-   err |= __get_user(msr, &sc->gp_regs[PT_MSR]);
+   unsafe_get_user(msr, &sc->gp_regs[PT_MSR], efault_out);
if (sig)
regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
-   err |= __get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3]);
-   err |= __get_user(regs->ctr, &sc->gp_regs[PT_CTR]);
-   err |= __get_user(regs->link, &sc->gp_regs[PT_LNK]);
-   err |= __get_user(regs->xer, &sc->gp_regs[PT_XER]);
-   err |= __get_user(regs->ccr, &sc->gp_regs[PT_CCR]);
+   unsafe_get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3], efault_out);
+   unsafe_get_user(regs->ctr, &sc->gp_regs[PT_CTR], efault_out);
+   unsafe_get_user(regs->link, &sc->gp_regs[PT_LNK], efault_out);
+   unsafe_get_user(regs->xer, &sc->gp_regs[PT_XER], efault_out);
+   unsafe_get_user(regs->ccr, &sc->gp_regs[PT_CCR], efault_out);
/* Don't allow userspace to set SOFTE */
set_trap_norestart(regs);
-   err |= __get_user(regs->dar, &sc->gp_regs[PT_DAR]);
-   err |= __get_user(regs->dsisr, &sc->gp_regs[PT_DSISR]);
-   err |= __get_user(regs->result, &sc->gp_regs[PT_RESULT]);
+   unsafe_get_user(regs->dar, &sc->gp_regs[PT_DAR], efault_out);
+   unsafe_get_user(regs->dsisr, &sc->gp_regs[PT_DSISR], efault_out);
+   unsafe_get_user(regs->result, &sc->gp_regs[PT_RESULT], efault_out);
 
if (!sig)
regs->gpr[13] = save_r13;
if (set != NULL)
-   err |=  __get_user(set->sig[0], &sc->oldmask);
+   unsafe_get_user(set->sig[0], &sc->oldmask, efault_out);
 
/*
 * Force reload of FP/VEC.
@@ -378,29 +379,28 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
regs->msr &= ~(MSR_FP | MSR_FE0 | MSR_FE1 | MSR_VEC | MSR_VSX);
 
 #ifdef CONFIG_ALTIVEC
-   err |= __get_user(v_regs, &sc->v_regs);
-   if (err)
-   return err;
+   unsafe_get_user(v_regs, &sc->v_regs, efault_out);
if (v_regs && !access_ok(v_regs, 34 * sizeof(vector128)))
return -EFAULT;
/* Copy 33 vec registers (vr0..31 and vscr) from the stack */
if (v_regs != NULL && (msr & MSR_VEC) != 0) {
-   err |= __copy_from_user(&tsk->thread.vr_state, v_regs,
-   33 * sizeof(vector128));
+   unsafe_copy_from_user(&tsk->thread.vr_state, v_regs,
+ 33 * sizeof(vector128), efault_out);
tsk->thread.used_vr = true;
} else if (

[PATCH v2 1/8] powerpc/uaccess: Add unsafe_copy_from_user

2020-11-04 Thread Christopher M. Riedl
Implement raw_copy_from_user_allowed() which assumes that userspace read
access is open. Use this new function to implement raw_copy_from_user().
Finally, wrap the new function to follow the usual "unsafe_" convention
of taking a label argument. The new raw_copy_from_user_allowed() calls
__copy_tofrom_user() internally, but this is still safe to call in user
access blocks formed with user_*_access_begin()/user_*_access_end()
since asm functions are not instrumented for tracing.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/uaccess.h | 28 +++-
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index ef5bbb705c08..96b4abab4f5a 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -403,38 +403,45 @@ raw_copy_in_user(void __user *to, const void __user 
*from, unsigned long n)
 }
 #endif /* __powerpc64__ */
 
-static inline unsigned long raw_copy_from_user(void *to,
-   const void __user *from, unsigned long n)
+static inline unsigned long
+raw_copy_from_user_allowed(void *to, const void __user *from, unsigned long n)
 {
-   unsigned long ret;
if (__builtin_constant_p(n) && (n <= 8)) {
-   ret = 1;
+   unsigned long ret = 1;
 
switch (n) {
case 1:
barrier_nospec();
-   __get_user_size(*(u8 *)to, from, 1, ret);
+   __get_user_size_allowed(*(u8 *)to, from, 1, ret);
break;
case 2:
barrier_nospec();
-   __get_user_size(*(u16 *)to, from, 2, ret);
+   __get_user_size_allowed(*(u16 *)to, from, 2, ret);
break;
case 4:
barrier_nospec();
-   __get_user_size(*(u32 *)to, from, 4, ret);
+   __get_user_size_allowed(*(u32 *)to, from, 4, ret);
break;
case 8:
barrier_nospec();
-   __get_user_size(*(u64 *)to, from, 8, ret);
+   __get_user_size_allowed(*(u64 *)to, from, 8, ret);
break;
}
if (ret == 0)
return 0;
}
 
+   return __copy_tofrom_user((__force void __user *)to, from, n);
+}
+
+static inline unsigned long
+raw_copy_from_user(void *to, const void __user *from, unsigned long n)
+{
+   unsigned long ret;
+
barrier_nospec();
allow_read_from_user(from, n);
-   ret = __copy_tofrom_user((__force void __user *)to, from, n);
+   ret = raw_copy_from_user_allowed(to, from, n);
prevent_read_from_user(from, n);
return ret;
 }
@@ -542,6 +549,9 @@ user_write_access_begin(const void __user *ptr, size_t len)
 #define unsafe_get_user(x, p, e) unsafe_op_wrap(__get_user_allowed(x, p), e)
 #define unsafe_put_user(x, p, e) __put_user_goto(x, p, e)
 
+#define unsafe_copy_from_user(d, s, l, e) \
+   unsafe_op_wrap(raw_copy_from_user_allowed(d, s, l), e)
+
 #define unsafe_copy_to_user(d, s, l, e) \
 do {   \
u8 __user *_dst = (u8 __user *)(d); \
-- 
2.29.0



[PATCH v2 5/8] powerpc/signal64: Replace setup_sigcontext() w/ unsafe_setup_sigcontext()

2020-11-04 Thread Christopher M. Riedl
Previously setup_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite setup_sigcontext() to assume that a userspace write access window
is open. Replace all uaccess functions with their 'unsafe' versions
which avoid the repeated uaccess switches.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 70 -
 1 file changed, 43 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index d3e9519b2e62..3f25309826b6 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -101,9 +101,13 @@ static void prepare_setup_sigcontext(struct task_struct 
*tsk, int ctx_has_vsx_re
  * Set up the sigcontext for the signal frame.
  */
 
-static long setup_sigcontext(struct sigcontext __user *sc,
-   struct task_struct *tsk, int signr, sigset_t *set,
-   unsigned long handler, int ctx_has_vsx_region)
+#define unsafe_setup_sigcontext(sc, tsk, signr, set, handler,  \
+   ctx_has_vsx_region, e)  \
+   unsafe_op_wrap(__unsafe_setup_sigcontext(sc, tsk, signr, set,   \
+   handler, ctx_has_vsx_region), e)
+static long notrace __unsafe_setup_sigcontext(struct sigcontext __user *sc,
+   struct task_struct *tsk, int signr, 
sigset_t *set,
+   unsigned long handler, int 
ctx_has_vsx_region)
 {
/* When CONFIG_ALTIVEC is set, we _always_ setup v_regs even if the
 * process never used altivec yet (MSR_VEC is zero in pt_regs of
@@ -118,20 +122,19 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 #endif
struct pt_regs *regs = tsk->thread.regs;
unsigned long msr = regs->msr;
-   long err = 0;
/* Force usr to alway see softe as 1 (interrupts enabled) */
unsigned long softe = 0x1;
 
BUG_ON(tsk != current);
 
 #ifdef CONFIG_ALTIVEC
-   err |= __put_user(v_regs, &sc->v_regs);
+   unsafe_put_user(v_regs, &sc->v_regs, efault_out);
 
/* save altivec registers */
if (tsk->thread.used_vr) {
/* Copy 33 vec registers (vr0..31 and vscr) to the stack */
-   err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
- 33 * sizeof(vector128));
+   unsafe_copy_to_user(v_regs, &tsk->thread.vr_state,
+   33 * sizeof(vector128), efault_out);
/* set MSR_VEC in the MSR value in the frame to indicate that 
sc->v_reg)
 * contains valid data.
 */
@@ -140,12 +143,12 @@ static long setup_sigcontext(struct sigcontext __user *sc,
/* We always copy to/from vrsave, it's 0 if we don't have or don't
 * use altivec.
 */
-   err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
+   unsafe_put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33], 
efault_out);
 #else /* CONFIG_ALTIVEC */
-   err |= __put_user(0, &sc->v_regs);
+   unsafe_put_user(0, &sc->v_regs, efault_out);
 #endif /* CONFIG_ALTIVEC */
/* copy fpr regs and fpscr */
-   err |= copy_fpr_to_user(&sc->fp_regs, tsk);
+   unsafe_copy_fpr_to_user(&sc->fp_regs, tsk, efault_out);
 
/*
 * Clear the MSR VSX bit to indicate there is no valid state attached
@@ -160,24 +163,27 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 */
if (tsk->thread.used_vsr && ctx_has_vsx_region) {
v_regs += ELF_NVRREG;
-   err |= copy_vsx_to_user(v_regs, tsk);
+   unsafe_copy_vsx_to_user(v_regs, tsk, efault_out);
/* set MSR_VSX in the MSR value in the frame to
 * indicate that sc->vs_reg) contains valid data.
 */
msr |= MSR_VSX;
}
 #endif /* CONFIG_VSX */
-   err |= __put_user(&sc->gp_regs, &sc->regs);
+   unsafe_put_user(&sc->gp_regs, &sc->regs, efault_out);
WARN_ON(!FULL_REGS(regs));
-   err |= __copy_to_user(&sc->gp_regs, regs, GP_REGS_SIZE);
-   err |= __put_user(msr, &sc->gp_regs[PT_MSR]);
-   err |= __put_user(softe, &sc->gp_regs[PT_SOFTE]);
-   err |= __put_user(signr, &sc->signal);
-   err |= __put_user(handler, &sc->handler);
+   unsafe_copy_to_user(&sc->gp_regs, regs, GP_REGS_SIZE, efault_out);
+   unsafe_put_user(msr, &sc->gp_regs[PT_MSR], efault_out);
+   unsafe_put_user(softe, &sc->gp_regs[PT_SOFTE], efault_out);
+   unsafe_put_user(signr, &sc->signal, efault_out);
+

[PATCH v2 7/8] powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches

2020-11-04 Thread Christopher M. Riedl
From: Daniel Axtens 

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

There is no 'unsafe' version of copy_siginfo_to_user, so move it
slightly to allow for a "longer" uaccess block.

Signed-off-by: Daniel Axtens 
Co-developed-by: Christopher M. Riedl 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 54 +
 1 file changed, 34 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index d72153825719..d17f2d5436d2 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -848,44 +848,51 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
*set,
unsigned long msr __maybe_unused = regs->msr;
 
frame = get_sigframe(ksig, tsk, sizeof(*frame), 0);
-   if (!access_ok(frame, sizeof(*frame)))
-   goto badframe;
 
-   err |= __put_user(&frame->info, &frame->pinfo);
-   err |= __put_user(&frame->uc, &frame->puc);
-   err |= copy_siginfo_to_user(&frame->info, &ksig->info);
-   if (err)
+   /* This only applies when calling unsafe_setup_sigcontext() and must be
+* called before opening the uaccess window.
+*/
+   if (!MSR_TM_ACTIVE(msr))
+   prepare_setup_sigcontext(tsk, 1);
+
+   if (!user_write_access_begin(frame, sizeof(*frame)))
goto badframe;
 
+   unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
+   unsafe_put_user(&frame->uc, &frame->puc, badframe_block);
+
/* Create the ucontext.  */
-   err |= __put_user(0, &frame->uc.uc_flags);
-   err |= __save_altstack(&frame->uc.uc_stack, regs->gpr[1]);
+   unsafe_put_user(0, &frame->uc.uc_flags, badframe_block);
+   unsafe_save_altstack(&frame->uc.uc_stack, regs->gpr[1], badframe_block);
 
if (MSR_TM_ACTIVE(msr)) {
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
/* The ucontext_t passed to userland points to the second
 * ucontext_t (for transactional state) with its uc_link ptr.
 */
-   err |= __put_user(&frame->uc_transact, &frame->uc.uc_link);
+   unsafe_put_user(&frame->uc_transact, &frame->uc.uc_link, 
badframe_block);
+
+   user_write_access_end();
+
err |= setup_tm_sigcontexts(&frame->uc.uc_mcontext,
&frame->uc_transact.uc_mcontext,
tsk, ksig->sig, NULL,
(unsigned 
long)ksig->ka.sa.sa_handler,
msr);
+
+   if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
+   goto badframe;
+
 #endif
} else {
-   err |= __put_user(0, &frame->uc.uc_link);
-   prepare_setup_sigcontext(tsk, 1);
-   if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
-   return -EFAULT;
-   err |= __unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk,
-   ksig->sig, NULL,
-   (unsigned 
long)ksig->ka.sa.sa_handler, 1);
-   user_write_access_end();
+   unsafe_put_user(0, &frame->uc.uc_link, badframe_block);
+   unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
+   NULL, (unsigned 
long)ksig->ka.sa.sa_handler,
+   1, badframe_block);
}
-   err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
-   if (err)
-   goto badframe;
+
+   unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), 
badframe_block);
+   user_write_access_end();
 
/* Make sure signal handler doesn't get spurious FP exceptions */
tsk->thread.fp_state.fpscr = 0;
@@ -900,6 +907,11 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
regs->nip = (unsigned long) &frame->tramp[0];
}
 
+
+   /* Save the siginfo outside of the unsafe block. */
+   if (copy_siginfo_to_user(&frame->info, &ksig->info))
+   goto badframe;
+
/* Allocate a dummy caller frame for the signal handler. */
newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
@@ -939,6 +951,8 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 
return 0;
 
+badframe_block:
+   user_write_access_end();
 badframe:
signal_fault(current, regs, "handle_rt_signal64", frame);
 
-- 
2.29.0



Re: [PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t

2021-02-04 Thread Christopher M. Riedl
On Wed Feb 3, 2021 at 12:43 PM CST, Christopher M. Riedl wrote:
> Usually sigset_t is exactly 8B which is a "trivial" size and does not
> warrant using __copy_from_user(). Use __get_user() directly in
> anticipation of future work to remove the trivial size optimizations
> from __copy_from_user(). Calling __get_user() also results in a small
> boost to signal handling throughput here.
>
> Signed-off-by: Christopher M. Riedl 

This patch triggered sparse warnings about 'different address spaces'.
This minor fixup cleans that up:

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 42fdc4a7ff72..1dfda6403e14 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -97,7 +97,7 @@ static void prepare_setup_sigcontext(struct task_struct *tsk, 
int ctx_has_vsx_re
 #endif /* CONFIG_VSX */
 }

-static inline int get_user_sigset(sigset_t *dst, const sigset_t *src)
+static inline int get_user_sigset(sigset_t *dst, const sigset_t __user *src)
 {
if (sizeof(sigset_t) <= 8)
return __get_user(dst->sig[0], &src->sig[0]);


[PATCH v2] powerpc64/idle: Fix SP offsets when saving GPRs

2021-02-05 Thread Christopher M. Riedl
The idle entry/exit code saves/restores GPRs in the stack "red zone"
(Protected Zone according to PowerPC64 ELF ABI v2). However, the offset
used for the first GPR is incorrect and overwrites the back chain - the
Protected Zone actually starts below the current SP. In practice this is
probably not an issue, but it's still incorrect so fix it.

Also expand the comments to explain why using the stack "red zone"
instead of creating a new stackframe is appropriate here.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/idle_book3s.S | 138 --
 1 file changed, 73 insertions(+), 65 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 22f249b6f58d..f9e6d83e6720 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -52,28 +52,32 @@ _GLOBAL(isa300_idle_stop_mayloss)
std r1,PACAR1(r13)
mflrr4
mfcrr5
-   /* use stack red zone rather than a new frame for saving regs */
-   std r2,-8*0(r1)
-   std r14,-8*1(r1)
-   std r15,-8*2(r1)
-   std r16,-8*3(r1)
-   std r17,-8*4(r1)
-   std r18,-8*5(r1)
-   std r19,-8*6(r1)
-   std r20,-8*7(r1)
-   std r21,-8*8(r1)
-   std r22,-8*9(r1)
-   std r23,-8*10(r1)
-   std r24,-8*11(r1)
-   std r25,-8*12(r1)
-   std r26,-8*13(r1)
-   std r27,-8*14(r1)
-   std r28,-8*15(r1)
-   std r29,-8*16(r1)
-   std r30,-8*17(r1)
-   std r31,-8*18(r1)
-   std r4,-8*19(r1)
-   std r5,-8*20(r1)
+   /*
+* Use the stack red zone rather than a new frame for saving regs since
+* in the case of no GPR loss the wakeup code branches directly back to
+* the caller without deallocating the stack frame first.
+*/
+   std r2,-8*1(r1)
+   std r14,-8*2(r1)
+   std r15,-8*3(r1)
+   std r16,-8*4(r1)
+   std r17,-8*5(r1)
+   std r18,-8*6(r1)
+   std r19,-8*7(r1)
+   std r20,-8*8(r1)
+   std r21,-8*9(r1)
+   std r22,-8*10(r1)
+   std r23,-8*11(r1)
+   std r24,-8*12(r1)
+   std r25,-8*13(r1)
+   std r26,-8*14(r1)
+   std r27,-8*15(r1)
+   std r28,-8*16(r1)
+   std r29,-8*17(r1)
+   std r30,-8*18(r1)
+   std r31,-8*19(r1)
+   std r4,-8*20(r1)
+   std r5,-8*21(r1)
/* 168 bytes */
PPC_STOP
b   .   /* catch bugs */
@@ -89,8 +93,8 @@ _GLOBAL(isa300_idle_stop_mayloss)
  */
 _GLOBAL(idle_return_gpr_loss)
ld  r1,PACAR1(r13)
-   ld  r4,-8*19(r1)
-   ld  r5,-8*20(r1)
+   ld  r4,-8*20(r1)
+   ld  r5,-8*21(r1)
mtlrr4
mtcrr5
/*
@@ -98,25 +102,25 @@ _GLOBAL(idle_return_gpr_loss)
 * from PACATOC. This could be avoided for that less common case
 * if KVM saved its r2.
 */
-   ld  r2,-8*0(r1)
-   ld  r14,-8*1(r1)
-   ld  r15,-8*2(r1)
-   ld  r16,-8*3(r1)
-   ld  r17,-8*4(r1)
-   ld  r18,-8*5(r1)
-   ld  r19,-8*6(r1)
-   ld  r20,-8*7(r1)
-   ld  r21,-8*8(r1)
-   ld  r22,-8*9(r1)
-   ld  r23,-8*10(r1)
-   ld  r24,-8*11(r1)
-   ld  r25,-8*12(r1)
-   ld  r26,-8*13(r1)
-   ld  r27,-8*14(r1)
-   ld  r28,-8*15(r1)
-   ld  r29,-8*16(r1)
-   ld  r30,-8*17(r1)
-   ld  r31,-8*18(r1)
+   ld  r2,-8*1(r1)
+   ld  r14,-8*2(r1)
+   ld  r15,-8*3(r1)
+   ld  r16,-8*4(r1)
+   ld  r17,-8*5(r1)
+   ld  r18,-8*6(r1)
+   ld  r19,-8*7(r1)
+   ld  r20,-8*8(r1)
+   ld  r21,-8*9(r1)
+   ld  r22,-8*10(r1)
+   ld  r23,-8*11(r1)
+   ld  r24,-8*12(r1)
+   ld  r25,-8*13(r1)
+   ld  r26,-8*14(r1)
+   ld  r27,-8*15(r1)
+   ld  r28,-8*16(r1)
+   ld  r29,-8*17(r1)
+   ld  r30,-8*18(r1)
+   ld  r31,-8*19(r1)
blr
 
 /*
@@ -154,28 +158,32 @@ _GLOBAL(isa206_idle_insn_mayloss)
std r1,PACAR1(r13)
mflrr4
mfcrr5
-   /* use stack red zone rather than a new frame for saving regs */
-   std r2,-8*0(r1)
-   std r14,-8*1(r1)
-   std r15,-8*2(r1)
-   std r16,-8*3(r1)
-   std r17,-8*4(r1)
-   std r18,-8*5(r1)
-   std r19,-8*6(r1)
-   std r20,-8*7(r1)
-   std r21,-8*8(r1)
-   std r22,-8*9(r1)
-   std r23,-8*10(r1)
-   std r24,-8*11(r1)
-   std r25,-8*12(r1)
-   std r26,-8*13(r1)
-   std r27,-8*14(r1)
-   std r28,-8*15(r1)
-   std r29,-8*16(r1)
-   std r30,-8*17(r1)
-   std r31,-8*18(r1)
-   std r4,-8*

Re: [PATCH 2/8] powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()

2021-02-06 Thread Christopher M. Riedl
On Sat Feb 6, 2021 at 10:32 AM CST, Christophe Leroy wrote:
>
>
> Le 20/10/2020 à 04:01, Christopher M. Riedl a écrit :
> > On Fri Oct 16, 2020 at 10:48 AM CDT, Christophe Leroy wrote:
> >>
> >>
> >> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> >>> Reuse the "safe" implementation from signal.c except for calling
> >>> unsafe_copy_from_user() to copy into a local buffer. Unlike the
> >>> unsafe_copy_{vsx,fpr}_to_user() functions the "copy from" functions
> >>> cannot use unsafe_get_user() directly to bypass the local buffer since
> >>> doing so significantly reduces signal handling performance.
> >>
> >> Why can't the functions use unsafe_get_user(), why does it significantly
> >> reduces signal handling
> >> performance ? How much significant ? I would expect that not going
> >> through an intermediate memory
> >> area would be more efficient
> >>
> > 
> > Here is a comparison, 'unsafe-signal64-regs' avoids the intermediate buffer:
> > 
> > |  | hash   | radix  |
> > |  | -- | -- |
> > | linuxppc/next| 289014 | 158408 |
> > | unsafe-signal64  | 298506 | 253053 |
> > | unsafe-signal64-regs | 254898 | 220831 |
> > 
> > I have not figured out the 'why' yet. As you mentioned in your series,
> > technically calling __copy_tofrom_user() is overkill for these
> > operations. The only obvious difference between unsafe_put_user() and
> > unsafe_get_user() is that we don't have asm-goto for the 'get' variant.
> > Instead we wrap with unsafe_op_wrap() which inserts a conditional and
> > then goto to the label.
> > 
> > Implemenations:
> > 
> > #define unsafe_copy_fpr_from_user(task, from, label)   do {\
> >struct task_struct *__t = task; \
> >u64 __user *buf = (u64 __user *)from;   \
> >int i;  \
> >\
> >for (i = 0; i < ELF_NFPREG - 1; i++)\
> >unsafe_get_user(__t->thread.TS_FPR(i), &buf[i], label); \
> >unsafe_get_user(__t->thread.fp_state.fpscr, &buf[i], label);\
> > } while (0)
> > 
> > #define unsafe_copy_vsx_from_user(task, from, label)   do {\
> >struct task_struct *__t = task; \
> >u64 __user *buf = (u64 __user *)from;   \
> >int i;  \
> >\
> >for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
> >
> > unsafe_get_user(__t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET], \
> >&buf[i], label);\
> > } while (0)
> > 
>
> Do you have CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP enabled in
> your config ?

I don't have these set in my config (ppc64le_defconfig). I think I
figured this out - the reason for the lower signal throughput is the
barrier_nospec() in __get_user_nocheck(). When looping we incur that
cost on every iteration. Commenting it out results in signal performance
of ~316K w/ hash on the unsafe-signal64-regs branch. Obviously the
barrier is there for a reason but it is quite costly.

This also explains why the copy_{fpr,vsx}_to_user() direction does not
suffer from the slowdown because there is no need for barrier_nospec().
>
> If yes, could you try together with the patch from Alexey
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210204121612.32721-1-...@ozlabs.ru/
> ?
>
> Thanks
> Christophe



Re: [PATCH 2/8] powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()

2021-02-08 Thread Christopher M. Riedl
On Sun Feb 7, 2021 at 4:12 AM CST, Christophe Leroy wrote:
>
>
> Le 06/02/2021 à 18:39, Christopher M. Riedl a écrit :
> > On Sat Feb 6, 2021 at 10:32 AM CST, Christophe Leroy wrote:
> >>
> >>
> >> Le 20/10/2020 à 04:01, Christopher M. Riedl a écrit :
> >>> On Fri Oct 16, 2020 at 10:48 AM CDT, Christophe Leroy wrote:
> >>>>
> >>>>
> >>>> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> >>>>> Reuse the "safe" implementation from signal.c except for calling
> >>>>> unsafe_copy_from_user() to copy into a local buffer. Unlike the
> >>>>> unsafe_copy_{vsx,fpr}_to_user() functions the "copy from" functions
> >>>>> cannot use unsafe_get_user() directly to bypass the local buffer since
> >>>>> doing so significantly reduces signal handling performance.
> >>>>
> >>>> Why can't the functions use unsafe_get_user(), why does it significantly
> >>>> reduces signal handling
> >>>> performance ? How much significant ? I would expect that not going
> >>>> through an intermediate memory
> >>>> area would be more efficient
> >>>>
> >>>
> >>> Here is a comparison, 'unsafe-signal64-regs' avoids the intermediate 
> >>> buffer:
> >>>
> >>>   |  | hash   | radix  |
> >>>   |  | -- | -- |
> >>>   | linuxppc/next| 289014 | 158408 |
> >>>   | unsafe-signal64  | 298506 | 253053 |
> >>>   | unsafe-signal64-regs | 254898 | 220831 |
> >>>
> >>> I have not figured out the 'why' yet. As you mentioned in your series,
> >>> technically calling __copy_tofrom_user() is overkill for these
> >>> operations. The only obvious difference between unsafe_put_user() and
> >>> unsafe_get_user() is that we don't have asm-goto for the 'get' variant.
> >>> Instead we wrap with unsafe_op_wrap() which inserts a conditional and
> >>> then goto to the label.
> >>>
> >>> Implemenations:
> >>>
> >>>   #define unsafe_copy_fpr_from_user(task, from, label)   do {\
> >>>  struct task_struct *__t = task; \
> >>>  u64 __user *buf = (u64 __user *)from;   \
> >>>  int i;  \
> >>>  \
> >>>  for (i = 0; i < ELF_NFPREG - 1; i++)\
> >>>  unsafe_get_user(__t->thread.TS_FPR(i), &buf[i], label); \
> >>>  unsafe_get_user(__t->thread.fp_state.fpscr, &buf[i], label);\
> >>>   } while (0)
> >>>
> >>>   #define unsafe_copy_vsx_from_user(task, from, label)   do {\
> >>>  struct task_struct *__t = task; \
> >>>  u64 __user *buf = (u64 __user *)from;   \
> >>>  int i;  \
> >>>  \
> >>>  for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
> >>>  
> >>> unsafe_get_user(__t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET], \
> >>>  &buf[i], label);\
> >>>   } while (0)
> >>>
> >>
> >> Do you have CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP enabled in
> >> your config ?
> > 
> > I don't have these set in my config (ppc64le_defconfig). I think I
> > figured this out - the reason for the lower signal throughput is the
> > barrier_nospec() in __get_user_nocheck(). When looping we incur that
> > cost on every iteration. Commenting it out results in signal performance
> > of ~316K w/ hash on the unsafe-signal64-regs branch. Obviously the
> > barrier is there for a reason but it is quite costly.
>
> Interesting.
>
> Can you try with the patch I just sent out
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/c72f014730823b413528e90ab6c4d3bcb79f8497.1612692067.git.christophe.le...@csgroup.eu/

Yeah that patch solves the problem. Using unsafe_get_user() in a loop is
actually faster on radix than using the intermediary buffer step. A
summary of results below (unsafe-signal64-v6 uses unsafe_get_user() and
avoids the local buffer):

|  | hash   | radix  |
|  | -- | -- |
| unsafe-signal64-v5   | 194533 | 230089 |
| unsafe-signal64-v6   | 176739 | 202840 |
| unsafe-signal64-v5+barrier patch | 203037 | 234936 |
| unsafe-signal64-v6+barrier patch | 205484 | 241030 |

I am still expecting some comments/feedback on my v5 before sending out
v6. Should I include your patch in my series as well?

>
> Thanks
> Christophe



Re: [PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t

2021-02-09 Thread Christopher M. Riedl
On Tue Feb 9, 2021 at 3:45 PM CST, Christophe Leroy wrote:
> "Christopher M. Riedl"  a écrit :
>
> > Usually sigset_t is exactly 8B which is a "trivial" size and does not
> > warrant using __copy_from_user(). Use __get_user() directly in
> > anticipation of future work to remove the trivial size optimizations
> > from __copy_from_user(). Calling __get_user() also results in a small
> > boost to signal handling throughput here.
> >
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/kernel/signal_64.c | 14 --
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c  
> > b/arch/powerpc/kernel/signal_64.c
> > index 817b64e1e409..42fdc4a7ff72 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -97,6 +97,14 @@ static void prepare_setup_sigcontext(struct  
> > task_struct *tsk, int ctx_has_vsx_re
> >  #endif /* CONFIG_VSX */
> >  }
> >
> > +static inline int get_user_sigset(sigset_t *dst, const sigset_t *src)
>
> Should be called __get_user_sigset() as it is a helper for __get_user()

Ok makes sense.

>
> > +{
> > +   if (sizeof(sigset_t) <= 8)
>
> We should always use __get_user(), see below.
>
> > +   return __get_user(dst->sig[0], &src->sig[0]);
>
> I think the above will not work on ppc32, it will only copy 4 bytes.
> You must cast the source to u64*

Well this is signal_64.c :) Looks like ppc32 needs the same thing so
I'll just move this into signal.h and use it for both. 

The only exception would be the COMPAT case in signal_32.c which ends up
calling the common get_compat_sigset(). Updating that is probably
outside the scope of this series.

>
> > +   else
> > +   return __copy_from_user(dst, src, sizeof(sigset_t));
>
> I see no point in keeping this alternative. Today sigset_ t is fixed.
> If you fear one day someone might change it to something different
> than a u64, just add a BUILD_BUG_ON(sizeof(sigset_t) != sizeof(u64));

Ah yes that is much better - thanks for the suggestion.

>
> > +}
> > +
> >  /*
> >   * Set up the sigcontext for the signal frame.
> >   */
> > @@ -701,8 +709,9 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext  
> > __user *, old_ctx,
> >  * We kill the task with a SIGSEGV in this situation.
> >  */
> >
> > -   if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
> > +   if (get_user_sigset(&set, &new_ctx->uc_sigmask))
> > do_exit(SIGSEGV);
> > +
>
> This white space is not part of the change, keep patches to the
> minimum, avoid cosmetic

Just a (bad?) habit on my part that I missed - I'll remove this one and
the one further below.

>
> > set_current_blocked(&set);
> >
> > if (!user_read_access_begin(new_ctx, ctx_size))
> > @@ -740,8 +749,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > if (!access_ok(uc, sizeof(*uc)))
> > goto badframe;
> >
> > -   if (__copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
> > +   if (get_user_sigset(&set, &uc->uc_sigmask))
> > goto badframe;
> > +
>
> Same
>
> > set_current_blocked(&set);
> >
> >  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > --
> > 2.26.1



  1   2   3   >