Re: [PATCH] arch: configuration, deleting 'CONFIG_BUG' since always need it.
Arnd Bergmann writes: > On Thursday 23 May 2013, Geert Uytterhoeven wrote: >> > The problem is: trying to fix that will mean the result is a larger >> > kernel than if you just do the usual arch-implemented thing of placing >> > an defined faulting instruction at the BUG() site - which defeats the >> > purpose of turning off CONFIG_BUG. >> >> Is __builtin_unreachable() working well these days? >> > > Hmm, I just tried the trivial patch below, which seemed to do the right thing. > Needs a little more investigation, but that might actually be the correct > solution. I thought that at some point __builtin_unreachable() was the same > as "do {} while (1)", but this is not the case with the gcc I was using -- > it just tells gcc that we don't expect to ever get here. Yes. We already have this abstracted in compiler.h as the macro unreachable, so the slight modification of your patch below should handle this case. For compilers without __builtin_unreachable() unreachable() expands to do {} while(1) but an infinite loop seems reasonable and preserves the semantics of the code, unlike the current noop that is do {} while(0). > Signed-off-by: Arnd Bergmann > diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h index 7d10f96..9afff7d 100644 --- a/include/asm-generic/bug.h +++ b/include/asm-generic/bug.h @@ -108,11 +108,11 @@ extern void warn_slowpath_null(const char *file, const int line); #else /* !CONFIG_BUG */ #ifndef HAVE_ARCH_BUG -#define BUG() do {} while(0) +#define BUG() unreachable () #endif #ifndef HAVE_ARCH_BUG_ON -#define BUG_ON(condition) do { if (condition) ; } while(0) +#define BUG_ON(condition) do { if (condition) unreachable(); } while(0) #endif #ifndef HAVE_ARCH_WARN_ON ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] arch: configuration, deleting 'CONFIG_BUG' since always need it.
Chen Gang writes: > The crazy user can unset 'CONFIG_BUG' in menuconfig: "> General setup > > Configure standard kernel features (expert users) > BUG() Support". > > But in fact, we always need it, and quite a few of architectures have > already implemented it (e.g. alpha, arc, arm, avr32, blackfin, cris, > frv, ia64, m68k, mips, mn10300, parisc, powerpc, s390, sh, sparc, x86). > > And kernel also already has prepared a default effective implementation > for the architectures which is unwilling to implement it by themselves > (e.g. arm64, c6x, h8300, hexagon, m32r, metag, microblaze, openrisc, > score, tile, um, unicore32, xtensa). > > So need get rid of 'CONFIG_BUG', and let it always enabled everywhere. This looks like the right way to handle this to me. If the BUG annotations are too big and not needed they should simply be deleted from the code base. Disabling CONFIG_BUG which removes the BUG annotations from the binaries without modifying the source code seems like the wrong approach. Acked-by: "Eric W. Biederman" ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] panic: Disable crash_kexec_post_notifiers if kdump is not available
Hidehiro Kawai writes: > You can call panic notifiers and kmsg dumpers before kdump by > specifying "crash_kexec_post_notifiers" as a boot parameter. > However, it doesn't make sense if kdump is not available. In that > case, disable "crash_kexec_post_notifiers" boot parameter so that > you can't change the value of the parameter. Nacked-by: "Eric W. Biederman" You are confusing kexec on panic and CONFIG_CRASH_DUMP which is about the tools for reading the state of the previous kernel. Eric > Signed-off-by: Hidehiro Kawai > Cc: Andrew Morton > Cc: Eric Biederman > Cc: Vivek Goyal > --- > kernel/panic.c |2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/panic.c b/kernel/panic.c > index 04e91ff..5252331 100644 > --- a/kernel/panic.c > +++ b/kernel/panic.c > @@ -502,12 +502,14 @@ __visible void __stack_chk_fail(void) > core_param(pause_on_oops, pause_on_oops, int, 0644); > core_param(panic_on_warn, panic_on_warn, int, 0644); > > +#ifdef CONFIG_CRASH_DUMP > static int __init setup_crash_kexec_post_notifiers(char *s) > { > crash_kexec_post_notifiers = true; > return 0; > } > early_param("crash_kexec_post_notifiers", setup_crash_kexec_post_notifiers); > +#endif > > static int __init oops_setup(char *s) > { ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] panic: Disable crash_kexec_post_notifiers if kdump is not available
dwal...@fifo99.com writes: > On Fri, Jul 10, 2015 at 08:41:28AM -0500, Eric W. Biederman wrote: >> Hidehiro Kawai writes: >> >> > You can call panic notifiers and kmsg dumpers before kdump by >> > specifying "crash_kexec_post_notifiers" as a boot parameter. >> > However, it doesn't make sense if kdump is not available. In that >> > case, disable "crash_kexec_post_notifiers" boot parameter so that >> > you can't change the value of the parameter. >> >> Nacked-by: "Eric W. Biederman" > > I think it would make sense if he just replaced "kdump" with "kexec". It would be less insane, however it still makes no sense as without kexec on panic support crash_kexec is a noop. So the value of the seeting makes no difference. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] panic: Disable crash_kexec_post_notifiers if kdump is not available
Vivek Goyal writes: > On Tue, Jul 14, 2015 at 03:48:33PM +, dwal...@fifo99.com wrote: >> On Tue, Jul 14, 2015 at 11:40:40AM -0400, Vivek Goyal wrote: >> > On Tue, Jul 14, 2015 at 03:34:30PM +, dwal...@fifo99.com wrote: >> > > On Tue, Jul 14, 2015 at 11:02:08AM -0400, Vivek Goyal wrote: >> > > > On Tue, Jul 14, 2015 at 01:59:19PM +, dwal...@fifo99.com wrote: >> > > > > On Mon, Jul 13, 2015 at 08:19:45PM -0500, Eric W. Biederman wrote: >> > > > > > dwal...@fifo99.com writes: >> > > > > > >> > > > > > > On Fri, Jul 10, 2015 at 08:41:28AM -0500, Eric W. Biederman >> > > > > > > wrote: >> > > > > > >> Hidehiro Kawai writes: >> > > > > > >> >> > > > > > >> > You can call panic notifiers and kmsg dumpers before kdump by >> > > > > > >> > specifying "crash_kexec_post_notifiers" as a boot parameter. >> > > > > > >> > However, it doesn't make sense if kdump is not available. In >> > > > > > >> > that >> > > > > > >> > case, disable "crash_kexec_post_notifiers" boot parameter so >> > > > > > >> > that >> > > > > > >> > you can't change the value of the parameter. >> > > > > > >> >> > > > > > >> Nacked-by: "Eric W. Biederman" >> > > > > > > >> > > > > > > I think it would make sense if he just replaced "kdump" with >> > > > > > > "kexec". >> > > > > > >> > > > > > It would be less insane, however it still makes no sense as without >> > > > > > kexec on panic support crash_kexec is a noop. So the value of the >> > > > > > seeting makes no difference. >> > > > > >> > > > > Can you explain more, I don't really understand what you mean. Are >> > > > > you suggesting >> > > > > the whole "crash_kexec_post_notifiers" feature has no value ? >> > > > >> > > > Daniel, >> > > > >> > > > BTW, why are you using crash_kexec_post_notifiers commandline? Why not >> > > > without it? >> > > >> > > It was explained in the prior thread but to rehash, the notifiers are >> > > used to do a switch >> > > over from the crashed machine to another redundant machine. >> > >> > So why not detect failure using polling or issue notifications from second >> > kernel. >> > >> > IOW, expecting that a crashed machine will be able to deliver notification >> > reliably is falwed to begin with, IMHO. >> >> It's flawed to think you can kexec, but you still do it right ? I've not >> gotten into >> the deep details of this switching process, but that's how this interface is >> used. > > Sure. But the deal here is that users of interface know that sometimes it > can be unreliable. And in the absence of more reliable mechanism, somewhat > less reliable mechanism is fine. > >> >> > If a machine is failing, there are high chance it can't deliver you the >> > notification. Detecting that failure suing some kind of polling mechanism >> > might be more reliable. And it will make even kdump mechanism more >> > reliable so that it does not have to run panic notifiers after the crash. >> >> I think what your suggesting is that my company should change how it's >> hardware works >> and that's not really an option for me. This isn't a simple thing like >> checking over the >> network if the machine is down or not, this is way more complex hardware >> design. > > That means you are ready to live with an unreliable design. There might be > cases where notifier does not get run properly and you will not do switch > despite the fact that OS has failed. I was just trying to nudge you in > a direction which could be more reliable mechanism. Sigh I see some deep confusion going on here. The panic notifiers are just that panic notifiers. They have not been nor should they be tied to kexec. If those notifiers force a switch over of between machines I fail to see why you would care if it was kexec or another panic situation that is forcing that switchover. Now if you want a reliable design, I strongly recommend as I have been recommending for the 15 years that magic failover code be placed in either the new kernel or a stub that preceedes the new kernel. That gives the greatest reliabilty we know how to engineer, and it lets you do whatever you need to do. Especially if it is not desirable for the panic notifiers to run without the presence of kexec, I very strongly recommend not using them at all and just writing a stub of code that can run before a new kernel starts. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] panic: Disable crash_kexec_post_notifiers if kdump is not available
Vivek Goyal writes: > On Tue, Jul 14, 2015 at 05:29:53PM +, dwal...@fifo99.com wrote: > > [..] >> > >> > If a machine is failing, there are high chance it can't deliver you >> > >> > the >> > >> > notification. Detecting that failure suing some kind of polling >> > >> > mechanism >> > >> > might be more reliable. And it will make even kdump mechanism more >> > >> > reliable so that it does not have to run panic notifiers after the >> > >> > crash. >> > >> >> > >> I think what your suggesting is that my company should change how it's >> > >> hardware works >> > >> and that's not really an option for me. This isn't a simple thing like >> > >> checking over the >> > >> network if the machine is down or not, this is way more complex >> > >> hardware design. >> > > >> > > That means you are ready to live with an unreliable design. There might >> > > be >> > > cases where notifier does not get run properly and you will not do switch >> > > despite the fact that OS has failed. I was just trying to nudge you in >> > > a direction which could be more reliable mechanism. >> > >> > Sigh I see some deep confusion going on here. >> > >> > The panic notifiers are just that panic notifiers. They have not been >> > nor should they be tied to kexec. If those notifiers force a switch >> > over of between machines I fail to see why you would care if it was >> > kexec or another panic situation that is forcing that switchover. >> >> Hidehiro isn't fixing the failover situation on my side, he's fixing register >> information collection when crash_kexec_post_notifiers is used. > > Sure. Given that we have created this new parameter, let us fix it so that > we can capture the other cpu register state in crash dump. > > I am little disappointed that it was not tested well when this parameter was > introuced. We should have atleast tested it to the extent to see if there > is proper cpu state present for all cpus in the crash dump. > > At that point of time it looked like a simple modification > to allow panic notifiers before crash_kexec(). Either that or we say no one cares enough, and it known broken so let's just revert the fool thing. I honestly can't see how to support panic notifiers, before kexec. There is no way to tell what is being done and all of the pieces including smp_send_stop are known to be buggy. It isn't like this latest set of patches was reviewed/tested much better, as the first patch was wrong. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/8] binfmt_elf: open code copy_siginfo_to_user to kernelspace buffer
Christoph Hellwig writes: > On Wed, Apr 15, 2020 at 10:20:11AM +0200, Arnd Bergmann wrote: >> > I'd rather keep it out of this series and to >> > an interested party. Then again x32 doesn't seem to have a whole lot >> > of interested parties.. >> >> Fine with me. It's on my mental list of things that we want to kill off >> eventually as soon as the remaining users stop replying to questions >> about it. >> >> In fact I should really turn that into a properly maintained list in >> Documentation/... that contains any options that someone has >> asked about removing in the past, along with the reasons for keeping >> it around and a time at which we should ask about it again. > > To the newly added x86 maintainers: Arnd brought up the point that > elf_core_dump writes the ABI siginfo format into the core dump. That > format differs for i386 vs x32. Is there any good way to find out > which is the right format when are not in a syscall? > > As far a I can tell x32 vs i386 just seems to be based around what > syscall table was used for the current syscall, but core dumps aren't > always in syscall context. I don't think this matters. The i386 and x32 signal structures only differ for SIGCHLD. The SIGCHLD signal does cause coredumps. So as long as we get the 32bit vs 64bit distinct correct all should be well. Eric
Re: [PATCH 2/8] signal: clean up __copy_siginfo_to_user32
Christoph Hellwig writes: > Instead of an architecture specific calling convention in common code > just pass a flags argument with architecture specific values. This bothers me because it makes all architectures pay for the sins of x32. Further it starts burying the details of the what is happening in architecture specific helpers. Hiding the fact that there is only one niche architecture that does anything weird. I am very sensitive to hiding away signal handling details right now because way to much of the signal handling code got hidden in architecture specific files and was quite buggy because as a result. My general sense is putting all of the weird details up front and center in kernel/signal.c is the only way for this code will be looked at and successfully maintained. How about these patches to solve set_fs with binfmt_elf instead: Eric W. Biederman (2): signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32 signal: Remove the set_fs in binfmt_elf.c:fill_siginfo_note fs/binfmt_elf.c| 5 + fs/compat_binfmt_elf.c | 2 +- include/linux/compat.h | 1 + include/linux/signal.h | 7 +++ kernel/signal.c| 108 ++-- Eric
[PATCH 1/2] signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32
To remove the use of set_fs in the coredump code there needs to be a way to convert a kernel siginfo to a userspace compat siginfo. Call that function copy_siginfo_to_compat and factor it out of copy_siginfo_to_user32. The existence of x32 complicates this code. On x32 SIGCHLD uses 64bit times for utime and stime. As only SIGCHLD is affected and SIGCHLD never causes a coredump I have avoided handling that case. Signed-off-by: "Eric W. Biederman" --- include/linux/compat.h | 1 + kernel/signal.c| 108 +++-- 2 files changed, 63 insertions(+), 46 deletions(-) diff --git a/include/linux/compat.h b/include/linux/compat.h index 0480ba4db592..4962b254e550 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -402,6 +402,7 @@ long compat_get_bitmap(unsigned long *mask, const compat_ulong_t __user *umask, unsigned long bitmap_size); long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask, unsigned long bitmap_size); +void copy_siginfo_to_external32(struct compat_siginfo *to, const struct kernel_siginfo *from); int copy_siginfo_from_user32(kernel_siginfo_t *to, const struct compat_siginfo __user *from); int copy_siginfo_to_user32(struct compat_siginfo __user *to, const kernel_siginfo_t *from); int get_compat_sigevent(struct sigevent *event, diff --git a/kernel/signal.c b/kernel/signal.c index e58a6c619824..578f196898cb 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3235,90 +3235,106 @@ int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from) } #ifdef CONFIG_COMPAT -int copy_siginfo_to_user32(struct compat_siginfo __user *to, - const struct kernel_siginfo *from) -#if defined(CONFIG_X86_X32_ABI) || defined(CONFIG_IA32_EMULATION) +void copy_siginfo_to_external32(struct compat_siginfo *to, + const struct kernel_siginfo *from) { - return __copy_siginfo_to_user32(to, from, in_x32_syscall()); -} -int __copy_siginfo_to_user32(struct compat_siginfo __user *to, -const struct kernel_siginfo *from, bool x32_ABI) -#endif -{ - struct compat_siginfo new; - memset(&new, 0, sizeof(new)); + /* +* This function does not work properly for SIGCHLD on x32, +* but it does not need to as SIGCHLD never causes a coredump. +*/ + memset(to, 0, sizeof(*to)); - new.si_signo = from->si_signo; - new.si_errno = from->si_errno; - new.si_code = from->si_code; + to->si_signo = from->si_signo; + to->si_errno = from->si_errno; + to->si_code = from->si_code; switch(siginfo_layout(from->si_signo, from->si_code)) { case SIL_KILL: - new.si_pid = from->si_pid; - new.si_uid = from->si_uid; + to->si_pid = from->si_pid; + to->si_uid = from->si_uid; break; case SIL_TIMER: - new.si_tid = from->si_tid; - new.si_overrun = from->si_overrun; - new.si_int = from->si_int; + to->si_tid = from->si_tid; + to->si_overrun = from->si_overrun; + to->si_int = from->si_int; break; case SIL_POLL: - new.si_band = from->si_band; - new.si_fd = from->si_fd; + to->si_band = from->si_band; + to->si_fd = from->si_fd; break; case SIL_FAULT: - new.si_addr = ptr_to_compat(from->si_addr); + to->si_addr = ptr_to_compat(from->si_addr); #ifdef __ARCH_SI_TRAPNO - new.si_trapno = from->si_trapno; + to->si_trapno = from->si_trapno; #endif break; case SIL_FAULT_MCEERR: - new.si_addr = ptr_to_compat(from->si_addr); + to->si_addr = ptr_to_compat(from->si_addr); #ifdef __ARCH_SI_TRAPNO - new.si_trapno = from->si_trapno; + to->si_trapno = from->si_trapno; #endif - new.si_addr_lsb = from->si_addr_lsb; + to->si_addr_lsb = from->si_addr_lsb; break; case SIL_FAULT_BNDERR: - new.si_addr = ptr_to_compat(from->si_addr); + to->si_addr = ptr_to_compat(from->si_addr); #ifdef __ARCH_SI_TRAPNO - new.si_trapno = from->si_trapno; + to->si_trapno = from->si_trapno; #endif - new.si_lower = ptr_to_compat(from->si_lower); - new.si_upper = ptr_to_compat(from->si_upper); + to->si_lower = ptr_to_compat(from->si_lower); + to->si_upper = ptr_to_compat(from->si_upper);
[PATCH 2/2] signal: Remove the set_fs in binfmt_elf.c:fill_siginfo_note
The code in binfmt_elf.c is differnt from the rest of the code that processes siginfo, as it sends siginfo from a kernel buffer to a file rather than from kernel memory to userspace buffers. To remove it's use of set_fs the code needs some different siginfo helpers. Add the helper copy_siginfo_to_external to copy from the kernel's internal siginfo layout to a buffer in the siginfo layout that userspace expects. Modify fill_siginfo_note to use copy_siginfo_to_external instead of set_fs and copy_siginfo_to_user. Update compat_binfmt_elf.c to use the previously added copy_siginfo_to_external32 to handle the compat case. Signed-off-by: "Eric W. Biederman" --- fs/binfmt_elf.c| 5 + fs/compat_binfmt_elf.c | 2 +- include/linux/signal.h | 7 +++ 3 files changed, 9 insertions(+), 5 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 13f25e241ac4..a1f57e20c3cf 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1556,10 +1556,7 @@ static void fill_auxv_note(struct memelfnote *note, struct mm_struct *mm) static void fill_siginfo_note(struct memelfnote *note, user_siginfo_t *csigdata, const kernel_siginfo_t *siginfo) { - mm_segment_t old_fs = get_fs(); - set_fs(KERNEL_DS); - copy_siginfo_to_user((user_siginfo_t __user *) csigdata, siginfo); - set_fs(old_fs); + copy_siginfo_to_external(csigdata, siginfo); fill_note(note, "CORE", NT_SIGINFO, sizeof(*csigdata), csigdata); } diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c index aaad4ca1217e..fa0e24e1b726 100644 --- a/fs/compat_binfmt_elf.c +++ b/fs/compat_binfmt_elf.c @@ -39,7 +39,7 @@ */ #define user_long_tcompat_long_t #define user_siginfo_t compat_siginfo_t -#define copy_siginfo_to_user copy_siginfo_to_user32 +#define copy_siginfo_to_external copy_siginfo_to_external32 /* * The machine-dependent core note format types are defined in elfcore-compat.h, diff --git a/include/linux/signal.h b/include/linux/signal.h index 05bacd2ab135..c1796321cadb 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -24,6 +24,13 @@ static inline void clear_siginfo(kernel_siginfo_t *info) #define SI_EXPANSION_SIZE (sizeof(struct siginfo) - sizeof(struct kernel_siginfo)) +static inline void copy_siginfo_to_external(siginfo_t *to, + const kernel_siginfo_t *from) +{ + memcpy(to, from, sizeof(*from)); + memset(((char *)to) + sizeof(struct kernel_siginfo), 0, SI_EXPANSION_SIZE); +} + int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from); int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from); -- 2.25.0
Re: remove set_fs calls from the exec and coredump code v2
Christoph Hellwig writes: > Hi all, > > this series gets rid of playing with the address limit in the exec and > coredump code. Most of this was fairly trivial, the biggest changes are > those to the spufs coredump code. > > Changes since v1: > - properly spell NUL > - properly handle the compat siginfo case in ELF coredumps Quick question is exec from a kernel thread within the scope of what you are looking at? There is a set_fs(USER_DS) in flush_old_exec whose sole purpose appears to be to allow exec from kernel threads. Where the kernel threads run with set_fs(KERNEL_DS) until they call exec. Eric
Re: [PATCH 1/2] signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32
Christophe Leroy writes: > Le 17/04/2020 à 23:09, Eric W. Biederman a écrit : >> >> To remove the use of set_fs in the coredump code there needs to be a >> way to convert a kernel siginfo to a userspace compat siginfo. >> >> Call that function copy_siginfo_to_compat and factor it out of >> copy_siginfo_to_user32. > > I find it a pitty to do that. > > The existing function could have been easily converted to using > user_access_begin() + user_access_end() and use unsafe_put_user() to copy to > userspace to avoid copying through a temporary structure on the stack. > > With your change, it becomes impossible to do that. I don't follow. You don't like temporary structures in the coredump code or temporary structures in copy_siginfo_to_user32? A temporary structure in copy_siginfo_to_user is pretty much required so that it can be zeroed to guarantee we don't pass a structure with holes to userspace. The implementation of copy_siginfo_to_user32 used to use the equivalent of user_access_begin() and user_access_end() and the code was a mess that was very difficult to reason about. I recall their being holes in the structure that were being copied to userspace. Meanwhile if you are going to set all of the bytes a cache hot temporary structure is quite cheap. > Is that really an issue to use that set_fs() in the coredump code ? Using set_fs() is pretty bad and something that we would like to remove from the kernel entirely. The fewer instances of set_fs() we have the better. I forget all of the details but set_fs() is both a type violation and an attack point when people are attacking the kernel. The existence of set_fs() requires somethings that should be constants to be variables. Something about that means that our current code is difficult to protect from spectre style vulnerabilities. There was a very good thread about it all in I think 2018 but unfortunately I can't find it now. Eric
Re: remove set_fs calls from the exec and coredump code v2
Christoph Hellwig writes: > On Fri, Apr 17, 2020 at 05:41:52PM -0500, Eric W. Biederman wrote: >> > this series gets rid of playing with the address limit in the exec and >> > coredump code. Most of this was fairly trivial, the biggest changes are >> > those to the spufs coredump code. >> > >> > Changes since v1: >> > - properly spell NUL >> > - properly handle the compat siginfo case in ELF coredumps >> >> Quick question is exec from a kernel thread within the scope of what you >> are looking at? >> >> There is a set_fs(USER_DS) in flush_old_exec whose sole purpose appears >> to be to allow exec from kernel threads. Where the kernel threads >> run with set_fs(KERNEL_DS) until they call exec. > > This series doesn't really look at that area. But I don't think exec > from a kernel thread makes any sense, and cleaning up how to set the > initial USER_DS vs KERNEL_DS state is something I'll eventually get to, > it seems like a major mess at the moment. Fair enough. I just wanted to make certain that it is on people's radar that when the kernel exec's init the arguments are read from kernel memory and the set_fs(USER_DS) in flush_old_exec() that makes that not work later. It is subtle and easy to miss. So I figured I would mention it since I have been staring at the exec code a lot lately. Eric
Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
David Hildenbrand writes: >>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't >>> pass the efi, it won't get the SRAT table correctly, if I remember >>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with >>> ACPI only, this won't happen on bare metal though. Need check carefully. >>> I have been using kvm guest with uefi firmwire recently. >> >> Yeah, I can imagine that bare metal is different. kvm only uses ACPI. >> >> I'm also asking because of virtio-mem. Memory added via virtio-mem is >> not part of any efi tables or whatsoever. So I assume the kexec kernel >> will not detect it automatically (good!), instead load the virtio-mem >> driver and let it add memory back to the system. >> >> I should probably play with kexec and virtio-mem once I have some spare >> cycles ... to find out what's broken and needs to be addressed :) > > FWIW, I just gave virtio-mem and kexec/kdump a try. > > a) kdump seems to work. Memory added by virtio-mem is getting dumped. > The kexec kernel only uses memory in the crash region. The virtio-mem > driver properly bails out due to is_kdump_kernel(). > > b) "kexec -s -l" seems to work fine. For now, the kernel does not seem > to get placed on virtio-mem memory (pure luck due to the left-to-right > search). Memory added by virtio-mem is not getting added to the e820 > map. Once the virtio-mem driver comes back up in the kexec kernel, the > right memory is readded. This sounds like a bug. > c) "kexec -c -l" does not work properly. All memory added by virtio-mem > is added to the e820 map, which is wrong. Memory that should not be > touched will be touched by the kexec kernel. I assume kexec-tools just > goes ahead and adds anything it can find in /proc/iomem (or > /sys/firmware/memmap/) to the e820 map of the new kernel. > > Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is > similarly added to the e820 map and, therefore, won't be able to be > onlined MOVABLE easily. This sounds like correct behavior to me. If you add memory to the system it is treated as memory to the system. If we need to make it a special kind of memory with special rules we can have some kind of special marking for the memory. But hotplugged is not in itself a sufficient criteria to say don't use this as normal memory. If take a huge server and I plug in an extra dimm it is just memory. For a similarly huge server I might want to have memory that the system booted with unpluggable, in case hardware error reporting notices a dimm generating a lot of memory errors. Now perhaps virtualization needs a special tier of memory that should only be used for cases where the memory is easily movable. I am not familiar with virtio-mem but my skim of the initial design is that virtio-mem was not designed to be such a special tier of memory. Perhaps something has changed? https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html > At least for virtio-mem, I would either have to > a) Not support "kexec -c -l". A viable option if we would be planning on > not supporting it either way in the long term. I could block this > in-kernel somehow eventually. No. > b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by > indicating it in /proc/iomem in a special way ("System RAM > (hotplugged)"/"System RAM (virtio-mem)"). How does the kernel memory allocator treat this memory? The logic is simple. If the kernel memory allocator treats that memory as ordinary memory available for all uses it should be presented as ordinary memory available for all uses. If the kernel memory allocator treats that memory as special memory only available for uses that we can easily free later and give back to the system. AKA it is special and not oridinary memory we should mark it as such. Eric p.s. Please excuse me for jumping in I may be missing some important context, but what I read when I saw this message in my inbox just seemed very wrong.
Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
David Hildenbrand writes: > Some devices/drivers that add memory via add_memory() and friends (e.g., > dax/kmem, but also virtio-mem in the future) don't want to create entries > in /sys/firmware/memmap/ - primarily to hinder kexec from adding this > memory to the boot memmap of the kexec kernel. > > In fact, such memory is never exposed via the firmware memmap as System > RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is > wrong: > "kexec needs the raw firmware-provided memory map to setup the > parameter segment of the kernel that should be booted with > kexec. Also, the raw memory map is useful for debugging. For > that reason, /sys/firmware/memmap is an interface that provides > the raw memory map to userspace." [1] > > We don't have to worry about firmware_map_remove() on the removal path. > If there is no entry, it will simply return with -EINVAL. > > [1] > https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap You know what this justification is rubbish, and I have previously explained why it is rubbish. Nacked-by: "Eric W. Biederman" This needs to be based on weather the added memory is ultimately normal ram or is something special. At least when we are talking memory resources. Keeping it out of the firmware map that is fine. If the hotplugged memory is the result of plugging a stick of ram into the kernel and can and should used be like any other memory it should be treated like any normal memory. If the hotplugged memory is something special it should be treated as something special. Justifying behavior by documentation that does not consider memory hotplug is bad thinking. > Cc: Andrew Morton > Cc: Michal Hocko > Cc: Pankaj Gupta > Cc: Wei Yang > Cc: Baoquan He > Cc: Eric Biederman > Signed-off-by: David Hildenbrand > --- > include/linux/memory_hotplug.h | 8 > mm/memory_hotplug.c| 3 ++- > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index 0151fb935c09..4ca418a731eb 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -68,6 +68,14 @@ struct mhp_params { > pgprot_t pgprot; > }; > > +/* Flags used for add_memory() and friends. */ > + > +/* > + * Don't create entries in /sys/firmware/memmap/. The memory is detected and > + * added via a device driver, not via the initial (firmware) memmap. > + */ > +#define MHP_NO_FIRMWARE_MEMMAP 1 > + > /* > * Zone resizing functions > * > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index c01be92693e3..e94ede9cad00 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1062,7 +1062,8 @@ int __ref add_memory_resource(int nid, struct resource > *res, > BUG_ON(ret); > > /* create new memmap entry */ > - firmware_map_add_hotplug(start, start + size, "System RAM"); > + if (!(flags & MHP_NO_FIRMWARE_MEMMAP)) > + firmware_map_add_hotplug(start, start + size, "System RAM"); > > /* device_online() will take the lock when calling online_pages() */ > mem_hotplug_done();
Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
David Hildenbrand writes: > On 30.04.20 17:38, Eric W. Biederman wrote: >> David Hildenbrand writes: >> >>> Some devices/drivers that add memory via add_memory() and friends (e.g., >>> dax/kmem, but also virtio-mem in the future) don't want to create entries >>> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this >>> memory to the boot memmap of the kexec kernel. >>> >>> In fact, such memory is never exposed via the firmware memmap as System >>> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is >>> wrong: >>> "kexec needs the raw firmware-provided memory map to setup the >>> parameter segment of the kernel that should be booted with >>> kexec. Also, the raw memory map is useful for debugging. For >>> that reason, /sys/firmware/memmap is an interface that provides >>> the raw memory map to userspace." [1] >>> >>> We don't have to worry about firmware_map_remove() on the removal path. >>> If there is no entry, it will simply return with -EINVAL. >>> >>> [1] >>> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap >> >> >> You know what this justification is rubbish, and I have previously >> explained why it is rubbish. > > Actually, no, I don't think it is rubbish. See patch #3 and the cover > letter why this is the right thing to do *for special memory*, *not > ordinary DIMMs*. > > And to be quite honest, I think your response is a little harsh. I don't > recall you replying to my virtio-mem-related comments. > >> >> Nacked-by: "Eric W. Biederman" >> >> This needs to be based on weather the added memory is ultimately normal >> ram or is something special. > > Yes, that's what the caller are expected to decide, see patch #3. > > kexec should try to be as closely as possible to a real reboot - IMHO. That is very fuzzy in terms of hotplug memory. The kexec'd kernel should see the hotplugged memory assuming it is ordinary memory. But kexec is not a reboot although it is quite similar. Kexec is swapping one running kernel and it's state for another kernel without rebooting. >> Justifying behavior by documentation that does not consider memory >> hotplug is bad thinking. > > Are you maybe confusing this patch series with the arm64 approach? This > is not about ordinary hotplugged DIMMs. I think I am. My challenge is that I don't see anything in the description that says this isn't about ordinary hotplugged DIMMs. All I saw was hotplug memory. If the class of memory is different then please by all means let's mark it differently in struct resource so everyone knows it is different. But that difference needs to be more than hotplug. That difference needs to be the hypervisor loaned us memory and might take it back at any time, or this memory is persistent and so it has these different characteristics so don't use it as ordinary ram. That information is also useful to other people looking at the system and seeing what is going on. Just please don't muddle the concepts, or assume that whatever subset of hotplug memory you are dealing with is the only subset. I didn't see that flag making the distinction about the kind of memory it is. Eric
Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
David Hildenbrand writes: > On 30.04.20 18:33, Eric W. Biederman wrote: >> David Hildenbrand writes: >> >>> On 30.04.20 17:38, Eric W. Biederman wrote: >>>> David Hildenbrand writes: >>>> >>>>> Some devices/drivers that add memory via add_memory() and friends (e.g., >>>>> dax/kmem, but also virtio-mem in the future) don't want to create entries >>>>> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this >>>>> memory to the boot memmap of the kexec kernel. >>>>> >>>>> In fact, such memory is never exposed via the firmware memmap as System >>>>> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is >>>>> wrong: >>>>> "kexec needs the raw firmware-provided memory map to setup the >>>>> parameter segment of the kernel that should be booted with >>>>> kexec. Also, the raw memory map is useful for debugging. For >>>>> that reason, /sys/firmware/memmap is an interface that provides >>>>> the raw memory map to userspace." [1] >>>>> >>>>> We don't have to worry about firmware_map_remove() on the removal path. >>>>> If there is no entry, it will simply return with -EINVAL. >>>>> >>>>> [1] >>>>> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap >>>> >>>> >>>> You know what this justification is rubbish, and I have previously >>>> explained why it is rubbish. >>> >>> Actually, no, I don't think it is rubbish. See patch #3 and the cover >>> letter why this is the right thing to do *for special memory*, *not >>> ordinary DIMMs*. >>> >>> And to be quite honest, I think your response is a little harsh. I don't >>> recall you replying to my virtio-mem-related comments. >>> >>>> >>>> Nacked-by: "Eric W. Biederman" >>>> >>>> This needs to be based on weather the added memory is ultimately normal >>>> ram or is something special. >>> >>> Yes, that's what the caller are expected to decide, see patch #3. >>> >>> kexec should try to be as closely as possible to a real reboot - IMHO. >> >> That is very fuzzy in terms of hotplug memory. The kexec'd kernel >> should see the hotplugged memory assuming it is ordinary memory. >> >> But kexec is not a reboot although it is quite similar. Kexec is >> swapping one running kernel and it's state for another kernel without >> rebooting. > > I agree (especially regarding the arm64 DIMM hotplug discussion). > However, for the two cases > > a) dax/kmem > b) virtio-mem > > We really want to let the driver take back control and figure out "what > to do with the memory". >From reading your v1 cover letter (the description appears missing in v2) I see what you are talking about with respect to virtio-mem. So I will count virt-io mem as something different. >>>> Justifying behavior by documentation that does not consider memory >>>> hotplug is bad thinking. >>> >>> Are you maybe confusing this patch series with the arm64 approach? This >>> is not about ordinary hotplugged DIMMs. >> >> I think I am. >> >> My challenge is that I don't see anything in the description that says >> this isn't about ordinary hotplugged DIMMs. All I saw was hotplug >> memory. > > I'm sorry if that was confusing, I tried to stress that kmem and > virtio-mem is special in the description. > > I squeezed a lot of that information into the cover letter and into > patch #3. >> If the class of memory is different then please by all means let's mark >> it differently in struct resource so everyone knows it is different. >> But that difference needs to be more than hotplug. >> >> That difference needs to be the hypervisor loaned us memory and might >> take it back at any time, or this memory is persistent and so it has >> these different characteristics so don't use it as ordinary ram. > > Yes, and I think kmem took an excellent approach of explicitly putting > that "System RAM" into a resource hierarchy. That "System RAM" won't > show up as a root node under /proc/iomem (see patch #3), which already > results in kexec-tools to treat it in a special way. I am thinking about > doing the same for virtio-mem. Reading this and your patch cover letters again my
Re: remove set_fs calls from the coredump code v6
Linus Torvalds writes: > On Tue, May 5, 2020 at 3:13 AM Christoph Hellwig wrote: >> >> this series gets rid of playing with the address limit in the exec and >> coredump code. Most of this was fairly trivial, the biggest changes are >> those to the spufs coredump code. > > Ack, nice, and looks good. > > The only part I dislike is how we have that 'struct compat_siginfo' on > the stack, which is a huge waste (most of it is the nasty padding to > 128 bytes). > > But that's not new, I only reacted to it because the code moved a bit. > We cleaned up the regular siginfo to not have the padding in the > kernel (and by "we" I mean "Eric Biederman did it after some prodding > as part of his siginfo cleanups" - see commit 4ce5f9c9e754 "signal: > Use a smaller struct siginfo in the kernel"), and I wonder if we > could do something similar with that compat thing. > > 128 bytes of wasted kernel stack isn't the end of the world, but it's > sad when the *actual* data is only 32 bytes or so. We probably can. After introducing a kernel_compat_siginfo that is the size that userspace actually would need. It isn't something I want to mess with until this code gets merged, as I think the set_fs cleanups are more important. Christoph made some good points about how ugly the #ifdefs are in the generic copy_siginfo_to_user32 implementation. I am thinking the right fix is to introduce. - TS_X32 as a companion to TS_COMPAT in the x86_64. - Modify in_x32_syscall() to test TS_X32 - Implement x32_copy_siginfo_to_user32 that forces TS_X32 to be set. AKA: x32_copy_siginfo_to_user32() { unsigned long state = current_thread_info()->state; current_thread_info()->state |= TS_X32; copy_siginfo_to_user32(); current_thread_info()->state = state; } That would make the #ifdefs go away, but I don't yet know what the x86 maintainers would say about that scheme. I think it is a good path as it would isolate the runtime cost of that weird SIGCHLD siginfo format to just x32. Then ia32 in compat mode would not need to pay. Once I get that then it will be easier to introduce a yet another helper of copy_siginfo_to_user32 that generates just the kernel_compat_siginfo part, and the two visible derivatives can call memset and clear_user to clear the unset parts. I am assuming you don't don't mind having a full siginfo in elf_note_info that ultimately gets copied into the core dump? Eric
Re: remove set_fs calls from the coredump code v6
Christoph Hellwig writes: > On Tue, May 05, 2020 at 03:28:50PM -0500, Eric W. Biederman wrote: >> We probably can. After introducing a kernel_compat_siginfo that is >> the size that userspace actually would need. >> >> It isn't something I want to mess with until this code gets merged, as I >> think the set_fs cleanups are more important. >> >> >> Christoph made some good points about how ugly the #ifdefs are in >> the generic copy_siginfo_to_user32 implementation. > > Take a look at the series you are replying to, the magic x86 ifdefs are > entirely gone from the common code :) Interesting. That is a different way of achieving that, and I don't hate it. I still want whatever you are doing to settle before I touch that code again. Removing the set_fs is important and I have other fish to fry at the moment. Eric
[REVIEW][PATCH 1/9] signal/powerpc: Use force_sig_mceerr as appropriate
In do_sigbus isolate the mceerr signaling code and call force_sig_mceerr instead of falling through to the force_sig_info that works for all of the other signals. Signed-off-by: "Eric W. Biederman" --- arch/powerpc/mm/fault.c | 18 +++--- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index d51cf5f4e45e..22d7f8748cd7 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -158,7 +158,6 @@ static int do_sigbus(struct pt_regs *regs, unsigned long address, vm_fault_t fault) { siginfo_t info; - unsigned int lsb = 0; if (!user_mode(regs)) return SIGBUS; @@ -171,17 +170,22 @@ static int do_sigbus(struct pt_regs *regs, unsigned long address, info.si_addr = (void __user *)address; #ifdef CONFIG_MEMORY_FAILURE if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) { + unsigned int lsb = 0; /* shutup gcc */ + pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n", current->comm, current->pid, address); - info.si_code = BUS_MCEERR_AR; + + if (fault & VM_FAULT_HWPOISON_LARGE) + lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault)); + if (fault & VM_FAULT_HWPOISON) + lsb = PAGE_SHIFT; + + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, +current); + return 0; } - if (fault & VM_FAULT_HWPOISON_LARGE) - lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault)); - if (fault & VM_FAULT_HWPOISON) - lsb = PAGE_SHIFT; #endif - info.si_addr_lsb = lsb; force_sig_info(SIGBUS, &info, current); return 0; } -- 2.17.1
[REVIEW][PATCH 2/9] signal/powerpc: Remove pkey parameter from __bad_area
There are no callers of __bad_area that pass in a pkey parameter so it makes no sense to take one. Signed-off-by: "Eric W. Biederman" --- arch/powerpc/mm/fault.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 22d7f8748cd7..e5725fa96a48 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -124,8 +124,7 @@ static noinline int bad_area_nosemaphore(struct pt_regs *regs, unsigned long add return __bad_area_nosemaphore(regs, address, SEGV_MAPERR, 0); } -static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code, - int pkey) +static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code) { struct mm_struct *mm = current->mm; @@ -135,12 +134,12 @@ static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code, */ up_read(&mm->mmap_sem); - return __bad_area_nosemaphore(regs, address, si_code, pkey); + return __bad_area_nosemaphore(regs, address, si_code, 0); } static noinline int bad_area(struct pt_regs *regs, unsigned long address) { - return __bad_area(regs, address, SEGV_MAPERR, 0); + return __bad_area(regs, address, SEGV_MAPERR); } static int bad_key_fault_exception(struct pt_regs *regs, unsigned long address, @@ -151,7 +150,7 @@ static int bad_key_fault_exception(struct pt_regs *regs, unsigned long address, static noinline int bad_access(struct pt_regs *regs, unsigned long address) { - return __bad_area(regs, address, SEGV_ACCERR, 0); + return __bad_area(regs, address, SEGV_ACCERR); } static int do_sigbus(struct pt_regs *regs, unsigned long address, -- 2.17.1
[REVIEW][PATCH 3/9] signal/powerpc: Call _exception_pkey directly from bad_key_fault_exception
This removes the need for other code paths to deal with pkey exceptions. Signed-off-by: "Eric W. Biederman" --- arch/powerpc/mm/fault.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index e5725fa96a48..5afc1ee55043 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -145,7 +145,17 @@ static noinline int bad_area(struct pt_regs *regs, unsigned long address) static int bad_key_fault_exception(struct pt_regs *regs, unsigned long address, int pkey) { - return __bad_area_nosemaphore(regs, address, SEGV_PKUERR, pkey); + /* +* If we are in kernel mode, bail out with a SEGV, this will +* be caught by the assembly which will restore the non-volatile +* registers before calling bad_page_fault() +*/ + if (!user_mode(regs)) + return SIGSEGV; + + _exception_pkey(SIGSEGV, regs, SEGV_PKUERR, address, pkey); + + return 0; } static noinline int bad_access(struct pt_regs *regs, unsigned long address) -- 2.17.1
[REVIEW][PATCH 4/9] signal/powerpc: Remove pkey parameter from __bad_area_nosemaphore
Now that bad_key_fault_exception no longer calls __bad_area_nosemaphore there is no reason for __bad_area_nosemaphore to handle pkeys. Signed-off-by: "Eric W. Biederman" --- arch/powerpc/mm/fault.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5afc1ee55043..a84d06b7d50d 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -103,8 +103,7 @@ static bool store_updates_sp(unsigned int inst) */ static int -__bad_area_nosemaphore(struct pt_regs *regs, unsigned long address, int si_code, - int pkey) +__bad_area_nosemaphore(struct pt_regs *regs, unsigned long address, int si_code) { /* * If we are in kernel mode, bail out with a SEGV, this will @@ -114,14 +113,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long address, int si_code, if (!user_mode(regs)) return SIGSEGV; - _exception_pkey(SIGSEGV, regs, si_code, address, pkey); + _exception(SIGSEGV, regs, si_code, address); return 0; } static noinline int bad_area_nosemaphore(struct pt_regs *regs, unsigned long address) { - return __bad_area_nosemaphore(regs, address, SEGV_MAPERR, 0); + return __bad_area_nosemaphore(regs, address, SEGV_MAPERR); } static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code) @@ -134,7 +133,7 @@ static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code) */ up_read(&mm->mmap_sem); - return __bad_area_nosemaphore(regs, address, si_code, 0); + return __bad_area_nosemaphore(regs, address, si_code); } static noinline int bad_area(struct pt_regs *regs, unsigned long address) -- 2.17.1
[REVIEW][PATCH 5/9] signal/powerpc: Factor the common exception code into exception_common
It is brittle and wrong to populate si_pkey when there was not a pkey exception. The field does not exist for all si_codes and in some cases another field exists in the same memory location. So factor out the code that all exceptions handlers must run into exception_common, leaving the individual exception handlers to generate the signals themselves. Signed-off-by: "Eric W. Biederman" --- arch/powerpc/kernel/traps.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index f651fa91cdc9..f6c778b5144f 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -338,14 +338,12 @@ static void show_signal_msg(int signr, struct pt_regs *regs, int code, show_user_instructions(regs); } -void _exception_pkey(int signr, struct pt_regs *regs, int code, -unsigned long addr, int key) +static bool exception_common(int signr, struct pt_regs *regs, int code, + unsigned long addr) { - siginfo_t info; - if (!user_mode(regs)) { die("Exception in kernel mode", regs, signr); - return; + return false; } show_signal_msg(signr, regs, code, addr); @@ -361,6 +359,16 @@ void _exception_pkey(int signr, struct pt_regs *regs, int code, */ thread_pkey_regs_save(¤t->thread); + return true; +} + +void _exception_pkey(int signr, struct pt_regs *regs, int code, unsigned long addr, int key) +{ + siginfo_t info; + + if (!exception_common(signr, regs, code, addr)) + return; + clear_siginfo(&info); info.si_signo = signr; info.si_code = code; -- 2.17.1
[REVIEW][PATCH 6/9] signal/powerpc: Call force_sig_fault from _exception
The callers of _exception don't need the pkey exception logic because they are not processing a pkey exception. So just call exception_common directly and then call force_sig_fault to generate the appropriate siginfo and deliver the appropriate signal. Signed-off-by: "Eric W. Biederman" --- arch/powerpc/kernel/traps.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index f6c778b5144f..c38bec51dd84 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -380,7 +380,10 @@ void _exception_pkey(int signr, struct pt_regs *regs, int code, unsigned long ad void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr) { - _exception_pkey(signr, regs, code, addr, 0); + if (!exception_common(signr, regs, code, addr)) + return; + + force_sig_fault(signr, code, (void __user *)addr, current); } void system_reset_exception(struct pt_regs *regs) -- 2.17.1
[REVIEW][PATCH 7/9] signal/poewrpc: Specialize _exception_pkey for handling pkey exceptions
Now that _exception no longer calls _exception_pkey it is no longer necessary to handle any signal with any si_code. All pkey exceptions are SIGSEGV with paired with SEGV_PKUERR. So just handle that case and remove the now unnecessary parameters from _exception_pkey. Signed-off-by: "Eric W. Biederman" --- arch/powerpc/include/asm/bug.h | 2 +- arch/powerpc/kernel/traps.c| 10 +- arch/powerpc/mm/fault.c| 2 +- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h index fd06dbe7d7d3..fed7e6241349 100644 --- a/arch/powerpc/include/asm/bug.h +++ b/arch/powerpc/include/asm/bug.h @@ -133,7 +133,7 @@ struct pt_regs; extern int do_page_fault(struct pt_regs *, unsigned long, unsigned long); extern void bad_page_fault(struct pt_regs *, unsigned long, int); extern void _exception(int, struct pt_regs *, int, unsigned long); -extern void _exception_pkey(int, struct pt_regs *, int, unsigned long, int); +extern void _exception_pkey(struct pt_regs *, unsigned long, int); extern void die(const char *, struct pt_regs *, long); extern bool die_will_crash(void); extern void panic_flush_kmsg_start(void); diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index c38bec51dd84..e5ea69222459 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -362,20 +362,20 @@ static bool exception_common(int signr, struct pt_regs *regs, int code, return true; } -void _exception_pkey(int signr, struct pt_regs *regs, int code, unsigned long addr, int key) +void _exception_pkey(struct pt_regs *regs, unsigned long addr, int key) { siginfo_t info; - if (!exception_common(signr, regs, code, addr)) + if (!exception_common(SIGSEGV, regs, SEGV_PKUERR, addr)) return; clear_siginfo(&info); - info.si_signo = signr; - info.si_code = code; + info.si_signo = SIGSEGV; + info.si_code = SEGV_PKUERR; info.si_addr = (void __user *) addr; info.si_pkey = key; - force_sig_info(signr, &info, current); + force_sig_info(info.si_signo, &info, current); } void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index a84d06b7d50d..406d0e0ef096 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -152,7 +152,7 @@ static int bad_key_fault_exception(struct pt_regs *regs, unsigned long address, if (!user_mode(regs)) return SIGSEGV; - _exception_pkey(SIGSEGV, regs, SEGV_PKUERR, address, pkey); + _exception_pkey(regs, address, pkey); return 0; } -- 2.17.1
[REVIEW][PATCH 8/9] signal/powerpc: Simplify _exception_pkey by using force_sig_pkuerr
Call force_sig_pkuerr directly instead of rolling it by hand in _exception_pkey. Signed-off-by: "Eric W. Biederman" --- arch/powerpc/kernel/traps.c | 10 +- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index e5ea69222459..ab1bd06d7c44 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -364,18 +364,10 @@ static bool exception_common(int signr, struct pt_regs *regs, int code, void _exception_pkey(struct pt_regs *regs, unsigned long addr, int key) { - siginfo_t info; - if (!exception_common(SIGSEGV, regs, SEGV_PKUERR, addr)) return; - clear_siginfo(&info); - info.si_signo = SIGSEGV; - info.si_code = SEGV_PKUERR; - info.si_addr = (void __user *) addr; - info.si_pkey = key; - - force_sig_info(info.si_signo, &info, current); + force_sig_pkuerr((void __user *) addr, key); } void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr) -- 2.17.1
[REVIEW][PATCH 9/9] signal/powerpc: Use force_sig_fault where appropriate
Signed-off-by: "Eric W. Biederman" --- arch/powerpc/kernel/process.c | 9 +--- arch/powerpc/mm/fault.c | 9 +--- arch/powerpc/platforms/cell/spu_base.c| 4 ++-- arch/powerpc/platforms/cell/spufs/fault.c | 26 +++ 4 files changed, 12 insertions(+), 36 deletions(-) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 913c5725cdb2..553a396e7fc1 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -620,8 +620,6 @@ void do_send_trap(struct pt_regs *regs, unsigned long address, void do_break (struct pt_regs *regs, unsigned long address, unsigned long error_code) { - siginfo_t info; - current->thread.trap_nr = TRAP_HWBKPT; if (notify_die(DIE_DABR_MATCH, "dabr_match", regs, error_code, 11, SIGSEGV) == NOTIFY_STOP) @@ -634,12 +632,7 @@ void do_break (struct pt_regs *regs, unsigned long address, hw_breakpoint_disable(); /* Deliver the signal to userspace */ - clear_siginfo(&info); - info.si_signo = SIGTRAP; - info.si_errno = 0; - info.si_code = TRAP_HWBKPT; - info.si_addr = (void __user *)address; - force_sig_info(SIGTRAP, &info, current); + force_sig_fault(SIGTRAP, TRAP_HWBKPT, (void __user *)address, current); } #endif /* CONFIG_PPC_ADV_DEBUG_REGS */ diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 406d0e0ef096..1697e903bbf2 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -165,17 +165,10 @@ static noinline int bad_access(struct pt_regs *regs, unsigned long address) static int do_sigbus(struct pt_regs *regs, unsigned long address, vm_fault_t fault) { - siginfo_t info; - if (!user_mode(regs)) return SIGBUS; current->thread.trap_nr = BUS_ADRERR; - clear_siginfo(&info); - info.si_signo = SIGBUS; - info.si_errno = 0; - info.si_code = BUS_ADRERR; - info.si_addr = (void __user *)address; #ifdef CONFIG_MEMORY_FAILURE if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) { unsigned int lsb = 0; /* shutup gcc */ @@ -194,7 +187,7 @@ static int do_sigbus(struct pt_regs *regs, unsigned long address, } #endif - force_sig_info(SIGBUS, &info, current); + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address, current); return 0; } diff --git a/arch/powerpc/platforms/cell/spu_base.c b/arch/powerpc/platforms/cell/spu_base.c index 0c45cdbac4cf..7f12c7b78c0f 100644 --- a/arch/powerpc/platforms/cell/spu_base.c +++ b/arch/powerpc/platforms/cell/spu_base.c @@ -50,11 +50,11 @@ struct cbe_spu_info cbe_spu_info[MAX_NUMNODES]; EXPORT_SYMBOL_GPL(cbe_spu_info); /* - * The spufs fault-handling code needs to call force_sig_info to raise signals + * The spufs fault-handling code needs to call force_sig_fault to raise signals * on DMA errors. Export it here to avoid general kernel-wide access to this * function */ -EXPORT_SYMBOL_GPL(force_sig_info); +EXPORT_SYMBOL_GPL(force_sig_fault); /* * Protects cbe_spu_info and spu->number. diff --git a/arch/powerpc/platforms/cell/spufs/fault.c b/arch/powerpc/platforms/cell/spufs/fault.c index 83cf58daaa79..971ac43b5d60 100644 --- a/arch/powerpc/platforms/cell/spufs/fault.c +++ b/arch/powerpc/platforms/cell/spufs/fault.c @@ -36,42 +36,32 @@ static void spufs_handle_event(struct spu_context *ctx, unsigned long ea, int type) { - siginfo_t info; - if (ctx->flags & SPU_CREATE_EVENTS_ENABLED) { ctx->event_return |= type; wake_up_all(&ctx->stop_wq); return; } - clear_siginfo(&info); - switch (type) { case SPE_EVENT_INVALID_DMA: - info.si_signo = SIGBUS; - info.si_code = BUS_OBJERR; + force_sig_fault(SIGBUS, BUS_OBJERR, NULL, current); break; case SPE_EVENT_SPE_DATA_STORAGE: - info.si_signo = SIGSEGV; - info.si_addr = (void __user *)ea; - info.si_code = SEGV_ACCERR; ctx->ops->restart_dma(ctx); + force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *)ea, + current); break; case SPE_EVENT_DMA_ALIGNMENT: - info.si_signo = SIGBUS; /* DAR isn't set for an alignment fault :( */ - info.si_code = BUS_ADRALN; + force_sig_fault(SIGBUS, BUS_ADRALN, NULL, current); break; case SPE_EVENT_SPE_ERROR: - info.si_signo = SIGILL; - info.si_addr = (void __user *)(unsigned long) - ctx->ops->npc_read(ctx) - 4; - inf
[REVIEW][PATCH 0/9] signal/powerpc: siginfo cleanups
This is the continuation of my work to sort out signaling of exceptions with siginfo. The old functions by passing siginfo resulted in many cases of fields of siginfo that were not initialized and then passed to userspace, and also resulted in callers getting confused and initializing the wrong fields. My remedy is to have specific functions for sending each different kind of signal with siginfo. Those functions take the information needed to fill in siginfo and do the work themselves. This is my set of changes to update powerpc to use those functions. Along with some refactoring so those functions can be cleanly used. Folks please review and double check me. I think I have kept these changes simple and obviously correct but I am human and mess up sometimes. After these patches have had a chance to be reviewed I plan to merge them by my siginfo tree. If you would rather take them in the powerpc tree let me know. All of the prerequisites should have been merged through Linus's tree several releases ago. Eric W. Biederman (9): signal/powerpc: Use force_sig_mceerr as appropriate signal/powerpc: Remove pkey parameter from __bad_area signal/powerpc: Call _exception_pkey directly from bad_key_fault_exception signal/powerpc: Remove pkey parameter from __bad_area_nosemaphore signal/powerpc: Factor the common exception code into exception_common signal/powerpc: Call force_sig_fault from _exception signal/poewrpc: Specialize _exception_pkey for handling pkey exceptions signal/powerpc: Simplify _exception_pkey by using force_sig_pkuerr signal/powerpc: Use force_sig_fault where appropriate arch/powerpc/include/asm/bug.h| 2 +- arch/powerpc/kernel/process.c | 9 + arch/powerpc/kernel/traps.c | 27 --- arch/powerpc/mm/fault.c | 55 +-- arch/powerpc/platforms/cell/spu_base.c| 4 +-- arch/powerpc/platforms/cell/spufs/fault.c | 26 +-- 6 files changed, 57 insertions(+), 66 deletions(-) Eric
Re: [PATCH v2 0/3] Kernel handling of Dynamic Logical Partitioning
Nathan Fontenot writes: > version 2 of the patch set with updates from comments. > > The Dynamic Logical Partitioning (DLPAR) capabilities of the powerpc pseries > platform allows for the addition and removal of resources (i.e. cpus, > memory, pci devices) from a partition. The removal of a resource involves > removing the resource's node from the device tree and then returning the > resource to firmware via the rtas set-indicator call. To add a resource, it > is first obtained from firmware via the rtas set-indicator call and then a > new device tree node is created using the ibm,configure-coinnector rtas call > and added to the device tree. > > The following set of patches implements the needed infrastructure to have the > kernel handle the DLPAR addition and removal of cpus (other DLPAR'able items > to > follow in future patches). The framework for this is to create a set of > probe/release sysfs files that will facilitate arch-specific call-outs to > handle > addition and removal of cpus to the system. How does this differ from /sys/devices/system/cpu/cpuX/online ? >From the descriptions it appears that we already have what you are trying to add and you just need to tie it into DLPAR on ppc. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] irq: move some interrupt arch_* functions into struct irq_chip.
Ian Campbell writes: > On Wed, 2010-03-10 at 12:06 +, Yinghai Lu wrote: >> On Wed, Mar 10, 2010 at 2:55 AM, wrote: >> > From: Ian Campbell >> > >> > Move arch_init_copy_chip_data and arch_free_chip_data into function >> > pointers in struct irq_chip since they operate on irq_desc->chip_data. >> > >> > arch_init_chip_data cannot be moved into struct irq_chip at this time >> > because irq_desc->chip is not known at the time the irq_desc is >> > setup. For now rename arch_init_chip_data to arch_init_irq_desc (for >> > PowerPC, the only other user, whose usage better matches the new name) >> > and on x86 convert arch_init_chip_data to ioapic_init_chip_data and >> > call this whenever the IO APIC code allocates a new IRQ. >> > >> > I've retained the chip_data behaviour for uv_irq although it isn't >> > clear to me if these interrupt types support migration or how closely >> > related to the APIC modes they really are. If it weren't for this the >> > ioapic_{init,copy,free}_chip_data functions could be static to >> > io_apic.c. >> > >> > I've tested by booting on a 64 bit system, but it's not clear to me >> > what actions I need to take to actually exercise some of these code >> > paths. >> > >> >> can you just add another pointer field in irq_desc? >> >> some kind of *irq_info etc. > > I think I don't understand what you are suggesting. YH another field doesn't make much sense. Xen is a bizarre subarch with an incompatible irq model. Xen simply needs the ability to handle the entire lifetime of an irq_chip. All we need between the Xen and the rest of x86 is a convention so that we never manage the same irqs. At least for domU we are in an either/or situation so I don't see even that being a problem. > There is already a pointer for irq_chip specific use i.e. > irq_desc->chip_data. This patchset is just about ensuring that the field > really is available to any chip implementation rather than just assuming > it is always used for the acpi chip types (on x86 at least). Ian Xen in this sense is simply not x86. irq_cfg is not acpi or ioapic or anything but x86 specific. It has everything to do with having a per cpu vector table of 256 entries and architecturally receiving a vector number when an interrupt is fired. It totally makes sense for Xen to do something different because architecturally it has a completely different irq subsystem. At the same time let's not pretend that the reason for this is anything except that Xen has a completely different notion of interrupt delivery than the rest of x86 and so it is it's own bizarre subarch. This is not a case where you simply need a driver because something is a bit different but fits into the existing model. So the best solution here seems to be a parameter that we pass into irq_to_desc_alloc_node that does what is needed. The second best would be to have arch_init_chip_data to call something like platfrom_init_chip_data().But I think we can avoid that in this case. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] irq: move some interrupt arch_* functions into struct irq_chip.
Ian Campbell writes: > On Wed, 2010-03-10 at 17:18 +0000, Eric W. Biederman wrote: >> Ian Campbell writes: >> >> > On Wed, 2010-03-10 at 10:55 +, i...@hellion.org.uk wrote: >> >> >> >> arch_init_chip_data cannot be moved into struct irq_chip at this time >> >> because irq_desc->chip is not known at the time the irq_desc is >> >> setup. For now rename arch_init_chip_data to arch_init_irq_desc (for >> >> PowerPC, the only other user, whose usage better matches the new name) >> >> and on x86 convert arch_init_chip_data to ioapic_init_chip_data and >> >> call this whenever the IO APIC code allocates a new IRQ. >> > >> > One idea I had to improve this was to add a struct irq_chip * as a >> > parameter to irq_to_desc_alloc_node. The new parameter potentially could >> > be NULL for current behaviour. Does that sound like a reasonable >> > approach? >> >> I don't follow why we have the restriction that irq_to_desc_alloc_node >> must call arch_init_chip_data. Assuming that requirement to call >> arch_init_chip_data >> is valid, passing something into init_one_irq_desc seems appropriate. > > Yes, I suspect that could also be made to work. > > The lifecycle of the irq_desc and chip_data isn't really clear to me -- > I guess once allocated an irq_desc never gets freed (at least > currently)? The associated chip_data can be freed on migrate and > replaced with a new one, but is not freed otherwise. Yes. That actually looks like a bug. > My concern is that if the caller asks for an IRQ which already exists > (is that valid?) then you will get that existing irq_desc back, > including its existing chip_data, which potentially leaks the new one > which was passed in. Or is it the case that the only way this could > happen would be for legacy IRQs? In which case perhaps it is simply > invalid to pass a new chip data in for such an IRQ. The only irqs that should be allocated/freed are probably the msi irqs as those are the only ones that dynamically come and go in the system. Unfortunately there appears to be a bigger mess here than first appeared. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] irq: move some interrupt arch_* functions into struct irq_chip.
Ian Campbell writes: > On Wed, 2010-03-10 at 17:42 +0000, Eric W. Biederman wrote: >> >> >> Ian Xen in this sense is simply not x86. irq_cfg is not acpi or >> ioapic or anything but x86 specific. It has everything to do with >> having a per cpu vector table of 256 entries and architecturally >> receiving a vector number when an interrupt is fired. >> >> It totally makes sense for Xen to do something different because >> architecturally it has a completely different irq subsystem. > > OK, so that sounds like you would like the same patchset but without the > irq_cfg renaming? or potentially with renaming to x86_blah instead (I'll > rework to your preference). Currently the renaming really makes it unclear what you are doing and for some reason the description of the renaming rubbed me the wrong way. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] irq: move some interrupt arch_* functions into struct irq_chip.
Ian Campbell writes: > On Wed, 2010-03-10 at 10:55 +, i...@hellion.org.uk wrote: >> >> arch_init_chip_data cannot be moved into struct irq_chip at this time >> because irq_desc->chip is not known at the time the irq_desc is >> setup. For now rename arch_init_chip_data to arch_init_irq_desc (for >> PowerPC, the only other user, whose usage better matches the new name) >> and on x86 convert arch_init_chip_data to ioapic_init_chip_data and >> call this whenever the IO APIC code allocates a new IRQ. > > One idea I had to improve this was to add a struct irq_chip * as a > parameter to irq_to_desc_alloc_node. The new parameter potentially could > be NULL for current behaviour. Does that sound like a reasonable > approach? I don't follow why we have the restriction that irq_to_desc_alloc_node must call arch_init_chip_data. Assuming that requirement to call arch_init_chip_data is valid, passing something into init_one_irq_desc seems appropriate. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] irq: move some interrupt arch_* functions into struct irq_chip.
Yinghai Lu writes: > On 03/10/2010 04:51 AM, Ian Campbell wrote: >> On Wed, 2010-03-10 at 12:06 +, Yinghai Lu wrote: >>> On Wed, Mar 10, 2010 at 2:55 AM, wrote: From: Ian Campbell Move arch_init_copy_chip_data and arch_free_chip_data into function pointers in struct irq_chip since they operate on irq_desc->chip_data. arch_init_chip_data cannot be moved into struct irq_chip at this time because irq_desc->chip is not known at the time the irq_desc is setup. For now rename arch_init_chip_data to arch_init_irq_desc (for PowerPC, the only other user, whose usage better matches the new name) and on x86 convert arch_init_chip_data to ioapic_init_chip_data and call this whenever the IO APIC code allocates a new IRQ. I've retained the chip_data behaviour for uv_irq although it isn't clear to me if these interrupt types support migration or how closely related to the APIC modes they really are. If it weren't for this the ioapic_{init,copy,free}_chip_data functions could be static to io_apic.c. I've tested by booting on a 64 bit system, but it's not clear to me what actions I need to take to actually exercise some of these code paths. >>> >>> can you just add another pointer field in irq_desc? >>> >>> some kind of *irq_info etc. >> >> I think I don't understand what you are suggesting. >> >> There is already a pointer for irq_chip specific use i.e. >> irq_desc->chip_data. This patchset is just about ensuring that the field >> really is available to any chip implementation rather than just assuming >> it is always used for the acpi chip types (on x86 at least). >> >> Does adding a second pointer with the same (intended?) semantics as the >> existing one buy us anything? > > > #ifdef CONFIG_INTR_REMAP > struct irq_2_iommu *irq_2_iommu; > #endif > struct irq_chip *chip; > struct msi_desc *msi_desc; > > we already have that for irq_2_iommu and msi_desc Those are at different levels of the hierarchy. Adding another pointer for Xen is like having a different iommu and so adding another pointer to handle that kind of iommu. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] irq: move some interrupt arch_* functions into struct irq_chip.
Yinghai Lu writes: > On 03/12/2010 01:45 AM, Ian Campbell wrote: >> Move arch_init_copy_chip_data and arch_free_chip_data into function >> pointers in struct irq_chip since they operate on irq_desc->chip_data. >> >> arch_init_chip_data cannot be moved into struct irq_chip at this time >> because irq_desc->chip is not known at the time the irq_desc is >> setup. For now rename arch_init_chip_data to arch_init_irq_desc (for >> PowerPC, the only other user, whose usage better matches the new name) >> and on x86 convert arch_init_chip_data to x86_init_chip_data and >> call this whenever a new struct irq_desc is allocated. >> >> I've retained the chip_data behaviour for uv_irq although it isn't >> clear to me if these interrupt types support migration or how closely >> related to the APIC modes they really are. If it weren't for this the >> x86_{init,copy,free}_chip_data functions could be static to >> io_apic.c. >> >> I've tested by booting on a 64 bit system, but it's not clear to me >> what actions I need to take to actually exercise some of these code >> paths. >> >> Signed-off-by: Ian Campbell >> Acked-by: Michael Ellerman [PowerPC rename portion] >> Cc: Thomas Gleixner >> Cc: Ingo Molnar >> Cc: H. Peter Anvin >> Cc: Eric W. Biederman >> Cc: Yinghai Lu >> Cc: Jeremy Fitzhardinge >> Cc: Benjamin Herrenschmidt >> Cc: Paul Mackerras >> Cc: x...@kernel.org >> Cc: linuxppc-...@ozlabs.org >> Cc: linux-ker...@vger.kernel.org >> --- >> arch/powerpc/kernel/irq.c |2 +- >> arch/x86/include/asm/hw_irq.h | 11 +++- >> arch/x86/kernel/apic/io_apic.c | 58 >> +--- >> arch/x86/kernel/uv_irq.c |5 +++ >> include/linux/interrupt.h |2 +- >> include/linux/irq.h| 12 +--- >> kernel/irq/handle.c|2 +- >> kernel/irq/numa_migrate.c | 12 +++- >> kernel/softirq.c |3 +- >> 9 files changed, 91 insertions(+), 16 deletions(-) >> >> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c >> index 64f6f20..002d87f 100644 >> --- a/arch/powerpc/kernel/irq.c >> +++ b/arch/powerpc/kernel/irq.c >> @@ -1088,7 +1088,7 @@ int arch_early_irq_init(void) >> return 0; >> } >> >> -int arch_init_chip_data(struct irq_desc *desc, int node) >> +int arch_init_irq_desc(struct irq_desc *desc, int node) >> { >> desc->status |= IRQ_NOREQUEST; >> return 0; >> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h >> index a929c9e..1bc7063 100644 >> --- a/arch/x86/include/asm/hw_irq.h >> +++ b/arch/x86/include/asm/hw_irq.h >> @@ -20,9 +20,9 @@ >> #include >> #include >> #include >> +#include >> >> #include >> -#include >> #include >> >> /* Interrupt handlers registered during init_IRQ */ >> @@ -61,6 +61,15 @@ extern void init_VISWS_APIC_irqs(void); >> extern void setup_IO_APIC(void); >> extern void disable_IO_APIC(void); >> >> +extern int x86_init_chip_data(struct irq_desc *desc, int node); >> + >> +#ifdef CONFIG_SPARSE_IRQ >> +extern void x86_copy_chip_data(struct irq_desc *old_desc, >> + struct irq_desc *desc, int node); >> +extern void x86_free_chip_data(struct irq_desc *old_desc, >> + struct irq_desc *desc); >> +#endif >> + >> struct io_apic_irq_attr { >> int ioapic; >> int ioapic_pin; >> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c >> index e4e0ddc..12e5cf4 100644 >> --- a/arch/x86/kernel/apic/io_apic.c >> +++ b/arch/x86/kernel/apic/io_apic.c >> @@ -211,7 +211,7 @@ static struct irq_cfg *get_one_free_irq_cfg(int node) >> return cfg; >> } >> >> -int arch_init_chip_data(struct irq_desc *desc, int node) >> +int x86_init_chip_data(struct irq_desc *desc, int node) >> { >> struct irq_cfg *cfg; >> >> @@ -287,8 +287,8 @@ static void free_irq_2_pin(struct irq_cfg *old_cfg, >> struct irq_cfg *cfg) >> old_cfg->irq_2_pin = NULL; >> } >> >> -void arch_init_copy_chip_data(struct irq_desc *old_desc, >> - struct irq_desc *desc, int node) >> +void x86_copy_chip_data(struct irq_desc *old_desc, >> + struct irq_desc *desc, int node) >> { >> struct irq_cfg *cfg;
Re: [PATCH] irq: move some interrupt arch_* functions into struct irq_chip.
Ian Campbell writes: > On Sat, 2010-03-13 at 00:29 +0000, Eric W. Biederman wrote: >> [...] >> > after that xen could use >> > irq_to_desc_alloc_node_f(irq, node, xen_init_chip_data); >> > >> > as need... >> > >> > at last we don't need to call x86_init_chip_data everywhere. > > This was one of the things I was considering. It seems like one of the > easiest solutions to make work correctly with the current locking in > e.g. irq_to_desc_alloc_node since the callback would always happen under > the lock taken in that function. > >> So trying to evaluate races. The worse case for this particular piece >> of code appears to be create_irq_nr. As this is the only place where >> we are setting up irqs and possibly repurposing the structure. > > Yes, create_irq_nr was one of the functions I was struggling to solve > cleanly. There is a similar construct in the Xen code as well. > > Part of the problem I'm having is the combination of lookup and allocate > in irq_to_desc_alloc_node but also the kind of "implicit" repurposing is > tricky to deal with. (by implicit I just mean that I can't find where > the previous user explicitly says they are finished with it, if you see > what I mean) > > What do you think of adding an explicit free operation for the irq_desc > structs? (does one already exist? I couldn't find it). This would go > along with some tracking of allocation state, trivial in the sparse case > where you can treat a NULL node in the radix tree as unallocated, I > guess a flag would suffice in the static array non-sparse case? We have free_one_irq_desc used in the numa irq migration code. We have dynamic_irq_cleanup. I don't expect it would be fundamentally hard to put all of the pieces together. > Going further could we split the alloc and lookup functions into > separate operations instead of combining them in irq_to_desc_alloc_node? > We already have irq_to_desc for the lookup portion so this would largely > involve changes to the semantics of irq_to_desc_alloc_node, perhaps > returning ERR_PTR(-EBUSY) if the node was already allocated. > > Having a variant which found a free IRQ rather than operating on a > specific requested IRQ could also be useful for create_irq_nr as well as > find_unbound_irq on the Xen side. I'm not convinced irq_alloc_virt on > powerpc isn't implementing broadly the same concept as well, although it > seems to work very differently from the other two. > >> Today >> we figure out if an irq has been assigned by looking at irq_cfg->vector, >> and if it is non-zero the irq has been assigned. > > Which is tricky to move into generic code hence my suggestions of > explicitly freeing the irq_desc and tracking the allocation status in > the generic code. Agreed. Getting an allocation assist into the generic code sounds nlike something for the next round of patches after your current one. At least for msi irqs, hi irqs and other irqs that are totally software constructs this makes a lot of sense. We really need to get to the point on x86 where sparse irq has no penalties and stop making it an option. Otherwise it gets a bit tricky to rely on sparse irq for an allocation helper. I need to go look up my old patchset. I did something different with create_irq(), that I think was cleaner and it may be worth resurrecting. At the very least I generalized the msi case and the ht irq case into using the same pointer in struct irq_desc. >> The logic in x86_init_chip_data is correct we only assign desc->chip_data >> if the generic layers are above it. However we need a lock to ensure that >> two paths don't race in that comparison and that assignment. There is >> no lock in x86_init_chip_data. Which unfortunately means as it stands >> this patchset is buggy. > > Yes, unfortunately I think you are right. The callback idea fixes this. > I'll respin with that. Thanks. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 2.6.34-rc3: Badness at kernel/lockdep.c:2706
Could you try the fix Wolfram Sang sent to linux-kernel yesterday? Peter Zijlstra writes: > On Fri, 2010-04-02 at 11:33 +0530, Sachin Sant wrote: >> With 2.6.34-rc3 boot on a Power5 box : >> >> loop: module loaded >> BUG: key 6b6b6b6b6b6b6b6b not in .data! >> [ cut here ] >> Badness at kernel/lockdep.c:2706 >> NIP: c010a0a8 LR: c010a08c CTR: c00704a4 >> REGS: c001096d76f0 TRAP: 0700 Not tainted (2.6.34-rc3) >> MSR: 80029032 CR: 2882 XER: 0009 >> TASK = c001fa6f8000[1] 'swapper' THREAD: c001096d4000 CPU: 12 >> GPR00: c001096d7970 c1322e38 0001 >> GPR04: 0001 c00c1ea8 0002 >> GPR08: c1cc5a00 2422 c137d768 >> GPR12: 0002 cf66e400 00d47940 01c0 >> GPR16: 02673148 018ff984 0060 >> GPR20: 0253a291 018ff990 >> GPR24: 018ff988 c1229ef0 c0010793a730 >> GPR28: 6b6b6b6b6b6b6b6b c0010793afd8 c1299de0 c001096d7970 >> NIP [c010a0a8] .lockdep_init_map+0x12c/0x538 >> LR [c010a08c] .lockdep_init_map+0x110/0x538 >> Call Trace: >> [c001096d7970] [c010a074] .lockdep_init_map+0xf8/0x538 >> (unreliable) >> [c001096d7a30] [c0274eac] .sysfs_add_file_mode+0x90/0x124 >> [c001096d7af0] [c0274f7c] .sysfs_add_file+0x3c/0x50 >> [c001096d7b90] [c02750d8] .sysfs_create_file+0x5c/0x74 >> [c001096d7c30] [c04b36a0] .device_create_file+0x40/0x5c >> [c001096d7cd0] [c04db384] .wf_register_control+0x108/0x178 >> [c001096d7d80] [c0a5f7cc] .wf_cpufreq_clamp_init+0x11c/0x16c >> [c001096d7e20] [c000a104] .do_one_initcall+0xb0/0x208 >> [c001096d7ed0] [c0a12568] .kernel_init+0x230/0x2f0 >> [c001096d7f90] [c0033b20] .kernel_thread+0x54/0x70 >> Instruction dump: >> e93e8190 8009 2f80 409e03f8 48280539 6000 2fa3 419e03e8 >> e93e8198 8009 2f80 409e03d8 <0fe0> 480003d0 e93e8140 fb9d >> Uniform Multi-Platform E-IDE driver > > 0x6b is POISON_FREE, so it seems to me sysfs made a boo-boo. Interesting combination of debugging options there. First time I have seen BUG: xyz not in .data with a 0x6b address... Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [Patch v2] kexec: increase max of kexec segments and use dynamic allocation
Milton Miller writes: > [ Added kexec at lists.infradead.org and linuxppc-dev@lists.ozlabs.org ] > >> >> Currently KEXEC_SEGMENT_MAX is only 16 which is too small for machine with >> many memory ranges. When hibernate on a machine with disjoint memory we do >> need one segment for each memory region. Increase this hard limit to 16K >> which is reasonably large. >> >> And change ->segment from a static array to a dynamically allocated memory. >> >> Cc: Neil Horman >> Cc: huang ying >> Cc: Eric W. Biederman >> Signed-off-by: WANG Cong >> >> --- >> diff --git a/arch/powerpc/kernel/machine_kexec_64.c >> b/arch/powerpc/kernel/machine_kexec_64.c >> index ed31a29..f115585 100644 >> --- a/arch/powerpc/kernel/machine_kexec_64.c >> +++ b/arch/powerpc/kernel/machine_kexec_64.c >> @@ -131,10 +131,7 @@ static void copy_segments(unsigned long ind) >> void kexec_copy_flush(struct kimage *image) >> { >> long i, nr_segments = image->nr_segments; >> -struct kexec_segment ranges[KEXEC_SEGMENT_MAX]; >> - >> -/* save the ranges on the stack to efficiently flush the icache */ >> -memcpy(ranges, image->segment, sizeof(ranges)); >> +struct kexec_segment range; > > I'm glad you found our copy on the stack and removed the stack overflow > that comes with this bump, but ... > >> >> /* >> * After this call we may not use anything allocated in dynamic >> @@ -148,9 +145,11 @@ void kexec_copy_flush(struct kimage *image) >> * we need to clear the icache for all dest pages sometime, >> * including ones that were in place on the original copy >> */ >> -for (i = 0; i < nr_segments; i++) >> -flush_icache_range((unsigned long)__va(ranges[i].mem), >> -(unsigned long)__va(ranges[i].mem + ranges[i].memsz)); >> +for (i = 0; i < nr_segments; i++) { >> +memcpy(&range, &image->segment[i], sizeof(range)); >> +flush_icache_range((unsigned long)__va(range.mem), >> +(unsigned long)__va(range.mem + range.memsz)); >> +} >> } > > This is executed after the copy, so as it says, > "we may not use anything allocated in dynamic memory". > > We could allocate control pages to copy the segment list into. > Actually ppc64 doesn't use the existing control page, but that > is only 4kB today. > > We need the list to icache flush all the pages in all the segments. > The as the indirect list doesn't have pages that were allocated at > their destination. An interesting point. > Or maybe the icache flush should be done in the generic code > like it does for crash load segments? Please. I don't quite understand the icache flush requirement. But we really should not be looking at the segments in the architecture specific code. Ideally we would only keep the segment information around for the duration of the kexec_load syscall and not have it when it comes time to start the second kernel. I am puzzled. We should be completely replacing the page tables so can't we just do a global flush? Perhaps I am being naive about what is required for a ppc flush. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] kdump: Allow shrinking of kdump region to be overridden
Anton Blanchard writes: > On ppc64 the crashkernel region almost always overlaps an area of firmware. > This works fine except when using the sysfs interface to reduce the kdump > region. If we free the firmware area we are guaranteed to crash. That is ppc64 bug. firmware should not be in the reserved region. Any random kernel like thing can be put in to that region at any valid address and the fact that shrinking the region frees your firmware means that using that region could also stomp your firmware (which I assume would be a bad thing). So please fix the ppc64 reservation. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/4] fs: always build llseek.
ebied...@xmission.com (Eric W. Biederman) writes: > Michal Suchanek writes: > >> 64bit !COMPAT does not build because the llseek syscall is in the >> tables. > > Do I read this right you have a 128 bit offset to llseek on ppc64? > > Looking at the signature it does not appear to make sense to build this > function on any 64bit platform. > > Perhaps the proper fix to to take llseek out of your syscall tables? Sorry I missed seeing the 2 newer version of the patchset in my inbox it seems you have already made the change I am suggesting. Never mind. Eric
Re: [PATCH 1/4] fs: always build llseek.
Michal Suchanek writes: > 64bit !COMPAT does not build because the llseek syscall is in the > tables. Do I read this right you have a 128 bit offset to llseek on ppc64? Looking at the signature it does not appear to make sense to build this function on any 64bit platform. Perhaps the proper fix to to take llseek out of your syscall tables? Eric > Signed-off-by: Michal Suchanek > --- > fs/read_write.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index 5bbf587f5bc1..9db56931eb26 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -331,7 +331,6 @@ COMPAT_SYSCALL_DEFINE3(lseek, unsigned int, fd, > compat_off_t, offset, unsigned i > } > #endif > > -#if !defined(CONFIG_64BIT) || defined(CONFIG_COMPAT) > SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, offset_high, > unsigned long, offset_low, loff_t __user *, result, > unsigned int, whence) > @@ -360,7 +359,6 @@ SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, > offset_high, > fdput_pos(f); > return retval; > } > -#endif > > int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, > size_t count) > {
Re: [PATCH v2 00/89] fs: new accessors for inode->i_ctime
Jeff Layton writes: > On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote: >> v2: >> - prepend patches to add missing ctime updates >> - add simple_rename_timestamp helper function >> - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_* >> - drop individual inode_ctime_set_{sec,nsec} helpers >> >> I've been working on a patchset to change how the inode->i_ctime is >> accessed in order to give us conditional, high-res timestamps for the >> ctime and mtime. struct timespec64 has unused bits in it that we can use >> to implement this. In order to do that however, we need to wrap all >> accesses of inode->i_ctime to ensure that bits used as flags are >> appropriately handled. >> >> The patchset starts with reposts of some missing ctime updates that I >> spotted in the tree. It then adds a new helper function for updating the >> timestamp after a successful rename, and new ctime accessor >> infrastructure. >> >> The bulk of the patchset is individual conversions of different >> subsysteme to use the new infrastructure. Finally, the patchset renames >> the i_ctime field to __i_ctime to help ensure that I didn't miss >> anything. >> >> This should apply cleanly to linux-next as of this morning. >> >> Most of this conversion was done via 5 different coccinelle scripts, run >> in succession, with a large swath of by-hand conversions to clean up the >> remainder. >> > > A couple of other things I should note: > > If you sent me an Acked-by or Reviewed-by in the previous set, then I > tried to keep it on the patch here, since the respun patches are mostly > just renaming stuff from v1. Let me know if I've missed any. > > I've also pushed the pile to my tree as this tag: > > > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ctime.20230705 > > In case that's easier to work with. Are there any preliminary patches showing what you want your introduced accessors to turn into? It is hard to judge the sanity of the introduction of wrappers without seeing what the wrappers are ultimately going to do. Eric
Re: [PATCH RFC PKS/PMEM 51/58] kernel: Utilize new kmap_thread()
ira.we...@intel.com writes: > From: Ira Weiny > > This kmap() call is localized to a single thread. To avoid the over > head of global PKRS updates use the new kmap_thread() call. Acked-by: "Eric W. Biederman" > > Cc: Eric Biederman > Signed-off-by: Ira Weiny > --- > kernel/kexec_core.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index c19c0dad1ebe..272a9920c0d6 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -815,7 +815,7 @@ static int kimage_load_normal_segment(struct kimage > *image, > if (result < 0) > goto out; > > - ptr = kmap(page); > + ptr = kmap_thread(page); > /* Start with a clear page */ > clear_page(ptr); > ptr += maddr & ~PAGE_MASK; > @@ -828,7 +828,7 @@ static int kimage_load_normal_segment(struct kimage > *image, > memcpy(ptr, kbuf, uchunk); > else > result = copy_from_user(ptr, buf, uchunk); > - kunmap(page); > + kunmap_thread(page); > if (result) { > result = -EFAULT; > goto out; > @@ -879,7 +879,7 @@ static int kimage_load_crash_segment(struct kimage *image, > goto out; > } > arch_kexec_post_alloc_pages(page_address(page), 1, 0); > - ptr = kmap(page); > + ptr = kmap_thread(page); > ptr += maddr & ~PAGE_MASK; > mchunk = min_t(size_t, mbytes, > PAGE_SIZE - (maddr & ~PAGE_MASK)); > @@ -895,7 +895,7 @@ static int kimage_load_crash_segment(struct kimage *image, > else > result = copy_from_user(ptr, buf, uchunk); > kexec_flush_icache_page(page); > - kunmap(page); > + kunmap_thread(page); > arch_kexec_pre_free_pages(page_address(page), 1); > if (result) { > result = -EFAULT;
[PATCH] sysctl: Kill binary sysctl KERN_PPC_L2CR
> From: Stefan Roese <[EMAIL PROTECTED]> > Subject: ppc: 4xx: sysctl table check failed: /kernel/l2cr .1.31 Missing > strategy > > I'm seeing this error message when booting an recent arch/ppc kernel on > 4xx platforms (tested on Ocotea and other 4xx platforms). Booting NFS > rootfs still works fine, but this message kind of makes me "nervous". > This is not seen on 4xx arch/powerpc platforms. Here the bootlog: Because the data field was never filled and a binary sysctl handler was never written this sysctl has never been usable through the sys_sysctl interface. So just remove the binary sysctl number. Making the kernel sanity checks happy. Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> --- arch/ppc/kernel/ppc_htab.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/arch/ppc/kernel/ppc_htab.c b/arch/ppc/kernel/ppc_htab.c index aa07b63..9ed36dd 100644 --- a/arch/ppc/kernel/ppc_htab.c +++ b/arch/ppc/kernel/ppc_htab.c @@ -436,7 +436,6 @@ int proc_dol2crvec(ctl_table *table, int write, struct file *filp, */ static ctl_table htab_ctl_table[]={ { - .ctl_name = KERN_PPC_L2CR, .procname = "l2cr", .mode = 0644, .proc_handler = &proc_dol2crvec, -- 1.5.3.rc6.17.g1911 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/5] Move elfcorehdr_addr out of vmcore.c (Was: Re: [patch] crashdump: fix undefined reference to `elfcorehdr_addr')
Vivek Goyal <[EMAIL PROTECTED]> writes: > Hi All, > > How does following series of patches look like. I have moved > elfcorehdr_addr out of vmcore.c and pushed it to arch dependent section > of crash dump to make sure that it can be worked with even when > CONFIG_PROC_VMCORE is disabled and CONFIG_CRASH_DUMP is enabled. > > I tested it on x86_64. Compile tested it on i386 and ppc64. ia64 and > sh versions are completely untested. Given the current state of the code: Acked-by: "Eric W. Biederman" <[EMAIL PROTECTED]> To process a kernel crash dump we pass the kernel elfcorehdr option, so testing to see if it was passed seems reasonable. In general I think this method of handling the problems with kdump is too brittle to live, but in the case of iommus we certainly need to do something different, and unfortunately iommus were not common on x86 when the original code was merged so we have not handled them well. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/5] Move elfcorehdr_addr out of vmcore.c (Was: Re: [patch] crashdump: fix undefined reference to `elfcorehdr_addr')
___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] crash in 2.6.22-git2 sysctl_set_parent()
[EMAIL PROTECTED] (Linas Vepstas) writes: > This is a patch (& bug report) for a crash in sysctl_set_parent() > in 2.6.22-git2. > > Problem: 2.6.22-git2 crashes with a stack trace > [c1d0fb00] c0067b4c .sysctl_set_parent+0x48/0x7c > [c1d0fb90] c0069b40 .register_sysctl_table+0x7c/0xf4 > [c1d0fc30] c065e710 .devinet_init+0x88/0xb0 > [c1d0fcc0] c065db74 .ip_rt_init+0x17c/0x32c > [c1d0fd70] c065deec .ip_init+0x10/0x34 > [c1d0fdf0] c065e898 .inet_init+0x160/0x3dc > [c1d0fea0] c0630bc4 .kernel_init+0x204/0x3c8 > > A bit of poking around makes it clear what the problem is: > In sysctl_set_parent(), the for loop > >for (; table->ctl_name || table->procname; table++) { > > walks off the end of the table, and into garbage. Basically, > this for-loop iterator expects all table arrays to be > "null terminated". However, net/ipv4/devinet.c statically > declares an array that is not null-terminated. The patch > below fixes that; it works for me. Its somewhat conservative; > if one wishes to assume that the compiler will always zero out > the empty parts of the structure, then this pach can be shrunk > to one line: +ctl_table devinet_root_dir[3]; > > Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> > > > I tried to audit some of the code to see where else there > might be similar badly-formed static declarations. This is hard, > as there's a lot of code. Most seems fine. Right. I believe I performed a similar audit not to long ago when everything was converted to C99 initializers and everything was fine then. > net/core/neighbour.c |4 > net/ipv4/devinet.c |7 ++- > 2 files changed, 10 insertions(+), 1 deletion(-) > > Index: linux-2.6.22-git2/net/ipv4/devinet.c > === > --- linux-2.6.22-git2.orig/net/ipv4/devinet.c 2007-07-13 14:23:21.0 > -0500 > +++ linux-2.6.22-git2/net/ipv4/devinet.c 2007-07-13 14:24:15.0 -0500 > @@ -1424,7 +1424,7 @@ static struct devinet_sysctl_table { > ctl_table devinet_dev[2]; > ctl_table devinet_conf_dir[2]; > ctl_table devinet_proto_dir[2]; > - ctl_table devinet_root_dir[2]; > + ctl_table devinet_root_dir[3]; > } devinet_sysctl = { > .devinet_vars = { > DEVINET_SYSCTL_COMPLEX_ENTRY(FORWARDING, "forwarding", > @@ -1493,8 +1493,13 @@ static struct devinet_sysctl_table { > .data = &ipv4_devconf.loop, > .maxlen = sizeof(int), > .mode = 0644, > + .child = 0x0, > .proc_handler = &proc_dointvec, > }, Where did this entry above in devinet_sysctl come from? > + { > + .ctl_name = 0, > + .procname = 0, > + }, I probably would have just done: + {}, > }, > }; > What added the additional entry to devinet_root_dir? I don't see that in Linus' tree? The result may be fine but if it isn't named in a per network device manner we are adding duplicate entries to the root /proc/sys directory which is wrong. Actually come to think of it I am concerned that someone added a settable entry into /proc/sys/ it should at least be in /proc/sys/net/ where it won't conflict with other uses of that directory. Especially as things like network devices have user controlled names. Eric ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 3/4] exec: simplify the compat syscall handling
Christoph Hellwig writes: > diff --git a/fs/exec.c b/fs/exec.c > index 06e07278b456fa..b34c1eb9e7ad8e 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -391,47 +391,34 @@ static int bprm_mm_init(struct linux_binprm *bprm) > return err; > } > > -struct user_arg_ptr { > -#ifdef CONFIG_COMPAT > - bool is_compat; > -#endif > - union { > - const char __user *const __user *native; > -#ifdef CONFIG_COMPAT > - const compat_uptr_t __user *compat; > -#endif > - } ptr; > -}; > - > -static const char __user *get_user_arg_ptr(struct user_arg_ptr argv, int nr) > +static const char __user * > +get_user_arg_ptr(const char __user *const __user *argv, int nr) > { > - const char __user *native; > - > -#ifdef CONFIG_COMPAT > - if (unlikely(argv.is_compat)) { > + if (in_compat_syscall()) { > + const compat_uptr_t __user *compat_argv = > + compat_ptr((unsigned long)argv); Ouch! Passing a pointer around as the wrong type through the kernel! Perhaps we should reduce everything to do_execveat and do_execveat_compat. Then there would be no need for anything to do anything odd with the pointer types. I think the big change would be to factor out a copy_string out of copy_strings, that performs all of the work once we know the proper pointer value. Casting pointers from one type to another scares me as one mistake means we are doing something wrong and probably exploitable. Eric > compat_uptr_t compat; > > - if (get_user(compat, argv.ptr.compat + nr)) > + if (get_user(compat, compat_argv + nr)) > return ERR_PTR(-EFAULT); > - > return compat_ptr(compat); > - } > -#endif > - > - if (get_user(native, argv.ptr.native + nr)) > - return ERR_PTR(-EFAULT); > + } else { > + const char __user *native; > > - return native; > + if (get_user(native, argv + nr)) > + return ERR_PTR(-EFAULT); > + return native; > + } > } >
Re: [PATCH] kexec_file: Drop pr_err in weak implementations of arch_kexec_apply_relocations[_add]
Looking at this the pr_err is absolutely needed. If an unsupported case winds up in the purgatory blob and the code can't handle it things will fail silently much worse later. So the proposed patch is unfortunately the wrong direction. "Naveen N. Rao" writes: > Baoquan He wrote: >> On 04/25/22 at 11:11pm, Naveen N. Rao wrote: >>> kexec_load_purgatory() can fail for many reasons - there is no need to >>> print an error when encountering unsupported relocations. >>> This solves a build issue on powerpc with binutils v2.36 and newer [1]. >>> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section >>> symbols") [2], binutils started dropping section symbols that it thought >> I am not familiar with binutils, while wondering if this exists in other >> ARCHes except of ppc. Arm64 doesn't have the ARCH override either, do we >> have problem with it? > > I'm not aware of this specific file causing a problem on other architectures - > perhaps the config options differ enough. There are however more reports of > similar issues affecting other architectures with the llvm integrated > assembler: > https://github.com/ClangBuiltLinux/linux/issues/981 > >> >>> were unused. This isn't an issue in general, but with kexec_file.c, gcc >>> is placing kexec_arch_apply_relocations[_add] into a separate >>> .text.unlikely section and the section symbol ".text.unlikely" is being >>> dropped. Due to this, recordmcount is unable to find a non-weak symbol >> But arch_kexec_apply_relocations_add is weak symbol on ppc. > > Yes. Note that it is just the section symbol that gets dropped. The section is > still present and will continue to hold the symbols for the functions > themselves. So we have a case where binutils thinks it is doing something useful and our kernel specific tool gets tripped up by it. Reading the recordmcount code it looks like it is finding any symbol within a section but ignoring weak symbols. So I suspect the only remaining symbol in the section is __weak and that confuses recordmcount. Does removing the __weak annotation on those functions fix the build error? If so we can restructure the kexec code to simply not use __weak symbols. Otherwise the fix needs to be in recordmcount or binutils, and we should loop whoever maintains recordmcount in to see what they can do. Eric
Re: [PATCH] kexec_file: Drop pr_err in weak implementations of arch_kexec_apply_relocations[_add]
Michael Ellerman writes: > "Eric W. Biederman" writes: >> Looking at this the pr_err is absolutely needed. If an unsupported case >> winds up in the purgatory blob and the code can't handle it things >> will fail silently much worse later. > > It won't fail later, it will fail the syscall. > > sys_kexec_file_load() > kimage_file_alloc_init() > kimage_file_prepare_segments() > arch_kexec_kernel_image_load() > kexec_image_load_default() > image->fops->load() > elf64_load()# powerpc > bzImage64_load()# x86 > kexec_load_purgatory() > kexec_apply_relocations() > > Which does: > > if (relsec->sh_type == SHT_RELA) > ret = arch_kexec_apply_relocations_add(pi, section, > relsec, symtab); > else if (relsec->sh_type == SHT_REL) > ret = arch_kexec_apply_relocations(pi, section, > relsec, symtab); > if (ret) > return ret; > > And that error is bubbled all the way back up. So as long as > arch_kexec_apply_relocations() returns an error the syscall will fail > back to userspace and there'll be an error message at that level. > > It's true that having nothing printed in dmesg makes it harder to work > out why the syscall failed. But it's a kernel bug if there are unhandled > relocations in the kernel-supplied purgatory code, so a user really has > no way to do anything about the error even if it is printed. Good point. I really hadn't noticed the error code in there when I looked. I still don't think changing the functionality of the code because of a tool issue is the right solution. >> "Naveen N. Rao" writes: >> >>> Baoquan He wrote: >>>> On 04/25/22 at 11:11pm, Naveen N. Rao wrote: >>>>> kexec_load_purgatory() can fail for many reasons - there is no need to >>>>> print an error when encountering unsupported relocations. >>>>> This solves a build issue on powerpc with binutils v2.36 and newer [1]. >>>>> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section >>>>> symbols") [2], binutils started dropping section symbols that it thought >>>> I am not familiar with binutils, while wondering if this exists in other >>>> ARCHes except of ppc. Arm64 doesn't have the ARCH override either, do we >>>> have problem with it? >>> >>> I'm not aware of this specific file causing a problem on other >>> architectures - >>> perhaps the config options differ enough. There are however more reports of >>> similar issues affecting other architectures with the llvm integrated >>> assembler: >>> https://github.com/ClangBuiltLinux/linux/issues/981 >>> >>>> >>>>> were unused. This isn't an issue in general, but with kexec_file.c, gcc >>>>> is placing kexec_arch_apply_relocations[_add] into a separate >>>>> .text.unlikely section and the section symbol ".text.unlikely" is being >>>>> dropped. Due to this, recordmcount is unable to find a non-weak symbol >>>> But arch_kexec_apply_relocations_add is weak symbol on ppc. >>> >>> Yes. Note that it is just the section symbol that gets dropped. The section >>> is >>> still present and will continue to hold the symbols for the functions >>> themselves. >> >> So we have a case where binutils thinks it is doing something useful >> and our kernel specific tool gets tripped up by it. > > It's not just binutils, the LLVM assembler has the same behavior. > >> Reading the recordmcount code it looks like it is finding any symbol >> within a section but ignoring weak symbols. So I suspect the only >> remaining symbol in the section is __weak and that confuses >> recordmcount. >> >> Does removing the __weak annotation on those functions fix the build >> error? If so we can restructure the kexec code to simply not use __weak >> symbols. >> >> Otherwise the fix needs to be in recordmcount or binutils, and we should >> loop whoever maintains recordmcount in to see what they can do. > > It seems that recordmcount is not really maintained anymore now that x86 > uses objtool? > > There've been several threads about fixing recordmcount, but none of > them seem to have lead to a solution. That is unfortunate. > These weak symbol vs recordmcount problems ha
Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]
"Naveen N. Rao" writes: > Since commit d1bcae833b32f1 ("ELF: Don't generate unused section > symbols") [1], binutils (v2.36+) started dropping section symbols that > it thought were unused. This isn't an issue in general, but with > kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a > separate .text.unlikely section and the section symbol ".text.unlikely" > is being dropped. Due to this, recordmcount is unable to find a non-weak > symbol in .text.unlikely to generate a relocation record against. > > Address this by dropping the weak attribute from these functions: > - arch_kexec_apply_relocations() is not overridden by any architecture > today, so just drop the weak attribute. > - arch_kexec_apply_relocations_add() is only overridden by x86 and s390. > Retain the function prototype for those and move the weak > implementation into the header as a static inline for other > architectures. > > [1] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1 Any chance you can also get machine_kexec_post_load, crash_free_reserved_phys_range, arch_kexec_protect_protect_crashkres, arch_kexec_unprotect_crashkres, arch_kexec_kernel_image_probe, arch_kexec_kernel_image_probe, arch_kimage_file_post_load_cleanup, arch_kexec_kernel_verify_sig, and arch_kexec_locate_mem_hole as well. That is everything in kexec that uses a __weak symbol. If we can't count on them working we might as well just get rid of the rest preemptively. Could you also address Andrews concerns by using a Kconfig symbol that the architectures that implement the symbol can select. I don't want to ask too much of a volunteer but if you are willing addressing both of those would be a great help. Eric > Signed-off-by: Naveen N. Rao > --- > include/linux/kexec.h | 28 > kernel/kexec_file.c | 19 +-- > 2 files changed, 25 insertions(+), 22 deletions(-) > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index 58d1b58a971e34..e656f981f43a73 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -193,10 +193,6 @@ void *kexec_purgatory_get_symbol_addr(struct kimage > *image, const char *name); > int arch_kexec_kernel_image_probe(struct kimage *image, void *buf, > unsigned long buf_len); > void *arch_kexec_kernel_image_load(struct kimage *image); > -int arch_kexec_apply_relocations_add(struct purgatory_info *pi, > - Elf_Shdr *section, > - const Elf_Shdr *relsec, > - const Elf_Shdr *symtab); > int arch_kexec_apply_relocations(struct purgatory_info *pi, >Elf_Shdr *section, >const Elf_Shdr *relsec, > @@ -229,6 +225,30 @@ extern int crash_exclude_mem_range(struct crash_mem *mem, > unsigned long long mend); > extern int crash_prepare_elf64_headers(struct crash_mem *mem, int kernel_map, > void **addr, unsigned long *sz); > + > +#if defined(CONFIG_X86_64) || defined(CONFIG_S390) > +int arch_kexec_apply_relocations_add(struct purgatory_info *pi, > + Elf_Shdr *section, > + const Elf_Shdr *relsec, > + const Elf_Shdr *symtab); > +#else > +/* > + * arch_kexec_apply_relocations_add - apply relocations of type RELA > + * @pi: Purgatory to be relocated. > + * @section: Section relocations applying to. > + * @relsec: Section containing RELAs. > + * @symtab: Corresponding symtab. > + * > + * Return: 0 on success, negative errno on error. > + */ > +static inline int > +arch_kexec_apply_relocations_add(struct purgatory_info *pi, Elf_Shdr > *section, > + const Elf_Shdr *relsec, const Elf_Shdr *symtab) > +{ > + pr_err("RELA relocation unsupported.\n"); > + return -ENOEXEC; > +} > +#endif /* CONFIG_X86_64 || CONFIG_S390 */ > #endif /* CONFIG_KEXEC_FILE */ > > #ifdef CONFIG_KEXEC_ELF > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 8347fc158d2b96..6bae253b4d315e 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -108,23 +108,6 @@ int __weak arch_kexec_kernel_verify_sig(struct kimage > *image, void *buf, > } > #endif > > -/* > - * arch_kexec_apply_relocations_add - apply relocations of type RELA > - * @pi: Purgatory to be relocated. > - * @section: Section relocations applying to. > - * @relsec: Section containing RELAs. > - * @symtab: Corresponding symtab. > - * > - * Return: 0 on success, negative errno on error. > - */ > -int __weak > -arch_kexec_apply_relocations_add(struct purgatory_info *pi, Elf_Shdr > *section, > - const Elf_Shdr *relsec, const Elf_Shdr *symtab) > -{ > - pr_err("RELA relocation unsupported.\n"); > - return -ENOEXEC; > -} > -
Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]
Baoquan He writes: > Hi Eric, > > On 05/18/22 at 04:59pm, Eric W. Biederman wrote: >> "Naveen N. Rao" writes: >> >> > Since commit d1bcae833b32f1 ("ELF: Don't generate unused section >> > symbols") [1], binutils (v2.36+) started dropping section symbols that >> > it thought were unused. This isn't an issue in general, but with >> > kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a >> > separate .text.unlikely section and the section symbol ".text.unlikely" >> > is being dropped. Due to this, recordmcount is unable to find a non-weak >> > symbol in .text.unlikely to generate a relocation record against. >> > >> > Address this by dropping the weak attribute from these functions: >> > - arch_kexec_apply_relocations() is not overridden by any architecture >> > today, so just drop the weak attribute. >> > - arch_kexec_apply_relocations_add() is only overridden by x86 and s390. >> > Retain the function prototype for those and move the weak >> > implementation into the header as a static inline for other >> > architectures. >> > >> > [1] >> > https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1 >> >> Any chance you can also get machine_kexec_post_load, >> crash_free_reserved_phys_range, arch_kexec_protect_protect_crashkres, >> arch_kexec_unprotect_crashkres, arch_kexec_kernel_image_probe, >> arch_kexec_kernel_image_probe, arch_kimage_file_post_load_cleanup, >> arch_kexec_kernel_verify_sig, and arch_kexec_locate_mem_hole as well. >> >> That is everything in kexec that uses a __weak symbol. If we can't >> count on them working we might as well just get rid of the rest >> preemptively. > > Is there a new rule that __weak is not suggested in kernel any more? > Please help provide a pointer if yes, so that I can learn that. > > In my mind, __weak is very simple and clear as a mechanism to add > ARCH related functionality. You should be able to trace the conversation back for all of the details but if you can't here is the summary. There is a tool that some architectures use called recordmcount. The recordmcount looks for a symbol in a section, and ignores all weak symbols. In certain cases sections become so simple there are only weak symbols. At which point recordmcount fails. Which means in practice __weak symbols are unreliable and don't work to add ARCH related functionality. Given that __weak symbols fail randomly I would much rather have simpler code that doesn't fail. It has never been the case that __weak symbols have been very common in the kernel. I expect they are something like bool that have been gaining traction. Still given that __weak symbols don't work. I don't want them. Eric
Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]
Baoquan He writes: > On 05/19/22 at 12:59pm, Eric W. Biederman wrote: >> Baoquan He writes: >> >> > Hi Eric, >> > >> > On 05/18/22 at 04:59pm, Eric W. Biederman wrote: >> >> "Naveen N. Rao" writes: >> >> >> >> > Since commit d1bcae833b32f1 ("ELF: Don't generate unused section >> >> > symbols") [1], binutils (v2.36+) started dropping section symbols that >> >> > it thought were unused. This isn't an issue in general, but with >> >> > kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a >> >> > separate .text.unlikely section and the section symbol ".text.unlikely" >> >> > is being dropped. Due to this, recordmcount is unable to find a non-weak >> >> > symbol in .text.unlikely to generate a relocation record against. >> >> > >> >> > Address this by dropping the weak attribute from these functions: >> >> > - arch_kexec_apply_relocations() is not overridden by any architecture >> >> > today, so just drop the weak attribute. >> >> > - arch_kexec_apply_relocations_add() is only overridden by x86 and s390. >> >> > Retain the function prototype for those and move the weak >> >> > implementation into the header as a static inline for other >> >> > architectures. >> >> > >> >> > [1] >> >> > https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1 >> >> >> >> Any chance you can also get machine_kexec_post_load, >> >> crash_free_reserved_phys_range, arch_kexec_protect_protect_crashkres, >> >> arch_kexec_unprotect_crashkres, arch_kexec_kernel_image_probe, >> >> arch_kexec_kernel_image_probe, arch_kimage_file_post_load_cleanup, >> >> arch_kexec_kernel_verify_sig, and arch_kexec_locate_mem_hole as well. >> >> >> >> That is everything in kexec that uses a __weak symbol. If we can't >> >> count on them working we might as well just get rid of the rest >> >> preemptively. >> > >> > Is there a new rule that __weak is not suggested in kernel any more? >> > Please help provide a pointer if yes, so that I can learn that. >> > >> > In my mind, __weak is very simple and clear as a mechanism to add >> > ARCH related functionality. >> >> You should be able to trace the conversation back for all of the details >> but if you can't here is the summary. >> >> There is a tool that some architectures use called recordmcount. The >> recordmcount looks for a symbol in a section, and ignores all weak >> symbols. In certain cases sections become so simple there are only weak >> symbols. At which point recordmcount fails. >> >> Which means in practice __weak symbols are unreliable and don't work >> to add ARCH related functionality. >> >> Given that __weak symbols fail randomly I would much rather have simpler >> code that doesn't fail. It has never been the case that __weak symbols >> have been very common in the kernel. I expect they are something like >> bool that have been gaining traction. Still given that __weak symbols >> don't work. I don't want them. > > Thanks for the summary, Eric. > > From Naveen's reply, what I got is, llvm's recent change makes > symbol of section .text.unlikely lost, If I have read the thread correctly this change happened in both llvm and binutils. So both tools chains that are used to build the kernel. > but the secton .text.unlikely > still exists. The __weak symbol will be put in .text.unlikely partly, > when arch_kexec_apply_relocations_add() includes the pr_err line. While > removing the pr_err() line will put __weak symbol > arch_kexec_apply_relocations_add() in .text instead. Yes. Calling pr_err has some effect. Either causing an mcount entry to be ommitted, or causing the symbols in the function to be placed in .text.unlikely. > Now the status is that not only recordmcount got this problem, objtools > met it too and got an appropriate fix. Means objtools's fix doesn't need > kernel's adjustment. Recordmcount need kernel to adjust because it lacks > continuous support and developement. Naveen also told that they are > converting to objtools, just the old CI cases rely on recordmcount. In > fact, if someone stands up to get an appropriate recordmcount fix too, > the problem will be gone too. If the descriptions are correct I suspect recoredmcount could just decided to use the weak sy
Re: [PATCH v2 3/5] signal: Add unsafe_copy_siginfo_to_user()
Christophe Leroy writes: > In the same spirit as commit fb05121fd6a2 ("signal: Add > unsafe_get_compat_sigset()"), implement an 'unsafe' version of > copy_siginfo_to_user() in order to use it within user access blocks. > > For that, also add an 'unsafe' version of clear_user(). Looking at your use cases you need the 32bit compat version of this as well. The 32bit compat version is too complicated to become a macro, so I don't think you can make this work correctly for the 32bit compat case. Probably-Not-by: "Eric W. Biederman" Eric > Signed-off-by: Christophe Leroy > --- > include/linux/signal.h | 15 +++ > include/linux/uaccess.h | 1 + > kernel/signal.c | 5 - > 3 files changed, 16 insertions(+), 5 deletions(-) > > diff --git a/include/linux/signal.h b/include/linux/signal.h > index 3454c7ff0778..659bd43daf10 100644 > --- a/include/linux/signal.h > +++ b/include/linux/signal.h > @@ -35,6 +35,21 @@ static inline void copy_siginfo_to_external(siginfo_t *to, > int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from); > int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user > *from); > > +static __always_inline char __user *si_expansion(const siginfo_t __user > *info) > +{ > + return ((char __user *)info) + sizeof(struct kernel_siginfo); > +} > + > +#define unsafe_copy_siginfo_to_user(to, from, label) do {\ > + siginfo_t __user *__ucs_to = to;\ > + const kernel_siginfo_t *__ucs_from = from; \ > + char __user *__ucs_expansion = si_expansion(__ucs_to); \ > + \ > + unsafe_copy_to_user(__ucs_to, __ucs_from, \ > + sizeof(struct kernel_siginfo), label); \ > + unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label); \ > +} while (0) > + > enum siginfo_layout { > SIL_KILL, > SIL_TIMER, > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h > index c05e903cef02..37073caac474 100644 > --- a/include/linux/uaccess.h > +++ b/include/linux/uaccess.h > @@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user *unsafe_addr, > long count); > #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e) > #define unsafe_copy_to_user(d,s,l,e) unsafe_op_wrap(__copy_to_user(d,s,l),e) > #define unsafe_copy_from_user(d,s,l,e) > unsafe_op_wrap(__copy_from_user(d,s,l),e) > +#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e) > static inline unsigned long user_access_save(void) { return 0UL; } > static inline void user_access_restore(unsigned long flags) { } > #endif > diff --git a/kernel/signal.c b/kernel/signal.c > index a3229add4455..83b5971e4304 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -3261,11 +3261,6 @@ enum siginfo_layout siginfo_layout(unsigned sig, int > si_code) > return layout; > } > > -static inline char __user *si_expansion(const siginfo_t __user *info) > -{ > - return ((char __user *)info) + sizeof(struct kernel_siginfo); > -} > - > int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from) > { > char __user *expansion = si_expansion(to);
Re: [PATCH v2 5/5] powerpc/signal: Use unsafe_copy_siginfo_to_user()
Christophe Leroy writes: > Use unsafe_copy_siginfo_to_user() in order to do the copy > within the user access block. > > On an mpc 8321 (book3s/32) the improvment is about 5% on a process > sending a signal to itself. Nacked-by: "Eric W. Biederman" copy_siginfo_to_user is not the same as copy_siginfo_to_user32. As in this patch breaks 32bit userspace on powerpc. > Signed-off-by: Christophe Leroy > --- > arch/powerpc/kernel/signal_32.c | 13 ++--- > arch/powerpc/kernel/signal_64.c | 5 + > 2 files changed, 7 insertions(+), 11 deletions(-) > > diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c > index ff101e2b3bab..f9e16d108bc8 100644 > --- a/arch/powerpc/kernel/signal_32.c > +++ b/arch/powerpc/kernel/signal_32.c > @@ -710,12 +710,6 @@ static long restore_tm_user_regs(struct pt_regs *regs, > struct mcontext __user *s > } > #endif > > -#ifdef CONFIG_PPC64 > - > -#define copy_siginfo_to_user copy_siginfo_to_user32 > - > -#endif /* CONFIG_PPC64 */ > - > /* > * Set up a signal frame for a "real-time" signal handler > * (one which gets siginfo). > @@ -779,14 +773,19 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t > *oldset, > asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0])); > } > unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed); > +#ifndef CONFIG_COMPAT > + unsafe_copy_siginfo_to_user(&frame->info, &ksig->info, failed); > +#endif > > /* create a stack frame for the caller of the handler */ > unsafe_put_user(regs->gpr[1], newsp, failed); > > user_access_end(); > > - if (copy_siginfo_to_user(&frame->info, &ksig->info)) > +#ifdef CONFIG_COMPAT > + if (copy_siginfo_to_user32(&frame->info, &ksig->info)) > goto badframe; > +#endif > > regs->link = tramp; > > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c > index 2cca6c8febe1..82b73fbd937d 100644 > --- a/arch/powerpc/kernel/signal_64.c > +++ b/arch/powerpc/kernel/signal_64.c > @@ -901,15 +901,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t > *set, > } > > unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), > badframe_block); > + unsafe_copy_siginfo_to_user(&frame->info, &ksig->info, badframe_block); > /* Allocate a dummy caller frame for the signal handler. */ > unsafe_put_user(regs->gpr[1], newsp, badframe_block); > > user_write_access_end(); > > - /* Save the siginfo outside of the unsafe block. */ > - if (copy_siginfo_to_user(&frame->info, &ksig->info)) > - goto badframe; > - > /* Make sure signal handler doesn't get spurious FP exceptions */ > tsk->thread.fp_state.fpscr = 0;
Re: [PATCH v2 3/5] signal: Add unsafe_copy_siginfo_to_user()
Christophe Leroy writes: > Le 02/09/2021 à 20:43, Eric W. Biederman a écrit : >> Christophe Leroy writes: >> >>> In the same spirit as commit fb05121fd6a2 ("signal: Add >>> unsafe_get_compat_sigset()"), implement an 'unsafe' version of >>> copy_siginfo_to_user() in order to use it within user access blocks. >>> >>> For that, also add an 'unsafe' version of clear_user(). >> >> Looking at your use cases you need the 32bit compat version of this >> as well. >> >> The 32bit compat version is too complicated to become a macro, so I >> don't think you can make this work correctly for the 32bit compat case. > > When looking into patch 5/5 that you nacked, I think you missed the fact that > we > keep using copy_siginfo_to_user32() as it for the 32 bit compat case. I did. My mistake. However that mistake was so easy I think it mirrors the comments others have made that this looks like a maintenance hazard. Is improving the performance of 32bit kernels interesting? Is improving the performance of 32bit compat support interesting? If performance one or either of those cases is interesting it looks like we already have copy_siginfo_to_external32 the factor you would need to build unsafe_copy_siginfo_to_user32. So I am not going to say impossible but please make something maintainable. I unified all of the compat 32bit siginfo logic because it simply did not get enough love and attention when it was implemented per architecture. In general I think that concern applies to this case as well. We really need an implementation that shares as much burden as possible with other architectures. Eric >> Probably-Not-by: "Eric W. Biederman" >> >> Eric >> >>> Signed-off-by: Christophe Leroy >>> --- >>> include/linux/signal.h | 15 +++ >>> include/linux/uaccess.h | 1 + >>> kernel/signal.c | 5 - >>> 3 files changed, 16 insertions(+), 5 deletions(-) >>> >>> diff --git a/include/linux/signal.h b/include/linux/signal.h >>> index 3454c7ff0778..659bd43daf10 100644 >>> --- a/include/linux/signal.h >>> +++ b/include/linux/signal.h >>> @@ -35,6 +35,21 @@ static inline void copy_siginfo_to_external(siginfo_t >>> *to, >>> int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t >>> *from); >>> int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user >>> *from); >>> +static __always_inline char __user *si_expansion(const siginfo_t __user >>> *info) >>> +{ >>> + return ((char __user *)info) + sizeof(struct kernel_siginfo); >>> +} >>> + >>> +#define unsafe_copy_siginfo_to_user(to, from, label) do { \ >>> + siginfo_t __user *__ucs_to = to;\ >>> + const kernel_siginfo_t *__ucs_from = from; \ >>> + char __user *__ucs_expansion = si_expansion(__ucs_to); \ >>> + \ >>> + unsafe_copy_to_user(__ucs_to, __ucs_from, \ >>> + sizeof(struct kernel_siginfo), label); \ >>> + unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label); \ >>> +} while (0) >>> + >>> enum siginfo_layout { >>> SIL_KILL, >>> SIL_TIMER, >>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h >>> index c05e903cef02..37073caac474 100644 >>> --- a/include/linux/uaccess.h >>> +++ b/include/linux/uaccess.h >>> @@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user >>> *unsafe_addr, long count); >>> #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e) >>> #define unsafe_copy_to_user(d,s,l,e) >>> unsafe_op_wrap(__copy_to_user(d,s,l),e) >>> #define unsafe_copy_from_user(d,s,l,e) >>> unsafe_op_wrap(__copy_from_user(d,s,l),e) >>> +#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e) >>> static inline unsigned long user_access_save(void) { return 0UL; } >>> static inline void user_access_restore(unsigned long flags) { } >>> #endif >>> diff --git a/kernel/signal.c b/kernel/signal.c >>> index a3229add4455..83b5971e4304 100644 >>> --- a/kernel/signal.c >>> +++ b/kernel/signal.c >>> @@ -3261,11 +3261,6 @@ enum siginfo_layout siginfo_layout(unsigned sig, int >>> si_code) >>> return layout; >>> } >>> -static inline char __user *si_expansion(const siginfo_t __user *info) >>> -{ >>> - return ((char __user *)info) + sizeof(struct kernel_siginfo); >>> -} >>> - >>> int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t >>> *from) >>> { >>> char __user *expansion = si_expansion(to);
Re: [PATCH v2 3/5] signal: Add unsafe_copy_siginfo_to_user()
Christophe Leroy writes: > On 9/8/21 6:17 PM, Eric W. Biederman wrote: >> Christophe Leroy writes: >> >>> Le 02/09/2021 à 20:43, Eric W. Biederman a écrit : >>>> Christophe Leroy writes: >>>> >>>>> In the same spirit as commit fb05121fd6a2 ("signal: Add >>>>> unsafe_get_compat_sigset()"), implement an 'unsafe' version of >>>>> copy_siginfo_to_user() in order to use it within user access blocks. >>>>> >>>>> For that, also add an 'unsafe' version of clear_user(). >>>> >>>> Looking at your use cases you need the 32bit compat version of this >>>> as well. >>>> >>>> The 32bit compat version is too complicated to become a macro, so I >>>> don't think you can make this work correctly for the 32bit compat case. >>> >>> When looking into patch 5/5 that you nacked, I think you missed the fact >>> that we >>> keep using copy_siginfo_to_user32() as it for the 32 bit compat case. >> >> I did. My mistake. >> >> However that mistake was so easy I think it mirrors the comments others >> have made that this looks like a maintenance hazard. >> >> Is improving the performance of 32bit kernels interesting? > > Yes it is, and that's what this series do. > >> Is improving the performance of 32bit compat support interesting? > > For me this is a corner case, so I left it aside for now. > >> >> If performance one or either of those cases is interesting it looks like >> we already have copy_siginfo_to_external32 the factor you would need >> to build unsafe_copy_siginfo_to_user32. > > I'm not sure I understand your saying here. What do you expect me to > do with copy_siginfo_to_external32() ? Implement unsafe_copy_siginfo_to_user32. > copy_siginfo_to_user32() is for compat only. > > Native 32 bits powerpc use copy_siginfo_to_user() What you implemented doubles the number of test cases necessary to compile test the 32bit ppc signal code, and makes the code noticeably harder to follow. Having a unsafe_copy_to_siginfo_to_user32 at least would allow the number of test cases to remain the same as the current code. >> So I am not going to say impossible but please make something >> maintainable. I unified all of the compat 32bit siginfo logic because >> it simply did not get enough love and attention when it was implemented >> per architecture. > > Yes, and ? I didn't do any modification to the compat case, so what > you did remains. You undid the unification between the 32bit code and the 32bit compat code. >> In general I think that concern applies to this case as well. We really >> need an implementation that shares as much burden as possible with other >> architectures. > > I think yes, that's the reason why I made a generic > unsafe_copy_siginfo_to_user() and didn't make a powerpc dedicated > change. > > Once this is merged any other architecture can use > unsafe_copy_siginfo_to_user(). > > Did I miss something ? Not dealing with the compat case and making the code signal stack frame code noticeably more complicated. If this optimization profitably applies to other architectures we need to figure out how to implement unsafe_copy_siginfo_to_user32 or risk making them all much worse to maintain. Eric
Re: [PATCH RESEND v3 4/6] signal: Add unsafe_copy_siginfo_to_user32()
Christophe Leroy writes: > In the same spirit as commit fb05121fd6a2 ("signal: Add > unsafe_get_compat_sigset()"), implement an 'unsafe' version of > copy_siginfo_to_user32() in order to use it within user access blocks. > > To do so, we need inline version of copy_siginfo_to_external32() as we > don't want any function call inside user access blocks. I don't understand. What is wrong with? #define unsafe_copy_siginfo_to_user32(to, from, label) do {\ struct compat_siginfo __user *__ucs_to = to;\ const struct kernel_siginfo *__ucs_from = from; \ struct compat_siginfo __ucs_new;\ \ copy_siginfo_to_external32(&__ucs_new, __ucs_from); \ unsafe_copy_to_user(__ucs_to, &__ucs_new, \ sizeof(struct compat_siginfo), label); \ } while (0) Your replacement of "memset(to, 0, sizeof(*to))" with "struct compat_siginfo __ucs_new = {0}". is actively unsafe as the compiler is free not to initialize any holes in the structure to 0 in the later case. Is there something about the unsafe macros that I am not aware of that makes it improper to actually call C functions? Is that a requirement for the instructions that change the user space access behavior? >From the looks of this change all that you are doing is making it so that all of copy_siginfo_to_external32 is being inlined. If that is not a hard requirement of the instructions it seems like the wrong thing to do here. copy_siginfo_to_external32 has not failures so it does not need to be inlined so you can jump to the label. Eric > > Signed-off-by: Christophe Leroy > --- > include/linux/compat.h | 83 +- > include/linux/signal.h | 58 + > kernel/signal.c| 114 + > 3 files changed, 141 insertions(+), 114 deletions(-) > > diff --git a/include/linux/compat.h b/include/linux/compat.h > index 8e0598c7d1d1..68823f4b86ee 100644 > --- a/include/linux/compat.h > +++ b/include/linux/compat.h > @@ -412,6 +412,19 @@ int __copy_siginfo_to_user32(struct compat_siginfo > __user *to, > #ifndef copy_siginfo_to_user32 > #define copy_siginfo_to_user32 __copy_siginfo_to_user32 > #endif > + > +#ifdef CONFIG_COMPAT > +#define unsafe_copy_siginfo_to_user32(to, from, label) do { > \ > + struct compat_siginfo __user *__ucs_to = to;\ > + const struct kernel_siginfo *__ucs_from = from; \ > + struct compat_siginfo __ucs_new = {0}; \ > + \ > + __copy_siginfo_to_external32(&__ucs_new, __ucs_from); \ > + unsafe_copy_to_user(__ucs_to, &__ucs_new, \ > + sizeof(struct compat_siginfo), label); \ > +} while (0) > +#endif > + > int get_compat_sigevent(struct sigevent *event, > const struct compat_sigevent __user *u_event); > > @@ -992,15 +1005,81 @@ static inline bool in_compat_syscall(void) { return > false; } > * appropriately converted them already. > */ > #ifndef compat_ptr > -static inline void __user *compat_ptr(compat_uptr_t uptr) > +static __always_inline void __user *compat_ptr(compat_uptr_t uptr) > { > return (void __user *)(unsigned long)uptr; > } > #endif > > -static inline compat_uptr_t ptr_to_compat(void __user *uptr) > +static __always_inline compat_uptr_t ptr_to_compat(void __user *uptr) > { > return (u32)(unsigned long)uptr; > } > > +static __always_inline void > +__copy_siginfo_to_external32(struct compat_siginfo *to, > + const struct kernel_siginfo *from) > +{ > + to->si_signo = from->si_signo; > + to->si_errno = from->si_errno; > + to->si_code = from->si_code; > + switch(__siginfo_layout(from->si_signo, from->si_code)) { > + case SIL_KILL: > + to->si_pid = from->si_pid; > + to->si_uid = from->si_uid; > + break; > + case SIL_TIMER: > + to->si_tid = from->si_tid; > + to->si_overrun = from->si_overrun; > + to->si_int = from->si_int; > + break; > + case SIL_POLL: > + to->si_band = from->si_band; > + to->si_fd = from->si_fd; > + break; > + case SIL_FAULT: > + to->si_addr = ptr_to_compat(from->si_addr); > + break; > + case SIL_FAULT_TRAPNO: > + to->si_addr = ptr_to_compat(from->si_addr); > + to->si_trapno = from->si_trapno; > + break; > + case SIL_FAULT_MCEERR: > + to->si_addr = ptr_to_compat(from->si_addr); > + to->si_addr_lsb = from->si_addr_lsb; > + break; > + case S
Re: [PATCH RESEND v3 6/6] powerpc/signal: Use unsafe_copy_siginfo_to_user()
Christophe Leroy writes: > Use unsafe_copy_siginfo_to_user() in order to do the copy > within the user access block. > > On an mpc 8321 (book3s/32) the improvment is about 5% on a process > sending a signal to itself. > > Signed-off-by: Christophe Leroy > --- > v3: Don't leave compat aside, use the new unsafe_copy_siginfo_to_user32() > --- > arch/powerpc/kernel/signal_32.c | 8 +++- > arch/powerpc/kernel/signal_64.c | 5 + > 2 files changed, 4 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c > index ff101e2b3bab..3a2db8af2d65 100644 > --- a/arch/powerpc/kernel/signal_32.c > +++ b/arch/powerpc/kernel/signal_32.c > @@ -710,9 +710,9 @@ static long restore_tm_user_regs(struct pt_regs *regs, > struct mcontext __user *s > } > #endif > > -#ifdef CONFIG_PPC64 > +#ifndef CONFIG_PPC64 > > -#define copy_siginfo_to_user copy_siginfo_to_user32 > +#define unsafe_copy_siginfo_to_user32unsafe_copy_siginfo_to_user > > #endif /* CONFIG_PPC64 */ Any particular reason to reverse the sense of this #ifdef? Otherwise this change looks much cleaner. Eric > > @@ -779,15 +779,13 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t > *oldset, > asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0])); > } > unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed); > + unsafe_copy_siginfo_to_user32(&frame->info, &ksig->info, failed); > > /* create a stack frame for the caller of the handler */ > unsafe_put_user(regs->gpr[1], newsp, failed); > > user_access_end(); > > - if (copy_siginfo_to_user(&frame->info, &ksig->info)) > - goto badframe; > - > regs->link = tramp; > > #ifdef CONFIG_PPC_FPU_REGS > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c > index d80ff83cacb9..56c0c74aa28c 100644 > --- a/arch/powerpc/kernel/signal_64.c > +++ b/arch/powerpc/kernel/signal_64.c > @@ -901,15 +901,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t > *set, > } > > unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), > badframe_block); > + unsafe_copy_siginfo_to_user(&frame->info, &ksig->info, badframe_block); > /* Allocate a dummy caller frame for the signal handler. */ > unsafe_put_user(regs->gpr[1], newsp, badframe_block); > > user_write_access_end(); > > - /* Save the siginfo outside of the unsafe block. */ > - if (copy_siginfo_to_user(&frame->info, &ksig->info)) > - goto badframe; > - > /* Make sure signal handler doesn't get spurious FP exceptions */ > tsk->thread.fp_state.fpscr = 0;
Re: [PATCH RESEND v3 6/6] powerpc/signal: Use unsafe_copy_siginfo_to_user()
ebied...@xmission.com (Eric W. Biederman) writes: > Christophe Leroy writes: > >> Use unsafe_copy_siginfo_to_user() in order to do the copy >> within the user access block. >> >> On an mpc 8321 (book3s/32) the improvment is about 5% on a process >> sending a signal to itself. If you can't make function calls from an unsafe macro there is another way to handle this that doesn't require everything to be inline. >From a safety perspective it is probably even a better approach. Eric diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index 0608581967f0..1b30bb78b863 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -205,6 +205,12 @@ struct sigframe { int abigap[56]; }; +#ifdef CONFIG_PPC64 +# define user_siginfo_tcompat_siginfo_t +#else +# define user_siginfo_tsiginfo_t +#endif + /* * When we have rt signals to deliver, we set up on the * user stack, going down from the original stack pointer: @@ -217,11 +223,7 @@ struct sigframe { * */ struct rt_sigframe { -#ifdef CONFIG_PPC64 - compat_siginfo_t info; -#else - struct siginfo info; -#endif + user_siginfo_t info; struct ucontext uc; #ifdef CONFIG_PPC_TRANSACTIONAL_MEM struct ucontext uc_transact; @@ -712,7 +714,7 @@ static long restore_tm_user_regs(struct pt_regs *regs, struct mcontext __user *s #ifdef CONFIG_PPC64 -#define copy_siginfo_to_user copy_siginfo_to_user32 +#define copy_siginfo_to_external copy_siginfo_to_external32 #endif /* CONFIG_PPC64 */ @@ -731,6 +733,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset, struct pt_regs *regs = tsk->thread.regs; /* Save the thread's msr before get_tm_stackpointer() changes it */ unsigned long msr = regs->msr; + user_siginfo_t uinfo; /* Set up Signal Frame */ frame = get_sigframe(ksig, tsk, sizeof(*frame), 1); @@ -743,6 +746,8 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset, else prepare_save_user_regs(1); + copy_siginfo_to_external(&uinfo, &ksig->info); + if (!user_access_begin(frame, sizeof(*frame))) goto badframe; @@ -778,12 +783,10 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset, asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0])); } unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed); + unsafe_copy_to_user(&frame->info, &uinfo, failed); user_access_end(); - if (copy_siginfo_to_user(&frame->info, &ksig->info)) - goto badframe; - regs->link = tramp; #ifdef CONFIG_PPC_FPU_REGS Eric
Re: [PATCH RESEND v3 6/6] powerpc/signal: Use unsafe_copy_siginfo_to_user()
Christophe Leroy writes: > Le 13/09/2021 à 18:21, Eric W. Biederman a écrit : >> ebied...@xmission.com (Eric W. Biederman) writes: >> >>> Christophe Leroy writes: >>> >>>> Use unsafe_copy_siginfo_to_user() in order to do the copy >>>> within the user access block. >>>> >>>> On an mpc 8321 (book3s/32) the improvment is about 5% on a process >>>> sending a signal to itself. >> >> If you can't make function calls from an unsafe macro there is another >> way to handle this that doesn't require everything to be inline. >> >> From a safety perspective it is probably even a better approach. > > Yes but that's exactly what I wanted to avoid for the native ppc32 case: this > double hop means useless pressure on the cache. The siginfo_t structure is 128 > bytes large, that means 8 lines of cache on powerpc 8xx. > > But maybe it is acceptable to do that only for the compat case. Let me think > about it, it might be quite easy. The places get_signal is called tend to be well known. So I think we are safe from a capacity standpoint. I am not certain it makes a difference in capacity as there is a high probability that the stack was deeper recently than it is now which suggests the cache blocks might already be in the cache. My sense it is worth benchmarking before optimizing out the extra copy like that. On the extreme side there is simply building the entire sigframe on the stack and then just calling it copy_to_user. As the stack cache lines are likely to be hot, and copy_to_user is quite well optimized there is a real possibility that it is faster to build everything on the kernel stack, and then copy it to the user space stack. It is also possible that I am wrong and we may want to figure out how far up we can push the conversion to the 32bit siginfo format. If could move the work into collect_signal we could guarantee there would be no extra work. That would require adjusting the sigframe generation code on all of the architectures. There is a lot we can do but we need benchmarking to tell if it is worth it. Eric
[PATCH 07/20] signal/powerpc: On swapcontext failure force SIGSEGV
If the register state may be partial and corrupted instead of calling do_exit, call force_sigsegv(SIGSEGV). Which properly kills the process with SIGSEGV and does not let any more userspace code execute, instead of just killing one thread of the process and potentially confusing everything. Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Fixes: 756f1ae8a44e ("PPC32: Rework signal code and add a swapcontext system call.") Fixes: 04879b04bf50 ("[PATCH] ppc64: VMX (Altivec) support & signal32 rework, from Ben Herrenschmidt") Signed-off-by: "Eric W. Biederman" --- arch/powerpc/kernel/signal_32.c | 6 -- arch/powerpc/kernel/signal_64.c | 9 ++--- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index 0608581967f0..666f3da41232 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -1062,8 +1062,10 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx, * or if another thread unmaps the region containing the context. * We kill the task with a SIGSEGV in this situation. */ - if (do_setcontext(new_ctx, regs, 0)) - do_exit(SIGSEGV); + if (do_setcontext(new_ctx, regs, 0)) { + force_sigsegv(SIGSEGV); + return -EFAULT; + } set_thread_flag(TIF_RESTOREALL); return 0; diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c index 1831bba0582e..d8de622c9e4a 100644 --- a/arch/powerpc/kernel/signal_64.c +++ b/arch/powerpc/kernel/signal_64.c @@ -703,15 +703,18 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx, * We kill the task with a SIGSEGV in this situation. */ - if (__get_user_sigset(&set, &new_ctx->uc_sigmask)) - do_exit(SIGSEGV); + if (__get_user_sigset(&set, &new_ctx->uc_sigmask)) { + force_sigsegv(SIGSEGV); + return -EFAULT; + } set_current_blocked(&set); if (!user_read_access_begin(new_ctx, ctx_size)) return -EFAULT; if (__unsafe_restore_sigcontext(current, NULL, 0, &new_ctx->uc_mcontext)) { user_read_access_end(); - do_exit(SIGSEGV); + force_sigsegv(SIGSEGV); + return -EFAULT; } user_read_access_end(); -- 2.20.1
[PATCH 00/20] exit cleanups
While looking at some issues related to the exit path in the kernel I found several instances where the code is not using the existing abstractions properly. This set of changes introduces force_fatal_sig a way of sending a signal and not allowing it to be caught, and corrects the misuse of the existing abstractions that I found. A lot of the misuse of the existing abstractions are silly things such as doing something after calling a no return function, rolling BUG by hand, doing more work than necessary to terminate a kernel thread, or calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL). It is my plan after sending all of these changes out for review to place them in a topic branch for sending Linus. Especially for the changes that depend upon the new helper force_fatal_sig this is important. Eric W. Biederman (20): exit/doublefault: Remove apparently bogus comment about rewind_stack_do_exit exit: Remove calls of do_exit after noreturn versions of die reboot: Remove the unreachable panic after do_exit in reboot(2) signal/sparc32: Remove unreachable do_exit in do_sparc_fault signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) signal/powerpc: On swapcontext failure force SIGSEGV signal/sparc: In setup_tsb_params convert open coded BUG into BUG signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved. signal/s390: Use force_sigsegv in default_trap_handler exit/kthread: Have kernel threads return instead of calling do_exit signal: Implement force_fatal_sig exit/syscall_user_dispatch: Send ordinary signals on failure signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig signal/x86: In emulate_vsyscall force a signal instead of calling do_exit exit/rtl8723bs: Replace the macro thread_exit with a simple return 0 exit/rtl8712: Replace the macro thread_exit with a simple return 0 exit/r8188eu: Replace the macro thread_exit with a simple return 0 arch/mips/kernel/r2300_fpu.S | 4 ++-- arch/mips/kernel/syscall.c | 9 arch/nds32/kernel/traps.c | 2 +- arch/nds32/mm/fault.c | 6 + arch/openrisc/kernel/traps.c | 2 +- arch/openrisc/mm/fault.c | 4 +--- arch/powerpc/kernel/signal_32.c| 6 +++-- arch/powerpc/kernel/signal_64.c| 9 +--- arch/s390/include/asm/kdebug.h | 2 +- arch/s390/kernel/dumpstack.c | 2 +- arch/s390/kernel/traps.c | 2 +- arch/s390/mm/fault.c | 2 -- arch/sh/kernel/cpu/fpu.c | 10 + arch/sh/kernel/traps.c | 2 +- arch/sh/mm/fault.c | 2 -- arch/sparc/kernel/signal_32.c | 4 ++-- arch/sparc/kernel/windows.c| 6 +++-- arch/sparc/mm/fault_32.c | 1 - arch/sparc/mm/tsb.c| 2 +- arch/x86/entry/vsyscall/vsyscall_64.c | 3 ++- arch/x86/kernel/doublefault_32.c | 3 --- arch/x86/kernel/signal.c | 6 - arch/x86/kernel/vm86_32.c | 8 +++ arch/xtensa/kernel/traps.c | 2 +- arch/xtensa/mm/fault.c | 3 +-- drivers/firmware/stratix10-svc.c | 4 ++-- drivers/soc/ti/wkup_m3_ipc.c | 2 +- drivers/staging/r8188eu/core/rtw_cmd.c | 2 +- drivers/staging/r8188eu/core/rtw_mp.c | 2 +- drivers/staging/r8188eu/include/osdep_service.h| 2 -- drivers/staging/rtl8712/osdep_service.h| 1 - drivers/staging/rtl8712/rtl8712_cmd.c | 2 +- drivers/staging/rtl8723bs/core/rtw_cmd.c | 2 +- drivers/staging/rtl8723bs/core/rtw_xmit.c | 2 +- drivers/staging/rtl8723bs/hal/rtl8723bs_xmit.c | 2 +- .../rtl8723bs/include/osdep_service_linux.h| 2 -- fs/ocfs2/journal.c | 5 + include/linux/sched/signal.h | 1 + kernel/entry/syscall_user_dispatch.c | 12 ++ kernel/kthread.c | 2 +- kernel/reboot.c| 1 - kernel/signal.c| 26 ++ net/batman-adv/tp_meter.c | 2 +- 43 files changed, 83 insertions(+), 91 deletions(-) Eric
[PATCH 21/20] signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
Now that force_fatal_sig exists it is unnecessary and a bit confusing to use force_sigsegv in cases where the simpler force_fatal_sig is wanted. So change every instance we can to make the code clearer. Signed-off-by: "Eric W. Biederman" --- arch/arc/kernel/process.c | 2 +- arch/m68k/kernel/traps.c| 2 +- arch/powerpc/kernel/signal_32.c | 2 +- arch/powerpc/kernel/signal_64.c | 4 ++-- arch/s390/kernel/traps.c| 2 +- arch/um/kernel/trap.c | 2 +- arch/x86/kernel/vm86_32.c | 2 +- fs/exec.c | 2 +- 8 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c index 3793876f42d9..8e90052f6f05 100644 --- a/arch/arc/kernel/process.c +++ b/arch/arc/kernel/process.c @@ -294,7 +294,7 @@ int elf_check_arch(const struct elf32_hdr *x) eflags = x->e_flags; if ((eflags & EF_ARC_OSABI_MSK) != EF_ARC_OSABI_CURRENT) { pr_err("ABI mismatch - you need newer toolchain\n"); - force_sigsegv(SIGSEGV); + force_fatal_sig(SIGSEGV); return 0; } diff --git a/arch/m68k/kernel/traps.c b/arch/m68k/kernel/traps.c index 5b19fcdcd69e..74045d164ddb 100644 --- a/arch/m68k/kernel/traps.c +++ b/arch/m68k/kernel/traps.c @@ -1150,7 +1150,7 @@ asmlinkage void set_esp0(unsigned long ssp) */ asmlinkage void fpsp040_die(void) { - force_sigsegv(SIGSEGV); + force_fatal_sig(SIGSEGV); } #ifdef CONFIG_M68KFPU_EMU diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index 666f3da41232..933ab95805a6 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -1063,7 +1063,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx, * We kill the task with a SIGSEGV in this situation. */ if (do_setcontext(new_ctx, regs, 0)) { - force_sigsegv(SIGSEGV); + force_fatal_sig(SIGSEGV); return -EFAULT; } diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c index d8de622c9e4a..8ead9b3f47c6 100644 --- a/arch/powerpc/kernel/signal_64.c +++ b/arch/powerpc/kernel/signal_64.c @@ -704,7 +704,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx, */ if (__get_user_sigset(&set, &new_ctx->uc_sigmask)) { - force_sigsegv(SIGSEGV); + force_fatal_sig(SIGSEGV); return -EFAULT; } set_current_blocked(&set); @@ -713,7 +713,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx, return -EFAULT; if (__unsafe_restore_sigcontext(current, NULL, 0, &new_ctx->uc_mcontext)) { user_read_access_end(); - force_sigsegv(SIGSEGV); + force_fatal_sig(SIGSEGV); return -EFAULT; } user_read_access_end(); diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c index 51729ea2cf8e..01a7c68dcfb6 100644 --- a/arch/s390/kernel/traps.c +++ b/arch/s390/kernel/traps.c @@ -84,7 +84,7 @@ static void default_trap_handler(struct pt_regs *regs) { if (user_mode(regs)) { report_user_fault(regs, SIGSEGV, 0); - force_sigsegv(SIGSEGV); + force_fatal_sig(SIGSEGV); } else die(regs, "Unknown program exception"); } diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index 3198c4767387..c32efb09db21 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -158,7 +158,7 @@ static void bad_segv(struct faultinfo fi, unsigned long ip) void fatal_sigsegv(void) { - force_sigsegv(SIGSEGV); + force_fatal_sig(SIGSEGV); do_signal(¤t->thread.regs); /* * This is to tell gcc that we're not returning - do_signal diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c index 040fd01be8b3..7ff0f622abd4 100644 --- a/arch/x86/kernel/vm86_32.c +++ b/arch/x86/kernel/vm86_32.c @@ -159,7 +159,7 @@ void save_v86_state(struct kernel_vm86_regs *regs, int retval) user_access_end(); Efault: pr_alert("could not access userspace vm86 info\n"); - force_sigsegv(SIGSEGV); + force_fatal_sig(SIGSEGV); } static int do_vm86_irq_handling(int subfunction, int irqnumber); diff --git a/fs/exec.c b/fs/exec.c index a098c133d8d7..ac7b51b51f38 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1852,7 +1852,7 @@ static int bprm_execve(struct linux_binprm *bprm, * SIGSEGV. */ if (bprm->point_of_no_return && !fatal_signal_pending(current)) - force_sigsegv(SIGSEGV); + force_fatal_sig(SIGSEGV); out_unmark: current->fs->in_exec = 0; -- 2.20.1
Re: [PATCH 21/20] signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
Geert Uytterhoeven writes: > Hi Eric, > > Patch 21/20? In reviewing another part of the patchset Linus asked if force_sigsegv could go away. It can't completely but I can get this far. Given that it is just a cleanup it makes most sense to me as an additional patch on top of what is already here. > On Wed, Oct 20, 2021 at 11:52 PM Eric W. Biederman > wrote: >> Now that force_fatal_sig exists it is unnecessary and a bit confusing >> to use force_sigsegv in cases where the simpler force_fatal_sig is >> wanted. So change every instance we can to make the code clearer. >> >> Signed-off-by: "Eric W. Biederman" > >> arch/m68k/kernel/traps.c| 2 +- > > Acked-by: Geert Uytterhoeven Thank you. Eric
Re: [PATCH 01/13] sysctl: add new register_sysctl_subdir() helper
Luis Chamberlain writes: > Often enough all we need to do is create a subdirectory so that > we can stuff sysctls underneath it. However, *if* that directory > was already created early on the boot sequence we really have no > need to use the full boiler plate code for it, we can just use > local variables to help us guide sysctl to place the new leaf files. > > So use a helper to do precisely this. Reset restart. This is patch is total nonsense. - You are using register_sysctl_table which as I believe I have mentioned is a deprecated compatibility wrapper. The point of spring house cleaning is to get off of the deprecated functions isn't it? - You are using the old nasty form for creating directories instead of just passing in a path. - None of this is even remotely necessary. The directories are created automatically if you just register their entries. Eric
Re: [PATCH 01/13] sysctl: add new register_sysctl_subdir() helper
ebied...@xmission.com (Eric W. Biederman) writes: > Luis Chamberlain writes: > >> Often enough all we need to do is create a subdirectory so that >> we can stuff sysctls underneath it. However, *if* that directory >> was already created early on the boot sequence we really have no >> need to use the full boiler plate code for it, we can just use >> local variables to help us guide sysctl to place the new leaf files. >> >> So use a helper to do precisely this. > > Reset restart. This is patch is total nonsense. > > - You are using register_sysctl_table which as I believe I have > mentioned is a deprecated compatibility wrapper. The point of > spring house cleaning is to get off of the deprecated functions > isn't it? > > - You are using the old nasty form for creating directories instead > of just passing in a path. > > - None of this is even remotely necessary. The directories > are created automatically if you just register their entries. Oh. *blink* The poor naming threw me off. This is a clumsy and poorly named version of register_sysctl(); Yes. This change is totally unnecessary. Eric
Re: [PATCH 02/13] cdrom: use new sysctl subdir helper register_sysctl_subdir()
Luis Chamberlain writes: > This simplifies the code considerably. The following coccinelle With register_sysctl the code would read: cdrom_sysctl_header = register_sysctl("dev/cdrom", cdrom_table); Please go that direction. Thank you. Eric
Re: [PATCH 12/13] sysctl: add helper to register empty subdir
Luis Chamberlain writes: > The way to create a subdirectory from the base set of directories > is a bit obscure, so provide a helper which makes this clear, and > also helps remove boiler plate code required to do this work. I agreee calling: register_sysctl("fs/binfmt_misc", sysctl_mount_point) is a bit obscure but if you are going to make a wrapper please make it the trivial one liner above. Say something that looks like: struct sysctl_header *register_sysctl_mount_point(const char *path) { return register_sysctl(path, sysctl_mount_point); } And yes please talk about a mount point and not an empty dir, as these are permanently empty directories to serve as mount points. There are some subtle but important permission checks this allows in the case of unprivileged mounts. Further code like this belong in proc_sysctl.c next to all of the code it is related to so that it is easier to see how to refactor the code if necessary. Eric > > Signed-off-by: Luis Chamberlain > --- > include/linux/sysctl.h | 7 +++ > kernel/sysctl.c| 16 +--- > 2 files changed, 20 insertions(+), 3 deletions(-) > > diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h > index 33a471b56345..89c92390e6de 100644 > --- a/include/linux/sysctl.h > +++ b/include/linux/sysctl.h > @@ -208,6 +208,8 @@ extern void register_sysctl_init(const char *path, struct > ctl_table *table, > extern struct ctl_table_header *register_sysctl_subdir(const char *base, > const char *subdir, > struct ctl_table *table); > +extern void register_sysctl_empty_subdir(const char *base, const char > *subdir); > + > void do_sysctl_args(void); > > extern int pwrsw_enabled; > @@ -231,6 +233,11 @@ inline struct ctl_table_header > *register_sysctl_subdir(const char *base, > return NULL; > } > > +static inline void register_sysctl_empty_subdir(const char *base, > + const char *subdir) > +{ > +} > + > static inline struct ctl_table_header *register_sysctl_paths( > const struct ctl_path *path, struct ctl_table *table) > { > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index f9a35325d5d5..460532cd5ac8 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -3188,13 +3188,17 @@ struct ctl_table_header *register_sysctl_subdir(const > char *base, > { } > }; > > - if (!table->procname) > + if (table != sysctl_mount_point && !table->procname) > goto out; > > hdr = register_sysctl_table(base_table); > if (unlikely(!hdr)) { > - pr_err("failed when creating subdirectory sysctl %s/%s/%s\n", > -base, subdir, table->procname); > + if (table != sysctl_mount_point) > + pr_err("failed when creating subdirectory sysctl > %s/%s/%s\n", > +base, subdir, table->procname); > + else > + pr_err("failed when creating empty subddirectory > %s/%s\n", > +base, subdir); > goto out; > } > kmemleak_not_leak(hdr); > @@ -3202,6 +3206,12 @@ struct ctl_table_header *register_sysctl_subdir(const > char *base, > return hdr; > } > EXPORT_SYMBOL_GPL(register_sysctl_subdir); > + > +void register_sysctl_empty_subdir(const char *base, > + const char *subdir) > +{ > + register_sysctl_subdir(base, subdir, sysctl_mount_point); > +} > #endif /* CONFIG_SYSCTL */ > /* > * No sense putting this after each symbol definition, twice,
Re: [PATCH v2] All arch: remove system call sys_sysctl
Xiaoming Ni writes: > Since the commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"), > sys_sysctl is actually unavailable: any input can only return an error. > > We have been warning about people using the sysctl system call for years > and believe there are no more users. Even if there are users of this > interface if they have not complained or fixed their code by now they > probably are not going to, so there is no point in warning them any > longer. > > So completely remove sys_sysctl on all architectures. > > Signed-off-by: Xiaoming Ni > > changes in v2: > According to Kees Cook's suggestion, completely remove sys_sysctl on all > arch > According to Eric W. Biederman's suggestion, update the commit log > > V1: > https://lore.kernel.org/lkml/1591683605-8585-1-git-send-email-nixiaom...@huawei.com/ > Delete the code of sys_sysctl and return -ENOSYS directly at the function > entry > --- > include/uapi/linux/sysctl.h| 15 -- [snip] > diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h > index 27c1ed2..84b44c3 100644 > --- a/include/uapi/linux/sysctl.h > +++ b/include/uapi/linux/sysctl.h > @@ -27,21 +27,6 @@ > #include > #include > > -#define CTL_MAXNAME 10 /* how many path components do we allow > in a > -call to sysctl? In other words, what is > -the largest acceptable value for the nlen > -member of a struct __sysctl_args to have? */ > - > -struct __sysctl_args { > - int __user *name; > - int nlen; > - void __user *oldval; > - size_t __user *oldlenp; > - void __user *newval; > - size_t newlen; > - unsigned long __unused[4]; > -}; > - > /* Define sysctl names first */ > > /* Top-level names: */ [snip] The uapi header change does not make sense. The entire point of the header is to allow userspace programs to be able to call sys_sysctl. It either needs to all stay or all go. As the concern with the uapi header is about userspace programs being able to compile please leave the header for now. We should leave auditing userspace and seeing if userspace code will still compile if we remove this header for a separate patch. The concerns and justifications for the uapi header are completely different then for the removing the sys_sysctl implementation. Otherwise Acked-by: "Eric W. Biederman" Eric
Re: [PATCH v2] All arch: remove system call sys_sysctl
Rich Felker writes: > On Thu, Jun 11, 2020 at 06:43:00AM -0500, Eric W. Biederman wrote: >> Xiaoming Ni writes: >> >> > Since the commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"), >> > sys_sysctl is actually unavailable: any input can only return an error. >> > >> > We have been warning about people using the sysctl system call for years >> > and believe there are no more users. Even if there are users of this >> > interface if they have not complained or fixed their code by now they >> > probably are not going to, so there is no point in warning them any >> > longer. >> > >> > So completely remove sys_sysctl on all architectures. >> >> >> >> > >> > Signed-off-by: Xiaoming Ni >> > >> > changes in v2: >> > According to Kees Cook's suggestion, completely remove sys_sysctl on all >> > arch >> > According to Eric W. Biederman's suggestion, update the commit log >> > >> > V1: >> > https://lore.kernel.org/lkml/1591683605-8585-1-git-send-email-nixiaom...@huawei.com/ >> > Delete the code of sys_sysctl and return -ENOSYS directly at the >> > function entry >> > --- >> > include/uapi/linux/sysctl.h| 15 -- >> [snip] >> >> > diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h >> > index 27c1ed2..84b44c3 100644 >> > --- a/include/uapi/linux/sysctl.h >> > +++ b/include/uapi/linux/sysctl.h >> > @@ -27,21 +27,6 @@ >> > #include >> > #include >> > >> > -#define CTL_MAXNAME 10/* how many path components do we allow >> > in a >> > - call to sysctl? In other words, what is >> > - the largest acceptable value for the nlen >> > - member of a struct __sysctl_args to have? */ >> > - >> > -struct __sysctl_args { >> > - int __user *name; >> > - int nlen; >> > - void __user *oldval; >> > - size_t __user *oldlenp; >> > - void __user *newval; >> > - size_t newlen; >> > - unsigned long __unused[4]; >> > -}; >> > - >> > /* Define sysctl names first */ >> > >> > /* Top-level names: */ >> [snip] >> >> The uapi header change does not make sense. The entire point of the >> header is to allow userspace programs to be able to call sys_sysctl. >> It either needs to all stay or all go. >> >> As the concern with the uapi header is about userspace programs being >> able to compile please leave the header for now. >> >> We should leave auditing userspace and seeing if userspace code will >> still compile if we remove this header for a separate patch. The >> concerns and justifications for the uapi header are completely different >> then for the removing the sys_sysctl implementation. >> >> Otherwise >> Acked-by: "Eric W. Biederman" > > The UAPI header should be kept because it's defining an API not just > for the kernel the headers are supplied with, but for all past > kernels. In particular programs needing a failsafe CSPRNG source that > works on old kernels may (do) use this as a fallback only if modern > syscalls are missing. Removing the syscall is no problem since it > won't be used, but if you remove the types/macros from the UAPI > headers, they'll have to copy that into their own sources. May we assume you know of a least one piece of userspace that will fail to compile if this header file is removed? Eric
Re: [PATCH v2] All arch: remove system call sys_sysctl
Rich Felker writes: > On Thu, Jun 11, 2020 at 12:01:11PM -0500, Eric W. Biederman wrote: >> Rich Felker writes: >> >> > On Thu, Jun 11, 2020 at 06:43:00AM -0500, Eric W. Biederman wrote: >> >> Xiaoming Ni writes: >> >> >> >> > Since the commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system >> >> > call"), >> >> > sys_sysctl is actually unavailable: any input can only return an error. >> >> > >> >> > We have been warning about people using the sysctl system call for years >> >> > and believe there are no more users. Even if there are users of this >> >> > interface if they have not complained or fixed their code by now they >> >> > probably are not going to, so there is no point in warning them any >> >> > longer. >> >> > >> >> > So completely remove sys_sysctl on all architectures. >> >> >> >> >> >> >> >> > >> >> > Signed-off-by: Xiaoming Ni >> >> > >> >> > changes in v2: >> >> > According to Kees Cook's suggestion, completely remove sys_sysctl on >> >> > all arch >> >> > According to Eric W. Biederman's suggestion, update the commit log >> >> > >> >> > V1: >> >> > https://lore.kernel.org/lkml/1591683605-8585-1-git-send-email-nixiaom...@huawei.com/ >> >> > Delete the code of sys_sysctl and return -ENOSYS directly at the >> >> > function entry >> >> > --- >> >> > include/uapi/linux/sysctl.h| 15 -- >> >> [snip] >> >> >> >> > diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h >> >> > index 27c1ed2..84b44c3 100644 >> >> > --- a/include/uapi/linux/sysctl.h >> >> > +++ b/include/uapi/linux/sysctl.h >> >> > @@ -27,21 +27,6 @@ >> >> > #include >> >> > #include >> >> > >> >> > -#define CTL_MAXNAME 10 /* how many path components do we allow >> >> > in a >> >> > - call to sysctl? In other words, >> >> > what is >> >> > - the largest acceptable value for the >> >> > nlen >> >> > - member of a struct __sysctl_args to >> >> > have? */ >> >> > - >> >> > -struct __sysctl_args { >> >> > - int __user *name; >> >> > - int nlen; >> >> > - void __user *oldval; >> >> > - size_t __user *oldlenp; >> >> > - void __user *newval; >> >> > - size_t newlen; >> >> > - unsigned long __unused[4]; >> >> > -}; >> >> > - >> >> > /* Define sysctl names first */ >> >> > >> >> > /* Top-level names: */ >> >> [snip] >> >> >> >> The uapi header change does not make sense. The entire point of the >> >> header is to allow userspace programs to be able to call sys_sysctl. >> >> It either needs to all stay or all go. >> >> >> >> As the concern with the uapi header is about userspace programs being >> >> able to compile please leave the header for now. >> >> >> >> We should leave auditing userspace and seeing if userspace code will >> >> still compile if we remove this header for a separate patch. The >> >> concerns and justifications for the uapi header are completely different >> >> then for the removing the sys_sysctl implementation. >> >> >> >> Otherwise >> >> Acked-by: "Eric W. Biederman" >> > >> > The UAPI header should be kept because it's defining an API not just >> > for the kernel the headers are supplied with, but for all past >> > kernels. In particular programs needing a failsafe CSPRNG source that >> > works on old kernels may (do) use this as a fallback only if modern >> > syscalls are missing. Removing the syscall is no problem since it >> > won't be used, but if you remove the types/macros from the UAPI >> > headers, they'll have to copy that into their own sources. >> >> May we assume you know of a least one piece of userspace that will fail >> to compile if
Re: [PATCH v5 2/6] kexec: avoid compat_alloc_user_space
Arnd Bergmann writes: > From: Arnd Bergmann > > kimage_alloc_init() expects a __user pointer, so compat_sys_kexec_load() > uses compat_alloc_user_space() to convert the layout and put it back > onto the user space caller stack. > > Moving the user space access into the syscall handler directly actually > makes the code simpler, as the conversion for compat mode can now be > done on kernel memory. Acked-by: "Eric W. Biederman" > > Co-developed-by: Eric Biederman > Co-developed-by: Christoph Hellwig > Link: https://lore.kernel.org/lkml/ypbtsu4gx6pl7%2...@infradead.org/ > Link: https://lore.kernel.org/lkml/m1y2cbzmnw@fess.ebiederm.org/ > Signed-off-by: Arnd Bergmann > --- > kernel/kexec.c | 61 +- > 1 file changed, 25 insertions(+), 36 deletions(-) > > diff --git a/kernel/kexec.c b/kernel/kexec.c > index 9c7aef8f4bb6..b5e40f069768 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -19,26 +19,9 @@ > > #include "kexec_internal.h" > > -static int copy_user_segment_list(struct kimage *image, > - unsigned long nr_segments, > - struct kexec_segment __user *segments) > -{ > - int ret; > - size_t segment_bytes; > - > - /* Read in the segments */ > - image->nr_segments = nr_segments; > - segment_bytes = nr_segments * sizeof(*segments); > - ret = copy_from_user(image->segment, segments, segment_bytes); > - if (ret) > - ret = -EFAULT; > - > - return ret; > -} > - > static int kimage_alloc_init(struct kimage **rimage, unsigned long entry, >unsigned long nr_segments, > - struct kexec_segment __user *segments, > + struct kexec_segment *segments, >unsigned long flags) > { > int ret; > @@ -58,10 +41,8 @@ static int kimage_alloc_init(struct kimage **rimage, > unsigned long entry, > return -ENOMEM; > > image->start = entry; > - > - ret = copy_user_segment_list(image, nr_segments, segments); > - if (ret) > - goto out_free_image; > + image->nr_segments = nr_segments; > + memcpy(image->segment, segments, nr_segments * sizeof(*segments)); > > if (kexec_on_panic) { > /* Enable special crash kernel control page alloc policy. */ > @@ -104,7 +85,7 @@ static int kimage_alloc_init(struct kimage **rimage, > unsigned long entry, > } > > static int do_kexec_load(unsigned long entry, unsigned long nr_segments, > - struct kexec_segment __user *segments, unsigned long flags) > + struct kexec_segment *segments, unsigned long flags) > { > struct kimage **dest_image, *image; > unsigned long i; > @@ -250,7 +231,8 @@ static inline int kexec_load_check(unsigned long > nr_segments, > SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments, > struct kexec_segment __user *, segments, unsigned long, flags) > { > - int result; > + struct kexec_segment *ksegments; > + unsigned long result; > > result = kexec_load_check(nr_segments, flags); > if (result) > @@ -261,7 +243,12 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, > unsigned long, nr_segments, > ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT)) > return -EINVAL; > > - result = do_kexec_load(entry, nr_segments, segments, flags); > + ksegments = memdup_user(segments, nr_segments * sizeof(ksegments[0])); > + if (IS_ERR(ksegments)) > + return PTR_ERR(ksegments); > + > + result = do_kexec_load(entry, nr_segments, ksegments, flags); > + kfree(ksegments); > > return result; > } > @@ -273,7 +260,7 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry, > compat_ulong_t, flags) > { > struct compat_kexec_segment in; > - struct kexec_segment out, __user *ksegments; > + struct kexec_segment *ksegments; > unsigned long i, result; > > result = kexec_load_check(nr_segments, flags); > @@ -286,24 +273,26 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, > entry, > if ((flags & KEXEC_ARCH_MASK) == KEXEC_ARCH_DEFAULT) > return -EINVAL; > > - ksegments = compat_alloc_user_space(nr_segments * sizeof(out)); > + ksegments = kmalloc_array(nr_segments, sizeof(ksegments[0]), > + GFP_KERNEL); > + if (!ksegments) > + return -ENOMEM; > + > for (i
Re: [PATCH v5 1/6] kexec: move locking into do_kexec_load
Arnd Bergmann writes: > From: Arnd Bergmann > > The locking is the same between the native and compat version of > sys_kexec_load(), so it can be done in the common implementation > to reduce duplication. Acked-by: "Eric W. Biederman" > > Co-developed-by: Eric Biederman > Co-developed-by: Christoph Hellwig > Signed-off-by: Arnd Bergmann > --- > kernel/kexec.c | 44 > 1 file changed, 16 insertions(+), 28 deletions(-) > > diff --git a/kernel/kexec.c b/kernel/kexec.c > index c82c6c06f051..9c7aef8f4bb6 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -110,6 +110,17 @@ static int do_kexec_load(unsigned long entry, unsigned > long nr_segments, > unsigned long i; > int ret; > > + /* > + * Because we write directly to the reserved memory region when loading > + * crash kernels we need a mutex here to prevent multiple crash kernels > + * from attempting to load simultaneously, and to prevent a crash kernel > + * from loading over the top of a in use crash kernel. > + * > + * KISS: always take the mutex. > + */ > + if (!mutex_trylock(&kexec_mutex)) > + return -EBUSY; > + > if (flags & KEXEC_ON_CRASH) { > dest_image = &kexec_crash_image; > if (kexec_crash_image) > @@ -121,7 +132,8 @@ static int do_kexec_load(unsigned long entry, unsigned > long nr_segments, > if (nr_segments == 0) { > /* Uninstall image */ > kimage_free(xchg(dest_image, NULL)); > - return 0; > + ret = 0; > + goto out_unlock; > } > if (flags & KEXEC_ON_CRASH) { > /* > @@ -134,7 +146,7 @@ static int do_kexec_load(unsigned long entry, unsigned > long nr_segments, > > ret = kimage_alloc_init(&image, entry, nr_segments, segments, flags); > if (ret) > - return ret; > + goto out_unlock; > > if (flags & KEXEC_PRESERVE_CONTEXT) > image->preserve_context = 1; > @@ -171,6 +183,8 @@ static int do_kexec_load(unsigned long entry, unsigned > long nr_segments, > arch_kexec_protect_crashkres(); > > kimage_free(image); > +out_unlock: > + mutex_unlock(&kexec_mutex); > return ret; > } > > @@ -247,21 +261,8 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, > unsigned long, nr_segments, > ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT)) > return -EINVAL; > > - /* Because we write directly to the reserved memory > - * region when loading crash kernels we need a mutex here to > - * prevent multiple crash kernels from attempting to load > - * simultaneously, and to prevent a crash kernel from loading > - * over the top of a in use crash kernel. > - * > - * KISS: always take the mutex. > - */ > - if (!mutex_trylock(&kexec_mutex)) > - return -EBUSY; > - > result = do_kexec_load(entry, nr_segments, segments, flags); > > - mutex_unlock(&kexec_mutex); > - > return result; > } > > @@ -301,21 +302,8 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry, > return -EFAULT; > } > > - /* Because we write directly to the reserved memory > - * region when loading crash kernels we need a mutex here to > - * prevent multiple crash kernels from attempting to load > - * simultaneously, and to prevent a crash kernel from loading > - * over the top of a in use crash kernel. > - * > - * KISS: always take the mutex. > - */ > - if (!mutex_trylock(&kexec_mutex)) > - return -EBUSY; > - > result = do_kexec_load(entry, nr_segments, ksegments, flags); > > - mutex_unlock(&kexec_mutex); > - > return result; > } > #endif
Re: [PATCH v2 18/18] uaccess: drop maining CONFIG_SET_FS users
Arnd Bergmann writes: > From: Arnd Bergmann > > There are no remaining callers of set_fs(), so CONFIG_SET_FS > can be removed globally, along with the thread_info field and > any references to it. > > This turns access_ok() into a cheaper check against TASK_SIZE_MAX. > > With CONFIG_SET_FS gone, so drop all remaining references to > set_fs()/get_fs(), mm_segment_t and uaccess_kernel(). For the bits I have looked at recently, and think I understand. Acked-by: "Eric W. Biederman" > > Signed-off-by: Arnd Bergmann > --- > fs/exec.c | 6 -- > kernel/exit.c | 14 - > kernel/kthread.c | 5 -- > > diff --git a/fs/exec.c b/fs/exec.c > index 79f2c9483302..bc68a0c089ac 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1303,12 +1303,6 @@ int begin_new_exec(struct linux_binprm * bprm) > if (retval) > goto out_unlock; > > - /* > - * Ensure that the uaccess routines can actually operate on userspace > - * pointers: > - */ > - force_uaccess_begin(); > - > if (me->flags & PF_KTHREAD) > free_kthread_struct(me); > me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | > diff --git a/kernel/exit.c b/kernel/exit.c > index b00a25bb4ab9..0884a75bc2f8 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -737,20 +737,6 @@ void __noreturn do_exit(long code) > > WARN_ON(blk_needs_flush_plug(tsk)); > > - /* > - * If do_dead is called because this processes oopsed, it's possible > - * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before > - * continuing. Amongst other possible reasons, this is to prevent > - * mm_release()->clear_child_tid() from writing to a user-controlled > - * kernel address. > - * > - * On uptodate architectures force_uaccess_begin is a noop. On > - * architectures that still have set_fs/get_fs in addition to handling > - * oopses handles kernel threads that run as set_fs(KERNEL_DS) by > - * default. > - */ > - force_uaccess_begin(); > - > kcov_task_exit(tsk); > > coredump_task_exit(tsk); > diff --git a/kernel/kthread.c b/kernel/kthread.c > index 38c6dd822da8..16c2275d4b50 100644 > --- a/kernel/kthread.c > +++ b/kernel/kthread.c > @@ -55,7 +55,6 @@ struct kthread { > int result; > int (*threadfn)(void *); > void *data; > - mm_segment_t oldfs; > struct completion parked; > struct completion exited; > #ifdef CONFIG_BLK_CGROUP > @@ -1441,8 +1440,6 @@ void kthread_use_mm(struct mm_struct *mm) > mmdrop(active_mm); > else > smp_mb(); > - > - to_kthread(tsk)->oldfs = force_uaccess_begin(); > } > EXPORT_SYMBOL_GPL(kthread_use_mm); > > @@ -1457,8 +1454,6 @@ void kthread_unuse_mm(struct mm_struct *mm) > WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD)); > WARN_ON_ONCE(!tsk->mm); > > - force_uaccess_end(to_kthread(tsk)->oldfs); > - > task_lock(tsk); > /* >* When a kthread stops operating on an address space, the loop
Re: [PATCH v2 2/5] ia64: reuse append_elf_note() and final_note() functions
Hari Bathini writes: > Hi Dave, > > > Thanks for the review. > > > On Thursday 01 December 2016 10:26 AM, Dave Young wrote: >> Hi Hari >> >> Personally I like V1 more, but split the patch 2 is easier for ia64 >> people to reivew. I did basic x86 testing, it runs ok. >> >> On 11/25/16 at 05:24pm, Hari Bathini wrote: >>> Get rid of multiple definitions of append_elf_note() & final_note() >>> functions. Reuse these functions compiled under CONFIG_CRASH_CORE. >>> >>> Signed-off-by: Hari Bathini >>> --- >>> arch/ia64/kernel/crash.c | 22 -- >>> include/linux/crash_core.h |4 >>> kernel/crash_core.c|6 +++--- >>> kernel/kexec_core.c| 28 >>> 4 files changed, 7 insertions(+), 53 deletions(-) >>> >>> diff --git a/arch/ia64/kernel/crash.c b/arch/ia64/kernel/crash.c >>> index 2955f35..75859a0 100644 >>> --- a/arch/ia64/kernel/crash.c >>> +++ b/arch/ia64/kernel/crash.c >>> @@ -27,28 +27,6 @@ static int kdump_freeze_monarch; >>> static int kdump_on_init = 1; >>> static int kdump_on_fatal_mca = 1; >>> -static inline Elf64_Word >>> -*append_elf_note(Elf64_Word *buf, char *name, unsigned type, void *data, >>> - size_t data_len) >>> -{ >>> - struct elf_note *note = (struct elf_note *)buf; >>> - note->n_namesz = strlen(name) + 1; >>> - note->n_descsz = data_len; >>> - note->n_type = type; >>> - buf += (sizeof(*note) + 3)/4; >>> - memcpy(buf, name, note->n_namesz); >>> - buf += (note->n_namesz + 3)/4; >>> - memcpy(buf, data, data_len); >>> - buf += (data_len + 3)/4; >>> - return buf; >>> -} >>> - >>> -static void >>> -final_note(void *buf) >>> -{ >>> - memset(buf, 0, sizeof(struct elf_note)); >>> -} >>> - >> The above IA64 version looks better than the functions in kexec_core.c >> about the Elf64_Word type usage and the simpler final_note function. > > Hmmm.. Is void* better over Elf64_Word* to be agnostic of Elf32 or > Elf64 type? Both Elf64_Word and Elf32_Word result in a u32. So I expect the right solution is to add a definition of Elf_Word to include/linux/elf.h and to make the buffer "Elf_Word *buf". That way we preserve the alignment knowledge, while making the code depend on 32bit or 64bit. Eric
[PATCH 06/11] signal/powerpc: Document conflicts with SI_USER and SIGFPE and SIGTRAP
Setting si_code to 0 results in a userspace seeing an si_code of 0. This is the same si_code as SI_USER. Posix and common sense requires that SI_USER not be a signal specific si_code. As such this use of 0 for the si_code is a pretty horribly broken ABI. Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a value of __SI_KILL and now sees a value of SIL_KILL with the result that uid and pid fields are copied and which might copying the si_addr field by accident but certainly not by design. Making this a very flakey implementation. Utilizing FPE_FIXME and TRAP_FIXME, siginfo_layout() will now return SIL_FAULT and the appropriate fields will be reliably copied. Possible ABI fixes includee: - Send the signal without siginfo - Don't generate a signal - Possibly assign and use an appropriate si_code - Don't handle cases which can't happen Cc: Paul Mackerras Cc: Kumar Gala Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: linuxppc-dev@lists.ozlabs.org Ref: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx") Ref: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various exceptions.") History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" --- arch/powerpc/include/uapi/asm/siginfo.h | 15 +++ arch/powerpc/kernel/traps.c | 10 +- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/uapi/asm/siginfo.h b/arch/powerpc/include/uapi/asm/siginfo.h index 1a691141e49f..444ca6c9989a 100644 --- a/arch/powerpc/include/uapi/asm/siginfo.h +++ b/arch/powerpc/include/uapi/asm/siginfo.h @@ -18,4 +18,19 @@ #undef NSIGTRAP #define NSIGTRAP 4 +/* + * SIGFPE si_codes + */ +#ifdef __KERNEL__ +#define FPE_FIXME 0 /* Broken dup of SI_USER */ +#endif /* __KERNEL__ */ + +/* + * SIGTRAP si_codes + */ +#ifdef __KERNEL__ +#define TRAP_FIXME 0 /* Broken dup of SI_USER */ +#endif /* __KERNEL__ */ + + #endif /* _ASM_POWERPC_SIGINFO_H */ diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index f3eb61be0d30..f2e6e1838952 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -917,7 +917,7 @@ void unknown_exception(struct pt_regs *regs) printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n", regs->nip, regs->msr, regs->trap); - _exception(SIGTRAP, regs, 0, 0); + _exception(SIGTRAP, regs, TRAP_FIXME, 0); exception_exit(prev_state); } @@ -939,7 +939,7 @@ void instruction_breakpoint_exception(struct pt_regs *regs) void RunModeException(struct pt_regs *regs) { - _exception(SIGTRAP, regs, 0, 0); + _exception(SIGTRAP, regs, TRAP_FIXME, 0); } void single_step_exception(struct pt_regs *regs) @@ -978,7 +978,7 @@ static void emulate_single_step(struct pt_regs *regs) static inline int __parse_fpscr(unsigned long fpscr) { - int ret = 0; + int ret = FPE_FIXME; /* Invalid operation */ if ((fpscr & FPSCR_VE) && (fpscr & FPSCR_VX)) @@ -1929,7 +1929,7 @@ void SPEFloatingPointException(struct pt_regs *regs) extern int do_spe_mathemu(struct pt_regs *regs); unsigned long spefscr; int fpexc_mode; - int code = 0; + int code = FPE_FIXME; int err; flush_spe_to_thread(current); @@ -1998,7 +1998,7 @@ void SPEFloatingPointRoundException(struct pt_regs *regs) printk(KERN_ERR "unrecognized spe instruction " "in %s at %lx\n", current->comm, regs->nip); } else { - _exception(SIGFPE, regs, 0, regs->nip); + _exception(SIGFPE, regs, FPE_FIXME, regs->nip); return; } } -- 2.14.1
Re: [PATCH v10 27/27] mm: display pkey in smaps if arch_pkeys_enabled() is true
Ram Pai writes: > Currently the architecture specific code is expected to > display the protection keys in smap for a given vma. > This can lead to redundant code and possibly to divergent > formats in which the key gets displayed. > > This patch changes the implementation. It displays the > pkey only if the architecture support pkeys. > > x86 arch_show_smap() function is not needed anymore. > Delete it. > > Signed-off-by: Ram Pai > --- > arch/x86/kernel/setup.c |8 > fs/proc/task_mmu.c | 11 ++- > 2 files changed, 6 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index 8af2e8d..ddf945a 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -1326,11 +1326,3 @@ static int __init register_kernel_offset_dumper(void) > return 0; > } > __initcall(register_kernel_offset_dumper); > - > -void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma) > -{ > - if (!boot_cpu_has(X86_FEATURE_OSPKE)) > - return; > - > - seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); > -} > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 0edd4da..4b39a94 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -18,6 +18,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -728,10 +729,6 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long > hmask, > } > #endif /* HUGETLB_PAGE */ > > -void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma) > -{ > -} > - > static int show_smap(struct seq_file *m, void *v, int is_pid) > { > struct proc_maps_private *priv = m->private; > @@ -851,9 +848,13 @@ static int show_smap(struct seq_file *m, void *v, int > is_pid) > (unsigned long)(mss->pss >> (10 + PSS_SHIFT))); > > if (!rollup_mode) { > - arch_show_smap(m, vma); > +#ifdef CONFIG_ARCH_HAS_PKEYS > + if (arch_pkeys_enabled()) > + seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); > +#endif Would it be worth it making vma_pkey a noop on architectures that don't support protection keys so that we don't need the #ifdef here? Eric > show_smap_vma_flags(m, vma); > } > + > m_cache_vma(m, vma); > return ret; > }
Re: [PATCH v10 27/27] mm: display pkey in smaps if arch_pkeys_enabled() is true
Ram Pai writes: > On Fri, Jan 19, 2018 at 10:09:41AM -0600, Eric W. Biederman wrote: >> Ram Pai writes: >> >> > Currently the architecture specific code is expected to >> > display the protection keys in smap for a given vma. >> > This can lead to redundant code and possibly to divergent >> > formats in which the key gets displayed. >> > >> > This patch changes the implementation. It displays the >> > pkey only if the architecture support pkeys. >> > >> > x86 arch_show_smap() function is not needed anymore. >> > Delete it. >> > >> > Signed-off-by: Ram Pai >> > --- >> > arch/x86/kernel/setup.c |8 >> > fs/proc/task_mmu.c | 11 ++- >> > 2 files changed, 6 insertions(+), 13 deletions(-) >> > >> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c >> > index 8af2e8d..ddf945a 100644 >> > --- a/arch/x86/kernel/setup.c >> > +++ b/arch/x86/kernel/setup.c >> > @@ -1326,11 +1326,3 @@ static int __init >> > register_kernel_offset_dumper(void) >> >return 0; >> > } >> > __initcall(register_kernel_offset_dumper); >> > - >> > -void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma) >> > -{ >> > - if (!boot_cpu_has(X86_FEATURE_OSPKE)) >> > - return; >> > - >> > - seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); >> > -} >> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> > index 0edd4da..4b39a94 100644 >> > --- a/fs/proc/task_mmu.c >> > +++ b/fs/proc/task_mmu.c >> > @@ -18,6 +18,7 @@ >> > #include >> > #include >> > #include >> > +#include >> > >> > #include >> > #include >> > @@ -728,10 +729,6 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned >> > long hmask, >> > } >> > #endif /* HUGETLB_PAGE */ >> > >> > -void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma) >> > -{ >> > -} >> > - >> > static int show_smap(struct seq_file *m, void *v, int is_pid) >> > { >> >struct proc_maps_private *priv = m->private; >> > @@ -851,9 +848,13 @@ static int show_smap(struct seq_file *m, void *v, int >> > is_pid) >> > (unsigned long)(mss->pss >> (10 + PSS_SHIFT))); >> > >> >if (!rollup_mode) { >> > - arch_show_smap(m, vma); >> > +#ifdef CONFIG_ARCH_HAS_PKEYS >> > + if (arch_pkeys_enabled()) >> > + seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); >> > +#endif >> >> Would it be worth it making vma_pkey a noop on architectures that don't >> support protection keys so that we don't need the #ifdef here? > > You mean something like this? > #define vma_pkey(vma) > It will lead to compilation error. > > > I can make it > #define vma_pkey(vma) 0 > > and that will work and get rid of the #ifdef Yes the second is what I was thinking. I don't know if it is worth it but #ifdefs can be problematic as the result in code not being compile tested. Eric
Re: [PATCH v11 00/10] Application Data Integrity feature introduced by SPARC M7
Khalid Aziz writes: > V11 changes: > This series is same as v10 and was simply rebased on 4.15 kernel. Can > mm maintainers please review patches 2, 7, 8 and 9 which are arch > independent, and include/linux/mm.h and mm/ksm.c changes in patch 10 > and ack these if everything looks good? I am a bit puzzled how this differs from the pkey's that other architectures are implementing to achieve a similar result. I am a bit mystified why you don't store the tag in a vma instead of inventing a new way to store data on page out. Can you please use force_sig_fault to send these signals instead of force_sig_info. Emperically I have found that it is very error prone to generate siginfo's by hand, especially on code paths where several different si_codes may apply. So it helps to go through a helper function to ensure the fiddly bits are all correct. AKA the unused bits all need to be set to zero before struct siginfo is copied to userspace. Eric
Re: [PATCH v11 00/10] Application Data Integrity feature introduced by SPARC M7
Khalid Aziz writes: > On 02/01/2018 07:29 PM, ebied...@xmission.com wrote: >> Khalid Aziz writes: >> >>> V11 changes: >>> This series is same as v10 and was simply rebased on 4.15 kernel. Can >>> mm maintainers please review patches 2, 7, 8 and 9 which are arch >>> independent, and include/linux/mm.h and mm/ksm.c changes in patch 10 >>> and ack these if everything looks good? >> >> I am a bit puzzled how this differs from the pkey's that other >> architectures are implementing to achieve a similar result. >> >> I am a bit mystified why you don't store the tag in a vma >> instead of inventing a new way to store data on page out. > > Hello Eric, > > As Steven pointed out, sparc sets tags per cacheline unlike pkey. This results > in much finer granularity for tags that pkey and hence requires larger tag > storage than what we can do in a vma. *Nod* I am a bit mystified where you keep the information in memory. I would think the tags would need to be stored per cacheline or per tlb entry, in some kind of cache that could overflow. So I would be surprised if swapping is the only time this information needs stored in memory. Which makes me wonder if you have the proper data structures. I would think an array per vma or something in the page tables would tend to make sense. But perhaps I am missing something. >> Can you please use force_sig_fault to send these signals instead >> of force_sig_info. Emperically I have found that it is very >> error prone to generate siginfo's by hand, especially on code >> paths where several different si_codes may apply. So it helps >> to go through a helper function to ensure the fiddly bits are >> all correct. AKA the unused bits all need to be set to zero before >> struct siginfo is copied to userspace. >> > > What you say makes sense. I followed the same code as other fault handlers for > sparc. I could change just the fault handlers for ADI related faults. Would it > make more sense to change all the fault handlers in a separate patch and keep > the code in arch/sparc/kernel/traps_64.c consistent? Dave M, do you have a > preference? It is my intention post -rc1 to start sending out patches to get the rest of not just sparc but all of the architectures using the new helpers. I have the code I just ran out of time befor the merge window opened to ensure everything had a good thorough review. So if you can handle the your new changes I expect I will handle the rest. Eric
Re: [PATCH v11 00/10] Application Data Integrity feature introduced by SPARC M7
Khalid Aziz writes: > On 02/07/2018 12:38 AM, ebied...@xmission.com wrote: >> Khalid Aziz writes: >> >>> On 02/01/2018 07:29 PM, ebied...@xmission.com wrote: Khalid Aziz writes: > V11 changes: > This series is same as v10 and was simply rebased on 4.15 kernel. Can > mm maintainers please review patches 2, 7, 8 and 9 which are arch > independent, and include/linux/mm.h and mm/ksm.c changes in patch 10 > and ack these if everything looks good? I am a bit puzzled how this differs from the pkey's that other architectures are implementing to achieve a similar result. I am a bit mystified why you don't store the tag in a vma instead of inventing a new way to store data on page out. >>> >>> Hello Eric, >>> >>> As Steven pointed out, sparc sets tags per cacheline unlike pkey. This >>> results >>> in much finer granularity for tags that pkey and hence requires larger tag >>> storage than what we can do in a vma. >> >> *Nod* I am a bit mystified where you keep the information in memory. >> I would think the tags would need to be stored per cacheline or per >> tlb entry, in some kind of cache that could overflow. So I would be >> surprised if swapping is the only time this information needs stored >> in memory. Which makes me wonder if you have the proper data >> structures. >> >> I would think an array per vma or something in the page tables would >> tend to make sense. >> >> But perhaps I am missing something. > > The ADI tags are stored in spare bits in the RAM. ADI tag storage is > managed entirely by memory controller which maintains these tags per > ADI block. An ADI block is the same size as cacheline on M7. Tags for > each ADI block are associated with the physical ADI block, not the > virtual address. When a physical page is reused, the physical ADI tag > storage for that page is overwritten with new ADI tags, hence we need > to store away the tags when we swap out a page. Kernel updates the ADI > tags for physical page when it swaps a new page in. Each vma can cover > variable number of pages so it is best to store a pointer to the tag > storage in vma as opposed to actual tags in an array. Each 8K page can > have 128 tags on it. Since each tag is 4 bits, we need 64 bytes per > page to store the tags. That can add up for a large vma. If the tags are already stored in RAM I can see why it does not make any sense to store them except on page out. Management wise this feels a lot like the encrypted memory options I have been seeing on x86. Can you please use force_sig_fault to send these signals instead of force_sig_info. Emperically I have found that it is very error prone to generate siginfo's by hand, especially on code paths where several different si_codes may apply. So it helps to go through a helper function to ensure the fiddly bits are all correct. AKA the unused bits all need to be set to zero before struct siginfo is copied to userspace. >>> >>> What you say makes sense. I followed the same code as other fault handlers >>> for >>> sparc. I could change just the fault handlers for ADI related faults. Would >>> it >>> make more sense to change all the fault handlers in a separate patch and >>> keep >>> the code in arch/sparc/kernel/traps_64.c consistent? Dave M, do you have a >>> preference? >> >> It is my intention post -rc1 to start sending out patches to get the >> rest of not just sparc but all of the architectures using the new >> helpers. I have the code I just ran out of time befor the merge >> window opened to ensure everything had a good thorough review. >> >> So if you can handle the your new changes I expect I will handle the >> rest. >> > > I can add a patch at the end of my series to update all > force_sig_info() in my patchset to force_sig_fault(). That will sync > my patches up with your changes cleanly. Does that work for you? I can > send an updated series with this change. Can you review and ack the > patches after this change. One additional patch would be fine. I can certainly review and ack that part. You probably want to wait until post -rc1 so that you have a clean base to work off of. Eric
[RFC PATCH 0/3] Dealing with the aliases of SI_USER
Linus, Would you consider the patchset below for -rc2? Dealing with the aliases of SI_USER has been a challenge as we have had a b0rked ABI in some cases since 2.5. So far no one except myself has suggested that changing the si_code of from 0 to something else for those problematic aliases of SI_USER is going to be a problem. So it looks like just fixing the issue is a real possibility. Fixing the cases that do kill(SIGFPE, ...) because at least test cases care seems important. The best fixes to backport appear to be the real architecture fixes that remove the aliases for SI_USER as those are deep fixes that fundamentally fix the problems, and are also very small changes. I am not yet brave enough to merge architectural fixes like that without arch maintainers buy-in. Getting at least an ack if nothing else takes a little bit of time. Still we have a arm fix upthread and David Miller has given his nod to a sparc fix that uses FPE_FLTUNK. So it appears real architecture fixes are progressing. Further I have looked and that leaves only powerpc, parisc, ia64, and alpha. The new si_code FPE_FLTUNK appears to address most of those, and there is an untested patch for parisc. So real progress appears possible. The generic code can do better, and that is what this rfc patchset is about. It ensures siginfo is fully initialized and uses copy_to_user to copy siginfo to userspace. This takes siginfo_layout out of the picture and so for non-compat non-signalfd siginfos the status quo returns to what it was before I introduced siginfo_layout (AKA regressions go bye-bye). I believe given the issues these changes are a candiate for -rc2. Otherwise I will keep these changes for the next merge window. Eric W. Biederman (3): signal: Ensure every siginfo we send has all bits initialized signal: Reduce copy_siginfo_to_user to just copy_to_user signal: Stop special casing TRAP_FIXME and FPE_FIXME in siginfo_layout arch/alpha/kernel/osf_sys.c | 1 + arch/alpha/kernel/signal.c| 2 + arch/alpha/kernel/traps.c | 5 ++ arch/alpha/mm/fault.c | 2 + arch/arc/mm/fault.c | 2 + arch/arm/kernel/ptrace.c | 1 + arch/arm/kernel/swp_emulate.c | 1 + arch/arm/kernel/traps.c | 5 ++ arch/arm/mm/alignment.c | 1 + arch/arm/mm/fault.c | 5 ++ arch/arm/vfp/vfpmodule.c | 3 +- arch/arm64/kernel/fpsimd.c| 2 +- arch/arm64/kernel/sys_compat.c| 1 + arch/arm64/kernel/traps.c | 1 + arch/arm64/mm/fault.c | 18 -- arch/c6x/kernel/traps.c | 1 + arch/hexagon/kernel/traps.c | 1 + arch/hexagon/mm/vm_fault.c| 1 + arch/ia64/kernel/brl_emu.c| 1 + arch/ia64/kernel/signal.c | 2 + arch/ia64/kernel/traps.c | 27 - arch/ia64/kernel/unaligned.c | 1 + arch/ia64/mm/fault.c | 4 +- arch/m68k/kernel/traps.c | 2 + arch/microblaze/kernel/exceptions.c | 1 + arch/microblaze/mm/fault.c| 4 +- arch/mips/mm/fault.c | 1 + arch/nds32/kernel/traps.c | 6 +- arch/nds32/mm/fault.c | 1 + arch/nios2/kernel/traps.c | 1 + arch/openrisc/kernel/traps.c | 5 +- arch/openrisc/mm/fault.c | 1 + arch/parisc/kernel/ptrace.c | 1 + arch/parisc/kernel/traps.c| 2 + arch/parisc/kernel/unaligned.c| 1 + arch/parisc/math-emu/driver.c | 1 + arch/parisc/mm/fault.c| 1 + arch/powerpc/kernel/process.c | 1 + arch/powerpc/kernel/traps.c | 3 +- arch/powerpc/mm/fault.c | 1 + arch/powerpc/platforms/cell/spufs/fault.c | 2 +- arch/riscv/kernel/traps.c | 1 + arch/s390/kernel/traps.c | 5 +- arch/s390/mm/fault.c | 2 + arch/sh/kernel/hw_breakpoint.c| 1 + arch/sh/kernel/traps_32.c | 2 + arch/sh/math-emu/math.c | 1 + arch/sh/mm/fault.c| 1 + arch/sparc/kernel/process_64.c| 1 + arch/sparc/kernel/sys_sparc_32.c | 1 + arch/sparc/kernel/traps_32.c | 10 arch/sparc/kernel/traps_64.c | 14 + arch/sparc/kernel/unaligned_32.c | 1 + arch/sparc/mm/fault_32.c | 1 + arch/sparc/mm/fault_64.c | 1 + arch/um/kernel/trap.c | 2 + arch/unicore32/kernel/fpu-ucf64.c | 2 +- arch/unicore32/mm/fault.c | 3 + arch/x86/entry/vsyscall/vsyscall_64.c | 2 +- arch/x86/kernel/ptrace.c | 2
[RFC PATCH 1/3] signal: Ensure every siginfo we send has all bits initialized
Call clear_siginfo to ensure every stack allocated siginfo is properly initialized before being passed to the signal sending functions. Note: It is not safe to depend on C initializers to initialize struct siginfo on the stack because C is allowed to skip holes when initializing a structure. The initialization of struct siginfo in tracehook_report_syscall_exit was moved from the helper user_single_step_siginfo into tracehook_report_syscall_exit itself, to make it clear that the local variable siginfo gets fully initialized. In a few cases the scope of struct siginfo has been reduced to make it clear that siginfo siginfo is not used on other paths in the function in which it is declared. Instances of using memset to initialize siginfo have been replaced with calls clear_siginfo for clarity. Signed-off-by: "Eric W. Biederman" --- arch/alpha/kernel/osf_sys.c | 1 + arch/alpha/kernel/signal.c| 2 ++ arch/alpha/kernel/traps.c | 5 + arch/alpha/mm/fault.c | 2 ++ arch/arc/mm/fault.c | 2 ++ arch/arm/kernel/ptrace.c | 1 + arch/arm/kernel/swp_emulate.c | 1 + arch/arm/kernel/traps.c | 5 + arch/arm/mm/alignment.c | 1 + arch/arm/mm/fault.c | 5 + arch/arm/vfp/vfpmodule.c | 3 +-- arch/arm64/kernel/fpsimd.c| 2 +- arch/arm64/kernel/sys_compat.c| 1 + arch/arm64/kernel/traps.c | 1 + arch/arm64/mm/fault.c | 18 -- arch/c6x/kernel/traps.c | 1 + arch/hexagon/kernel/traps.c | 1 + arch/hexagon/mm/vm_fault.c| 1 + arch/ia64/kernel/brl_emu.c| 1 + arch/ia64/kernel/signal.c | 2 ++ arch/ia64/kernel/traps.c | 27 --- arch/ia64/kernel/unaligned.c | 1 + arch/ia64/mm/fault.c | 4 +++- arch/m68k/kernel/traps.c | 2 ++ arch/microblaze/kernel/exceptions.c | 1 + arch/microblaze/mm/fault.c| 4 +++- arch/mips/mm/fault.c | 1 + arch/nds32/kernel/traps.c | 6 +- arch/nds32/mm/fault.c | 1 + arch/nios2/kernel/traps.c | 1 + arch/openrisc/kernel/traps.c | 5 - arch/openrisc/mm/fault.c | 1 + arch/parisc/kernel/ptrace.c | 1 + arch/parisc/kernel/traps.c| 2 ++ arch/parisc/kernel/unaligned.c| 1 + arch/parisc/math-emu/driver.c | 1 + arch/parisc/mm/fault.c| 1 + arch/powerpc/kernel/process.c | 1 + arch/powerpc/kernel/traps.c | 3 +-- arch/powerpc/mm/fault.c | 1 + arch/powerpc/platforms/cell/spufs/fault.c | 2 +- arch/riscv/kernel/traps.c | 1 + arch/s390/kernel/traps.c | 5 - arch/s390/mm/fault.c | 2 ++ arch/sh/kernel/hw_breakpoint.c| 1 + arch/sh/kernel/traps_32.c | 2 ++ arch/sh/math-emu/math.c | 1 + arch/sh/mm/fault.c| 1 + arch/sparc/kernel/process_64.c| 1 + arch/sparc/kernel/sys_sparc_32.c | 1 + arch/sparc/kernel/traps_32.c | 10 ++ arch/sparc/kernel/traps_64.c | 14 ++ arch/sparc/kernel/unaligned_32.c | 1 + arch/sparc/mm/fault_32.c | 1 + arch/sparc/mm/fault_64.c | 1 + arch/um/kernel/trap.c | 2 ++ arch/unicore32/kernel/fpu-ucf64.c | 2 +- arch/unicore32/mm/fault.c | 3 +++ arch/x86/entry/vsyscall/vsyscall_64.c | 2 +- arch/x86/kernel/ptrace.c | 2 +- arch/x86/kernel/traps.c | 3 +++ arch/x86/kernel/umip.c| 1 + arch/x86/kvm/mmu.c| 1 + arch/x86/mm/fault.c | 1 + arch/xtensa/kernel/traps.c| 1 + arch/xtensa/mm/fault.c| 1 + include/linux/ptrace.h| 1 - include/linux/tracehook.h | 1 + virt/kvm/arm/mmu.c| 1 + 69 files changed, 163 insertions(+), 24 deletions(-) diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c index 89faa6f4de47..8ad689d6a0e4 100644 --- a/arch/alpha/kernel/osf_sys.c +++ b/arch/alpha/kernel/osf_sys.c @@ -881,6 +881,7 @@ SYSCALL_DEFINE5(osf_setsysinfo, unsigned long, op, void __user *, buffer, if (fex & IEEE_TRAP_ENABLE_DZE) si_code = FPE_FLTDIV; if (fex & IEEE_TRAP_ENABLE_INV) si_code = FPE_FLTINV; + clear_siginfo(&info);
[RFC PATCH 2/3] signal: Reduce copy_siginfo_to_user to just copy_to_user
Now that every instance of struct siginfo is now initialized it is no longer necessary to copy struct siginfo piece by piece to userspace but instead the entire structure can be copied. As well as making the code simpler and more efficient this means that copy_sinfo_to_user no longer cares which union member of struct siginfo is in use. In practice this means that all 32bit architectures that define FPE_FIXME will handle properly send SI_USER when kill(SIGFPE) is sent. While still performing their historic architectural brokenness when 0 is used a floating pointer signal. This matches the current behavior of 64bit architectures that define FPE_FIXME who get lucky and an overloaded SI_USER has continuted to work through copy_siginfo_to_user because the 8 byte si_addr occupies the same bytes in struct siginfo as the 4 byte si_pid and the 4 byte si_uid. Problematic architectures still need to fix their ABI so that signalfd and 32bit compat code will work properly. Signed-off-by: "Eric W. Biederman" --- kernel/signal.c | 84 ++--- 1 file changed, 2 insertions(+), 82 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index d4ccea599692..d56f4d496c89 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2850,89 +2850,9 @@ enum siginfo_layout siginfo_layout(int sig, int si_code) int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from) { - int err; - - if (!access_ok (VERIFY_WRITE, to, sizeof(siginfo_t))) + if (copy_to_user(to, from , sizeof(struct siginfo))) return -EFAULT; - if (from->si_code < 0) - return __copy_to_user(to, from, sizeof(siginfo_t)) - ? -EFAULT : 0; - /* -* If you change siginfo_t structure, please be sure -* this code is fixed accordingly. -* Please remember to update the signalfd_copyinfo() function -* inside fs/signalfd.c too, in case siginfo_t changes. -* It should never copy any pad contained in the structure -* to avoid security leaks, but must copy the generic -* 3 ints plus the relevant union member. -*/ - err = __put_user(from->si_signo, &to->si_signo); - err |= __put_user(from->si_errno, &to->si_errno); - err |= __put_user(from->si_code, &to->si_code); - switch (siginfo_layout(from->si_signo, from->si_code)) { - case SIL_KILL: - err |= __put_user(from->si_pid, &to->si_pid); - err |= __put_user(from->si_uid, &to->si_uid); - break; - case SIL_TIMER: - /* Unreached SI_TIMER is negative */ - break; - case SIL_POLL: - err |= __put_user(from->si_band, &to->si_band); - err |= __put_user(from->si_fd, &to->si_fd); - break; - case SIL_FAULT: - err |= __put_user(from->si_addr, &to->si_addr); -#ifdef __ARCH_SI_TRAPNO - err |= __put_user(from->si_trapno, &to->si_trapno); -#endif -#ifdef __ia64__ - err |= __put_user(from->si_imm, &to->si_imm); - err |= __put_user(from->si_flags, &to->si_flags); - err |= __put_user(from->si_isr, &to->si_isr); -#endif - /* -* Other callers might not initialize the si_lsb field, -* so check explicitly for the right codes here. -*/ -#ifdef BUS_MCEERR_AR - if (from->si_signo == SIGBUS && from->si_code == BUS_MCEERR_AR) - err |= __put_user(from->si_addr_lsb, &to->si_addr_lsb); -#endif -#ifdef BUS_MCEERR_AO - if (from->si_signo == SIGBUS && from->si_code == BUS_MCEERR_AO) - err |= __put_user(from->si_addr_lsb, &to->si_addr_lsb); -#endif -#ifdef SEGV_BNDERR - if (from->si_signo == SIGSEGV && from->si_code == SEGV_BNDERR) { - err |= __put_user(from->si_lower, &to->si_lower); - err |= __put_user(from->si_upper, &to->si_upper); - } -#endif -#ifdef SEGV_PKUERR - if (from->si_signo == SIGSEGV && from->si_code == SEGV_PKUERR) - err |= __put_user(from->si_pkey, &to->si_pkey); -#endif - break; - case SIL_CHLD: - err |= __put_user(from->si_pid, &to->si_pid); - err |= __put_user(from->si_uid, &to->si_uid); - err |= __put_user(from->si_status, &to->si_status); - err |= __put_user(from->si_utime, &to->si_utime); - err |= __put_user(from->si_stime, &to->si_stime); - break; - case SIL_RT: -
[RFC PATCH 3/3] signal: Stop special casing TRAP_FIXME and FPE_FIXME in siginfo_layout
After more experience with the cases where no one the si_code of 0 is used both as a signal specific si_code, and as SI_USER it appears that no one cares about the signal specific si_code case and the good solution is to just fix the architectures by using a different si_code. In none of the conversations has anyone even suggested that anything depends on the signal specific redefinition of SI_USER. There are at least test cases that care when si_code as 0 does not work as si_user. So make things simple and keep the generic code from introducing problems by removing the special casing of TRAP_FIXME and FPE_FIXME. This will ensure the generic case of sending a signal with kill will always set SI_USER and work. The architecture specific, and signal specific overloads that set si_code to 0 will now have problems with signalfd and the 32bit compat versions of siginfo copying. At least until they are fixed. Signed-off-by: "Eric W. Biederman" --- kernel/signal.c | 9 - 1 file changed, 9 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index d56f4d496c89..fc82d2c0918f 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2835,15 +2835,6 @@ enum siginfo_layout siginfo_layout(int sig, int si_code) layout = SIL_POLL; else if (si_code < 0) layout = SIL_RT; - /* Tests to support buggy kernel ABIs */ -#ifdef TRAP_FIXME - if ((sig == SIGTRAP) && (si_code == TRAP_FIXME)) - layout = SIL_FAULT; -#endif -#ifdef FPE_FIXME - if ((sig == SIGFPE) && (si_code == FPE_FIXME)) - layout = SIL_FAULT; -#endif } return layout; } -- 2.14.1
Re: sparc/ppc/arm compat siginfo ABI regressions: sending SIGFPE via kill() returns wrong values in si_pid and si_uid
Russell King - ARM Linux writes: > On Fri, Apr 13, 2018 at 12:53:49PM -0700, Linus Torvalds wrote: >> On Fri, Apr 13, 2018 at 11:45 AM, Dave Martin wrote: >> > >> > Most uses I've seen do nothing more than use the FPE_xyz value to >> > format diagnostic messages while dying. I struggled to find code that >> > made a meaningful functional decision based on the value, though that's >> > not proof... >> >> Yeah. I've seen code that cares about SIGFPE deeply, but it's almost >> invariably about some emulated environment (eg Java VM, or CPU >> emulation). >> >> And the siginfo data is basically never good enough for those >> environments anyway on its own, so they will go and look at the actual >> instruction that caused the fault and the register state instead, >> because they need *all* the information. >> >> The cases that use si_code are the ones that just trapped signals in >> order to give a more helpful abort message. >> >> So I could certainly imagine that si_code is actually used by somebody >> who then decides to actuall act differently on it, but aside from >> perhaps printing out a different message, it sounds far-fetched. > > Okay, in that case let's just use FPE_FLTINV. That makes the patch > easily back-portable for stable kernels. If we want to I don't think backporting 266da65e9156 ("signal: Add FPE_FLTUNK si_code for undiagnosable fp exceptions") would be at all difficult. What it is changing has been stable for quite a while. The surroundings might change and so it might require some trivial manual fixup but I don't expect any problems. Not that I want to derail the consensus but if we want to backport similar fixes for arm64 or the other architectures that wind up using FPE_FLTUNK for their fix we would need to backport 266da65e9156 anyway. Eric
Re: [RFC PATCH 0/3] Dealing with the aliases of SI_USER
Linus Torvalds writes: > ( > > On Sun, Apr 15, 2018 at 8:56 AM, Eric W. Biederman > wrote: >> >> Would you consider the patchset below for -rc2? > > Ugh. The point of this series is to squash the potential for regressions even from the weird broken code that fills in fields for si_code 0 that do not match SI_USER. The copy_siginfo_to_user32 never handled that so we don't need to worry about that. All of the signals that abuse si_code 0 go through force_sig_info so signalfd can not catch them. Which means that only copy_siginfo_to_user needs to be worried about. Last time I was benchmarking I could not see a difference between copying the entire siginfo and only a small part so I don't see the point of optimizing now. For an actual cleanup I intend to go farther than you are proposing and convert everthing to the set of helper functions I have added to kernel/signal.c force_sig_fault, force_sig_mceerr, force_sig_bnderr, force_sig_pkuerr. When I am done there will be maybe 5 instances of clear_siginfo outside of those helper functions. At which point I won't care if we remove clear_siginfo and just use memset. That is a real improvement that looks something like: "105 files changed, 933 insertions(+), 1698 deletions(-)" That is my end goal with all of this. You complained to me about regressions and you are right with the current handling of FPE_FIXME TRAP_FIXME the code is not bug compatible with old versions of linux, and people have noticed. What this patchset represents is the bare minimum needed to convert copy_siginfo_to_user to a copy_to_user, and remove the possibility of regressions. To that end I need to ensure every struct siginfo in the entire kernel is memset to 0. A structure initializer is absolutely not enough. So I don't mind people thinking clear_siginfo is necessary. To that end clear_siginfo where it does not take the size of struct siginfo as a parameter is much less suceptible to typos, and allows me to ensure every instance of struct siginfo is fully initialized. I fully intend to remove practically every instance of struct siginfo from the architecture specific code, and use the aforementioned helpers. The trouble is that some of the architecture code need refactoring for that to happen. Even your proposed init_siginfo can't be widely used as a common pattern is to fill in siginfo partially in the code with si_code and si_signo being filled in almost last. So in short. I intend to remove most of these clear_siginfo calls when I remove the siginfos later. This series is focused on making copy_siginfo_to_user simple enough we don't need to use siginfo_layout, and thus can be fully backwards compatibile with ourselves and we won't need to worry about regressions. I have aimed to keep this simple enough we can merge this after -rc1 because we don't want regressions. Eric ps. I intend to place this change first in my series wether or not it makes it into -rc2 so that I can be certain we remove any possible regressions in behavior on the buggy architectures. Then I can take my time and ensure the non-trivial changes of refactoring etc are done carefully so I don't introduce other bugs. I need that so I can sleep at night. pps. I can look at some of your other suggestions but cleverness leads to regressions, and if you are going to complain at me harshly when I have been being careful and taking things seriously I am not particularly willing to take unnecessary chances.
Re: [RFC PATCH 1/3] signal: Ensure every siginfo we send has all bits initialized
Dave Martin writes: > Hmmm > > memset()/clear_siginfo() may ensure that there are no uninitialised > explicit fields except for those in inactive union members, but I'm not > sure that this approach is guaranteed to sanitise the padding seen by > userspace. > > Rationale below, though it's a bit theoretical... > > With this in mind, I tend agree with Linus that hiding memset() calls > from the maintainer may be a bad idea unless they are also hidden from > the compiler. If the compiler sees the memset() it may be able to > optimise it in ways that wouldn't be possible for some other random > external function call, including optimising all or part of the call > out. > > As a result, the breakdown into individual put_user()s etc. in > copy_siginfo_to_user() may still be valuable even if all paths have the > memset(). The breakdown into individual put_user()s is known to be problematically slow, and is actually wrong. Even exclusing the SI_USER duplication in a small number of cases the fields filled out in siginfo by architecture code are not the fields that copy_siginfo_to_user is copying. Which is much worse. The code looks safe but is not. My intention is to leave 0 instances of clear_siginfo in the architecture specific code. Ideally struct siginfo will be limited to kernel/signal.c but I am not certain I can quite get that far. The function do_coredump appears to have a legit need for siginfo. > (Rationale for an arch/arm example:) > >> diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c >> index 4c375e11ae95..adda3fc2dde8 100644 >> --- a/arch/arm/vfp/vfpmodule.c >> +++ b/arch/arm/vfp/vfpmodule.c >> @@ -218,8 +218,7 @@ static void vfp_raise_sigfpe(unsigned int sicode, struct >> pt_regs *regs) >> { >> siginfo_t info; >> >> -memset(&info, 0, sizeof(info)); >> - >> +clear_siginfo(&info); >> info.si_signo = SIGFPE; > > /* by c11 (n1570) 6.2.6.1 para 6 [1], all padding bytes in info now take >unspecified values */ > >> info.si_code = sicode; >> info.si_addr = (void __user *)(instruction_pointer(regs) - 4); > > /* by c11 (n1570) 6.2.6.1 para 7 [2], all bytes of the union info._sifields >other than than those corresponding to _sigfault take unspecified >values */ > > So I don't see why the compiler needs to ensure that any of the affected > bytes are zero: it could potentially skip a lot of the memset() as a > result, in theory. > > I've not seen a compiler actually take advantage of that, but I'm now > not sure what forbids it. I took a quick look at gcc-4.9 which I have handy. The passes -f-no-strict-aliasing which helps, and gcc actually documents that if you access things through the union it will not take advantage of c11. gcc-4.9 Documents it this way: > -fstrict-aliasing' > Allow the compiler to assume the strictest aliasing rules > applicable to the language being compiled. For C (and C++), this > activates optimizations based on the type of expressions. In > particular, an object of one type is assumed never to reside at the > same address as an object of a different type, unless the types are > almost the same. For example, an 'unsigned int' can alias an > 'int', but not a 'void*' or a 'double'. A character type may alias > any other type. > > Pay special attention to code like this: > union a_union { > int i; > double d; > }; > > int f() { > union a_union t; > t.d = 3.0; > return t.i; > } > The practice of reading from a different union member than the one > most recently written to (called "type-punning") is common. Even > with '-fstrict-aliasing', type-punning is allowed, provided the > memory is accessed through the union type. So, the code above > works as expected. > If this can happen, I only see two watertight workarounds: > > 1) Ensure that there is no implicit padding in any UAPI structure, e.g. > aeb1f39d814b: ("arm64/ptrace: Avoid uninitialised struct padding in > fpr_set()"). This would include tail-padding of any union member that > is smaller than the containing union. > > It would be significantly more effort to ensure this for siginfo though. > > 2) Poke all values directly into allocated or user memory directly > via pointers to paddingless types; never assign to objects on the kernel > stack if you care what ends up in the padding, e.g., what your > copy_siginfo_to_user() does prior to this series. > > > If I'm not barking up the wrong tree, memset() cannot generally be > used to determine the value of padding bytes, but it may still be > useful for forcing otherwise uninitialised members to sane initial > values. > > This likely affects many more things than just siginfo. Unless gcc has changed it's stance on type-punning through unions or it's semantics with -fno-strict_aliasing we should be good. Eric
Re: [RFC PATCH 1/3] signal: Ensure every siginfo we send has all bits initialized
Dave Martin writes: > On Tue, Apr 17, 2018 at 02:37:38PM -0500, Eric W. Biederman wrote: >> Dave Martin writes: >> >> > Hmmm >> > >> > memset()/clear_siginfo() may ensure that there are no uninitialised >> > explicit fields except for those in inactive union members, but I'm not >> > sure that this approach is guaranteed to sanitise the padding seen by >> > userspace. >> > >> > Rationale below, though it's a bit theoretical... >> > >> > With this in mind, I tend agree with Linus that hiding memset() calls >> > from the maintainer may be a bad idea unless they are also hidden from >> > the compiler. If the compiler sees the memset() it may be able to >> > optimise it in ways that wouldn't be possible for some other random >> > external function call, including optimising all or part of the call >> > out. >> > >> > As a result, the breakdown into individual put_user()s etc. in >> > copy_siginfo_to_user() may still be valuable even if all paths have the >> > memset(). >> >> The breakdown into individual put_user()s is known to be problematically >> slow, and is actually wrong. > > Slowness certainly looked like a potential problem. > >> Even exclusing the SI_USER duplication in a small number of cases the >> fields filled out in siginfo by architecture code are not the fields >> that copy_siginfo_to_user is copying. Which is much worse. The code >> looks safe but is not. >> >> My intention is to leave 0 instances of clear_siginfo in the >> architecture specific code. Ideally struct siginfo will be limited to >> kernel/signal.c but I am not certain I can quite get that far. >> The function do_coredump appears to have a legit need for siginfo. > > So, you mean we can't detect that the caller didn't initialise all the > members, or initialised the wrong union member? Correct. Even when we smuggled the the union member in the upper bits of si_code we got it wrong. So an interface that helps out and does more and is harder to misues looks desirable. > What would be the alternative? Have a separate interface for each SIL_ > type, with only kernel/signal.c translating that into the siginfo_t that > userspace sees? Yes. It really isn't bad as architecture specific code only generates faults. In general faults only take a pointer. I have already merged the needed helpers into kernel/signal.c > Either way, I don't see how we force the caller to initilise the whole > structure. In general the plan is to convert the callers to call force_sig_fault, and then there is no need to have siginfo in the architecture specific code. I have all of the necessary helpers are already merged into kernel/signal.c > >> > (Rationale for an arch/arm example:) >> > >> >> diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c >> >> index 4c375e11ae95..adda3fc2dde8 100644 >> >> --- a/arch/arm/vfp/vfpmodule.c >> >> +++ b/arch/arm/vfp/vfpmodule.c >> >> @@ -218,8 +218,7 @@ static void vfp_raise_sigfpe(unsigned int sicode, >> >> struct pt_regs *regs) >> >> { >> >> siginfo_t info; >> >> >> >> - memset(&info, 0, sizeof(info)); >> >> - >> >> + clear_siginfo(&info); >> >> info.si_signo = SIGFPE; >> > >> > /* by c11 (n1570) 6.2.6.1 para 6 [1], all padding bytes in info now take >> >unspecified values */ >> > >> >> info.si_code = sicode; >> >> info.si_addr = (void __user *)(instruction_pointer(regs) - 4); >> > >> > /* by c11 (n1570) 6.2.6.1 para 7 [2], all bytes of the union info._sifields >> >other than than those corresponding to _sigfault take unspecified >> >values */ >> > >> > So I don't see why the compiler needs to ensure that any of the affected >> > bytes are zero: it could potentially skip a lot of the memset() as a >> > result, in theory. >> > >> > I've not seen a compiler actually take advantage of that, but I'm now >> > not sure what forbids it. >> >> I took a quick look at gcc-4.9 which I have handy. >> >> The passes -f-no-strict-aliasing which helps, and gcc actually >> documents that if you access things through the union it will >> not take advantage of c11. >> >> gcc-4.9 Documents it this way: >> >> > -fstrict-aliasing' >> > Allow the compiler to assume the strictest aliasing rules >> > applicable to the la
Re: [RFC PATCH 0/3] Dealing with the aliases of SI_USER
Linus Torvalds writes: > ( > > On Sun, Apr 15, 2018 at 8:56 AM, Eric W. Biederman > wrote: [snip bit about wanting what is effectively force_sig_fault instead of clear_siginfo everywhere] > The other thing we should do is to get rid of the stupid padding. > Right now "struct siginfo" is pointlessly padded to 128 bytes. That is > completely insane, when it's always just zero in the kernel. > > So put that _pad[] thing inside #ifndef __KERNEL__, and make > copy_siginfo_to_user() write the padding zeroes when copying to user > space. The reason for the padding is "future expansion", so we do want > to tell the user space that it's maybe up to 128 bytes in size, but if > we don't fill it all, we shouldn't waste time and memory on clearing > the padding internally. > > I'm certainly *hoping* nobody depends on the whole 128 bytes in > rt_sigqueueinfo(). In theory you can fill it all (as long as si_code > is negative), but the man-page only says "si_value", and the compat > function doesn't copy any more than that either, so any user that > tries to fill in more than si_value is already broken. In fact, it > might even be worth enforcing that in rt_sigqueueinfo(), just to see > if anybody plays any games.. >From my earlier looking I think this is all doable except detecting if > On x86-64, without the pointless padding, the size of 'struct siginfo' > inside the kernel would be 48 bytes. That's quite a big difference for > something that is often allocated on the kernel stack. >From my earlier looking I can say that I know of no case where signals are injected into the kernel that we need more bytes than the kernel currently provides. The two cases I know of are signal reinjection for checkpoint/restart or something that just uses pid, uid and ptr. Generally that is enough. If we just truncate siginfo to everything except the pad bytes in the kernel there should be no problems. What I don't see how to do is to limit the size of struct siginfo the kernel accepts in a forward compatible way. Something that would fail if for some reason you used more siginfo bytes. Looking at userspace. glibc always memsets siginfo_t to 0. Criu just uses whatever it captured with PTRACE_PEEKSIGINFO, and criu uses unitialized memory for the target of PTRACE_PEEKSIGINFO. If we truncate siginfo in the kernel then we check for a known si_code in which case we are safe to truncate siginfo. If the si_code is unknown then we should check to see if the extra bytes are 0. That should work with everything. Plus it is a natural point to test to see if userspace is using signals that the kernel does not currently support. I will put that in my queue. > So I'm certainly willing to make those kinds of changes, but let's > make them real *improvements* now, ok? Wasn't that the point of all > the cleanups in the end? Definitely. With the strace test case causing people to talk about regressions I was just looking to see if it would make sense to do something minimal for -rc2 so reduce concerns about regressions. Now I am going to focus on getting what I can ready for the next merge window. Eric
Re: [RFC PATCH 0/3] Dealing with the aliases of SI_USER
Dave Martin writes: > On Sun, Apr 15, 2018 at 11:16:04AM -0700, Linus Torvalds wrote: > > [...] > >> The other thing we should do is to get rid of the stupid padding. >> Right now "struct siginfo" is pointlessly padded to 128 bytes. That is >> completely insane, when it's always just zero in the kernel. > > Agreed, inside the kernel the padding achieves nothing. > >> So put that _pad[] thing inside #ifndef __KERNEL__, and make >> copy_siginfo_to_user() write the padding zeroes when copying to user >> space. The reason for the padding is "future expansion", so we do want >> to tell the user space that it's maybe up to 128 bytes in size, but if >> we don't fill it all, we shouldn't waste time and memory on clearing >> the padding internally. >> >> I'm certainly *hoping* nobody depends on the whole 128 bytes in >> rt_sigqueueinfo(). In theory you can fill it all (as long as si_code >> is negative), but the man-page only says "si_value", and the compat >> function doesn't copy any more than that either, so any user that >> tries to fill in more than si_value is already broken. In fact, it >> might even be worth enforcing that in rt_sigqueueinfo(), just to see >> if anybody plays any games.. > > [...] > > Digression: > > Since we don't traditionally zero the tail-padding in the user sigframe, > is there a reliable way for userspace to detect newly-added fields in > siginfo other than by having an explicit sigaction sa_flags flag to > request them? I imagine something like [1] below from the userspace > perspective. > > On a separate thread, the issue of how to report syndrome information > for SIGSEGV came up [2] (such as whether the faulting instruction was a > read or a write). This information is useful (and used) by things like > userspace sanitisers and qemu. Currently, reporting this to userspace > relies on arch-specific cruft in the sigframe. > > We're committed to maintaining what's already in each arch sigframe, > but it would be preferable to have a portable way of adding information > to siginfo in a generic way. si_code doesn't really work for that, > since si_codes are mutually exclusive: I can't see a way of adding > supplementary information using si_code. > > Anyway, that would be a separate RFC in the future (if ever). So far what I have seen done is ``registers'' added to sigcontext. Which it looks like you have done with esr. Scrubbing information from faults to where the addresses point outside of the userspace mapping makes sense. I think before I would pursue what you are talking about on a generic level I would want to look at the fact that we handle unblockable faults wrong. While unlikely it is possible for someone to send a thread specific signal at just the right time, and have that signal delivered before the synchronous fault. Then we could pass through additional arguments through that new ``generic'' path. Especially what are arguments such as tsk->thread.fault_address and tsk->thread.fault_code. We can do anything we like with a new SA_flag as that allows us to change the format of the sigframe. If we are very careful we can add generic fields after that crazy union anonymous union in the _sigfault case of struct siginfo. The trick would be to find something that would be enough so that people don't need to implement their own instruction decoder to see what is going on. Something that is applicable to every sigfault case not just SIGSEGV. Something that we can and want to implement on multiple architectures. The point being doing something generic can be a lot of work, even if it is worth it in the end. > [1] > > static volatile int have_extflags = 0; > > static void handler(int n, siginfo_t *si, void *uc) > { > /* ... */ > > if (have_extflags) { > /* Check si->si_extflags */ > } else { > /* fallback */ > } > > /* ... */ > } > > int main(void) > { > /* ... */ > > struct sigaction sa; > > /* ... */ > > sa.sa_flags = SA_SIGINFO | SA_SIGINFO_EXTFLAGS; > sa.sa_sigaction = handler; > if (!sigaction(SIGSEGV, &sa, NULL)) { > have_extflags = 1; > } else { > sa.sa_flags &= ~SA_SIGINFO_EXTFLAGS; > if (sigaction(SIGSEGV, &sa, NULL)) > goto error; > } > > /* ... */ > } > > [2] [RFC PATCH] arm64: fault: Don't leak data in ESR context for user fault > on kernel VA > http://lists.infradead.org/pipermail/linux-arm-kernel/2018-April/571428.html Eric
Re: [PATCH v3 01/17] y2038: asm-generic: Extend sysvipc data structures
Arnd Bergmann writes: > Most architectures now use the asm-generic copy of the sysvipc data > structures (msqid64_ds, semid64_ds, shmid64_ds), which use 32-bit > __kernel_time_t on 32-bit architectures but have padding behind them to > allow extending the type to 64-bit. > > Unfortunately, that fails on all big-endian architectures, which have the > padding on the wrong side. As so many of them get it wrong, we decided to > not bother even trying to fix it up when we introduced the asm-generic > copy. Instead we always use the padding word now to provide the upper > 32 bits of the seconds value, regardless of the endianess. > > A libc implementation on a typical big-endian system can deal with > this by providing its own copy of the structure definition to user > space, and swapping the two 32-bit words before returning from the > semctl/shmctl/msgctl system calls. > > ARM64 and s/390 are architectures that use these generic headers and > also provide support for compat mode on 64-bit kernels, so we adapt > their copies here as well. > > Signed-off-by: Arnd Bergmann > --- > include/uapi/asm-generic/msgbuf.h | 17 - > include/uapi/asm-generic/sembuf.h | 26 -- > include/uapi/asm-generic/shmbuf.h | 17 - > 3 files changed, 32 insertions(+), 28 deletions(-) > > diff --git a/include/uapi/asm-generic/msgbuf.h > b/include/uapi/asm-generic/msgbuf.h > index fb306ebdb36f..d2169cae93b8 100644 > --- a/include/uapi/asm-generic/msgbuf.h > +++ b/include/uapi/asm-generic/msgbuf.h > @@ -18,23 +18,22 @@ > * On big-endian systems, the padding is in the wrong place. > * > * Pad space is left for: > - * - 64-bit time_t to solve y2038 problem > * - 2 miscellaneous 32-bit values > */ > > struct msqid64_ds { > struct ipc64_perm msg_perm; > +#if __BITS_PER_LONG == 64 > __kernel_time_t msg_stime; /* last msgsnd time */ > -#if __BITS_PER_LONG != 64 > - unsigned long __unused1; > -#endif > __kernel_time_t msg_rtime; /* last msgrcv time */ > -#if __BITS_PER_LONG != 64 > - unsigned long __unused2; > -#endif > __kernel_time_t msg_ctime; /* last change time */ > -#if __BITS_PER_LONG != 64 > - unsigned long __unused3; > +#else > + unsigned long msg_stime; /* last msgsnd time */ > + unsigned long msg_stime_high; > + unsigned long msg_rtime; /* last msgrcv time */ > + unsigned long msg_rtime_high; > + unsigned long msg_ctime; /* last change time */ > + unsigned long msg_ctime_high; > #endif I suspect you want to use __kernel_ulong_t here instead of a raw unsigned long. If nothing else it seems inconsistent to use typedefs in one half of the structure and no typedefs in the other half. > __kernel_ulong_t msg_cbytes;/* current number of bytes on queue */ > __kernel_ulong_t msg_qnum; /* number of messages in queue */ > diff --git a/include/uapi/asm-generic/sembuf.h > b/include/uapi/asm-generic/sembuf.h > index cbf9cfe977d6..0bae010f1b64 100644 > --- a/include/uapi/asm-generic/sembuf.h > +++ b/include/uapi/asm-generic/sembuf.h > @@ -13,23 +13,29 @@ > * everyone just ended up making identical copies without specific > * optimizations, so we may just as well all use the same one. > * > - * 64 bit architectures typically define a 64 bit __kernel_time_t, > + * 64 bit architectures use a 64-bit __kernel_time_t here, while > + * 32 bit architectures have a pair of unsigned long values. > * so they do not need the first two padding words. > - * On big-endian systems, the padding is in the wrong place. > * > - * Pad space is left for: > - * - 64-bit time_t to solve y2038 problem > - * - 2 miscellaneous 32-bit values > + * On big-endian systems, the padding is in the wrong place for > + * historic reasons, so user space has to reconstruct a time_t > + * value using > + * > + * user_semid_ds.sem_otime = kernel_semid64_ds.sem_otime + > + * ((long long)kernel_semid64_ds.sem_otime_high << 32) > + * > + * Pad space is left for 2 miscellaneous 32-bit values > */ > struct semid64_ds { > struct ipc64_perm sem_perm; /* permissions .. see ipc.h */ > +#if __BITS_PER_LONG == 64 > __kernel_time_t sem_otime; /* last semop time */ > -#if __BITS_PER_LONG != 64 > - unsigned long __unused1; > -#endif > __kernel_time_t sem_ctime; /* last change time */ > -#if __BITS_PER_LONG != 64 > - unsigned long __unused2; > +#else > + unsigned long sem_otime; /* last semop time */ > + unsigned long sem_otime_high; > + unsigned long sem_ctime; /* last change time */ > + unsigned long sem_ctime_high; > #endif > unsigned long sem_nsems; /* no. of semaphores in array */ > unsigned long __unused3; > diff --git a/include/uapi/asm-generic/shmbuf.h > b/include/uapi/asm-generic/shmbuf.h > index 2b6c3bb97f97..602f1b5b462b 100644 > --- a/include/uapi/asm-generic/shmbuf
Re: [PATCH v3 01/17] y2038: asm-generic: Extend sysvipc data structures
Arnd Bergmann writes: > On Thu, Apr 19, 2018 at 5:20 PM, Arnd Bergmann wrote: >> On Thu, Apr 19, 2018 at 4:59 PM, Eric W. Biederman >> wrote: >>> I suspect you want to use __kernel_ulong_t here instead of a raw >>> unsigned long. If nothing else it seems inconsistent to use typedefs >>> in one half of the structure and no typedefs in the other half. >> >> Good catch, there is definitely something wrong here, but I think using >> __kernel_ulong_t for all members would also be wrong, as that >> still changes the layout on x32, which effectively is >> >> struct msqid64_ds { >> ipc64_perm msg_perm; >> u64 msg_stime; >> u32 __unused1; >> /* 32 bit implict padding */ >> u64 msg_rtime; >> u32 __unused2; >> /* 32 bit implict padding */ >> u64 msg_ctime; >> u32 __unused3; >> /* 32 bit implict padding */ >> __kernel_pid_t shm_cpid; /* pid of creator */ >> __kernel_pid_t shm_lpid; /* pid of last operator */ >> >> }; >> >> The choices here would be to either use a mix of >> __kernel_ulong_t and unsigned long, or taking the x32 >> version back into arch/x86/include/uapi/asm/ so the >> generic version at least makes some sense. >> >> I can't use __kernel_time_t for the lower half on 32-bit >> since it really should be unsigned. > > After thinking about it some more, I conclude that the structure is simply > incorrect on x32: The __kernel_ulong_t usage was introduced in 2013 > in commit b9cd5ca22d67 ("uapi: Use __kernel_ulong_t in struct > msqid64_ds") and apparently was correct initially as __BITS_PER_LONG > evaluated to 64, but it broke with commit f4b4aae18288 ("x86/headers/uapi: > Fix __BITS_PER_LONG value for x32 builds") that changed the value > of __BITS_PER_LONG and introduced the extra padding in 2015. > > The same change apparently also broke a lot of other definitions, e.g. > > $ echo "#include " | gcc -mx32 -E -xc - | grep -A3 > __kernel_size_t > typedef unsigned int __kernel_size_t; > typedef int __kernel_ssize_t; > typedef int __kernel_ptrdiff_t; > > Those used to be defined as 'unsigned long long' and 'long long' > respectively, so now all kernel interfaces using those on x32 > became incompatible! That seems like a real mess. Is this just for the uapi header as seen by userspace? I expect we are using the a normal kernel interface with 64bit longs and 64bit pointers when we build the kernel. If this is just a header as seen from userspace mess it seems unfortunate but fixable. Eric
[REVIEW][PATCH 17/17] signal/powerpc: Replace TRAP_FIXME with TRAP_UNK
Using an si_code of 0 that aliases with SI_USER is clearly the wrong thing todo, and causes problems in interesting ways. For use in unknown_exception the recently defined TRAP_UNK semantically is a perfect fit. For use in RunModeException it looks like something more specific than TRAP_UNK could be used. No one has bothered to find a better fit than the broken si_code of 0 in all of these years and I don't see an obvious better fit so TRAP_UNK is switching RunModeException to return TRAP_UNK is clearly an improvement. Recent history suggests no actually cares about crazy corner cases of the kernel behavior like this so I don't expect any regressions from changing this. However if something does happen this change is easy to revert. Though I wonder if SIGKILL might not be a better fit. Cc: Paul Mackerras Cc: Kumar Gala Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: linuxppc-dev@lists.ozlabs.org Fixes: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx") Fixes: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various exceptions.") History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" --- arch/powerpc/include/uapi/asm/siginfo.h | 8 arch/powerpc/kernel/traps.c | 4 ++-- 2 files changed, 2 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/uapi/asm/siginfo.h b/arch/powerpc/include/uapi/asm/siginfo.h index 0437afc9ef3c..1d51d9b88221 100644 --- a/arch/powerpc/include/uapi/asm/siginfo.h +++ b/arch/powerpc/include/uapi/asm/siginfo.h @@ -15,12 +15,4 @@ #include -/* - * SIGTRAP si_codes - */ -#ifdef __KERNEL__ -#define TRAP_FIXME 0 /* Broken dup of SI_USER */ -#endif /* __KERNEL__ */ - - #endif /* _ASM_POWERPC_SIGINFO_H */ diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index fdf9400beec8..0e17dcb48720 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -969,7 +969,7 @@ void unknown_exception(struct pt_regs *regs) printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n", regs->nip, regs->msr, regs->trap); - _exception(SIGTRAP, regs, TRAP_FIXME, 0); + _exception(SIGTRAP, regs, TRAP_UNK, 0); exception_exit(prev_state); } @@ -991,7 +991,7 @@ void instruction_breakpoint_exception(struct pt_regs *regs) void RunModeException(struct pt_regs *regs) { - _exception(SIGTRAP, regs, TRAP_FIXME, 0); + _exception(SIGTRAP, regs, TRAP_UNK, 0); } void single_step_exception(struct pt_regs *regs) -- 2.14.1
[REVIEW][PATCH 13/17] signal/powerpc: Replace FPE_FIXME with FPE_FLTUNK
Using an si_code of 0 that aliases with SI_USER is clearly the wrong thing todo, and causes problems in interesting ways. The newly defined FPE_FLTUNK semantically appears to fit the bill so use it instead. Cc: Paul Mackerras Cc: Kumar Gala Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: linuxppc-dev@lists.ozlabs.org Fixes: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx") Fixes: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various exceptions.") History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" --- arch/powerpc/include/uapi/asm/siginfo.h | 7 --- arch/powerpc/kernel/traps.c | 6 +++--- 2 files changed, 3 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/uapi/asm/siginfo.h b/arch/powerpc/include/uapi/asm/siginfo.h index 9f142451a01f..0437afc9ef3c 100644 --- a/arch/powerpc/include/uapi/asm/siginfo.h +++ b/arch/powerpc/include/uapi/asm/siginfo.h @@ -15,13 +15,6 @@ #include -/* - * SIGFPE si_codes - */ -#ifdef __KERNEL__ -#define FPE_FIXME 0 /* Broken dup of SI_USER */ -#endif /* __KERNEL__ */ - /* * SIGTRAP si_codes */ diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 087855caf6a9..fdf9400beec8 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -1031,7 +1031,7 @@ static void emulate_single_step(struct pt_regs *regs) static inline int __parse_fpscr(unsigned long fpscr) { - int ret = FPE_FIXME; + int ret = FPE_FLTUNK; /* Invalid operation */ if ((fpscr & FPSCR_VE) && (fpscr & FPSCR_VX)) @@ -1972,7 +1972,7 @@ void SPEFloatingPointException(struct pt_regs *regs) extern int do_spe_mathemu(struct pt_regs *regs); unsigned long spefscr; int fpexc_mode; - int code = FPE_FIXME; + int code = FPE_FLTUNK; int err; flush_spe_to_thread(current); @@ -2041,7 +2041,7 @@ void SPEFloatingPointRoundException(struct pt_regs *regs) printk(KERN_ERR "unrecognized spe instruction " "in %s at %lx\n", current->comm, regs->nip); } else { - _exception(SIGFPE, regs, FPE_FIXME, regs->nip); + _exception(SIGFPE, regs, FPE_FLTUNK, regs->nip); return; } } -- 2.14.1
[REVIEW][PATCH 15/17] signal: Add TRAP_UNK si_code for undiagnosted trap exceptions
Both powerpc and alpha have cases where they wronly set si_code to 0 in combination with SIGTRAP and don't mean SI_USER. About half the time this is because the architecture can not report accurately what kind of trap exception triggered the trap exception. The other half the time it looks like no one has bothered to figure out an appropriate si_code. For the cases where the architecture does not have enough information or is too lazy to figure out exactly what kind of trap exception it is define TRAP_UNK. Cc: linux-...@vger.kernel.org Cc: linux-a...@vger.kernel.org Cc: linux-al...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: "Eric W. Biederman" --- arch/x86/kernel/signal_compat.c| 2 +- include/uapi/asm-generic/siginfo.h | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c index 14c057f29979..9ccbf0576cd0 100644 --- a/arch/x86/kernel/signal_compat.c +++ b/arch/x86/kernel/signal_compat.c @@ -29,7 +29,7 @@ static inline void signal_compat_build_tests(void) BUILD_BUG_ON(NSIGFPE != 15); BUILD_BUG_ON(NSIGSEGV != 7); BUILD_BUG_ON(NSIGBUS != 5); - BUILD_BUG_ON(NSIGTRAP != 4); + BUILD_BUG_ON(NSIGTRAP != 5); BUILD_BUG_ON(NSIGCHLD != 6); BUILD_BUG_ON(NSIGSYS != 1); diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h index 558b902f18d4..80e2a7227205 100644 --- a/include/uapi/asm-generic/siginfo.h +++ b/include/uapi/asm-generic/siginfo.h @@ -249,7 +249,8 @@ typedef struct siginfo { #define TRAP_TRACE 2 /* process trace trap */ #define TRAP_BRANCH 3 /* process taken branch trap */ #define TRAP_HWBKPT 4 /* hardware breakpoint/watchpoint */ -#define NSIGTRAP 4 +#define TRAP_UNK 5 /* undiagnosed trap */ +#define NSIGTRAP 5 /* * There is an additional set of SIGTRAP si_codes used by ptrace -- 2.14.1
Re: [PATHC v2 0/9] ima: carry the measurement list across kexec
Mimi Zohar writes: > Hi Andrew, > > On Wed, 2016-08-31 at 18:38 -0400, Mimi Zohar wrote: >> On Wed, 2016-08-31 at 13:50 -0700, Andrew Morton wrote: >> > On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar >> > wrote: >> > >> > > The TPM PCRs are only reset on a hard reboot. In order to validate a >> > > TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list >> > > of the running kernel must be saved and then restored on the subsequent >> > > boot, possibly of a different architecture. >> > > >> > > The existing securityfs binary_runtime_measurements file conveniently >> > > provides a serialized format of the IMA measurement list. This patch >> > > set serializes the measurement list in this format and restores it. >> > > >> > > Up to now, the binary_runtime_measurements was defined as architecture >> > > native format. The assumption being that userspace could and would >> > > handle any architecture conversions. With the ability of carrying the >> > > measurement list across kexec, possibly from one architecture to a >> > > different one, the per boot architecture information is lost and with it >> > > the ability of recalculating the template digest hash. To resolve this >> > > problem, without breaking the existing ABI, this patch set introduces >> > > the boot command line option "ima_canonical_fmt", which is arbitrarily >> > > defined as little endian. >> > > >> > > The need for this boot command line option will be limited to the >> > > existing version 1 format of the binary_runtime_measurements. >> > > Subsequent formats will be defined as canonical format (eg. TPM 2.0 >> > > support for larger digests). >> > > >> > > This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer >> > > hand-over for the next kernel" patch set. >> > > >> > > These patches can also be found in the next-kexec-restore branch of: >> > > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git >> > >> > I'll merge these into -mm to get some linux-next exposure. I don't >> > know what your upstream merge plans will be? >> >> Sounds good. I'm hoping to get some review/comments on this patch set >> as well. At the moment, I'm chasing down a kernel test robot report >> from this afternoon. > > My concern about changing the canonical format as originally defined in > patch 9/9 from big endian to little endian never materialized. Andreas > Steffan, the patch author, is happy either way. > > We proposed two methods of addressing Eric Biederman's concerns of not > including the IMA measurement list segment in the kexec hash as > described in https://lkml.org/lkml/2016/9/9/355. > > - defer calculating and verifying the serialized IMA measurement list > buffer hash to IMA > - calculate the kexec hash on load, verify it on the kexec execute, > before re-calculating and updating it. I need to ask: How this is anticipated to interact with kexec on panic? Because honestly I can't see this ever working in that case. The assumption is that the original kernel has gone crazy. So from a practical standpoint any trusted path should have been invalided. This entire idea of updating the kexec image makes me extremely extremely nervious. It feels like sticking a screw driver through the spokes of your bicicle tires while ridding down the road. I can see tracking to see if the list has changed at some point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail. At least the common bootloader cases that I know of using kexec are very minimal distributions that live in a ramdisk and as such it should be very straight forward to measure what is needed at or before sys_kexec_load. But that was completely dismissed as unrealistic so I don't have a clue what actual problem you are trying to solve. If there is anyway we can start small and not with this big scary infrastructure change I would very much prefer it. Eric
Re: [PATHC v2 0/9] ima: carry the measurement list across kexec
ebied...@xmission.com (Eric W. Biederman) writes: > Mimi Zohar writes: > >> Hi Andrew, >> >> On Wed, 2016-08-31 at 18:38 -0400, Mimi Zohar wrote: >>> On Wed, 2016-08-31 at 13:50 -0700, Andrew Morton wrote: >>> > On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar >>> > wrote: >>> > >>> > > The TPM PCRs are only reset on a hard reboot. In order to validate a >>> > > TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list >>> > > of the running kernel must be saved and then restored on the subsequent >>> > > boot, possibly of a different architecture. >>> > > >>> > > The existing securityfs binary_runtime_measurements file conveniently >>> > > provides a serialized format of the IMA measurement list. This patch >>> > > set serializes the measurement list in this format and restores it. >>> > > >>> > > Up to now, the binary_runtime_measurements was defined as architecture >>> > > native format. The assumption being that userspace could and would >>> > > handle any architecture conversions. With the ability of carrying the >>> > > measurement list across kexec, possibly from one architecture to a >>> > > different one, the per boot architecture information is lost and with it >>> > > the ability of recalculating the template digest hash. To resolve this >>> > > problem, without breaking the existing ABI, this patch set introduces >>> > > the boot command line option "ima_canonical_fmt", which is arbitrarily >>> > > defined as little endian. >>> > > >>> > > The need for this boot command line option will be limited to the >>> > > existing version 1 format of the binary_runtime_measurements. >>> > > Subsequent formats will be defined as canonical format (eg. TPM 2.0 >>> > > support for larger digests). >>> > > >>> > > This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer >>> > > hand-over for the next kernel" patch set. >>> > > >>> > > These patches can also be found in the next-kexec-restore branch of: >>> > > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git >>> > >>> > I'll merge these into -mm to get some linux-next exposure. I don't >>> > know what your upstream merge plans will be? >>> >>> Sounds good. I'm hoping to get some review/comments on this patch set >>> as well. At the moment, I'm chasing down a kernel test robot report >>> from this afternoon. >> >> My concern about changing the canonical format as originally defined in >> patch 9/9 from big endian to little endian never materialized. Andreas >> Steffan, the patch author, is happy either way. >> >> We proposed two methods of addressing Eric Biederman's concerns of not >> including the IMA measurement list segment in the kexec hash as >> described in https://lkml.org/lkml/2016/9/9/355. >> >> - defer calculating and verifying the serialized IMA measurement list >> buffer hash to IMA >> - calculate the kexec hash on load, verify it on the kexec execute, >> before re-calculating and updating it. > > I need to ask: How this is anticipated to interact with kexec on panic? > Because honestly I can't see this ever working in that case. The > assumption is that the original kernel has gone crazy. So from a > practical standpoint any trusted path should have been invalided. > > This entire idea of updating the kexec image makes me extremely > extremely nervious. It feels like sticking a screw driver through the > spokes of your bicicle tires while ridding down the road. > > I can see tracking to see if the list has changed at some > point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail. > > At least the common bootloader cases that I know of using kexec are very > minimal distributions that live in a ramdisk and as such it should be > very straight forward to measure what is needed at or before > sys_kexec_load. But that was completely dismissed as unrealistic so I > don't have a clue what actual problem you are trying to solve. > > If there is anyway we can start small and not with this big scary > infrastructure change I would very much prefer it. I have thought about this a little more and the entire reason for updating things on the fly really really disturbs me. To prove you are trusted the new kernel is going to have to present that whole trusted list of files to someone. Which means in my naive understanding of the situation that any change in that list of files is going to have to be tracked and audited. So the idea that we have to be super flexible in the kernel (when we are inflexible in userspace) does not make a bit of sense to me. So no. I am not in favor of adding a mechanism to kexec that gives me the screaming heebie jeebies and that appears to be complete at odds with what that mechanism is trying to do. AKA if you are going to trust any old thing, or any old change on the reboot path than it doesn't make sense to track them. If you are tracking them it doesn't make sense to have a mechanism where anything goes. Eric