Re: [PATCH v2] audit: create explicit AUDIT_SECCOMP event type
Thanks! Acked-by: Will Drewry On Wed, Nov 28, 2012 at 5:15 PM, Kees Cook wrote: > The seccomp path was using AUDIT_ANOM_ABEND from when seccomp mode 1 > could only kill a process. While we still want to make sure an audit > record is forced on a kill, this should use a separate record type since > seccomp mode 2 introduces other behaviors. In the case of "handled" > behaviors (process wasn't killed), only emit a record if the process is > under inspection. This change also fixes userspace examination of seccomp > audit events, since it was considered malformed due to missing fields of > the AUDIT_ANOM_ABEND event type. > > Cc: Julien Tinnes > Cc: Will Drewry > Cc: sta...@vger.kernel.org > Signed-off-by: Kees Cook > Acked-by: Steve Grubb > --- > v2: > - update commit message and add Cc to stable, suggested by Steve Grubb > > --- > include/linux/audit.h |3 ++- > include/uapi/linux/audit.h |1 + > kernel/auditsc.c | 14 +++--- > 3 files changed, 14 insertions(+), 4 deletions(-) > > diff --git a/include/linux/audit.h b/include/linux/audit.h > index bce729a..9d5104d 100644 > --- a/include/linux/audit.h > +++ b/include/linux/audit.h > @@ -157,7 +157,8 @@ void audit_core_dumps(long signr); > > static inline void audit_seccomp(unsigned long syscall, long signr, int code) > { > - if (unlikely(!audit_dummy_context())) > + /* Force a record to be reported if a signal was delivered. */ > + if (signr || unlikely(!audit_dummy_context())) > __audit_seccomp(syscall, signr, code); > } > > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > index 76352ac..09a2d94 100644 > --- a/include/uapi/linux/audit.h > +++ b/include/uapi/linux/audit.h > @@ -106,6 +106,7 @@ > #define AUDIT_MMAP 1323/* Record showing descriptor and > flags in mmap */ > #define AUDIT_NETFILTER_PKT1324/* Packets traversing netfilter > chains */ > #define AUDIT_NETFILTER_CFG1325/* Netfilter chain modifications */ > +#define AUDIT_SECCOMP 1326/* Secure Computing event */ > > #define AUDIT_AVC 1400/* SE Linux avc denial or grant */ > #define AUDIT_SELINUX_ERR 1401/* Internal SE Linux Errors */ > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index 2f186ed..157e989 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -2735,7 +2735,7 @@ void __audit_mmap_fd(int fd, int flags) > context->type = AUDIT_MMAP; > } > > -static void audit_log_abend(struct audit_buffer *ab, char *reason, long > signr) > +static void audit_log_task(struct audit_buffer *ab) > { > kuid_t auid, uid; > kgid_t gid; > @@ -2753,6 +2753,11 @@ static void audit_log_abend(struct audit_buffer *ab, > char *reason, long signr) > audit_log_task_context(ab); > audit_log_format(ab, " pid=%d comm=", current->pid); > audit_log_untrustedstring(ab, current->comm); > +} > + > +static void audit_log_abend(struct audit_buffer *ab, char *reason, long > signr) > +{ > + audit_log_task(ab); > audit_log_format(ab, " reason="); > audit_log_string(ab, reason); > audit_log_format(ab, " sig=%ld", signr); > @@ -2783,8 +2788,11 @@ void __audit_seccomp(unsigned long syscall, long > signr, int code) > { > struct audit_buffer *ab; > > - ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND); > - audit_log_abend(ab, "seccomp", signr); > + ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_SECCOMP); > + if (unlikely(!ab)) > + return; > + audit_log_task(ab); > + audit_log_format(ab, " sig=%ld", signr); > audit_log_format(ab, " syscall=%ld", syscall); > audit_log_format(ab, " compat=%d", is_compat_task()); > audit_log_format(ab, " ip=0x%lx", KSTK_EIP(current)); > -- > 1.7.9.5 > > > -- > Kees Cook > Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/3] seccomp: add generic code for jitted seccomp filters.
On Mon, Apr 1, 2013 at 4:53 PM, Kees Cook wrote: > On Mon, Mar 18, 2013 at 7:50 AM, Nicolas Schichan > wrote: >> Architecture must select HAVE_SECCOMP_FILTER_JIT and implement >> seccomp_jit_compile() and seccomp_jit_free() if they intend to support >> jitted seccomp filters. >> >> struct seccomp_filter has been moved to to make its >> content available to the jit compilation code. >> >> In a way similar to the net BPF, the jit compilation code is expected >> to updates struct seccomp_filter.bpf_func pointer to the generated >> code. >> >> Signed-off-by: Nicolas Schichan > > Acked-by: Kees Cook > > I'd love to see this for x86 too. I suspect it'd be a small change > after this series lands. Agreed - and thanks for working through the necessary changes! Acked-By: Will Drewry (for the series) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 14/15] samples/seccomp: be less stupid about cross compiling
On Tue, Jan 22, 2013 at 3:20 PM, Kees Cook wrote: > On Mon, Jan 21, 2013 at 9:16 AM, Arnd Bergmann wrote: >> The seccomp filters are currently built for the build >> host, not for the machine that they are going to run >> on, but they are also built for with the -m32 flag >> if the kernel is built for a 32 bit machine, both >> of which seems rather odd. >> >> It broke allyesconfig on my machine, which is x86-64, but >> building for 32 bit ARM, with this error message: >> >> In file included from /usr/include/stdio.h:28:0, >> from samples/seccomp/bpf-fancy.c:15: >> /usr/include/features.h:324:26: fatal error: bits/predefs.h: No such file or >> directory >> >> because there are no 32 bit libc headers installed on >> this machine. We should really be building all the >> samples for the target machine rather than the build >> host, but since the infrastructure for that appears >> to be missing right now, let's be a little bit smarter >> and not pass the '-m32' flag to the HOSTCC when cross- >> compiling. >> >> Signed-off-by: Arnd Bergmann >> Cc: Heiko Carstens >> Cc: Kees Cook >> Cc: James Morris > > I'm fine with this. Thanks! > > Acked-by: Kees Cook Thanks! This makes a lot of sense and maybe buys a bit more time to figure out the best way to build these (all?) samples for the target. Acked-by: Will Drewry > >> --- >> samples/seccomp/Makefile |2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile >> index bbbd276..7203e66 100644 >> --- a/samples/seccomp/Makefile >> +++ b/samples/seccomp/Makefile >> @@ -19,6 +19,7 @@ bpf-direct-objs := bpf-direct.o >> >> # Try to match the kernel target. >> ifndef CONFIG_64BIT >> +ifndef CROSS_COMPILE >> >> # s390 has -m31 flag to build 31 bit binaries >> ifndef CONFIG_S390 >> @@ -35,6 +36,7 @@ HOSTLOADLIBES_bpf-direct += $(MFLAG) >> HOSTLOADLIBES_bpf-fancy += $(MFLAG) >> HOSTLOADLIBES_dropper += $(MFLAG) >> endif >> +endif >> >> # Tell kbuild to always build the programs >> always := $(hostprogs-y) >> -- >> 1.7.10.4 >> > > > > -- > Kees Cook > Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] seccomp: Add SECCOMP_RET_INFO return value
Thanks for the patch! On Tue, Dec 18, 2012 at 3:50 PM, Corey Bryant wrote: > Adds a new return value to seccomp filters that causes an > informational kernel message to be printed. The message > includes the system call number. I don't have strong opinions about this either way, but here are the points that led me to drop a _LOG return value in the past: - ptrace can cover this awkwardly (user) - ftrace can cover this awkwardly (system/root) - audit can cover this without an allow - _TRAP can be used to implement this - There's no good way to give back the log data. I've been relying on SECCOMP_RET_TRAP: - trap on failure, log, then die - trap on failure, log, then jump to a whitelisted re-entry point to resume the syscall while others I've spoken with have been using the audit path to track denied values -- not so great for soft-failures :) [snip] > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 5af44b5..854f628 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -433,6 +433,10 @@ int __secure_computing(int this_syscall) > goto skip; /* Explicit request to skip. */ > > return 0; > + case SECCOMP_RET_INFO: > + if (printk_ratelimit()) > + pr_info("seccomp: syscall=%d\n", > this_syscall); The arch value will also be needed to make this reliably meaningful (how was the syscall called). That aside, I worry that pr_info is the wrong place for a random user on the machine to log to for this, but I may be wrong, rather than a dedicated ringbufffer, etc. So if this is for a user with privs, then a SECCOMP_RET_AUDIT might make sense. Feedback to a local user seems tricky in general. I don't know :) I just decided to deal with it in userland even if it is slightly painful. Thanks! will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 0/4] arch/arm: support seccomp
On Thu, Nov 15, 2012 at 7:25 AM, Will Deacon wrote: > On Wed, Nov 14, 2012 at 07:07:13PM +, Kees Cook wrote: >> Hi, any more thoughts on this series? I'd really like to get it into >> -next. It's been running happily for a while now in the Chrome OS ARM >> devices. > > I'm not familiar with seccomp in general, but the changes look ok now from > an ARM point-of-view: > > Reviewed-by: Will Deacon > > These should probably go via Russell (who will put them into -next), so > please send them to the patch system. > For the seccomp side, Acked-by: Will Drewry for the whole series. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] dm: verity: Add support for emitting uevents on dm-verity errors.
IX "verity" > > #define DM_VERITY_IO_VEC_INLINE16 > @@ -137,6 +139,30 @@ static void dm_bufio_alloc_callback(struct dm_buffer > *buf) > } > > /* > + * Trigger userspace data corruption handler. > + */ > +#ifdef CONFIG_DM_UEVENT > +static void verity_data_error(struct dm_verity *v, unsigned long long > block_nr) > +{ > + dm_send_verity_uevent(DM_UEVENT_VERITY_DATA_ERROR, v->ti, block_nr); > +} > + > +static void verity_hash_error(struct dm_verity *v, unsigned long long > block_nr) > +{ > + dm_send_verity_uevent(DM_UEVENT_VERITY_HASH_ERROR, v->ti, block_nr); > +} > +#else > +static inline void verity_data_error(struct dm_verity *v, > + unsigned long long block_nr) > +{ > +} > +static inline void verity_hash_error(struct dm_verity *v, > + unsigned long long block_nr) > +{ > +} > +#endif > + > +/* > * Translate input sector number to the sector number on the target device. > */ > static sector_t verity_map_sector(struct dm_verity *v, sector_t bi_sector) > @@ -255,6 +281,7 @@ static int verity_verify_level(struct dm_verity_io *io, > sector_t block, > (unsigned long long)hash_block); > v->hash_failed = 1; > r = -EIO; > + verity_hash_error(v, (unsigned long long)hash_block); > goto release_ret_r; > } else > aux->hash_verified = 1; > @@ -375,6 +402,7 @@ test_block_hash: > DMERR_LIMIT("data block %llu is corrupted", > (unsigned long long)(io->block + b)); > v->hash_failed = 1; > + verity_data_error(v, (unsigned long long)(io->block + > b)); > return -EIO; > } > } > -- > 1.8.3 Thanks for doing this! (I much prefer this to using notifiers chains!) Acked-by: Will Drewry -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] seccomp: allow BPF_XOR based ALU instructions.
Better late than never .. right? Acked-by: Will Drewry Thanks! On Fri, Mar 15, 2013 at 1:12 PM, Kees Cook wrote: > On Fri, Mar 15, 2013 at 10:02 AM, Nicolas Schichan > wrote: >> >> Signed-off-by: Nicolas Schichan > > Ah, good catch. Thanks! > > Acked-by: Kees Cook > > -- > Kees Cook > Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
On Thu, Jul 26, 2012 at 10:41 AM, Andy Lutomirski wrote: > On Tue, Jul 17, 2012 at 4:19 PM, Andy Lutomirski wrote: >> Currently, if a tracer changes a syscall nr to __NR_future_enosys, >> behavior will differ between kernels that know about >> __NR_future_enosys (and return -ENOSYS) and older kernels (which >> return the value from pt_regs). This is silly; we should just >> return -ENOSYS. >> >> This is unlikely to ever happen on x86 because the return value in >> pt_regs starts out as -ENOSYS, but a silly tracer can change that. >> >> Signed-off-by: Andy Lutomirski >> Cc: Will Drewry >> --- >> arch/x86/include/asm/syscall.h | 11 +++ >> kernel/seccomp.c | 15 +++ >> 2 files changed, 26 insertions(+), 0 deletions(-) > > Will, can you pick this, or some version of it, up in your > seccomp-for-ARM tree or wherever your development is? I'm still not sure about this change though the end result is nice. Regardless, I'll explore it when I can -- my family has just increased in size, so I'm going to be a bit delayed! cheers! will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: root=PARTUUID for MBR/NT disk signatures?
On Mon, Aug 20, 2012 at 1:30 PM, Stephen Warren wrote: > On 08/20/2012 12:22 PM, Tejun Heo wrote: >> Hello, >> >> On Fri, Aug 17, 2012 at 04:10:52PM -0600, Stephen Warren wrote: >>> I was considering extending the kernel command-line option >>> root=PARTUUID= to also support MBR (NT disk signatures). I was thinking >>> of a syntax along the lines of: >>> >>> root=PARTUUID=-PP[/PARTNROFF=%d] >>> >>> ... where is the hex representation of the NT disk signature, >>> and PP is the hex representation of the partition number. Like GPT, >>> /PARTNROFF could be used too if desired. >>> >>> Related, I was thinking of changing struct partition_meta_info's uuid >>> field to be a string, so that it could simply be strcmp'd against the >>> UUID value on the kernel command-line. That way, the type of the UUID is >>> irrelevant. >>> >>> Does anyone have any objection to that? >> >> Wouldn't that be able to break setups which work currently? > > I don't believe so: > > Since the newly supported UUID syntax wouldn't ever match any EFI UUID > (the lengths differ in all cases), I don't believe the new syntax would > affect behavior for any existing usage. > > Obviously, part_efi.c would be modified to initialize struct > partition_meta_info's uuid field to the appropriate string > representation of the UUID so that the str(case)cmp would still succeed > for existing command-lines. I ended up coding up that part of the change > late Friday, and the feature was certainly still working OK. Functionally, I suspect this will work fine, but I am concerned that it is a bad move from an efficiency perspective (not unfixable though). Right now, the user-supplied value is converted from string-uuid to packed-uuid. This is then memcmp'd across any and all partitions - be it 2 or 200 - across all attached storage. If we move to a pure string, then we end up needing to unpack every packed UUID at disk scan time (or search, depending on impl) rather than just the one user supplied value. Perhaps the cost is negligible on modern machines, but it seems like the wrong place to put the cost (per entry rather than per search value). I'd be happy to test out any proposed patch to see if I can measure any differences in my specific environments, but I don't know if it will slow down partition scanning for other EFI UUID users out there. Maybe the NT disk sigs could be massaged to be memcmp friendly instead of the opposite? thanks! will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: root=PARTUUID for MBR/NT disk signatures?
On Tue, Aug 21, 2012 at 1:08 PM, Stephen Warren wrote: > On 08/20/2012 10:47 PM, Will Drewry wrote: >> On Mon, Aug 20, 2012 at 1:30 PM, Stephen Warren >> wrote: >>> On 08/20/2012 12:22 PM, Tejun Heo wrote: >>>> Hello, >>>> >>>> On Fri, Aug 17, 2012 at 04:10:52PM -0600, Stephen Warren wrote: >>>>> I was considering extending the kernel command-line option >>>>> root=PARTUUID= to also support MBR (NT disk signatures). I was thinking >>>>> of a syntax along the lines of: >>>>> >>>>> root=PARTUUID=-PP[/PARTNROFF=%d] >>>>> >>>>> ... where is the hex representation of the NT disk signature, >>>>> and PP is the hex representation of the partition number. Like GPT, >>>>> /PARTNROFF could be used too if desired. >>>>> >>>>> Related, I was thinking of changing struct partition_meta_info's uuid >>>>> field to be a string, so that it could simply be strcmp'd against the >>>>> UUID value on the kernel command-line. That way, the type of the UUID is >>>>> irrelevant. >>>>> >>>>> Does anyone have any objection to that? >>>> >>>> Wouldn't that be able to break setups which work currently? >>> >>> I don't believe so: >>> >>> Since the newly supported UUID syntax wouldn't ever match any EFI UUID >>> (the lengths differ in all cases), I don't believe the new syntax would >>> affect behavior for any existing usage. >>> >>> Obviously, part_efi.c would be modified to initialize struct >>> partition_meta_info's uuid field to the appropriate string >>> representation of the UUID so that the str(case)cmp would still succeed >>> for existing command-lines. I ended up coding up that part of the change >>> late Friday, and the feature was certainly still working OK. >> >> Functionally, I suspect this will work fine, but I am concerned that >> it is a bad move from an efficiency perspective (not unfixable >> though). Right now, the user-supplied value is converted from >> string-uuid to packed-uuid. This is then memcmp'd across any and all >> partitions - be it 2 or 200 - across all attached storage. If we move >> to a pure string, then we end up needing to unpack every packed UUID >> at disk scan time (or search, depending on impl) rather than just the >> one user supplied value. > > The EFI partition code actually does the following already: > > 1) Unpack the UUID from the binary on-disk representation to a temporary > string. > 2) Repack the temporary string into the internal UUID buffer. > > The comments imply this is in order to do endian conversions. > > Switching the internal representation to a string avoids step (2) above, > plus avoids having to pack the string on the kernel command-line into a > binary UUID before the comparison. I doubt the difference between memcmp > vs. strcasecmp is worth considering. So, I think it's overall a win. Sounds reasonable to me then. Thanks! will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86/vsyscall: allow seccomp filter in vsyscall=emulate
If a seccomp filter program is installed, older static binaries and distributions with older libc implementations (glibc 2.13 and earlier) that rely on vsyscall use will be terminated regardless of the filter program policy when executing time, gettimeofday, or getcpu. This is only the case when vsyscall emulation is in use (vsyscall=emulate is the default). This patch emulates system call entry inside a vsyscall=emulate trap such that seccomp can properly evaluate the system call. Reported-by: Owen Kibel Signed-off-by: Will Drewry --- arch/x86/kernel/vsyscall_64.c | 29 ++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c index 7515cf0..433545f 100644 --- a/arch/x86/kernel/vsyscall_64.c +++ b/arch/x86/kernel/vsyscall_64.c @@ -139,6 +139,14 @@ static int addr_to_vsyscall_nr(unsigned long addr) return nr; } +static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr) +{ + if (!seccomp_mode(&tsk->seccomp)) + return 0; + task_pt_regs(tsk)->orig_ax = syscall_nr; + return __secure_computing(syscall_nr); +} + static bool write_ok_or_segv(unsigned long ptr, size_t size) { /* @@ -174,6 +182,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) int vsyscall_nr; int prev_sig_on_uaccess_error; long ret; + int skip; /* * No point in checking CS -- the only way to get here is a user mode @@ -205,9 +214,6 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) } tsk = current; - if (seccomp_mode(&tsk->seccomp)) - do_exit(SIGKILL); - /* * With a real vsyscall, page faults cause SIGSEGV. We want to * preserve that behavior to make writing exploits harder. @@ -222,8 +228,13 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) * address 0". */ ret = -EFAULT; + skip = 0; switch (vsyscall_nr) { case 0: + skip = vsyscall_seccomp(tsk, __NR_gettimeofday); + if (skip) + break; + if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) || !write_ok_or_segv(regs->si, sizeof(struct timezone))) break; @@ -234,6 +245,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) break; case 1: + skip = vsyscall_seccomp(tsk, __NR_time); + if (skip) + break; + if (!write_ok_or_segv(regs->di, sizeof(time_t))) break; @@ -241,6 +256,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) break; case 2: + skip = vsyscall_seccomp(tsk, __NR_getcpu); + if (skip) + break; + if (!write_ok_or_segv(regs->di, sizeof(unsigned)) || !write_ok_or_segv(regs->si, sizeof(unsigned))) break; @@ -253,6 +272,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error; + if (skip) + goto do_ret; + if (ret == -EFAULT) { /* Bad news -- userspace fed a bad pointer to a vsyscall. */ warn_bad_vsyscall(KERN_INFO, regs, @@ -271,6 +293,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) regs->ax = ret; +do_ret: /* Emulate a ret instruction. */ regs->ip = caller; regs->sp += 8; -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 0/2] dm: boot a mapped device without an initramfs
Hi Richard, Helen, On Sat, Nov 3, 2018 at 4:10 AM Richard Weinberger wrote: > > Helen, > > Am Samstag, 3. November 2018, 04:53:39 CET schrieb Helen Koike: > > As mentioned in the discussion from the previous version of this patch, > > Android > > and Chrome OS do not use initramfs mostly due to boot time and size > > liability. > > Do you have numbers on that? Originally, we saved ~200 ms, but I don't think we have recent numbers. (Unless Helen has some!) We first authored and posted this patch in 2010: - https://marc.info/?l=dm-devel&m=127429492521964&w=2 - https://marc.info/?l=dm-devel&m=127429499422096&w=2 - https://marc.info/?l=dm-devel&m=127429493922000&w=2 Every Chrome OS device uses a variant of this patch as well as Android devices starting last year (if they use AVB 2.0). Originally, the intent was the measured latency reduction. We get a linear speed improvement when doing a cryptographic verification of the kernel and initramfs. Why? More data == more hashes (sha256 w/compute per block). There's additional overhead from bringing up early userspace, but those are the numbers I don't have. > I understand that using something like dracut with systemd inside is not what > you > want from a boot time point of view. > But having an initramfs embedded into the kernel image which contains only a > single > static linked binary can be *very* small and fast. > If you invest a little more time, you don't even need a libc, just fire up > some > syscalls to setup your dm. I use this technique regularly on deeply embedded > systems > to setup non-trivial UBIFS/crypto stuff. > > Want I'm trying to say, before adding ad-hoc a feature to the kernel, we > should be > very sure that there is no other way to solve this in a sane manner. > We have initramfs support for reasons. I very much appreciate the perspective, but after 8 years in shipping devices after integrating feedback from kernel maintainers over the subsequent years, this doesn't feel like an "ad-hoc" feature. It's been effective and fit in well with the existing kernel functionality, etc (imho :). What level of performance improvement or other changes might be necessary to make the cut? Thanks! will
Re: [PATCH] x86/vsyscall: allow seccomp filter in vsyscall=emulate
On Fri, Jul 13, 2012 at 9:32 AM, Andrew Lutomirski wrote: > On Thu, Jul 12, 2012 at 10:17 PM, Will Drewry wrote: >> If a seccomp filter program is installed, older static binaries and >> distributions with older libc implementations (glibc 2.13 and earlier) >> that rely on vsyscall use will be terminated regardless of the filter >> program policy when executing time, gettimeofday, or getcpu. This is >> only the case when vsyscall emulation is in use (vsyscall=emulate is the >> default). >> >> This patch emulates system call entry inside a vsyscall=emulate trap >> such that seccomp can properly evaluate the system call. >> >> Reported-by: Owen Kibel >> Signed-off-by: Will Drewry >> --- >> arch/x86/kernel/vsyscall_64.c | 29 ++--- >> 1 file changed, 26 insertions(+), 3 deletions(-) >> >> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c >> index 7515cf0..433545f 100644 >> --- a/arch/x86/kernel/vsyscall_64.c >> +++ b/arch/x86/kernel/vsyscall_64.c >> @@ -139,6 +139,14 @@ static int addr_to_vsyscall_nr(unsigned long addr) >> return nr; >> } >> >> +static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr) >> +{ >> + if (!seccomp_mode(&tsk->seccomp)) >> + return 0; >> + task_pt_regs(tsk)->orig_ax = syscall_nr; >> + return __secure_computing(syscall_nr); >> +} >> + >> static bool write_ok_or_segv(unsigned long ptr, size_t size) >> { >> /* >> @@ -174,6 +182,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> int vsyscall_nr; >> int prev_sig_on_uaccess_error; >> long ret; >> + int skip; >> >> /* >> * No point in checking CS -- the only way to get here is a user mode >> @@ -205,9 +214,6 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> } >> >> tsk = current; >> - if (seccomp_mode(&tsk->seccomp)) >> - do_exit(SIGKILL); >> - >> /* >> * With a real vsyscall, page faults cause SIGSEGV. We want to >> * preserve that behavior to make writing exploits harder. >> @@ -222,8 +228,13 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> * address 0". >> */ >> ret = -EFAULT; >> + skip = 0; >> switch (vsyscall_nr) { >> case 0: >> + skip = vsyscall_seccomp(tsk, __NR_gettimeofday); >> + if (skip) >> + break; >> + >> if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) || >> !write_ok_or_segv(regs->si, sizeof(struct timezone))) >> break; >> @@ -234,6 +245,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> break; >> >> case 1: >> + skip = vsyscall_seccomp(tsk, __NR_time); >> + if (skip) >> + break; >> + >> if (!write_ok_or_segv(regs->di, sizeof(time_t))) >> break; >> >> @@ -241,6 +256,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> break; >> >> case 2: >> + skip = vsyscall_seccomp(tsk, __NR_getcpu); >> + if (skip) >> + break; >> + >> if (!write_ok_or_segv(regs->di, sizeof(unsigned)) || >> !write_ok_or_segv(regs->si, sizeof(unsigned))) >> break; >> @@ -253,6 +272,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> >> current_thread_info()->sig_on_uaccess_error = >> prev_sig_on_uaccess_error; >> >> + if (skip) >> + goto do_ret; >> + >> if (ret == -EFAULT) { >> /* Bad news -- userspace fed a bad pointer to a vsyscall. */ >> warn_bad_vsyscall(KERN_INFO, regs, >> @@ -271,6 +293,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> >> regs->ax = ret; >> >> +do_ret: >> /* Emulate a ret instruction. */ >> regs->ip = caller; >> regs->sp += 8; > > Does this work correctly in SECCOMP_RET_TRAP, TRA
[PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate
If a seccomp filter program is installed, older static binaries and distributions with older libc implementations (glibc 2.13 and earlier) that rely on vsyscall use will be terminated regardless of the filter program policy when executing time, gettimeofday, or getcpu. This is only the case when vsyscall emulation is in use (vsyscall=emulate is the default). This patch emulates system call entry inside a vsyscall=emulate by populating regs->ax and regs->orig_ax with the system call number prior to calling into seccomp such that all seccomp-dependencies function normally. Additionally, system call return behavior is emulated in line with other vsyscall entrypoints for the trace/trap cases. Reported-by: Owen Kibel Signed-off-by: Will Drewry v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu) --- arch/x86/kernel/vsyscall_64.c | 35 +++ 1 file changed, 31 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c index 7515cf0..08a18d0 100644 --- a/arch/x86/kernel/vsyscall_64.c +++ b/arch/x86/kernel/vsyscall_64.c @@ -139,6 +139,15 @@ static int addr_to_vsyscall_nr(unsigned long addr) return nr; } +static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr) +{ + if (!seccomp_mode(&tsk->seccomp)) + return 0; + task_pt_regs(tsk)->orig_ax = syscall_nr; + task_pt_regs(tsk)->ax = syscall_nr; + return __secure_computing(syscall_nr); +} + static bool write_ok_or_segv(unsigned long ptr, size_t size) { /* @@ -174,6 +183,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) int vsyscall_nr; int prev_sig_on_uaccess_error; long ret; + int skip; /* * No point in checking CS -- the only way to get here is a user mode @@ -205,9 +215,6 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) } tsk = current; - if (seccomp_mode(&tsk->seccomp)) - do_exit(SIGKILL); - /* * With a real vsyscall, page faults cause SIGSEGV. We want to * preserve that behavior to make writing exploits harder. @@ -222,8 +229,13 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) * address 0". */ ret = -EFAULT; + skip = 0; switch (vsyscall_nr) { case 0: + skip = vsyscall_seccomp(tsk, __NR_gettimeofday); + if (skip) + break; + if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) || !write_ok_or_segv(regs->si, sizeof(struct timezone))) break; @@ -234,6 +246,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) break; case 1: + skip = vsyscall_seccomp(tsk, __NR_time); + if (skip) + break; + if (!write_ok_or_segv(regs->di, sizeof(time_t))) break; @@ -241,6 +257,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) break; case 2: + skip = vsyscall_seccomp(tsk, __NR_getcpu); + if (skip) + break; + if (!write_ok_or_segv(regs->di, sizeof(unsigned)) || !write_ok_or_segv(regs->si, sizeof(unsigned))) break; @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error; + if (skip) { + if ((long)regs->ax <= 0L) /* seccomp errno emulation */ + goto do_ret; + goto done; /* seccomp trace/trap */ + } + if (ret == -EFAULT) { /* Bad news -- userspace fed a bad pointer to a vsyscall. */ warn_bad_vsyscall(KERN_INFO, regs, @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) regs->ax = ret; +do_ret: /* Emulate a ret instruction. */ regs->ip = caller; regs->sp += 8; - +done: return true; sigsegv: -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate
On Fri, Jul 13, 2012 at 6:00 PM, Andrew Lutomirski wrote: > On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry wrote: >> If a seccomp filter program is installed, older static binaries and >> distributions with older libc implementations (glibc 2.13 and earlier) >> that rely on vsyscall use will be terminated regardless of the filter >> program policy when executing time, gettimeofday, or getcpu. This is >> only the case when vsyscall emulation is in use (vsyscall=emulate is the >> default). >> >> This patch emulates system call entry inside a vsyscall=emulate by >> populating regs->ax and regs->orig_ax with the system call number prior >> to calling into seccomp such that all seccomp-dependencies function >> normally. Additionally, system call return behavior is emulated in line >> with other vsyscall entrypoints for the trace/trap cases. >> >> Reported-by: Owen Kibel >> Signed-off-by: Will Drewry >> >> v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu) > >> @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> >> current_thread_info()->sig_on_uaccess_error = >> prev_sig_on_uaccess_error; >> >> + if (skip) { >> + if ((long)regs->ax <= 0L) /* seccomp errno emulation */ >> + goto do_ret; >> + goto done; /* seccomp trace/trap */ >> + } >> + >> if (ret == -EFAULT) { >> /* Bad news -- userspace fed a bad pointer to a vsyscall. */ >> warn_bad_vsyscall(KERN_INFO, regs, >> @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> >> regs->ax = ret; >> >> +do_ret: >> /* Emulate a ret instruction. */ >> regs->ip = caller; >> regs->sp += 8; >> - >> +done: >> return true; >> >> sigsegv: >> -- >> 1.7.9.5 >> > > This has the same odd property as the sigsegv path that the faulting > instruction will appear to be the mov, not the syscall. That seems to > be okay, though -- various pieces of code that try to restart the segv > are okay with that. Yeah - I would otherwise do regs->ip += 9; but I wanted to match the code that was therefor SIGSEGV. If regs->ip += 9 _just_ for the SIGSYS case is fine, then I'll make that change shortly. Since any code that sees the vsyscall address should be wise enough to avoid it, perhaps that's why the SIGSEGV hasn't had a problem so far. > Is there any code that assumes that changing rax (i.e. the syscall > number) and restarting a syscall after SIGSYS will invoke the new > syscall? (The RET_TRACE path might be similar -- does the > ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger > a chance to synchronously cancel or change the syscall? Unfortunately, it does in normal interception. I don't see any way out of that quirk with vsyscall=emulate. As is without seccomp, vsyscall=emulate doesn't allow ptrace interception (or syscall auditing for that matter) while vsyscall=native does. So the option here is to document the quirky interaction in Documentation/prctl/seccomp_filter.txt. In particular, if the tracer sees either (time|gettimeofday|getcpu) and rip in the vsyscall page, it will know it can't rewrite or bypass the call.Is there a better option? Given that, I will include a tweak to the documentation to indicate that behavior so that userspace authors of BPF programs that use SECCOMP_RET_TRACE will be aware of the behavior. > If those issues aren't problems, then: > > Reviewed-by: Andy Lutomirski > > (If the syscall number needs to change after the fact in the > SECCOMP_RET_TRAP case, it'll be a mess.) Nah - traps are delivered like the forced sigsegv path. I'll spin a v3 soon including the documentation tweak and the ip offset to match vsyscall=native behavior (regs->ip += 9 _just_ for the skip case). Of course, any better ideas for the trace-case will be more than welcome, but it seems to me to be an acceptable tradeoff - I hope others agree. I'll make the changes and then put it through its paces to see if any other little idiosyncrasies emerge. Thanks for the close review! will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate
On Fri, Jul 13, 2012 at 7:48 PM, Will Drewry wrote: > On Fri, Jul 13, 2012 at 6:00 PM, Andrew Lutomirski wrote: >> On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry wrote: >>> If a seccomp filter program is installed, older static binaries and >>> distributions with older libc implementations (glibc 2.13 and earlier) >>> that rely on vsyscall use will be terminated regardless of the filter >>> program policy when executing time, gettimeofday, or getcpu. This is >>> only the case when vsyscall emulation is in use (vsyscall=emulate is the >>> default). >>> >>> This patch emulates system call entry inside a vsyscall=emulate by >>> populating regs->ax and regs->orig_ax with the system call number prior >>> to calling into seccomp such that all seccomp-dependencies function >>> normally. Additionally, system call return behavior is emulated in line >>> with other vsyscall entrypoints for the trace/trap cases. >>> >>> Reported-by: Owen Kibel >>> Signed-off-by: Will Drewry >>> >>> v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu) >> >>> @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >>> long address) >>> >>> current_thread_info()->sig_on_uaccess_error = >>> prev_sig_on_uaccess_error; >>> >>> + if (skip) { >>> + if ((long)regs->ax <= 0L) /* seccomp errno emulation */ >>> + goto do_ret; >>> + goto done; /* seccomp trace/trap */ >>> + } >>> + >>> if (ret == -EFAULT) { >>> /* Bad news -- userspace fed a bad pointer to a vsyscall. */ >>> warn_bad_vsyscall(KERN_INFO, regs, >>> @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >>> long address) >>> >>> regs->ax = ret; >>> >>> +do_ret: >>> /* Emulate a ret instruction. */ >>> regs->ip = caller; >>> regs->sp += 8; >>> - >>> +done: >>> return true; >>> >>> sigsegv: >>> -- >>> 1.7.9.5 >>> >> >> This has the same odd property as the sigsegv path that the faulting >> instruction will appear to be the mov, not the syscall. That seems to >> be okay, though -- various pieces of code that try to restart the segv >> are okay with that. > > Yeah - I would otherwise do > regs->ip += 9; > but I wanted to match the code that was therefor SIGSEGV. If regs->ip > += 9 _just_ for the SIGSYS case is fine, then I'll make that change > shortly. Since any code that sees the vsyscall address should be wise > enough to avoid it, perhaps that's why the SIGSEGV hasn't had a > problem so far. I dashed this off without more thought. It's best to leave it as is because any return to the emulated page will cause a vsyscall fault event. >> Is there any code that assumes that changing rax (i.e. the syscall >> number) and restarting a syscall after SIGSYS will invoke the new >> syscall? (The RET_TRACE path might be similar -- does the >> ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger >> a chance to synchronously cancel or change the syscall? > > Unfortunately, it does in normal interception. I don't see any way out > of that quirk with vsyscall=emulate. As is without seccomp, > vsyscall=emulate doesn't allow ptrace interception (or syscall > auditing for that matter) while vsyscall=native does. So the option > here is to document the quirky interaction in > Documentation/prctl/seccomp_filter.txt. In particular, if the tracer > sees either (time|gettimeofday|getcpu) and rip in the vsyscall page, > it will know it can't rewrite or bypass the call.Is there a better > option? > > Given that, I will include a tweak to the documentation to indicate > that behavior so that userspace authors of BPF programs that use > SECCOMP_RET_TRACE will be aware of the behavior. > >> If those issues aren't problems, then: >> >> Reviewed-by: Andy Lutomirski >> >> (If the syscall number needs to change after the fact in the >> SECCOMP_RET_TRAP case, it'll be a mess.) > > Nah - traps are delivered like the forced sigsegv path. > > I'll spin a v3 soon including the documentation tweak and the ip > offset to match vsyscall=native behavior (regs->ip += 9 _just_ for the > skip case). Of course, any better ideas for the trace-case will be >
[PATCH v3 1/2] vsyscall: allow seccomp in vsyscall=emulate
If a seccomp filter program is installed, older static binaries and distributions with older libc implementations (glibc 2.13 and earlier) that rely on vsyscall use will be terminated regardless of the filter program policy when executing time, gettimeofday, or getcpu. This is only the case when vsyscall emulation is in use (vsyscall=emulate is the default). This patch emulates system call entry inside a vsyscall=emulate by populating regs->ax and regs->orig_ax with the system call number prior to calling into seccomp such that all seccomp-dependencies function normally. Additionally, system call return behavior is emulated in line with other vsyscall entrypoints for the trace/trap cases. Note, v3 adds support for a ptracer to skip and emulate vsyscalls. This is not required behavior but the documentation should reflect the behavior for whichever is preferred (v2 or v3). Reported-by: Owen Kibel Signed-off-by: Will Drewry v3: - allow ptrace orig_ax changes to skip the syscall since changing it is not an option. (result of discussions with luto) - ensure ptrace register modification doesn't change return behavior taking the "normal" return path - add some comments v2: - fixed ip and sp on SECCOMP_RET_TRAP/ERRNO (thanks to l...@mit.edu) --- arch/x86/kernel/vsyscall_64.c | 42 + 1 file changed, 38 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c index 7515cf0..c56a8dc 100644 --- a/arch/x86/kernel/vsyscall_64.c +++ b/arch/x86/kernel/vsyscall_64.c @@ -139,6 +139,22 @@ static int addr_to_vsyscall_nr(unsigned long addr) return nr; } +static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr) +{ + int ret; + struct pt_regs *regs; + if (!seccomp_mode(&tsk->seccomp)) + return 0; + regs = task_pt_regs(tsk); + regs->orig_ax = syscall_nr; + regs->ax = syscall_nr; /* ensure consistency */ + /* 0 if allowed, -1 on SECCOMP_RET_ERRNO and SECCOMP_RET_TRAP */ + ret = __secure_computing(syscall_nr); + if (regs->orig_ax != syscall_nr) + return 1; /* ptrace requested skip */ + return ret; +} + static bool write_ok_or_segv(unsigned long ptr, size_t size) { /* @@ -174,6 +190,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) int vsyscall_nr; int prev_sig_on_uaccess_error; long ret; + int skip; /* * No point in checking CS -- the only way to get here is a user mode @@ -205,9 +222,6 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) } tsk = current; - if (seccomp_mode(&tsk->seccomp)) - do_exit(SIGKILL); - /* * With a real vsyscall, page faults cause SIGSEGV. We want to * preserve that behavior to make writing exploits harder. @@ -222,8 +236,13 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) * address 0". */ ret = -EFAULT; + skip = 0; switch (vsyscall_nr) { case 0: + skip = vsyscall_seccomp(tsk, __NR_gettimeofday); + if (skip) + break; + if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) || !write_ok_or_segv(regs->si, sizeof(struct timezone))) break; @@ -234,6 +253,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) break; case 1: + skip = vsyscall_seccomp(tsk, __NR_time); + if (skip) + break; + if (!write_ok_or_segv(regs->di, sizeof(time_t))) break; @@ -241,6 +264,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) break; case 2: + skip = vsyscall_seccomp(tsk, __NR_getcpu); + if (skip) + break; + if (!write_ok_or_segv(regs->di, sizeof(unsigned)) || !write_ok_or_segv(regs->si, sizeof(unsigned))) break; @@ -253,6 +280,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error; + if (skip) { + if ((long)regs->ax <= 0L || skip == 1) /* seccomp errno/trace */ + goto do_ret; + goto done; /* seccomp trap */ + } + if (ret == -EFAULT) { /* Bad news -- userspace fed a bad pointer to a vsyscall. */ warn_bad_vsyscall(KERN_INFO, regs, @@ -271,10 +304,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) regs->ax = ret; +do_ret: /* Emulate a ret instruction. */
[PATCH v3 2/2] Documentation: add a caveat for seccomp filter and vsyscall emulation
With the addition of seccomp support to vsyscall emulation: http://permalink.gmane.org/gmane.linux.kernel/1327732 with some minor changes in the first patch in this series. Update the documentation to indicate quirky behaviors when the 'ip' is in the vsyscall page and vsyscall emulation is in effect. If v2 of the first patch is preferred, then this patch will need to be changed to indicate that SECCOMP_RET_TRACE does not allow system calls to be remapped _or_ skipped. Signed-off-by: Will Drewry --- Documentation/prctl/seccomp_filter.txt | 22 ++ 1 file changed, 22 insertions(+) diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt index 597c3c5..67ed88b 100644 --- a/Documentation/prctl/seccomp_filter.txt +++ b/Documentation/prctl/seccomp_filter.txt @@ -161,3 +161,25 @@ architecture supports both ptrace_event and seccomp, it will be able to support seccomp filter with minor fixup: SIGSYS support and seccomp return value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER to its arch-specific Kconfig. + + +Caveats +--- + +On x86-64 with vsyscall emulation enabled and while servicing a +vsyscall-emulated system call: +- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to + the vsyscall entry for the given call and not the address after the + 'syscall' instruction. Any code which wants to restart the call + should return to that address and code wishing to return simulating + completion may either sigreturn normally or simulate a ret instruction + and use the return address from the stack. +- A return value of SECCOMP_RET_TRACE will signal the tracer as usual, + but the syscall may not be changed to another system call using the + orig_rax register. It may only be changed to a different value in + order to skip the currently emulated call and any change will result + in that behavior. The remainder of the registers may be altered as + usual. +- Detection of this quirky behavior may be done by checking for getcpu, + time, or gettimeofday and if the si_call_addr or rip is in the + vsyscall page, specifically at the start of the specific entry call. -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] vsyscall_64: add missing ifdef CONFIG_SECCOMP
vsyscall_seccomp introduced a dependency on __secure_computing. On configurations with CONFIG_SECCOMP disabled, compilation will fail. Reported-by: feng xiangjun Signed-off-by: Will Drewry --- arch/x86/kernel/vsyscall_64.c |4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c index 08a18d0..5db36ca 100644 --- a/arch/x86/kernel/vsyscall_64.c +++ b/arch/x86/kernel/vsyscall_64.c @@ -139,6 +139,7 @@ static int addr_to_vsyscall_nr(unsigned long addr) return nr; } +#ifdef CONFIG_SECCOMP static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr) { if (!seccomp_mode(&tsk->seccomp)) @@ -147,6 +148,9 @@ static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr) task_pt_regs(tsk)->ax = syscall_nr; return __secure_computing(syscall_nr); } +#else +#define vsyscall_seccomp(_tsk, _nr) 0 +#endif static bool write_ok_or_segv(unsigned long ptr, size_t size) { -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] Documentation: add a caveat for seccomp filter and vsyscall emulation
With the addition of seccomp support to vsyscall emulation: http://permalink.gmane.org/gmane.linux.kernel/1327732 and the prior patch in this series. Update the documentation to indicate quirky behaviors when the 'ip' is in the vsyscall page and vsyscall emulation is in effect. Signed-off-by: Will Drewry --- Documentation/prctl/seccomp_filter.txt | 22 ++ 1 file changed, 22 insertions(+) diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt index 597c3c5..67ed88b 100644 --- a/Documentation/prctl/seccomp_filter.txt +++ b/Documentation/prctl/seccomp_filter.txt @@ -161,3 +161,25 @@ architecture supports both ptrace_event and seccomp, it will be able to support seccomp filter with minor fixup: SIGSYS support and seccomp return value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER to its arch-specific Kconfig. + + +Caveats +--- + +On x86-64 with vsyscall emulation enabled and while servicing a +vsyscall-emulated system call: +- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to + the vsyscall entry for the given call and not the address after the + 'syscall' instruction. Any code which wants to restart the call + should return to that address and code wishing to return simulating + completion may either sigreturn normally or simulate a ret instruction + and use the return address from the stack. +- A return value of SECCOMP_RET_TRACE will signal the tracer as usual, + but the syscall may not be changed to another system call using the + orig_rax register. It may only be changed to a different value in + order to skip the currently emulated call and any change will result + in that behavior. The remainder of the registers may be altered as + usual. +- Detection of this quirky behavior may be done by checking for getcpu, + time, or gettimeofday and if the si_call_addr or rip is in the + vsyscall page, specifically at the start of the specific entry call. -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] vsyscall_64: allow SECCOMP_RET_TRACErs to skip
Current quirky ptrace behavior with vsyscall and seccomp does not allow tracers to bypass the call. This change provides that ability by checking if orig_ax changed. Signed-off-by: Will Drewry --- arch/x86/kernel/vsyscall_64.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c index 5db36ca..5f9640c 100644 --- a/arch/x86/kernel/vsyscall_64.c +++ b/arch/x86/kernel/vsyscall_64.c @@ -142,11 +142,15 @@ static int addr_to_vsyscall_nr(unsigned long addr) #ifdef CONFIG_SECCOMP static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr) { + int ret; if (!seccomp_mode(&tsk->seccomp)) return 0; task_pt_regs(tsk)->orig_ax = syscall_nr; task_pt_regs(tsk)->ax = syscall_nr; - return __secure_computing(syscall_nr); + ret = __secure_computing(syscall_nr); + if (task_pt_regs(tsk)->orig_ax != syscall_nr) + return 1; /* ptrace syscall skip */ + return ret; } #else #define vsyscall_seccomp(_tsk, _nr) 0 @@ -278,9 +282,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address) current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error; if (skip) { - if ((long)regs->ax <= 0L) /* seccomp errno emulation */ + if ((long)regs->ax <= 0L || skip == 1) /* seccomp errno/trace */ goto do_ret; - goto done; /* seccomp trace/trap */ + goto done; /* seccomp trap */ } if (ret == -EFAULT) { -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] vsyscall_64: allow SECCOMP_RET_TRACErs to skip
On Sat, Jul 14, 2012 at 10:44 AM, Andrew Lutomirski wrote: > I think I'd prefer if changing to something other than whatever value is > used to cancel the syscall resulted in a crash rather than just being > ignored. I was trying to keep as much seccomp-ptrace behavior intact rather than making it terminal in this special case. Is there a reason why it'd make more sense to crash? > How hard is it for a page fault to return into the syscall entry path? It > should be possible to do this for rel, although it could be messy and not > worth it. Not sure, tbh. I think given vsyscall's status and the fact that ptrace+seccomp+vsyscall=emulate isn't horrible, I think it's fine to either ignore (what is in tree now) or to allow ptrace to skip, without providing full functionality. But obviously, my view my be biased! thanks! will > > On Jul 14, 2012 10:35 AM, "Will Drewry" wrote: >> >> Current quirky ptrace behavior with vsyscall and seccomp >> does not allow tracers to bypass the call. This change >> provides that ability by checking if orig_ax changed. >> >> Signed-off-by: Will Drewry >> --- >> arch/x86/kernel/vsyscall_64.c | 10 +++--- >> 1 file changed, 7 insertions(+), 3 deletions(-) >> >> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c >> index 5db36ca..5f9640c 100644 >> --- a/arch/x86/kernel/vsyscall_64.c >> +++ b/arch/x86/kernel/vsyscall_64.c >> @@ -142,11 +142,15 @@ static int addr_to_vsyscall_nr(unsigned long addr) >> #ifdef CONFIG_SECCOMP >> static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr) >> { >> + int ret; >> if (!seccomp_mode(&tsk->seccomp)) >> return 0; >> task_pt_regs(tsk)->orig_ax = syscall_nr; >> task_pt_regs(tsk)->ax = syscall_nr; >> - return __secure_computing(syscall_nr); >> + ret = __secure_computing(syscall_nr); >> + if (task_pt_regs(tsk)->orig_ax != syscall_nr) >> + return 1; /* ptrace syscall skip */ >> + return ret; >> } >> #else >> #define vsyscall_seccomp(_tsk, _nr) 0 >> @@ -278,9 +282,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >> long address) >> current_thread_info()->sig_on_uaccess_error = >> prev_sig_on_uaccess_error; >> >> if (skip) { >> - if ((long)regs->ax <= 0L) /* seccomp errno emulation */ >> + if ((long)regs->ax <= 0L || skip == 1) /* seccomp >> errno/trace */ >> goto do_ret; >> - goto done; /* seccomp trace/trap */ >> + goto done; /* seccomp trap */ >> } >> >> if (ret == -EFAULT) { >> -- >> 1.7.9.5 >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] vsyscall_64: allow SECCOMP_RET_TRACErs to skip
On Sat, Jul 14, 2012 at 10:50 AM, Will Drewry wrote: > On Sat, Jul 14, 2012 at 10:44 AM, Andrew Lutomirski wrote: >> I think I'd prefer if changing to something other than whatever value is >> used to cancel the syscall resulted in a crash rather than just being >> ignored. > > I was trying to keep as much seccomp-ptrace behavior intact rather > than making it terminal in this special case. Is there a reason why > it'd make more sense to crash? Unless you meant something the tracer could catch? That may make sense, but they could also use singlestep or whatever else to get similar behavior. But maybe I'm missing the bigger picture! thanks! will >> How hard is it for a page fault to return into the syscall entry path? It >> should be possible to do this for rel, although it could be messy and not >> worth it. > > Not sure, tbh. I think given vsyscall's status and the fact that > ptrace+seccomp+vsyscall=emulate isn't horrible, I think it's fine to > either ignore (what is in tree now) or to allow ptrace to skip, > without providing full functionality. But obviously, my view my be > biased! > > thanks! > will > >> >> On Jul 14, 2012 10:35 AM, "Will Drewry" wrote: >>> >>> Current quirky ptrace behavior with vsyscall and seccomp >>> does not allow tracers to bypass the call. This change >>> provides that ability by checking if orig_ax changed. >>> >>> Signed-off-by: Will Drewry >>> --- >>> arch/x86/kernel/vsyscall_64.c | 10 +++--- >>> 1 file changed, 7 insertions(+), 3 deletions(-) >>> >>> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c >>> index 5db36ca..5f9640c 100644 >>> --- a/arch/x86/kernel/vsyscall_64.c >>> +++ b/arch/x86/kernel/vsyscall_64.c >>> @@ -142,11 +142,15 @@ static int addr_to_vsyscall_nr(unsigned long addr) >>> #ifdef CONFIG_SECCOMP >>> static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr) >>> { >>> + int ret; >>> if (!seccomp_mode(&tsk->seccomp)) >>> return 0; >>> task_pt_regs(tsk)->orig_ax = syscall_nr; >>> task_pt_regs(tsk)->ax = syscall_nr; >>> - return __secure_computing(syscall_nr); >>> + ret = __secure_computing(syscall_nr); >>> + if (task_pt_regs(tsk)->orig_ax != syscall_nr) >>> + return 1; /* ptrace syscall skip */ >>> + return ret; >>> } >>> #else >>> #define vsyscall_seccomp(_tsk, _nr) 0 >>> @@ -278,9 +282,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned >>> long address) >>> current_thread_info()->sig_on_uaccess_error = >>> prev_sig_on_uaccess_error; >>> >>> if (skip) { >>> - if ((long)regs->ax <= 0L) /* seccomp errno emulation */ >>> + if ((long)regs->ax <= 0L || skip == 1) /* seccomp >>> errno/trace */ >>> goto do_ret; >>> - goto done; /* seccomp trace/trap */ >>> + goto done; /* seccomp trap */ >>> } >>> >>> if (ret == -EFAULT) { >>> -- >>> 1.7.9.5 >>> >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] vsyscall_64: allow SECCOMP_RET_TRACErs to skip
On Sat, Jul 14, 2012 at 11:21 AM, Andrew Lutomirski wrote: > On Sat, Jul 14, 2012 at 9:15 AM, Andrew Lutomirski wrote: >> On Sat, Jul 14, 2012 at 8:57 AM, Will Drewry wrote: >>> On Sat, Jul 14, 2012 at 10:50 AM, Will Drewry wrote: >>>> On Sat, Jul 14, 2012 at 10:44 AM, Andrew Lutomirski wrote: >>>>> I think I'd prefer if changing to something other than whatever value is >>>>> used to cancel the syscall resulted in a crash rather than just being >>>>> ignored. >>>> >>>> I was trying to keep as much seccomp-ptrace behavior intact rather >>>> than making it terminal in this special case. Is there a reason why >>>> it'd make more sense to crash? >>> >>> Unless you meant something the tracer could catch? That may make >>> sense, but they could also use singlestep or whatever else to get >>> similar behavior. But maybe I'm missing the bigger picture! >> >> I think it would be nice to not introduce any special behavior that >> things might rely on if we do this better in the future. Similarly, >> for almost all purposes, a tracer could change gettimeofday to write, >> but there would be a silent behavior change if gettimeofday were >> entered via vsyscall. >> >> What's the standard way of skipping a syscall? sigreturn? (I don't >> know off the top of my head what sigreturn does.) sys_ni_syscall? Most implementations I've seen just set orig_rax to -1, but in practice any system call that is invalid will result in the system call being skipped in the normal syscall path. So I could tweak it to check <0 > __NR_syscalls and add a sigsys generator if it is valid. Do you think that'd be the most correct move? (As you say below, the impact of _any_ changes here are very minor :) >> I'd be all for making those continue to work but making anything that >> can't be emulated correctly do something sufficiently unpleasant that >> people won't do it. Is there a syscall that does nothing at all? > > Here's my suggestion for now: any attempt by the seccomp filter to do > anything other than executing the syscall unchanged, skipping it > entirely, or killing results in a trap or sigsys with the syscall > number unchanged. > > In the RET_TRACE case, the SIGSYS restarts at the mov nr,%rax > instruction. The tracer can deal with it accordingly (via checking > the rip) -- if the tracer changes rax, it'll get changed right back by > the emulated mov nr,%rax. If this results in an infinite loop, so be > it. > > In the RET_TRAP case, do the same thing but via the special ptrace > event instead of SIGSYS. You've lost me here. The flow of a RET_TRAP is: - enter vsyscall at 0xff600[]00 - run seccomp - get ret_trap - queue up a synchronous trap - skip the vsyscall - return Even if the trap were to interrupt the current vsyscall trap handler, the skip value would still be -1 and the syscall would still get skipped without touching sp or ip. Traps cannot provide kernel-side arbitration of allowed syscalls, they can just emulate them in userspace before "returning" to the original callsite. In this case, vsyscall makes returning there harder and the quirk is to return to the prior address if the ip is in vsyscall's range. > This way, the special rip & ~0x0c00 == 0xff60 handling > becomes part of the ABI, but it looks like the mov instruction > trapped, which it more or less did. If we want RET_TRAP to be able to > skip the syscall, let it happen the normal way. I *think* trap is ok? > This stuff probably barely matters. For example, ptrace currently > doesn't work right when vsyscalls are being emulated. No one appears > to care. Agreed :) I don't mind making tweaks to get it right, but this only matters to users that want to: - use seccomp filter - with ptrace (or trap with resumption and not sigreturn) - of time, gettimeofday, and getcpu since they will then have to include quirk management _just_ in case their code is linked against something using vsyscall and vsyscall=emulate is in effect! thanks! will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.5 0/2] seccomp and vsyscall fixes
On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski wrote: > Apologies for the lateness of this stuff. I was at a conference last > week when the Chrome issue was discovered and I couldn't do this > properly until I got back. > > Will, can you confirm that this version is okay and passes your tests? > It passes mine. I'll pull it and test it! > While there are no known seccomp users that will have trouble, > SECCOMP_RET_TRAP and SECCOMP_RET_TRACE currently interact oddly with > emulated vsyscalls. This might lead to ABI issues down the road (if > something starts to rely on current behavior) or unexpected malfunctions > (if something tries to change, say, sys_gettimeofday, into a different > syscall and gets completely bogus results on a vsyscall-using distro. > > It's unlikely that fixing this later will cause issues, but it would be > nice to nail down and document the vsyscall quirks for the first > released kernel with seccomp mode 2 support. > > (Patch 2/2 is very much optional. It fixes a strange corner case. It > ought to be fine for 3.6, since I very much doubt that any real code > will hit that corner case and cause ABI problems.) > > Andy Lutomirski (2): > seccomp: Make syscall skipping and nr changes more consistent > seccomp: Future-proof against silly tracers > > Documentation/prctl/seccomp_filter.txt | 74 -- > arch/x86/include/asm/syscall.h | 11 +++ > arch/x86/kernel/vsyscall_64.c | 110 > +--- > kernel/seccomp.c | 28 +++- > 4 files changed, 163 insertions(+), 60 deletions(-) > > -- > 1.7.7.6 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski wrote: > Currently, if a tracer changes a syscall nr to __NR_future_enosys, > behavior will differ between kernels that know about > __NR_future_enosys (and return -ENOSYS) and older kernels (which > return the value from pt_regs). This is silly; we should just > return -ENOSYS. > > This is unlikely to ever happen on x86 because the return value in > pt_regs starts out as -ENOSYS, but a silly tracer can change that. > > Signed-off-by: Andy Lutomirski > Cc: Will Drewry > --- > arch/x86/include/asm/syscall.h | 11 +++ > kernel/seccomp.c | 15 +++ > 2 files changed, 26 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h > index 1ace47b..8191e057 100644 > --- a/arch/x86/include/asm/syscall.h > +++ b/arch/x86/include/asm/syscall.h Since this is used in an arch-wide location, the prototype and comment should be in asm-generic/syscall.h too. > @@ -70,6 +70,17 @@ static inline void syscall_set_return_value(struct > task_struct *task, > regs->ax = (long) error ?: val; > } > > +static inline bool syscall_is_nr_future(struct task_struct *task, > + struct pt_regs *regs) > +{ > +#ifdef CONFIG_IA32_EMULATION > + if (task_thread_info(task)->status & TS_COMPAT) > + return syscall_get_nr(task, regs) > __NR_ia32_syscall_max; > +#endif > + > + return syscall_get_nr(task, regs) > __NR_syscall_max; I'm not sure how easy this will be to implement on some of the arches where this data isn't bubbled up. It'd be good if some non-x86 arch maintainers chimed in (since x86 is easy and already works as expected :). > +} > + > #ifdef CONFIG_X86_32 > > static inline void syscall_get_arguments(struct task_struct *task, > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 5af44b5..bd7527d 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -429,6 +429,21 @@ int __secure_computing(int this_syscall) > */ > if (fatal_signal_pending(current)) > break; > + > + if (syscall_is_nr_future(current, regs)) { > + /* > +* If the tracer selects a system call that > +* only future kernels know about, we still > need > +* to execute that syscall by returning > +* -ENOSYS. (On x86, this only matters if the > +* tracer changed the return value, which > would > +* be silly, but user code can be silly.) > +*/ > + syscall_set_return_value(current, regs, > +-ENOSYS, 0); > + goto skip; > + } > + > if (syscall_get_nr(current, regs) < 0) > goto skip; /* Explicit request to skip. */ > > -- > 1.7.7.6 > thanks! will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent
On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski wrote: > This fixes two issues that could cause incompatibility between > kernel versions: > > - If a tracer uses SECCOMP_RET_TRACE to select a syscall number >higher than the largest known syscall, emulate the unknown >vsyscall by returning -ENOSYS. (This is unlikely to make a >noticeable difference on x86-64 due to the way the system call >entry works.) > > - On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy. > > This updates the documentation accordingly. > > Signed-off-by: Andy Lutomirski > Cc: Will Drewry > --- > Documentation/prctl/seccomp_filter.txt | 74 -- > arch/x86/kernel/vsyscall_64.c | 110 > +--- > kernel/seccomp.c | 13 +++- > 3 files changed, 137 insertions(+), 60 deletions(-) > > diff --git a/Documentation/prctl/seccomp_filter.txt > b/Documentation/prctl/seccomp_filter.txt > index 597c3c5..1e469ef 100644 > --- a/Documentation/prctl/seccomp_filter.txt > +++ b/Documentation/prctl/seccomp_filter.txt > @@ -95,12 +95,15 @@ SECCOMP_RET_KILL: > > SECCOMP_RET_TRAP: > Results in the kernel sending a SIGSYS signal to the triggering > - task without executing the system call. The kernel will > - rollback the register state to just before the system call > - entry such that a signal handler in the task will be able to > - inspect the ucontext_t->uc_mcontext registers and emulate > - system call success or failure upon return from the signal > - handler. > + task without executing the system call. siginfo->si_call_addr > + will show the address of the system call instruction, and > + siginfo->si_syscall and siginfo->si_arch will indicate which > + syscall was attempted. The program counter will be as though > + the syscall happened (i.e. it will not point to the syscall > + instruction). The return value register will contain an arch- > + dependent value -- if resuming execution, set it to something > + sensible. (The architecture dependency is because replacing > + it with -ENOSYS could overwrite some useful information.) > > The SECCOMP_RET_DATA portion of the return value will be passed > as si_errno. > @@ -123,6 +126,18 @@ SECCOMP_RET_TRACE: > the BPF program return value will be available to the tracer > via PTRACE_GETEVENTMSG. > > + The tracer can skip the system call by changing the syscall number > + to -1. Alternatively, the tracer can change the system call > + requested by changing the system call to a valid syscall number. If > + the tracer asks to skip the system call, then the system call will > + appear to return the value that the tracer puts in the return value > + register. > + > + The seccomp check will not be run again after the tracer is > + notified. (This means that seccomp-based sandboxes MUST NOT > + allow use of ptrace, even of other sandboxed processes, without > + extreme care; ptracers can use this mechanism to escape.) > + > SECCOMP_RET_ALLOW: > Results in the system call being executed. > > @@ -161,3 +176,50 @@ architecture supports both ptrace_event and seccomp, it > will be able to > support seccomp filter with minor fixup: SIGSYS support and seccomp return > value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER > to its arch-specific Kconfig. > + > + > + > +Caveats > +--- > + > +The vDSO can cause some system calls to run entirely in userspace, > +leading to surprises when you run programs on different machines that > +fall back to real syscalls. To minimize these surprises on x86, make > +sure you test with > +/sys/devices/system/clocksource/clocksource0/current_clocksource set to > +something like acpi_pm. > + > +On x86-64, vsyscall emulation is enabled by default. (vsyscalls are > +legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor > seccomp, with a few oddities: > + > +- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to > + the vsyscall entry for the given call and not the address after the > + 'syscall' instruction. Any code which wants to restart the call > + should be aware that (a) a ret instruction has been emulated and (b) > + trying to resume the syscall will again trigger the standard vsyscall > + emulation security checks, making resuming the syscall mostly > + pointless. > + > +- A return value of SECCOMP_RET_TRACE will signal the tracer as usual, > + but the syscall may not be changed t
Re: [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
On Tue, Jul 17, 2012 at 9:13 PM, Will Drewry wrote: > On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski wrote: >> Currently, if a tracer changes a syscall nr to __NR_future_enosys, >> behavior will differ between kernels that know about >> __NR_future_enosys (and return -ENOSYS) and older kernels (which >> return the value from pt_regs). This is silly; we should just >> return -ENOSYS. >> >> This is unlikely to ever happen on x86 because the return value in >> pt_regs starts out as -ENOSYS, but a silly tracer can change that. >> >> Signed-off-by: Andy Lutomirski >> Cc: Will Drewry >> --- >> arch/x86/include/asm/syscall.h | 11 +++ >> kernel/seccomp.c | 15 +++ >> 2 files changed, 26 insertions(+), 0 deletions(-) >> >> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h >> index 1ace47b..8191e057 100644 >> --- a/arch/x86/include/asm/syscall.h >> +++ b/arch/x86/include/asm/syscall.h > > Since this is used in an arch-wide location, the prototype and comment > should be in asm-generic/syscall.h too. > >> @@ -70,6 +70,17 @@ static inline void syscall_set_return_value(struct >> task_struct *task, >> regs->ax = (long) error ?: val; >> } >> >> +static inline bool syscall_is_nr_future(struct task_struct *task, >> + struct pt_regs *regs) >> +{ >> +#ifdef CONFIG_IA32_EMULATION >> + if (task_thread_info(task)->status & TS_COMPAT) >> + return syscall_get_nr(task, regs) > __NR_ia32_syscall_max; >> +#endif >> + >> + return syscall_get_nr(task, regs) > __NR_syscall_max; > > I'm not sure how easy this will be to implement on some of the arches > where this data isn't bubbled up. It'd be good if some non-x86 arch > maintainers chimed in (since x86 is easy and already works as expected > :). > Since x86 always returns -ENOSYS with an invalid syscall and only x86 supports seccomp filter in 3.5, I'd propose pushing this off for 3.6+ to get more feedback from the relevant parties. Not doing it now doesn't expose any users of 3.5 to any sort of changing ABI. (Otherwise, it seems fine but may make adding new arches slightly more onerous, but I suspect ftrace needs this sort of info too as it spreads to other arches!) >> +} >> + >> #ifdef CONFIG_X86_32 >> >> static inline void syscall_get_arguments(struct task_struct *task, >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c >> index 5af44b5..bd7527d 100644 >> --- a/kernel/seccomp.c >> +++ b/kernel/seccomp.c >> @@ -429,6 +429,21 @@ int __secure_computing(int this_syscall) >> */ >> if (fatal_signal_pending(current)) >> break; >> + >> + if (syscall_is_nr_future(current, regs)) { >> + /* >> +* If the tracer selects a system call that >> +* only future kernels know about, we still >> need >> +* to execute that syscall by returning >> +* -ENOSYS. (On x86, this only matters if >> the >> +* tracer changed the return value, which >> would >> +* be silly, but user code can be silly.) >> +*/ >> + syscall_set_return_value(current, regs, >> +-ENOSYS, 0); >> + goto skip; >> + } >> + >> if (syscall_get_nr(current, regs) < 0) >> goto skip; /* Explicit request to skip. */ >> >> -- >> 1.7.7.6 >> > > thanks! > will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [libseccomp-discuss] ARM audit, seccomp, etc are broken wrt OABI syscalls
On Wed, Nov 6, 2013 at 9:51 AM, Russell King - ARM Linux wrote: > On Wed, Nov 06, 2013 at 10:32:31AM -0500, Eric Paris wrote: >> On Tue, 2013-11-05 at 14:36 -0800, Andy Lutomirski wrote: >> > 1. Set a different audit arch for OABI syscalls (e.g. >> > AUDIT_ARCH_ARMOABI). That is, treat OABI syscall entries the same way >> > that x86_64 treats int 80. >> >> As the audit maintainer, I like #1. It might break ABI, but the ABI is >> flat wrong now and not maintainable... > > If you read the whole thread, you will see that this corner case is just > not worth the effort to support. Audit may as well be disabled by > kernel config if any OABI support is enabled. This might be the best move for seccomp too (as Kees suggested). I'd love to have audit arch visibility, but it's not clear that it's worth any sort of larger changes ... ... like adding a task_thread_info.compat flag that bubbles up to syscall_get_arch(), or if we assume consumers of syscall_get_nr() are broken today (I haven't checked), then it would be possible to at least re-add the 0x90 bits, if compat, before handing back the system call number but leave the audit arch pieces alone. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] seccomp: not compatible with ARM OABI
Thanks! Reviewed-By: Will Drewry On Wed, Nov 6, 2013 at 5:31 PM, Kees Cook wrote: > Make sure that seccomp filter won't be built when ARM OABI is in use, > since there is work needed to distinguish calling conventions. Until > that is done (which is likely never since OABI is deprecated), make > sure seccomp filter is unavailable in the OABI compat world. > > Signed-off-by: Kees Cook > --- > arch/Kconfig |5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index af2cc6eabcc7..6eaca7d92399 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -331,12 +331,15 @@ config HAVE_ARCH_SECCOMP_FILTER > > config SECCOMP_FILTER > def_bool y > - depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET > + depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET && !OABI_COMPAT > help > Enable tasks to build secure computing environments defined > in terms of Berkeley Packet Filter programs which implement > task-defined system call filtering polices. > > + Not available on ARM when built with OABI compatibility due to > + lack of a sensible way to distinguish the calling conventions. > + > See Documentation/prctl/seccomp_filter.txt for details. > > config HAVE_CONTEXT_TRACKING > -- > 1.7.9.5 > > > -- > Kees Cook > Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 35/46] kernel: Mark function as static in kernel/seccomp.c
On Thu, Feb 27, 2014 at 9:33 AM, Kees Cook wrote: > On Thu, Feb 27, 2014 at 4:20 AM, Rashika Kheria > wrote: >> Mark function as static in kernel/seccomp.c because it is not used >> outside this file. >> >> This eliminates the following warning in kernel/seccomp.c: >> kernel/seccomp.c:296:6: warning: no previous prototype for >> 'seccomp_attach_user_filter' [-Wmissing-prototypes] >> >> Signed-off-by: Rashika Kheria >> Reviewed-by: Josh Triplett > > Acked-by: Kees Cook Acked-by: Will Drewry It amazing what things slip through. Thanks! >> --- >> kernel/seccomp.c |2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c >> index b7a1004..0e004a7 100644 >> --- a/kernel/seccomp.c >> +++ b/kernel/seccomp.c >> @@ -293,7 +293,7 @@ fail: >> * >> * Returns 0 on success and non-zero otherwise. >> */ >> -long seccomp_attach_user_filter(char __user *user_filter) >> +static long seccomp_attach_user_filter(char __user *user_filter) >> { >> struct sock_fprog fprog; >> long ret = -EFAULT; >> -- >> 1.7.9.5 >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] seccomp: protect seccomp.filter pointer (w) with the task_lock
Normally, task_struck.seccomp.filter is only ever read or modified by the task that owns it (current). This property aids in fast access during system call filtering as read access is lockless. Updating the pointer from another task, however, opens up race conditions. To allow cross-task filter pointer updates, writes to the pointer are now protected by the task_lock. Read access remains lockless because pointer updates themselves are atomic. However, writes often entail additional checking (like maximum instruction counts) which require locking to perform safely. Signed-off-by: Will Drewry --- include/linux/seccomp.h |4 ++-- kernel/seccomp.c| 22 +- 2 files changed, 15 insertions(+), 11 deletions(-) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 6f19cfd..85c0895 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -17,8 +17,8 @@ struct seccomp_filter; * @filter: The metadata and ruleset for determining what system calls * are allowed for a task. * - * @filter must only be accessed from the context of current as there - * is no locking. + * @filter must always point to a valid seccomp_filter or NULL as it is + * accessed without locking during system call entry. */ struct seccomp { int mode; diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b7a1004..71512e4 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -201,18 +201,18 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) */ static u32 seccomp_run_filters(int syscall) { - struct seccomp_filter *f; + struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter); u32 ret = SECCOMP_RET_ALLOW; /* Ensure unexpected behavior doesn't result in failing open. */ - if (WARN_ON(current->seccomp.filter == NULL)) + if (WARN_ON(f == NULL)) return SECCOMP_RET_KILL; /* * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). */ - for (f = current->seccomp.filter; f; f = f->prev) { + for (; f; f = ACCESS_ONCE(f->prev)) { u32 cur_ret = sk_run_filter(NULL, f->insns); if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION)) ret = cur_ret; @@ -228,7 +228,7 @@ static u32 seccomp_run_filters(int syscall) */ static long seccomp_attach_filter(struct sock_fprog *fprog) { - struct seccomp_filter *filter; + struct seccomp_filter *filter, *f; unsigned long fp_size = fprog->len * sizeof(struct sock_filter); unsigned long total_insns = fprog->len; long ret; @@ -236,11 +236,6 @@ static long seccomp_attach_filter(struct sock_fprog *fprog) if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return -EINVAL; - for (filter = current->seccomp.filter; filter; filter = filter->prev) - total_insns += filter->len + 4; /* include a 4 instr penalty */ - if (total_insns > MAX_INSNS_PER_PATH) - return -ENOMEM; - /* * Installing a seccomp filter requires that the task have * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. @@ -260,6 +255,13 @@ static long seccomp_attach_filter(struct sock_fprog *fprog) atomic_set(&filter->usage, 1); filter->len = fprog->len; + task_lock(current); /* protect current->seccomp.filter from updates */ + for (f = current->seccomp.filter; f; f = f->prev) + total_insns += f->len + 4; /* include a 4 instr penalty */ + ret = -ENOMEM; + if (total_insns > MAX_INSNS_PER_PATH) + goto fail; + /* Copy the instructions from fprog. */ ret = -EFAULT; if (copy_from_user(filter->insns, fprog->filter, fp_size)) @@ -281,8 +283,10 @@ static long seccomp_attach_filter(struct sock_fprog *fprog) */ filter->prev = current->seccomp.filter; current->seccomp.filter = filter; + task_unlock(current); return 0; fail: + task_unlock(current); kfree(filter); return ret; } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
Applying restrictive seccomp filter programs to large or diverse codebases often requires handling threads which may be started early in the process lifetime (e.g., by code that is linked in). While it is possible to apply permissive programs prior to process start up, it is difficult to further restrict the kernel ABI to those threads after that point. This change adds a new seccomp "extension" for synchronizing thread group seccomp filters and a prctl() for accessing that functionality. The need for the added prctl() is due to the lack of reserved arguments in PR_SET_SECCOMP. When prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT_TSYNC, 0, 0) is called, it will attempt to synchronize all threads in current's threadgroup to its seccomp filter program. This is possible iff all threads are using a filter that is an ancestor to the filter current is attempting to synchronize to. NULL filters (where the task is running as SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be transitioned into SECCOMP_MODE_FILTER. On success, 0 is returned. On failure, the pid of one of the failing threads will be returned. Suggested-by: Julien Tinnes Signed-off-by: Will Drewry --- include/linux/seccomp.h |7 +++ include/uapi/linux/prctl.h |6 ++ include/uapi/linux/seccomp.h |6 ++ kernel/seccomp.c | 128 ++ kernel/sys.c |3 + 5 files changed, 150 insertions(+) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 85c0895..3163db6 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -77,6 +77,8 @@ static inline int seccomp_mode(struct seccomp *s) extern void put_seccomp_filter(struct task_struct *tsk); extern void get_seccomp_filter(struct task_struct *tsk); extern u32 seccomp_bpf_load(int off); +extern long prctl_seccomp_ext(unsigned long, unsigned long, + unsigned long, unsigned long); #else /* CONFIG_SECCOMP_FILTER */ static inline void put_seccomp_filter(struct task_struct *tsk) { @@ -86,5 +88,10 @@ static inline void get_seccomp_filter(struct task_struct *tsk) { return; } +static inline long prctl_seccomp_ext(unsigned long arg2, unsigned long arg3, +unsigned long arg4, unsigned long arg5) +{ + return -EINVAL; +} #endif /* CONFIG_SECCOMP_FILTER */ #endif /* _LINUX_SECCOMP_H */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 289760f..5dcd5d3 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -149,4 +149,10 @@ #define PR_GET_TID_ADDRESS 40 +/* + * Access seccomp extensions + * See Documentation/prctl/seccomp_filter.txt for more details. + */ +#define PR_SECCOMP_EXT 41 + #endif /* _LINUX_PRCTL_H */ diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index ac2dc9f..49b5279 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -10,6 +10,12 @@ #define SECCOMP_MODE_STRICT1 /* uses hard-coded filter. */ #define SECCOMP_MODE_FILTER2 /* uses user-supplied filter. */ +/* Valid extension types as arg2 for prctl(PR_SECCOMP_EXT) */ +#define SECCOMP_EXT_ACT1 + +/* Valid extension actions as arg3 to prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT) */ +#define SECCOMP_EXT_ACT_TSYNC 1 /* attempt to synchronize thread filters */ + /* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 71512e4..8a0de7b 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -24,6 +24,7 @@ #ifdef CONFIG_SECCOMP_FILTER #include #include +#include #include #include #include @@ -220,6 +221,108 @@ static u32 seccomp_run_filters(int syscall) return ret; } +/* Returns 1 if the candidate is an ancestor. */ +static int is_ancestor(struct seccomp_filter *candidate, + struct seccomp_filter *child) +{ + /* NULL is the root ancestor. */ + if (candidate == NULL) + return 1; + for (; child; child = child->prev) + if (child == candidate) + return 1; + return 0; +} + +/** + * seccomp_sync_threads: sets all threads to use current's filter + * + * Returns 0 on success or the pid of a thread which was either not + * in the correct seccomp mode or it did not have an ancestral + * seccomp filter. current must be in seccomp.mode=2 already. + */ +static pid_t seccomp_sync_threads(void) +{ + struct task_struct *thread, *caller; + pid_t failed = 0; + thread = caller = current; + + read_lock(&tasklist_lock); + if (thread_group_empty(caller)) + goto done; + while_each_thread(caller, thread) { + task_lock(thread); + /* +* All threads must not be in SECCOMP_MODE_STRICT to +* be e
[PATCH 3/3] Documentation/prctl/seccomp_filter.txt: document extensions
(missed this on the first run) Add an entry for the PR_SECCOMP_EXT entry point and the only existing consumer, SECCOMP_EXT_ACT_TSYNC. Signed-off-by: Will Drewry --- Documentation/prctl/seccomp_filter.txt | 28 +++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt index 1e469ef..b296701 100644 --- a/Documentation/prctl/seccomp_filter.txt +++ b/Documentation/prctl/seccomp_filter.txt @@ -166,10 +166,36 @@ The samples/seccomp/ directory contains both an x86-specific example and a more generic example of a higher level macro interface for BPF program generation. +Extensions +-- + +SECCOMP_MODE_FILTER supports an additional entry point for accessing +extended behavior through prctl(PR_SECCOMP_EXT). Only one extension +exists today: + +SECCOMP_EXT_ACT_TSYNC: + If the calling task is running under SECCOMP_MODE_FILTER, it + may call prctl() to synchronize the seccomp filter of its + threads. As seccomp behavior is per-task, any thread under + SECCOMP_MODE_STRICT will be unaffected, as will any thread + under SECCOMP_MODE_FILTER that does not have a filter that is + in the filter tree ancestry for the caller. Any threads that + are in SECCOMP_MODE_NONE will be transitioned to + SECCOMP_MODE_FILTER if possible. + + Usage: + prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT_TSYNC, 0, 0); + + If any threads cannot be transitioned, the call will return one + of the process ids. All other threads will have been transitioned. + A return value of 0 indicates success. On a negative return value, + the errno will be populated appropriately: + EINVAL indicates invalid arguments. + EACCES indicates invalid seccomp mode. Adding architecture support +--- See arch/Kconfig for the authoritative requirements. In general, if an architecture supports both ptrace_event and seccomp, it will be able to -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
On Mon, Jan 13, 2014 at 5:36 PM, Andy Lutomirski wrote: > On 01/13/2014 12:30 PM, Will Drewry wrote: >> Applying restrictive seccomp filter programs to large or diverse >> codebases often requires handling threads which may be started early in >> the process lifetime (e.g., by code that is linked in). While it is >> possible to apply permissive programs prior to process start up, it is >> difficult to further restrict the kernel ABI to those threads after that >> point. >> >> This change adds a new seccomp "extension" for synchronizing thread >> group seccomp filters and a prctl() for accessing that functionality. >> The need for the added prctl() is due to the lack of reserved arguments >> in PR_SET_SECCOMP. >> >> When prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT_TSYNC, 0, 0) is called, it >> will attempt to synchronize all threads in current's threadgroup to its >> seccomp filter program. This is possible iff all threads are using a >> filter that is an ancestor to the filter current is attempting to >> synchronize to. NULL filters (where the task is running as >> SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be >> transitioned into SECCOMP_MODE_FILTER. On success, 0 is returned. On >> failure, the pid of one of the failing threads will be returned. >> >> Suggested-by: Julien Tinnes >> Signed-off-by: Will Drewry >> --- >> include/linux/seccomp.h |7 +++ >> include/uapi/linux/prctl.h |6 ++ >> include/uapi/linux/seccomp.h |6 ++ >> kernel/seccomp.c | 128 >> ++ >> kernel/sys.c |3 + >> 5 files changed, 150 insertions(+) >> >> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h >> index 85c0895..3163db6 100644 >> --- a/include/linux/seccomp.h >> +++ b/include/linux/seccomp.h >> @@ -77,6 +77,8 @@ static inline int seccomp_mode(struct seccomp *s) >> extern void put_seccomp_filter(struct task_struct *tsk); >> extern void get_seccomp_filter(struct task_struct *tsk); >> extern u32 seccomp_bpf_load(int off); >> +extern long prctl_seccomp_ext(unsigned long, unsigned long, >> + unsigned long, unsigned long); >> #else /* CONFIG_SECCOMP_FILTER */ >> static inline void put_seccomp_filter(struct task_struct *tsk) >> { >> @@ -86,5 +88,10 @@ static inline void get_seccomp_filter(struct task_struct >> *tsk) >> { >> return; >> } >> +static inline long prctl_seccomp_ext(unsigned long arg2, unsigned long arg3, >> + unsigned long arg4, unsigned long arg5) >> +{ >> + return -EINVAL; >> +} >> #endif /* CONFIG_SECCOMP_FILTER */ >> #endif /* _LINUX_SECCOMP_H */ >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >> index 289760f..5dcd5d3 100644 >> --- a/include/uapi/linux/prctl.h >> +++ b/include/uapi/linux/prctl.h >> @@ -149,4 +149,10 @@ >> >> #define PR_GET_TID_ADDRESS 40 >> >> +/* >> + * Access seccomp extensions >> + * See Documentation/prctl/seccomp_filter.txt for more details. >> + */ >> +#define PR_SECCOMP_EXT 41 >> + >> #endif /* _LINUX_PRCTL_H */ >> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h >> index ac2dc9f..49b5279 100644 >> --- a/include/uapi/linux/seccomp.h >> +++ b/include/uapi/linux/seccomp.h >> @@ -10,6 +10,12 @@ >> #define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ >> #define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ >> >> +/* Valid extension types as arg2 for prctl(PR_SECCOMP_EXT) */ >> +#define SECCOMP_EXT_ACT 1 >> + >> +/* Valid extension actions as arg3 to prctl(PR_SECCOMP_EXT, >> SECCOMP_EXT_ACT) */ >> +#define SECCOMP_EXT_ACT_TSYNC1 /* attempt to synchronize thread >> filters */ >> + >> /* >> * All BPF programs must return a 32-bit value. >> * The bottom 16-bits are for optional return data. >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c >> index 71512e4..8a0de7b 100644 >> --- a/kernel/seccomp.c >> +++ b/kernel/seccomp.c >> @@ -24,6 +24,7 @@ >> #ifdef CONFIG_SECCOMP_FILTER >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -220,6 +221,108 @@ static u32 seccomp_run_filters(int syscall) >> return ret; >> } >> >> +/* Returns 1 if the candidate is an ancestor. */ >> +static int
Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
On Tue, Jan 14, 2014 at 2:13 PM, Oleg Nesterov wrote: > On 01/14, Oleg Nesterov wrote: >> >> On 01/14, Oleg Nesterov wrote: >> > >> > > + get_seccomp_filter(caller); >> > > + /* >> > > + * Drop the task reference to the shared ancestor since >> > > + * current's path will hold a reference. (This also >> > > + * allows a put before the assignment.) >> > > + */ >> > > + put_seccomp_filter(thread); >> > > + thread->seccomp.filter = caller->seccomp.filter; >> > >> > As I said, I do not understand this patch yet, but this looks suspicious. >> > >> > Why we can't race with this thread doing clone(CLONE_THREAD) ? We do >> > not the the new thread yet, but its ->seccomp can be already copied >> > by copy_process(), no? Ah - I thought the tasklist_lock would catch that, but of course that happens before the tasklist_lock is needed. >> >> And it seems that this can obviously race with seccomp_attach_filter() >> called by this "thread". And... I was hoping the task_lock would cover any attach cases, but missing the copy_process() is a problem. > > Heh. I just noticed that this patch is not first in series, and I wasn't > cc'ed. I found this one on marc.info, Sorry! I shouldn't have relied on cc-cmd, I usually mess it up. > > http://marc.info/?l=linux-kernel&m=138964557211277 > > this explains task_lock(). But this can't fix the race with copy_process, > and the patch itself doesn't look right... if nothing else, we can't do > copy_from_user() under task_lock(). Thanks -- I'll take a more critical look! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
On Tue, Jan 14, 2014 at 2:24 PM, Andy Lutomirski wrote: > On Tue, Jan 14, 2014 at 10:59 AM, Will Drewry wrote: >> On Mon, Jan 13, 2014 at 5:36 PM, Andy Lutomirski wrote: >>> On 01/13/2014 12:30 PM, Will Drewry wrote: >>>> Applying restrictive seccomp filter programs to large or diverse >>>> codebases often requires handling threads which may be started early in >>>> the process lifetime (e.g., by code that is linked in). While it is >>>> possible to apply permissive programs prior to process start up, it is >>>> difficult to further restrict the kernel ABI to those threads after that >>>> point. >>>> >>>> This change adds a new seccomp "extension" for synchronizing thread >>>> group seccomp filters and a prctl() for accessing that functionality. >>>> The need for the added prctl() is due to the lack of reserved arguments >>>> in PR_SET_SECCOMP. >>>> >>>> When prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT_TSYNC, 0, 0) is called, it >>>> will attempt to synchronize all threads in current's threadgroup to its >>>> seccomp filter program. This is possible iff all threads are using a >>>> filter that is an ancestor to the filter current is attempting to >>>> synchronize to. NULL filters (where the task is running as >>>> SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be >>>> transitioned into SECCOMP_MODE_FILTER. On success, 0 is returned. On >>>> failure, the pid of one of the failing threads will be returned. >>>> >>>> Suggested-by: Julien Tinnes >>>> Signed-off-by: Will Drewry >>>> --- >>>> include/linux/seccomp.h |7 +++ >>>> include/uapi/linux/prctl.h |6 ++ >>>> include/uapi/linux/seccomp.h |6 ++ >>>> kernel/seccomp.c | 128 >>>> ++ >>>> kernel/sys.c |3 + >>>> 5 files changed, 150 insertions(+) >>>> >>>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h >>>> index 85c0895..3163db6 100644 >>>> --- a/include/linux/seccomp.h >>>> +++ b/include/linux/seccomp.h >>>> @@ -77,6 +77,8 @@ static inline int seccomp_mode(struct seccomp *s) >>>> extern void put_seccomp_filter(struct task_struct *tsk); >>>> extern void get_seccomp_filter(struct task_struct *tsk); >>>> extern u32 seccomp_bpf_load(int off); >>>> +extern long prctl_seccomp_ext(unsigned long, unsigned long, >>>> + unsigned long, unsigned long); >>>> #else /* CONFIG_SECCOMP_FILTER */ >>>> static inline void put_seccomp_filter(struct task_struct *tsk) >>>> { >>>> @@ -86,5 +88,10 @@ static inline void get_seccomp_filter(struct >>>> task_struct *tsk) >>>> { >>>> return; >>>> } >>>> +static inline long prctl_seccomp_ext(unsigned long arg2, unsigned long >>>> arg3, >>>> + unsigned long arg4, unsigned long arg5) >>>> +{ >>>> + return -EINVAL; >>>> +} >>>> #endif /* CONFIG_SECCOMP_FILTER */ >>>> #endif /* _LINUX_SECCOMP_H */ >>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >>>> index 289760f..5dcd5d3 100644 >>>> --- a/include/uapi/linux/prctl.h >>>> +++ b/include/uapi/linux/prctl.h >>>> @@ -149,4 +149,10 @@ >>>> >>>> #define PR_GET_TID_ADDRESS 40 >>>> >>>> +/* >>>> + * Access seccomp extensions >>>> + * See Documentation/prctl/seccomp_filter.txt for more details. >>>> + */ >>>> +#define PR_SECCOMP_EXT 41 >>>> + >>>> #endif /* _LINUX_PRCTL_H */ >>>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h >>>> index ac2dc9f..49b5279 100644 >>>> --- a/include/uapi/linux/seccomp.h >>>> +++ b/include/uapi/linux/seccomp.h >>>> @@ -10,6 +10,12 @@ >>>> #define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ >>>> #define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ >>>> >>>> +/* Valid extension types as arg2 for prctl(PR_SECCOMP_EXT) */ >>>> +#define SECCOMP_EXT_ACT 1 >>>> + >>>> +/* Valid extension actions as arg3 to
Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
On Tue, Jan 14, 2014 at 1:21 PM, Oleg Nesterov wrote: > On 01/13, Will Drewry wrote: >> >> +static pid_t seccomp_sync_threads(void) >> +{ >> + struct task_struct *thread, *caller; >> + pid_t failed = 0; >> + thread = caller = current; >> + >> + read_lock(&tasklist_lock); >> + if (thread_group_empty(caller)) >> + goto done; >> + while_each_thread(caller, thread) { >> + task_lock(thread); > > perhaps we take task_lock() to serialize with another caller of > seccomp_sync_threads()... Sorry for the patch being unclear! The task_lock is meant to protect against the assignment of a seccomp.filter pointer which was meant to protect against the target task calling seccomp_attach_filter while another task is changing its filter pointer. > If yes, then perhaps you can use ->siglock instead of tasklist_lock > and do not use task_lock(). It would be even better to rely on rcu, > but: Exactly. The siglock seems like it makes more sense. >> + get_seccomp_filter(caller); >> + /* >> + * Drop the task reference to the shared ancestor since >> + * current's path will hold a reference. (This also >> + * allows a put before the assignment.) >> + */ >> + put_seccomp_filter(thread); >> + thread->seccomp.filter = caller->seccomp.filter; > > As I said, I do not understand this patch yet, but this looks suspicious. > > Why we can't race with this thread doing clone(CLONE_THREAD) ? We do > not the the new thread yet, but its ->seccomp can be already copied > by copy_process(), no? Yeah I missed that. That said, I think the worst of it would be that the new thread gets the old filter. It should still get_seccomp_filter() its own reference to the filter. I'm not clear if it's possible to ensure that there is no pathological condition where a thread races and is created without being synchronized. I'll see if the siglock helps here and walk the clone() code again to see what else I missed. thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
On Tue, Jan 14, 2014 at 3:09 PM, Andy Lutomirski wrote: > On Tue, Jan 14, 2014 at 12:59 PM, Will Drewry wrote: >> On Tue, Jan 14, 2014 at 2:24 PM, Andy Lutomirski wrote: >>> On Tue, Jan 14, 2014 at 10:59 AM, Will Drewry wrote: >>>> On Mon, Jan 13, 2014 at 5:36 PM, Andy Lutomirski >>>> wrote: >>>>> On 01/13/2014 12:30 PM, Will Drewry wrote: >>>>>> Applying restrictive seccomp filter programs to large or diverse >>>>>> codebases often requires handling threads which may be started early in >>>>>> the process lifetime (e.g., by code that is linked in). While it is >>>>>> possible to apply permissive programs prior to process start up, it is >>>>>> difficult to further restrict the kernel ABI to those threads after that >>>>>> point. >>>>>> >>>>>> This change adds a new seccomp "extension" for synchronizing thread >>>>>> group seccomp filters and a prctl() for accessing that functionality. >>>>>> The need for the added prctl() is due to the lack of reserved arguments >>>>>> in PR_SET_SECCOMP. >>>>>> >>>>>> When prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT_TSYNC, 0, 0) is called, it >>>>>> will attempt to synchronize all threads in current's threadgroup to its >>>>>> seccomp filter program. This is possible iff all threads are using a >>>>>> filter that is an ancestor to the filter current is attempting to >>>>>> synchronize to. NULL filters (where the task is running as >>>>>> SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be >>>>>> transitioned into SECCOMP_MODE_FILTER. On success, 0 is returned. On >>>>>> failure, the pid of one of the failing threads will be returned. >>>>>> >>>>>> Suggested-by: Julien Tinnes >>>>>> Signed-off-by: Will Drewry >>>>>> --- >>>>>> include/linux/seccomp.h |7 +++ >>>>>> include/uapi/linux/prctl.h |6 ++ >>>>>> include/uapi/linux/seccomp.h |6 ++ >>>>>> kernel/seccomp.c | 128 >>>>>> ++ >>>>>> kernel/sys.c |3 + >>>>>> 5 files changed, 150 insertions(+) >>>>>> >>>>>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h >>>>>> index 85c0895..3163db6 100644 >>>>>> --- a/include/linux/seccomp.h >>>>>> +++ b/include/linux/seccomp.h >>>>>> @@ -77,6 +77,8 @@ static inline int seccomp_mode(struct seccomp *s) >>>>>> extern void put_seccomp_filter(struct task_struct *tsk); >>>>>> extern void get_seccomp_filter(struct task_struct *tsk); >>>>>> extern u32 seccomp_bpf_load(int off); >>>>>> +extern long prctl_seccomp_ext(unsigned long, unsigned long, >>>>>> + unsigned long, unsigned long); >>>>>> #else /* CONFIG_SECCOMP_FILTER */ >>>>>> static inline void put_seccomp_filter(struct task_struct *tsk) >>>>>> { >>>>>> @@ -86,5 +88,10 @@ static inline void get_seccomp_filter(struct >>>>>> task_struct *tsk) >>>>>> { >>>>>> return; >>>>>> } >>>>>> +static inline long prctl_seccomp_ext(unsigned long arg2, unsigned long >>>>>> arg3, >>>>>> + unsigned long arg4, unsigned long >>>>>> arg5) >>>>>> +{ >>>>>> + return -EINVAL; >>>>>> +} >>>>>> #endif /* CONFIG_SECCOMP_FILTER */ >>>>>> #endif /* _LINUX_SECCOMP_H */ >>>>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >>>>>> index 289760f..5dcd5d3 100644 >>>>>> --- a/include/uapi/linux/prctl.h >>>>>> +++ b/include/uapi/linux/prctl.h >>>>>> @@ -149,4 +149,10 @@ >>>>>> >>>>>> #define PR_GET_TID_ADDRESS 40 >>>>>> >>>>>> +/* >>>>>> + * Access seccomp extensions >>>>>> + * See Documentation/prctl/seccomp_filter.txt for more details. >>>>>> + */ >>>>>> +#define PR_SECCOMP_EXT 41
Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
On Wed, Jan 15, 2014 at 1:04 PM, Oleg Nesterov wrote: > On 01/14, Will Drewry wrote: >> >> On Tue, Jan 14, 2014 at 1:21 PM, Oleg Nesterov wrote: >> >> >> + get_seccomp_filter(caller); >> >> + /* >> >> + * Drop the task reference to the shared ancestor >> >> since >> >> + * current's path will hold a reference. (This also >> >> + * allows a put before the assignment.) >> >> + */ >> >> + put_seccomp_filter(thread); >> >> + thread->seccomp.filter = caller->seccomp.filter; >> > >> > As I said, I do not understand this patch yet, but this looks suspicious. >> > >> > Why we can't race with this thread doing clone(CLONE_THREAD) ? We do >> > not the the new thread yet, but its ->seccomp can be already copied >> > by copy_process(), no? >> >> Yeah I missed that. That said, I think the worst of it would be that >> the new thread >> gets the old filter. > > Yes, but this means you can trust SECCOMP_EXT_ACT_TSYNC. > >> I'll see if >> the siglock helps >> here and walk the clone() code again to see what else I missed. > > No, siglock itself can't help to avoid this race. Unless you move > copy_process()->get_seccomp_filter() under the same lock, and in > this case it should also re-copy ->seccomp. Not nice. Yeah - not at all. I'll rethink it. I was too excited about how easy is_ancestor works, but the locking is really the hard part. > But note task_lock() (or any other per-thread locking) is wrong. > Just look at the code above. We hold task_lock(thread) but not > task_lock(caller). What if another thread calls seccomp_sync_threads() > and changes caller->seccomp right after get_seccomp_filter(caller). Yup - I was thinking of tasklist_lock as a non-multi-reader lock, which is wrong. The task_lock(current) would clearly cover that case, but I need to walk through all the interactions paying more attention to the lock being used. > And even get_seccomp_filter() itself becomes racy. I think the > locking is seriously broken in this series. It certainly needs to be better applied :) thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [libseccomp-discuss] ARM seccomp filters and EABI/OABI
On Mon, Oct 28, 2013 at 5:02 PM, Paul Moore wrote: > On Thursday, October 24, 2013 02:14:15 PM Andy Lutomirski wrote: >> On Thu, Oct 24, 2013 at 12:11 PM, Paul Moore wrote: >> > On Wednesday, October 23, 2013 02:02:00 PM Andy Lutomirski wrote: >> >> I'm looking at the seccomp code, the ARM entry code, and the >> >> syscall(2) manpage, and I'm a bit lost. (The fact that I don't really >> >> speak ARM assembly doesn't help.) >> > >> > I suspect Kees, and perhaps Will, will be able to provide the best >> > answers, but my thoughts are below. >> > >> >> My basic question is: what happens if an OABI syscall happens? >> > >> > Well, libseccomp doesn't support ARM OABI and since all the new ARM stuff >> > is EABI I don't think there is much reason to worry about OABI. I know >> > this doesn't answer your question, but perhaps this provides some >> > context. >> >> Are you sure you don't support it? > > Yep, I said libseccomp doesn't support it so we don't ;) > > It may build and function to some extent, but I'm making no claims for OABI > support; if someone tries it on a OABI system they do so as their own risk. > >> What actually happens if someone writes an EABI program that issues an OABI >> syscall? (I'm hoping that the syscall nr ends up offset by 0x90, but >> I'm not sure.) > > Like you, I expect there would be a syscall mismatch but I can't say for > certain. It looks like entry-common.S masks off the 0x90 so that the numbers are the same, but the entry path is determined by the software interrupt instruction (SWI) 8-bit immediate which indicates the type (which it looks like is read by loading the address(-4) in the link register). http://lxr.free-electrons.com/source/arch/arm/kernel/entry-common.S?a=arm#L420 Then it just swaps the call table if the immediate is non-zero and uses the arg to call into the oabi version of the syscall table. I don't think there's a clear way to detect if the oabi table has been swapped in. I think that the reason that the AUDIT_ARCH is the same for OABI compat/not compat is that there is no argument ordering or syscall numbering difference. I'm not 100% though. >> >> AFAICS, the syscall arguments for EABI are r0..r5, although their >> >> ordering is a bit odd*. >> > >> > Hmmm, that could complicate things a bit - do you know if they are put in >> > a more "standard" order by the time they are accessed in >> > seccomp_bpf_load() via task_pt_regs()? If not, we likely need to come up >> > with some special handling in libseccomp to account for this. >> >> I don't think that such a think is possible. It depends on the >> signature of the particular syscall, and I don't know if there's any >> table of these things. > > Oh, that's all sorts of awesome. > > Well, at least in libseccomp we do have a syscall table for each arch so it > should be possible to track what per-syscall fixups are needed (assuming some > augmentation of our syscall table structures) and apply them at runtime. The > hard part is going to be determining what fixups are needed and recording them > in the table. > > Gr. >From looking at the oabi compat calls, it may be that no fixup is really needed in the BPF side, but only in places where other argument introspection occurs -- ptrace or sigsys handlers. >> >> I'm a bit surprised to see that both the EABI and OABI ABIs show up as >> >> AUDIT_ARCH_ARM. >> > >> > Yeah, the usage of AUDIT_ARCH_* is not really ideal for seccomp. There >> > are similar issues with x32; not quite as bad as with ARM, but still ... >> >> As long as the combination of AUDIT_ARCH and nr uniquely identifies a >> syscall and its ABI, life should be good. > > Ha! Life may be good, but the code to handle it was annoying* ;) > > Largely because I made the assumption (which turned out to be a bad) that an > AUDIT_ARCH_* value uniquely identified a single ABI. Removing that assumption > was both annoying and painful; the code still isn't very good in dealing with > multiple ABIs sharing a single AUDIT_ARCH_* token but it works. It seems like a problem for the audit infrastucture if the calling conventions are yielding improper system call information -- but I don't think it is in this case (which is why it seemed to make sense to connect the two subsystems...). The calls don't seem different, but the structs are organized, aligned, or padded differently (which shouldn't matter _too_ much to the seccomp filter unless the arg ordering is different, etc). Bleh. [ http://lxr.free-electrons.com/source/arch/arm/kernel/sys_oabi-compat.c ] I'm not sure if pt_regs ends up looking the same, but I was hoping it did. I realize that the seventh argument is dropped for OABI, but IIRC, the only call that used it was the syscall() syscall: http://lxr.free-electrons.com/source/arch/arm/kernel/entry-common.S#L524 which is marked as obsolete in calls.S. I don't think any other OABI calls (or any other calls in general) pass a seventh argument. *think* is the operative w
Re: [RFC 5/5] x86,seccomp: Add a seccomp fastpath
On Wed, Jun 11, 2014 at 5:32 PM, Kees Cook wrote: > On Wed, Jun 11, 2014 at 3:28 PM, Andy Lutomirski wrote: >> On Wed, Jun 11, 2014 at 3:27 PM, H. Peter Anvin wrote: >>> On 06/11/2014 03:22 PM, Andy Lutomirski wrote: On Wed, Jun 11, 2014 at 3:18 PM, H. Peter Anvin wrote: > On 06/11/2014 02:56 PM, Andy Lutomirski wrote: >> >> 13ns is with the simplest nonempty filter. I hope that empty filters >> don't work. >> > > Why wouldn't they? Is it permissible to fall off the end of a BPF program? I'm getting EINVAL trying to install an actual empty filter. The filter I tested with was: >>> >>> What I meant was that there has to be a well-defined behavior for the >>> program falling off the end anyway, and that that should be preserved. >>> >>> I guess it is possible to require that all code paths must provably >>> reach a termination point. >>> >> >> Dunno. I haven't ever touched any of the actual BPF code. This whole >> patchset only changes the code that invokes the BPF evaluator. > > Yes, this is how BPF works: runs to the end or exit early. With > seccomp BPF specifically, the return value defaults to kill the > process. If a filter was missing (NULL), or empty, or didn't > explicitly return with a new value, the default (kill) should be > taken. Yup - this is just a property of BPF (and a nice one :) On seccomp_attach_filter this check fires: if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return -EINVAL; As well as in sk_chk_filter: if (flen == 0 || flen > BPF_MAXINSNS) return -EINVAL; And: /* last instruction must be a RET code */ switch (filter[flen - 1].code) { case BPF_S_RET_K: case BPF_S_RET_A: return check_load_and_stores(filter, flen); } cheers! will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 8/9] seccomp: Introduce addfd ioctl to seccomp user notifier
On Thu, Jul 9, 2020 at 1:26 PM Kees Cook wrote: > > From: Sargun Dhillon > > The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over > an fd. It is often used in settings where a supervising task emulates > syscalls on behalf of a supervised task in userspace, either to further > restrict the supervisee's syscall abilities or to circumvent kernel > enforced restrictions the supervisor deems safe to lift (e.g. actually > performing a mount(2) for an unprivileged container). > > While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall, > only a certain subset of syscalls could be correctly emulated. Over the > last few development cycles, the set of syscalls which can't be emulated > has been reduced due to the addition of pidfd_getfd(2). With this we are > now able to, for example, intercept syscalls that require the supervisor > to operate on file descriptors of the supervisee such as connect(2). > > However, syscalls that cause new file descriptors to be installed can not > currently be correctly emulated since there is no way for the supervisor > to inject file descriptors into the supervisee. This patch adds a > new addfd ioctl to remove this restriction by allowing the supervisor to > install file descriptors into the intercepted task. By implementing this > feature via seccomp the supervisor effectively instructs the supervisee > to install a set of file descriptors into its own file descriptor table > during the intercepted syscall. This way it is possible to intercept > syscalls such as open() or accept(), and install (or replace, like > dup2(2)) the supervisor's resulting fd into the supervisee. One > replacement use-case would be to redirect the stdout and stderr of a > supervisee into log file descriptors opened by the supervisor. > > The ioctl handling is based on the discussions[1] of how Extensible > Arguments should interact with ioctls. Instead of building size into > the addfd structure, make it a function of the ioctl command (which > is how sizes are normally passed to ioctls). To support forward and > backward compatibility, just mask out the direction and size, and match > everything. The size (and any future direction) checks are done along > with copy_struct_from_user() logic. > > As a note, the seccomp_notif_addfd structure is laid out based on 8-byte > alignment without requiring packing as there have been packing issues > with uapi highlighted before[2][3]. Although we could overload the > newfd field and use -1 to indicate that it is not to be used, doing > so requires changing the size of the fd field, and introduces struct > packing complexity. > > [1]: https://lore.kernel.org/lkml/87o8w9bcaf@mid.deneb.enyo.de/ > [2]: > https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f1...@rasmusvillemoes.dk/ > [3]: > https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal > > Suggested-by: Matt Denton > Link: https://lore.kernel.org/r/20200603011044.7972-4-sar...@sargun.me > Signed-off-by: Sargun Dhillon > Co-developed-by: Kees Cook > Signed-off-by: Kees Cook Reviewed-by: Will Drewry
Re: [PATCH v1] selftests: Make test_harness.h more generally available
On Sun, Apr 30, 2017 at 12:39 PM, Kees Cook wrote: > > On Sun, Apr 30, 2017 at 5:26 AM, Mickaël Salaün wrote: > > The seccomp/test_harness.h file contains useful helpers to build tests. > > Moving it to the selftest directory should benefit to other test > > components. > > Unless Shuah thinks this should live in a new include/ directory, this > looks fine to me. > > Acked-by: Kees Cook Same here! Acked-by: Will Drewry > > Thanks! > > -Kees > > > > > Signed-off-by: Mickaël Salaün > > Cc: Andy Lutomirski > > Cc: Kees Cook > > Cc: Shuah Khan > > Cc: Will Drewry > > Link: > > https://lkml.kernel.org/r/CAGXu5j+8CVz8vL51DRYXqOY=xc3zuKFf=ptene88xyhzfyi...@mail.gmail.com > > --- > > tools/testing/selftests/seccomp/seccomp_bpf.c| 2 +- > > tools/testing/selftests/{seccomp => }/test_harness.h | 0 > > 2 files changed, 1 insertion(+), 1 deletion(-) > > rename tools/testing/selftests/{seccomp => }/test_harness.h (100%) > > > > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c > > b/tools/testing/selftests/seccomp/seccomp_bpf.c > > index 03f1fa495d74..d7095ff58e72 100644 > > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c > > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c > > @@ -37,7 +37,7 @@ > > #include > > #include > > > > -#include "test_harness.h" > > +#include "../test_harness.h" > > > > #ifndef PR_SET_PTRACER > > # define PR_SET_PTRACER 0x59616d61 > > diff --git a/tools/testing/selftests/seccomp/test_harness.h > > b/tools/testing/selftests/test_harness.h > > similarity index 100% > > rename from tools/testing/selftests/seccomp/test_harness.h > > rename to tools/testing/selftests/test_harness.h > > -- > > 2.11.0 > > > > > > -- > Kees Cook > Pixel Security