Re: [RFC PATCH v5 0/7] mseal system mappings
On Wed, Feb 12, 2025 at 11:25 AM Lorenzo Stoakes wrote: > > On Wed, Feb 12, 2025 at 03:21:48AM +, jef...@chromium.org wrote: > > From: Jeff Xu > > > > The commit message in the first patch contains the full description of > > this series. > > Sorry to nit, but it'd be useful to reproduce in the cover letter too! But > this obviously isn't urgent, just be nice when we un-RFC. > > Thanks for sending as RFC, appreciated, keen to figure out a way forward > with this series and this gives us space to discuss. > > One thing that came up recently with the LWN article (...!) was that rr is > also impacted by this [0]. > > I think with this behind a config flag we're fine (this refers to my > 'opt-in' comment in the reply on LWN) as my concerns about this being > enabled in a broken way without an explicit kernel configuration are > addressed, and actually we do expose a means by which a user can detect if > the VDSO for instance is sealed via /proc/$pid/[s]maps. > > So tools like rr and such can be updated to check for this. I wonder if we > ought to try to liaise with the known problematic ones? > > It'd be nice to update the documentation to have a list of 'known > problematic userland software with sealed VDSO' so we make people aware. > > Hopefully we are acheiving the opt-in nature of the thing here, but it > makes me wonder whether we need a prctl() interface to optionally disable > even if the system has it enabled as a whole. Just noting that (as we discussed off-list) doing prctl() would not work, because that would effectively be an munseal for those vdso regions. Possibly something like a personality() flag (that's *not* inherited when AT_SECURE/secureexec). But personalities have other issues... FWIW, although it would (at the moment) be hard to pull off in the libc, I still much prefer it to playing these weird games with CONFIG options and kernel command line options and prctl and personality and whatnot. It seems to me like we're trying to stick policy where it doesn't belong. -- Pedro
Re: [RFC PATCH v5 0/7] mseal system mappings
(sorry I really am struggling to reply to mail as lore still seems to be broken). On Wed, Feb 12, 2025 at 12:37:50PM +, Pedro Falcato wrote: > On Wed, Feb 12, 2025 at 11:25 AM Lorenzo Stoakes > wrote: > > > > On Wed, Feb 12, 2025 at 03:21:48AM +, jef...@chromium.org wrote: > > > From: Jeff Xu > > > > > > The commit message in the first patch contains the full description of > > > this series. > > > > Sorry to nit, but it'd be useful to reproduce in the cover letter too! But > > this obviously isn't urgent, just be nice when we un-RFC. > > > > Thanks for sending as RFC, appreciated, keen to figure out a way forward > > with this series and this gives us space to discuss. > > > > One thing that came up recently with the LWN article (...!) was that rr is > > also impacted by this [0]. > > > > I think with this behind a config flag we're fine (this refers to my > > 'opt-in' comment in the reply on LWN) as my concerns about this being > > enabled in a broken way without an explicit kernel configuration are > > addressed, and actually we do expose a means by which a user can detect if > > the VDSO for instance is sealed via /proc/$pid/[s]maps. > > > > So tools like rr and such can be updated to check for this. I wonder if we > > ought to try to liaise with the known problematic ones? > > > > It'd be nice to update the documentation to have a list of 'known > > problematic userland software with sealed VDSO' so we make people aware. > > > > Hopefully we are acheiving the opt-in nature of the thing here, but it > > makes me wonder whether we need a prctl() interface to optionally disable > > even if the system has it enabled as a whole. > > Just noting that (as we discussed off-list) doing prctl() would not > work, because that would effectively be an munseal for those vdso > regions. > Possibly something like a personality() flag (that's *not* inherited > when AT_SECURE/secureexec). But personalities have other issues... Thanks, yeah that's a good point, it would have to be implemented as a personality or something similar otherwise you're essentially relying on 'unsealing' which can't be permitted. I'm not sure how useful that'd be for the likes of rr though. But I suppose if it makes everything exec'd by a child inherit it then maybe that works for a debugging session etc.? > > FWIW, although it would (at the moment) be hard to pull off in the > libc, I still much prefer it to playing these weird games with CONFIG > options and kernel command line options and prctl and personality and > whatnot. It seems to me like we're trying to stick policy where it > doesn't belong. The problem is, as a security feature, you don't want to make it trivially easy to disable. I mean we _need_ a config option to be able to strictly enforce only making the feature enable-able on architectures and configuration option combinations that work. But if there is userspace that will be broken, we really have to have some way of avoiding the disconnect between somebody making policy decision at the kernel level and somebody trying to run something. Because I can easily envision somebody enabling this as a 'good security feature' for a distro release or such, only for somebody else to later try rr, CRIU, or whatever else and for it to just not work or fail subtly and to have no idea why. I mean one option is to have it as a CONFIG_ flag _and_ you have to enable it via a tunable, so then it can become sysctl.d policy for instance. The CONFIG_ flag dependency is critical because we don't want to enable this on arches that have not been tested against it. It's vital at any rate that we document everywhere we can that _this might break some userland that depends on remapping the VDSO_. > > -- > Pedro
Re: [RFC PATCH v5 0/7] mseal system mappings
On Wed, 2025-02-12 at 14:01 +, Lorenzo Stoakes wrote: > Thanks, yeah that's a good point, it would have to be implemented as a > personality or something similar otherwise you're essentially relying on > 'unsealing' which can't be permitted. > > I'm not sure how useful that'd be for the likes of rr though. But I suppose > if it makes everything exec'd by a child inherit it then maybe that works > for a debugging session etc.? For whatever that's worth, ARCH=um should not need 'unsealing' or 'not sealing' it for *itself*, but rather only for the *children* it starts, which are for the userspace processes inside of it. Which I suppose could actually start without a VDSO in the first place, but I don't think that's possible now? Which I'll note should not have access to the host, so in a way this outer security feature (sealing) breaks the inner ARCH=um security, I think. johannes
[PATCH net-next v2] net: Add options as a flexible array to struct ip_tunnel_info
Remove the hidden assumption that options are allocated at the end of the struct, and teach the compiler about them using a flexible array. With this, we can revert the unsafe_memcpy() call we have in tun_dst_unclone() [1], and resolve the false field-spanning write warning caused by the memcpy() in ip_tunnel_info_opts_set(). The layout of struct ip_tunnel_info remains the same with this patch. Before this patch, there was an implicit padding at the end of the struct, options would be written at 'info + 1' which is after the padding. This will remain the same as this patch explicitly aligns 'options'. The alignment is needed as the options are later casted to different structs, and might result in unaligned memory access. Pahole output before this patch: struct ip_tunnel_info { struct ip_tunnel_key key; /* 064 */ /* XXX last struct has 1 byte of padding */ /* --- cacheline 1 boundary (64 bytes) --- */ struct ip_tunnel_encap encap;/*64 8 */ struct dst_cache dst_cache;/*7216 */ u8 options_len; /*88 1 */ u8 mode; /*89 1 */ /* size: 96, cachelines: 2, members: 5 */ /* padding: 6 */ /* paddings: 1, sum paddings: 1 */ /* last cacheline: 32 bytes */ }; Pahole output after this patch: struct ip_tunnel_info { struct ip_tunnel_key key; /* 064 */ /* XXX last struct has 1 byte of padding */ /* --- cacheline 1 boundary (64 bytes) --- */ struct ip_tunnel_encap encap;/*64 8 */ struct dst_cache dst_cache;/*7216 */ u8 options_len; /*88 1 */ u8 mode; /*89 1 */ /* XXX 6 bytes hole, try to pack */ u8 options[] __attribute__((__aligned__(8))); /* 96 0 */ /* size: 96, cachelines: 2, members: 6 */ /* sum members: 90, holes: 1, sum holes: 6 */ /* paddings: 1, sum paddings: 1 */ /* forced alignments: 1, forced holes: 1, sum forced holes: 6 */ /* last cacheline: 32 bytes */ } __attribute__((__aligned__(8))); [1] Commit 13cfd6a6d7ac ("net: Silence false field-spanning write warning in metadata_dst memcpy") Link: https://lore.kernel.org/all/53d1d353-b8f6-4adc-8f29-8c48a7c9c...@kernel.org/ Suggested-by: Kees Cook Reviewed-by: Cosmin Ratiu Reviewed-by: Tariq Toukan Signed-off-by: Gal Pressman --- Changelog - v1->v2: https://lore.kernel.org/netdev/20250209101853.15828-1-...@nvidia.com/ * Remove change in struct layout, align 'options' field explicitly (Ilya, Kees, Jakub). * Change allocation I missed in v1 in metadata_dst_alloc_percpu(). --- .../mellanox/mlx5/core/en/tc_tun_encap.c | 4 +--- .../mellanox/mlx5/core/en/tc_tun_vxlan.c | 2 +- .../ethernet/netronome/nfp/flower/action.c| 4 ++-- drivers/net/pfcp.c| 2 +- drivers/net/vxlan/vxlan_core.c| 4 ++-- include/net/dst_metadata.h| 7 ++ include/net/ip_tunnels.h | 11 +++-- net/core/dst.c| 6 +++-- net/ipv4/ip_gre.c | 4 ++-- net/ipv4/ip_tunnel_core.c | 24 +-- net/ipv6/ip6_gre.c| 4 ++-- net/openvswitch/flow_netlink.c| 4 ++-- net/psample/psample.c | 2 +- net/sched/act_tunnel_key.c| 12 +- 14 files changed, 41 insertions(+), 49 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c index e7e01f3298ef..d9f40cf8198d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c @@ -620,9 +620,7 @@ bool mlx5e_tc_tun_encap_info_equal_options(struct mlx5e_encap_key *a, b_info = container_of(b->ip_tun_key, struct ip_tunnel_info, key); return a_info->options_len == b_info->options_len && - !memcmp(ip_tunnel_info_opts(a_info), - ip_tunnel_info_opts(b_info), - a_info->options_len); + !memcmp(a_info->options, b_info->options, a_info->options_len); } static int cmp_decap_info(struct mlx5e_decap_key *a, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_vxlan.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_vxlan.c index e4e487c8431b..561c874b0825 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_vxlan.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_vxlan.c @@ -100,7 +100,7 @@ static int mlx5e_gen_ip_tunnel_header_vxlan(char buf[], vxh->vx_flags = VXLAN_HF_VNI; vxh->vx_vni = vxlan_vni_field(tun_id);
Re: [RFC PATCH v5 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed
On Wed, Feb 12, 2025 at 03:21:50AM +, jef...@chromium.org wrote: > From: Jeff Xu > > Add code to detect if the vdso is memory sealed, skip the test > if it is. > > Signed-off-by: Jeff Xu > --- > .../testing/selftests/x86/test_mremap_vdso.c | 38 +++ > 1 file changed, 38 insertions(+) > > diff --git a/tools/testing/selftests/x86/test_mremap_vdso.c > b/tools/testing/selftests/x86/test_mremap_vdso.c > index d53959e03593..c68077c56b22 100644 > --- a/tools/testing/selftests/x86/test_mremap_vdso.c > +++ b/tools/testing/selftests/x86/test_mremap_vdso.c > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -55,13 +56,50 @@ static int try_to_remap(void *vdso_addr, unsigned long > size) > > } > > +#define VDSO_NAME "[vdso]" > +#define VMFLAGS "VmFlags:" > +#define MSEAL_FLAGS "sl" > +#define MAX_LINE_LEN 512 > + > +bool vdso_sealed(FILE *maps) > +{ > + char line[MAX_LINE_LEN]; > + bool has_vdso = false; > + > + while (fgets(line, sizeof(line), maps)) { > + if (strstr(line, VDSO_NAME)) > + has_vdso = true; > + > + if (has_vdso && !strncmp(line, VMFLAGS, strlen(VMFLAGS))) { > + if (strstr(line, MSEAL_FLAGS)) > + return true; > + > + return false; This only tests that any mapping after the vdso is sealed. There is a real parser for /proc/self/smaps in tools/testing/selftests/mm/vm_util.[ch], see check_vmflag_io(). > + } > + } > + > + return false; > +}
Re: [RFC PATCH v5 1/7] mseal, system mappings: kernel config and header change
* jef...@chromium.org [250211 22:22]: > From: Jeff Xu > > Provide infrastructure to mseal system mappings. Establish > two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS, > ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and a header file (userprocess.h) > for future patches. > > As discussed during mseal() upstream process [1], mseal() protects > the VMAs of a given virtual memory range against modifications, such > as the read/write (RW) and no-execute (NX) bits. For complete > descriptions of memory sealing, please see mseal.rst [2]. > > The mseal() is useful to mitigate memory corruption issues where a > corrupted pointer is passed to a memory management system. For > example, such an attacker primitive can break control-flow integrity > guarantees since read-only memory that is supposed to be trusted can > become writable or .text pages can get remapped. > > The system mappings are readonly only, memory sealing can protect > them from ever changing to writable or unmmap/remapped as different > attributes. > > System mappings such as vdso, vvar, and sigpage (arm), vectors (arm) > are created by the kernel during program initialization, and could > be sealed after creation. > > Unlike the aforementioned mappings, the uprobe mapping is not > established during program startup. However, its lifetime is the same > as the process's lifetime [3]. It could be sealed from creation. > > The vsyscall on x86-64 uses a special address (0xff60), > which is outside the mm managed range. This means mprotect, munmap, and > mremap won't work on the vsyscall. Since sealing doesn't enhance > the vsyscall's security, it is skipped in this patch. If we ever seal > the vsyscall, it is probably only for decorative purpose, i.e. showing > the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored. > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may > alter the system mappings during restore operations. UML(User Mode Linux) > and gVisor are also known to change the vdso/vvar mappings. Consequently, > this feature cannot be universally enabled across all systems. As such, > CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default. > > To support mseal of system mappings, architectures must define > CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS and update their special mappings > calls to pass mseal flag. Additionally, architectures must confirm they > do not unmap/remap system mappings during the process lifetime. > > In this version, we've improved the handling of system mapping sealing from > previous versions, instead of modifying the _install_special_mapping > function itself, which would affect all architectures, we now call > _install_special_mapping with a sealing flag only within the specific > architecture that requires it. This targeted approach offers two key > advantages: 1) It limits the code change's impact to the necessary > architectures, and 2) It aligns with the software architecture by keeping > the core memory management within the mm layer, while delegating the > decision of sealing system mappings to the individual architecture, which > is particularly relevant since 32-bit architectures never require sealing. > > Additionally, this patch introduces a new header, > include/linux/usrprocess.h, which contains the mseal_system_mappings() > function. This function helps the architecture determine if system > mapping is enabled within the current kernel configuration. It can be > extended in the future to handle opt-in/out prctl for enabling system > mapping sealing at the process level or a kernel cmdline feature. > > A new header file was introduced because it was difficult to find the > best location for this function. Although include/linux/mm.h was > considered, this feature is more closely related to user processes > than core memory management. Additionally, future prctl or kernel > cmd-line implementations for this feature would not fit within the > scope of core memory management or mseal.c. This is relevant because > if we add unit-tests for mseal.c in the future, we would want to limit > mseal.c's dependencies to core memory management. > > Prior to this patch series, we explored sealing special mappings from > userspace using glibc's dynamic linker. This approach revealed several > issues: > - The PT_LOAD header may report an incorrect length for vdso, (smaller > than its actual size). The dynamic linker, which relies on PT_LOAD > information to determine mapping size, would then split and partially > seal the vdso mapping. Since each architecture has its own vdso/vvar > code, fixing this in the kernel would require going through each > archiecture. Our initial goal was to enable sealing readonly mappings, > e.g. .text, across all architectures, sealing vdso from kernel since > creation appears to be simpler than sealing vdso at glibc. > - The [vvar] mapping header only contains address information, not length > information. Similar issues might exist for
Re: [PATCH v3 0/2] Refine kmalloc caches randomization in kvmalloc
On Wed, Feb 12, 2025 at 04:15:03PM +0800, GONG Ruiqi wrote: > Hi, > > v3: > - move all the way from kmalloc_gfp_adjust to kvrealloc_noprof into > mm/slub.c > - some rewording for commit logs > v2: > https://lore.kernel.org/all/20250208014723.1514049-1-gongrui...@huawei.com/ > - change the implementation as Vlastimil suggested > v1: https://lore.kernel.org/all/20250122074817.991060-1-gongrui...@huawei.com/ > > Tamás reported [1] that kmalloc cache randomization doesn't actually > work for those kmalloc invoked via kvmalloc. For more details, see the > commit log of patch 2. > > The current solution requires a direct call from __kvmalloc_node_noprof > to __do_kmalloc_node, a static function in a different .c file. As > suggested by Vlastimil [2], it's achieved by simply moving > __kvmalloc_node_noprof from mm/util.c to mm/slub.c, together with some > other functions of the same family. Hi, GONG! Sorry for my late review. This patch series looks good to me (with a nit), Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com> Also, I verified that the problem you described exists on slab/for-next, and the patch series fixes the problem. Please feel free to add, Tested-by: Hyeonggon Yoo <42.hye...@gmail.com> nit: Does it make sense to call __kvmalloc_node_track_caller_noprof() instead of __do_kmalloc_node() to avoid bloating the code size? My simple build test says it saves 1592 bytes: $ ./scripts/bloat-o-meter slub.o.before slub.o.after add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-1592 (-1592) Function old new delta __kvmalloc_node_noprof.cold 39 - -39 __kvmalloc_node_noprof 1755 202 -1553 Total: Before=79723, After=78131, chg -2.00% > Link: > https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259 > [1] > Link: > https://lore.kernel.org/all/62044279-0c56-4185-97f7-7afac65ff...@suse.cz/ [2] -- Harry
Re: [PATCH v3 0/2] Refine kmalloc caches randomization in kvmalloc
On 2/12/25 15:20, Harry Yoo wrote: > On Wed, Feb 12, 2025 at 04:15:03PM +0800, GONG Ruiqi wrote: >> Hi, >> >> v3: >> - move all the way from kmalloc_gfp_adjust to kvrealloc_noprof into >> mm/slub.c >> - some rewording for commit logs >> v2: >> https://lore.kernel.org/all/20250208014723.1514049-1-gongrui...@huawei.com/ >> - change the implementation as Vlastimil suggested >> v1: >> https://lore.kernel.org/all/20250122074817.991060-1-gongrui...@huawei.com/ >> >> Tamás reported [1] that kmalloc cache randomization doesn't actually >> work for those kmalloc invoked via kvmalloc. For more details, see the >> commit log of patch 2. >> >> The current solution requires a direct call from __kvmalloc_node_noprof >> to __do_kmalloc_node, a static function in a different .c file. As >> suggested by Vlastimil [2], it's achieved by simply moving >> __kvmalloc_node_noprof from mm/util.c to mm/slub.c, together with some >> other functions of the same family. > > Hi, GONG! > Sorry for my late review. > > This patch series looks good to me (with a nit), > Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com> > > Also, I verified that the problem you described exists on slab/for-next, > and the patch series fixes the problem. Please feel free to add, > Tested-by: Hyeonggon Yoo <42.hye...@gmail.com> Thanks! > nit: Does it make sense to call __kvmalloc_node_track_caller_noprof() > instead of __do_kmalloc_node() to avoid bloating the code size? Hm I think it would be a bit arbitrary to make kvmalloc special like this here. But we should probably change __do_kmalloc_node() to __fastpath_inline. Or even check if not making it inline at all results in other callers (not kvmalloc probably due to its complexity) doing a tail call there, which should be fast enough. > My simple build test says it saves 1592 bytes: > $ ./scripts/bloat-o-meter slub.o.before slub.o.after > add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-1592 (-1592) > Function old new delta > __kvmalloc_node_noprof.cold 39 - -39 > __kvmalloc_node_noprof 1755 202 -1553 > Total: Before=79723, After=78131, chg -2.00% > >> Link: >> https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259 >> [1] >> Link: >> https://lore.kernel.org/all/62044279-0c56-4185-97f7-7afac65ff...@suse.cz/ [2] >
Re: [PATCH v3 0/2] Refine kmalloc caches randomization in kvmalloc
On 2/12/25 09:15, GONG Ruiqi wrote: > Hi, > > v3: > - move all the way from kmalloc_gfp_adjust to kvrealloc_noprof into > mm/slub.c > - some rewording for commit logs > v2: > https://lore.kernel.org/all/20250208014723.1514049-1-gongrui...@huawei.com/ > - change the implementation as Vlastimil suggested > v1: https://lore.kernel.org/all/20250122074817.991060-1-gongrui...@huawei.com/ > > Tamás reported [1] that kmalloc cache randomization doesn't actually > work for those kmalloc invoked via kvmalloc. For more details, see the > commit log of patch 2. > > The current solution requires a direct call from __kvmalloc_node_noprof > to __do_kmalloc_node, a static function in a different .c file. As > suggested by Vlastimil [2], it's achieved by simply moving > __kvmalloc_node_noprof from mm/util.c to mm/slub.c, together with some > other functions of the same family. > > Link: > https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259 > [1] > Link: > https://lore.kernel.org/all/62044279-0c56-4185-97f7-7afac65ff...@suse.cz/ [2] > > GONG Ruiqi (2): > slab: Adjust placement of __kvmalloc_node_noprof > slab: Achieve better kmalloc caches randomization in kvmalloc Applied to slab/for-next, thanks! > > mm/slub.c | 162 ++ > mm/util.c | 162 -- > 2 files changed, 162 insertions(+), 162 deletions(-) >
Re: [RFC PATCH v5 0/7] mseal system mappings
On Wed, Feb 12, 2025 at 03:21:48AM +, jef...@chromium.org wrote: > From: Jeff Xu > > The commit message in the first patch contains the full description of > this series. Sorry to nit, but it'd be useful to reproduce in the cover letter too! But this obviously isn't urgent, just be nice when we un-RFC. Thanks for sending as RFC, appreciated, keen to figure out a way forward with this series and this gives us space to discuss. One thing that came up recently with the LWN article (...!) was that rr is also impacted by this [0]. I think with this behind a config flag we're fine (this refers to my 'opt-in' comment in the reply on LWN) as my concerns about this being enabled in a broken way without an explicit kernel configuration are addressed, and actually we do expose a means by which a user can detect if the VDSO for instance is sealed via /proc/$pid/[s]maps. So tools like rr and such can be updated to check for this. I wonder if we ought to try to liaise with the known problematic ones? It'd be nice to update the documentation to have a list of 'known problematic userland software with sealed VDSO' so we make people aware. Hopefully we are acheiving the opt-in nature of the thing here, but it makes me wonder whether we need a prctl() interface to optionally disable even if the system has it enabled as a whole. That way, rr for instance can just turn it off for debugging purposes. To be clear I am not trying to add additional extranous tasks here - my one and only motive is to ensure that the feature works and we address concerns about any possible breakage. And I _want the series to land_ :>) I suspect we're close now. I am tied up with a number of other tasks at the moment so apologies if I take a second to get back to this series, but just wanted to say thanks for addressing my various points, and that I will definitely provide review (it's on the whiteboard, the only true guarantee I will do something :P). I will also come back to your testing series which I owe you review on, which is equally on the same whiteboard... Thanks, Lorenzo [0]:https://lwn.net/Articles/1007984/ > > -- > History: > > V5 > - Remove kernel cmd line (Lorenzo Stoakes) > - Add test info (Lorenzo Stoakes) > - Add threat model info (Lorenzo Stoakes) > - Fix x86 selftest: test_mremap_vdso > - Restrict code change to ARM64/x86-64/UM arch only. > - Add userprocess.h to include seal_system_mapping(). > - Remove sealing vsyscall. > - Split the patch. > > V4: > https://lore.kernel.org/all/20241125202021.3684919-1-jef...@google.com/ > > V3: > https://lore.kernel.org/all/20241113191602.3541870-1-jef...@google.com/ > > V2: > https://lore.kernel.org/all/20241014215022.68530-1-jef...@google.com/ > > V1: > https://lore.kernel.org/all/20241004163155.3493183-1-jef...@google.com/ > > Jeff Xu (7): > mseal, system mappings: kernel config and header change > selftests: x86: test_mremap_vdso: skip if vdso is msealed > mseal, system mappings: enable x86-64 > mseal, system mappings: enable arm64 > mseal, system mappings: enable uml architecture > mseal, system mappings: uprobe mapping > mseal, system mappings: update mseal.rst > > Documentation/userspace-api/mseal.rst | 5 +++ > arch/arm64/Kconfig| 1 + > arch/arm64/kernel/vdso.c | 23 +++ > arch/um/Kconfig | 1 + > arch/x86/Kconfig | 1 + > arch/x86/entry/vdso/vma.c | 17 ++--- > arch/x86/um/vdso/vma.c| 7 +++- > include/linux/userprocess.h | 18 + > init/Kconfig | 18 + > kernel/events/uprobes.c | 6 ++- > security/Kconfig | 18 + > .../testing/selftests/x86/test_mremap_vdso.c | 38 +++ > 12 files changed, 137 insertions(+), 16 deletions(-) > create mode 100644 include/linux/userprocess.h > > -- > 2.48.1.502.g6dc24dfdaf-goog >
Re: [PATCH v2 6/6] unicode: kunit: change tests filename and path
Kees Cook writes: > From: Gabriela Bittencourt > > Change utf8 kunit test filename and path to follow the style > convention on Documentation/dev-tools/kunit/style.rst > > Co-developed-by: Pedro Orlando > Signed-off-by: Pedro Orlando > Co-developed-by: Danilo Pereira > Signed-off-by: Danilo Pereira > Signed-off-by: Gabriela Bittencourt > Reviewed-by: David Gow > Acked-by: Gabriel Krisman Bertazi > Reviewed-by: Shuah Khan > Reviewed-by: Rae Moar > Link: https://lore.kernel.org/r/20241202075545.3648096-7-david...@google.com > Signed-off-by: Kees Cook Hi Kees, Last time this was submitted, I have reported this patch breaks the build when the kunit is built as a module, and it wasn't fixed before resubmission. Thorsten Leemhuis just reported the same issue is now on linux-next and I confirmed it is still there. https://lore.kernel.org/lkml/2c26c5e4-9cf3-4020-b0be-637dc826b...@leemhuis.info/T/#m2e2ddb3b2600274e5eae50b93ea821914dc85f20 The build error is: fs/unicode/tests/utf8_kunit.c:11:10: fatal error: utf8n.h: No such file or directory 11 | #include "utf8n.h" | ^ The problem is that the patch moves utf8_kunit.c into tests/ but doesn't fix the include line above. Can you fix it in your tree since it is already merged? Thanks, > --- > fs/unicode/Makefile| 2 +- > fs/unicode/{ => tests}/.kunitconfig| 0 > fs/unicode/{utf8-selftest.c => tests/utf8_kunit.c} | 0 > 3 files changed, 1 insertion(+), 1 deletion(-) > rename fs/unicode/{ => tests}/.kunitconfig (100%) > rename fs/unicode/{utf8-selftest.c => tests/utf8_kunit.c} (100%) > > diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile > index 37bbcbc628a1..d95be7fb9f6b 100644 > --- a/fs/unicode/Makefile > +++ b/fs/unicode/Makefile > @@ -4,7 +4,7 @@ ifneq ($(CONFIG_UNICODE),) > obj-y+= unicode.o > endif > obj-$(CONFIG_UNICODE)+= utf8data.o > -obj-$(CONFIG_UNICODE_NORMALIZATION_KUNIT_TEST) += utf8-selftest.o > +obj-$(CONFIG_UNICODE_NORMALIZATION_KUNIT_TEST) += tests/utf8_kunit.o > > unicode-y := utf8-norm.o utf8-core.o > > diff --git a/fs/unicode/.kunitconfig b/fs/unicode/tests/.kunitconfig > similarity index 100% > rename from fs/unicode/.kunitconfig > rename to fs/unicode/tests/.kunitconfig > diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/tests/utf8_kunit.c > similarity index 100% > rename from fs/unicode/utf8-selftest.c > rename to fs/unicode/tests/utf8_kunit.c -- Gabriel Krisman Bertazi
Re: [RFC] Circumventing FineIBT Via Entrypoints
On Wed, Feb 12, 2025 at 11:29:02PM +0100, Jann Horn wrote: > +Andy Lutomirski (X86 entry code maintainer) > > On Wed, Feb 12, 2025 at 10:08 PM Jennifer Miller wrote: > > As part of a recently accepted paper we demonstrated that syscall > > entrypoints can be misused on x86-64 systems to generically bypass > > FineIBT/KERNEL_IBT from forwards-edge control flow hijacking. We > > communicated this finding to s@k.o before submitting the paper and were > > encouraged to bring the issue to hardening after the paper was accepted to > > have a discussion on how to address the issue. > > > > The bypass takes advantage of the architectural requirement of entrypoints > > to begin with the endbr64 instruction and the ability to control GS_BASE > > from userspace via wrgsbase, from to the FSGSBASE extension, in order to > > perform a stack pivot to a ROP-chain. > > Oh, fun, that's a gnarly quirk. yeag :) > Since the kernel, as far as I understand, uses FineIBT without > backwards control flow protection (in other words, I think we assume > that the kernel stack is trusted?), could we build a cheaper > check on that basis somehow? For example, maybe we could do something like: > > ``` > endbr64 > test rsp, rsp > js slowpath > swapgs > ``` > > So we'd have the fast normal case where RSP points to userspace > (meaning we can't be coming from the kernel unless our stack has > already been pivoted, in which case forward edge protection alone > can't help anymore), and the slow case where RSP points to kernel > memory - in that case we'd then have to do some slower checks to > figure out whether weird userspace is making a syscall with RSP > pointing to the kernel, or whether we're coming from hijacked kernel > control flow. I've been tinkering this idea a bit and came with something. In short, we could have the slowpath branch as you suggested, in the slowpath permit the stack switch and preserving of the registers on the stack, but then do a sanity check according to the __per_cpu_offset array and decide from there whether we should continue executing the entrypoint or die/attempt to recover. Here is some napkin asm for this I wrote for the 64-bit syscall entrypoint, I think more or less the same could be done for the other entrypoints. ``` endbr64 test rsp, rsp js slowpath swapgs ~~fastpath continues~~ ; path taken when rsp was a kernel address ; we have no choice really but to switch to the stack from the untrusted ; gsbase but after doing so we have to be careful about what we put on the ; stack slowpath: swapgs ; swap stacks as normal movQWORD PTR gs:[rip+0x7f005f85],rsp # 0x6014 movrsp,QWORD PTR gs:[rip+0x7f02c56d] # 0x2c618 ~~normal push and clear GPRs sequence here~~ ; we entered with an rsp in the kernel address range. ; we already did swapgs but we don't know if we can trust our gsbase yet. ; we should be able to trust the ro_after_init __per_cpu_offset array ; though. ; check that gsbase is the expected value for our current cpu rdtscp mov rax, QWORD PTR [8*ecx-0x7d7be460] <__per_cpu_offset> rdgsbase rbx cmp rbx, rax je fastpath_after_regs_preserved wrgsbase rax ; if we reach here we are being exploited and should explode or attempt ; to recover ``` The unfortunate part is that it would still result in the register state being dumped on top of some attacker controlled address, so if the error path is recoverable someone could still use entrypoints to convert control flow hijacking into memory corruption via register dump. So it would kill the ability to get ROP but it would still be possible to dump regs over modprobe_path, core_pattern, etc. Does this seem feasible and any better than the alternative of overwriting and restoring KERNEL_GS_BASE? ~Jennifer
Re: [RFC] Circumventing FineIBT Via Entrypoints
On 13/02/2025 2:09 am, Jann Horn wrote: > On Thu, Feb 13, 2025 at 2:31 AM Andrew Cooper > wrote: Assuming this is an issue you all feel is worth addressing, I will continue working on providing a patch. I'm concerned though that the overhead from adding a wrmsr on both syscall entry and exit to overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so any feedback in regards to the approach or suggestions of alternate approaches to patching are welcome :) >>> Since the kernel, as far as I understand, uses FineIBT without >>> backwards control flow protection (in other words, I think we assume >>> that the kernel stack is trusted?), >> This is fun indeed. Linux cannot use supervisor shadow stacks because >> the mess around NMI re-entrancy (and IST more generally) requires ROP >> gadgets in order to function safely. Implementing this with shadow >> stacks active, while not impossible, is deemed to be prohibitively >> complicated. >> >> Linux's supervisor shadow stack support is waiting for FRED support, >> which fixes both the NMI re-entrancy problem, and other exceptions >> nesting within NMIs, as well as prohibiting the use of the SWAPGS >> instruction as FRED tries to make sure that the correct GS is always in >> context. >> >> But, FRED support is slated for PantherLake/DiamondRapids which haven't >> shipped yet, so are no use to the problem right now. >> >>> could we build a cheaper >>> check on that basis somehow? For example, maybe we could do something like: >>> >>> ``` >>> endbr64 >>> test rsp, rsp >>> js slowpath >>> swapgs >>> ``` >> I presume it's been pointed out already, but there are 3 related >> entrypoints here. SYSCALL64 which is discussed, SYSCALL32 and SYSENTER >> which are related. >> >> But, any other IDT entry is in a similar bucket. If we're corrupting a >> function pointer or return address to redirect here, then the check of >> CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers >> stack frame. >> >> For IDT entries, checking %rsp is reasonable, because userspace can't >> forge a kernel-like %rsp. However, SYSCALL64 specifically leaves %rsp >> entirely attacker controlled (and even potentially non-canonical), so >> I'm wondering what you hand in mind for the slowpath to truly >> distinguish kernel context from user context? > Hm, yeah, that seems hard - maybe the best we could do is to make sure > that the inactive gsbase has the correct value for our CPU's kernel > gsbase? Kinda like a paranoid_entry, except more painful because we'd > first have to figure out a place to spill registers to before we can > start using stuff like rdmsr... Then a function pointer overwrite > might still turn into returning to userspace with a sysret with GPRs > full of kernel pointers, but at least we wouldn't run off of a bogus > gsbase anymore? Thinking about this some more, I think it's impossible to distinguish. One of the many sharp edges of SYSCALL (and SYSENTER for that matter) is that they're instructions expected to be only be used by userspace, but that be executed in supervisor too[1]. They're asymmetric with their SYSRET (and SYSEXIT) counterparts which are CPL0 instructions that strictly transition into CPL3. The SYSCALL behaviour TLDR is: %rcx = %rip %r11 = %eflags %cs = fixed attr %ss = fixed attr %rip = MSR_LSTAR which means that %rcx (old rip) is the only piece of state which userspace can't feasibly forge (and therefore could distinguish a SYSCALL from user vs kernel mode), yet if we're talking about a JOP chain to get here, then %rcx is under attacker control too. There are a variety of solutions to this problem that involve not using %gs for per-cpu data. I also expect that to be wholly unpopular and dismissed as an approach. ~Andrew [1] No-one back then was brave enough to design CPL3-only instructions.
[PATCH][next] UAPI: ndctl / acpi: intel: Avoid multiple -Wflex-array-member-not-at-end warnings
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. So, in order to avoid ending up with flexible-array members in the middle of other structs, we use the `__struct_group()` helper to separate the flexible array from the rest of the members in the flexible structure. We then use the newly created tagged `struct nd_cmd_pkg_hdr` to replace the type of the objects causing trouble (`pkg`) in multiple structs. So, with these changes, fix the following warnings: drivers/acpi/nfit/intel.c:692:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:59:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:586:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:522:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:411:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:358:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:322:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:281:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:238:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:199:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:157:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/acpi/nfit/intel.c:124:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva --- drivers/acpi/nfit/intel.c | 24 include/uapi/linux/ndctl.h | 15 +-- 2 files changed, 21 insertions(+), 18 deletions(-) diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c index 3902759abcba..fe561ce0ddec 100644 --- a/drivers/acpi/nfit/intel.c +++ b/drivers/acpi/nfit/intel.c @@ -56,7 +56,7 @@ static unsigned long intel_security_flags(struct nvdimm *nvdimm, struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); unsigned long security_flags = 0; struct { - struct nd_cmd_pkg pkg; + struct nd_cmd_pkg_hdr pkg; struct nd_intel_get_security_state cmd; } nd_cmd = { .pkg = { @@ -121,7 +121,7 @@ static int intel_security_freeze(struct nvdimm *nvdimm) { struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); struct { - struct nd_cmd_pkg pkg; + struct nd_cmd_pkg_hdr pkg; struct nd_intel_freeze_lock cmd; } nd_cmd = { .pkg = { @@ -154,7 +154,7 @@ static int intel_security_change_key(struct nvdimm *nvdimm, NVDIMM_INTEL_SET_MASTER_PASSPHRASE : NVDIMM_INTEL_SET_PASSPHRASE; struct { - struct nd_cmd_pkg pkg; + struct nd_cmd_pkg_hdr pkg; struct nd_intel_set_passphrase cmd; } nd_cmd = { .pkg = { @@ -196,7 +196,7 @@ static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm, { struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); struct { - struct nd_cmd_pkg pkg; + struct nd_cmd_pkg_hdr pkg; struct nd_intel_unlock_unit cmd; } nd_cmd = { .pkg = { @@ -235,7 +235,7 @@ static int intel_security_disable(struct nvdimm *nvdimm, int rc; struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); struct { - struct nd_cmd_pkg pkg; + struct nd_cmd_pkg_hdr pkg; struct nd_intel_disable_passphrase cmd; } nd_cmd = { .pkg = { @@ -278,7 +278,7 @@ static int __maybe_unused intel_security_erase(struct nvdimm *nvdimm, unsigned int cmd = ptype == NVDIMM_MASTER ? NVDIMM_INTEL_MASTER_SECURE_ERASE : NVDIMM_INTEL_SECURE_ERASE; struct { - struct nd_cmd_pkg pkg; + struct nd_cmd_pkg_hdr pkg; struct nd
Re: [PATCH 00/10] Annotate arguments of memtostr/strtomem with __nonstring
Kees, > The memtostr*() and strtomem*() helpers are designed to move between C > strings (NUL-terminated) and byte arrays (that may just be zero padded > and may not be NUL-terminated). The "nonstring" attribute is used to > annotated these kinds of byte arrays, and we can validate the > annotation on the arguments of the helpers. Add the the infrastructure > to do this, and then update all the places where these annotations are > currently missing. Reviewed-by: Martin K. Petersen # SCSI -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH net-next v2] net: Add options as a flexible array to struct ip_tunnel_info
On 12/02/2025 18:29, Alexander Lobakin wrote: > From: Gal Pressman > Date: Wed, 12 Feb 2025 16:09:53 +0200 > >> Remove the hidden assumption that options are allocated at the end of >> the struct, and teach the compiler about them using a flexible array. > > [...] > >> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h >> index 84c15402931c..4160731dcb6e 100644 >> --- a/include/net/dst_metadata.h >> +++ b/include/net/dst_metadata.h >> @@ -163,11 +163,8 @@ static inline struct metadata_dst >> *tun_dst_unclone(struct sk_buff *skb) >> if (!new_md) >> return ERR_PTR(-ENOMEM); >> >> -unsafe_memcpy(&new_md->u.tun_info, &md_dst->u.tun_info, >> - sizeof(struct ip_tunnel_info) + md_size, >> - /* metadata_dst_alloc() reserves room (md_size bytes) for >> - * options right after the ip_tunnel_info struct. >> - */); >> +memcpy(&new_md->u.tun_info, &md_dst->u.tun_info, >> + sizeof(struct ip_tunnel_info) + md_size); >> #ifdef CONFIG_DST_CACHE >> /* Unclone the dst cache if there is one */ >> if (new_md->u.tun_info.dst_cache.cache) { >> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h >> index 1aa31bdb2b31..517f78070be0 100644 >> --- a/include/net/ip_tunnels.h >> +++ b/include/net/ip_tunnels.h >> @@ -93,12 +93,6 @@ struct ip_tunnel_encap { >> GENMASK((sizeof_field(struct ip_tunnel_info,\ >>options_len) * BITS_PER_BYTE) - 1, 0) >> >> -#define ip_tunnel_info_opts(info) \ >> -_Generic(info, \ >> - const struct ip_tunnel_info * : ((const void *)((info) + 1)),\ >> - struct ip_tunnel_info * : ((void *)((info) + 1))\ >> -) > > You could leave this macro inplace and just change `(info) + 1` to > `(info)->options` avoiding changes in lots of files and adding casts > everywhere. I'd rather not, having a macro to do 'info->options' doesn't help code readability IMHO. > >> - >> struct ip_tunnel_info { >> struct ip_tunnel_keykey; >> struct ip_tunnel_encap encap; >> @@ -107,6 +101,7 @@ struct ip_tunnel_info { >> #endif >> u8 options_len; >> u8 mode; >> +u8 options[] __aligned(sizeof(void *)) >> __counted_by(options_len); > > Since 96 % 16 == 0, I'd check if __aligned_largest would change the > bytecode anyhow... Sometimes it does. __aligned_largest aligns the field to 16 bytes, so the struct remains the same (96 bytes). When should __aligned_larget be used? What needs 16 bytes alignment?
[RFC] Circumventing FineIBT Via Entrypoints
Hi All, As part of a recently accepted paper we demonstrated that syscall entrypoints can be misused on x86-64 systems to generically bypass FineIBT/KERNEL_IBT from forwards-edge control flow hijacking. We communicated this finding to s@k.o before submitting the paper and were encouraged to bring the issue to hardening after the paper was accepted to have a discussion on how to address the issue. The bypass takes advantage of the architectural requirement of entrypoints to begin with the endbr64 instruction and the ability to control GS_BASE from userspace via wrgsbase, from to the FSGSBASE extension, in order to perform a stack pivot to a ROP-chain. Here is a snippet of the 64-bit entrypoint code: ``` entry_SYSCALL_64: <+0>: endbr64 <+4>: swapgs <+7>: movQWORD PTR gs:0x6014,rsp <+16>:jmp <+18>:movrsp,cr3 <+21>:nop <+26>:andrsp,0xe7ff <+33>:movcr3,rsp <+36>:movrsp,QWORD PTR gs:0x32c98 ``` This is a valid target from any indirect callsite under FineIBT due to the endbr64 instruction and the lack of a software CFI check. After hijacking control flow to the entrypoint, executing swapgs will swap to the user controlled GS_BASE, which will be used to set the stack pointer, leading to a stack pivot. The rest of the entrypoint will execute with a hijacked GS_BASE on a user controlled stack. The stack page we use is one mapped in the user address space and from another thread we race overwriting returns addresses on the stack to pivot a second time to a ROP-chain. For this to succeed we required a large area of user-controlled kernel memory that can serve as the forged GS_BASE address, we did this by spraying 2MB Transparent Huge Pages to fill the kernel physical memory map with controlled 2MB allocations and guessing relative to the base address of the area to hit a page we control. We evaluated an approach to patching the issue in the paper but it touched the userspace API a bit, added an error code returned by syscalls if they are invoked with a kernel address in GS_BASE, which is not a great solution. Linus provided some thoughts on how to potentially address this issue in our communication with s@k.o, suggesting the kernel could make the KERNEL_GS_BASE match the GS_BASE value so both registers always contain a valid kernel address and a confusion induced by executing swapgs an extra time cannot occur, and restore the value of KERNEL_GS_BASE ahead of executing swapgs in the exit path. I started working on a patch based on the approach suggested by Linus but I haven't been able to get it passing the relevant x86 selftests yet. It turned out that it's more than the entrypoint code that needs to be modified for it to work, we need to correctly save and restore the user's GS_BASE across task switches and ensure it is updated correctly when set via arch_prctl and ptrace. Unfortunately, I lack familiarity with those parts of the kernel, and my understanding is that the paper will be made public in a couple weeks so I didn't want to delay too long on bringing the issue to this list. Assuming this is an issue you all feel is worth addressing, I will continue working on providing a patch. I'm concerned though that the overhead from adding a wrmsr on both syscall entry and exit to overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so any feedback in regards to the approach or suggestions of alternate approaches to patching are welcome :) ~Jennifer
Re: [PATCH net-next v2] net: Add options as a flexible array to struct ip_tunnel_info
From: Gal Pressman Date: Wed, 12 Feb 2025 16:09:53 +0200 > Remove the hidden assumption that options are allocated at the end of > the struct, and teach the compiler about them using a flexible array. [...] > diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h > index 84c15402931c..4160731dcb6e 100644 > --- a/include/net/dst_metadata.h > +++ b/include/net/dst_metadata.h > @@ -163,11 +163,8 @@ static inline struct metadata_dst > *tun_dst_unclone(struct sk_buff *skb) > if (!new_md) > return ERR_PTR(-ENOMEM); > > - unsafe_memcpy(&new_md->u.tun_info, &md_dst->u.tun_info, > - sizeof(struct ip_tunnel_info) + md_size, > - /* metadata_dst_alloc() reserves room (md_size bytes) for > -* options right after the ip_tunnel_info struct. > -*/); > + memcpy(&new_md->u.tun_info, &md_dst->u.tun_info, > +sizeof(struct ip_tunnel_info) + md_size); > #ifdef CONFIG_DST_CACHE > /* Unclone the dst cache if there is one */ > if (new_md->u.tun_info.dst_cache.cache) { > diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h > index 1aa31bdb2b31..517f78070be0 100644 > --- a/include/net/ip_tunnels.h > +++ b/include/net/ip_tunnels.h > @@ -93,12 +93,6 @@ struct ip_tunnel_encap { > GENMASK((sizeof_field(struct ip_tunnel_info,\ > options_len) * BITS_PER_BYTE) - 1, 0) > > -#define ip_tunnel_info_opts(info)\ > - _Generic(info, \ > - const struct ip_tunnel_info * : ((const void *)((info) + 1)),\ > - struct ip_tunnel_info * : ((void *)((info) + 1))\ > - ) You could leave this macro inplace and just change `(info) + 1` to `(info)->options` avoiding changes in lots of files and adding casts everywhere. > - > struct ip_tunnel_info { > struct ip_tunnel_keykey; > struct ip_tunnel_encap encap; > @@ -107,6 +101,7 @@ struct ip_tunnel_info { > #endif > u8 options_len; > u8 mode; > + u8 options[] __aligned(sizeof(void *)) > __counted_by(options_len); Since 96 % 16 == 0, I'd check if __aligned_largest would change the bytecode anyhow... Sometimes it does. Thanks, Olek
Re: [RFC] Circumventing FineIBT Via Entrypoints
+Andy Lutomirski (X86 entry code maintainer) On Wed, Feb 12, 2025 at 10:08 PM Jennifer Miller wrote: > As part of a recently accepted paper we demonstrated that syscall > entrypoints can be misused on x86-64 systems to generically bypass > FineIBT/KERNEL_IBT from forwards-edge control flow hijacking. We > communicated this finding to s@k.o before submitting the paper and were > encouraged to bring the issue to hardening after the paper was accepted to > have a discussion on how to address the issue. > > The bypass takes advantage of the architectural requirement of entrypoints > to begin with the endbr64 instruction and the ability to control GS_BASE > from userspace via wrgsbase, from to the FSGSBASE extension, in order to > perform a stack pivot to a ROP-chain. Oh, fun, that's a gnarly quirk. > Here is a snippet of the 64-bit entrypoint code: > ``` > entry_SYSCALL_64: > <+0>: endbr64 > <+4>: swapgs > <+7>: movQWORD PTR gs:0x6014,rsp > <+16>:jmp > <+18>:movrsp,cr3 > <+21>:nop > <+26>:andrsp,0xe7ff > <+33>:movcr3,rsp > <+36>:movrsp,QWORD PTR gs:0x32c98 > ``` > > This is a valid target from any indirect callsite under FineIBT due to the > endbr64 instruction and the lack of a software CFI check. After hijacking > control flow to the entrypoint, executing swapgs will swap to the user > controlled GS_BASE, which will be used to set the stack pointer, leading to > a stack pivot. The rest of the entrypoint will execute with a hijacked > GS_BASE on a user controlled stack. The stack page we use is one mapped in > the user address space and from another thread we race overwriting returns > addresses on the stack to pivot a second time to a ROP-chain. For this to > succeed we required a large area of user-controlled kernel memory that can > serve as the forged GS_BASE address, we did this by spraying 2MB > Transparent Huge Pages to fill the kernel physical memory map with > controlled 2MB allocations and guessing relative to the base address of the > area to hit a page we control. > > We evaluated an approach to patching the issue in the paper but it touched > the userspace API a bit, added an error code returned by syscalls if they > are invoked with a kernel address in GS_BASE, which is not a great > solution. > > Linus provided some thoughts on how to potentially address this issue > in our communication with s@k.o, suggesting the kernel could make the > KERNEL_GS_BASE match the GS_BASE value so both registers always contain a > valid kernel address and a confusion induced by executing swapgs an extra > time cannot occur, and restore the value of KERNEL_GS_BASE ahead of > executing swapgs in the exit path. > > I started working on a patch based on the approach suggested by Linus but I > haven't been able to get it passing the relevant x86 selftests yet. It > turned out that it's more than the entrypoint code that needs to be > modified for it to work, we need to correctly save and restore the user's > GS_BASE across task switches and ensure it is updated correctly when set > via arch_prctl and ptrace. Unfortunately, I lack familiarity with those > parts of the kernel, and my understanding is that the paper will be made > public in a couple weeks so I didn't want to delay too long on bringing the > issue to this list. > > Assuming this is an issue you all feel is worth addressing, I will continue > working on providing a patch. I'm concerned though that the overhead from > adding a wrmsr on both syscall entry and exit to overwrite and restore the > KERNEL_GS_BASE MSR may be quite high, so any feedback in regards to the > approach or suggestions of alternate approaches to patching are welcome :) Since the kernel, as far as I understand, uses FineIBT without backwards control flow protection (in other words, I think we assume that the kernel stack is trusted?), could we build a cheaper check on that basis somehow? For example, maybe we could do something like: ``` endbr64 test rsp, rsp js slowpath swapgs ``` So we'd have the fast normal case where RSP points to userspace (meaning we can't be coming from the kernel unless our stack has already been pivoted, in which case forward edge protection alone can't help anymore), and the slow case where RSP points to kernel memory - in that case we'd then have to do some slower checks to figure out whether weird userspace is making a syscall with RSP pointing to the kernel, or whether we're coming from hijacked kernel control flow.
[PATCH] scsi: hpsa: Replace deprecated strncpy() with strscpy_pad()
strncpy() is deprecated for NUL-terminated destination buffers [1]. Replace memset() and strncpy() with strscpy_pad() to copy the version string and fill the remaining bytes in the destination buffer with NUL bytes. This avoids zeroing the memory before copying the string. Compile-tested only. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Cc: linux-hardening@vger.kernel.org Signed-off-by: Thorsten Blum --- drivers/scsi/hpsa.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c index 84d8de07b7ae..1334886b68d0 100644 --- a/drivers/scsi/hpsa.c +++ b/drivers/scsi/hpsa.c @@ -7238,8 +7238,7 @@ static int hpsa_controller_hard_reset(struct pci_dev *pdev, static void init_driver_version(char *driver_version, int len) { - memset(driver_version, 0, len); - strncpy(driver_version, HPSA " " HPSA_DRIVER_VERSION, len - 1); + strscpy_pad(driver_version, HPSA " " HPSA_DRIVER_VERSION, len - 1); } static int write_driver_ver_to_cfgtable(struct CfgTable __iomem *cfgtable) -- 2.48.1
[PATCH v3 2/2] slab: Achieve better kmalloc caches randomization in kvmalloc
As revealed by this writeup[1], due to the fact that __kmalloc_node (now renamed to __kmalloc_node_noprof) is an exported symbol and will never get inlined, using it in kvmalloc_node (now is __kvmalloc_node_noprof) would make the RET_IP inside always point to the same address: upper_caller kvmalloc kvmalloc_node kvmalloc_node_noprof __kvmalloc_node_noprof <-- all macros all the way down here __kmalloc_node_noprof __do_kmalloc_node(.., _RET_IP_) ... <-- _RET_IP_ points to That literally means all kmalloc invoked via kvmalloc would use the same seed for cache randomization (CONFIG_RANDOM_KMALLOC_CACHES), which makes this hardening non-functional. The root cause of this problem, IMHO, is that using RET_IP only cannot identify the actual allocation site in case of kmalloc being called inside non-inlined wrappers or helper functions. And I believe there could be similar cases in other functions. Nevertheless, I haven't thought of any good solution for this. So for now let's solve this specific case first. For __kvmalloc_node_noprof, replace __kmalloc_node_noprof and call __do_kmalloc_node directly instead, so that RET_IP can take the return address of kvmalloc and differentiate each kvmalloc invocation: upper_caller kvmalloc kvmalloc_node kvmalloc_node_noprof __kvmalloc_node_noprof <-- all macros all the way down here __do_kmalloc_node(.., _RET_IP_) ... <-- _RET_IP_ points to Thanks to Tamás Koczka for the report and discussion! Link: https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259 [1] Reported-by: Tamás Koczka Signed-off-by: GONG Ruiqi --- mm/slub.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index abc982d68feb..1f7d1d260eeb 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4925,9 +4925,9 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) * It doesn't really make sense to fallback to vmalloc for sub page * requests */ - ret = __kmalloc_node_noprof(PASS_BUCKET_PARAMS(size, b), - kmalloc_gfp_adjust(flags, size), - node); + ret = __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), + kmalloc_gfp_adjust(flags, size), + node, _RET_IP_); if (ret || size <= PAGE_SIZE) return ret; -- 2.25.1
[PATCH v3 1/2] slab: Adjust placement of __kvmalloc_node_noprof
Move __kvmalloc_node_noprof (as well as kvfree*, kvrealloc_noprof and kmalloc_gfp_adjust for consistency) into mm/slub.c so that it can directly invoke __do_kmalloc_node, which is needed for the next patch. No functional changes intended. Signed-off-by: GONG Ruiqi --- mm/slub.c | 162 ++ mm/util.c | 162 -- 2 files changed, 162 insertions(+), 162 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 1f50129dcfb3..abc982d68feb 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4878,6 +4878,168 @@ void *krealloc_noprof(const void *p, size_t new_size, gfp_t flags) } EXPORT_SYMBOL(krealloc_noprof); +static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size) +{ + /* +* We want to attempt a large physically contiguous block first because +* it is less likely to fragment multiple larger blocks and therefore +* contribute to a long term fragmentation less than vmalloc fallback. +* However make sure that larger requests are not too disruptive - no +* OOM killer and no allocation failure warnings as we have a fallback. +*/ + if (size > PAGE_SIZE) { + flags |= __GFP_NOWARN; + + if (!(flags & __GFP_RETRY_MAYFAIL)) + flags |= __GFP_NORETRY; + + /* nofail semantic is implemented by the vmalloc fallback */ + flags &= ~__GFP_NOFAIL; + } + + return flags; +} + +/** + * __kvmalloc_node - attempt to allocate physically contiguous memory, but upon + * failure, fall back to non-contiguous (vmalloc) allocation. + * @size: size of the request. + * @b: which set of kmalloc buckets to allocate from. + * @flags: gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL. + * @node: numa node to allocate from + * + * Uses kmalloc to get the memory but if the allocation fails then falls back + * to the vmalloc allocator. Use kvfree for freeing the memory. + * + * GFP_NOWAIT and GFP_ATOMIC are not supported, neither is the __GFP_NORETRY modifier. + * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is + * preferable to the vmalloc fallback, due to visible performance drawbacks. + * + * Return: pointer to the allocated memory of %NULL in case of failure + */ +void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) +{ + void *ret; + + /* +* It doesn't really make sense to fallback to vmalloc for sub page +* requests +*/ + ret = __kmalloc_node_noprof(PASS_BUCKET_PARAMS(size, b), + kmalloc_gfp_adjust(flags, size), + node); + if (ret || size <= PAGE_SIZE) + return ret; + + /* non-sleeping allocations are not supported by vmalloc */ + if (!gfpflags_allow_blocking(flags)) + return NULL; + + /* Don't even allow crazy sizes */ + if (unlikely(size > INT_MAX)) { + WARN_ON_ONCE(!(flags & __GFP_NOWARN)); + return NULL; + } + + /* +* kvmalloc() can always use VM_ALLOW_HUGE_VMAP, +* since the callers already cannot assume anything +* about the resulting pointer, and cannot play +* protection games. +*/ + return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END, + flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP, + node, __builtin_return_address(0)); +} +EXPORT_SYMBOL(__kvmalloc_node_noprof); + +/** + * kvfree() - Free memory. + * @addr: Pointer to allocated memory. + * + * kvfree frees memory allocated by any of vmalloc(), kmalloc() or kvmalloc(). + * It is slightly more efficient to use kfree() or vfree() if you are certain + * that you know which one to use. + * + * Context: Either preemptible task context or not-NMI interrupt. + */ +void kvfree(const void *addr) +{ + if (is_vmalloc_addr(addr)) + vfree(addr); + else + kfree(addr); +} +EXPORT_SYMBOL(kvfree); + +/** + * kvfree_sensitive - Free a data object containing sensitive information. + * @addr: address of the data object to be freed. + * @len: length of the data object. + * + * Use the special memzero_explicit() function to clear the content of a + * kvmalloc'ed object containing sensitive data to make sure that the + * compiler won't optimize out the data clearing. + */ +void kvfree_sensitive(const void *addr, size_t len) +{ + if (likely(!ZERO_OR_NULL_PTR(addr))) { + memzero_explicit((void *)addr, len); + kvfree(addr); + } +} +EXPORT_SYMBOL(kvfree_sensitive); + +/** + * kvrealloc - reallocate memory; contents remain unchanged + * @p: object to reallocate memory for + * @size: the size to reallocate + * @flags: the flags for the page level allocator + * + * If @p is %NULL, kvrealloc() behaves exactly
[PATCH v3 0/2] Refine kmalloc caches randomization in kvmalloc
Hi, v3: - move all the way from kmalloc_gfp_adjust to kvrealloc_noprof into mm/slub.c - some rewording for commit logs v2: https://lore.kernel.org/all/20250208014723.1514049-1-gongrui...@huawei.com/ - change the implementation as Vlastimil suggested v1: https://lore.kernel.org/all/20250122074817.991060-1-gongrui...@huawei.com/ Tamás reported [1] that kmalloc cache randomization doesn't actually work for those kmalloc invoked via kvmalloc. For more details, see the commit log of patch 2. The current solution requires a direct call from __kvmalloc_node_noprof to __do_kmalloc_node, a static function in a different .c file. As suggested by Vlastimil [2], it's achieved by simply moving __kvmalloc_node_noprof from mm/util.c to mm/slub.c, together with some other functions of the same family. Link: https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259 [1] Link: https://lore.kernel.org/all/62044279-0c56-4185-97f7-7afac65ff...@suse.cz/ [2] GONG Ruiqi (2): slab: Adjust placement of __kvmalloc_node_noprof slab: Achieve better kmalloc caches randomization in kvmalloc mm/slub.c | 162 ++ mm/util.c | 162 -- 2 files changed, 162 insertions(+), 162 deletions(-) -- 2.25.1
Re: [PATCH] net/mlx4_core: Avoid impossible mlx4_db_alloc() order value
On Tue, Feb 11, 2025 at 6:22 AM Tariq Toukan wrote: > > > > On 11/02/2025 2:01, Justin Stitt wrote: > > On Mon, Feb 10, 2025 at 09:45:05AM -0800, Kees Cook wrote: > >> GCC can see that the value range for "order" is capped, but this leads > >> it to consider that it might be negative, leading to a false positive > >> warning (with GCC 15 with -Warray-bounds -fdiagnostics-details): > >> > >> ../drivers/net/ethernet/mellanox/mlx4/alloc.c:691:47: error: array > >> subscript -1 is below array bounds of 'long unsigned int *[2]' > >> [-Werror=array-bounds=] > >>691 | i = find_first_bit(pgdir->bits[o], > >> MLX4_DB_PER_PAGE >> o); > >>|~~~^~~ > >>'mlx4_alloc_db_from_pgdir': events 1-2 > >>691 | i = find_first_bit(pgdir->bits[o], > >> MLX4_DB_PER_PAGE >> o);| > >> ^ > >>| | | > >> | | > >> (2) out of array bounds here > >>| (1) when the condition is evaluated to true > >> In file included from > >> ../drivers/net/ethernet/mellanox/mlx4/mlx4.h:53, > >> from ../drivers/net/ethernet/mellanox/mlx4/alloc.c:42: > >> ../include/linux/mlx4/device.h:664:33: note: while referencing 'bits' > >>664 | unsigned long *bits[2]; > >>| ^~~~ > >> > >> Switch the argument to unsigned int, which removes the compiler needing > >> to consider negative values. > >> > >> Signed-off-by: Kees Cook > >> --- > >> Cc: Tariq Toukan > >> Cc: Andrew Lunn > >> Cc: "David S. Miller" > >> Cc: Eric Dumazet > >> Cc: Jakub Kicinski > >> Cc: Paolo Abeni > >> Cc: Yishai Hadas > >> Cc: net...@vger.kernel.org > >> Cc: linux-r...@vger.kernel.org > >> --- > >> drivers/net/ethernet/mellanox/mlx4/alloc.c | 6 +++--- > >> include/linux/mlx4/device.h| 2 +- > >> 2 files changed, 4 insertions(+), 4 deletions(-) > >> > >> diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c > >> b/drivers/net/ethernet/mellanox/mlx4/alloc.c > >> index b330020dc0d6..f2bded847e61 100644 > >> --- a/drivers/net/ethernet/mellanox/mlx4/alloc.c > >> +++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c > >> @@ -682,9 +682,9 @@ static struct mlx4_db_pgdir > >> *mlx4_alloc_db_pgdir(struct device *dma_device) > >> } > >> > >> static int mlx4_alloc_db_from_pgdir(struct mlx4_db_pgdir *pgdir, > >> -struct mlx4_db *db, int order) > >> +struct mlx4_db *db, unsigned int order) > >> { > >> -int o; > >> +unsigned int o; > >> int i; > >> > >> for (o = order; o <= 1; ++o) { > > > >^ Knowing now that @order can only be 0 or 1 can this for loop (and > >goto) be dropped entirely? > > > > Maybe I'm missing something... > Can you please explain why you think this can be dropped? I meant "rewritten to use two if statements" instead of "dropped". I think "replaced" or "refactored" was the word I wanted. > > > >The code is already short and sweet so I don't feel strongly either > >way. > > > >> @@ -712,7 +712,7 @@ static int mlx4_alloc_db_from_pgdir(struct > >> mlx4_db_pgdir *pgdir, > >> return 0; > >> } > >> > >> -int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order) > >> +int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, unsigned int > >> order) > >> { > >> struct mlx4_priv *priv = mlx4_priv(dev); > >> struct mlx4_db_pgdir *pgdir; > >> diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h > >> index 27f42f713c89..86f0f2a25a3d 100644 > >> --- a/include/linux/mlx4/device.h > >> +++ b/include/linux/mlx4/device.h > >> @@ -1135,7 +1135,7 @@ int mlx4_write_mtt(struct mlx4_dev *dev, struct > >> mlx4_mtt *mtt, > >> int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt, > >> struct mlx4_buf *buf); > >> > >> -int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order); > >> +int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, unsigned int > >> order); > >> void mlx4_db_free(struct mlx4_dev *dev, struct mlx4_db *db); > >> > >> int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources > >> *wqres, > >> -- > >> 2.34.1 > >> > > > > Justin > > >
Re: [PATCH net-next v2] net: Add options as a flexible array to struct ip_tunnel_info
On Wed, 12 Feb 2025 20:13:28 +0200 Gal Pressman wrote: > > You could leave this macro inplace and just change `(info) + 1` to > > `(info)->options` avoiding changes in lots of files and adding casts > > everywhere. +1 > I'd rather not, having a macro to do 'info->options' doesn't help code > readability IMHO. It'd avoid having to explicitly cast the types everywhere, too.
Re: [RFC] Circumventing FineIBT Via Entrypoints
On Thu, Feb 13, 2025 at 2:31 AM Andrew Cooper wrote: > >> Assuming this is an issue you all feel is worth addressing, I will > >> continue working on providing a patch. I'm concerned though that the > >> overhead from adding a wrmsr on both syscall entry and exit to > >> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so > >> any feedback in regards to the approach or suggestions of alternate > >> approaches to patching are welcome :) > > > > Since the kernel, as far as I understand, uses FineIBT without > > backwards control flow protection (in other words, I think we assume > > that the kernel stack is trusted?), > > This is fun indeed. Linux cannot use supervisor shadow stacks because > the mess around NMI re-entrancy (and IST more generally) requires ROP > gadgets in order to function safely. Implementing this with shadow > stacks active, while not impossible, is deemed to be prohibitively > complicated. > > Linux's supervisor shadow stack support is waiting for FRED support, > which fixes both the NMI re-entrancy problem, and other exceptions > nesting within NMIs, as well as prohibiting the use of the SWAPGS > instruction as FRED tries to make sure that the correct GS is always in > context. > > But, FRED support is slated for PantherLake/DiamondRapids which haven't > shipped yet, so are no use to the problem right now. > > > could we build a cheaper > > check on that basis somehow? For example, maybe we could do something like: > > > > ``` > > endbr64 > > test rsp, rsp > > js slowpath > > swapgs > > ``` > > I presume it's been pointed out already, but there are 3 related > entrypoints here. SYSCALL64 which is discussed, SYSCALL32 and SYSENTER > which are related. > > But, any other IDT entry is in a similar bucket. If we're corrupting a > function pointer or return address to redirect here, then the check of > CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers > stack frame. > > For IDT entries, checking %rsp is reasonable, because userspace can't > forge a kernel-like %rsp. However, SYSCALL64 specifically leaves %rsp > entirely attacker controlled (and even potentially non-canonical), so > I'm wondering what you hand in mind for the slowpath to truly > distinguish kernel context from user context? Hm, yeah, that seems hard - maybe the best we could do is to make sure that the inactive gsbase has the correct value for our CPU's kernel gsbase? Kinda like a paranoid_entry, except more painful because we'd first have to figure out a place to spill registers to before we can start using stuff like rdmsr... Then a function pointer overwrite might still turn into returning to userspace with a sysret with GPRs full of kernel pointers, but at least we wouldn't run off of a bogus gsbase anymore?
Re: [RFC] Circumventing FineIBT Via Entrypoints
>> Assuming this is an issue you all feel is worth addressing, I will >> continue working on providing a patch. I'm concerned though that the >> overhead from adding a wrmsr on both syscall entry and exit to >> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so >> any feedback in regards to the approach or suggestions of alternate >> approaches to patching are welcome :) > > Since the kernel, as far as I understand, uses FineIBT without > backwards control flow protection (in other words, I think we assume > that the kernel stack is trusted?), This is fun indeed. Linux cannot use supervisor shadow stacks because the mess around NMI re-entrancy (and IST more generally) requires ROP gadgets in order to function safely. Implementing this with shadow stacks active, while not impossible, is deemed to be prohibitively complicated. Linux's supervisor shadow stack support is waiting for FRED support, which fixes both the NMI re-entrancy problem, and other exceptions nesting within NMIs, as well as prohibiting the use of the SWAPGS instruction as FRED tries to make sure that the correct GS is always in context. But, FRED support is slated for PantherLake/DiamondRapids which haven't shipped yet, so are no use to the problem right now. > could we build a cheaper > check on that basis somehow? For example, maybe we could do something like: > > ``` > endbr64 > test rsp, rsp > js slowpath > swapgs > ``` I presume it's been pointed out already, but there are 3 related entrypoints here. SYSCALL64 which is discussed, SYSCALL32 and SYSENTER which are related. But, any other IDT entry is in a similar bucket. If we're corrupting a function pointer or return address to redirect here, then the check of CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers stack frame. For IDT entries, checking %rsp is reasonable, because userspace can't forge a kernel-like %rsp. However, SYSCALL64 specifically leaves %rsp entirely attacker controlled (and even potentially non-canonical), so I'm wondering what you hand in mind for the slowpath to truly distinguish kernel context from user context? ~Andrew
RE: [PATCH] wifi: ath12k: Fix uninitialized variable and remove goto
Jeff Johnson wrote: > On 2/10/2025 8:09 PM, Ping-Ke Shih wrote: > > Addresses-Coverity-ID: 1642337 ("Uninitialized scalar variable") > >>> > >>> Is that an official kernel tag? IMO the proper tag would be > >> So, it isn't "official" as far as I can tell, but it is widely used in > >> other commits, especially by Gustavo Silva. > >> > >> Also: > >> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=778e2478d19574508861bcb > >> 1806d6e34d095587c > >> > >> Coverity-IDs: is another option I have found. I have seen Closes: a few > >> times as well. I'm not really sure what the best option is, honestly. > > > > In my patch, I used and treated Addresses-Coverity-ID as a unofficial tag, > > so additional empty line is added. > > > > Days ago I have received Coverity issues sent to mailing list, so I used > > Closes tag at that time. But recently I have not seen that kind of mails. > > Instead, I visit Coverity web site to check issues and use > > Addresses-Coverity-ID tag, since Coverity link is not visible to everyone. > > That is just my thought. > > The problem I have is that I get Coverity fixes both from the linux and the > linux-next projects: > > https://scan.coverity.com/projects/linux?tab=overview > https://scan.coverity.com/projects/linux-next-weekly-scan?tab=overview > > The Coverity IDs from these projects are allocated independently, so a > Coverity ID does not uniquely identify an issue. > > The URL uniquely identifies an issue, and also utilizes an official tag. > > That is my thought. Yes, I have the same problem as yours. For me, I only annotate Coverity IDs from linux project, and the linux-next project is as a reference to check if issues are still existing in -next tree.
Re: [PATCH v2] wifi: ath12k: cleanup ath12k_mac_mlo_ready()
On 2/10/2025 6:49 PM, Ethan Carter Edwards wrote: > There is a possibility for an uninitialized *ret* variable to be > returned in some code paths. > > This explicitly returns 0 without an error. Also removes goto that > returned *ret* and simply returns in place. > > Closes: > https://scan5.scan.coverity.com/#/project-view/63541/10063?selectedIssue=1642337 > Fixes: b716a10d99a28 ("wifi: ath12k: enable MLO setup and teardown from core") In the pending branch I changed this to: Fixes: b716a10d99a2 ("wifi: ath12k: enable MLO setup and teardown from core") To fix the checkpatch issue: WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("")' - ie: 'Fixes: b716a10d99a2 ("wifi: ath12k: enable MLO setup and teardown from core")' https://git.kernel.org/pub/scm/linux/kernel/git/ath/ath.git/commit/?h=pending&id=c2f7ae223cd3d781c69337dc804f1fae95789cdd
[PATCH] scsi: hpsa: Replace deprecated strncpy() with strscpy()
strncpy() is deprecated for NUL-terminated destination buffers [1]. Use strscpy() instead and remove the manual NUL-termination. Use min() to simplify the size calculation. Compile-tested only. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Cc: linux-hardening@vger.kernel.org Signed-off-by: Thorsten Blum Suggested-by: Bart Van Assche --- drivers/scsi/hpsa.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c index 84d8de07b7ae..9399e101f150 100644 --- a/drivers/scsi/hpsa.c +++ b/drivers/scsi/hpsa.c @@ -460,9 +460,8 @@ static ssize_t host_store_hp_ssd_smart_path_status(struct device *dev, if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; - len = count > sizeof(tmpbuf) - 1 ? sizeof(tmpbuf) - 1 : count; - strncpy(tmpbuf, buf, len); - tmpbuf[len] = '\0'; + len = min(count, sizeof(tmpbuf) - 1); + strscpy(tmpbuf, buf, len); if (sscanf(tmpbuf, "%d", &status) != 1) return -EINVAL; h = shost_to_hba(shost); @@ -484,9 +483,8 @@ static ssize_t host_store_raid_offload_debug(struct device *dev, if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; - len = count > sizeof(tmpbuf) - 1 ? sizeof(tmpbuf) - 1 : count; - strncpy(tmpbuf, buf, len); - tmpbuf[len] = '\0'; + len = min(count, sizeof(tmpbuf) - 1); + strscpy(tmpbuf, buf, len); if (sscanf(tmpbuf, "%d", &debug_level) != 1) return -EINVAL; if (debug_level < 0) -- 2.48.1
Re: [PATCH] wifi: ath12k: Fix uninitialized variable and remove goto
On 2/10/2025 8:09 PM, Ping-Ke Shih wrote: Addresses-Coverity-ID: 1642337 ("Uninitialized scalar variable") >>> >>> Is that an official kernel tag? IMO the proper tag would be >> So, it isn't "official" as far as I can tell, but it is widely used in >> other commits, especially by Gustavo Silva. >> >> Also: >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=778e2478d19574508861bcb >> 1806d6e34d095587c >> >> Coverity-IDs: is another option I have found. I have seen Closes: a few >> times as well. I'm not really sure what the best option is, honestly. > > In my patch, I used and treated Addresses-Coverity-ID as a unofficial tag, > so additional empty line is added. > > Days ago I have received Coverity issues sent to mailing list, so I used > Closes tag at that time. But recently I have not seen that kind of mails. > Instead, I visit Coverity web site to check issues and use > Addresses-Coverity-ID tag, since Coverity link is not visible to everyone. > That is just my thought. The problem I have is that I get Coverity fixes both from the linux and the linux-next projects: https://scan.coverity.com/projects/linux?tab=overview https://scan.coverity.com/projects/linux-next-weekly-scan?tab=overview The Coverity IDs from these projects are allocated independently, so a Coverity ID does not uniquely identify an issue. The URL uniquely identifies an issue, and also utilizes an official tag. That is my thought. /jeff
Re: [PATCH v2 6/6] unicode: kunit: change tests filename and path
On Wed, Feb 12, 2025 at 10:40:34AM -0500, Gabriel Krisman Bertazi wrote: > Kees Cook writes: > > > From: Gabriela Bittencourt > > > > Change utf8 kunit test filename and path to follow the style > > convention on Documentation/dev-tools/kunit/style.rst > > > > Co-developed-by: Pedro Orlando > > Signed-off-by: Pedro Orlando > > Co-developed-by: Danilo Pereira > > Signed-off-by: Danilo Pereira > > Signed-off-by: Gabriela Bittencourt > > Reviewed-by: David Gow > > Acked-by: Gabriel Krisman Bertazi > > Reviewed-by: Shuah Khan > > Reviewed-by: Rae Moar > > Link: https://lore.kernel.org/r/20241202075545.3648096-7-david...@google.com > > Signed-off-by: Kees Cook > > Hi Kees, > > Last time this was submitted, I have reported this patch breaks the > build when the kunit is built as a module, and it wasn't fixed before > resubmission. Thorsten Leemhuis just reported the same issue is now on > linux-next and I confirmed it is still there. > > https://lore.kernel.org/lkml/2c26c5e4-9cf3-4020-b0be-637dc826b...@leemhuis.info/T/#m2e2ddb3b2600274e5eae50b93ea821914dc85f20 > > The build error is: > > fs/unicode/tests/utf8_kunit.c:11:10: fatal error: utf8n.h: No such file or > directory >11 | #include "utf8n.h" > | ^ > > The problem is that the patch moves utf8_kunit.c into tests/ but doesn't > fix the include line above. Can you fix it in your tree since it is > already merged? Ah-ha! Thanks. Yes, I've updated this now. -Kees > > Thanks, > > > --- > > fs/unicode/Makefile| 2 +- > > fs/unicode/{ => tests}/.kunitconfig| 0 > > fs/unicode/{utf8-selftest.c => tests/utf8_kunit.c} | 0 > > 3 files changed, 1 insertion(+), 1 deletion(-) > > rename fs/unicode/{ => tests}/.kunitconfig (100%) > > rename fs/unicode/{utf8-selftest.c => tests/utf8_kunit.c} (100%) > > > > diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile > > index 37bbcbc628a1..d95be7fb9f6b 100644 > > --- a/fs/unicode/Makefile > > +++ b/fs/unicode/Makefile > > @@ -4,7 +4,7 @@ ifneq ($(CONFIG_UNICODE),) > > obj-y += unicode.o > > endif > > obj-$(CONFIG_UNICODE) += utf8data.o > > -obj-$(CONFIG_UNICODE_NORMALIZATION_KUNIT_TEST) += utf8-selftest.o > > +obj-$(CONFIG_UNICODE_NORMALIZATION_KUNIT_TEST) += tests/utf8_kunit.o > > > > unicode-y := utf8-norm.o utf8-core.o > > > > diff --git a/fs/unicode/.kunitconfig b/fs/unicode/tests/.kunitconfig > > similarity index 100% > > rename from fs/unicode/.kunitconfig > > rename to fs/unicode/tests/.kunitconfig > > diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/tests/utf8_kunit.c > > similarity index 100% > > rename from fs/unicode/utf8-selftest.c > > rename to fs/unicode/tests/utf8_kunit.c > > -- > Gabriel Krisman Bertazi -- Kees Cook
Re: [RFC PATCH v5 0/7] mseal system mappings
On Wed, Feb 12, 2025 at 11:24:35AM +, Lorenzo Stoakes wrote: > On Wed, Feb 12, 2025 at 03:21:48AM +, jef...@chromium.org wrote: > > From: Jeff Xu > > > > The commit message in the first patch contains the full description of > > this series. > > Sorry to nit, but it'd be useful to reproduce in the cover letter too! But > this obviously isn't urgent, just be nice when we un-RFC. I advised Jeff against this because I've found it can sometimes cause "thread splitting" in that some people reply to the cover letter, and some people reply to the first patch, etc. I've tended to try to keep cover letters very general, with the bulk of the prose in the first patch. > It'd be nice to update the documentation to have a list of 'known > problematic userland software with sealed VDSO' so we make people aware. I like this idea! Probably in mseal.rst, as the Kconfig help already points there. -Kees -- Kees Cook