Re: [RFC PATCH v5 0/7] mseal system mappings

2025-02-12 Thread Pedro Falcato
On Wed, Feb 12, 2025 at 11:25 AM Lorenzo Stoakes
 wrote:
>
> On Wed, Feb 12, 2025 at 03:21:48AM +, jef...@chromium.org wrote:
> > From: Jeff Xu 
> >
> > The commit message in the first patch contains the full description of
> > this series.
>
> Sorry to nit, but it'd be useful to reproduce in the cover letter too! But
> this obviously isn't urgent, just be nice when we un-RFC.
>
> Thanks for sending as RFC, appreciated, keen to figure out a way forward
> with this series and this gives us space to discuss.
>
> One thing that came up recently with the LWN article (...!) was that rr is
> also impacted by this [0].
>
> I think with this behind a config flag we're fine (this refers to my
> 'opt-in' comment in the reply on LWN) as my concerns about this being
> enabled in a broken way without an explicit kernel configuration are
> addressed, and actually we do expose a means by which a user can detect if
> the VDSO for instance is sealed via /proc/$pid/[s]maps.
>
> So tools like rr and such can be updated to check for this. I wonder if we
> ought to try to liaise with the known problematic ones?
>
> It'd be nice to update the documentation to have a list of 'known
> problematic userland software with sealed VDSO' so we make people aware.
>
> Hopefully we are acheiving the opt-in nature of the thing here, but it
> makes me wonder whether we need a prctl() interface to optionally disable
> even if the system has it enabled as a whole.

Just noting that (as we discussed off-list) doing prctl() would not
work, because that would effectively be an munseal for those vdso
regions.
Possibly something like a personality() flag (that's *not* inherited
when AT_SECURE/secureexec). But personalities have other issues...

FWIW, although it would (at the moment) be hard to pull off in the
libc, I still much prefer it to playing these weird games with CONFIG
options and kernel command line options and prctl and personality and
whatnot. It seems to me like we're trying to stick policy where it
doesn't belong.

-- 
Pedro



Re: [RFC PATCH v5 0/7] mseal system mappings

2025-02-12 Thread Lorenzo Stoakes
(sorry I really am struggling to reply to mail as lore still seems to be
broken).

On Wed, Feb 12, 2025 at 12:37:50PM +, Pedro Falcato wrote:
> On Wed, Feb 12, 2025 at 11:25 AM Lorenzo Stoakes
>  wrote:
> >
> > On Wed, Feb 12, 2025 at 03:21:48AM +, jef...@chromium.org wrote:
> > > From: Jeff Xu 
> > >
> > > The commit message in the first patch contains the full description of
> > > this series.
> >
> > Sorry to nit, but it'd be useful to reproduce in the cover letter too! But
> > this obviously isn't urgent, just be nice when we un-RFC.
> >
> > Thanks for sending as RFC, appreciated, keen to figure out a way forward
> > with this series and this gives us space to discuss.
> >
> > One thing that came up recently with the LWN article (...!) was that rr is
> > also impacted by this [0].
> >
> > I think with this behind a config flag we're fine (this refers to my
> > 'opt-in' comment in the reply on LWN) as my concerns about this being
> > enabled in a broken way without an explicit kernel configuration are
> > addressed, and actually we do expose a means by which a user can detect if
> > the VDSO for instance is sealed via /proc/$pid/[s]maps.
> >
> > So tools like rr and such can be updated to check for this. I wonder if we
> > ought to try to liaise with the known problematic ones?
> >
> > It'd be nice to update the documentation to have a list of 'known
> > problematic userland software with sealed VDSO' so we make people aware.
> >
> > Hopefully we are acheiving the opt-in nature of the thing here, but it
> > makes me wonder whether we need a prctl() interface to optionally disable
> > even if the system has it enabled as a whole.
>
> Just noting that (as we discussed off-list) doing prctl() would not
> work, because that would effectively be an munseal for those vdso
> regions.
> Possibly something like a personality() flag (that's *not* inherited
> when AT_SECURE/secureexec). But personalities have other issues...

Thanks, yeah that's a good point, it would have to be implemented as a
personality or something similar otherwise you're essentially relying on
'unsealing' which can't be permitted.

I'm not sure how useful that'd be for the likes of rr though. But I suppose
if it makes everything exec'd by a child inherit it then maybe that works
for a debugging session etc.?

>
> FWIW, although it would (at the moment) be hard to pull off in the
> libc, I still much prefer it to playing these weird games with CONFIG
> options and kernel command line options and prctl and personality and
> whatnot. It seems to me like we're trying to stick policy where it
> doesn't belong.

The problem is, as a security feature, you don't want to make it trivially
easy to disable.

I mean we _need_ a config option to be able to strictly enforce only making
the feature enable-able on architectures and configuration option
combinations that work.

But if there is userspace that will be broken, we really have to have some
way of avoiding the disconnect between somebody making policy decision at
the kernel level and somebody trying to run something.

Because I can easily envision somebody enabling this as a 'good security
feature' for a distro release or such, only for somebody else to later try
rr, CRIU, or whatever else and for it to just not work or fail subtly and
to have no idea why.

I mean one option is to have it as a CONFIG_ flag _and_ you have to enable
it via a tunable, so then it can become sysctl.d policy for instance.

The CONFIG_ flag dependency is critical because we don't want to enable
this on arches that have not been tested against it.

It's vital at any rate that we document everywhere we can that _this might
break some userland that depends on remapping the VDSO_.

>
> --
> Pedro



Re: [RFC PATCH v5 0/7] mseal system mappings

2025-02-12 Thread Johannes Berg
On Wed, 2025-02-12 at 14:01 +, Lorenzo Stoakes wrote:
> Thanks, yeah that's a good point, it would have to be implemented as a
> personality or something similar otherwise you're essentially relying on
> 'unsealing' which can't be permitted.
> 
> I'm not sure how useful that'd be for the likes of rr though. But I suppose
> if it makes everything exec'd by a child inherit it then maybe that works
> for a debugging session etc.?

For whatever that's worth, ARCH=um should not need 'unsealing' or 'not
sealing' it for *itself*, but rather only for the *children* it starts,
which are for the userspace processes inside of it. Which I suppose
could actually start without a VDSO in the first place, but I don't
think that's possible now?

Which I'll note should not have access to the host, so in a way this
outer security feature (sealing) breaks the inner ARCH=um security, I
think.

johannes



[PATCH net-next v2] net: Add options as a flexible array to struct ip_tunnel_info

2025-02-12 Thread Gal Pressman
Remove the hidden assumption that options are allocated at the end of
the struct, and teach the compiler about them using a flexible array.

With this, we can revert the unsafe_memcpy() call we have in
tun_dst_unclone() [1], and resolve the false field-spanning write
warning caused by the memcpy() in ip_tunnel_info_opts_set().

The layout of struct ip_tunnel_info remains the same with this patch.
Before this patch, there was an implicit padding at the end of the
struct, options would be written at 'info + 1' which is after the
padding.
This will remain the same as this patch explicitly aligns 'options'.
The alignment is needed as the options are later casted to different
structs, and might result in unaligned memory access.

Pahole output before this patch:
struct ip_tunnel_info {
struct ip_tunnel_key   key;  /* 064 */

/* XXX last struct has 1 byte of padding */

/* --- cacheline 1 boundary (64 bytes) --- */
struct ip_tunnel_encap encap;/*64 8 */
struct dst_cache   dst_cache;/*7216 */
u8 options_len;  /*88 1 */
u8 mode; /*89 1 */

/* size: 96, cachelines: 2, members: 5 */
/* padding: 6 */
/* paddings: 1, sum paddings: 1 */
/* last cacheline: 32 bytes */
};

Pahole output after this patch:
struct ip_tunnel_info {
   struct ip_tunnel_key   key;  /* 064 */

   /* XXX last struct has 1 byte of padding */

   /* --- cacheline 1 boundary (64 bytes) --- */
   struct ip_tunnel_encap encap;/*64 8 */
   struct dst_cache   dst_cache;/*7216 */
   u8 options_len;  /*88 1 */
   u8 mode; /*89 1 */

   /* XXX 6 bytes hole, try to pack */

   u8 options[] __attribute__((__aligned__(8))); /*
96 0 */

   /* size: 96, cachelines: 2, members: 6 */
   /* sum members: 90, holes: 1, sum holes: 6 */
   /* paddings: 1, sum paddings: 1 */
   /* forced alignments: 1, forced holes: 1, sum forced holes: 6 */
   /* last cacheline: 32 bytes */
} __attribute__((__aligned__(8)));

[1] Commit 13cfd6a6d7ac ("net: Silence false field-spanning write warning in 
metadata_dst memcpy")

Link: 
https://lore.kernel.org/all/53d1d353-b8f6-4adc-8f29-8c48a7c9c...@kernel.org/
Suggested-by: Kees Cook 
Reviewed-by: Cosmin Ratiu 
Reviewed-by: Tariq Toukan 
Signed-off-by: Gal Pressman 
---
Changelog -
v1->v2: https://lore.kernel.org/netdev/20250209101853.15828-1-...@nvidia.com/
* Remove change in struct layout, align 'options' field explicitly (Ilya, Kees, 
Jakub).
* Change allocation I missed in v1 in metadata_dst_alloc_percpu().
---
 .../mellanox/mlx5/core/en/tc_tun_encap.c  |  4 +---
 .../mellanox/mlx5/core/en/tc_tun_vxlan.c  |  2 +-
 .../ethernet/netronome/nfp/flower/action.c|  4 ++--
 drivers/net/pfcp.c|  2 +-
 drivers/net/vxlan/vxlan_core.c|  4 ++--
 include/net/dst_metadata.h|  7 ++
 include/net/ip_tunnels.h  | 11 +++--
 net/core/dst.c|  6 +++--
 net/ipv4/ip_gre.c |  4 ++--
 net/ipv4/ip_tunnel_core.c | 24 +--
 net/ipv6/ip6_gre.c|  4 ++--
 net/openvswitch/flow_netlink.c|  4 ++--
 net/psample/psample.c |  2 +-
 net/sched/act_tunnel_key.c| 12 +-
 14 files changed, 41 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c
index e7e01f3298ef..d9f40cf8198d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_encap.c
@@ -620,9 +620,7 @@ bool mlx5e_tc_tun_encap_info_equal_options(struct 
mlx5e_encap_key *a,
b_info = container_of(b->ip_tun_key, struct ip_tunnel_info, key);
 
return a_info->options_len == b_info->options_len &&
-  !memcmp(ip_tunnel_info_opts(a_info),
-  ip_tunnel_info_opts(b_info),
-  a_info->options_len);
+  !memcmp(a_info->options, b_info->options, a_info->options_len);
 }
 
 static int cmp_decap_info(struct mlx5e_decap_key *a,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_vxlan.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_vxlan.c
index e4e487c8431b..561c874b0825 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_vxlan.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun_vxlan.c
@@ -100,7 +100,7 @@ static int mlx5e_gen_ip_tunnel_header_vxlan(char buf[],
vxh->vx_flags = VXLAN_HF_VNI;
vxh->vx_vni = vxlan_vni_field(tun_id);

Re: [RFC PATCH v5 2/7] selftests: x86: test_mremap_vdso: skip if vdso is msealed

2025-02-12 Thread Thomas Weißschuh
On Wed, Feb 12, 2025 at 03:21:50AM +, jef...@chromium.org wrote:
> From: Jeff Xu 
> 
> Add code to detect if the vdso is memory sealed, skip the test
> if it is.
> 
> Signed-off-by: Jeff Xu 
> ---
>  .../testing/selftests/x86/test_mremap_vdso.c  | 38 +++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/tools/testing/selftests/x86/test_mremap_vdso.c 
> b/tools/testing/selftests/x86/test_mremap_vdso.c
> index d53959e03593..c68077c56b22 100644
> --- a/tools/testing/selftests/x86/test_mremap_vdso.c
> +++ b/tools/testing/selftests/x86/test_mremap_vdso.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -55,13 +56,50 @@ static int try_to_remap(void *vdso_addr, unsigned long 
> size)
>  
>  }
>  
> +#define VDSO_NAME "[vdso]"
> +#define VMFLAGS "VmFlags:"
> +#define MSEAL_FLAGS "sl"
> +#define MAX_LINE_LEN 512
> +
> +bool vdso_sealed(FILE *maps)
> +{
> + char line[MAX_LINE_LEN];
> + bool has_vdso = false;
> +
> + while (fgets(line, sizeof(line), maps)) {
> + if (strstr(line, VDSO_NAME))
> + has_vdso = true;
> +
> + if (has_vdso && !strncmp(line, VMFLAGS, strlen(VMFLAGS))) {
> + if (strstr(line, MSEAL_FLAGS))
> + return true;
> +
> + return false;

This only tests that any mapping after the vdso is sealed.
There is a real parser for /proc/self/smaps in
tools/testing/selftests/mm/vm_util.[ch], see check_vmflag_io().

> + }
> + }
> +
> + return false;
> +}





Re: [RFC PATCH v5 1/7] mseal, system mappings: kernel config and header change

2025-02-12 Thread Liam R. Howlett
* jef...@chromium.org  [250211 22:22]:
> From: Jeff Xu 
> 
> Provide infrastructure to mseal system mappings. Establish
> two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
> ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and a header file (userprocess.h)
> for future patches.
> 
> As discussed during mseal() upstream process [1], mseal() protects
> the VMAs of a given virtual memory range against modifications, such
> as the read/write (RW) and no-execute (NX) bits. For complete
> descriptions of memory sealing, please see mseal.rst [2].
> 
> The mseal() is useful to mitigate memory corruption issues where a
> corrupted pointer is passed to a memory management system. For
> example, such an attacker primitive can break control-flow integrity
> guarantees since read-only memory that is supposed to be trusted can
> become writable or .text pages can get remapped.
> 
> The system mappings are readonly only, memory sealing can protect
> them from ever changing to writable or unmmap/remapped as different
> attributes.
> 
> System mappings such as vdso, vvar, and sigpage (arm), vectors (arm)
> are created by the kernel during program initialization, and could
> be sealed after creation.
> 
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [3]. It could be sealed from creation.
> 
> The vsyscall on x86-64 uses a special address (0xff60),
> which is outside the mm managed range. This means mprotect, munmap, and
> mremap won't work on the vsyscall. Since sealing doesn't enhance
> the vsyscall's security, it is skipped in this patch. If we ever seal
> the vsyscall, it is probably only for decorative purpose, i.e. showing
> the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored.
> 
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the system mappings during restore operations. UML(User Mode Linux)
> and gVisor are also known to change the vdso/vvar mappings. Consequently,
> this feature cannot be universally enabled across all systems. As such,
> CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.
> 
> To support mseal of system mappings, architectures must define
> CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS and update their special mappings
> calls to pass mseal flag. Additionally, architectures must confirm they
> do not unmap/remap system mappings during the process lifetime.
> 
> In this version, we've improved the handling of system mapping sealing from
> previous versions, instead of modifying the _install_special_mapping
> function itself, which would affect all architectures, we now call
> _install_special_mapping with a sealing flag only within the specific
> architecture that requires it. This targeted approach offers two key
> advantages: 1) It limits the code change's impact to the necessary
> architectures, and 2) It aligns with the software architecture by keeping
> the core memory management within the mm layer, while delegating the
> decision of sealing system mappings to the individual architecture, which
> is particularly relevant since 32-bit architectures never require sealing.
> 
> Additionally, this patch introduces a new header,
> include/linux/usrprocess.h, which contains the mseal_system_mappings()
> function. This function helps the architecture determine if system
> mapping is enabled within the current kernel configuration. It can be
> extended in the future to handle opt-in/out prctl for enabling system
> mapping sealing at the process level or a kernel cmdline feature.
> 
> A new header file was introduced because it was difficult to find the
> best location for this function. Although include/linux/mm.h was
> considered, this feature is more closely related to user processes
> than core memory management. Additionally, future prctl or kernel
> cmd-line implementations for this feature would not fit within the
> scope of core memory management or mseal.c. This is relevant because
> if we add unit-tests for mseal.c in the future, we would want to limit
> mseal.c's dependencies to core memory management.
> 
> Prior to this patch series, we explored sealing special mappings from
> userspace using glibc's dynamic linker. This approach revealed several
> issues:
> - The PT_LOAD header may report an incorrect length for vdso, (smaller
>   than its actual size). The dynamic linker, which relies on PT_LOAD
>   information to determine mapping size, would then split and partially
>   seal the vdso mapping. Since each architecture has its own vdso/vvar
>   code, fixing this in the kernel would require going through each
>   archiecture. Our initial goal was to enable sealing readonly mappings,
>   e.g. .text, across all architectures, sealing vdso from kernel since
>   creation appears to be simpler than sealing vdso at glibc.
> - The [vvar] mapping header only contains address information, not length
>   information. Similar issues might exist for 

Re: [PATCH v3 0/2] Refine kmalloc caches randomization in kvmalloc

2025-02-12 Thread Harry Yoo
On Wed, Feb 12, 2025 at 04:15:03PM +0800, GONG Ruiqi wrote:
> Hi,
> 
> v3:
>   - move all the way from kmalloc_gfp_adjust to kvrealloc_noprof into
> mm/slub.c
>   - some rewording for commit logs
> v2: 
> https://lore.kernel.org/all/20250208014723.1514049-1-gongrui...@huawei.com/
>   - change the implementation as Vlastimil suggested
> v1: https://lore.kernel.org/all/20250122074817.991060-1-gongrui...@huawei.com/
> 
> Tamás reported [1] that kmalloc cache randomization doesn't actually
> work for those kmalloc invoked via kvmalloc. For more details, see the
> commit log of patch 2.
> 
> The current solution requires a direct call from __kvmalloc_node_noprof
> to __do_kmalloc_node, a static function in a different .c file. As
> suggested by Vlastimil [2], it's achieved by simply moving
> __kvmalloc_node_noprof from mm/util.c to mm/slub.c, together with some
> other functions of the same family.

Hi, GONG!
Sorry for my late review.

This patch series looks good to me (with a nit),
Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com>

Also, I verified that the problem you described exists on slab/for-next,
and the patch series fixes the problem. Please feel free to add,
Tested-by: Hyeonggon Yoo <42.hye...@gmail.com>

nit: Does it make sense to call __kvmalloc_node_track_caller_noprof()
instead of __do_kmalloc_node() to avoid bloating the code size?

My simple build test says it saves 1592 bytes:
  $ ./scripts/bloat-o-meter slub.o.before slub.o.after
  add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-1592 (-1592)
  Function old new   delta
  __kvmalloc_node_noprof.cold   39   - -39
  __kvmalloc_node_noprof  1755 202   -1553
  Total: Before=79723, After=78131, chg -2.00%

> Link: 
> https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259
>  [1]
> Link: 
> https://lore.kernel.org/all/62044279-0c56-4185-97f7-7afac65ff...@suse.cz/ [2]

-- 
Harry



Re: [PATCH v3 0/2] Refine kmalloc caches randomization in kvmalloc

2025-02-12 Thread Vlastimil Babka
On 2/12/25 15:20, Harry Yoo wrote:
> On Wed, Feb 12, 2025 at 04:15:03PM +0800, GONG Ruiqi wrote:
>> Hi,
>> 
>> v3:
>>   - move all the way from kmalloc_gfp_adjust to kvrealloc_noprof into
>> mm/slub.c
>>   - some rewording for commit logs
>> v2: 
>> https://lore.kernel.org/all/20250208014723.1514049-1-gongrui...@huawei.com/
>>   - change the implementation as Vlastimil suggested
>> v1: 
>> https://lore.kernel.org/all/20250122074817.991060-1-gongrui...@huawei.com/
>> 
>> Tamás reported [1] that kmalloc cache randomization doesn't actually
>> work for those kmalloc invoked via kvmalloc. For more details, see the
>> commit log of patch 2.
>> 
>> The current solution requires a direct call from __kvmalloc_node_noprof
>> to __do_kmalloc_node, a static function in a different .c file. As
>> suggested by Vlastimil [2], it's achieved by simply moving
>> __kvmalloc_node_noprof from mm/util.c to mm/slub.c, together with some
>> other functions of the same family.
> 
> Hi, GONG!
> Sorry for my late review.
> 
> This patch series looks good to me (with a nit),
> Reviewed-by: Hyeonggon Yoo <42.hye...@gmail.com>
> 
> Also, I verified that the problem you described exists on slab/for-next,
> and the patch series fixes the problem. Please feel free to add,
> Tested-by: Hyeonggon Yoo <42.hye...@gmail.com>

Thanks!

> nit: Does it make sense to call __kvmalloc_node_track_caller_noprof()
> instead of __do_kmalloc_node() to avoid bloating the code size?

Hm I think it would be a bit arbitrary to make kvmalloc special like this
here. But we should probably change __do_kmalloc_node() to
__fastpath_inline. Or even check if not making it inline at all results in
other callers (not kvmalloc probably due to its complexity) doing a tail
call there, which should be fast enough.

> My simple build test says it saves 1592 bytes:
>   $ ./scripts/bloat-o-meter slub.o.before slub.o.after
>   add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-1592 (-1592)
>   Function old new   delta
>   __kvmalloc_node_noprof.cold   39   - -39
>   __kvmalloc_node_noprof  1755 202   -1553
>   Total: Before=79723, After=78131, chg -2.00%
> 
>> Link: 
>> https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259
>>  [1]
>> Link: 
>> https://lore.kernel.org/all/62044279-0c56-4185-97f7-7afac65ff...@suse.cz/ [2]
> 




Re: [PATCH v3 0/2] Refine kmalloc caches randomization in kvmalloc

2025-02-12 Thread Vlastimil Babka
On 2/12/25 09:15, GONG Ruiqi wrote:
> Hi,
> 
> v3:
>   - move all the way from kmalloc_gfp_adjust to kvrealloc_noprof into
> mm/slub.c
>   - some rewording for commit logs
> v2: 
> https://lore.kernel.org/all/20250208014723.1514049-1-gongrui...@huawei.com/
>   - change the implementation as Vlastimil suggested
> v1: https://lore.kernel.org/all/20250122074817.991060-1-gongrui...@huawei.com/
> 
> Tamás reported [1] that kmalloc cache randomization doesn't actually
> work for those kmalloc invoked via kvmalloc. For more details, see the
> commit log of patch 2.
> 
> The current solution requires a direct call from __kvmalloc_node_noprof
> to __do_kmalloc_node, a static function in a different .c file. As
> suggested by Vlastimil [2], it's achieved by simply moving
> __kvmalloc_node_noprof from mm/util.c to mm/slub.c, together with some
> other functions of the same family.
> 
> Link: 
> https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259
>  [1]
> Link: 
> https://lore.kernel.org/all/62044279-0c56-4185-97f7-7afac65ff...@suse.cz/ [2]
> 
> GONG Ruiqi (2):
>   slab: Adjust placement of __kvmalloc_node_noprof
>   slab: Achieve better kmalloc caches randomization in kvmalloc

Applied to slab/for-next, thanks!

> 
>  mm/slub.c | 162 ++
>  mm/util.c | 162 --
>  2 files changed, 162 insertions(+), 162 deletions(-)
> 




Re: [RFC PATCH v5 0/7] mseal system mappings

2025-02-12 Thread Lorenzo Stoakes
On Wed, Feb 12, 2025 at 03:21:48AM +, jef...@chromium.org wrote:
> From: Jeff Xu 
>
> The commit message in the first patch contains the full description of
> this series.

Sorry to nit, but it'd be useful to reproduce in the cover letter too! But
this obviously isn't urgent, just be nice when we un-RFC.

Thanks for sending as RFC, appreciated, keen to figure out a way forward
with this series and this gives us space to discuss.

One thing that came up recently with the LWN article (...!) was that rr is
also impacted by this [0].

I think with this behind a config flag we're fine (this refers to my
'opt-in' comment in the reply on LWN) as my concerns about this being
enabled in a broken way without an explicit kernel configuration are
addressed, and actually we do expose a means by which a user can detect if
the VDSO for instance is sealed via /proc/$pid/[s]maps.

So tools like rr and such can be updated to check for this. I wonder if we
ought to try to liaise with the known problematic ones?

It'd be nice to update the documentation to have a list of 'known
problematic userland software with sealed VDSO' so we make people aware.

Hopefully we are acheiving the opt-in nature of the thing here, but it
makes me wonder whether we need a prctl() interface to optionally disable
even if the system has it enabled as a whole.

That way, rr for instance can just turn it off for debugging purposes. To
be clear I am not trying to add additional extranous tasks here - my one
and only motive is to ensure that the feature works and we address concerns
about any possible breakage.

And I _want the series to land_ :>) I suspect we're close now.

I am tied up with a number of other tasks at the moment so apologies if I
take a second to get back to this series, but just wanted to say thanks for
addressing my various points, and that I will definitely provide review
(it's on the whiteboard, the only true guarantee I will do something :P).

I will also come back to your testing series which I owe you review on,
which is equally on the same whiteboard...

Thanks, Lorenzo

[0]:https://lwn.net/Articles/1007984/

>
> --
> History:
>
> V5
>   - Remove kernel cmd line (Lorenzo Stoakes)
>   - Add test info (Lorenzo Stoakes)
>   - Add threat model info (Lorenzo Stoakes)
>   - Fix x86 selftest: test_mremap_vdso
>   - Restrict code change to ARM64/x86-64/UM arch only.
>   - Add userprocess.h to include seal_system_mapping().
>   - Remove sealing vsyscall.
>   - Split the patch.
>
> V4:
>   https://lore.kernel.org/all/20241125202021.3684919-1-jef...@google.com/
>
> V3:
>   https://lore.kernel.org/all/20241113191602.3541870-1-jef...@google.com/
>
> V2:
>   https://lore.kernel.org/all/20241014215022.68530-1-jef...@google.com/
>
> V1:
>   https://lore.kernel.org/all/20241004163155.3493183-1-jef...@google.com/
>
> Jeff Xu (7):
>   mseal, system mappings: kernel config and header change
>   selftests: x86: test_mremap_vdso: skip if vdso is msealed
>   mseal, system mappings: enable x86-64
>   mseal, system mappings: enable arm64
>   mseal, system mappings: enable uml architecture
>   mseal, system mappings: uprobe mapping
>   mseal, system mappings: update mseal.rst
>
>  Documentation/userspace-api/mseal.rst |  5 +++
>  arch/arm64/Kconfig|  1 +
>  arch/arm64/kernel/vdso.c  | 23 +++
>  arch/um/Kconfig   |  1 +
>  arch/x86/Kconfig  |  1 +
>  arch/x86/entry/vdso/vma.c | 17 ++---
>  arch/x86/um/vdso/vma.c|  7 +++-
>  include/linux/userprocess.h   | 18 +
>  init/Kconfig  | 18 +
>  kernel/events/uprobes.c   |  6 ++-
>  security/Kconfig  | 18 +
>  .../testing/selftests/x86/test_mremap_vdso.c  | 38 +++
>  12 files changed, 137 insertions(+), 16 deletions(-)
>  create mode 100644 include/linux/userprocess.h
>
> --
> 2.48.1.502.g6dc24dfdaf-goog
>



Re: [PATCH v2 6/6] unicode: kunit: change tests filename and path

2025-02-12 Thread Gabriel Krisman Bertazi
Kees Cook  writes:

> From: Gabriela Bittencourt 
>
> Change utf8 kunit test filename and path to follow the style
> convention on Documentation/dev-tools/kunit/style.rst
>
> Co-developed-by: Pedro Orlando 
> Signed-off-by: Pedro Orlando 
> Co-developed-by: Danilo Pereira 
> Signed-off-by: Danilo Pereira 
> Signed-off-by: Gabriela Bittencourt 
> Reviewed-by: David Gow 
> Acked-by: Gabriel Krisman Bertazi 
> Reviewed-by: Shuah Khan 
> Reviewed-by: Rae Moar 
> Link: https://lore.kernel.org/r/20241202075545.3648096-7-david...@google.com
> Signed-off-by: Kees Cook 

Hi Kees,

Last time this was submitted, I have reported this patch breaks the
build when the kunit is built as a module, and it wasn't fixed before
resubmission.  Thorsten Leemhuis just reported the same issue is now on
linux-next and I confirmed it is still there.

https://lore.kernel.org/lkml/2c26c5e4-9cf3-4020-b0be-637dc826b...@leemhuis.info/T/#m2e2ddb3b2600274e5eae50b93ea821914dc85f20

The build error is:

fs/unicode/tests/utf8_kunit.c:11:10: fatal error: utf8n.h: No such file or 
directory
   11 | #include "utf8n.h"
  |  ^

The problem is that the patch moves utf8_kunit.c into tests/ but doesn't
fix the include line above.  Can you fix it in your tree since it is
already merged?

Thanks,

> ---
>  fs/unicode/Makefile| 2 +-
>  fs/unicode/{ => tests}/.kunitconfig| 0
>  fs/unicode/{utf8-selftest.c => tests/utf8_kunit.c} | 0
>  3 files changed, 1 insertion(+), 1 deletion(-)
>  rename fs/unicode/{ => tests}/.kunitconfig (100%)
>  rename fs/unicode/{utf8-selftest.c => tests/utf8_kunit.c} (100%)
>
> diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile
> index 37bbcbc628a1..d95be7fb9f6b 100644
> --- a/fs/unicode/Makefile
> +++ b/fs/unicode/Makefile
> @@ -4,7 +4,7 @@ ifneq ($(CONFIG_UNICODE),)
>  obj-y+= unicode.o
>  endif
>  obj-$(CONFIG_UNICODE)+= utf8data.o
> -obj-$(CONFIG_UNICODE_NORMALIZATION_KUNIT_TEST) += utf8-selftest.o
> +obj-$(CONFIG_UNICODE_NORMALIZATION_KUNIT_TEST) += tests/utf8_kunit.o
>  
>  unicode-y := utf8-norm.o utf8-core.o
>  
> diff --git a/fs/unicode/.kunitconfig b/fs/unicode/tests/.kunitconfig
> similarity index 100%
> rename from fs/unicode/.kunitconfig
> rename to fs/unicode/tests/.kunitconfig
> diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/tests/utf8_kunit.c
> similarity index 100%
> rename from fs/unicode/utf8-selftest.c
> rename to fs/unicode/tests/utf8_kunit.c

-- 
Gabriel Krisman Bertazi



Re: [RFC] Circumventing FineIBT Via Entrypoints

2025-02-12 Thread Jennifer Miller
On Wed, Feb 12, 2025 at 11:29:02PM +0100, Jann Horn wrote:
> +Andy Lutomirski (X86 entry code maintainer)
> 
> On Wed, Feb 12, 2025 at 10:08 PM Jennifer Miller  wrote:
> > As part of a recently accepted paper we demonstrated that syscall
> > entrypoints can be misused on x86-64 systems to generically bypass
> > FineIBT/KERNEL_IBT from forwards-edge control flow hijacking. We
> > communicated this finding to s@k.o before submitting the paper and were
> > encouraged to bring the issue to hardening after the paper was accepted to
> > have a discussion on how to address the issue.
> >
> > The bypass takes advantage of the architectural requirement of entrypoints
> > to begin with the endbr64 instruction and the ability to control GS_BASE
> > from userspace via wrgsbase, from to the FSGSBASE extension, in order to
> > perform a stack pivot to a ROP-chain.
> 
> Oh, fun, that's a gnarly quirk.

yeag :)

> Since the kernel, as far as I understand, uses FineIBT without
> backwards control flow protection (in other words, I think we assume
> that the kernel stack is trusted?), could we build a cheaper
> check on that basis somehow? For example, maybe we could do something like:
> 
> ```
> endbr64
> test rsp, rsp
> js slowpath
> swapgs
> ```
> 
> So we'd have the fast normal case where RSP points to userspace
> (meaning we can't be coming from the kernel unless our stack has
> already been pivoted, in which case forward edge protection alone
> can't help anymore), and the slow case where RSP points to kernel
> memory - in that case we'd then have to do some slower checks to
> figure out whether weird userspace is making a syscall with RSP
> pointing to the kernel, or whether we're coming from hijacked kernel
> control flow.

I've been tinkering this idea a bit and came with something.

In short, we could have the slowpath branch as you suggested, in the 
slowpath permit the stack switch and preserving of the registers on the
stack, but then do a sanity check according to the __per_cpu_offset array
and decide from there whether we should continue executing the entrypoint
or die/attempt to recover.

Here is some napkin asm for this I wrote for the 64-bit syscall entrypoint, 
I think more or less the same could be done for the other entrypoints.

```
endbr64
test rsp, rsp
js slowpath

swapgs
~~fastpath continues~~

; path taken when rsp was a kernel address
; we have no choice really but to switch to the stack from the untrusted
; gsbase but after doing so we have to be careful about what we put on the
; stack
slowpath:
swapgs

; swap stacks as normal
movQWORD PTR gs:[rip+0x7f005f85],rsp   # 0x6014 
movrsp,QWORD PTR gs:[rip+0x7f02c56d]   # 0x2c618 

~~normal push and clear GPRs sequence here~~

; we entered with an rsp in the kernel address range.
; we already did swapgs but we don't know if we can trust our gsbase yet.
; we should be able to trust the ro_after_init __per_cpu_offset array
; though.

; check that gsbase is the expected value for our current cpu
rdtscp
mov rax, QWORD PTR [8*ecx-0x7d7be460] <__per_cpu_offset>

rdgsbase rbx

cmp rbx, rax
je fastpath_after_regs_preserved

wrgsbase rax

; if we reach here we are being exploited and should explode or attempt
; to recover
```

The unfortunate part is that it would still result in the register state
being dumped on top of some attacker controlled address, so if the error
path is recoverable someone could still use entrypoints to convert control
flow hijacking into memory corruption via register dump. So it would kill
the ability to get ROP but it would still be possible to dump regs over
modprobe_path, core_pattern, etc.

Does this seem feasible and any better than the alternative of overwriting
and restoring KERNEL_GS_BASE?

~Jennifer



Re: [RFC] Circumventing FineIBT Via Entrypoints

2025-02-12 Thread Andrew Cooper
On 13/02/2025 2:09 am, Jann Horn wrote:
> On Thu, Feb 13, 2025 at 2:31 AM Andrew Cooper  
> wrote:
 Assuming this is an issue you all feel is worth addressing, I will
 continue working on providing a patch. I'm concerned though that the
 overhead from adding a wrmsr on both syscall entry and exit to
 overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
 any feedback in regards to the approach or suggestions of alternate
 approaches to patching are welcome :)
>>> Since the kernel, as far as I understand, uses FineIBT without
>>> backwards control flow protection (in other words, I think we assume
>>> that the kernel stack is trusted?),
>> This is fun indeed.  Linux cannot use supervisor shadow stacks because
>> the mess around NMI re-entrancy (and IST more generally) requires ROP
>> gadgets in order to function safely.  Implementing this with shadow
>> stacks active, while not impossible, is deemed to be prohibitively
>> complicated.
>>
>> Linux's supervisor shadow stack support is waiting for FRED support,
>> which fixes both the NMI re-entrancy problem, and other exceptions
>> nesting within NMIs, as well as prohibiting the use of the SWAPGS
>> instruction as FRED tries to make sure that the correct GS is always in
>> context.
>>
>> But, FRED support is slated for PantherLake/DiamondRapids which haven't
>> shipped yet, so are no use to the problem right now.
>>
>>> could we build a cheaper
>>> check on that basis somehow? For example, maybe we could do something like:
>>>
>>> ```
>>> endbr64
>>> test rsp, rsp
>>> js slowpath
>>> swapgs
>>> ```
>> I presume it's been pointed out already, but there are 3 related
>> entrypoints here.  SYSCALL64 which is discussed, SYSCALL32 and SYSENTER
>> which are related.
>>
>> But, any other IDT entry is in a similar bucket.  If we're corrupting a
>> function pointer or return address to redirect here, then the check of
>> CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers
>> stack frame.
>>
>> For IDT entries, checking %rsp is reasonable, because userspace can't
>> forge a kernel-like %rsp.  However, SYSCALL64 specifically leaves %rsp
>> entirely attacker controlled (and even potentially non-canonical), so
>> I'm wondering what you hand in mind for the slowpath to truly
>> distinguish kernel context from user context?
> Hm, yeah, that seems hard - maybe the best we could do is to make sure
> that the inactive gsbase has the correct value for our CPU's kernel
> gsbase? Kinda like a paranoid_entry, except more painful because we'd
> first have to figure out a place to spill registers to before we can
> start using stuff like rdmsr... Then a function pointer overwrite
> might still turn into returning to userspace with a sysret with GPRs
> full of kernel pointers, but at least we wouldn't run off of a bogus
> gsbase anymore?

Thinking about this some more, I think it's impossible to distinguish.

One of the many sharp edges of SYSCALL (and SYSENTER for that matter) is
that they're instructions expected to be only be used by userspace, but
that be executed in supervisor too[1].  They're asymmetric with their
SYSRET (and SYSEXIT) counterparts which are CPL0 instructions that
strictly transition into CPL3.

The SYSCALL behaviour TLDR is:

    %rcx = %rip
    %r11 = %eflags
    %cs = fixed attr
    %ss = fixed attr
    %rip = MSR_LSTAR

which means that %rcx (old rip) is the only piece of state which
userspace can't feasibly forge (and therefore could distinguish a
SYSCALL from user vs kernel mode), yet if we're talking about a JOP
chain to get here, then %rcx is under attacker control too.

There are a variety of solutions to this problem that involve not using
%gs for per-cpu data.  I also expect that to be wholly unpopular and
dismissed as an approach.

~Andrew

[1] No-one back then was brave enough to design CPL3-only instructions.



[PATCH][next] UAPI: ndctl / acpi: intel: Avoid multiple -Wflex-array-member-not-at-end warnings

2025-02-12 Thread Gustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

So, in order to avoid ending up with flexible-array members in the
middle of other structs, we use the `__struct_group()` helper to
separate the flexible array from the rest of the members in the
flexible structure. We then use the newly created tagged `struct
nd_cmd_pkg_hdr` to replace the type of the objects causing trouble
(`pkg`) in multiple structs.

So, with these changes, fix the following warnings:

drivers/acpi/nfit/intel.c:692:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:59:35: warning: structure containing a flexible array 
member is not at the end of another structure [-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:586:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:522:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:411:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:358:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:322:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:281:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:238:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:199:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:157:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]
drivers/acpi/nfit/intel.c:124:35: warning: structure containing a flexible 
array member is not at the end of another structure 
[-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/acpi/nfit/intel.c  | 24 
 include/uapi/linux/ndctl.h | 15 +--
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
index 3902759abcba..fe561ce0ddec 100644
--- a/drivers/acpi/nfit/intel.c
+++ b/drivers/acpi/nfit/intel.c
@@ -56,7 +56,7 @@ static unsigned long intel_security_flags(struct nvdimm 
*nvdimm,
struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
unsigned long security_flags = 0;
struct {
-   struct nd_cmd_pkg pkg;
+   struct nd_cmd_pkg_hdr pkg;
struct nd_intel_get_security_state cmd;
} nd_cmd = {
.pkg = {
@@ -121,7 +121,7 @@ static int intel_security_freeze(struct nvdimm *nvdimm)
 {
struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
struct {
-   struct nd_cmd_pkg pkg;
+   struct nd_cmd_pkg_hdr pkg;
struct nd_intel_freeze_lock cmd;
} nd_cmd = {
.pkg = {
@@ -154,7 +154,7 @@ static int intel_security_change_key(struct nvdimm *nvdimm,
NVDIMM_INTEL_SET_MASTER_PASSPHRASE :
NVDIMM_INTEL_SET_PASSPHRASE;
struct {
-   struct nd_cmd_pkg pkg;
+   struct nd_cmd_pkg_hdr pkg;
struct nd_intel_set_passphrase cmd;
} nd_cmd = {
.pkg = {
@@ -196,7 +196,7 @@ static int __maybe_unused intel_security_unlock(struct 
nvdimm *nvdimm,
 {
struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
struct {
-   struct nd_cmd_pkg pkg;
+   struct nd_cmd_pkg_hdr pkg;
struct nd_intel_unlock_unit cmd;
} nd_cmd = {
.pkg = {
@@ -235,7 +235,7 @@ static int intel_security_disable(struct nvdimm *nvdimm,
int rc;
struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
struct {
-   struct nd_cmd_pkg pkg;
+   struct nd_cmd_pkg_hdr pkg;
struct nd_intel_disable_passphrase cmd;
} nd_cmd = {
.pkg = {
@@ -278,7 +278,7 @@ static int __maybe_unused intel_security_erase(struct 
nvdimm *nvdimm,
unsigned int cmd = ptype == NVDIMM_MASTER ?
NVDIMM_INTEL_MASTER_SECURE_ERASE : NVDIMM_INTEL_SECURE_ERASE;
struct {
-   struct nd_cmd_pkg pkg;
+   struct nd_cmd_pkg_hdr pkg;
struct nd

Re: [PATCH 00/10] Annotate arguments of memtostr/strtomem with __nonstring

2025-02-12 Thread Martin K. Petersen


Kees,

> The memtostr*() and strtomem*() helpers are designed to move between C
> strings (NUL-terminated) and byte arrays (that may just be zero padded
> and may not be NUL-terminated). The "nonstring" attribute is used to
> annotated these kinds of byte arrays, and we can validate the
> annotation on the arguments of the helpers. Add the the infrastructure
> to do this, and then update all the places where these annotations are
> currently missing.

Reviewed-by: Martin K. Petersen  # SCSI

-- 
Martin K. Petersen  Oracle Linux Engineering



Re: [PATCH net-next v2] net: Add options as a flexible array to struct ip_tunnel_info

2025-02-12 Thread Gal Pressman
On 12/02/2025 18:29, Alexander Lobakin wrote:
> From: Gal Pressman 
> Date: Wed, 12 Feb 2025 16:09:53 +0200
> 
>> Remove the hidden assumption that options are allocated at the end of
>> the struct, and teach the compiler about them using a flexible array.
> 
> [...]
> 
>> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>> index 84c15402931c..4160731dcb6e 100644
>> --- a/include/net/dst_metadata.h
>> +++ b/include/net/dst_metadata.h
>> @@ -163,11 +163,8 @@ static inline struct metadata_dst 
>> *tun_dst_unclone(struct sk_buff *skb)
>>  if (!new_md)
>>  return ERR_PTR(-ENOMEM);
>>  
>> -unsafe_memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>> -  sizeof(struct ip_tunnel_info) + md_size,
>> -  /* metadata_dst_alloc() reserves room (md_size bytes) for
>> -   * options right after the ip_tunnel_info struct.
>> -   */);
>> +memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>> +   sizeof(struct ip_tunnel_info) + md_size);
>>  #ifdef CONFIG_DST_CACHE
>>  /* Unclone the dst cache if there is one */
>>  if (new_md->u.tun_info.dst_cache.cache) {
>> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
>> index 1aa31bdb2b31..517f78070be0 100644
>> --- a/include/net/ip_tunnels.h
>> +++ b/include/net/ip_tunnels.h
>> @@ -93,12 +93,6 @@ struct ip_tunnel_encap {
>>  GENMASK((sizeof_field(struct ip_tunnel_info,\
>>options_len) * BITS_PER_BYTE) - 1, 0)
>>  
>> -#define ip_tunnel_info_opts(info)   \
>> -_Generic(info,  \
>> - const struct ip_tunnel_info * : ((const void *)((info) + 1)),\
>> - struct ip_tunnel_info * : ((void *)((info) + 1))\
>> -)
> 
> You could leave this macro inplace and just change `(info) + 1` to
> `(info)->options` avoiding changes in lots of files and adding casts
> everywhere.

I'd rather not, having a macro to do 'info->options' doesn't help code
readability IMHO.

> 
>> -
>>  struct ip_tunnel_info {
>>  struct ip_tunnel_keykey;
>>  struct ip_tunnel_encap  encap;
>> @@ -107,6 +101,7 @@ struct ip_tunnel_info {
>>  #endif
>>  u8  options_len;
>>  u8  mode;
>> +u8  options[] __aligned(sizeof(void *)) 
>> __counted_by(options_len);
> 
> Since 96 % 16 == 0, I'd check if __aligned_largest would change the
> bytecode anyhow... Sometimes it does.

__aligned_largest aligns the field to 16 bytes, so the struct remains
the same (96 bytes).
When should __aligned_larget be used? What needs 16 bytes alignment?



[RFC] Circumventing FineIBT Via Entrypoints

2025-02-12 Thread Jennifer Miller
Hi All,

As part of a recently accepted paper we demonstrated that syscall
entrypoints can be misused on x86-64 systems to generically bypass
FineIBT/KERNEL_IBT from forwards-edge control flow hijacking. We
communicated this finding to s@k.o before submitting the paper and were
encouraged to bring the issue to hardening after the paper was accepted to
have a discussion on how to address the issue.

The bypass takes advantage of the architectural requirement of entrypoints
to begin with the endbr64 instruction and the ability to control GS_BASE
from userspace via wrgsbase, from to the FSGSBASE extension, in order to
perform a stack pivot to a ROP-chain.

Here is a snippet of the 64-bit entrypoint code:
```
entry_SYSCALL_64:
<+0>: endbr64
<+4>: swapgs
<+7>: movQWORD PTR gs:0x6014,rsp
<+16>:jmp
<+18>:movrsp,cr3
<+21>:nop
<+26>:andrsp,0xe7ff
<+33>:movcr3,rsp
<+36>:movrsp,QWORD PTR gs:0x32c98
```

This is a valid target from any indirect callsite under FineIBT due to the
endbr64 instruction and the lack of a software CFI check. After hijacking
control flow to the entrypoint, executing swapgs will swap to the user
controlled GS_BASE, which will be used to set the stack pointer, leading to
a stack pivot. The rest of the entrypoint will execute with a hijacked
GS_BASE on a user controlled stack. The stack page we use is one mapped in
the user address space and from another thread we race overwriting returns
addresses on the stack to pivot a second time to a ROP-chain. For this to
succeed we required a large area of user-controlled kernel memory that can
serve as the forged GS_BASE address, we did this by spraying 2MB
Transparent Huge Pages to fill the kernel physical memory map with
controlled 2MB allocations and guessing relative to the base address of the
area to hit a page we control.

We evaluated an approach to patching the issue in the paper but it touched
the userspace API a bit, added an error code returned by syscalls if they
are invoked with a kernel address in GS_BASE, which is not a great 
solution.

Linus provided some thoughts on how to potentially address this issue
in our communication with s@k.o, suggesting the kernel could make the
KERNEL_GS_BASE match the GS_BASE value so both registers always contain a
valid kernel address and a confusion induced by executing swapgs an extra
time cannot occur, and restore the value of KERNEL_GS_BASE ahead of 
executing swapgs in the exit path.

I started working on a patch based on the approach suggested by Linus but I
haven't been able to get it passing the relevant x86 selftests yet. It 
turned out that it's more than the entrypoint code that needs to be 
modified for it to work, we need to correctly save and restore the user's 
GS_BASE across task switches and ensure it is updated correctly when set 
via arch_prctl and ptrace. Unfortunately, I lack familiarity with those 
parts of the kernel, and my understanding is that the paper will be made 
public in a couple weeks so I didn't want to delay too long on bringing the 
issue to this list.

Assuming this is an issue you all feel is worth addressing, I will continue
working on providing a patch. I'm concerned though that the overhead from 
adding a wrmsr on both syscall entry and exit to overwrite and restore the 
KERNEL_GS_BASE MSR may be quite high, so any feedback in regards to the
approach or suggestions of alternate approaches to patching are welcome :)

~Jennifer



Re: [PATCH net-next v2] net: Add options as a flexible array to struct ip_tunnel_info

2025-02-12 Thread Alexander Lobakin
From: Gal Pressman 
Date: Wed, 12 Feb 2025 16:09:53 +0200

> Remove the hidden assumption that options are allocated at the end of
> the struct, and teach the compiler about them using a flexible array.

[...]

> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
> index 84c15402931c..4160731dcb6e 100644
> --- a/include/net/dst_metadata.h
> +++ b/include/net/dst_metadata.h
> @@ -163,11 +163,8 @@ static inline struct metadata_dst 
> *tun_dst_unclone(struct sk_buff *skb)
>   if (!new_md)
>   return ERR_PTR(-ENOMEM);
>  
> - unsafe_memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
> -   sizeof(struct ip_tunnel_info) + md_size,
> -   /* metadata_dst_alloc() reserves room (md_size bytes) for
> -* options right after the ip_tunnel_info struct.
> -*/);
> + memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
> +sizeof(struct ip_tunnel_info) + md_size);
>  #ifdef CONFIG_DST_CACHE
>   /* Unclone the dst cache if there is one */
>   if (new_md->u.tun_info.dst_cache.cache) {
> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
> index 1aa31bdb2b31..517f78070be0 100644
> --- a/include/net/ip_tunnels.h
> +++ b/include/net/ip_tunnels.h
> @@ -93,12 +93,6 @@ struct ip_tunnel_encap {
>   GENMASK((sizeof_field(struct ip_tunnel_info,\
> options_len) * BITS_PER_BYTE) - 1, 0)
>  
> -#define ip_tunnel_info_opts(info)\
> - _Generic(info,  \
> -  const struct ip_tunnel_info * : ((const void *)((info) + 1)),\
> -  struct ip_tunnel_info * : ((void *)((info) + 1))\
> - )

You could leave this macro inplace and just change `(info) + 1` to
`(info)->options` avoiding changes in lots of files and adding casts
everywhere.

> -
>  struct ip_tunnel_info {
>   struct ip_tunnel_keykey;
>   struct ip_tunnel_encap  encap;
> @@ -107,6 +101,7 @@ struct ip_tunnel_info {
>  #endif
>   u8  options_len;
>   u8  mode;
> + u8  options[] __aligned(sizeof(void *)) 
> __counted_by(options_len);

Since 96 % 16 == 0, I'd check if __aligned_largest would change the
bytecode anyhow... Sometimes it does.

Thanks,
Olek



Re: [RFC] Circumventing FineIBT Via Entrypoints

2025-02-12 Thread Jann Horn
+Andy Lutomirski (X86 entry code maintainer)

On Wed, Feb 12, 2025 at 10:08 PM Jennifer Miller  wrote:
> As part of a recently accepted paper we demonstrated that syscall
> entrypoints can be misused on x86-64 systems to generically bypass
> FineIBT/KERNEL_IBT from forwards-edge control flow hijacking. We
> communicated this finding to s@k.o before submitting the paper and were
> encouraged to bring the issue to hardening after the paper was accepted to
> have a discussion on how to address the issue.
>
> The bypass takes advantage of the architectural requirement of entrypoints
> to begin with the endbr64 instruction and the ability to control GS_BASE
> from userspace via wrgsbase, from to the FSGSBASE extension, in order to
> perform a stack pivot to a ROP-chain.

Oh, fun, that's a gnarly quirk.

> Here is a snippet of the 64-bit entrypoint code:
> ```
> entry_SYSCALL_64:
> <+0>: endbr64
> <+4>: swapgs
> <+7>: movQWORD PTR gs:0x6014,rsp
> <+16>:jmp
> <+18>:movrsp,cr3
> <+21>:nop
> <+26>:andrsp,0xe7ff
> <+33>:movcr3,rsp
> <+36>:movrsp,QWORD PTR gs:0x32c98
> ```
>
> This is a valid target from any indirect callsite under FineIBT due to the
> endbr64 instruction and the lack of a software CFI check. After hijacking
> control flow to the entrypoint, executing swapgs will swap to the user
> controlled GS_BASE, which will be used to set the stack pointer, leading to
> a stack pivot. The rest of the entrypoint will execute with a hijacked
> GS_BASE on a user controlled stack. The stack page we use is one mapped in
> the user address space and from another thread we race overwriting returns
> addresses on the stack to pivot a second time to a ROP-chain. For this to
> succeed we required a large area of user-controlled kernel memory that can
> serve as the forged GS_BASE address, we did this by spraying 2MB
> Transparent Huge Pages to fill the kernel physical memory map with
> controlled 2MB allocations and guessing relative to the base address of the
> area to hit a page we control.
>
> We evaluated an approach to patching the issue in the paper but it touched
> the userspace API a bit, added an error code returned by syscalls if they
> are invoked with a kernel address in GS_BASE, which is not a great
> solution.
>
> Linus provided some thoughts on how to potentially address this issue
> in our communication with s@k.o, suggesting the kernel could make the
> KERNEL_GS_BASE match the GS_BASE value so both registers always contain a
> valid kernel address and a confusion induced by executing swapgs an extra
> time cannot occur, and restore the value of KERNEL_GS_BASE ahead of
> executing swapgs in the exit path.
>
> I started working on a patch based on the approach suggested by Linus but I
> haven't been able to get it passing the relevant x86 selftests yet. It
> turned out that it's more than the entrypoint code that needs to be
> modified for it to work, we need to correctly save and restore the user's
> GS_BASE across task switches and ensure it is updated correctly when set
> via arch_prctl and ptrace. Unfortunately, I lack familiarity with those
> parts of the kernel, and my understanding is that the paper will be made
> public in a couple weeks so I didn't want to delay too long on bringing the
> issue to this list.
>
> Assuming this is an issue you all feel is worth addressing, I will continue
> working on providing a patch. I'm concerned though that the overhead from
> adding a wrmsr on both syscall entry and exit to overwrite and restore the
> KERNEL_GS_BASE MSR may be quite high, so any feedback in regards to the
> approach or suggestions of alternate approaches to patching are welcome :)

Since the kernel, as far as I understand, uses FineIBT without
backwards control flow protection (in other words, I think we assume
that the kernel stack is trusted?), could we build a cheaper
check on that basis somehow? For example, maybe we could do something like:

```
endbr64
test rsp, rsp
js slowpath
swapgs
```

So we'd have the fast normal case where RSP points to userspace
(meaning we can't be coming from the kernel unless our stack has
already been pivoted, in which case forward edge protection alone
can't help anymore), and the slow case where RSP points to kernel
memory - in that case we'd then have to do some slower checks to
figure out whether weird userspace is making a syscall with RSP
pointing to the kernel, or whether we're coming from hijacked kernel
control flow.



[PATCH] scsi: hpsa: Replace deprecated strncpy() with strscpy_pad()

2025-02-12 Thread Thorsten Blum
strncpy() is deprecated for NUL-terminated destination buffers [1].

Replace memset() and strncpy() with strscpy_pad() to copy the version
string and fill the remaining bytes in the destination buffer with NUL
bytes. This avoids zeroing the memory before copying the string.

Compile-tested only.

Link: 
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
 [1]
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Thorsten Blum 
---
 drivers/scsi/hpsa.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 84d8de07b7ae..1334886b68d0 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -7238,8 +7238,7 @@ static int hpsa_controller_hard_reset(struct pci_dev 
*pdev,
 
 static void init_driver_version(char *driver_version, int len)
 {
-   memset(driver_version, 0, len);
-   strncpy(driver_version, HPSA " " HPSA_DRIVER_VERSION, len - 1);
+   strscpy_pad(driver_version, HPSA " " HPSA_DRIVER_VERSION, len - 1);
 }
 
 static int write_driver_ver_to_cfgtable(struct CfgTable __iomem *cfgtable)
-- 
2.48.1




[PATCH v3 2/2] slab: Achieve better kmalloc caches randomization in kvmalloc

2025-02-12 Thread GONG Ruiqi
As revealed by this writeup[1], due to the fact that __kmalloc_node (now
renamed to __kmalloc_node_noprof) is an exported symbol and will never
get inlined, using it in kvmalloc_node (now is __kvmalloc_node_noprof)
would make the RET_IP inside always point to the same address:

upper_caller
kvmalloc
kvmalloc_node
kvmalloc_node_noprof
__kvmalloc_node_noprof  <-- all macros all the way down here
__kmalloc_node_noprof
__do_kmalloc_node(.., _RET_IP_)
... <-- _RET_IP_ points to

That literally means all kmalloc invoked via kvmalloc would use the same
seed for cache randomization (CONFIG_RANDOM_KMALLOC_CACHES), which makes
this hardening non-functional.

The root cause of this problem, IMHO, is that using RET_IP only cannot
identify the actual allocation site in case of kmalloc being called
inside non-inlined wrappers or helper functions. And I believe there
could be similar cases in other functions. Nevertheless, I haven't
thought of any good solution for this. So for now let's solve this
specific case first.

For __kvmalloc_node_noprof, replace __kmalloc_node_noprof and call
__do_kmalloc_node directly instead, so that RET_IP can take the return
address of kvmalloc and differentiate each kvmalloc invocation:

upper_caller
kvmalloc
kvmalloc_node
kvmalloc_node_noprof
__kvmalloc_node_noprof  <-- all macros all the way down here
__do_kmalloc_node(.., _RET_IP_)
... <-- _RET_IP_ points to

Thanks to Tamás Koczka for the report and discussion!

Link: 
https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259
 [1]
Reported-by: Tamás Koczka 
Signed-off-by: GONG Ruiqi 
---
 mm/slub.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index abc982d68feb..1f7d1d260eeb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4925,9 +4925,9 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), 
gfp_t flags, int node)
 * It doesn't really make sense to fallback to vmalloc for sub page
 * requests
 */
-   ret = __kmalloc_node_noprof(PASS_BUCKET_PARAMS(size, b),
-   kmalloc_gfp_adjust(flags, size),
-   node);
+   ret = __do_kmalloc_node(size, PASS_BUCKET_PARAM(b),
+   kmalloc_gfp_adjust(flags, size),
+   node, _RET_IP_);
if (ret || size <= PAGE_SIZE)
return ret;
 
-- 
2.25.1




[PATCH v3 1/2] slab: Adjust placement of __kvmalloc_node_noprof

2025-02-12 Thread GONG Ruiqi
Move __kvmalloc_node_noprof (as well as kvfree*, kvrealloc_noprof and
kmalloc_gfp_adjust for consistency) into mm/slub.c so that it can
directly invoke __do_kmalloc_node, which is needed for the next patch.

No functional changes intended.

Signed-off-by: GONG Ruiqi 
---
 mm/slub.c | 162 ++
 mm/util.c | 162 --
 2 files changed, 162 insertions(+), 162 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1f50129dcfb3..abc982d68feb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4878,6 +4878,168 @@ void *krealloc_noprof(const void *p, size_t new_size, 
gfp_t flags)
 }
 EXPORT_SYMBOL(krealloc_noprof);
 
+static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
+{
+   /*
+* We want to attempt a large physically contiguous block first because
+* it is less likely to fragment multiple larger blocks and therefore
+* contribute to a long term fragmentation less than vmalloc fallback.
+* However make sure that larger requests are not too disruptive - no
+* OOM killer and no allocation failure warnings as we have a fallback.
+*/
+   if (size > PAGE_SIZE) {
+   flags |= __GFP_NOWARN;
+
+   if (!(flags & __GFP_RETRY_MAYFAIL))
+   flags |= __GFP_NORETRY;
+
+   /* nofail semantic is implemented by the vmalloc fallback */
+   flags &= ~__GFP_NOFAIL;
+   }
+
+   return flags;
+}
+
+/**
+ * __kvmalloc_node - attempt to allocate physically contiguous memory, but upon
+ * failure, fall back to non-contiguous (vmalloc) allocation.
+ * @size: size of the request.
+ * @b: which set of kmalloc buckets to allocate from.
+ * @flags: gfp mask for the allocation - must be compatible (superset) with 
GFP_KERNEL.
+ * @node: numa node to allocate from
+ *
+ * Uses kmalloc to get the memory but if the allocation fails then falls back
+ * to the vmalloc allocator. Use kvfree for freeing the memory.
+ *
+ * GFP_NOWAIT and GFP_ATOMIC are not supported, neither is the __GFP_NORETRY 
modifier.
+ * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
+ * preferable to the vmalloc fallback, due to visible performance drawbacks.
+ *
+ * Return: pointer to the allocated memory of %NULL in case of failure
+ */
+void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int 
node)
+{
+   void *ret;
+
+   /*
+* It doesn't really make sense to fallback to vmalloc for sub page
+* requests
+*/
+   ret = __kmalloc_node_noprof(PASS_BUCKET_PARAMS(size, b),
+   kmalloc_gfp_adjust(flags, size),
+   node);
+   if (ret || size <= PAGE_SIZE)
+   return ret;
+
+   /* non-sleeping allocations are not supported by vmalloc */
+   if (!gfpflags_allow_blocking(flags))
+   return NULL;
+
+   /* Don't even allow crazy sizes */
+   if (unlikely(size > INT_MAX)) {
+   WARN_ON_ONCE(!(flags & __GFP_NOWARN));
+   return NULL;
+   }
+
+   /*
+* kvmalloc() can always use VM_ALLOW_HUGE_VMAP,
+* since the callers already cannot assume anything
+* about the resulting pointer, and cannot play
+* protection games.
+*/
+   return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
+   flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+   node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(__kvmalloc_node_noprof);
+
+/**
+ * kvfree() - Free memory.
+ * @addr: Pointer to allocated memory.
+ *
+ * kvfree frees memory allocated by any of vmalloc(), kmalloc() or kvmalloc().
+ * It is slightly more efficient to use kfree() or vfree() if you are certain
+ * that you know which one to use.
+ *
+ * Context: Either preemptible task context or not-NMI interrupt.
+ */
+void kvfree(const void *addr)
+{
+   if (is_vmalloc_addr(addr))
+   vfree(addr);
+   else
+   kfree(addr);
+}
+EXPORT_SYMBOL(kvfree);
+
+/**
+ * kvfree_sensitive - Free a data object containing sensitive information.
+ * @addr: address of the data object to be freed.
+ * @len: length of the data object.
+ *
+ * Use the special memzero_explicit() function to clear the content of a
+ * kvmalloc'ed object containing sensitive data to make sure that the
+ * compiler won't optimize out the data clearing.
+ */
+void kvfree_sensitive(const void *addr, size_t len)
+{
+   if (likely(!ZERO_OR_NULL_PTR(addr))) {
+   memzero_explicit((void *)addr, len);
+   kvfree(addr);
+   }
+}
+EXPORT_SYMBOL(kvfree_sensitive);
+
+/**
+ * kvrealloc - reallocate memory; contents remain unchanged
+ * @p: object to reallocate memory for
+ * @size: the size to reallocate
+ * @flags: the flags for the page level allocator
+ *
+ * If @p is %NULL, kvrealloc() behaves exactly 

[PATCH v3 0/2] Refine kmalloc caches randomization in kvmalloc

2025-02-12 Thread GONG Ruiqi
Hi,

v3:
  - move all the way from kmalloc_gfp_adjust to kvrealloc_noprof into
mm/slub.c
  - some rewording for commit logs
v2: https://lore.kernel.org/all/20250208014723.1514049-1-gongrui...@huawei.com/
  - change the implementation as Vlastimil suggested
v1: https://lore.kernel.org/all/20250122074817.991060-1-gongrui...@huawei.com/

Tamás reported [1] that kmalloc cache randomization doesn't actually
work for those kmalloc invoked via kvmalloc. For more details, see the
commit log of patch 2.

The current solution requires a direct call from __kvmalloc_node_noprof
to __do_kmalloc_node, a static function in a different .c file. As
suggested by Vlastimil [2], it's achieved by simply moving
__kvmalloc_node_noprof from mm/util.c to mm/slub.c, together with some
other functions of the same family.

Link: 
https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259
 [1]
Link: https://lore.kernel.org/all/62044279-0c56-4185-97f7-7afac65ff...@suse.cz/ 
[2]

GONG Ruiqi (2):
  slab: Adjust placement of __kvmalloc_node_noprof
  slab: Achieve better kmalloc caches randomization in kvmalloc

 mm/slub.c | 162 ++
 mm/util.c | 162 --
 2 files changed, 162 insertions(+), 162 deletions(-)

-- 
2.25.1




Re: [PATCH] net/mlx4_core: Avoid impossible mlx4_db_alloc() order value

2025-02-12 Thread Justin Stitt
On Tue, Feb 11, 2025 at 6:22 AM Tariq Toukan  wrote:
>
>
>
> On 11/02/2025 2:01, Justin Stitt wrote:
> > On Mon, Feb 10, 2025 at 09:45:05AM -0800, Kees Cook wrote:
> >> GCC can see that the value range for "order" is capped, but this leads
> >> it to consider that it might be negative, leading to a false positive
> >> warning (with GCC 15 with -Warray-bounds -fdiagnostics-details):
> >>
> >> ../drivers/net/ethernet/mellanox/mlx4/alloc.c:691:47: error: array 
> >> subscript -1 is below array bounds of 'long unsigned int *[2]' 
> >> [-Werror=array-bounds=]
> >>691 | i = find_first_bit(pgdir->bits[o], 
> >> MLX4_DB_PER_PAGE >> o);
> >>|~~~^~~
> >>'mlx4_alloc_db_from_pgdir': events 1-2
> >>691 | i = find_first_bit(pgdir->bits[o], 
> >> MLX4_DB_PER_PAGE >> o);| 
> >> ^
> >>| | |   
> >> | |
> >>  (2) out of array bounds here
> >>| (1) when the condition is evaluated to true   
> >>   In file included from 
> >> ../drivers/net/ethernet/mellanox/mlx4/mlx4.h:53,
> >>   from ../drivers/net/ethernet/mellanox/mlx4/alloc.c:42:
> >> ../include/linux/mlx4/device.h:664:33: note: while referencing 'bits'
> >>664 | unsigned long  *bits[2];
> >>| ^~~~
> >>
> >> Switch the argument to unsigned int, which removes the compiler needing
> >> to consider negative values.
> >>
> >> Signed-off-by: Kees Cook 
> >> ---
> >> Cc: Tariq Toukan 
> >> Cc: Andrew Lunn 
> >> Cc: "David S. Miller" 
> >> Cc: Eric Dumazet 
> >> Cc: Jakub Kicinski 
> >> Cc: Paolo Abeni 
> >> Cc: Yishai Hadas 
> >> Cc: net...@vger.kernel.org
> >> Cc: linux-r...@vger.kernel.org
> >> ---
> >>   drivers/net/ethernet/mellanox/mlx4/alloc.c | 6 +++---
> >>   include/linux/mlx4/device.h| 2 +-
> >>   2 files changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c 
> >> b/drivers/net/ethernet/mellanox/mlx4/alloc.c
> >> index b330020dc0d6..f2bded847e61 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
> >> +++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
> >> @@ -682,9 +682,9 @@ static struct mlx4_db_pgdir 
> >> *mlx4_alloc_db_pgdir(struct device *dma_device)
> >>   }
> >>
> >>   static int mlx4_alloc_db_from_pgdir(struct mlx4_db_pgdir *pgdir,
> >> -struct mlx4_db *db, int order)
> >> +struct mlx4_db *db, unsigned int order)
> >>   {
> >> -int o;
> >> +unsigned int o;
> >>  int i;
> >>
> >>  for (o = order; o <= 1; ++o) {
> >
> >^ Knowing now that @order can only be 0 or 1 can this for loop (and
> >goto) be dropped entirely?
> >
>
> Maybe I'm missing something...
> Can you please explain why you think this can be dropped?

I meant "rewritten to use two if statements" instead of "dropped". I
think "replaced" or "refactored" was the word I wanted.

>
>
> >The code is already short and sweet so I don't feel strongly either
> >way.
> >
> >> @@ -712,7 +712,7 @@ static int mlx4_alloc_db_from_pgdir(struct 
> >> mlx4_db_pgdir *pgdir,
> >>  return 0;
> >>   }
> >>
> >> -int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order)
> >> +int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, unsigned int 
> >> order)
> >>   {
> >>  struct mlx4_priv *priv = mlx4_priv(dev);
> >>  struct mlx4_db_pgdir *pgdir;
> >> diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
> >> index 27f42f713c89..86f0f2a25a3d 100644
> >> --- a/include/linux/mlx4/device.h
> >> +++ b/include/linux/mlx4/device.h
> >> @@ -1135,7 +1135,7 @@ int mlx4_write_mtt(struct mlx4_dev *dev, struct 
> >> mlx4_mtt *mtt,
> >>   int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
> >> struct mlx4_buf *buf);
> >>
> >> -int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order);
> >> +int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, unsigned int 
> >> order);
> >>   void mlx4_db_free(struct mlx4_dev *dev, struct mlx4_db *db);
> >>
> >>   int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources 
> >> *wqres,
> >> --
> >> 2.34.1
> >>
> >
> > Justin
> >
>



Re: [PATCH net-next v2] net: Add options as a flexible array to struct ip_tunnel_info

2025-02-12 Thread Jakub Kicinski
On Wed, 12 Feb 2025 20:13:28 +0200 Gal Pressman wrote:
> > You could leave this macro inplace and just change `(info) + 1` to
> > `(info)->options` avoiding changes in lots of files and adding casts
> > everywhere.  

+1

> I'd rather not, having a macro to do 'info->options' doesn't help code
> readability IMHO.

It'd avoid having to explicitly cast the types everywhere, too.



Re: [RFC] Circumventing FineIBT Via Entrypoints

2025-02-12 Thread Jann Horn
On Thu, Feb 13, 2025 at 2:31 AM Andrew Cooper  wrote:
> >> Assuming this is an issue you all feel is worth addressing, I will
> >> continue working on providing a patch. I'm concerned though that the
> >> overhead from adding a wrmsr on both syscall entry and exit to
> >> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
> >> any feedback in regards to the approach or suggestions of alternate
> >> approaches to patching are welcome :)
> >
> > Since the kernel, as far as I understand, uses FineIBT without
> > backwards control flow protection (in other words, I think we assume
> > that the kernel stack is trusted?),
>
> This is fun indeed.  Linux cannot use supervisor shadow stacks because
> the mess around NMI re-entrancy (and IST more generally) requires ROP
> gadgets in order to function safely.  Implementing this with shadow
> stacks active, while not impossible, is deemed to be prohibitively
> complicated.
>
> Linux's supervisor shadow stack support is waiting for FRED support,
> which fixes both the NMI re-entrancy problem, and other exceptions
> nesting within NMIs, as well as prohibiting the use of the SWAPGS
> instruction as FRED tries to make sure that the correct GS is always in
> context.
>
> But, FRED support is slated for PantherLake/DiamondRapids which haven't
> shipped yet, so are no use to the problem right now.
>
> > could we build a cheaper
> > check on that basis somehow? For example, maybe we could do something like:
> >
> > ```
> > endbr64
> > test rsp, rsp
> > js slowpath
> > swapgs
> > ```
>
> I presume it's been pointed out already, but there are 3 related
> entrypoints here.  SYSCALL64 which is discussed, SYSCALL32 and SYSENTER
> which are related.
>
> But, any other IDT entry is in a similar bucket.  If we're corrupting a
> function pointer or return address to redirect here, then the check of
> CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers
> stack frame.
>
> For IDT entries, checking %rsp is reasonable, because userspace can't
> forge a kernel-like %rsp.  However, SYSCALL64 specifically leaves %rsp
> entirely attacker controlled (and even potentially non-canonical), so
> I'm wondering what you hand in mind for the slowpath to truly
> distinguish kernel context from user context?

Hm, yeah, that seems hard - maybe the best we could do is to make sure
that the inactive gsbase has the correct value for our CPU's kernel
gsbase? Kinda like a paranoid_entry, except more painful because we'd
first have to figure out a place to spill registers to before we can
start using stuff like rdmsr... Then a function pointer overwrite
might still turn into returning to userspace with a sysret with GPRs
full of kernel pointers, but at least we wouldn't run off of a bogus
gsbase anymore?



Re: [RFC] Circumventing FineIBT Via Entrypoints

2025-02-12 Thread Andrew Cooper
>> Assuming this is an issue you all feel is worth addressing, I will
>> continue working on providing a patch. I'm concerned though that the
>> overhead from adding a wrmsr on both syscall entry and exit to
>> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
>> any feedback in regards to the approach or suggestions of alternate
>> approaches to patching are welcome :) 
>
> Since the kernel, as far as I understand, uses FineIBT without
> backwards control flow protection (in other words, I think we assume
> that the kernel stack is trusted?),

This is fun indeed.  Linux cannot use supervisor shadow stacks because
the mess around NMI re-entrancy (and IST more generally) requires ROP
gadgets in order to function safely.  Implementing this with shadow
stacks active, while not impossible, is deemed to be prohibitively
complicated.

Linux's supervisor shadow stack support is waiting for FRED support,
which fixes both the NMI re-entrancy problem, and other exceptions
nesting within NMIs, as well as prohibiting the use of the SWAPGS
instruction as FRED tries to make sure that the correct GS is always in
context.

But, FRED support is slated for PantherLake/DiamondRapids which haven't
shipped yet, so are no use to the problem right now.

> could we build a cheaper
> check on that basis somehow? For example, maybe we could do something like:
>
> ```
> endbr64
> test rsp, rsp
> js slowpath
> swapgs
> ```

I presume it's been pointed out already, but there are 3 related
entrypoints here.  SYSCALL64 which is discussed, SYSCALL32 and SYSENTER
which are related.

But, any other IDT entry is in a similar bucket.  If we're corrupting a
function pointer or return address to redirect here, then the check of
CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers
stack frame.

For IDT entries, checking %rsp is reasonable, because userspace can't
forge a kernel-like %rsp.  However, SYSCALL64 specifically leaves %rsp
entirely attacker controlled (and even potentially non-canonical), so
I'm wondering what you hand in mind for the slowpath to truly
distinguish kernel context from user context?

~Andrew



RE: [PATCH] wifi: ath12k: Fix uninitialized variable and remove goto

2025-02-12 Thread Ping-Ke Shih
Jeff Johnson  wrote:
> On 2/10/2025 8:09 PM, Ping-Ke Shih wrote:
> 
>  Addresses-Coverity-ID: 1642337 ("Uninitialized scalar variable")
> >>>
> >>> Is that an official kernel tag? IMO the proper tag would be
> >> So, it isn't "official" as far as I can tell, but it is widely used in
> >> other commits, especially by Gustavo Silva.
> >>
> >> Also:
> >>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=778e2478d19574508861bcb
> >> 1806d6e34d095587c
> >>
> >> Coverity-IDs: is another option I have found. I have seen Closes: a few
> >> times as well. I'm not really sure what the best option is, honestly.
> >
> > In my patch, I used and treated Addresses-Coverity-ID as a unofficial tag,
> > so additional empty line is added.
> >
> > Days ago I have received Coverity issues sent to mailing list, so I used
> > Closes tag at that time. But recently I have not seen that kind of mails.
> > Instead, I visit Coverity web site to check issues and use
> > Addresses-Coverity-ID tag, since Coverity link is not visible to everyone.
> > That is just my thought.
> 
> The problem I have is that I get Coverity fixes both from the linux and the
> linux-next projects:
> 
> https://scan.coverity.com/projects/linux?tab=overview
> https://scan.coverity.com/projects/linux-next-weekly-scan?tab=overview
> 
> The Coverity IDs from these projects are allocated independently, so a
> Coverity ID does not uniquely identify an issue.
> 
> The URL uniquely identifies an issue, and also utilizes an official tag.
> 
> That is my thought.

Yes, I have the same problem as yours. For me, I only annotate Coverity
IDs from linux project, and the linux-next project is as a reference
to check if issues are still existing in -next tree.




Re: [PATCH v2] wifi: ath12k: cleanup ath12k_mac_mlo_ready()

2025-02-12 Thread Jeff Johnson
On 2/10/2025 6:49 PM, Ethan Carter Edwards wrote:
> There is a possibility for an uninitialized *ret* variable to be
> returned in some code paths.
> 
> This explicitly returns 0 without an error. Also removes goto that
> returned *ret* and simply returns in place.
> 
> Closes: 
> https://scan5.scan.coverity.com/#/project-view/63541/10063?selectedIssue=1642337
> Fixes: b716a10d99a28 ("wifi: ath12k: enable MLO setup and teardown from core")

In the pending branch I changed this to:
Fixes: b716a10d99a2 ("wifi: ath12k: enable MLO setup and teardown from core")

To fix the checkpatch issue:
WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("")' - ie: 'Fixes: b716a10d99a2 ("wifi: ath12k: enable MLO setup and 
teardown from core")'

https://git.kernel.org/pub/scm/linux/kernel/git/ath/ath.git/commit/?h=pending&id=c2f7ae223cd3d781c69337dc804f1fae95789cdd




[PATCH] scsi: hpsa: Replace deprecated strncpy() with strscpy()

2025-02-12 Thread Thorsten Blum
strncpy() is deprecated for NUL-terminated destination buffers [1]. Use
strscpy() instead and remove the manual NUL-termination.

Use min() to simplify the size calculation.

Compile-tested only.

Link: 
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
 [1]
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Thorsten Blum 
Suggested-by: Bart Van Assche 
---
 drivers/scsi/hpsa.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 84d8de07b7ae..9399e101f150 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -460,9 +460,8 @@ static ssize_t host_store_hp_ssd_smart_path_status(struct 
device *dev,
 
if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
return -EACCES;
-   len = count > sizeof(tmpbuf) - 1 ? sizeof(tmpbuf) - 1 : count;
-   strncpy(tmpbuf, buf, len);
-   tmpbuf[len] = '\0';
+   len = min(count, sizeof(tmpbuf) - 1);
+   strscpy(tmpbuf, buf, len);
if (sscanf(tmpbuf, "%d", &status) != 1)
return -EINVAL;
h = shost_to_hba(shost);
@@ -484,9 +483,8 @@ static ssize_t host_store_raid_offload_debug(struct device 
*dev,
 
if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
return -EACCES;
-   len = count > sizeof(tmpbuf) - 1 ? sizeof(tmpbuf) - 1 : count;
-   strncpy(tmpbuf, buf, len);
-   tmpbuf[len] = '\0';
+   len = min(count, sizeof(tmpbuf) - 1);
+   strscpy(tmpbuf, buf, len);
if (sscanf(tmpbuf, "%d", &debug_level) != 1)
return -EINVAL;
if (debug_level < 0)
-- 
2.48.1




Re: [PATCH] wifi: ath12k: Fix uninitialized variable and remove goto

2025-02-12 Thread Jeff Johnson
On 2/10/2025 8:09 PM, Ping-Ke Shih wrote:

 Addresses-Coverity-ID: 1642337 ("Uninitialized scalar variable")
>>>
>>> Is that an official kernel tag? IMO the proper tag would be
>> So, it isn't "official" as far as I can tell, but it is widely used in
>> other commits, especially by Gustavo Silva.
>>
>> Also:
>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=778e2478d19574508861bcb
>> 1806d6e34d095587c
>>
>> Coverity-IDs: is another option I have found. I have seen Closes: a few
>> times as well. I'm not really sure what the best option is, honestly.
> 
> In my patch, I used and treated Addresses-Coverity-ID as a unofficial tag,
> so additional empty line is added.
> 
> Days ago I have received Coverity issues sent to mailing list, so I used
> Closes tag at that time. But recently I have not seen that kind of mails. 
> Instead, I visit Coverity web site to check issues and use
> Addresses-Coverity-ID tag, since Coverity link is not visible to everyone.
> That is just my thought. 

The problem I have is that I get Coverity fixes both from the linux and the
linux-next projects:

https://scan.coverity.com/projects/linux?tab=overview
https://scan.coverity.com/projects/linux-next-weekly-scan?tab=overview

The Coverity IDs from these projects are allocated independently, so a
Coverity ID does not uniquely identify an issue.

The URL uniquely identifies an issue, and also utilizes an official tag.

That is my thought.

/jeff



Re: [PATCH v2 6/6] unicode: kunit: change tests filename and path

2025-02-12 Thread Kees Cook
On Wed, Feb 12, 2025 at 10:40:34AM -0500, Gabriel Krisman Bertazi wrote:
> Kees Cook  writes:
> 
> > From: Gabriela Bittencourt 
> >
> > Change utf8 kunit test filename and path to follow the style
> > convention on Documentation/dev-tools/kunit/style.rst
> >
> > Co-developed-by: Pedro Orlando 
> > Signed-off-by: Pedro Orlando 
> > Co-developed-by: Danilo Pereira 
> > Signed-off-by: Danilo Pereira 
> > Signed-off-by: Gabriela Bittencourt 
> > Reviewed-by: David Gow 
> > Acked-by: Gabriel Krisman Bertazi 
> > Reviewed-by: Shuah Khan 
> > Reviewed-by: Rae Moar 
> > Link: https://lore.kernel.org/r/20241202075545.3648096-7-david...@google.com
> > Signed-off-by: Kees Cook 
> 
> Hi Kees,
> 
> Last time this was submitted, I have reported this patch breaks the
> build when the kunit is built as a module, and it wasn't fixed before
> resubmission.  Thorsten Leemhuis just reported the same issue is now on
> linux-next and I confirmed it is still there.
> 
> https://lore.kernel.org/lkml/2c26c5e4-9cf3-4020-b0be-637dc826b...@leemhuis.info/T/#m2e2ddb3b2600274e5eae50b93ea821914dc85f20
> 
> The build error is:
> 
> fs/unicode/tests/utf8_kunit.c:11:10: fatal error: utf8n.h: No such file or 
> directory
>11 | #include "utf8n.h"
>   |  ^
> 
> The problem is that the patch moves utf8_kunit.c into tests/ but doesn't
> fix the include line above.  Can you fix it in your tree since it is
> already merged?

Ah-ha! Thanks. Yes, I've updated this now.

-Kees

> 
> Thanks,
> 
> > ---
> >  fs/unicode/Makefile| 2 +-
> >  fs/unicode/{ => tests}/.kunitconfig| 0
> >  fs/unicode/{utf8-selftest.c => tests/utf8_kunit.c} | 0
> >  3 files changed, 1 insertion(+), 1 deletion(-)
> >  rename fs/unicode/{ => tests}/.kunitconfig (100%)
> >  rename fs/unicode/{utf8-selftest.c => tests/utf8_kunit.c} (100%)
> >
> > diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile
> > index 37bbcbc628a1..d95be7fb9f6b 100644
> > --- a/fs/unicode/Makefile
> > +++ b/fs/unicode/Makefile
> > @@ -4,7 +4,7 @@ ifneq ($(CONFIG_UNICODE),)
> >  obj-y  += unicode.o
> >  endif
> >  obj-$(CONFIG_UNICODE)  += utf8data.o
> > -obj-$(CONFIG_UNICODE_NORMALIZATION_KUNIT_TEST) += utf8-selftest.o
> > +obj-$(CONFIG_UNICODE_NORMALIZATION_KUNIT_TEST) += tests/utf8_kunit.o
> >  
> >  unicode-y := utf8-norm.o utf8-core.o
> >  
> > diff --git a/fs/unicode/.kunitconfig b/fs/unicode/tests/.kunitconfig
> > similarity index 100%
> > rename from fs/unicode/.kunitconfig
> > rename to fs/unicode/tests/.kunitconfig
> > diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/tests/utf8_kunit.c
> > similarity index 100%
> > rename from fs/unicode/utf8-selftest.c
> > rename to fs/unicode/tests/utf8_kunit.c
> 
> -- 
> Gabriel Krisman Bertazi

-- 
Kees Cook



Re: [RFC PATCH v5 0/7] mseal system mappings

2025-02-12 Thread Kees Cook
On Wed, Feb 12, 2025 at 11:24:35AM +, Lorenzo Stoakes wrote:
> On Wed, Feb 12, 2025 at 03:21:48AM +, jef...@chromium.org wrote:
> > From: Jeff Xu 
> >
> > The commit message in the first patch contains the full description of
> > this series.
> 
> Sorry to nit, but it'd be useful to reproduce in the cover letter too! But
> this obviously isn't urgent, just be nice when we un-RFC.

I advised Jeff against this because I've found it can sometimes cause
"thread splitting" in that some people reply to the cover letter, and
some people reply to the first patch, etc. I've tended to try to keep
cover letters very general, with the bulk of the prose in the first
patch.

> It'd be nice to update the documentation to have a list of 'known
> problematic userland software with sealed VDSO' so we make people aware.

I like this idea! Probably in mseal.rst, as the Kconfig help already
points there.

-Kees

-- 
Kees Cook