Re: "test_ip_fast_csum: ASSERTION FAILED at lib/checksum_kunit.c:589" at boot with CONFIG_CHECKSUM_KUNIT=y enabled on a Talos II, kernel 6.8-rc5
On Fri, Feb 23, 2024 at 05:59:07AM +, Christophe Leroy wrote: > Hi Erhard, hi Charlie, > > Le 23/02/2024 à 02:26, Erhard Furtner a écrit : > > Greetings! > > > > Looks like my Talos II (running a BE kernel+system) fails some of the > > kernels internal unit tests. One of the failing tests is checksum_kunit, > > enabled via CONFIG_CHECKSUM_KUNIT=y: > > > > [...] > > KTAP version 1 > > # Subtest: checksum > > # module: checksum_kunit > > 1..5 > > entry-flush: disabled on command line. > > ok 1 test_csum_fixed_random_inputs > > ok 2 test_csum_all_carry_inputs > > ok 3 test_csum_no_carry_inputs > > # test_ip_fast_csum: ASSERTION FAILED at lib/checksum_kunit.c:589 > > Expected ( u64)expected == ( u64)csum_result, but > > ( u64)expected == 55939 (0xda83) > > ( u64)csum_result == 33754 (0x83da) > > not ok 4 test_ip_fast_csum > > # test_csum_ipv6_magic: ASSERTION FAILED at lib/checksum_kunit.c:617 > > Expected ( u64)expected_csum_ipv6_magic[i] == ( > > u64)csum_ipv6_magic(saddr, daddr, len, proto, csum), but > > ( u64)expected_csum_ipv6_magic[i] == 6356 (0x18d4) > > ( u64)csum_ipv6_magic(saddr, daddr, len, proto, csum) == 43586 > > (0xaa42) > > not ok 5 test_csum_ipv6_magic > > # checksum: pass:3 fail:2 skip:0 total:5 > > # Totals: pass:3 fail:2 skip:0 total:5 > > not ok 4 checksum > > [...] > > > > Full dmesg + kernel .config attached. > > Looks like the same problem as the one I fixed with commit b38460bc463c > ("kunit: Fix checksum tests on big endian CPUs") > > The new tests implemented through commit 6f4c45cbcb00 ("kunit: Add tests > for csum_ipv6_magic and ip_fast_csum") create a lot of type issues as > reported by sparse when built with C=2 (see below). > > Once those issues are fixed, it should work. > > Charlie, can you provide a fix ? > > Thanks, > Christophe The "lib: checksum: Fix issues with checksum tests" patch should fix all of these issues [1]. [1] https://lore.kernel.org/all/20240221-fix_sparse_errors_checksum_tests-v9-1-bff4d73ab...@rivosinc.com/T/#m189783a9b2a7d12e3c34c4a412e65408658db2c9 - Charlie > >CC lib/checksum_kunit.o >CHECK lib/checksum_kunit.c > lib/checksum_kunit.c:219:9: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:219:9:expected restricted __sum16 > lib/checksum_kunit.c:219:9:got int > lib/checksum_kunit.c:219:17: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:219:17:expected restricted __sum16 > lib/checksum_kunit.c:219:17:got int > lib/checksum_kunit.c:219:25: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:219:25:expected restricted __sum16 > lib/checksum_kunit.c:219:25:got int > lib/checksum_kunit.c:219:33: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:219:33:expected restricted __sum16 > lib/checksum_kunit.c:219:33:got int > lib/checksum_kunit.c:219:41: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:219:41:expected restricted __sum16 > lib/checksum_kunit.c:219:41:got int > lib/checksum_kunit.c:219:49: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:219:49:expected restricted __sum16 > lib/checksum_kunit.c:219:49:got int > lib/checksum_kunit.c:219:57: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:219:57:expected restricted __sum16 > lib/checksum_kunit.c:219:57:got int > lib/checksum_kunit.c:219:65: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:219:65:expected restricted __sum16 > lib/checksum_kunit.c:219:65:got int > lib/checksum_kunit.c:219:73: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:219:73:expected restricted __sum16 > lib/checksum_kunit.c:219:73:got int > lib/checksum_kunit.c:220:9: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:220:9:expected restricted __sum16 > lib/checksum_kunit.c:220:9:got int > lib/checksum_kunit.c:220:17: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:220:17:expected restricted __sum16 > lib/checksum_kunit.c:220:17:got int > lib/checksum_kunit.c:220:25: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:220:25:expected restricted __sum16 > lib/checksum_kunit.c:220:25:got int > lib/checksum_kunit.c:220:33: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:220:33:expected restricted __sum16 > lib/checksum_kunit.c:220:33:got int > lib/checksum_kunit.c:220:41: warning: incorrect type in initializer > (different base types) > lib/checksum_kunit.c:220:41:expected restricted __sum16 > li
Re: "test_ip_fast_csum: ASSERTION FAILED at lib/checksum_kunit.c:589" at boot with CONFIG_CHECKSUM_KUNIT=y enabled on a Talos II, kernel 6.8-rc5
On Fri, Feb 23, 2024 at 06:58:14AM +, Christophe Leroy wrote: > > > Le 23/02/2024 à 07:12, Charlie Jenkins a écrit : > > On Fri, Feb 23, 2024 at 05:59:07AM +, Christophe Leroy wrote: > >> Hi Erhard, hi Charlie, > >> > >> Le 23/02/2024 à 02:26, Erhard Furtner a écrit : > >>> Greetings! > >>> > >>> Looks like my Talos II (running a BE kernel+system) fails some of the > >>> kernels internal unit tests. One of the failing tests is checksum_kunit, > >>> enabled via CONFIG_CHECKSUM_KUNIT=y: > >>> > >>> [...] > >>> KTAP version 1 > >>> # Subtest: checksum > >>> # module: checksum_kunit > >>> 1..5 > >>> entry-flush: disabled on command line. > >>> ok 1 test_csum_fixed_random_inputs > >>> ok 2 test_csum_all_carry_inputs > >>> ok 3 test_csum_no_carry_inputs > >>> # test_ip_fast_csum: ASSERTION FAILED at lib/checksum_kunit.c:589 > >>> Expected ( u64)expected == ( u64)csum_result, but > >>> ( u64)expected == 55939 (0xda83) > >>> ( u64)csum_result == 33754 (0x83da) > >>> not ok 4 test_ip_fast_csum > >>> # test_csum_ipv6_magic: ASSERTION FAILED at lib/checksum_kunit.c:617 > >>> Expected ( u64)expected_csum_ipv6_magic[i] == ( > >>> u64)csum_ipv6_magic(saddr, daddr, len, proto, csum), but > >>> ( u64)expected_csum_ipv6_magic[i] == 6356 (0x18d4) > >>> ( u64)csum_ipv6_magic(saddr, daddr, len, proto, csum) == 43586 > >>> (0xaa42) > >>> not ok 5 test_csum_ipv6_magic > >>> # checksum: pass:3 fail:2 skip:0 total:5 > >>> # Totals: pass:3 fail:2 skip:0 total:5 > >>> not ok 4 checksum > >>> [...] > >>> > >>> Full dmesg + kernel .config attached. > >> > >> Looks like the same problem as the one I fixed with commit b38460bc463c > >> ("kunit: Fix checksum tests on big endian CPUs") > >> > >> The new tests implemented through commit 6f4c45cbcb00 ("kunit: Add tests > >> for csum_ipv6_magic and ip_fast_csum") create a lot of type issues as > >> reported by sparse when built with C=2 (see below). > >> > >> Once those issues are fixed, it should work. > >> > >> Charlie, can you provide a fix ? > >> > >> Thanks, > >> Christophe > > > > The "lib: checksum: Fix issues with checksum tests" patch should fix all of > > these issues [1]. > > > > [1] > > https://lore.kernel.org/all/20240221-fix_sparse_errors_checksum_tests-v9-1-bff4d73ab...@rivosinc.com/T/#m189783a9b2a7d12e3c34c4a412e65408658db2c9 > > It doesn't fix the issues, I still get the following with your patch 1/2 > applied: > > [6.893141] KTAP version 1 > [6.896118] 1..1 > [6.897764] KTAP version 1 > [6.900800] # Subtest: checksum > [6.904518] # module: checksum_kunit > [6.904601] 1..5 > [7.139784] ok 1 test_csum_fixed_random_inputs > [7.590056] ok 2 test_csum_all_carry_inputs > [8.064415] ok 3 test_csum_no_carry_inputs > [8.070065] # test_ip_fast_csum: ASSERTION FAILED at > lib/checksum_kunit.c:589 > [8.070065] Expected ( u64)expected == ( u64)csum_result, but > [8.070065] ( u64)expected == 55939 (0xda83) > [8.070065] ( u64)csum_result == 33754 (0x83da) > [8.075836] not ok 4 test_ip_fast_csum > [8.101039] # test_csum_ipv6_magic: ASSERTION FAILED at > lib/checksum_kunit.c:617 > [8.101039] Expected ( u64)( __sum16)expected_csum_ipv6_magic[i] > == ( u64)csum_ipv6_magic(saddr, daddr, len, proto, ( __wsum)csum), but > [8.101039] ( u64)( __sum16)expected_csum_ipv6_magic[i] == > 6356 (0x18d4) > [8.101039] ( u64)csum_ipv6_magic(saddr, daddr, len, proto, ( > __wsum)csum) == 43586 (0xaa42) > [8.106446] not ok 5 test_csum_ipv6_magic > [8.143829] # checksum: pass:3 fail:2 skip:0 total:5 > [8.148334] # Totals: pass:3 fail:2 skip:0 total:5 > [8.153173] not ok 1 checksum > > All your patch does is to hide the sparse warnings. But forcing a cast > doesn't fix byte orders. > > Please have a look at commit b38460bc463c ("kunit: Fix checksum tests on > big endian CPUs"), there are helpers to put checksums in the correct > byte order. > > Christophe Well that's what the second patch is for. Is it failing with the second patch applied? - Charlie
Re: "test_ip_fast_csum: ASSERTION FAILED at lib/checksum_kunit.c:589" at boot with CONFIG_CHECKSUM_KUNIT=y enabled on a Talos II, kernel 6.8-rc5
On Fri, Feb 23, 2024 at 09:06:56AM +, Christophe Leroy wrote: > > > Le 23/02/2024 à 08:00, Charlie Jenkins a écrit : > > On Fri, Feb 23, 2024 at 06:58:14AM +, Christophe Leroy wrote: > >> > >> > >> Le 23/02/2024 à 07:12, Charlie Jenkins a écrit : > >>> On Fri, Feb 23, 2024 at 05:59:07AM +, Christophe Leroy wrote: > >>>> Hi Erhard, hi Charlie, > >>>> > >>>> Le 23/02/2024 à 02:26, Erhard Furtner a écrit : > >>>>> Greetings! > >>>>> > >>>>> Looks like my Talos II (running a BE kernel+system) fails some of the > >>>>> kernels internal unit tests. One of the failing tests is > >>>>> checksum_kunit, enabled via CONFIG_CHECKSUM_KUNIT=y: > >>>>> > >>>>> [...] > >>>>> KTAP version 1 > >>>>># Subtest: checksum > >>>>># module: checksum_kunit > >>>>>1..5 > >>>>> entry-flush: disabled on command line. > >>>>>ok 1 test_csum_fixed_random_inputs > >>>>>ok 2 test_csum_all_carry_inputs > >>>>>ok 3 test_csum_no_carry_inputs > >>>>># test_ip_fast_csum: ASSERTION FAILED at lib/checksum_kunit.c:589 > >>>>>Expected ( u64)expected == ( u64)csum_result, but > >>>>>( u64)expected == 55939 (0xda83) > >>>>>( u64)csum_result == 33754 (0x83da) > >>>>>not ok 4 test_ip_fast_csum > >>>>># test_csum_ipv6_magic: ASSERTION FAILED at > >>>>> lib/checksum_kunit.c:617 > >>>>>Expected ( u64)expected_csum_ipv6_magic[i] == ( > >>>>> u64)csum_ipv6_magic(saddr, daddr, len, proto, csum), but > >>>>>( u64)expected_csum_ipv6_magic[i] == 6356 (0x18d4) > >>>>>( u64)csum_ipv6_magic(saddr, daddr, len, proto, csum) == > >>>>> 43586 (0xaa42) > >>>>>not ok 5 test_csum_ipv6_magic > >>>>> # checksum: pass:3 fail:2 skip:0 total:5 > >>>>> # Totals: pass:3 fail:2 skip:0 total:5 > >>>>> not ok 4 checksum > >>>>> [...] > >>>>> > >>>>> Full dmesg + kernel .config attached. > >>>> > >>>> Looks like the same problem as the one I fixed with commit b38460bc463c > >>>> ("kunit: Fix checksum tests on big endian CPUs") > >>>> > >>>> The new tests implemented through commit 6f4c45cbcb00 ("kunit: Add tests > >>>> for csum_ipv6_magic and ip_fast_csum") create a lot of type issues as > >>>> reported by sparse when built with C=2 (see below). > >>>> > >>>> Once those issues are fixed, it should work. > >>>> > >>>> Charlie, can you provide a fix ? > >>>> > >>>> Thanks, > >>>> Christophe > >>> > >>> The "lib: checksum: Fix issues with checksum tests" patch should fix all > >>> of these issues [1]. > >>> > >>> [1] > >>> https://lore.kernel.org/all/20240221-fix_sparse_errors_checksum_tests-v9-1-bff4d73ab...@rivosinc.com/T/#m189783a9b2a7d12e3c34c4a412e65408658db2c9 > >> > >> It doesn't fix the issues, I still get the following with your patch 1/2 > >> applied: > >> > >> [6.893141] KTAP version 1 > >> [6.896118] 1..1 > >> [6.897764] KTAP version 1 > >> [6.900800] # Subtest: checksum > >> [6.904518] # module: checksum_kunit > >> [6.904601] 1..5 > >> [7.139784] ok 1 test_csum_fixed_random_inputs > >> [7.590056] ok 2 test_csum_all_carry_inputs > >> [8.064415] ok 3 test_csum_no_carry_inputs > >> [8.070065] # test_ip_fast_csum: ASSERTION FAILED at > >> lib/checksum_kunit.c:589 > >> [8.070065] Expected ( u64)expected == ( u64)csum_result, but > >> [8.070065] ( u64)expected == 55939 (0xda83) > >> [8.070065] ( u64)csum_result == 33754 (0x83da) > >> [8.075836] not ok 4 test_ip_fast_csum > >> [8.101039] # test_csum_ipv6_magic: ASSERTION FAILED at > >> lib/checksum_kunit.c:617 > >> [8.101039] Expected ( u64)( __sum16)expected_csum_ipv6_magic[i] > >> == ( u64)csum_ipv6_magic(saddr, dadd
Re: [PATCH v9 2/2] lib: checksum: Use aligned accesses for ip_fast_csum and csum_ipv6_magic tests
On Fri, Feb 23, 2024 at 10:06:54AM +, Christophe Leroy wrote: > > > Le 22/02/2024 à 03:55, Charlie Jenkins a écrit : > > The test cases for ip_fast_csum and csum_ipv6_magic were failing on a > > variety of architectures that are big endian or do not support > > misalgined accesses. Both of these test cases are changed to support big > > and little endian architectures. > > It is unclear. The endianess issue and the alignment issue are two > independant subjects that should be handled in separate patches. > > According to the subject of this patch, only misaligned accesses should > be handled here. Endianness should have been fixed by patch 1. > > Also, would be nice to give exemple of architecture that has such > problem, and explain what is the problem exactly. > > > > > The test for ip_fast_csum is changed to align the data along (14 + > > NET_IP_ALIGN) bytes which is the alignment of an IP header. The test for > > csum_ipv6_magic aligns the data using a struct. An extra padding field > > is added to the struct to ensure that the size of the struct is the same > > on all architectures (44 bytes). > > What is the purpose of that padding ? You take fields one by one and > never do anything with the full struct. sizeof(struct csum_ipv6_magic_data)) takes into account the full struct. > > > > > Fixes: 6f4c45cbcb00 ("kunit: Add tests for csum_ipv6_magic and > > ip_fast_csum") > > Signed-off-by: Charlie Jenkins > > Reviewed-by: Guenter Roeck > > Tested-by: Guenter Roeck > > --- > > lib/checksum_kunit.c | 393 > > ++- > > 1 file changed, 134 insertions(+), 259 deletions(-) > > > > diff --git a/lib/checksum_kunit.c b/lib/checksum_kunit.c > > index 776ad3d6d5a1..f1b18e3628dd 100644 > > --- a/lib/checksum_kunit.c > > +++ b/lib/checksum_kunit.c > > @@ -13,8 +13,9 @@ > > > > #define IPv4_MIN_WORDS 5 > > #define IPv4_MAX_WORDS 15 > > -#define NUM_IPv6_TESTS 200 > > -#define NUM_IP_FAST_CSUM_TESTS 181 > > +#define WORD_ALIGNMENT 4 > > Is that macro really needed ? Can't you just use sizeof(u32) or > something similar ? It is for readability purposes. This was introduced to ensure that alignment on a 32-bit boundary was happening, so I called this word alignment. > > > > +/* Ethernet headers are 14 bytes and NET_IP_ALIGN is used to align them */ > > +#define IP_ALIGNMENT (14 + NET_IP_ALIGN) > > Only if no VLAN. > > When using VLANs it is 4 bytes more. But why do you mind that at all > ? Architectures make assumptions about the alignment of the packets to optimize code. Not doing this alignment will cause illegal misaligned accesses on some ARM platforms. Yes, VLEN is ignored here, but this alignment is required to be supported and that is what the test cases are stressing. > > > > > /* Values for a little endian CPU. Byte swap each half on big endian CPU. > > */ > > static const u32 random_init_sum = 0x2847aab; > > ... > > > @@ -578,15 +451,19 @@ static void test_csum_no_carry_inputs(struct kunit > > *test) > > static void test_ip_fast_csum(struct kunit *test) > > { > > __sum16 csum_result, expected; > > - > > - for (int len = IPv4_MIN_WORDS; len < IPv4_MAX_WORDS; len++) { > > - for (int index = 0; index < NUM_IP_FAST_CSUM_TESTS; index++) { > > - csum_result = ip_fast_csum(random_buf + index, len); > > - expected = (__force __sum16) > > - expected_fast_csum[(len - IPv4_MIN_WORDS) * > > - NUM_IP_FAST_CSUM_TESTS + > > - index]; > > - CHECK_EQ(expected, csum_result); > > + int num_tests = (MAX_LEN / WORD_ALIGNMENT - IPv4_MAX_WORDS * > > WORD_ALIGNMENT); > > + > > + for (int i = 0; i < num_tests; i++) { > > + memcpy(&tmp_buf[IP_ALIGNMENT], > > + random_buf + (i * WORD_ALIGNMENT), > > + IPv4_MAX_WORDS * WORD_ALIGNMENT); > > That looks weird. If you have constructive feedback then I would be happy to clarify. > > > + > > + for (int len = IPv4_MIN_WORDS; len <= IPv4_MAX_WORDS; len++) { > > + int index = (len - IPv4_MIN_WORDS) + > > +i * ((IPv4_MAX_WORDS - IPv4_MIN_WORDS) + 1); > > Missing blank line after declaration. > > > + csum_result = ip_fast_csum(tmp_buf + IP_ALIGNMENT, len); >
Re: [PATCH v3 14/15] riscv: Add support for suppressing warning backtraces
On Wed, Apr 03, 2024 at 06:19:35AM -0700, Guenter Roeck wrote: > Add name of functions triggering warning backtraces to the __bug_table > object section to enable support for suppressing WARNING backtraces. > > To limit image size impact, the pointer to the function name is only added > to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and > CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly > parameter is replaced with a (dummy) NULL parameter to avoid an image size > increase due to unused __func__ entries (this is necessary because __func__ > is not a define but a virtual variable). > > To simplify the implementation, unify the __BUG_ENTRY_ADDR and > __BUG_ENTRY_FILE macros into a single macro named __BUG_REL() which takes > the address, file, or function reference as parameter. > > Tested-by: Linux Kernel Functional Testing > Acked-by: Dan Carpenter > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Signed-off-by: Guenter Roeck > --- > v2: > - Rebased to v6.9-rc1 > - Added Tested-by:, Acked-by:, and Reviewed-by: tags > - Introduced KUNIT_SUPPRESS_BACKTRACE configuration option > v3: > - Rebased to v6.9-rc2 > > arch/riscv/include/asm/bug.h | 38 > 1 file changed, 26 insertions(+), 12 deletions(-) > > diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h > index 1aaea81fb141..79f360af4ad8 100644 > --- a/arch/riscv/include/asm/bug.h > +++ b/arch/riscv/include/asm/bug.h > @@ -30,26 +30,39 @@ > typedef u32 bug_insn_t; > > #ifdef CONFIG_GENERIC_BUG_RELATIVE_POINTERS > -#define __BUG_ENTRY_ADDR RISCV_INT " 1b - ." > -#define __BUG_ENTRY_FILE RISCV_INT " %0 - ." > +#define __BUG_REL(val) RISCV_INT " " __stringify(val) " - ." > #else > -#define __BUG_ENTRY_ADDR RISCV_PTR " 1b" > -#define __BUG_ENTRY_FILE RISCV_PTR " %0" > +#define __BUG_REL(val) RISCV_PTR " " __stringify(val) > #endif > > #ifdef CONFIG_DEBUG_BUGVERBOSE > + > +#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE > +# define HAVE_BUG_FUNCTION > +# define __BUG_FUNC_PTR __BUG_REL(%1) > +#else > +# define __BUG_FUNC_PTR > +#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */ > + > #define __BUG_ENTRY \ > - __BUG_ENTRY_ADDR "\n\t" \ > - __BUG_ENTRY_FILE "\n\t" \ > - RISCV_SHORT " %1\n\t" \ > - RISCV_SHORT " %2" > + __BUG_REL(1b) "\n\t"\ > + __BUG_REL(%0) "\n\t"\ > + __BUG_FUNC_PTR "\n\t" \ > + RISCV_SHORT " %2\n\t" \ > + RISCV_SHORT " %3" > #else > #define __BUG_ENTRY \ > - __BUG_ENTRY_ADDR "\n\t" \ > - RISCV_SHORT " %2" > + __BUG_REL(1b) "\n\t"\ > + RISCV_SHORT " %3" > #endif > > #ifdef CONFIG_GENERIC_BUG > +#ifdef HAVE_BUG_FUNCTION > +# define __BUG_FUNC __func__ > +#else > +# define __BUG_FUNC NULL > +#endif > + > #define __BUG_FLAGS(flags) \ > do { \ > __asm__ __volatile__ ( \ > @@ -58,10 +71,11 @@ do { > \ > ".pushsection __bug_table,\"aw\"\n\t" \ > "2:\n\t"\ > __BUG_ENTRY "\n\t" \ > - ".org 2b + %3\n\t" \ > + ".org 2b + %4\n\t" \ > ".popsection" \ > : \ > - : "i" (__FILE__), "i" (__LINE__), \ > + : "i" (__FILE__), "i" (__BUG_FUNC), \ > + "i" (__LINE__), \ > "i" (flags), \ > "i" (sizeof(struct bug_entry))); \ > } while (0) > -- > 2.39.2 > > > ___ > linux-riscv mailing list > linux-ri...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv Reviewed-by: Charlie Jenkins - Charlie
[PATCH 00/16] mm: Introduce MAP_BELOW_HINT
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value. On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This flag allows applications a way to specify exactly how many bits they want to be left unused by mmap. This eliminates the need for applications to know the page table hierarchy of the system to be able to reason which addresses mmap will be allowed to return. --- riscv made this feature of mmap returning addresses less than the hint address the default behavior. This was in contrast to the implementation of x86/arm64 that have a single boundary at the 5-level page table region. However this restriction proved too great -- the reduced address space when using a hint address was too small. A patch for riscv [1] reverts the behavior that broke userspace. This series serves to make this feature available to all architectures. I have only tested on riscv and x86. There is a tremendous amount of duplicated code in mmap so the implementations across architectures I believe should be mostly consistent. I added this feature to all architectures that implement either arch_get_mmap_end()/arch_get_mmap_base() or arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base(). Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe...@rivosinc.com/T/ [1] To: Arnd Bergmann To: Paul Walmsley To: Palmer Dabbelt To: Albert Ou To: Catalin Marinas To: Will Deacon To: Michael Ellerman To: Nicholas Piggin To: Christophe Leroy To: Naveen N Rao To: Muchun Song To: Andrew Morton To: Liam R. Howlett To: Vlastimil Babka To: Lorenzo Stoakes To: Thomas Gleixner To: Ingo Molnar To: Borislav Petkov To: Dave Hansen To: x...@kernel.org To: H. Peter Anvin To: Huacai Chen To: WANG Xuerui To: Russell King To: Thomas Bogendoerfer To: James E.J. Bottomley To: Helge Deller To: Alexander Gordeev To: Gerald Schaefer To: Heiko Carstens To: Vasily Gorbik To: Christian Borntraeger To: Sven Schnelle To: Yoshinori Sato To: Rich Felker To: John Paul Adrian Glaubitz To: David S. Miller To: Andreas Larsson To: Shuah Khan To: Alexandre Ghiti Cc: linux-a...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: Palmer Dabbelt Cc: linux-ri...@lists.infradead.org Cc: linux-arm-ker...@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux...@kvack.org Cc: loonga...@lists.linux.dev Cc: linux-m...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linux-s...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux-kselft...@vger.kernel.org Signed-off-by: Charlie Jenkins --- Charlie Jenkins (16): mm: Add MAP_BELOW_HINT riscv: mm: Do not restrict mmap address based on hint mm: Add flag and len param to arch_get_mmap_base() mm: Add generic MAP_BELOW_HINT riscv: mm: Support MAP_BELOW_HINT arm64: mm: Support MAP_BELOW_HINT powerpc: mm: Support MAP_BELOW_HINT x86: mm: Support MAP_BELOW_HINT loongarch: mm: Support MAP_BELOW_HINT arm: mm: Support MAP_BELOW_HINT mips: mm: Support MAP_BELOW_HINT parisc: mm: Support MAP_BELOW_HINT s390: mm: Support MAP_BELOW_HINT sh: mm: Support MAP_BELOW_HINT sparc: mm: Support MAP_BELOW_HINT selftests/mm: Create MAP_BELOW_HINT test arch/arm/mm/mmap.c | 10 arch/arm64/include/asm/processor.h | 34 ++ arch/loongarch/mm/mmap.c | 11 + arch/mips/mm/mmap.c | 9 +++ arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++ arch/powerpc/include/asm/task_size_64.h | 36 +++- arch/riscv/include/asm/processor.h | 32 - arch/s390/mm/mmap.c | 10 arch/sh/mm/mmap.c| 10 arch/sparc/kernel/sys_sparc_64.c | 8 +++ arch/x86/kernel/sys_x86_64.c | 25 --- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 34 -- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c| 2 +- tools/arch/parisc/include/uapi/asm/mman.h| 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 ++ 20 files changed, 216 insertions(+), 50 deletions(-) --- base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 -- - Charlie
[PATCH 01/16] mm: Add MAP_BELOW_HINT
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value. To make this behavior explicit and more versatile across all architectures, define a mmap flag that allows users to define an arbitrary upper limit on addresses returned by mmap. Signed-off-by: Charlie Jenkins --- include/uapi/asm-generic/mman-common.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + 2 files changed, 2 insertions(+) diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@ #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x800 /* give out address that is below (inclusive) hint address */ /* * Flags for mlock diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@ #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x800 /* give out address that is below (inclusive) hint address */ /* * Flags for mlock -- 2.45.0
[PATCH 02/16] riscv: mm: Do not restrict mmap address based on hint
The hint address should not forcefully restrict the addresses returned by mmap as this causes mmap to report ENOMEM when there is memory still available. Signed-off-by: Charlie Jenkins Fixes: b5b4287accd7 ("riscv: mm: Use hint address in mmap if available") Fixes: add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") Closes: https://lore.kernel.org/linux-kernel/ZbxTNjQPFKBatMq+@ghost/T/#mccb1890466bf5a488c9ce7441e57e42271895765 --- This patch is pulled from [1] for ease of review of this series. Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe...@rivosinc.com/T/ [1] --- arch/riscv/include/asm/processor.h | 28 +++- 1 file changed, 3 insertions(+), 25 deletions(-) diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 8702b8721a27..1015f2a49917 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -14,36 +14,14 @@ #include -/* - * addr is a hint to the maximum userspace address that mmap should provide, so - * this macro needs to return the largest address space available so that - * mmap_end < addr, being mmap_end the top of that address space. - * See Documentation/arch/riscv/vm-layout.rst for more details. - */ #define arch_get_mmap_end(addr, len, flags)\ ({ \ - unsigned long mmap_end; \ - typeof(addr) _addr = (addr);\ - if ((_addr) == 0 || is_compat_task() || \ - ((_addr + len) > BIT(VA_BITS - 1))) \ - mmap_end = STACK_TOP_MAX; \ - else\ - mmap_end = (_addr + len); \ - mmap_end; \ + STACK_TOP_MAX; \ }) -#define arch_get_mmap_base(addr, base) \ +#define arch_get_mmap_base(addr, base, flags) \ ({ \ - unsigned long mmap_base;\ - typeof(addr) _addr = (addr);\ - typeof(base) _base = (base);\ - unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ - if ((_addr) == 0 || is_compat_task() || \ - ((_addr + len) > BIT(VA_BITS - 1))) \ - mmap_base = (_base);\ - else\ - mmap_base = (_addr + len) - rnd_gap;\ - mmap_base; \ + base; \ }) #ifdef CONFIG_64BIT -- 2.45.0
[PATCH 03/16] mm: Add flag and len param to arch_get_mmap_base()
The flag and len param is required in arch_get_mmap_base() to implement MAP_BELOW_HINT. Signed-off-by: Charlie Jenkins --- arch/arm64/include/asm/processor.h | 2 +- arch/powerpc/include/asm/task_size_64.h | 2 +- arch/riscv/include/asm/processor.h | 2 +- fs/hugetlbfs/inode.c| 2 +- include/linux/sched/mm.h| 2 +- mm/mmap.c | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index f77371232d8c..a67ca119bb91 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -95,7 +95,7 @@ #define arch_get_mmap_end(addr, len, flags) \ (((addr) > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW) -#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \ +#define arch_get_mmap_base(addr, len, base, flags) ((addr > DEFAULT_MAP_WINDOW) ? \ base + TASK_SIZE - DEFAULT_MAP_WINDOW :\ base) #endif /* CONFIG_ARM64_FORCE_52BIT */ diff --git a/arch/powerpc/include/asm/task_size_64.h b/arch/powerpc/include/asm/task_size_64.h index 5a709951c901..239b363841aa 100644 --- a/arch/powerpc/include/asm/task_size_64.h +++ b/arch/powerpc/include/asm/task_size_64.h @@ -72,7 +72,7 @@ #define STACK_TOP_MAX TASK_SIZE_USER64 #define STACK_TOP (is_32bit_task() ? STACK_TOP_USER32 : STACK_TOP_USER64) -#define arch_get_mmap_base(addr, base) \ +#define arch_get_mmap_base(addr, len, base, flags) \ (((addr) > DEFAULT_MAP_WINDOW) ? (base) + TASK_SIZE - DEFAULT_MAP_WINDOW : (base)) #define arch_get_mmap_end(addr, len, flags) \ diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 1015f2a49917..7ff559bf46f2 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -19,7 +19,7 @@ STACK_TOP_MAX; \ }) -#define arch_get_mmap_base(addr, base, flags) \ +#define arch_get_mmap_base(addr, len, base, flags) \ ({ \ base; \ }) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 9f6cff356796..05a52f85dba9 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -195,7 +195,7 @@ hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.length = len; info.low_limit = PAGE_SIZE; - info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base); + info.high_limit = arch_get_mmap_base(addr, len, current->mm->mmap_base, flags); info.align_mask = PAGE_MASK & ~huge_page_mask(h); addr = vm_unmapped_area(&info); diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 91546493c43d..265b43855d0b 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -174,7 +174,7 @@ static inline void mm_update_next_owner(struct mm_struct *mm) #endif #ifndef arch_get_mmap_base -#define arch_get_mmap_base(addr, base) (base) +#define arch_get_mmap_base(addr, len, base, flags) (base) #endif extern void arch_pick_mmap_layout(struct mm_struct *mm, diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..27a7f2be3f68 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1861,7 +1861,7 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.length = len; info.low_limit = PAGE_SIZE; - info.high_limit = arch_get_mmap_base(addr, mm->mmap_base); + info.high_limit = arch_get_mmap_base(addr, len, mm->mmap_base, flags); addr = vm_unmapped_area(&info); /* -- 2.45.0
[PATCH 04/16] mm: Add generic MAP_BELOW_HINT
Make the generic implementation of arch_get_mmap_base() and arch_get_mmap_end() support MAP_BELOW_HINT. Signed-off-by: Charlie Jenkins --- include/linux/sched/mm.h | 34 -- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 265b43855d0b..c350bb5ac0a2 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -9,6 +9,7 @@ #include #include #include +#include /* * Routines for handling mm_structs @@ -170,11 +171,40 @@ static inline void mm_update_next_owner(struct mm_struct *mm) #ifdef CONFIG_MMU #ifndef arch_get_mmap_end -#define arch_get_mmap_end(addr, len, flags)(TASK_SIZE) +#define arch_get_mmap_end(addr, len, flags) \ +({ \ + unsigned long mmap_end; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(len) _len = (len); \ + mmap_end = TASK_SIZE; \ + if (_flags & MAP_BELOW_HINT && _addr != 0) \ + mmap_end = MIN(mmap_end, _addr + _len); \ + mmap_end; \ +}) #endif #ifndef arch_get_mmap_base -#define arch_get_mmap_base(addr, len, base, flags) (base) +/* + * rnd_gap is defined to be (STACK_TOP - _base) due to the definition of + * mmap_base in mm/util.c + * + * Assumes ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT, which all architectures that + * implement generic mmap use + */ +#define arch_get_mmap_base(addr, len, base, flags) \ +({ \ + unsigned long mmap_base; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(base) _base = (base); \ + typeof(len) _len = (len); \ + unsigned long rnd_gap = STACK_TOP - _base; \ + mmap_base = _base; \ + if (_flags & MAP_BELOW_HINT && _addr != 0) \ + mmap_base = MIN(mmap_base, (_addr + _len) - rnd_gap); \ + mmap_base; \ +}) #endif extern void arch_pick_mmap_layout(struct mm_struct *mm, -- 2.45.0
[PATCH 05/16] riscv: mm: Support MAP_BELOW_HINT
When adding support for MAP_BELOW_HINT, the riscv implementation becomes identical to the default implementation, so arch_get_mmap_base() and arch_get_mmap_end() can be removed. Signed-off-by: Charlie Jenkins --- arch/riscv/include/asm/processor.h | 10 -- 1 file changed, 10 deletions(-) diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 7ff559bf46f2..20b4ba7d32be 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -14,16 +14,6 @@ #include -#define arch_get_mmap_end(addr, len, flags)\ -({ \ - STACK_TOP_MAX; \ -}) - -#define arch_get_mmap_base(addr, len, base, flags) \ -({ \ - base; \ -}) - #ifdef CONFIG_64BIT #define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1)) #define STACK_TOP_MAX TASK_SIZE_64 -- 2.45.0
[PATCH 06/16] arm64: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to arch_get_mmap_base() and arch_get_mmap_end(). Signed-off-by: Charlie Jenkins --- arch/arm64/include/asm/processor.h | 34 +- 1 file changed, 29 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index a67ca119bb91..39aabb1619f6 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -92,12 +92,36 @@ #endif /* CONFIG_COMPAT */ #ifndef CONFIG_ARM64_FORCE_52BIT -#define arch_get_mmap_end(addr, len, flags) \ - (((addr) > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW) +#define arch_get_mmap_end(addr, len, flags) \ +({ \ + unsigned long mmap_end; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(len) _len = (len); \ + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1))) \ + mmap_end = (_addr + _len); \ + else \ + mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW); \ + mmap_end \ +}) + +#define arch_get_mmap_base(addr, len, base, flags) \ +({ \ + unsigned long mmap_base; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(base) _base = (base); \ + typeof(len) _len = (len); \ + unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1)))\ + mmap_base = (_addr + _len) - rnd_gap; \ + else \ + mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? \ + _base + TASK_SIZE - DEFAULT_MAP_WINDOW : \ + _base); \ + mmap_end \ +}) -#define arch_get_mmap_base(addr, len, base, flags) ((addr > DEFAULT_MAP_WINDOW) ? \ - base + TASK_SIZE - DEFAULT_MAP_WINDOW :\ - base) #endif /* CONFIG_ARM64_FORCE_52BIT */ extern phys_addr_t arm64_dma_phys_limit; -- 2.45.0
[PATCH 07/16] powerpc: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to arch_get_mmap_base() and arch_get_mmap_end(). Signed-off-by: Charlie Jenkins --- arch/powerpc/include/asm/task_size_64.h | 36 +++-- 1 file changed, 30 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/task_size_64.h b/arch/powerpc/include/asm/task_size_64.h index 239b363841aa..a37a5a81365d 100644 --- a/arch/powerpc/include/asm/task_size_64.h +++ b/arch/powerpc/include/asm/task_size_64.h @@ -72,12 +72,36 @@ #define STACK_TOP_MAX TASK_SIZE_USER64 #define STACK_TOP (is_32bit_task() ? STACK_TOP_USER32 : STACK_TOP_USER64) -#define arch_get_mmap_base(addr, len, base, flags) \ - (((addr) > DEFAULT_MAP_WINDOW) ? (base) + TASK_SIZE - DEFAULT_MAP_WINDOW : (base)) +#define arch_get_mmap_base(addr, len, base, flags) \ +({ \ + unsigned long mmap_base; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(base) _base = (base); \ + typeof(len) _len = (len); \ + unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1)))\ + mmap_base = (_addr + _len) - rnd_gap; \ + else \ + mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? \ + _base + TASK_SIZE - DEFAULT_MAP_WINDOW : \ + _base); \ + mmap_end; \ +}) -#define arch_get_mmap_end(addr, len, flags) \ - (((addr) > DEFAULT_MAP_WINDOW) || \ -(((flags) & MAP_FIXED) && ((addr) + (len) > DEFAULT_MAP_WINDOW)) ? TASK_SIZE : \ - DEFAULT_MAP_WINDOW) +#define arch_get_mmap_end(addr, len, flags) \ +({ \ + unsigned long mmap_end; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(len) _len = (len); \ + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1))) \ + mmap_end = (_addr + _len); \ + else \ + mmap_end = (((_addr) > DEFAULT_MAP_WINDOW) || \ + (((_flags) & MAP_FIXED) && ((_addr) + (_len) > DEFAULT_MAP_WINDOW))\ + ? TASK_SIZE : DEFAULT_MAP_WINDOW) \ + mmap_end; \ +}) #endif /* _ASM_POWERPC_TASK_SIZE_64_H */ -- 2.45.0
[PATCH 08/16] x86: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled. Signed-off-by: Charlie Jenkins --- arch/x86/kernel/sys_x86_64.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index 01d7cd85ef97..fa16b38f3702 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -86,7 +86,7 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); } -static void find_start_end(unsigned long addr, unsigned long flags, +static void find_start_end(unsigned long addr, unsigned long len, unsigned long flags, unsigned long *begin, unsigned long *end) { if (!in_32bit_syscall() && (flags & MAP_32BIT)) { @@ -106,10 +106,14 @@ static void find_start_end(unsigned long addr, unsigned long flags, } *begin = get_mmap_base(1); + if (in_32bit_syscall()) *end = task_size_32bit(); else *end = task_size_64bit(addr > DEFAULT_MAP_WINDOW); + + if (flags & MAP_BELOW_HINT) + *end = MIN(*end, addr + len); } static inline unsigned long stack_guard_placement(vm_flags_t vm_flags) @@ -132,7 +136,7 @@ arch_get_unmapped_area_vmflags(struct file *filp, unsigned long addr, unsigned l if (flags & MAP_FIXED) return addr; - find_start_end(addr, flags, &begin, &end); + find_start_end(addr, len, flags, &begin, &end); if (len > end) return -ENOMEM; @@ -166,6 +170,7 @@ arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0, struct mm_struct *mm = current->mm; unsigned long addr = addr0; struct vm_unmapped_area_info info = {}; + unsigned long task_size, mmap_base; /* requested length too big for entire address space */ if (len > TASK_SIZE) @@ -198,7 +203,8 @@ arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0, else info.low_limit = PAGE_SIZE; - info.high_limit = get_mmap_base(0); + mmap_base = get_mmap_base(0); + info.high_limit = mmap_base; info.start_gap = stack_guard_placement(vm_flags); /* @@ -210,6 +216,19 @@ arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0, */ if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall()) info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW; + if (flags & MAP_BELOW_HINT) { +#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES + task_size = task_size_32bit(); +#else + task_size = task_size_64bit(0); +#endif + /* +* mmap_base is defined by PAGE_ALIGN(task_size - gap - rnd) so +* subtract out the task_size to isolate the gap + rnd. +*/ + info.high_limit = MIN(info.high_limit, + (addr + len) - (task_size - mmap_base)); + } info.align_offset = pgoff << PAGE_SHIFT; if (filp) { -- 2.45.0
[PATCH 09/16] loongarch: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled. Signed-off-by: Charlie Jenkins --- arch/loongarch/mm/mmap.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/loongarch/mm/mmap.c b/arch/loongarch/mm/mmap.c index 889030985135..66a5badf849b 100644 --- a/arch/loongarch/mm/mmap.c +++ b/arch/loongarch/mm/mmap.c @@ -70,6 +70,13 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* +* Subtract (STACK_TOP - mm->mmap_base) to get random +* offset defined in mmap_base() in mm/util.c +*/ + info.high_limit = MIN(mm->mmap_base, + (addr + len) - (STACK_TOP - mm->mmap_base)) addr = vm_unmapped_area(&info); if (!(addr & ~PAGE_MASK)) @@ -84,7 +91,11 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, } info.low_limit = mm->mmap_base; + info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); + return vm_unmapped_area(&info); } -- 2.45.0
[PATCH 10/16] arm: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled. Signed-off-by: Charlie Jenkins --- arch/arm/mm/mmap.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c index d65d0e6ed10a..fa0c79447b78 100644 --- a/arch/arm/mm/mmap.c +++ b/arch/arm/mm/mmap.c @@ -71,6 +71,8 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, info.length = len; info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0; info.align_offset = pgoff << PAGE_SHIFT; return vm_unmapped_area(&info); @@ -122,6 +124,12 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.length = len; info.low_limit = FIRST_USER_ADDRESS; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* +* Subtract (STACK_TOP - mm->mmap_base) to get random +* offset defined in mmap_base() in mm/util.c +*/ + info.high_limit = MIN(info.high_limit, (addr + len) - (STACK_TOP - mm->mmap_base)); info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0; info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -137,6 +145,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.flags = 0; info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); addr = vm_unmapped_area(&info); } -- 2.45.0
[PATCH 11/16] mips: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled. Signed-off-by: Charlie Jenkins --- arch/mips/mm/mmap.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c index 7e11d7b58761..1595fda284cd 100644 --- a/arch/mips/mm/mmap.c +++ b/arch/mips/mm/mmap.c @@ -79,6 +79,13 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* +* Subtract (STACK_TOP - mm->mmap_base) to get random +* offset defined in mmap_base() in mm/util.c +*/ + info.high_limit = MIN(info.high_limit, + (addr + len) - (STACK_TOP - mm->mmap_base)); addr = vm_unmapped_area(&info); if (!(addr & ~PAGE_MASK)) @@ -94,6 +101,8 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); return vm_unmapped_area(&info); } -- 2.45.0
[PATCH 12/16] parisc: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled. Signed-off-by: Charlie Jenkins --- arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 + tools/arch/parisc/include/uapi/asm/mman.h | 1 + 3 files changed, 11 insertions(+) diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..44925ef8ac44 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -26,6 +26,7 @@ #define MAP_HUGETLB0x8 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x10 /* MAP_FIXED which doesn't unmap underlying mapping */ #define MAP_UNINITIALIZED 0/* uninitialized anonymous mmap */ +#define MAP_BELOW_HINT 0x20 /* give out address that is below (inclusive) hint address */ #define MS_SYNC1 /* synchronous memory sync */ #define MS_ASYNC 2 /* sync memory asynchronously */ diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c index f7722451276e..feccb60cf746 100644 --- a/arch/parisc/kernel/sys_parisc.c +++ b/arch/parisc/kernel/sys_parisc.c @@ -148,6 +148,13 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* +* Subtract (STACK_TOP - mm->mmap_base) to get random +* offset defined in mmap_base() in mm/util.c +*/ + info.high_limit = MIN(info.high_limit, + (addr + len) - (STACK_TOP - mm->mmap_base)); addr = vm_unmapped_area(&info); if (!(addr & ~PAGE_MASK)) return addr; @@ -163,6 +170,8 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, info.low_limit = mm->mmap_base; info.high_limit = mmap_upper_limit(NULL); + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); return vm_unmapped_area(&info); } diff --git a/tools/arch/parisc/include/uapi/asm/mman.h b/tools/arch/parisc/include/uapi/asm/mman.h index 4cc88a642e10..297acc0f7b2a 100644 --- a/tools/arch/parisc/include/uapi/asm/mman.h +++ b/tools/arch/parisc/include/uapi/asm/mman.h @@ -40,4 +40,5 @@ /* MAP_32BIT is undefined on parisc, fix it for perf */ #define MAP_32BIT 0 #define MAP_UNINITIALIZED 0 +#define MAP_BELOW_MAP 0x20 #endif -- 2.45.0
[PATCH 13/16] s390: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled. Signed-off-by: Charlie Jenkins --- arch/s390/mm/mmap.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c index 206756946589..29e20351ca85 100644 --- a/arch/s390/mm/mmap.c +++ b/arch/s390/mm/mmap.c @@ -105,6 +105,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, info.length = len; info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = get_align_mask(filp, flags); info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -143,6 +145,12 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long ad info.length = len; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* +* Subtract (STACK_TOP - mm->mmap_base) to get random +* offset defined in mmap_base() in mm/util.c +*/ + info.high_limit = MIN(info.high_limit, addr + len) - (STACK_TOP - mm->mmap_base); info.align_mask = get_align_mask(filp, flags); info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -158,6 +166,8 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long ad info.flags = 0; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(TASK_SIZE, addr + len); addr = vm_unmapped_area(&info); if (offset_in_page(addr)) return addr; -- 2.45.0
[PATCH 14/16] sh: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled. Signed-off-by: Charlie Jenkins --- arch/sh/mm/mmap.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/sh/mm/mmap.c b/arch/sh/mm/mmap.c index bee329d4149a..867f6598b7d0 100644 --- a/arch/sh/mm/mmap.c +++ b/arch/sh/mm/mmap.c @@ -91,6 +91,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, info.length = len; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = do_colour_align ? (PAGE_MASK & shm_align_mask) : 0; info.align_offset = pgoff << PAGE_SHIFT; return vm_unmapped_area(&info); @@ -141,6 +143,12 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.length = len; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* +* Subtract (STACK_TOP - mm->mmap_base) to get random +* offset defined in mmap_base() in mm/util.c +*/ + info.high_limit = MIN(info.high_limit, (addr + len) - (STACK_TOP - mm->mmap_base)); info.align_mask = do_colour_align ? (PAGE_MASK & shm_align_mask) : 0; info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -156,6 +164,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.flags = 0; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); addr = vm_unmapped_area(&info); } -- 2.45.0
[PATCH 15/16] sparc: mm: Support MAP_BELOW_HINT
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled. Signed-off-by: Charlie Jenkins --- arch/sparc/kernel/sys_sparc_64.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c index d9c3b34ca744..b9ce9988551a 100644 --- a/arch/sparc/kernel/sys_sparc_64.c +++ b/arch/sparc/kernel/sys_sparc_64.c @@ -129,6 +129,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi info.length = len; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = min(task_size, VA_EXCLUDE_START); + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = do_color_align ? (PAGE_MASK & (SHMLBA - 1)) : 0; info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -137,6 +139,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi VM_BUG_ON(addr != -ENOMEM); info.low_limit = VA_EXCLUDE_END; info.high_limit = task_size; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); addr = vm_unmapped_area(&info); } @@ -192,6 +196,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.length = len; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = do_color_align ? (PAGE_MASK & (SHMLBA - 1)) : 0; info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -207,6 +213,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.flags = 0; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = STACK_TOP32; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); addr = vm_unmapped_area(&info); } -- 2.45.0
[PATCH 16/16] selftests/mm: Create MAP_BELOW_HINT test
Add a selftest for MAP_BELOW_HINT that maps until it runs out of space below the hint address. Signed-off-by: Charlie Jenkins --- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 + 2 files changed, 30 insertions(+) diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..4e2de85267b5 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_below_hint TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_below_hint.c b/tools/testing/selftests/mm/map_below_hint.c new file mode 100644 index ..305274c5af49 --- /dev/null +++ b/tools/testing/selftests/mm/map_below_hint.c @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test the MAP_BELOW_HINT mmap flag. + */ +#include +#include "../kselftest.h" + +#define ADDR 0x100UL +#define LENGTH (ADDR / 100) + +#define MAP_BELOW_HINT 0x800 /* Not defined in all libc */ + +/* + * Map memory with MAP_BELOW_HINT until no memory left. Ensure that all returned + * addresses are below the hint. + */ +int main(int argc, char **argv) +{ + void *addr; + + do { + addr = mmap((void *)ADDR, LENGTH, MAP_ANONYMOUS, MAP_BELOW_HINT, -1, 0); + } while (addr == MAP_FAILED && (unsigned long)addr <= ADDR); + + if (addr != MAP_FAILED && (unsigned long)addr > ADDR) + ksft_exit_fail_msg("mmap returned address above hint with MAP_BELOW_HINT\n"); + + ksft_test_result_pass("MAP_BELOW_HINT works\n"); +} -- 2.45.0
Re: [PATCH 07/16] powerpc: mm: Support MAP_BELOW_HINT
On Wed, Aug 28, 2024 at 08:34:49AM +0200, Christophe Leroy wrote: > Hi Charlie, > > Le 28/08/2024 à 07:49, Charlie Jenkins a écrit : > > Add support for MAP_BELOW_HINT to arch_get_mmap_base() and > > arch_get_mmap_end(). > > > > Signed-off-by: Charlie Jenkins > > --- > > arch/powerpc/include/asm/task_size_64.h | 36 > > +++-- > > 1 file changed, 30 insertions(+), 6 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/task_size_64.h > > b/arch/powerpc/include/asm/task_size_64.h > > index 239b363841aa..a37a5a81365d 100644 > > --- a/arch/powerpc/include/asm/task_size_64.h > > +++ b/arch/powerpc/include/asm/task_size_64.h > > @@ -72,12 +72,36 @@ > > #define STACK_TOP_MAX TASK_SIZE_USER64 > > #define STACK_TOP (is_32bit_task() ? STACK_TOP_USER32 : STACK_TOP_USER64) > > -#define arch_get_mmap_base(addr, len, base, flags) \ > > - (((addr) > DEFAULT_MAP_WINDOW) ? (base) + TASK_SIZE - > > DEFAULT_MAP_WINDOW : (base)) > > +#define arch_get_mmap_base(addr, len, base, flags) > > \ > > This macro looks quite big for a macro, can it be a static inline function > instead ? Same for the other macro below. > I had overlooked that possibility, I think that's a great solution, I will change that. > > +({ > > \ > > + unsigned long mmap_base; > > \ > > + typeof(flags) _flags = (flags); > > \ > > + typeof(addr) _addr = (addr); > > \ > > + typeof(base) _base = (base); > > \ > > + typeof(len) _len = (len); > > \ > > + unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); > > \ > > + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > > > BIT(VA_BITS - 1)))\ > > + mmap_base = (_addr + _len) - rnd_gap; > > \ > > + else > > \ > > + mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? > > \ > > + _base + TASK_SIZE - DEFAULT_MAP_WINDOW : > > \ > > + _base); > > \ > > + mmap_end; > > \ > > mmap_end doesn't exist, did you mean mmap_base ? Oh whoops, thank you! - Charlie > > > +}) > > -#define arch_get_mmap_end(addr, len, flags) \ > > - (((addr) > DEFAULT_MAP_WINDOW) || \ > > -(((flags) & MAP_FIXED) && ((addr) + (len) > DEFAULT_MAP_WINDOW)) ? > > TASK_SIZE : \ > > - > > DEFAULT_MAP_WINDOW) > > +#define arch_get_mmap_end(addr, len, flags) > > \ > > +({ > > \ > > + unsigned long mmap_end; > > \ > > + typeof(flags) _flags = (flags); > > \ > > + typeof(addr) _addr = (addr); > > \ > > + typeof(len) _len = (len); > > \ > > + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > > > BIT(VA_BITS - 1))) \ > > + mmap_end = (_addr + _len); > > \ > > + else > > \ > > + mmap_end = (((_addr) > DEFAULT_MAP_WINDOW) || > > \ > > + (((_flags) & MAP_FIXED) && ((_addr) + (_len) > > > DEFAULT_MAP_WINDOW))\ > > + ? TASK_SIZE : DEFAULT_MAP_WINDOW) > > \ > > + mmap_end; > > \ > > +}) > > #endif /* _ASM_POWERPC_TASK_SIZE_64_H */ > >
Re: [PATCH 16/16] selftests/mm: Create MAP_BELOW_HINT test
On Wed, Aug 28, 2024 at 06:48:33PM +0100, Lorenzo Stoakes wrote: > On Tue, Aug 27, 2024 at 10:49:22PM GMT, Charlie Jenkins wrote: > > Add a selftest for MAP_BELOW_HINT that maps until it runs out of space > > below the hint address. > > > > Signed-off-by: Charlie Jenkins > > --- > > tools/testing/selftests/mm/Makefile | 1 + > > tools/testing/selftests/mm/map_below_hint.c | 29 > > + > > 2 files changed, 30 insertions(+) > > > > diff --git a/tools/testing/selftests/mm/Makefile > > b/tools/testing/selftests/mm/Makefile > > index cfad627e8d94..4e2de85267b5 100644 > > --- a/tools/testing/selftests/mm/Makefile > > +++ b/tools/testing/selftests/mm/Makefile > > @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm > > TEST_GEN_FILES += hugepage-vmemmap > > TEST_GEN_FILES += khugepaged > > TEST_GEN_FILES += madv_populate > > +TEST_GEN_FILES += map_below_hint > > TEST_GEN_FILES += map_fixed_noreplace > > TEST_GEN_FILES += map_hugetlb > > TEST_GEN_FILES += map_populate > > diff --git a/tools/testing/selftests/mm/map_below_hint.c > > b/tools/testing/selftests/mm/map_below_hint.c > > new file mode 100644 > > index ..305274c5af49 > > --- /dev/null > > +++ b/tools/testing/selftests/mm/map_below_hint.c > > @@ -0,0 +1,29 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * Test the MAP_BELOW_HINT mmap flag. > > + */ > > +#include > > +#include "../kselftest.h" > > + > > +#define ADDR 0x100UL > > +#define LENGTH (ADDR / 100) > > + > > +#define MAP_BELOW_HINT 0x800 /* Not defined in all libc */ > > + > > +/* > > + * Map memory with MAP_BELOW_HINT until no memory left. Ensure that all > > returned > > + * addresses are below the hint. > > + */ > > +int main(int argc, char **argv) > > +{ > > + void *addr; > > + > > + do { > > + addr = mmap((void *)ADDR, LENGTH, MAP_ANONYMOUS, > > MAP_BELOW_HINT, -1, 0); > > How can this be correct? mmap() has parameters: > >void *mmap(void addr[.length], size_t length, int prot, int flags, > int fd, off_t offset); > > You'r setting prot = MAP_ANONYMOUS, flags = MAP_BELOW_HINT... > > This surely should be: > > mmap(..., PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE | MAP_BELOW_HINT, > -1, 0); > > > + } while (addr == MAP_FAILED && (unsigned long)addr <= ADDR); > > How can addr == MAP_FAILED (i.e. ~0) and addr <= ADDR? This will just loop > through once... > > If you want to make sure you're getting mappings only below the hint until > you start getting MAP_FAILED's you'll need to handle this more robustly. > > > + > > + if (addr != MAP_FAILED && (unsigned long)addr > ADDR) > > + ksft_exit_fail_msg("mmap returned address above hint with > > MAP_BELOW_HINT\n"); > > This is just going to fail because your flags are wrong then wrongly claim > to have passed... I obviously didn't spend enough time thinking about this test case... You are correct that I wrote this incorrectly. I will make a proper test case and send out a new version. - Charlie > > > + > > + ksft_test_result_pass("MAP_BELOW_HINT works\n"); > > +} > > > > -- > > 2.45.0 > >
Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT
On Wed, Aug 28, 2024 at 11:29:56AM -0700, Dave Hansen wrote: > On 8/27/24 22:49, Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the maximum address space, > > unless the hint address is greater than this value. > > Which applications are these, btw? Java and Go require this feature. These applications store flags that represent the type of data a pointer holds in the upper bits of the pointer itself. > > Is this the same crowd as the folks who are using the address tagging > features like X86_FEATURE_LAM? Yes it is. LAM helps to mask the bits out on x86, and this feature could be used to ensure that mmap() doesn't return an address with bits that would be masked out. I chose not to tie this feature to x86 LAM which only has masking boundaries at 57 and 48 bits to allow it to be independent of architecture specific address masking. > > Even if they are different, I also wonder if a per-mmap() thing > MAP_BELOW_HINT is really what we want. Or should the applications > you're trying to service here use a similar mechanism to how LAM affects > the *whole* address space as opposed to an individual mmap(). LAM is required to be enabled for entire address spaces because the hardware needs to be configured to mask out the bits. It is not possible to influence the granularity of LAM in the current implementation. However mmap() does not require any of this hardware configuration so it is possible to have finer granularity. A way to restrict mmap() to return LAM compliant addresses in an entire address space also doesn't have to be mutually exclusive with this flag. This flag allows for the greatest degree of control from applications. I don't believe there is additionally performance saving that could be achieved by having this be on a per address space basis. Link: https://cdrdv2.intel.com/v1/dl/getContent/671368 [1] - Charlie
Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT
On Wed, Aug 28, 2024 at 02:31:42PM -0400, Liam R. Howlett wrote: > * Charlie Jenkins [240828 01:49]: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the maximum address space, > > unless the hint address is greater than this value. > > Wait, what arch(s) allows for greater than the max? The passed hint > should be where we start searching, but we go to the lower limit then > start at the hint and search up (or vice-versa on the directions). > I worded this awkwardly. On arm64 there is a page-table boundary at 48 bits and at 52 bits. On x86 the boundaries are at 48 bits and 57 bits. The max value mmap is able to return on arm64 is 48 bits if the hint address uses 48 bits or less, even if the architecture supports 5-level paging and thus addresses can be 52 bits. Applications can opt-in to using up to 52-bits in an address by using a hint address greater than 48 bits. x86 has the same behavior but with 57 bits instead of 52. This reason this exists is because some applications arbitrarily replace bits in virtual addresses with data with an assumption that the address will not be using any of the bits above bit 48 in the virtual address. As hardware with larger address spaces was released, x86 decided to build safety guards into the kernel to allow the applications that made these assumptions to continue to work on this different hardware. This causes all application that use a hint address to silently be restricted to 48-bit addresses. The goal of this flag is to have a way for applications to explicitly request how many bits they want mmap to use. > I don't understand how unmapping works on a higher address; we would > fail to free it on termination of the application. > > Also, there are archs that map outside of the VMAs, which are freed by > freeing from the prev->vm_end to next->vm_start, so I don't understand > what that looks like in this reality as well. > > > > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This > > flag allows applications a way to specify exactly how many bits they > > want to be left unused by mmap. This eliminates the need for > > applications to know the page table hierarchy of the system to be able > > to reason which addresses mmap will be allowed to return. > > But, why do they need to know today? We have a limit for this don't we? The limit is different for different architectures. On x86 the limit is 57 bits, and on arm64 it is 52 bits. So in the theoretical case that an application requires 10 bits free in a virtual address, the application would always work on arm64 regardless of the hint address, but on x86 if the hint address is greater than 48 bits then the application will not work. The goal of this flag is to have consistent and tunable behavior of mmap() when it is desired to ensure that mmap() only returns addresses that use some number of bits. > > Also, these upper limits are how some archs use the upper bits that you > are trying to use. > It does not eliminate the existing behavior of the architectures to place this upper limits, it instead provides a way to have consistent behavior across all architectures. > > > > --- > > riscv made this feature of mmap returning addresses less than the hint > > address the default behavior. This was in contrast to the implementation > > of x86/arm64 that have a single boundary at the 5-level page table > > region. However this restriction proved too great -- the reduced > > address space when using a hint address was too small. > > Yes, the hint is used to group things close together so it would > literally be random chance on if you have enough room or not (aslr and > all). > > > > > A patch for riscv [1] reverts the behavior that broke userspace. This > > series serves to make this feature available to all architectures. > > I don't fully understand this statement, you say it broke userspace so > now you are porting it to everyone? This reads as if you are braking > the userspace on all architectures :) It was the default for mmap on riscv. The difference here is that it is now enabled by a flag instead. Instead of making the flag specific to riscv, I figured that other architectures might find it useful as well. > > If you fail to find room below, then your application fails as there is > no way to get the upper bits you need. It would be better to fix this > in userspace - if your application is returned too high an address, then > free it and exit because it's going to fail anyways. > This flag is trying to define an API that is more robust than the current behavior on that
Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT
On Wed, Aug 28, 2024 at 01:59:18PM -0700, Charlie Jenkins wrote: > On Wed, Aug 28, 2024 at 02:31:42PM -0400, Liam R. Howlett wrote: > > * Charlie Jenkins [240828 01:49]: > > > Some applications rely on placing data in free bits addresses allocated > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > address returned by mmap to be less than the maximum address space, > > > unless the hint address is greater than this value. > > > > Wait, what arch(s) allows for greater than the max? The passed hint > > should be where we start searching, but we go to the lower limit then > > start at the hint and search up (or vice-versa on the directions). > > > > I worded this awkwardly. On arm64 there is a page-table boundary at 48 > bits and at 52 bits. On x86 the boundaries are at 48 bits and 57 bits. > The max value mmap is able to return on arm64 is 48 bits if the hint > address uses 48 bits or less, even if the architecture supports 5-level > paging and thus addresses can be 52 bits. Applications can opt-in to > using up to 52-bits in an address by using a hint address greater than > 48 bits. x86 has the same behavior but with 57 bits instead of 52. > > This reason this exists is because some applications arbitrarily replace > bits in virtual addresses with data with an assumption that the address > will not be using any of the bits above bit 48 in the virtual address. > As hardware with larger address spaces was released, x86 decided to > build safety guards into the kernel to allow the applications that made > these assumptions to continue to work on this different hardware. > > This causes all application that use a hint address to silently be > restricted to 48-bit addresses. The goal of this flag is to have a way > for applications to explicitly request how many bits they want mmap to > use. > > > I don't understand how unmapping works on a higher address; we would > > fail to free it on termination of the application. > > > > Also, there are archs that map outside of the VMAs, which are freed by > > freeing from the prev->vm_end to next->vm_start, so I don't understand > > what that looks like in this reality as well. > > > > > > > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This > > > flag allows applications a way to specify exactly how many bits they > > > want to be left unused by mmap. This eliminates the need for > > > applications to know the page table hierarchy of the system to be able > > > to reason which addresses mmap will be allowed to return. > > > > But, why do they need to know today? We have a limit for this don't we? > > The limit is different for different architectures. On x86 the limit is > 57 bits, and on arm64 it is 52 bits. So in the theoretical case that an > application requires 10 bits free in a virtual address, the application > would always work on arm64 regardless of the hint address, but on x86 if > the hint address is greater than 48 bits then the application will not > work. > > The goal of this flag is to have consistent and tunable behavior of > mmap() when it is desired to ensure that mmap() only returns addresses > that use some number of bits. > > > > > Also, these upper limits are how some archs use the upper bits that you > > are trying to use. > > > > It does not eliminate the existing behavior of the architectures to > place this upper limits, it instead provides a way to have consistent > behavior across all architectures. > > > > > > > --- > > > riscv made this feature of mmap returning addresses less than the hint > > > address the default behavior. This was in contrast to the implementation > > > of x86/arm64 that have a single boundary at the 5-level page table > > > region. However this restriction proved too great -- the reduced > > > address space when using a hint address was too small. > > > > Yes, the hint is used to group things close together so it would > > literally be random chance on if you have enough room or not (aslr and > > all). > > > > > > > > A patch for riscv [1] reverts the behavior that broke userspace. This > > > series serves to make this feature available to all architectures. > > > > I don't fully understand this statement, you say it broke userspace so > > now you are porting it to everyone? This reads as if you are braking > > the userspace on all architectures :) > > It was the default for mmap on riscv. The difference here is that it is now > enabled by a flag
Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT
On Wed, Aug 28, 2024 at 07:19:36PM +0100, Lorenzo Stoakes wrote: > On Tue, Aug 27, 2024 at 10:49:06PM GMT, Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the maximum address space, > > unless the hint address is greater than this value. > > > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This > > flag allows applications a way to specify exactly how many bits they > > want to be left unused by mmap. This eliminates the need for > > applications to know the page table hierarchy of the system to be able > > to reason which addresses mmap will be allowed to return. > > > > --- > > riscv made this feature of mmap returning addresses less than the hint > > address the default behavior. This was in contrast to the implementation > > of x86/arm64 that have a single boundary at the 5-level page table > > region. However this restriction proved too great -- the reduced > > address space when using a hint address was too small. > > > > A patch for riscv [1] reverts the behavior that broke userspace. This > > series serves to make this feature available to all architectures. > > I'm a little confused as to the justification for this - you broke RISC V by > doing this, and have now reverted it, but now offer the same behaviour that > broke RISC V to all other architectures? > > I mean this is how this reads, so I might be being ungenerous here :) but > would > be good to clarify what the value-add is here. Yeah I did not do a good job of explaining this! Having this be the default behavior was broken, not that this feature in general was broken. > > I also wonder at use of a new MAP_ flag, they're a limited resource and we > should only ever add them if we _really_ need to. This seems a bit niche and > specific to be making such a big change for including touching a bunch of > pretty > sensitive arch-specific code. > > We have the ability to change how mmap() functions through 'personalities' > though of course this would impact every mmap() call in the process. > > Overall I'm really not hugely convinced by this, it feels like userland > could find better ways of doing this (mostly you'd do a PROT_NONE mmap() to > reserve a domain and mprotect() it on allocation or mmap() over it). > > So I just struggle to see the purpose myself. BUT absolutely I may be > missing context/others may have a view on the value of this. So happy to > stand corrected. > > > > > I have only tested on riscv and x86. There is a tremendous amount of > > Yeah, OK this is crazy, you can't really submit something as non-RFC that > touches every single arch and not test it. > > I also feel like we need more justification than 'this is a neat thing that > we use in RISC V sometimes' conceptually for such a big change. I will send out a new version that does a much better job at explaining! This is not something that is done on riscv ever currently. This is something that is done on other architectures such as x86 and arm64. This flag is to make similar behavior (the ability to force mmap to return addresses that have a constrained address space) available to all architectures. > > Also your test program is currently completely broken afaict (have > commented on it directly). I also feel like your test program is a little > rudimentary, and should test some edge cases close to the limit etc. > > So I think this is a NACK until there is testing across the board and a little > more justification. > > Feel free to respin, but I think any future revisions should be RFC until > we're absolutely sure on testing/justification. > > I appreciate your efforts here so sorry to be negative, but just obviously > want to make sure this is functional and trades off added complexity for > value for the kernel and userland :) > Totally understand thank you! After reviewing comments I have realized that I made this much more complicated than it needs to be. This should be able to be done without changing any architecture specific code. That will mostly eliminate all of the complexity, but still has the downside of consuming a MAP_ flag. - Charlie > Thanks! > > > duplicated code in mmap so the implementations across architectures I > > believe should be mostly consistent. I added this feature to all > > architectures that implement either > > arch_get_mmap_end()/arch_get_mmap_base() or > > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added > > it to the default behavior for
[PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space). The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2]. These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits. Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking. This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture. Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79 [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47 [2] To: Arnd Bergmann To: Richard Henderson To: Ivan Kokshaysky To: Matt Turner To: Vineet Gupta To: Russell King To: Guo Ren To: Huacai Chen To: WANG Xuerui To: Thomas Bogendoerfer To: James E.J. Bottomley To: Helge Deller To: Michael Ellerman To: Nicholas Piggin To: Christophe Leroy To: Naveen N Rao To: Alexander Gordeev To: Gerald Schaefer To: Heiko Carstens To: Vasily Gorbik To: Christian Borntraeger To: Sven Schnelle To: Yoshinori Sato To: Rich Felker To: John Paul Adrian Glaubitz To: David S. Miller To: Andreas Larsson To: Thomas Gleixner To: Ingo Molnar To: Borislav Petkov To: Dave Hansen To: x...@kernel.org To: H. Peter Anvin To: Andy Lutomirski To: Peter Zijlstra To: Muchun Song To: Andrew Morton To: Liam R. Howlett To: Vlastimil Babka To: Lorenzo Stoakes To: Shuah Khan Cc: linux-a...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-al...@vger.kernel.org Cc: linux-snps-...@lists.infradead.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: loonga...@lists.linux.dev Cc: linux-m...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-kselft...@vger.kernel.org Signed-off-by: Charlie Jenkins Changes in v2: - Added much greater detail to cover letter - Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info - Made this an RFC since I have only tested it on riscv and x86 - Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com --- Charlie Jenkins (4): mm: Add MAP_BELOW_HINT mm: Add hint and mmap_flags to struct vm_unmapped_area_info mm: Support MAP_BELOW_HINT in vm_unmapped_area() selftests/mm: Create MAP_BELOW_HINT test arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 ++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 ++ arch/s390/mm/hugetlbpage.c | 4 arch/s390/mm/mmap.c | 6 ++ arch/sh/mm/mmap.c| 6 ++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++ arch/sparc/mm/hugetlbpage.c | 4 arch/x86/kernel/sys_x86_64.c | 6 ++ arch/x86/mm/hugetlbpage.c| 4 fs/hugetlbfs/inode.c | 4 include/linux/mm.h | 2 ++ include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c| 9 tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 23 files changed, 120
[PATCH RFC v2 1/4] mm: Add MAP_BELOW_HINT
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space). To make this behavior explicit and more versatile across all architectures, define a mmap flag that allows users to define an arbitrary upper limit on addresses returned by mmap. Signed-off-by: Charlie Jenkins --- include/uapi/asm-generic/mman-common.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + 2 files changed, 2 insertions(+) diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@ #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x800 /* give out address that is below (inclusive) hint address */ /* * Flags for mlock diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@ #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x800 /* give out address that is below (inclusive) hint address */ /* * Flags for mlock -- 2.45.0
[PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info
The hint address and mmap_flags are necessary to determine if MAP_BELOW_HINT requirements are satisfied. Signed-off-by: Charlie Jenkins --- arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 +++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 +++ arch/s390/mm/hugetlbpage.c | 4 arch/s390/mm/mmap.c | 6 ++ arch/sh/mm/mmap.c| 6 ++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++ arch/sparc/mm/hugetlbpage.c | 4 arch/x86/kernel/sys_x86_64.c | 6 ++ arch/x86/mm/hugetlbpage.c| 4 fs/hugetlbfs/inode.c | 4 include/linux/mm.h | 2 ++ mm/mmap.c| 6 ++ 19 files changed, 82 insertions(+) diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c index e5f881bc8288..6903700afd12 100644 --- a/arch/alpha/kernel/osf_sys.c +++ b/arch/alpha/kernel/osf_sys.c @@ -1223,6 +1223,8 @@ arch_get_unmapped_area_1(unsigned long addr, unsigned long len, info.length = len; info.low_limit = addr; info.high_limit = limit; + info.hint = addr; + info.mmap_flags = flags; return vm_unmapped_area(&info); } diff --git a/arch/arc/mm/mmap.c b/arch/arc/mm/mmap.c index 69a915297155..5922cb51e029 100644 --- a/arch/arc/mm/mmap.c +++ b/arch/arc/mm/mmap.c @@ -29,6 +29,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, struct vm_area_struct *vma; struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + /* * We enforce the MAP_FIXED case. */ diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c index d65d0e6ed10a..04d9234f049a 100644 --- a/arch/arm/mm/mmap.c +++ b/arch/arm/mm/mmap.c @@ -36,6 +36,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, int aliasing = cache_is_vipt_aliasing(); struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + /* * We only need to do colour alignment if either the I or D * caches alias. @@ -56,6 +59,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, if (len > TASK_SIZE) return -ENOMEM; + if (addr) { if (do_align) addr = COLOUR_ALIGN(addr, pgoff); @@ -88,6 +92,9 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, int aliasing = cache_is_vipt_aliasing(); struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + /* * We only need to do colour alignment if either the I or D * caches alias. diff --git a/arch/csky/abiv1/mmap.c b/arch/csky/abiv1/mmap.c index 7f826331d409..0be4913c6cf3 100644 --- a/arch/csky/abiv1/mmap.c +++ b/arch/csky/abiv1/mmap.c @@ -35,6 +35,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, .align_offset = pgoff << PAGE_SHIFT }; + info.hint = addr; + info.mmap_flags = flags; + /* * We only need to do colour alignment if either the I or D * caches alias. diff --git a/arch/loongarch/mm/mmap.c b/arch/loongarch/mm/mmap.c index 889030985135..7d1e8be20519 100644 --- a/arch/loongarch/mm/mmap.c +++ b/arch/loongarch/mm/mmap.c @@ -27,6 +27,9 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, int do_color_align; struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + if (unlikely(len > TASK_SIZE)) return -ENOMEM; diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c index 7e11d7b58761..22e8f9c8eaa0 100644 --- a/arch/mips/mm/mmap.c +++ b/arch/mips/mm/mmap.c @@ -36,6 +36,9 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, int do_color_align; struct vm_unmapped_area_info info = {}; + info.hint = addr; + info.mmap_flags = flags; + if (unlikely(len > TASK_SIZE)) return -ENOMEM; diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c index f7722451276e..2ac53f148624 100644 --- a/arch/parisc/kernel/sys_parisc.c +++ b/arch/parisc/kernel/sys_parisc.c @@ -108,6 +108,9 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, .length = len }; + info.hint = addr; + info.mmap_flags = flags; + if (unlikely(len > TASK_SIZE)) return -ENOMEM; diff --git a/arch/powerpc/mm/book3s64/slice.c b/arch/powerpc/mm/book3s64/slice.c index ef3ce37f1bb3..f0e2550af6d0 100644 --- a/ar
[PATCH RFC v2 3/4] mm: Support MAP_BELOW_HINT in vm_unmapped_area()
To ensure that all memory allocations comply with the new MAP_BELOW_HINT flag, set the high_limit in vm_unmapped_area() to the hint address + length at most. All callers to this function set the high_limit to something reasonable, usually with space for a random offset and a gap for the stack. To respect the provided high_limit, take the minimum of hint+length and the given high_limit. Signed-off-by: Charlie Jenkins --- mm/mmap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/mmap.c b/mm/mmap.c index 34ba0db23678..459ad380c673 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1766,6 +1766,9 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) { unsigned long addr; + if (info->hint != 0 && info->mmap_flags & MAP_BELOW_HINT) + info->high_limit = MIN(info->high_limit, info->hint + info->length); + if (info->flags & VM_UNMAPPED_AREA_TOPDOWN) addr = unmapped_area_topdown(info); else -- 2.45.0
[PATCH RFC v2 4/4] selftests/mm: Create MAP_BELOW_HINT test
Add a selftest for MAP_BELOW_HINT that maps until it runs out of space below the hint address. Signed-off-by: Charlie Jenkins --- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 + 2 files changed, 33 insertions(+) diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..4e2de85267b5 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_below_hint TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_below_hint.c b/tools/testing/selftests/mm/map_below_hint.c new file mode 100644 index ..55d6cbf90645 --- /dev/null +++ b/tools/testing/selftests/mm/map_below_hint.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test the MAP_BELOW_HINT mmap flag. + */ +#include +#include +#include "../kselftest.h" + +#define ADDR (1 << 20) +#define LENGTH (ADDR / 1) + +#define MAP_BELOW_HINT 0x800 /* Not defined in all libc */ + +/* + * Map memory with MAP_BELOW_HINT until no memory left. Ensure that all returned + * addresses are below the hint. + */ +int main(int argc, char **argv) +{ + void *addr; + + do { + addr = mmap((void *)ADDR, LENGTH, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_BELOW_HINT, -1, 0); + } while (addr != MAP_FAILED && (unsigned long)addr <= ADDR); + + if (errno == ENOMEM) + ksft_test_result_pass("MAP_BELOW_HINT works\n"); + else + ksft_test_result_fail("mmap returned address above hint with MAP_BELOW_HINT with error: %s\n", + strerror(errno)); +} -- 2.45.0
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote: > On Thu 29-08-24 00:15:57, Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the 48-bit address space, > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > for the kernel address space). > > > > The riscv architecture needs a way to similarly restrict the virtual > > address space. On the riscv port of OpenJDK an error is thrown if > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > has a comment that sv57 support is not complete, but there are some > > workarounds to get it to mostly work [2]. > > > > These applications work on x86 because x86 does an implicit 47-bit > > restriction of mmap() address that contain a hint address that is less > > than 48 bits. > > > > Instead of implicitly restricting the address space on riscv (or any > > current/future architecture), a flag would allow users to opt-in to this > > behavior rather than opt-out as is done on other architectures. This is > > desirable because it is a small class of applications that do pointer > > masking. > > IIRC this has been discussed at length when 5-level page tables support > has been proposed for x86. Sorry I do not have a link handy but lore > should help you. Linus was not really convinced and in the end vetoed it > and prefer that those few applications that benefit from greater address > space would do that explicitly than other way around. I believe I found the conversation you were referring to. Ingo Molnar recommended a flag similar to what I have proposed [1]. Catalin recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up MPX [3]. However these conversations are tangential to what I am proposing. arm64 and x86 decided to have the default address space be 48 bits. However this was done on a per-architecture basis with no way for applications to have guarantees between architectures. Even this behavior to restrict to 48 bits does not even appear in the man pages, so would require reading the kernel source code to understand that this feature is available. Then to opt-in to larger address spaces, applications have to know to provide a hint address that is greater than 47 bits, mmap() will then return an address that contains up to 56 bits on x86 and 52 bits on arm64. This difference of 4 bits causes inconsistency and is part of the problem I am trying to solve with this flag. I am not proposing to change x86 and arm64 away from using their opt-out feature, I am instead proposing a standard ABI for applications that need some guarantees of the bits used in pointers. - Charlie Link: https://lore.kernel.org/lkml/20161209050130.gc2...@gmail.com/ [1] Link: https://lore.kernel.org/lkml/20161209105120.ga3...@e104818-lin.cambridge.arm.com/ [2] Link: https://lore.kernel.org/lkml/a2f86495-b55f-fda0-40d2-242c45d3c...@intel.com/ [3] > > -- > Michal Hocko > SUSE Labs
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote: > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote: > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote: > > > Some applications rely on placing data in free bits addresses allocated > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > address returned by mmap to be less than the 48-bit address space, > > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > > for the kernel address space). > > > > I'm still confused as to why, if an mmap flag is desired, and thus programs > > are having to be heavily modified and controlled to be able to do this, why > > you can't just do an mmap() with PROT_NONE early, around a hinted address > > that, sits below the required limit, and then mprotect() or mmap() over it? > > > > Your feature is a major adjustment to mmap(), it needs to be pretty > > significantly justified, especially if taking up a new flag. > > > > > > > > The riscv architecture needs a way to similarly restrict the virtual > > > address space. On the riscv port of OpenJDK an error is thrown if > > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > > has a comment that sv57 support is not complete, but there are some > > > workarounds to get it to mostly work [2]. > > > > > > These applications work on x86 because x86 does an implicit 47-bit > > > restriction of mmap() address that contain a hint address that is less > > > than 48 bits. > > > > You mean x86 _has_ to limit to physically available bits in a canonical > > format :) this will not be the case for 5-page table levels though... I might be misunderstanding but I am not talking about pointer masking or canonical addresses here. I am referring to the pattern of: 1. Getting an address from mmap() 2. Writing data into bits assumed to be unused in the address 3. Using the data stored in the address 4. Clearing the data from the address and sign extending 5. Dereferencing the now sign-extended address to conform to canonical addresses I am just talking about step 1 and 2 here -- getting an address from mmap() that only uses bits that will allow your application to not break. How canonicalization happens is a a separate conversation, that can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv. While LAM for x86 is only capable of masking addresses to 48 or 57 bits, Ssnpm for riscv allow an arbitrary number of bits to be masked out. A design goal here is to be able to support all of the pointer masking flavors, and not just x86. > > > > > > > > Instead of implicitly restricting the address space on riscv (or any > > > current/future architecture), a flag would allow users to opt-in to this > > > behavior rather than opt-out as is done on other architectures. This is > > > desirable because it is a small class of applications that do pointer > > > masking. > > > > I raised this last time and you didn't seem to address it so to be more > > blunt: > > > > I don't understand why this needs to be an mmap() flag. From this it seems > > the whole process needs allocations to be below a certain limit. Yeah making it per-process does seem logical, as it would help with pointer masking. > > > > That _could_ be achieved through a 'personality' or similar (though a > > personality is on/off, rather than allowing configuration so maybe > > something else would be needed). > > > > From what you're saying 57-bit is all you really need right? So maybe > > ADDR_LIMIT_57BIT? Addresses will always be limited to 57 bits on riscv and x86 (but not necessarily on other architectures). A flag like that would have no impact, I do not understand what you are suggesting. This patch is to have a configurable number of bits be restricted. If anything, a personality that was ADDR_LIMIT_48BIT would be the closest to what I am trying to achieve. Since the issue is that applications fail to work when the address space is greater than 48 bits. > > > > I don't see how you're going to actually enforce this in a process either > > via an mmap flag, as a library might decide not to use it, so you'd need to > > control the allocator, the thread library implementation, and everything > > that might allocate. It is reasonable to change the implementation to be per-process but that is not the current proposal. This flag was designed for applications which already directly manage all of their addresses like OpenJDK and Go. This flag implementation was an attempt to make
Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 09:54:08AM -0700, Dave Hansen wrote: > On 8/28/24 13:15, Charlie Jenkins wrote: > > A way to restrict mmap() to return LAM compliant addresses in an entire > > address space also doesn't have to be mutually exclusive with this flag. > > This flag allows for the greatest degree of control from applications. > > I don't believe there is additionally performance saving that could be > > achieved by having this be on a per address space basis. > > I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. > But it's also rather complicated. Can you expand upon what you mean by it being complicated? Complicated for the kernel or complicated for a user? > > My _hope_ would be that a per-address-space property could share at > least some infrastructure with what x86/LAM and arm/TBI do to the > address space. Basically put the restrictions in place for purely > software reasons instead of the mostly hardware reasons for LAM/TBI. That is a good point, perhaps that would be a way to hook this into LAM, TBI, and any other architecture's specific address masking feature. - Charlie > > Lorenzo also raised some very valid points about a having a generic > address-restriction ABI. I'm certainly not discounting those concerns. > It's not something that can be done lightly.
Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 03:36:43PM -0400, Liam R. Howlett wrote: > * Dave Hansen [240829 12:54]: > > On 8/28/24 13:15, Charlie Jenkins wrote: > > > A way to restrict mmap() to return LAM compliant addresses in an entire > > > address space also doesn't have to be mutually exclusive with this flag. > > > This flag allows for the greatest degree of control from applications. > > > I don't believe there is additionally performance saving that could be > > > achieved by having this be on a per address space basis. > > > > I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. > > But it's also rather complicated. > > There is a (seldom used?) feature of mmap_min_addr, it seems like we > could have an mmap_max_addr. Would something like that work for your > use case? Perhaps it would be less intrusive to do something in this > way? I haven't looked at it in depth and this affects all address > spaces as well (new allocations only). > > There is a note on mmap_min_addr about applications that require the > lower addresses, would this mean we'll now have a note about upper > limits? I don't think that's a viable solution because that would change the mmap behavior for all applications running on the system, and wouldn't allow individual applications to have different configurations. > > I really don't understand why you need this at all, to be honest. If > you know the upper limit you could just MAP_FIXED map a huge guard at > the top of your address space then do whatever you want with those bits. > > This will create an entry in the vma tree that no one else will be able > to use, and you can do this in any process you want, for as many bits as > you want. Oh that's an interesting idea. I am not sure how that could work in practice though. The application would need to know it allocated all of the addresses in the upper address space, how would it be able to do that? > > > > > My _hope_ would be that a per-address-space property could share at > > least some infrastructure with what x86/LAM and arm/TBI do to the > > address space. Basically put the restrictions in place for purely > > software reasons instead of the mostly hardware reasons for LAM/TBI. > > > > Lorenzo also raised some very valid points about a having a generic > > address-restriction ABI. I'm certainly not discounting those concerns. > > It's not something that can be done lightly. > > Yes, I am concerned about supporting this (probably forever) and dancing > around special code that may cause issues, perhaps on an arch that few > have for testing. I already have so many qemu images for testing, some > of which no longer have valid install media - and basically none of them > use the same code in this area (or have special cases already). I think > you understand what we are dealing with considering your comments in > your cover letter. It is definitely not something to be taken lightly. The version 2 of this is completely generic so that should eliminate most of the concern of "special code" on various architectures. Unless I am misunderstanding something. - Charlie > > Thanks, > Liam
Re: [PATCH RFC v2 2/4] mm: Add hint and mmap_flags to struct vm_unmapped_area_info
On Thu, Aug 29, 2024 at 09:48:44AM +0100, Lorenzo Stoakes wrote: > On Thu, Aug 29, 2024 at 12:15:59AM GMT, Charlie Jenkins wrote: > > [snip] > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index d0dfc85b209b..34ba0db23678 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -1796,6 +1796,9 @@ generic_get_unmapped_area(struct file *filp, unsigned > > long addr, > > struct vm_unmapped_area_info info = {}; > > const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags); > > > > + info.hint = addr; > > + info.mmap_flags = flags; > > + > > if (len > mmap_end - mmap_min_addr) > > return -ENOMEM; > > > > @@ -1841,6 +1844,9 @@ generic_get_unmapped_area_topdown(struct file *filp, > > unsigned long addr, > > struct vm_unmapped_area_info info = {}; > > const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags); > > > > + info.hint = addr; > > + info.mmap_flags = flags; > > + > > /* requested length too big for entire address space */ > > if (len > mmap_end - mmap_min_addr) > > return -ENOMEM; > > > > These line numbers suggest you're working against Linus's tree, mm/mmap.c > has changed a lot recently, so to avoid conflicts please base your changes > on mm-unstable in Andrew's tree (if looking to go through that) or at least > -next. I will make sure that I base off of mm-unstable for future revisions. - Charlie > > > -- > > 2.45.0 > >
Re: [PATCH RFC v2 1/4] mm: Add MAP_BELOW_HINT
On Thu, Aug 29, 2024 at 06:26:41PM +1000, Michael Ellerman wrote: > Charlie Jenkins writes: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the 48-bit address space, > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > for the kernel address space). > > > > To make this behavior explicit and more versatile across all > > architectures, define a mmap flag that allows users to define an > > arbitrary upper limit on addresses returned by mmap. > > > > Signed-off-by: Charlie Jenkins > > --- > > include/uapi/asm-generic/mman-common.h | 1 + > > tools/include/uapi/asm-generic/mman-common.h | 1 + > > You're not meant to update the headers in tools/ directly. There's a > mail somewhere from acme somewhere describing the proper process, but > the tldr is leave it up to him. Oh okay, thank you. > > > diff --git a/include/uapi/asm-generic/mman-common.h > > b/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..03ac13d9aa37 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -32,6 +32,7 @@ > > > > #define MAP_UNINITIALIZED 0x400/* For anonymous mmap, memory > > could be > > * uninitialized */ > > +#define MAP_BELOW_HINT 0x800 /* give out address that is > > below (inclusive) hint address */ > > IMHO the API would be clearer if this actually forced the address to be > below the hint. That's what the flag name implies after all. > > It would also mean the application doesn't need to take into account the > length of the mapping when passing the hint. > > cheers That's a good point. The reason I did it this way was to allow mmap the possibility of returning the same address as the hint. If it must be strictly less than the hint then the hint address can never be returned. Maybe that doesn't matter though. - Charlie
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Fri, Aug 30, 2024 at 10:52:01AM +0100, Lorenzo Stoakes wrote: > On Thu, Aug 29, 2024 at 03:16:53PM GMT, Charlie Jenkins wrote: > > On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote: > > > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote: > > > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote: > > > > > Some applications rely on placing data in free bits addresses > > > > > allocated > > > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > > > > address returned by mmap to be less than the 48-bit address space, > > > > > unless the hint address uses more than 47 bits (the 48th bit is > > > > > reserved > > > > > for the kernel address space). > > > > > > > > I'm still confused as to why, if an mmap flag is desired, and thus > > > > programs > > > > are having to be heavily modified and controlled to be able to do this, > > > > why > > > > you can't just do an mmap() with PROT_NONE early, around a hinted > > > > address > > > > that, sits below the required limit, and then mprotect() or mmap() over > > > > it? > > > > > > > > Your feature is a major adjustment to mmap(), it needs to be pretty > > > > significantly justified, especially if taking up a new flag. > > > > > > > > > > > > > > The riscv architecture needs a way to similarly restrict the virtual > > > > > address space. On the riscv port of OpenJDK an error is thrown if > > > > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > > > > has a comment that sv57 support is not complete, but there are some > > > > > workarounds to get it to mostly work [2]. > > > > > > > > > > These applications work on x86 because x86 does an implicit 47-bit > > > > > restriction of mmap() address that contain a hint address that is less > > > > > than 48 bits. > > > > > > > > You mean x86 _has_ to limit to physically available bits in a canonical > > > > format :) this will not be the case for 5-page table levels though... > > > > I might be misunderstanding but I am not talking about pointer masking > > or canonical addresses here. I am referring to the pattern of: > > > > 1. Getting an address from mmap() > > 2. Writing data into bits assumed to be unused in the address > > 3. Using the data stored in the address > > 4. Clearing the data from the address and sign extending > > 5. Dereferencing the now sign-extended address to conform to canonical > >addresses > > > > I am just talking about step 1 and 2 here -- getting an address from > > mmap() that only uses bits that will allow your application to not > > break. How canonicalization happens is a a separate conversation, that > > can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv. > > While LAM for x86 is only capable of masking addresses to 48 or 57 bits, > > Ssnpm for riscv allow an arbitrary number of bits to be masked out. > > A design goal here is to be able to support all of the pointer masking > > flavors, and not just x86. > > Right I get that, I was just saying that the implicit limitation in x86 is > due to virtual addresses _having_ to be less than 48 bits. So that's why > that is right? I mean perhaps I'm mistaken? > > Or is it such that x86 can provide a space for tagging for CPU technology > that supports it (UAI perhaps?). > > I agree with what Michal and others said about the decision to default to > the reduced address space size and opt-in for higher bits. Your series > doesn't do this... > > > > > > > > > > > > > > > > > Instead of implicitly restricting the address space on riscv (or any > > > > > current/future architecture), a flag would allow users to opt-in to > > > > > this > > > > > behavior rather than opt-out as is done on other architectures. This > > > > > is > > > > > desirable because it is a small class of applications that do pointer > > > > > masking. > > > > > > > > I raised this last time and you didn't seem to address it so to be more > > > > blunt: > > > > > > > > I don't understand why this needs to be an mmap() flag. From this it > > > > seems > > > > the whole process needs allo
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote: > On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote: > > Some applications rely on placing data in free bits addresses allocated > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > > address returned by mmap to be less than the 48-bit address space, > > unless the hint address uses more than 47 bits (the 48th bit is reserved > > for the kernel address space). > > > > The riscv architecture needs a way to similarly restrict the virtual > > address space. On the riscv port of OpenJDK an error is thrown if > > attempted to run on the 57-bit address space, called sv57 [1]. golang > > has a comment that sv57 support is not complete, but there are some > > workarounds to get it to mostly work [2]. > > > > These applications work on x86 because x86 does an implicit 47-bit > > restriction of mmap() address that contain a hint address that is less > > than 48 bits. > > > > Instead of implicitly restricting the address space on riscv (or any > > current/future architecture), a flag would allow users to opt-in to this > > behavior rather than opt-out as is done on other architectures. This is > > desirable because it is a small class of applications that do pointer > > masking. > > This argument looks broken to me. > > The "small class of applications" is going to be broken unless they got > patched to use your new mmap() flag. You are asking for bugs. > > Consider the case when you write, compile and validate a piece of software > on machine that has <=47bit VA. The binary got shipped to customers. > Later, customer gets a new shiny machine that supports larger address > space and your previously working software is broken. Such binaries might > exist today. > > It is bad idea to use >47bit VA by default. Most of software got tested on > x86 with 47bit VA. > > We can consider more options to opt-in into wider address space like > personality or prctl() handle. But opt-out is no-go from what I see. > > -- > Kiryl Shutsemau / Kirill A. Shutemov riscv is in an interesting state in regards to this because the software ecosystem is much less mature than other architectures. The existing riscv hardware supports either 38 or 47 bit userspace VAs, but a lot of people test on QEMU which defaults to 56 bit. As a result, a lot of code is tested with the larger address space. Applications that don't work on the larger address space, like OpenJDK, currently throw an error and exit. Since riscv does not currently have the address space default to 47 bits, some applications just don't work on 56 bits. We could change the kernel so that these applications start working without the need for them to change their code, but that seems like the kernel is overstepping and fixing binaries rather than providing users tools to fix the binaries themselves. This mmap flag was an attempt to provide a tool for these applications that work on the existing 47 bit VA hardware to also work on different hardware that supports a 56 bit VA space. After feedback, it looks like a better solution than the mmap flag is to use the personality syscall to set a process wide restriction to 47 bits instead, which matches the 32 bit flag that already exists. - Charlie
[PATCH RFC v3 0/2] mm: Introduce ADDR_LIMIT_47BIT personality flag
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space). The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2]. These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits. Instead of implicitly restricting the address space on riscv (or any current/future architecture), provide a flag to the personality syscall that can be used to ensure an application works in any arbitrary VA space. A similar feature has already been implemented by the personality syscall in ADDR_LIMIT_32BIT. This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture. Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79 [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47 [2] To: Arnd Bergmann To: Richard Henderson To: Ivan Kokshaysky To: Matt Turner To: Vineet Gupta To: Russell King To: Guo Ren To: Huacai Chen To: WANG Xuerui To: Thomas Bogendoerfer To: James E.J. Bottomley To: Helge Deller To: Michael Ellerman To: Nicholas Piggin To: Christophe Leroy To: Naveen N Rao To: Alexander Gordeev To: Gerald Schaefer To: Heiko Carstens To: Vasily Gorbik To: Christian Borntraeger To: Sven Schnelle To: Yoshinori Sato To: Rich Felker To: John Paul Adrian Glaubitz To: David S. Miller To: Andreas Larsson To: Thomas Gleixner To: Ingo Molnar To: Borislav Petkov To: Dave Hansen To: x...@kernel.org To: H. Peter Anvin To: Andy Lutomirski To: Peter Zijlstra To: Muchun Song To: Andrew Morton To: Liam R. Howlett To: Vlastimil Babka To: Lorenzo Stoakes To: Shuah Khan To: Christoph Hellwig To: Michal Hocko To: "Kirill A. Shutemov" To: Chris Torek Cc: linux-a...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-al...@vger.kernel.org Cc: linux-snps-...@lists.infradead.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: loonga...@lists.linux.dev Cc: linux-m...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-kselft...@vger.kernel.org Cc: linux-abi-de...@lists.sourceforge.net Signed-off-by: Charlie Jenkins Changes in v2: - Added much greater detail to cover letter - Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info - Made this an RFC since I have only tested it on riscv and x86 - Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb90...@rivosinc.com Changes in v3: - Use a personality flag instead of an mmap flag - Link to v2: https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9e...@rivosinc.com --- Charlie Jenkins (2): mm: Add personality flag to limit address to 47 bits selftests/mm: Create ADDR_LIMIT_47BIT test include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 ++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile| 1 + tools/testing/selftests/mm/map_47bit_personality.c | 34 ++ 5 files changed, 40 insertions(+) --- base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 -- - Charlie
[PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction. Signed-off-by: Charlie Jenkins --- include/uapi/linux/personality.h | 1 + mm/mmap.c| 3 +++ 2 files changed, 4 insertions(+) diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x200, STICKY_TIMEOUTS = 0x400, ADDR_LIMIT_3GB =0x800, + ADDR_LIMIT_47BIT = 0x1000, }; /* diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..a5c7544853e5 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1766,6 +1766,9 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) { unsigned long addr; + if (current->personality & ADDR_LIMIT_47BIT) + info->high_limit = MIN(info->high_limit, BIT(47) - 1); + if (info->flags & VM_UNMAPPED_AREA_TOPDOWN) addr = unmapped_area_topdown(info); else -- 2.45.0
[PATCH RFC v3 2/2] selftests/mm: Create ADDR_LIMIT_47BIT test
Add a selftest for the ADDR_LIMIT_47BIT personality flag that mmaps until it runs out of space and ensures no addresses are allocated above 47 bits. Signed-off-by: Charlie Jenkins --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile| 1 + tools/testing/selftests/mm/map_47bit_personality.c | 34 ++ 3 files changed, 36 insertions(+) diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index da030b43e43b..918ef05e180d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -32,6 +32,7 @@ mlock-random-test virtual_address_range gup_test va_128TBswitch +map_47bit_personality map_fixed_noreplace write_to_hugetlbfs hmm-tests diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..2e95fd545409 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_47bit_personality TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_47bit_personality.c b/tools/testing/selftests/mm/map_47bit_personality.c new file mode 100644 index ..453412990c21 --- /dev/null +++ b/tools/testing/selftests/mm/map_47bit_personality.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test the ADDR_LIMIT_47BIT personality flag. + */ +#include +#include +#include +#include "../kselftest.h" + +#define LENGTH (1) + +#define ADDR_LIMIT_47BIT 0x1000 +#define BIT47 1UL << 47 + +/* + * Map memory with ADDR_LIMIT_47BIT until no memory left. Ensure that all returned + * addresses are below 47 bits. + */ +int main(int argc, char **argv) +{ + void *addr; + + syscall(__NR_personality, ADDR_LIMIT_47BIT); + + do { + addr = mmap(0, LENGTH, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + } while (addr != MAP_FAILED && (unsigned long)addr < BIT47); + + if (errno == ENOMEM) + ksft_test_result_pass("ADDR_LIMIT_47BIT works\n"); + else + ksft_test_result_fail("mmap returned address above 47 bits with ADDR_LIMIT_47BIT with addr: %p and err: %s\n", + addr, strerror(errno)); +} -- 2.45.0
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 06, 2024 at 04:59:40PM +1000, Michael Ellerman wrote: > Charlie Jenkins writes: > > Create a personality flag ADDR_LIMIT_47BIT to support applications > > that wish to transition from running in environments that support at > > most 47-bit VAs to environments that support larger VAs. This > > personality can be set to cause all allocations to be below the 47-bit > > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > > > > Signed-off-by: Charlie Jenkins > > --- > > include/uapi/linux/personality.h | 1 + > > mm/mmap.c| 3 +++ > > 2 files changed, 4 insertions(+) > > > > diff --git a/include/uapi/linux/personality.h > > b/include/uapi/linux/personality.h > > index 49796b7756af..cd3b8c154d9b 100644 > > --- a/include/uapi/linux/personality.h > > +++ b/include/uapi/linux/personality.h > > @@ -22,6 +22,7 @@ enum { > > WHOLE_SECONDS = 0x200, > > STICKY_TIMEOUTS = 0x400, > > ADDR_LIMIT_3GB =0x800, > > + ADDR_LIMIT_47BIT = 0x1000, > > }; > > I wonder if ADDR_LIMIT_128T would be clearer? > I don't follow, what does 128T represent? > Have you looked at writing an update for the personality(2) man page? :) I will write an update to the man page if this patch is approved! > > cheers - Charlie
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote: > (Sorry having issues with my IPv6 setup that duplicated the original email... > > On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: > > On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote: > > > On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote: > > >> On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote: > > >> > Create a personality flag ADDR_LIMIT_47BIT to support applications > > >> > that wish to transition from running in environments that support at > > >> > most 47-bit VAs to environments that support larger VAs. This > > >> > personality can be set to cause all allocations to be below the 47-bit > > >> > boundary. Using MAP_FIXED with mmap() will bypass this restriction. > > >> > > > >> > Signed-off-by: Charlie Jenkins > > >> > > >> I think having an architecture-independent mechanism to limit the size > > >> of the 64-bit address space is useful in general, and we've discussed > > >> the same thing for arm64 in the past, though we have not actually > > >> reached an agreement on the ABI previously. > > > > > > The thread on the original proposals attests to this being rather a > > > fraught > > > topic, and I think the weight of opinion was more so in favour of opt-in > > > rather than opt-out. > > > > You mean opt-in to using the larger addresses like we do on arm64 and > > powerpc, while "opt-out" means a limit as Charlie suggested? > > I guess I'm not using brilliant terminology here haha! > > To clarify - the weight of opinion was for a situation where the address > space is limited, except if you set a hint above that (you could call that > opt-out or opt-in depending which way you look at it, so yeah ok very > unclear sorry!). > > It was against the MAP_ flag and also I think a _flexible_ per-process > limit is also questionable as you might end up setting a limit which breaks > something else, and this starts getting messy quick. > > To be clear, the ADDR_LIMIT_47BIT suggestion is absolutely a compromise and > practical suggestion. > > > > > >> > @@ -22,6 +22,7 @@ enum { > > >> >WHOLE_SECONDS = 0x200, > > >> >STICKY_TIMEOUTS = 0x400, > > >> >ADDR_LIMIT_3GB =0x800, > > >> > + ADDR_LIMIT_47BIT = 0x1000, > > >> > }; > > >> > > >> I'm a bit worried about having this done specifically in the > > >> personality flag bits, as they are rather limited. We obviously > > >> don't want to add many more such flags when there could be > > >> a way to just set the default limit. > > > > > > Since I'm the one who suggested it, I feel I should offer some kind of > > > vague defence here :) > > > > > > We shouldn't let perfect be the enemy of the good. This is a relatively > > > straightforward means of achieving the aim (assuming your concern about > > > arch_get_mmap_end() below isn't a blocker) which has the least impact on > > > existing code. > > > > > > Of course we can end up in absurdities where we start doing > > > ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an > > > egregious maintenance burden and is entirely opt-in so has things going > > > for > > > it. > > > > I'm more confused now, I think most importantly we should try to > > handle this consistently across all architectures. The proposed > > implementation seems to completely block addresses above BIT(47) > > even for applications that opt in by calling mmap(BIT(47), ...), > > which seems to break the existing applications. > > Hm, I thought the commit message suggested the hint overrides it still? > > The intent is to optionally be able to run a process that keeps higher bits > free for tagging and to be sure no memory mapping in the process will > clobber these (correct me if I'm wrong Charlie! :) > > So you really wouldn't want this if you are using tagged pointers, you'd > want to be sure literally nothing touches the higher bits. Various architectures handle the hint address differently, but it appears that the only case across any architecture where an address above 47 bits will be returned is if the application had a hint address with a value greater than 47 bits and was using the MAP_FIXED flag. MAP_FIXED bypasses all oth
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Tue, Sep 10, 2024 at 09:13:33AM +, Arnd Bergmann wrote: > On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote: > > On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote: > >> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: > >> The intent is to optionally be able to run a process that keeps higher bits > >> free for tagging and to be sure no memory mapping in the process will > >> clobber these (correct me if I'm wrong Charlie! :) > >> > >> So you really wouldn't want this if you are using tagged pointers, you'd > >> want to be sure literally nothing touches the higher bits. > > My understanding was that the purpose of the existing design > is to allow applications to ask for a high address without having > to resort to the complexity of MAP_FIXED. > > In particular, I'm sure there is precedent for applications that > want both tagged pointers (for most mappings) and untagged pointers > (for large mappings). With a per-mm_struct or per-task_struct > setting you can't do that. > > > Various architectures handle the hint address differently, but it > > appears that the only case across any architecture where an address > > above 47 bits will be returned is if the application had a hint address > > with a value greater than 47 bits and was using the MAP_FIXED flag. > > MAP_FIXED bypasses all other checks so I was assuming that it would be > > logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set, > > then the intent is for no hint address to cause a value greater than 47 > > bits to be returned. > > I don't think the MAP_FIXED case is that interesting here because > it has to work in both fixed and non-fixed mappings. > > >> This would be more consistent vs. other arches. > > > > Yes riscv is an outlier here. The reason I am pushing for something like > > a flag to restrict the address space rather than setting it to be the > > default is it seems like if applications are relying on upper bits to be > > free, then they should be explicitly asking the kernel to keep them free > > rather than assuming them to be free. > > Let's see what the other architectures do and then come up with > a way that fixes the pointer tagging case first on those that are > broken. We can see if there needs to be an extra flag after that. > Here is what I found: > > - x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit > address space when an addr hint is passed. > - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns > higher 52-bit addresses when either a hint is passed or > CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this > is a debugging option) > - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), > returns 52 bit address when an addr hint is passed > - riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses > it for allocating the stack below, ignoring it for normal > mappings > - s390 has no DEFAULT_MAP_WINDOW but tried to allocate in > the current number of pgtable levels and only upgrades to > the next level (31, 42, 53, 64 bits) if a hint is passed or > the current level is exhausted. > - loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA > space of 47 bits (16K pages, 3 levels), but can support > a 55 bit space (64K pages, 3 levels). > - sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space. > It may allocate both positive and negative addresses in > there. (?) > - mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and > at most 48, 41 or 39 address bits, respectively. > > I would suggest these changes: > > - make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64 >and ppc64, leave it at 47 > > - add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits > based on page size), sparc (48 bits) and s390 (unsure if > 42, 53, 47 or 48 bits) > > - leave the rest unchanged. > >Arnd Changing all architectures to have a standardized DEFAULT_MAP_WINDOW mostly solves the problem. However, I am concerned that it is fragile for applications to rely on a default like this. Having the personality bit flag is supposed to provide an intuitive ABI for users that guarantees that they will not accidentally request for memory outside of the boundary that they specified. Also you bring up that the DEFAULT_MAP_WINDOW would not be able to be standardized across architectures, so we still have the problem that this default behavior will be different across architectures which I am trying to solve. - Charlie
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > * Catalin Marinas [240906 07:44]: > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: > > > >> It's also unclear to me how we want this flag to interact with > > > >> the existing logic in arch_get_mmap_end(), which attempts to > > > >> limit the default mapping to a 47-bit address space already. > > > > > > > > To optimize RISC-V progress, I recommend: > > > > > > > > Step 1: Approve the patch. > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > > > > > I really want to first see a plausible explanation about why > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > > like all the other major architectures (x86, arm64, powerpc64), > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default > > configuration. We end up with a 47-bit with 16K pages but for a > > different reason that has to do with LPA2 support (I doubt we need this > > for the user mapping but we need to untangle some of the macros there; > > that's for a separate discussion). > > > > That said, we haven't encountered any user space problems with a 48-bit > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > > approach (47 or 48 bit default limit). Better to have some ABI > > consistency between architectures. One can still ask for addresses above > > this default limit via mmap(). > > I think that is best as well. > > Can we please just do what x86 and arm64 does? > > Thanks, > Liam I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used. The other issue I have with this is that if there is not a hint address specified to be greater than 47 bits on x86, then mmap() may return an address that is greater than 47-bits. The documentation in Documentation/arch/x86/x86_64/5level-paging.rst says: "If hint address set above 47-bit, but MAP_FIXED is not specified, we try to look for unmapped area by specified address. If it's already occupied, we look for unmapped area in *full* address space, rather than from 47-bit window." arm64 on the other hand defines this as only being able to opt-into the 52-bit VA space with the hint address, and my understanding is that mmap() will not fall back to the 52-bit address space. Please correct me if I am wrong. From Documentation/arch/arm64/memory.rst: "To maintain compatibility with software that relies on the ARMv8.0 VA space maximum size of 48-bits, the kernel will, by default, return virtual addresses to userspace from a 48-bit range. "Software can "opt-in" to receiving VAs from a 52-bit space by specifying an mmap hint parameter that is larger than 48-bit." This is an inconsistency I am trying to solve with this personality flag. - Charlie
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Wed, Sep 11, 2024 at 07:25:08AM +, Arnd Bergmann wrote: > On Wed, Sep 11, 2024, at 00:45, Charlie Jenkins wrote: > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > > > > I responded to Arnd in the other thread, but I am still not convinced > > that the solution that x86 and arm64 have selected is the best solution. > > The solution of defaulting to 47 bits does allow applications the > > ability to get addresses that are below 47 bits. However, due to > > differences across architectures it doesn't seem possible to have all > > architectures default to the same value. Additionally, this flag will be > > able to help users avoid potential bugs where a hint address is passed > > that causes upper bits of a VA to be used. > > > > The other issue I have with this is that if there is not a hint address > > specified to be greater than 47 bits on x86, then mmap() may return an > > address that is greater than 47-bits. The documentation in > > Documentation/arch/x86/x86_64/5level-paging.rst says: > > > > "If hint address set above 47-bit, but MAP_FIXED is not specified, we try > > to look for unmapped area by specified address. If it's already > > occupied, we look for unmapped area in *full* address space, rather than > > from 47-bit window." > > This is also in the commit message of b569bab78d8d ("x86/mm: Prepare > to expose larger address space to userspace"), which introduced it. > However, I don't actually see the fallback to the full address space, > instead the actual behavior seems to be the same as arm64. > > Am I missing something in the x86 implementation, or do we just > need to update the documentation? > > Arnd Yeah I guess it is incorrect documentation then? It seems more reasonable to me to have a hint address fall back onto the larger address space because otherwise the "hint" address can cause allocations to fail even if there is space above the 47-bit limit. This is another reason I wanted to avoid having this default behavior on riscv, to not have this abuse of the hint address. - Charlie
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote: > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote: > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > > > * Catalin Marinas [240906 07:44]: > > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: > > > > > >> It's also unclear to me how we want this flag to interact with > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to > > > > > >> limit the default mapping to a 47-bit address space already. > > > > > > > > > > > > To optimize RISC-V progress, I recommend: > > > > > > > > > > > > Step 1: Approve the patch. > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > Point 4 is an ABI change. What guarantees that there isn't still > software out there that relies on the old behaviour? Yeah I don't think it would be desirable to remove the 47 bit constraint in architectures that already have it. > > > > > > I really want to first see a plausible explanation about why > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > > > > like all the other major architectures (x86, arm64, powerpc64), > > > > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default > > > > configuration. We end up with a 47-bit with 16K pages but for a > > > > different reason that has to do with LPA2 support (I doubt we need this > > > > for the user mapping but we need to untangle some of the macros there; > > > > that's for a separate discussion). > > > > > > > > That said, we haven't encountered any user space problems with a 48-bit > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > > > > approach (47 or 48 bit default limit). Better to have some ABI > > > > consistency between architectures. One can still ask for addresses above > > > > this default limit via mmap(). > > > > > > I think that is best as well. > > > > > > Can we please just do what x86 and arm64 does? > > > > I responded to Arnd in the other thread, but I am still not convinced > > that the solution that x86 and arm64 have selected is the best solution. > > The solution of defaulting to 47 bits does allow applications the > > ability to get addresses that are below 47 bits. However, due to > > differences across architectures it doesn't seem possible to have all > > architectures default to the same value. Additionally, this flag will be > > able to help users avoid potential bugs where a hint address is passed > > that causes upper bits of a VA to be used. > > The reason we added this limit on arm64 is that we noticed programs > using the top 8 bits of a 64-bit pointer for additional information. > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have > taught those programs of a new flag but since we couldn't tell how many > are out there, it was the safest to default to a smaller limit and opt > in to the higher one. Such opt-in is via mmap() but if you prefer a > prctl() flag, that's fine by me as well (though I think this should be > opt-in to higher addresses rather than opt-out of the higher addresses). The mmap() flag was used in previous versions but was decided against because this feature is more useful if it is process-wide. A personality() flag was chosen instead of a prctl() flag because there existed other flags in personality() that were similar. I am tempted to use prctl() however because then we could have an additional arg to select the exact number of bits that should be reserved (rather than being fixed at 47 bits). Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag. - Charlie > > -- > Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Wed, Sep 11, 2024 at 11:38:55PM +1000, Michael Ellerman wrote: > Geert Uytterhoeven writes: > > Hi Christophe, > > > > On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy > > wrote: > >> >>> diff --git a/include/uapi/linux/personality.h > >> >>> b/include/uapi/linux/personality.h > >> >>> index 49796b7756af..cd3b8c154d9b 100644 > >> >>> --- a/include/uapi/linux/personality.h > >> >>> +++ b/include/uapi/linux/personality.h > >> >>> @@ -22,6 +22,7 @@ enum { > >> >>> WHOLE_SECONDS = 0x200, > >> >>> STICKY_TIMEOUTS = 0x400, > >> >>> ADDR_LIMIT_3GB =0x800, > >> >>> + ADDR_LIMIT_47BIT = 0x1000, > >> >>> }; > >> >> > >> >> I wonder if ADDR_LIMIT_128T would be clearer? > >> >> > >> > > >> > I don't follow, what does 128T represent? > >> > >> 128T is 128 Terabytes, that's the maximum size achievable with a 47BIT > >> address, that naming would be more consistant with the ADDR_LIMIT_3GB > >> just above that means a 3 Gigabytes limit. > > > > Hence ADDR_LIMIT_128TB? > > Yes it should be 128TB. Typo by me. > > cheers 47BIT was selected because the usecase for this flag is for applications that want to store data in the upper bits of a virtual address space. In this case, how large the virtual address space is irrelevant, and only the number of bits that are being used, and hence the number of bits that are free. - Charlie
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote: > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote: > > Opting-in to the higher address space is reasonable. However, it is not > > my preference, because the purpose of this flag is to ensure that > > allocations do not exceed 47-bits, so it is a clearer ABI to have the > > applications that want this guarantee to be the ones setting the flag, > > rather than the applications that want the higher bits setting the flag. > > Yes, this would be ideal. Unfortunately those applications don't know > they need to set a flag in order to work. It's not a regression, the applications never worked (on platforms that do not have this default). The 47-bit default would allow applications that didn't work to start working at the cost of a non-ideal ABI. That doesn't seem like a reasonable tradeoff to me. If applications want to run on new hardware that has different requirements, shouldn't they be required to update rather than expect the kernel will solve their problems for them? > > A slightly better option is to leave the default 47-bit at the kernel > ABI level and have the libc/dynamic loader issue the prctl(). You can > control the default with environment variables if needed. Having glibc set the 47-bit requirement could make it slightly easier for applications since they would only have to set the environment variable. After the kernel interface is approved I can look into supporting that. - Charlie > > We do something similar in glibc for arm64 MTE. When MTE is enabled, the > top byte of an allocated pointer contains the tag that must not be > corrupted. We left the decision to the C library via the > glibc.mem.tagging tunable (Android has something similar via the app > manifest). An app can change the default if it wants but if you run with > old glibc or no environment variable to say otherwise, the default would > be safe. Distros can set the environment to be the maximum range by > default if they know the apps included have been upgraded and tested. > > -- > Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote: > On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote: > > On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote: > > > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote: > > > > Opting-in to the higher address space is reasonable. However, it is not > > > > my preference, because the purpose of this flag is to ensure that > > > > allocations do not exceed 47-bits, so it is a clearer ABI to have the > > > > applications that want this guarantee to be the ones setting the flag, > > > > rather than the applications that want the higher bits setting the flag. > > > > > > Yes, this would be ideal. Unfortunately those applications don't know > > > they need to set a flag in order to work. > > > > It's not a regression, the applications never worked (on platforms that > > do not have this default). The 47-bit default would allow applications > > that didn't work to start working at the cost of a non-ideal ABI. That > > doesn't seem like a reasonable tradeoff to me. If applications want to > > run on new hardware that has different requirements, shouldn't they be > > required to update rather than expect the kernel will solve their > > problems for them? > > That's a valid point but it depends on the application and how much you > want to spend updating user-space. OpenJDK is fine, if you need a JIT > you'll have to add support for that architecture anyway. But others are > arch-agnostic, you just recompile to your target. It's not an ABI > problem, more of an API one. The arch-agnosticism is my hope with this personality flag, it can be added arch-agnostic userspace code and allow the application to work everywhere, but it does have the downside of requiring that change to user-space code. > > The x86 case (and powerpc/arm64) was different, the 47-bit worked for a > long time before expanding it. So it made a lot of sense to keep the > same default. Yes it is very reasonable that this solution was selected for those architectures since the support for higher address spaces evolved in the manner that it did! - Charlie > > Anyway, the prctl() can go both ways, either expanding or limiting the > default address space. So I'd be fine with such interface. > > -- > Catalin
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
On Fri, Sep 13, 2024 at 08:41:34AM +0100, Lorenzo Stoakes wrote: > On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote: > > On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote: > > > On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote: > > > > On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote: > > > > > * Catalin Marinas [240906 07:44]: > > > > > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > > > > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > > > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann > > > > > > > > wrote: > > > > > > > >> It's also unclear to me how we want this flag to interact with > > > > > > > >> the existing logic in arch_get_mmap_end(), which attempts to > > > > > > > >> limit the default mapping to a 47-bit address space already. > > > > > > > > > > > > > > > > To optimize RISC-V progress, I recommend: > > > > > > > > > > > > > > > > Step 1: Approve the patch. > > > > > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > > > > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > > > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > > > > > Point 4 is an ABI change. What guarantees that there isn't still > > > software out there that relies on the old behaviour? > > > > Yeah I don't think it would be desirable to remove the 47 bit > > constraint in architectures that already have it. > > > > > > > > > > > > I really want to first see a plausible explanation about why > > > > > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > > > > > > like all the other major architectures (x86, arm64, powerpc64), > > > > > > > > > > > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the > > > > > > default > > > > > > configuration. We end up with a 47-bit with 16K pages but for a > > > > > > different reason that has to do with LPA2 support (I doubt we need > > > > > > this > > > > > > for the user mapping but we need to untangle some of the macros > > > > > > there; > > > > > > that's for a separate discussion). > > > > > > > > > > > > That said, we haven't encountered any user space problems with a > > > > > > 48-bit > > > > > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > > > > > > approach (47 or 48 bit default limit). Better to have some ABI > > > > > > consistency between architectures. One can still ask for addresses > > > > > > above > > > > > > this default limit via mmap(). > > > > > > > > > > I think that is best as well. > > > > > > > > > > Can we please just do what x86 and arm64 does? > > > > > > > > I responded to Arnd in the other thread, but I am still not convinced > > > > that the solution that x86 and arm64 have selected is the best solution. > > > > The solution of defaulting to 47 bits does allow applications the > > > > ability to get addresses that are below 47 bits. However, due to > > > > differences across architectures it doesn't seem possible to have all > > > > architectures default to the same value. Additionally, this flag will be > > > > able to help users avoid potential bugs where a hint address is passed > > > > that causes upper bits of a VA to be used. > > > > > > The reason we added this limit on arm64 is that we noticed programs > > > using the top 8 bits of a 64-bit pointer for additional information. > > > IIRC, it wasn't even openJDK but some JavaScript JIT. We could have > > > taught those programs of a new flag but since we couldn't tell how many > > > are out there, it was the safest to default to a smaller limit and opt > > > in to the higher one. Such opt-in is via mmap() but if you prefer a > > > prctl() flag, that's fine by me as well (though I think this should be > > > opt-in to higher addresses rather than opt-out of the higher addresses). &g
Re: [PATCH v2 3/7] syscall.h: add syscall_set_arguments() and syscall_set_return_value()
/include/asm/syscall.h > index fff52205fb65..526449edd768 100644 > --- a/arch/nios2/include/asm/syscall.h > +++ b/arch/nios2/include/asm/syscall.h > @@ -58,6 +58,17 @@ static inline void syscall_get_arguments(struct > task_struct *task, > *args = regs->r9; > } > > +static inline void syscall_set_arguments(struct task_struct *task, > + struct pt_regs *regs, const unsigned long *args) > +{ > + regs->r4 = *args++; > + regs->r5 = *args++; > + regs->r6 = *args++; > + regs->r7 = *args++; > + regs->r8 = *args++; > + regs->r9 = *args; > +} > + > static inline int syscall_get_arch(struct task_struct *task) > { > return AUDIT_ARCH_NIOS2; > diff --git a/arch/openrisc/include/asm/syscall.h > b/arch/openrisc/include/asm/syscall.h > index 903ed882bdec..e6383be2a195 100644 > --- a/arch/openrisc/include/asm/syscall.h > +++ b/arch/openrisc/include/asm/syscall.h > @@ -57,6 +57,13 @@ syscall_get_arguments(struct task_struct *task, struct > pt_regs *regs, > memcpy(args, ®s->gpr[3], 6 * sizeof(args[0])); > } > > +static inline void > +syscall_set_arguments(struct task_struct *task, struct pt_regs *regs, > + const unsigned long *args) > +{ > + memcpy(®s->gpr[3], args, 6 * sizeof(args[0])); > +} > + > static inline int syscall_get_arch(struct task_struct *task) > { > return AUDIT_ARCH_OPENRISC; > diff --git a/arch/parisc/include/asm/syscall.h > b/arch/parisc/include/asm/syscall.h > index 00b127a5e09b..b146d0ae4c77 100644 > --- a/arch/parisc/include/asm/syscall.h > +++ b/arch/parisc/include/asm/syscall.h > @@ -29,6 +29,18 @@ static inline void syscall_get_arguments(struct > task_struct *tsk, > args[0] = regs->gr[26]; > } > > +static inline void syscall_set_arguments(struct task_struct *tsk, > + struct pt_regs *regs, > + unsigned long *args) > +{ > + regs->gr[21] = args[5]; > + regs->gr[22] = args[4]; > + regs->gr[23] = args[3]; > + regs->gr[24] = args[2]; > + regs->gr[25] = args[1]; > + regs->gr[26] = args[0]; > +} > + > static inline long syscall_get_error(struct task_struct *task, >struct pt_regs *regs) > { > diff --git a/arch/powerpc/include/asm/syscall.h > b/arch/powerpc/include/asm/syscall.h > index 422d7735ace6..521f279e6b33 100644 > --- a/arch/powerpc/include/asm/syscall.h > +++ b/arch/powerpc/include/asm/syscall.h > @@ -114,6 +114,16 @@ static inline void syscall_get_arguments(struct > task_struct *task, > } > } > > +static inline void syscall_set_arguments(struct task_struct *task, > + struct pt_regs *regs, > + const unsigned long *args) > +{ > + memcpy(®s->gpr[3], args, 6 * sizeof(args[0])); > + > + /* Also copy the first argument into orig_gpr3 */ > + regs->orig_gpr3 = args[0]; > +} > + > static inline int syscall_get_arch(struct task_struct *task) > { > if (is_tsk_32bit_task(task)) > diff --git a/arch/riscv/include/asm/syscall.h > b/arch/riscv/include/asm/syscall.h > index 121fff429dce..8d389ba995c8 100644 > --- a/arch/riscv/include/asm/syscall.h > +++ b/arch/riscv/include/asm/syscall.h > @@ -66,6 +66,15 @@ static inline void syscall_get_arguments(struct > task_struct *task, > memcpy(args, ®s->a1, 5 * sizeof(args[0])); > } > > +static inline void syscall_set_arguments(struct task_struct *task, > + struct pt_regs *regs, > + const unsigned long *args) > +{ > + regs->orig_a0 = args[0]; > + args++; > + memcpy(®s->a1, args, 5 * sizeof(regs->a1)); > +} Looks good for riscv. Tested-by: Charlie Jenkins Reviewed-by: Charlie Jenkins + > static inline int syscall_get_arch(struct task_struct *task) > { > #ifdef CONFIG_64BIT > diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h > index 27e3d804b311..b3dd883699e7 100644 > --- a/arch/s390/include/asm/syscall.h > +++ b/arch/s390/include/asm/syscall.h > @@ -78,6 +78,18 @@ static inline void syscall_get_arguments(struct > task_struct *task, > args[0] = regs->orig_gpr2 & mask; > } > > +static inline void syscall_set_arguments(struct task_struct *task, > + struct pt_regs *regs, > + const unsigned long *args) > +{ > + unsigned int n = 6; > + > + while (n-- > 0) > +
Re: [PATCH v2 4/7] syscall.h: introduce syscall_set_nr()
gt; static inline void syscall_rollback(struct task_struct *task, > struct pt_regs *regs) > { > diff --git a/arch/openrisc/include/asm/syscall.h > b/arch/openrisc/include/asm/syscall.h > index e6383be2a195..5e037d9659c5 100644 > --- a/arch/openrisc/include/asm/syscall.h > +++ b/arch/openrisc/include/asm/syscall.h > @@ -25,6 +25,12 @@ syscall_get_nr(struct task_struct *task, struct pt_regs > *regs) > return regs->orig_gpr11; > } > > +static inline void > +syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr) > +{ > + regs->orig_gpr11 = nr; > +} > + > static inline void > syscall_rollback(struct task_struct *task, struct pt_regs *regs) > { > diff --git a/arch/parisc/include/asm/syscall.h > b/arch/parisc/include/asm/syscall.h > index b146d0ae4c77..c11222798ab2 100644 > --- a/arch/parisc/include/asm/syscall.h > +++ b/arch/parisc/include/asm/syscall.h > @@ -17,6 +17,13 @@ static inline long syscall_get_nr(struct task_struct *tsk, > return regs->gr[20]; > } > > +static inline void syscall_set_nr(struct task_struct *tsk, > + struct pt_regs *regs, > + int nr) > +{ > + regs->gr[20] = nr; > +} > + > static inline void syscall_get_arguments(struct task_struct *tsk, >struct pt_regs *regs, >unsigned long *args) > diff --git a/arch/powerpc/include/asm/syscall.h > b/arch/powerpc/include/asm/syscall.h > index 521f279e6b33..7505dcfed247 100644 > --- a/arch/powerpc/include/asm/syscall.h > +++ b/arch/powerpc/include/asm/syscall.h > @@ -39,6 +39,16 @@ static inline int syscall_get_nr(struct task_struct *task, > struct pt_regs *regs) > return -1; > } > > +static inline void syscall_set_nr(struct task_struct *task, struct pt_regs > *regs, int nr) > +{ > + /* > + * Unlike syscall_get_nr(), syscall_set_nr() can be called only when > + * the target task is stopped for tracing on entering syscall, so > + * there is no need to have the same check syscall_get_nr() has. > + */ > + regs->gpr[0] = nr; > +} > + > static inline void syscall_rollback(struct task_struct *task, > struct pt_regs *regs) > { > diff --git a/arch/riscv/include/asm/syscall.h > b/arch/riscv/include/asm/syscall.h > index 8d389ba995c8..a5281cdf2b10 100644 > --- a/arch/riscv/include/asm/syscall.h > +++ b/arch/riscv/include/asm/syscall.h > @@ -30,6 +30,13 @@ static inline int syscall_get_nr(struct task_struct *task, > return regs->a7; > } > > +static inline void syscall_set_nr(struct task_struct *task, > + struct pt_regs *regs, > + int nr) > +{ > + regs->a7 = nr; > +} Looks good for riscv. Tested-by: Charlie Jenkins Reviewed-by: Charlie Jenkins > + > static inline void syscall_rollback(struct task_struct *task, > struct pt_regs *regs) > { > diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h > index b3dd883699e7..12cd0c60c07b 100644 > --- a/arch/s390/include/asm/syscall.h > +++ b/arch/s390/include/asm/syscall.h > @@ -24,6 +24,18 @@ static inline long syscall_get_nr(struct task_struct *task, > (regs->int_code & 0x) : -1; > } > > +static inline void syscall_set_nr(struct task_struct *task, > + struct pt_regs *regs, > + int nr) > +{ > + /* > + * Unlike syscall_get_nr(), syscall_set_nr() can be called only when > + * the target task is stopped for tracing on entering syscall, so > + * there is no need to have the same check syscall_get_nr() has. > + */ > + regs->int_code = (regs->int_code & ~0x) | (nr & 0x); > +} > + > static inline void syscall_rollback(struct task_struct *task, > struct pt_regs *regs) > { > diff --git a/arch/sh/include/asm/syscall_32.h > b/arch/sh/include/asm/syscall_32.h > index cb51a7528384..7027d87d901d 100644 > --- a/arch/sh/include/asm/syscall_32.h > +++ b/arch/sh/include/asm/syscall_32.h > @@ -15,6 +15,18 @@ static inline long syscall_get_nr(struct task_struct *task, > return (regs->tra >= 0) ? regs->regs[3] : -1L; > } > > +static inline void syscall_set_nr(struct task_struct *task, > + struct pt_regs *regs, > + int nr) > +{ > + /* > + * Unlike syscall_get_nr(), syscall_set_nr(