Module Name: src Committed By: martin Date: Wed Aug 7 11:01:57 UTC 2024
Modified Files: src/libexec/ld.elf_so [netbsd-9]: README.TLS tls.c src/libexec/ld.elf_so/arch/aarch64 [netbsd-9]: rtld_start.S src/tests/libexec/ld.elf_so [netbsd-9]: t_tls_extern.c Log Message: Pull up following revision(s) (requested by riastradh in ticket #1864): libexec/ld.elf_so/tls.c: revision 1.15 libexec/ld.elf_so/arch/aarch64/rtld_start.S: revision 1.6 libexec/ld.elf_so/arch/aarch64/rtld_start.S: revision 1.7 tests/libexec/ld.elf_so/t_tls_extern.c: revision 1.15 tests/libexec/ld.elf_so/t_tls_extern.c: revision 1.16 libexec/ld.elf_so/README.TLS: revision 1.7 libexec/ld.elf_so/tls.c: revision 1.20 libexec/ld.elf_so/tls.c: revision 1.21 Alignment. NFCI. ld.elf_so: Sprinkle comments and references for thread-local storage. Maybe this will help the TLS business to be less mysterious to the next traveller to pass by here. Prompted by PR lib/58154. ld.elf_so: Add comments explaining DTV allocation size. Patch by pho@ for PR lib/58154. tests/libexec/ld.elf_so/t_tls_extern: Test PR lib/58154. ld.elf_so aarch64/rtld_start.S: Sprinkle comments. No functional change intended. Prompted by PR lib/58154. ld.elf_so aarch64/rtld_start.S: Fix dynamic TLS fast path branch. Bug found and patch prepared by pho@. PR lib/58154 To generate a diff of this commit: cvs rdiff -u -r1.5.2.1 -r1.5.2.2 src/libexec/ld.elf_so/README.TLS cvs rdiff -u -r1.12.2.2 -r1.12.2.3 src/libexec/ld.elf_so/tls.c cvs rdiff -u -r1.4 -r1.4.2.1 src/libexec/ld.elf_so/arch/aarch64/rtld_start.S cvs rdiff -u -r1.12.4.2 -r1.12.4.3 src/tests/libexec/ld.elf_so/t_tls_extern.c Please note that diffs are not public domain; they are subject to the copyright notices on the relevant files.
Modified files: Index: src/libexec/ld.elf_so/README.TLS diff -u src/libexec/ld.elf_so/README.TLS:1.5.2.1 src/libexec/ld.elf_so/README.TLS:1.5.2.2 --- src/libexec/ld.elf_so/README.TLS:1.5.2.1 Fri Aug 4 12:55:45 2023 +++ src/libexec/ld.elf_so/README.TLS Wed Aug 7 11:01:57 2024 @@ -1,11 +1,111 @@ +Thread-local storage. + +Each thread has a thread control block, or TCB. The TCB is a +variable-size structure headed by `struct tls_tcb' from <sys/tls.h>, +with: + +(a) static thread-local storage for the TLS data of initial objects, + i.e., those loaded at startup rather than those dynamically loaded + by dlopen + +(b) a pointer to a dynamic thread vector (DTV) for the TLS data + pointers of objects that use global-dynamic or local-dynamic models + (typically shared libraries or dlopenable modules) + +(c) the pthread_t pointer + +The per-thread lwp private pointer, also sometimes called TP (thread +pointer), managed by the _lwp_setprivate and _lwp_setprivate syscalls, +either points at the TCB directly, or, on some architectures, points at + + tp = tcb + sizeof(struct tls_tcb) + TLS_TP_OFFSET. + +This bias is chosen for architectures where signed displacements from +TP enable twice the range of static TLS offsets when biased like this. +Architectures with such a tp/tcb offset must provide + +void *__lwp_gettcb_fast(void); + +in machine/mcontext.h and must define __HAVE___LWP_GETTCB_FAST in +machine/types.h to reflect this; otherwise they must provide +__lwp_getprivate_fast to return the TCB pointer. + +Each architecture has one of two TLS variants, variant I or variant II. +Variant I places the static thread-local storage _after_ the fixed +content of the TCB, at increasing addresses (increasing addresses grow +down in diagram): + + +---------------+ + | dtv pointer | tcb points here (struct tls_tcb) + +---------------+ + | pthread_t | + +---------------+ + | obj0 tls | obj0->tlsoffset = 0 + | | + | | + +---------------+ + | obj1 tls | obj1->tlsoffset = 3 + +---------------+ + | obj2 tls | obj2->tlsoffset = 4 + | | + . . + . . + . . + | | + +---------------+ + | objN tls | objN->tlsoffset = k + +---------------+ + +Variant II places the static thread-local storage _before_ the fixed +content of the TCB, at decreasing addresses: + + +---------------+ + | objN tls | objN->tlsoffset = k + +---------------+ + | obj(N-1) tls | obj(N-1)->tlsoffset = k - 1 + . . + . . + . . + | | + +---------------+ + | obj2 tls | obj2->tlsoffset = 4 + +---------------+ + | obj1 tls | obj1->tlsoffset = 3 + +---------------+ + | obj0 tls | obj0->tlsoffset = 0 + | | + | | + +---------------+ + | tcb pointer | tcb points here (struct tls_tcb) + +---------------+ + | dtv pointer | + +---------------+ + | pthread_t | + +---------------+ + +See [ELFTLS] Sec. 3 `Run-Time Handling of TLS', Figs 1 and 2, for +bigger pictures including the DTV and dynamically allocated TLS blocks. + +Each architecture also has its own ELF ABI processor supplement with +the architecture-specific relocations and TLS details. + +References: + + [ELFTLS] Ulrich Drepper, `ELF Handling For Thread-Local + Storage', Version 0.21, 2023-08-22. + https://akkadia.org/drepper/tls.pdf + https://web.archive.org/web/20240718081934/https://akkadia.org/drepper/tls.pdf + Steps for adding TLS support for a new platform: (1) Declare TLS variant in machine/types.h by defining either __HAVE_TLS_VARIANT_I or __HAVE_TLS_VARIANT_II. -(2) _lwp_makecontext has to set the reserved register or kernel transfer -variable in uc_mcontext to the provided value of 'private'. See -src/lib/libc/arch/$PLATFORM/gen/_lwp.c. +(2) _lwp_makecontext has to set the reserved register or kernel +transfer variable in uc_mcontext according to the provided value of +`private'. Note that _lwp_makecontext takes tcb, not tp, as an +argument, so make sure to adjust it if needed for the tp/tcb offset. +See src/lib/libc/arch/$PLATFORM/gen/_lwp.c. This is not possible on the VAX as there is no free space in ucontext_t. This requires either a special version of _lwp_create or versioning @@ -60,9 +160,22 @@ def->st_value - defobj->tlsoffset + rela e.g. starting offset is counting down from the TCB. -(6) Implement __lwp_getprivate_fast() in machine/mcontext.h and set -__HAVE___LWP_GETPRIVATE_FAST in machine/types.h. +(6) If there is a tp/tcb offset, implement + + __lwp_gettcb_fast() + __lwp_settcb() + +in machine/mcontext.h and set + + __HAVE___LWP_GETTCB_FAST + __HAVE___LWP_SETTCB + +in machine/types.h. + +Otherwise, implement __lwp_getprivate_fast() in machine/mcontext.h and +set __HAVE___LWP_GETPRIVATE_FAST in machine/types.h. -(7) Test using src/tests/lib/libc/tls. Make sure with "objdump -R" that -t_tls_dynamic has two TPOFF relocations and h_tls_dlopen.so.1 and -libh_tls_dynamic.so.1 have both two DTPMOD and DTPOFF relocations. +(7) Test using src/tests/lib/libc/tls and src/tests/libexec/ld.elf_so. +Make sure with "objdump -R" that t_tls_dynamic has two TPOFF +relocations and h_tls_dlopen.so.1 and libh_tls_dynamic.so.1 have both +two DTPMOD and DTPOFF relocations. Index: src/libexec/ld.elf_so/tls.c diff -u src/libexec/ld.elf_so/tls.c:1.12.2.2 src/libexec/ld.elf_so/tls.c:1.12.2.3 --- src/libexec/ld.elf_so/tls.c:1.12.2.2 Fri Aug 4 12:55:45 2023 +++ src/libexec/ld.elf_so/tls.c Wed Aug 7 11:01:57 2024 @@ -1,4 +1,4 @@ -/* $NetBSD: tls.c,v 1.12.2.2 2023/08/04 12:55:45 martin Exp $ */ +/* $NetBSD: tls.c,v 1.12.2.3 2024/08/07 11:01:57 martin Exp $ */ /*- * Copyright (c) 2011 The NetBSD Foundation, Inc. * All rights reserved. @@ -29,7 +29,18 @@ */ #include <sys/cdefs.h> -__RCSID("$NetBSD: tls.c,v 1.12.2.2 2023/08/04 12:55:45 martin Exp $"); +__RCSID("$NetBSD: tls.c,v 1.12.2.3 2024/08/07 11:01:57 martin Exp $"); + +/* + * Thread-local storage + * + * Reference: + * + * [ELFTLS] Ulrich Drepper, `ELF Handling For Thread-Local + * Storage', Version 0.21, 2023-08-22. + * https://akkadia.org/drepper/tls.pdf + * https://web.archive.org/web/20240718081934/https://akkadia.org/drepper/tls.pdf + */ #include <sys/param.h> #include <sys/ucontext.h> @@ -45,20 +56,93 @@ __RCSID("$NetBSD: tls.c,v 1.12.2.2 2023/ static struct tls_tcb *_rtld_tls_allocate_locked(void); static void *_rtld_tls_module_allocate(struct tls_tcb *, size_t); +/* + * DTV offset + * + * On some architectures (m68k, mips, or1k, powerpc, and riscv), + * the DTV offsets passed to __tls_get_addr have a bias relative + * to the start of the DTV, in order to maximize the range of TLS + * offsets that can be used by instruction encodings with signed + * displacements. + */ #ifndef TLS_DTV_OFFSET #define TLS_DTV_OFFSET 0 #endif static size_t _rtld_tls_static_space; /* Static TLS space allocated */ static size_t _rtld_tls_static_offset; /* Next offset for static TLS to use */ -size_t _rtld_tls_dtv_generation = 1; -size_t _rtld_tls_max_index = 1; +size_t _rtld_tls_dtv_generation = 1; /* Bumped on each load of obj w/ TLS */ +size_t _rtld_tls_max_index = 1; /* Max index into up-to-date DTV */ -#define DTV_GENERATION(dtv) ((size_t)((dtv)[0])) -#define DTV_MAX_INDEX(dtv) ((size_t)((dtv)[-1])) +/* + * DTV -- Dynamic Thread Vector + * + * The DTV is a per-thread array that maps each module with + * thread-local storage to a pointer into part of the thread's TCB + * (thread control block), or dynamically loaded TLS blocks, + * reserved for that module's storage. + * + * The TCB itself, struct tls_tcb, has a pointer to the DTV at + * tcb->tcb_dtv. + * + * The layout is: + * + * +---------------+ + * | max index | -1 max index i for which dtv[i] is alloced + * +---------------+ + * | generation | 0 void **dtv points here + * +---------------+ + * | obj 1 tls ptr | 1 TLS pointer for obj w/ obj->tlsindex 1 + * +---------------+ + * | obj 2 tls ptr | 2 TLS pointer for obj w/ obj->tlsindex 2 + * +---------------+ + * . + * . + * . + * + * The values of obj->tlsindex start at 1; this way, + * dtv[obj->tlsindex] works, when dtv[0] is the generation. The + * TLS pointers go either into the static thread-local storage, + * for the initial objects (i.e., those loaded at startup), or + * into TLS blocks dynamically allocated for objects that + * dynamically loaded by dlopen. + * + * The generation field is a cache of the global generation number + * _rtld_tls_dtv_generation, which is bumped every time an object + * with TLS is loaded in _rtld_map_object, and cached by + * __tls_get_addr (via _rtld_tls_get_addr) when a newly loaded + * module lies outside the bounds of the current DTV. + * + * XXX Why do we keep max index and generation separately? They + * appear to be initialized the same, always incremented together, + * and always stored together. + * + * XXX Why is this not a struct? + * + * struct dtv { + * size_t dtv_gen; + * void *dtv_module[]; + * }; + */ +#define DTV_GENERATION(dtv) ((size_t)((dtv)[0])) +#define DTV_MAX_INDEX(dtv) ((size_t)((dtv)[-1])) #define SET_DTV_GENERATION(dtv, val) (dtv)[0] = (void *)(size_t)(val) #define SET_DTV_MAX_INDEX(dtv, val) (dtv)[-1] = (void *)(size_t)(val) +/* + * _rtld_tls_get_addr(tcb, idx, offset) + * + * Slow path for __tls_get_addr (see below), called to allocate + * TLS space if needed for the object obj with obj->tlsindex idx, + * at offset, which must be below obj->tlssize. + * + * This may allocate a DTV if the current one is too old, and it + * may allocate a dynamically loaded TLS block if there isn't one + * already allocated for it. + * + * XXX Why is the first argument passed as `void *tls' instead of + * just `struct tls_tcb *tcb'? + */ void * _rtld_tls_get_addr(void *tls, size_t idx, size_t offset) { @@ -70,15 +154,26 @@ _rtld_tls_get_addr(void *tls, size_t idx dtv = tcb->tcb_dtv; + /* + * If the generation number has changed, we have to allocate a + * new DTV. + * + * XXX Do we really? Isn't it enough to check whether idx <= + * DTV_MAX_INDEX(dtv)? + */ if (__predict_false(DTV_GENERATION(dtv) != _rtld_tls_dtv_generation)) { size_t to_copy = DTV_MAX_INDEX(dtv); + /* + * "2 +" because the first element is the generation and + * the second one is the maximum index. + */ new_dtv = xcalloc((2 + _rtld_tls_max_index) * sizeof(*dtv)); - ++new_dtv; - if (to_copy > _rtld_tls_max_index) + ++new_dtv; /* advance past DTV_MAX_INDEX */ + if (to_copy > _rtld_tls_max_index) /* XXX How? */ to_copy = _rtld_tls_max_index; memcpy(new_dtv + 1, dtv + 1, to_copy * sizeof(*dtv)); - xfree(dtv - 1); + xfree(dtv - 1); /* retreat back to DTV_MAX_INDEX */ dtv = tcb->tcb_dtv = new_dtv; SET_DTV_MAX_INDEX(dtv, _rtld_tls_max_index); SET_DTV_GENERATION(dtv, _rtld_tls_dtv_generation); @@ -92,6 +187,18 @@ _rtld_tls_get_addr(void *tls, size_t idx return (uint8_t *)dtv[idx] + offset; } +/* + * _rtld_tls_initial_allocation() + * + * Allocate the TCB (thread control block) for the initial thread, + * once the static TLS space usage has been determined (plus some + * slop to allow certain special cases like Mesa to be dlopened). + * + * This must be done _after_ all initial objects (i.e., those + * loaded at startup, as opposed to objects dynamically loaded by + * dlopen) have had TLS offsets allocated if need be by + * _rtld_tls_offset_allocate, and have had relocations processed. + */ void _rtld_tls_initial_allocation(void) { @@ -114,6 +221,20 @@ _rtld_tls_initial_allocation(void) #endif } +/* + * _rtld_tls_allocate_locked() + * + * Internal subroutine to allocate a TCB (thread control block) + * for the current thread. + * + * This allocates a DTV and a TCB that points to it, including + * static space in the TCB for the TLS of the initial objects. + * TLS blocks for dynamically loaded objects are allocated lazily. + * + * Caller must either be single-threaded (at startup via + * _rtld_tls_initial_allocation) or hold the rtld exclusive lock + * (via _rtld_tls_allocate). + */ static struct tls_tcb * _rtld_tls_allocate_locked(void) { @@ -131,8 +252,12 @@ _rtld_tls_allocate_locked(void) tcb->tcb_self = tcb; #endif dbg(("lwp %d tls tcb %p", _lwp_self(), tcb)); + /* + * "2 +" because the first element is the generation and the second + * one is the maximum index. + */ tcb->tcb_dtv = xcalloc(sizeof(*tcb->tcb_dtv) * (2 + _rtld_tls_max_index)); - ++tcb->tcb_dtv; + ++tcb->tcb_dtv; /* advance past DTV_MAX_INDEX */ SET_DTV_MAX_INDEX(tcb->tcb_dtv, _rtld_tls_max_index); SET_DTV_GENERATION(tcb->tcb_dtv, _rtld_tls_dtv_generation); @@ -155,6 +280,14 @@ _rtld_tls_allocate_locked(void) return tcb; } +/* + * _rtld_tls_allocate() + * + * Allocate a TCB (thread control block) for the current thread. + * + * Called by pthread_create for non-initial threads. (The initial + * thread's TCB is allocated by _rtld_tls_initial_allocation.) + */ struct tls_tcb * _rtld_tls_allocate(void) { @@ -168,6 +301,14 @@ _rtld_tls_allocate(void) return tcb; } +/* + * _rtld_tls_free(tcb) + * + * Free a TCB allocated with _rtld_tls_allocate. + * + * Frees any TLS blocks for dynamically loaded objects that tcb's + * DTV points to, and frees tcb's DTV, and frees tcb. + */ void _rtld_tls_free(struct tls_tcb *tcb) { @@ -190,12 +331,27 @@ _rtld_tls_free(struct tls_tcb *tcb) (uint8_t *)tcb->tcb_dtv[i] >= p_end) xfree(tcb->tcb_dtv[i]); } - xfree(tcb->tcb_dtv - 1); + xfree(tcb->tcb_dtv - 1); /* retreat back to DTV_MAX_INDEX */ xfree(p); _rtld_exclusive_exit(&mask); } +/* + * _rtld_tls_module_allocate(tcb, idx) + * + * Allocate thread-local storage in the thread with the given TCB + * (thread control block) for the object obj whose obj->tlsindex + * is idx. + * + * If obj has had space in static TLS reserved (obj->tls_static), + * return a pointer into that. Otherwise, allocate a TLS block, + * mark obj as having a TLS block allocated (obj->tls_dynamic), + * and return it. + * + * Called by _rtld_tls_get_addr to get the thread-local storage + * for an object the first time around. + */ static void * _rtld_tls_module_allocate(struct tls_tcb *tcb, size_t idx) { @@ -228,6 +384,16 @@ _rtld_tls_module_allocate(struct tls_tcb return p; } +/* + * _rtld_tls_offset_allocate(obj) + * + * Allocate a static thread-local storage offset for obj. + * + * Called by _rtld at startup for all initial objects. Called + * also by MD relocation logic, which is allowed (for Mesa) to + * allocate an additional 64 bytes (RTLD_STATIC_TLS_RESERVATION) + * of static thread-local storage in dlopened objects. + */ int _rtld_tls_offset_allocate(Obj_Entry *obj) { @@ -284,6 +450,17 @@ _rtld_tls_offset_allocate(Obj_Entry *obj return 0; } +/* + * _rtld_tls_offset_free(obj) + * + * Free a static thread-local storage offset for obj. + * + * Called by dlclose (via _rtld_unload_object -> _rtld_obj_free). + * + * Since static thread-local storage is normally not used by + * dlopened objects (with the exception of Mesa), this doesn't do + * anything to recycle the space right now. + */ void _rtld_tls_offset_free(Obj_Entry *obj) { @@ -297,10 +474,33 @@ _rtld_tls_offset_free(Obj_Entry *obj) #if defined(__HAVE_COMMON___TLS_GET_ADDR) && defined(RTLD_LOADER) /* - * The fast path is access to an already allocated DTV entry. - * This checks the current limit and the entry without needing any - * locking. Entries are only freed on dlclose() and it is an application - * bug if code of the module is still running at that point. + * __tls_get_addr(tlsindex) + * + * Symbol directly called by code generated by the compiler for + * references thread-local storage in the general-dynamic or + * local-dynamic TLS models (but not initial-exec or local-exec). + * + * The argument is a pointer to + * + * struct { + * unsigned long int ti_module; + * unsigned long int ti_offset; + * }; + * + * as in, e.g., [ELFTLS] Sec. 3.4.3. This coincides with the + * type size_t[2] on all architectures that use this common + * __tls_get_addr definition (XXX but why do we write it as + * size_t[2]?). + * + * ti_module, i.e., arg[0], is the obj->tlsindex assigned at + * load-time by _rtld_map_object, and ti_offset, i.e., arg[1], is + * assigned at link-time by ld(1), possibly adjusted by + * TLS_DTV_OFFSET. + * + * Some architectures -- specifically IA-64 -- use a different + * calling convention. Some architectures -- specifically i386 + * -- also use another entry point ___tls_get_addr (that's three + * leading underscores) with a different calling convention. */ void * __tls_get_addr(void *arg_) @@ -316,6 +516,13 @@ __tls_get_addr(void *arg_) dtv = tcb->tcb_dtv; + /* + * Fast path: access to an already allocated DTV entry. This + * checks the current limit and the entry without needing any + * locking. Entries are only freed on dlclose() and it is an + * application bug if code of the module is still running at + * that point. + */ if (__predict_true(idx < DTV_MAX_INDEX(dtv) && dtv[idx] != NULL)) return (uint8_t *)dtv[idx] + offset; Index: src/libexec/ld.elf_so/arch/aarch64/rtld_start.S diff -u src/libexec/ld.elf_so/arch/aarch64/rtld_start.S:1.4 src/libexec/ld.elf_so/arch/aarch64/rtld_start.S:1.4.2.1 --- src/libexec/ld.elf_so/arch/aarch64/rtld_start.S:1.4 Fri Jan 18 11:59:03 2019 +++ src/libexec/ld.elf_so/arch/aarch64/rtld_start.S Wed Aug 7 11:01:57 2024 @@ -1,4 +1,4 @@ -/* $NetBSD: rtld_start.S,v 1.4 2019/01/18 11:59:03 skrll Exp $ */ +/* $NetBSD: rtld_start.S,v 1.4.2.1 2024/08/07 11:01:57 martin Exp $ */ /*- * Copyright (c) 2014 The NetBSD Foundation, Inc. @@ -60,7 +60,7 @@ #include <machine/asm.h> -RCSID("$NetBSD: rtld_start.S,v 1.4 2019/01/18 11:59:03 skrll Exp $") +RCSID("$NetBSD: rtld_start.S,v 1.4.2.1 2024/08/07 11:01:57 martin Exp $") /* * void _rtld_start(void (*cleanup)(void), const Obj_Entry *obj, @@ -146,87 +146,121 @@ ENTRY_NP(_rtld_bind_start) END(_rtld_bind_start) /* - * struct rel_tlsdesc { - * uint64_t resolver_fnc; - * uint64_t resolver_arg; + * Entry points used by _rtld_tlsdesc_fill. They will be passed in x0 + * a pointer to: * + * struct rel_tlsdesc { + * uint64_t resolver_fnc; + * uint64_t resolver_arg; + * }; * - * uint64_t _rtld_tlsdesc_static(struct rel_tlsdesc *); + * They are called with nonstandard calling convention and must + * preserve all registers except x0. + */ + +/* + * uint64_t@x0 + * _rtld_tlsdesc_static(struct rel_tlsdesc *rel_tlsdesc@x0); + * + * Resolver function for TLS symbols resolved at load time. * - * Resolver function for TLS symbols resolved at load time + * rel_tlsdesc->resolver_arg is the offset of the static + * thread-local storage region, relative to the start of the TCB. + * + * Nonstandard calling convention: Must preserve all registers + * except x0. */ ENTRY(_rtld_tlsdesc_static) .cfi_startproc - ldr x0, [x0, #8] - ret + ldr x0, [x0, #8] /* x0 := tcboffset */ + ret /* return x0 = tcboffset */ .cfi_endproc END(_rtld_tlsdesc_static) /* - * uint64_t _rtld_tlsdesc_undef(void); + * uint64_t@x0 + * _rtld_tlsdesc_undef(struct rel_tlsdesc *rel_tlsdesc@x0); + * + * Resolver function for weak and undefined TLS symbols. * - * Resolver function for weak and undefined TLS symbols + * rel_tlsdesc->resolver_arg is the Elf_Rela rela->r_addend. + * + * Nonstandard calling convention: Must preserve all registers + * except x0. */ ENTRY(_rtld_tlsdesc_undef) .cfi_startproc - str x1, [sp, #-16]! + str x1, [sp, #-16]! /* save x1 on stack */ .cfi_adjust_cfa_offset 16 - mrs x1, tpidr_el0 - ldr x0, [x0, #8] - sub x0, x0, x1 + mrs x1, tpidr_el0 /* x1 := current thread tcb */ + ldr x0, [x0, #8] /* x0 := rela->r_addend */ + sub x0, x0, x1 /* x0 := rela->r_addend - tcb */ - ldr x1, [sp], #16 - .cfi_adjust_cfa_offset -16 + ldr x1, [sp], #16 /* restore x1 from stack */ + .cfi_adjust_cfa_offset -16 .cfi_endproc - ret + ret /* return x0 = rela->r_addend - tcb */ END(_rtld_tlsdesc_undef) /* - * uint64_t _rtld_tlsdesc_dynamic(struct rel_tlsdesc *); + * uint64_t@x0 + * _rtld_tlsdesc_dynamic(struct rel_tlsdesc *tlsdesc@x0); + * + * Resolver function for TLS symbols from dlopen(). * - * Resolver function for TLS symbols from dlopen() + * rel_tlsdesc->resolver_arg is a pointer to a struct tls_data + * object allocated during relocation. + * + * Nonstandard calling convention: Must preserve all registers + * except x0. */ ENTRY(_rtld_tlsdesc_dynamic) .cfi_startproc /* Save registers used in fast path */ - stp x1, x2, [sp, #(-2 * 16)]! - stp x3, x4, [sp, #(1 * 16)] + stp x1, x2, [sp, #(-2 * 16)]! + stp x3, x4, [sp, #(1 * 16)] .cfi_adjust_cfa_offset 2 * 16 .cfi_rel_offset x1, 0 .cfi_rel_offset x2, 8 .cfi_rel_offset x3, 16 .cfi_rel_offset x4, 24 - /* Test fastpath - inlined version of __tls_get_addr. */ + /* Try for the fast path -- inlined version of __tls_get_addr. */ - ldr x1, [x0, #8] /* tlsdesc ptr */ - mrs x4, tpidr_el0 - ldr x0, [x4] /* DTV pointer (tcb->tcb_dtv) */ + ldr x1, [x0, #8] /* x1 := tlsdesc (struct tls_data *) */ + mrs x4, tpidr_el0 /* x4 := tcb */ + ldr x0, [x4] /* x0 := dtv = tcb->tcb_dtv */ - ldr x3, [x0, #-8] /* DTV_MAX_INDEX(dtv) */ - ldr x2, [x1, #0] /* tlsdesc->td_tlsindex */ + ldr x3, [x0, #-8] /* x3 := max = DTV_MAX_INDEX(dtv) */ + ldr x2, [x1, #0] /* x2 := idx = tlsdesc->td_tlsindex */ cmp x2, x3 - b.lt 1f /* Slow path */ + b.gt 1f /* Slow path if idx > max */ + + ldr x3, [x0, x2, lsl #3] /* x3 := dtv[idx] */ + cbz x3, 1f /* Slow path if dtv[idx] is null */ - ldr x3, [x0, x2, lsl #3] /* dtv[tlsdesc->td_tlsindex] */ - cbz x3, 1f + /* + * Fast path + * + * return (dtv[tlsdesc->td_tlsindex] + tlsdesc->td_tlsoffs - tcb) + */ + ldr x2, [x1, #8] /* x2 := offs = tlsdesc->td_tlsoffs */ + add x2, x2, x3 /* x2 := addr = dtv[idx] + offs */ + sub x0, x2, x4 /* x0 := addr - tcb - /* Return (dtv[tlsdesc->td_tlsindex] + tlsdesc->td_tlsoffs - tp) */ - ldr x2, [x1, #8] /* tlsdesc->td_tlsoffs */ - add x2, x2, x3 - sub x0, x2, x4 - - /* Restore registers and return */ - ldp x3, x4, [sp, #(1 * 16)] - ldp x1, x2, [sp], #(2 * 16) - .cfi_adjust_cfa_offset -2 * 16 - ret + /* Restore fast path registers and return */ + ldp x3, x4, [sp, #(1 * 16)] + ldp x1, x2, [sp], #(2 * 16) + .cfi_adjust_cfa_offset -2 * 16 + ret /* return x0 = addr - tcb */ /* * Slow path - * return _rtld_tls_get_addr(tp, tlsdesc->td_tlsindex, tlsdesc->td_tlsoffs); + * + * return _rtld_tls_get_addr(tp, tlsdesc->td_tlsindex, + * tlsdesc->td_tlsoffs); * */ 1: @@ -236,18 +270,18 @@ ENTRY(_rtld_tlsdesc_dynamic) .cfi_rel_offset x29, 0 .cfi_rel_offset x30, 8 - stp x5, x6, [sp, #(1 * 16)] - stp x7, x8, [sp, #(2 * 16)] - stp x9, x10, [sp, #(3 * 16)] + stp x5, x6, [sp, #(1 * 16)] + stp x7, x8, [sp, #(2 * 16)] + stp x9, x10, [sp, #(3 * 16)] stp x11, x12, [sp, #(4 * 16)] stp x13, x14, [sp, #(5 * 16)] stp x15, x16, [sp, #(6 * 16)] stp x17, x18, [sp, #(7 * 16)] - .cfi_rel_offset x5, 16 - .cfi_rel_offset x6, 24 - .cfi_rel_offset x7, 32 - .cfi_rel_offset x8, 40 - .cfi_rel_offset x9, 48 + .cfi_rel_offset x5, 16 + .cfi_rel_offset x6, 24 + .cfi_rel_offset x7, 32 + .cfi_rel_offset x8, 40 + .cfi_rel_offset x9, 48 .cfi_rel_offset x10, 56 .cfi_rel_offset x11, 64 .cfi_rel_offset x12, 72 @@ -259,31 +293,32 @@ ENTRY(_rtld_tlsdesc_dynamic) .cfi_rel_offset x18, 120 /* Find the tls offset */ - mov x0, x4 /* tp */ - mov x3, x1 /* tlsdesc ptr */ - ldr x1, [x3, #0] /* tlsdesc->td_tlsindex */ - ldr x2, [x3, #8] /* tlsdesc->td_tlsoffs */ - bl _rtld_tls_get_addr - mrs x1, tpidr_el0 - sub x0, x0, x1 + mov x0, x4 /* x0 := tcb */ + mov x3, x1 /* x3 := tlsdesc */ + ldr x1, [x3, #0] /* x1 := idx = tlsdesc->td_tlsindex */ + ldr x2, [x3, #8] /* x2 := offs = tlsdesc->td_tlsoffs */ + bl _rtld_tls_get_addr /* x0 := addr = _rtld_tls_get_addr(tcb, + * idx, offs) */ + mrs x1, tpidr_el0 /* x1 := tcb */ + sub x0, x0, x1 /* x0 := addr - tcb */ /* Restore slow path registers */ ldp x17, x18, [sp, #(7 * 16)] ldp x15, x16, [sp, #(6 * 16)] ldp x13, x14, [sp, #(5 * 16)] ldp x11, x12, [sp, #(4 * 16)] - ldp x9, x10, [sp, #(3 * 16)] - ldp x7, x8, [sp, #(2 * 16)] - ldp x5, x6, [sp, #(1 * 16)] + ldp x9, x10, [sp, #(3 * 16)] + ldp x7, x8, [sp, #(2 * 16)] + ldp x5, x6, [sp, #(1 * 16)] ldp x29, x30, [sp], #(8 * 16) - .cfi_adjust_cfa_offset -8 * 16 + .cfi_adjust_cfa_offset -8 * 16 .cfi_restore x29 .cfi_restore x30 /* Restore fast path registers and return */ - ldp x3, x4, [sp, #16] - ldp x1, x2, [sp], #(2 * 16) + ldp x3, x4, [sp, #16] + ldp x1, x2, [sp], #(2 * 16) .cfi_adjust_cfa_offset -2 * 16 .cfi_endproc - ret + ret /* return x0 = addr - tcb */ END(_rtld_tlsdesc_dynamic) Index: src/tests/libexec/ld.elf_so/t_tls_extern.c diff -u src/tests/libexec/ld.elf_so/t_tls_extern.c:1.12.4.2 src/tests/libexec/ld.elf_so/t_tls_extern.c:1.12.4.3 --- src/tests/libexec/ld.elf_so/t_tls_extern.c:1.12.4.2 Fri Aug 4 12:55:46 2023 +++ src/tests/libexec/ld.elf_so/t_tls_extern.c Wed Aug 7 11:01:57 2024 @@ -1,4 +1,4 @@ -/* $NetBSD: t_tls_extern.c,v 1.12.4.2 2023/08/04 12:55:46 martin Exp $ */ +/* $NetBSD: t_tls_extern.c,v 1.12.4.3 2024/08/07 11:01:57 martin Exp $ */ /*- * Copyright (c) 2023 The NetBSD Foundation, Inc. @@ -382,6 +382,63 @@ ATF_TC_BODY(onlydef_static_dynamic_lazy, pstatic, pdynamic); } +ATF_TC(opencloseloop_use); +ATF_TC_HEAD(opencloseloop_use, tc) +{ + atf_tc_set_md_var(tc, "descr", "Testing opening and closing in a loop," + " then opening and using dynamic TLS"); +} +ATF_TC_BODY(opencloseloop_use, tc) +{ + unsigned i; + void *def, *use; + int *(*fdef)(void), *(*fuse)(void); + int *pdef, *puse; + + /* + * Open and close the definition library repeatedly. This + * should trigger allocation of many DTV offsets, which are + * (currently) not recycled, so the required DTV offsets should + * become very long -- pages past what is actually allocated + * before we attempt to use it. + * + * This way, we will exercise the wrong-way-conditional fast + * path of PR lib/58154. + */ + for (i = sysconf(_SC_PAGESIZE); i --> 0;) { + ATF_REQUIRE_DL(def = dlopen("libh_def_dynamic.so", 0)); + ATF_REQUIRE_EQ_MSG(dlclose(def), 0, + "dlclose(def): %s", dlerror()); + } + + /* + * Now open the definition library and keep it open. + */ + ATF_REQUIRE_DL(def = dlopen("libh_def_dynamic.so", 0)); + ATF_REQUIRE_DL(fdef = dlsym(def, "fdef")); + + /* + * Open libraries that use the definition and verify they + * observe the same pointer. + */ + ATF_REQUIRE_DL(use = dlopen("libh_use_dynamic.so", 0)); + ATF_REQUIRE_DL(fuse = dlsym(use, "fuse")); + pdef = (*fdef)(); + puse = (*fuse)(); + ATF_CHECK_EQ_MSG(pdef, puse, + "%p in defining library != %p in using library", + pdef, puse); + + /* + * Also verify the pointer can be used. + */ + *pdef = 123; + *puse = 456; + ATF_CHECK_EQ_MSG(*pdef, *puse, + "%d in defining library != %d in using library", + *pdef, *puse); +} + ATF_TP_ADD_TCS(tp) { @@ -398,6 +455,7 @@ ATF_TP_ADD_TCS(tp) ATF_TP_ADD_TC(tp, onlydef_dynamic_static_lazy); ATF_TP_ADD_TC(tp, onlydef_static_dynamic_eager); ATF_TP_ADD_TC(tp, onlydef_static_dynamic_lazy); + ATF_TP_ADD_TC(tp, opencloseloop_use); ATF_TP_ADD_TC(tp, static_abusedef); ATF_TP_ADD_TC(tp, static_abusedefnoload); ATF_TP_ADD_TC(tp, static_defabuse_eager);