Module Name:    src
Committed By:   martin
Date:           Wed Aug  7 11:01:57 UTC 2024

Modified Files:
        src/libexec/ld.elf_so [netbsd-9]: README.TLS tls.c
        src/libexec/ld.elf_so/arch/aarch64 [netbsd-9]: rtld_start.S
        src/tests/libexec/ld.elf_so [netbsd-9]: t_tls_extern.c

Log Message:
Pull up following revision(s) (requested by riastradh in ticket #1864):

        libexec/ld.elf_so/tls.c: revision 1.15
        libexec/ld.elf_so/arch/aarch64/rtld_start.S: revision 1.6
        libexec/ld.elf_so/arch/aarch64/rtld_start.S: revision 1.7
        tests/libexec/ld.elf_so/t_tls_extern.c: revision 1.15
        tests/libexec/ld.elf_so/t_tls_extern.c: revision 1.16
        libexec/ld.elf_so/README.TLS: revision 1.7
        libexec/ld.elf_so/tls.c: revision 1.20
        libexec/ld.elf_so/tls.c: revision 1.21

Alignment. NFCI.

ld.elf_so: Sprinkle comments and references for thread-local storage.

Maybe this will help the TLS business to be less mysterious to the
next traveller to pass by here.
Prompted by PR lib/58154.

ld.elf_so: Add comments explaining DTV allocation size.
Patch by pho@ for PR lib/58154.

tests/libexec/ld.elf_so/t_tls_extern: Test PR lib/58154.

ld.elf_so aarch64/rtld_start.S: Sprinkle comments.
No functional change intended.
Prompted by PR lib/58154.

ld.elf_so aarch64/rtld_start.S: Fix dynamic TLS fast path branch.
Bug found and patch prepared by pho@.
PR lib/58154


To generate a diff of this commit:
cvs rdiff -u -r1.5.2.1 -r1.5.2.2 src/libexec/ld.elf_so/README.TLS
cvs rdiff -u -r1.12.2.2 -r1.12.2.3 src/libexec/ld.elf_so/tls.c
cvs rdiff -u -r1.4 -r1.4.2.1 src/libexec/ld.elf_so/arch/aarch64/rtld_start.S
cvs rdiff -u -r1.12.4.2 -r1.12.4.3 src/tests/libexec/ld.elf_so/t_tls_extern.c

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

Modified files:

Index: src/libexec/ld.elf_so/README.TLS
diff -u src/libexec/ld.elf_so/README.TLS:1.5.2.1 src/libexec/ld.elf_so/README.TLS:1.5.2.2
--- src/libexec/ld.elf_so/README.TLS:1.5.2.1	Fri Aug  4 12:55:45 2023
+++ src/libexec/ld.elf_so/README.TLS	Wed Aug  7 11:01:57 2024
@@ -1,11 +1,111 @@
+Thread-local storage.
+
+Each thread has a thread control block, or TCB.  The TCB is a
+variable-size structure headed by `struct tls_tcb' from <sys/tls.h>,
+with:
+
+(a) static thread-local storage for the TLS data of initial objects,
+    i.e., those loaded at startup rather than those dynamically loaded
+    by dlopen
+
+(b) a pointer to a dynamic thread vector (DTV) for the TLS data
+    pointers of objects that use global-dynamic or local-dynamic models
+    (typically shared libraries or dlopenable modules)
+
+(c) the pthread_t pointer
+
+The per-thread lwp private pointer, also sometimes called TP (thread
+pointer), managed by the _lwp_setprivate and _lwp_setprivate syscalls,
+either points at the TCB directly, or, on some architectures, points at
+
+	tp = tcb + sizeof(struct tls_tcb) + TLS_TP_OFFSET.
+
+This bias is chosen for architectures where signed displacements from
+TP enable twice the range of static TLS offsets when biased like this.
+Architectures with such a tp/tcb offset must provide
+
+void *__lwp_gettcb_fast(void);
+
+in machine/mcontext.h and must define __HAVE___LWP_GETTCB_FAST in
+machine/types.h to reflect this; otherwise they must provide
+__lwp_getprivate_fast to return the TCB pointer.
+
+Each architecture has one of two TLS variants, variant I or variant II.
+Variant I places the static thread-local storage _after_ the fixed
+content of the TCB, at increasing addresses (increasing addresses grow
+down in diagram):
+
+	+---------------+
+	| dtv pointer   |       tcb points here (struct tls_tcb)
+	+---------------+
+	| pthread_t     |
+	+---------------+
+	| obj0 tls      |       obj0->tlsoffset = 0
+	|               |
+	|               |
+	+---------------+
+	| obj1 tls      |       obj1->tlsoffset = 3
+	+---------------+
+	| obj2 tls      |       obj2->tlsoffset = 4
+	|               |
+	.		.
+	.		.
+	.		.
+	|               |
+	+---------------+
+	| objN tls      |       objN->tlsoffset = k
+	+---------------+
+
+Variant II places the static thread-local storage _before_ the fixed
+content of the TCB, at decreasing addresses:
+
+	+---------------+
+	| objN tls      |       objN->tlsoffset = k
+	+---------------+
+	| obj(N-1) tls  |       obj(N-1)->tlsoffset = k - 1
+	.               .
+	.               .
+	.               .
+	|               |
+	+---------------+
+	| obj2 tls      |       obj2->tlsoffset = 4
+	+---------------+
+	| obj1 tls      |       obj1->tlsoffset = 3
+	+---------------+
+	| obj0 tls      |       obj0->tlsoffset = 0
+	|               |
+	|               |
+	+---------------+
+	| tcb pointer   |       tcb points here (struct tls_tcb)
+	+---------------+
+	| dtv pointer   |
+	+---------------+
+	| pthread_t     |
+	+---------------+
+
+See [ELFTLS] Sec. 3 `Run-Time Handling of TLS', Figs 1 and 2, for
+bigger pictures including the DTV and dynamically allocated TLS blocks.
+
+Each architecture also has its own ELF ABI processor supplement with
+the architecture-specific relocations and TLS details.
+
+References:
+
+	[ELFTLS] Ulrich Drepper, `ELF Handling For Thread-Local
+	Storage', Version 0.21, 2023-08-22.
+	https://akkadia.org/drepper/tls.pdf
+	https://web.archive.org/web/20240718081934/https://akkadia.org/drepper/tls.pdf
+
 Steps for adding TLS support for a new platform:
 
 (1) Declare TLS variant in machine/types.h by defining either
 __HAVE_TLS_VARIANT_I or __HAVE_TLS_VARIANT_II.
 
-(2) _lwp_makecontext has to set the reserved register or kernel transfer
-variable in uc_mcontext to the provided value of 'private'. See
-src/lib/libc/arch/$PLATFORM/gen/_lwp.c.
+(2) _lwp_makecontext has to set the reserved register or kernel
+transfer variable in uc_mcontext according to the provided value of
+`private'.  Note that _lwp_makecontext takes tcb, not tp, as an
+argument, so make sure to adjust it if needed for the tp/tcb offset.
+See src/lib/libc/arch/$PLATFORM/gen/_lwp.c.
 
 This is not possible on the VAX as there is no free space in ucontext_t.
 This requires either a special version of _lwp_create or versioning
@@ -60,9 +160,22 @@ def->st_value - defobj->tlsoffset + rela
 
 e.g. starting offset is counting down from the TCB.
 
-(6) Implement __lwp_getprivate_fast() in machine/mcontext.h and set
-__HAVE___LWP_GETPRIVATE_FAST in machine/types.h.
+(6) If there is a tp/tcb offset, implement
+
+	__lwp_gettcb_fast()
+	__lwp_settcb()
+
+in machine/mcontext.h and set
+
+	__HAVE___LWP_GETTCB_FAST
+	__HAVE___LWP_SETTCB
+
+in machine/types.h.
+
+Otherwise, implement __lwp_getprivate_fast() in machine/mcontext.h and
+set __HAVE___LWP_GETPRIVATE_FAST in machine/types.h.
 
-(7) Test using src/tests/lib/libc/tls.  Make sure with "objdump -R" that
-t_tls_dynamic has two TPOFF relocations and h_tls_dlopen.so.1 and
-libh_tls_dynamic.so.1 have both two DTPMOD and DTPOFF relocations.
+(7) Test using src/tests/lib/libc/tls and src/tests/libexec/ld.elf_so.
+Make sure with "objdump -R" that t_tls_dynamic has two TPOFF
+relocations and h_tls_dlopen.so.1 and libh_tls_dynamic.so.1 have both
+two DTPMOD and DTPOFF relocations.

Index: src/libexec/ld.elf_so/tls.c
diff -u src/libexec/ld.elf_so/tls.c:1.12.2.2 src/libexec/ld.elf_so/tls.c:1.12.2.3
--- src/libexec/ld.elf_so/tls.c:1.12.2.2	Fri Aug  4 12:55:45 2023
+++ src/libexec/ld.elf_so/tls.c	Wed Aug  7 11:01:57 2024
@@ -1,4 +1,4 @@
-/*	$NetBSD: tls.c,v 1.12.2.2 2023/08/04 12:55:45 martin Exp $	*/
+/*	$NetBSD: tls.c,v 1.12.2.3 2024/08/07 11:01:57 martin Exp $	*/
 /*-
  * Copyright (c) 2011 The NetBSD Foundation, Inc.
  * All rights reserved.
@@ -29,7 +29,18 @@
  */
 
 #include <sys/cdefs.h>
-__RCSID("$NetBSD: tls.c,v 1.12.2.2 2023/08/04 12:55:45 martin Exp $");
+__RCSID("$NetBSD: tls.c,v 1.12.2.3 2024/08/07 11:01:57 martin Exp $");
+
+/*
+ * Thread-local storage
+ *
+ * Reference:
+ *
+ *	[ELFTLS] Ulrich Drepper, `ELF Handling For Thread-Local
+ *	Storage', Version 0.21, 2023-08-22.
+ *	https://akkadia.org/drepper/tls.pdf
+ *	https://web.archive.org/web/20240718081934/https://akkadia.org/drepper/tls.pdf
+ */
 
 #include <sys/param.h>
 #include <sys/ucontext.h>
@@ -45,20 +56,93 @@ __RCSID("$NetBSD: tls.c,v 1.12.2.2 2023/
 static struct tls_tcb *_rtld_tls_allocate_locked(void);
 static void *_rtld_tls_module_allocate(struct tls_tcb *, size_t);
 
+/*
+ * DTV offset
+ *
+ *	On some architectures (m68k, mips, or1k, powerpc, and riscv),
+ *	the DTV offsets passed to __tls_get_addr have a bias relative
+ *	to the start of the DTV, in order to maximize the range of TLS
+ *	offsets that can be used by instruction encodings with signed
+ *	displacements.
+ */
 #ifndef TLS_DTV_OFFSET
 #define	TLS_DTV_OFFSET	0
 #endif
 
 static size_t _rtld_tls_static_space;	/* Static TLS space allocated */
 static size_t _rtld_tls_static_offset;	/* Next offset for static TLS to use */
-size_t _rtld_tls_dtv_generation = 1;
-size_t _rtld_tls_max_index = 1;
+size_t _rtld_tls_dtv_generation = 1;	/* Bumped on each load of obj w/ TLS */
+size_t _rtld_tls_max_index = 1;		/* Max index into up-to-date DTV */
 
-#define	DTV_GENERATION(dtv)	((size_t)((dtv)[0]))
-#define	DTV_MAX_INDEX(dtv)	((size_t)((dtv)[-1]))
+/*
+ * DTV -- Dynamic Thread Vector
+ *
+ *	The DTV is a per-thread array that maps each module with
+ *	thread-local storage to a pointer into part of the thread's TCB
+ *	(thread control block), or dynamically loaded TLS blocks,
+ *	reserved for that module's storage.
+ *
+ *	The TCB itself, struct tls_tcb, has a pointer to the DTV at
+ *	tcb->tcb_dtv.
+ *
+ *	The layout is:
+ *
+ *		+---------------+
+ *		| max index     | -1    max index i for which dtv[i] is alloced
+ *		+---------------+
+ *		| generation    |  0    void **dtv points here
+ *		+---------------+
+ *		| obj 1 tls ptr |  1    TLS pointer for obj w/ obj->tlsindex 1
+ *		+---------------+
+ *		| obj 2 tls ptr |  2    TLS pointer for obj w/ obj->tlsindex 2
+ *		+---------------+
+ *		  .
+ *		  .
+ *		  .
+ *
+ *	The values of obj->tlsindex start at 1; this way,
+ *	dtv[obj->tlsindex] works, when dtv[0] is the generation.  The
+ *	TLS pointers go either into the static thread-local storage,
+ *	for the initial objects (i.e., those loaded at startup), or
+ *	into TLS blocks dynamically allocated for objects that
+ *	dynamically loaded by dlopen.
+ *
+ *	The generation field is a cache of the global generation number
+ *	_rtld_tls_dtv_generation, which is bumped every time an object
+ *	with TLS is loaded in _rtld_map_object, and cached by
+ *	__tls_get_addr (via _rtld_tls_get_addr) when a newly loaded
+ *	module lies outside the bounds of the current DTV.
+ *
+ *	XXX Why do we keep max index and generation separately?  They
+ *	appear to be initialized the same, always incremented together,
+ *	and always stored together.
+ *
+ *	XXX Why is this not a struct?
+ *
+ *		struct dtv {
+ *			size_t	dtv_gen;
+ *			void	*dtv_module[];
+ *		};
+ */
+#define	DTV_GENERATION(dtv)		((size_t)((dtv)[0]))
+#define	DTV_MAX_INDEX(dtv)		((size_t)((dtv)[-1]))
 #define	SET_DTV_GENERATION(dtv, val)	(dtv)[0] = (void *)(size_t)(val)
 #define	SET_DTV_MAX_INDEX(dtv, val)	(dtv)[-1] = (void *)(size_t)(val)
 
+/*
+ * _rtld_tls_get_addr(tcb, idx, offset)
+ *
+ *	Slow path for __tls_get_addr (see below), called to allocate
+ *	TLS space if needed for the object obj with obj->tlsindex idx,
+ *	at offset, which must be below obj->tlssize.
+ *
+ *	This may allocate a DTV if the current one is too old, and it
+ *	may allocate a dynamically loaded TLS block if there isn't one
+ *	already allocated for it.
+ *
+ *	XXX Why is the first argument passed as `void *tls' instead of
+ *	just `struct tls_tcb *tcb'?
+ */
 void *
 _rtld_tls_get_addr(void *tls, size_t idx, size_t offset)
 {
@@ -70,15 +154,26 @@ _rtld_tls_get_addr(void *tls, size_t idx
 
 	dtv = tcb->tcb_dtv;
 
+	/*
+	 * If the generation number has changed, we have to allocate a
+	 * new DTV.
+	 *
+	 * XXX Do we really?  Isn't it enough to check whether idx <=
+	 * DTV_MAX_INDEX(dtv)?
+	 */
 	if (__predict_false(DTV_GENERATION(dtv) != _rtld_tls_dtv_generation)) {
 		size_t to_copy = DTV_MAX_INDEX(dtv);
 
+		/*
+		 * "2 +" because the first element is the generation and
+		 * the second one is the maximum index.
+		 */
 		new_dtv = xcalloc((2 + _rtld_tls_max_index) * sizeof(*dtv));
-		++new_dtv;
-		if (to_copy > _rtld_tls_max_index)
+		++new_dtv;		/* advance past DTV_MAX_INDEX */
+		if (to_copy > _rtld_tls_max_index)	/* XXX How? */
 			to_copy = _rtld_tls_max_index;
 		memcpy(new_dtv + 1, dtv + 1, to_copy * sizeof(*dtv));
-		xfree(dtv - 1);
+		xfree(dtv - 1);		/* retreat back to DTV_MAX_INDEX */
 		dtv = tcb->tcb_dtv = new_dtv;
 		SET_DTV_MAX_INDEX(dtv, _rtld_tls_max_index);
 		SET_DTV_GENERATION(dtv, _rtld_tls_dtv_generation);
@@ -92,6 +187,18 @@ _rtld_tls_get_addr(void *tls, size_t idx
 	return (uint8_t *)dtv[idx] + offset;
 }
 
+/*
+ * _rtld_tls_initial_allocation()
+ *
+ *	Allocate the TCB (thread control block) for the initial thread,
+ *	once the static TLS space usage has been determined (plus some
+ *	slop to allow certain special cases like Mesa to be dlopened).
+ *
+ *	This must be done _after_ all initial objects (i.e., those
+ *	loaded at startup, as opposed to objects dynamically loaded by
+ *	dlopen) have had TLS offsets allocated if need be by
+ *	_rtld_tls_offset_allocate, and have had relocations processed.
+ */
 void
 _rtld_tls_initial_allocation(void)
 {
@@ -114,6 +221,20 @@ _rtld_tls_initial_allocation(void)
 #endif
 }
 
+/*
+ * _rtld_tls_allocate_locked()
+ *
+ *	Internal subroutine to allocate a TCB (thread control block)
+ *	for the current thread.
+ *
+ *	This allocates a DTV and a TCB that points to it, including
+ *	static space in the TCB for the TLS of the initial objects.
+ *	TLS blocks for dynamically loaded objects are allocated lazily.
+ *
+ *	Caller must either be single-threaded (at startup via
+ *	_rtld_tls_initial_allocation) or hold the rtld exclusive lock
+ *	(via _rtld_tls_allocate).
+ */
 static struct tls_tcb *
 _rtld_tls_allocate_locked(void)
 {
@@ -131,8 +252,12 @@ _rtld_tls_allocate_locked(void)
 	tcb->tcb_self = tcb;
 #endif
 	dbg(("lwp %d tls tcb %p", _lwp_self(), tcb));
+	/*
+	 * "2 +" because the first element is the generation and the second
+	 * one is the maximum index.
+	 */
 	tcb->tcb_dtv = xcalloc(sizeof(*tcb->tcb_dtv) * (2 + _rtld_tls_max_index));
-	++tcb->tcb_dtv;
+	++tcb->tcb_dtv;		/* advance past DTV_MAX_INDEX */
 	SET_DTV_MAX_INDEX(tcb->tcb_dtv, _rtld_tls_max_index);
 	SET_DTV_GENERATION(tcb->tcb_dtv, _rtld_tls_dtv_generation);
 
@@ -155,6 +280,14 @@ _rtld_tls_allocate_locked(void)
 	return tcb;
 }
 
+/*
+ * _rtld_tls_allocate()
+ *
+ *	Allocate a TCB (thread control block) for the current thread.
+ *
+ *	Called by pthread_create for non-initial threads.  (The initial
+ *	thread's TCB is allocated by _rtld_tls_initial_allocation.)
+ */
 struct tls_tcb *
 _rtld_tls_allocate(void)
 {
@@ -168,6 +301,14 @@ _rtld_tls_allocate(void)
 	return tcb;
 }
 
+/*
+ * _rtld_tls_free(tcb)
+ *
+ *	Free a TCB allocated with _rtld_tls_allocate.
+ *
+ *	Frees any TLS blocks for dynamically loaded objects that tcb's
+ *	DTV points to, and frees tcb's DTV, and frees tcb.
+ */
 void
 _rtld_tls_free(struct tls_tcb *tcb)
 {
@@ -190,12 +331,27 @@ _rtld_tls_free(struct tls_tcb *tcb)
 		    (uint8_t *)tcb->tcb_dtv[i] >= p_end)
 			xfree(tcb->tcb_dtv[i]);
 	}
-	xfree(tcb->tcb_dtv - 1);
+	xfree(tcb->tcb_dtv - 1);	/* retreat back to DTV_MAX_INDEX */
 	xfree(p);
 
 	_rtld_exclusive_exit(&mask);
 }
 
+/*
+ * _rtld_tls_module_allocate(tcb, idx)
+ *
+ *	Allocate thread-local storage in the thread with the given TCB
+ *	(thread control block) for the object obj whose obj->tlsindex
+ *	is idx.
+ *
+ *	If obj has had space in static TLS reserved (obj->tls_static),
+ *	return a pointer into that.  Otherwise, allocate a TLS block,
+ *	mark obj as having a TLS block allocated (obj->tls_dynamic),
+ *	and return it.
+ *
+ *	Called by _rtld_tls_get_addr to get the thread-local storage
+ *	for an object the first time around.
+ */
 static void *
 _rtld_tls_module_allocate(struct tls_tcb *tcb, size_t idx)
 {
@@ -228,6 +384,16 @@ _rtld_tls_module_allocate(struct tls_tcb
 	return p;
 }
 
+/*
+ * _rtld_tls_offset_allocate(obj)
+ *
+ *	Allocate a static thread-local storage offset for obj.
+ *
+ *	Called by _rtld at startup for all initial objects.  Called
+ *	also by MD relocation logic, which is allowed (for Mesa) to
+ *	allocate an additional 64 bytes (RTLD_STATIC_TLS_RESERVATION)
+ *	of static thread-local storage in dlopened objects.
+ */
 int
 _rtld_tls_offset_allocate(Obj_Entry *obj)
 {
@@ -284,6 +450,17 @@ _rtld_tls_offset_allocate(Obj_Entry *obj
 	return 0;
 }
 
+/*
+ * _rtld_tls_offset_free(obj)
+ *
+ *	Free a static thread-local storage offset for obj.
+ *
+ *	Called by dlclose (via _rtld_unload_object -> _rtld_obj_free).
+ *
+ *	Since static thread-local storage is normally not used by
+ *	dlopened objects (with the exception of Mesa), this doesn't do
+ *	anything to recycle the space right now.
+ */
 void
 _rtld_tls_offset_free(Obj_Entry *obj)
 {
@@ -297,10 +474,33 @@ _rtld_tls_offset_free(Obj_Entry *obj)
 
 #if defined(__HAVE_COMMON___TLS_GET_ADDR) && defined(RTLD_LOADER)
 /*
- * The fast path is access to an already allocated DTV entry.
- * This checks the current limit and the entry without needing any
- * locking. Entries are only freed on dlclose() and it is an application
- * bug if code of the module is still running at that point.
+ * __tls_get_addr(tlsindex)
+ *
+ *	Symbol directly called by code generated by the compiler for
+ *	references thread-local storage in the general-dynamic or
+ *	local-dynamic TLS models (but not initial-exec or local-exec).
+ *
+ *	The argument is a pointer to
+ *
+ *		struct {
+ *			unsigned long int ti_module;
+ *			unsigned long int ti_offset;
+ *		};
+ *
+ *	 as in, e.g., [ELFTLS] Sec. 3.4.3.  This coincides with the
+ *	 type size_t[2] on all architectures that use this common
+ *	 __tls_get_addr definition (XXX but why do we write it as
+ *	 size_t[2]?).
+ *
+ *	 ti_module, i.e., arg[0], is the obj->tlsindex assigned at
+ *	 load-time by _rtld_map_object, and ti_offset, i.e., arg[1], is
+ *	 assigned at link-time by ld(1), possibly adjusted by
+ *	 TLS_DTV_OFFSET.
+ *
+ *	 Some architectures -- specifically IA-64 -- use a different
+ *	 calling convention.  Some architectures -- specifically i386
+ *	 -- also use another entry point ___tls_get_addr (that's three
+ *	 leading underscores) with a different calling convention.
  */
 void *
 __tls_get_addr(void *arg_)
@@ -316,6 +516,13 @@ __tls_get_addr(void *arg_)
 
 	dtv = tcb->tcb_dtv;
 
+	/*
+	 * Fast path: access to an already allocated DTV entry.  This
+	 * checks the current limit and the entry without needing any
+	 * locking.  Entries are only freed on dlclose() and it is an
+	 * application bug if code of the module is still running at
+	 * that point.
+	 */
 	if (__predict_true(idx < DTV_MAX_INDEX(dtv) && dtv[idx] != NULL))
 		return (uint8_t *)dtv[idx] + offset;
 

Index: src/libexec/ld.elf_so/arch/aarch64/rtld_start.S
diff -u src/libexec/ld.elf_so/arch/aarch64/rtld_start.S:1.4 src/libexec/ld.elf_so/arch/aarch64/rtld_start.S:1.4.2.1
--- src/libexec/ld.elf_so/arch/aarch64/rtld_start.S:1.4	Fri Jan 18 11:59:03 2019
+++ src/libexec/ld.elf_so/arch/aarch64/rtld_start.S	Wed Aug  7 11:01:57 2024
@@ -1,4 +1,4 @@
-/* $NetBSD: rtld_start.S,v 1.4 2019/01/18 11:59:03 skrll Exp $ */
+/* $NetBSD: rtld_start.S,v 1.4.2.1 2024/08/07 11:01:57 martin Exp $ */
 
 /*-
  * Copyright (c) 2014 The NetBSD Foundation, Inc.
@@ -60,7 +60,7 @@
 
 #include <machine/asm.h>
 
-RCSID("$NetBSD: rtld_start.S,v 1.4 2019/01/18 11:59:03 skrll Exp $")
+RCSID("$NetBSD: rtld_start.S,v 1.4.2.1 2024/08/07 11:01:57 martin Exp $")
 
 /*
  * void _rtld_start(void (*cleanup)(void), const Obj_Entry *obj,
@@ -146,87 +146,121 @@ ENTRY_NP(_rtld_bind_start)
 END(_rtld_bind_start)
 
 /*
- * struct rel_tlsdesc {
- *  uint64_t resolver_fnc;
- *  uint64_t resolver_arg;
+ * Entry points used by _rtld_tlsdesc_fill.  They will be passed in x0
+ * a pointer to:
  *
+ *	struct rel_tlsdesc {
+ *		uint64_t resolver_fnc;
+ *		uint64_t resolver_arg;
+ *	};
  *
- * uint64_t _rtld_tlsdesc_static(struct rel_tlsdesc *);
+ * They are called with nonstandard calling convention and must
+ * preserve all registers except x0.
+ */
+
+/*
+ * uint64_t@x0
+ * _rtld_tlsdesc_static(struct rel_tlsdesc *rel_tlsdesc@x0);
+ *
+ *	Resolver function for TLS symbols resolved at load time.
  *
- * Resolver function for TLS symbols resolved at load time
+ *	rel_tlsdesc->resolver_arg is the offset of the static
+ *	thread-local storage region, relative to the start of the TCB.
+ *
+ *	Nonstandard calling convention: Must preserve all registers
+ *	except x0.
  */
 ENTRY(_rtld_tlsdesc_static)
 	.cfi_startproc
-	ldr	x0, [x0, #8]
-	ret
+	ldr	x0, [x0, #8]		/* x0 := tcboffset */
+	ret				/* return x0 = tcboffset */
 	.cfi_endproc
 END(_rtld_tlsdesc_static)
 
 /*
- * uint64_t _rtld_tlsdesc_undef(void);
+ * uint64_t@x0
+ * _rtld_tlsdesc_undef(struct rel_tlsdesc *rel_tlsdesc@x0);
+ *
+ *	Resolver function for weak and undefined TLS symbols.
  *
- * Resolver function for weak and undefined TLS symbols
+ *	rel_tlsdesc->resolver_arg is the Elf_Rela rela->r_addend.
+ *
+ *	Nonstandard calling convention: Must preserve all registers
+ *	except x0.
  */
 ENTRY(_rtld_tlsdesc_undef)
 	.cfi_startproc
-	str	x1, [sp, #-16]!
+	str	x1, [sp, #-16]!		/* save x1 on stack */
 	.cfi_adjust_cfa_offset	16
 
-	mrs	x1, tpidr_el0
-	ldr	x0, [x0, #8]
-	sub	x0, x0, x1
+	mrs	x1, tpidr_el0		/* x1 := current thread tcb */
+	ldr	x0, [x0, #8]		/* x0 := rela->r_addend */
+	sub	x0, x0, x1		/* x0 := rela->r_addend - tcb */
 
-	ldr	x1, [sp], #16
-	.cfi_adjust_cfa_offset 	-16
+	ldr	x1, [sp], #16		/* restore x1 from stack */
+	.cfi_adjust_cfa_offset	-16
 	.cfi_endproc
-	ret
+	ret				/* return x0 = rela->r_addend - tcb */
 END(_rtld_tlsdesc_undef)
 
 /*
- * uint64_t _rtld_tlsdesc_dynamic(struct rel_tlsdesc *);
+ * uint64_t@x0
+ * _rtld_tlsdesc_dynamic(struct rel_tlsdesc *tlsdesc@x0);
+ *
+ *	Resolver function for TLS symbols from dlopen().
  *
- * Resolver function for TLS symbols from dlopen()
+ *	rel_tlsdesc->resolver_arg is a pointer to a struct tls_data
+ *	object allocated during relocation.
+ *
+ *	Nonstandard calling convention: Must preserve all registers
+ *	except x0.
  */
 ENTRY(_rtld_tlsdesc_dynamic)
 	.cfi_startproc
 
 	/* Save registers used in fast path */
-	stp	x1,  x2, [sp, #(-2 * 16)]!
-	stp	x3,  x4, [sp, #(1 * 16)]
+	stp	x1, x2, [sp, #(-2 * 16)]!
+	stp	x3, x4, [sp, #(1 * 16)]
 	.cfi_adjust_cfa_offset	2 * 16
 	.cfi_rel_offset		x1, 0
 	.cfi_rel_offset		x2, 8
 	.cfi_rel_offset		x3, 16
 	.cfi_rel_offset		x4, 24
 
-	/* Test fastpath - inlined version of __tls_get_addr. */
+	/* Try for the fast path -- inlined version of __tls_get_addr. */
 
-	ldr	x1, [x0, #8]		/* tlsdesc ptr */
-	mrs	x4, tpidr_el0
-	ldr	x0, [x4]		/* DTV pointer (tcb->tcb_dtv) */
+	ldr	x1, [x0, #8]		/* x1 := tlsdesc (struct tls_data *) */
+	mrs	x4, tpidr_el0		/* x4 := tcb */
+	ldr	x0, [x4]		/* x0 := dtv = tcb->tcb_dtv */
 
-	ldr	x3, [x0, #-8]		/* DTV_MAX_INDEX(dtv) */
-	ldr	x2, [x1, #0]		/* tlsdesc->td_tlsindex */
+	ldr	x3, [x0, #-8]		/* x3 := max = DTV_MAX_INDEX(dtv) */
+	ldr	x2, [x1, #0]		/* x2 := idx = tlsdesc->td_tlsindex */
 	cmp	x2, x3
-	b.lt	1f			/* Slow path */
+	b.gt	1f			/* Slow path if idx > max */
+
+	ldr	x3, [x0, x2, lsl #3]	/* x3 := dtv[idx] */
+	cbz	x3, 1f			/* Slow path if dtv[idx] is null */
 
-	ldr     x3, [x0, x2, lsl #3]	/* dtv[tlsdesc->td_tlsindex] */
-	cbz	x3, 1f
+	/*
+	 * Fast path
+	 *
+	 * return (dtv[tlsdesc->td_tlsindex] + tlsdesc->td_tlsoffs - tcb)
+	 */
+	ldr	x2, [x1, #8]		/* x2 := offs = tlsdesc->td_tlsoffs */
+	add	x2, x2, x3		/* x2 := addr = dtv[idx] + offs */
+	sub	x0, x2, x4		/* x0 := addr - tcb
 
-	/* Return (dtv[tlsdesc->td_tlsindex] + tlsdesc->td_tlsoffs - tp) */
-	ldr	x2, [x1, #8]		/* tlsdesc->td_tlsoffs */
-	add 	x2, x2, x3
-	sub	x0, x2, x4
-
-	/* Restore registers and return */
-	ldp	 x3,  x4, [sp, #(1 * 16)]
-	ldp	 x1,  x2, [sp], #(2 * 16)
-	.cfi_adjust_cfa_offset 	-2 * 16
-	ret
+	/* Restore fast path registers and return */
+	ldp	x3, x4, [sp, #(1 * 16)]
+	ldp	x1, x2, [sp], #(2 * 16)
+	.cfi_adjust_cfa_offset	-2 * 16
+	ret				/* return x0 = addr - tcb */
 
 	/*
 	 * Slow path
-	 * return _rtld_tls_get_addr(tp, tlsdesc->td_tlsindex, tlsdesc->td_tlsoffs);
+	 *
+	 * return _rtld_tls_get_addr(tp, tlsdesc->td_tlsindex,
+	 *     tlsdesc->td_tlsoffs);
 	 *
 	 */
 1:
@@ -236,18 +270,18 @@ ENTRY(_rtld_tlsdesc_dynamic)
 	.cfi_rel_offset		x29, 0
 	.cfi_rel_offset		x30, 8
 
-	stp	x5,   x6, [sp, #(1 * 16)]
-	stp	x7,   x8, [sp, #(2 * 16)]
-	stp	x9,  x10, [sp, #(3 * 16)]
+	stp	x5, x6, [sp, #(1 * 16)]
+	stp	x7, x8, [sp, #(2 * 16)]
+	stp	x9, x10, [sp, #(3 * 16)]
 	stp	x11, x12, [sp, #(4 * 16)]
 	stp	x13, x14, [sp, #(5 * 16)]
 	stp	x15, x16, [sp, #(6 * 16)]
 	stp	x17, x18, [sp, #(7 * 16)]
-	.cfi_rel_offset		 x5, 16
-	.cfi_rel_offset		 x6, 24
-	.cfi_rel_offset		 x7, 32
-	.cfi_rel_offset		 x8, 40
-	.cfi_rel_offset		 x9, 48
+	.cfi_rel_offset		x5, 16
+	.cfi_rel_offset		x6, 24
+	.cfi_rel_offset		x7, 32
+	.cfi_rel_offset		x8, 40
+	.cfi_rel_offset		x9, 48
 	.cfi_rel_offset		x10, 56
 	.cfi_rel_offset		x11, 64
 	.cfi_rel_offset		x12, 72
@@ -259,31 +293,32 @@ ENTRY(_rtld_tlsdesc_dynamic)
 	.cfi_rel_offset		x18, 120
 
 	/* Find the tls offset */
-	mov	x0, x4			/* tp */
-	mov	x3, x1			/* tlsdesc ptr */
-	ldr	x1, [x3, #0]		/* tlsdesc->td_tlsindex */
-	ldr	x2, [x3, #8]		/* tlsdesc->td_tlsoffs */
-	bl	_rtld_tls_get_addr
-	mrs	x1, tpidr_el0
-	sub	x0, x0, x1
+	mov	x0, x4			/* x0 := tcb */
+	mov	x3, x1			/* x3 := tlsdesc */
+	ldr	x1, [x3, #0]		/* x1 := idx = tlsdesc->td_tlsindex */
+	ldr	x2, [x3, #8]		/* x2 := offs = tlsdesc->td_tlsoffs */
+	bl	_rtld_tls_get_addr	/* x0 := addr = _rtld_tls_get_addr(tcb,
+					 *     idx, offs) */
+	mrs	x1, tpidr_el0		/* x1 := tcb */
+	sub	x0, x0, x1		/* x0 := addr - tcb */
 
 	/* Restore slow path registers */
 	ldp	x17, x18, [sp, #(7 * 16)]
 	ldp	x15, x16, [sp, #(6 * 16)]
 	ldp	x13, x14, [sp, #(5 * 16)]
 	ldp	x11, x12, [sp, #(4 * 16)]
-	ldp	x9, x10,  [sp, #(3 * 16)]
-	ldp	x7, x8,   [sp, #(2 * 16)]
-	ldp	x5, x6,   [sp, #(1 * 16)]
+	ldp	x9, x10, [sp, #(3 * 16)]
+	ldp	x7, x8, [sp, #(2 * 16)]
+	ldp	x5, x6, [sp, #(1 * 16)]
 	ldp	x29, x30, [sp], #(8 * 16)
-	.cfi_adjust_cfa_offset 	-8 * 16
+	.cfi_adjust_cfa_offset	-8 * 16
 	.cfi_restore		x29
 	.cfi_restore		x30
 
 	/* Restore fast path registers and return */
-	ldp	 x3,  x4, [sp, #16]
-	ldp	 x1,  x2, [sp], #(2 * 16)
+	ldp	x3, x4, [sp, #16]
+	ldp	x1, x2, [sp], #(2 * 16)
 	.cfi_adjust_cfa_offset	-2 * 16
 	.cfi_endproc
-	ret
+	ret				/* return x0 = addr - tcb */
 END(_rtld_tlsdesc_dynamic)

Index: src/tests/libexec/ld.elf_so/t_tls_extern.c
diff -u src/tests/libexec/ld.elf_so/t_tls_extern.c:1.12.4.2 src/tests/libexec/ld.elf_so/t_tls_extern.c:1.12.4.3
--- src/tests/libexec/ld.elf_so/t_tls_extern.c:1.12.4.2	Fri Aug  4 12:55:46 2023
+++ src/tests/libexec/ld.elf_so/t_tls_extern.c	Wed Aug  7 11:01:57 2024
@@ -1,4 +1,4 @@
-/*	$NetBSD: t_tls_extern.c,v 1.12.4.2 2023/08/04 12:55:46 martin Exp $	*/
+/*	$NetBSD: t_tls_extern.c,v 1.12.4.3 2024/08/07 11:01:57 martin Exp $	*/
 
 /*-
  * Copyright (c) 2023 The NetBSD Foundation, Inc.
@@ -382,6 +382,63 @@ ATF_TC_BODY(onlydef_static_dynamic_lazy,
 	    pstatic, pdynamic);
 }
 
+ATF_TC(opencloseloop_use);
+ATF_TC_HEAD(opencloseloop_use, tc)
+{
+	atf_tc_set_md_var(tc, "descr", "Testing opening and closing in a loop,"
+	    " then opening and using dynamic TLS");
+}
+ATF_TC_BODY(opencloseloop_use, tc)
+{
+	unsigned i;
+	void *def, *use;
+	int *(*fdef)(void), *(*fuse)(void);
+	int *pdef, *puse;
+
+	/*
+	 * Open and close the definition library repeatedly.  This
+	 * should trigger allocation of many DTV offsets, which are
+	 * (currently) not recycled, so the required DTV offsets should
+	 * become very long -- pages past what is actually allocated
+	 * before we attempt to use it.
+	 *
+	 * This way, we will exercise the wrong-way-conditional fast
+	 * path of PR lib/58154.
+	 */
+	for (i = sysconf(_SC_PAGESIZE); i --> 0;) {
+		ATF_REQUIRE_DL(def = dlopen("libh_def_dynamic.so", 0));
+		ATF_REQUIRE_EQ_MSG(dlclose(def), 0,
+		    "dlclose(def): %s", dlerror());
+	}
+
+	/*
+	 * Now open the definition library and keep it open.
+	 */
+	ATF_REQUIRE_DL(def = dlopen("libh_def_dynamic.so", 0));
+	ATF_REQUIRE_DL(fdef = dlsym(def, "fdef"));
+
+	/*
+	 * Open libraries that use the definition and verify they
+	 * observe the same pointer.
+	 */
+	ATF_REQUIRE_DL(use = dlopen("libh_use_dynamic.so", 0));
+	ATF_REQUIRE_DL(fuse = dlsym(use, "fuse"));
+	pdef = (*fdef)();
+	puse = (*fuse)();
+	ATF_CHECK_EQ_MSG(pdef, puse,
+	    "%p in defining library != %p in using library",
+	    pdef, puse);
+
+	/*
+	 * Also verify the pointer can be used.
+	 */
+	*pdef = 123;
+	*puse = 456;
+	ATF_CHECK_EQ_MSG(*pdef, *puse,
+	    "%d in defining library != %d in using library",
+	    *pdef, *puse);
+}
+
 ATF_TP_ADD_TCS(tp)
 {
 
@@ -398,6 +455,7 @@ ATF_TP_ADD_TCS(tp)
 	ATF_TP_ADD_TC(tp, onlydef_dynamic_static_lazy);
 	ATF_TP_ADD_TC(tp, onlydef_static_dynamic_eager);
 	ATF_TP_ADD_TC(tp, onlydef_static_dynamic_lazy);
+	ATF_TP_ADD_TC(tp, opencloseloop_use);
 	ATF_TP_ADD_TC(tp, static_abusedef);
 	ATF_TP_ADD_TC(tp, static_abusedefnoload);
 	ATF_TP_ADD_TC(tp, static_defabuse_eager);

Reply via email to