Hi Mark, Mark H Weaver <m...@netris.org> skribis:
> l...@gnu.org (Ludovic Courtès) writes: > [...] >> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)): >> #0 0x00007f7019db0d79 in scm_is_pair (x=<error reading variable: ERROR: >> Cannot access memory at address 0x0>0x0) at ../libguile/pairs.h:159 >> #1 scm_ilength (sx=<optimized out>) at list.c:190 > [...] >> What this means is that Thread 1 gets NULL instead of a list as its >> on-stack argument (vm-engine.c:909 is ‘tail-apply’). >> >> How can arguments on the VM stack be zeroed? > > I doubt that's what happened, because I expect that each VM stack is > dedicated to a single hardware thread. In theory, if a single VM stack > is used by one thread, and then later used by another thread, > thread-safety issues on the VM stack could arise in the absense of > proper thread synchronization. > > However, I think it's _far_ more likely that the NULL argument on the > stack was copied from memory shared by multiple threads without proper > thread synchronization. It could be this, but this particular case is an “embarrassingly parallel” program where threads work on independent data sets without any inter-thread communication whatsoever. What you describe could nevertheless be happening at a lower level, within libguile, though it’s not clear to me where that could be. >> I commented out the MADV_DONTNEED call to be sure, but I can still >> reproduce the bug. > > I strongly doubt that the MADV_DONTNEED is relevant to this issue. I thought about it because that’s one way part of the VM stack could be zeroed out. >> Then I thought vp->sp might be out-of-sync compared to the local >> variable ‘sp’, which in turn could cause ‘scm_i_vm_mark_stack’ to not >> mark a few items on the tip of the stack. So I did this: >> >> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c >> index 9509cd643..1136b2271 100644 >> --- a/libguile/vm-engine.c >> +++ b/libguile/vm-engine.c >> @@ -151,7 +151,8 @@ >> code, or otherwise push anything on the stack, you will need to >> CACHE_SP afterwards to restore the possibly-changed stack pointer. */ >> >> -#define SYNC_IP() vp->ip = (ip) >> +#define SYNC_IP() \ >> + do { vp->ip = (ip); vp->sp = (sp); } while (0) > > I don't see how a change like this could be useful for any thread safety > problem. I witnessed situations where the local ‘sp’ seemed to be different from ‘vp->sp’, though it’s hard to tell because I’m unsure where gcc stores ‘sp’. Here’s an example: --8<---------------cut here---------------start------------->8--- (gdb) frame #16 0x00007fabf30af2ca in vm_regular_engine (thread=0x24e6000, vp=0x22de6c0, registers=0x0, resume=40) at vm-engine.c:785 785 ret = scm_apply_subr (sp, FRAME_LOCALS_COUNT ()); (gdb) p vp->sp $5 = (union scm_vm_stack_element *) 0x7fabec158718 (gdb) p (union scm_vm_stack_element *) $r13 $6 = (union scm_vm_stack_element *) 0x7fabec158e30 (gdb) p $6 - $5 $7 = 227 (gdb) p vp->fp $8 = (union scm_vm_stack_element *) 0x7fabec158730 (gdb) p vp->stack_top $9 = (union scm_vm_stack_element *) 0x7fabec159000 (gdb) p vp->stack_bottom $10 = (union scm_vm_stack_element *) 0x7fabec158000 (gdb) p vp->sp_min_since_gc $11 = (union scm_vm_stack_element *) 0x7fabec158620 (gdb) info registers rax 0x1 1 rbx 0xa 10 rcx 0x28 40 rdx 0x0 0 rsi 0x23f1920 37689632 rdi 0x24e6000 38690816 rbp 0x22de6c0 0x22de6c0 rsp 0x7fabcce18660 0x7fabcce18660 r8 0x1 1 r9 0x1 1 r10 0x100 256 r11 0x23f1920 37689632 r12 0x7fabf330b8c0 140376496191680 r13 0x7fabec158e30 140376376970800 r14 0x7fabf30c6d7c 140376493813116 r15 0x7fabf0fa7f28 140376459083560 rip 0x7fabf30af2ca 0x7fabf30af2ca <vm_regular_engine+14058> eflags 0x10246 [ PF ZF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 --8<---------------cut here---------------end--------------->8--- My hypothesis was that such a bug could lead heap elements to be reclaimed too early. This is more likely to happen in a multi-threaded context because one thread could be allocating memory and triggering a GC while another thread is invoking a subr with an out-of-sync ‘vp->sp’. Does that make sense? > For now, I would suggest avoiding multi-threaded code in Guix, or at > least to avoid loading any Scheme code from multiple threads. > > How about using multiple processes instead? We could do that, but with my Guile maintainer hat on (a hat I don’t wear that often as you might have noticed ;-)) I think it would be nice to fix the issue. Thanks, Ludo’.