bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

Ludovic Courtès Wed, 09 May 2018 00:20:12 -0700

Hi Mark,

Mark H Weaver <m...@netris.org> skribis:


> l...@gnu.org (Ludovic Courtès) writes:
> [...]
>> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)):
>> #0  0x00007f7019db0d79 in scm_is_pair (x=<error reading variable: ERROR: 
>> Cannot access memory at address 0x0>0x0) at ../libguile/pairs.h:159
>> #1  scm_ilength (sx=<optimized out>) at list.c:190
> [...]
>> What this means is that Thread 1 gets NULL instead of a list as its
>> on-stack argument (vm-engine.c:909 is ‘tail-apply’).
>>
>> How can arguments on the VM stack be zeroed?
>
> I doubt that's what happened, because I expect that each VM stack is
> dedicated to a single hardware thread.  In theory, if a single VM stack
> is used by one thread, and then later used by another thread,
> thread-safety issues on the VM stack could arise in the absense of
> proper thread synchronization.
>
> However, I think it's _far_ more likely that the NULL argument on the
> stack was copied from memory shared by multiple threads without proper
> thread synchronization.

It could be this, but this particular case is an “embarrassingly
parallel” program where threads work on independent data sets without
any inter-thread communication whatsoever.

What you describe could nevertheless be happening at a lower level,
within libguile, though it’s not clear to me where that could be.

>> I commented out the MADV_DONTNEED call to be sure, but I can still
>> reproduce the bug.
>
> I strongly doubt that the MADV_DONTNEED is relevant to this issue.

I thought about it because that’s one way part of the VM stack could be
zeroed out.

>> Then I thought vp->sp might be out-of-sync compared to the local
>> variable ‘sp’, which in turn could cause ‘scm_i_vm_mark_stack’ to not
>> mark a few items on the tip of the stack.  So I did this:
>>
>> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
>> index 9509cd643..1136b2271 100644
>> --- a/libguile/vm-engine.c
>> +++ b/libguile/vm-engine.c
>> @@ -151,7 +151,8 @@
>>     code, or otherwise push anything on the stack, you will need to
>>     CACHE_SP afterwards to restore the possibly-changed stack pointer.  */
>>  
>> -#define SYNC_IP() vp->ip = (ip)
>> +#define SYNC_IP()                                       \
>> +  do { vp->ip = (ip); vp->sp = (sp); } while (0)
>
> I don't see how a change like this could be useful for any thread safety
> problem.

I witnessed situations where the local ‘sp’ seemed to be different from
‘vp->sp’, though it’s hard to tell because I’m unsure where gcc stores
‘sp’.  Here’s an example:

--8<---------------cut here---------------start------------->8---
(gdb) frame
#16 0x00007fabf30af2ca in vm_regular_engine (thread=0x24e6000, vp=0x22de6c0, 
registers=0x0, resume=40) at vm-engine.c:785
785           ret = scm_apply_subr (sp, FRAME_LOCALS_COUNT ());
(gdb) p vp->sp
$5 = (union scm_vm_stack_element *) 0x7fabec158718
(gdb) p (union scm_vm_stack_element *) $r13
$6 = (union scm_vm_stack_element *) 0x7fabec158e30
(gdb) p $6 - $5
$7 = 227
(gdb) p vp->fp
$8 = (union scm_vm_stack_element *) 0x7fabec158730
(gdb) p vp->stack_top
$9 = (union scm_vm_stack_element *) 0x7fabec159000
(gdb) p vp->stack_bottom
$10 = (union scm_vm_stack_element *) 0x7fabec158000
(gdb) p vp->sp_min_since_gc
$11 = (union scm_vm_stack_element *) 0x7fabec158620
(gdb) info registers
rax            0x1      1
rbx            0xa      10
rcx            0x28     40
rdx            0x0      0
rsi            0x23f1920        37689632
rdi            0x24e6000        38690816
rbp            0x22de6c0        0x22de6c0
rsp            0x7fabcce18660   0x7fabcce18660
r8             0x1      1
r9             0x1      1
r10            0x100    256
r11            0x23f1920        37689632
r12            0x7fabf330b8c0   140376496191680
r13            0x7fabec158e30   140376376970800
r14            0x7fabf30c6d7c   140376493813116
r15            0x7fabf0fa7f28   140376459083560
rip            0x7fabf30af2ca   0x7fabf30af2ca <vm_regular_engine+14058>
eflags         0x10246  [ PF ZF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
--8<---------------cut here---------------end--------------->8---

My hypothesis was that such a bug could lead heap elements to be
reclaimed too early.  This is more likely to happen in a multi-threaded
context because one thread could be allocating memory and triggering a
GC while another thread is invoking a subr with an out-of-sync ‘vp->sp’.

Does that make sense?

> For now, I would suggest avoiding multi-threaded code in Guix, or at
> least to avoid loading any Scheme code from multiple threads.
>
> How about using multiple processes instead?

We could do that, but with my Guile maintainer hat on (a hat I don’t
wear that often as you might have noticed ;-)) I think it would be nice
to fix the issue.

Thanks,
Ludo’.

bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2

Reply via email to