On Wed 09 May 2018 02:32, Mark H Weaver <m...@netris.org> writes: > However, I think it's _far_ more likely that the NULL argument on the > stack was copied from memory shared by multiple threads without proper > thread synchronization.
I think this is unlikely on x86 given its total-store-ordering memory model. I agree with you about the value of barriers, but I don't think they are part of this bug that Ludo is seeing. >> I commented out the MADV_DONTNEED call to be sure, but I can still >> reproduce the bug. > > I strongly doubt that the MADV_DONTNEED is relevant to this issue. It could be. It would zero out VM stack frames, and if GC is called when/if vp->sp is out of date, then that would be possible. However I think vp->sp is never out of date, so that's probably not it. The things that can be out of date are the on-heap copy of the IP (vp->ip) and the local register copy of the sp (sp). It's more likely for the local "sp" cache to be out of date -- if we recursed through Scheme in a call out from the VM, eventualy causing stack expansion and relocation, then on the return forgot to re-cache the sp value, that could be it. Similarly, forgetting to set vp->ip before calling out to something that could GC could likewise cause a problem because the stack map wouldn't be right and the precise stack marker could clear a value by mistake. This only happens for non-innermost frames though; the innermost frame is marked conservatively. The rules are: update vp->ip before something that can allocate, and update local "sp" after returning from a C function that could have recursively called Scheme. I did find a couple places in the VM where we forgot to do one of these, e.g. 07b7490f73fb4a6cb0c9577d082d37c8d9cee7b0 and just now 9a72e212622fa3bd118d7c02c4386601285b3224. These two patches aren't shipped yet fwiw. Andy