Hi Aurelien, thanks for your analysis.
On 2015-12-08 10:23, Aurelien Jarno wrote: > I disagree it is supposed to be fixed. Intel got a few bugs in there > TSX-NI implementation for Haswell and Broadwell and possibly early > versions of Skylake, and to avoid data loss we have therefore disabled > lock elision for some CPU revisions. That's what I meant with "fixed". But obviously there are two problems here: buggy hardware (blacklisted, #800574) and ... > That said the bugs in the Intel > implementation are corner cases, and it took quite some time for them to > get discovered. If your program crashes reproducibly, it's definitely not > an issue with the TSX-NI implementation. Disabling --enable-lock-elision > it's just a workaround for the real issue. People now start to have CPUs > with a working TSX-NI implementation which is therefore not blacklisted > and thus the problem is appearing again. ... buggy software (#807244), which is only exposed by running on hardware with working TSX-NI. That could also explain the fact that the bug was introduced in 352+. Jelle, I didn't dig through the nvidia forums, but if this info isn't mentioned there already, maybe you could post it: > According to the backtrace the problem is typical of a call to > mutex_unlock() on a mutex which hasn't been locked with mutex_lock() > before. (or was already unlocked.) Andreas