This is a fascinating approach to finding hard to
reproduce event-interleaving related bugs.

I'm particularly interested in this approach 
because rr record and replay plus chaos 
mode is directly applicable to 
Go programs -- whereas deterministic simulation
testing (DST) is next to impossible in a Go program 
using more than 4GB of memory (like most of my
programs) because this rules out wasm.

In contrast to DST, the rr+chaos approach
accepts you will be randomly 
sampling executions, but by recording all of them you
can still get reproducibility when you do hit the issue.  

rr is very efficient at recording. Green test runs can be quickly
discarded.

In a blog from 2016, Robert O'Callahan, one of the principal rr authors,
talks about the design of rr's chaos mode for provoking hard to find 
concurrency bugs:

> To cut a long story short, here's an approach that works. 
> Use just two thread priorities, "high" and "low". Make 
> most threads high-priority; I give each thread a 0.1 
> probability of being low priority. Periodically re-randomize 
> thread priorities. Randomize timeslice lengths. 
>
> Here's the good part: periodically choose a short random interval, 
> up to a few seconds long, and during that interval do not 
> allow low-priority threads to run at all, even if they're 
> the only runnable threads. Since these intervals can 
> prevent all forward progress (no control of priority inversion),
>  limit their length to no more than 20% of total run time. 
>
> The intuition is that many of our intermittent test failures 
> depend on CPU starvation (e.g. a machine temporarily 
> hanging), so we're emulating intense starvation of a few 
> "victim" threads, and allowing high-priority threads to 
> wait for timeouts or input from the environment 
> without interruption.
>
> With this approach, rr can reproduce my bug 
<https://bugzilla.mozilla.org/show_bug.cgi?id=1213938> in 
> several runs out of a thousand. I've also been able 
> to reproduce a top intermittent 
<https://bugzilla.mozilla.org/show_bug.cgi?id=1237176> (now being fixed),
> an intermittent test failure 
<https://bugzilla.mozilla.org/show_bug.cgi?id=1203417> that was assigned to 
me, 
> and an intermittent shutdown hang in IndexedDB 
<https://bugzilla.mozilla.org/show_bug.cgi?id=1150737#c197> 
> we've been chasing for a while. A couple of other 
> people have found this enabled reproducing their 
> bugs. I'm sure there are still bugs this approach 
> can't reproduce, but it's good progress.
> 
> I just landed all this work on rr master. The 
> normal scheduler doesn't do this randomization, 
> because it reduces throughput, i.e. slows down
> recording for easy-to-reproduce bugs. 
> Run rr record -h to enable chaos mode for 
> hard-to-reproduce bugs.
 -- https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html

Links to more info and background on rr:

https://rr-project.org/
https://github.com/rr-debugger/rr
https://github.com/rr-debugger/rr/wiki/Usage
https://github.com/rr-debugger/rr/wiki/Testimonials
https://github.com/rr-debugger/rr/wiki/Building-And-Installing

https://arxiv.org/pdf/1705.05937

https://fitzgen.com/2015/11/02/back-to-the-futurre.html

https://www.percona.com/blog/replay-the-execution-of-mysql-with-rr-record-and-replay/
https://www.youtube.com/watch?v=61kD3x4Pu8I

Robert's talk, "Taming Non-determinism" from 9 years ago is
a good technical introduction to rr.
https://www.youtube.com/watch?v=H4iNuufAe_8

NB The Delve debugger for Go supports rr, so you can get goroutine stack 
traces.

Enjoy.

- Jason

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/0bc3083c-fd2e-4f05-a743-4c434a27517en%40googlegroups.com.

Reply via email to