[go-nuts] Re: rr's chaos mode for hard to reproduce bugs

Jason E. Aten Fri, 06 Jun 2025 00:33:00 -0700

Thanks for your note of appreciation, cpasmaboiteaspam.

On Friday, June 6, 2025 at 7:05:35 AM UTC+1 cpasmaboiteaspam wrote:


> Hello Jason,
>
> Intuitively,  to play with timing of goroutines,
> *to manually inject delays where none exists,*
> is what new comers do, 
> to experiment with deadlocks,
> or simply build better understanding, 
> at least *I* did that very often.
> Therefor your suggestion definitely makes sense,
> to me, a *non* low level programmer perspective.
>
> (*not a useful email*, i only want to show some support 
> to your various posts which i read with lots of interest,
> others too... but physically speaking 
> it is hard to follow everybody,
> I liked the last post from Robert Griesemer
> I like all the post of the Go blog anyway...... 
> and the coroutines.... *c'est interminable*...= )
>
> Thank you,
> Thank you all.
>
> Le jeudi 5 juin 2025 à 22:28:07 UTC+2, Jason E. Aten a écrit :
>
>> Hmm. Maybe rr found a bug in runtime GC code(?)... or
>> maybe it will be hard to use rr on Go. Let's see what the runtime folks 
>> say
>> on this issue:
>>
>> https://github.com/golang/go/issues/74019
>>
>> On Wednesday, June 4, 2025 at 12:18:45 PM UTC+1 Jason E. Aten wrote:
>>
>>> This is a fascinating approach to finding hard to
>>> reproduce event-interleaving related bugs.
>>>
>>> I'm particularly interested in this approach 
>>> because rr record and replay plus chaos 
>>> mode is directly applicable to 
>>> Go programs -- whereas deterministic simulation
>>> testing (DST) is next to impossible in a Go program 
>>> using more than 4GB of memory (like most of my
>>> programs) because this rules out wasm.
>>>
>>> In contrast to DST, the rr+chaos approach
>>> accepts you will be randomly 
>>> sampling executions, but by recording all of them you
>>> can still get reproducibility when you do hit the issue.  
>>>
>>> rr is very efficient at recording. Green test runs can be quickly
>>> discarded.
>>>
>>> In a blog from 2016, Robert O'Callahan, one of the principal rr authors,
>>> talks about the design of rr's chaos mode for provoking hard to find 
>>> concurrency bugs:
>>>
>>> > To cut a long story short, here's an approach that works. 
>>> > Use just two thread priorities, "high" and "low". Make 
>>> > most threads high-priority; I give each thread a 0.1 
>>> > probability of being low priority. Periodically re-randomize 
>>> > thread priorities. Randomize timeslice lengths. 
>>> >
>>> > Here's the good part: periodically choose a short random interval, 
>>> > up to a few seconds long, and during that interval do not 
>>> > allow low-priority threads to run at all, even if they're 
>>> > the only runnable threads. Since these intervals can 
>>> > prevent all forward progress (no control of priority inversion),
>>> >  limit their length to no more than 20% of total run time. 
>>> >
>>> > The intuition is that many of our intermittent test failures 
>>> > depend on CPU starvation (e.g. a machine temporarily 
>>> > hanging), so we're emulating intense starvation of a few 
>>> > "victim" threads, and allowing high-priority threads to 
>>> > wait for timeouts or input from the environment 
>>> > without interruption.
>>> >
>>> > With this approach, rr can reproduce my bug 
>>> <https://bugzilla.mozilla.org/show_bug.cgi?id=1213938> in 
>>> > several runs out of a thousand. I've also been able 
>>> > to reproduce a top intermittent 
>>> <https://bugzilla.mozilla.org/show_bug.cgi?id=1237176> (now being 
>>> fixed),
>>> > an intermittent test failure 
>>> <https://bugzilla.mozilla.org/show_bug.cgi?id=1203417> that was 
>>> assigned to me, 
>>> > and an intermittent shutdown hang in IndexedDB 
>>> <https://bugzilla.mozilla.org/show_bug.cgi?id=1150737#c197> 
>>> > we've been chasing for a while. A couple of other 
>>> > people have found this enabled reproducing their 
>>> > bugs. I'm sure there are still bugs this approach 
>>> > can't reproduce, but it's good progress.
>>> > 
>>> > I just landed all this work on rr master. The 
>>> > normal scheduler doesn't do this randomization, 
>>> > because it reduces throughput, i.e. slows down
>>> > recording for easy-to-reproduce bugs. 
>>> > Run rr record -h to enable chaos mode for 
>>> > hard-to-reproduce bugs.
>>>  -- https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html
>>>
>>> Links to more info and background on rr:
>>>
>>> https://rr-project.org/
>>> https://github.com/rr-debugger/rr
>>> https://github.com/rr-debugger/rr/wiki/Usage
>>> https://github.com/rr-debugger/rr/wiki/Testimonials
>>> https://github.com/rr-debugger/rr/wiki/Building-And-Installing
>>>
>>> https://arxiv.org/pdf/1705.05937
>>>
>>> https://fitzgen.com/2015/11/02/back-to-the-futurre.html
>>>
>>>
>>> https://www.percona.com/blog/replay-the-execution-of-mysql-with-rr-record-and-replay/
>>> https://www.youtube.com/watch?v=61kD3x4Pu8I
>>>
>>> Robert's talk, "Taming Non-determinism" from 9 years ago is
>>> a good technical introduction to rr.
>>> https://www.youtube.com/watch?v=H4iNuufAe_8
>>>
>>> NB The Delve debugger for Go supports rr, so you can get goroutine stack 
>>> traces.
>>>
>>> Enjoy.
>>>
>>> - Jason
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/c539eaaf-b0dc-4042-ba3a-e67cac8d262en%40googlegroups.com.

[go-nuts] Re: rr's chaos mode for hard to reproduce bugs

Reply via email to