And here's a proof-of-concept fix which seems to do the job: --- main.go.orig 2022-09-21 13:14:10.000000000 +0100 +++ main.go 2022-09-22 23:19:54.000000000 +0100 @@ -27,6 +27,7 @@ // not written after the WaitGroup is done. dups int chans []chan<- Result + forgotten bool }
// Group represents a class of work and forms a namespace in @@ -101,7 +102,9 @@ c.wg.Done() g.mu.Lock() - delete(g.m, key) + if !c.forgotten { + delete(g.m, key) + } for _, ch := range c.chans { ch <- Result{c.val, c.err, c.dups > 0} } @@ -121,6 +124,7 @@ return true } if c.dups == 0 { + c.forgotten = true delete(g.m, key) return true } On Thursday, 22 September 2022 at 23:16:22 UTC+1 Brian Candler wrote: > OK, I think I have it. It's ugly. > > Firstly, note that multiple instances of doCall can be running for the > same key. This happens when: > > 1. you invoke DoChan. This inserts a 'c' (call struct) into the map and > starts doCall in a goroutine. > 2. at this point it's not shared: i.e. you don't call DoChan again with > the same key (yet). > 3. you invoke ForgetUnshared on this key. This "detaches" it, but doCall > carries on running. It has its own local copy of 'c' so it knows where to > send the result, even though the map is now empty. > 4. you invoke DoChan again with the same key. This inserts a new 'c' into > the map and starts a new doCall goroutine. > > At this point, you have two instances of doCall running, and the map is > pointing at the second one. > > This is where it gets ugly. > > 5. you invoke DoChan yet again with the same key. This turns it into a > shared task, with c.dups > 0, len(c.chans) > 1. > 6. the first instance of doCall terminates. At this point it > unconditionally removes the key from the map - even though it had > previously been removed by ForgetUnshared! > > func (g *Group) doCall(c *call, key string, fn func() (interface{}, > error)) { > c.val, c.err = fn() > c.wg.Done() > > g.mu.Lock() > * delete(g.m, key) // <<<< NOTE* > for _, ch := range c.chans { > ch <- Result{c.val, c.err, c.dups > 0} > } > g.mu.Unlock() > } > > So, even though it's the first instance of doCall which is terminating, > it's removing the second instance of doCall from the map. This is now also > a detached task. > > 7. In one of the two goroutines, the timeout event occurs. It calls > ForgetUnshared, which happily returns true because the key does not exist > in the map - and therefore you proceed to cancel the context. > > But actually a task with this key *is* running; and furthermore, it is a > shared task, with 2 channel receivers. > > 8. Once the sleep has completed in the task function, it notices that the > context is cancelled and returns an error. > > 9. doCall sends the resulting error down multiple channels (those you > started in steps 4 and 5 above) > > 10. The select { case res := <-ch } triggers in the *other* goroutine - > the one which didn't have a timeout. Hence it receives the error, and > that's where you panic(). > > On Thursday, 22 September 2022 at 20:37:07 UTC+1 Brian Candler wrote: > >> OK, I see where you're coming from - and I agree, this is a difficult one! >> >> The point you were making is that >> >> if g.ForgetUnshared(key) { >> cancel() >> } >> >> should only invoke cancel() if this result wasn't shared: i.e. there's >> only one receiver in the c.chans array, and c.dups == 0. So where's the >> race, given that everything in g is done under a mutex? >> >> What I have discovered so far is: when g.ForgetUnshared(key) returns true >> and the problem occurs, the key is not present in the map (as opposed to >> being present with c.dups == 0). But I've not been able to work out why >> yet. >> >> Incidentally, a minor style observation: you passed in ctx to your go >> func(...), but not cancel. As far as I can see, both ctx and cancel are >> local variables which drop immediately out of scope - there's no way they >> can be modified later outside of the goroutine. So I believe you don't >> need to pass ctx at all: you can access it via the closure. But if you do >> pass one "to be on the safe side", then I think the other should be passed >> as well - otherwise it's confusing why you passed in only one. >> >> In fact, in this case, you could move the ctx/cancel creation inside the >> go func(...) anyway. The only thing which needs to be outside is >> the wg.Add(1). >> >> On Thursday, 22 September 2022 at 03:12:47 UTC+1 atomic wrote: >> >>> > Also notice that the random time you pick for cancelTime can be longer >>> than the different random time you sleep inside the goroutine (i.e. the >>> function which you pass to DoChan). Hence the goroutine can return a >>> result, before the cancelTime is reached. >>> >>> Although the goroutine can return a result before cancelTime arrives, >>> the returned result should not be err because I haven't had time to call >>> cancel(). >>> 在2022年9月21日星期三 UTC+8 20:18:30<Brian Candler> 写道: >>> >>>> Notice that DoChan starts a goroutine for the task... >>>> >>>> go g.doCall(c, key, fn) >>>> >>>> ... and then returns immediately. >>>> >>>> Also notice that the random time you pick for cancelTime can be longer >>>> than the different random time you sleep inside the goroutine (i.e. the >>>> function which you pass to DoChan). Hence the goroutine can return a >>>> result, before the cancelTime is reached. >>>> >>>> Try this modification: >>>> >>>> --- main.go.orig 2022-09-21 13:14:10.000000000 +0100 >>>> +++ main.go 2022-09-21 13:13:43.000000000 +0100 >>>> @@ -144,7 +144,7 @@ >>>> defer wg.Done() >>>> >>>> ch, _ := g.DoChan(key, func() (interface{}, error) { >>>> - time.Sleep(randTimeout()) >>>> + time.Sleep(5000 * time.Millisecond) >>>> if ctx.Err() == context.Canceled { >>>> return nil, fmt.Errorf("callUUID=[%d] err=[%s]", >>>> uuid, ctx.Err()) >>>> } >>>> @@ -152,7 +152,7 @@ >>>> }) >>>> >>>> // randomly choose a timeout to cancel >>>> - cancelTime := time.After(randTimeout()) >>>> + cancelTime := time.After(10 * time.Millisecond) >>>> select { >>>> case <-cancelTime: >>>> // cancel only if no other goroutines share >>>> >>>> On Wednesday, 21 September 2022 at 10:01:22 UTC+1 atomic wrote: >>>> >>>>> Thanks for your reply, but I still don't understand why time.Sleep is >>>>> causing my test program to panic. >>>>> >>>>> In fact, this is a real online environment problem. My application >>>>> uses http.Client.Do(), but it occasionally has errors: [lookup xxxxx >>>>> on xxxxx: dial udp xxxxx: operation was canceled], after looking at the >>>>> code, I found that it may be There is a problem with ForgetUnshared, >>>>> lookupIPAddr uses ForgetUnshared: >>>>> https://github.com/golang/go/blob/4a4127bccc826ebb6079af3252bc6bfeaec187c4/src/net/lookup.go#L336 >>>>> >>>>> 在2022年9月21日星期三 UTC+8 16:17:35<cuong.m...@gmail.com> 写道: >>>>> >>>>>> Hello, >>>>>> >>>>>> You use time.Sleep in your program, so the behavior is not >>>>>> predictable. In fact, I get it success or panic randomly. >>>>>> >>>>>> You can see https://go-review.googlesource.com/c/sync/+/424114 to >>>>>> see a predictable test of ForgetUnshared . >>>>>> >>>>>> On Wednesday, September 21, 2022 at 1:45:24 PM UTC+7 atomic wrote: >>>>>> >>>>>>> hello >>>>>>> >>>>>>> I find that the `src/internal/singleflight/singleflight.go >>>>>>> ForgetUnshared()` method returns results that are not always expected >>>>>>> >>>>>>> For this I wrote a test code, I copied the code in the >>>>>>> src/internal/singleflight/singleflight.go file to the main package, and >>>>>>> wrote a main function to test it, if ForgetUnshared() returns >>>>>>> correctly, >>>>>>> this code It should not panic, but the fact that it will panic every >>>>>>> time >>>>>>> it runs, is there something wrong with my understanding of >>>>>>> ForgetUnshared()? >>>>>>> >>>>>>> The test code cannot be run in goplay, so I posted a link: >>>>>>> https://gist.github.com/dchaofei/e07547bce17d94c3e05b1b2a7230f62f >>>>>>> >>>>>>> The go version I use for testing is 1.16, 1.19.1 >>>>>>> >>>>>>> result: >>>>>>> ``` >>>>>>> $ go run cmd/main.go >>>>>>> panic: callUUID=[9314284969 <(931)%20428-4969>] err=[context >>>>>>> canceled] currentUUId=[6980556786] >>>>>>> ``` >>>>>>> >>>>>> -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/5fb5a486-a46b-439b-b29f-ba40ebc59ad8n%40googlegroups.com.