And here's a proof-of-concept fix which seems to do the job:

--- main.go.orig    2022-09-21 13:14:10.000000000 +0100
+++ main.go    2022-09-22 23:19:54.000000000 +0100
@@ -27,6 +27,7 @@
     // not written after the WaitGroup is done.
     dups  int
     chans []chan<- Result
+    forgotten bool
 }

 // Group represents a class of work and forms a namespace in
@@ -101,7 +102,9 @@
     c.wg.Done()

     g.mu.Lock()
-    delete(g.m, key)
+    if !c.forgotten {
+        delete(g.m, key)
+    }
     for _, ch := range c.chans {
         ch <- Result{c.val, c.err, c.dups > 0}
     }
@@ -121,6 +124,7 @@
         return true
     }
     if c.dups == 0 {
+        c.forgotten = true
         delete(g.m, key)
         return true
     }

On Thursday, 22 September 2022 at 23:16:22 UTC+1 Brian Candler wrote:

> OK, I think I have it.  It's ugly.
>
> Firstly, note that multiple instances of doCall can be running for the 
> same key.  This happens when:
>
> 1. you invoke DoChan.  This inserts a 'c' (call struct) into the map and 
> starts doCall in a goroutine.
> 2. at this point it's not shared: i.e. you don't call DoChan again with 
> the same key (yet).
> 3. you invoke ForgetUnshared on this key. This "detaches" it, but doCall 
> carries on running. It has its own local copy of 'c' so it knows where to 
> send the result, even though the map is now empty.
> 4. you invoke DoChan again with the same key.  This inserts a new 'c' into 
> the map and starts a new doCall goroutine.
>
> At this point, you have two instances of doCall running, and the map is 
> pointing at the second one.
>
> This is where it gets ugly.
>
> 5. you invoke DoChan yet again with the same key. This turns it into a 
> shared task, with c.dups > 0, len(c.chans) > 1.
> 6. the first instance of doCall terminates.  At this point it 
> unconditionally removes the key from the map - even though it had 
> previously been removed by ForgetUnshared!
>
> func (g *Group) doCall(c *call, key string, fn func() (interface{}, 
> error)) {
>         c.val, c.err = fn()
>         c.wg.Done()
>
>         g.mu.Lock()
> *        delete(g.m, key)    // <<<< NOTE*
>         for _, ch := range c.chans {
>                 ch <- Result{c.val, c.err, c.dups > 0}
>         }
>         g.mu.Unlock()
> }
>
> So, even though it's the first instance of doCall which is terminating, 
> it's removing the second instance of doCall from the map.  This is now also 
> a detached task.
>
> 7. In one of the two goroutines, the timeout event occurs.  It calls 
> ForgetUnshared, which happily returns true because the key does not exist 
> in the map - and therefore you proceed to cancel the context.
>
> But actually a task with this key *is* running; and furthermore, it is a 
> shared task, with 2 channel receivers.
>
> 8. Once the sleep has completed in the task function, it notices that the 
> context is cancelled and returns an error.
>
> 9. doCall sends the resulting error down multiple channels (those you 
> started in steps 4 and 5 above)
>
> 10. The select { case res := <-ch } triggers in the *other* goroutine - 
> the one which didn't have a timeout. Hence it receives the error, and 
> that's where you panic().
>
> On Thursday, 22 September 2022 at 20:37:07 UTC+1 Brian Candler wrote:
>
>> OK, I see where you're coming from - and I agree, this is a difficult one!
>>
>> The point you were making is that
>>
>>                                 if g.ForgetUnshared(key) {
>>                                         cancel()
>>                                 }
>>
>> should only invoke cancel() if this result wasn't shared: i.e. there's 
>> only one receiver in the c.chans array, and c.dups == 0.  So where's the 
>> race, given that everything in g is done under a mutex?
>>
>> What I have discovered so far is: when g.ForgetUnshared(key) returns true 
>> and the problem occurs, the key is not present in the map (as opposed to 
>> being present with c.dups == 0).  But I've not been able to work out why 
>> yet.
>>
>> Incidentally, a minor style observation: you passed in ctx to your go 
>> func(...), but not cancel. As far as I can see, both ctx and cancel are 
>> local variables which drop immediately out of scope - there's no way they 
>> can be modified later outside of the goroutine.  So I believe you don't 
>> need to pass ctx at all: you can access it via the closure.  But if you do 
>> pass one "to be on the safe side", then I think the other should be passed 
>> as well - otherwise it's confusing why you passed in only one.
>>
>> In fact, in this case, you could move the ctx/cancel creation inside the 
>> go func(...) anyway.  The only thing which needs to be outside is 
>> the wg.Add(1).
>>
>> On Thursday, 22 September 2022 at 03:12:47 UTC+1 atomic wrote:
>>
>>> > Also notice that the random time you pick for cancelTime can be longer 
>>> than the different random time you sleep inside the goroutine (i.e. the 
>>> function which you pass to DoChan).  Hence the goroutine can return a 
>>> result, before the cancelTime is reached.
>>>
>>> Although the goroutine can return a result before cancelTime arrives, 
>>> the returned result should not be err because I haven't had time to call 
>>> cancel(). 
>>> 在2022年9月21日星期三 UTC+8 20:18:30<Brian Candler> 写道:
>>>
>>>> Notice that DoChan starts a goroutine for the task...
>>>>
>>>>         go g.doCall(c, key, fn)
>>>>
>>>> ... and then returns immediately.
>>>>
>>>> Also notice that the random time you pick for cancelTime can be longer 
>>>> than the different random time you sleep inside the goroutine (i.e. the 
>>>> function which you pass to DoChan).  Hence the goroutine can return a 
>>>> result, before the cancelTime is reached.
>>>>
>>>> Try this modification:
>>>>
>>>> --- main.go.orig    2022-09-21 13:14:10.000000000 +0100
>>>> +++ main.go    2022-09-21 13:13:43.000000000 +0100
>>>> @@ -144,7 +144,7 @@
>>>>              defer wg.Done()
>>>>
>>>>              ch, _ := g.DoChan(key, func() (interface{}, error) {
>>>> -                time.Sleep(randTimeout())
>>>> +                time.Sleep(5000 * time.Millisecond)
>>>>                  if ctx.Err() == context.Canceled {
>>>>                      return nil, fmt.Errorf("callUUID=[%d] err=[%s]", 
>>>> uuid, ctx.Err())
>>>>                  }
>>>> @@ -152,7 +152,7 @@
>>>>              })
>>>>
>>>>              // randomly choose a timeout to cancel
>>>> -            cancelTime := time.After(randTimeout())
>>>> +            cancelTime := time.After(10 * time.Millisecond)
>>>>              select {
>>>>              case <-cancelTime:
>>>>                  // cancel only if no other goroutines share
>>>>
>>>> On Wednesday, 21 September 2022 at 10:01:22 UTC+1 atomic wrote:
>>>>
>>>>> Thanks for your reply, but I still don't understand why time.Sleep is 
>>>>> causing my test program to panic.
>>>>>
>>>>> In fact, this is a real online environment problem. My application 
>>>>> uses http.Client.Do(), but it occasionally has errors: [lookup xxxxx 
>>>>> on xxxxx: dial udp xxxxx: operation was canceled], after looking at the 
>>>>> code, I found that it may be There is a problem with ForgetUnshared, 
>>>>> lookupIPAddr uses ForgetUnshared: 
>>>>> https://github.com/golang/go/blob/4a4127bccc826ebb6079af3252bc6bfeaec187c4/src/net/lookup.go#L336
>>>>>
>>>>> 在2022年9月21日星期三 UTC+8 16:17:35<cuong.m...@gmail.com> 写道:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> You use time.Sleep in your program, so the behavior is not 
>>>>>> predictable. In fact, I get it success or panic randomly.
>>>>>>
>>>>>> You can see https://go-review.googlesource.com/c/sync/+/424114 to 
>>>>>> see a predictable test of ForgetUnshared .
>>>>>>
>>>>>> On Wednesday, September 21, 2022 at 1:45:24 PM UTC+7 atomic wrote:
>>>>>>
>>>>>>> hello
>>>>>>>
>>>>>>> I find that the `src/internal/singleflight/singleflight.go 
>>>>>>> ForgetUnshared()` method returns results that are not always expected
>>>>>>>
>>>>>>> For this I wrote a test code, I copied the code in the 
>>>>>>> src/internal/singleflight/singleflight.go file to the main package, and 
>>>>>>> wrote a main function to test it, if ForgetUnshared() returns 
>>>>>>> correctly, 
>>>>>>> this code It should not panic, but the fact that it will panic every 
>>>>>>> time 
>>>>>>> it runs, is there something wrong with my understanding of 
>>>>>>> ForgetUnshared()?
>>>>>>>
>>>>>>> The test code cannot be run in goplay, so I posted a link: 
>>>>>>> https://gist.github.com/dchaofei/e07547bce17d94c3e05b1b2a7230f62f
>>>>>>>
>>>>>>> The go version I use for testing is 1.16, 1.19.1
>>>>>>>
>>>>>>> result:
>>>>>>> ```
>>>>>>> $ go run cmd/main.go
>>>>>>> panic: callUUID=[9314284969 <(931)%20428-4969>] err=[context 
>>>>>>> canceled] currentUUId=[6980556786]
>>>>>>> ```
>>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/5fb5a486-a46b-439b-b29f-ba40ebc59ad8n%40googlegroups.com.

Reply via email to