On Wednesday, 12 July 2017 13:57:16 UTC+1, Ian Lance Taylor wrote:
>
> On Wed, Jul 12, 2017 at 12:28 AM,  <james...@gmail.com <javascript:>> 
> wrote: 
> > 
> > On Wednesday, 12 July 2017 04:44:43 UTC+1, Ian Lance Taylor wrote: 
> >> 
> >> On Tue, Jul 11, 2017 at 12:27 PM,  <james...@gmail.com> wrote: 
> >> > 
> >> > This caught me out recently; there's definitely a gap in my 
> >> > understanding as 
> >> > to how select works. Maybe it's caught other people out too? 
> >> > 
> >> > For one of my clients, I'd written an application that would try to 
> read 
> >> > data from a backend. There were multiple instances of this backend 
> (say 
> >> > one 
> >> > in Oregon, and one in Virginia). 
> >> > 
> >> > In this context, it made sense to try reading from all of the 
> backends 
> >> > concurrently, and return the first successful result. All of the 
> slower 
> >> > responses could be ignored, as long as the code cleaned up any 
> resources 
> >> > (goroutines, file descriptors etc). 
> >> > 
> >> > Reader, it did not. 
> >> > 
> >> > The code which demonstrates the failure is at 
> >> > https://github.com/jabley/fptp 
> >> 
> >> I think you neglected to say what went wrong.  Or, if you did say it, 
> >> I missed it. 
> >> 
> >> When I look at this: 
> >> 
> >>     select { 
> >>     case <-ctx.Done(): 
> >>         // Another Searcher already returned a successful response. 
> >> Ensure we don't leak things 
> >>         res.Close() 
> >>     default: 
> >>         respc <- searchResponse{res, err} 
> >>     } 
> >> 
> >> what I think is: if the context is not completed when you first enter 
> >> the select statement, then something had better hang around reading 
> >> from respc.  But, as far as I can tell, nothing does.  So you will 
> >> collect goroutines that will never go away.  Is that the problem you 
> >> were seeing? 
> >> 
> >> Ian 
> > 
> > 
> > The problem seen was that goroutines were leaked, and file descriptors 
> were 
> > leaked. Over time, the process got to the limit of max file descriptors 
> > allowed by the OS, and started returning "too many open files" errors. 
> > 
> > The select statement in `waitForSearchCompletion` is trying to read from 
> > respc. If the response that gets read from the channel is deemed 
> successful, 
> > that function returns. The calling Search function also returns, and the 
> > context.CancelFunc is called to stop further responses being returned by 
> > `search`. 
> > 
> > TestClosersAreNotLeaked does not appear to fail if 
> > 
> > `case respc <- searchResponse{res, err}:` 
> > 
> > is used in `search` instead of 
> > 
> > ``` 
> > default: 
> >         respc <- searchResponse{res, err} 
> > ``` 
> > 
> > Running the tests with -race does not highlight a problem. There is 100% 
> > test coverage in both the successful and failing variants. But it was 
> only 
> > the addition of TestClosersAreNotLeaked (and the supporting plumbing 
> code) 
> > which replicated the problem I was seeing in production. 
> > 
> > There does appear to be a race condition between the 2 select statements 
> in 
> > `search` and `waitForSearchCompletion`. 
> > 
> > I don't (yet!) understand what I was doing wrong. 
>
> I have not looked at your whole program. 
>
> In the case where you write 
>     default: 
>         respc <- ... 
> then if the context is not finished when the goroutine reaches that 
> point, the goroutine is going to hang forever waiting for something to 
> read from the channel.  That will give you the leaked goroutines and 
> descriptors that you describe. 
>
> In the case where you write 
>     case <-ctx.Done(): 
>         ... 
>     case respc <- ... 
> then the goroutine will wait until either the context is done or it 
> can write to the channel.  When the context is done, the goroutine 
> will go away, and you will not see a leak. 
>
> If that does not explain your problem, why not? 


> Ian 
>

Labels for clarity:

1. goroutine 1 spawns goroutine 2 to read from Oregon, and spawns goroutine 
3 to read from Virginia. 
2. goroutine 1 calls waitForSearchCompletion and enters the select. Nothing 
can proceed, so that select blocks.
3. goroutine 3 completes Search and enters its select, sees that the 
context isn't done, and sends on the respc channel
4. goroutine 1 can now proceed, by reading from the respc channel
5. goroutine 2 completes the Search and enters its select
6. ???

At this point, it seems as though goroutine 2 blocks if it uses a case, and 
proceeds if it uses default. 

If it blocks (is it blocked by goroutine 1?), then goroutine 1 exits from 
waitForSearchCompletion, and exits from Search, the deferred cancel() is 
evaluated, and the context is marked as done. The select in goroutine 2 
sees the context as Done and does not write to the channel.

(I have never seen this version fail, having run soak tests that have 
exercised it for billions of iterations in a test application. It may be a 
correct programme, or I may have been lucky so far).

If it proceeds, then goroutine 2 is able to write to the channel (which is 
never subsequently read, and thus leaks, and eventually starts failing 
hard).

>From https://golang.org/ref/spec#Send_statements

> A send on an unbuffered channel can proceed if a receiver is ready

When is the receiver respc ready for another write (by goroutine 2)? Is the 
receiver ready after `case r := <-respc:` has been evaluated in 
waitForSearchCompletion? Does it need to wait until the StatementList of 
the CommClause starts being run? Is it after that select exits?

The thing that's confusing me is that the code in waitForSeachCompletion 
hasn't changed. I do not understand how I've changed the "readiness" of the 
receiver. Presumably the choice of case or default is causing that?
 

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to