Here are my comments for what they are worth. First, let me detail the issue this is trying to address.
The way that most clients work with VCs is via and Processor::open_XXXX which calls back with and OPEN event at which point they set VIOs mutex field and from this point on, access to the VC is under that lock including the do_io_close() which eventually results in the VC being decallocated. There is, however, one situation where this simple and safe order of events is not followed. That is connection sharing to origin servers. Here the situations starts the same, but when the client is done with the connection it does not issue a do_io_close(), and this is where the problems can begin. The code that is run to manage connection sharing is very brittle as it depends on the possible interleaving of events between three different threads: the thread on which the origin VC lives, the thread where the "previous" client lives, and the thread where the "next" client VC lives. This code worked fine for many years because it wasn't changed, but because these interactions are subtle and because the NetProcessor and origin server connection management have changed it is little wonder that there is a problem now. As I vaguely remember, the old solution involved swapping the mutex while holding it and the new mutex for a certain amount of time (to avoid races when the mutex is captured by the lock) and maybe some other voodoo involving chicken sacrifices on the full moon. The bug we were seeing manifested from two of those three threads being active at the same time and reading/writing the NetVC. Any time there are two threads scribbling on the same piece of memory it isn't good. So, this patch. What does it do? It uses smart points to prevent either of the two threads from making one particular change to the shared NetVC that they are currently scribbling all over: that of deleting it while the other is still running. It doesn't prevent any of the other horrors, or all other manner of crashes, race conditions and unexpected behavior, just the one, deallocating. It is a serious one, but not the only one. What would really solve this problem? Not having two threads running a the same time holding the same NetVC. I can think of 3 ways forward: 1) we can try to muddle through the various timelines of events and figure out where the race is and figure out the specific order of lock holds and lock substitutions etc. to prevent the race. This is what we did originally and while it produces a nice high performance result it is extremely brittle and unsupportable. I think this is the wrong way. 2) we can apply this patch which will prevent the race from causing one particular kind of crash, but do nothing to eliminate the race and leaves us open to lots of other issues. 3) we can do something more heavy handed and change the way locking is handled in the NetProcessor. In particular, wrap a lock around the code which allocates, frees and *changes the mutex* of a NetVC such that it is not possible for two threads to both be running while scribbling on the same NetVC. Once we do this we will not have to us smart pointers for NetVCs because they will be no different from anything else that you call free() on and after which you can't use the pointer. They will be just like a normally allocated object because it will nolonger be the case that two threads will hold the pointer because all transfers of ownership of the NetVC to a different mutex (which you can think of as representing a logical "mini-process") will be managed under locks. We can even add tracking such that a NetVC can be tracked and verified to be owned by a single mutex at runtime for DEBUG builds/runs. Anyway, that is my 2 cents. I'd rather have a final solution than a partial patch. The latter didn't work out so well for my roof :) john On Tue, Mar 13, 2012 at 10:36 AM, Alan M. Carroll < a...@network-geographics.com> wrote: > I have a patch submitted for TS-857 which attempts to make a start on > finer grained locking. The essence of it is > > 1) Provide more powerful and standards compliant reference counted smart > pointers - lib/ts/IntrusivePtr.h > > 2) Provide reference counted lists that are compatible with both (1) and > existing intrusive list support. > > The current IntrusivePtr.h does more than this but that is an artifact of > the development process, as I will describe later. > > For TS-857 in particular, the change is to replace the existing NetHandler > open_list with a reference counted list which uses the same internal links, > and to change HttpServerSession to use reference counted smart pointers. > These changes mean that NetHandler can detect when a VC is shared and avoid > closing it underneath the other thread. The current Handle and callback > mechanism is discarded because it's a hack. > > Unfortunately, for reasons I don't understand, I can no longer replicate > the problem on my test setup, even going back to 3.1.1 where as before I > could replicate it in under a minute. So, while this patch seems stable I > can not demonstrate that it is better with respect to TS-857. It passes > regression, my load testing, and does not seem to leak connections > (traffic_line reports a constant number of open connections even after much > load testing). > > Reference counting is implemented only for UnixNetVConnections as the > NetHandler open list contains only instances of that type. Unfortunately > the actual mechanism must be introduced in to the class tree at > NetVConnection because that is what HttpServerSession uses. The normal > reference counting cleanup is disabled at this point because it's either > handled differently (UnixNetVConnection) or not wanted (all other > NetVConnection subclasses). > > The additional types in IntrusivePtr.h that are not used were used during > earlier work but it seemed later that changing only open_list was the best > approach. The other types were left in as they seem potentially useful. > > The question going forward is, should we commit this patch? Any comments > are appreciated. For now, my recommendation is to commit and resolve TS-857 > unless someone else can either replicate it or provide field report of its > occurrence. > >