Here are my comments for what they are worth.

First, let me detail the issue this is trying to address.

The way that most clients work with VCs is via and Processor::open_XXXX
which calls back with and OPEN event at which point they set VIOs mutex
field and from this point on, access to the VC is under that lock including
the do_io_close() which eventually results in the VC being decallocated.

There is, however, one situation where this simple and safe order of events
is not followed.  That is connection sharing to origin servers.  Here the
situations starts the same, but when the client is done with the connection
it does not issue a do_io_close(), and this is where the problems can begin.

The code that is run to manage connection sharing is very brittle as it
depends on the possible interleaving of events between three different
threads: the thread on which the origin VC lives, the thread where the
"previous" client lives, and the thread where the "next" client VC lives.
This code worked fine for many years because it wasn't changed, but because
these interactions are subtle and because the NetProcessor and origin
server connection management have changed it is little wonder that there is
a problem now.  As I vaguely remember, the old solution involved swapping
the mutex while holding it and the new mutex for a certain amount of time
(to avoid races when the mutex is captured by the lock) and maybe some
other voodoo involving chicken sacrifices on the full moon.

The bug we were seeing manifested from two of those three threads being
active at the same time and reading/writing the NetVC.   Any time there are
two threads scribbling on the same piece of memory it isn't good.

So, this patch.  What does it do?  It uses smart points to prevent either
of the two threads from making one particular change to the shared NetVC
that they are currently scribbling all over: that of deleting it while the
other is still running.  It doesn't prevent any of the other horrors, or
all other manner of crashes, race conditions and unexpected behavior, just
the one, deallocating.  It is a serious one, but not the only one.

What would really solve this problem?  Not having two threads running a the
same time holding the same NetVC.

I can think of 3 ways forward:

1) we can try to muddle through the various timelines of events and figure
out where the race is and figure out the specific order of lock holds and
lock substitutions etc. to prevent the race.  This is what we did
originally and while it produces a nice high performance result it is
extremely brittle and unsupportable.  I think this is the wrong way.

2) we can apply this patch which will prevent the race from causing one
particular kind of crash, but do nothing to eliminate the race and leaves
us open to lots of other issues.

3) we can do something more heavy handed and change the way locking is
handled in the NetProcessor.  In particular, wrap a lock around the code
which allocates, frees and *changes the mutex* of a NetVC such that it is
not possible for two threads to both be running while scribbling on the
same NetVC.   Once we do this we will not have to us smart pointers for
NetVCs because they will be no different from anything else that you call
free() on and after which you can't use the pointer.   They will be just
like a normally allocated object because it will nolonger be the case that
two threads will hold the pointer because all transfers of ownership of the
NetVC to a different mutex (which you can think of as representing a
logical "mini-process") will be managed under locks.   We can even add
tracking such that a NetVC can be tracked and verified to be owned by a
single mutex at runtime for DEBUG builds/runs.

Anyway, that is my 2 cents.   I'd rather have a final solution than a
partial patch.   The latter didn't work out so well for my roof :)

john

On Tue, Mar 13, 2012 at 10:36 AM, Alan M. Carroll <
a...@network-geographics.com> wrote:

> I have a patch submitted for TS-857 which attempts to make a start on
> finer grained locking. The essence of it is
>
> 1) Provide more powerful and standards compliant reference counted smart
> pointers - lib/ts/IntrusivePtr.h
>
> 2) Provide reference counted lists that are compatible with both (1) and
> existing intrusive list support.
>
> The current IntrusivePtr.h does more than this but that is an artifact of
> the development process, as I will describe later.
>
> For TS-857 in particular, the change is to replace the existing NetHandler
> open_list with a reference counted list which uses the same internal links,
> and to change HttpServerSession to use reference counted smart pointers.
> These changes mean that NetHandler can detect when a VC is shared and avoid
> closing it underneath the other thread. The current Handle and callback
> mechanism is discarded because it's a hack.
>
> Unfortunately, for reasons I don't understand, I can no longer replicate
> the problem on my test setup, even going back to 3.1.1 where as before I
> could replicate it in under a minute. So, while this patch seems stable I
> can not demonstrate that it is better with respect to TS-857. It passes
> regression, my load testing, and does not seem to leak connections
> (traffic_line reports a constant number of open connections even after much
> load testing).
>
> Reference counting is implemented only for UnixNetVConnections as the
> NetHandler open list contains only instances of that type. Unfortunately
> the actual mechanism must be introduced in to the class tree at
> NetVConnection because that is what HttpServerSession uses. The normal
> reference counting cleanup is disabled at this point because it's either
> handled differently (UnixNetVConnection) or not wanted (all other
> NetVConnection subclasses).
>
> The additional types in IntrusivePtr.h that are not used were used during
> earlier work but it seemed later that changing only open_list was the best
> approach. The other types were left in as they seem potentially useful.
>
> The question going forward is, should we commit this patch? Any comments
> are appreciated. For now, my recommendation is to commit and resolve TS-857
> unless someone else can either replicate it or provide field report of its
> occurrence.
>
>

Reply via email to