Tuesday, March 13, 2012, 11:46:15 PM, you wrote:

> Here are my comments for what they are worth.

> First, let me detail the issue this is trying to address.

> The way that most clients work with VCs is via and Processor::open_XXXX
> which calls back with and OPEN event at which point they set VIOs mutex
> field and from this point on, access to the VC is under that lock including
> the do_io_close() which eventually results in the VC being decallocated.

> There is, however, one situation where this simple and safe order of events
> is not followed.  That is connection sharing to origin servers.  Here the
> situations starts the same, but when the client is done with the connection
> it does not issue a do_io_close(), and this is where the problems can begin.

That's not my interpretation of the crashes. We tried various settings for 
connection sharing to no observed effect on crashing type or frequency. In fact 
all of the configurations I use for testing have connection sharing disabled.

> ming_zym wrote:
> give us more detail on the crashing back traces as much as possible.

As far as I can tell the problem arises when the VCs in a HttpServerSession are 
split across two threads. In this case, even with proper locking, the VC in the 
HttpServerSession that is owned by the "other" thread can be closed (via 
close_UnixNetVConnection) by that other thread. When the HttpServerSession 
thread then attempts to access the VC, it crashes. In fact many of the crashes 
occurred during mutex acquisition because the mutex had been cleared (all of 
the invalid access of memory location 0x18 are of this nature). I instrumented 
the VCs to verify this and consistently had the result that the crash was due 
to one thread calling close_UnixNetVConnection while a pointer to the VC was 
still "live" on another thread. Every crash I investigated was consistent with 
this scenario. Because only a raw pointer was used, there was no way to prevent 
or detect this. The kludge that is currently in place serves essentially to 
detect the situation and cleanup afterwards. The noticeable improvement after 
implementation is additional evidence for this theory. IMHO this is the single 
root cause of all the do_io_close related crashes.

If this is the case then all of the operations involved were done under lock. I 
have attached an illustration of a stack trace from such a crash where you can 
see the two threads and that the HttpServerSession has VCs from both. Even if 
full locking is done, it seems to me that if the 0x2b702ac4b010 thread has 
called already called close_UnixNetVConnection on the 0x6a6fc7 VC, then the 
0x2b702ab4a010 thread will crash on its call to do_io_close. It can't even 
acquire the mutex because that's been cleared.

Here's an example timeline, starting with the VC not locked or actively in use.

Thread A (..c4b010)                          Thread B (..b4a010)
(VC Owner)                                   (HttpServerSession owner) 
VC->mutex->acquire()
VC->do_io_close
 |- close_UnixNetVConnection
    |- mutex->clear()
                                             mutex->acquire(): crash

No lock is going to protect you if the lock itself can go stale.

One might ask, why is HttpServerSession split across threads like that? I have 
no idea. But it seems to happen much more with forward proxy (note: I have only 
indirect evidence for that).


> John Plevyak writes:
> So, this patch.  What does it do?  It uses smart points to prevent either
> of the two threads from making one particular change to the shared NetVC
> that they are currently scribbling all over: that of deleting it while the
> other is still running.  It doesn't prevent any of the other horrors, or
> all other manner of crashes, race conditions and unexpected behavior, just
> the one, deallocating.  It is a serious one, but not the only one.

My current view is that this is the only problem, because in all other cases 
the locking is working.

> 3) we can do something more heavy handed and change the way locking is
> handled in the NetProcessor.  In particular, wrap a lock around the code
> which allocates, frees and *changes the mutex* of a NetVC such that it is
> not possible for two threads to both be running while scribbling on the
> same NetVC.   Once we do this we will not have to us smart pointers for
> NetVCs because they will be no different from anything else that you call
> free() on and after which you can't use the pointer.
 
Could you describe how this would work for my example timeline? In that case 
thread A calls free() and thread B uses the pointer later, thread A owning the 
VC the entire time.

Or, alternatively (for anyone) if you could describe how to prevent 
HttpServerSession from referencing VCs in different threads. For all I know 
there's some simple change that would fix that.
 

Reply via email to