Re: TS-857, TS-934, TS-1031

John Plevyak Mon, 12 Dec 2011 11:52:26 -0800

On Mon, Dec 12, 2011 at 7:21 AM, Alan M. Carroll <
a...@network-geographics.com> wrote:

> Friday, December 9, 2011, 5:17:56 PM, you wrote:
>
> > NetHandler::mainNetEvent doesn't lock the VCs because it doesn't need to
> > until it wants to access the internal data at which time it locks it.
> >  Therefore the internal data is protected by the NetVC mutex.  However,
> the
> > existance is NOT protected by that mutex.
>
> That's not what I have seen in debugging runs. i need to double check the
> actual call stack but I see calls down through read and in to close that
> don't lock the VC

There are 4 locks of interest: the NetHandler, the NetVc, the read VIO and
the write VIO, and they are used to protect different things.  There are
likely some missuses which are causing problems.

 The NetHandler lock is used to protect the data structures of the
NetHandler, which include the queues that the NetVC is in and the Link
fields for those queues in the NetVC, but not the other internal values of
the NetVC.  The NetVC lock is only used during connecting at the point
where the NetVC has not yet been handed over to the protocol SM.  The read
VIO and write VIO mutexes control the internal data structures for the
NetVC.  They are typically the same (although they could be different).  To
set the closed flag, a protocol engine needs to acquire both read and write
VIO mutexes and clear all pointers to the NetVC.  To delete the NetVC the
NetHandler mutex must be acquired and the 'closed' flag (previously set
after the protocol engine acquired both read and write VIO mutexes) must be
acquired and the NetVC removed from the NetVC queue.

read_from_net is called from the NetHandler under it's lock.  It takes out
the read VIO mutex.  It does NOT take out the NetVC mutex because that
mutex is solely used BEFORE the NET_EVENT_OPEN callback which passes the
NetVC to the protocol engine.

When I left Inktomi the inactivity timer was managed directly i.e.
INACTIVITY_TIMEOUT was set.  It looks like someone checked in an
InactivityCop.   This is protected by the NetHandler mutex.  It also takes
out the NetVC mutex in case there is a timeout between the net_accept() and
the immediate event where the NetVC is transferred to the controlling Net
thread, but that can't happen as the inactivity_timeout_at is set to 0 in
close_UnixNetVConnection() so this is strictly unnecessary (although I
don't think harmful).  We could remove this.

> .
>
> > The NetHandler lock is used to protect the allocation, deletion of the
> > NetVC.  The owner of the NetVC lock can set the 'closed' flag, but can't
> > delete the NetVC unless they own the NetHandler lock.
>
> Sure it can, it just calls do_io_close. I direct your attention to
> HttpServerSession.cc:do_io_close which does precisely this. I have attached
> a diagram of this call chain, derived directly from an actual stack trace.
>

WTF: I have never in my life seen the code in
NEtVConnection::Handler::do_locked_io_close().   I have never even seen a
clas NetVConnection::Handler. This is extremely dangerous code. I have no
idea what it is trying to do, but I would strongly suggest removing it and
replacing it with netvc->do_io_close().

It calls "delete" for gosh sakes!  This is a complete NONO.  Before this
there were NO malloc's or free's on the critical path.  This code is both
insanely inefficient as well as dangerous.

do_io_close sets the "closed" flag which is then correctly managed by when
the NetHandler mutex is being held by the NetHandler.

The code in NetVConnection::Handle::do_locked_io_close is going to mangle
the data structures!

>
> > A well functioning
> > protocol engine would hold the NetVC mutex,. remove all references to the
> > NetVC  then call do_io_close().
>
> But isn't it the NetHandler mutex that controls de-allocation? Holding
> just the NetVC mutex and calling do_io_close doesn't follow that protocol.
>

Sorry, I meant the read and write VIO mutexes.   And yes, I do mean that it
should call do_io_close which DOES NOT delete the NetVC unless
nh->mutex->thread_holding == this_ethread() (in other words when it is
currently holding the NetHandler mutex).

do_io_close() does the right thing. It sets a flag "closed" and then *if*
it holds the NetHandler mutex it removes the NetVC from the queues etc.

>
> > It looks like there are crashes because someone is calling do_io_close
> > without ensuring that all the holders of pointers to the NetVC have been
> > cleared.  Once do_io_close() is called, the NetVC is dead, it can't be
> > accessed any more because the NetHandler can access the "closed" flag
> > WITHOUT the NetVC lock, but WITH the NetHandler lock and delete the
> NetVC.
>
> But how is that to be done? In the case I attached, the NetVC is possibly
> present in a ready queue of a NetHandler which is different from the NH
> that is invoking do_io_close. How can it find the correct NH and remove the
> NetVC from those queues? One must also be concerned about deadlock, where
> two threads try to close the NetVC, each holding the NetHandler mutex the
> other thread wants.
>

You don't!!!! you just set the "closed" flag.  The do_io_close() code
checks that the current thread is holding the NetHandler for the mutex of
the target VC.  Therefore it is safe under these circumstances.  You do not
want to try to delete the NetVC.  Just call do_io_close and it will do the
right thing.  Trying to forceably delete something between threads with
shared state is a really really bad idea.

Perhaps there is some confusion about what "close" means.  In this context
it means close the VC which only indirectly refers to the underlying socket
because this happens to be a NetVC. It is a single to the controlling
processor, in this case represented as the NetHandler that the VC is no
longer referenced by the protocol engine and should be reclaimed.  It very
much does not mean to close the underlying socket immediately or to reclaim
the underlying memory immediately.

>
> > There are a variety of ways of throwing smart pointers at the "problem"
> > which basically confuse when the object is actually live and let people
> > continue to hold pointers to dead/closed objects in buggy code.
>
> No, it's using them to help distinguish between "live" and "de-allocated".
> It seems to me to be the least costly / complex way for thread 1 to in
> effect cull entries from thread 2's internal queues. With smart pointers
> thread 1 doesn't have to know which thread is thread 2, or even if it
> exists. If you have an alternate coordination mechanism, please suggest it.
>

Do not cull entries from internal queues.  Just call do_io_close() and the
owner of the memory and NetVC will reclaim it in good time.  Smart pointers
of this type are a crutch.  Some companies (including the one I now work
at) have an style guide which explicitly states that shared smart pointers
should not be used except in extraordinary circumstances when ownership
really is shared (e.g. Mutex).  In this case ownership is extremely clear.
 The NetVC is owned by the NetHandler, period.   Only the NetHandler (or by
extension the holder of the NetHandler lock) has any right to the
NetHandler queues and can reclaim the NetVC memory.

This is most definitely a situation where a smart pointer is should never
be used.  Mutex, MIOBuffer, these are true shared objects which have no one
true owner.  These are the objects which deserve and use smart pointers.
 Only a handful of objects in ATS are reference counted, and all of them
are like this: objects with no single owner.

> > This doesn't change the fact that if they made sure not to hold stale
> > pointers, there wouldn't be a problem in the first place.
>
> Yes, but the question is, how can a thread make sure to not hold a stale
> pointer to an object which can be closed / de-allocated in another thread?
> Smart pointers make that simple without deadlock and with minimal lock
> contention.
>

The answer is very very simple.  Remove all references and call do_io_close
which does the right thing.  It checks to see if it already has the
NetHandler lock, and if so it can reclaim the NetVC, and if not it sets the
"closed" flag for later handling under the NetHandler mutex.

Re: TS-857, TS-934, TS-1031

Reply via email to