On Mon, Dec 12, 2011 at 7:21 AM, Alan M. Carroll < a...@network-geographics.com> wrote:
> Friday, December 9, 2011, 5:17:56 PM, you wrote: > > > NetHandler::mainNetEvent doesn't lock the VCs because it doesn't need to > > until it wants to access the internal data at which time it locks it. > > Therefore the internal data is protected by the NetVC mutex. However, > the > > existance is NOT protected by that mutex. > > That's not what I have seen in debugging runs. i need to double check the > actual call stack but I see calls down through read and in to close that > don't lock the VC There are 4 locks of interest: the NetHandler, the NetVc, the read VIO and the write VIO, and they are used to protect different things. There are likely some missuses which are causing problems. The NetHandler lock is used to protect the data structures of the NetHandler, which include the queues that the NetVC is in and the Link fields for those queues in the NetVC, but not the other internal values of the NetVC. The NetVC lock is only used during connecting at the point where the NetVC has not yet been handed over to the protocol SM. The read VIO and write VIO mutexes control the internal data structures for the NetVC. They are typically the same (although they could be different). To set the closed flag, a protocol engine needs to acquire both read and write VIO mutexes and clear all pointers to the NetVC. To delete the NetVC the NetHandler mutex must be acquired and the 'closed' flag (previously set after the protocol engine acquired both read and write VIO mutexes) must be acquired and the NetVC removed from the NetVC queue. read_from_net is called from the NetHandler under it's lock. It takes out the read VIO mutex. It does NOT take out the NetVC mutex because that mutex is solely used BEFORE the NET_EVENT_OPEN callback which passes the NetVC to the protocol engine. When I left Inktomi the inactivity timer was managed directly i.e. INACTIVITY_TIMEOUT was set. It looks like someone checked in an InactivityCop. This is protected by the NetHandler mutex. It also takes out the NetVC mutex in case there is a timeout between the net_accept() and the immediate event where the NetVC is transferred to the controlling Net thread, but that can't happen as the inactivity_timeout_at is set to 0 in close_UnixNetVConnection() so this is strictly unnecessary (although I don't think harmful). We could remove this. > . > > > The NetHandler lock is used to protect the allocation, deletion of the > > NetVC. The owner of the NetVC lock can set the 'closed' flag, but can't > > delete the NetVC unless they own the NetHandler lock. > > Sure it can, it just calls do_io_close. I direct your attention to > HttpServerSession.cc:do_io_close which does precisely this. I have attached > a diagram of this call chain, derived directly from an actual stack trace. > WTF: I have never in my life seen the code in NEtVConnection::Handler::do_locked_io_close(). I have never even seen a clas NetVConnection::Handler. This is extremely dangerous code. I have no idea what it is trying to do, but I would strongly suggest removing it and replacing it with netvc->do_io_close(). It calls "delete" for gosh sakes! This is a complete NONO. Before this there were NO malloc's or free's on the critical path. This code is both insanely inefficient as well as dangerous. do_io_close sets the "closed" flag which is then correctly managed by when the NetHandler mutex is being held by the NetHandler. The code in NetVConnection::Handle::do_locked_io_close is going to mangle the data structures! > > > A well functioning > > protocol engine would hold the NetVC mutex,. remove all references to the > > NetVC then call do_io_close(). > > But isn't it the NetHandler mutex that controls de-allocation? Holding > just the NetVC mutex and calling do_io_close doesn't follow that protocol. > Sorry, I meant the read and write VIO mutexes. And yes, I do mean that it should call do_io_close which DOES NOT delete the NetVC unless nh->mutex->thread_holding == this_ethread() (in other words when it is currently holding the NetHandler mutex). do_io_close() does the right thing. It sets a flag "closed" and then *if* it holds the NetHandler mutex it removes the NetVC from the queues etc. > > > It looks like there are crashes because someone is calling do_io_close > > without ensuring that all the holders of pointers to the NetVC have been > > cleared. Once do_io_close() is called, the NetVC is dead, it can't be > > accessed any more because the NetHandler can access the "closed" flag > > WITHOUT the NetVC lock, but WITH the NetHandler lock and delete the > NetVC. > > But how is that to be done? In the case I attached, the NetVC is possibly > present in a ready queue of a NetHandler which is different from the NH > that is invoking do_io_close. How can it find the correct NH and remove the > NetVC from those queues? One must also be concerned about deadlock, where > two threads try to close the NetVC, each holding the NetHandler mutex the > other thread wants. > You don't!!!! you just set the "closed" flag. The do_io_close() code checks that the current thread is holding the NetHandler for the mutex of the target VC. Therefore it is safe under these circumstances. You do not want to try to delete the NetVC. Just call do_io_close and it will do the right thing. Trying to forceably delete something between threads with shared state is a really really bad idea. Perhaps there is some confusion about what "close" means. In this context it means close the VC which only indirectly refers to the underlying socket because this happens to be a NetVC. It is a single to the controlling processor, in this case represented as the NetHandler that the VC is no longer referenced by the protocol engine and should be reclaimed. It very much does not mean to close the underlying socket immediately or to reclaim the underlying memory immediately. > > > There are a variety of ways of throwing smart pointers at the "problem" > > which basically confuse when the object is actually live and let people > > continue to hold pointers to dead/closed objects in buggy code. > > No, it's using them to help distinguish between "live" and "de-allocated". > It seems to me to be the least costly / complex way for thread 1 to in > effect cull entries from thread 2's internal queues. With smart pointers > thread 1 doesn't have to know which thread is thread 2, or even if it > exists. If you have an alternate coordination mechanism, please suggest it. > Do not cull entries from internal queues. Just call do_io_close() and the owner of the memory and NetVC will reclaim it in good time. Smart pointers of this type are a crutch. Some companies (including the one I now work at) have an style guide which explicitly states that shared smart pointers should not be used except in extraordinary circumstances when ownership really is shared (e.g. Mutex). In this case ownership is extremely clear. The NetVC is owned by the NetHandler, period. Only the NetHandler (or by extension the holder of the NetHandler lock) has any right to the NetHandler queues and can reclaim the NetVC memory. This is most definitely a situation where a smart pointer is should never be used. Mutex, MIOBuffer, these are true shared objects which have no one true owner. These are the objects which deserve and use smart pointers. Only a handful of objects in ATS are reference counted, and all of them are like this: objects with no single owner. > > This doesn't change the fact that if they made sure not to hold stale > > pointers, there wouldn't be a problem in the first place. > > Yes, but the question is, how can a thread make sure to not hold a stale > pointer to an object which can be closed / de-allocated in another thread? > Smart pointers make that simple without deadlock and with minimal lock > contention. > The answer is very very simple. Remove all references and call do_io_close which does the right thing. It checks to see if it already has the NetHandler lock, and if so it can reclaim the NetVC, and if not it sets the "closed" flag for later handling under the NetHandler mutex.