I've done some profiling before, using Intel's VTune, you can ask Jason for details.
On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote: > > > > On Oct 2, 2019, at 11:23 AM, Walt Karas <wka...@verizonmedia.com.INVALID> > wrote: > > > > What's the best tool for multi-threaded profiling on Linux? > > Probably the Intel tools. Probably the “perf” toolchain works well, at > least on modern linux, you could do perf lock etc. > > > > > On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll > > <solidwallofc...@verizonmedia.com.invalid> wrote: > > > >> Correct, it doesn't mean no lock contention. The claim was it reduced > the > >> lock contention to a level where it's not significant enough to warrant > >> additional preventative measures. The data Dan got wasn't from code > >> analysis, but from run time profiling. That was a while ago so if you'd > >> like to perform another pass of measuring the level of lock contention, > >> that would certainly be interesting data. > > > So, before we dig outselves into this rathole, can someone explain to me > what the problem is? Where are the metrics / analysis showing that we have > a problem? I’d love to see a comparison too between various version of ATS, > say v6 - v9, and understand where, if anywhere, we made things (lock > contention) worse. > > We have to stop making complicated and invasive solutions without real > problems, and understanding such problem. > > My $0.01, > > — Leif > > >> > >> In addition, the push for thread affinity started from actual issues in > >> production with Continuations being scheduled on different threads of > the > >> same type (that is, it was Kees' fault). Those would not be resolved by > >> faster scheduling on different threads. > >> > >> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com > >> .invalid> > >> wrote: > >> > >>> I assume thread affinity can't literal mean no lock contention. You'd > >> need > >>> a lock on the thread run queue wouldn't you? Continuations can't only > >> get > >>> queued for the event thread from the event thread. I don't think we > can > >>> say conclusively that there would be a significant difference due to > lock > >>> contention. I'm guessing that Fei would agree that the Continuation > >>> dispatch code is difficult to understand and work on. Simplification > and > >>> more modularity is obviously a goal. Seems like it would be simpler if > >> all > >>> the Continuations in a to-run list where actually ready to run. > >>> > >>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll > >>> <solidwallofc...@verizonmedia.com.invalid> wrote: > >>> > >>>> Do you have any more specific information on mutex contention? We have > >> in > >>>> fact already looked at doing this, I think back in 2015 with Dan Xu. > >> The > >>>> goal there was to have queues with the mutexes to avoid rescheduling. > >> As > >>> a > >>>> precursor Dan did some profiling and the only significant lock > >> contention > >>>> he could find was in the cache. That lead to the partial object > caching > >>>> work setting up queues for the hot locks, but it was decided that > given > >>> the > >>>> lack of > >>>> contention elsewhere, it wasn't worth the complexity. > >>>> > >>>> I think thread affinity is a better choice because no lock contention > >>> will > >>>> always beat even the most optimized lock contention resolution. If > >>>> Continuations related to the same constellation of data objects are on > >>> the > >>>> same thread, then the locks are never contested, which makes it as > fast > >>> as > >>>> possible. > >>>> > >>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com > >>>> .invalid> > >>>> wrote: > >>>> > >>>>> From the longer-term TSers I've heard comments about seeing profiling > >>>>> results that show that waiting on mutexes is a significant > >> performance > >>>>> issue with TS. But I'm not aware of any write-ups of such results. > >>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not > >> currently > >>>>> familiar with the best approaches to profiling TS. > >>>>> > >>>>> For better performance, I think that having a single to-run > >>> Continuation > >>>>> queue, or one per core, with a queue feeding multiple event threads > >> is > >>>> the > >>>>> main thing. It's more resilient to Continuations that block. There > >>>>> doesn't seem to be enthusiasm for getting hard-core about not having > >>>>> blocking Continuations (see > >>>>> https://github.com/apache/trafficserver/pull/5412 ). I'm not sure > >>>>> changing > >>>>> to queue-based mutexes would have a significant performance impact. > >>> But > >>>> it > >>>>> seems a cleaner design, making sure Continuations in the to-run > >> list(s) > >>>> are > >>>>> actually ready to run. But a different mutex implementation is not > >>>>> strictly necessary in order to consolidate to-run Continuation > >> queues. > >>>>> > >>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra < > >> kspoels...@we-amp.com> > >>>>> wrote: > >>>>> > >>>>>> Sounds very interesting. > >>>>>> But what is the problem we're trying to solve here, I like the > >> thread > >>>>>> affinity because it gives us head ache free concurrency in some > >>> cases, > >>>>> and > >>>>>> I'll bet that there is some code which doesn't have the proper > >>>>> continuation > >>>>>> mutexes because we know it runs on the same thread. > >>>>>> > >>>>>> Are we seeing a lot of imbalanced threads (too much processing > >>> causing > >>>>> long > >>>>>> queues of continuations, which I can imagine in some cases) ? And > >>>>> shouldn't > >>>>>> we balance based on transactions or connections, move those around > >>> when > >>>>> we > >>>>>> see imbalance and aim for embarrassingly parallel processing :) > >> Come > >>>>>> to think of it, this might introduce another set of problems, how > >> to > >>>> know > >>>>>> which continuations are part of the life cycle of a connection :/ > >>>>>> > >>>>>> Jumping threads in one transaction is not always ideal either, this > >>> can > >>>>>> really hurt performance. But your proposed model seems to handle > >> that > >>>>>> somewhat better than the current implementation. > >>>>>> > >>>>>> Very interested and wondering what this would mean for plugin > >>>> developers. > >>>>>> > >>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com > >>>> .invalid> > >>>>>> wrote: > >>>>>> > >>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put > >>> in > >>>> a > >>>>>>> queue specific to that mutex. The release function for the mutex > >>>>> (called > >>>>>>> when a Continuation holding the mutex exists) would put the > >>>>> Continuation > >>>>>> at > >>>>>>> the front of the mutex's queue (if not empty) into the > >> ready-to-run > >>>>> queue > >>>>>>> (transferring the lock to that Continuation). A drawback is that > >>> the > >>>>>> queue > >>>>>>> would itself need a mutex (spinlock?), but the critical section > >>> would > >>>>> be > >>>>>>> very short. > >>>>>>> > >>>>>>> There would be a function to lock a mutex directly. It would > >>> create > >>>> a > >>>>>>> Continuation that had two condition variables. It would assign > >> the > >>>>> mutex > >>>>>>> to this Continuation and schedule it. (In this case, it might > >> make > >>>>> sense > >>>>>>> to put this Continuation at the front of the mutex's queue, since > >>> it > >>>>>> would > >>>>>>> be blocking an entire event thread.) The direct-lock function > >>> would > >>>>> then > >>>>>>> block on the first condition variable. When the Continuation > >> ran, > >>> it > >>>>>> would > >>>>>>> trigger the first condition variable, and then block on the > >> second > >>>>>>> condition variable. The direct-lock function would then exit, > >>>> allowing > >>>>>> the > >>>>>>> calling code to enter its critical section. At the end of the > >>>> critical > >>>>>>> section, another function to release the direct lock would be > >>> called. > >>>>> It > >>>>>>> would trigger the second condition variable, which would cause > >> the > >>>>>> function > >>>>>>> of the Continuation created for the direct lock to exit (thus > >>>> releasing > >>>>>> the > >>>>>>> mutex). > >>>>>>> > >>>>>>> With this approach, I'm not sure thread affinities would be of > >> any > >>>>> value. > >>>>>>> I think perhaps each core should have it's own list of > >> ready-to-run > >>>>>>> Continuations, and a pool of event threads with affinity to that > >>>> core. > >>>>>> Not > >>>>>>> having per-event-thread ready-to-run lists means that a > >>> Continuation > >>>>>>> function that blocks is less likely to block other ready-to-run > >>>>>>> Continuations. If Continuations had core affinities to some > >>> degree, > >>>>> this > >>>>>>> might reduce evictions in per-core memory cache. (Multiple > >>>>> Continuations > >>>>>>> having the same function should have the same core affinity.) > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >