What's the best tool for multi-threaded profiling on Linux? On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll <solidwallofc...@verizonmedia.com.invalid> wrote:
> Correct, it doesn't mean no lock contention. The claim was it reduced the > lock contention to a level where it's not significant enough to warrant > additional preventative measures. The data Dan got wasn't from code > analysis, but from run time profiling. That was a while ago so if you'd > like to perform another pass of measuring the level of lock contention, > that would certainly be interesting data. > > In addition, the push for thread affinity started from actual issues in > production with Continuations being scheduled on different threads of the > same type (that is, it was Kees' fault). Those would not be resolved by > faster scheduling on different threads. > > On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com > .invalid> > wrote: > > > I assume thread affinity can't literal mean no lock contention. You'd > need > > a lock on the thread run queue wouldn't you? Continuations can't only > get > > queued for the event thread from the event thread. I don't think we can > > say conclusively that there would be a significant difference due to lock > > contention. I'm guessing that Fei would agree that the Continuation > > dispatch code is difficult to understand and work on. Simplification and > > more modularity is obviously a goal. Seems like it would be simpler if > all > > the Continuations in a to-run list where actually ready to run. > > > > On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll > > <solidwallofc...@verizonmedia.com.invalid> wrote: > > > > > Do you have any more specific information on mutex contention? We have > in > > > fact already looked at doing this, I think back in 2015 with Dan Xu. > The > > > goal there was to have queues with the mutexes to avoid rescheduling. > As > > a > > > precursor Dan did some profiling and the only significant lock > contention > > > he could find was in the cache. That lead to the partial object caching > > > work setting up queues for the hot locks, but it was decided that given > > the > > > lack of > > > contention elsewhere, it wasn't worth the complexity. > > > > > > I think thread affinity is a better choice because no lock contention > > will > > > always beat even the most optimized lock contention resolution. If > > > Continuations related to the same constellation of data objects are on > > the > > > same thread, then the locks are never contested, which makes it as fast > > as > > > possible. > > > > > > On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com > > > .invalid> > > > wrote: > > > > > > > From the longer-term TSers I've heard comments about seeing profiling > > > > results that show that waiting on mutexes is a significant > performance > > > > issue with TS. But I'm not aware of any write-ups of such results. > > > > Unfortunately, I'm relatively new to TS and Linux, so I'm not > currently > > > > familiar with the best approaches to profiling TS. > > > > > > > > For better performance, I think that having a single to-run > > Continuation > > > > queue, or one per core, with a queue feeding multiple event threads > is > > > the > > > > main thing. It's more resilient to Continuations that block. There > > > > doesn't seem to be enthusiasm for getting hard-core about not having > > > > blocking Continuations (see > > > > https://github.com/apache/trafficserver/pull/5412 ). I'm not sure > > > > changing > > > > to queue-based mutexes would have a significant performance impact. > > But > > > it > > > > seems a cleaner design, making sure Continuations in the to-run > list(s) > > > are > > > > actually ready to run. But a different mutex implementation is not > > > > strictly necessary in order to consolidate to-run Continuation > queues. > > > > > > > > On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra < > kspoels...@we-amp.com> > > > > wrote: > > > > > > > > > Sounds very interesting. > > > > > But what is the problem we're trying to solve here, I like the > thread > > > > > affinity because it gives us head ache free concurrency in some > > cases, > > > > and > > > > > I'll bet that there is some code which doesn't have the proper > > > > continuation > > > > > mutexes because we know it runs on the same thread. > > > > > > > > > > Are we seeing a lot of imbalanced threads (too much processing > > causing > > > > long > > > > > queues of continuations, which I can imagine in some cases) ? And > > > > shouldn't > > > > > we balance based on transactions or connections, move those around > > when > > > > we > > > > > see imbalance and aim for embarrassingly parallel processing :) > Come > > > > > to think of it, this might introduce another set of problems, how > to > > > know > > > > > which continuations are part of the life cycle of a connection :/ > > > > > > > > > > Jumping threads in one transaction is not always ideal either, this > > can > > > > > really hurt performance. But your proposed model seems to handle > that > > > > > somewhat better than the current implementation. > > > > > > > > > > Very interested and wondering what this would mean for plugin > > > developers. > > > > > > > > > > On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com > > > .invalid> > > > > > wrote: > > > > > > > > > > > If a Continuation is scheduled, but its mutex is locked, it's put > > in > > > a > > > > > > queue specific to that mutex. The release function for the mutex > > > > (called > > > > > > when a Continuation holding the mutex exists) would put the > > > > Continuation > > > > > at > > > > > > the front of the mutex's queue (if not empty) into the > ready-to-run > > > > queue > > > > > > (transferring the lock to that Continuation). A drawback is that > > the > > > > > queue > > > > > > would itself need a mutex (spinlock?), but the critical section > > would > > > > be > > > > > > very short. > > > > > > > > > > > > There would be a function to lock a mutex directly. It would > > create > > > a > > > > > > Continuation that had two condition variables. It would assign > the > > > > mutex > > > > > > to this Continuation and schedule it. (In this case, it might > make > > > > sense > > > > > > to put this Continuation at the front of the mutex's queue, since > > it > > > > > would > > > > > > be blocking an entire event thread.) The direct-lock function > > would > > > > then > > > > > > block on the first condition variable. When the Continuation > ran, > > it > > > > > would > > > > > > trigger the first condition variable, and then block on the > second > > > > > > condition variable. The direct-lock function would then exit, > > > allowing > > > > > the > > > > > > calling code to enter its critical section. At the end of the > > > critical > > > > > > section, another function to release the direct lock would be > > called. > > > > It > > > > > > would trigger the second condition variable, which would cause > the > > > > > function > > > > > > of the Continuation created for the direct lock to exit (thus > > > releasing > > > > > the > > > > > > mutex). > > > > > > > > > > > > With this approach, I'm not sure thread affinities would be of > any > > > > value. > > > > > > I think perhaps each core should have it's own list of > ready-to-run > > > > > > Continuations, and a pool of event threads with affinity to that > > > core. > > > > > Not > > > > > > having per-event-thread ready-to-run lists means that a > > Continuation > > > > > > function that blocks is less likely to block other ready-to-run > > > > > > Continuations. If Continuations had core affinities to some > > degree, > > > > this > > > > > > might reduce evictions in per-core memory cache. (Multiple > > > > Continuations > > > > > > having the same function should have the same core affinity.) > > > > > > > > > > > > > > > > > > > > >