The thing with profiling is trying to cover all the important traffic mixes and plugin use (impossible because of proprietary plugins). It seems desirable to have a design that's robust even when Continuations block more than they ideally should. Having a single run queue per core and N worker event threads per core helps greatly with preventing a blocked Continuation from blocking other Continuations from starting (increasing worst-case latency). Each core could have a supervisor thread with this pseudo-code:
bool idle[Num_workers_pre_core]; Condition_variable go[Num_workers_per_core]; Continuation *curr[Num_workers_per_core]; while (true) { blocking get on message queue; switch (msg.type) { case WORKER_THREAD_READY_FOR_NEXT: // Sent by worker thread. if (run queue empty) { idle[msg.worker_index] = true; } else { curr[msg.worker_index] = dequeue from run queue; trigger go[msg.worker_index]; } break; case QUEUE_CONTINUATION_TO_RUN: run_index = index where idle[index] is true or none; if (run_index == none) { enqueue msg.continuation in run queue; } else { idel[run_index] = false; curr[run_index] = msg.continuation; trigger go[run_index]; } break; } // end switch } // end while On Wed, Oct 2, 2019 at 12:23 PM Walt Karas <wka...@verizonmedia.com> wrote: > What's the best tool for multi-threaded profiling on Linux? > > On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll > <solidwallofc...@verizonmedia.com.invalid> wrote: > >> Correct, it doesn't mean no lock contention. The claim was it reduced the >> lock contention to a level where it's not significant enough to warrant >> additional preventative measures. The data Dan got wasn't from code >> analysis, but from run time profiling. That was a while ago so if you'd >> like to perform another pass of measuring the level of lock contention, >> that would certainly be interesting data. >> >> In addition, the push for thread affinity started from actual issues in >> production with Continuations being scheduled on different threads of the >> same type (that is, it was Kees' fault). Those would not be resolved by >> faster scheduling on different threads. >> >> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com >> .invalid> >> wrote: >> >> > I assume thread affinity can't literal mean no lock contention. You'd >> need >> > a lock on the thread run queue wouldn't you? Continuations can't only >> get >> > queued for the event thread from the event thread. I don't think we can >> > say conclusively that there would be a significant difference due to >> lock >> > contention. I'm guessing that Fei would agree that the Continuation >> > dispatch code is difficult to understand and work on. Simplification >> and >> > more modularity is obviously a goal. Seems like it would be simpler if >> all >> > the Continuations in a to-run list where actually ready to run. >> > >> > On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll >> > <solidwallofc...@verizonmedia.com.invalid> wrote: >> > >> > > Do you have any more specific information on mutex contention? We >> have in >> > > fact already looked at doing this, I think back in 2015 with Dan Xu. >> The >> > > goal there was to have queues with the mutexes to avoid rescheduling. >> As >> > a >> > > precursor Dan did some profiling and the only significant lock >> contention >> > > he could find was in the cache. That lead to the partial object >> caching >> > > work setting up queues for the hot locks, but it was decided that >> given >> > the >> > > lack of >> > > contention elsewhere, it wasn't worth the complexity. >> > > >> > > I think thread affinity is a better choice because no lock contention >> > will >> > > always beat even the most optimized lock contention resolution. If >> > > Continuations related to the same constellation of data objects are on >> > the >> > > same thread, then the locks are never contested, which makes it as >> fast >> > as >> > > possible. >> > > >> > > On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com >> > > .invalid> >> > > wrote: >> > > >> > > > From the longer-term TSers I've heard comments about seeing >> profiling >> > > > results that show that waiting on mutexes is a significant >> performance >> > > > issue with TS. But I'm not aware of any write-ups of such results. >> > > > Unfortunately, I'm relatively new to TS and Linux, so I'm not >> currently >> > > > familiar with the best approaches to profiling TS. >> > > > >> > > > For better performance, I think that having a single to-run >> > Continuation >> > > > queue, or one per core, with a queue feeding multiple event threads >> is >> > > the >> > > > main thing. It's more resilient to Continuations that block. There >> > > > doesn't seem to be enthusiasm for getting hard-core about not having >> > > > blocking Continuations (see >> > > > https://github.com/apache/trafficserver/pull/5412 ). I'm not sure >> > > > changing >> > > > to queue-based mutexes would have a significant performance impact. >> > But >> > > it >> > > > seems a cleaner design, making sure Continuations in the to-run >> list(s) >> > > are >> > > > actually ready to run. But a different mutex implementation is not >> > > > strictly necessary in order to consolidate to-run Continuation >> queues. >> > > > >> > > > On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra < >> kspoels...@we-amp.com> >> > > > wrote: >> > > > >> > > > > Sounds very interesting. >> > > > > But what is the problem we're trying to solve here, I like the >> thread >> > > > > affinity because it gives us head ache free concurrency in some >> > cases, >> > > > and >> > > > > I'll bet that there is some code which doesn't have the proper >> > > > continuation >> > > > > mutexes because we know it runs on the same thread. >> > > > > >> > > > > Are we seeing a lot of imbalanced threads (too much processing >> > causing >> > > > long >> > > > > queues of continuations, which I can imagine in some cases) ? And >> > > > shouldn't >> > > > > we balance based on transactions or connections, move those around >> > when >> > > > we >> > > > > see imbalance and aim for embarrassingly parallel processing :) >> Come >> > > > > to think of it, this might introduce another set of problems, how >> to >> > > know >> > > > > which continuations are part of the life cycle of a connection :/ >> > > > > >> > > > > Jumping threads in one transaction is not always ideal either, >> this >> > can >> > > > > really hurt performance. But your proposed model seems to handle >> that >> > > > > somewhat better than the current implementation. >> > > > > >> > > > > Very interested and wondering what this would mean for plugin >> > > developers. >> > > > > >> > > > > On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com >> > > .invalid> >> > > > > wrote: >> > > > > >> > > > > > If a Continuation is scheduled, but its mutex is locked, it's >> put >> > in >> > > a >> > > > > > queue specific to that mutex. The release function for the >> mutex >> > > > (called >> > > > > > when a Continuation holding the mutex exists) would put the >> > > > Continuation >> > > > > at >> > > > > > the front of the mutex's queue (if not empty) into the >> ready-to-run >> > > > queue >> > > > > > (transferring the lock to that Continuation). A drawback is >> that >> > the >> > > > > queue >> > > > > > would itself need a mutex (spinlock?), but the critical section >> > would >> > > > be >> > > > > > very short. >> > > > > > >> > > > > > There would be a function to lock a mutex directly. It would >> > create >> > > a >> > > > > > Continuation that had two condition variables. It would assign >> the >> > > > mutex >> > > > > > to this Continuation and schedule it. (In this case, it might >> make >> > > > sense >> > > > > > to put this Continuation at the front of the mutex's queue, >> since >> > it >> > > > > would >> > > > > > be blocking an entire event thread.) The direct-lock function >> > would >> > > > then >> > > > > > block on the first condition variable. When the Continuation >> ran, >> > it >> > > > > would >> > > > > > trigger the first condition variable, and then block on the >> second >> > > > > > condition variable. The direct-lock function would then exit, >> > > allowing >> > > > > the >> > > > > > calling code to enter its critical section. At the end of the >> > > critical >> > > > > > section, another function to release the direct lock would be >> > called. >> > > > It >> > > > > > would trigger the second condition variable, which would cause >> the >> > > > > function >> > > > > > of the Continuation created for the direct lock to exit (thus >> > > releasing >> > > > > the >> > > > > > mutex). >> > > > > > >> > > > > > With this approach, I'm not sure thread affinities would be of >> any >> > > > value. >> > > > > > I think perhaps each core should have it's own list of >> ready-to-run >> > > > > > Continuations, and a pool of event threads with affinity to that >> > > core. >> > > > > Not >> > > > > > having per-event-thread ready-to-run lists means that a >> > Continuation >> > > > > > function that blocks is less likely to block other ready-to-run >> > > > > > Continuations. If Continuations had core affinities to some >> > degree, >> > > > this >> > > > > > might reduce evictions in per-core memory cache. (Multiple >> > > > Continuations >> > > > > > having the same function should have the same core affinity.) >> > > > > > >> > > > > >> > > > >> > > >> > >> >