On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org> wrote:
> > > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wka...@verizonmedia.com.INVALID> > wrote: > > > > I feel like you overestimate me if you think I could create a rathole > with > > just 30 lines of pseudo-code. > > We have faith in you. > What's an example of a rathole I've already created? Seems like a very big leap of faith if there isn't one. Question: Why do we have more event threads than cores? If it's to mitigate the fact that ATS is not purist event-driven, that Continuations sometimes do block, it makes sense to use the best design to mitigate this. Sorry but I don't think any big profiling hoedown is needed to see that what I'm proposing mitigates it better. It's essentially the same idea as having a single line to pay even when there are multiple cashiers. > > > > > We're making complex thread affinity changes, on top of an event > execution > > code that already seem complex. (I leave it to those with more rathole > > background to say if it should be so labeled.) . Can I see the profiling > > results used to justify and guide the thread affinity design? > > > +1. > > — leif > > > > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote: > > > >> > >> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wka...@verizonmedia.com > .INVALID> > >> wrote: > >>> > >>> What's the best tool for multi-threaded profiling on Linux? > >> > >> Probably the Intel tools. Probably the “perf” toolchain works well, at > >> least on modern linux, you could do perf lock etc. > >> > >>> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll > >>> <solidwallofc...@verizonmedia.com.invalid> wrote: > >>> > >>>> Correct, it doesn't mean no lock contention. The claim was it reduced > >> the > >>>> lock contention to a level where it's not significant enough to > warrant > >>>> additional preventative measures. The data Dan got wasn't from code > >>>> analysis, but from run time profiling. That was a while ago so if > you'd > >>>> like to perform another pass of measuring the level of lock > contention, > >>>> that would certainly be interesting data. > >> > >> > >> So, before we dig outselves into this rathole, can someone explain to me > >> what the problem is? Where are the metrics / analysis showing that we > have > >> a problem? I’d love to see a comparison too between various version of > ATS, > >> say v6 - v9, and understand where, if anywhere, we made things (lock > >> contention) worse. > >> > >> We have to stop making complicated and invasive solutions without real > >> problems, and understanding such problem. > >> > >> My $0.01, > >> > >> — Leif > >> > >>>> > >>>> In addition, the push for thread affinity started from actual issues > in > >>>> production with Continuations being scheduled on different threads of > >> the > >>>> same type (that is, it was Kees' fault). Those would not be resolved > by > >>>> faster scheduling on different threads. > >>>> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com > >>>> .invalid> > >>>> wrote: > >>>> > >>>>> I assume thread affinity can't literal mean no lock contention. > You'd > >>>> need > >>>>> a lock on the thread run queue wouldn't you? Continuations can't > only > >>>> get > >>>>> queued for the event thread from the event thread. I don't think we > >> can > >>>>> say conclusively that there would be a significant difference due to > >> lock > >>>>> contention. I'm guessing that Fei would agree that the Continuation > >>>>> dispatch code is difficult to understand and work on. Simplification > >> and > >>>>> more modularity is obviously a goal. Seems like it would be simpler > if > >>>> all > >>>>> the Continuations in a to-run list where actually ready to run. > >>>>> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll > >>>>> <solidwallofc...@verizonmedia.com.invalid> wrote: > >>>>> > >>>>>> Do you have any more specific information on mutex contention? We > have > >>>> in > >>>>>> fact already looked at doing this, I think back in 2015 with Dan Xu. > >>>> The > >>>>>> goal there was to have queues with the mutexes to avoid > rescheduling. > >>>> As > >>>>> a > >>>>>> precursor Dan did some profiling and the only significant lock > >>>> contention > >>>>>> he could find was in the cache. That lead to the partial object > >> caching > >>>>>> work setting up queues for the hot locks, but it was decided that > >> given > >>>>> the > >>>>>> lack of > >>>>>> contention elsewhere, it wasn't worth the complexity. > >>>>>> > >>>>>> I think thread affinity is a better choice because no lock > contention > >>>>> will > >>>>>> always beat even the most optimized lock contention resolution. If > >>>>>> Continuations related to the same constellation of data objects are > on > >>>>> the > >>>>>> same thread, then the locks are never contested, which makes it as > >> fast > >>>>> as > >>>>>> possible. > >>>>>> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com > >>>>>> .invalid> > >>>>>> wrote: > >>>>>> > >>>>>>> From the longer-term TSers I've heard comments about seeing > profiling > >>>>>>> results that show that waiting on mutexes is a significant > >>>> performance > >>>>>>> issue with TS. But I'm not aware of any write-ups of such results. > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not > >>>> currently > >>>>>>> familiar with the best approaches to profiling TS. > >>>>>>> > >>>>>>> For better performance, I think that having a single to-run > >>>>> Continuation > >>>>>>> queue, or one per core, with a queue feeding multiple event threads > >>>> is > >>>>>> the > >>>>>>> main thing. It's more resilient to Continuations that block. > There > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about not > having > >>>>>>> blocking Continuations (see > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ). I'm not sure > >>>>>>> changing > >>>>>>> to queue-based mutexes would have a significant performance impact. > >>>>> But > >>>>>> it > >>>>>>> seems a cleaner design, making sure Continuations in the to-run > >>>> list(s) > >>>>>> are > >>>>>>> actually ready to run. But a different mutex implementation is not > >>>>>>> strictly necessary in order to consolidate to-run Continuation > >>>> queues. > >>>>>>> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra < > >>>> kspoels...@we-amp.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Sounds very interesting. > >>>>>>>> But what is the problem we're trying to solve here, I like the > >>>> thread > >>>>>>>> affinity because it gives us head ache free concurrency in some > >>>>> cases, > >>>>>>> and > >>>>>>>> I'll bet that there is some code which doesn't have the proper > >>>>>>> continuation > >>>>>>>> mutexes because we know it runs on the same thread. > >>>>>>>> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much processing > >>>>> causing > >>>>>>> long > >>>>>>>> queues of continuations, which I can imagine in some cases) ? And > >>>>>>> shouldn't > >>>>>>>> we balance based on transactions or connections, move those around > >>>>> when > >>>>>>> we > >>>>>>>> see imbalance and aim for embarrassingly parallel processing :) > >>>> Come > >>>>>>>> to think of it, this might introduce another set of problems, how > >>>> to > >>>>>> know > >>>>>>>> which continuations are part of the life cycle of a connection :/ > >>>>>>>> > >>>>>>>> Jumping threads in one transaction is not always ideal either, > this > >>>>> can > >>>>>>>> really hurt performance. But your proposed model seems to handle > >>>> that > >>>>>>>> somewhat better than the current implementation. > >>>>>>>> > >>>>>>>> Very interested and wondering what this would mean for plugin > >>>>>> developers. > >>>>>>>> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com > >>>>>> .invalid> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put > >>>>> in > >>>>>> a > >>>>>>>>> queue specific to that mutex. The release function for the mutex > >>>>>>> (called > >>>>>>>>> when a Continuation holding the mutex exists) would put the > >>>>>>> Continuation > >>>>>>>> at > >>>>>>>>> the front of the mutex's queue (if not empty) into the > >>>> ready-to-run > >>>>>>> queue > >>>>>>>>> (transferring the lock to that Continuation). A drawback is that > >>>>> the > >>>>>>>> queue > >>>>>>>>> would itself need a mutex (spinlock?), but the critical section > >>>>> would > >>>>>>> be > >>>>>>>>> very short. > >>>>>>>>> > >>>>>>>>> There would be a function to lock a mutex directly. It would > >>>>> create > >>>>>> a > >>>>>>>>> Continuation that had two condition variables. It would assign > >>>> the > >>>>>>> mutex > >>>>>>>>> to this Continuation and schedule it. (In this case, it might > >>>> make > >>>>>>> sense > >>>>>>>>> to put this Continuation at the front of the mutex's queue, since > >>>>> it > >>>>>>>> would > >>>>>>>>> be blocking an entire event thread.) The direct-lock function > >>>>> would > >>>>>>> then > >>>>>>>>> block on the first condition variable. When the Continuation > >>>> ran, > >>>>> it > >>>>>>>> would > >>>>>>>>> trigger the first condition variable, and then block on the > >>>> second > >>>>>>>>> condition variable. The direct-lock function would then exit, > >>>>>> allowing > >>>>>>>> the > >>>>>>>>> calling code to enter its critical section. At the end of the > >>>>>> critical > >>>>>>>>> section, another function to release the direct lock would be > >>>>> called. > >>>>>>> It > >>>>>>>>> would trigger the second condition variable, which would cause > >>>> the > >>>>>>>> function > >>>>>>>>> of the Continuation created for the direct lock to exit (thus > >>>>>> releasing > >>>>>>>> the > >>>>>>>>> mutex). > >>>>>>>>> > >>>>>>>>> With this approach, I'm not sure thread affinities would be of > >>>> any > >>>>>>> value. > >>>>>>>>> I think perhaps each core should have it's own list of > >>>> ready-to-run > >>>>>>>>> Continuations, and a pool of event threads with affinity to that > >>>>>> core. > >>>>>>>> Not > >>>>>>>>> having per-event-thread ready-to-run lists means that a > >>>>> Continuation > >>>>>>>>> function that blocks is less likely to block other ready-to-run > >>>>>>>>> Continuations. If Continuations had core affinities to some > >>>>> degree, > >>>>>>> this > >>>>>>>>> might reduce evictions in per-core memory cache. (Multiple > >>>>>>> Continuations > >>>>>>>>> having the same function should have the same core affinity.) > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > >> > >