It's reasonably likely this was discussed on the mailing list the last time around, probably in the 2015-2016 time frame. Alternatively it might have been a JIRA ticket on the old ASF Jira.
On Thu, Oct 3, 2019 at 9:57 AM Walt Karas <wka...@verizonmedia.com.invalid> wrote: > Thread affinity seems like a worthwhile incremental improvement for the > current approach. I'm just offering alternatives approach for > consideration. > > Were any measurements sent to this mailing list or recorded somewhere else? > > On Thu, Oct 3, 2019 at 9:14 AM Alan Carroll > <solidwallofc...@verizonmedia.com.invalid> wrote: > > > I didn't see the thread affinity as a rathole, but more as a > regularization > > of existing practice. That kind of thing was already being done in many > > places in an ad hoc manner (e.g. cache, DNS, VC migration). Putting it in > > the Thread support classes meant there was a single implementation, not > > several, with the consequent and highly desired consistency of operation. > > > > I also considered it very desirable for plugins to be able to do similar > > things, which is why I championed Kees' request on that. Any plugin doing > > asynchronous operations found it difficult to impossible to do the > correct > > thing. Now it's easy and mostly automatic. As for tests on that, we > > certainly had mysterious crashes and other problems due to the lack of > such > > a mechanism and even now we're still cleaning up plugins that do bad > things > > with scheduling Continuations. The difference now is we can put in a > clean > > fix rather than obscure hacks. This went through because it addressed, as > > Kees noted, a well known and chronic issue in a reasonable way, > formalizing > > what had been unwritten rules. > > > > Let me note again, we considered this previously, did actual > measurements, > > and concluded the gains were much too small to justify the code (except > in > > the cache, which does have hot locks). And as noted above, given that we > > were already doing crude forms of thread affinity, it seemed making that > > more general and consistent was the best approach to reducing lock > > contention, especially as it provided other benefits. > > > > On Wed, Oct 2, 2019 at 4:30 PM Walt Karas <wka...@verizonmedia.com > > .invalid> > > wrote: > > > > > On Wed, Oct 2, 2019 at 3:19 PM Kees Spoelstra <kspoels...@we-amp.com> > > > wrote: > > > > > > > I think we're all able to dig some ratholes with less code than 30 > > lines. > > > > > > > > > > Him, I thought "rathole" implied unneeded complexity but it seems to be > > > just a synonym for "bad". > > > > > > > > > > > > > > Alan, it's only my fault that I pushed for threadaffinity for plugins > > > > after I saw how it was in core... Can't keep all the goodies for the > > core > > > > :) > > > > > > > > One thing to watch out for is in order execution of continuations, as > > > I've > > > > mentioned earlier there is a big possibility that there is code which > > > does > > > > not properly lock ( or does not need to) and might get executing > > > > continuations at the same time instead of in order when running > > > > concurrently. > > > > Sometimes to get out of the current executing context which might be > > deep > > > > in the stack (x locks down and before some cleanup) you will schedule > > > > another continuation on the same thread, which can give you a cleaner > > > > context. > > > > Imagine the scheduler picking that up and running it concurrently on > > > > another thread on the same core, that might yield weird results. > > > > Multiple threads might break that implicit thread safety > > > > > > > > > > When one Continuation schedules another, it could go into a queue of > > > Continuations that become runnable after the current Continuation > exits. > > > So the pseudo-code becomes: > > > > > > bool idle[Num_workers_pre_core]; > > > Condition_variable go[Num_workers_per_core]; > > > Continuation *curr[Num_workers_per_core]; > > > Continuation_queue after[Num_workers_per_core]; // Run after current > > > continuation. > > > > > > while (true) > > > { > > > blocking get on message queue; > > > switch (msg.type) > > > { > > > case WORKER_THREAD_READY_FOR_NEXT: // Sent by worker thread. > > > while (after[msg.worker_index} is not empty) { > > > c = pop from after[msg.worker_index]; > > > push c to run queue; > > > } > > > if (run queue empty) { > > > idle[msg.worker_index] = true; > > > } else { > > > curr[msg.worker_index] = dequeue from run queue; > > > trigger go[msg.worker_index]; > > > } > > > break; > > > > > > case QUEUE_CONTINUATION_TO_RUN: > > > run_index = index where idle[index] is true or none; > > > if (run_index == none) { > > > enqueue msg.continuation in run queue; > > > } else { > > > idel[run_index] = false; > > > curr[run_index] = msg.continuation; > > > trigger go[run_index]; > > > } > > > break; > > > > > > } // end switch > > > } // end while > > > > > > On the other hand, it's not that hard for a specific Continuation to > not > > > trigger successive Continuations until they are actually ready to run. > > It > > > would simply need to have its own "after" queue. We should be > selective > > > about how many capabilities we pull into the core Event execution code, > > and > > > how much we damage general behavior by doing so. > > > > > > > > > > > > > > Locking mostly gets/got screwy in the plugins, because of possible > > > > thread/core jumping, the plugin affinity functions help a lot with > > that. > > > > There are patterns to prevent that, try lock, but they're a bit of a > > pain > > > > and I've seen people ignore that. > > > > > > > > Some findings we had when doing h2. Thread affinity might not always > > > yield > > > > the best latency on an almost idle system if the protocols on both > > sides > > > > are relatively heavy weight the packing and unpacking will take time > > and > > > > one can imagine there is some parallelism which could decrease > latency. > > > > We've thought about controlled handover to other threads, but never > > made > > > > any work of it. > > > > > > > > I can imagine with quic coming that we might see some thread jumping > > > > again, haven't been following progress on that one. > > > > > > > > On Wed, 2 Oct 2019, 21:33 Walt Karas, <wka...@verizonmedia.com > > .invalid> > > > > wrote: > > > > > > > >> On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org> > > wrote: > > > >> > > > >> > > > > >> > > > > >> > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wka...@verizonmedia.com > > > >> .INVALID> > > > >> > wrote: > > > >> > > > > > >> > > I feel like you overestimate me if you think I could create a > > > rathole > > > >> > with > > > >> > > just 30 lines of pseudo-code. > > > >> > > > > >> > We have faith in you. > > > >> > > > > >> > > > >> What's an example of a rathole I've already created? Seems like a > > very > > > >> big > > > >> leap of faith if there isn't one. > > > >> > > > >> Question: Why do we have more event threads than cores? > > > >> > > > >> If it's to mitigate the fact that ATS is not purist event-driven, > that > > > >> Continuations sometimes do block, it makes sense to use the best > > design > > > to > > > >> mitigate this. Sorry but I don't think any big profiling hoedown is > > > >> needed > > > >> to see that what I'm proposing mitigates it better. It's > essentially > > > the > > > >> same idea as having a single line to pay even when there are > multiple > > > >> cashiers. > > > >> > > > >> > > > >> > > > > >> > > > > > >> > > We're making complex thread affinity changes, on top of an event > > > >> > execution > > > >> > > code that already seem complex. (I leave it to those with more > > > >> rathole > > > >> > > background to say if it should be so labeled.) . Can I see the > > > >> profiling > > > >> > > results used to justify and guide the thread affinity design? > > > >> > > > > >> > > > > >> > +1. > > > >> > > > > >> > — leif > > > >> > > > > >> > > > > > >> > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> > > > >> wrote: > > > >> > > > > > >> > >> > > > >> > >> > > > >> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas < > > wka...@verizonmedia.com > > > >> > .INVALID> > > > >> > >> wrote: > > > >> > >>> > > > >> > >>> What's the best tool for multi-threaded profiling on Linux? > > > >> > >> > > > >> > >> Probably the Intel tools. Probably the “perf” toolchain works > > > well, > > > >> at > > > >> > >> least on modern linux, you could do perf lock etc. > > > >> > >> > > > >> > >>> > > > >> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll > > > >> > >>> <solidwallofc...@verizonmedia.com.invalid> wrote: > > > >> > >>> > > > >> > >>>> Correct, it doesn't mean no lock contention. The claim was it > > > >> reduced > > > >> > >> the > > > >> > >>>> lock contention to a level where it's not significant enough > to > > > >> > warrant > > > >> > >>>> additional preventative measures. The data Dan got wasn't > from > > > code > > > >> > >>>> analysis, but from run time profiling. That was a while ago > so > > if > > > >> > you'd > > > >> > >>>> like to perform another pass of measuring the level of lock > > > >> > contention, > > > >> > >>>> that would certainly be interesting data. > > > >> > >> > > > >> > >> > > > >> > >> So, before we dig outselves into this rathole, can someone > > explain > > > >> to me > > > >> > >> what the problem is? Where are the metrics / analysis showing > > that > > > we > > > >> > have > > > >> > >> a problem? I’d love to see a comparison too between various > > version > > > >> of > > > >> > ATS, > > > >> > >> say v6 - v9, and understand where, if anywhere, we made things > > > (lock > > > >> > >> contention) worse. > > > >> > >> > > > >> > >> We have to stop making complicated and invasive solutions > without > > > >> real > > > >> > >> problems, and understanding such problem. > > > >> > >> > > > >> > >> My $0.01, > > > >> > >> > > > >> > >> — Leif > > > >> > >> > > > >> > >>>> > > > >> > >>>> In addition, the push for thread affinity started from actual > > > >> issues > > > >> > in > > > >> > >>>> production with Continuations being scheduled on different > > > threads > > > >> of > > > >> > >> the > > > >> > >>>> same type (that is, it was Kees' fault). Those would not be > > > >> resolved > > > >> > by > > > >> > >>>> faster scheduling on different threads. > > > >> > >>>> > > > >> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas < > > > >> wka...@verizonmedia.com > > > >> > >>>> .invalid> > > > >> > >>>> wrote: > > > >> > >>>> > > > >> > >>>>> I assume thread affinity can't literal mean no lock > > contention. > > > >> > You'd > > > >> > >>>> need > > > >> > >>>>> a lock on the thread run queue wouldn't you? Continuations > > > can't > > > >> > only > > > >> > >>>> get > > > >> > >>>>> queued for the event thread from the event thread. I don't > > > think > > > >> we > > > >> > >> can > > > >> > >>>>> say conclusively that there would be a significant > difference > > > due > > > >> to > > > >> > >> lock > > > >> > >>>>> contention. I'm guessing that Fei would agree that the > > > >> Continuation > > > >> > >>>>> dispatch code is difficult to understand and work on. > > > >> Simplification > > > >> > >> and > > > >> > >>>>> more modularity is obviously a goal. Seems like it would be > > > >> simpler > > > >> > if > > > >> > >>>> all > > > >> > >>>>> the Continuations in a to-run list where actually ready to > > run. > > > >> > >>>>> > > > >> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll > > > >> > >>>>> <solidwallofc...@verizonmedia.com.invalid> wrote: > > > >> > >>>>> > > > >> > >>>>>> Do you have any more specific information on mutex > > contention? > > > We > > > >> > have > > > >> > >>>> in > > > >> > >>>>>> fact already looked at doing this, I think back in 2015 > with > > > Dan > > > >> Xu. > > > >> > >>>> The > > > >> > >>>>>> goal there was to have queues with the mutexes to avoid > > > >> > rescheduling. > > > >> > >>>> As > > > >> > >>>>> a > > > >> > >>>>>> precursor Dan did some profiling and the only significant > > lock > > > >> > >>>> contention > > > >> > >>>>>> he could find was in the cache. That lead to the partial > > object > > > >> > >> caching > > > >> > >>>>>> work setting up queues for the hot locks, but it was > decided > > > that > > > >> > >> given > > > >> > >>>>> the > > > >> > >>>>>> lack of > > > >> > >>>>>> contention elsewhere, it wasn't worth the complexity. > > > >> > >>>>>> > > > >> > >>>>>> I think thread affinity is a better choice because no lock > > > >> > contention > > > >> > >>>>> will > > > >> > >>>>>> always beat even the most optimized lock contention > > resolution. > > > >> If > > > >> > >>>>>> Continuations related to the same constellation of data > > objects > > > >> are > > > >> > on > > > >> > >>>>> the > > > >> > >>>>>> same thread, then the locks are never contested, which > makes > > it > > > >> as > > > >> > >> fast > > > >> > >>>>> as > > > >> > >>>>>> possible. > > > >> > >>>>>> > > > >> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas < > > > >> wka...@verizonmedia.com > > > >> > >>>>>> .invalid> > > > >> > >>>>>> wrote: > > > >> > >>>>>> > > > >> > >>>>>>> From the longer-term TSers I've heard comments about > seeing > > > >> > profiling > > > >> > >>>>>>> results that show that waiting on mutexes is a significant > > > >> > >>>> performance > > > >> > >>>>>>> issue with TS. But I'm not aware of any write-ups of such > > > >> results. > > > >> > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm > > not > > > >> > >>>> currently > > > >> > >>>>>>> familiar with the best approaches to profiling TS. > > > >> > >>>>>>> > > > >> > >>>>>>> For better performance, I think that having a single > to-run > > > >> > >>>>> Continuation > > > >> > >>>>>>> queue, or one per core, with a queue feeding multiple > event > > > >> threads > > > >> > >>>> is > > > >> > >>>>>> the > > > >> > >>>>>>> main thing. It's more resilient to Continuations that > > block. > > > >> > There > > > >> > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about > > not > > > >> > having > > > >> > >>>>>>> blocking Continuations (see > > > >> > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ). I'm > > not > > > >> sure > > > >> > >>>>>>> changing > > > >> > >>>>>>> to queue-based mutexes would have a significant > performance > > > >> impact. > > > >> > >>>>> But > > > >> > >>>>>> it > > > >> > >>>>>>> seems a cleaner design, making sure Continuations in the > > > to-run > > > >> > >>>> list(s) > > > >> > >>>>>> are > > > >> > >>>>>>> actually ready to run. But a different mutex > implementation > > > is > > > >> not > > > >> > >>>>>>> strictly necessary in order to consolidate to-run > > Continuation > > > >> > >>>> queues. > > > >> > >>>>>>> > > > >> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra < > > > >> > >>>> kspoels...@we-amp.com> > > > >> > >>>>>>> wrote: > > > >> > >>>>>>> > > > >> > >>>>>>>> Sounds very interesting. > > > >> > >>>>>>>> But what is the problem we're trying to solve here, I > like > > > the > > > >> > >>>> thread > > > >> > >>>>>>>> affinity because it gives us head ache free concurrency > in > > > some > > > >> > >>>>> cases, > > > >> > >>>>>>> and > > > >> > >>>>>>>> I'll bet that there is some code which doesn't have the > > > proper > > > >> > >>>>>>> continuation > > > >> > >>>>>>>> mutexes because we know it runs on the same thread. > > > >> > >>>>>>>> > > > >> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much > > > processing > > > >> > >>>>> causing > > > >> > >>>>>>> long > > > >> > >>>>>>>> queues of continuations, which I can imagine in some > > cases) ? > > > >> And > > > >> > >>>>>>> shouldn't > > > >> > >>>>>>>> we balance based on transactions or connections, move > those > > > >> around > > > >> > >>>>> when > > > >> > >>>>>>> we > > > >> > >>>>>>>> see imbalance and aim for embarrassingly parallel > > processing > > > :) > > > >> > >>>> Come > > > >> > >>>>>>>> to think of it, this might introduce another set of > > problems, > > > >> how > > > >> > >>>> to > > > >> > >>>>>> know > > > >> > >>>>>>>> which continuations are part of the life cycle of a > > > connection > > > >> :/ > > > >> > >>>>>>>> > > > >> > >>>>>>>> Jumping threads in one transaction is not always ideal > > > either, > > > >> > this > > > >> > >>>>> can > > > >> > >>>>>>>> really hurt performance. But your proposed model seems to > > > >> handle > > > >> > >>>> that > > > >> > >>>>>>>> somewhat better than the current implementation. > > > >> > >>>>>>>> > > > >> > >>>>>>>> Very interested and wondering what this would mean for > > plugin > > > >> > >>>>>> developers. > > > >> > >>>>>>>> > > > >> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, < > > > >> wka...@verizonmedia.com > > > >> > >>>>>> .invalid> > > > >> > >>>>>>>> wrote: > > > >> > >>>>>>>> > > > >> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked, > > > it's > > > >> put > > > >> > >>>>> in > > > >> > >>>>>> a > > > >> > >>>>>>>>> queue specific to that mutex. The release function for > > the > > > >> mutex > > > >> > >>>>>>> (called > > > >> > >>>>>>>>> when a Continuation holding the mutex exists) would put > > the > > > >> > >>>>>>> Continuation > > > >> > >>>>>>>> at > > > >> > >>>>>>>>> the front of the mutex's queue (if not empty) into the > > > >> > >>>> ready-to-run > > > >> > >>>>>>> queue > > > >> > >>>>>>>>> (transferring the lock to that Continuation). A > drawback > > is > > > >> that > > > >> > >>>>> the > > > >> > >>>>>>>> queue > > > >> > >>>>>>>>> would itself need a mutex (spinlock?), but the critical > > > >> section > > > >> > >>>>> would > > > >> > >>>>>>> be > > > >> > >>>>>>>>> very short. > > > >> > >>>>>>>>> > > > >> > >>>>>>>>> There would be a function to lock a mutex directly. It > > > would > > > >> > >>>>> create > > > >> > >>>>>> a > > > >> > >>>>>>>>> Continuation that had two condition variables. It would > > > >> assign > > > >> > >>>> the > > > >> > >>>>>>> mutex > > > >> > >>>>>>>>> to this Continuation and schedule it. (In this case, it > > > might > > > >> > >>>> make > > > >> > >>>>>>> sense > > > >> > >>>>>>>>> to put this Continuation at the front of the mutex's > > queue, > > > >> since > > > >> > >>>>> it > > > >> > >>>>>>>> would > > > >> > >>>>>>>>> be blocking an entire event thread.) The direct-lock > > > function > > > >> > >>>>> would > > > >> > >>>>>>> then > > > >> > >>>>>>>>> block on the first condition variable. When the > > > Continuation > > > >> > >>>> ran, > > > >> > >>>>> it > > > >> > >>>>>>>> would > > > >> > >>>>>>>>> trigger the first condition variable, and then block on > > the > > > >> > >>>> second > > > >> > >>>>>>>>> condition variable. The direct-lock function would then > > > exit, > > > >> > >>>>>> allowing > > > >> > >>>>>>>> the > > > >> > >>>>>>>>> calling code to enter its critical section. At the end > of > > > the > > > >> > >>>>>> critical > > > >> > >>>>>>>>> section, another function to release the direct lock > would > > > be > > > >> > >>>>> called. > > > >> > >>>>>>> It > > > >> > >>>>>>>>> would trigger the second condition variable, which would > > > cause > > > >> > >>>> the > > > >> > >>>>>>>> function > > > >> > >>>>>>>>> of the Continuation created for the direct lock to exit > > > (thus > > > >> > >>>>>> releasing > > > >> > >>>>>>>> the > > > >> > >>>>>>>>> mutex). > > > >> > >>>>>>>>> > > > >> > >>>>>>>>> With this approach, I'm not sure thread affinities would > > be > > > of > > > >> > >>>> any > > > >> > >>>>>>> value. > > > >> > >>>>>>>>> I think perhaps each core should have it's own list of > > > >> > >>>> ready-to-run > > > >> > >>>>>>>>> Continuations, and a pool of event threads with affinity > > to > > > >> that > > > >> > >>>>>> core. > > > >> > >>>>>>>> Not > > > >> > >>>>>>>>> having per-event-thread ready-to-run lists means that a > > > >> > >>>>> Continuation > > > >> > >>>>>>>>> function that blocks is less likely to block other > > > >> ready-to-run > > > >> > >>>>>>>>> Continuations. If Continuations had core affinities to > > some > > > >> > >>>>> degree, > > > >> > >>>>>>> this > > > >> > >>>>>>>>> might reduce evictions in per-core memory cache. > > (Multiple > > > >> > >>>>>>> Continuations > > > >> > >>>>>>>>> having the same function should have the same core > > > affinity.) > > > >> > >>>>>>>>> > > > >> > >>>>>>>> > > > >> > >>>>>>> > > > >> > >>>>>> > > > >> > >>>>> > > > >> > >>>> > > > >> > >> > > > >> > >> > > > >> > > > > >> > > > > >> > > > > > > > > > >