I feel like you overestimate me if you think I could create a rathole with just 30 lines of pseudo-code.
We're making complex thread affinity changes, on top of an event execution code that already seem complex. (I leave it to those with more rathole background to say if it should be so labeled.) . Can I see the profiling results used to justify and guide the thread affinity design? On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote: > > > > On Oct 2, 2019, at 11:23 AM, Walt Karas <wka...@verizonmedia.com.INVALID> > wrote: > > > > What's the best tool for multi-threaded profiling on Linux? > > Probably the Intel tools. Probably the “perf” toolchain works well, at > least on modern linux, you could do perf lock etc. > > > > > On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll > > <solidwallofc...@verizonmedia.com.invalid> wrote: > > > >> Correct, it doesn't mean no lock contention. The claim was it reduced > the > >> lock contention to a level where it's not significant enough to warrant > >> additional preventative measures. The data Dan got wasn't from code > >> analysis, but from run time profiling. That was a while ago so if you'd > >> like to perform another pass of measuring the level of lock contention, > >> that would certainly be interesting data. > > > So, before we dig outselves into this rathole, can someone explain to me > what the problem is? Where are the metrics / analysis showing that we have > a problem? I’d love to see a comparison too between various version of ATS, > say v6 - v9, and understand where, if anywhere, we made things (lock > contention) worse. > > We have to stop making complicated and invasive solutions without real > problems, and understanding such problem. > > My $0.01, > > — Leif > > >> > >> In addition, the push for thread affinity started from actual issues in > >> production with Continuations being scheduled on different threads of > the > >> same type (that is, it was Kees' fault). Those would not be resolved by > >> faster scheduling on different threads. > >> > >> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com > >> .invalid> > >> wrote: > >> > >>> I assume thread affinity can't literal mean no lock contention. You'd > >> need > >>> a lock on the thread run queue wouldn't you? Continuations can't only > >> get > >>> queued for the event thread from the event thread. I don't think we > can > >>> say conclusively that there would be a significant difference due to > lock > >>> contention. I'm guessing that Fei would agree that the Continuation > >>> dispatch code is difficult to understand and work on. Simplification > and > >>> more modularity is obviously a goal. Seems like it would be simpler if > >> all > >>> the Continuations in a to-run list where actually ready to run. > >>> > >>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll > >>> <solidwallofc...@verizonmedia.com.invalid> wrote: > >>> > >>>> Do you have any more specific information on mutex contention? We have > >> in > >>>> fact already looked at doing this, I think back in 2015 with Dan Xu. > >> The > >>>> goal there was to have queues with the mutexes to avoid rescheduling. > >> As > >>> a > >>>> precursor Dan did some profiling and the only significant lock > >> contention > >>>> he could find was in the cache. That lead to the partial object > caching > >>>> work setting up queues for the hot locks, but it was decided that > given > >>> the > >>>> lack of > >>>> contention elsewhere, it wasn't worth the complexity. > >>>> > >>>> I think thread affinity is a better choice because no lock contention > >>> will > >>>> always beat even the most optimized lock contention resolution. If > >>>> Continuations related to the same constellation of data objects are on > >>> the > >>>> same thread, then the locks are never contested, which makes it as > fast > >>> as > >>>> possible. > >>>> > >>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com > >>>> .invalid> > >>>> wrote: > >>>> > >>>>> From the longer-term TSers I've heard comments about seeing profiling > >>>>> results that show that waiting on mutexes is a significant > >> performance > >>>>> issue with TS. But I'm not aware of any write-ups of such results. > >>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not > >> currently > >>>>> familiar with the best approaches to profiling TS. > >>>>> > >>>>> For better performance, I think that having a single to-run > >>> Continuation > >>>>> queue, or one per core, with a queue feeding multiple event threads > >> is > >>>> the > >>>>> main thing. It's more resilient to Continuations that block. There > >>>>> doesn't seem to be enthusiasm for getting hard-core about not having > >>>>> blocking Continuations (see > >>>>> https://github.com/apache/trafficserver/pull/5412 ). I'm not sure > >>>>> changing > >>>>> to queue-based mutexes would have a significant performance impact. > >>> But > >>>> it > >>>>> seems a cleaner design, making sure Continuations in the to-run > >> list(s) > >>>> are > >>>>> actually ready to run. But a different mutex implementation is not > >>>>> strictly necessary in order to consolidate to-run Continuation > >> queues. > >>>>> > >>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra < > >> kspoels...@we-amp.com> > >>>>> wrote: > >>>>> > >>>>>> Sounds very interesting. > >>>>>> But what is the problem we're trying to solve here, I like the > >> thread > >>>>>> affinity because it gives us head ache free concurrency in some > >>> cases, > >>>>> and > >>>>>> I'll bet that there is some code which doesn't have the proper > >>>>> continuation > >>>>>> mutexes because we know it runs on the same thread. > >>>>>> > >>>>>> Are we seeing a lot of imbalanced threads (too much processing > >>> causing > >>>>> long > >>>>>> queues of continuations, which I can imagine in some cases) ? And > >>>>> shouldn't > >>>>>> we balance based on transactions or connections, move those around > >>> when > >>>>> we > >>>>>> see imbalance and aim for embarrassingly parallel processing :) > >> Come > >>>>>> to think of it, this might introduce another set of problems, how > >> to > >>>> know > >>>>>> which continuations are part of the life cycle of a connection :/ > >>>>>> > >>>>>> Jumping threads in one transaction is not always ideal either, this > >>> can > >>>>>> really hurt performance. But your proposed model seems to handle > >> that > >>>>>> somewhat better than the current implementation. > >>>>>> > >>>>>> Very interested and wondering what this would mean for plugin > >>>> developers. > >>>>>> > >>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com > >>>> .invalid> > >>>>>> wrote: > >>>>>> > >>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put > >>> in > >>>> a > >>>>>>> queue specific to that mutex. The release function for the mutex > >>>>> (called > >>>>>>> when a Continuation holding the mutex exists) would put the > >>>>> Continuation > >>>>>> at > >>>>>>> the front of the mutex's queue (if not empty) into the > >> ready-to-run > >>>>> queue > >>>>>>> (transferring the lock to that Continuation). A drawback is that > >>> the > >>>>>> queue > >>>>>>> would itself need a mutex (spinlock?), but the critical section > >>> would > >>>>> be > >>>>>>> very short. > >>>>>>> > >>>>>>> There would be a function to lock a mutex directly. It would > >>> create > >>>> a > >>>>>>> Continuation that had two condition variables. It would assign > >> the > >>>>> mutex > >>>>>>> to this Continuation and schedule it. (In this case, it might > >> make > >>>>> sense > >>>>>>> to put this Continuation at the front of the mutex's queue, since > >>> it > >>>>>> would > >>>>>>> be blocking an entire event thread.) The direct-lock function > >>> would > >>>>> then > >>>>>>> block on the first condition variable. When the Continuation > >> ran, > >>> it > >>>>>> would > >>>>>>> trigger the first condition variable, and then block on the > >> second > >>>>>>> condition variable. The direct-lock function would then exit, > >>>> allowing > >>>>>> the > >>>>>>> calling code to enter its critical section. At the end of the > >>>> critical > >>>>>>> section, another function to release the direct lock would be > >>> called. > >>>>> It > >>>>>>> would trigger the second condition variable, which would cause > >> the > >>>>>> function > >>>>>>> of the Continuation created for the direct lock to exit (thus > >>>> releasing > >>>>>> the > >>>>>>> mutex). > >>>>>>> > >>>>>>> With this approach, I'm not sure thread affinities would be of > >> any > >>>>> value. > >>>>>>> I think perhaps each core should have it's own list of > >> ready-to-run > >>>>>>> Continuations, and a pool of event threads with affinity to that > >>>> core. > >>>>>> Not > >>>>>>> having per-event-thread ready-to-run lists means that a > >>> Continuation > >>>>>>> function that blocks is less likely to block other ready-to-run > >>>>>>> Continuations. If Continuations had core affinities to some > >>> degree, > >>>>> this > >>>>>>> might reduce evictions in per-core memory cache. (Multiple > >>>>> Continuations > >>>>>>> having the same function should have the same core affinity.) > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >