Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Alan Carroll Thu, 03 Oct 2019 08:07:14 -0700

It's reasonably likely this was discussed on the mailing list the last time
around, probably in the 2015-2016 time frame. Alternatively it might have
been a JIRA ticket on the old ASF Jira.


On Thu, Oct 3, 2019 at 9:57 AM Walt Karas <wka...@verizonmedia.com.invalid>
wrote:

> Thread affinity seems like a worthwhile incremental improvement for the
> current approach.  I'm just offering alternatives approach for
> consideration.
>
> Were any measurements sent to this mailing list or recorded somewhere else?
>
> On Thu, Oct 3, 2019 at 9:14 AM Alan Carroll
> <solidwallofc...@verizonmedia.com.invalid> wrote:
>
> > I didn't see the thread affinity as a rathole, but more as a
> regularization
> > of existing practice. That kind of thing was already being done in many
> > places in an ad hoc manner (e.g. cache, DNS, VC migration). Putting it in
> > the Thread support classes meant there was a single implementation, not
> > several, with the consequent and highly desired consistency of operation.
> >
> > I also considered it very desirable for plugins to be able to do similar
> > things, which is why I championed Kees' request on that. Any plugin doing
> > asynchronous operations found it difficult to impossible to do the
> correct
> > thing. Now it's easy and mostly automatic. As for tests on that, we
> > certainly had mysterious crashes and other problems due to the lack of
> such
> > a mechanism and even now we're still cleaning up plugins that do bad
> things
> > with scheduling Continuations. The difference now is we can put in a
> clean
> > fix rather than obscure hacks. This went through because it addressed, as
> > Kees noted, a well known and chronic issue in a reasonable way,
> formalizing
> > what had been unwritten rules.
> >
> > Let me note again, we considered this previously, did actual
> measurements,
> > and concluded the gains were much too small to justify the code (except
> in
> > the cache, which does have hot locks). And as noted above, given that we
> > were already doing crude forms of thread affinity, it seemed making that
> > more general and consistent was the best approach to reducing lock
> > contention, especially as it provided other benefits.
> >
> > On Wed, Oct 2, 2019 at 4:30 PM Walt Karas <wka...@verizonmedia.com
> > .invalid>
> > wrote:
> >
> > > On Wed, Oct 2, 2019 at 3:19 PM Kees Spoelstra <kspoels...@we-amp.com>
> > > wrote:
> > >
> > > > I think we're all able to dig some ratholes with less code than 30
> > lines.
> > > >
> > >
> > > Him, I thought "rathole" implied unneeded complexity but it seems to be
> > > just a synonym for "bad".
> > >
> > >
> > > >
> > > > Alan, it's only my fault that I pushed for threadaffinity for plugins
> > > > after I saw how it was in core... Can't keep all the goodies for the
> > core
> > > > :)
> > > >
> > > > One thing to watch out for is in order execution of continuations, as
> > > I've
> > > > mentioned earlier there is a big possibility that there is code which
> > > does
> > > > not properly lock ( or does not need to) and might get executing
> > > > continuations at the same time instead of in order when running
> > > > concurrently.
> > > > Sometimes to get out of the current executing context which might be
> > deep
> > > > in the stack (x locks down and before some cleanup) you will schedule
> > > > another continuation on the same thread, which can give you a cleaner
> > > > context.
> > > > Imagine the scheduler picking that up and running it concurrently on
> > > > another thread on the same core, that might yield weird results.
> > > > Multiple threads might break that implicit thread safety
> > > >
> > >
> > > When one Continuation schedules another, it could go into a queue of
> > > Continuations that become runnable after the current Continuation
> exits.
> > > So the pseudo-code becomes:
> > >
> > > bool idle[Num_workers_pre_core];
> > > Condition_variable go[Num_workers_per_core];
> > > Continuation *curr[Num_workers_per_core];
> > > Continuation_queue after[Num_workers_per_core]; // Run after current
> > > continuation.
> > >
> > > while (true)
> > > {
> > >   blocking get on message queue;
> > >   switch (msg.type)
> > >   {
> > >     case WORKER_THREAD_READY_FOR_NEXT:  // Sent by worker thread.
> > >       while (after[msg.worker_index} is not empty) {
> > >         c = pop from after[msg.worker_index];
> > >         push c to run queue;
> > >       }
> > >       if (run queue empty) {
> > >         idle[msg.worker_index] = true;
> > >       } else {
> > >         curr[msg.worker_index] = dequeue from run queue;
> > >         trigger go[msg.worker_index];
> > >       }
> > >     break;
> > >
> > >     case QUEUE_CONTINUATION_TO_RUN:
> > >       run_index = index where idle[index] is true or none;
> > >       if (run_index == none) {
> > >         enqueue msg.continuation in run queue;
> > >       } else {
> > >         idel[run_index] = false;
> > >         curr[run_index] = msg.continuation;
> > >         trigger go[run_index];
> > >       }
> > >     break;
> > >
> > >   } // end switch
> > > } // end while
> > >
> > > On the other hand, it's not that hard for a specific Continuation to
> not
> > > trigger successive Continuations until they are actually ready to run.
> > It
> > > would simply need to have its own "after" queue.  We should be
> selective
> > > about how many capabilities we pull into the core Event execution code,
> > and
> > > how much we damage general behavior by doing so.
> > >
> > >
> > > >
> > > > Locking mostly gets/got screwy in the plugins, because of possible
> > > > thread/core jumping, the plugin affinity functions help a lot with
> > that.
> > > > There are patterns to prevent that, try lock, but they're a bit of a
> > pain
> > > > and I've seen people ignore that.
> > > >
> > > > Some findings we had when doing h2. Thread affinity might not always
> > > yield
> > > > the best latency on an almost idle system if the protocols on both
> > sides
> > > > are relatively heavy weight the packing and unpacking will take time
> > and
> > > > one can imagine there is some parallelism which could decrease
> latency.
> > > > We've thought about controlled handover to other threads, but never
> > made
> > > > any work of it.
> > > >
> > > > I can imagine with quic coming that we might see some thread jumping
> > > > again, haven't been following progress on that one.
> > > >
> > > > On Wed, 2 Oct 2019, 21:33 Walt Karas, <wka...@verizonmedia.com
> > .invalid>
> > > > wrote:
> > > >
> > > >> On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org>
> > wrote:
> > > >>
> > > >> >
> > > >> >
> > > >> > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wka...@verizonmedia.com
> > > >> .INVALID>
> > > >> > wrote:
> > > >> > >
> > > >> > > I feel like you overestimate me if you think I could create a
> > > rathole
> > > >> > with
> > > >> > > just 30 lines of pseudo-code.
> > > >> >
> > > >> > We have faith in you.
> > > >> >
> > > >>
> > > >> What's an example of a rathole I've already created?  Seems like a
> > very
> > > >> big
> > > >> leap of faith if there isn't one.
> > > >>
> > > >> Question:  Why do we have more event threads than cores?
> > > >>
> > > >> If it's to mitigate the fact that ATS is not purist event-driven,
> that
> > > >> Continuations sometimes do block, it makes sense to use the best
> > design
> > > to
> > > >> mitigate this.  Sorry but I don't think any big profiling hoedown is
> > > >> needed
> > > >> to see that what I'm proposing mitigates it better.  It's
> essentially
> > > the
> > > >> same idea as having a single line to pay even when there are
> multiple
> > > >> cashiers.
> > > >>
> > > >>
> > > >> >
> > > >> > >
> > > >> > > We're making complex thread affinity changes, on top of an event
> > > >> > execution
> > > >> > > code that already seem complex.  (I leave it to those with more
> > > >> rathole
> > > >> > > background to say if it should be so labeled.) . Can I see the
> > > >> profiling
> > > >> > > results used to justify and guide the thread affinity design?
> > > >> >
> > > >> >
> > > >> > +1.
> > > >> >
> > > >> > — leif
> > > >> >
> > > >> > >
> > > >> > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org>
> > > >> wrote:
> > > >> > >
> > > >> > >>
> > > >> > >>
> > > >> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <
> > wka...@verizonmedia.com
> > > >> > .INVALID>
> > > >> > >> wrote:
> > > >> > >>>
> > > >> > >>> What's the best tool for multi-threaded profiling on Linux?
> > > >> > >>
> > > >> > >> Probably the Intel tools.  Probably the “perf” toolchain works
> > > well,
> > > >> at
> > > >> > >> least on modern linux, you could do perf lock etc.
> > > >> > >>
> > > >> > >>>
> > > >> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> > > >> > >>> <solidwallofc...@verizonmedia.com.invalid> wrote:
> > > >> > >>>
> > > >> > >>>> Correct, it doesn't mean no lock contention. The claim was it
> > > >> reduced
> > > >> > >> the
> > > >> > >>>> lock contention to a level where it's not significant enough
> to
> > > >> > warrant
> > > >> > >>>> additional preventative measures. The data Dan got wasn't
> from
> > > code
> > > >> > >>>> analysis, but from run time profiling. That was a while ago
> so
> > if
> > > >> > you'd
> > > >> > >>>> like to perform another pass of measuring the level of lock
> > > >> > contention,
> > > >> > >>>> that would certainly be interesting data.
> > > >> > >>
> > > >> > >>
> > > >> > >> So, before we dig outselves into this rathole, can someone
> > explain
> > > >> to me
> > > >> > >> what the problem is? Where are the metrics / analysis showing
> > that
> > > we
> > > >> > have
> > > >> > >> a problem? I’d love to see a comparison too between various
> > version
> > > >> of
> > > >> > ATS,
> > > >> > >> say v6 - v9, and understand where, if anywhere, we made things
> > > (lock
> > > >> > >> contention) worse.
> > > >> > >>
> > > >> > >> We have to stop making complicated and invasive solutions
> without
> > > >> real
> > > >> > >> problems, and understanding such problem.
> > > >> > >>
> > > >> > >> My $0.01,
> > > >> > >>
> > > >> > >> — Leif
> > > >> > >>
> > > >> > >>>>
> > > >> > >>>> In addition, the push for thread affinity started from actual
> > > >> issues
> > > >> > in
> > > >> > >>>> production with Continuations being scheduled on different
> > > threads
> > > >> of
> > > >> > >> the
> > > >> > >>>> same type (that is, it was Kees' fault). Those would not be
> > > >> resolved
> > > >> > by
> > > >> > >>>> faster scheduling on different threads.
> > > >> > >>>>
> > > >> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <
> > > >> wka...@verizonmedia.com
> > > >> > >>>> .invalid>
> > > >> > >>>> wrote:
> > > >> > >>>>
> > > >> > >>>>> I assume thread affinity can't literal mean no lock
> > contention.
> > > >> > You'd
> > > >> > >>>> need
> > > >> > >>>>> a lock on the thread run queue wouldn't you?  Continuations
> > > can't
> > > >> > only
> > > >> > >>>> get
> > > >> > >>>>> queued for the event thread from the event thread.  I don't
> > > think
> > > >> we
> > > >> > >> can
> > > >> > >>>>> say conclusively that there would be a significant
> difference
> > > due
> > > >> to
> > > >> > >> lock
> > > >> > >>>>> contention.  I'm guessing that Fei would agree that the
> > > >> Continuation
> > > >> > >>>>> dispatch code is difficult to understand and work on.
> > > >> Simplification
> > > >> > >> and
> > > >> > >>>>> more modularity is obviously a goal.  Seems like it would be
> > > >> simpler
> > > >> > if
> > > >> > >>>> all
> > > >> > >>>>> the Continuations in a to-run list where actually ready to
> > run.
> > > >> > >>>>>
> > > >> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> > > >> > >>>>> <solidwallofc...@verizonmedia.com.invalid> wrote:
> > > >> > >>>>>
> > > >> > >>>>>> Do you have any more specific information on mutex
> > contention?
> > > We
> > > >> > have
> > > >> > >>>> in
> > > >> > >>>>>> fact already looked at doing this, I think back in 2015
> with
> > > Dan
> > > >> Xu.
> > > >> > >>>> The
> > > >> > >>>>>> goal there was to have queues with the mutexes to avoid
> > > >> > rescheduling.
> > > >> > >>>> As
> > > >> > >>>>> a
> > > >> > >>>>>> precursor Dan did some profiling and the only significant
> > lock
> > > >> > >>>> contention
> > > >> > >>>>>> he could find was in the cache. That lead to the partial
> > object
> > > >> > >> caching
> > > >> > >>>>>> work setting up queues for the hot locks, but it was
> decided
> > > that
> > > >> > >> given
> > > >> > >>>>> the
> > > >> > >>>>>> lack of
> > > >> > >>>>>> contention elsewhere, it wasn't worth the complexity.
> > > >> > >>>>>>
> > > >> > >>>>>> I think thread affinity is a better choice because no lock
> > > >> > contention
> > > >> > >>>>> will
> > > >> > >>>>>> always beat even the most optimized lock contention
> > resolution.
> > > >> If
> > > >> > >>>>>> Continuations related to the same constellation of data
> > objects
> > > >> are
> > > >> > on
> > > >> > >>>>> the
> > > >> > >>>>>> same thread, then the locks are never contested, which
> makes
> > it
> > > >> as
> > > >> > >> fast
> > > >> > >>>>> as
> > > >> > >>>>>> possible.
> > > >> > >>>>>>
> > > >> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <
> > > >> wka...@verizonmedia.com
> > > >> > >>>>>> .invalid>
> > > >> > >>>>>> wrote:
> > > >> > >>>>>>
> > > >> > >>>>>>> From the longer-term TSers I've heard comments about
> seeing
> > > >> > profiling
> > > >> > >>>>>>> results that show that waiting on mutexes is a significant
> > > >> > >>>> performance
> > > >> > >>>>>>> issue with TS.  But I'm not aware of any write-ups of such
> > > >> results.
> > > >> > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm
> > not
> > > >> > >>>> currently
> > > >> > >>>>>>> familiar with the best approaches to profiling TS.
> > > >> > >>>>>>>
> > > >> > >>>>>>> For better performance, I think that having a single
> to-run
> > > >> > >>>>> Continuation
> > > >> > >>>>>>> queue, or one per core, with a queue feeding multiple
> event
> > > >> threads
> > > >> > >>>> is
> > > >> > >>>>>> the
> > > >> > >>>>>>> main thing.  It's more resilient to Continuations that
> > block.
> > > >> > There
> > > >> > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about
> > not
> > > >> > having
> > > >> > >>>>>>> blocking Continuations (see
> > > >> > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm
> > not
> > > >> sure
> > > >> > >>>>>>> changing
> > > >> > >>>>>>> to queue-based mutexes would have a significant
> performance
> > > >> impact.
> > > >> > >>>>> But
> > > >> > >>>>>> it
> > > >> > >>>>>>> seems a cleaner design, making sure Continuations in the
> > > to-run
> > > >> > >>>> list(s)
> > > >> > >>>>>> are
> > > >> > >>>>>>> actually ready to run.  But a different mutex
> implementation
> > > is
> > > >> not
> > > >> > >>>>>>> strictly necessary in order to consolidate to-run
> > Continuation
> > > >> > >>>> queues.
> > > >> > >>>>>>>
> > > >> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> > > >> > >>>> kspoels...@we-amp.com>
> > > >> > >>>>>>> wrote:
> > > >> > >>>>>>>
> > > >> > >>>>>>>> Sounds very interesting.
> > > >> > >>>>>>>> But what is the problem we're trying to solve here, I
> like
> > > the
> > > >> > >>>> thread
> > > >> > >>>>>>>> affinity because it gives us head ache free concurrency
> in
> > > some
> > > >> > >>>>> cases,
> > > >> > >>>>>>> and
> > > >> > >>>>>>>> I'll bet that there is some code which doesn't have the
> > > proper
> > > >> > >>>>>>> continuation
> > > >> > >>>>>>>> mutexes because we know it runs on the same thread.
> > > >> > >>>>>>>>
> > > >> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much
> > > processing
> > > >> > >>>>> causing
> > > >> > >>>>>>> long
> > > >> > >>>>>>>> queues of continuations, which I can imagine in some
> > cases) ?
> > > >> And
> > > >> > >>>>>>> shouldn't
> > > >> > >>>>>>>> we balance based on transactions or connections, move
> those
> > > >> around
> > > >> > >>>>> when
> > > >> > >>>>>>> we
> > > >> > >>>>>>>> see imbalance and aim for embarrassingly parallel
> > processing
> > > :)
> > > >> > >>>> Come
> > > >> > >>>>>>>> to think of it, this might introduce another set of
> > problems,
> > > >> how
> > > >> > >>>> to
> > > >> > >>>>>> know
> > > >> > >>>>>>>> which continuations are part of the life cycle of a
> > > connection
> > > >> :/
> > > >> > >>>>>>>>
> > > >> > >>>>>>>> Jumping threads in one transaction is not always ideal
> > > either,
> > > >> > this
> > > >> > >>>>> can
> > > >> > >>>>>>>> really hurt performance. But your proposed model seems to
> > > >> handle
> > > >> > >>>> that
> > > >> > >>>>>>>> somewhat better than the current implementation.
> > > >> > >>>>>>>>
> > > >> > >>>>>>>> Very interested and wondering what this would mean for
> > plugin
> > > >> > >>>>>> developers.
> > > >> > >>>>>>>>
> > > >> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <
> > > >> wka...@verizonmedia.com
> > > >> > >>>>>> .invalid>
> > > >> > >>>>>>>> wrote:
> > > >> > >>>>>>>>
> > > >> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked,
> > > it's
> > > >> put
> > > >> > >>>>> in
> > > >> > >>>>>> a
> > > >> > >>>>>>>>> queue specific to that mutex.  The release function for
> > the
> > > >> mutex
> > > >> > >>>>>>> (called
> > > >> > >>>>>>>>> when a Continuation holding the mutex exists) would put
> > the
> > > >> > >>>>>>> Continuation
> > > >> > >>>>>>>> at
> > > >> > >>>>>>>>> the front of the mutex's queue (if not empty) into the
> > > >> > >>>> ready-to-run
> > > >> > >>>>>>> queue
> > > >> > >>>>>>>>> (transferring the lock to that Continuation).  A
> drawback
> > is
> > > >> that
> > > >> > >>>>> the
> > > >> > >>>>>>>> queue
> > > >> > >>>>>>>>> would itself need a mutex (spinlock?), but the critical
> > > >> section
> > > >> > >>>>> would
> > > >> > >>>>>>> be
> > > >> > >>>>>>>>> very short.
> > > >> > >>>>>>>>>
> > > >> > >>>>>>>>> There would be a function to lock a mutex directly.  It
> > > would
> > > >> > >>>>> create
> > > >> > >>>>>> a
> > > >> > >>>>>>>>> Continuation that had two condition variables.  It would
> > > >> assign
> > > >> > >>>> the
> > > >> > >>>>>>> mutex
> > > >> > >>>>>>>>> to this Continuation and schedule it.  (In this case, it
> > > might
> > > >> > >>>> make
> > > >> > >>>>>>> sense
> > > >> > >>>>>>>>> to put this Continuation at the front of the mutex's
> > queue,
> > > >> since
> > > >> > >>>>> it
> > > >> > >>>>>>>> would
> > > >> > >>>>>>>>> be blocking an entire event thread.)  The direct-lock
> > > function
> > > >> > >>>>> would
> > > >> > >>>>>>> then
> > > >> > >>>>>>>>> block on the first condition variable.  When the
> > > Continuation
> > > >> > >>>> ran,
> > > >> > >>>>> it
> > > >> > >>>>>>>> would
> > > >> > >>>>>>>>> trigger the first condition variable, and then block on
> > the
> > > >> > >>>> second
> > > >> > >>>>>>>>> condition variable.  The direct-lock function would then
> > > exit,
> > > >> > >>>>>> allowing
> > > >> > >>>>>>>> the
> > > >> > >>>>>>>>> calling code to enter its critical section.  At the end
> of
> > > the
> > > >> > >>>>>> critical
> > > >> > >>>>>>>>> section, another function to release the direct lock
> would
> > > be
> > > >> > >>>>> called.
> > > >> > >>>>>>> It
> > > >> > >>>>>>>>> would trigger the second condition variable, which would
> > > cause
> > > >> > >>>> the
> > > >> > >>>>>>>> function
> > > >> > >>>>>>>>> of the Continuation created for the direct lock to exit
> > > (thus
> > > >> > >>>>>> releasing
> > > >> > >>>>>>>> the
> > > >> > >>>>>>>>> mutex).
> > > >> > >>>>>>>>>
> > > >> > >>>>>>>>> With this approach, I'm not sure thread affinities would
> > be
> > > of
> > > >> > >>>> any
> > > >> > >>>>>>> value.
> > > >> > >>>>>>>>> I think perhaps each core should have it's own list of
> > > >> > >>>> ready-to-run
> > > >> > >>>>>>>>> Continuations, and a pool of event threads with affinity
> > to
> > > >> that
> > > >> > >>>>>> core.
> > > >> > >>>>>>>> Not
> > > >> > >>>>>>>>> having per-event-thread ready-to-run lists means that a
> > > >> > >>>>> Continuation
> > > >> > >>>>>>>>> function that blocks is less likely to block other
> > > >> ready-to-run
> > > >> > >>>>>>>>> Continuations.  If Continuations had core affinities to
> > some
> > > >> > >>>>> degree,
> > > >> > >>>>>>> this
> > > >> > >>>>>>>>> might reduce evictions in per-core memory cache.
> > (Multiple
> > > >> > >>>>>>> Continuations
> > > >> > >>>>>>>>> having the same function should have the same core
> > > affinity.)
> > > >> > >>>>>>>>>
> > > >> > >>>>>>>>
> > > >> > >>>>>>>
> > > >> > >>>>>>
> > > >> > >>>>>
> > > >> > >>>>
> > > >> > >>
> > > >> > >>
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Reply via email to