Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Alan Carroll Thu, 03 Oct 2019 07:15:01 -0700

I didn't see the thread affinity as a rathole, but more as a regularization
of existing practice. That kind of thing was already being done in many
places in an ad hoc manner (e.g. cache, DNS, VC migration). Putting it in
the Thread support classes meant there was a single implementation, not
several, with the consequent and highly desired consistency of operation.


I also considered it very desirable for plugins to be able to do similar
things, which is why I championed Kees' request on that. Any plugin doing
asynchronous operations found it difficult to impossible to do the correct
thing. Now it's easy and mostly automatic. As for tests on that, we
certainly had mysterious crashes and other problems due to the lack of such
a mechanism and even now we're still cleaning up plugins that do bad things
with scheduling Continuations. The difference now is we can put in a clean
fix rather than obscure hacks. This went through because it addressed, as
Kees noted, a well known and chronic issue in a reasonable way, formalizing
what had been unwritten rules.

Let me note again, we considered this previously, did actual measurements,
and concluded the gains were much too small to justify the code (except in
the cache, which does have hot locks). And as noted above, given that we
were already doing crude forms of thread affinity, it seemed making that
more general and consistent was the best approach to reducing lock
contention, especially as it provided other benefits.

On Wed, Oct 2, 2019 at 4:30 PM Walt Karas <wka...@verizonmedia.com.invalid>
wrote:

> On Wed, Oct 2, 2019 at 3:19 PM Kees Spoelstra <kspoels...@we-amp.com>
> wrote:
>
> > I think we're all able to dig some ratholes with less code than 30 lines.
> >
>
> Him, I thought "rathole" implied unneeded complexity but it seems to be
> just a synonym for "bad".
>
>
> >
> > Alan, it's only my fault that I pushed for threadaffinity for plugins
> > after I saw how it was in core... Can't keep all the goodies for the core
> > :)
> >
> > One thing to watch out for is in order execution of continuations, as
> I've
> > mentioned earlier there is a big possibility that there is code which
> does
> > not properly lock ( or does not need to) and might get executing
> > continuations at the same time instead of in order when running
> > concurrently.
> > Sometimes to get out of the current executing context which might be deep
> > in the stack (x locks down and before some cleanup) you will schedule
> > another continuation on the same thread, which can give you a cleaner
> > context.
> > Imagine the scheduler picking that up and running it concurrently on
> > another thread on the same core, that might yield weird results.
> > Multiple threads might break that implicit thread safety
> >
>
> When one Continuation schedules another, it could go into a queue of
> Continuations that become runnable after the current Continuation exits.
> So the pseudo-code becomes:
>
> bool idle[Num_workers_pre_core];
> Condition_variable go[Num_workers_per_core];
> Continuation *curr[Num_workers_per_core];
> Continuation_queue after[Num_workers_per_core]; // Run after current
> continuation.
>
> while (true)
> {
>   blocking get on message queue;
>   switch (msg.type)
>   {
>     case WORKER_THREAD_READY_FOR_NEXT:  // Sent by worker thread.
>       while (after[msg.worker_index} is not empty) {
>         c = pop from after[msg.worker_index];
>         push c to run queue;
>       }
>       if (run queue empty) {
>         idle[msg.worker_index] = true;
>       } else {
>         curr[msg.worker_index] = dequeue from run queue;
>         trigger go[msg.worker_index];
>       }
>     break;
>
>     case QUEUE_CONTINUATION_TO_RUN:
>       run_index = index where idle[index] is true or none;
>       if (run_index == none) {
>         enqueue msg.continuation in run queue;
>       } else {
>         idel[run_index] = false;
>         curr[run_index] = msg.continuation;
>         trigger go[run_index];
>       }
>     break;
>
>   } // end switch
> } // end while
>
> On the other hand, it's not that hard for a specific Continuation to not
> trigger successive Continuations until they are actually ready to run.  It
> would simply need to have its own "after" queue.  We should be selective
> about how many capabilities we pull into the core Event execution code, and
> how much we damage general behavior by doing so.
>
>
> >
> > Locking mostly gets/got screwy in the plugins, because of possible
> > thread/core jumping, the plugin affinity functions help a lot with that.
> > There are patterns to prevent that, try lock, but they're a bit of a pain
> > and I've seen people ignore that.
> >
> > Some findings we had when doing h2. Thread affinity might not always
> yield
> > the best latency on an almost idle system if the protocols on both sides
> > are relatively heavy weight the packing and unpacking will take time and
> > one can imagine there is some parallelism which could decrease latency.
> > We've thought about controlled handover to other threads, but never made
> > any work of it.
> >
> > I can imagine with quic coming that we might see some thread jumping
> > again, haven't been following progress on that one.
> >
> > On Wed, 2 Oct 2019, 21:33 Walt Karas, <wka...@verizonmedia.com.invalid>
> > wrote:
> >
> >> On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org> wrote:
> >>
> >> >
> >> >
> >> > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wka...@verizonmedia.com
> >> .INVALID>
> >> > wrote:
> >> > >
> >> > > I feel like you overestimate me if you think I could create a
> rathole
> >> > with
> >> > > just 30 lines of pseudo-code.
> >> >
> >> > We have faith in you.
> >> >
> >>
> >> What's an example of a rathole I've already created?  Seems like a very
> >> big
> >> leap of faith if there isn't one.
> >>
> >> Question:  Why do we have more event threads than cores?
> >>
> >> If it's to mitigate the fact that ATS is not purist event-driven, that
> >> Continuations sometimes do block, it makes sense to use the best design
> to
> >> mitigate this.  Sorry but I don't think any big profiling hoedown is
> >> needed
> >> to see that what I'm proposing mitigates it better.  It's essentially
> the
> >> same idea as having a single line to pay even when there are multiple
> >> cashiers.
> >>
> >>
> >> >
> >> > >
> >> > > We're making complex thread affinity changes, on top of an event
> >> > execution
> >> > > code that already seem complex.  (I leave it to those with more
> >> rathole
> >> > > background to say if it should be so labeled.) . Can I see the
> >> profiling
> >> > > results used to justify and guide the thread affinity design?
> >> >
> >> >
> >> > +1.
> >> >
> >> > — leif
> >> >
> >> > >
> >> > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org>
> >> wrote:
> >> > >
> >> > >>
> >> > >>
> >> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wka...@verizonmedia.com
> >> > .INVALID>
> >> > >> wrote:
> >> > >>>
> >> > >>> What's the best tool for multi-threaded profiling on Linux?
> >> > >>
> >> > >> Probably the Intel tools.  Probably the “perf” toolchain works
> well,
> >> at
> >> > >> least on modern linux, you could do perf lock etc.
> >> > >>
> >> > >>>
> >> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> >> > >>> <solidwallofc...@verizonmedia.com.invalid> wrote:
> >> > >>>
> >> > >>>> Correct, it doesn't mean no lock contention. The claim was it
> >> reduced
> >> > >> the
> >> > >>>> lock contention to a level where it's not significant enough to
> >> > warrant
> >> > >>>> additional preventative measures. The data Dan got wasn't from
> code
> >> > >>>> analysis, but from run time profiling. That was a while ago so if
> >> > you'd
> >> > >>>> like to perform another pass of measuring the level of lock
> >> > contention,
> >> > >>>> that would certainly be interesting data.
> >> > >>
> >> > >>
> >> > >> So, before we dig outselves into this rathole, can someone explain
> >> to me
> >> > >> what the problem is? Where are the metrics / analysis showing that
> we
> >> > have
> >> > >> a problem? I’d love to see a comparison too between various version
> >> of
> >> > ATS,
> >> > >> say v6 - v9, and understand where, if anywhere, we made things
> (lock
> >> > >> contention) worse.
> >> > >>
> >> > >> We have to stop making complicated and invasive solutions without
> >> real
> >> > >> problems, and understanding such problem.
> >> > >>
> >> > >> My $0.01,
> >> > >>
> >> > >> — Leif
> >> > >>
> >> > >>>>
> >> > >>>> In addition, the push for thread affinity started from actual
> >> issues
> >> > in
> >> > >>>> production with Continuations being scheduled on different
> threads
> >> of
> >> > >> the
> >> > >>>> same type (that is, it was Kees' fault). Those would not be
> >> resolved
> >> > by
> >> > >>>> faster scheduling on different threads.
> >> > >>>>
> >> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <
> >> wka...@verizonmedia.com
> >> > >>>> .invalid>
> >> > >>>> wrote:
> >> > >>>>
> >> > >>>>> I assume thread affinity can't literal mean no lock contention.
> >> > You'd
> >> > >>>> need
> >> > >>>>> a lock on the thread run queue wouldn't you?  Continuations
> can't
> >> > only
> >> > >>>> get
> >> > >>>>> queued for the event thread from the event thread.  I don't
> think
> >> we
> >> > >> can
> >> > >>>>> say conclusively that there would be a significant difference
> due
> >> to
> >> > >> lock
> >> > >>>>> contention.  I'm guessing that Fei would agree that the
> >> Continuation
> >> > >>>>> dispatch code is difficult to understand and work on.
> >> Simplification
> >> > >> and
> >> > >>>>> more modularity is obviously a goal.  Seems like it would be
> >> simpler
> >> > if
> >> > >>>> all
> >> > >>>>> the Continuations in a to-run list where actually ready to run.
> >> > >>>>>
> >> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> >> > >>>>> <solidwallofc...@verizonmedia.com.invalid> wrote:
> >> > >>>>>
> >> > >>>>>> Do you have any more specific information on mutex contention?
> We
> >> > have
> >> > >>>> in
> >> > >>>>>> fact already looked at doing this, I think back in 2015 with
> Dan
> >> Xu.
> >> > >>>> The
> >> > >>>>>> goal there was to have queues with the mutexes to avoid
> >> > rescheduling.
> >> > >>>> As
> >> > >>>>> a
> >> > >>>>>> precursor Dan did some profiling and the only significant lock
> >> > >>>> contention
> >> > >>>>>> he could find was in the cache. That lead to the partial object
> >> > >> caching
> >> > >>>>>> work setting up queues for the hot locks, but it was decided
> that
> >> > >> given
> >> > >>>>> the
> >> > >>>>>> lack of
> >> > >>>>>> contention elsewhere, it wasn't worth the complexity.
> >> > >>>>>>
> >> > >>>>>> I think thread affinity is a better choice because no lock
> >> > contention
> >> > >>>>> will
> >> > >>>>>> always beat even the most optimized lock contention resolution.
> >> If
> >> > >>>>>> Continuations related to the same constellation of data objects
> >> are
> >> > on
> >> > >>>>> the
> >> > >>>>>> same thread, then the locks are never contested, which makes it
> >> as
> >> > >> fast
> >> > >>>>> as
> >> > >>>>>> possible.
> >> > >>>>>>
> >> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <
> >> wka...@verizonmedia.com
> >> > >>>>>> .invalid>
> >> > >>>>>> wrote:
> >> > >>>>>>
> >> > >>>>>>> From the longer-term TSers I've heard comments about seeing
> >> > profiling
> >> > >>>>>>> results that show that waiting on mutexes is a significant
> >> > >>>> performance
> >> > >>>>>>> issue with TS.  But I'm not aware of any write-ups of such
> >> results.
> >> > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
> >> > >>>> currently
> >> > >>>>>>> familiar with the best approaches to profiling TS.
> >> > >>>>>>>
> >> > >>>>>>> For better performance, I think that having a single to-run
> >> > >>>>> Continuation
> >> > >>>>>>> queue, or one per core, with a queue feeding multiple event
> >> threads
> >> > >>>> is
> >> > >>>>>> the
> >> > >>>>>>> main thing.  It's more resilient to Continuations that block.
> >> > There
> >> > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about not
> >> > having
> >> > >>>>>>> blocking Continuations (see
> >> > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not
> >> sure
> >> > >>>>>>> changing
> >> > >>>>>>> to queue-based mutexes would have a significant performance
> >> impact.
> >> > >>>>> But
> >> > >>>>>> it
> >> > >>>>>>> seems a cleaner design, making sure Continuations in the
> to-run
> >> > >>>> list(s)
> >> > >>>>>> are
> >> > >>>>>>> actually ready to run.  But a different mutex implementation
> is
> >> not
> >> > >>>>>>> strictly necessary in order to consolidate to-run Continuation
> >> > >>>> queues.
> >> > >>>>>>>
> >> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> >> > >>>> kspoels...@we-amp.com>
> >> > >>>>>>> wrote:
> >> > >>>>>>>
> >> > >>>>>>>> Sounds very interesting.
> >> > >>>>>>>> But what is the problem we're trying to solve here, I like
> the
> >> > >>>> thread
> >> > >>>>>>>> affinity because it gives us head ache free concurrency in
> some
> >> > >>>>> cases,
> >> > >>>>>>> and
> >> > >>>>>>>> I'll bet that there is some code which doesn't have the
> proper
> >> > >>>>>>> continuation
> >> > >>>>>>>> mutexes because we know it runs on the same thread.
> >> > >>>>>>>>
> >> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much
> processing
> >> > >>>>> causing
> >> > >>>>>>> long
> >> > >>>>>>>> queues of continuations, which I can imagine in some cases) ?
> >> And
> >> > >>>>>>> shouldn't
> >> > >>>>>>>> we balance based on transactions or connections, move those
> >> around
> >> > >>>>> when
> >> > >>>>>>> we
> >> > >>>>>>>> see imbalance and aim for embarrassingly parallel processing
> :)
> >> > >>>> Come
> >> > >>>>>>>> to think of it, this might introduce another set of problems,
> >> how
> >> > >>>> to
> >> > >>>>>> know
> >> > >>>>>>>> which continuations are part of the life cycle of a
> connection
> >> :/
> >> > >>>>>>>>
> >> > >>>>>>>> Jumping threads in one transaction is not always ideal
> either,
> >> > this
> >> > >>>>> can
> >> > >>>>>>>> really hurt performance. But your proposed model seems to
> >> handle
> >> > >>>> that
> >> > >>>>>>>> somewhat better than the current implementation.
> >> > >>>>>>>>
> >> > >>>>>>>> Very interested and wondering what this would mean for plugin
> >> > >>>>>> developers.
> >> > >>>>>>>>
> >> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <
> >> wka...@verizonmedia.com
> >> > >>>>>> .invalid>
> >> > >>>>>>>> wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked,
> it's
> >> put
> >> > >>>>> in
> >> > >>>>>> a
> >> > >>>>>>>>> queue specific to that mutex.  The release function for the
> >> mutex
> >> > >>>>>>> (called
> >> > >>>>>>>>> when a Continuation holding the mutex exists) would put the
> >> > >>>>>>> Continuation
> >> > >>>>>>>> at
> >> > >>>>>>>>> the front of the mutex's queue (if not empty) into the
> >> > >>>> ready-to-run
> >> > >>>>>>> queue
> >> > >>>>>>>>> (transferring the lock to that Continuation).  A drawback is
> >> that
> >> > >>>>> the
> >> > >>>>>>>> queue
> >> > >>>>>>>>> would itself need a mutex (spinlock?), but the critical
> >> section
> >> > >>>>> would
> >> > >>>>>>> be
> >> > >>>>>>>>> very short.
> >> > >>>>>>>>>
> >> > >>>>>>>>> There would be a function to lock a mutex directly.  It
> would
> >> > >>>>> create
> >> > >>>>>> a
> >> > >>>>>>>>> Continuation that had two condition variables.  It would
> >> assign
> >> > >>>> the
> >> > >>>>>>> mutex
> >> > >>>>>>>>> to this Continuation and schedule it.  (In this case, it
> might
> >> > >>>> make
> >> > >>>>>>> sense
> >> > >>>>>>>>> to put this Continuation at the front of the mutex's queue,
> >> since
> >> > >>>>> it
> >> > >>>>>>>> would
> >> > >>>>>>>>> be blocking an entire event thread.)  The direct-lock
> function
> >> > >>>>> would
> >> > >>>>>>> then
> >> > >>>>>>>>> block on the first condition variable.  When the
> Continuation
> >> > >>>> ran,
> >> > >>>>> it
> >> > >>>>>>>> would
> >> > >>>>>>>>> trigger the first condition variable, and then block on the
> >> > >>>> second
> >> > >>>>>>>>> condition variable.  The direct-lock function would then
> exit,
> >> > >>>>>> allowing
> >> > >>>>>>>> the
> >> > >>>>>>>>> calling code to enter its critical section.  At the end of
> the
> >> > >>>>>> critical
> >> > >>>>>>>>> section, another function to release the direct lock would
> be
> >> > >>>>> called.
> >> > >>>>>>> It
> >> > >>>>>>>>> would trigger the second condition variable, which would
> cause
> >> > >>>> the
> >> > >>>>>>>> function
> >> > >>>>>>>>> of the Continuation created for the direct lock to exit
> (thus
> >> > >>>>>> releasing
> >> > >>>>>>>> the
> >> > >>>>>>>>> mutex).
> >> > >>>>>>>>>
> >> > >>>>>>>>> With this approach, I'm not sure thread affinities would be
> of
> >> > >>>> any
> >> > >>>>>>> value.
> >> > >>>>>>>>> I think perhaps each core should have it's own list of
> >> > >>>> ready-to-run
> >> > >>>>>>>>> Continuations, and a pool of event threads with affinity to
> >> that
> >> > >>>>>> core.
> >> > >>>>>>>> Not
> >> > >>>>>>>>> having per-event-thread ready-to-run lists means that a
> >> > >>>>> Continuation
> >> > >>>>>>>>> function that blocks is less likely to block other
> >> ready-to-run
> >> > >>>>>>>>> Continuations.  If Continuations had core affinities to some
> >> > >>>>> degree,
> >> > >>>>>>> this
> >> > >>>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
> >> > >>>>>>> Continuations
> >> > >>>>>>>>> having the same function should have the same core
> affinity.)
> >> > >>>>>>>>>
> >> > >>>>>>>>
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>>>
> >> > >>>>
> >> > >>
> >> > >>
> >> >
> >> >
> >>
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Reply via email to