Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Walt Karas Wed, 02 Oct 2019 12:34:34 -0700

On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org> wrote:


>
>
> > On Oct 2, 2019, at 1:00 PM, Walt Karas <wka...@verizonmedia.com.INVALID>
> wrote:
> >
> > I feel like you overestimate me if you think I could create a rathole
> with
> > just 30 lines of pseudo-code.
>
> We have faith in you.
>

What's an example of a rathole I've already created?  Seems like a very big
leap of faith if there isn't one.

Question:  Why do we have more event threads than cores?

If it's to mitigate the fact that ATS is not purist event-driven, that
Continuations sometimes do block, it makes sense to use the best design to
mitigate this.  Sorry but I don't think any big profiling hoedown is needed
to see that what I'm proposing mitigates it better.  It's essentially the
same idea as having a single line to pay even when there are multiple
cashiers.


>
> >
> > We're making complex thread affinity changes, on top of an event
> execution
> > code that already seem complex.  (I leave it to those with more rathole
> > background to say if it should be so labeled.) . Can I see the profiling
> > results used to justify and guide the thread affinity design?
>
>
> +1.
>
> — leif
>
> >
> > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote:
> >
> >>
> >>
> >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wka...@verizonmedia.com
> .INVALID>
> >> wrote:
> >>>
> >>> What's the best tool for multi-threaded profiling on Linux?
> >>
> >> Probably the Intel tools.  Probably the “perf” toolchain works well, at
> >> least on modern linux, you could do perf lock etc.
> >>
> >>>
> >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> >>> <solidwallofc...@verizonmedia.com.invalid> wrote:
> >>>
> >>>> Correct, it doesn't mean no lock contention. The claim was it reduced
> >> the
> >>>> lock contention to a level where it's not significant enough to
> warrant
> >>>> additional preventative measures. The data Dan got wasn't from code
> >>>> analysis, but from run time profiling. That was a while ago so if
> you'd
> >>>> like to perform another pass of measuring the level of lock
> contention,
> >>>> that would certainly be interesting data.
> >>
> >>
> >> So, before we dig outselves into this rathole, can someone explain to me
> >> what the problem is? Where are the metrics / analysis showing that we
> have
> >> a problem? I’d love to see a comparison too between various version of
> ATS,
> >> say v6 - v9, and understand where, if anywhere, we made things (lock
> >> contention) worse.
> >>
> >> We have to stop making complicated and invasive solutions without real
> >> problems, and understanding such problem.
> >>
> >> My $0.01,
> >>
> >> — Leif
> >>
> >>>>
> >>>> In addition, the push for thread affinity started from actual issues
> in
> >>>> production with Continuations being scheduled on different threads of
> >> the
> >>>> same type (that is, it was Kees' fault). Those would not be resolved
> by
> >>>> faster scheduling on different threads.
> >>>>
> >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com
> >>>> .invalid>
> >>>> wrote:
> >>>>
> >>>>> I assume thread affinity can't literal mean no lock contention.
> You'd
> >>>> need
> >>>>> a lock on the thread run queue wouldn't you?  Continuations can't
> only
> >>>> get
> >>>>> queued for the event thread from the event thread.  I don't think we
> >> can
> >>>>> say conclusively that there would be a significant difference due to
> >> lock
> >>>>> contention.  I'm guessing that Fei would agree that the Continuation
> >>>>> dispatch code is difficult to understand and work on.  Simplification
> >> and
> >>>>> more modularity is obviously a goal.  Seems like it would be simpler
> if
> >>>> all
> >>>>> the Continuations in a to-run list where actually ready to run.
> >>>>>
> >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> >>>>> <solidwallofc...@verizonmedia.com.invalid> wrote:
> >>>>>
> >>>>>> Do you have any more specific information on mutex contention? We
> have
> >>>> in
> >>>>>> fact already looked at doing this, I think back in 2015 with Dan Xu.
> >>>> The
> >>>>>> goal there was to have queues with the mutexes to avoid
> rescheduling.
> >>>> As
> >>>>> a
> >>>>>> precursor Dan did some profiling and the only significant lock
> >>>> contention
> >>>>>> he could find was in the cache. That lead to the partial object
> >> caching
> >>>>>> work setting up queues for the hot locks, but it was decided that
> >> given
> >>>>> the
> >>>>>> lack of
> >>>>>> contention elsewhere, it wasn't worth the complexity.
> >>>>>>
> >>>>>> I think thread affinity is a better choice because no lock
> contention
> >>>>> will
> >>>>>> always beat even the most optimized lock contention resolution. If
> >>>>>> Continuations related to the same constellation of data objects are
> on
> >>>>> the
> >>>>>> same thread, then the locks are never contested, which makes it as
> >> fast
> >>>>> as
> >>>>>> possible.
> >>>>>>
> >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com
> >>>>>> .invalid>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> From the longer-term TSers I've heard comments about seeing
> profiling
> >>>>>>> results that show that waiting on mutexes is a significant
> >>>> performance
> >>>>>>> issue with TS.  But I'm not aware of any write-ups of such results.
> >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
> >>>> currently
> >>>>>>> familiar with the best approaches to profiling TS.
> >>>>>>>
> >>>>>>> For better performance, I think that having a single to-run
> >>>>> Continuation
> >>>>>>> queue, or one per core, with a queue feeding multiple event threads
> >>>> is
> >>>>>> the
> >>>>>>> main thing.  It's more resilient to Continuations that block.
> There
> >>>>>>> doesn't seem to be enthusiasm for getting hard-core about not
> having
> >>>>>>> blocking Continuations (see
> >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
> >>>>>>> changing
> >>>>>>> to queue-based mutexes would have a significant performance impact.
> >>>>> But
> >>>>>> it
> >>>>>>> seems a cleaner design, making sure Continuations in the to-run
> >>>> list(s)
> >>>>>> are
> >>>>>>> actually ready to run.  But a different mutex implementation is not
> >>>>>>> strictly necessary in order to consolidate to-run Continuation
> >>>> queues.
> >>>>>>>
> >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> >>>> kspoels...@we-amp.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Sounds very interesting.
> >>>>>>>> But what is the problem we're trying to solve here, I like the
> >>>> thread
> >>>>>>>> affinity because it gives us head ache free concurrency in some
> >>>>> cases,
> >>>>>>> and
> >>>>>>>> I'll bet that there is some code which doesn't have the proper
> >>>>>>> continuation
> >>>>>>>> mutexes because we know it runs on the same thread.
> >>>>>>>>
> >>>>>>>> Are we seeing a lot of imbalanced threads (too much processing
> >>>>> causing
> >>>>>>> long
> >>>>>>>> queues of continuations, which I can imagine in some cases) ? And
> >>>>>>> shouldn't
> >>>>>>>> we balance based on transactions or connections, move those around
> >>>>> when
> >>>>>>> we
> >>>>>>>> see imbalance and aim for embarrassingly parallel processing :)
> >>>> Come
> >>>>>>>> to think of it, this might introduce another set of problems, how
> >>>> to
> >>>>>> know
> >>>>>>>> which continuations are part of the life cycle of a connection :/
> >>>>>>>>
> >>>>>>>> Jumping threads in one transaction is not always ideal either,
> this
> >>>>> can
> >>>>>>>> really hurt performance. But your proposed model seems to handle
> >>>> that
> >>>>>>>> somewhat better than the current implementation.
> >>>>>>>>
> >>>>>>>> Very interested and wondering what this would mean for plugin
> >>>>>> developers.
> >>>>>>>>
> >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com
> >>>>>> .invalid>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put
> >>>>> in
> >>>>>> a
> >>>>>>>>> queue specific to that mutex.  The release function for the mutex
> >>>>>>> (called
> >>>>>>>>> when a Continuation holding the mutex exists) would put the
> >>>>>>> Continuation
> >>>>>>>> at
> >>>>>>>>> the front of the mutex's queue (if not empty) into the
> >>>> ready-to-run
> >>>>>>> queue
> >>>>>>>>> (transferring the lock to that Continuation).  A drawback is that
> >>>>> the
> >>>>>>>> queue
> >>>>>>>>> would itself need a mutex (spinlock?), but the critical section
> >>>>> would
> >>>>>>> be
> >>>>>>>>> very short.
> >>>>>>>>>
> >>>>>>>>> There would be a function to lock a mutex directly.  It would
> >>>>> create
> >>>>>> a
> >>>>>>>>> Continuation that had two condition variables.  It would assign
> >>>> the
> >>>>>>> mutex
> >>>>>>>>> to this Continuation and schedule it.  (In this case, it might
> >>>> make
> >>>>>>> sense
> >>>>>>>>> to put this Continuation at the front of the mutex's queue, since
> >>>>> it
> >>>>>>>> would
> >>>>>>>>> be blocking an entire event thread.)  The direct-lock function
> >>>>> would
> >>>>>>> then
> >>>>>>>>> block on the first condition variable.  When the Continuation
> >>>> ran,
> >>>>> it
> >>>>>>>> would
> >>>>>>>>> trigger the first condition variable, and then block on the
> >>>> second
> >>>>>>>>> condition variable.  The direct-lock function would then exit,
> >>>>>> allowing
> >>>>>>>> the
> >>>>>>>>> calling code to enter its critical section.  At the end of the
> >>>>>> critical
> >>>>>>>>> section, another function to release the direct lock would be
> >>>>> called.
> >>>>>>> It
> >>>>>>>>> would trigger the second condition variable, which would cause
> >>>> the
> >>>>>>>> function
> >>>>>>>>> of the Continuation created for the direct lock to exit (thus
> >>>>>> releasing
> >>>>>>>> the
> >>>>>>>>> mutex).
> >>>>>>>>>
> >>>>>>>>> With this approach, I'm not sure thread affinities would be of
> >>>> any
> >>>>>>> value.
> >>>>>>>>> I think perhaps each core should have it's own list of
> >>>> ready-to-run
> >>>>>>>>> Continuations, and a pool of event threads with affinity to that
> >>>>>> core.
> >>>>>>>> Not
> >>>>>>>>> having per-event-thread ready-to-run lists means that a
> >>>>> Continuation
> >>>>>>>>> function that blocks is less likely to block other ready-to-run
> >>>>>>>>> Continuations.  If Continuations had core affinities to some
> >>>>> degree,
> >>>>>>> this
> >>>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
> >>>>>>> Continuations
> >>>>>>>>> having the same function should have the same core affinity.)
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Reply via email to