I've done some profiling before, using Intel's VTune, you can ask Jason for
details.

On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote:

>
>
> > On Oct 2, 2019, at 11:23 AM, Walt Karas <wka...@verizonmedia.com.INVALID>
> wrote:
> >
> > What's the best tool for multi-threaded profiling on Linux?
>
> Probably the Intel tools.  Probably the “perf” toolchain works well, at
> least on modern linux, you could do perf lock etc.
>
> >
> > On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> > <solidwallofc...@verizonmedia.com.invalid> wrote:
> >
> >> Correct, it doesn't mean no lock contention. The claim was it reduced
> the
> >> lock contention to a level where it's not significant enough to warrant
> >> additional preventative measures. The data Dan got wasn't from code
> >> analysis, but from run time profiling. That was a while ago so if you'd
> >> like to perform another pass of measuring the level of lock contention,
> >> that would certainly be interesting data.
>
>
> So, before we dig outselves into this rathole, can someone explain to me
> what the problem is? Where are the metrics / analysis showing that we have
> a problem? I’d love to see a comparison too between various version of ATS,
> say v6 - v9, and understand where, if anywhere, we made things (lock
> contention) worse.
>
> We have to stop making complicated and invasive solutions without real
> problems, and understanding such problem.
>
> My $0.01,
>
> — Leif
>
> >>
> >> In addition, the push for thread affinity started from actual issues in
> >> production with Continuations being scheduled on different threads of
> the
> >> same type (that is, it was Kees' fault). Those would not be resolved by
> >> faster scheduling on different threads.
> >>
> >> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com
> >> .invalid>
> >> wrote:
> >>
> >>> I assume thread affinity can't literal mean no lock contention.  You'd
> >> need
> >>> a lock on the thread run queue wouldn't you?  Continuations can't only
> >> get
> >>> queued for the event thread from the event thread.  I don't think we
> can
> >>> say conclusively that there would be a significant difference due to
> lock
> >>> contention.  I'm guessing that Fei would agree that the Continuation
> >>> dispatch code is difficult to understand and work on.  Simplification
> and
> >>> more modularity is obviously a goal.  Seems like it would be simpler if
> >> all
> >>> the Continuations in a to-run list where actually ready to run.
> >>>
> >>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> >>> <solidwallofc...@verizonmedia.com.invalid> wrote:
> >>>
> >>>> Do you have any more specific information on mutex contention? We have
> >> in
> >>>> fact already looked at doing this, I think back in 2015 with Dan Xu.
> >> The
> >>>> goal there was to have queues with the mutexes to avoid rescheduling.
> >> As
> >>> a
> >>>> precursor Dan did some profiling and the only significant lock
> >> contention
> >>>> he could find was in the cache. That lead to the partial object
> caching
> >>>> work setting up queues for the hot locks, but it was decided that
> given
> >>> the
> >>>> lack of
> >>>> contention elsewhere, it wasn't worth the complexity.
> >>>>
> >>>> I think thread affinity is a better choice because no lock contention
> >>> will
> >>>> always beat even the most optimized lock contention resolution. If
> >>>> Continuations related to the same constellation of data objects are on
> >>> the
> >>>> same thread, then the locks are never contested, which makes it as
> fast
> >>> as
> >>>> possible.
> >>>>
> >>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com
> >>>> .invalid>
> >>>> wrote:
> >>>>
> >>>>> From the longer-term TSers I've heard comments about seeing profiling
> >>>>> results that show that waiting on mutexes is a significant
> >> performance
> >>>>> issue with TS.  But I'm not aware of any write-ups of such results.
> >>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
> >> currently
> >>>>> familiar with the best approaches to profiling TS.
> >>>>>
> >>>>> For better performance, I think that having a single to-run
> >>> Continuation
> >>>>> queue, or one per core, with a queue feeding multiple event threads
> >> is
> >>>> the
> >>>>> main thing.  It's more resilient to Continuations that block.  There
> >>>>> doesn't seem to be enthusiasm for getting hard-core about not having
> >>>>> blocking Continuations (see
> >>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
> >>>>> changing
> >>>>> to queue-based mutexes would have a significant performance impact.
> >>> But
> >>>> it
> >>>>> seems a cleaner design, making sure Continuations in the to-run
> >> list(s)
> >>>> are
> >>>>> actually ready to run.  But a different mutex implementation is not
> >>>>> strictly necessary in order to consolidate to-run Continuation
> >> queues.
> >>>>>
> >>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> >> kspoels...@we-amp.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Sounds very interesting.
> >>>>>> But what is the problem we're trying to solve here, I like the
> >> thread
> >>>>>> affinity because it gives us head ache free concurrency in some
> >>> cases,
> >>>>> and
> >>>>>> I'll bet that there is some code which doesn't have the proper
> >>>>> continuation
> >>>>>> mutexes because we know it runs on the same thread.
> >>>>>>
> >>>>>> Are we seeing a lot of imbalanced threads (too much processing
> >>> causing
> >>>>> long
> >>>>>> queues of continuations, which I can imagine in some cases) ? And
> >>>>> shouldn't
> >>>>>> we balance based on transactions or connections, move those around
> >>> when
> >>>>> we
> >>>>>> see imbalance and aim for embarrassingly parallel processing :)
> >> Come
> >>>>>> to think of it, this might introduce another set of problems, how
> >> to
> >>>> know
> >>>>>> which continuations are part of the life cycle of a connection :/
> >>>>>>
> >>>>>> Jumping threads in one transaction is not always ideal either, this
> >>> can
> >>>>>> really hurt performance. But your proposed model seems to handle
> >> that
> >>>>>> somewhat better than the current implementation.
> >>>>>>
> >>>>>> Very interested and wondering what this would mean for plugin
> >>>> developers.
> >>>>>>
> >>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com
> >>>> .invalid>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put
> >>> in
> >>>> a
> >>>>>>> queue specific to that mutex.  The release function for the mutex
> >>>>> (called
> >>>>>>> when a Continuation holding the mutex exists) would put the
> >>>>> Continuation
> >>>>>> at
> >>>>>>> the front of the mutex's queue (if not empty) into the
> >> ready-to-run
> >>>>> queue
> >>>>>>> (transferring the lock to that Continuation).  A drawback is that
> >>> the
> >>>>>> queue
> >>>>>>> would itself need a mutex (spinlock?), but the critical section
> >>> would
> >>>>> be
> >>>>>>> very short.
> >>>>>>>
> >>>>>>> There would be a function to lock a mutex directly.  It would
> >>> create
> >>>> a
> >>>>>>> Continuation that had two condition variables.  It would assign
> >> the
> >>>>> mutex
> >>>>>>> to this Continuation and schedule it.  (In this case, it might
> >> make
> >>>>> sense
> >>>>>>> to put this Continuation at the front of the mutex's queue, since
> >>> it
> >>>>>> would
> >>>>>>> be blocking an entire event thread.)  The direct-lock function
> >>> would
> >>>>> then
> >>>>>>> block on the first condition variable.  When the Continuation
> >> ran,
> >>> it
> >>>>>> would
> >>>>>>> trigger the first condition variable, and then block on the
> >> second
> >>>>>>> condition variable.  The direct-lock function would then exit,
> >>>> allowing
> >>>>>> the
> >>>>>>> calling code to enter its critical section.  At the end of the
> >>>> critical
> >>>>>>> section, another function to release the direct lock would be
> >>> called.
> >>>>> It
> >>>>>>> would trigger the second condition variable, which would cause
> >> the
> >>>>>> function
> >>>>>>> of the Continuation created for the direct lock to exit (thus
> >>>> releasing
> >>>>>> the
> >>>>>>> mutex).
> >>>>>>>
> >>>>>>> With this approach, I'm not sure thread affinities would be of
> >> any
> >>>>> value.
> >>>>>>> I think perhaps each core should have it's own list of
> >> ready-to-run
> >>>>>>> Continuations, and a pool of event threads with affinity to that
> >>>> core.
> >>>>>> Not
> >>>>>>> having per-event-thread ready-to-run lists means that a
> >>> Continuation
> >>>>>>> function that blocks is less likely to block other ready-to-run
> >>>>>>> Continuations.  If Continuations had core affinities to some
> >>> degree,
> >>>>> this
> >>>>>>> might reduce evictions in per-core memory cache.  (Multiple
> >>>>> Continuations
> >>>>>>> having the same function should have the same core affinity.)
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Reply via email to