Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Leif Hedstrom Wed, 02 Oct 2019 11:32:52 -0700

> On Oct 2, 2019, at 11:23 AM, Walt Karas <wka...@verizonmedia.com.INVALID> 
> wrote:
> 
> What's the best tool for multi-threaded profiling on Linux?

Probably the Intel tools.  Probably the “perf” toolchain works well, at least 
on modern linux, you could do perf lock etc.

> 
> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> <solidwallofc...@verizonmedia.com.invalid> wrote:
> 
>> Correct, it doesn't mean no lock contention. The claim was it reduced the
>> lock contention to a level where it's not significant enough to warrant
>> additional preventative measures. The data Dan got wasn't from code
>> analysis, but from run time profiling. That was a while ago so if you'd
>> like to perform another pass of measuring the level of lock contention,
>> that would certainly be interesting data.


So, before we dig outselves into this rathole, can someone explain to me what 
the problem is? Where are the metrics / analysis showing that we have a 
problem? I’d love to see a comparison too between various version of ATS, say 
v6 - v9, and understand where, if anywhere, we made things (lock contention) 
worse.

We have to stop making complicated and invasive solutions without real 
problems, and understanding such problem.

My $0.01,

— Leif

>> 
>> In addition, the push for thread affinity started from actual issues in
>> production with Continuations being scheduled on different threads of the
>> same type (that is, it was Kees' fault). Those would not be resolved by
>> faster scheduling on different threads.
>> 
>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com
>> .invalid>
>> wrote:
>> 
>>> I assume thread affinity can't literal mean no lock contention.  You'd
>> need
>>> a lock on the thread run queue wouldn't you?  Continuations can't only
>> get
>>> queued for the event thread from the event thread.  I don't think we can
>>> say conclusively that there would be a significant difference due to lock
>>> contention.  I'm guessing that Fei would agree that the Continuation
>>> dispatch code is difficult to understand and work on.  Simplification and
>>> more modularity is obviously a goal.  Seems like it would be simpler if
>> all
>>> the Continuations in a to-run list where actually ready to run.
>>> 
>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
>>> <solidwallofc...@verizonmedia.com.invalid> wrote:
>>> 
>>>> Do you have any more specific information on mutex contention? We have
>> in
>>>> fact already looked at doing this, I think back in 2015 with Dan Xu.
>> The
>>>> goal there was to have queues with the mutexes to avoid rescheduling.
>> As
>>> a
>>>> precursor Dan did some profiling and the only significant lock
>> contention
>>>> he could find was in the cache. That lead to the partial object caching
>>>> work setting up queues for the hot locks, but it was decided that given
>>> the
>>>> lack of
>>>> contention elsewhere, it wasn't worth the complexity.
>>>> 
>>>> I think thread affinity is a better choice because no lock contention
>>> will
>>>> always beat even the most optimized lock contention resolution. If
>>>> Continuations related to the same constellation of data objects are on
>>> the
>>>> same thread, then the locks are never contested, which makes it as fast
>>> as
>>>> possible.
>>>> 
>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com
>>>> .invalid>
>>>> wrote:
>>>> 
>>>>> From the longer-term TSers I've heard comments about seeing profiling
>>>>> results that show that waiting on mutexes is a significant
>> performance
>>>>> issue with TS.  But I'm not aware of any write-ups of such results.
>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
>> currently
>>>>> familiar with the best approaches to profiling TS.
>>>>> 
>>>>> For better performance, I think that having a single to-run
>>> Continuation
>>>>> queue, or one per core, with a queue feeding multiple event threads
>> is
>>>> the
>>>>> main thing.  It's more resilient to Continuations that block.  There
>>>>> doesn't seem to be enthusiasm for getting hard-core about not having
>>>>> blocking Continuations (see
>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
>>>>> changing
>>>>> to queue-based mutexes would have a significant performance impact.
>>> But
>>>> it
>>>>> seems a cleaner design, making sure Continuations in the to-run
>> list(s)
>>>> are
>>>>> actually ready to run.  But a different mutex implementation is not
>>>>> strictly necessary in order to consolidate to-run Continuation
>> queues.
>>>>> 
>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
>> kspoels...@we-amp.com>
>>>>> wrote:
>>>>> 
>>>>>> Sounds very interesting.
>>>>>> But what is the problem we're trying to solve here, I like the
>> thread
>>>>>> affinity because it gives us head ache free concurrency in some
>>> cases,
>>>>> and
>>>>>> I'll bet that there is some code which doesn't have the proper
>>>>> continuation
>>>>>> mutexes because we know it runs on the same thread.
>>>>>> 
>>>>>> Are we seeing a lot of imbalanced threads (too much processing
>>> causing
>>>>> long
>>>>>> queues of continuations, which I can imagine in some cases) ? And
>>>>> shouldn't
>>>>>> we balance based on transactions or connections, move those around
>>> when
>>>>> we
>>>>>> see imbalance and aim for embarrassingly parallel processing :)
>> Come
>>>>>> to think of it, this might introduce another set of problems, how
>> to
>>>> know
>>>>>> which continuations are part of the life cycle of a connection :/
>>>>>> 
>>>>>> Jumping threads in one transaction is not always ideal either, this
>>> can
>>>>>> really hurt performance. But your proposed model seems to handle
>> that
>>>>>> somewhat better than the current implementation.
>>>>>> 
>>>>>> Very interested and wondering what this would mean for plugin
>>>> developers.
>>>>>> 
>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com
>>>> .invalid>
>>>>>> wrote:
>>>>>> 
>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put
>>> in
>>>> a
>>>>>>> queue specific to that mutex.  The release function for the mutex
>>>>> (called
>>>>>>> when a Continuation holding the mutex exists) would put the
>>>>> Continuation
>>>>>> at
>>>>>>> the front of the mutex's queue (if not empty) into the
>> ready-to-run
>>>>> queue
>>>>>>> (transferring the lock to that Continuation).  A drawback is that
>>> the
>>>>>> queue
>>>>>>> would itself need a mutex (spinlock?), but the critical section
>>> would
>>>>> be
>>>>>>> very short.
>>>>>>> 
>>>>>>> There would be a function to lock a mutex directly.  It would
>>> create
>>>> a
>>>>>>> Continuation that had two condition variables.  It would assign
>> the
>>>>> mutex
>>>>>>> to this Continuation and schedule it.  (In this case, it might
>> make
>>>>> sense
>>>>>>> to put this Continuation at the front of the mutex's queue, since
>>> it
>>>>>> would
>>>>>>> be blocking an entire event thread.)  The direct-lock function
>>> would
>>>>> then
>>>>>>> block on the first condition variable.  When the Continuation
>> ran,
>>> it
>>>>>> would
>>>>>>> trigger the first condition variable, and then block on the
>> second
>>>>>>> condition variable.  The direct-lock function would then exit,
>>>> allowing
>>>>>> the
>>>>>>> calling code to enter its critical section.  At the end of the
>>>> critical
>>>>>>> section, another function to release the direct lock would be
>>> called.
>>>>> It
>>>>>>> would trigger the second condition variable, which would cause
>> the
>>>>>> function
>>>>>>> of the Continuation created for the direct lock to exit (thus
>>>> releasing
>>>>>> the
>>>>>>> mutex).
>>>>>>> 
>>>>>>> With this approach, I'm not sure thread affinities would be of
>> any
>>>>> value.
>>>>>>> I think perhaps each core should have it's own list of
>> ready-to-run
>>>>>>> Continuations, and a pool of event threads with affinity to that
>>>> core.
>>>>>> Not
>>>>>>> having per-event-thread ready-to-run lists means that a
>>> Continuation
>>>>>>> function that blocks is less likely to block other ready-to-run
>>>>>>> Continuations.  If Continuations had core affinities to some
>>> degree,
>>>>> this
>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
>>>>> Continuations
>>>>>>> having the same function should have the same core affinity.)
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>
Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Reply via email to