> On Oct 2, 2019, at 1:00 PM, Walt Karas <wka...@verizonmedia.com.INVALID>
> wrote:
>
> I feel like you overestimate me if you think I could create a rathole with
> just 30 lines of pseudo-code.
We have faith in you.
>
> We're making complex thread affinity changes, on top of an event execution
> code that already seem complex. (I leave it to those with more rathole
> background to say if it should be so labeled.) . Can I see the profiling
> results used to justify and guide the thread affinity design?
+1.
— leif
>
> On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote:
>
>>
>>
>>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wka...@verizonmedia.com.INVALID>
>> wrote:
>>>
>>> What's the best tool for multi-threaded profiling on Linux?
>>
>> Probably the Intel tools. Probably the “perf” toolchain works well, at
>> least on modern linux, you could do perf lock etc.
>>
>>>
>>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
>>> <solidwallofc...@verizonmedia.com.invalid> wrote:
>>>
>>>> Correct, it doesn't mean no lock contention. The claim was it reduced
>> the
>>>> lock contention to a level where it's not significant enough to warrant
>>>> additional preventative measures. The data Dan got wasn't from code
>>>> analysis, but from run time profiling. That was a while ago so if you'd
>>>> like to perform another pass of measuring the level of lock contention,
>>>> that would certainly be interesting data.
>>
>>
>> So, before we dig outselves into this rathole, can someone explain to me
>> what the problem is? Where are the metrics / analysis showing that we have
>> a problem? I’d love to see a comparison too between various version of ATS,
>> say v6 - v9, and understand where, if anywhere, we made things (lock
>> contention) worse.
>>
>> We have to stop making complicated and invasive solutions without real
>> problems, and understanding such problem.
>>
>> My $0.01,
>>
>> — Leif
>>
>>>>
>>>> In addition, the push for thread affinity started from actual issues in
>>>> production with Continuations being scheduled on different threads of
>> the
>>>> same type (that is, it was Kees' fault). Those would not be resolved by
>>>> faster scheduling on different threads.
>>>>
>>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com
>>>> .invalid>
>>>> wrote:
>>>>
>>>>> I assume thread affinity can't literal mean no lock contention. You'd
>>>> need
>>>>> a lock on the thread run queue wouldn't you? Continuations can't only
>>>> get
>>>>> queued for the event thread from the event thread. I don't think we
>> can
>>>>> say conclusively that there would be a significant difference due to
>> lock
>>>>> contention. I'm guessing that Fei would agree that the Continuation
>>>>> dispatch code is difficult to understand and work on. Simplification
>> and
>>>>> more modularity is obviously a goal. Seems like it would be simpler if
>>>> all
>>>>> the Continuations in a to-run list where actually ready to run.
>>>>>
>>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
>>>>> <solidwallofc...@verizonmedia.com.invalid> wrote:
>>>>>
>>>>>> Do you have any more specific information on mutex contention? We have
>>>> in
>>>>>> fact already looked at doing this, I think back in 2015 with Dan Xu.
>>>> The
>>>>>> goal there was to have queues with the mutexes to avoid rescheduling.
>>>> As
>>>>> a
>>>>>> precursor Dan did some profiling and the only significant lock
>>>> contention
>>>>>> he could find was in the cache. That lead to the partial object
>> caching
>>>>>> work setting up queues for the hot locks, but it was decided that
>> given
>>>>> the
>>>>>> lack of
>>>>>> contention elsewhere, it wasn't worth the complexity.
>>>>>>
>>>>>> I think thread affinity is a better choice because no lock contention
>>>>> will
>>>>>> always beat even the most optimized lock contention resolution. If
>>>>>> Continuations related to the same constellation of data objects are on
>>>>> the
>>>>>> same thread, then the locks are never contested, which makes it as
>> fast
>>>>> as
>>>>>> possible.
>>>>>>
>>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com
>>>>>> .invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> From the longer-term TSers I've heard comments about seeing profiling
>>>>>>> results that show that waiting on mutexes is a significant
>>>> performance
>>>>>>> issue with TS. But I'm not aware of any write-ups of such results.
>>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
>>>> currently
>>>>>>> familiar with the best approaches to profiling TS.
>>>>>>>
>>>>>>> For better performance, I think that having a single to-run
>>>>> Continuation
>>>>>>> queue, or one per core, with a queue feeding multiple event threads
>>>> is
>>>>>> the
>>>>>>> main thing. It's more resilient to Continuations that block. There
>>>>>>> doesn't seem to be enthusiasm for getting hard-core about not having
>>>>>>> blocking Continuations (see
>>>>>>> https://github.com/apache/trafficserver/pull/5412 ). I'm not sure
>>>>>>> changing
>>>>>>> to queue-based mutexes would have a significant performance impact.
>>>>> But
>>>>>> it
>>>>>>> seems a cleaner design, making sure Continuations in the to-run
>>>> list(s)
>>>>>> are
>>>>>>> actually ready to run. But a different mutex implementation is not
>>>>>>> strictly necessary in order to consolidate to-run Continuation
>>>> queues.
>>>>>>>
>>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
>>>> kspoels...@we-amp.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sounds very interesting.
>>>>>>>> But what is the problem we're trying to solve here, I like the
>>>> thread
>>>>>>>> affinity because it gives us head ache free concurrency in some
>>>>> cases,
>>>>>>> and
>>>>>>>> I'll bet that there is some code which doesn't have the proper
>>>>>>> continuation
>>>>>>>> mutexes because we know it runs on the same thread.
>>>>>>>>
>>>>>>>> Are we seeing a lot of imbalanced threads (too much processing
>>>>> causing
>>>>>>> long
>>>>>>>> queues of continuations, which I can imagine in some cases) ? And
>>>>>>> shouldn't
>>>>>>>> we balance based on transactions or connections, move those around
>>>>> when
>>>>>>> we
>>>>>>>> see imbalance and aim for embarrassingly parallel processing :)
>>>> Come
>>>>>>>> to think of it, this might introduce another set of problems, how
>>>> to
>>>>>> know
>>>>>>>> which continuations are part of the life cycle of a connection :/
>>>>>>>>
>>>>>>>> Jumping threads in one transaction is not always ideal either, this
>>>>> can
>>>>>>>> really hurt performance. But your proposed model seems to handle
>>>> that
>>>>>>>> somewhat better than the current implementation.
>>>>>>>>
>>>>>>>> Very interested and wondering what this would mean for plugin
>>>>>> developers.
>>>>>>>>
>>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com
>>>>>> .invalid>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put
>>>>> in
>>>>>> a
>>>>>>>>> queue specific to that mutex. The release function for the mutex
>>>>>>> (called
>>>>>>>>> when a Continuation holding the mutex exists) would put the
>>>>>>> Continuation
>>>>>>>> at
>>>>>>>>> the front of the mutex's queue (if not empty) into the
>>>> ready-to-run
>>>>>>> queue
>>>>>>>>> (transferring the lock to that Continuation). A drawback is that
>>>>> the
>>>>>>>> queue
>>>>>>>>> would itself need a mutex (spinlock?), but the critical section
>>>>> would
>>>>>>> be
>>>>>>>>> very short.
>>>>>>>>>
>>>>>>>>> There would be a function to lock a mutex directly. It would
>>>>> create
>>>>>> a
>>>>>>>>> Continuation that had two condition variables. It would assign
>>>> the
>>>>>>> mutex
>>>>>>>>> to this Continuation and schedule it. (In this case, it might
>>>> make
>>>>>>> sense
>>>>>>>>> to put this Continuation at the front of the mutex's queue, since
>>>>> it
>>>>>>>> would
>>>>>>>>> be blocking an entire event thread.) The direct-lock function
>>>>> would
>>>>>>> then
>>>>>>>>> block on the first condition variable. When the Continuation
>>>> ran,
>>>>> it
>>>>>>>> would
>>>>>>>>> trigger the first condition variable, and then block on the
>>>> second
>>>>>>>>> condition variable. The direct-lock function would then exit,
>>>>>> allowing
>>>>>>>> the
>>>>>>>>> calling code to enter its critical section. At the end of the
>>>>>> critical
>>>>>>>>> section, another function to release the direct lock would be
>>>>> called.
>>>>>>> It
>>>>>>>>> would trigger the second condition variable, which would cause
>>>> the
>>>>>>>> function
>>>>>>>>> of the Continuation created for the direct lock to exit (thus
>>>>>> releasing
>>>>>>>> the
>>>>>>>>> mutex).
>>>>>>>>>
>>>>>>>>> With this approach, I'm not sure thread affinities would be of
>>>> any
>>>>>>> value.
>>>>>>>>> I think perhaps each core should have it's own list of
>>>> ready-to-run
>>>>>>>>> Continuations, and a pool of event threads with affinity to that
>>>>>> core.
>>>>>>>> Not
>>>>>>>>> having per-event-thread ready-to-run lists means that a
>>>>> Continuation
>>>>>>>>> function that blocks is less likely to block other ready-to-run
>>>>>>>>> Continuations. If Continuations had core affinities to some
>>>>> degree,
>>>>>>> this
>>>>>>>>> might reduce evictions in per-core memory cache. (Multiple
>>>>>>> Continuations
>>>>>>>>> having the same function should have the same core affinity.)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>>