On Wed, Oct 2, 2019 at 3:19 PM Kees Spoelstra <kspoels...@we-amp.com> wrote:

> I think we're all able to dig some ratholes with less code than 30 lines.
>

Him, I thought "rathole" implied unneeded complexity but it seems to be
just a synonym for "bad".


>
> Alan, it's only my fault that I pushed for threadaffinity for plugins
> after I saw how it was in core... Can't keep all the goodies for the core
> :)
>
> One thing to watch out for is in order execution of continuations, as I've
> mentioned earlier there is a big possibility that there is code which does
> not properly lock ( or does not need to) and might get executing
> continuations at the same time instead of in order when running
> concurrently.
> Sometimes to get out of the current executing context which might be deep
> in the stack (x locks down and before some cleanup) you will schedule
> another continuation on the same thread, which can give you a cleaner
> context.
> Imagine the scheduler picking that up and running it concurrently on
> another thread on the same core, that might yield weird results.
> Multiple threads might break that implicit thread safety
>

When one Continuation schedules another, it could go into a queue of
Continuations that become runnable after the current Continuation exits.
So the pseudo-code becomes:

bool idle[Num_workers_pre_core];
Condition_variable go[Num_workers_per_core];
Continuation *curr[Num_workers_per_core];
Continuation_queue after[Num_workers_per_core]; // Run after current
continuation.

while (true)
{
  blocking get on message queue;
  switch (msg.type)
  {
    case WORKER_THREAD_READY_FOR_NEXT:  // Sent by worker thread.
      while (after[msg.worker_index} is not empty) {
        c = pop from after[msg.worker_index];
        push c to run queue;
      }
      if (run queue empty) {
        idle[msg.worker_index] = true;
      } else {
        curr[msg.worker_index] = dequeue from run queue;
        trigger go[msg.worker_index];
      }
    break;

    case QUEUE_CONTINUATION_TO_RUN:
      run_index = index where idle[index] is true or none;
      if (run_index == none) {
        enqueue msg.continuation in run queue;
      } else {
        idel[run_index] = false;
        curr[run_index] = msg.continuation;
        trigger go[run_index];
      }
    break;

  } // end switch
} // end while

On the other hand, it's not that hard for a specific Continuation to not
trigger successive Continuations until they are actually ready to run.  It
would simply need to have its own "after" queue.  We should be selective
about how many capabilities we pull into the core Event execution code, and
how much we damage general behavior by doing so.


>
> Locking mostly gets/got screwy in the plugins, because of possible
> thread/core jumping, the plugin affinity functions help a lot with that.
> There are patterns to prevent that, try lock, but they're a bit of a pain
> and I've seen people ignore that.
>
> Some findings we had when doing h2. Thread affinity might not always yield
> the best latency on an almost idle system if the protocols on both sides
> are relatively heavy weight the packing and unpacking will take time and
> one can imagine there is some parallelism which could decrease latency.
> We've thought about controlled handover to other threads, but never made
> any work of it.
>
> I can imagine with quic coming that we might see some thread jumping
> again, haven't been following progress on that one.
>
> On Wed, 2 Oct 2019, 21:33 Walt Karas, <wka...@verizonmedia.com.invalid>
> wrote:
>
>> On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org> wrote:
>>
>> >
>> >
>> > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wka...@verizonmedia.com
>> .INVALID>
>> > wrote:
>> > >
>> > > I feel like you overestimate me if you think I could create a rathole
>> > with
>> > > just 30 lines of pseudo-code.
>> >
>> > We have faith in you.
>> >
>>
>> What's an example of a rathole I've already created?  Seems like a very
>> big
>> leap of faith if there isn't one.
>>
>> Question:  Why do we have more event threads than cores?
>>
>> If it's to mitigate the fact that ATS is not purist event-driven, that
>> Continuations sometimes do block, it makes sense to use the best design to
>> mitigate this.  Sorry but I don't think any big profiling hoedown is
>> needed
>> to see that what I'm proposing mitigates it better.  It's essentially the
>> same idea as having a single line to pay even when there are multiple
>> cashiers.
>>
>>
>> >
>> > >
>> > > We're making complex thread affinity changes, on top of an event
>> > execution
>> > > code that already seem complex.  (I leave it to those with more
>> rathole
>> > > background to say if it should be so labeled.) . Can I see the
>> profiling
>> > > results used to justify and guide the thread affinity design?
>> >
>> >
>> > +1.
>> >
>> > — leif
>> >
>> > >
>> > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org>
>> wrote:
>> > >
>> > >>
>> > >>
>> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wka...@verizonmedia.com
>> > .INVALID>
>> > >> wrote:
>> > >>>
>> > >>> What's the best tool for multi-threaded profiling on Linux?
>> > >>
>> > >> Probably the Intel tools.  Probably the “perf” toolchain works well,
>> at
>> > >> least on modern linux, you could do perf lock etc.
>> > >>
>> > >>>
>> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
>> > >>> <solidwallofc...@verizonmedia.com.invalid> wrote:
>> > >>>
>> > >>>> Correct, it doesn't mean no lock contention. The claim was it
>> reduced
>> > >> the
>> > >>>> lock contention to a level where it's not significant enough to
>> > warrant
>> > >>>> additional preventative measures. The data Dan got wasn't from code
>> > >>>> analysis, but from run time profiling. That was a while ago so if
>> > you'd
>> > >>>> like to perform another pass of measuring the level of lock
>> > contention,
>> > >>>> that would certainly be interesting data.
>> > >>
>> > >>
>> > >> So, before we dig outselves into this rathole, can someone explain
>> to me
>> > >> what the problem is? Where are the metrics / analysis showing that we
>> > have
>> > >> a problem? I’d love to see a comparison too between various version
>> of
>> > ATS,
>> > >> say v6 - v9, and understand where, if anywhere, we made things (lock
>> > >> contention) worse.
>> > >>
>> > >> We have to stop making complicated and invasive solutions without
>> real
>> > >> problems, and understanding such problem.
>> > >>
>> > >> My $0.01,
>> > >>
>> > >> — Leif
>> > >>
>> > >>>>
>> > >>>> In addition, the push for thread affinity started from actual
>> issues
>> > in
>> > >>>> production with Continuations being scheduled on different threads
>> of
>> > >> the
>> > >>>> same type (that is, it was Kees' fault). Those would not be
>> resolved
>> > by
>> > >>>> faster scheduling on different threads.
>> > >>>>
>> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <
>> wka...@verizonmedia.com
>> > >>>> .invalid>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> I assume thread affinity can't literal mean no lock contention.
>> > You'd
>> > >>>> need
>> > >>>>> a lock on the thread run queue wouldn't you?  Continuations can't
>> > only
>> > >>>> get
>> > >>>>> queued for the event thread from the event thread.  I don't think
>> we
>> > >> can
>> > >>>>> say conclusively that there would be a significant difference due
>> to
>> > >> lock
>> > >>>>> contention.  I'm guessing that Fei would agree that the
>> Continuation
>> > >>>>> dispatch code is difficult to understand and work on.
>> Simplification
>> > >> and
>> > >>>>> more modularity is obviously a goal.  Seems like it would be
>> simpler
>> > if
>> > >>>> all
>> > >>>>> the Continuations in a to-run list where actually ready to run.
>> > >>>>>
>> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
>> > >>>>> <solidwallofc...@verizonmedia.com.invalid> wrote:
>> > >>>>>
>> > >>>>>> Do you have any more specific information on mutex contention? We
>> > have
>> > >>>> in
>> > >>>>>> fact already looked at doing this, I think back in 2015 with Dan
>> Xu.
>> > >>>> The
>> > >>>>>> goal there was to have queues with the mutexes to avoid
>> > rescheduling.
>> > >>>> As
>> > >>>>> a
>> > >>>>>> precursor Dan did some profiling and the only significant lock
>> > >>>> contention
>> > >>>>>> he could find was in the cache. That lead to the partial object
>> > >> caching
>> > >>>>>> work setting up queues for the hot locks, but it was decided that
>> > >> given
>> > >>>>> the
>> > >>>>>> lack of
>> > >>>>>> contention elsewhere, it wasn't worth the complexity.
>> > >>>>>>
>> > >>>>>> I think thread affinity is a better choice because no lock
>> > contention
>> > >>>>> will
>> > >>>>>> always beat even the most optimized lock contention resolution.
>> If
>> > >>>>>> Continuations related to the same constellation of data objects
>> are
>> > on
>> > >>>>> the
>> > >>>>>> same thread, then the locks are never contested, which makes it
>> as
>> > >> fast
>> > >>>>> as
>> > >>>>>> possible.
>> > >>>>>>
>> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <
>> wka...@verizonmedia.com
>> > >>>>>> .invalid>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>>> From the longer-term TSers I've heard comments about seeing
>> > profiling
>> > >>>>>>> results that show that waiting on mutexes is a significant
>> > >>>> performance
>> > >>>>>>> issue with TS.  But I'm not aware of any write-ups of such
>> results.
>> > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
>> > >>>> currently
>> > >>>>>>> familiar with the best approaches to profiling TS.
>> > >>>>>>>
>> > >>>>>>> For better performance, I think that having a single to-run
>> > >>>>> Continuation
>> > >>>>>>> queue, or one per core, with a queue feeding multiple event
>> threads
>> > >>>> is
>> > >>>>>> the
>> > >>>>>>> main thing.  It's more resilient to Continuations that block.
>> > There
>> > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about not
>> > having
>> > >>>>>>> blocking Continuations (see
>> > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not
>> sure
>> > >>>>>>> changing
>> > >>>>>>> to queue-based mutexes would have a significant performance
>> impact.
>> > >>>>> But
>> > >>>>>> it
>> > >>>>>>> seems a cleaner design, making sure Continuations in the to-run
>> > >>>> list(s)
>> > >>>>>> are
>> > >>>>>>> actually ready to run.  But a different mutex implementation is
>> not
>> > >>>>>>> strictly necessary in order to consolidate to-run Continuation
>> > >>>> queues.
>> > >>>>>>>
>> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
>> > >>>> kspoels...@we-amp.com>
>> > >>>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>>> Sounds very interesting.
>> > >>>>>>>> But what is the problem we're trying to solve here, I like the
>> > >>>> thread
>> > >>>>>>>> affinity because it gives us head ache free concurrency in some
>> > >>>>> cases,
>> > >>>>>>> and
>> > >>>>>>>> I'll bet that there is some code which doesn't have the proper
>> > >>>>>>> continuation
>> > >>>>>>>> mutexes because we know it runs on the same thread.
>> > >>>>>>>>
>> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much processing
>> > >>>>> causing
>> > >>>>>>> long
>> > >>>>>>>> queues of continuations, which I can imagine in some cases) ?
>> And
>> > >>>>>>> shouldn't
>> > >>>>>>>> we balance based on transactions or connections, move those
>> around
>> > >>>>> when
>> > >>>>>>> we
>> > >>>>>>>> see imbalance and aim for embarrassingly parallel processing :)
>> > >>>> Come
>> > >>>>>>>> to think of it, this might introduce another set of problems,
>> how
>> > >>>> to
>> > >>>>>> know
>> > >>>>>>>> which continuations are part of the life cycle of a connection
>> :/
>> > >>>>>>>>
>> > >>>>>>>> Jumping threads in one transaction is not always ideal either,
>> > this
>> > >>>>> can
>> > >>>>>>>> really hurt performance. But your proposed model seems to
>> handle
>> > >>>> that
>> > >>>>>>>> somewhat better than the current implementation.
>> > >>>>>>>>
>> > >>>>>>>> Very interested and wondering what this would mean for plugin
>> > >>>>>> developers.
>> > >>>>>>>>
>> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <
>> wka...@verizonmedia.com
>> > >>>>>> .invalid>
>> > >>>>>>>> wrote:
>> > >>>>>>>>
>> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's
>> put
>> > >>>>> in
>> > >>>>>> a
>> > >>>>>>>>> queue specific to that mutex.  The release function for the
>> mutex
>> > >>>>>>> (called
>> > >>>>>>>>> when a Continuation holding the mutex exists) would put the
>> > >>>>>>> Continuation
>> > >>>>>>>> at
>> > >>>>>>>>> the front of the mutex's queue (if not empty) into the
>> > >>>> ready-to-run
>> > >>>>>>> queue
>> > >>>>>>>>> (transferring the lock to that Continuation).  A drawback is
>> that
>> > >>>>> the
>> > >>>>>>>> queue
>> > >>>>>>>>> would itself need a mutex (spinlock?), but the critical
>> section
>> > >>>>> would
>> > >>>>>>> be
>> > >>>>>>>>> very short.
>> > >>>>>>>>>
>> > >>>>>>>>> There would be a function to lock a mutex directly.  It would
>> > >>>>> create
>> > >>>>>> a
>> > >>>>>>>>> Continuation that had two condition variables.  It would
>> assign
>> > >>>> the
>> > >>>>>>> mutex
>> > >>>>>>>>> to this Continuation and schedule it.  (In this case, it might
>> > >>>> make
>> > >>>>>>> sense
>> > >>>>>>>>> to put this Continuation at the front of the mutex's queue,
>> since
>> > >>>>> it
>> > >>>>>>>> would
>> > >>>>>>>>> be blocking an entire event thread.)  The direct-lock function
>> > >>>>> would
>> > >>>>>>> then
>> > >>>>>>>>> block on the first condition variable.  When the Continuation
>> > >>>> ran,
>> > >>>>> it
>> > >>>>>>>> would
>> > >>>>>>>>> trigger the first condition variable, and then block on the
>> > >>>> second
>> > >>>>>>>>> condition variable.  The direct-lock function would then exit,
>> > >>>>>> allowing
>> > >>>>>>>> the
>> > >>>>>>>>> calling code to enter its critical section.  At the end of the
>> > >>>>>> critical
>> > >>>>>>>>> section, another function to release the direct lock would be
>> > >>>>> called.
>> > >>>>>>> It
>> > >>>>>>>>> would trigger the second condition variable, which would cause
>> > >>>> the
>> > >>>>>>>> function
>> > >>>>>>>>> of the Continuation created for the direct lock to exit (thus
>> > >>>>>> releasing
>> > >>>>>>>> the
>> > >>>>>>>>> mutex).
>> > >>>>>>>>>
>> > >>>>>>>>> With this approach, I'm not sure thread affinities would be of
>> > >>>> any
>> > >>>>>>> value.
>> > >>>>>>>>> I think perhaps each core should have it's own list of
>> > >>>> ready-to-run
>> > >>>>>>>>> Continuations, and a pool of event threads with affinity to
>> that
>> > >>>>>> core.
>> > >>>>>>>> Not
>> > >>>>>>>>> having per-event-thread ready-to-run lists means that a
>> > >>>>> Continuation
>> > >>>>>>>>> function that blocks is less likely to block other
>> ready-to-run
>> > >>>>>>>>> Continuations.  If Continuations had core affinities to some
>> > >>>>> degree,
>> > >>>>>>> this
>> > >>>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
>> > >>>>>>> Continuations
>> > >>>>>>>>> having the same function should have the same core affinity.)
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>>
>> > >>
>> > >>
>> >
>> >
>>
>

Reply via email to