The thing with profiling is trying to cover all the important traffic mixes
and plugin use (impossible because of proprietary plugins).  It seems
desirable to have a design that's robust even when Continuations block more
than they ideally should.  Having a single run queue per core and N worker
event threads per core helps greatly with preventing a blocked Continuation
from blocking other Continuations from starting (increasing worst-case
latency).  Each core could have a supervisor thread with this pseudo-code:

bool idle[Num_workers_pre_core];
Condition_variable go[Num_workers_per_core];
Continuation *curr[Num_workers_per_core];

while (true)
{
  blocking get on message queue;
  switch (msg.type)
  {
    case WORKER_THREAD_READY_FOR_NEXT:  // Sent by worker thread.
      if (run queue empty) {
        idle[msg.worker_index] = true;
      } else {
        curr[msg.worker_index] = dequeue from run queue;
        trigger go[msg.worker_index];
      }
    break;

    case QUEUE_CONTINUATION_TO_RUN:
      run_index = index where idle[index] is true or none;
      if (run_index == none) {
        enqueue msg.continuation in run queue;
      } else {
        idel[run_index] = false;
        curr[run_index] = msg.continuation;
        trigger go[run_index];
      }
    break;

  } // end switch
} // end while

On Wed, Oct 2, 2019 at 12:23 PM Walt Karas <wka...@verizonmedia.com> wrote:

> What's the best tool for multi-threaded profiling on Linux?
>
> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> <solidwallofc...@verizonmedia.com.invalid> wrote:
>
>> Correct, it doesn't mean no lock contention. The claim was it reduced the
>> lock contention to a level where it's not significant enough to warrant
>> additional preventative measures. The data Dan got wasn't from code
>> analysis, but from run time profiling. That was a while ago so if you'd
>> like to perform another pass of measuring the level of lock contention,
>> that would certainly be interesting data.
>>
>> In addition, the push for thread affinity started from actual issues in
>> production with Continuations being scheduled on different threads of the
>> same type (that is, it was Kees' fault). Those would not be resolved by
>> faster scheduling on different threads.
>>
>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wka...@verizonmedia.com
>> .invalid>
>> wrote:
>>
>> > I assume thread affinity can't literal mean no lock contention.  You'd
>> need
>> > a lock on the thread run queue wouldn't you?  Continuations can't only
>> get
>> > queued for the event thread from the event thread.  I don't think we can
>> > say conclusively that there would be a significant difference due to
>> lock
>> > contention.  I'm guessing that Fei would agree that the Continuation
>> > dispatch code is difficult to understand and work on.  Simplification
>> and
>> > more modularity is obviously a goal.  Seems like it would be simpler if
>> all
>> > the Continuations in a to-run list where actually ready to run.
>> >
>> > On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
>> > <solidwallofc...@verizonmedia.com.invalid> wrote:
>> >
>> > > Do you have any more specific information on mutex contention? We
>> have in
>> > > fact already looked at doing this, I think back in 2015 with Dan Xu.
>> The
>> > > goal there was to have queues with the mutexes to avoid rescheduling.
>> As
>> > a
>> > > precursor Dan did some profiling and the only significant lock
>> contention
>> > > he could find was in the cache. That lead to the partial object
>> caching
>> > > work setting up queues for the hot locks, but it was decided that
>> given
>> > the
>> > > lack of
>> > > contention elsewhere, it wasn't worth the complexity.
>> > >
>> > > I think thread affinity is a better choice because no lock contention
>> > will
>> > > always beat even the most optimized lock contention resolution. If
>> > > Continuations related to the same constellation of data objects are on
>> > the
>> > > same thread, then the locks are never contested, which makes it as
>> fast
>> > as
>> > > possible.
>> > >
>> > > On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wka...@verizonmedia.com
>> > > .invalid>
>> > > wrote:
>> > >
>> > > > From the longer-term TSers I've heard comments about seeing
>> profiling
>> > > > results that show that waiting on mutexes is a significant
>> performance
>> > > > issue with TS.  But I'm not aware of any write-ups of such results.
>> > > > Unfortunately, I'm relatively new to TS and Linux, so I'm not
>> currently
>> > > > familiar with the best approaches to profiling TS.
>> > > >
>> > > > For better performance, I think that having a single to-run
>> > Continuation
>> > > > queue, or one per core, with a queue feeding multiple event threads
>> is
>> > > the
>> > > > main thing.  It's more resilient to Continuations that block.  There
>> > > > doesn't seem to be enthusiasm for getting hard-core about not having
>> > > > blocking Continuations (see
>> > > > https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
>> > > > changing
>> > > > to queue-based mutexes would have a significant performance impact.
>> > But
>> > > it
>> > > > seems a cleaner design, making sure Continuations in the to-run
>> list(s)
>> > > are
>> > > > actually ready to run.  But a different mutex implementation is not
>> > > > strictly necessary in order to consolidate to-run Continuation
>> queues.
>> > > >
>> > > > On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
>> kspoels...@we-amp.com>
>> > > > wrote:
>> > > >
>> > > > > Sounds very interesting.
>> > > > > But what is the problem we're trying to solve here, I like the
>> thread
>> > > > > affinity because it gives us head ache free concurrency in some
>> > cases,
>> > > > and
>> > > > > I'll bet that there is some code which doesn't have the proper
>> > > > continuation
>> > > > > mutexes because we know it runs on the same thread.
>> > > > >
>> > > > > Are we seeing a lot of imbalanced threads (too much processing
>> > causing
>> > > > long
>> > > > > queues of continuations, which I can imagine in some cases) ? And
>> > > > shouldn't
>> > > > > we balance based on transactions or connections, move those around
>> > when
>> > > > we
>> > > > > see imbalance and aim for embarrassingly parallel processing :)
>> Come
>> > > > > to think of it, this might introduce another set of problems, how
>> to
>> > > know
>> > > > > which continuations are part of the life cycle of a connection :/
>> > > > >
>> > > > > Jumping threads in one transaction is not always ideal either,
>> this
>> > can
>> > > > > really hurt performance. But your proposed model seems to handle
>> that
>> > > > > somewhat better than the current implementation.
>> > > > >
>> > > > > Very interested and wondering what this would mean for plugin
>> > > developers.
>> > > > >
>> > > > > On Mon, 30 Sep 2019, 19:20 Walt Karas, <wka...@verizonmedia.com
>> > > .invalid>
>> > > > > wrote:
>> > > > >
>> > > > > > If a Continuation is scheduled, but its mutex is locked, it's
>> put
>> > in
>> > > a
>> > > > > > queue specific to that mutex.  The release function for the
>> mutex
>> > > > (called
>> > > > > > when a Continuation holding the mutex exists) would put the
>> > > > Continuation
>> > > > > at
>> > > > > > the front of the mutex's queue (if not empty) into the
>> ready-to-run
>> > > > queue
>> > > > > > (transferring the lock to that Continuation).  A drawback is
>> that
>> > the
>> > > > > queue
>> > > > > > would itself need a mutex (spinlock?), but the critical section
>> > would
>> > > > be
>> > > > > > very short.
>> > > > > >
>> > > > > > There would be a function to lock a mutex directly.  It would
>> > create
>> > > a
>> > > > > > Continuation that had two condition variables.  It would assign
>> the
>> > > > mutex
>> > > > > > to this Continuation and schedule it.  (In this case, it might
>> make
>> > > > sense
>> > > > > > to put this Continuation at the front of the mutex's queue,
>> since
>> > it
>> > > > > would
>> > > > > > be blocking an entire event thread.)  The direct-lock function
>> > would
>> > > > then
>> > > > > > block on the first condition variable.  When the Continuation
>> ran,
>> > it
>> > > > > would
>> > > > > > trigger the first condition variable, and then block on the
>> second
>> > > > > > condition variable.  The direct-lock function would then exit,
>> > > allowing
>> > > > > the
>> > > > > > calling code to enter its critical section.  At the end of the
>> > > critical
>> > > > > > section, another function to release the direct lock would be
>> > called.
>> > > > It
>> > > > > > would trigger the second condition variable, which would cause
>> the
>> > > > > function
>> > > > > > of the Continuation created for the direct lock to exit (thus
>> > > releasing
>> > > > > the
>> > > > > > mutex).
>> > > > > >
>> > > > > > With this approach, I'm not sure thread affinities would be of
>> any
>> > > > value.
>> > > > > > I think perhaps each core should have it's own list of
>> ready-to-run
>> > > > > > Continuations, and a pool of event threads with affinity to that
>> > > core.
>> > > > > Not
>> > > > > > having per-event-thread ready-to-run lists means that a
>> > Continuation
>> > > > > > function that blocks is less likely to block other ready-to-run
>> > > > > > Continuations.  If Continuations had core affinities to some
>> > degree,
>> > > > this
>> > > > > > might reduce evictions in per-core memory cache.  (Multiple
>> > > > Continuations
>> > > > > > having the same function should have the same core affinity.)
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to