Re: Debug Samza consumer lag issue

Jacob Maes Wed, 24 Aug 2016 15:12:01 -0700

We don't have any hard guidelines around that metric just because there are
no hard rules that work for every job. For example, some jobs are very
bursty and need to keep up with huge traffic ramp-ups even though they're
underutilized the rest of the time.


That said, yes, I have used that metric to determine whether a job has too
much parallelism. But it was a job that had very stable throughput patterns
and didn't have any major time spent in the window or commit methods, which
could cause periodic spikes in utilization.



On Wed, Aug 24, 2016 at 2:55 PM, David Yu <david...@optimizely.com> wrote:

> Interesting.
>
> To me, "event-loop-utilization" looks like a good indicator that shows us
> how busy the containers are. Is it safe to use this metric as a reference
> when we need to scale out/in our job? For example, if I'm seeing around 0.3
> utilization most of the time, maybe I can decrease the # of containers and
> save some resources?
>
> Thanks,
> David
>
> On Wed, Aug 24, 2016 at 1:27 PM Jacob Maes <jacob.m...@gmail.com> wrote:
>
> > >
> > > Based on what you have described, the following should be true in
> 0.10.1:
> > > event-loop-ns = choose-ns + process-ns + window-ns (if necessary) +
> > > commit-ns (if necessary)
> >
> > Yes, plus any time (e.g. due to an unlucky GC at just the right moment)
> > that happens outside those timers.  And no "if necessary" for window or
> > commit. There will be a small value for those methods even if they don't
> do
> > anything significant because the timer runs even for no-ops
> >
> > Since you're on 10.1, there's another useful metric
> > "event-loop-utilization", which represents
> > (process-ns+window-ns+commit-ns)/event-loop-ns
> > (as you defined it). In other words, the proportion of time spend working
> > vs waiting.
> >
> > On Wed, Aug 24, 2016 at 10:18 AM, David Yu <david...@optimizely.com>
> > wrote:
> >
> > > Great. It all makes sense now.
> > >
> > > With the SSD fix, we also upgrade to 0.10.1. So we should see pretty
> > > consistent process-ns (which we do).
> > >
> > > Based on what you have described, the following should be true in
> 0.10.1:
> > > event-loop-ns = choose-ns + process-ns + window-ns (if necessary) +
> > > commit-ns (if necessary)
> > >
> > > Is this correct?
> > > Thanks,
> > > David
> > >
> > > On Wed, Aug 24, 2016 at 11:27 AM Jacob Maes <jacob.m...@gmail.com>
> > wrote:
> > >
> > > > A couple other notes.
> > > >
> > > > Prior to Samza 10.1, the choose-ns was part of process-ns. So when
> > > > choose-ns and process-ns are both high (around 10,000,000 == 10ms,
> > which
> > > is
> > > > the default poll timeout), that usually means the task is caught up.
> In
> > > > Samza 10.1 the same is true if ONLY choose-ns is high. process-ns is
> > > always
> > > > the time spent in the process() method.
> > > >
> > > > Based on the above, the metric numbers you provided after the SSD fix
> > all
> > > > look reasonable. They're all sub-millisecond and since choose-ns and
> > > > process-ns are low, it seems that the container is chewing through
> > > messages
> > > > at a good rate.
> > > >
> > > > So I would conclude that the SSD fix was probably the right one and
> it
> > > just
> > > > took the job a while to catch up to the backlog of messages. Once it
> > > caught
> > > > up, the choose-ns and process-ns increased, which is normal when the
> > > > processor is waiting for new messages.
> > > >
> > > > -Jake
> > > >
> > > > On Wed, Aug 24, 2016 at 9:05 AM, Jacob Maes <jacob.m...@gmail.com>
> > > wrote:
> > > >
> > > > > Hey David,
> > > > >
> > > > > Answering the most recent question first, since it's also the
> > easiest.
> > > > :-)
> > > > >
> > > > > Is choose-ns the total number of ms used to choose a message from
> the
> > > > input
> > > > >> stream? What are some gating factors (e.g. serialization?) for
> this
> > > > >> metric?
> > > > >
> > > > > It's the amount of time the event loop spent getting new messsages
> > for
> > > > > process(). It includes deserialization time and poll time which we
> > > added
> > > > > new metrics for, in Samza 10.1. Typically deserialization time is
> > > pretty
> > > > > consistent, so when you see a spike in choose-ns, it's usually
> > because
> > > > the
> > > > > event loop is waiting for new messages. The two most common cases
> > when
> > > > it's
> > > > > waiting are:
> > > > > 1. There are no new messages in the topic partition. This is good
> > > because
> > > > > it means the processor is caught up.
> > > > > 2. The consumer is slow and/or the buffer isn't large enough so the
> > > > > BrokerProxy isn't able to keep enough messages buffered to keep the
> > > event
> > > > > loop busy. This is uncommon because the buffer is defaulted to
> 50,000
> > > > > messages, which should be plenty. But if it happens, it's bad. To
> > > control
> > > > > this behavior, see the following properties in the config table (
> > > > > http://samza.apache.org/learn/documentation/0.10/jobs/
> > > > > configuration-table.html)
> > > > > systems.system-name.samza.fetch.threshold
> > > > > task.poll.interval.ms
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 24, 2016 at 8:52 AM, David Yu <david...@optimizely.com
> >
> > > > wrote:
> > > > >
> > > > >> More updates:
> > > > >> 1. process-envelopes rate finally stabilized and converged.
> Consumer
> > > lag
> > > > >> is
> > > > >> down to zero.
> > > > >> 2. avg choose-ns across containers dropped overtime
> > > > >> <https://www.dropbox.com/s/z4iiilvd7c1wrjc/Screenshot%202016
> > > > >> -08-24%2010.46.22.png?dl=0>,
> > > > >> which I assume is a good thing.
> > > > >>
> > > > >> My question:
> > > > >> Is choose-ns the total number of ms used to choose a message from
> > the
> > > > >> input
> > > > >> stream? What are some gating factors (e.g. serialization?) for
> this
> > > > >> metric?
> > > > >>
> > > > >> Thanks,
> > > > >> David
> > > > >>
> > > > >> On Wed, Aug 24, 2016 at 12:34 AM David Yu <
> david...@optimizely.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Some metric updates:
> > > > >> > 1. We started seeing some containers with a higher choose-ns
> > > > >> > <https://www.dropbox.com/s/06n3awdwn8ntfxd/Screenshot%202016
> > > > >> -08-24%2000.26.07.png?dl=0>.
> > > > >> > Not sure what would be the cause of this.
> > > > >> > 2. We are seeing very different process-envelopes values across
> > > > >> containers
> > > > >> > <https://www.dropbox.com/s/n1wxtngquv607nb/Screenshot%202016
> > > > >> -08-24%2000.21.05.png?dl=0>
> > > > >> > .
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Aug 23, 2016 at 5:56 PM David Yu <
> david...@optimizely.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> >> Hi, Jake,
> > > > >> >>
> > > > >> >> Thanks for your suggestions. Some of my answers inline:
> > > > >> >>
> > > > >> >> 1.
> > > > >> >> On Tue, Aug 23, 2016 at 11:53 AM Jacob Maes <
> > jacob.m...@gmail.com>
> > > > >> wrote:
> > > > >> >>
> > > > >> >>> Hey David,
> > > > >> >>>
> > > > >> >>> A few initial thoughts/questions:
> > > > >> >>>
> > > > >> >>>
> > > > >> >>>    1. Is this job using RocksDB to store the aggregations? If
> > so,
> > > is
> > > > >> it
> > > > >> >>>    running on a machine with SSDs? We've seen a few
> performance
> > > > issues
> > > > >> >>> related
> > > > >> >>>    to RocksDB.
> > > > >> >>>       1. Not running on SSD causes slowness on disk
> > > > >> >>
> > > > >> >>  - [David] This definitely pointed me to the right direction in
> > my
> > > > >> >> investigation. We do use RocksDB and just realized that our
> YARN
> > > > >> cluster is
> > > > >> >> using c3.xlarge EC2 instances and is using a mixture of EBS and
> > > local
> > > > >> SSD
> > > > >> >> storage. After digging around, we noticed that some containers
> > were
> > > > >> >> persisting their KV stores in SSD while others were using EBS.
> We
> > > > just
> > > > >> >> updated our YARN config to use SSD only before redeployed our
> > jobs.
> > > > So
> > > > >> far
> > > > >> >> everything looks good. Will report back on the performance
> > update.
> > > > >> >>
> > > > >> >>>       2. Prior to Samza 10.1, samza would excessively flush
> the
> > > > store
> > > > >> to
> > > > >> >>>       disk, causing RocksDB compaction issues (stalls) -
> > SAMZA-957
> > > > >> >>>
> > > > >> >> - [David] We did notice that the log cleaner thread died on one
> > of
> > > > our
> > > > >> >> brokers. Not sure if this was the same problem you pointed out.
> > > > >> Following
> > > > >> >> errors are logged:
> > > > >> >>
> > > > >> >> 2016-08-15 10:00:56,475 ERROR kafka.log.LogCleaner:
> > > > >> >> [kafka-log-cleaner-thread-0], Error due to
> > > > >> >>
> > > > >> >> java.lang.IllegalArgumentException: requirement failed:
> 5865800
> > > > >> messages
> > > > >> >> in segment
> > > > session-store-2.0-tickets-changelog-9/00000000000009548937.
> > > > >> log
> > > > >> >> but offset map can fit only 5033164. You can increase
> > > > >> >> log.cleaner.dedupe.buffer.size or decrease log.cleaner.threads
> > > > >> >>
> > > > >> >>         at scala.Predef$.require(Predef.scala:219)
> > > > >> >>
> > > > >> >> We had to cleanup the changelog topic and restart the broker to
> > > bring
> > > > >> >> back the cleaner thread.
> > > > >> >>
> > > > >> >>>       3. When the RocksDB store is used as a queue, the
> iterator
> > > can
> > > > >> >>> suffer
> > > > >> >>>       performance issues due to RocksDBs tombstoning. (
> > > > >> >>>
> > > > >> >>> https://github.com/facebook/rocksdb/wiki/Implement-Queue-Ser
> > > > >> vice-Using-RocksDB
> > > > >> >>>       )
> > > > >> >>>
> > > > >> >> - [David] We use RocksDB to keep track of opening sessions and
> > use
> > > > >> >> sessionId (a random hash) as the key. In that sense, this does
> > not
> > > > >> sound
> > > > >> >> like a queue. But we do iterate and delete closed sessions
> during
> > > > >> windowing
> > > > >> >> on a minute by minute basis.
> > > > >> >>
> > > > >> >>    2. Is the "messages-behind-high-watermark" metric non-zero?
> > > > >> >>>
> > > > >> >> -[David] Yes.
> > > > >> >>
> > > > >> >>>    3. The SamzaContainerMetrics might be useful too.
> > Particularly
> > > > >> >>>    "choose-ns" and "commit-ns"
> > > > >> >>>
> > > > >> >> -[David] We are seeing the following from one of the containers
> > > > (after
> > > > >> >> the SSD fix mentioned above):
> > > > >> >> choose-ns=61353
> > > > >> >> commit-ns=306328 (what does this metric indicate? Is this in
> ms?)
> > > > >> >> process-ns=248260
> > > > >> >> window-ns=150717
> > > > >> >>
> > > > >> >>>    4. The only time I've personally seen slowness on the
> > producer
> > > is
> > > > >> if
> > > > >> >>>    it's configured for acks="all". What is the producer config
> > > from
> > > > >> the
> > > > >> >>> log?
> > > > >> >>>
> > > > >> >> - [David] We did not override this. So should be the default
> > value
> > > > >> (1?).
> > > > >> >>
> > > > >> >>    5. The window time is high, but since it's only called once
> > per
> > > > >> minute,
> > > > >> >>>    it looks like it only represents 1% of the event loop
> > > > utilization.
> > > > >> So
> > > > >> >>> I
> > > > >> >>>    don't think that's a smoking gun.
> > > > >> >>>
> > > > >> >>> -Jake
> > > > >> >>>
> > > > >> >>> On Tue, Aug 23, 2016 at 9:18 AM, David Yu <
> > > david...@optimizely.com>
> > > > >> >>> wrote:
> > > > >> >>>
> > > > >> >>> > Dear Samza guys,
> > > > >> >>> >
> > > > >> >>> > We are here for some debugging suggestions on our Samza job
> > > > >> (0.10.0),
> > > > >> >>> which
> > > > >> >>> > lags behind on consumption after running for a couple of
> > hours,
> > > > >> >>> regardless
> > > > >> >>> > of the number of containers allocated (currently 5).
> > > > >> >>> >
> > > > >> >>> > Briefly, the job aggregates events into sessions (in Avro)
> > > during
> > > > >> >>> process()
> > > > >> >>> > and emits snapshots of the open sessions using window()
> every
> > > > >> minute.
> > > > >> >>> This
> > > > >> >>> > graph
> > > > >> >>> > <https://www.dropbox.com/s/utywr1j5eku0ec0/Screenshot%
> > > > >> >>> > 202016-08-23%2010.33.16.png?dl=0>
> > > > >> >>> > shows
> > > > >> >>> > you where processing started to lag (red is the number of
> > events
> > > > >> >>> received
> > > > >> >>> > and green is the number of event processed). The end result
> > is a
> > > > >> steady
> > > > >> >>> > increase of the consumer lag
> > > > >> >>> > <https://www.dropbox.com/s/fppsv91c339xmdb/Screenshot%
> > > > >> >>> > 202016-08-23%2010.19.27.png?dl=0>.
> > > > >> >>> > What we are trying to track down is where the performance
> > > > bottleneck
> > > > >> >>> is.
> > > > >> >>> > But it's unclear at the moment if that's in Samza or in
> Kafka.
> > > > >> >>> >
> > > > >> >>> > What we know so far:
> > > > >> >>> >
> > > > >> >>> >    - Kafka producer seems to take a while writing to the
> > > > downstream
> > > > >> >>> topic
> > > > >> >>> >    (changelog and session snapshots) shown by various
> timers.
> > > Not
> > > > >> sure
> > > > >> >>> > which
> > > > >> >>> >    numbers are critical but here are the producer metrics
> > > > >> >>> >    <https://www.dropbox.com/s/pzi9304gw5vmae2/Screenshot%
> > > > >> >>> > 202016-08-23%2010.57.33.png?dl=0>
> > > > >> >>> > from
> > > > >> >>> >    one container.
> > > > >> >>> >    - avg windowing duration peaks at one point during the
> day
> > > (due
> > > > >> to
> > > > >> >>> the
> > > > >> >>> >    number of open sessions) but everything is still
> > sub-seconds
> > > > >> >>> >    <https://www.dropbox.com/s/y2ps6pbs1tf257e/Screenshot%
> > > > >> >>> > 202016-08-23%2010.44.19.png?dl=0>
> > > > >> >>> >    .
> > > > >> >>> >    - our Kafka cluster doesn't seem to be overloaded
> > > > >> >>> >    <https://www.dropbox.com/s/q01b4p4rg43spua/Screenshot%
> > > > >> >>> > 202016-08-23%2010.48.25.png?dl=0>
> > > > >> >>> >     with writes < 60MB/s across all three brokers
> > > > >> >>> >
> > > > >> >>> > From all we know, we suspected that the bottleneck happens
> at
> > > > >> >>> producing to
> > > > >> >>> > Kafka. But we need some help confirming that.
> > > > >> >>> >
> > > > >> >>> > Any suggestion is appreciated.
> > > > >> >>> >
> > > > >> >>> > David
> > > > >> >>> >
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Debug Samza consumer lag issue

Reply via email to