Re: Multi task Container Starvation

Jagadish Venkatraman Thu, 09 Mar 2017 08:24:37 -0800

I understand you are receiving messages from *all* partitions (but fewer
messages from some partitions).

Some questions:

1. Is it possible that you may have saturated the capacity of the entire
container?
2. What is the time you spend inside *process* and *window* for the
affected container? (How does it compare with other containers?). The
metrics of interest are *process_ns* and *window_ns.*
3. What is the number of messages per-second you process for the affected
containers ? (How does it compare with other containers?). The metric of
interest is *process-calls*. You can also look at per-task process calls
<https://github.com/apache/samza/blob/master/samza-core/src/main/scala/org/apache/samza/container/TaskInstanceMetrics.scala>,
and per-partition messages read
<https://github.com/apache/samza/blob/master/samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemConsumerMetrics.scala>
.
4. How many partitions do you have on the input topics?
4.1 Have you tried increasing the # of partitions, and re-deploying with
more containers? ( I take it that you are running on Yarn).

*>> Another thing to add is our amount of time being blocked is generally
quite high which we believe is mainly because our processing is not fast
enough? *

5. What metric are you referring to here? Do you mean the time spent in the
*process* and *window* call? If you're not processing "fast enough", it
will be helpful to instrument where you are spending most of the time.

You can rely on Samza's MetricRegistry to configure / report your custom
metrics.

On Thu, Mar 9, 2017 at 5:20 AM, Ankit Malhotra <amalho...@appnexus.com>
wrote:

> Replies inline.
>
> --
> Ankit
>
> > On Mar 9, 2017, at 12:34 AM, Jagadish Venkatraman <
> jagadish1...@gmail.com> wrote:
> >
> > We can certainly help you debug this more. Some questions:
> >
> > 1. Are you processing messages (at all) from the "suffering" containers?
> > (You can verify that by observing metrics/ logging etc.)
> Processing messages for sure. But mainly from one of the 2 partitions that
> the container is reading from.
>
> The overall number of process calls for task 1 is much higher than the
> process calls for task 2 and the diff is approx the lag on the container
> which is all from task 2.
> >
> > 2. If you are indeed processing messages, is it possible the impacted
> > containers not able to keep up with the surge in the data? You can try
> > re-partitioning your input topics (and increasing the number of
> containers)
> We were trying the async loop by having 2 tasks be on the same container
> and multiple threads processing messages. The process is a simple inner
> join with a get and put into the RocksDB store.
>
> We saw that both get and put for the suffering task was higher than the
> task that was chugging log.
> >
> > 3. If you are not processing messages, maybe can you provide us with your
> > stack trace? It will be super-helpful to find out if (or where)
> containers
> > are stuck.
> >
>
> Another thing to add is our amount of time being blocked is generally
> quite high which we believe is mainly because our processing is not fast
> enough? To add more color, our rocks store's average get across all tasks
> is around 20,000ns BUT average put is 5X or more. We have the object cache
> enabled.
>
> > Thanks,
> > Jagadish
> >
> >
> > On Wed, Mar 8, 2017 at 1:05 PM, Ankit Malhotra <amalho...@appnexus.com>
> > wrote:
> >
> >> Hi,
> >>
> >> While joining streams from 2 partitions to join 2 streams, we see that
> >> some containers start suffering in that, lag (messages behind high
> >> watermark) for one of the tasks starts sky rocketing while the other
> one is
> >> ~ 0.
> >>
> >> We are using default values for buffer sizes, fetch threshold, are
> using 4
> >> threads as part of the pool and are using the default
> >> RoundRobinMessageChooser.
> >>
> >> Happy to share more details/config if it can help to debug this further.
> >>
> >> Thanks
> >> Ankit
> >>
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Jagadish V,
> > Graduate Student,
> > Department of Computer Science,
> > Stanford University
>

-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: Multi task Container Starvation

Reply via email to