I understand you are receiving messages from *all* partitions (but fewer messages from some partitions).
Some questions: 1. Is it possible that you may have saturated the capacity of the entire container? 2. What is the time you spend inside *process* and *window* for the affected container? (How does it compare with other containers?). The metrics of interest are *process_ns* and *window_ns.* 3. What is the number of messages per-second you process for the affected containers ? (How does it compare with other containers?). The metric of interest is *process-calls*. You can also look at per-task process calls <https://github.com/apache/samza/blob/master/samza-core/src/main/scala/org/apache/samza/container/TaskInstanceMetrics.scala>, and per-partition messages read <https://github.com/apache/samza/blob/master/samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemConsumerMetrics.scala> . 4. How many partitions do you have on the input topics? 4.1 Have you tried increasing the # of partitions, and re-deploying with more containers? ( I take it that you are running on Yarn). *>> Another thing to add is our amount of time being blocked is generally quite high which we believe is mainly because our processing is not fast enough? * 5. What metric are you referring to here? Do you mean the time spent in the *process* and *window* call? If you're not processing "fast enough", it will be helpful to instrument where you are spending most of the time. You can rely on Samza's MetricRegistry to configure / report your custom metrics. On Thu, Mar 9, 2017 at 5:20 AM, Ankit Malhotra <amalho...@appnexus.com> wrote: > Replies inline. > > -- > Ankit > > > On Mar 9, 2017, at 12:34 AM, Jagadish Venkatraman < > jagadish1...@gmail.com> wrote: > > > > We can certainly help you debug this more. Some questions: > > > > 1. Are you processing messages (at all) from the "suffering" containers? > > (You can verify that by observing metrics/ logging etc.) > Processing messages for sure. But mainly from one of the 2 partitions that > the container is reading from. > > The overall number of process calls for task 1 is much higher than the > process calls for task 2 and the diff is approx the lag on the container > which is all from task 2. > > > > 2. If you are indeed processing messages, is it possible the impacted > > containers not able to keep up with the surge in the data? You can try > > re-partitioning your input topics (and increasing the number of > containers) > We were trying the async loop by having 2 tasks be on the same container > and multiple threads processing messages. The process is a simple inner > join with a get and put into the RocksDB store. > > We saw that both get and put for the suffering task was higher than the > task that was chugging log. > > > > 3. If you are not processing messages, maybe can you provide us with your > > stack trace? It will be super-helpful to find out if (or where) > containers > > are stuck. > > > > Another thing to add is our amount of time being blocked is generally > quite high which we believe is mainly because our processing is not fast > enough? To add more color, our rocks store's average get across all tasks > is around 20,000ns BUT average put is 5X or more. We have the object cache > enabled. > > > Thanks, > > Jagadish > > > > > > On Wed, Mar 8, 2017 at 1:05 PM, Ankit Malhotra <amalho...@appnexus.com> > > wrote: > > > >> Hi, > >> > >> While joining streams from 2 partitions to join 2 streams, we see that > >> some containers start suffering in that, lag (messages behind high > >> watermark) for one of the tasks starts sky rocketing while the other > one is > >> ~ 0. > >> > >> We are using default values for buffer sizes, fetch threshold, are > using 4 > >> threads as part of the pool and are using the default > >> RoundRobinMessageChooser. > >> > >> Happy to share more details/config if it can help to debug this further. > >> > >> Thanks > >> Ankit > >> > >> > >> > >> > >> > > > > > > -- > > Jagadish V, > > Graduate Student, > > Department of Computer Science, > > Stanford University > -- Jagadish V, Graduate Student, Department of Computer Science, Stanford University