Thanks for the quick reply. That sounds very much like what I'm seeing. I'm merging in 0.14.1 to our branch now. I did try single threaded mode and unfortunately that didn't seem to make a significant difference. Perhaps I do need some multithreading? I'm seeing a task latency 0.2ms per message but still only achieve ~700/sec
-----Original Message----- From: Prateek Maheshwari [mailto:prateek...@gmail.com] Sent: Friday, June 8, 2018 13:54 To: dev@samza.apache.org Subject: Re: Urgent : Help with latency / backlog / topic lag Hi Thunder, > What we believe may be happening is that most of the topics have no backlog, but one topic has all the backlog (this is because one of the topics accounts for ~60% of the total message rate). Could there be something inducing extra latency on processing the one topic with a backlog just having a bunch of other topics with NO backlog? This seems very similar to this issue: https://issues.apache.org/jira/browse/SAMZA-1599 This was fixed in https://github.com/apache/samza/pull/436, and the fix should be available in the 0.14.1 version. Would it be possible to try upgrading to 0.14.1? It should be backwards compatible with 0.14.0. For something you can try without upgrading: try setting "job.container.single.thread.mode" to true. From the configuration reference <https://samza.apache.org/learn/documentation/latest/jobs/configuration-table.html>: "If set to true, samza will fallback to legacy single-threaded event loop. Default is false, which enables the multithreading execution." Let us know if this doesn't help. Thanks, Prateek On Fri, Jun 8, 2018 at 1:35 PM, Thunder Stumpges <tstump...@ntent.com> wrote: > We have a new samza job which we just put into production. This job > processes many topics (~30) but the total rate is not that high > (~1200/sec in aggregate). I am unable to get above ~700/sec and have a > growing backlog. > > We are running samza 0.12 (I have an update to 0.14 that is not tested > or pushed yet). When we load tested with a single topic, we could > easily do several thousand per second. The latency of a single message > is about 0.5ms as recorded by our timer metric on our 'process' call. > > What we believe may be happening is that most of the topics have no > backlog, but one topic has all the backlog (this is because one of the > topics accounts for ~60% of the total message rate). Could there be > something inducing extra latency on processing the one topic with a > backlog just having a bunch of other topics with NO backlog? > > Some things I have tried: > > > 1. Increasing thread pool (10->20->30), no change > 2. Going from 1 container to 2, no help (the two containers run at > half the speed and total is the same) > 3. Increasing task.max.concurrency from 1 -> 2 -> 3 (this had some > minor help going from 1 to 2, but not enough) > 4. Increasing fetch.threshold.bytes (currently at 100,000 and we > have pretty small messages) > > Some observed metrics: > > > * "Pending Messages" are > 0 (15+ on some partitions) > * "Messages in flight" is almost always 0 > * Polls rate is ~50/sec > * Message chooser "Choos Obj" is ~680-700/sec like our processing rate > * Message chooser "choose null" is ~50/sec > > I'm somewhat at a loss because based on the actual processing latency > we should easily be able to do 2000+ with just a small handful of threads. > > Thanks in advance, this is in production I really need a solution. > Thunder > >