Hey Leah, Thanks for the response. We are running Kafka 2.5.1 and if the topology will still be useful after the next few sentences, i will share it with you (its messy!). It happens on few partitions, and few internal topics - and it seems to be kind of random which topics and which partitions exactly. The business logic in prune to having "hot" partitions since the identifier being used is coming-in a very different rate during different times of the day. We are using rocksdb and I would like to know which metrics you think can help us (I didn't expose the metrics in a clever way outside yet :/)
Since the topic and partitions are changing, and reset usually fixes the problem almost immediately - i find it hard to believe it has anything to do with the topology or business logic but I might be missing something (since, after restart, the lag disappear with no real effort). Thanks On Tue, Dec 8, 2020 at 9:35 PM Leah Thomas <ltho...@confluent.io> wrote: > Hi Nitay, > > What version of Kafka are you running? If you could also give the topology > you're using that would be great. Do you have a sense of if the lag is > happening on all partitions or just a few? Also if you're using rocksDB > there are some rocksDB metrics in newer versions of Kafka that could be > helpful for diagnosing the issue. > > Cheers, > Leah > > On Mon, Dec 7, 2020 at 8:59 AM Nitay Kufert <nita...@ironsrc.com> wrote: > > > Hey, > > We are running a kafka-stream based app in production where the input, > > intermediate and global topics have 36 partitions. > > We have 17 sub-tasks (2 of them are for global stores so they won't > > generate tasks). > > More tech details: > > 6 machines with 16cpu's, 30 threads so: 6 * 30 = 180 stream-threads > > 15 * 36 = 540 tasks > > 3 tasks per thread > > > > Every once in a while, during our rush-hours, some of the internal > topics, > > on specific partitions, start to lag - the lag usually keeps increasing > > until i restart the application - and the lag disappears very quickly. > > > > It seems like there is some problem in the work allocation since the > > machines are not loaded at all, and have enough threads (more than double > > the cpu's). > > > > Any idea what's going on there? > > > > -- > > > > Nitay Kufert > > Backend Team Leader > > [image: ironSource] <http://www.ironsrc.com> > > > > email nita...@ironsrc.com > > mobile +972-54-5480021 > > fax +972-77-5448273 > > skype nitay.kufert.ssa > > 121 Menachem Begin St., Tel Aviv, Israel > > ironsrc.com <http://www.ironsrc.com> > > [image: linkedin] <https://www.linkedin.com/company/ironsource> [image: > > twitter] <https://twitter.com/ironsource> [image: facebook] > > <https://www.facebook.com/ironSource> [image: googleplus] > > <https://plus.google.com/+ironsrc> > > This email (including any attachments) is for the sole use of the > intended > > recipient and may contain confidential information which may be protected > > by legal privilege. If you are not the intended recipient, or the > employee > > or agent responsible for delivering it to the intended recipient, you are > > hereby notified that any use, dissemination, distribution or copying of > > this communication and/or its content is strictly prohibited. If you are > > not the intended recipient, please immediately notify us by reply email > or > > by telephone, delete this email and destroy any copies. Thank you. > > > -- Nitay Kufert Backend Team Leader [image: ironSource] <http://www.ironsrc.com> email nita...@ironsrc.com mobile +972-54-5480021 fax +972-77-5448273 skype nitay.kufert.ssa 121 Menachem Begin St., Tel Aviv, Israel ironsrc.com <http://www.ironsrc.com> [image: linkedin] <https://www.linkedin.com/company/ironsource> [image: twitter] <https://twitter.com/ironsource> [image: facebook] <https://www.facebook.com/ironSource> [image: googleplus] <https://plus.google.com/+ironsrc> This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.