Hey Nitay, In terms of rocksDB metrics, 2.5.1 should have a number of debug level metrics that could shed some light on the situation. Particularly I'd recommend looking at WRITE_STALL_DURATION_AVG / WRITE_STALL_DURATION_TOTAL, as well as some of the compaction metrics such as COMPACTION_TIME_MAX, BYTES_READ_DURING_COMPACTION or BYTES_WRITTEN_DURING_COMPACTION. The compaction metrics, in particular, could alert you to rocksDB falling behind in compaction which could be solved by the restart you're doing.
I do think it *could* still be something in your topology. Definitely confirm that your subtopologies have a fairly even load of processing, overloaded tasks could definitely be impacting performance. Good luck! Leah On Wed, Dec 9, 2020 at 3:00 PM Nitay Kufert <nita...@ironsrc.com> wrote: > Hey Leah, Thanks for the response. > > We are running Kafka 2.5.1 and if the topology will still be useful after > the next few sentences, i will share it with you (its messy!). > It happens on few partitions, and few internal topics - and it seems to be > kind of random which topics and which partitions exactly. > The business logic in prune to having "hot" partitions since the identifier > being used is coming-in a very different rate during different times of the > day. > We are using rocksdb and I would like to know which metrics you think can > help us (I didn't expose the metrics in a clever way outside yet :/) > > Since the topic and partitions are changing, and reset usually fixes the > problem almost immediately - i find it hard to believe it has anything to > do with the topology or business logic but I might be missing something > (since, after restart, the lag disappear with no real effort). > > Thanks > > > > > On Tue, Dec 8, 2020 at 9:35 PM Leah Thomas <ltho...@confluent.io> wrote: > > > Hi Nitay, > > > > What version of Kafka are you running? If you could also give the > topology > > you're using that would be great. Do you have a sense of if the lag is > > happening on all partitions or just a few? Also if you're using rocksDB > > there are some rocksDB metrics in newer versions of Kafka that could be > > helpful for diagnosing the issue. > > > > Cheers, > > Leah > > > > On Mon, Dec 7, 2020 at 8:59 AM Nitay Kufert <nita...@ironsrc.com> wrote: > > > > > Hey, > > > We are running a kafka-stream based app in production where the input, > > > intermediate and global topics have 36 partitions. > > > We have 17 sub-tasks (2 of them are for global stores so they won't > > > generate tasks). > > > More tech details: > > > 6 machines with 16cpu's, 30 threads so: 6 * 30 = 180 stream-threads > > > 15 * 36 = 540 tasks > > > 3 tasks per thread > > > > > > Every once in a while, during our rush-hours, some of the internal > > topics, > > > on specific partitions, start to lag - the lag usually keeps increasing > > > until i restart the application - and the lag disappears very quickly. > > > > > > It seems like there is some problem in the work allocation since the > > > machines are not loaded at all, and have enough threads (more than > double > > > the cpu's). > > > > > > Any idea what's going on there? > > > > > > -- > > > > > > Nitay Kufert > > > Backend Team Leader > > > [image: ironSource] <http://www.ironsrc.com> > > > > > > email nita...@ironsrc.com > > > mobile +972-54-5480021 > > > fax +972-77-5448273 > > > skype nitay.kufert.ssa > > > 121 Menachem Begin St., Tel Aviv, Israel > > > ironsrc.com <http://www.ironsrc.com> > > > [image: linkedin] <https://www.linkedin.com/company/ironsource> > [image: > > > twitter] <https://twitter.com/ironsource> [image: facebook] > > > <https://www.facebook.com/ironSource> [image: googleplus] > > > <https://plus.google.com/+ironsrc> > > > This email (including any attachments) is for the sole use of the > > intended > > > recipient and may contain confidential information which may be > protected > > > by legal privilege. If you are not the intended recipient, or the > > employee > > > or agent responsible for delivering it to the intended recipient, you > are > > > hereby notified that any use, dissemination, distribution or copying of > > > this communication and/or its content is strictly prohibited. If you > are > > > not the intended recipient, please immediately notify us by reply email > > or > > > by telephone, delete this email and destroy any copies. Thank you. > > > > > > > > -- > > Nitay Kufert > Backend Team Leader > [image: ironSource] <http://www.ironsrc.com> > > email nita...@ironsrc.com > mobile +972-54-5480021 > fax +972-77-5448273 > skype nitay.kufert.ssa > 121 Menachem Begin St., Tel Aviv, Israel > ironsrc.com <http://www.ironsrc.com> > [image: linkedin] <https://www.linkedin.com/company/ironsource> [image: > twitter] <https://twitter.com/ironsource> [image: facebook] > <https://www.facebook.com/ironSource> [image: googleplus] > <https://plus.google.com/+ironsrc> > This email (including any attachments) is for the sole use of the intended > recipient and may contain confidential information which may be protected > by legal privilege. If you are not the intended recipient, or the employee > or agent responsible for delivering it to the intended recipient, you are > hereby notified that any use, dissemination, distribution or copying of > this communication and/or its content is strictly prohibited. If you are > not the intended recipient, please immediately notify us by reply email or > by telephone, delete this email and destroy any copies. Thank you. >