Hey Nitay,

In terms of rocksDB metrics, 2.5.1 should have a number of debug level
metrics that could shed some light on the situation. Particularly I'd
recommend looking at WRITE_STALL_DURATION_AVG / WRITE_STALL_DURATION_TOTAL,
as well as some of the compaction metrics such as COMPACTION_TIME_MAX,
BYTES_READ_DURING_COMPACTION or BYTES_WRITTEN_DURING_COMPACTION. The
compaction metrics, in particular, could alert you to rocksDB falling
behind in compaction which could be solved by the restart you're doing.

I do think it *could* still be something in your topology. Definitely
confirm that your subtopologies have a fairly even load of processing,
overloaded tasks could definitely be impacting performance.

Good luck!
Leah





On Wed, Dec 9, 2020 at 3:00 PM Nitay Kufert <nita...@ironsrc.com> wrote:

> Hey Leah, Thanks for the response.
>
> We are running Kafka 2.5.1 and if the topology will still be useful after
> the next few sentences, i will share it with you (its messy!).
> It happens on few partitions, and few internal topics - and it seems to be
> kind of random which topics and which partitions exactly.
> The business logic in prune to having "hot" partitions since the identifier
> being used is coming-in a very different rate during different times of the
> day.
> We are using rocksdb and I would like to know which metrics you think can
> help us (I didn't expose the metrics in a clever way outside yet :/)
>
> Since the topic and partitions are changing, and reset usually fixes the
> problem almost immediately - i find it hard to believe it has anything to
> do with the topology or business logic but I might be missing something
> (since, after restart, the lag disappear with no real effort).
>
> Thanks
>
>
>
>
> On Tue, Dec 8, 2020 at 9:35 PM Leah Thomas <ltho...@confluent.io> wrote:
>
> > Hi Nitay,
> >
> > What version of Kafka are you running? If you could also give the
> topology
> > you're using that would be great. Do you have a sense of if the lag is
> > happening on all partitions or just a few? Also if you're using rocksDB
> > there are some rocksDB metrics in newer versions of Kafka that could be
> > helpful for diagnosing the issue.
> >
> > Cheers,
> > Leah
> >
> > On Mon, Dec 7, 2020 at 8:59 AM Nitay Kufert <nita...@ironsrc.com> wrote:
> >
> > > Hey,
> > > We are running a kafka-stream based app in production where the input,
> > > intermediate and global topics have 36 partitions.
> > > We have 17 sub-tasks (2 of them are for global stores so they won't
> > > generate tasks).
> > > More tech details:
> > > 6 machines with 16cpu's, 30 threads so: 6 * 30 = 180 stream-threads
> > > 15 * 36 = 540 tasks
> > > 3 tasks per thread
> > >
> > > Every once in a while, during our rush-hours, some of the internal
> > topics,
> > > on specific partitions, start to lag - the lag usually keeps increasing
> > > until i restart the application - and the lag disappears very quickly.
> > >
> > > It seems like there is some problem in the work allocation since the
> > > machines are not loaded at all, and have enough threads (more than
> double
> > > the cpu's).
> > >
> > > Any idea what's going on there?
> > >
> > > --
> > >
> > > Nitay Kufert
> > > Backend Team Leader
> > > [image: ironSource] <http://www.ironsrc.com>
> > >
> > > email nita...@ironsrc.com
> > > mobile +972-54-5480021
> > > fax +972-77-5448273
> > > skype nitay.kufert.ssa
> > > 121 Menachem Begin St., Tel Aviv, Israel
> > > ironsrc.com <http://www.ironsrc.com>
> > > [image: linkedin] <https://www.linkedin.com/company/ironsource>
> [image:
> > > twitter] <https://twitter.com/ironsource> [image: facebook]
> > > <https://www.facebook.com/ironSource> [image: googleplus]
> > > <https://plus.google.com/+ironsrc>
> > > This email (including any attachments) is for the sole use of the
> > intended
> > > recipient and may contain confidential information which may be
> protected
> > > by legal privilege. If you are not the intended recipient, or the
> > employee
> > > or agent responsible for delivering it to the intended recipient, you
> are
> > > hereby notified that any use, dissemination, distribution or copying of
> > > this communication and/or its content is strictly prohibited. If you
> are
> > > not the intended recipient, please immediately notify us by reply email
> > or
> > > by telephone, delete this email and destroy any copies. Thank you.
> > >
> >
>
>
> --
>
> Nitay Kufert
> Backend Team Leader
> [image: ironSource] <http://www.ironsrc.com>
>
> email nita...@ironsrc.com
> mobile +972-54-5480021
> fax +972-77-5448273
> skype nitay.kufert.ssa
> 121 Menachem Begin St., Tel Aviv, Israel
> ironsrc.com <http://www.ironsrc.com>
> [image: linkedin] <https://www.linkedin.com/company/ironsource> [image:
> twitter] <https://twitter.com/ironsource> [image: facebook]
> <https://www.facebook.com/ironSource> [image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>

Reply via email to