Re: Need some help in identifying some important metrics to monitor for streams

Guozhang Wang Sat, 04 Mar 2017 22:12:48 -0800

That is right, since client-id is used as the metrics name which should be
distinguishable.


https://kafka.apache.org/documentation/#streamsconfigs (I think we can
improve on the explanation of the client.id config)

A common client-id could contain the machine's host-port; of course, if you
have more than one Streams instances running on the same machine that wont
work and you need to consider using more information.

Again the client-id config is not required, and when not specified Streams
will use an UUID suffix to achieve uniqueness but as you observed it is
less human readable for monitoring.


Guozhang

On Fri, Mar 3, 2017 at 5:18 PM, Sachin Mittal <sjmit...@gmail.com> wrote:

> Son if I am running my stream and across a cluster of different machine
> each machine should have a different client id.
>
> On 4 Mar 2017 12:36 a.m., "Guozhang Wang" <wangg...@gmail.com> wrote:
>
> > Sachin,
> >
> > The reason that you got metrics name as
> >
> > new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >
> >
> > Is that you did not set the "CLIENT_ID_CONFIG" in your app, and
> > KafkaStreams have to use a default combo of "appID:
> > new-part-advice"-"processID: a UUID to guarantee uniqueness across
> > machines" as its clientId.
> >
> >
> > As for metricsName, it is always set as "clientId + "-" + threadName"
> where
> > "StreamThread-1" is your threadName which is unique WITHIN the JVM and
> that
> > is why we still need the globally unique clientId for distinguishment.
> >
> > I just checked the source code and this logic was not changed from 0.10.1
> > to 0.10.2, so I guess you set your clientId as "new-advice-1" as well in
> > 0.10.1?
> >
> >
> > Guozhang
> >
> >
> >
> > On Fri, Mar 3, 2017 at 4:02 AM, Eno Thereska <eno.there...@gmail.com>
> > wrote:
> >
> > > Hi Sachin,
> > >
> > > Now that the confluent platform 3.2 is out, we also have some more
> > > documentation on this here: http://docs.confluent.io/3.2.
> > > 0/streams/monitoring.html <http://docs.confluent.io/3.2.
> > > 0/streams/monitoring.html>. We added a note on how to add other
> metrics.
> > >
> > > Yeah, your calculation on poll time makes sense. The important metrics
> > are
> > > the “info” ones that are on by default. However, for stageful
> > applications,
> > > if you suspect that state stores might be bottlenecking, you might want
> > to
> > > collect those metrics too.
> > >
> > > On the benchmarks, the one called “processstreamwithstatestore” and
> > > “count” are the closest to a benchmarking on RocksDb with the default
> > > configs. The first writes each record to RocksDb, while the second
> > performs
> > > simple aggregates (reads and writes from/to RocksDb).
> > >
> > > We might need to add more benchmarks here, would be great to get some
> > > ideas and help from the community. E.g., a pure RocksDb benchmark that
> > > doesn’t go through streams at all.
> > >
> > > Could you open a JIRA on the name issue please? As an “improvement”.
> > >
> > > Thanks
> > > Eno
> > >
> > >
> > >
> > > > On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sjmit...@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > > I had checked the monitoring docs, but could not figure out which
> > metrics
> > > > are important ones.
> > > >
> > > > Also mainly I am looking at the average time spent between 2
> successive
> > > > poll requests.
> > > > Can I say that average time between 2 poll requests is sum of
> > > >
> > > > commit + poll + process + punctuate (latency-avg).
> > > >
> > > >
> > > > Also I checked the benchmark tests results but could not find any
> > > > information on rocksdb metrics for fetch and put operations.
> > > > Is there any benchmark for these or based on my values in previous
> mail
> > > can
> > > > something be commented on its performance.
> > > >
> > > >
> > > > Lastly can we get some help on names like
> > new-part-advice-d1094e71-0f59-
> > > > 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name of
> > > thread
> > > > like new-advice-1-StreamThread-1(as in version 10.1.1) so we can log
> > > these
> > > > metrics as part of out cron jobs.
> > > >
> > > > Thanks
> > > > Sachin
> > > >
> > > >
> > > >
> > > > On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <eno.there...@gmail.com
> >
> > > wrote:
> > > >
> > > >> Hi Sachin,
> > > >>
> > > >> The new streams metrics are now documented at
> > https://kafka.apache.org/
> > > >> documentation/#kafka_streams_monitoring <https://kafka.apache.org/
> > > >> documentation/#kafka_streams_monitoring>. Note that not all of them
> > are
> > > >> turned on by default.
> > > >>
> > > >> We have several benchmarks that run nightly to monitor streams
> > > >> performance. They all stem from the SimpleBenchmark.java benchmark.
> In
> > > >> addition, their results are published nightly here
> > > >> http://testing.confluent.io <http://testing.confluent.io/>, (e.g.,
> > > under
> > > >> the trunk results). E.g., looking at today's results:
> > > >> http://confluent-kafka-system-test-results.s3-us-west-2.
> > > >> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
> > > >> ef92bb4/report.html <http://confluent-kafka-
> > system-test-results.s3-us-
> > > >> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
> > > >> trunk--ef92bb4/report.html>
> > > >> (if you search for "benchmarks.streams") you'll see results from a
> > > series
> > > >> of benchmarks, ranging from simply consuming, to simple topologies
> > with
> > > a
> > > >> source and sink, to joins and count aggregate. These run on AWS
> > nightly,
> > > >> but you can also run manually on your setup.
> > > >>
> > > >> In addition, programmatically the code can check the
> > > KafkaStreams.state()
> > > >> and register listeners for when the state changes. For example, the
> > > state
> > > >> can change from "running" to "rebalancing".
> > > >>
> > > >> It is likely we'll need more metrics moving forward and would be
> great
> > > to
> > > >> get feedback from the community.
> > > >>
> > > >>
> > > >> Thanks
> > > >> Eno
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> On 2 Mar 2017, at 11:54, Sachin Mittal <sjmit...@gmail.com> wrote:
> > > >>>
> > > >>> Hello All,
> > > >>> I had few questions regarding monitoring of kafka streams
> application
> > > and
> > > >>> what are some important metrics we should collect in our case.
> > > >>>
> > > >>> Just a brief overview, we have a single thread application
> (0.10.1.1)
> > > >>> reading from single partition topic and it is working all fine.
> > > >>> Then we have same application (using 0.10.2.0) multi threaded with
> 4
> > > >>> threads per machine and 3 machines cluster setup reading for same
> but
> > > >>> partitioned topic (12 partitions).
> > > >>> Thus we have each thread processing single partition same case as
> > > earlier
> > > >>> one.
> > > >>>
> > > >>> The new setup also works fine in steady state, but under load
> somehow
> > > it
> > > >>> triggers frequent re-balance and then we run into all sort of
> issues
> > > like
> > > >>> stream thread dying due to CommitFailedException or entering into
> > > >> deadlock
> > > >>> state.
> > > >>> After a while we restart all the instances then it works fine for a
> > > while
> > > >>> and again we get the same problem and it goes on.
> > > >>>
> > > >>> 1. So just to monitor, like when first thread fails what would be
> > some
> > > >>> important metrics we should be collecting to get some sense of
> whats
> > > >> going
> > > >>> on?
> > > >>>
> > > >>> 2. Is there any metric that tells time elapsed between successive
> > poll
> > > >>> requests, so we can monitor that?
> > > >>>
> > > >>> Also I did monitor rocksdb put and fetch times for these 2
> instances
> > > and
> > > >>> here is the output I get:
> > > >>> 0.10.1.1
> > > >>> $>get -s  -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > >> id=new-advice-1-StreamThread-1
> > > >>> key-table-put-avg-latency-ms
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-advice-1-StreamThread-1:
> > > >>> 206431.7497615029
> > > >>> $>get -s  -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > >> id=new-advice-1-StreamThread-1
> > > >>> key-table-fetch-avg-latency-ms
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-advice-1-StreamThread-1:
> > > >>> 2595394.2746129474
> > > >>> $>get -s  -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > >> id=new-advice-1-StreamThread-1
> > > >>> key-table-put-qps
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-advice-1-StreamThread-1:
> > > >>> 232.86299499317252
> > > >>> $>get -s  -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > >> id=new-advice-1-StreamThread-1
> > > >>> key-table-fetch-qps
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-advice-1-StreamThread-1:
> > > >>> 373.61071016166284
> > > >>>
> > > >>> Same values for 0.10.2.0 I get
> > > >>> $>get -s -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1
> > > >>> key-table-put-latency-avg
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1:
> > > >>> 1199859.5535022356
> > > >>> $>get -s -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1
> > > >>> key-table-fetch-latency-avg
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1:
> > > >>> 3679340.80748852
> > > >>> $>get -s -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1
> > > >>> key-table-put-rate
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1:
> > > >>> 56.134778706069184
> > > >>> $>get -s -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1
> > > >>> key-table-fetch-rate
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1:
> > > >>> 136.10721427931827
> > > >>>
> > > >>> I notice that result in 10.2.0 is much worse than same for 10.1.1
> > > >>>
> > > >>> I would like to know
> > > >>> 1. Is there any benchmark on rocksdb as at what rate/latency it
> > should
> > > be
> > > >>> doing put/fetch operations.
> > > >>>
> > > >>> 2. What could be the cause of inferior numbers in 10.2.0, is it
> > because
> > > >>> this application is also running three other threads doing the same
> > > >> thing.
> > > >>>
> > > >>> 3. Also whats with the name new-part-advice-d1094e71-
> > > >>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > > >>>   I wanted to put this as a part of my cronjob, so why can't we
> have
> > > >>> simpler name like we have in 10.1.1, so it is easy to write the
> > script.
> > > >>>
> > > >>> Thanks
> > > >>> Sachin
> > > >>
> > > >>
> > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: Need some help in identifying some important metrics to monitor for streams

Reply via email to