Re: Need some help in identifying some important metrics to monitor for streams

Eno Thereska Sun, 05 Mar 2017 00:44:04 -0800

Answer inline:

> 
> I just wanted to understand say in single poll request if it fetches n
> records does the above values indicate time computed for all n records or
> just a single record.
>


In 0.10.2, the process latency is that of a single record, not the sum of n 
records. The commit latency is the latency for several requests. So your second 
statement is true:

> or is it the total average time to process these records = n * process
> latency + commit latency  before making another poll request.

Correct.

Thanks
Eno



> 
> Basically we just want to know how often is poll getting called just to see
> how close is it to MAX_POLL_INTERVAL_MS_CONFIG.
> 
> Thanks
> Sachin
> 
> 
> On Sun, Mar 5, 2017 at 11:42 AM, Guozhang Wang <wangg...@gmail.com> wrote:
> 
>> That is right, since client-id is used as the metrics name which should be
>> distinguishable.
>> 
>> https://kafka.apache.org/documentation/#streamsconfigs (I think we can
>> improve on the explanation of the client.id config)
>> 
>> A common client-id could contain the machine's host-port; of course, if you
>> have more than one Streams instances running on the same machine that wont
>> work and you need to consider using more information.
>> 
>> Again the client-id config is not required, and when not specified Streams
>> will use an UUID suffix to achieve uniqueness but as you observed it is
>> less human readable for monitoring.
>> 
>> 
>> Guozhang
>> 
>> On Fri, Mar 3, 2017 at 5:18 PM, Sachin Mittal <sjmit...@gmail.com> wrote:
>> 
>>> Son if I am running my stream and across a cluster of different machine
>>> each machine should have a different client id.
>>> 
>>> On 4 Mar 2017 12:36 a.m., "Guozhang Wang" <wangg...@gmail.com> wrote:
>>> 
>>>> Sachin,
>>>> 
>>>> The reason that you got metrics name as
>>>> 
>>>> new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>>> 
>>>> 
>>>> Is that you did not set the "CLIENT_ID_CONFIG" in your app, and
>>>> KafkaStreams have to use a default combo of "appID:
>>>> new-part-advice"-"processID: a UUID to guarantee uniqueness across
>>>> machines" as its clientId.
>>>> 
>>>> 
>>>> As for metricsName, it is always set as "clientId + "-" + threadName"
>>> where
>>>> "StreamThread-1" is your threadName which is unique WITHIN the JVM and
>>> that
>>>> is why we still need the globally unique clientId for distinguishment.
>>>> 
>>>> I just checked the source code and this logic was not changed from
>> 0.10.1
>>>> to 0.10.2, so I guess you set your clientId as "new-advice-1" as well
>> in
>>>> 0.10.1?
>>>> 
>>>> 
>>>> Guozhang
>>>> 
>>>> 
>>>> 
>>>> On Fri, Mar 3, 2017 at 4:02 AM, Eno Thereska <eno.there...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi Sachin,
>>>>> 
>>>>> Now that the confluent platform 3.2 is out, we also have some more
>>>>> documentation on this here: http://docs.confluent.io/3.2.
>>>>> 0/streams/monitoring.html <http://docs.confluent.io/3.2.
>>>>> 0/streams/monitoring.html>. We added a note on how to add other
>>> metrics.
>>>>> 
>>>>> Yeah, your calculation on poll time makes sense. The important
>> metrics
>>>> are
>>>>> the “info” ones that are on by default. However, for stageful
>>>> applications,
>>>>> if you suspect that state stores might be bottlenecking, you might
>> want
>>>> to
>>>>> collect those metrics too.
>>>>> 
>>>>> On the benchmarks, the one called “processstreamwithstatestore” and
>>>>> “count” are the closest to a benchmarking on RocksDb with the default
>>>>> configs. The first writes each record to RocksDb, while the second
>>>> performs
>>>>> simple aggregates (reads and writes from/to RocksDb).
>>>>> 
>>>>> We might need to add more benchmarks here, would be great to get some
>>>>> ideas and help from the community. E.g., a pure RocksDb benchmark
>> that
>>>>> doesn’t go through streams at all.
>>>>> 
>>>>> Could you open a JIRA on the name issue please? As an “improvement”.
>>>>> 
>>>>> Thanks
>>>>> Eno
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sjmit...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> I had checked the monitoring docs, but could not figure out which
>>>> metrics
>>>>>> are important ones.
>>>>>> 
>>>>>> Also mainly I am looking at the average time spent between 2
>>> successive
>>>>>> poll requests.
>>>>>> Can I say that average time between 2 poll requests is sum of
>>>>>> 
>>>>>> commit + poll + process + punctuate (latency-avg).
>>>>>> 
>>>>>> 
>>>>>> Also I checked the benchmark tests results but could not find any
>>>>>> information on rocksdb metrics for fetch and put operations.
>>>>>> Is there any benchmark for these or based on my values in previous
>>> mail
>>>>> can
>>>>>> something be commented on its performance.
>>>>>> 
>>>>>> 
>>>>>> Lastly can we get some help on names like
>>>> new-part-advice-d1094e71-0f59-
>>>>>> 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name
>> of
>>>>> thread
>>>>>> like new-advice-1-StreamThread-1(as in version 10.1.1) so we can
>> log
>>>>> these
>>>>>> metrics as part of out cron jobs.
>>>>>> 
>>>>>> Thanks
>>>>>> Sachin
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <
>> eno.there...@gmail.com
>>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> Hi Sachin,
>>>>>>> 
>>>>>>> The new streams metrics are now documented at
>>>> https://kafka.apache.org/
>>>>>>> documentation/#kafka_streams_monitoring <
>> https://kafka.apache.org/
>>>>>>> documentation/#kafka_streams_monitoring>. Note that not all of
>> them
>>>> are
>>>>>>> turned on by default.
>>>>>>> 
>>>>>>> We have several benchmarks that run nightly to monitor streams
>>>>>>> performance. They all stem from the SimpleBenchmark.java
>> benchmark.
>>> In
>>>>>>> addition, their results are published nightly here
>>>>>>> http://testing.confluent.io <http://testing.confluent.io/>,
>> (e.g.,
>>>>> under
>>>>>>> the trunk results). E.g., looking at today's results:
>>>>>>> http://confluent-kafka-system-test-results.s3-us-west-2.
>>>>>>> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
>>>>>>> ef92bb4/report.html <http://confluent-kafka-
>>>> system-test-results.s3-us-
>>>>>>> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
>>>>>>> trunk--ef92bb4/report.html>
>>>>>>> (if you search for "benchmarks.streams") you'll see results from a
>>>>> series
>>>>>>> of benchmarks, ranging from simply consuming, to simple topologies
>>>> with
>>>>> a
>>>>>>> source and sink, to joins and count aggregate. These run on AWS
>>>> nightly,
>>>>>>> but you can also run manually on your setup.
>>>>>>> 
>>>>>>> In addition, programmatically the code can check the
>>>>> KafkaStreams.state()
>>>>>>> and register listeners for when the state changes. For example,
>> the
>>>>> state
>>>>>>> can change from "running" to "rebalancing".
>>>>>>> 
>>>>>>> It is likely we'll need more metrics moving forward and would be
>>> great
>>>>> to
>>>>>>> get feedback from the community.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Eno
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 2 Mar 2017, at 11:54, Sachin Mittal <sjmit...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>> Hello All,
>>>>>>>> I had few questions regarding monitoring of kafka streams
>>> application
>>>>> and
>>>>>>>> what are some important metrics we should collect in our case.
>>>>>>>> 
>>>>>>>> Just a brief overview, we have a single thread application
>>> (0.10.1.1)
>>>>>>>> reading from single partition topic and it is working all fine.
>>>>>>>> Then we have same application (using 0.10.2.0) multi threaded
>> with
>>> 4
>>>>>>>> threads per machine and 3 machines cluster setup reading for same
>>> but
>>>>>>>> partitioned topic (12 partitions).
>>>>>>>> Thus we have each thread processing single partition same case as
>>>>> earlier
>>>>>>>> one.
>>>>>>>> 
>>>>>>>> The new setup also works fine in steady state, but under load
>>> somehow
>>>>> it
>>>>>>>> triggers frequent re-balance and then we run into all sort of
>>> issues
>>>>> like
>>>>>>>> stream thread dying due to CommitFailedException or entering into
>>>>>>> deadlock
>>>>>>>> state.
>>>>>>>> After a while we restart all the instances then it works fine
>> for a
>>>>> while
>>>>>>>> and again we get the same problem and it goes on.
>>>>>>>> 
>>>>>>>> 1. So just to monitor, like when first thread fails what would be
>>>> some
>>>>>>>> important metrics we should be collecting to get some sense of
>>> whats
>>>>>>> going
>>>>>>>> on?
>>>>>>>> 
>>>>>>>> 2. Is there any metric that tells time elapsed between successive
>>>> poll
>>>>>>>> requests, so we can monitor that?
>>>>>>>> 
>>>>>>>> Also I did monitor rocksdb put and fetch times for these 2
>>> instances
>>>>> and
>>>>>>>> here is the output I get:
>>>>>>>> 0.10.1.1
>>>>>>>> $>get -s  -b kafka.streams:type=stream-
>>>> rocksdb-window-metrics,client-
>>>>>>> id=new-advice-1-StreamThread-1
>>>>>>>> key-table-put-avg-latency-ms
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-advice-1-StreamThread-1:
>>>>>>>> 206431.7497615029
>>>>>>>> $>get -s  -b kafka.streams:type=stream-
>>>> rocksdb-window-metrics,client-
>>>>>>> id=new-advice-1-StreamThread-1
>>>>>>>> key-table-fetch-avg-latency-ms
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-advice-1-StreamThread-1:
>>>>>>>> 2595394.2746129474
>>>>>>>> $>get -s  -b kafka.streams:type=stream-
>>>> rocksdb-window-metrics,client-
>>>>>>> id=new-advice-1-StreamThread-1
>>>>>>>> key-table-put-qps
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-advice-1-StreamThread-1:
>>>>>>>> 232.86299499317252
>>>>>>>> $>get -s  -b kafka.streams:type=stream-
>>>> rocksdb-window-metrics,client-
>>>>>>> id=new-advice-1-StreamThread-1
>>>>>>>> key-table-fetch-qps
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-advice-1-StreamThread-1:
>>>>>>>> 373.61071016166284
>>>>>>>> 
>>>>>>>> Same values for 0.10.2.0 I get
>>>>>>>> $>get -s -b kafka.streams:type=stream-
>>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>> StreamThread-1
>>>>>>>> key-table-put-latency-avg
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>>> StreamThread-1:
>>>>>>>> 1199859.5535022356
>>>>>>>> $>get -s -b kafka.streams:type=stream-
>>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>> StreamThread-1
>>>>>>>> key-table-fetch-latency-avg
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>>> StreamThread-1:
>>>>>>>> 3679340.80748852
>>>>>>>> $>get -s -b kafka.streams:type=stream-
>>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>> StreamThread-1
>>>>>>>> key-table-put-rate
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>>> StreamThread-1:
>>>>>>>> 56.134778706069184
>>>>>>>> $>get -s -b kafka.streams:type=stream-
>>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>> StreamThread-1
>>>>>>>> key-table-fetch-rate
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>>> StreamThread-1:
>>>>>>>> 136.10721427931827
>>>>>>>> 
>>>>>>>> I notice that result in 10.2.0 is much worse than same for 10.1.1
>>>>>>>> 
>>>>>>>> I would like to know
>>>>>>>> 1. Is there any benchmark on rocksdb as at what rate/latency it
>>>> should
>>>>> be
>>>>>>>> doing put/fetch operations.
>>>>>>>> 
>>>>>>>> 2. What could be the cause of inferior numbers in 10.2.0, is it
>>>> because
>>>>>>>> this application is also running three other threads doing the
>> same
>>>>>>> thing.
>>>>>>>> 
>>>>>>>> 3. Also whats with the name new-part-advice-d1094e71-
>>>>>>>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>>>>>>>  I wanted to put this as a part of my cronjob, so why can't we
>>> have
>>>>>>>> simpler name like we have in 10.1.1, so it is easy to write the
>>>> script.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Sachin
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> -- Guozhang
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> -- Guozhang
>>

Re: Need some help in identifying some important metrics to monitor for streams

Reply via email to