Re: [DISCUSS] Would like to make collective intelligence about Metrics on Storm

Adam Meyerowitz (BLOOMBERG/ 731 LEX) Mon, 09 May 2016 06:53:37 -0700

Jungtaek, thanks for the followup response.

For #1, having this in the Storm UI would be very nice and I think of general 
interest to anyone who is tasked with maintaining Storm deployments and 
certainly during development for capacity and stress testing.  I'm not sure 
what it takes to get it into the UI, but sounds like a good change.


For #2, having metrics reporting impact the realtime system is not the best.  
Again, I'm not sure how this is all implemented or the challenges involved so 
it's easy for me to say that, but it seems periodic reporting of aggregated 
stats done by each task itself in a separate thread would be sufficient and 
hopefully would not impact performance.  That aggregation could include the 
things we are interested in such as min/max/average, percentile all that good 
stuff.

From: [email protected] At: May  8 2016 23:01:12
To: Adam Meyerowitz (BLOOMBERG/ 731 LEX), [email protected]
Subject: Re: [DISCUSS] Would like to make collective intelligence about Metrics 
on Storm

Hi Adam,

Thanks for the great input! Let me share my thought about two things.

1. There's metrics for disruptor queue so if you attach metrics consumer it 
will be provided to consumer. Sojourn time for queue is also provided (kudos to 
Li Wang) but it's based on queueing theory and has one precondition so 
sometimes its value seems not stable (especially problematic tasks).

2. Agreed. There could be latency SLAs for specific topology, then we would 
really want to see outliers and percentiles, too. Since providing them may 
affect performance we should address them with care. I believe eventually we 
will provide various information for latency. Stay tuned.

Thanks,
Jungtaek Lim (HeartSaVioR)
2016년 5월 6일 (금) 오후 10:29, Adam Meyerowitz (BLOOMBERG/ 731 LEX) 
<[email protected]>님이 작성:

I recall seeing in another thread a discussion about monitoring metrics for 
various queues within a worker.  For us this would be pretty key for each 
executor input and output LMAX queue as well as the worker level input and 
output queues.  In our topologies we run one task per executor so it would help 
us get a much better understanding of the performance of our components.  If 
acking is turned off, which it is for our topologies, it's hard to get the full 
picture of the performance of the various components we have.  The execute and 
process latency only tells part of a larger story.  For the queues, generally 
we would like to see queue utilization and how long tuples stayed on the queue.

Also generally we would like more than average.  For example, 
min/max/average/standard deviation.. percentiles, whatever.  Average definitely 
smooths the bumps and it's good but we'd gain more insight in understanding 
outliers and the larger performance picture.


From: [email protected] At: Apr 20 2016 00:30:05
To: [email protected]
Subject: Re: [DISCUSS] Would like to make collective intelligence about Metrics 
on Storm

Let me start sharing my thought. :)

1. Need to enrich docs about metrics / stats.

In fact, I couldn't see the fact - topology stats are sampled by default and 
sample rate is 0.05 - from the docs when I was newbie of Apache Storm. It made 
me misleading and made me saying "Why there're difference between the counts?". 
I also saw some mails from user@ about same question. If we include this to 
guide doc that would be better.

And Metrics document page seems not well written. I think it has appropriate 
headings but lacks contents on each heading. 
It should be addressed, and introducing some external metrics consumer plugins 
(like storm-graphite from Verisign) would be great, too.

2. Need to increase sample rate or (ideally) no sampling at all.

Let's postpone considering performance hit at this time.
Ideally, we expect precision of metrics gets better when we increase sample 
rate. It affects non-gauge kinds of metrics which are counter, and latency, and 
so on.

Btw, I would like to hear about opinions on latency since I'm not an expert. 
Storm provides only average latency and it's indeed based on sample rate. Do we 
feel OK with this? If not how much having also percentiles can help us?

Thanks,
Jungtaek Lim (HeartSaVioR)

2016년 4월 20일 (수) 오전 10:55, Jungtaek Lim <[email protected]>님이 작성:

Hi Storm users,

I'm Jungtaek Lim, committer and PMC member of Apache Storm.

If you subscribed dev@ mailing list, you may have seen that recently we're 
addressing the metrics feature on Apache Storm.

For now, improvements are going forward based on current metrics feature.

- Improve (Topology) MetricsConsumer
- Provide topology metrics in detail (metrics per each stream)
- (WIP) Introduce Cluster Metrics Consumer

As I don't maintain large cluster for myself, I really want to collect the any 
ideas for improving, any inconveniences, use cases of Metrics with community 
members, so we're on the right way to go forward.

Let's talk!

Thanks in advance,
Jungtaek Lim (HeartSaVioR)

Re: [DISCUSS] Would like to make collective intelligence about Metrics on Storm

Reply via email to