Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
Ah yes, right I forgot about the existence. Thanks! I'm aware of some implementations for approximate calculations (I guess what we say approximate median is approximate percentile with 50%) but I didn't know about implementation details like supporting accumulative. Given current source values of

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Sean Owen
Yep, that's clear. That's a reasonable case. There are already approximate median computations that can be done cumulatively as you say, implemented in Spark. I think it's reasonable to consider this for performance, as it can be faster with just a small error tolerance. But yeah up to you if you h

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
Thanks all for providing inputs! Maybe I wasn't clear about my intention. The issue I focus on is; there're plenty of metrics being defined in a stage for SQL, and each metric has values for each task and being grouped later to calculate aggregated values. (e.g. metric for "elapsed time" is shown

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Sean Owen
How big is the overhead, at scale? If it has a non-trivial effect for most jobs, I could imagine reusing the existing approximate quantile support to more efficiently find a pretty-close median. On Wed, Nov 27, 2019 at 3:55 AM Jungtaek Lim wrote: > > Hi Spark devs, > > The change might be specifi

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Mayur Rustagi
Another option could be to use a sketch to get approx median(extendable to quantiles as well) for a large number of tasks sketch would give accurate value as tasks are few, for larger task the benefit will be good. Regards, Mayur Rustagi Ph: +1 (650) 937 9673 http://www.sigmoid.com

Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
Hi Spark devs, The change might be specific to the SQLAppStatusListener, but given it may change the value of metric being shown in UI, so would like to hear some voices on this. When we aggregate the SQL metric between tasks, we apply "sum", "min", "median", "max", which all are cumulative excep