Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
Ah yes, right I forgot about the existence. Thanks! I'm aware of some implementations for approximate calculations (I guess what we say approximate median is approximate percentile with 50%) but I didn't know about implementation details like supporting accumulative. Given current source values of

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Sean Owen
Yep, that's clear. That's a reasonable case. There are already approximate median computations that can be done cumulatively as you say, implemented in Spark. I think it's reasonable to consider this for performance, as it can be faster with just a small error tolerance. But yeah up to you if you h

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
Thanks all for providing inputs! Maybe I wasn't clear about my intention. The issue I focus on is; there're plenty of metrics being defined in a stage for SQL, and each metric has values for each task and being grouped later to calculate aggregated values. (e.g. metric for "elapsed time" is shown

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Sean Owen
How big is the overhead, at scale? If it has a non-trivial effect for most jobs, I could imagine reusing the existing approximate quantile support to more efficiently find a pretty-close median. On Wed, Nov 27, 2019 at 3:55 AM Jungtaek Lim wrote: > > Hi Spark devs, > > The change might be specifi

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Mayur Rustagi
Another option could be to use a sketch to get approx median(extendable to quantiles as well) for a large number of tasks sketch would give accurate value as tasks are few, for larger task the benefit will be good. Regards, Mayur Rustagi Ph: +1 (650) 937 9673 http://www.sigmoid.com