How big is the overhead, at scale? If it has a non-trivial effect for most jobs, I could imagine reusing the existing approximate quantile support to more efficiently find a pretty-close median.
On Wed, Nov 27, 2019 at 3:55 AM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > > Hi Spark devs, > > The change might be specific to the SQLAppStatusListener, but given it may > change the value of metric being shown in UI, so would like to hear some > voices on this. > > When we aggregate the SQL metric between tasks, we apply "sum", "min", > "median", "max", which all are cumulative except "median". That's different > from "average" given it helps to get rid of outliers, but if that's the only > purpose, it may not strictly need to have exact value of median. > > I'm not sure how much the value is losing the meaning of representation, but > if it doesn't hurt much, what about taking median of medians? For example, > taking median of nearest 10 tasks and store it as one of median values, and > finally taking median of medians. If I calculate correctly, that would only > require 11% of slots if the number of tasks is 100, and replace sorting 100 > elements with sorting 10 elements 11 times. The difference would be bigger if > the number of tasks is bigger. > > Just a rough idea so any feedbacks are appreciated. > > Thanks, > Jungtaek Lim (HeartSaVioR) --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org