Hi Spark devs, The change might be specific to the SQLAppStatusListener, but given it may change the value of metric being shown in UI, so would like to hear some voices on this.
When we aggregate the SQL metric between tasks, we apply "sum", "min", "median", "max", which all are cumulative except "median". That's different from "average" given it helps to get rid of outliers, but if that's the only purpose, it may not strictly need to have exact value of median. I'm not sure how much the value is losing the meaning of representation, but if it doesn't hurt much, what about taking median of medians? For example, taking median of nearest 10 tasks and store it as one of median values, and finally taking median of medians. If I calculate correctly, that would only require 11% of slots if the number of tasks is 100, and replace sorting 100 elements with sorting 10 elements 11 times. The difference would be bigger if the number of tasks is bigger. Just a rough idea so any feedbacks are appreciated. Thanks, Jungtaek Lim (HeartSaVioR)