Loose the requirement of "median" of the SQL metrics

Jungtaek Lim Wed, 27 Nov 2019 01:56:33 -0800

Hi Spark devs,

The change might be specific to the SQLAppStatusListener, but given it may
change the value of metric being shown in UI, so would like to hear some
voices on this.


When we aggregate the SQL metric between tasks, we apply "sum", "min",
"median", "max", which all are cumulative except "median". That's different
from "average" given it helps to get rid of outliers, but if that's the
only purpose, it may not strictly need to have exact value of median.

I'm not sure how much the value is losing the meaning of representation,
but if it doesn't hurt much, what about taking median of medians? For
example, taking median of nearest 10 tasks and store it as one of median
values, and finally taking median of medians. If I calculate correctly,
that would only require 11% of slots if the number of tasks is 100, and
replace sorting 100 elements with sorting 10 elements 11 times. The
difference would be bigger if the number of tasks is bigger.

Just a rough idea so any feedbacks are appreciated.

Thanks,
Jungtaek Lim (HeartSaVioR)

Loose the requirement of "median" of the SQL metrics

Reply via email to