Have you looked at t-digests? Calculating percentiles (including medians) is something that is inherently difficult/inefficient to do in a distributed system. T-digests provide a useful probabilistic structure to allow you to compute any percentile with a known (and tunable) margin of error.
https://github.com/tdunning/t-digest -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/distributed-computation-of-median-tp21356p21357.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org