Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Grega Kešpret Mon, 06 Apr 2015 00:53:18 -0700

Hi!

I'd like to get community's opinion on implementing a generic quantile
approximation algorithm for Spark that is O(n) and requires limited memory.
I would find it useful and I haven't found any existing implementation. The
plan was basically to wrap t-digest <https://github.com/tdunning/t-digest>,
implement the serialization/deserialization boilerplate and provide


def cdf(x: Double): Double
def quantile(q: Double): Double


on RDD[Double] and RDD[(K, Double)].

Let me know what you think. Any other ideas/suggestions also welcome!

Best,
Grega
--
[image: Inline image 1]*Grega Kešpret*
Senior Software Engineer, Analytics

Skype: gregakespret
celtra.com <http://www.celtra.com/> | @celtramobile
<http://www.twitter.com/celtramobile>

Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Reply via email to