Hi Grega and Reynold, Grega, if you still want to use t-digest, I filed this PR because I thought your t-digest suggestion was a good idea.
https://github.com/tdunning/t-digest/pull/56 If it is helpful feel free to do whatever with it. Regards, Ray On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin <r...@databricks.com> wrote: > This email is good. Just one note -- a lot of people are swamped right > before Spark Summit, so you might not get prompt responses this week. > > > On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <gr...@celtra.com> wrote: > >> I have some time to work on it now. What's a good way to continue the >> discussions before coding it? >> >> This e-mail list, JIRA or something else? >> >> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <r...@databricks.com> wrote: >> >>> I think those are great to have. I would put them in the DataFrame API >>> though, since this is applying to structured data. Many of the advanced >>> functions on the PairRDDFunctions should really go into the DataFrame API >>> now we have it. >>> >>> One thing that would be great to understand is what state-of-the-art >>> alternatives are out there. I did a quick google scholar search using the >>> keyword "approximate quantile" and found some older papers. Just the >>> first few I found: >>> >>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf by bell labs >>> >>> >>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf >>> by Bruce Lindsay, IBM >>> >>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf >>> >>> >>> >>> >>> >>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com> wrote: >>> >>>> Hi! >>>> >>>> I'd like to get community's opinion on implementing a generic quantile >>>> approximation algorithm for Spark that is O(n) and requires limited memory. >>>> I would find it useful and I haven't found any existing implementation. The >>>> plan was basically to wrap t-digest >>>> <https://github.com/tdunning/t-digest>, implement the >>>> serialization/deserialization boilerplate and provide >>>> >>>> def cdf(x: Double): Double >>>> def quantile(q: Double): Double >>>> >>>> >>>> on RDD[Double] and RDD[(K, Double)]. >>>> >>>> Let me know what you think. Any other ideas/suggestions also welcome! >>>> >>>> Best, >>>> Grega >>>> -- >>>> [image: Inline image 1]*Grega Kešpret* >>>> Senior Software Engineer, Analytics >>>> >>>> Skype: gregakespret >>>> celtra.com <http://www.celtra.com/> | @celtramobile >>>> <http://www.twitter.com/celtramobile> >>>> >>>> >>> >> >