If it's going into the DataFrame API (which it probably should rather than in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT) which would mean it doesn't have to implement Serializable, as it appears that serialization is taken care of in the UDT def (e.g. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L254 )
If I understand correctly UDT SerDe correctly? On Thu, Jun 11, 2015 at 2:47 AM, Ray Ortigas <rorti...@linkedin.com.invalid> wrote: > Hi Grega and Reynold, > > Grega, if you still want to use t-digest, I filed this PR because I > thought your t-digest suggestion was a good idea. > > https://github.com/tdunning/t-digest/pull/56 > > If it is helpful feel free to do whatever with it. > > Regards, > Ray > > > On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin <r...@databricks.com> wrote: > >> This email is good. Just one note -- a lot of people are swamped right >> before Spark Summit, so you might not get prompt responses this week. >> >> >> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <gr...@celtra.com> wrote: >> >>> I have some time to work on it now. What's a good way to continue the >>> discussions before coding it? >>> >>> This e-mail list, JIRA or something else? >>> >>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> I think those are great to have. I would put them in the DataFrame API >>>> though, since this is applying to structured data. Many of the advanced >>>> functions on the PairRDDFunctions should really go into the DataFrame API >>>> now we have it. >>>> >>>> One thing that would be great to understand is what state-of-the-art >>>> alternatives are out there. I did a quick google scholar search using the >>>> keyword "approximate quantile" and found some older papers. Just the >>>> first few I found: >>>> >>>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf by bell labs >>>> >>>> >>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf >>>> by Bruce Lindsay, IBM >>>> >>>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com> >>>> wrote: >>>> >>>>> Hi! >>>>> >>>>> I'd like to get community's opinion on implementing a generic quantile >>>>> approximation algorithm for Spark that is O(n) and requires limited >>>>> memory. >>>>> I would find it useful and I haven't found any existing implementation. >>>>> The >>>>> plan was basically to wrap t-digest >>>>> <https://github.com/tdunning/t-digest>, implement the >>>>> serialization/deserialization boilerplate and provide >>>>> >>>>> def cdf(x: Double): Double >>>>> def quantile(q: Double): Double >>>>> >>>>> >>>>> on RDD[Double] and RDD[(K, Double)]. >>>>> >>>>> Let me know what you think. Any other ideas/suggestions also welcome! >>>>> >>>>> Best, >>>>> Grega >>>>> -- >>>>> [image: Inline image 1]*Grega Kešpret* >>>>> Senior Software Engineer, Analytics >>>>> >>>>> Skype: gregakespret >>>>> celtra.com <http://www.celtra.com/> | @celtramobile >>>>> <http://www.twitter.com/celtramobile> >>>>> >>>>> >>>> >>> >> >