Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Nick Pentreath Thu, 18 Jun 2015 01:23:15 -0700

If it's going into the DataFrame API (which it probably should rather than
in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT)
which would mean it doesn't have to implement Serializable, as it appears
that serialization is taken care of in the UDT def (e.g.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L254
)


If I understand correctly UDT SerDe correctly?

On Thu, Jun 11, 2015 at 2:47 AM, Ray Ortigas <[email protected]>
wrote:

> Hi Grega and Reynold,
>
> Grega, if you still want to use t-digest, I filed this PR because I
> thought your t-digest suggestion was a good idea.
>
> https://github.com/tdunning/t-digest/pull/56
>
> If it is helpful feel free to do whatever with it.
>
> Regards,
> Ray
>
>
> On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin <[email protected]> wrote:
>
>> This email is good. Just one note -- a lot of people are swamped right
>> before Spark Summit, so you might not get prompt responses this week.
>>
>>
>> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <[email protected]> wrote:
>>
>>> I have some time to work on it now. What's a good way to continue the
>>> discussions before coding it?
>>>
>>> This e-mail list, JIRA or something else?
>>>
>>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <[email protected]>
>>> wrote:
>>>
>>>> I think those are great to have. I would put them in the DataFrame API
>>>> though, since this is applying to structured data. Many of the advanced
>>>> functions on the PairRDDFunctions should really go into the DataFrame API
>>>> now we have it.
>>>>
>>>> One thing that would be great to understand is what state-of-the-art
>>>> alternatives are out there. I did a quick google scholar search using the
>>>> keyword "approximate quantile" and found some older papers. Just the
>>>> first few I found:
>>>>
>>>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>>>>
>>>>
>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>>>>  by Bruce Lindsay, IBM
>>>>
>>>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I'd like to get community's opinion on implementing a generic quantile
>>>>> approximation algorithm for Spark that is O(n) and requires limited 
>>>>> memory.
>>>>> I would find it useful and I haven't found any existing implementation. 
>>>>> The
>>>>> plan was basically to wrap t-digest
>>>>> <https://github.com/tdunning/t-digest>, implement the
>>>>> serialization/deserialization boilerplate and provide
>>>>>
>>>>> def cdf(x: Double): Double
>>>>> def quantile(q: Double): Double
>>>>>
>>>>>
>>>>> on RDD[Double] and RDD[(K, Double)].
>>>>>
>>>>> Let me know what you think. Any other ideas/suggestions also welcome!
>>>>>
>>>>> Best,
>>>>> Grega
>>>>> --
>>>>> [image: Inline image 1]*Grega Kešpret*
>>>>> Senior Software Engineer, Analytics
>>>>>
>>>>> Skype: gregakespret
>>>>> celtra.com <http://www.celtra.com/> | @celtramobile
>>>>> <http://www.twitter.com/celtramobile>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Reply via email to