Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Ray Ortigas Wed, 10 Jun 2015 17:48:06 -0700

Hi Grega and Reynold,

Grega, if you still want to use t-digest, I filed this PR because I thought
your t-digest suggestion was a good idea.


https://github.com/tdunning/t-digest/pull/56

If it is helpful feel free to do whatever with it.

Regards,
Ray


On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin <[email protected]> wrote:

> This email is good. Just one note -- a lot of people are swamped right
> before Spark Summit, so you might not get prompt responses this week.
>
>
> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <[email protected]> wrote:
>
>> I have some time to work on it now. What's a good way to continue the
>> discussions before coding it?
>>
>> This e-mail list, JIRA or something else?
>>
>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <[email protected]> wrote:
>>
>>> I think those are great to have. I would put them in the DataFrame API
>>> though, since this is applying to structured data. Many of the advanced
>>> functions on the PairRDDFunctions should really go into the DataFrame API
>>> now we have it.
>>>
>>> One thing that would be great to understand is what state-of-the-art
>>> alternatives are out there. I did a quick google scholar search using the
>>> keyword "approximate quantile" and found some older papers. Just the
>>> first few I found:
>>>
>>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>>>
>>>
>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>>>  by Bruce Lindsay, IBM
>>>
>>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <[email protected]> wrote:
>>>
>>>> Hi!
>>>>
>>>> I'd like to get community's opinion on implementing a generic quantile
>>>> approximation algorithm for Spark that is O(n) and requires limited memory.
>>>> I would find it useful and I haven't found any existing implementation. The
>>>> plan was basically to wrap t-digest
>>>> <https://github.com/tdunning/t-digest>, implement the
>>>> serialization/deserialization boilerplate and provide
>>>>
>>>> def cdf(x: Double): Double
>>>> def quantile(q: Double): Double
>>>>
>>>>
>>>> on RDD[Double] and RDD[(K, Double)].
>>>>
>>>> Let me know what you think. Any other ideas/suggestions also welcome!
>>>>
>>>> Best,
>>>> Grega
>>>> --
>>>> [image: Inline image 1]*Grega Kešpret*
>>>> Senior Software Engineer, Analytics
>>>>
>>>> Skype: gregakespret
>>>> celtra.com <http://www.celtra.com/> | @celtramobile
>>>> <http://www.twitter.com/celtramobile>
>>>>
>>>>
>>>
>>
>

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Reply via email to