Re: percentile_approx slowness

2014-10-02 Thread Prasanth Jayachandran
You can look for explode(), posexplode() UDF’s in hive. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode Thanks Prasanth Jayachandran On Oct 2, 2014, at 7:15 AM, Kevin Weiler wrote: > Hi all, > > I wanted to note that I figured out a better soluti

Re: percentile_approx slowness

2014-10-02 Thread Kevin Weiler
Hi all, I wanted to note that I figured out a better solution to my problem. I was selecting each percentile I wanted to compute (0.1, 0.5, 0.9 etc) as an individual percentile calculation which was blowing up my query. It turns out that if you do it like this: SELECT PERCENTILE(col, array(0

Re: percentile_approx slowness

2014-09-25 Thread j.barrett Strausser
Not an answer to your question, but you can compute approximate percentiles with only the memory overhead of a single integer ( two integers if you want better results) http://link.springer.com/chapter/10.1007/978-3-642-40273-9_7 So you could pretty easily implement that above algorithm as a pyth

percentile_approx slowness

2014-09-25 Thread Kevin Weiler
Hi All, I have a query that attempts to computer percentiles on some datasets that are well in excess of 100,000,000 rows and have thus opted to use percentile_approx as we are routinely overrunning the memory. I’m having trouble finding a threshold that I want to begin sampling. Before this da