Re: percentile_approx slowness

Prasanth Jayachandran Thu, 02 Oct 2014 11:31:30 -0700

You can look for explode(), posexplode() UDF’s in hive.  
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode


Thanks
Prasanth Jayachandran

On Oct 2, 2014, at 7:15 AM, Kevin Weiler <kevin.wei...@imc-chicago.com> wrote:

> Hi all,
> 
> I wanted to note that I figured out a better solution to my problem. I was 
> selecting each percentile I wanted to compute (0.1, 0.5, 0.9 etc) as an 
> individual percentile calculation which was blowing up my query. It turns out 
> that if you do it like this:
> 
> SELECT
>   PERCENTILE(col, array(0.1, 0.5, 0.9))
> 
> the aggregation doesn’t need to run multiple times and my query runs just 
> fine.
> 
> I now have one additional question.
> 
> I would like to store each percentile as a field in another hive table. This 
> calculation returns an array. How can I break out the array into individual 
> fields to be put into a new table?
> 
> --
> Kevin Weiler
> IT
> 
> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | 
> http://imc-chicago.com/
> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: 
> kevin.wei...@imc-chicago.com
> 
> On Sep 25, 2014, at 3:35 PM, j.barrett Strausser 
> <j.barrett.straus...@gmail.com> wrote:
> 
>> Not an answer to your question, but you can compute approximate percentiles 
>> with only the memory overhead of a single integer ( two integers if you want 
>> better results)
>> 
>> http://link.springer.com/chapter/10.1007/978-3-642-40273-9_7
>> 
>> So you could pretty easily implement that above algorithm as a python UDF 
>> and then have a reduce step that averages the results.
>> 
>> 
>> 
>> 
>> 
>> On Thu, Sep 25, 2014 at 3:06 PM, Kevin Weiler <kevin.wei...@imc-chicago.com> 
>> wrote:
>> Hi All,
>> 
>> I have a query that attempts to computer percentiles on some datasets that 
>> are well in excess of 100,000,000 rows and have thus opted to use 
>> percentile_approx as we are routinely overrunning the memory. I’m having 
>> trouble finding a threshold that I want to begin sampling. Before this 
>> dataset got so large, the maximum number of rows I would need to include in 
>> the percentile was about 1,000,000. I’ve tried using 1,000,000 as a sampling 
>> threshold, 100,000, and even the default 10,000. For some reason this query, 
>> that previously took about 20 minutes to run is now taking around 13 hours 
>> to complete (in the case of 100,000 as my sampling rate). Are there some 
>> hive settings I should be investigating to see if I can have this query 
>> complete in a reasonable time?
>> 
>> --
>> Kevin Weiler
>> IT
>> 
>> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 
>> | http://imc-chicago.com/
>> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: 
>> kevin.wei...@imc-chicago.com
>> 
>> 
>> 
>> The information in this e-mail is intended only for the person or entity to 
>> which it is addressed.
>> 
>> It may contain confidential and /or privileged material. If someone other 
>> than the intended recipient should receive this e-mail, he / she shall not 
>> be entitled to read, disseminate, disclose or duplicate it.
>> 
>> If you receive this e-mail unintentionally, please inform us immediately by 
>> "reply" and then delete it from your system. Although this information has 
>> been compiled with great care, neither IMC Financial Markets & Asset 
>> Management nor any of its related entities shall accept any responsibility 
>> for any errors, omissions or other inaccuracies in this information or for 
>> the consequences thereof, nor shall it be bound in any way by the contents 
>> of this e-mail or its attachments. In the event of incomplete or incorrect 
>> transmission, please return the e-mail to the sender and permanently delete 
>> this message and any attachments.
>> 
>> Messages and attachments are scanned for all known viruses. Always scan 
>> attachments before opening them.
>> 
>> 
>> 
>> -- 
>> 
>> 
>> https://github.com/bearrito
>> @deepbearrito
> 
> 
> 
> The information in this e-mail is intended only for the person or entity to 
> which it is addressed.
> 
> It may contain confidential and /or privileged material. If someone other 
> than the intended recipient should receive this e-mail, he / she shall not be 
> entitled to read, disseminate, disclose or duplicate it.
> 
> If you receive this e-mail unintentionally, please inform us immediately by 
> "reply" and then delete it from your system. Although this information has 
> been compiled with great care, neither IMC Financial Markets & Asset 
> Management nor any of its related entities shall accept any responsibility 
> for any errors, omissions or other inaccuracies in this information or for 
> the consequences thereof, nor shall it be bound in any way by the contents of 
> this e-mail or its attachments. In the event of incomplete or incorrect 
> transmission, please return the e-mail to the sender and permanently delete 
> this message and any attachments.
> 
> Messages and attachments are scanned for all known viruses. Always scan 
> attachments before opening them.


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: percentile_approx slowness

Reply via email to