Need help with percentile calculation

Puneet Khatod Wed, 21 Nov 2012 01:08:35 -0800

Hi,

I am trying to use percentile function of HIVE but getting exception from 
Amazon EMR service.
I am using version 0.7.


Please assist. It is very critical and urgent.

Below is the code snippet:

CREATE EXTERNAL TABLE IF NOT EXISTS server_d
(
  ag_date STRING,
  median _time BIGINT,
  95percentile _time BIGINT
) COMMENT 'server'
PARTITIONED by (dt STRING, hh STRING, min STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '${hiveconf:S3_INPUT_BUCKET}/server /';

INSERT OVERWRITE TABLE server_d PARTITION(dt='${hiveconf:DT}', 
hh='${hiveconf:HH}', min='${hiveconf:MIN}')
SELECT '${hiveconf:DT}' as ag_date,
percentile(total _time, 0.50) as median _time,
percentile(total_time, 0.95) as 95percentile_time
from my_log my
where my.date_req = '${hiveconf:DT}' and my.dt = '${hiveconf:DT}';


exception that I am getting:

Exception in thread "Thread-239" java.lang.RuntimeException: Error while 
reading from task log url
        at 
org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
        at 
org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
        at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for 
URL: 
http://10.***.***.**:9103/tasklog?taskid=attempt_201****0835_0005_m_000001_3&start=-8193
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
        at java.net.URL.openStream(URL.java:1010)
        at 
org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
        ... 3 more

Thanks,
Puneet

From: Felix.徐 [mailto:ygnhz...@gmail.com]
Sent: Wednesday, November 21, 2012 2:22 PM
To: user@hive.apache.org
Subject: Bugs exist in SEMI JOIN?

Hi,
I am using the version 0.9.0 and my tables are the same with TPC-H benchmark:

Here is a simple query(works correctly):

Q1
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN(
  SELECT O_CUSTKEY FROM ORDERS WHERE unix_timestamp(O_ORDERDATE, 'yyyy-MM-dd') 
> unix_timestamp('1995-12-31','yyyy-MM-dd')
 ) tempTable ON tempTable.O_CUSTKEY=CUSTOMER.C_CUSTKEY

it means inserting the key of customers who has orders since 1995-12-31 into 
another table.
But if I write the query like this:

Q2
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 AND unix_timestamp(ORDERS.O_ORDERDATE, 'yyyy-MM-dd') > 
unix_timestamp('1995-12-31','yyyy-MM-dd')

I will get exception from Hive:


FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
          at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1566)
          at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.pushJoinFilters(SemanticAnalyzer.java:5254)
          at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6754)
          at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7531)
          at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
          at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
          at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
          at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
          at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
          at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
          at 
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
          at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
          at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Also,If I write the query like this:
Q3
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 WHERE unix_timestamp(ORDERS.O_ORDERDATE, 'yyyy-MM-dd') > 
unix_timestamp('1995-12-31','yyyy-MM-dd')

Then this query can be executed(wondering the right hand of SEMI JOIN can be 
referenced in WHERE clause now?), but the result is wrong(comparing to Q1, Q1's 
result is the same with mysql).
Any comments or statements made in this email are not necessarily those of 
Tavant Technologies.
The information transmitted is intended only for the person or entity to which 
it is addressed and may
contain confidential and/or privileged material. If you have received this in 
error, please contact the
sender and delete the material from any computer. All e-mails sent from or to 
Tavant Technologies
may be subject to our monitoring procedures.

Need help with percentile calculation

Reply via email to