[ 
https://issues.apache.org/jira/browse/HIVE-21618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajveer Singh updated HIVE-21618:
---------------------------------
    Description: 
Hive table query with a UDF call, shows multiple calls to the UDF. The UDF is 
being called to as many times as the number of fields in sub-select statement.

 

This behavior leads to performance issues as well as it also gives incorrect 
results while collecting the duration or timings of any operation.

To show the problem TimestampUDF (sample code has been attached) has been 
created which returns three fields namely *start_time* (string), *end_time* 
(string) and *duration_milli* (long). To keep the things simple, fixed 2 second 
delay has been added. In the TimestampUDF duration_mill is calculated as 
(end_time - start_time).

The expectation is that *duration_milli* should be equal to *(end_time - 
start_time)*

 

I have tested following Hive Queries:
 # Select T.Candidate.startTime, T.Candidate.endTime, 
T.Candidate.duration_milli from (Select TimestampUDF() as Candidate) T;
 # Select T.Candidate.startTime, T.Candidate.endTime, 
T.Candidate.duration_milli from (Select TimestampUDF() as Candidate from 
*idtable*) T;

 

*Note*

"*idtable"* is having only one entry and has been created via followig hive 
queries

create table if not exists idTable (RowNum int);
insert overwrite table idTable values (1);

 

In Hive query (1)                              

I am getting correct values because TimestampUDF is being called once and if  
we calculate the  *(end_time - start_time)* manually and compare it with 
returned *duration_milli* then they both are same.

 

However, in Hive query (2)          

*duration_milli* is not equal to *(end_time - start_time)* because the 
TimestampUDF is being called thrice (the number of fields in sub-select query) 
and also
 * start_time is from 1^st^  call of TimestampUDF
 * end_time is from 2^nd^ call of TimestampUDF
 * duration_milli is from 3^rd^ call of TimestampUDF

 

I got following results  from hive query (2), same values can be validated in 
attached logs file or respective platform

 

*EMR* 

+----------------------------+--------------------------++-----------------

+
|starttime|endtime|duration_milli|
---------------------------+--------------------------++-----------------
|2019-04-16T11:53:13.595Z|2019-04-16T11:53:17.619Z|2000|

+----------------------------+--------------------------++-----------------

 

*HDP3*

+----------------------------+--------------------------++-----------------
|starttime|endtime|duration_milli|

+----------------------------+--------------------------++-----------------
|2019-04-16T12:37:33.593Z|2019-04-16T12:37:37.597Z|2000|

+----------------------------+--------------------------++-----------------

 

 

  was:
Hive table query with a UDF call, shows multiple calls to the UDF.

 

This behavior leads to performance issues as well as it also gives incorrect 
results while collecting the duration of any operation.

To show the problem TimestampUDF (sample code has been attached) has been 
created which returns three fields namely *start_time* (string), *end_time* 
(string) and *duration_milli* (long). To keep the things simple, fixed 2 second 
delay has been added. In the TimestampUDF duration_mill is calculated as 
(end_time - start_time).

The expectation is that *duration_milli* should be equal to *(end_time - 
start_time)*

 

I have tested following Hive Queries:
 # Select T.Candidate.startTime, T.Candidate.endTime, 
T.Candidate.duration_milli from (Select TimestampUDF() as Candidate) T;
 # Select T.Candidate.startTime, T.Candidate.endTime, 
T.Candidate.duration_milli from (Select TimestampUDF() as Candidate from 
*idtable*) T;

 

In Hive query (1)                              

I am getting correct values because TimestampUDF is being called once and if  
we calculate the  *(end_time - start_time)* with returned *duration_milli,* 
they both are same.

 

However, in Hive query (2)          

*duration_milli* is not equal to *(end_time - start_time)* because the 
TimestampUDF is being called thrice (the number of fields in sub-select query) 
and also
 * start_time is from 1^st^  call of TimestampUDF
 * end_time is from 2^nd^ call of TimestampUDF
 * duration_milli is from 3^rd^ call of TimestampUDF

 

I got following results ** from hive query (2), same values can be validated in 
attached logs file or respective platform

 

*EMR* 

+---------------------------+---------------------------+-----------------+
| starttime | endtime | duration_milli |
+---------------------------+---------------------------+-----------------+
| 2019-04-16T11:53:13.595Z | 2019-04-16T11:53:17.619Z | 2000 |
+---------------------------+---------------------------+-----------------+

 

*HDP3*

+---------------------------+---------------------------+-----------------+
| starttime | endtime | duration_milli |
+---------------------------+---------------------------+-----------------+
| 2019-04-16T12:37:33.593Z | 2019-04-16T12:37:37.597Z | 2000 |
+---------------------------+---------------------------+-----------------+

 

 


> Hive table query with a UDF call, shows multiple calls to the UDF.
> ------------------------------------------------------------------
>
>                 Key: HIVE-21618
>                 URL: https://issues.apache.org/jira/browse/HIVE-21618
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>    Affects Versions: 3.0.1, 2.3.4
>         Environment: *Hortonworks (HDP3)*
>  * Hadoop 3.1.1.3.0.1.0-187
>  * Hive 3.1.0
>  * Tez 0.9.1
>  
> *EMR*
>  * Hadoop 2.8.5-amzn-1
>  * Hive 2.3.4
>  * Tez 0.9.1
>            Reporter: Rajveer Singh
>            Priority: Major
>         Attachments: TimestampUDF.java, emr_19_cluster_timestampUDF.log, 
> hdp_3_cluster_timestampUDF.log
>
>
> Hive table query with a UDF call, shows multiple calls to the UDF. The UDF is 
> being called to as many times as the number of fields in sub-select statement.
>  
> This behavior leads to performance issues as well as it also gives incorrect 
> results while collecting the duration or timings of any operation.
> To show the problem TimestampUDF (sample code has been attached) has been 
> created which returns three fields namely *start_time* (string), *end_time* 
> (string) and *duration_milli* (long). To keep the things simple, fixed 2 
> second delay has been added. In the TimestampUDF duration_mill is calculated 
> as (end_time - start_time).
> The expectation is that *duration_milli* should be equal to *(end_time - 
> start_time)*
>  
> I have tested following Hive Queries:
>  # Select T.Candidate.startTime, T.Candidate.endTime, 
> T.Candidate.duration_milli from (Select TimestampUDF() as Candidate) T;
>  # Select T.Candidate.startTime, T.Candidate.endTime, 
> T.Candidate.duration_milli from (Select TimestampUDF() as Candidate from 
> *idtable*) T;
>  
> *Note*
> "*idtable"* is having only one entry and has been created via followig hive 
> queries
> create table if not exists idTable (RowNum int);
> insert overwrite table idTable values (1);
>  
> In Hive query (1)                              
> I am getting correct values because TimestampUDF is being called once and if  
> we calculate the  *(end_time - start_time)* manually and compare it with 
> returned *duration_milli* then they both are same.
>  
> However, in Hive query (2)          
> *duration_milli* is not equal to *(end_time - start_time)* because the 
> TimestampUDF is being called thrice (the number of fields in sub-select 
> query) and also
>  * start_time is from 1^st^  call of TimestampUDF
>  * end_time is from 2^nd^ call of TimestampUDF
>  * duration_milli is from 3^rd^ call of TimestampUDF
>  
> I got following results  from hive query (2), same values can be validated in 
> attached logs file or respective platform
>  
> *EMR* 
> +----------------------------+--------------------------++-----------------
> +
> |starttime|endtime|duration_milli|
> ---------------------------+--------------------------++-----------------
> |2019-04-16T11:53:13.595Z|2019-04-16T11:53:17.619Z|2000|
> +----------------------------+--------------------------++-----------------
>  
> *HDP3*
> +----------------------------+--------------------------++-----------------
> |starttime|endtime|duration_milli|
> +----------------------------+--------------------------++-----------------
> |2019-04-16T12:37:33.593Z|2019-04-16T12:37:37.597Z|2000|
> +----------------------------+--------------------------++-----------------
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to