[ 
https://issues.apache.org/jira/browse/HIVE-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Doneriya updated HIVE-21915:
-----------------------------------
    Description: 
The HQL syntax is like this:

CREATE TEMPORARY TABLE tez_union_all_loss_data AS
 SELECT xxx, yyy, zzz,1 as tag
 FROM ods_1

UNION ALL

SELECT xxx, yyy, zzz, tag
 FROM
 (
 SELECT xxx
 ,get_json_object(get_json_object(tb,'$.a'),'$.b') AS yyy
 ,zzz
 ,2 as tag
 FROM ods_2
 LATERAL VIEW EXPLODE(some_udf(uuu)) team_number AS tb
 ) tbl 
 ;

 

With above HQL, we are expecting that rows with both tag = 2 and tag = 1 
appear. In our case however, all the rows with tag = 1 are lost.

Dig deeper we can find that the generated two maps have identical task tmp 
paths. And that results from when UDTF is present, the FileSinkOperator will be 
processed twice generating the tmp path in GenTezUtils.removeUnionOperators();

 

  was:
The HQL syntax is like this:

CREATE TEMPORARY TABLE tez_union_all_loss_data AS
SELECT xxx, yyy, zzz,1 as tag
FROM ods_1

UNION ALL

SELECT xxx, yyy, zzz, tag
FROM
(
SELECT xxx
,get_json_object(get_json_object(tb,'$.a'),'$.b') AS yyy
,zzz
,2 as tag
FROM ods_2
LATERAL VIEW EXPLODE(some_udf(uuu)) team_number AS tb
) tbl 
;

 

With above HQL, we are expecting that rows with both tag = 2 and tag = 1 
appear. In our case however, all the rows with tag = 1 are lost.

Dig deeper we can find that the generated two maps have identical task tmp 
paths. And that results from when UDTF is present, the FileSinkOperator will be 
processed twice generating the tmp path in GenTezUtils.removeUnionOperators();

 


> Hive with TEZ UNION ALL and UDTF results in data loss
> -----------------------------------------------------
>
>                 Key: HIVE-21915
>                 URL: https://issues.apache.org/jira/browse/HIVE-21915
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 1.2.1
>            Reporter: Wei Zhang
>            Assignee: Wei Zhang
>            Priority: Major
>             Fix For: 4.0.0
>
>         Attachments: HIVE-21915.01.patch, HIVE-21915.02.patch, 
> HIVE-21915.03.patch, HIVE-21915.04.patch
>
>
> The HQL syntax is like this:
> CREATE TEMPORARY TABLE tez_union_all_loss_data AS
>  SELECT xxx, yyy, zzz,1 as tag
>  FROM ods_1
> UNION ALL
> SELECT xxx, yyy, zzz, tag
>  FROM
>  (
>  SELECT xxx
>  ,get_json_object(get_json_object(tb,'$.a'),'$.b') AS yyy
>  ,zzz
>  ,2 as tag
>  FROM ods_2
>  LATERAL VIEW EXPLODE(some_udf(uuu)) team_number AS tb
>  ) tbl 
>  ;
>  
> With above HQL, we are expecting that rows with both tag = 2 and tag = 1 
> appear. In our case however, all the rows with tag = 1 are lost.
> Dig deeper we can find that the generated two maps have identical task tmp 
> paths. And that results from when UDTF is present, the FileSinkOperator will 
> be processed twice generating the tmp path in 
> GenTezUtils.removeUnionOperators();
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to