[ https://issues.apache.org/jira/browse/HIVE-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei Zhang reassigned HIVE-21915: -------------------------------- Assignee: Ning Zhang > Hive with TEZ UNION ALL and UDTF results in data loss > ----------------------------------------------------- > > Key: HIVE-21915 > URL: https://issues.apache.org/jira/browse/HIVE-21915 > Project: Hive > Issue Type: Bug > Affects Versions: 1.2.1 > Reporter: Wei Zhang > Assignee: Ning Zhang > Priority: Major > Labels: pull-request-available > > The HQL syntax is like this: > CREATE TEMPORARY TABLE tez_union_all_loss_data AS > SELECT xxx, yyy, zzz,1 as tag > FROM ods_1 > UNION ALL > SELECT xxx, yyy, zzz, tag > FROM > ( > SELECT xxx > ,get_json_object(get_json_object(tb,'$.a'),'$.b') AS yyy > ,zzz > ,2 as tag > FROM ods_2 > LATERAL VIEW EXPLODE(some_udf(uuu)) team_number AS tb > ) tbl > ; > > With above HQL, we are expecting that rows with both tag = 2 and tag = 1 > appear. In our case however, all the rows with tag = 1 are lost. > Dig deeper we can find that the generated two maps have identical task tmp > paths. And that results from when UDTF is present, the FileSinkOperator will > be processed twice generating the tmp path in > GenTezUtils.removeUnionOperators(); > -- This message was sent by Atlassian JIRA (v7.6.3#76005)