Hi, I was having an issue with a query that involves a transform with a distribute by. The query works in the Hive 0.11 but not in Hive 0.13 (both on EMR).
shell> echo -e "a\tb\tc" > /tmp/a.txt create table tbl ( a string, b string, c string ) row format delimited fields terminated by '\t' ; load data local inpath '/tmp/a.txt' overwrite into table tbl; select transform(a, b) using 'python foo.py' as (y, z) from ( select a, b, c from tbl distribute by c ) tmp ; Error: > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.RuntimeException: cannot find field _col2 from [0:_col0, 1:_col1] > Caused by: java.lang.RuntimeException: cannot find field _col2 from [0:_col0, > 1:_col1] However, if works if I add sort by into the distribution: select transform(a, b) using 'python foo.py' as (y, z) from ( select a, b, c from tbl distribute by c sort by c ) tmp ; Is this a valid behavior or a bug? Hive 0.13 plan (non-working query): STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: tbl Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: a (type: string), b (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator sort order: Map-reduce partition columns: _col2 (type: string) Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE value expressions: _col0 (type: string), _col1 (type: string) Reduce Operator Tree: Extract Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: _col0 (type: string), _col1 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE Transform Operator command: python foo.py output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Hive 0.13 plan (working query): STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: tbl Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: a (type: string), b (type: string), c (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator key expressions: _col2 (type: string) sort order: + Map-reduce partition columns: _col2 (type: string) Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE value expressions: _col0 (type: string), _col1 (type: string) Reduce Operator Tree: Extract Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: _col0 (type: string), _col1 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE Transform Operator command: python foo.py output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Hive 0.11 plan: ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME tbl))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a)) (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_TABLE_OR_COL c))) (TOK_DISTRIBUTEBY (TOK_TABLE_OR_COL c)))) tmp)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TRANSFORM (TOK_EXPLIST (TOK_TABLE_OR_COL a) (TOK_TABLE_OR_COL b)) TOK_SERDE TOK_RECORDWRITER 'python foo.py' TOK_SERDE TOK_RECORDREADER (TOK_ALIASLIST y z)))))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: tmp:tbl TableScan alias: tbl Select Operator expressions: expr: a type: string expr: b type: string expr: c type: string outputColumnNames: _col0, _col1, _col2 Reduce Output Operator sort order: Map-reduce partition columns: expr: _col2 type: string tag: -1 value expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: string Reduce Operator Tree: Extract Select Operator expressions: expr: _col0 type: string expr: _col1 type: string outputColumnNames: _col0, _col1 Transform Operator command: python foo.py output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1